CN104199974A - Microblog-oriented dynamic topic detection and evolution tracking method - Google Patents

Microblog-oriented dynamic topic detection and evolution tracking method Download PDF

Info

Publication number
CN104199974A
CN104199974A CN201410488391.6A CN201410488391A CN104199974A CN 104199974 A CN104199974 A CN 104199974A CN 201410488391 A CN201410488391 A CN 201410488391A CN 104199974 A CN104199974 A CN 104199974A
Authority
CN
China
Prior art keywords
theme
microblogging
data
time interval
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410488391.6A
Other languages
Chinese (zh)
Inventor
闫碧莹
邓攀
余雷
赵鑫
袁伟
万安格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SINOPARADOFT (BEIJING) PARALLEL SOFTWARE Co Ltd
Original Assignee
SINOPARADOFT (BEIJING) PARALLEL SOFTWARE Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SINOPARADOFT (BEIJING) PARALLEL SOFTWARE Co Ltd filed Critical SINOPARADOFT (BEIJING) PARALLEL SOFTWARE Co Ltd
Priority to CN201410488391.6A priority Critical patent/CN104199974A/en
Publication of CN104199974A publication Critical patent/CN104199974A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a microblog-oriented dynamic topic detecting and evolution tracking method and belongs to the technical field of intelligent information processing. The method includes the steps of 1, establishing a distributed crawler to acquire microblog data; 2, pre-processing the microblog data; 3, performing Chinese word segmentation to remove stop words, and acquiring a word set VOC; 4, subjecting the microblog data to LDA (latent Dirichlet allocation) clustering in each time interval so as to extract latent topics; 5, screening out microblog hot topics in each time interval; 6, subjecting the hot topics of a global time to hierarchical clustering to acquire inter-topic aggregation and differentiation relations; 7, visualizing a topic evolution process according to the inter-topic aggregation and differentiation relations. The method has the advantages such that topic word distribution of an event in different times and a fine-grained topic of a same topic in different times are mined under low time complexity, efficiency is high, and robustness is high; the method has greater practical value.

Description

A kind of Dynamic Theme towards microblogging detects and develops method for tracing
Technical field
The invention belongs to intelligent information processing technology field, be specifically related to a kind of Dynamic Theme towards microblogging and detect and develop method for tracing.
Background technology
Along with the explosive increase of text message on internet, people are more and more difficult to obtain in time interested theme or event information from mass text information.Topic detection and tracking (Topic detecting and tracking, TDT) technology is intended to according to event, language text information flow be organized, and develops a series of core technologies that can meet above user's needs.Topic tracking is one of subtask of TDT, and topic tracking can help people that the information of disperseing is effectively collected and organized, and understands on the whole the full detail of a topic.And topic EVOLUTION ANALYSIS is as the new research direction of TDT, be intended on the basis of topic tracking, the course that the relation in discovery topic between each event and topic develop, clearly represents the ins and outs of topic to user.
The main stream approach that theme develops is at present Dynamic Theme model (Dynamic topic model, be called for short DTM), the main thought of this method is that the model parameter posteriority of current time distributes and introduces model as the condition of next moment model parameter, from the overall situation, whole topic evolutionary model is still graphical model, but more difficult in model parameter derivation.In addition overall processing is made to just can obtain the topic in all moment by a modeling and represent, but do not there is the function of the new text of online interpolation, for newly arrived text again discrete, global modeling.
Dynamically mixture model (Dynamic Mixture Model is called for short DMM), compared with DTM, has stronger time hypothesis.Sequences of text in DTM in each time window can exchange mutually, and text in DMM successively arrives in strict accordance with time sequencing, each moment only arrives one section of text, from this angle, DMM is online topic evolutionary model, can calculate the probability distribution of same theme at different time although DMM is disposable, there are the following problems:
1, too committed memory of three-dimensional matrice space.Distribute because needs solve theme on each time period, therefore need to set up m-word three-dimensional matrice of theme-time, cause EMS memory occupation amount greatly to increase.
2, new text is not re-used, in the time newly arriving text, can only recalculate.
Detection to theme in practical application and the analysis of subject evolution trend all require to carry out in real time, and its difficult point is that the document data amount of processing is very large, and document data type complexity comprises the text of the forms such as news, forum, blog.Above method, all based on specific hypotheses, can only be carried out certain analysis and excavation to a small amount of experimental data, cannot meet the demand of practical application.
Summary of the invention
Defect that cannot analytical calculation subject evolution trend for existing topic detection system, the present invention calculates the similarity relation between theme in different time sections in real time by hierarchical clustering, thereby analyze theme evolution trend in time, and can draw out subject evolution trend figure, specifically propose a kind of Dynamic Theme towards microblogging and detected and develop method for tracing.Concrete steps are:
Step 1, structure distributed reptile, obtain microblogging data.
Step 2, microblogging data are carried out to pre-service.
Pre-service comprises denoising and duplicate removal, specifically refers to that removal microblogging data Chinese version number of words is less than data, the ad content of the data of length threshold L, repetition, automatically replies data and website data, and wherein microblogging data comprise the data in microblogging text and comment.
Step 3, all microblogging data are carried out to Chinese word segmentation, remove stop words, obtain word segmentation result, form set of words VOC.
Step 4, the microblogging data of each time interval are carried out respectively to LDA cluster, extract potential theme.
By microblogging data, comprise text and comment, the set of words forming after Chinese word segmentation is considered as a document, all documents in each time interval are set up to document-topic model, extract theme, and obtain each document and correspond to the probability of different themes and each theme and generate the probability of word.
Step 5, filter out the microblogging topical subject in each time interval.
The microblogging theme obtaining from step 4, to its ownership theme in each time interval of all document calculations, and sort according to the number of documents being divided under theme, choose topical subject.
Step 6: according to LDA cluster result, the topical subject of length of a game is carried out to hierarchical clustering, and obtain polymerization and differentiation relation between each topical subject.
The probability distribution of the theme obtaining according to step 4 on set of words VOC, the topical subject of all time intervals that combining step five obtains is carried out hierarchical clustering in length of a game to topical subject.And then by hierarchical clustering result, calculate differentiation and polymerization situation between each theme.
Step 7: according to the polymerization of theme and differentiation relation, visual theme evolution process.
Advantage of the present invention and good effect are:
(1) detect and develop a method for tracing towards the Dynamic Theme of microblogging, the method has the advantage such as high efficiency, robustness, has larger practical value.
(2) detect and develop a method for tracing towards the Dynamic Theme of microblogging, excavating the fine granularity theme of same theme at different times.
(3) detect and develop a method for tracing towards the Dynamic Theme of microblogging, excavating an event with lower time complexity and distribute in the descriptor of different times.
figure of description
Fig. 1 is that a kind of Dynamic Theme towards microblogging of the present invention detects and the process flow diagram that develops method for tracing.
Fig. 2 is about data are carried out to pretreated flow chart of steps in the present invention.
Fig. 3 is the process flow diagram about Chinese word segmentation in the present invention.
Fig. 4 is the process flow diagram about hierarchical clustering in the present invention.
Fig. 5 is the topical subject schematic diagram of choosing in the present invention.
Fig. 6 is theme evolution design sketch in time in the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is further detailed.
The present invention is based on topic model and respectively the document of different time is carried out to cluster, utilize hierarchical clustering technology to carry out hierarchical clustering to the theme in different time interval, build polymerization, differentiation relation between theme.To choose the popular degree of theme and the influence power point of penetration as theme characteristic research, and the variation tendency of weight analysis topical subject and the high theme of influence power.
Patent 200710062943.7 and the present invention are similar, when difference is its follow-up similarity of asking for different time points, only ask for similarity according to adjacent two themes, if wither away a period of time in the middle of theme, later stage occurs again, cannot catch, and cannot capture differentiation and the aggregation information of theme.
The present invention is the method for subject evolution trend on a kind of automatic calculating microblogging, gathers internet text information, collects the language material of Sina's microblogging, and it is carried out to pre-service; Comprise extraction text message, filter stop words and formula symbol etc.Time segment detects subject events, by distributing at each time point discrete calculation theme, merges the theme result in each stage, carries out hierarchical clustering; Method by hierarchical clustering, the Subject Clustering of different time sheet, completes tracking and evolution analysis to theme, according to the result of hierarchical clustering, and visual subject evolution figure.
Detect and develop a method for tracing towards the Dynamic Theme of microblogging, concrete steps as shown in Figure 1:
Step 1, structure distributed reptile, obtain microblogging data.
Obtain Sina's microblogging data by the API of Sina (Application Programming Interface), the present invention chooses the microblogging data in month, and each microblogging data includes the data of microblogging text and comment.
Step 2, microblogging data are carried out to pre-service.
Pre-service comprises denoising and duplicate removal; Microblogging data are implemented by concrete steps as shown in Figure 2 respectively:
Step 201: remove the data that microblogging data Chinese version number of words is less than length threshold L;
The concrete microblogging data that adopt big or small program automatic fitration microblogging data Chinese version length to be less than length threshold L.Length threshold L value rule of thumb or specific field depend on the circumstances.The present embodiment preferably 5;
Step 202: remove the microblogging data that repeat;
Owing to there being a large amount of repeated microblogging data in microblogging, utilize Bloom filter algorithm or Simhash algorithm to filter the repeating data in microblogging data.
Step 203: remove the ad content comprising in microblogging data;
Set advertising words matching rule base, remove the ad data comprising in microblogging data, in advertising words matching rule base, comprised general conventional advertising words; Write any word of regular expression for match advertisements word matching rule base, regular expression is according to concrete template and fixed.
Step 204: that removes specific reply template Network Based automatically replies microblogging data;
Specific reply template Network Based is set the regular expression mating with network automatic reply content, and that removes specific reply template Network Based in microblogging data automatically replies microblogging data.
Step 205: remove the website data in microblogging data.Set the regular expression of matching web site, remove the website data in microblogging data.
Step 206: repeating step 201, again calculate the length of microblogging data Chinese version number of words, and remove discontented foot length metric microblogging data, carry out secondary cleaning.
Step 3, all microblogging data are carried out to Chinese word segmentation, remove stop words, obtain word segmentation result, form set of words VOC.
Detailed process is as shown in Figure 3:
Step 301: microblogging data are carried out to Chinese word segmentation and remove stop words simultaneously; Call Chinese word segmentation machine microblogging data are carried out to participle, remove stop words simultaneously;
Step 302: the english in microblogging data is carried out to morphological transformation, be transformed into Unified Form;
The english comprising in word segmentation result after step 301 is processed is carried out morphological transformation, is transformed into Unified Form; Comprising that by tense unification be present indefinite simple present, is active voice by voice unification.
Step 303: the document frequency df and the word frequency tf that calculate each word; Be mainly the each word in the word segmentation result that step 302 is obtained, calculate its document frequency df and word frequency tf;
Document frequency df: refer to that the file number that occurred this word is divided by the total number of files in file set;
Word frequency tf: refer to number of times that this word occurs the hereof total word number divided by this file.
Step 304: the characteristic strength ft that calculates each word; Each word in the word segmentation result obtaining for step 302, calculates its characteristic strength ft, and characteristic strength ft is defined as:
ft = log ( tf idf + 1 + 1 )
Wherein idf represents inverse document frequency, is the inverse of document frequency df;
Step 305: extract the word that characteristic strength ft is greater than characteristic strength threshold value T, form set of words VOC.
Calculate the characteristic strength ft of gained according to step 304, screening characteristic strength ft is greater than the word of intensity threshold T, the word composition set of words VOC that characteristic strength ft all in microblogging data is greater than to characteristic strength threshold value T, characteristic strength threshold value T determines according to concrete applicable situation.
Step 4: the microblogging data of each time interval are carried out respectively to LDA cluster, extract potential theme.
In the present embodiment, each the microblogging data in set of words VOC are made as to a document d, for arbitrary time interval period, all document d composition collection of document D in it is interval; If each document d all contains n word; Word sequence is made as < w1, w2 ..., wn >, wi represents i word.
All documents in each time interval are set up to document-topic model, obtain theme set T and extract theme, obtain each document and correspond to the probability of different themes and the probability of each theme generation word.
Described document-topic model is chosen the LDA topic model based on Gibbs sampling, in each time interval, collection of document D is at that time carried out to cluster, excavate and be made as < t1, t2 to implicit theme set T, ..., tk >.That extracts themes as topic; The present embodiment is chosen k topic, and ti represents i topic.
Two result vectors that obtain by LDA cluster are expressed as follows:
To arbitrary document d, the probability that corresponds to different topic is θ d< P t1, P t2..., P tk>, wherein, P tirepresent the probability of corresponding i the topic of document d.
To any theme topic, the probability that generates various words is wherein, P wirepresent that topic generates the probability of i word.
Step 5: filter out the microblogging topical subject in each time interval;
Concrete calculation procedure is as follows:
Step 501: to all themes in each time interval, calculate its ownership theme.
Document-theme strength of association threshold value R is set, in the microblogging theme obtaining for step 4, for document d, if it corresponds to the probability P of i topic tiexceed R, document d belongs to theme ti, and one section of document d may belong to multiple themes simultaneously.
Step 502: all themes are sorted according to the quantity that is divided into the document d under it, get top n theme as topical subject.N distributes and determines according to number of documents under all themes, and N of the present invention preferably 20, is the schematic diagram of 20 topical subject as shown in Figure 5.
As shown in Figure 6, wherein, horizontal ordinate represents the time to front 20 topical subject distribution plan in time, and ordinate represents the theme temperature after LDA cluster.Every corresponding topical subject of curve in figure.As can be seen from the figure each theme is from producing to the process of withering away or having fluctuation to rise and fall therebetween.
Step 6: according to LDA cluster result, the topical subject of length of a game is carried out to hierarchical clustering, and obtain polymerization and differentiation relation between each topical subject.
Concrete steps are as shown in Figure 4:
Step 601: form the topical subject in each time interval of the overall situation.
Topical subject in the each time interval obtaining according to step 5, merges the topical subject of all time intervals, forms the topical subject in each time interval of the overall situation.
Step 602: each topical subject is carried out to hierarchical clustering, obtain cluster result.
Each theme that extraction step four obtains generates the probability distribution of various words extract the probability distribution of each theme on set of words VOC in each time interval; According to probability distribution, each topical subject all themes in length of a game are carried out to hierarchical clustering, obtain the cluster result of each topical subject all themes in each time interval of the overall situation.
Step 603: according to hierarchical clustering result, binding time information, obtains polymerization and the differentiation relation of topical subject.
The polymerization of topical subject and differentiation relation are as follows:
If theme t1, interior two themes of interval period1, period2 of continuous time before and after t2 is respectively, and t1, t2 belongs to same class, and t2 is considered as being developed by theme t1;
If theme t1, t2 is two themes in time interval period1, and theme t3 is theme in time interval period2, and period1, period2 are intervals continuous time, front and back, and t1, t2, and t3 belongs to same class, and be considered as may be by theme t1 for t3, t2 polymerization;
If theme t1 is theme in time interval period1, theme t2, t3 is two themes in time interval period2, period1, period2 are intervals continuous time, front and back, and t1, t2, t3 belongs to same class, t2, t3 is considered as being differentiated by theme t1;
Step 7: according to the polymerization of theme and differentiation relation, visual theme evolution process.
To obtain polymerization and the differentiation relation of theme, express with topological network form, thereby embody the evolution process of following the trail of topical subject.Utilize HTML5 technology and d3.js data visualization js storehouse to realize dynamic motif polymerization and differentiation, realize the differentiation of topical subject and follow the trail of visual.
Embodiment: adopt following algorithm to carry out hierarchical clustering

Claims (5)

1. detect and develop a method for tracing towards the Dynamic Theme of microblogging, it is characterized in that: comprise the steps:
Step 1, structure distributed reptile, obtain microblogging data;
Step 2, microblogging data are carried out to pre-service;
Step 3, all microblogging data are carried out to Chinese word segmentation, remove stop words, obtain word segmentation result, form set of words VOC;
Step 4: the microblogging data of each time interval are carried out respectively to LDA cluster, extract potential theme;
Microblogging data in set of words VOC are considered as to a document, all documents in each time interval are set up to document-topic model, extract theme, and obtain each document and correspond to the probability of different themes and each theme and generate the probability of word;
Step 5: filter out the microblogging topical subject in each time interval;
Step 6: according to LDA cluster result, the topical subject of length of a game is carried out to hierarchical clustering, and obtain polymerization and differentiation relation between each topical subject;
The probability distribution of the theme obtaining according to step 4 on set of words VOC, the topical subject of all time intervals that combining step five obtains is carried out hierarchical clustering in length of a game to topical subject; And then obtain differentiation and the paradigmatic relation between each theme by hierarchical clustering result;
Step 7: according to the polymerization of theme and differentiation relation, visual theme evolution process.
2. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described pre-service comprises denoising and duplicate removal, specifically refer to that removal microblogging data Chinese version number of words is less than data, the ad content of the data of length threshold L, repetition, automatically replies data and website data, wherein microblogging data comprise the data in microblogging text and comment.
3. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described step 3 specifically comprises:
Step 301: microblogging data are carried out to Chinese word segmentation, remove stop words simultaneously;
Step 302: the english in microblogging data is carried out to morphological transformation, be transformed into Unified Form;
Step 303: the document frequency df and the word frequency tf that calculate each word;
Step 304: calculate the characteristic strength ft of each word, characteristic strength ft is defined as:
ft = log ( tf idf + 1 + 1 )
Wherein idf represents inverse document frequency, is the inverse of document frequency df;
Step 305: extract the term vector that characteristic strength ft is greater than characteristic strength threshold value T, form set of words VOC.
4. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described step 5 concrete steps are as follows:
Step 501: to all themes in each time interval, calculate its ownership theme;
Document-theme strength of association threshold value R is set, in the microblogging theme obtaining for step 4, for document d, if it corresponds to the probability P of i topic tiexceed R, document d belongs to theme ti, and one section of document d can belong to multiple themes simultaneously;
Step 502: all themes are sorted according to the quantity that is divided into the document d under it, get top n theme as topical subject.
5. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described step 6 specifically comprises:
Step 601: form the topical subject in each time interval of the overall situation;
According to the topical subject in each time interval, merge the topical subject of all time intervals, form the topical subject in each time interval of the overall situation;
Step 602: each topical subject is carried out to hierarchical clustering, obtain cluster result;
Extract the probability distribution that each theme generates various words extract the probability distribution of each theme on set of words VOC in each time interval; According to probability distribution, each topical subject all themes in length of a game are carried out to hierarchical clustering, obtain the cluster result of each topical subject all themes in each time interval of the overall situation;
Step 603: according to hierarchical clustering result, binding time information, obtains polymerization and the differentiation relation of theme;
The polymerization of topical subject and differentiation relation are as follows:
If theme t1, interior two themes of interval period1, period2 of continuous time before and after t2 is respectively, and t1, t2 belongs to same class, and t2 is considered as being developed by theme t1;
If theme t1, t2 is two themes in time interval period1, and theme t3 is theme in time interval period2, and period1, period2 are intervals continuous time, front and back, and t1, t2, and t3 belongs to same class, and be considered as may be by theme t1 for t3, t2 polymerization;
If theme t1 is theme in time interval period1, theme t2, t3 is two themes in time interval period2, period1, period2 are intervals continuous time, front and back, and t1, t2, t3 belongs to same class, t2, t3 is considered as being differentiated by theme t1.
CN201410488391.6A 2013-09-22 2014-09-22 Microblog-oriented dynamic topic detection and evolution tracking method Pending CN104199974A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410488391.6A CN104199974A (en) 2013-09-22 2014-09-22 Microblog-oriented dynamic topic detection and evolution tracking method

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201310432003 2013-09-22
CN201310432003.8 2013-09-22
CN201410488391.6A CN104199974A (en) 2013-09-22 2014-09-22 Microblog-oriented dynamic topic detection and evolution tracking method

Publications (1)

Publication Number Publication Date
CN104199974A true CN104199974A (en) 2014-12-10

Family

ID=52085267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410488391.6A Pending CN104199974A (en) 2013-09-22 2014-09-22 Microblog-oriented dynamic topic detection and evolution tracking method

Country Status (1)

Country Link
CN (1) CN104199974A (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809751A (en) * 2015-04-30 2015-07-29 百度在线网络技术(北京)有限公司 Method and device for generating event group evolution diagram
CN105335349A (en) * 2015-08-26 2016-02-17 天津大学 Time window based LDA microblog topic trend detection method and apparatus
CN105389377A (en) * 2015-11-18 2016-03-09 清华大学 Topic mining based event cluster acquisition method
CN105447067A (en) * 2014-09-30 2016-03-30 华东师范大学 Adaptive sampling method for hot spot microblog data in social media
CN105550365A (en) * 2016-01-15 2016-05-04 中国科学院自动化研究所 Visualization analysis system based on text topic model
CN105608217A (en) * 2015-12-31 2016-05-25 中国科学院电子学研究所 Method for displaying hot topics based on remote sensing data
CN105787025A (en) * 2016-02-24 2016-07-20 腾讯科技(深圳)有限公司 Network platform public account classifying method and device
CN106484724A (en) * 2015-08-31 2017-03-08 富士通株式会社 Information processor and information processing method
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN106951554A (en) * 2017-03-29 2017-07-14 浙江大学 A kind of stratification hot news and its excavation and the method for visualizing of evolution
CN107133238A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of text message clustering method and text message clustering system
CN107644089A (en) * 2017-09-26 2018-01-30 武大吉奥信息技术有限公司 A kind of hot ticket extracting method based on the network media
CN107908698A (en) * 2017-11-03 2018-04-13 广州索答信息科技有限公司 A kind of theme network crawler method, electronic equipment, storage medium, system
CN108052636A (en) * 2017-12-20 2018-05-18 北京工业大学 Determine the method, apparatus and terminal device of the text subject degree of correlation
CN108108353A (en) * 2017-12-19 2018-06-01 北京邮电大学 A kind of video semantic annotation method, apparatus and electronic equipment based on barrage
CN108241610A (en) * 2016-12-26 2018-07-03 上海神计信息***工程有限公司 A kind of online topic detection method and system of text flow
CN108717421A (en) * 2018-04-23 2018-10-30 深圳市城市规划设计研究院有限公司 A kind of social media text subject extracting method and system based on change in time and space
CN109684480A (en) * 2018-12-30 2019-04-26 杭州翼兔网络科技有限公司 A kind of clustering method based on industry
CN109710936A (en) * 2018-12-27 2019-05-03 中电科大数据研究院有限公司 A kind of cross-layer grade government document bulletin subject analysis method
CN109739988A (en) * 2018-12-30 2019-05-10 杭州翼兔网络科技有限公司 A kind of industry temperature acquisition methods
CN109859808A (en) * 2018-07-25 2019-06-07 武汉心络科技有限公司 A kind of medical data acquisition method and system
CN109885760A (en) * 2019-01-22 2019-06-14 上海交通大学 Information source tracing method and system based on user interest
CN110096704A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of Dynamic Theme discovery algorithm of short text stream
CN110222172A (en) * 2019-05-15 2019-09-10 北京邮电大学 A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering
CN110428102A (en) * 2019-07-31 2019-11-08 杭州电子科技大学 Major event trend forecasting method based on HC-TC-LDA
CN112380342A (en) * 2020-11-10 2021-02-19 福建亿榕信息技术有限公司 Electric power document theme extraction method and device
CN112597269A (en) * 2020-12-25 2021-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Stream data event text topic and detection system
US11244013B2 (en) 2018-06-01 2022-02-08 International Business Machines Corporation Tracking the evolution of topic rankings from contextual data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571853A (en) * 2009-05-22 2009-11-04 哈尔滨工程大学 Evolution analysis device and method for contents of network topics
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
US20130151531A1 (en) * 2011-12-13 2013-06-13 Xerox Corporation Systems and methods for scalable topic detection in social media

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571853A (en) * 2009-05-22 2009-11-04 哈尔滨工程大学 Evolution analysis device and method for contents of network topics
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
US20130151531A1 (en) * 2011-12-13 2013-06-13 Xerox Corporation Systems and methods for scalable topic detection in social media

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
崔凯: "基于LDA的主题演化研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
张婷: "科学传播研究的可视化分析", 《中国博士学位论文全文数据库 经济与管理科学辑》 *

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447067A (en) * 2014-09-30 2016-03-30 华东师范大学 Adaptive sampling method for hot spot microblog data in social media
CN104809751A (en) * 2015-04-30 2015-07-29 百度在线网络技术(北京)有限公司 Method and device for generating event group evolution diagram
CN104809751B (en) * 2015-04-30 2017-11-24 百度在线网络技术(北京)有限公司 The method and apparatus for generating event group evolution diagram
CN105335349A (en) * 2015-08-26 2016-02-17 天津大学 Time window based LDA microblog topic trend detection method and apparatus
CN106484724A (en) * 2015-08-31 2017-03-08 富士通株式会社 Information processor and information processing method
CN105389377A (en) * 2015-11-18 2016-03-09 清华大学 Topic mining based event cluster acquisition method
CN105389377B (en) * 2015-11-18 2019-02-05 清华大学 Event based on Topics Crawling rolls into a ball acquisition methods
CN105608217A (en) * 2015-12-31 2016-05-25 中国科学院电子学研究所 Method for displaying hot topics based on remote sensing data
CN105608217B (en) * 2015-12-31 2019-03-26 中国科学院电子学研究所 A kind of hot spot theme presentation method based on remotely-sensed data
CN105550365A (en) * 2016-01-15 2016-05-04 中国科学院自动化研究所 Visualization analysis system based on text topic model
CN105787025B (en) * 2016-02-24 2021-07-09 腾讯科技(深圳)有限公司 Network platform public account classification method and device
CN105787025A (en) * 2016-02-24 2016-07-20 腾讯科技(深圳)有限公司 Network platform public account classifying method and device
CN107133238A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of text message clustering method and text message clustering system
CN106815310B (en) * 2016-12-20 2020-04-21 华南师范大学 Hierarchical clustering method and system for massive document sets
CN106815310A (en) * 2016-12-20 2017-06-09 华南师范大学 A kind of hierarchy clustering method and system to magnanimity document sets
CN108241610A (en) * 2016-12-26 2018-07-03 上海神计信息***工程有限公司 A kind of online topic detection method and system of text flow
CN106951554B (en) * 2017-03-29 2021-04-20 浙江大学 Hierarchical news hotspot and evolution mining and visualization method thereof
CN106951554A (en) * 2017-03-29 2017-07-14 浙江大学 A kind of stratification hot news and its excavation and the method for visualizing of evolution
CN107644089A (en) * 2017-09-26 2018-01-30 武大吉奥信息技术有限公司 A kind of hot ticket extracting method based on the network media
CN107644089B (en) * 2017-09-26 2020-08-04 武大吉奥信息技术有限公司 Hot event extraction method based on network media
CN107908698A (en) * 2017-11-03 2018-04-13 广州索答信息科技有限公司 A kind of theme network crawler method, electronic equipment, storage medium, system
CN107908698B (en) * 2017-11-03 2021-04-13 广州索答信息科技有限公司 Topic web crawler method, electronic device, storage medium and system
CN108108353B (en) * 2017-12-19 2020-11-10 北京邮电大学 Video semantic annotation method and device based on bullet screen and electronic equipment
CN108108353A (en) * 2017-12-19 2018-06-01 北京邮电大学 A kind of video semantic annotation method, apparatus and electronic equipment based on barrage
CN108052636B (en) * 2017-12-20 2022-02-25 北京工业大学 Method and device for determining text theme correlation degree and terminal equipment
CN108052636A (en) * 2017-12-20 2018-05-18 北京工业大学 Determine the method, apparatus and terminal device of the text subject degree of correlation
CN108717421A (en) * 2018-04-23 2018-10-30 深圳市城市规划设计研究院有限公司 A kind of social media text subject extracting method and system based on change in time and space
US11244013B2 (en) 2018-06-01 2022-02-08 International Business Machines Corporation Tracking the evolution of topic rankings from contextual data
CN109859808A (en) * 2018-07-25 2019-06-07 武汉心络科技有限公司 A kind of medical data acquisition method and system
CN109710936A (en) * 2018-12-27 2019-05-03 中电科大数据研究院有限公司 A kind of cross-layer grade government document bulletin subject analysis method
CN109739988A (en) * 2018-12-30 2019-05-10 杭州翼兔网络科技有限公司 A kind of industry temperature acquisition methods
CN109684480B (en) * 2018-12-30 2021-01-05 北京人民在线网络有限公司 Industry-based clustering method
CN109684480A (en) * 2018-12-30 2019-04-26 杭州翼兔网络科技有限公司 A kind of clustering method based on industry
CN109885760B (en) * 2019-01-22 2020-12-29 上海交通大学 Information tracing method and system based on user interests
CN109885760A (en) * 2019-01-22 2019-06-14 上海交通大学 Information source tracing method and system based on user interest
CN110096704A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of Dynamic Theme discovery algorithm of short text stream
CN110096704B (en) * 2019-04-29 2023-05-05 扬州大学 Dynamic theme discovery method for short text stream
CN110222172B (en) * 2019-05-15 2021-03-16 北京邮电大学 Multi-source network public opinion theme mining method based on improved hierarchical clustering
CN110222172A (en) * 2019-05-15 2019-09-10 北京邮电大学 A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering
CN110428102A (en) * 2019-07-31 2019-11-08 杭州电子科技大学 Major event trend forecasting method based on HC-TC-LDA
CN110428102B (en) * 2019-07-31 2021-11-09 杭州电子科技大学 HC-TC-LDA-based major event trend prediction method
CN112380342A (en) * 2020-11-10 2021-02-19 福建亿榕信息技术有限公司 Electric power document theme extraction method and device
CN112597269A (en) * 2020-12-25 2021-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Stream data event text topic and detection system

Similar Documents

Publication Publication Date Title
CN104199974A (en) Microblog-oriented dynamic topic detection and evolution tracking method
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN103793501B (en) Based on the theme Combo discovering method of social networks
CN103812872A (en) Network water army behavior detection method and system based on mixed Dirichlet process
CN104239539A (en) Microblog information filtering method based on multi-information fusion
CN104077417A (en) Figure tag recommendation method and system in social network
CN103530603A (en) Video abnormality detection method based on causal loop diagram model
CN103793489A (en) Method for discovering topics of communities in on-line social network
CN104504024A (en) Method and system for mining keywords based on microblog content
US9183598B2 (en) Identifying event-specific social discussion threads
CN103823792A (en) Method and equipment for detecting hotspot events from text document
Farseev et al. bbridge: A big data platform for social multimedia analytics
CN109271488A (en) Causal relationship discovery method and system between a kind of bonding behavior sequence and the social network user of text information
CN109376231A (en) A kind of media hotspot tracking and system
CN104536830A (en) KNN text classification method based on MapReduce
CN105678626B (en) Method and device for mining overlapped communities
Rao et al. An optimal machine learning model based on selective reinforced Markov decision to predict web browsing patterns
CN104199947A (en) Important person speech supervision and incidence relation excavating method
CN103761246A (en) Link network based user domain identifying method and device
CN110069686A (en) User behavior analysis method, apparatus, computer installation and storage medium
CN105243095A (en) Microblog text based emotion classification method and system
CN112765313A (en) False information detection method based on original text and comment information analysis algorithm
CN104636324A (en) Topic tracing method and system
Yu et al. Mining hidden interests from twitter based on word similarity and social relationship for OLAP

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141210