CN108197144A - A kind of much-talked-about topic based on BTM and Single-pass finds method - Google Patents

A kind of much-talked-about topic based on BTM and Single-pass finds method Download PDF

Info

Publication number
CN108197144A
CN108197144A CN201711210195.2A CN201711210195A CN108197144A CN 108197144 A CN108197144 A CN 108197144A CN 201711210195 A CN201711210195 A CN 201711210195A CN 108197144 A CN108197144 A CN 108197144A
Authority
CN
China
Prior art keywords
topic
data
cluster
talked
btm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711210195.2A
Other languages
Chinese (zh)
Other versions
CN108197144B (en
Inventor
许国艳
夭荣朋
张网娟
平萍
朱帅
李敏佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201711210195.2A priority Critical patent/CN108197144B/en
Publication of CN108197144A publication Critical patent/CN108197144A/en
Application granted granted Critical
Publication of CN108197144B publication Critical patent/CN108197144B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides the present invention and provides a kind of much-talked-about topic discovery method based on BTM and Single pass, this method carries out theme modeling first with BTM topic models, obtain the theme distribution of corpus data collection, then vectorization is carried out using VSM, improved Single pass algorithms are reused to cluster result obtained above, cluster result is arranged to obtain new cluster result, secondly, method, which carries out parallelization calculating, to be found to above-mentioned much-talked-about topic, to improve the speed of its Topics Crawling in big data quantity.The present invention can be very good to solve the problems, such as that microblog data is sparse and handle the ability of mass data, improved Single pass algorithms can be very good to reduce computation complexity, keep the stability of algorithm, new data can be handled effectively, there is preferable calculate to analyze to the long lasting effect power of much-talked-about topic, and on the basis of data set improves data-handling efficiency by MapReduce frames, it is maintained to topic and finds quality.

Description

A kind of much-talked-about topic based on BTM and Single-pass finds method
Technical field
The present invention relates to a kind of much-talked-about topics based on BTM and Single-pass to find method, belongs to Data Mining In text cluster.
Background technology
It is universal with smart mobile phone and network, people can the moment by microblogging APP pay close attention to some newest countries, Sociogenetic major issue carries out hot microblog topic to find that research has very big value in fields such as business, scientific researches, more Carry out more scholars and carry out relevant research for microblogging.
In traditional much-talked-about topic is found, typically ground using LDA topic models and K-Means scheduling algorithms Study carefully, however, traditional LDA models mainly solve be long text the problem of, the processing of the short text data of microblogging etc is imitated Fruit is bad, meanwhile, it is difficult to solve that Sparse that microblog data has, the features such as contextual relevance is stronger, which are LDA models, 's.
In order to handle a large amount of data set, traditional much-talked-about topic discovery technique has encountered bottleneck.First, it is carrying out Data volume to be processed is huge during much-talked-about topic is found, individual host, processor deal with time-consuming and laborious. Secondly, simple BTM models data are carried out Topics Crawling speed it is excessively slow, it is last secondly, use BTM topic models into After row modeling, the characteristic of microblogging stream data is not accounted for, so its classifying quality need to be improved.Therefore in distribution Hot microblog topic discovery is particularly important under environment.
Invention content
Goal of the invention:In view of the deficiencies of the prior art, the present invention proposes a kind of streamings suitable for short text, Sparse The much-talked-about topic based on BTM and Single-pass of data finds method, meanwhile, this method can adapt to big data quantity situation With quickening Topics Crawling speed.
Technical solution:The present invention provides a kind of much-talked-about topics based on BTM and Single-pass to find method, the party Method includes the Mapper stages of MapReduce frames and Reducer stages;
The Mapper stages specifically include:
(1) the data set D of the input is pre-processed;
(2) the pretreated data set D is averagely assigned into C node, each node includes quantitative word pair, At random for institute's predicate to distribution one theme;
(3) theme modeling is carried out using BTM topic models, obtains the theme distribution of corpus data collection;
(4) vectorization is carried out to result using VSM on each node;
(5) cluster analysis is carried out using improved Single-pass algorithms on each node, obtains Partial-Topic;
(6) Partial-Topic is exported;
The Reducer stages specifically include:
(1) the initialization cluster result on each node is inputted;
(2) cluster result of host node is chosen as initial cluster centre;
(3) it uses and improves Single-pass algorithms by the cluster result of remaining node and the cluster result of the host node Cluster operation is carried out, finally obtains much-talked-about topic;
(4) much-talked-about topic is exported.
Preferably, in step (5), the cluster analysis specifically includes:
1) data set D is divided into n data slice D by certain scale1,D2,...,Dn, by the data slice in order successively As input data;Each data slice individually carries out internal cluster, obtains the cluster result of each data slice;
2) the data slice D is selected1As first part, to the D1Internal cluster is carried out, obtains described first The cluster result divided;
3) for the data slice D2,...,Dn, internal cluster is individually carried out to each data slice before input, is obtained every The cluster result of a data slice;
4) by the data slice D2Obtained each cluster centre, with the already existing data slice D1Each cluster Center carries out similarity calculation, is denoted as sim (D2di, D1di), wherein, the D1di and D2di represent the data slice respectively D1And D2Middle different word pair;
5) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, if institute State max=max (sim (D2di, D1di)) >=c, then by the D2di be grouped into in the class of its similarity value maximum;If max= Max (sim (D2di, D1di)) < c then create a topic again by cluster centre of the D2di;
6) by the data slice D1、D2Cluster result arranged to obtain new cluster result;
7) 4), 5) He 6) cycle, until data all in the data set are all disposed, obtains final result.
Preferably, the internal cluster is using classical Single-pass algorithms.
Preferably, institute's predicate exists simultaneously identical described in one to referring to after being pre-processed to the data set In data slice, and arbitrary, unordered two different words.
Preferably, in step (3), the theme modeling uses the methods of sampling as the Gibbs methods of samplings, sampling process Iteration obtains theme matrix using institute's predicate of each node to tuple-set, i.e., to each institute's predicate to b= (wi,wj)∈BC, seek its probability in the BTM models under theme k.
Advantageous effect:Compared with prior art, the present invention its remarkable advantage is:1st, the present invention is led to using BTM topic models Word is crossed to solving the problems, such as Sparse well, and the model is more preferable compared with LDA models for the processing capacity of data;2nd, exist K-means is commonly used in the method for traditional clustering documents, however microblog data is a kind of form of flow data, K- Means can not well cluster it, therefore the present invention carries out it using Single-pass incremental clustering algorithms Analysis;3rd, the Single-pass algorithms after improving can reduce the complexity of calculating very well, keep the stability of algorithm, and New data can be handled effectively, have preferable calculate to analyze to the long lasting effect power of much-talked-about topic;4th, the present invention is gone back Method, which carries out distributed parallel processing, to be found to the much-talked-about topic based on BTM and Single-pass, data set passes through On the basis of MapReduce frames improve data-handling efficiency, it is maintained to topic and finds quality.
Description of the drawings
Fig. 1 is Single-pass algorithm flow charts after improving;
Fig. 2 is that much-talked-about topic of the present invention finds Algorithm parallelization flow chart;
Fig. 3 is BTM model topology structure charts;
Fig. 4 is the basic flow chart that the much-talked-about topic based on BTM and Single-pass finds algorithm in the present invention;
Fig. 5 is F value sizes of the BTM under K value differences;
Fig. 6 is β values analysis chart in BTM models;
Fig. 7 is different threshold performance schematic diagrames;
Fig. 8 is part topic experimental result;
Fig. 9 is three kinds of methods experiment comparisons;
Figure 10 is the hot word tendency chart that the present invention uses;
Figure 11 is unit and Distributed Time comparison diagram;
Figure 12 is server and time chart used.
Specific embodiment
The present invention provides a kind of much-talked-about topics based on BTM and Single-pass to find method, and this method is suitable for short Text, Sparse stream data under use, meanwhile, in order to enable the method to adapt to big data quantity situation and accelerate lead Excavation speed is inscribed, has carried out parallelization calculating, the key step of this method includes:(1) using improvement Single-pass algorithms Carry out cluster analysis;(2) much-talked-about topic based on BTM and Single-pass finds that method carries out MapReduce distributions simultaneously Rowization processing.
(1) Single-pass algorithms are improved and carries out cluster analysis
Such as Fig. 1, data set D, it is divided into multiple data slices, i.e. D by certain scale1,D2,...,Dn, in order successively Data set after input decomposition.
1) for D1,D2,...,DnThese data slices are used as input data successively in order;Each part first oneself carries out Inside cluster, clustering method is similar with classics Single-pass algorithms, can obtain the respective cluster result in each part;
2) D is selected1This part clusters it using classical Single-pass algorithms, obtains as first part The cluster result of this part;
3) for D2,...,DnThese parts, oneself first carry out internal cluster before input, and clustering algorithm is also classical Single-pass algorithms, obtain the cluster result of each part;
4) for D2Obtained each cluster centre carries out similarity meter with each cluster centre of already existing part It calculates, wherein D1di and D2di represent D respectively1And D2Different words pair in data set, word is to referring to segment etc. to data set Later, it exists simultaneously in a same clip, and is two arbitrary and unordered different words;
5) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, c root Have different selections according to different situations, if maximum similarity value be greater than or equal to threshold value, i.e. max=max (sim (D2di, D1di)) >=c, then be grouped into in that class of its similarity value maximum;If maximum similarity value is less than threshold value, that is, max= Max (sim (D2di, D1di)) < c, then with for a cluster centre again newly-built topic;
6) for D1、D2Two-part cluster result is arranged to obtain new cluster result;
7) circulation step 4), step 5), three step of step 6) be all disposed until all data, so as to most be terminated Fruit.
(2) much-talked-about topic based on BTM and Single-pass finds that method carries out the distributed parallel places of MapReduce Reason
1) as Fig. 2, all data D after pre-processing averagely assign to C node, then each node probably handles D/C Data volume.Each node includes a certain amount of word to b=(wi,wj)∈BC;, then for all data set D, global word B={ B are combined into collection1,B2,...,Bc}。
2) local data using the method that Gibbs samples is sampled in BTM models, is used every time in iteration It is that the word of each node obtains theme matrix namely to each word to b=(w to tuple-seti,wj)∈BC, it is asked in master Inscribe the probability under k.
3) it is poly- for below result obtained above in the enterprising row vector of respective host for each node Class is prepared.
4) each node is first respectively clustered, and is chosen the cluster result of host node as initial cluster centre, is used The cluster result of remaining node is merged arrangement with the cluster result of host node, obtains knot by improved Single-pass algorithms Fruit.
BTM models concentrate data whole words pair, and the distribution of theme probability is identical, and theme here understands Into the probability distribution of different words.BTM is based on LDA models and unitary mixed model, it passes through the word of entire data set generation Pair and word between relevance learn the theme of short text sparse data, the relevance between this word is by Term co-occurrence Come what is represented.Therefore, BTM topic models use total data that the word of co-occurrence is concentrated to carry out theme to modeling It practises.
The structures of BTM topic models as shown in figure 3, wherein, θ represents the global theme distribution in data set,Represent single The probability distribution of word below a theme, K represent the number of theme, | B | represent the quantity of word pair in data set, z represents implicit Theme, Wi, WjIt is some word to two different words in b.α and β be respectively θ andDirichlet prior distributions it is super Parameter.
Such as Fig. 4, the much-talked-about topic based on BTM and Single-pass finds that algorithm comprises the steps of:
Step 1 pre-processes the data set of acquisition, removes those and does not have contributive noise number to much-talked-about topic discovery According to, such as stop words, hyperlink, spcial character;
Step 2, the Chinese word segmentation tool (NLPIR) proposed using Institute of Computing Technology, CAS divide data Word;
Step 3, the relevant information for obtaining required characteristic item retain related verb, noun, topic label, time etc., Using it as characteristic item, arrange dittograph and count word frequency size;
Step 4 models data above using BTM models, calculate data set p (z | d), p (w | z) (document- Theme, theme-word) value, for every data d in data setiUse formula Inf (di) topic influence is calculated, make With formula wk(di) from 0 to k cycle calculations going out weighted value size, K is characterized the quantity of word.
(1) with Inf (di) represent to quantify the influence power of a micro-blog information, formula is
Wherein, NcomRepresent the comment number of a micro-blog information, NrepRepresent the forwarding number of a micro-blog information, NsupIt represents One microblogging praises number, MAX { Ncom(dj)}、MAX{Nrep(dj)}、MAX{Nsup(dj) respectively represent collection of document in maximum Comment number, most praises number vigorously at max-forwards number.α, β, γ are parameters, meet alpha+beta+γ=1.
(2) TF-IDF is a kind of common characteristic weighing technology, and calculation formula is
TFij-IDFi=TFij×IDFiFormula 2
Wherein, word frequency TF represents the frequency that given Feature Words occur in a document, and the value of frequency is bigger to illustrate that it is heavier It will;Reverse document-frequency IDF means that given Feature Words concentrate the number occurred in total data.The calculating of TF and IDF Formula is as follows:
Wherein, nijIt is the frequency of given word in a document, ∑ knkjThe sum of all words in the document is represented, | D | it represents The number of total document, | { j:ti∈dj| it is the quantity of the document containing Feature Words, in order to avoid denominator is zero it to be carried out plus 1 Operation.
Step 5, using VSM vector space models, vectorization is carried out to result obtained above, formula isObtain a result matrix M.
Step 6 carries out increment cluster using above-mentioned improved Single-pass algorithms to matrix M, final so as to obtain As a result.
Much-talked-about topic based on BTM and Single-pass finds Parallel Algorithm, comprises the steps of:
For microblogging big data quantity feature carry out parallelization processing, mainly the Mapper stages of MapReduce frames and Two parts algorithm design in Reducer stages.
(1) the Mapper stages
Step 1, input data set D, number of topics K, parameter alpha, β;
Step 2 pre-processes the data set of acquisition, removes those and does not have contributive noise number to much-talked-about topic discovery According to, such as stop words, hyperlink, spcial character;
Step 3, to all data D after pretreatment, averagely assign to C node, each node includes a certain amount of word To b=(wi,wj)∈BC, at random a theme k is distributed to b for word;
Step 4, Zi←Zi+Zj, on host i, global subject word matrix is the sum of other hosts and the machine.Traversal institute The global subject word matrix of every host is obtained in some node C.
Step 5 in iteration obtains theme matrix using the word of each node to tuple-set every time, that is, To each b=(wi,wj)∈BC, the probability under theme k is:
Wherein, nk|CIt represents on process C, is assigned to word that theme number is k to the number of tuple,WithTable respectively Show word wi, wjThe number for being k to be in theme number, N represent the size of total dictionary in data set, and α, β represent Study first;
Step 6 carries out vectorization using VSM to the result on respective host on each node;
Step 7 first carries out respective cluster using improved Single-pass algorithms on each node, obtains part Topic;
Step 8, output Partial-Topic.
(2) the Reducer stages
Initialization cluster result in step 1, each node of input;
Step 2 chooses the cluster result of host node as initial cluster centre;
Step 3, using improved Single-pass algorithms, by the cluster result of remaining node and the cluster knot of host node Fruit carries out cluster operation, finally obtains much-talked-about topic;
Step 4, output much-talked-about topic.
Experimental analysis
Experimental data of the present invention is the truthful data collection in the Sina weibo acquired by crawlers, in the data crawled Hold 20000 datas for mainly including the topics such as " Alphago ", " the Rio Olympic Games ".
Currently find that method does not have unified evaluation criterion to hot microblog topic, the present invention uses following evaluation index To be tested.
Consider from accuracy rate this respect, referred to using NIST (National Institute of Standards and Technology) evaluations announced Mark, includes accuracy rate P, recall rate R, F value, omission factor PmissAnd false drop rate PFAEtc. evaluation indexes.Formula is as follows:
Wherein, a is the relevant microblog textual data that detection is found;B is the uncorrelated microblogging textual data that detection is found;C is represented The relevant microblog textual data of discovery is not detected;D is the uncorrelated microblogging textual data found without detection.
The value analysis of parameters in BTM algorithm models
BTM models are the first steps that topic identification is carried out to microblogging language material in much-talked-about topic discovery method of the present invention, sharp There is large effect for postorder increment Clustering Effect with the result that BTM is obtained, so the present invention analyzes BTM themes first The problems of value of each parameter in model.BTM topic models are a kind of non-supervisory models, it is modeled when Time needs first to set the theme number K of data set, and the estimation of different theme exponential models is different, and the performance to BTM is It is influential, so first to determine the value of K.This experiment take respectively the value of K for 3,4,5,6,7,8,9,10,11,12,13,15 into Row experiment, is compared, experimental result is as shown in Figure 5 using F values as evaluation index.
As can be seen that with the gradual increase of number of topics K, downward trend is presented in F values, and when K takes 7, effect is best.By It is that the short text data of Sparse its word for containing is less in microblogging, the word of formation is being built to also just comparatively less During mould, we to be each word to distributing a theme, and set excessive number of topics, perhaps can make the p- theme of word Probability be divided, and then have an impact to the probability of document-theme so that final result is affected.The present invention chooses master It inscribes number and is used as K values for 7.
In the case where number of topics K values determine, our parameter alpha and values of β in BTM is analyzed, firstly for taking for α Value is understood based on experience valueSo the value of α is 50/7.It is determined in the value of α and then the value of β is analyzed, Experimental result is as shown in fig. 6, block diagram from left to right represents accuracy rate, recall rate and F values respectively.
As can be seen that overall effect is stablized relatively when β takes 0.01, so, the value of β of the present invention is 0.01.To sum up two A experiment, the value of algorithm different parameters that provides of the present invention be respectively the value of K be 7, α value be 50/7, β value be 0.01.
Threshold value selection analysis
The setting of threshold value has the important influence of class for the Clustering Effect of the poly- method of Single-pass increments, and clusters Result to ultimatum topic find have important influence, so the setting of threshold value is an important aspect.This experiment pair In threshold value selection evaluation index from omission factor PmissWith false drop rate PFAIt is weighed in terms of two.This experiment is from topic number above According to middle selected section as testing, wherein each topic selects 1000 datas.For experimental result such as Fig. 7 of different threshold values It is shown.
It can be seen that false drop rate is inversely proportional with threshold value by experiment above, and omission factor is directly proportional to threshold value, threshold value Value is increasing, and the value of false drop rate is gradually decreasing, and the value of omission factor is becoming larger.It is whole when threshold value takes 0.4 Body effect is comparatively preferable, so threshold value value of the present invention is 0.4.
The comparison of method is found with other much-talked-about topics
The present invention excavates much-talked-about topic by being based on BTM and improved Single-pass algorithms, the present invention be with Keyword set represents associated topic, and since length is limited, present invention displaying " Alphago " topic part of test results is such as Shown in Fig. 8.
In order to verify the validity of method proposed by the present invention, it is compared using BTM topic models, BTK methods, by P, R index are combined in F values, so this experiment, using F values as evaluation index, experimental result is as shown in Figure 9.
As shown in Figure 9, the characteristics of its cluster result is different for different topics, this mainly has topic itself causes 's.Such as the cluster result of " the Rio Olympic Games " topic is comparatively poor, be because the Olympic Games relate to it is multiple and different Project, including multiple and different project such as " diving ", " table tennis ", " vollyball ", it includes project mostly so that BTM word centerings Term co-occurrence is undesirable so as to cause Clustering Effect with regard to relative reduction.
The present invention is compared using the hot word trend in micro- index that microblogging provides, and hot word trend can intuitively be shown Show the tendency and frequency of some keyword in a period of time.Due to containing multiple and different heat in the data of " the Rio Olympic Games " Point topic, therefore be appropriate for comparing, in August, 2016 data of 5 to 21 is selected to be compared.Use the side of the present invention The keyword that method tests " the Rio Olympic Games " data is:
1 Rio Olympic Games keyword of table
Figure 10 illustrates the hot word trend of Rio during the course of the Olympic Games, since the frequency that the August women's volleyball of 21 days occurs is far above Women's volleyball's data in order to preferably show the trend of other hot words on the diagram, therefore are deployed to secondary axis by other hot words On.As can be seen from Figure 10 first gold medal, the power of chaotic state, Huo Dun, women's volleyball, a period of times of Zhang Jike in Rio during the course of the Olympic Games Interior temperature increases severely, and becomes much-talked-about topic, mud-rock flow, tomato scrambled eggs due to not being specific keyword during the course of the Olympic Games, so The frequency of occurrences is relatively low.Therefore the validity of algorithm proposed by the present invention can be intuitively proved very much.
By experimental result as can be seen that the much-talked-about topic based on BTM and Single-pass finds algorithm compared to other Method has a preferable effect, F values preferably prove the much-talked-about topic based on BTM and Single-pass find algorithm be effectively and Feasible.
Much-talked-about topic based on BTM and Single-pass finds Algorithm parallelization experimental analysis
This experiment shares 4 servers, wherein a machine as Master nodes, deploy NameNode and JobTracker, in addition 3 servers dispose DataNode and TaskTracker on it as Slave nodes.All clothes The operating system being engaged on device is Ubuntu14.04, and the version of the Hadoop used is 2.6.5, and the version of JDK is jdk- 8u121-linux-i586。
The experimental data set of the present invention is Sina weibo on the January 5th, 2017 to January 10 crawled by reptile instrument Between the microblog data of 5 days, size is about 1.1GB.Obtained data are xml forms, extract content therein, and pass through The JAVA kits that HDFS increases income, which upload data, carries out distributed preservation.
This experiment is to find that algorithm carries out the processing of parallelization to the much-talked-about topic based on BTM and Single-pass, so We will weigh its evaluation in terms of the quality that much-talked-about topic is found and the speed two handled data.Using Coherence values are used as the index for weighing the quality that much-talked-about topic is found.
The calculation formula of Coherence values is as follows:
Wherein, V(z)=(V1 (z),V2 (z),...,VT (z)) represent preceding T word under a known theme z, and V(z)Be by According to the size of the value of Probability p (w | z), according to what is be ranked up from high to low.D (v) represents the number that word v occurs in a document Mesh, D (v, z) represent two words while the number occurred.The quality that much-talked-about topic is found and C (z, V(z)) value size in just Than it is better to be worth more big then quality.
The present invention is taken to evaluate the quality of the motif discovery under stand-alone environment and under distributed environment under current environment For the average value of the Coherence of whole themes as evaluation criterion, the calculation formula of the average value of Coherence is as follows:
In the case that similar in motif discovery quality, for data processing speed we to handle the time used in data To weigh.
Performance analysis
Topic finds quality
In the case of same configuration, the present invention sets the size of T as 30, is compared using Coherence values in unit The quality found under environment with topic under distributed environment, the experimental results are shown inthe following table:
Table 2Coherence values compare
From the results shown in Table 2, the Coherence values under stand-alone environment and under distributed environment are relatively , as the increase Coherence values of number of nodes are slowly being reduced under distributed environment, generally speaking topic is sent out Existing quality is not much different.
Topic finds speed
The influence of speed is found to better illustrate distributed environment to topic, the present invention is opposed by two groups of experiments Than first group is to maintain that distributed environment interior joint quantity is constant, and the size of change data amount, second group is to maintain data volume Constant, the quantity of concept transfer is compared.
For first group of experiment, present invention setting experimental situation is 4 servers, wherein 1 is used as Master nodes, , as Slave nodes, experimental data is divided into four groups of sizes, respectively 256MB, 512MB, 768MB, 1024MB, institute's used time for remaining Between unit be minute.The experimental results are shown inthe following table:
3 unit of table finds the used time with Hadoop distributions topic
In order to more intuitively show the increase with data volume, the transformation of unit and distributed environment, then the form with figure It is shown, as a result as shown in figure 11.
As can be seen from Table 3 with the increase of processing data amount, time used in Hadoop distributed environments and unit it Between gap be gradually increased, so Hadoop distributions for handle big data quantity operation it is more efficient.
The size that second group of experiment is to maintain data is constant, changes the number of distributed interior joint, to compare the time used. Selection size of data is 1024MB, and node number is incremented to 4 from 1, and experimental result is as shown in figure 12.
It can intuitively be found very much from Figure 12, when handling an equal amount of data, with Hadoop interior joint quantity Increase, the time used constantly reducing.Certainly as number of nodes is more, Slave nodes and Master nodes carry out data Exchange times can be more and more, and the trend of time reduction used is more and more slower.So rational number of nodes is set.
Pass through the algorithm after experimental verification parallelization, in the case where topic finds that mass difference is few, Ke Yixian The speed for improving topic and finding of work, reduces the time used in processing data.

Claims (5)

1. a kind of much-talked-about topic based on BTM and Single-pass finds method, which is characterized in that this method uses The distributed parallel processing of MapReduce, Mapper stages and Reducer stages including MapReduce frames;
The Mapper stages specifically include:
(1) the data set D of the input is pre-processed;
(2) the pretreated data set D is averagely assigned into C node, each node includes quantitative word pair, is at random Institute's predicate is to distributing a theme;
(3) theme modeling is carried out using BTM topic models, obtains the theme distribution of corpus data collection;
(4) vectorization is carried out to result using VSM on each node;
(5) cluster analysis is carried out using improved Single-pass algorithms on each node, obtains Partial-Topic;
(6) Partial-Topic is exported;
The Reducer stages specifically include:
(1) the initialization cluster result on each node is inputted;
(2) cluster result of host node is chosen as initial cluster centre;
(3) cluster result of the cluster result of remaining node and the host node is carried out using improvement Single-pass algorithms Cluster operation finally obtains much-talked-about topic;
(4) much-talked-about topic is exported.
2. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute In the step of stating the Mapper stages (2), institute's predicate is existed simultaneously to referring to after being pre-processed to the data set in one In the identical data slice, and arbitrary, unordered two different words.
3. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute In the step of stating the Mapper stages (3), the theme modeling uses the methods of sampling as the Gibbs methods of samplings, and sampling process changes In generation, obtains theme matrix using institute's predicate of each node to tuple-set, i.e., to each institute's predicate to b=(wi, wj)∈BC, seek its probability in the BTM models under theme k.
4. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute In the step of stating the Mapper stages (5), the cluster analysis specifically includes:
(51) the data set D is divided into n data slice D by certain scale1,D2,...,Dn, by the data slice in order according to It is secondary to be used as input data;Each data slice individually carries out internal cluster, obtains the cluster result of each data slice;
(52) the data slice D is selected1As first part, to the D1Internal cluster is carried out, obtains the first part Cluster result;
(53) for the data slice D2,...,Dn, internal cluster is individually carried out to each data slice before input, is obtained each The cluster result of the data slice;
(54) by the data slice D2Obtained each cluster centre, with the already existing data slice D1Each cluster in The heart carries out similarity calculation, is denoted as sim (D2di, D1di), wherein, the D1di and D2di represent the data slice D respectively1With D2Middle different word pair;
(55) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, if described Max=max (sim (D2di, D1di)) >=c, then by the D2di be grouped into in the class of its similarity value maximum;If max=max (sim (D2di, D1di)) < c then create a topic again by cluster centre of the D2di;
(56) by the data slice D1、D2Cluster result arranged to obtain new cluster result;
(57) circulation step (54), (55) and (56) until data all in the data set are all disposed, are most terminated Fruit.
5. the much-talked-about topic according to claim 4 based on BTM and Single-pass finds method, which is characterized in that institute Internal cluster is stated using Single-pass algorithms.
CN201711210195.2A 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass Active CN108197144B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711210195.2A CN108197144B (en) 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711210195.2A CN108197144B (en) 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass

Publications (2)

Publication Number Publication Date
CN108197144A true CN108197144A (en) 2018-06-22
CN108197144B CN108197144B (en) 2021-02-09

Family

ID=62573247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711210195.2A Active CN108197144B (en) 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass

Country Status (1)

Country Link
CN (1) CN108197144B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109509110A (en) * 2018-07-27 2019-03-22 福州大学 Method is found based on the hot microblog topic for improving BBTM model
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map
CN110134958A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text Topics Crawling method based on semantic word network
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN111090811A (en) * 2019-12-24 2020-05-01 北京理工大学 Method and system for extracting massive news hot topics
CN113378558A (en) * 2021-05-25 2021-09-10 浙江工业大学 RESTful API document theme distribution extraction method based on representative word pairs
CN115718680A (en) * 2023-01-09 2023-02-28 江铃汽车股份有限公司 Data reading method, system, computer and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs
EP2068526B1 (en) * 2007-11-06 2014-04-30 Intel Corporation End-to-end network security with traffic visibility
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2068526B1 (en) * 2007-11-06 2014-04-30 Intel Corporation End-to-end network security with traffic visibility
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BO HUANG ET AL.: ""Microblog Topic Detection Based on LDA Model and Single-Pass Clustering"", 《INTERNATIONAL CONFERENCE ON ROUGH SETS AND CURRENT TRENDS IN COMPUTING》 *
梁亚楠 等: ""基于主题模型建模的微博话题发现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王宇: ""基于云计算的网络舆情热点发现研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
谭真: ""基于MapReduce的热点话题发现及演化分析方法研究"", 《万方数据知识服务平台》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109509110A (en) * 2018-07-27 2019-03-22 福州大学 Method is found based on the hot microblog topic for improving BBTM model
CN109509110B (en) * 2018-07-27 2021-08-31 福州大学 Microblog hot topic discovery method based on improved BBTM model
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map
CN110134958A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text Topics Crawling method based on semantic word network
CN110134958B (en) * 2019-05-14 2021-05-18 南京大学 Short text topic mining method based on semantic word network
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN111090811A (en) * 2019-12-24 2020-05-01 北京理工大学 Method and system for extracting massive news hot topics
CN111090811B (en) * 2019-12-24 2023-09-01 北京理工大学 Massive news hot topic extraction method and system
CN113378558A (en) * 2021-05-25 2021-09-10 浙江工业大学 RESTful API document theme distribution extraction method based on representative word pairs
CN113378558B (en) * 2021-05-25 2024-04-16 浙江工业大学 RESTful API document theme distribution extraction method based on representative word pairs
CN115718680A (en) * 2023-01-09 2023-02-28 江铃汽车股份有限公司 Data reading method, system, computer and readable storage medium

Also Published As

Publication number Publication date
CN108197144B (en) 2021-02-09

Similar Documents

Publication Publication Date Title
CN108197144A (en) A kind of much-talked-about topic based on BTM and Single-pass finds method
Liu et al. Text features extraction based on TF-IDF associating semantic
EP2092419B1 (en) Method and system for high performance data metatagging and data indexing using coprocessors
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN106599054B (en) Method and system for classifying and pushing questions
US9449096B2 (en) Identifying influencers for topics in social media
WO2017097231A1 (en) Topic processing method and device
CN111581949B (en) Method and device for disambiguating name of learner, storage medium and terminal
CN107066555A (en) Towards the online topic detection method of professional domain
EP3539018A1 (en) Apparatus and method for semantic search
CN107608953B (en) Word vector generation method based on indefinite-length context
Shetty et al. Automatic extractive text summarization using K-means clustering
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN105404677B (en) A kind of search method based on tree structure
CN105787121B (en) A kind of microblogging event summary extracting method based on more story lines
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
Kim et al. A web service for author name disambiguation in scholarly databases
CN104536830A (en) KNN text classification method based on MapReduce
Backes Effective unsupervised author disambiguation with relative frequencies
CN105740448A (en) Topic-oriented multi-microblog time sequence abstracting method
Long et al. Tcsst: transfer classification of short & sparse text using external data
Zhou et al. The survey of large-scale query classification
Ye et al. An End‐to‐End Rumor Detection Model Based on Feature Aggregation
CN105426490B (en) A kind of indexing means based on tree structure
Sailaja et al. An overview of pre-processing text clustering methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant