CN108197144A

CN108197144A - A kind of much-talked-about topic based on BTM and Single-pass finds method

Info

Publication number: CN108197144A
Application number: CN201711210195.2A
Authority: CN
Inventors: 许国艳; 夭荣朋; 张网娟; 平萍; 朱帅; 李敏佳
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-11-28
Filing date: 2017-11-28
Publication date: 2018-06-22
Anticipated expiration: 2037-11-28
Also published as: CN108197144B

Abstract

The present invention provides the present invention and provides a kind of much-talked-about topic discovery method based on BTM and Single pass, this method carries out theme modeling first with BTM topic models, obtain the theme distribution of corpus data collection, then vectorization is carried out using VSM, improved Single pass algorithms are reused to cluster result obtained above, cluster result is arranged to obtain new cluster result, secondly, method, which carries out parallelization calculating, to be found to above-mentioned much-talked-about topic, to improve the speed of its Topics Crawling in big data quantity.The present invention can be very good to solve the problems, such as that microblog data is sparse and handle the ability of mass data, improved Single pass algorithms can be very good to reduce computation complexity, keep the stability of algorithm, new data can be handled effectively, there is preferable calculate to analyze to the long lasting effect power of much-talked-about topic, and on the basis of data set improves data-handling efficiency by MapReduce frames, it is maintained to topic and finds quality.

Description

A kind of much-talked-about topic based on BTM and Single-pass finds method

Technical field

The present invention relates to a kind of much-talked-about topics based on BTM and Single-pass to find method, belongs to Data Mining In text cluster.

Background technology

It is universal with smart mobile phone and network, people can the moment by microblogging APP pay close attention to some newest countries, Sociogenetic major issue carries out hot microblog topic to find that research has very big value in fields such as business, scientific researches, more Carry out more scholars and carry out relevant research for microblogging.

In traditional much-talked-about topic is found, typically ground using LDA topic models and K-Means scheduling algorithms Study carefully, however, traditional LDA models mainly solve be long text the problem of, the processing of the short text data of microblogging etc is imitated Fruit is bad, meanwhile, it is difficult to solve that Sparse that microblog data has, the features such as contextual relevance is stronger, which are LDA models, 's.

In order to handle a large amount of data set, traditional much-talked-about topic discovery technique has encountered bottleneck.First, it is carrying out Data volume to be processed is huge during much-talked-about topic is found, individual host, processor deal with time-consuming and laborious. Secondly, simple BTM models data are carried out Topics Crawling speed it is excessively slow, it is last secondly, use BTM topic models into After row modeling, the characteristic of microblogging stream data is not accounted for, so its classifying quality need to be improved.Therefore in distribution Hot microblog topic discovery is particularly important under environment.

Invention content

Goal of the invention：In view of the deficiencies of the prior art, the present invention proposes a kind of streamings suitable for short text, Sparse The much-talked-about topic based on BTM and Single-pass of data finds method, meanwhile, this method can adapt to big data quantity situation With quickening Topics Crawling speed.

Technical solution：The present invention provides a kind of much-talked-about topics based on BTM and Single-pass to find method, the party Method includes the Mapper stages of MapReduce frames and Reducer stages；

The Mapper stages specifically include：

(1) the data set D of the input is pre-processed；

(2) the pretreated data set D is averagely assigned into C node, each node includes quantitative word pair, At random for institute's predicate to distribution one theme；

(3) theme modeling is carried out using BTM topic models, obtains the theme distribution of corpus data collection；

(4) vectorization is carried out to result using VSM on each node；

(5) cluster analysis is carried out using improved Single-pass algorithms on each node, obtains Partial-Topic；

(6) Partial-Topic is exported；

The Reducer stages specifically include：

(1) the initialization cluster result on each node is inputted；

(2) cluster result of host node is chosen as initial cluster centre；

(3) it uses and improves Single-pass algorithms by the cluster result of remaining node and the cluster result of the host node Cluster operation is carried out, finally obtains much-talked-about topic；

(4) much-talked-about topic is exported.

Preferably, in step (5), the cluster analysis specifically includes：

1) data set D is divided into n data slice D by certain scale₁,D₂,...,D_n, by the data slice in order successively As input data；Each data slice individually carries out internal cluster, obtains the cluster result of each data slice；

2) the data slice D is selected₁As first part, to the D₁Internal cluster is carried out, obtains described first The cluster result divided；

3) for the data slice D₂,...,D_n, internal cluster is individually carried out to each data slice before input, is obtained every The cluster result of a data slice；

4) by the data slice D₂Obtained each cluster centre, with the already existing data slice D₁Each cluster Center carries out similarity calculation, is denoted as sim (D2di, D1di), wherein, the D1di and D2di represent the data slice respectively D₁And D₂Middle different word pair；

5) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, if institute State max=max (sim (D2di, D1di)) >=c, then by the D2di be grouped into in the class of its similarity value maximum；If max= Max (sim (D2di, D1di)) ＜ c then create a topic again by cluster centre of the D2di；

6) by the data slice D₁、D₂Cluster result arranged to obtain new cluster result；

7) 4), 5) He 6) cycle, until data all in the data set are all disposed, obtains final result.

Preferably, the internal cluster is using classical Single-pass algorithms.

Preferably, institute's predicate exists simultaneously identical described in one to referring to after being pre-processed to the data set In data slice, and arbitrary, unordered two different words.

Preferably, in step (3), the theme modeling uses the methods of sampling as the Gibbs methods of samplings, sampling process Iteration obtains theme matrix using institute's predicate of each node to tuple-set, i.e., to each institute's predicate to b= (w_i,w_j)∈B_C, seek its probability in the BTM models under theme k.

Advantageous effect：Compared with prior art, the present invention its remarkable advantage is：1st, the present invention is led to using BTM topic models Word is crossed to solving the problems, such as Sparse well, and the model is more preferable compared with LDA models for the processing capacity of data；2nd, exist K-means is commonly used in the method for traditional clustering documents, however microblog data is a kind of form of flow data, K- Means can not well cluster it, therefore the present invention carries out it using Single-pass incremental clustering algorithms Analysis；3rd, the Single-pass algorithms after improving can reduce the complexity of calculating very well, keep the stability of algorithm, and New data can be handled effectively, have preferable calculate to analyze to the long lasting effect power of much-talked-about topic；4th, the present invention is gone back Method, which carries out distributed parallel processing, to be found to the much-talked-about topic based on BTM and Single-pass, data set passes through On the basis of MapReduce frames improve data-handling efficiency, it is maintained to topic and finds quality.

Description of the drawings

Fig. 1 is Single-pass algorithm flow charts after improving；

Fig. 2 is that much-talked-about topic of the present invention finds Algorithm parallelization flow chart；

Fig. 3 is BTM model topology structure charts；

Fig. 4 is the basic flow chart that the much-talked-about topic based on BTM and Single-pass finds algorithm in the present invention；

Fig. 5 is F value sizes of the BTM under K value differences；

Fig. 6 is β values analysis chart in BTM models；

Fig. 7 is different threshold performance schematic diagrames；

Fig. 8 is part topic experimental result；

Fig. 9 is three kinds of methods experiment comparisons；

Figure 10 is the hot word tendency chart that the present invention uses；

Figure 11 is unit and Distributed Time comparison diagram；

Figure 12 is server and time chart used.

Specific embodiment

The present invention provides a kind of much-talked-about topics based on BTM and Single-pass to find method, and this method is suitable for short Text, Sparse stream data under use, meanwhile, in order to enable the method to adapt to big data quantity situation and accelerate lead Excavation speed is inscribed, has carried out parallelization calculating, the key step of this method includes：(1) using improvement Single-pass algorithms Carry out cluster analysis；(2) much-talked-about topic based on BTM and Single-pass finds that method carries out MapReduce distributions simultaneously Rowization processing.

(1) Single-pass algorithms are improved and carries out cluster analysis

Such as Fig. 1, data set D, it is divided into multiple data slices, i.e. D by certain scale₁,D₂,...,D_n, in order successively Data set after input decomposition.

1) for D₁,D₂,...,D_nThese data slices are used as input data successively in order；Each part first oneself carries out Inside cluster, clustering method is similar with classics Single-pass algorithms, can obtain the respective cluster result in each part；

2) D is selected₁This part clusters it using classical Single-pass algorithms, obtains as first part The cluster result of this part；

3) for D₂,...,D_nThese parts, oneself first carry out internal cluster before input, and clustering algorithm is also classical Single-pass algorithms, obtain the cluster result of each part；

4) for D₂Obtained each cluster centre carries out similarity meter with each cluster centre of already existing part It calculates, wherein D1di and D2di represent D respectively₁And D₂Different words pair in data set, word is to referring to segment etc. to data set Later, it exists simultaneously in a same clip, and is two arbitrary and unordered different words；

5) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, c root Have different selections according to different situations, if maximum similarity value be greater than or equal to threshold value, i.e. max=max (sim (D2di, D1di)) >=c, then be grouped into in that class of its similarity value maximum；If maximum similarity value is less than threshold value, that is, max= Max (sim (D2di, D1di)) ＜ c, then with for a cluster centre again newly-built topic；

6) for D₁、D₂Two-part cluster result is arranged to obtain new cluster result；

7) circulation step 4), step 5), three step of step 6) be all disposed until all data, so as to most be terminated Fruit.

(2) much-talked-about topic based on BTM and Single-pass finds that method carries out the distributed parallel places of MapReduce Reason

1) as Fig. 2, all data D after pre-processing averagely assign to C node, then each node probably handles D/C Data volume.Each node includes a certain amount of word to b=(w_i,w_j)∈B_C；, then for all data set D, global word B={ B are combined into collection₁,B₂,...,B_c}。

2) local data using the method that Gibbs samples is sampled in BTM models, is used every time in iteration It is that the word of each node obtains theme matrix namely to each word to b=(w to tuple-set_i,w_j)∈B_C, it is asked in master Inscribe the probability under k.

3) it is poly- for below result obtained above in the enterprising row vector of respective host for each node Class is prepared.

4) each node is first respectively clustered, and is chosen the cluster result of host node as initial cluster centre, is used The cluster result of remaining node is merged arrangement with the cluster result of host node, obtains knot by improved Single-pass algorithms Fruit.

BTM models concentrate data whole words pair, and the distribution of theme probability is identical, and theme here understands Into the probability distribution of different words.BTM is based on LDA models and unitary mixed model, it passes through the word of entire data set generation Pair and word between relevance learn the theme of short text sparse data, the relevance between this word is by Term co-occurrence Come what is represented.Therefore, BTM topic models use total data that the word of co-occurrence is concentrated to carry out theme to modeling It practises.

The structures of BTM topic models as shown in figure 3, wherein, θ represents the global theme distribution in data set,Represent single The probability distribution of word below a theme, K represent the number of theme, | B | represent the quantity of word pair in data set, z represents implicit Theme, W_i, W_jIt is some word to two different words in b.α and β be respectively θ andDirichlet prior distributions it is super Parameter.

Such as Fig. 4, the much-talked-about topic based on BTM and Single-pass finds that algorithm comprises the steps of：

Step 1 pre-processes the data set of acquisition, removes those and does not have contributive noise number to much-talked-about topic discovery According to, such as stop words, hyperlink, spcial character；

Step 2, the Chinese word segmentation tool (NLPIR) proposed using Institute of Computing Technology, CAS divide data Word；

Step 3, the relevant information for obtaining required characteristic item retain related verb, noun, topic label, time etc., Using it as characteristic item, arrange dittograph and count word frequency size；

Step 4 models data above using BTM models, calculate data set p (z | d), p (w | z) (document- Theme, theme-word) value, for every data d in data set_iUse formula Inf (d_i) topic influence is calculated, make With formula w_k(d_i) from 0 to k cycle calculations going out weighted value size, K is characterized the quantity of word.

(1) with Inf (d_i) represent to quantify the influence power of a micro-blog information, formula is

Wherein, N_comRepresent the comment number of a micro-blog information, N_repRepresent the forwarding number of a micro-blog information, N_supIt represents One microblogging praises number, MAX { N_com(d_j)}、MAX{N_rep(d_j)}、MAX{N_sup(d_j) respectively represent collection of document in maximum Comment number, most praises number vigorously at max-forwards number.α, β, γ are parameters, meet alpha+beta+γ=1.

(2) TF-IDF is a kind of common characteristic weighing technology, and calculation formula is

TF_ij-IDF_i=TF_ij×IDF_iFormula 2

Wherein, word frequency TF represents the frequency that given Feature Words occur in a document, and the value of frequency is bigger to illustrate that it is heavier It will；Reverse document-frequency IDF means that given Feature Words concentrate the number occurred in total data.The calculating of TF and IDF Formula is as follows：

Wherein, n_ijIt is the frequency of given word in a document, ∑ kn_kjThe sum of all words in the document is represented, | D | it represents The number of total document, | { j:t_i∈d_j| it is the quantity of the document containing Feature Words, in order to avoid denominator is zero it to be carried out plus 1 Operation.

Step 5, using VSM vector space models, vectorization is carried out to result obtained above, formula isObtain a result matrix M.

Step 6 carries out increment cluster using above-mentioned improved Single-pass algorithms to matrix M, final so as to obtain As a result.

Much-talked-about topic based on BTM and Single-pass finds Parallel Algorithm, comprises the steps of：

For microblogging big data quantity feature carry out parallelization processing, mainly the Mapper stages of MapReduce frames and Two parts algorithm design in Reducer stages.

(1) the Mapper stages

Step 1, input data set D, number of topics K, parameter alpha, β；

Step 2 pre-processes the data set of acquisition, removes those and does not have contributive noise number to much-talked-about topic discovery According to, such as stop words, hyperlink, spcial character；

Step 3, to all data D after pretreatment, averagely assign to C node, each node includes a certain amount of word To b=(w_i,w_j)∈B_C, at random a theme k is distributed to b for word；

Step 4, Z_i←Z_i+Z_j, on host i, global subject word matrix is the sum of other hosts and the machine.Traversal institute The global subject word matrix of every host is obtained in some node C.

Step 5 in iteration obtains theme matrix using the word of each node to tuple-set every time, that is, To each b=(w_i,w_j)∈B_C, the probability under theme k is：

Wherein, n_k|CIt represents on process C, is assigned to word that theme number is k to the number of tuple,WithTable respectively Show word w_i, w_jThe number for being k to be in theme number, N represent the size of total dictionary in data set, and α, β represent Study first；

Step 6 carries out vectorization using VSM to the result on respective host on each node；

Step 7 first carries out respective cluster using improved Single-pass algorithms on each node, obtains part Topic；

Step 8, output Partial-Topic.

(2) the Reducer stages

Initialization cluster result in step 1, each node of input；

Step 2 chooses the cluster result of host node as initial cluster centre；

Step 3, using improved Single-pass algorithms, by the cluster result of remaining node and the cluster knot of host node Fruit carries out cluster operation, finally obtains much-talked-about topic；

Step 4, output much-talked-about topic.

Experimental analysis

Experimental data of the present invention is the truthful data collection in the Sina weibo acquired by crawlers, in the data crawled Hold 20000 datas for mainly including the topics such as " Alphago ", " the Rio Olympic Games ".

Currently find that method does not have unified evaluation criterion to hot microblog topic, the present invention uses following evaluation index To be tested.

Consider from accuracy rate this respect, referred to using NIST (National Institute of Standards and Technology) evaluations announced Mark, includes accuracy rate P, recall rate R, F value, omission factor P_missAnd false drop rate P_FAEtc. evaluation indexes.Formula is as follows：

Wherein, a is the relevant microblog textual data that detection is found；B is the uncorrelated microblogging textual data that detection is found；C is represented The relevant microblog textual data of discovery is not detected；D is the uncorrelated microblogging textual data found without detection.

The value analysis of parameters in BTM algorithm models

BTM models are the first steps that topic identification is carried out to microblogging language material in much-talked-about topic discovery method of the present invention, sharp There is large effect for postorder increment Clustering Effect with the result that BTM is obtained, so the present invention analyzes BTM themes first The problems of value of each parameter in model.BTM topic models are a kind of non-supervisory models, it is modeled when Time needs first to set the theme number K of data set, and the estimation of different theme exponential models is different, and the performance to BTM is It is influential, so first to determine the value of K.This experiment take respectively the value of K for 3,4,5,6,7,8,9,10,11,12,13,15 into Row experiment, is compared, experimental result is as shown in Figure 5 using F values as evaluation index.

As can be seen that with the gradual increase of number of topics K, downward trend is presented in F values, and when K takes 7, effect is best.By It is that the short text data of Sparse its word for containing is less in microblogging, the word of formation is being built to also just comparatively less During mould, we to be each word to distributing a theme, and set excessive number of topics, perhaps can make the p- theme of word Probability be divided, and then have an impact to the probability of document-theme so that final result is affected.The present invention chooses master It inscribes number and is used as K values for 7.

In the case where number of topics K values determine, our parameter alpha and values of β in BTM is analyzed, firstly for taking for α Value is understood based on experience valueSo the value of α is 50/7.It is determined in the value of α and then the value of β is analyzed, Experimental result is as shown in fig. 6, block diagram from left to right represents accuracy rate, recall rate and F values respectively.

As can be seen that overall effect is stablized relatively when β takes 0.01, so, the value of β of the present invention is 0.01.To sum up two A experiment, the value of algorithm different parameters that provides of the present invention be respectively the value of K be 7, α value be 50/7, β value be 0.01.

Threshold value selection analysis

The setting of threshold value has the important influence of class for the Clustering Effect of the poly- method of Single-pass increments, and clusters Result to ultimatum topic find have important influence, so the setting of threshold value is an important aspect.This experiment pair In threshold value selection evaluation index from omission factor P_missWith false drop rate P_FAIt is weighed in terms of two.This experiment is from topic number above According to middle selected section as testing, wherein each topic selects 1000 datas.For experimental result such as Fig. 7 of different threshold values It is shown.

It can be seen that false drop rate is inversely proportional with threshold value by experiment above, and omission factor is directly proportional to threshold value, threshold value Value is increasing, and the value of false drop rate is gradually decreasing, and the value of omission factor is becoming larger.It is whole when threshold value takes 0.4 Body effect is comparatively preferable, so threshold value value of the present invention is 0.4.

The comparison of method is found with other much-talked-about topics

The present invention excavates much-talked-about topic by being based on BTM and improved Single-pass algorithms, the present invention be with Keyword set represents associated topic, and since length is limited, present invention displaying " Alphago " topic part of test results is such as Shown in Fig. 8.

In order to verify the validity of method proposed by the present invention, it is compared using BTM topic models, BTK methods, by P, R index are combined in F values, so this experiment, using F values as evaluation index, experimental result is as shown in Figure 9.

As shown in Figure 9, the characteristics of its cluster result is different for different topics, this mainly has topic itself causes 's.Such as the cluster result of " the Rio Olympic Games " topic is comparatively poor, be because the Olympic Games relate to it is multiple and different Project, including multiple and different project such as " diving ", " table tennis ", " vollyball ", it includes project mostly so that BTM word centerings Term co-occurrence is undesirable so as to cause Clustering Effect with regard to relative reduction.

The present invention is compared using the hot word trend in micro- index that microblogging provides, and hot word trend can intuitively be shown Show the tendency and frequency of some keyword in a period of time.Due to containing multiple and different heat in the data of " the Rio Olympic Games " Point topic, therefore be appropriate for comparing, in August, 2016 data of 5 to 21 is selected to be compared.Use the side of the present invention The keyword that method tests " the Rio Olympic Games " data is：

1 Rio Olympic Games keyword of table

Figure 10 illustrates the hot word trend of Rio during the course of the Olympic Games, since the frequency that the August women's volleyball of 21 days occurs is far above Women's volleyball's data in order to preferably show the trend of other hot words on the diagram, therefore are deployed to secondary axis by other hot words On.As can be seen from Figure 10 first gold medal, the power of chaotic state, Huo Dun, women's volleyball, a period of times of Zhang Jike in Rio during the course of the Olympic Games Interior temperature increases severely, and becomes much-talked-about topic, mud-rock flow, tomato scrambled eggs due to not being specific keyword during the course of the Olympic Games, so The frequency of occurrences is relatively low.Therefore the validity of algorithm proposed by the present invention can be intuitively proved very much.

By experimental result as can be seen that the much-talked-about topic based on BTM and Single-pass finds algorithm compared to other Method has a preferable effect, F values preferably prove the much-talked-about topic based on BTM and Single-pass find algorithm be effectively and Feasible.

Much-talked-about topic based on BTM and Single-pass finds Algorithm parallelization experimental analysis

This experiment shares 4 servers, wherein a machine as Master nodes, deploy NameNode and JobTracker, in addition 3 servers dispose DataNode and TaskTracker on it as Slave nodes.All clothes The operating system being engaged on device is Ubuntu14.04, and the version of the Hadoop used is 2.6.5, and the version of JDK is jdk- 8u121-linux-i586。

The experimental data set of the present invention is Sina weibo on the January 5th, 2017 to January 10 crawled by reptile instrument Between the microblog data of 5 days, size is about 1.1GB.Obtained data are xml forms, extract content therein, and pass through The JAVA kits that HDFS increases income, which upload data, carries out distributed preservation.

This experiment is to find that algorithm carries out the processing of parallelization to the much-talked-about topic based on BTM and Single-pass, so We will weigh its evaluation in terms of the quality that much-talked-about topic is found and the speed two handled data.Using Coherence values are used as the index for weighing the quality that much-talked-about topic is found.

The calculation formula of Coherence values is as follows：

Wherein, V^(z)=(V₁ ^(z),V₂ ^(z),...,V_T ^(z)) represent preceding T word under a known theme z, and V^(z)Be by According to the size of the value of Probability p (w | z), according to what is be ranked up from high to low.D (v) represents the number that word v occurs in a document Mesh, D (v, z) represent two words while the number occurred.The quality that much-talked-about topic is found and C (z, V^(z)) value size in just Than it is better to be worth more big then quality.

The present invention is taken to evaluate the quality of the motif discovery under stand-alone environment and under distributed environment under current environment For the average value of the Coherence of whole themes as evaluation criterion, the calculation formula of the average value of Coherence is as follows：

In the case that similar in motif discovery quality, for data processing speed we to handle the time used in data To weigh.

Performance analysis

Topic finds quality

In the case of same configuration, the present invention sets the size of T as 30, is compared using Coherence values in unit The quality found under environment with topic under distributed environment, the experimental results are shown inthe following table：

Table 2Coherence values compare

From the results shown in Table 2, the Coherence values under stand-alone environment and under distributed environment are relatively , as the increase Coherence values of number of nodes are slowly being reduced under distributed environment, generally speaking topic is sent out Existing quality is not much different.

Topic finds speed

The influence of speed is found to better illustrate distributed environment to topic, the present invention is opposed by two groups of experiments Than first group is to maintain that distributed environment interior joint quantity is constant, and the size of change data amount, second group is to maintain data volume Constant, the quantity of concept transfer is compared.

For first group of experiment, present invention setting experimental situation is 4 servers, wherein 1 is used as Master nodes, , as Slave nodes, experimental data is divided into four groups of sizes, respectively 256MB, 512MB, 768MB, 1024MB, institute's used time for remaining Between unit be minute.The experimental results are shown inthe following table：

3 unit of table finds the used time with Hadoop distributions topic

In order to more intuitively show the increase with data volume, the transformation of unit and distributed environment, then the form with figure It is shown, as a result as shown in figure 11.

As can be seen from Table 3 with the increase of processing data amount, time used in Hadoop distributed environments and unit it Between gap be gradually increased, so Hadoop distributions for handle big data quantity operation it is more efficient.

The size that second group of experiment is to maintain data is constant, changes the number of distributed interior joint, to compare the time used. Selection size of data is 1024MB, and node number is incremented to 4 from 1, and experimental result is as shown in figure 12.

It can intuitively be found very much from Figure 12, when handling an equal amount of data, with Hadoop interior joint quantity Increase, the time used constantly reducing.Certainly as number of nodes is more, Slave nodes and Master nodes carry out data Exchange times can be more and more, and the trend of time reduction used is more and more slower.So rational number of nodes is set.

Pass through the algorithm after experimental verification parallelization, in the case where topic finds that mass difference is few, Ke Yixian The speed for improving topic and finding of work, reduces the time used in processing data.

Claims

1. a kind of much-talked-about topic based on BTM and Single-pass finds method, which is characterized in that this method uses The distributed parallel processing of MapReduce, Mapper stages and Reducer stages including MapReduce frames；

The Mapper stages specifically include：

(1) the data set D of the input is pre-processed；

(2) the pretreated data set D is averagely assigned into C node, each node includes quantitative word pair, is at random Institute's predicate is to distributing a theme；

(4) vectorization is carried out to result using VSM on each node；

(6) Partial-Topic is exported；

The Reducer stages specifically include：

(1) the initialization cluster result on each node is inputted；

(2) cluster result of host node is chosen as initial cluster centre；

(3) cluster result of the cluster result of remaining node and the host node is carried out using improvement Single-pass algorithms Cluster operation finally obtains much-talked-about topic；

(4) much-talked-about topic is exported.

2. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute In the step of stating the Mapper stages (2), institute's predicate is existed simultaneously to referring to after being pre-processed to the data set in one In the identical data slice, and arbitrary, unordered two different words.

3. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute In the step of stating the Mapper stages (3), the theme modeling uses the methods of sampling as the Gibbs methods of samplings, and sampling process changes In generation, obtains theme matrix using institute's predicate of each node to tuple-set, i.e., to each institute's predicate to b=(w_i, w_j)∈B_C, seek its probability in the BTM models under theme k.

4. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute In the step of stating the Mapper stages (5), the cluster analysis specifically includes：

(51) the data set D is divided into n data slice D by certain scale₁,D₂,...,D_n, by the data slice in order according to It is secondary to be used as input data；Each data slice individually carries out internal cluster, obtains the cluster result of each data slice；

(52) the data slice D is selected₁As first part, to the D₁Internal cluster is carried out, obtains the first part Cluster result；

(53) for the data slice D₂,...,D_n, internal cluster is individually carried out to each data slice before input, is obtained each The cluster result of the data slice；

(54) by the data slice D₂Obtained each cluster centre, with the already existing data slice D₁Each cluster in The heart carries out similarity calculation, is denoted as sim (D2di, D1di), wherein, the D1di and D2di represent the data slice D respectively₁With D₂Middle different word pair；

(55) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, if described Max=max (sim (D2di, D1di)) >=c, then by the D2di be grouped into in the class of its similarity value maximum；If max=max (sim (D2di, D1di)) ＜ c then create a topic again by cluster centre of the D2di；

(56) by the data slice D₁、D₂Cluster result arranged to obtain new cluster result；

(57) circulation step (54), (55) and (56) until data all in the data set are all disposed, are most terminated Fruit.

5. the much-talked-about topic according to claim 4 based on BTM and Single-pass finds method, which is characterized in that institute Internal cluster is stated using Single-pass algorithms.