CN108197144A - A kind of much-talked-about topic based on BTM and Single-pass finds method - Google Patents
A kind of much-talked-about topic based on BTM and Single-pass finds method Download PDFInfo
- Publication number
- CN108197144A CN108197144A CN201711210195.2A CN201711210195A CN108197144A CN 108197144 A CN108197144 A CN 108197144A CN 201711210195 A CN201711210195 A CN 201711210195A CN 108197144 A CN108197144 A CN 108197144A
- Authority
- CN
- China
- Prior art keywords
- topic
- data
- cluster
- talked
- btm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides the present invention and provides a kind of much-talked-about topic discovery method based on BTM and Single pass, this method carries out theme modeling first with BTM topic models, obtain the theme distribution of corpus data collection, then vectorization is carried out using VSM, improved Single pass algorithms are reused to cluster result obtained above, cluster result is arranged to obtain new cluster result, secondly, method, which carries out parallelization calculating, to be found to above-mentioned much-talked-about topic, to improve the speed of its Topics Crawling in big data quantity.The present invention can be very good to solve the problems, such as that microblog data is sparse and handle the ability of mass data, improved Single pass algorithms can be very good to reduce computation complexity, keep the stability of algorithm, new data can be handled effectively, there is preferable calculate to analyze to the long lasting effect power of much-talked-about topic, and on the basis of data set improves data-handling efficiency by MapReduce frames, it is maintained to topic and finds quality.
Description
Technical field
The present invention relates to a kind of much-talked-about topics based on BTM and Single-pass to find method, belongs to Data Mining
In text cluster.
Background technology
It is universal with smart mobile phone and network, people can the moment by microblogging APP pay close attention to some newest countries,
Sociogenetic major issue carries out hot microblog topic to find that research has very big value in fields such as business, scientific researches, more
Carry out more scholars and carry out relevant research for microblogging.
In traditional much-talked-about topic is found, typically ground using LDA topic models and K-Means scheduling algorithms
Study carefully, however, traditional LDA models mainly solve be long text the problem of, the processing of the short text data of microblogging etc is imitated
Fruit is bad, meanwhile, it is difficult to solve that Sparse that microblog data has, the features such as contextual relevance is stronger, which are LDA models,
's.
In order to handle a large amount of data set, traditional much-talked-about topic discovery technique has encountered bottleneck.First, it is carrying out
Data volume to be processed is huge during much-talked-about topic is found, individual host, processor deal with time-consuming and laborious.
Secondly, simple BTM models data are carried out Topics Crawling speed it is excessively slow, it is last secondly, use BTM topic models into
After row modeling, the characteristic of microblogging stream data is not accounted for, so its classifying quality need to be improved.Therefore in distribution
Hot microblog topic discovery is particularly important under environment.
Invention content
Goal of the invention:In view of the deficiencies of the prior art, the present invention proposes a kind of streamings suitable for short text, Sparse
The much-talked-about topic based on BTM and Single-pass of data finds method, meanwhile, this method can adapt to big data quantity situation
With quickening Topics Crawling speed.
Technical solution:The present invention provides a kind of much-talked-about topics based on BTM and Single-pass to find method, the party
Method includes the Mapper stages of MapReduce frames and Reducer stages;
The Mapper stages specifically include:
(1) the data set D of the input is pre-processed;
(2) the pretreated data set D is averagely assigned into C node, each node includes quantitative word pair,
At random for institute's predicate to distribution one theme;
(3) theme modeling is carried out using BTM topic models, obtains the theme distribution of corpus data collection;
(4) vectorization is carried out to result using VSM on each node;
(5) cluster analysis is carried out using improved Single-pass algorithms on each node, obtains Partial-Topic;
(6) Partial-Topic is exported;
The Reducer stages specifically include:
(1) the initialization cluster result on each node is inputted;
(2) cluster result of host node is chosen as initial cluster centre;
(3) it uses and improves Single-pass algorithms by the cluster result of remaining node and the cluster result of the host node
Cluster operation is carried out, finally obtains much-talked-about topic;
(4) much-talked-about topic is exported.
Preferably, in step (5), the cluster analysis specifically includes:
1) data set D is divided into n data slice D by certain scale1,D2,...,Dn, by the data slice in order successively
As input data;Each data slice individually carries out internal cluster, obtains the cluster result of each data slice;
2) the data slice D is selected1As first part, to the D1Internal cluster is carried out, obtains described first
The cluster result divided;
3) for the data slice D2,...,Dn, internal cluster is individually carried out to each data slice before input, is obtained every
The cluster result of a data slice;
4) by the data slice D2Obtained each cluster centre, with the already existing data slice D1Each cluster
Center carries out similarity calculation, is denoted as sim (D2di, D1di), wherein, the D1di and D2di represent the data slice respectively
D1And D2Middle different word pair;
5) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, if institute
State max=max (sim (D2di, D1di)) >=c, then by the D2di be grouped into in the class of its similarity value maximum;If max=
Max (sim (D2di, D1di)) < c then create a topic again by cluster centre of the D2di;
6) by the data slice D1、D2Cluster result arranged to obtain new cluster result;
7) 4), 5) He 6) cycle, until data all in the data set are all disposed, obtains final result.
Preferably, the internal cluster is using classical Single-pass algorithms.
Preferably, institute's predicate exists simultaneously identical described in one to referring to after being pre-processed to the data set
In data slice, and arbitrary, unordered two different words.
Preferably, in step (3), the theme modeling uses the methods of sampling as the Gibbs methods of samplings, sampling process
Iteration obtains theme matrix using institute's predicate of each node to tuple-set, i.e., to each institute's predicate to b=
(wi,wj)∈BC, seek its probability in the BTM models under theme k.
Advantageous effect:Compared with prior art, the present invention its remarkable advantage is:1st, the present invention is led to using BTM topic models
Word is crossed to solving the problems, such as Sparse well, and the model is more preferable compared with LDA models for the processing capacity of data;2nd, exist
K-means is commonly used in the method for traditional clustering documents, however microblog data is a kind of form of flow data, K-
Means can not well cluster it, therefore the present invention carries out it using Single-pass incremental clustering algorithms
Analysis;3rd, the Single-pass algorithms after improving can reduce the complexity of calculating very well, keep the stability of algorithm, and
New data can be handled effectively, have preferable calculate to analyze to the long lasting effect power of much-talked-about topic;4th, the present invention is gone back
Method, which carries out distributed parallel processing, to be found to the much-talked-about topic based on BTM and Single-pass, data set passes through
On the basis of MapReduce frames improve data-handling efficiency, it is maintained to topic and finds quality.
Description of the drawings
Fig. 1 is Single-pass algorithm flow charts after improving;
Fig. 2 is that much-talked-about topic of the present invention finds Algorithm parallelization flow chart;
Fig. 3 is BTM model topology structure charts;
Fig. 4 is the basic flow chart that the much-talked-about topic based on BTM and Single-pass finds algorithm in the present invention;
Fig. 5 is F value sizes of the BTM under K value differences;
Fig. 6 is β values analysis chart in BTM models;
Fig. 7 is different threshold performance schematic diagrames;
Fig. 8 is part topic experimental result;
Fig. 9 is three kinds of methods experiment comparisons;
Figure 10 is the hot word tendency chart that the present invention uses;
Figure 11 is unit and Distributed Time comparison diagram;
Figure 12 is server and time chart used.
Specific embodiment
The present invention provides a kind of much-talked-about topics based on BTM and Single-pass to find method, and this method is suitable for short
Text, Sparse stream data under use, meanwhile, in order to enable the method to adapt to big data quantity situation and accelerate lead
Excavation speed is inscribed, has carried out parallelization calculating, the key step of this method includes:(1) using improvement Single-pass algorithms
Carry out cluster analysis;(2) much-talked-about topic based on BTM and Single-pass finds that method carries out MapReduce distributions simultaneously
Rowization processing.
(1) Single-pass algorithms are improved and carries out cluster analysis
Such as Fig. 1, data set D, it is divided into multiple data slices, i.e. D by certain scale1,D2,...,Dn, in order successively
Data set after input decomposition.
1) for D1,D2,...,DnThese data slices are used as input data successively in order;Each part first oneself carries out
Inside cluster, clustering method is similar with classics Single-pass algorithms, can obtain the respective cluster result in each part;
2) D is selected1This part clusters it using classical Single-pass algorithms, obtains as first part
The cluster result of this part;
3) for D2,...,DnThese parts, oneself first carry out internal cluster before input, and clustering algorithm is also classical
Single-pass algorithms, obtain the cluster result of each part;
4) for D2Obtained each cluster centre carries out similarity meter with each cluster centre of already existing part
It calculates, wherein D1di and D2di represent D respectively1And D2Different words pair in data set, word is to referring to segment etc. to data set
Later, it exists simultaneously in a same clip, and is two arbitrary and unordered different words;
5) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, c root
Have different selections according to different situations, if maximum similarity value be greater than or equal to threshold value, i.e. max=max (sim (D2di,
D1di)) >=c, then be grouped into in that class of its similarity value maximum;If maximum similarity value is less than threshold value, that is, max=
Max (sim (D2di, D1di)) < c, then with for a cluster centre again newly-built topic;
6) for D1、D2Two-part cluster result is arranged to obtain new cluster result;
7) circulation step 4), step 5), three step of step 6) be all disposed until all data, so as to most be terminated
Fruit.
(2) much-talked-about topic based on BTM and Single-pass finds that method carries out the distributed parallel places of MapReduce
Reason
1) as Fig. 2, all data D after pre-processing averagely assign to C node, then each node probably handles D/C
Data volume.Each node includes a certain amount of word to b=(wi,wj)∈BC;, then for all data set D, global word
B={ B are combined into collection1,B2,...,Bc}。
2) local data using the method that Gibbs samples is sampled in BTM models, is used every time in iteration
It is that the word of each node obtains theme matrix namely to each word to b=(w to tuple-seti,wj)∈BC, it is asked in master
Inscribe the probability under k.
3) it is poly- for below result obtained above in the enterprising row vector of respective host for each node
Class is prepared.
4) each node is first respectively clustered, and is chosen the cluster result of host node as initial cluster centre, is used
The cluster result of remaining node is merged arrangement with the cluster result of host node, obtains knot by improved Single-pass algorithms
Fruit.
BTM models concentrate data whole words pair, and the distribution of theme probability is identical, and theme here understands
Into the probability distribution of different words.BTM is based on LDA models and unitary mixed model, it passes through the word of entire data set generation
Pair and word between relevance learn the theme of short text sparse data, the relevance between this word is by Term co-occurrence
Come what is represented.Therefore, BTM topic models use total data that the word of co-occurrence is concentrated to carry out theme to modeling
It practises.
The structures of BTM topic models as shown in figure 3, wherein, θ represents the global theme distribution in data set,Represent single
The probability distribution of word below a theme, K represent the number of theme, | B | represent the quantity of word pair in data set, z represents implicit
Theme, Wi, WjIt is some word to two different words in b.α and β be respectively θ andDirichlet prior distributions it is super
Parameter.
Such as Fig. 4, the much-talked-about topic based on BTM and Single-pass finds that algorithm comprises the steps of:
Step 1 pre-processes the data set of acquisition, removes those and does not have contributive noise number to much-talked-about topic discovery
According to, such as stop words, hyperlink, spcial character;
Step 2, the Chinese word segmentation tool (NLPIR) proposed using Institute of Computing Technology, CAS divide data
Word;
Step 3, the relevant information for obtaining required characteristic item retain related verb, noun, topic label, time etc.,
Using it as characteristic item, arrange dittograph and count word frequency size;
Step 4 models data above using BTM models, calculate data set p (z | d), p (w | z) (document-
Theme, theme-word) value, for every data d in data setiUse formula Inf (di) topic influence is calculated, make
With formula wk(di) from 0 to k cycle calculations going out weighted value size, K is characterized the quantity of word.
(1) with Inf (di) represent to quantify the influence power of a micro-blog information, formula is
Wherein, NcomRepresent the comment number of a micro-blog information, NrepRepresent the forwarding number of a micro-blog information, NsupIt represents
One microblogging praises number, MAX { Ncom(dj)}、MAX{Nrep(dj)}、MAX{Nsup(dj) respectively represent collection of document in maximum
Comment number, most praises number vigorously at max-forwards number.α, β, γ are parameters, meet alpha+beta+γ=1.
(2) TF-IDF is a kind of common characteristic weighing technology, and calculation formula is
TFij-IDFi=TFij×IDFiFormula 2
Wherein, word frequency TF represents the frequency that given Feature Words occur in a document, and the value of frequency is bigger to illustrate that it is heavier
It will;Reverse document-frequency IDF means that given Feature Words concentrate the number occurred in total data.The calculating of TF and IDF
Formula is as follows:
Wherein, nijIt is the frequency of given word in a document, ∑ knkjThe sum of all words in the document is represented, | D | it represents
The number of total document, | { j:ti∈dj| it is the quantity of the document containing Feature Words, in order to avoid denominator is zero it to be carried out plus 1
Operation.
Step 5, using VSM vector space models, vectorization is carried out to result obtained above, formula isObtain a result matrix M.
Step 6 carries out increment cluster using above-mentioned improved Single-pass algorithms to matrix M, final so as to obtain
As a result.
Much-talked-about topic based on BTM and Single-pass finds Parallel Algorithm, comprises the steps of:
For microblogging big data quantity feature carry out parallelization processing, mainly the Mapper stages of MapReduce frames and
Two parts algorithm design in Reducer stages.
(1) the Mapper stages
Step 1, input data set D, number of topics K, parameter alpha, β;
Step 2 pre-processes the data set of acquisition, removes those and does not have contributive noise number to much-talked-about topic discovery
According to, such as stop words, hyperlink, spcial character;
Step 3, to all data D after pretreatment, averagely assign to C node, each node includes a certain amount of word
To b=(wi,wj)∈BC, at random a theme k is distributed to b for word;
Step 4, Zi←Zi+Zj, on host i, global subject word matrix is the sum of other hosts and the machine.Traversal institute
The global subject word matrix of every host is obtained in some node C.
Step 5 in iteration obtains theme matrix using the word of each node to tuple-set every time, that is,
To each b=(wi,wj)∈BC, the probability under theme k is:
Wherein, nk|CIt represents on process C, is assigned to word that theme number is k to the number of tuple,WithTable respectively
Show word wi, wjThe number for being k to be in theme number, N represent the size of total dictionary in data set, and α, β represent Study first;
Step 6 carries out vectorization using VSM to the result on respective host on each node;
Step 7 first carries out respective cluster using improved Single-pass algorithms on each node, obtains part
Topic;
Step 8, output Partial-Topic.
(2) the Reducer stages
Initialization cluster result in step 1, each node of input;
Step 2 chooses the cluster result of host node as initial cluster centre;
Step 3, using improved Single-pass algorithms, by the cluster result of remaining node and the cluster knot of host node
Fruit carries out cluster operation, finally obtains much-talked-about topic;
Step 4, output much-talked-about topic.
Experimental analysis
Experimental data of the present invention is the truthful data collection in the Sina weibo acquired by crawlers, in the data crawled
Hold 20000 datas for mainly including the topics such as " Alphago ", " the Rio Olympic Games ".
Currently find that method does not have unified evaluation criterion to hot microblog topic, the present invention uses following evaluation index
To be tested.
Consider from accuracy rate this respect, referred to using NIST (National Institute of Standards and Technology) evaluations announced
Mark, includes accuracy rate P, recall rate R, F value, omission factor PmissAnd false drop rate PFAEtc. evaluation indexes.Formula is as follows:
Wherein, a is the relevant microblog textual data that detection is found;B is the uncorrelated microblogging textual data that detection is found;C is represented
The relevant microblog textual data of discovery is not detected;D is the uncorrelated microblogging textual data found without detection.
The value analysis of parameters in BTM algorithm models
BTM models are the first steps that topic identification is carried out to microblogging language material in much-talked-about topic discovery method of the present invention, sharp
There is large effect for postorder increment Clustering Effect with the result that BTM is obtained, so the present invention analyzes BTM themes first
The problems of value of each parameter in model.BTM topic models are a kind of non-supervisory models, it is modeled when
Time needs first to set the theme number K of data set, and the estimation of different theme exponential models is different, and the performance to BTM is
It is influential, so first to determine the value of K.This experiment take respectively the value of K for 3,4,5,6,7,8,9,10,11,12,13,15 into
Row experiment, is compared, experimental result is as shown in Figure 5 using F values as evaluation index.
As can be seen that with the gradual increase of number of topics K, downward trend is presented in F values, and when K takes 7, effect is best.By
It is that the short text data of Sparse its word for containing is less in microblogging, the word of formation is being built to also just comparatively less
During mould, we to be each word to distributing a theme, and set excessive number of topics, perhaps can make the p- theme of word
Probability be divided, and then have an impact to the probability of document-theme so that final result is affected.The present invention chooses master
It inscribes number and is used as K values for 7.
In the case where number of topics K values determine, our parameter alpha and values of β in BTM is analyzed, firstly for taking for α
Value is understood based on experience valueSo the value of α is 50/7.It is determined in the value of α and then the value of β is analyzed,
Experimental result is as shown in fig. 6, block diagram from left to right represents accuracy rate, recall rate and F values respectively.
As can be seen that overall effect is stablized relatively when β takes 0.01, so, the value of β of the present invention is 0.01.To sum up two
A experiment, the value of algorithm different parameters that provides of the present invention be respectively the value of K be 7, α value be 50/7, β value be 0.01.
Threshold value selection analysis
The setting of threshold value has the important influence of class for the Clustering Effect of the poly- method of Single-pass increments, and clusters
Result to ultimatum topic find have important influence, so the setting of threshold value is an important aspect.This experiment pair
In threshold value selection evaluation index from omission factor PmissWith false drop rate PFAIt is weighed in terms of two.This experiment is from topic number above
According to middle selected section as testing, wherein each topic selects 1000 datas.For experimental result such as Fig. 7 of different threshold values
It is shown.
It can be seen that false drop rate is inversely proportional with threshold value by experiment above, and omission factor is directly proportional to threshold value, threshold value
Value is increasing, and the value of false drop rate is gradually decreasing, and the value of omission factor is becoming larger.It is whole when threshold value takes 0.4
Body effect is comparatively preferable, so threshold value value of the present invention is 0.4.
The comparison of method is found with other much-talked-about topics
The present invention excavates much-talked-about topic by being based on BTM and improved Single-pass algorithms, the present invention be with
Keyword set represents associated topic, and since length is limited, present invention displaying " Alphago " topic part of test results is such as
Shown in Fig. 8.
In order to verify the validity of method proposed by the present invention, it is compared using BTM topic models, BTK methods, by
P, R index are combined in F values, so this experiment, using F values as evaluation index, experimental result is as shown in Figure 9.
As shown in Figure 9, the characteristics of its cluster result is different for different topics, this mainly has topic itself causes
's.Such as the cluster result of " the Rio Olympic Games " topic is comparatively poor, be because the Olympic Games relate to it is multiple and different
Project, including multiple and different project such as " diving ", " table tennis ", " vollyball ", it includes project mostly so that BTM word centerings
Term co-occurrence is undesirable so as to cause Clustering Effect with regard to relative reduction.
The present invention is compared using the hot word trend in micro- index that microblogging provides, and hot word trend can intuitively be shown
Show the tendency and frequency of some keyword in a period of time.Due to containing multiple and different heat in the data of " the Rio Olympic Games "
Point topic, therefore be appropriate for comparing, in August, 2016 data of 5 to 21 is selected to be compared.Use the side of the present invention
The keyword that method tests " the Rio Olympic Games " data is:
1 Rio Olympic Games keyword of table
Figure 10 illustrates the hot word trend of Rio during the course of the Olympic Games, since the frequency that the August women's volleyball of 21 days occurs is far above
Women's volleyball's data in order to preferably show the trend of other hot words on the diagram, therefore are deployed to secondary axis by other hot words
On.As can be seen from Figure 10 first gold medal, the power of chaotic state, Huo Dun, women's volleyball, a period of times of Zhang Jike in Rio during the course of the Olympic Games
Interior temperature increases severely, and becomes much-talked-about topic, mud-rock flow, tomato scrambled eggs due to not being specific keyword during the course of the Olympic Games, so
The frequency of occurrences is relatively low.Therefore the validity of algorithm proposed by the present invention can be intuitively proved very much.
By experimental result as can be seen that the much-talked-about topic based on BTM and Single-pass finds algorithm compared to other
Method has a preferable effect, F values preferably prove the much-talked-about topic based on BTM and Single-pass find algorithm be effectively and
Feasible.
Much-talked-about topic based on BTM and Single-pass finds Algorithm parallelization experimental analysis
This experiment shares 4 servers, wherein a machine as Master nodes, deploy NameNode and
JobTracker, in addition 3 servers dispose DataNode and TaskTracker on it as Slave nodes.All clothes
The operating system being engaged on device is Ubuntu14.04, and the version of the Hadoop used is 2.6.5, and the version of JDK is jdk-
8u121-linux-i586。
The experimental data set of the present invention is Sina weibo on the January 5th, 2017 to January 10 crawled by reptile instrument
Between the microblog data of 5 days, size is about 1.1GB.Obtained data are xml forms, extract content therein, and pass through
The JAVA kits that HDFS increases income, which upload data, carries out distributed preservation.
This experiment is to find that algorithm carries out the processing of parallelization to the much-talked-about topic based on BTM and Single-pass, so
We will weigh its evaluation in terms of the quality that much-talked-about topic is found and the speed two handled data.Using
Coherence values are used as the index for weighing the quality that much-talked-about topic is found.
The calculation formula of Coherence values is as follows:
Wherein, V(z)=(V1 (z),V2 (z),...,VT (z)) represent preceding T word under a known theme z, and V(z)Be by
According to the size of the value of Probability p (w | z), according to what is be ranked up from high to low.D (v) represents the number that word v occurs in a document
Mesh, D (v, z) represent two words while the number occurred.The quality that much-talked-about topic is found and C (z, V(z)) value size in just
Than it is better to be worth more big then quality.
The present invention is taken to evaluate the quality of the motif discovery under stand-alone environment and under distributed environment under current environment
For the average value of the Coherence of whole themes as evaluation criterion, the calculation formula of the average value of Coherence is as follows:
In the case that similar in motif discovery quality, for data processing speed we to handle the time used in data
To weigh.
Performance analysis
Topic finds quality
In the case of same configuration, the present invention sets the size of T as 30, is compared using Coherence values in unit
The quality found under environment with topic under distributed environment, the experimental results are shown inthe following table:
Table 2Coherence values compare
From the results shown in Table 2, the Coherence values under stand-alone environment and under distributed environment are relatively
, as the increase Coherence values of number of nodes are slowly being reduced under distributed environment, generally speaking topic is sent out
Existing quality is not much different.
Topic finds speed
The influence of speed is found to better illustrate distributed environment to topic, the present invention is opposed by two groups of experiments
Than first group is to maintain that distributed environment interior joint quantity is constant, and the size of change data amount, second group is to maintain data volume
Constant, the quantity of concept transfer is compared.
For first group of experiment, present invention setting experimental situation is 4 servers, wherein 1 is used as Master nodes,
, as Slave nodes, experimental data is divided into four groups of sizes, respectively 256MB, 512MB, 768MB, 1024MB, institute's used time for remaining
Between unit be minute.The experimental results are shown inthe following table:
3 unit of table finds the used time with Hadoop distributions topic
In order to more intuitively show the increase with data volume, the transformation of unit and distributed environment, then the form with figure
It is shown, as a result as shown in figure 11.
As can be seen from Table 3 with the increase of processing data amount, time used in Hadoop distributed environments and unit it
Between gap be gradually increased, so Hadoop distributions for handle big data quantity operation it is more efficient.
The size that second group of experiment is to maintain data is constant, changes the number of distributed interior joint, to compare the time used.
Selection size of data is 1024MB, and node number is incremented to 4 from 1, and experimental result is as shown in figure 12.
It can intuitively be found very much from Figure 12, when handling an equal amount of data, with Hadoop interior joint quantity
Increase, the time used constantly reducing.Certainly as number of nodes is more, Slave nodes and Master nodes carry out data
Exchange times can be more and more, and the trend of time reduction used is more and more slower.So rational number of nodes is set.
Pass through the algorithm after experimental verification parallelization, in the case where topic finds that mass difference is few, Ke Yixian
The speed for improving topic and finding of work, reduces the time used in processing data.
Claims (5)
1. a kind of much-talked-about topic based on BTM and Single-pass finds method, which is characterized in that this method uses
The distributed parallel processing of MapReduce, Mapper stages and Reducer stages including MapReduce frames;
The Mapper stages specifically include:
(1) the data set D of the input is pre-processed;
(2) the pretreated data set D is averagely assigned into C node, each node includes quantitative word pair, is at random
Institute's predicate is to distributing a theme;
(3) theme modeling is carried out using BTM topic models, obtains the theme distribution of corpus data collection;
(4) vectorization is carried out to result using VSM on each node;
(5) cluster analysis is carried out using improved Single-pass algorithms on each node, obtains Partial-Topic;
(6) Partial-Topic is exported;
The Reducer stages specifically include:
(1) the initialization cluster result on each node is inputted;
(2) cluster result of host node is chosen as initial cluster centre;
(3) cluster result of the cluster result of remaining node and the host node is carried out using improvement Single-pass algorithms
Cluster operation finally obtains much-talked-about topic;
(4) much-talked-about topic is exported.
2. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute
In the step of stating the Mapper stages (2), institute's predicate is existed simultaneously to referring to after being pre-processed to the data set in one
In the identical data slice, and arbitrary, unordered two different words.
3. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute
In the step of stating the Mapper stages (3), the theme modeling uses the methods of sampling as the Gibbs methods of samplings, and sampling process changes
In generation, obtains theme matrix using institute's predicate of each node to tuple-set, i.e., to each institute's predicate to b=(wi,
wj)∈BC, seek its probability in the BTM models under theme k.
4. the much-talked-about topic according to claim 1 based on BTM and Single-pass finds method, which is characterized in that institute
In the step of stating the Mapper stages (5), the cluster analysis specifically includes:
(51) the data set D is divided into n data slice D by certain scale1,D2,...,Dn, by the data slice in order according to
It is secondary to be used as input data;Each data slice individually carries out internal cluster, obtains the cluster result of each data slice;
(52) the data slice D is selected1As first part, to the D1Internal cluster is carried out, obtains the first part
Cluster result;
(53) for the data slice D2,...,Dn, internal cluster is individually carried out to each data slice before input, is obtained each
The cluster result of the data slice;
(54) by the data slice D2Obtained each cluster centre, with the already existing data slice D1Each cluster in
The heart carries out similarity calculation, is denoted as sim (D2di, D1di), wherein, the D1di and D2di represent the data slice D respectively1With
D2Middle different word pair;
(55) selection maximum similarity value max=max (sim (D2di, D1di)), judges whether max is more than threshold value c, if described
Max=max (sim (D2di, D1di)) >=c, then by the D2di be grouped into in the class of its similarity value maximum;If max=max
(sim (D2di, D1di)) < c then create a topic again by cluster centre of the D2di;
(56) by the data slice D1、D2Cluster result arranged to obtain new cluster result;
(57) circulation step (54), (55) and (56) until data all in the data set are all disposed, are most terminated
Fruit.
5. the much-talked-about topic according to claim 4 based on BTM and Single-pass finds method, which is characterized in that institute
Internal cluster is stated using Single-pass algorithms.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711210195.2A CN108197144B (en) | 2017-11-28 | 2017-11-28 | Hot topic discovery method based on BTM and Single-pass |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711210195.2A CN108197144B (en) | 2017-11-28 | 2017-11-28 | Hot topic discovery method based on BTM and Single-pass |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108197144A true CN108197144A (en) | 2018-06-22 |
CN108197144B CN108197144B (en) | 2021-02-09 |
Family
ID=62573247
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711210195.2A Active CN108197144B (en) | 2017-11-28 | 2017-11-28 | Hot topic discovery method based on BTM and Single-pass |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108197144B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109509110A (en) * | 2018-07-27 | 2019-03-22 | 福州大学 | Method is found based on the hot microblog topic for improving BBTM model |
CN109726394A (en) * | 2018-12-18 | 2019-05-07 | 电子科技大学 | Short text Subject Clustering method based on fusion BTM model |
CN110046260A (en) * | 2019-04-16 | 2019-07-23 | 广州大学 | A kind of darknet topic discovery method and system of knowledge based map |
CN110134958A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text Topics Crawling method based on semantic word network |
CN110297988A (en) * | 2019-07-06 | 2019-10-01 | 四川大学 | Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm |
CN111090811A (en) * | 2019-12-24 | 2020-05-01 | 北京理工大学 | Method and system for extracting massive news hot topics |
CN113378558A (en) * | 2021-05-25 | 2021-09-10 | 浙江工业大学 | RESTful API document theme distribution extraction method based on representative word pairs |
CN115718680A (en) * | 2023-01-09 | 2023-02-28 | 江铃汽车股份有限公司 | Data reading method, system, computer and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103745000A (en) * | 2014-01-24 | 2014-04-23 | 福州大学 | Hot topic detection method of Chinese micro-blogs |
EP2068526B1 (en) * | 2007-11-06 | 2014-04-30 | Intel Corporation | End-to-end network security with traffic visibility |
CN104834632A (en) * | 2015-05-13 | 2015-08-12 | 北京工业大学 | Microblog topic detection and hotspot evaluation method based on semantic expansion |
CN106599029A (en) * | 2016-11-02 | 2017-04-26 | 焦点科技股份有限公司 | Chinese short text clustering method |
-
2017
- 2017-11-28 CN CN201711210195.2A patent/CN108197144B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2068526B1 (en) * | 2007-11-06 | 2014-04-30 | Intel Corporation | End-to-end network security with traffic visibility |
CN103745000A (en) * | 2014-01-24 | 2014-04-23 | 福州大学 | Hot topic detection method of Chinese micro-blogs |
CN104834632A (en) * | 2015-05-13 | 2015-08-12 | 北京工业大学 | Microblog topic detection and hotspot evaluation method based on semantic expansion |
CN106599029A (en) * | 2016-11-02 | 2017-04-26 | 焦点科技股份有限公司 | Chinese short text clustering method |
Non-Patent Citations (4)
Title |
---|
BO HUANG ET AL.: ""Microblog Topic Detection Based on LDA Model and Single-Pass Clustering"", 《INTERNATIONAL CONFERENCE ON ROUGH SETS AND CURRENT TRENDS IN COMPUTING》 * |
梁亚楠 等: ""基于主题模型建模的微博话题发现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王宇: ""基于云计算的网络舆情热点发现研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
谭真: ""基于MapReduce的热点话题发现及演化分析方法研究"", 《万方数据知识服务平台》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109509110A (en) * | 2018-07-27 | 2019-03-22 | 福州大学 | Method is found based on the hot microblog topic for improving BBTM model |
CN109509110B (en) * | 2018-07-27 | 2021-08-31 | 福州大学 | Microblog hot topic discovery method based on improved BBTM model |
CN109726394A (en) * | 2018-12-18 | 2019-05-07 | 电子科技大学 | Short text Subject Clustering method based on fusion BTM model |
CN110046260A (en) * | 2019-04-16 | 2019-07-23 | 广州大学 | A kind of darknet topic discovery method and system of knowledge based map |
CN110134958A (en) * | 2019-05-14 | 2019-08-16 | 南京大学 | A kind of short text Topics Crawling method based on semantic word network |
CN110134958B (en) * | 2019-05-14 | 2021-05-18 | 南京大学 | Short text topic mining method based on semantic word network |
CN110297988A (en) * | 2019-07-06 | 2019-10-01 | 四川大学 | Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm |
CN111090811A (en) * | 2019-12-24 | 2020-05-01 | 北京理工大学 | Method and system for extracting massive news hot topics |
CN111090811B (en) * | 2019-12-24 | 2023-09-01 | 北京理工大学 | Massive news hot topic extraction method and system |
CN113378558A (en) * | 2021-05-25 | 2021-09-10 | 浙江工业大学 | RESTful API document theme distribution extraction method based on representative word pairs |
CN113378558B (en) * | 2021-05-25 | 2024-04-16 | 浙江工业大学 | RESTful API document theme distribution extraction method based on representative word pairs |
CN115718680A (en) * | 2023-01-09 | 2023-02-28 | 江铃汽车股份有限公司 | Data reading method, system, computer and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108197144B (en) | 2021-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108197144A (en) | A kind of much-talked-about topic based on BTM and Single-pass finds method | |
Liu et al. | Text features extraction based on TF-IDF associating semantic | |
EP2092419B1 (en) | Method and system for high performance data metatagging and data indexing using coprocessors | |
CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
US9449096B2 (en) | Identifying influencers for topics in social media | |
WO2017097231A1 (en) | Topic processing method and device | |
CN111581949B (en) | Method and device for disambiguating name of learner, storage medium and terminal | |
CN107066555A (en) | Towards the online topic detection method of professional domain | |
EP3539018A1 (en) | Apparatus and method for semantic search | |
CN107608953B (en) | Word vector generation method based on indefinite-length context | |
Shetty et al. | Automatic extractive text summarization using K-means clustering | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN105404677B (en) | A kind of search method based on tree structure | |
CN105787121B (en) | A kind of microblogging event summary extracting method based on more story lines | |
Wu et al. | Extracting topics based on Word2Vec and improved Jaccard similarity coefficient | |
Kim et al. | A web service for author name disambiguation in scholarly databases | |
CN104536830A (en) | KNN text classification method based on MapReduce | |
Backes | Effective unsupervised author disambiguation with relative frequencies | |
CN105740448A (en) | Topic-oriented multi-microblog time sequence abstracting method | |
Long et al. | Tcsst: transfer classification of short & sparse text using external data | |
Zhou et al. | The survey of large-scale query classification | |
Ye et al. | An End‐to‐End Rumor Detection Model Based on Feature Aggregation | |
CN105426490B (en) | A kind of indexing means based on tree structure | |
Sailaja et al. | An overview of pre-processing text clustering methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |