CN109325092A - Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information - Google Patents

Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information Download PDF

Info

Publication number
CN109325092A
CN109325092A CN201811438180.6A CN201811438180A CN109325092A CN 109325092 A CN109325092 A CN 109325092A CN 201811438180 A CN201811438180 A CN 201811438180A CN 109325092 A CN109325092 A CN 109325092A
Authority
CN
China
Prior art keywords
theme
phrase
word
parallelization
nonparametric
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811438180.6A
Other languages
Chinese (zh)
Inventor
林立晖
饶洋辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201811438180.6A priority Critical patent/CN109325092A/en
Publication of CN109325092A publication Critical patent/CN109325092A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention relates to the technical fields of natural language processing and artificial intelligence in machine learning, more particularly, to the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information.Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information, wherein be divided into three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, and third is to carry out implication relation modeling to phrase by Copula function.Model proposed by the present invention also models the implication relation of the phrase in text while accelerating HDP calculating.Compared with the prior art, we are under the premise of remaining HDP imparametrization characteristic, realize parallelization, and the shortcomings that compensating for traditional theme model, phrase semanteme is merged, it overcomes serial HDP and calculates force request height, the shortcoming that subject information is lacked optimizes the qualitatively and quantitatively performance capabilities of model.

Description

Merge the nonparametric parallelization level Di Li Cray process topic model of phrase information System
Technical field
The present invention relates to the technical field of natural language processing and artificial intelligence in machine learning, more particularly, to Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information.
Background technique
Existing topic model is broadly divided into two classes, parametric technique and nonparametric technique, and it is that probability is potential that the former is corresponding Semantic analysis model (Probablistic Latent SemanticAnalysis) and latent Dirichletal location (Latent Dirichlet Allocation), corresponding the latter is traditional level Di Li Cray process (Hierarchical Dirichlet Process).It is currently commercially using it is more be traditional parametrization topic model, but this method needs are manually specified Parameter, and parameter is affected to final mining effect, is difficult tuning in practice.Issued * LDA of Tencent Mesh is then the large-scale parallel topic model realized based on computer cluster, is generated in social media for analyzing user A large amount of texts.
Probability latent semantic analysis (hereinafter referred to as PLSA) is the use probability distribution that proposes earliest to text generation process The development foundation of one of method modeled and traditional theme model.PLSA carries out text from the angle of probability statistics Analysis, it is believed that the word in every document belongs to a specific theme, each theme controls the probability of different terms Distribution, and document itself obeys certain probability distribution.Therefore, in PLSA, the generating process of a document is as shown in Figure 1.
Wherein P (di) it is document diProbability of occurrence, P (zi|di) it is given document diUnder conditions of theme ziAppearance it is general Rate, P (wi|zi) it is given theme ziIn the case where word wiProbability of occurrence.But this method is merely able in existing text set Upper excavation implicit information, can not cope with new data, and textual data is more, and the parameter of PLSA is also more, be unfavorable for answering on a large scale With.
Implicit Di Li Cray distribution (hereinafter referred to as LDA) is then on the basis of PLSA model, is document and theme distribution It is added to priori knowledge, it is assumed that the parameter for the probability distribution that document and theme are obeyed while submitting to conjugate prior point Cloth.LDA is more in line with probability statistics rule compared to PLSA, while introducing priori knowledge, it can accurately excavate out text Subject information.The generating process of LDA model is as shown in Figure 2.
Wherein z and w is exactly the word and theme in PLSA, and θ and φ are the ginsengs for the probability distribution that theme and word are obeyed Number, α and β are the parameters for the prior probability distribution that the two parameters are obeyed.As can be seen that increasing two compared to PLSA, LDA A hyper parameter parameter alpha and β avoid PLSA and need to obtain lacking for document and theme distribution according to certain text data collection statistics Point can accelerate the efficiency and effect excavated.But the problem of LDA, is also due precisely to the determination of parameter, when the value for passing through α and β It can generate well and be similar to genuine document and theme distribution parameter θ and φ, and when artificial setting theme number is appropriate, model Effect will be fine, it is on the contrary then can not obtain significant Clustering Effect.
Level Di Li Cray process (hereinafter referred to as HDP) is then most important nonparametric topic model, and generating process is such as Shown in Fig. 3.It generates the sub- Di Li Cray process of next stage, passes through by setting prior probability distribution for Di Li Cray process The parameter of theme distribution is picked out in sampling, then determines word.Since the discreteness of Di Li Cray process itself surveys probability with it Spend space divide unlimitedness (but divide after each atom and be 1, meet probability distribution), can automatically determine optimal Clusters number, i.e. number of topics in topic model avoid and number of topics purpose drawback are manually specified, and are a kind of effective nonparametrics Topic model.But the parametric inference of HDP itself is extremely complex, the operation of conventional serial algorithm compared to PLSA and LDA Speed is very slow, can not successfully manage a large amount of text.
Above-mentioned topic model has the shortcomings that respectively most obvious.
1.PLSA can only learn the distribution of current data set out, and new data set can not be extended to and number of parameters with Number of documents increase and it is linearly increasing, caused by this is the design by this body structure of PLSA, need by largely count come It was found that the regularity of distribution in data set, and the missing of priori knowledge also produces certain influence to effect;
2.LDA needs to be manually specified hyper parameter, and the setting of hyper parameter has larger impact to experiment effect, in Practical Project In be difficult to adjust out the preferable parameter of effect.This is because LDA can not voluntarily obtain priori knowledge, it can only be by manually according to warp It tests caused by specified (i.e. often use experience parameter), for different data collection, parameter generalization ability is weaker;
The parametric inference of 3.HDP is extremely complex, it is difficult to large-scale application.This is because the distribution form of Di Li Cray process It is extremely complex, inferred using traditional variation or Gibbs sampling method can not also effectively shorten caused by the parameter Estimation time 's.
Moreover, a general character of the above-mentioned prior art the disadvantage is that, they assume that the word in text is irrelevant, they Corresponding theme is also mutually indepedent.This assumes all to be not reasonable for language and probability angle, because random become Amount is completely independent same distributional assumption and excessively idealizes, and in natural language, is often combined with each other, influences each other between word , this semantic structure of the phrase that especially all there will necessarily be in various language, it is complete that multiple words are combined together expression one Whole semantic information.If we independently treat each word, will be unable to excavate the phrase information that text includes.
Summary of the invention
The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides the nonparametric of fusion phrase information simultaneously Rowization level Di Li Cray process topic model system solves the disadvantage that existing Parallelizing Techniques lose HDP nonparametric characteristic.
The technical scheme is that the nonparametric parallelization level Di Li Cray process topic model of fusion phrase information System, wherein be divided into three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, and third is to pass through Copula function carries out implication relation modeling to phrase.
In the present invention, as long as we are it is not difficult to find that we solve computational efficiency after analyzing traditional theme model Problem, HDP are a kind of good nonparametric techniques, and we can introduce a kind of mathematical method and model to phrase, mend Fill the shortcoming of traditional theme model.And since traditional HDP is difficult to parallelization, we are based on existing parallel HDP Equivalence model, proposes a kind of HDP topic model based on gibbs sampler of parallelization, and proposes real-time theme Regulation mechanism, Solve the disadvantage that existing Parallelizing Techniques lose HDP nonparametric characteristic.Meanwhile we introduce Copula function to the implicit pass of phrase System is modeled, and phrase information is incorporated in parallelization frame, is more effectively excavated text and is implied theme.
Compared with prior art, beneficial effect is: model proposed by the present invention is while accelerating HDP calculating, to text The implication relation of phrase in this is also modeled.Compared with the prior art, we are remaining HDP imparametrization characteristic Under the premise of, the shortcomings that realizing parallelization, and compensate for traditional theme model, phrase semanteme is merged, serial HDP is overcome It is high to calculate force request, the shortcoming that subject information is lacked optimizes the qualitatively and quantitatively performance capabilities of model.
Detailed description of the invention
Fig. 1 is the generating process schematic diagram of an existing document in PLSA.
Fig. 2 is the generating process schematic diagram of existing LDA model.
Fig. 3 is existing level Di Li Cray nonparametric topic model generating process schematic diagram.
Fig. 4 is manager of the present invention-executor's schematic diagram of mechanism.
Fig. 5 is the first schematic diagram of copula modeling procedure of the present invention.
Fig. 6 is the second schematic diagram of copula modeling procedure of the present invention.
Fig. 7 is the real-time Regulation mechanism schematic diagram of present subject matter.
Fig. 8 is Copula modeling of the present invention and phrase information fusion schematic diagram.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;In order to better illustrate this embodiment, attached Scheme certain components to have omission, zoom in or out, does not represent the size of actual product;To those skilled in the art, The omitting of some known structures and their instructions in the attached drawings are understandable.Being given for example only property of positional relationship is described in attached drawing Illustrate, should not be understood as the limitation to this patent.
In the present embodiment, the nonparametric parallelization level Di Li Cray process topic model system of phrase information is merged, It is characterized in that, be divided into three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, and third is to pass through Copula function carries out implication relation modeling to phrase.
As shown in figure 4, the design of parallelization mechanism specifically:
A manager-executor's mechanism is devised, for synchronizing global subject information, realizes unified theme additions and deletions behaviour Make;By the rotation of executor's thread and manager's thread, reach pipelined fabric, realizes efficient parallel;In each iteration, Executor's thread updates the theme word information of itself, reports after iteration and gives manager's thread, and manager passes through each thread Theme word information carry out theme additions and deletions decision.
As shown in fig. 7, theme adjustment in real time specifically:
Assuming that " word number of the theme after successive ignition is still lower than some threshold value, then it is assumed that the theme is Wither away, need to delete theme " and " if still having the unassigned theme of word, then it is assumed that theme number is not after certain iteration Foot, needs to increase theme ", the real-time Regulation mechanism of design motif.
During theme adjusts in real time, the increased dependent thresholds of theme are provided that
∈=1%* (MaximumIteration)
P=1%* (NumberOfWordsInDataset)
I.e. after ε iteration, if the word number under theme is less than p, the theme is deleted.
As shown in Figure 8, wherein Fig. 5 is copula modeled segments first part algorithm schematic diagram, and Fig. 6 is copula modeling Part second part algorithm schematic diagram carries out implication relation modeling to phrase by Copula function specifically:
Copula function is introduced to model the implication relation of phrase:
According to Sklar theorem:
Theorem 3.1Given a p-dimensional with univariate margins F1, F2..., Fp, there always exist a copula function C such that for all combinations of random variables(x1, x2..., xp)∈Rp,
F(x1..., xp)=C (F1(x1) ..., Fp(xp)) (5)
If sampling obtains the phrase that length is p in a document, sampled from copula function to a length Equal vector U={ u1, u2 ..., up }, and the following equation by being derived by Sklar theorem is converted into word distribution Sample:
C(u1, u2..., up)=F (F-1(u1) ..., F-1(up))
Form of the copula sample U in word distribution is calculated by quantile and probability integral transformation:
Calculation equation above is applied in HDP, the theme of each word in phrase is calculated:
Then we transformed U=(u1..., uL) into Z=(z1..., zL)where zi, i ∈ { 1 ..., L } is the topic assignment of the ith word in phrase.Once zi=zj, i, j ∈ { 1 ..., L }, i ≠ j, then we push the ith and the jth word into or remove from X′dk simul-taneously.
According to correlation provided by Copula function, by the theme of all words in same phrase be limited in one compared with Close to each other each other in small range, this also complies with philological intuitivism apprehension: " generating the word of the same phrase It is similar or even identical that theme, which has greater probability, ".
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims (5)

1. merging the nonparametric parallelization level Di Li Cray process topic model system of phrase information, which is characterized in that be divided into Three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, third be by Copula function to phrase into The modeling of row implication relation.
2. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 1 System, it is characterised in that: the design of the parallelization mechanism specifically:
A manager-executor's mechanism is devised, for synchronizing global subject information, realizes unified theme additions and deletions operation; By the rotation of executor's thread and manager's thread, reach pipelined fabric, realizes efficient parallel;In each iteration, hold Passerby's thread updates the theme word information of itself, reports after iteration and gives manager's thread, and manager passes through each thread Theme word information carries out theme additions and deletions decision.
3. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 1 System, it is characterised in that: the real-time theme adjustment specifically:
Assuming that " word number of the theme after successive ignition is still lower than some threshold value, then it is assumed that and the theme has been withered away, Need to delete theme " and " if still having the unassigned theme of word after certain iteration, then it is assumed that theme number deficiency needs Increase theme ", the real-time Regulation mechanism of design motif.
4. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 3 System, it is characterised in that: during the theme adjusts in real time, the increased dependent thresholds of theme are provided that
ε=1%* (MaximumIteration)
P=1%* (NumberOfWordsInDataset)
I.e. after ε iteration, if the word number under theme is less than p, the theme is deleted.
5. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 1 System, it is characterised in that: described that implication relation modeling is carried out to phrase by Copula function specifically:
Copula function is introduced to model the implication relation of phrase:
According to Sklar theorem:
Theorem 3.1 Given a p-dimensional with univariate
margins F1, F2..., Fp, there always exist a copula
function C such that for all combinations of random
variables(x1, x2..., xp)∈Rp,
F(x1..., xp)=C (F1(x1) ..., Fp(xp)) (5)
If sampling obtains the phrase that length is p in a document, sampled from copula function to an equal length Vector U={ u1, u2 ..., up }, and the following equation by being derived by Sklar theorem be converted into word distribution on sample This:
C(u1, u2..., up)=F (F-1(u1) ..., F-1(up))
Form of the copula sample U in word distribution is calculated by quantile and probability integral transformation:
Calculation equation above is applied in HDP, the theme of each word in phrase is calculated:
Then we transformed U=(u1..., uL)into
Z=(z1..., zL)where zi, i ∈ { 1 ..., L } is the
topic assignment of the ith word in phrase.Once
zi=zj, i, j ∈ { 1 ..., L }, i ≠ j, then we push the
ith and the jth word into or remove from X′dksimul-taneously.
According to correlation provided by Copula function, by the theme of all words in same phrase be limited in one it is lesser Close to each other each other in range, this also complies with philological intuitivism apprehension: " generating the theme of the word of the same phrase Greater probability is similar or even identical ".
CN201811438180.6A 2018-11-27 2018-11-27 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information Pending CN109325092A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811438180.6A CN109325092A (en) 2018-11-27 2018-11-27 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811438180.6A CN109325092A (en) 2018-11-27 2018-11-27 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information

Publications (1)

Publication Number Publication Date
CN109325092A true CN109325092A (en) 2019-02-12

Family

ID=65258841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811438180.6A Pending CN109325092A (en) 2018-11-27 2018-11-27 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information

Country Status (1)

Country Link
CN (1) CN109325092A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885839A (en) * 2019-03-04 2019-06-14 中山大学 A kind of parallelization topic model identifying weight and sampling type reconstruct based on theme
CN113344107A (en) * 2021-06-25 2021-09-03 清华大学深圳国际研究生院 Theme analysis method and system based on kernel principal component analysis and LDA (latent Dirichlet Allocation analysis)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1085429A1 (en) * 1999-09-20 2001-03-21 NCR International, Inc. Classifying data in a database
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
CN102411638A (en) * 2011-12-30 2012-04-11 中国科学院自动化研究所 Method for generating multimedia summary of news search result
US20140278771A1 (en) * 2013-03-13 2014-09-18 Salesforce.Com, Inc. Systems, methods, and apparatuses for rendering scored opportunities using a predictive query interface
CN104063399A (en) * 2013-03-22 2014-09-24 杭州金弩信息技术有限公司 Method and system for automatically identifying emotional probability borne by texts

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1085429A1 (en) * 1999-09-20 2001-03-21 NCR International, Inc. Classifying data in a database
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
CN102411638A (en) * 2011-12-30 2012-04-11 中国科学院自动化研究所 Method for generating multimedia summary of news search result
US20140278771A1 (en) * 2013-03-13 2014-09-18 Salesforce.Com, Inc. Systems, methods, and apparatuses for rendering scored opportunities using a predictive query interface
CN105229633A (en) * 2013-03-13 2016-01-06 萨勒斯福斯通讯有限公司 For realizing system, method and apparatus disclosed in data upload, process and predicted query API
CN104063399A (en) * 2013-03-22 2014-09-24 杭州金弩信息技术有限公司 Method and system for automatically identifying emotional probability borne by texts

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885839A (en) * 2019-03-04 2019-06-14 中山大学 A kind of parallelization topic model identifying weight and sampling type reconstruct based on theme
CN113344107A (en) * 2021-06-25 2021-09-03 清华大学深圳国际研究生院 Theme analysis method and system based on kernel principal component analysis and LDA (latent Dirichlet Allocation analysis)
CN113344107B (en) * 2021-06-25 2023-07-11 清华大学深圳国际研究生院 Topic analysis method and system based on kernel principal component analysis and LDA

Similar Documents

Publication Publication Date Title
CN102222092B (en) Massive high-dimension data clustering method for MapReduce platform
CN103207856B (en) A kind of Ontological concept and hierarchical relationship generation method
CN103838863B (en) A kind of big data clustering algorithm based on cloud computing platform
CN103106616B (en) Based on community discovery and the evolution method of resource consolidation and characteristics in spreading information
CN101727391B (en) Method for extracting operation sequence of software vulnerability characteristics
CN106294715A (en) A kind of association rule mining method based on attribute reduction and device
CN114238958A (en) Intrusion detection method and system based on traceable clustering and graph serialization
CN109325092A (en) Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information
Riedy et al. Multithreaded community monitoring for massive streaming graph data
Guo Research on anomaly detection in massive multimedia data transmission network based on improved PSO algorithm
Mostaeen et al. Clonecognition: machine learning based code clone validation tool
Chen et al. Scalable generation of large-scale unstructured meshes by a novel domain decomposition approach
CN104217013A (en) Course positive and negative mode excavation method and system based on item weighing and item set association degree
CN106294140B (en) A kind of PoC rapid generation for submitting explanation based on code storage
Himmelspach et al. Sequential processing of PDEVS models
Song et al. Parallel incremental association rule mining framework for public opinion analysis
Wang et al. Mining high-utility temporal patterns on time interval–based data
Sengupta et al. Benchmark generator for dynamic overlapping communities in networks
Fan et al. Decision tree evolution using limited number of labeled data items from drifting data streams
Bailey et al. Efficient incremental mining of contrast patterns in changing data
CN107153870A (en) The power prediction system of small blower fan
Hou et al. Simulating the dynamics of urban land quantity in China from 2020 to 2070 under the Shared Socioeconomic Pathways
Liu et al. Role-based approach for decentralized dynamic service composition.
Tan et al. Causality and consistency of state update schemes in synchronous agent-based simulations
Zhao et al. Realization of intrusion detection system based on the improved data mining technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190212

RJ01 Rejection of invention patent application after publication