CN109325092A

CN109325092A - Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information

Info

Publication number: CN109325092A
Application number: CN201811438180.6A
Authority: CN
Inventors: 林立晖; 饶洋辉
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2019-02-12

Abstract

The present invention relates to the technical fields of natural language processing and artificial intelligence in machine learning, more particularly, to the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information.Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information, wherein be divided into three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, and third is to carry out implication relation modeling to phrase by Copula function.Model proposed by the present invention also models the implication relation of the phrase in text while accelerating HDP calculating.Compared with the prior art, we are under the premise of remaining HDP imparametrization characteristic, realize parallelization, and the shortcomings that compensating for traditional theme model, phrase semanteme is merged, it overcomes serial HDP and calculates force request height, the shortcoming that subject information is lacked optimizes the qualitatively and quantitatively performance capabilities of model.

Description

Merge the nonparametric parallelization level Di Li Cray process topic model of phrase information System

Technical field

The present invention relates to the technical field of natural language processing and artificial intelligence in machine learning, more particularly, to Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information.

Background technique

Existing topic model is broadly divided into two classes, parametric technique and nonparametric technique, and it is that probability is potential that the former is corresponding Semantic analysis model (Probablistic Latent SemanticAnalysis) and latent Dirichletal location (Latent Dirichlet Allocation), corresponding the latter is traditional level Di Li Cray process (Hierarchical Dirichlet Process).It is currently commercially using it is more be traditional parametrization topic model, but this method needs are manually specified Parameter, and parameter is affected to final mining effect, is difficult tuning in practice.Issued * LDA of Tencent Mesh is then the large-scale parallel topic model realized based on computer cluster, is generated in social media for analyzing user A large amount of texts.

Probability latent semantic analysis (hereinafter referred to as PLSA) is the use probability distribution that proposes earliest to text generation process The development foundation of one of method modeled and traditional theme model.PLSA carries out text from the angle of probability statistics Analysis, it is believed that the word in every document belongs to a specific theme, each theme controls the probability of different terms Distribution, and document itself obeys certain probability distribution.Therefore, in PLSA, the generating process of a document is as shown in Figure 1.

Wherein P (d_i) it is document d_iProbability of occurrence, P (z_i|d_i) it is given document d_iUnder conditions of theme z_iAppearance it is general Rate, P (w_i|z_i) it is given theme z_iIn the case where word w_iProbability of occurrence.But this method is merely able in existing text set Upper excavation implicit information, can not cope with new data, and textual data is more, and the parameter of PLSA is also more, be unfavorable for answering on a large scale With.

Implicit Di Li Cray distribution (hereinafter referred to as LDA) is then on the basis of PLSA model, is document and theme distribution It is added to priori knowledge, it is assumed that the parameter for the probability distribution that document and theme are obeyed while submitting to conjugate prior point Cloth.LDA is more in line with probability statistics rule compared to PLSA, while introducing priori knowledge, it can accurately excavate out text Subject information.The generating process of LDA model is as shown in Figure 2.

Wherein z and w is exactly the word and theme in PLSA, and θ and φ are the ginsengs for the probability distribution that theme and word are obeyed Number, α and β are the parameters for the prior probability distribution that the two parameters are obeyed.As can be seen that increasing two compared to PLSA, LDA A hyper parameter parameter alpha and β avoid PLSA and need to obtain lacking for document and theme distribution according to certain text data collection statistics Point can accelerate the efficiency and effect excavated.But the problem of LDA, is also due precisely to the determination of parameter, when the value for passing through α and β It can generate well and be similar to genuine document and theme distribution parameter θ and φ, and when artificial setting theme number is appropriate, model Effect will be fine, it is on the contrary then can not obtain significant Clustering Effect.

Level Di Li Cray process (hereinafter referred to as HDP) is then most important nonparametric topic model, and generating process is such as Shown in Fig. 3.It generates the sub- Di Li Cray process of next stage, passes through by setting prior probability distribution for Di Li Cray process The parameter of theme distribution is picked out in sampling, then determines word.Since the discreteness of Di Li Cray process itself surveys probability with it Spend space divide unlimitedness (but divide after each atom and be 1, meet probability distribution), can automatically determine optimal Clusters number, i.e. number of topics in topic model avoid and number of topics purpose drawback are manually specified, and are a kind of effective nonparametrics Topic model.But the parametric inference of HDP itself is extremely complex, the operation of conventional serial algorithm compared to PLSA and LDA Speed is very slow, can not successfully manage a large amount of text.

Above-mentioned topic model has the shortcomings that respectively most obvious.

1.PLSA can only learn the distribution of current data set out, and new data set can not be extended to and number of parameters with Number of documents increase and it is linearly increasing, caused by this is the design by this body structure of PLSA, need by largely count come It was found that the regularity of distribution in data set, and the missing of priori knowledge also produces certain influence to effect；

2.LDA needs to be manually specified hyper parameter, and the setting of hyper parameter has larger impact to experiment effect, in Practical Project In be difficult to adjust out the preferable parameter of effect.This is because LDA can not voluntarily obtain priori knowledge, it can only be by manually according to warp It tests caused by specified (i.e. often use experience parameter), for different data collection, parameter generalization ability is weaker；

The parametric inference of 3.HDP is extremely complex, it is difficult to large-scale application.This is because the distribution form of Di Li Cray process It is extremely complex, inferred using traditional variation or Gibbs sampling method can not also effectively shorten caused by the parameter Estimation time 's.

Moreover, a general character of the above-mentioned prior art the disadvantage is that, they assume that the word in text is irrelevant, they Corresponding theme is also mutually indepedent.This assumes all to be not reasonable for language and probability angle, because random become Amount is completely independent same distributional assumption and excessively idealizes, and in natural language, is often combined with each other, influences each other between word , this semantic structure of the phrase that especially all there will necessarily be in various language, it is complete that multiple words are combined together expression one Whole semantic information.If we independently treat each word, will be unable to excavate the phrase information that text includes.

Summary of the invention

The present invention in order to overcome at least one of the drawbacks of the prior art described above, provides the nonparametric of fusion phrase information simultaneously Rowization level Di Li Cray process topic model system solves the disadvantage that existing Parallelizing Techniques lose HDP nonparametric characteristic.

The technical scheme is that the nonparametric parallelization level Di Li Cray process topic model of fusion phrase information System, wherein be divided into three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, and third is to pass through Copula function carries out implication relation modeling to phrase.

In the present invention, as long as we are it is not difficult to find that we solve computational efficiency after analyzing traditional theme model Problem, HDP are a kind of good nonparametric techniques, and we can introduce a kind of mathematical method and model to phrase, mend Fill the shortcoming of traditional theme model.And since traditional HDP is difficult to parallelization, we are based on existing parallel HDP Equivalence model, proposes a kind of HDP topic model based on gibbs sampler of parallelization, and proposes real-time theme Regulation mechanism, Solve the disadvantage that existing Parallelizing Techniques lose HDP nonparametric characteristic.Meanwhile we introduce Copula function to the implicit pass of phrase System is modeled, and phrase information is incorporated in parallelization frame, is more effectively excavated text and is implied theme.

Compared with prior art, beneficial effect is: model proposed by the present invention is while accelerating HDP calculating, to text The implication relation of phrase in this is also modeled.Compared with the prior art, we are remaining HDP imparametrization characteristic Under the premise of, the shortcomings that realizing parallelization, and compensate for traditional theme model, phrase semanteme is merged, serial HDP is overcome It is high to calculate force request, the shortcoming that subject information is lacked optimizes the qualitatively and quantitatively performance capabilities of model.

Detailed description of the invention

Fig. 1 is the generating process schematic diagram of an existing document in PLSA.

Fig. 2 is the generating process schematic diagram of existing LDA model.

Fig. 3 is existing level Di Li Cray nonparametric topic model generating process schematic diagram.

Fig. 4 is manager of the present invention-executor's schematic diagram of mechanism.

Fig. 5 is the first schematic diagram of copula modeling procedure of the present invention.

Fig. 6 is the second schematic diagram of copula modeling procedure of the present invention.

Fig. 7 is the real-time Regulation mechanism schematic diagram of present subject matter.

Fig. 8 is Copula modeling of the present invention and phrase information fusion schematic diagram.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；In order to better illustrate this embodiment, attached Scheme certain components to have omission, zoom in or out, does not represent the size of actual product；To those skilled in the art, The omitting of some known structures and their instructions in the attached drawings are understandable.Being given for example only property of positional relationship is described in attached drawing Illustrate, should not be understood as the limitation to this patent.

In the present embodiment, the nonparametric parallelization level Di Li Cray process topic model system of phrase information is merged, It is characterized in that, be divided into three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, and third is to pass through Copula function carries out implication relation modeling to phrase.

As shown in figure 4, the design of parallelization mechanism specifically:

A manager-executor's mechanism is devised, for synchronizing global subject information, realizes unified theme additions and deletions behaviour Make；By the rotation of executor's thread and manager's thread, reach pipelined fabric, realizes efficient parallel；In each iteration, Executor's thread updates the theme word information of itself, reports after iteration and gives manager's thread, and manager passes through each thread Theme word information carry out theme additions and deletions decision.

As shown in fig. 7, theme adjustment in real time specifically:

Assuming that " word number of the theme after successive ignition is still lower than some threshold value, then it is assumed that the theme is Wither away, need to delete theme " and " if still having the unassigned theme of word, then it is assumed that theme number is not after certain iteration Foot, needs to increase theme ", the real-time Regulation mechanism of design motif.

During theme adjusts in real time, the increased dependent thresholds of theme are provided that

∈=1%* (MaximumIteration)

P=1%* (NumberOfWordsInDataset)

I.e. after ε iteration, if the word number under theme is less than p, the theme is deleted.

As shown in Figure 8, wherein Fig. 5 is copula modeled segments first part algorithm schematic diagram, and Fig. 6 is copula modeling Part second part algorithm schematic diagram carries out implication relation modeling to phrase by Copula function specifically:

Copula function is introduced to model the implication relation of phrase:

According to Sklar theorem:

Theorem 3.1Given a p-dimensional with univariate margins F₁, F₂..., F_p, there always exist a copula function C such that for all combinations of random variables(x₁, x₂..., x_p)∈R^p,

F(x₁..., x_p)=C (F₁(x₁) ..., F_p(x_p)) (5)

If sampling obtains the phrase that length is p in a document, sampled from copula function to a length Equal vector U={ u1, u2 ..., up }, and the following equation by being derived by Sklar theorem is converted into word distribution Sample:

C(u₁, u₂..., u_p)=F (F^-1(u₁) ..., F^-1(u_p))

Form of the copula sample U in word distribution is calculated by quantile and probability integral transformation:

Calculation equation above is applied in HDP, the theme of each word in phrase is calculated:

Then we transformed U=(u₁..., u_L) into Z=(z₁..., z_L)where z_i, i ∈ { 1 ..., L } is the topic assignment of the i^th word in phrase.Once z_i=z_j, i, j ∈ { 1 ..., L }, i ≠ j, then we push the i^th and the j^th word into or remove from X′_dk simul-taneously.

According to correlation provided by Copula function, by the theme of all words in same phrase be limited in one compared with Close to each other each other in small range, this also complies with philological intuitivism apprehension: " generating the word of the same phrase It is similar or even identical that theme, which has greater probability, ".

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. merging the nonparametric parallelization level Di Li Cray process topic model system of phrase information, which is characterized in that be divided into Three parts, first is the design of parallelization mechanism, and second is real-time theme adjustment, third be by Copula function to phrase into The modeling of row implication relation.

2. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 1 System, it is characterised in that: the design of the parallelization mechanism specifically:

A manager-executor's mechanism is devised, for synchronizing global subject information, realizes unified theme additions and deletions operation； By the rotation of executor's thread and manager's thread, reach pipelined fabric, realizes efficient parallel；In each iteration, hold Passerby's thread updates the theme word information of itself, reports after iteration and gives manager's thread, and manager passes through each thread Theme word information carries out theme additions and deletions decision.

3. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 1 System, it is characterised in that: the real-time theme adjustment specifically:

Assuming that " word number of the theme after successive ignition is still lower than some threshold value, then it is assumed that and the theme has been withered away, Need to delete theme " and " if still having the unassigned theme of word after certain iteration, then it is assumed that theme number deficiency needs Increase theme ", the real-time Regulation mechanism of design motif.

4. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 3 System, it is characterised in that: during the theme adjusts in real time, the increased dependent thresholds of theme are provided that

ε=1%* (MaximumIteration)

P=1%* (NumberOfWordsInDataset)

5. the nonparametric parallelization level Di Li Cray process topic model system of fusion phrase information according to claim 1 System, it is characterised in that: described that implication relation modeling is carried out to phrase by Copula function specifically:

Copula function is introduced to model the implication relation of phrase:

According to Sklar theorem:

Theorem 3.1 Given a p-dimensional with univariate

margins F₁, F₂..., F_p, there always exist a copula

function C such that for all combinations of random

variables(x₁, x₂..., x_p)∈R^p,

F(x₁..., x_p)=C (F₁(x₁) ..., F_p(x_p)) (5)

If sampling obtains the phrase that length is p in a document, sampled from copula function to an equal length Vector U={ u1, u2 ..., up }, and the following equation by being derived by Sklar theorem be converted into word distribution on sample This:

C(u₁, u₂..., u_p)=F (F^-1(u₁) ..., F^-1(u_p))

Then we transformed U=(u₁..., u_L)into

Z=(z₁..., z_L)where z_i, i ∈ { 1 ..., L } is the

topic assignment of the i^th word in phrase.Once

z_i=z_j, i, j ∈ { 1 ..., L }, i ≠ j, then we push the

i^th and the j^th word into or remove from X′_dksimul-taneously.

According to correlation provided by Copula function, by the theme of all words in same phrase be limited in one it is lesser Close to each other each other in range, this also complies with philological intuitivism apprehension: " generating the theme of the word of the same phrase Greater probability is similar or even identical ".