CN105740354A - Adaptive potential Dirichlet model selection method and apparatus - Google Patents

Adaptive potential Dirichlet model selection method and apparatus Download PDF

Info

Publication number
CN105740354A
CN105740354A CN201610050982.4A CN201610050982A CN105740354A CN 105740354 A CN105740354 A CN 105740354A CN 201610050982 A CN201610050982 A CN 201610050982A CN 105740354 A CN105740354 A CN 105740354A
Authority
CN
China
Prior art keywords
theme
word
topics
document
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610050982.4A
Other languages
Chinese (zh)
Other versions
CN105740354B (en
Inventor
程光权
陈发君
刘忠
黄金才
朱承
修保新
陈超
冯旸赫
龙开亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Yuanben Information Technology Co Ltd
National University of Defense Technology
Original Assignee
Changsha Yuanben Information Technology Co Ltd
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Yuanben Information Technology Co Ltd, National University of Defense Technology filed Critical Changsha Yuanben Information Technology Co Ltd
Priority to CN201610050982.4A priority Critical patent/CN105740354B/en
Publication of CN105740354A publication Critical patent/CN105740354A/en
Application granted granted Critical
Publication of CN105740354B publication Critical patent/CN105740354B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides an adaptive potential Dirichlet model selection method and apparatus. The method comprises: initializing a experience theme number K according to a corpus scale; continuously updating the theme number K by calculating an average cosine distance similarity measure of theme-word probability distribution of an LDA model; obtaining a K value that is more suitable for a current corpus than the initial theme number by means of multiple rounds of iteration calculation; and outputting a corresponding LDA mode as a final result model. By dynamically adjusting the theme number K, a model unreasonableness problem caused by personal experience-based subjective setting is avoided to some degree and precision of the model is improved.

Description

The method and device of self adaptation potential Di Li Cray Model Selection
Technical field
The present invention relates to natural language processing technique field, be specifically related to a kind of self adaptation potential Di Li Cray model choosing The method and device selected.
Background technology
Along with the fast development of the Internet, informational capacity increases severely day by day, people's efficient retrieval to information and the need of acquisition Ask the strongest.Owing to the form of expression of network information is many based on text, therefore text message is classified automatically and become The important research focus of information retrieval field.Automatic document classification method determines the classification of association according to content of text, with base File classification method in statistical machine learning is most widely used, and one of which common model is implicit Di Li Cray distribution (Latent Dirichlet Allocation, LDA) model.
LDA model is a kind of topic model, can be used to identify theme letter hiding in extensive document sets or corpus Breath, obtains document-theme probability distribution and theme-Word probability distribution, includes text subject identification, literary composition in text mining field This classification and Text similarity computing aspect have application.In the calculating of LDA model, the setting of number of topics K is limited to individual Experience, and the corpus of different scales size has different characteristics, even if the different corpus of same scale there is also Difference, this brings the biggest challenge to the setting of K, usually due to the irrationality of K value setting cause the precision of LDA model poor from And affect follow-up analytical calculation.
Summary of the invention
It is an object of the invention to provide the method and device of a kind of self adaptation potential Di Li Cray Model Selection, this invention The technical problem solved.
An aspect of of the present present invention provides the method for a kind of self adaptation potential Di Li Cray Model Selection, comprises the following steps:
Step S100: corpus is converted to the document frequency matrix F calculated for LDA model, and advises according to corpus Mould arranges initial subject number K, iterative computation LDA model;
Step S200: often after wheel iteration, according to the mean cosine Distance conformability degree r of the theme-Word probability distribution of LDA model Mean cosine Distance conformability degree r_old in the most last round of iterative computation is to increase or reduce, and changes the change side of number of topics K To dk, and then update number of topics K, and the number of topics K after updating proceeds as the number of topics of next round LDA model iteration Iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or Sequential value along with iterations change.
Further, the structure of document frequency matrix F comprises the following steps:
Step S110: number in order index to the document in corpus;
Step S120: each document is carried out word segmentation processing and builds dictionary;
Step S130: based on the document frequency matrix F that dictionary creation form is following:
F = f 1 , 1 f 1 , 2 .... f 1 , N - 1 f 1 , N f 2 , 1 f 2 , 2 .... f 2 , N - 1 f 2 , N .... .... .... .... .... f M - 1 , 1 f M - 1 , 2 .... f M - 1 , N - 1 f M - 1 , N f M , 1 f M , 2 .... f M , N - 1 f M , N
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
Further, initialize number of topics K and include dk, eps, alpha and maxIteration, wherein, MaxIteration is maximum iteration time, and eps is that iteration terminates threshold value.
Further, the method for solving to LDA model is algorithm based on Gibbs sampling.
Further, theme-Word probability distribution in LDA modelFor the matrix of K*N, K is the theme number, and N is in dictionary The number of contained word, its every a line represents the Word probability distribution of a theme, the probability of t word of kth theme's Computing formula is:
Wherein, βtBe the theme-the t of hyper parameter vector β of the prior probability distribution Di Li Cray distribution of Word probability distribution Individual component,The number of times occurred for the t word in kth theme;
COS distance similarity d between theme vectorx,yPeace all similar COS distance similarity r, passes through following formula respectively Calculate:
r = Σ x = 0 K - 1 Σ y = x + 1 K d x , y K × ( K - 1 ) / 2
Wherein, dx,yRepresentCOS distance similarity between middle x-th theme vector and y-th theme vector,RepresentThe probit of the n-th word of x-th theme,Representing the mould of x-th theme vector, K is the theme Number, N is the number of contained word in dictionary.
Further, the more new regulation of dk is: as r-r_old > 0 time, dk direction negates, i.e. dk=-1*dk;Work as r-r_ < when 0, dk is constant for old;The renewal termination condition of dk is: reach less than eps or iterations when meeting r-r_old During maxIteration, terminate iterative computation.
Another aspect of the present invention additionally provides a kind of Di Li Cray Model Selection as potential in said method self adaptation dress Put, including:
Initial value setting module, for corpus being converted to the document frequency matrix F calculated for LDA model, and root According to corpus scale, initial subject number K, iterative computation LDA model are set;
Iteration more new module, after often wheel iteration, according to the mean cosine of the theme of LDA model-Word probability distribution away from Mean cosine Distance conformability degree r_old in the most last round of iterative computation of similarity r is to increase or reduce, and changes number of topics Change direction dk of K, and then update number of topics K, and the number of topics K after updating is as the theme of next round LDA model iteration Number proceeds iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or Sequential value along with iterations change.
Further, initial value setting module includes:
Numbering module: for index that the document in corpus is numbered in order;
Dictionary creation module: for each document being carried out word segmentation processing and building dictionary;
Document frequency matrix computing module: the document frequency matrix F for following based on dictionary creation form:
F = f 1 , 1 f 1 , 2 .... f 1 , N - 1 f 1 , N f 2 , 1 f 2 , 2 .... f 2 , N - 1 f 2 , N .... .... .... .... .... f M - 1 , 1 f M - 1 , 2 .... f M - 1 , N - 1 f M - 1 , N f M , 1 f M , 2 .... f M , N - 1 f M , N
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
The technique effect of the present invention:
The method of the self adaptation potential Di Li Cray Model Selection that the present invention provides, it is possible to for the language material of different scales size Storehouse calculates a K value more suitably LDA model relative to initial setting up, and in the sufficiently large situation of iterative computation number of times Under, the LDA model tending to optimum can be calculated, thus be effectively increased the precision of LDA model.
The method of the self adaptation potential Di Li Cray Model Selection that the present invention provides is by dynamically adjusting number of topics K, effectively Avoid, because personal experience's subjectivity arranges the unreasonable problem of model that K causes, improve the precision of model;
The method of the self adaptation potential Di Li Cray Model Selection that the present invention provides, can by arranging suitable iterations For the real-time calculating of the corpus of small lot, select the model of applicable corpus.
Specifically refer to the potential Di Li Cray Model Selection of the self adaptation according to the present invention method and device propose each Plant the described below, by apparent for the above and other aspect making the present invention of embodiment.
Accompanying drawing explanation
Fig. 1 is the preferred embodiment schematic flow sheet of the Di Li Cray model selection method that the present invention provides;
Fig. 2 is the preferred embodiment structural representation of the Di Li Cray Model Selection device that the present invention provides.
Detailed description of the invention
The accompanying drawing of the part constituting the application is used for providing a further understanding of the present invention, and the present invention's is schematic real Execute example and illustrate for explaining the present invention, being not intended that inappropriate limitation of the present invention.
Seeing Fig. 1, one aspect of the present invention provides the method for a kind of self adaptation potential Di Li Cray Model Selection, including with Lower step:
Step S100: corpus is converted to the document frequency matrix F calculated for LDA model, and advises according to corpus Mould arranges initial subject number, iterative computation LDA model;
Step S200: often after wheel iteration, according to the mean cosine Distance conformability degree r of the theme-Word probability distribution of LDA model Mean cosine Distance conformability degree r_old in the most last round of iterative computation is to increase or reduce, and changes the change side of number of topics K To dk, and then update number of topics K, and the number of topics K after updating proceeds as the number of topics of next round LDA model iteration Iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or Sequential value along with iterations change.
Arrange initial subject number herein according to corpus scale can be set based on experience value, it is also possible to adopt in force Arranging initial subject number with formula K=N/V, wherein K is the theme number, and N is the number of contained word in dictionary, and V is each theme institute The average word number comprised.Use this formula V simply can be set to fixed value (empirical value) as taken 100, it is possible to use complexity Mode V is carried out value (< taking V=100 when 10000,10000 < N < takes V when 100000 such as N as arranged V-value according to the size of N =200 ... .).Dictionary is the set of word, stores with tabular form.
The number of topics K to initial setting up can be realized by above-mentioned steps and realize amendment automatically, automatically obtain and meet concrete literary composition The number of topics K that shelves corpus requires.Thus avoid the trouble constantly revising LDA model number of topics K.The present invention provides method profit Estimate the evaluation criteria as LDA model quality with COS distance, by subsequent survey, use the measurement of this index to obtain LDA model number of topics, can be effectively improved the gained LDA model accuracy to all kinds of corpus.By dynamically adjusting theme number K, avoids to a certain extent because personal experience's subjectivity arranges the unreasonable problem of model that K causes, improves the precision of model; By arranging suitable iterations, can be used for the real-time calculating of the corpus of small lot, select the LDA of this corpus applicable Model, thus improve screening efficiency.
Preferably, the structure of document frequency matrix F comprises the following steps:
Step S110: number in order index to the document in corpus;
Step S120: each document is carried out word segmentation processing and builds dictionary;
Step S130: based on the document frequency matrix F that dictionary creation form is following:
F = f 1 , 1 f 1 , 2 .... f 1 , N - 1 f 1 , N f 2 , 1 f 2 , 2 .... f 2 , N - 1 f 2 , N .... .... .... .... .... f M - 1 , 1 f M - 1 , 2 .... f M - 1 , N - 1 f M - 1 , N f M , 1 f M , 2 .... f M , N - 1 f M , N
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
Document frequency matrix F is that in corpus, the vectorization of document represents, is that a kind of word bag model (i.e. have ignored the sky of word Between position relationship, only comprise the statistical information of word), every a line represents the word frequency statistics of word in a document, word frequency be 0 expression literary composition Shelves do not comprise this word, are not that 0 expression comprises this word.Document in corpus is the most first converted into document frequency matrix F represent LDA cluster calculation could be used for.
Preferably, initializing number of topics K and include dk, eps, alpha and maxIteration, wherein dk is the change side of K Being 1 or-1 (1 is forward, and-1 is negative sense) to, random initializtion, alpha is Studying factors, alpha may be configured as fixed value or Being set to along with the sequential value of iterations change, maxIteration is maximum iteration time, and eps is that iteration terminates threshold value. The vector of these numbers of topics it is set and combines subsequent calculations, this iterative calculation method can be effectively improved accurate to the renewal of number of topics K Really rate, the number of topics K that raising iteration obtains is for the calculating accuracy rate of LDA model.
Use number of topics K that document frequency matrix F is carried out LDA Model tying calculating.The method that LDA model is solved Can be: algorithm based on variation EM (Variational EM), the algorithm sampled based on Gibbs (gibbs) and expectation pass Broadcast the algorithm of (Expectation-Propagation).It is preferably based on the algorithm of Gibbs sampling.
Preferably, theme-Word probability distribution in LDA modelFor the matrix of K*N, K is the theme number, and N is dictionary The number of word contained by, its every a line represents the Word probability distribution of a theme, the wherein probability of t word of kth themeComputing formula as follows
Wherein βtBe the theme-the t of the hyper parameter vector β of the prior probability distribution Di Li Cray distribution of Word probability distribution Component,Be the theme the number of times that the t word occurs in K.
Can be calculated by Gibbs sampling, circular sees the skill that G Heinrich delivered in 2005 Art report " Parameter estimation for text analysis ".
COS distance similarity d between theme vectorx,yPeace all similar COS distance similarity r, passes through following formula respectively Calculate:
r = &Sigma; x = 0 K - 1 &Sigma; y = x + 1 K d x , y K &times; ( K - 1 ) / 2
Wherein, dx,yRepresentCOS distance similarity between middle x-th theme vector and y-th theme vector,RepresentThe probit of the n-th word of x-th theme,Representing the mould of x-th theme vector, k is the theme Number, N is dictionary (dictionary i.e. built in step S120).Use these formula to calculate and can be calculated average similar COS distance Similarity r can be used for measuring the dependency between theme, and between r the least expression theme, dependency is the least, represents main as r=0 Orthogonal between topic is optimal cluster result.
Preferably, the more new regulation of dk is: as r-r_old > 0 (r increase) time, dk direction negates, i.e. dk=-1*dk.When During r-r_old < 0 (r reduction), dk is constant, and wherein r is the average similarity of current iteration, and r_old is the average of last round of iteration Similarity.Carry out dk renewal by this rule and can guarantee that the value of average similar COS distance similarity r constantly reduces to obtain most preferably Cluster result.
Preferably, use current topic number K that document frequency matrix F is proceeded LDA Model tying iterative computation, when full When foot r-r_old (change of average similarity r) reaches maxIteration less than eps or iterations, flow process terminates, defeated Go out current LDA model.Carry out dk renewal by this condition to terminate to can guarantee that the value of average similar COS distance similarity r constantly subtracts Little to obtain optimal cluster result.
The said method that the present invention provides specifically includes following steps:
A) corpus is carried out pretreatment and obtain document frequency matrix F.
1. number in order index to corpus document;
2. each document carried out word segmentation processing and build dictionary;
3. based on dictionary creation document frequency matrix F, the form of document frequency matrix is as follows:
F = f 1 , 1 f 1 , 2 .... f 1 , N - 1 f 1 , N f 2 , 1 f 2 , 2 .... f 2 , N - 1 f 2 , N .... .... .... .... .... f M - 1 , 1 f M - 1 , 2 .... f M - 1 , N - 1 f M - 1 , N f M , 1 f M , 2 .... f M , N - 1 f M , N
Wherein M is number of documents, and N is dictionary word number, and the row in matrix represents document, and the word index at dictionary is shown in list, fi,jRepresent jth word occurrence number in i-th document in dictionary.
B) number of topics K is initialized according to the number of word contained in dictionary;Initialize dk, eps, alpha, MaxIteration, wherein dk be the change direction random initializtion of K be 1 or-1 (1 is forward, and-1 be negative sense), alpha is Practising the factor, alpha may be configured as fixed value and may be alternatively provided as along with the sequential value of iterations change, and maxIteration is Big iterations, eps is that iteration terminates threshold value.
C) use number of topics K that document frequency matrix F is carried out LDA Model tying calculating.The master that LDA model is solved The algorithm want method to have algorithm based on variation EM (Variational EM), sampling based on Gibbs (gibbs) and expectation Propagate the algorithm of (Expectation-Propagation), assume to use asking of Gibbs sampling in the case study on implementation of the present invention Resolving Algorithm, but it is not limited to Gibbs sampling algorithm.
D) according to the theme in the LDA model obtained in c)-Word probability distributionCalculate the cosine between theme vector Distance conformability degree and average similarity r.Computing formula is
r = &Sigma; i = 0 K - 1 &Sigma; j = i + 1 K d i , j K &times; ( K - 1 ) / 2
Wherein di,jRepresentCOS distance similarity between middle i-th theme vector and jth theme vector,RepresentThe probit of the n-th word of i-th theme,Representing the mould of i-th theme vector, k is the theme Number, N is the number of contained word in dictionary.
E) update number of topics K, more new formula be k=k_old+dk*alpha*k_old, k_old be the master of last iteration Topic number.
F) update dk, dk more new regulation be as r-r_old > 0 (r increase) time, dk direction negates, i.e. dk=-1*dk, r- During r_old < 0 (r reduction), dk is constant, and wherein r is the average similarity of current iteration, and r_old is the most similar of a upper iteration Degree.
G) use new number of topics K to jump to c) and continue iteration, be less than when meeting r-r_old (change of average similarity r) When eps or iterations reach maxIteration, flow process terminates, and exports current LDA model.
See Fig. 2, another aspect provides a kind of said method self adaptation potential Di Li Cray model choosing Select device, including:
Initial value setting module 100, for corpus being converted to the document frequency matrix F calculated for LDA model, and According to corpus scale, initial subject number K, iterative computation LDA model are set;
Iteration more new module 200, after often wheel iteration, according to the mean cosine of the theme-Word probability distribution of LDA model Mean cosine Distance conformability degree r_old in the most last round of iterative computation of Distance conformability degree r is to increase or reduce, and changes theme Change direction dk of number K, and then update number of topics K, and the number of topics K after updating is as the master of next round LDA model iteration Topic number proceeds iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or Sequential value along with iterations change.
Simple dependence experience selects number of topics K to cause being applicable to a lot of unscreened language to use this device to be avoided that Material storehouse, thus improve the efficiency that different corpus outputs are had being specifically designed for property LDA model.
Preferably, initial value setting module includes:
Numbering module: for index that the document in corpus is numbered in order;
Dictionary creation module: for each document being carried out word segmentation processing and building dictionary;
Document frequency matrix computing module: the document frequency matrix F for following based on dictionary creation form:
F = f 1 , 1 f 1 , 2 .... f 1 , N - 1 f 1 , N f 2 , 1 f 2 , 2 .... f 2 , N - 1 f 2 , N .... .... .... .... .... f M - 1 , 1 f M - 1 , 2 .... f M - 1 , N - 1 f M - 1 , N f M , 1 f M , 2 .... f M , N - 1 f M , N
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
Using this device to carry out document frequency matrix F, gained document frequency matrix F is the vectorization table of document in corpus Showing, be a kind of word bag model (i.e. have ignored the spatial relation of word, only comprise the statistical information of word), every a line represents one The word frequency statistics of word in document, word frequency is that 0 expression document does not comprise this word, is not that 0 expression comprises this word.The most first by corpus In document be converted into document frequency matrix F and represent and could be used for LDA cluster calculation.
Those skilled in the art will understand that the scope of the present invention is not restricted to example discussed above, it is possible to carries out it Some changes and amendment, the scope of the present invention limited without deviating from appended claims.Although oneself is through in accompanying drawing and explanation Book illustrates and describes the present invention in detail, but such explanation and description are only explanations or schematic, and nonrestrictive. The present invention is not limited to the disclosed embodiments.
By to accompanying drawing, the research of specification and claims, when implementing the present invention, those skilled in the art are permissible Understand and realize the deformation of the disclosed embodiments.In detail in the claims, term " includes " being not excluded for other steps or element, And indefinite article " " or " a kind of " are not excluded for multiple.Some measure quoted in mutually different dependent claims The fact does not means that the combination of these measures can not be advantageously used.It is right that any reference marker in claims is not constituted The restriction of the scope of the present invention.

Claims (8)

1. the method for a self adaptation potential Di Li Cray Model Selection, it is characterised in that comprise the following steps:
Step S100: corpus is converted to the document frequency matrix F calculated for LDA model, and sets according to corpus scale Put initial subject number K, iterative computation LDA model;
Step S200: often after wheel iteration, the mean cosine Distance conformability degree r according to the theme-Word probability distribution of LDA model is relative Mean cosine Distance conformability degree r_old in last round of iterative computation is to increase or reduce, and changes the change direction of number of topics K Dk, and then update number of topics K, and the number of topics K after updating proceeds repeatedly as the number of topics of next round LDA model iteration In generation, calculates, and after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the number of topics of last iteration, Dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or along with The sequential value of iterations change.
The method of self adaptation the most according to claim 1 potential Di Li Cray Model Selection, it is characterised in that described document The structure of frequency matrix F comprises the following steps:
Step S110: number in order index to the document in described corpus;
Step S120: each document is carried out word segmentation processing and builds dictionary;
Step S130: based on the document frequency matrix F that described dictionary creation form is following:
F = f 1 , 1 f 1 , 2 .... f 1 , N - 1 f 1 , N f 2 , 1 f 2 , 2 .... f 2 , N - 1 f 2 , N .... .... .... .... .... f M - 1 , 1 f M - 1,2 .... f M - 1 , N - 1 f M - 1 , N f M , 1 f M , 2 .... f M , N - 1 f M , N
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and this word index in dictionary is shown in list, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
The method of self adaptation the most according to claim 2 potential Di Li Cray Model Selection, it is characterised in that described initially Changing number of topics K and include dk, eps, alpha and maxIteration, wherein, maxIteration is maximum iteration time, and eps is Iteration terminates threshold value.
The method of self adaptation the most according to claim 3 potential Di Li Cray Model Selection, it is characterised in that to described The method for solving of LDA model is algorithm based on Gibbs sampling.
The method of self adaptation the most according to claim 4 potential Di Li Cray Model Selection, it is characterised in that
Theme-Word probability distribution in LDA modelFor the matrix of K*N, K is the theme number, and N is the number of contained word in dictionary, Its every a line represents the Word probability distribution of a theme, the probability of t word of kth themeComputing formula be:
Wherein, βtBe the theme-t point of the hyper parameter vector β of the prior probability distribution Di Li Cray distribution of Word probability distribution Amount,The number of times occurred for the t word in kth theme;
COS distance similarity d between theme vectorx,yPeace all similar COS distance similarity r, respectively by following formula meter Calculate:
r = &Sigma; x = 0 K - 1 &Sigma; y = x + 1 K d x , y K &times; ( K - 1 ) / 2
Wherein, dx,yRepresentCOS distance similarity between middle x-th theme vector and y-th theme vector, RepresentThe probit of the n-th word of x-th theme,Representing the mould of x-th theme vector, K is the theme number, N Number for word contained in dictionary.
The method of self adaptation the most according to claim 5 potential Di Li Cray Model Selection, it is characterised in that described dk's More new regulation is: as r-r_old > 0 time, dk direction negates, i.e. dk=-1*dk;As r-r_old, < when 0, dk is constant;Described dk's Renewal termination condition is: when meeting r-r_old and reaching maxIteration less than eps or iterations, terminate iteration meter Calculate.
7. a method self adaptation potential Di Li Cray Model Selection device, its feature as described in any one of claim 1~6 It is, including:
Initial value setting module, for being converted to the document frequency matrix F calculated for LDA model, and according to language by corpus Material storehouse scale arranges initial subject number K, iterative computation LDA model;
Iteration more new module, after often wheel iteration, according to the mean cosine distance phase of the theme-Word probability distribution of LDA model It is to increase or reduce like the mean cosine Distance conformability degree r_old in the degree the most last round of iterative computation of r, changes number of topics K's Change direction dk, and then update number of topics K, and the number of topics K after updating continues as the number of topics of next round LDA model iteration Continuous being iterated calculates, and after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the number of topics of last iteration, Dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or along with The sequential value of iterations change.
Self adaptation the most according to claim 7 potential Di Li Cray Model Selection device, it is characterised in that described initial value Setting module includes:
Numbering module: for index that the document in described corpus is numbered in order;
Dictionary creation module: for each document being carried out word segmentation processing and building dictionary;
Document frequency matrix computing module: the document frequency matrix F for following based on described dictionary creation form:
F = f 1 , 1 f 1 , 2 .... f 1 , N - 1 f 1 , N f 2 , 1 f 2 , 2 .... f 2 , N - 1 f 2 , N .... .... .... .... .... f M - 1 , 1 f M - 1,2 .... f M - 1 , N - 1 f M - 1 , N f M , 1 f M , 2 .... f M , N - 1 f M , N
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and this word index in dictionary is shown in list, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
CN201610050982.4A 2016-01-26 2016-01-26 The method and device of adaptive potential Di Li Cray model selection Active CN105740354B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610050982.4A CN105740354B (en) 2016-01-26 2016-01-26 The method and device of adaptive potential Di Li Cray model selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610050982.4A CN105740354B (en) 2016-01-26 2016-01-26 The method and device of adaptive potential Di Li Cray model selection

Publications (2)

Publication Number Publication Date
CN105740354A true CN105740354A (en) 2016-07-06
CN105740354B CN105740354B (en) 2018-11-30

Family

ID=56246575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610050982.4A Active CN105740354B (en) 2016-01-26 2016-01-26 The method and device of adaptive potential Di Li Cray model selection

Country Status (1)

Country Link
CN (1) CN105740354B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491417A (en) * 2017-07-06 2017-12-19 复旦大学 A kind of document structure tree method under topic model based on particular division
CN107656919A (en) * 2017-09-12 2018-02-02 中国软件与技术服务股份有限公司 A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme
CN107798043A (en) * 2017-06-28 2018-03-13 贵州大学 The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
CN111966702A (en) * 2020-08-17 2020-11-20 中国银行股份有限公司 Spark-based financial information bag-of-words model incremental updating method and system
CN113935321A (en) * 2021-10-19 2022-01-14 昆明理工大学 Adaptive iteration Gibbs sampling method suitable for LDA topic model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071778A1 (en) * 2004-04-05 2008-03-20 International Business Machines Corporation Apparatus for selecting documents in response to a plurality of inquiries by a plurality of clients by estimating the relevance of documents
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071778A1 (en) * 2004-04-05 2008-03-20 International Business Machines Corporation Apparatus for selecting documents in response to a plurality of inquiries by a plurality of clients by estimating the relevance of documents
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798043A (en) * 2017-06-28 2018-03-13 贵州大学 The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays
CN107798043B (en) * 2017-06-28 2022-05-03 贵州大学 Text clustering method for long text auxiliary short text based on Dirichlet multinomial mixed model
CN107491417A (en) * 2017-07-06 2017-12-19 复旦大学 A kind of document structure tree method under topic model based on particular division
CN107491417B (en) * 2017-07-06 2021-06-22 复旦大学 Document generation method based on specific division under topic model
CN107656919A (en) * 2017-09-12 2018-02-02 中国软件与技术服务股份有限公司 A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
CN109829151B (en) * 2018-11-27 2023-04-21 国网浙江省电力有限公司 Text segmentation method based on hierarchical dirichlet model
CN111966702A (en) * 2020-08-17 2020-11-20 中国银行股份有限公司 Spark-based financial information bag-of-words model incremental updating method and system
CN111966702B (en) * 2020-08-17 2023-08-18 中国银行股份有限公司 Spark-based financial information word bag model increment updating method and system
CN113935321A (en) * 2021-10-19 2022-01-14 昆明理工大学 Adaptive iteration Gibbs sampling method suitable for LDA topic model
CN113935321B (en) * 2021-10-19 2024-03-26 昆明理工大学 Adaptive iterative Gibbs sampling method suitable for LDA topic model

Also Published As

Publication number Publication date
CN105740354B (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN105740354A (en) Adaptive potential Dirichlet model selection method and apparatus
CN105868178B (en) A kind of multi-document auto-abstracting generation method of phrase-based theme modeling
CN105260390B (en) A kind of item recommendation method based on joint probability matrix decomposition towards group
CN104573046A (en) Comment analyzing method and system based on term vector
CN106649272B (en) A kind of name entity recognition method based on mixed model
CN107590565A (en) A kind of method and device for building building energy consumption forecast model
CN103617435B (en) Image sorting method and system for active learning
CN106251174A (en) Information recommendation method and device
CN104657496A (en) Method and equipment for calculating information hot value
CN104899298A (en) Microblog sentiment analysis method based on large-scale corpus characteristic learning
CN106021223A (en) Sentence similarity calculation method and system
CN109816438B (en) Information pushing method and device
CN106651542A (en) Goods recommendation method and apparatus
CN106055673A (en) Chinese short-text sentiment classification method based on text characteristic insertion
CN103886047A (en) Distributed on-line recommending method orientated to stream data
CN104468413B (en) A kind of network service method and system
CN109871858A (en) Prediction model foundation, object recommendation method and system, equipment and storage medium
CN105224959A (en) The training method of order models and device
CN106294863A (en) A kind of abstract method for mass text fast understanding
CN103942375A (en) High-speed press sliding block dimension robust design method based on interval
CN102541920A (en) Method and device for improving accuracy degree by collaborative filtering jointly based on user and item
CN106503209A (en) A kind of topic temperature Forecasting Methodology and system
CN101887443A (en) Method and device for classifying texts
CN103761266A (en) Click rate predicting method and system based on multistage logistic regression
CN110825850A (en) Natural language theme classification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant