CN105740354A - Adaptive potential Dirichlet model selection method and apparatus - Google Patents
Adaptive potential Dirichlet model selection method and apparatus Download PDFInfo
- Publication number
- CN105740354A CN105740354A CN201610050982.4A CN201610050982A CN105740354A CN 105740354 A CN105740354 A CN 105740354A CN 201610050982 A CN201610050982 A CN 201610050982A CN 105740354 A CN105740354 A CN 105740354A
- Authority
- CN
- China
- Prior art keywords
- theme
- word
- topics
- document
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides an adaptive potential Dirichlet model selection method and apparatus. The method comprises: initializing a experience theme number K according to a corpus scale; continuously updating the theme number K by calculating an average cosine distance similarity measure of theme-word probability distribution of an LDA model; obtaining a K value that is more suitable for a current corpus than the initial theme number by means of multiple rounds of iteration calculation; and outputting a corresponding LDA mode as a final result model. By dynamically adjusting the theme number K, a model unreasonableness problem caused by personal experience-based subjective setting is avoided to some degree and precision of the model is improved.
Description
Technical field
The present invention relates to natural language processing technique field, be specifically related to a kind of self adaptation potential Di Li Cray model choosing
The method and device selected.
Background technology
Along with the fast development of the Internet, informational capacity increases severely day by day, people's efficient retrieval to information and the need of acquisition
Ask the strongest.Owing to the form of expression of network information is many based on text, therefore text message is classified automatically and become
The important research focus of information retrieval field.Automatic document classification method determines the classification of association according to content of text, with base
File classification method in statistical machine learning is most widely used, and one of which common model is implicit Di Li Cray distribution
(Latent Dirichlet Allocation, LDA) model.
LDA model is a kind of topic model, can be used to identify theme letter hiding in extensive document sets or corpus
Breath, obtains document-theme probability distribution and theme-Word probability distribution, includes text subject identification, literary composition in text mining field
This classification and Text similarity computing aspect have application.In the calculating of LDA model, the setting of number of topics K is limited to individual
Experience, and the corpus of different scales size has different characteristics, even if the different corpus of same scale there is also
Difference, this brings the biggest challenge to the setting of K, usually due to the irrationality of K value setting cause the precision of LDA model poor from
And affect follow-up analytical calculation.
Summary of the invention
It is an object of the invention to provide the method and device of a kind of self adaptation potential Di Li Cray Model Selection, this invention
The technical problem solved.
An aspect of of the present present invention provides the method for a kind of self adaptation potential Di Li Cray Model Selection, comprises the following steps:
Step S100: corpus is converted to the document frequency matrix F calculated for LDA model, and advises according to corpus
Mould arranges initial subject number K, iterative computation LDA model;
Step S200: often after wheel iteration, according to the mean cosine Distance conformability degree r of the theme-Word probability distribution of LDA model
Mean cosine Distance conformability degree r_old in the most last round of iterative computation is to increase or reduce, and changes the change side of number of topics K
To dk, and then update number of topics K, and the number of topics K after updating proceeds as the number of topics of next round LDA model iteration
Iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration
Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or
Sequential value along with iterations change.
Further, the structure of document frequency matrix F comprises the following steps:
Step S110: number in order index to the document in corpus;
Step S120: each document is carried out word segmentation processing and builds dictionary;
Step S130: based on the document frequency matrix F that dictionary creation form is following:
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary
Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
Further, initialize number of topics K and include dk, eps, alpha and maxIteration, wherein,
MaxIteration is maximum iteration time, and eps is that iteration terminates threshold value.
Further, the method for solving to LDA model is algorithm based on Gibbs sampling.
Further, theme-Word probability distribution in LDA modelFor the matrix of K*N, K is the theme number, and N is in dictionary
The number of contained word, its every a line represents the Word probability distribution of a theme, the probability of t word of kth theme's
Computing formula is:
Wherein, βtBe the theme-the t of hyper parameter vector β of the prior probability distribution Di Li Cray distribution of Word probability distribution
Individual component,The number of times occurred for the t word in kth theme;
COS distance similarity d between theme vectorx,yPeace all similar COS distance similarity r, passes through following formula respectively
Calculate:
Wherein, dx,yRepresentCOS distance similarity between middle x-th theme vector and y-th theme vector,RepresentThe probit of the n-th word of x-th theme,Representing the mould of x-th theme vector, K is the theme
Number, N is the number of contained word in dictionary.
Further, the more new regulation of dk is: as r-r_old > 0 time, dk direction negates, i.e. dk=-1*dk;Work as r-r_
< when 0, dk is constant for old;The renewal termination condition of dk is: reach less than eps or iterations when meeting r-r_old
During maxIteration, terminate iterative computation.
Another aspect of the present invention additionally provides a kind of Di Li Cray Model Selection as potential in said method self adaptation dress
Put, including:
Initial value setting module, for corpus being converted to the document frequency matrix F calculated for LDA model, and root
According to corpus scale, initial subject number K, iterative computation LDA model are set;
Iteration more new module, after often wheel iteration, according to the mean cosine of the theme of LDA model-Word probability distribution away from
Mean cosine Distance conformability degree r_old in the most last round of iterative computation of similarity r is to increase or reduce, and changes number of topics
Change direction dk of K, and then update number of topics K, and the number of topics K after updating is as the theme of next round LDA model iteration
Number proceeds iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration
Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or
Sequential value along with iterations change.
Further, initial value setting module includes:
Numbering module: for index that the document in corpus is numbered in order;
Dictionary creation module: for each document being carried out word segmentation processing and building dictionary;
Document frequency matrix computing module: the document frequency matrix F for following based on dictionary creation form:
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary
Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
The technique effect of the present invention:
The method of the self adaptation potential Di Li Cray Model Selection that the present invention provides, it is possible to for the language material of different scales size
Storehouse calculates a K value more suitably LDA model relative to initial setting up, and in the sufficiently large situation of iterative computation number of times
Under, the LDA model tending to optimum can be calculated, thus be effectively increased the precision of LDA model.
The method of the self adaptation potential Di Li Cray Model Selection that the present invention provides is by dynamically adjusting number of topics K, effectively
Avoid, because personal experience's subjectivity arranges the unreasonable problem of model that K causes, improve the precision of model;
The method of the self adaptation potential Di Li Cray Model Selection that the present invention provides, can by arranging suitable iterations
For the real-time calculating of the corpus of small lot, select the model of applicable corpus.
Specifically refer to the potential Di Li Cray Model Selection of the self adaptation according to the present invention method and device propose each
Plant the described below, by apparent for the above and other aspect making the present invention of embodiment.
Accompanying drawing explanation
Fig. 1 is the preferred embodiment schematic flow sheet of the Di Li Cray model selection method that the present invention provides;
Fig. 2 is the preferred embodiment structural representation of the Di Li Cray Model Selection device that the present invention provides.
Detailed description of the invention
The accompanying drawing of the part constituting the application is used for providing a further understanding of the present invention, and the present invention's is schematic real
Execute example and illustrate for explaining the present invention, being not intended that inappropriate limitation of the present invention.
Seeing Fig. 1, one aspect of the present invention provides the method for a kind of self adaptation potential Di Li Cray Model Selection, including with
Lower step:
Step S100: corpus is converted to the document frequency matrix F calculated for LDA model, and advises according to corpus
Mould arranges initial subject number, iterative computation LDA model;
Step S200: often after wheel iteration, according to the mean cosine Distance conformability degree r of the theme-Word probability distribution of LDA model
Mean cosine Distance conformability degree r_old in the most last round of iterative computation is to increase or reduce, and changes the change side of number of topics K
To dk, and then update number of topics K, and the number of topics K after updating proceeds as the number of topics of next round LDA model iteration
Iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration
Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or
Sequential value along with iterations change.
Arrange initial subject number herein according to corpus scale can be set based on experience value, it is also possible to adopt in force
Arranging initial subject number with formula K=N/V, wherein K is the theme number, and N is the number of contained word in dictionary, and V is each theme institute
The average word number comprised.Use this formula V simply can be set to fixed value (empirical value) as taken 100, it is possible to use complexity
Mode V is carried out value (< taking V=100 when 10000,10000 < N < takes V when 100000 such as N as arranged V-value according to the size of N
=200 ... .).Dictionary is the set of word, stores with tabular form.
The number of topics K to initial setting up can be realized by above-mentioned steps and realize amendment automatically, automatically obtain and meet concrete literary composition
The number of topics K that shelves corpus requires.Thus avoid the trouble constantly revising LDA model number of topics K.The present invention provides method profit
Estimate the evaluation criteria as LDA model quality with COS distance, by subsequent survey, use the measurement of this index to obtain
LDA model number of topics, can be effectively improved the gained LDA model accuracy to all kinds of corpus.By dynamically adjusting theme number
K, avoids to a certain extent because personal experience's subjectivity arranges the unreasonable problem of model that K causes, improves the precision of model;
By arranging suitable iterations, can be used for the real-time calculating of the corpus of small lot, select the LDA of this corpus applicable
Model, thus improve screening efficiency.
Preferably, the structure of document frequency matrix F comprises the following steps:
Step S110: number in order index to the document in corpus;
Step S120: each document is carried out word segmentation processing and builds dictionary;
Step S130: based on the document frequency matrix F that dictionary creation form is following:
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary
Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
Document frequency matrix F is that in corpus, the vectorization of document represents, is that a kind of word bag model (i.e. have ignored the sky of word
Between position relationship, only comprise the statistical information of word), every a line represents the word frequency statistics of word in a document, word frequency be 0 expression literary composition
Shelves do not comprise this word, are not that 0 expression comprises this word.Document in corpus is the most first converted into document frequency matrix F represent
LDA cluster calculation could be used for.
Preferably, initializing number of topics K and include dk, eps, alpha and maxIteration, wherein dk is the change side of K
Being 1 or-1 (1 is forward, and-1 is negative sense) to, random initializtion, alpha is Studying factors, alpha may be configured as fixed value or
Being set to along with the sequential value of iterations change, maxIteration is maximum iteration time, and eps is that iteration terminates threshold value.
The vector of these numbers of topics it is set and combines subsequent calculations, this iterative calculation method can be effectively improved accurate to the renewal of number of topics K
Really rate, the number of topics K that raising iteration obtains is for the calculating accuracy rate of LDA model.
Use number of topics K that document frequency matrix F is carried out LDA Model tying calculating.The method that LDA model is solved
Can be: algorithm based on variation EM (Variational EM), the algorithm sampled based on Gibbs (gibbs) and expectation pass
Broadcast the algorithm of (Expectation-Propagation).It is preferably based on the algorithm of Gibbs sampling.
Preferably, theme-Word probability distribution in LDA modelFor the matrix of K*N, K is the theme number, and N is dictionary
The number of word contained by, its every a line represents the Word probability distribution of a theme, the wherein probability of t word of kth themeComputing formula as follows
Wherein βtBe the theme-the t of the hyper parameter vector β of the prior probability distribution Di Li Cray distribution of Word probability distribution
Component,Be the theme the number of times that the t word occurs in K.
Can be calculated by Gibbs sampling, circular sees the skill that G Heinrich delivered in 2005
Art report " Parameter estimation for text analysis ".
COS distance similarity d between theme vectorx,yPeace all similar COS distance similarity r, passes through following formula respectively
Calculate:
Wherein, dx,yRepresentCOS distance similarity between middle x-th theme vector and y-th theme vector,RepresentThe probit of the n-th word of x-th theme,Representing the mould of x-th theme vector, k is the theme
Number, N is dictionary (dictionary i.e. built in step S120).Use these formula to calculate and can be calculated average similar COS distance
Similarity r can be used for measuring the dependency between theme, and between r the least expression theme, dependency is the least, represents main as r=0
Orthogonal between topic is optimal cluster result.
Preferably, the more new regulation of dk is: as r-r_old > 0 (r increase) time, dk direction negates, i.e. dk=-1*dk.When
During r-r_old < 0 (r reduction), dk is constant, and wherein r is the average similarity of current iteration, and r_old is the average of last round of iteration
Similarity.Carry out dk renewal by this rule and can guarantee that the value of average similar COS distance similarity r constantly reduces to obtain most preferably
Cluster result.
Preferably, use current topic number K that document frequency matrix F is proceeded LDA Model tying iterative computation, when full
When foot r-r_old (change of average similarity r) reaches maxIteration less than eps or iterations, flow process terminates, defeated
Go out current LDA model.Carry out dk renewal by this condition to terminate to can guarantee that the value of average similar COS distance similarity r constantly subtracts
Little to obtain optimal cluster result.
The said method that the present invention provides specifically includes following steps:
A) corpus is carried out pretreatment and obtain document frequency matrix F.
1. number in order index to corpus document;
2. each document carried out word segmentation processing and build dictionary;
3. based on dictionary creation document frequency matrix F, the form of document frequency matrix is as follows:
Wherein M is number of documents, and N is dictionary word number, and the row in matrix represents document, and the word index at dictionary is shown in list,
fi,jRepresent jth word occurrence number in i-th document in dictionary.
B) number of topics K is initialized according to the number of word contained in dictionary;Initialize dk, eps, alpha,
MaxIteration, wherein dk be the change direction random initializtion of K be 1 or-1 (1 is forward, and-1 be negative sense), alpha is
Practising the factor, alpha may be configured as fixed value and may be alternatively provided as along with the sequential value of iterations change, and maxIteration is
Big iterations, eps is that iteration terminates threshold value.
C) use number of topics K that document frequency matrix F is carried out LDA Model tying calculating.The master that LDA model is solved
The algorithm want method to have algorithm based on variation EM (Variational EM), sampling based on Gibbs (gibbs) and expectation
Propagate the algorithm of (Expectation-Propagation), assume to use asking of Gibbs sampling in the case study on implementation of the present invention
Resolving Algorithm, but it is not limited to Gibbs sampling algorithm.
D) according to the theme in the LDA model obtained in c)-Word probability distributionCalculate the cosine between theme vector
Distance conformability degree and average similarity r.Computing formula is
Wherein di,jRepresentCOS distance similarity between middle i-th theme vector and jth theme vector,RepresentThe probit of the n-th word of i-th theme,Representing the mould of i-th theme vector, k is the theme
Number, N is the number of contained word in dictionary.
E) update number of topics K, more new formula be k=k_old+dk*alpha*k_old, k_old be the master of last iteration
Topic number.
F) update dk, dk more new regulation be as r-r_old > 0 (r increase) time, dk direction negates, i.e. dk=-1*dk, r-
During r_old < 0 (r reduction), dk is constant, and wherein r is the average similarity of current iteration, and r_old is the most similar of a upper iteration
Degree.
G) use new number of topics K to jump to c) and continue iteration, be less than when meeting r-r_old (change of average similarity r)
When eps or iterations reach maxIteration, flow process terminates, and exports current LDA model.
See Fig. 2, another aspect provides a kind of said method self adaptation potential Di Li Cray model choosing
Select device, including:
Initial value setting module 100, for corpus being converted to the document frequency matrix F calculated for LDA model, and
According to corpus scale, initial subject number K, iterative computation LDA model are set;
Iteration more new module 200, after often wheel iteration, according to the mean cosine of the theme-Word probability distribution of LDA model
Mean cosine Distance conformability degree r_old in the most last round of iterative computation of Distance conformability degree r is to increase or reduce, and changes theme
Change direction dk of number K, and then update number of topics K, and the number of topics K after updating is as the master of next round LDA model iteration
Topic number proceeds iterative computation, after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the master of last iteration
Topic number, dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or
Sequential value along with iterations change.
Simple dependence experience selects number of topics K to cause being applicable to a lot of unscreened language to use this device to be avoided that
Material storehouse, thus improve the efficiency that different corpus outputs are had being specifically designed for property LDA model.
Preferably, initial value setting module includes:
Numbering module: for index that the document in corpus is numbered in order;
Dictionary creation module: for each document being carried out word segmentation processing and building dictionary;
Document frequency matrix computing module: the document frequency matrix F for following based on dictionary creation form:
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and list shows that this word is in dictionary
Index, fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
Using this device to carry out document frequency matrix F, gained document frequency matrix F is the vectorization table of document in corpus
Showing, be a kind of word bag model (i.e. have ignored the spatial relation of word, only comprise the statistical information of word), every a line represents one
The word frequency statistics of word in document, word frequency is that 0 expression document does not comprise this word, is not that 0 expression comprises this word.The most first by corpus
In document be converted into document frequency matrix F and represent and could be used for LDA cluster calculation.
Those skilled in the art will understand that the scope of the present invention is not restricted to example discussed above, it is possible to carries out it
Some changes and amendment, the scope of the present invention limited without deviating from appended claims.Although oneself is through in accompanying drawing and explanation
Book illustrates and describes the present invention in detail, but such explanation and description are only explanations or schematic, and nonrestrictive.
The present invention is not limited to the disclosed embodiments.
By to accompanying drawing, the research of specification and claims, when implementing the present invention, those skilled in the art are permissible
Understand and realize the deformation of the disclosed embodiments.In detail in the claims, term " includes " being not excluded for other steps or element,
And indefinite article " " or " a kind of " are not excluded for multiple.Some measure quoted in mutually different dependent claims
The fact does not means that the combination of these measures can not be advantageously used.It is right that any reference marker in claims is not constituted
The restriction of the scope of the present invention.
Claims (8)
1. the method for a self adaptation potential Di Li Cray Model Selection, it is characterised in that comprise the following steps:
Step S100: corpus is converted to the document frequency matrix F calculated for LDA model, and sets according to corpus scale
Put initial subject number K, iterative computation LDA model;
Step S200: often after wheel iteration, the mean cosine Distance conformability degree r according to the theme-Word probability distribution of LDA model is relative
Mean cosine Distance conformability degree r_old in last round of iterative computation is to increase or reduce, and changes the change direction of number of topics K
Dk, and then update number of topics K, and the number of topics K after updating proceeds repeatedly as the number of topics of next round LDA model iteration
In generation, calculates, and after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the number of topics of last iteration,
Dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or along with
The sequential value of iterations change.
The method of self adaptation the most according to claim 1 potential Di Li Cray Model Selection, it is characterised in that described document
The structure of frequency matrix F comprises the following steps:
Step S110: number in order index to the document in described corpus;
Step S120: each document is carried out word segmentation processing and builds dictionary;
Step S130: based on the document frequency matrix F that described dictionary creation form is following:
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and this word index in dictionary is shown in list,
fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
The method of self adaptation the most according to claim 2 potential Di Li Cray Model Selection, it is characterised in that described initially
Changing number of topics K and include dk, eps, alpha and maxIteration, wherein, maxIteration is maximum iteration time, and eps is
Iteration terminates threshold value.
The method of self adaptation the most according to claim 3 potential Di Li Cray Model Selection, it is characterised in that to described
The method for solving of LDA model is algorithm based on Gibbs sampling.
The method of self adaptation the most according to claim 4 potential Di Li Cray Model Selection, it is characterised in that
Theme-Word probability distribution in LDA modelFor the matrix of K*N, K is the theme number, and N is the number of contained word in dictionary,
Its every a line represents the Word probability distribution of a theme, the probability of t word of kth themeComputing formula be:
Wherein, βtBe the theme-t point of the hyper parameter vector β of the prior probability distribution Di Li Cray distribution of Word probability distribution
Amount,The number of times occurred for the t word in kth theme;
COS distance similarity d between theme vectorx,yPeace all similar COS distance similarity r, respectively by following formula meter
Calculate:
Wherein, dx,yRepresentCOS distance similarity between middle x-th theme vector and y-th theme vector,
RepresentThe probit of the n-th word of x-th theme,Representing the mould of x-th theme vector, K is the theme number, N
Number for word contained in dictionary.
The method of self adaptation the most according to claim 5 potential Di Li Cray Model Selection, it is characterised in that described dk's
More new regulation is: as r-r_old > 0 time, dk direction negates, i.e. dk=-1*dk;As r-r_old, < when 0, dk is constant;Described dk's
Renewal termination condition is: when meeting r-r_old and reaching maxIteration less than eps or iterations, terminate iteration meter
Calculate.
7. a method self adaptation potential Di Li Cray Model Selection device, its feature as described in any one of claim 1~6
It is, including:
Initial value setting module, for being converted to the document frequency matrix F calculated for LDA model, and according to language by corpus
Material storehouse scale arranges initial subject number K, iterative computation LDA model;
Iteration more new module, after often wheel iteration, according to the mean cosine distance phase of the theme-Word probability distribution of LDA model
It is to increase or reduce like the mean cosine Distance conformability degree r_old in the degree the most last round of iterative computation of r, changes number of topics K's
Change direction dk, and then update number of topics K, and the number of topics K after updating continues as the number of topics of next round LDA model iteration
Continuous being iterated calculates, and after iterative computation terminates, obtains the LDA model of current corpus;
The more new regulation of number of topics K: k=k_old+dk*alpha*k_old, wherein, k_old is the number of topics of last iteration,
Dk is the theme the change direction of several K, and random initializtion is 1 or-1;Alpha is Studying factors, may be configured as fixed value or along with
The sequential value of iterations change.
Self adaptation the most according to claim 7 potential Di Li Cray Model Selection device, it is characterised in that described initial value
Setting module includes:
Numbering module: for index that the document in described corpus is numbered in order;
Dictionary creation module: for each document being carried out word segmentation processing and building dictionary;
Document frequency matrix computing module: the document frequency matrix F for following based on described dictionary creation form:
Wherein, M is number of documents, and N is dictionary word number, and the row in matrix represents document, and this word index in dictionary is shown in list,
fi,jRepresent the number of times that the jth word in dictionary occurs in i-th document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610050982.4A CN105740354B (en) | 2016-01-26 | 2016-01-26 | The method and device of adaptive potential Di Li Cray model selection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610050982.4A CN105740354B (en) | 2016-01-26 | 2016-01-26 | The method and device of adaptive potential Di Li Cray model selection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105740354A true CN105740354A (en) | 2016-07-06 |
CN105740354B CN105740354B (en) | 2018-11-30 |
Family
ID=56246575
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610050982.4A Active CN105740354B (en) | 2016-01-26 | 2016-01-26 | The method and device of adaptive potential Di Li Cray model selection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105740354B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107491417A (en) * | 2017-07-06 | 2017-12-19 | 复旦大学 | A kind of document structure tree method under topic model based on particular division |
CN107656919A (en) * | 2017-09-12 | 2018-02-02 | 中国软件与技术服务股份有限公司 | A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme |
CN107798043A (en) * | 2017-06-28 | 2018-03-13 | 贵州大学 | The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays |
CN109829151A (en) * | 2018-11-27 | 2019-05-31 | 国网浙江省电力有限公司 | A kind of text segmenting method based on layering Di Li Cray model |
CN111966702A (en) * | 2020-08-17 | 2020-11-20 | 中国银行股份有限公司 | Spark-based financial information bag-of-words model incremental updating method and system |
CN113935321A (en) * | 2021-10-19 | 2022-01-14 | 昆明理工大学 | Adaptive iteration Gibbs sampling method suitable for LDA topic model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080071778A1 (en) * | 2004-04-05 | 2008-03-20 | International Business Machines Corporation | Apparatus for selecting documents in response to a plurality of inquiries by a plurality of clients by estimating the relevance of documents |
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
-
2016
- 2016-01-26 CN CN201610050982.4A patent/CN105740354B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080071778A1 (en) * | 2004-04-05 | 2008-03-20 | International Business Machines Corporation | Apparatus for selecting documents in response to a plurality of inquiries by a plurality of clients by estimating the relevance of documents |
CN105069143A (en) * | 2015-08-19 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Method and device for extracting keywords from document |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107798043A (en) * | 2017-06-28 | 2018-03-13 | 贵州大学 | The Text Clustering Method of long text auxiliary short text based on the multinomial mixed model of Di Li Crays |
CN107798043B (en) * | 2017-06-28 | 2022-05-03 | 贵州大学 | Text clustering method for long text auxiliary short text based on Dirichlet multinomial mixed model |
CN107491417A (en) * | 2017-07-06 | 2017-12-19 | 复旦大学 | A kind of document structure tree method under topic model based on particular division |
CN107491417B (en) * | 2017-07-06 | 2021-06-22 | 复旦大学 | Document generation method based on specific division under topic model |
CN107656919A (en) * | 2017-09-12 | 2018-02-02 | 中国软件与技术服务股份有限公司 | A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme |
CN109829151A (en) * | 2018-11-27 | 2019-05-31 | 国网浙江省电力有限公司 | A kind of text segmenting method based on layering Di Li Cray model |
CN109829151B (en) * | 2018-11-27 | 2023-04-21 | 国网浙江省电力有限公司 | Text segmentation method based on hierarchical dirichlet model |
CN111966702A (en) * | 2020-08-17 | 2020-11-20 | 中国银行股份有限公司 | Spark-based financial information bag-of-words model incremental updating method and system |
CN111966702B (en) * | 2020-08-17 | 2023-08-18 | 中国银行股份有限公司 | Spark-based financial information word bag model increment updating method and system |
CN113935321A (en) * | 2021-10-19 | 2022-01-14 | 昆明理工大学 | Adaptive iteration Gibbs sampling method suitable for LDA topic model |
CN113935321B (en) * | 2021-10-19 | 2024-03-26 | 昆明理工大学 | Adaptive iterative Gibbs sampling method suitable for LDA topic model |
Also Published As
Publication number | Publication date |
---|---|
CN105740354B (en) | 2018-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105740354A (en) | Adaptive potential Dirichlet model selection method and apparatus | |
CN105868178B (en) | A kind of multi-document auto-abstracting generation method of phrase-based theme modeling | |
CN105260390B (en) | A kind of item recommendation method based on joint probability matrix decomposition towards group | |
CN104573046A (en) | Comment analyzing method and system based on term vector | |
CN106649272B (en) | A kind of name entity recognition method based on mixed model | |
CN107590565A (en) | A kind of method and device for building building energy consumption forecast model | |
CN103617435B (en) | Image sorting method and system for active learning | |
CN106251174A (en) | Information recommendation method and device | |
CN104657496A (en) | Method and equipment for calculating information hot value | |
CN104899298A (en) | Microblog sentiment analysis method based on large-scale corpus characteristic learning | |
CN106021223A (en) | Sentence similarity calculation method and system | |
CN109816438B (en) | Information pushing method and device | |
CN106651542A (en) | Goods recommendation method and apparatus | |
CN106055673A (en) | Chinese short-text sentiment classification method based on text characteristic insertion | |
CN103886047A (en) | Distributed on-line recommending method orientated to stream data | |
CN104468413B (en) | A kind of network service method and system | |
CN109871858A (en) | Prediction model foundation, object recommendation method and system, equipment and storage medium | |
CN105224959A (en) | The training method of order models and device | |
CN106294863A (en) | A kind of abstract method for mass text fast understanding | |
CN103942375A (en) | High-speed press sliding block dimension robust design method based on interval | |
CN102541920A (en) | Method and device for improving accuracy degree by collaborative filtering jointly based on user and item | |
CN106503209A (en) | A kind of topic temperature Forecasting Methodology and system | |
CN101887443A (en) | Method and device for classifying texts | |
CN103761266A (en) | Click rate predicting method and system based on multistage logistic regression | |
CN110825850A (en) | Natural language theme classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |