CN108090231A

CN108090231A - A kind of topic model optimization method based on comentropy

Info

Publication number: CN108090231A
Application number: CN201810029097.7A
Authority: CN
Inventors: 孙新; 申长虹; 唐正; 姚晶旭; 张颖捷; 欧阳童
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2018-05-29

Abstract

The invention discloses a kind of topic model optimization methods based on comentropy, belong to Text Classification field.The present invention main technical schemes be：It is related to a kind of structure of topic model and the degree of subject relativity of document content is calculated using constructed topic model, specifically utilize comentropy and mutual information, the feature lexical item of theme can uniquely be characterized by being excavated from topic corpus, the feature lexical item of threshold condition will be met as theme dictionary, training topic model calculates the degree of subject relativity of document content.Present invention is particularly suitable for the document content degree of subject relativity calculating that subject key words item is depending therefrom, high granularity, the theme feature lexical item of strong feature can be excavated according to statistics such as comentropies, effective aggregation features lexical item realizes the document content Optimum Classification of dependent theme.

Description

A kind of topic model optimization method based on comentropy

Technical field

The present invention relates to a kind of topic model optimization methods based on comentropy, belong to Text Classification field.

Background technology

In recent years, mass data information is while great convenience is brought, and analysis similarly to information and looks into It looks for and brings huge challenge.Under big data background, required information how is rapidly obtained from mass data as people It is in the urgent need to address the problem of.

The form complexity of data is various, and compared to the data mode that video, audio so visualize, text data is abstract And the highest data mode of condensed degree.In machine learning and natural language processing field, it is often necessary to be dug from a large amount of texts Excavate the potential applications relation contained in text lexical item.Previous information retrieval website by Shallow Semantic Parsing to content of text into The preliminary semantic analysis of row determines the correlation between the document and search for, but with society and the continuous hair of technology Exhibition, it is desirable to quickly obtain accurate answer in a manner of " ask and answer ", so frequent and efficient interactive mode Machine is promoted to need have deeper analysis and understandability to text semantic.

By the study and prediction of topic model, the theme distribution of text can be obtained, realizes text cluster, classification, inspection The tasks such as rope, extension, recommendation and applied to text mining, sentiment analysis, commending system, digital book, public sentiment monitoring, number According to fields such as acquisition, social network sites and personalized retrievals.

Traditional subject heading list representation model mainly has Boolean Model, vector space model, probabilistic model and language model etc..Cloth You represent particular topic by model with a subject key words set, as long as calculating the intersection of keyword set, you can judge text The degree of correlation of shelves and theme.Although Boolean Model is easily achieved, but it does not account for the weight of keyword, can not be accurate Similitude is calculated, binary outcome can not effectively distinguish degree of subject relativity.The appearance of vector space model compensates for Boolean Model and recognizes For all keywords it is of equal importance the defects of, improve the binary value of keyword weight, quantitatively segment different keywords to theme Different contributions.But vector space model does not account for the semantic information of lexical item, can not judge lexical item not on semantic understanding Same and semantic relevant content of text.

Due to document semantic and the close relation of document subject matter, consider how to carry out document subject matter from document structure tree angle The method of modeling is come into being.PLSA (Probabilistic Latent Se-mantic Analysis) topic model be from The angle of Frequency school is set out to document structure tree process model building, and Frequency school thinks that model parameter is although unknown but fixes not Become, can apply and be calculated the methods of Maximum-likelihood estimation.Since think with the Bayesian schools that Frequency school is completely contradicted Unknown parameters, then parameter is also a stochastic variable, also obeys corresponding distribution.If it is ginseng on the basis of PLSA models Number has just obtained LDA (Latent Dirichlet Allocation) topic model plus corresponding prior distribution.

LDA has three layers of Feature Words, theme and document Bayesian network as complete production probability statistics topic model Network structure by being modeled to corpus, excavates potential semantic information in corpus.With the development and application of LDA models, Extended Model based on LDA is also gradually suggested.In order to preferably find the correlation information between implicit theme, CTM models The Dirichlet substituted using logic-normal distribution in LDA models is distributed；PAV models using directed acyclic graph represent theme it Between imply semantic information, so as to more effectively excavate existing hierarchical relationship between theme；SLDA models pass through Class label is added in, the structure of thematic structure information and prediction is made to become more accurate.Above-mentioned extended model takes full advantage of LDA moulds The type expression ability powerful to text.Compared to other topic models, probability theory is introduced into model by LDA, layer of structure Clearly, meet text actual conditions, there is powerful semantic classes characteristic under big data environment, meanwhile, pass through Dirichlet points Cloth constructs subject layer and Feature Words layer, can quickly handle huge topic corpus, effectively avoid over-fitting in training process Situation.

Topic model represent it is accurate whether be restriction text subject degree of correlation an important factor for.However, in actual feelings In condition, the text in text library can be gathered by the Attribute transposition of some structurings for some, and the text among each set is deposited In general character, and these general character are by being ignored such as topic model independent between LDA this class hypothesis text.Therefore, as text master Between topic share majority equal key word item when, how using LDA models advantage, build new topic model, effectively distinguish Subject categories belonging to document are the key that we study.

The content of the invention

It is an object of the invention to improve the classification accuracy of dependent theme in text classification, for document content lexical item Different and semantic relevant characteristic, it is proposed that a kind of topic model optimization method based on comentropy utilizes comentropy and mutual trust Breath statistic excavates the feature lexical item that can uniquely characterize theme from topic corpus, and selection meets the Feature Words of threshold condition , training topic model calculates the degree of subject relativity of document content further according to this topic model, distinguishes the theme class belonging to document Not；This method, which is particularly suitable for subject key words set, more intersection or each theme with the superior and the subordinate's inclusion relation, energy Subject categories belonging to enough effective district single cent shelves.

A kind of topic model optimization method based on comentropy, includes the following steps：

Step 1. training LDA topic models obtain theme dictionary；

Specifically, according to topic corpus train LDA topic models, obtain theme dictionary, wherein, topic corpus by with Family voluntarily selects as needed, and step 1 specifically includes following sub-step：

Step 1.1 is according to Probability p (d_m) one document d of selection_m, wherein m ∈ [1, M], M are number of documents；

Step 1.2 utilizes Dirichlet prior distributions generation document d_mTheme multinomial distributionIts In,It is the parameter of Dirichlet prior distributions；

Step 1.3 basisGenerate document d_mThe theme k=z of n-th of word_m,n, wherein, z_m,nIt represents n-th in m documents A theme；

Step 1.4 reuses Dirichlet prior distributions generation k=z_m,nLexical item multinomial distributionWherein,It is the parameter of Dirichlet prior distributions；

Step 1.5 is from lexical item multinomial distributionGenerate Topic word w_m,n, wherein, w_m,nIt represents in m documents n-th Word；

Step 1.6 is for document d_mIn N_mA word repeats step 1.2 to step 1.5, generates corresponding Topic word

Step 1.7 repeats M-1 step 1.1 to step 1.6, generation theme dictionary w for M document_M,N, wherein, N For the number of word in M document；

Step 2. utilizes comentropy, and the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus；

Specifically, topic corpus is scanned, and excavates based on left and right comentropy and mutual information statistic and meets certain threshold The candidate feature lexical item of value condition；

Step 2 specifically includes two sub-steps, step 2.1 and step 2.2：

Step 2.1 utilizes left and right comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, Computational methods are shown in formula (1):

Candidate feature lexical item phrase needs to meet：HL (phrase) ＞ a₁∩ HR (phrase) ＞ a₂∩ I (phrase) ＞ a₃, wherein, a₁, a₂, a₃Respectively HL (phrase), HR (phrase), the threshold value of I (phrase)；

It should be noted that the left comentropy HL (phrase) of candidate feature lexical item phrase and right comentropy HR (phrase) definition procedure is about set to formula (2) and (3)：

Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively Character string, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y | Phrase it is the probability of the right adjacent character string of candidate feature lexical item phrase) to represent character string y；

Meanwhile the calculation formula of candidate feature lexical item phrase mutual informations is about set to formula (4)：

Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively Character string, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent character string x and y respectively The probability individually occurred；It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y is higher, Namely character string x and y connect into word probability it is higher；

Meanwhile w_i∈ phrase represent to form the substring w of phrase_i(1≤i≤3).∑ p (phrase) is normalization Coefficient, 0≤p (phrase)≤1.p(topic|w_i) can be obtained according to Bayesian formula, shown in computational methods such as formula (5)：

Wherein, p (topic) and p (w_i) counted from topic corpus, p (w_i| topic) it is obtained from step 1；

Ranking results of the step 2.2 according to the probability value p (phrase) of candidate feature lexical item from high to low, screening meet threshold The candidate feature lexical item of value condition；Specifically, according to formula (1) threshold parameter different with formula (5) setting, calculate and choose Preceding K candidate feature lexical item；

For step 3. using candidate feature lexical item as theme dictionary, training obtains topic model；

The training of step 3 specifically includes following sub-step：

Step 3.1 random initializtion, to each word w in each document in topic corpus_i, a theme is assigned at random Number z_i；

Step 3.2 scans topic corpus, to each word w_i, compiled according to Gibbs Sampling formula resamplings theme Number, and updated in topic corpus, shown in wherein Gibbs Sampling sampling computational methods such as formula (6)：

Where it is assumed that lexical item w_i=t；z_iRepresent the corresponding theme variable of i-th of word；It represents to reject therein i-th ；Represent the number of lexical item t occur in k themes；β_tIt is the Dirichlet priori of lexical item t；It represents to lead in document m Inscribe the number of k, α_kIt is the Dirichlet priori of theme k；Represent the probability of lexical item t in theme k；For theme k in document m Probability；V represents the number of lexical item in topic corpus, and K is the theme the number of theme in dictionary, and M is number of documents；

Step 3.3 repeat the above steps 3.2 sampling process, until Gibbs Sampling restrain；Wherein, Gibbs Sampling convergences refer to that the probability value that formula (6) sampling obtains approaches the Joint Distribution of word and theme；

The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in step 3.4 statistics topic corpus；

Step 4. calculates the degree of subject relativity of document content, and probability distribution value of the document on predetermined theme is taken to be led for it Inscribe relevance degree；

Specifically, using topic model, the theme distribution probability value of the document is taken to calculate relevance degree for its theme；

Wherein, the document in the topic corpus in the document and step 1 to step 3 in step 4 is different, and the latter is instruction Practice document, for training topic model；The former is prediction document, for applying topic model；

So far, from step 1 to step 4, a kind of topic model optimization method based on comentropy is completed.

Advantageous effect

A kind of topic model optimization method based on comentropy of the present invention, with existing topic model optimization method phase Than having the advantages that：

1st, the method for the invention is more suitable for the document content degree of subject relativity calculating depending therefrom of subject key words item；

2nd, the method for the invention can excavate the theme of high granularity, strong feature according to comentropy and mutual information statistic Feature lexical item, effective aggregation features lexical item, are realized the defects of low granularity lexical item is avoided to inscribe smudgy on the theme, word more The document content Optimum Classification of dependent theme；

3rd, the method for the invention need to only calculate the comentropy of lexical item and mutual information statistic before training pattern, operation Simply, method run cost is small.

Description of the drawings

Fig. 1 is the flow diagram in the present invention a kind of topic model optimization method and embodiment 1 based on comentropy；

Fig. 2 is the LDA probabilistic models signal in a kind of topic model optimization method embodiment 2 based on comentropy of the present invention Figure；

Fig. 3 is to utilize Gibbs in a kind of topic model optimization method embodiment 3 based on comentropy of the present invention The doc-topic-word path probability schematic diagrames of Sampling formula sampling.

Specific embodiment

The invention will be further described with reference to the accompanying drawings and detailed description.

In order to which technical solution in present application example and advantage is more clearly understood, below in conjunction with attached drawing to the application's Exemplary embodiment is described in more detail, it is clear that and described embodiment is only the part of the embodiment of the application, Rather than the exhaustion of all embodiments.It should be noted that in the case where there is no conflict, the example in the application can be tied mutually It closes.

The present invention defines independent theme and refers to have significant difference between subject key words set and include pass without the superior and the subordinate Each theme of system, and dependent theme refers to there is more intersection or with the superior and the subordinate's inclusion relation between subject key words set Each theme.The reason for defining independent theme and the two concepts of dependent theme is belonging to further subdivision document content Specific category.

Applicant analyzes the existing technical method being modeled based on document structure tree angle to document subject matter, with LDA moulds Exemplified by type method, if the model directly applies to document content relatedness computation, most equal keywords are shared between theme Xiang Shi, the model are not accurate enough to the document classification of dependent theme.

A kind of topic model optimization method based on comentropy is provided in present application example, can be counted according to comentropy Amount weighs word string degree of freedom from outside, and mutual information statistic is determined uniquely to characterize theme from the internal tight ness rating for weighing word string Feature lexical item, accordingly generate theme dictionary, train topic model using Gibbs Sampling methods, calculate document content Degree of subject relativity value.The statistics such as use information entropy can excavate high granularity, the theme feature lexical item of strong feature, effectively poly- The defects of closing feature lexical item, low granularity lexical item avoided to inscribe smudgy on the theme, word more.It should be noted that here High granularity refers to multiple contaminations, and low granularity refers to single word.

Scheme in present application example can be applied to text mining, sentiment analysis, commending system, digital book, public sentiment The fields such as monitoring, data acquisition, social network sites and personalized retrieval.

Embodiment 1

The embodiment of the present invention 1 elaborates a kind of topic model optimization method based on comentropy, and attached drawing 1 is reality of the invention Existing flow chart, specifically includes following steps：

S1. LDA topic models is trained to obtain theme dictionary；

For there is a LDA models of M document and K theme, the generating process of document is about set in specific LDA models：

S1.A is for document d_m∈ [1, M], according to Probability p (d_m) selection document d_m, wherein m ∈ [1, M], M are number of files Mesh；

S1.B is for document d_m∈ [1, M], sampling document d_mTheme multinomial distributionWherein, It is the parameter of Dirichlet prior distributions；

S1.C is for document d_m∈ [1, M], sampling document d_mThe theme k=z of n-th of word_m,n, wherein, z_m,nRepresent m N-th of theme in document；

S1.D is for theme k ∈ [1, K], the lexical item multinomial distribution of sampling theme kWherein,It is The parameter of Dirichlet prior distributions；

S1.E is for document d_m∈ [1, M], from lexical item multinomial distributionGenerate Topic word w_m,n, wherein, w_m,nRepresent m N-th of word in piece document；

S1.F is for document d_mIn N_mA word repeats step S1.B~S1.E, generates corresponding Topic word w_m,N；

S1.G repeats M-1 step S1.A~S1.F, generation theme dictionary w for M document_M,N, wherein, N is M text The number of word in shelves；

S2. using comentropy, the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus；

S2.A utilizes comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, calculating side Method is：

The left comentropy HL (phrase) of candidate feature lexical item phrase and the definition procedure of right comentropy HR (phrase) About it is set to

The calculation formula of candidate feature lexical item phrase mutual informations is about set to：

Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively Character string, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent character string x and y respectively The probability individually occurred.It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y is higher, Namely character string x and y connect into word probability it is higher；

Meanwhile w_i∈ phrase represent to form the substring w of phrase_i(1≤i≤3)；∑ p (phrase) is normalization Coefficient, 0≤p (phrase)≤1；

Meanwhile p (topic | w_i) can be obtained according to Bayesian formula, computational methods are：

Wherein, p (topic) and p (w_i) counted from topic corpus, p (w_i| topic) divide for the joint of word and theme Cloth is obtained from S1；

Ranking results of the S2.B according to the probability value p (phrase) of feature lexical item from high to low, screening meet threshold condition Candidate feature lexical item；

Specifically, according to formulaWithDifferent threshold parameters is set, K feature lexical item before calculating and choosing；

S3. using candidate feature lexical item as theme dictionary, training obtains topic model；

Specific training step includes：

S3.A random initializtions, to each word w in every document in topic corpus_i, it is random to assign a theme volume Number z_i；

S3.B scans topic corpus, to each word w_i, it is numbered according to Gibbs Sampling formula resamplings theme, And updated in topic corpus, wherein Gibbs Sampling samplings computational methods are：

Where it is assumed that lexical item w_i=t；z_iRepresent the corresponding theme variable of i-th of word；It represents to reject therein i-th ；Represent the number of lexical item t occur in k themes；β_tIt is the Dirichlet priori of lexical item t；It represents to lead in document m Inscribe the number of k, α_kIt is the Dirichlet priori of theme k；Represent the probability of lexical item t in theme k；For theme k in document m Probability；V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents；

S3.C repeats above-mentioned sampling process, until Gibbs Sampling restrain；

The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in S3.D statistics topic corpus；

S4. the degree of subject relativity of document content is calculated, it is its theme to take probability distribution value of the document on predetermined theme Relevance degree.

So far, by S1~S4, a kind of topic model optimization method based on comentropy in the present embodiment is completed.

Embodiment 2

The present embodiment specifically describes the theme dictionary generating process of the LDA topic models described in step 1 of the present invention, LDA generates model as a kind of document subject matter, contains Feature Words, theme and document three-decker, is three layers of Bayes Probabilistic model, as shown in Figure 2.As can be seen that the generation of theme dictionary corresponds to two solely respectively from the probabilistic model of attached drawing 2 Vertical Dirichlet-Multinomial conjugated structures.

S2.A generating process 1：α→θ→z.Corresponding to step 1, specific generating process is：Represent life The corresponding topics numbers of all words into m documents.

From between conjugated structure characteristic and document independently of each other, the generating probability of theme in entire subject matter corpus For：

Wherein, Represent the number of the lexical item of k-th of topic generation in m documents, K is The number of theme, M are the number of document.

S2.B generating process 2：β → w, corresponding to step 1, specific generating process is：Table Show all lexical items that theme number is k in m documents of generation.Due to the theme number of generation lexical item and the process of generation lexical item It can be exchanged with each other, therefore the theme that the generating process of a document can be considered as to Mr. into all lexical items in document is numbered, Then different lexical items is regenerated to all identical theme numbers.By independently of each other may be used between conjugated structure characteristic and document Know, the generating probability of lexical item is such as in entire language material：

Wherein, Represent the number of lexical item t in the lexical item of k-th of theme generation.

S2.C according toWithEach pass can be obtained The word of keyword item and the Joint Distribution of theme, computational methods are about set to：

So far, the theme dictionary generating process of LDA topic models is completed by step S2.A~S2.C.

Embodiment 3

The present embodiment specifically describes the Gibbs Sampling method of samplings described in step 3 of the present invention, gives Doc-topic-word path probability schematic diagrames, as shown in Figure 3.It is as follows：

S3.B scans topic corpus, to each word w_i, it is numbered according to Gibbs Sampling formula resamplings theme, And it is updated in topic corpus, wherein Gibbs Sampling sampling computational methods such as formula：

Where it is assumed that lexical item w_i=t；z_iRepresent the corresponding theme variable of i-th of word；It represents to reject therein i-th ；Represent the number of lexical item t occur in k themes；β_tIt is the Dirichlet priori of lexical item t；Represent occur in document m The number of theme k, α_kIt is the Dirichlet priori of theme k；Represent the probability of lexical item t in theme k；For master in document m Inscribe the probability of k；V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents；

Formula the right as p (topic | doc) p (word | topic), this probability namely doc → topic → word Path probability, since theme has K, so the physical significance of Gibbs Sampling formula is exactly to be carried out in this K paths Sampling, as shown in Figure 3.

S3.C repeats above-mentioned sampling process, until Gibbs Sampling restrain；

The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in S3.D statistics topic corpus.

Embodiment 4

Based on examples detailed above 1, the present embodiment provides a kind of specifically topic model optimization method based on comentropy, the party Method is specifically realized based on today's tops topic corpus and is calculated the document content degree of correlation using topic model.

S4.A based on experience value, quasi-definite Dirichlet prior distributions parameter alpha=0.5 and β=0.1, Gibbs Sampling maximum iterations are 2000 times, and based on comentropy and mutual information statistic, correspondence is calculated in number of topics 70 Theme candidate feature lexical item.

Specifically, candidates' feature lexical item such as " Group Co., Ltd ", " capital market ", " China's economic " and " financial institution " It can substantially represent finance and economic theme；Candidates' feature lexical items such as " smart mobile phones ", " Internet era " and " artificial intelligence technology " It can substantially represent scientific and technological class theme；Candidates' feature lexical item such as " USN " and " weaponry " can substantially represent military class master Topic；Candidates' feature lexical item such as " Cixi empress dowager " and " Chinese history " can substantially represent history class theme.

S4.B re-starts training and obtains topic model using candidate feature lexical item as the theme dictionary of topic model.

Specifically, 3 themes, 9877 lexical items and 121 document re -trainings are chosen, obtain " data " theme Feature lexical item (highest 29 lexical items of weight selection here)：11st, 90, company, double ten, technology, work, platform, ai, be more than, It is China, internet, artificial intelligence, Alibaba, day cat, product, enterprise, Jingdone district, development, Ali, future, brand, user, complete Ball, hundred million yuan, 2017,10, number, automation, drop drop；Feature lexical item (highest 29 of weight selection here of " equipment " theme Lexical item)：It is dnf, player, version, injury, occupation, technical ability, Lu Ke, game, trade council, upgrading, attribute, increase, epic, amplification, true , update, abyss, robot, Korea Spro's clothes, the time, sweep the floor, stone, it is transregional, clean, 90, strength, correcting, state clothes, weapon；" vehicle The feature lexical item (highest 29 lexical items of weight selection here) of type " theme：Automobile, design, price, engine, are matched somebody with somebody at machine oil Put, use, standard, Ai Erfa, power, space, buying car, 4s shops, interior trim, sale, market, sensation, BMW, friend, influence, state Interior, 310w, car owner, automatic, moulding, vehicle, consumer, suv, purchase vehicle.

S4.C application topic models calculate the degree of subject relativity value of document content, take the document general on predetermined theme Rate Distribution Value is its degree of subject relativity value.

Specifically, using the theme dictionary of above-mentioned " data ", " equipment " and " vehicle " theme, science and technology, automobile and trip are chosen Play language material calculates the degree of subject relativity value of document content in corresponding language material.Obtain the theme phase of document content in " science and technology " language material Close angle value (3 document citings before choosing here)：doc:0,topic:0(0.520574),topic:1(0.314514), topic:2(0.164912)；doc:1,topic:0(0.738012),topic:1(0.135914),topic:2 (0.126075)；doc:2,topic:0(0.813056),topic:2(0.122989),topic:1(0.063955)." automobile " The degree of subject relativity value (3 document citings before choosing here) of document content in language material：doc:0,topic:2(0.755955), topic:0(0.143801),topic:1(0.100244)；doc:1,topic:2(0.736144),topic:0 (0.144676),topic:1(0.119180)；doc:2,topic:2(0.614256),topic:0(0.298078),topic: 1(0.087666).The degree of subject relativity value (3 document citings before choosing here) of document content in " game " language material：doc:0, topic:0(0.395853),topic:1(0.336999),topic:2(0.267147)；doc:1,topic:1 (0.607892),topic:2(0.252507),topic:0(0.139601)；doc:2,topic:1(0.420732),topic: 0(0.314079),topic:2(0.265189)。

The foregoing is merely the preferable embodiments of the present invention, are not intended to limit the invention, all spirit in the present invention Within principle, any modification, equivalent substitution, improvement and etc. are made, should be included within the scope of protection of the invention.

Claims

1. a kind of topic model optimization method based on comentropy, it is characterised in that：Using comentropy and mutual information statistic from The feature lexical item of theme can uniquely be characterized by being excavated in topic corpus, and selection meets the feature lexical item of threshold condition, training master Model is inscribed, calculates the degree of subject relativity of document content, distinguishes the subject categories belonging to document；This method is particularly suitable for theme pass Keyword set has more intersection or each theme with the superior and the subordinate's inclusion relation, theme class that can be belonging to effective district single cent shelves Not；Include the following steps：

Step 1. training LDA topic models obtain theme dictionary；

Specifically, LDA topic models are trained according to topic corpus, obtains theme dictionary；

Specifically, topic corpus is scanned, and excavates based on left and right comentropy and mutual information statistic and meets specific threshold item The candidate feature lexical item of part；

Step 4. calculates the degree of subject relativity of document content, and it is its theme phase to take probability distribution value of the document on predetermined theme Close angle value；

2. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that：In step 1 Topic corpus voluntarily selected as needed by user, step 1 specifically includes following sub-step：

Step 1.2 utilizes Dirichlet prior distributions generation document d_mTheme multinomial distributionWherein,It is the parameter of Dirichlet prior distributions；

Step 1.3 basisGenerate document d_mThe theme k=z of n-th of word_m,n, wherein, z_m,nRepresent n-th of master in m documents Topic；

Step 1.4 reuses Dirichlet prior distributions generation k=z_m,nLexical item multinomial distribution Wherein,It is the parameter of Dirichlet prior distributions；

Step 1.5 is from lexical item multinomial distributionGenerate Topic word w_m,n, wherein, w_m,nRepresent n-th of word in m documents；

Step 1.7 repeats M-1 step 1.1 to step 1.6, generation theme dictionary w for M document_M,N, wherein, N M The number of word in a document.

3. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that：Step 2 has Body includes following sub-step：

Step 2.1 utilizes comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, calculating side Method is shown in formula (1):

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>&Element;</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mi>o</mi> <mi>p</mi> <mi>i</mi> <mi>c</mi> <mo>|</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>/</mo> <mi>&Sigma;</mi> <mi>p</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Candidate feature lexical item phrase needs to meet：HL (phrase) ＞ a₁∩ HR (phrase) ＞ a₂∩ I (phrase) ＞ a₃, Middle a₁, a₂, a₃Respectively HL (phrase), HR (phrase), the threshold value of I (phrase)；

The left comentropy HL (phrase) of candidate feature lexical item phrase and the definition procedure agreement of right comentropy HR (phrase) For formula (2) and (3)：

<mrow> <mi>H</mi> <mi>R</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&Sigma;</mo> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <mi>H</mi> <mi>L</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&Sigma;</mo> <mi>x</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjacent character respectively String, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y | phrase) It is the probability of the right adjacent character string of candidate feature lexical item phrase to represent character string y；

The calculation formula of candidate feature lexical item phrase mutual informations is about set to formula (4)：

<mrow> <mi>I</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>y</mi> <mo>&Element;</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> </mrow> </munder> <munder> <mo>&Sigma;</mo> <mrow> <mi>x</mi> <mo>&Element;</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjacent character respectively String, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent that character string x and y are independent respectively The probability of appearance；It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y it is higher namely The probability that character string x and y connect into word is higher；

Meanwhile w_i∈ phrase represent to form the substring w of phrase_i(1≤i≤3)；∑ p (phrase) is normalization system Number, 0≤p (phrase)≤1；

Meanwhile p (topic | w_i) can be obtained according to Bayesian formula, shown in computational methods such as formula (5)：

Ranking results of the step 2.2 according to the probability value p (phrase) of candidate feature lexical item from high to low, screening meet threshold value The candidate feature lexical item of condition；Specifically, according to formula (1) threshold parameter different with formula (5) setting, before calculating and choosing K candidate feature lexical item.

4. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that：Step 3 Training specifically includes following sub-step again：

Step 3.1 random initializtion, to each word w in every document in topic corpus_i, it is random to assign a theme number z_i；

Step 3.2 scans topic corpus, to each word w_i, it is numbered according to Gibbs Sampling formula resamplings theme, And updated in topic corpus, the wherein sampling of Gibbs Sampling formula is calculated as shown in (6)：

<mrow> <mtable> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>=</mo> <mi>k</mi> <mo>|</mo> <msub> <mover> <mi>z</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>&Not;</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mover> <mi>w</mi> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>&Proportional;</mo> <msub> <mover> <mi>&theta;</mi> <mo>^</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>&CenterDot;</mo> <msub> <mover> <mi>&phi;</mi> <mo>^</mo> </mover> <mrow> <mi>k</mi> <mo>,</mo> <mi>t</mi> </mrow> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>n</mi> <mrow> <mi>m</mi> <mo>,</mo> <mo>&Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>k</mi> </msub> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <msubsup> <mi>n</mi> <mrow> <mi>m</mi> <mo>,</mo> <mo>&Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>k</mi> </msub> </mrow> </mfrac> <mo>&CenterDot;</mo> <mfrac> <mrow> <msubsup> <mi>n</mi> <mrow> <mi>k</mi> <mo>,</mo> <mo>&Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>V</mi> </munderover> <msubsup> <mi>n</mi> <mrow> <mi>k</mi> <mo>,</mo> <mo>&Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&beta;</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>

Where it is assumed that lexical item w_i=t；z_iRepresent the corresponding theme variable of i-th of word；It represents to reject i-th therein； Represent the number of lexical item t occur in k themes；β_tIt is the Dirichlet priori of lexical item t；Represent occur theme k's in document m Number, α_kIt is the Dirichlet priori of theme k；Represent the probability of lexical item t in theme k；For in document m theme k it is general Rate；V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents；

Step 3.3 repeat the above steps 3.2 sampling process, until Gibbs Sampling restrain；

Wherein, Gibbs Sampling convergences refer to that formula (6) samples the Joint Distribution that obtained probability value approaches word and theme；

The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in step 3.4 statistics topic corpus.

5. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that：In step 4 Document and step 1 to step 3 in main body corpus in document it is different, the latter is Training document, for training theme Model；The former is prediction document, for applying topic model.