CN106919557A

CN106919557A - A kind of document vector generation method of combination topic model

Info

Publication number: CN106919557A
Application number: CN201710096926.9A
Authority: CN
Inventors: 阳可欣; 王美华; 印鉴
Original assignee: Guangdong Heng Electrical Information Polytron Technologies Inc; Sun Yat Sen University
Current assignee: Guangdong Heng Electrical Information Polytron Technologies Inc; Sun Yat Sen University
Priority date: 2017-02-22
Filing date: 2017-02-22
Publication date: 2017-07-04

Abstract

The present invention provides a kind of document vector generation method of combination topic model, and the method obtains document sets merging and it is pre-processed, collection of document is trained with LDA then, obtains the theme of each word in every document, and word and theme are constituted<Word, theme>It is right, will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation subject document vector, be dissolved into the subject information of word in document in the training process of document vector by the process, the subject document vector comprising subject information can be trained, so as to lift the degree of accuracy of the natural language processing task such as text classification, Text similarity computing.

Description

A kind of document vector generation method of combination topic model

Technical field

The present invention relates to document processing method field, the document vector more particularly, to a kind of combination topic model is raw Into method.

Background technology

Document vector is a kind of document representation into vectorial method.In some text classifications, Text similarity computing In natural language processing task, the quality of document vector directly affects the result of task.Therefore, an effective earth's surface of document It is shown as a vector particularly significant.Earliest document vector representation is bag of words method (Bag-of-Word, BOW), and bag of words method is by one Into the vector with vocabulary identical dimensional, the value of each position is word representated by the position in text in vector for document representation The number of times occurred in shelves.This method for expressing dimension is high, openness big, and separate between word and word, the order of word, language Method, semantic information are all ignored.With the development of deep learning, occur in that based on neutral net come the side of Training document vector Method.The Doc2vec document handlings for proposing for 14 years, are namely based on word2vec term vector models, in the training of neutral net During with the addition of a file characteristics vector, train term vector while, directly trained document vector.This document Vector model, captures the semanteme and order information between word, but have ignored the polysemy problem of word.I.e. same word is not With context in be different semantic expression, and it is same vector that same word is corresponding in model, it is impossible to well The difference for giving expression to word is semantic, and this will have influence on the effect of document vector.

The content of the invention

The present invention provides a kind of document vector generation method of combination topic model, and it is more preferable that the method can train effect Document vector, so as to lift the degree of accuracy of the natural language processing task such as text classification, Text similarity computing.

In order to reach above-mentioned technique effect, technical scheme is as follows：

A kind of document vector generation method of combination topic model, comprises the following steps：

S1：Document sets merging is obtained to pre-process it；

S2：Collection of document is trained with LDA, obtains the theme of each word in every document；

S3：Word and theme are constituted<Word, theme>It is right；

S4：Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation master Topic document vector.

Further, the process of the pretreatment in the step S1 is as follows：

Contained text text in document is taken out, is used to represent the content of this document, all punctuates in removal document Symbol；All of low-frequency word in removal document, threshold value is set to 5, and low-frequency word is less than 5 times in whole collection of document occurrence number Word；All of stop words is removed, stop words is that some do not have the function word of physical meaning, including ' the ', ' is ', ' at ', ' which’、’on’。

Further, the detailed process of the step S2 is as follows：

S21：Determine the theme number k in LDA models；

S22：To every each word of document in document sets, random one theme of tax；

S23：Whole document sets are scanned using Gibbs Sampling, to each word in every document, sampling updates it Theme；

S24：Gibbs Sampling processes are repeated until model is restrained；

S25：Obtain the theme of each word in every document.

Further, each word in step S3 in every document and its theme are constituted<Word, theme>It is right, same word When expressing the different meaning of a word in different contexts, different themes are had, therefore can constitute different<Word, theme>It is right, it is used for The polysemy of differentiating words.

Further, the Doc2vec document handlings in step s4 are on the basis of Word2vec term vector models Add a file characteristics vector to represent the remainder information of current document, then believed with the remainder of current document Breath predicts current word with the word of contextual window, finally trains term vector and document vector, will<Word, theme>To composition Collection of document is input in Doc2vec, be by subject information be dissolved into document vector training process in, with current document its Remaining partial information and contextual window<Word, theme>To current to predict<Word, theme>It is right, finally train descriptor to Amount and subject document vector.

Compared with prior art, the beneficial effect of technical solution of the present invention is：

The inventive method obtains document sets merging and it is pre-processed, and collection of document is trained with LDA then, obtains The theme of each word in every document, and word and theme are constituted<Word, theme>It is right, will<Word, theme>To the document for constituting Set is input in Doc2vec document handlings, and training generation subject document vector, the process believes the theme of word in document Breath is dissolved into the training process of document vector, the subject document vector comprising subject information can be trained, so as to lift text The degree of accuracy of the natural language processing task such as this classification, Text similarity computing.

Brief description of the drawings

Fig. 1 is the flow chart of the inventive method；

Fig. 2 is the theme text vector illustraton of model.

Specific embodiment

Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent；

In order to more preferably illustrate the present embodiment, accompanying drawing some parts have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it can be to understand that some known features and its explanation may be omitted in accompanying drawing 's.

Technical scheme is described further with reference to the accompanying drawings and examples.

Embodiment 1

As shown in figure 1, a kind of document vector generation method of combination topic model, comprises the following steps：

Step 1：Document sets merging is obtained to pre-process it.The source of collection of document, is diversified, without limitation , can be Website News, or film comment, push away top grade etc..Each document takes out wherein contained text text, uses To represent the content of this document.Collection of document is pre-processed, all punctuation marks in removal document；In removal document All of low-frequency word, threshold value is set to 5, and low-frequency word is exactly the word for being less than 5 times in whole collection of document occurrence number；Removal is all Stop words, stop words refers to that some do not have the function word of physical meaning, such as ' the ', ' is ', ' at ', ' which ', ' on ' Word；

Step 2：Collection of document is trained with LDA, obtains the theme of each word in every document：

(1) the theme number K in LDA models is determined；

(2) to every each word of document in document sets, random one theme of tax；

(3) document subject matter distribution θ and descriptor distribution are calculatedFormula is as follows：

Wherein, θ_mkIt is the probability of document m generation words k,It is the number of times of document m generation themes k, α is parameter vector； The probability of the k that is the theme generation words t,The number of times of the k that is the theme generation words t, β is parameter vector；

(4) whole document sets are scanned, to each word in every document, is sampled more with using Gibbs Sampling formula New its theme.Wherein Gibbs Sampling formula are as follows, and it is proportional to the value that document subject matter probability is multiplied by theme Word probability：

(5) Gibbs Sampling processes are repeated until model is restrained；

(6) theme of each word in every document is finally given.

Step 3：Each word in every document and its theme are constituted<Word, theme>It is right.Same word is not ibid When hereinafter expressing the different meaning of a word, different themes are had, therefore can constitute different<Word, theme>It is right.Such as in sentence " The In bank of a river ", the implication that bank expression river carries acquires bank for topic1, composition by LDA<bank, topic1>It is right, and in sentence " The bank agreed further credits " in, bank expresses the implication of bank, passes through LDA obtains bank for topic2, then constitute<bank,topic2>It is right.Therefore after subject information is incorporated, identical word is in difference Will be expressed as under theme different<Word, theme>It is right, can be used to the polysemy of differentiating words.

Step 4：Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation Subject document vector.

As shown in Figure 2, it is necessary first to initialize every document doc_mSubject document vector D_m, and each word, theme pair< w_t,z_t>Theme term vector w_t, then by maximizing following object function, carry out iteration and update subject document vector sum theme Document vector：

Wherein, during C represents all document sets<Word, theme>It is right, and p (＜ w, z ＞ | D, context (＜ w, z ＞)) represent With the remainder information of current document and contextual window<Word, theme>To current to predict<Word, theme>To it is general Rate, object function wishes to maximize this log probability and the computing formula of wherein this probability is as follows：

Wherein, X_wIt is that weighted average is done to the theme term vector in current document vector sum contextual window, c represents window Mouth size.For guarantee probability and be 1, softmax normalization has been done here, U is then the parameter of softmax.

Subject document vector sum subject document vector and nerve net are updated by continuing to optimize above-mentioned object function Other specification in network, it is final to preserve the subject document vector that training is obtained.

The present invention combines LDA theme generation models, and the subject information of word in document is dissolved into the training of document vector Cheng Zhong, can train the subject document vector comprising subject information, so as to lift text classification, Text similarity computing etc. certainly The degree of accuracy of right language processing tasks.

The same or analogous part of same or analogous label correspondence；

Position relationship for the explanation of being for illustration only property described in accompanying drawing, it is impossible to be interpreted as the limitation to this patent；

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. a kind of document vector generation method of combination topic model, it is characterised in that comprise the following steps：

S1：Document sets merging is obtained to pre-process it；

S3：Word and theme are constituted<Word, theme>It is right；

S4：Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation theme text Shelves vector.

2. the document vector generation method of combination topic model according to claim 1, it is characterised in that the step S1 In pretreatment process it is as follows：

Contained text text in document is taken out, is used to represent the content of this document, all punctuation marks in removal document； All of low-frequency word in removal document, threshold value is set to 5, and low-frequency word is the word for being less than 5 times in whole collection of document occurrence number； All of stop words is removed, stop words is that some do not have the function word of physical meaning, including ' the ', ' is ', ' at ', ' which’、’on’。

3. the document vector generation method of combination topic model according to claim 2, it is characterised in that the step S2 Detailed process it is as follows：

S21：Determine the theme number k in LDA models；

S23：Whole document sets are scanned using Gibbs Sampling, to each word in every document, sampling updates its master Topic；

S24：Gibbs Sampling processes are repeated until model is restrained；

S25：Obtain the theme of each word in every document.

4. the document vector generation method of combination topic model according to claim 3, it is characterised in that every in step S3 Each word and its theme in piece document are constituted<Word, theme>Right, same word expresses the different meaning of a word in different contexts When, different themes are had, therefore can constitute different<Word, theme>It is right, for the polysemy of differentiating words.

5. the document vector generation method of combination topic model according to claim 4, it is characterised in that in step s4 Doc2vec document handlings are to add a file characteristics vector on the basis of Word2vec term vector models to represent The remainder information of current document, is then predicted current with the remainder information of current document and the word of contextual window Word, finally trains term vector and document vector, will<Word, theme>To constitute collection of document be input in Doc2vec, be by Subject information is dissolved into the training process of document vector, with the remainder information of current document and contextual window<Word, Theme>To current to predict<Word, theme>It is right, finally train descriptor vector sum subject document vector.