CN106919557A - A kind of document vector generation method of combination topic model - Google Patents
A kind of document vector generation method of combination topic model Download PDFInfo
- Publication number
- CN106919557A CN106919557A CN201710096926.9A CN201710096926A CN106919557A CN 106919557 A CN106919557 A CN 106919557A CN 201710096926 A CN201710096926 A CN 201710096926A CN 106919557 A CN106919557 A CN 106919557A
- Authority
- CN
- China
- Prior art keywords
- document
- word
- theme
- vector
- generation method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 13
- 238000005070 sampling Methods 0.000 claims description 9
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 230000007935 neutral effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of document vector generation method of combination topic model, and the method obtains document sets merging and it is pre-processed, collection of document is trained with LDA then, obtains the theme of each word in every document, and word and theme are constituted<Word, theme>It is right, will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation subject document vector, be dissolved into the subject information of word in document in the training process of document vector by the process, the subject document vector comprising subject information can be trained, so as to lift the degree of accuracy of the natural language processing task such as text classification, Text similarity computing.
Description
Technical field
The present invention relates to document processing method field, the document vector more particularly, to a kind of combination topic model is raw
Into method.
Background technology
Document vector is a kind of document representation into vectorial method.In some text classifications, Text similarity computing
In natural language processing task, the quality of document vector directly affects the result of task.Therefore, an effective earth's surface of document
It is shown as a vector particularly significant.Earliest document vector representation is bag of words method (Bag-of-Word, BOW), and bag of words method is by one
Into the vector with vocabulary identical dimensional, the value of each position is word representated by the position in text in vector for document representation
The number of times occurred in shelves.This method for expressing dimension is high, openness big, and separate between word and word, the order of word, language
Method, semantic information are all ignored.With the development of deep learning, occur in that based on neutral net come the side of Training document vector
Method.The Doc2vec document handlings for proposing for 14 years, are namely based on word2vec term vector models, in the training of neutral net
During with the addition of a file characteristics vector, train term vector while, directly trained document vector.This document
Vector model, captures the semanteme and order information between word, but have ignored the polysemy problem of word.I.e. same word is not
With context in be different semantic expression, and it is same vector that same word is corresponding in model, it is impossible to well
The difference for giving expression to word is semantic, and this will have influence on the effect of document vector.
The content of the invention
The present invention provides a kind of document vector generation method of combination topic model, and it is more preferable that the method can train effect
Document vector, so as to lift the degree of accuracy of the natural language processing task such as text classification, Text similarity computing.
In order to reach above-mentioned technique effect, technical scheme is as follows:
A kind of document vector generation method of combination topic model, comprises the following steps:
S1:Document sets merging is obtained to pre-process it;
S2:Collection of document is trained with LDA, obtains the theme of each word in every document;
S3:Word and theme are constituted<Word, theme>It is right;
S4:Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation master
Topic document vector.
Further, the process of the pretreatment in the step S1 is as follows:
Contained text text in document is taken out, is used to represent the content of this document, all punctuates in removal document
Symbol;All of low-frequency word in removal document, threshold value is set to 5, and low-frequency word is less than 5 times in whole collection of document occurrence number
Word;All of stop words is removed, stop words is that some do not have the function word of physical meaning, including ' the ', ' is ', ' at ', '
which’、’on’。
Further, the detailed process of the step S2 is as follows:
S21:Determine the theme number k in LDA models;
S22:To every each word of document in document sets, random one theme of tax;
S23:Whole document sets are scanned using Gibbs Sampling, to each word in every document, sampling updates it
Theme;
S24:Gibbs Sampling processes are repeated until model is restrained;
S25:Obtain the theme of each word in every document.
Further, each word in step S3 in every document and its theme are constituted<Word, theme>It is right, same word
When expressing the different meaning of a word in different contexts, different themes are had, therefore can constitute different<Word, theme>It is right, it is used for
The polysemy of differentiating words.
Further, the Doc2vec document handlings in step s4 are on the basis of Word2vec term vector models
Add a file characteristics vector to represent the remainder information of current document, then believed with the remainder of current document
Breath predicts current word with the word of contextual window, finally trains term vector and document vector, will<Word, theme>To composition
Collection of document is input in Doc2vec, be by subject information be dissolved into document vector training process in, with current document its
Remaining partial information and contextual window<Word, theme>To current to predict<Word, theme>It is right, finally train descriptor to
Amount and subject document vector.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
The inventive method obtains document sets merging and it is pre-processed, and collection of document is trained with LDA then, obtains
The theme of each word in every document, and word and theme are constituted<Word, theme>It is right, will<Word, theme>To the document for constituting
Set is input in Doc2vec document handlings, and training generation subject document vector, the process believes the theme of word in document
Breath is dissolved into the training process of document vector, the subject document vector comprising subject information can be trained, so as to lift text
The degree of accuracy of the natural language processing task such as this classification, Text similarity computing.
Brief description of the drawings
Fig. 1 is the flow chart of the inventive method;
Fig. 2 is the theme text vector illustraton of model.
Specific embodiment
Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent;
In order to more preferably illustrate the present embodiment, accompanying drawing some parts have omission, zoom in or out, and do not represent actual product
Size;
To those skilled in the art, it can be to understand that some known features and its explanation may be omitted in accompanying drawing
's.
Technical scheme is described further with reference to the accompanying drawings and examples.
Embodiment 1
As shown in figure 1, a kind of document vector generation method of combination topic model, comprises the following steps:
Step 1:Document sets merging is obtained to pre-process it.The source of collection of document, is diversified, without limitation
, can be Website News, or film comment, push away top grade etc..Each document takes out wherein contained text text, uses
To represent the content of this document.Collection of document is pre-processed, all punctuation marks in removal document;In removal document
All of low-frequency word, threshold value is set to 5, and low-frequency word is exactly the word for being less than 5 times in whole collection of document occurrence number;Removal is all
Stop words, stop words refers to that some do not have the function word of physical meaning, such as ' the ', ' is ', ' at ', ' which ', ' on '
Word;
Step 2:Collection of document is trained with LDA, obtains the theme of each word in every document:
(1) the theme number K in LDA models is determined;
(2) to every each word of document in document sets, random one theme of tax;
(3) document subject matter distribution θ and descriptor distribution are calculatedFormula is as follows:
Wherein, θmkIt is the probability of document m generation words k,It is the number of times of document m generation themes k, α is parameter vector;
The probability of the k that is the theme generation words t,The number of times of the k that is the theme generation words t, β is parameter vector;
(4) whole document sets are scanned, to each word in every document, is sampled more with using Gibbs Sampling formula
New its theme.Wherein Gibbs Sampling formula are as follows, and it is proportional to the value that document subject matter probability is multiplied by theme Word probability:
(5) Gibbs Sampling processes are repeated until model is restrained;
(6) theme of each word in every document is finally given.
Step 3:Each word in every document and its theme are constituted<Word, theme>It is right.Same word is not ibid
When hereinafter expressing the different meaning of a word, different themes are had, therefore can constitute different<Word, theme>It is right.Such as in sentence " The
In bank of a river ", the implication that bank expression river carries acquires bank for topic1, composition by LDA<bank,
topic1>It is right, and in sentence " The bank agreed further credits " in, bank expresses the implication of bank, passes through
LDA obtains bank for topic2, then constitute<bank,topic2>It is right.Therefore after subject information is incorporated, identical word is in difference
Will be expressed as under theme different<Word, theme>It is right, can be used to the polysemy of differentiating words.
Step 4:Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation
Subject document vector.
As shown in Figure 2, it is necessary first to initialize every document docmSubject document vector Dm, and each word, theme pair<
wt,zt>Theme term vector wt, then by maximizing following object function, carry out iteration and update subject document vector sum theme
Document vector:
Wherein, during C represents all document sets<Word, theme>It is right, and p (< w, z > | D, context (< w, z >)) represent
With the remainder information of current document and contextual window<Word, theme>To current to predict<Word, theme>To it is general
Rate, object function wishes to maximize this log probability and the computing formula of wherein this probability is as follows:
Wherein, XwIt is that weighted average is done to the theme term vector in current document vector sum contextual window, c represents window
Mouth size.For guarantee probability and be 1, softmax normalization has been done here, U is then the parameter of softmax.
Subject document vector sum subject document vector and nerve net are updated by continuing to optimize above-mentioned object function
Other specification in network, it is final to preserve the subject document vector that training is obtained.
The present invention combines LDA theme generation models, and the subject information of word in document is dissolved into the training of document vector
Cheng Zhong, can train the subject document vector comprising subject information, so as to lift text classification, Text similarity computing etc. certainly
The degree of accuracy of right language processing tasks.
The same or analogous part of same or analogous label correspondence;
Position relationship for the explanation of being for illustration only property described in accompanying drawing, it is impossible to be interpreted as the limitation to this patent;
Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right
The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description
To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this
Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention
Protection domain within.
Claims (5)
1. a kind of document vector generation method of combination topic model, it is characterised in that comprise the following steps:
S1:Document sets merging is obtained to pre-process it;
S2:Collection of document is trained with LDA, obtains the theme of each word in every document;
S3:Word and theme are constituted<Word, theme>It is right;
S4:Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation theme text
Shelves vector.
2. the document vector generation method of combination topic model according to claim 1, it is characterised in that the step S1
In pretreatment process it is as follows:
Contained text text in document is taken out, is used to represent the content of this document, all punctuation marks in removal document;
All of low-frequency word in removal document, threshold value is set to 5, and low-frequency word is the word for being less than 5 times in whole collection of document occurrence number;
All of stop words is removed, stop words is that some do not have the function word of physical meaning, including ' the ', ' is ', ' at ', '
which’、’on’。
3. the document vector generation method of combination topic model according to claim 2, it is characterised in that the step S2
Detailed process it is as follows:
S21:Determine the theme number k in LDA models;
S22:To every each word of document in document sets, random one theme of tax;
S23:Whole document sets are scanned using Gibbs Sampling, to each word in every document, sampling updates its master
Topic;
S24:Gibbs Sampling processes are repeated until model is restrained;
S25:Obtain the theme of each word in every document.
4. the document vector generation method of combination topic model according to claim 3, it is characterised in that every in step S3
Each word and its theme in piece document are constituted<Word, theme>Right, same word expresses the different meaning of a word in different contexts
When, different themes are had, therefore can constitute different<Word, theme>It is right, for the polysemy of differentiating words.
5. the document vector generation method of combination topic model according to claim 4, it is characterised in that in step s4
Doc2vec document handlings are to add a file characteristics vector on the basis of Word2vec term vector models to represent
The remainder information of current document, is then predicted current with the remainder information of current document and the word of contextual window
Word, finally trains term vector and document vector, will<Word, theme>To constitute collection of document be input in Doc2vec, be by
Subject information is dissolved into the training process of document vector, with the remainder information of current document and contextual window<Word,
Theme>To current to predict<Word, theme>It is right, finally train descriptor vector sum subject document vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710096926.9A CN106919557A (en) | 2017-02-22 | 2017-02-22 | A kind of document vector generation method of combination topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710096926.9A CN106919557A (en) | 2017-02-22 | 2017-02-22 | A kind of document vector generation method of combination topic model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106919557A true CN106919557A (en) | 2017-07-04 |
Family
ID=59454560
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710096926.9A Pending CN106919557A (en) | 2017-02-22 | 2017-02-22 | A kind of document vector generation method of combination topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106919557A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090178A (en) * | 2017-12-15 | 2018-05-29 | 北京锐安科技有限公司 | A kind of text data analysis method, device, server and storage medium |
CN108345686A (en) * | 2018-03-08 | 2018-07-31 | 广州赫炎大数据科技有限公司 | A kind of data analysing method and system based on search engine technique |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
CN109492157A (en) * | 2018-10-24 | 2019-03-19 | 华侨大学 | Based on RNN, the news recommended method of attention mechanism and theme characterizing method |
CN109815474A (en) * | 2017-11-20 | 2019-05-28 | 深圳市腾讯计算机***有限公司 | A kind of word order column vector determines method, apparatus, server and storage medium |
CN110032642A (en) * | 2019-03-26 | 2019-07-19 | 广东工业大学 | The modeling method of the manifold topic model of word-based insertion |
CN111339296A (en) * | 2020-02-20 | 2020-06-26 | 电子科技大学 | Document theme extraction method based on introduction of adaptive window in HDP model |
CN111353303A (en) * | 2020-05-25 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Word vector construction method and device, electronic equipment and storage medium |
WO2020253583A1 (en) * | 2019-06-20 | 2020-12-24 | 首都师范大学 | Written composition off-topic detection method |
CN113591473A (en) * | 2021-07-21 | 2021-11-02 | 西北工业大学 | Text similarity calculation method based on BTM topic model and Doc2vec |
-
2017
- 2017-02-22 CN CN201710096926.9A patent/CN106919557A/en active Pending
Non-Patent Citations (1)
Title |
---|
牛力强: "基于神经网络的文本向量表示与建模研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815474B (en) * | 2017-11-20 | 2022-09-23 | 深圳市腾讯计算机***有限公司 | Word sequence vector determination method, device, server and storage medium |
CN109815474A (en) * | 2017-11-20 | 2019-05-28 | 深圳市腾讯计算机***有限公司 | A kind of word order column vector determines method, apparatus, server and storage medium |
CN108090178B (en) * | 2017-12-15 | 2020-08-25 | 北京锐安科技有限公司 | Text data analysis method, text data analysis device, server and storage medium |
CN108090178A (en) * | 2017-12-15 | 2018-05-29 | 北京锐安科技有限公司 | A kind of text data analysis method, device, server and storage medium |
CN108345686A (en) * | 2018-03-08 | 2018-07-31 | 广州赫炎大数据科技有限公司 | A kind of data analysing method and system based on search engine technique |
CN108345686B (en) * | 2018-03-08 | 2021-12-28 | 广州赫炎大数据科技有限公司 | Data analysis method and system based on search engine technology |
CN108984526A (en) * | 2018-07-10 | 2018-12-11 | 北京理工大学 | A kind of document subject matter vector abstracting method based on deep learning |
CN108984526B (en) * | 2018-07-10 | 2021-05-07 | 北京理工大学 | Document theme vector extraction method based on deep learning |
CN109492157B (en) * | 2018-10-24 | 2021-08-31 | 华侨大学 | News recommendation method and theme characterization method based on RNN and attention mechanism |
CN109492157A (en) * | 2018-10-24 | 2019-03-19 | 华侨大学 | Based on RNN, the news recommended method of attention mechanism and theme characterizing method |
CN110032642A (en) * | 2019-03-26 | 2019-07-19 | 广东工业大学 | The modeling method of the manifold topic model of word-based insertion |
CN110032642B (en) * | 2019-03-26 | 2022-02-11 | 广东工业大学 | Modeling method of manifold topic model based on word embedding |
WO2020253583A1 (en) * | 2019-06-20 | 2020-12-24 | 首都师范大学 | Written composition off-topic detection method |
CN111339296A (en) * | 2020-02-20 | 2020-06-26 | 电子科技大学 | Document theme extraction method based on introduction of adaptive window in HDP model |
CN111339296B (en) * | 2020-02-20 | 2023-03-28 | 电子科技大学 | Document theme extraction method based on introduction of adaptive window in HDP model |
CN111353303B (en) * | 2020-05-25 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Word vector construction method and device, electronic equipment and storage medium |
CN111353303A (en) * | 2020-05-25 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Word vector construction method and device, electronic equipment and storage medium |
CN113591473A (en) * | 2021-07-21 | 2021-11-02 | 西北工业大学 | Text similarity calculation method based on BTM topic model and Doc2vec |
CN113591473B (en) * | 2021-07-21 | 2024-03-12 | 西北工业大学 | Text similarity calculation method based on BTM topic model and Doc2vec |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106919557A (en) | A kind of document vector generation method of combination topic model | |
CN108073677B (en) | Multi-level text multi-label classification method and system based on artificial intelligence | |
Dos Santos et al. | Deep convolutional neural networks for sentiment analysis of short texts | |
CN109359297B (en) | Relationship extraction method and system | |
CN106776713A (en) | It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis | |
CN111125367B (en) | Multi-character relation extraction method based on multi-level attention mechanism | |
CN109635288A (en) | A kind of resume abstracting method based on deep neural network | |
CN111914091A (en) | Entity and relation combined extraction method based on reinforcement learning | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN108710611A (en) | A kind of short text topic model generation method of word-based network and term vector | |
CN112395417A (en) | Network public opinion evolution simulation method and system based on deep learning | |
CN111967267B (en) | XLNET-based news text region extraction method and system | |
CN105975497A (en) | Automatic microblog topic recommendation method and device | |
CN116629275A (en) | Intelligent decision support system and method based on big data | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics | |
CN111125370A (en) | Relation extraction method suitable for small samples | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN111126037B (en) | Thai sentence segmentation method based on twin cyclic neural network | |
CN106610949A (en) | Text feature extraction method based on semantic analysis | |
CN104123336B (en) | Depth Boltzmann machine model and short text subject classification system and method | |
CN107832307B (en) | Chinese word segmentation method based on undirected graph and single-layer neural network | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN113095063A (en) | Two-stage emotion migration method and system based on masking language model | |
CN112131879A (en) | Relationship extraction system, method and device | |
CN110597982A (en) | Short text topic clustering algorithm based on word co-occurrence network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170704 |