CN106919557A - A kind of document vector generation method of combination topic model - Google Patents

A kind of document vector generation method of combination topic model Download PDF

Info

Publication number
CN106919557A
CN106919557A CN201710096926.9A CN201710096926A CN106919557A CN 106919557 A CN106919557 A CN 106919557A CN 201710096926 A CN201710096926 A CN 201710096926A CN 106919557 A CN106919557 A CN 106919557A
Authority
CN
China
Prior art keywords
document
word
theme
vector
generation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710096926.9A
Other languages
Chinese (zh)
Inventor
阳可欣
王美华
印鉴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Heng Electrical Information Polytron Technologies Inc
Sun Yat Sen University
Original Assignee
Guangdong Heng Electrical Information Polytron Technologies Inc
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Heng Electrical Information Polytron Technologies Inc, Sun Yat Sen University filed Critical Guangdong Heng Electrical Information Polytron Technologies Inc
Priority to CN201710096926.9A priority Critical patent/CN106919557A/en
Publication of CN106919557A publication Critical patent/CN106919557A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of document vector generation method of combination topic model, and the method obtains document sets merging and it is pre-processed, collection of document is trained with LDA then, obtains the theme of each word in every document, and word and theme are constituted<Word, theme>It is right, will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation subject document vector, be dissolved into the subject information of word in document in the training process of document vector by the process, the subject document vector comprising subject information can be trained, so as to lift the degree of accuracy of the natural language processing task such as text classification, Text similarity computing.

Description

A kind of document vector generation method of combination topic model
Technical field
The present invention relates to document processing method field, the document vector more particularly, to a kind of combination topic model is raw Into method.
Background technology
Document vector is a kind of document representation into vectorial method.In some text classifications, Text similarity computing In natural language processing task, the quality of document vector directly affects the result of task.Therefore, an effective earth's surface of document It is shown as a vector particularly significant.Earliest document vector representation is bag of words method (Bag-of-Word, BOW), and bag of words method is by one Into the vector with vocabulary identical dimensional, the value of each position is word representated by the position in text in vector for document representation The number of times occurred in shelves.This method for expressing dimension is high, openness big, and separate between word and word, the order of word, language Method, semantic information are all ignored.With the development of deep learning, occur in that based on neutral net come the side of Training document vector Method.The Doc2vec document handlings for proposing for 14 years, are namely based on word2vec term vector models, in the training of neutral net During with the addition of a file characteristics vector, train term vector while, directly trained document vector.This document Vector model, captures the semanteme and order information between word, but have ignored the polysemy problem of word.I.e. same word is not With context in be different semantic expression, and it is same vector that same word is corresponding in model, it is impossible to well The difference for giving expression to word is semantic, and this will have influence on the effect of document vector.
The content of the invention
The present invention provides a kind of document vector generation method of combination topic model, and it is more preferable that the method can train effect Document vector, so as to lift the degree of accuracy of the natural language processing task such as text classification, Text similarity computing.
In order to reach above-mentioned technique effect, technical scheme is as follows:
A kind of document vector generation method of combination topic model, comprises the following steps:
S1:Document sets merging is obtained to pre-process it;
S2:Collection of document is trained with LDA, obtains the theme of each word in every document;
S3:Word and theme are constituted<Word, theme>It is right;
S4:Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation master Topic document vector.
Further, the process of the pretreatment in the step S1 is as follows:
Contained text text in document is taken out, is used to represent the content of this document, all punctuates in removal document Symbol;All of low-frequency word in removal document, threshold value is set to 5, and low-frequency word is less than 5 times in whole collection of document occurrence number Word;All of stop words is removed, stop words is that some do not have the function word of physical meaning, including ' the ', ' is ', ' at ', ' which’、’on’。
Further, the detailed process of the step S2 is as follows:
S21:Determine the theme number k in LDA models;
S22:To every each word of document in document sets, random one theme of tax;
S23:Whole document sets are scanned using Gibbs Sampling, to each word in every document, sampling updates it Theme;
S24:Gibbs Sampling processes are repeated until model is restrained;
S25:Obtain the theme of each word in every document.
Further, each word in step S3 in every document and its theme are constituted<Word, theme>It is right, same word When expressing the different meaning of a word in different contexts, different themes are had, therefore can constitute different<Word, theme>It is right, it is used for The polysemy of differentiating words.
Further, the Doc2vec document handlings in step s4 are on the basis of Word2vec term vector models Add a file characteristics vector to represent the remainder information of current document, then believed with the remainder of current document Breath predicts current word with the word of contextual window, finally trains term vector and document vector, will<Word, theme>To composition Collection of document is input in Doc2vec, be by subject information be dissolved into document vector training process in, with current document its Remaining partial information and contextual window<Word, theme>To current to predict<Word, theme>It is right, finally train descriptor to Amount and subject document vector.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
The inventive method obtains document sets merging and it is pre-processed, and collection of document is trained with LDA then, obtains The theme of each word in every document, and word and theme are constituted<Word, theme>It is right, will<Word, theme>To the document for constituting Set is input in Doc2vec document handlings, and training generation subject document vector, the process believes the theme of word in document Breath is dissolved into the training process of document vector, the subject document vector comprising subject information can be trained, so as to lift text The degree of accuracy of the natural language processing task such as this classification, Text similarity computing.
Brief description of the drawings
Fig. 1 is the flow chart of the inventive method;
Fig. 2 is the theme text vector illustraton of model.
Specific embodiment
Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent;
In order to more preferably illustrate the present embodiment, accompanying drawing some parts have omission, zoom in or out, and do not represent actual product Size;
To those skilled in the art, it can be to understand that some known features and its explanation may be omitted in accompanying drawing 's.
Technical scheme is described further with reference to the accompanying drawings and examples.
Embodiment 1
As shown in figure 1, a kind of document vector generation method of combination topic model, comprises the following steps:
Step 1:Document sets merging is obtained to pre-process it.The source of collection of document, is diversified, without limitation , can be Website News, or film comment, push away top grade etc..Each document takes out wherein contained text text, uses To represent the content of this document.Collection of document is pre-processed, all punctuation marks in removal document;In removal document All of low-frequency word, threshold value is set to 5, and low-frequency word is exactly the word for being less than 5 times in whole collection of document occurrence number;Removal is all Stop words, stop words refers to that some do not have the function word of physical meaning, such as ' the ', ' is ', ' at ', ' which ', ' on ' Word;
Step 2:Collection of document is trained with LDA, obtains the theme of each word in every document:
(1) the theme number K in LDA models is determined;
(2) to every each word of document in document sets, random one theme of tax;
(3) document subject matter distribution θ and descriptor distribution are calculatedFormula is as follows:
Wherein, θmkIt is the probability of document m generation words k,It is the number of times of document m generation themes k, α is parameter vector; The probability of the k that is the theme generation words t,The number of times of the k that is the theme generation words t, β is parameter vector;
(4) whole document sets are scanned, to each word in every document, is sampled more with using Gibbs Sampling formula New its theme.Wherein Gibbs Sampling formula are as follows, and it is proportional to the value that document subject matter probability is multiplied by theme Word probability:
(5) Gibbs Sampling processes are repeated until model is restrained;
(6) theme of each word in every document is finally given.
Step 3:Each word in every document and its theme are constituted<Word, theme>It is right.Same word is not ibid When hereinafter expressing the different meaning of a word, different themes are had, therefore can constitute different<Word, theme>It is right.Such as in sentence " The In bank of a river ", the implication that bank expression river carries acquires bank for topic1, composition by LDA<bank, topic1>It is right, and in sentence " The bank agreed further credits " in, bank expresses the implication of bank, passes through LDA obtains bank for topic2, then constitute<bank,topic2>It is right.Therefore after subject information is incorporated, identical word is in difference Will be expressed as under theme different<Word, theme>It is right, can be used to the polysemy of differentiating words.
Step 4:Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation Subject document vector.
As shown in Figure 2, it is necessary first to initialize every document docmSubject document vector Dm, and each word, theme pair< wt,zt>Theme term vector wt, then by maximizing following object function, carry out iteration and update subject document vector sum theme Document vector:
Wherein, during C represents all document sets<Word, theme>It is right, and p (< w, z > | D, context (< w, z >)) represent With the remainder information of current document and contextual window<Word, theme>To current to predict<Word, theme>To it is general Rate, object function wishes to maximize this log probability and the computing formula of wherein this probability is as follows:
Wherein, XwIt is that weighted average is done to the theme term vector in current document vector sum contextual window, c represents window Mouth size.For guarantee probability and be 1, softmax normalization has been done here, U is then the parameter of softmax.
Subject document vector sum subject document vector and nerve net are updated by continuing to optimize above-mentioned object function Other specification in network, it is final to preserve the subject document vector that training is obtained.
The present invention combines LDA theme generation models, and the subject information of word in document is dissolved into the training of document vector Cheng Zhong, can train the subject document vector comprising subject information, so as to lift text classification, Text similarity computing etc. certainly The degree of accuracy of right language processing tasks.
The same or analogous part of same or analogous label correspondence;
Position relationship for the explanation of being for illustration only property described in accompanying drawing, it is impossible to be interpreted as the limitation to this patent;
Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims (5)

1. a kind of document vector generation method of combination topic model, it is characterised in that comprise the following steps:
S1:Document sets merging is obtained to pre-process it;
S2:Collection of document is trained with LDA, obtains the theme of each word in every document;
S3:Word and theme are constituted<Word, theme>It is right;
S4:Will<Word, theme>Collection of document to constituting is input in Doc2vec document handlings, training generation theme text Shelves vector.
2. the document vector generation method of combination topic model according to claim 1, it is characterised in that the step S1 In pretreatment process it is as follows:
Contained text text in document is taken out, is used to represent the content of this document, all punctuation marks in removal document; All of low-frequency word in removal document, threshold value is set to 5, and low-frequency word is the word for being less than 5 times in whole collection of document occurrence number; All of stop words is removed, stop words is that some do not have the function word of physical meaning, including ' the ', ' is ', ' at ', ' which’、’on’。
3. the document vector generation method of combination topic model according to claim 2, it is characterised in that the step S2 Detailed process it is as follows:
S21:Determine the theme number k in LDA models;
S22:To every each word of document in document sets, random one theme of tax;
S23:Whole document sets are scanned using Gibbs Sampling, to each word in every document, sampling updates its master Topic;
S24:Gibbs Sampling processes are repeated until model is restrained;
S25:Obtain the theme of each word in every document.
4. the document vector generation method of combination topic model according to claim 3, it is characterised in that every in step S3 Each word and its theme in piece document are constituted<Word, theme>Right, same word expresses the different meaning of a word in different contexts When, different themes are had, therefore can constitute different<Word, theme>It is right, for the polysemy of differentiating words.
5. the document vector generation method of combination topic model according to claim 4, it is characterised in that in step s4 Doc2vec document handlings are to add a file characteristics vector on the basis of Word2vec term vector models to represent The remainder information of current document, is then predicted current with the remainder information of current document and the word of contextual window Word, finally trains term vector and document vector, will<Word, theme>To constitute collection of document be input in Doc2vec, be by Subject information is dissolved into the training process of document vector, with the remainder information of current document and contextual window<Word, Theme>To current to predict<Word, theme>It is right, finally train descriptor vector sum subject document vector.
CN201710096926.9A 2017-02-22 2017-02-22 A kind of document vector generation method of combination topic model Pending CN106919557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710096926.9A CN106919557A (en) 2017-02-22 2017-02-22 A kind of document vector generation method of combination topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710096926.9A CN106919557A (en) 2017-02-22 2017-02-22 A kind of document vector generation method of combination topic model

Publications (1)

Publication Number Publication Date
CN106919557A true CN106919557A (en) 2017-07-04

Family

ID=59454560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710096926.9A Pending CN106919557A (en) 2017-02-22 2017-02-22 A kind of document vector generation method of combination topic model

Country Status (1)

Country Link
CN (1) CN106919557A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090178A (en) * 2017-12-15 2018-05-29 北京锐安科技有限公司 A kind of text data analysis method, device, server and storage medium
CN108345686A (en) * 2018-03-08 2018-07-31 广州赫炎大数据科技有限公司 A kind of data analysing method and system based on search engine technique
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN109815474A (en) * 2017-11-20 2019-05-28 深圳市腾讯计算机***有限公司 A kind of word order column vector determines method, apparatus, server and storage medium
CN110032642A (en) * 2019-03-26 2019-07-19 广东工业大学 The modeling method of the manifold topic model of word-based insertion
CN111339296A (en) * 2020-02-20 2020-06-26 电子科技大学 Document theme extraction method based on introduction of adaptive window in HDP model
CN111353303A (en) * 2020-05-25 2020-06-30 腾讯科技(深圳)有限公司 Word vector construction method and device, electronic equipment and storage medium
WO2020253583A1 (en) * 2019-06-20 2020-12-24 首都师范大学 Written composition off-topic detection method
CN113591473A (en) * 2021-07-21 2021-11-02 西北工业大学 Text similarity calculation method based on BTM topic model and Doc2vec

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
牛力强: "基于神经网络的文本向量表示与建模研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815474B (en) * 2017-11-20 2022-09-23 深圳市腾讯计算机***有限公司 Word sequence vector determination method, device, server and storage medium
CN109815474A (en) * 2017-11-20 2019-05-28 深圳市腾讯计算机***有限公司 A kind of word order column vector determines method, apparatus, server and storage medium
CN108090178B (en) * 2017-12-15 2020-08-25 北京锐安科技有限公司 Text data analysis method, text data analysis device, server and storage medium
CN108090178A (en) * 2017-12-15 2018-05-29 北京锐安科技有限公司 A kind of text data analysis method, device, server and storage medium
CN108345686A (en) * 2018-03-08 2018-07-31 广州赫炎大数据科技有限公司 A kind of data analysing method and system based on search engine technique
CN108345686B (en) * 2018-03-08 2021-12-28 广州赫炎大数据科技有限公司 Data analysis method and system based on search engine technology
CN108984526A (en) * 2018-07-10 2018-12-11 北京理工大学 A kind of document subject matter vector abstracting method based on deep learning
CN108984526B (en) * 2018-07-10 2021-05-07 北京理工大学 Document theme vector extraction method based on deep learning
CN109492157B (en) * 2018-10-24 2021-08-31 华侨大学 News recommendation method and theme characterization method based on RNN and attention mechanism
CN109492157A (en) * 2018-10-24 2019-03-19 华侨大学 Based on RNN, the news recommended method of attention mechanism and theme characterizing method
CN110032642A (en) * 2019-03-26 2019-07-19 广东工业大学 The modeling method of the manifold topic model of word-based insertion
CN110032642B (en) * 2019-03-26 2022-02-11 广东工业大学 Modeling method of manifold topic model based on word embedding
WO2020253583A1 (en) * 2019-06-20 2020-12-24 首都师范大学 Written composition off-topic detection method
CN111339296A (en) * 2020-02-20 2020-06-26 电子科技大学 Document theme extraction method based on introduction of adaptive window in HDP model
CN111339296B (en) * 2020-02-20 2023-03-28 电子科技大学 Document theme extraction method based on introduction of adaptive window in HDP model
CN111353303B (en) * 2020-05-25 2020-08-25 腾讯科技(深圳)有限公司 Word vector construction method and device, electronic equipment and storage medium
CN111353303A (en) * 2020-05-25 2020-06-30 腾讯科技(深圳)有限公司 Word vector construction method and device, electronic equipment and storage medium
CN113591473A (en) * 2021-07-21 2021-11-02 西北工业大学 Text similarity calculation method based on BTM topic model and Doc2vec
CN113591473B (en) * 2021-07-21 2024-03-12 西北工业大学 Text similarity calculation method based on BTM topic model and Doc2vec

Similar Documents

Publication Publication Date Title
CN106919557A (en) A kind of document vector generation method of combination topic model
CN108073677B (en) Multi-level text multi-label classification method and system based on artificial intelligence
Dos Santos et al. Deep convolutional neural networks for sentiment analysis of short texts
CN109359297B (en) Relationship extraction method and system
CN106776713A (en) It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN111125367B (en) Multi-character relation extraction method based on multi-level attention mechanism
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN111914091A (en) Entity and relation combined extraction method based on reinforcement learning
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
CN112395417A (en) Network public opinion evolution simulation method and system based on deep learning
CN111967267B (en) XLNET-based news text region extraction method and system
CN105975497A (en) Automatic microblog topic recommendation method and device
CN116629275A (en) Intelligent decision support system and method based on big data
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN111125370A (en) Relation extraction method suitable for small samples
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN111126037B (en) Thai sentence segmentation method based on twin cyclic neural network
CN106610949A (en) Text feature extraction method based on semantic analysis
CN104123336B (en) Depth Boltzmann machine model and short text subject classification system and method
CN107832307B (en) Chinese word segmentation method based on undirected graph and single-layer neural network
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN113095063A (en) Two-stage emotion migration method and system based on masking language model
CN112131879A (en) Relationship extraction system, method and device
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170704