CN108984526B - Document theme vector extraction method based on deep learning - Google Patents

Document theme vector extraction method based on deep learning Download PDF

Info

Publication number
CN108984526B
CN108984526B CN201810748564.1A CN201810748564A CN108984526B CN 108984526 B CN108984526 B CN 108984526B CN 201810748564 A CN201810748564 A CN 201810748564A CN 108984526 B CN108984526 B CN 108984526B
Authority
CN
China
Prior art keywords
vector
word
representing
time
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810748564.1A
Other languages
Chinese (zh)
Other versions
CN108984526A (en
Inventor
高扬
黄河燕
陆池
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810748564.1A priority Critical patent/CN108984526B/en
Publication of CN108984526A publication Critical patent/CN108984526A/en
Application granted granted Critical
Publication of CN108984526B publication Critical patent/CN108984526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a document theme vector extraction method based on deep learning, and belongs to the technical field of natural language processing. The method extracts local deep semantic information by using a convolutional neural network, learns time sequence information by using an LSTM model, enables vector semantics to be more comprehensive, selects implicit co-occurrence relations of context phrases and document themes, avoids the defects of some sentence-based theme vector models for short texts, organically combines CNN and LSTM models by using an attention mechanism, learns the deep semantic, time sequence information and significant information of contexts, and more effectively constructs a model for extracting document theme vectors.

Description

Document theme vector extraction method based on deep learning
Technical Field
The invention relates to a document theme vector extraction method based on deep learning, and belongs to the technical field of natural language processing.
Background
In the big data era today, the topic of how to discover massive internet text data is a research focus. The theme of the text data is analyzed, and the document theme vector is essentially deep semantic representing the document and is the inherent combination of the theme and the semantic. The extracted document theme vector can be widely applied to natural language processing tasks, including public opinion analysis of social networks and new media, timely acquisition of news hotspots and the like. Therefore, how to extract the document theme vector efficiently is an important research topic.
For text data, the theme is not necessarily directly embodied on the specific text content, which makes it difficult to mine the theme implied by the text, and the theme meaning included in the document needs to be extracted according to the relationship of words, sentences, paragraphs and the like of the text, and the theme of the document is extracted by combining the chapter relationship of the document. With the enrichment of statistical natural language processing methods and corpora in recent years, text topic modeling methods based on word-topic and document-topic have been proposed in succession, the basic idea is to assume that the topic of each word and document obeys a statistical probability distribution, calculate the probability distribution of the document topic by training the document data, and then perform clustering according to the document topic.
To correctly analyze the topic of each document, the conventional method is to perform topic analysis on each word of the text, but this method has a big problem: the words really determining the text theme only account for a small part of the text words, so that the traditional method can carry out a large amount of analysis on the words irrelevant to the theme, on one hand, the irrelevant words cause large calculation amount for realization, and on the other hand, the problems that the text theme is not accurately extracted and deep semantics of the text cannot be mined by combining the internal relevance relation of the text also exist.
With the improvement of hardware performance and the continuous expansion of data scale, deep learning is widely applied to various fields, and experimental results are greatly improved on the original basis. Deep learning is widely applied in recent years to methods combining word Embedding and document Embedding by the characteristics of elegant models, flexible architectures and the like. Among all deep learning methods, CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory Network) are the two most popular. In natural language processing tasks, the text analysis method based on the CNN and LSTM models can well find potential semantic features of the text, and great help is provided for the natural language processing tasks such as automatic abstracting, sentiment analysis, machine translation and the like on semantic analysis calculation.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, solve the problem of how to mine the deep semantics of a text by combining the internal association relation of the text and provide a document theme vector extraction method based on deep learning. The invention focuses more on the analysis of the document theme vector modeling, and digs the relevance implied by the text feature and the theme vector, thereby learning the document theme vector.
The core idea of the invention is as follows: the method comprises the steps of extracting the semantics of a context phrase by using the CNN, inputting the extracted semantics into an LSTM model, extracting the importance of words with different positions and different meanings of a text by using an attention mechanism, thereby retaining important information, completing the organic combination of the CNN and the LSTM model, excavating the internal association between contexts and learning document theme vectors with deep semantics and significance.
The method of the invention is realized by the following technical scheme.
A document theme vector extraction method based on deep learning comprises the following steps:
step one, performing relevant definition, specifically as follows:
definition 1: document D, D ═ w1,w2,...,wi,...,wn],wiThe ith word representing document D;
definition 2: predicting word wd+1Representing a target word to be learned;
definition 3: window words which are formed by a plurality of words appearing continuously in the text, and hidden internal association exists between the window words;
definition 4: a context phrase representing a window word appearing before the predicted word location, the window length being l, the context phrase being denoted as wd-l,wd-l+1,...,wd
Definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm (Latent Dirichlet Allocation), and each line represents a theme of a document;
definition 6: n is a radical ofdAnd docid,NdRepresenting the number of documents in the corpus, docidRepresenting a location of a document; each document corresponds to a unique docidWherein, 1 is less than or equal to docid≤Nd
And step two, learning to obtain the semantic vector of the context phrase by using the CNN.
Step three, learning the semantics of the context phrase by using an LSTM model to obtain a hidden layer vector hd-l,hd-l+1,...,hd
Step four, organically combining the CNN model and the LSTM model through an attention mechanism to obtain the average value of the semantic vector of the context phrase
Figure BDA0001724932700000031
Step five, utilizing the average value of the semantic vector of the context phrase by a logistic regression method
Figure BDA0001724932700000032
And the document topic information prediction target word wd+1Obtaining the target word wd+1The prediction probability of (2).
Advantageous effects
Compared with the prior art, the document theme vector extraction method based on deep learning has the following beneficial effects:
1. extracting local deep semantic information by using CNN;
2. the LSTM model is utilized to learn out the time sequence information, so that the vector semantics are more comprehensive;
3. the implicit co-occurrence relation between the context phrase and the document theme is selected, so that the defects of a few sentence-based theme vector models to short texts are avoided;
4. the CNN model and the LSTM model are organically combined by utilizing an attention mechanism, deep semantics, time sequence information and significant information of context are learned, and a model for extracting document theme vectors is more effectively constructed.
Drawings
FIG. 1 is a flowchart of a document theme vector extraction method based on deep learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the method of the present invention is further described in detail below with reference to the accompanying drawings and embodiments.
A document theme vector extraction method based on deep learning is implemented in the following basic steps:
step one, performing relevant definition, specifically as follows:
definition 1: document D, D ═ w1,w2,...,wi,...,wn],wiThe ith word representing document D;
definition 2: predicting word wd+1(ii) a Representing a target word to be learned;
definition 3: window words which are formed by a plurality of words appearing continuously in the text, and hidden internal association exists between the window words;
definition 4: contextual phrase (w)d-l,wd-l+1,...,wd) The word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;
definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm, and each row represents a theme of a document;
definition 6: n is a radical ofdAnd docid,NdRepresenting the number of documents in the corpus, docidRepresenting a location of a document; each document corresponds to a unique docidWherein, 1 is less than or equal to docid≤Nd
And step two, learning to obtain the semantic vector Context of the Context phrase by using the CNN.
The specific implementation process is as follows:
step 2.1, training a word vector matrix of the document D by utilizing the word2vec and other algorithms, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;
step 2.2 extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1, thereby obtaining the context phrase wd-l,wd-l+1,...,wdA vector matrix M of (a);
step 2.3 calculates the semantic vector Context of the Context phrase using the CNN. Specifically, the vector matrix obtained in step 2.2M and K layers of size Cl×CmThe convolution kernel of (a) operates;
where K represents the number of convolution kernels, K equals 128 in this embodiment, and ClRepresents the length of the convolution kernel, and Cl=l,CmRepresents the width of the convolution kernel, and Cm=m;
The semantic vector Context of a Context phrase is calculated by equation (1):
Figure BDA0001724932700000041
1≤k≤K
Context=[Context1,Context2,...,ContextK]
wherein, ContextkThe k-dimension of the semantic vector representing the context phrase, l the context phrase length, m the width of the word vector matrix, i.e. the word vector dimension, d the starting position of the first word in the context phrase, cpqIs the weight parameter of the p-th row and q-th column of the convolution kernel, MpqRepresenting the p-th row and q-th column of data of the vector matrix M, b being the bias parameter of the convolution kernel;
step three, learning the semantics of the context phrase by using an LSTM model to obtain a hidden layer vector hd-l,hd-l+1,...,hd
The specific implementation process is as follows:
step 3.1, assigning t to d-l, namely t is d-l, and t represents the t-th moment;
step 3.2 reaction of xtAssignment wtWord vector of xtRepresenting the word vector entered at time t, wtA word indicating input at the t-th time;
wherein, wtThe word vectors are obtained by mapping the word vector matrix output in step 2.1, i.e. extracting wtWord vectors at the corresponding positions of the vector matrix M;
step 3.3 reaction of xtObtaining a hidden layer vector h at the time t as an input of an LSTM modelt
The specific implementation process is as follows:
step 3.3.1 calculating forgetting door f at t momenttThe forgetting control module is used for controlling forgetting information and calculating through a formula (2);
ft=σ(Wfxt+Ufht-1+bf) (2)
wherein, WfRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tfRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, bfDenotes the offset vector parameter, when t is d-l, ht-1=hd-l-1And h isd-l-1Is a zero vector, sigma represents a Sigmoid function, and is an activation function of the LSTM model;
step 3.3.2 input Gate i at time ttThe new information to be added at the current moment is controlled and calculated through a formula (3);
it=σ(Wixt+Uiht-1+bi) (3)
wherein, WiRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tiRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, biRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model;
step 3.3.3 calculating updated information at time t
Figure BDA0001724932700000051
Calculating by formula (4);
Figure BDA0001724932700000052
wherein the content of the first and second substances,
Figure BDA0001724932700000053
representing a parameter matrix, xtRepresenting the word vector entered at time t,
Figure BDA0001724932700000054
represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1,
Figure BDA0001724932700000055
representing a bias vector parameter, and tanh representing a hyperbolic tangent function, which is an activation function of an LSTM model;
step 3.3.4, calculating the information of the t moment, adding the information of the previous moment and the updated information of the current moment to obtain the information, and calculating the information through a formula (5);
Figure BDA0001724932700000056
wherein, ctInformation indicating time t, ftIndicating forgetting to leave door at time t, ct-1Information indicating the time t-1, itThe input gate at time t is shown,
Figure BDA0001724932700000059
represents the updated information at time t, ° represents the cross-product of the vector;
step 3.3.5 output gate o at time t is calculatedtFor controlling the input information, calculated by equation (6):
ot=σ(Woxt+U0ht-1+bo) (6)
wherein, WoRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time t0Represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, boRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein, the parameter matrix W in the steps 3.3.1-3.3.3 and 3.3.5f,Uf,Wi,Ui
Figure BDA0001724932700000057
Wo,UoHas different matrix element sizes, and is biased with vector parameter bf,bi
Figure BDA0001724932700000058
boThe elements in (A) are different in size; step 3.3.6 calculating the hidden layer vector h at time ttCalculated by equation (7):
ht=otοct (7)
wherein o istOutput gate representing time t, ctInformation indicating time t;
step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is outputd-l,hd-l+1,...,hdJumping to the step four;
step four, combining the CNN model and the LSTM model by using an attention mechanism to obtain the average value of the semantic vector of the context phrase
Figure BDA0001724932700000061
The specific implementation process is as follows:
step 4.1, obtaining an importance factor alpha of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through a formula (8):
Figure BDA0001724932700000062
d-l≤t≤d
α=[αd-ld-l+1,...,αd] (8)
wherein alpha istRepresenting the importance factor of the word at the time t on the semantic vector of the Context phrase, Context representing the semantic vector of the Context phrase obtained in the step two, xtRepresenting the word vector, x, input at time tiRepresenting the word vector input at the ith moment; t represents the transposition of the vector; e represents an exponential function with e, a natural constant, as the base;
step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the formula (9);
h′t=αt*ht
d-l≤t≤d
h′=[h′d-l,h′d-l+1,...,h′d] (9)
wherein, h'tRepresenting the weighted hidden layer vector h 'at time t't,αtAn importance factor, h, representing each word at time t on the semantic vector of the context phrasetRepresenting a hidden layer vector at the moment t;
step 4.3 calculate the mean of the semantic vector of the context phrase using mean-posing operation
Figure BDA0001724932700000063
Calculated by equation (10):
Figure BDA0001724932700000064
wherein, h'tRepresenting the weighted hidden layer vector h 'at time t't
Step five, predicting the target word w by using the average value of the semantic vector of the context phrase and the document theme information through a logistic regression methodd+1Obtaining the target word wd+1The prediction probability of (2). The specific implementation process is as follows:
step 5.1, learning the document theme mapping matrix by using the LDA algorithm, and then according to the document theme mapping matrix and docidEach document is mapped into a one-dimensional vector D of equal length and width of the word vector matrix in step 2.1z
Step 5.2 vector D output from step 5.1zAnd the average value of the context phrase semantic vector output in the step four
Figure BDA0001724932700000071
Splicing to obtain a splicing vector Vd
Figure BDA0001724932700000072
Step 5.3 uses the V output from step 5.2dTo predict the target word wd+1. The classification is carried out by a logistic regression method, and the objective function is shown as a formula (11)
Figure BDA0001724932700000073
Wherein, thetad+1Is the target word wd+1The parameter, theta, corresponding to the locationiWord w in the corresponding word listiCorresponding parameters, | V | representing the size of the vocabulary, VdThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p denotes probability, y denotes dependent variable, and T denotes matrix transposition.
Step 5.4, calculating a loss function of the target function (11) by using a cross entropy method through a formula (12):
L=-log(P(y=wd+1|Vd)) (12)
wherein, wd+1Representing a target word, VdIs the concatenation vector of step 4.2, log () represents a base-10 logarithmic function;
and (3) updating and solving the loss function (12) by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method to obtain a document theme vector.
From step one to step five, the document theme vector with deep semantics and saliency is completed.
Examples
This example describes the practice of the present invention, as shown in FIG. 1.
As can be seen from fig. 1, the process of the document theme vector extraction method based on deep learning of the present invention is as follows:
step A, pretreatment; meaningless symbols such as special characters in the corpus are removed first, and then the text is segmented. Word segmentation is the process of dividing a continuous text sequence into individual words according to a given lexical rule, so that a sentence is decomposed into a plurality of continuous meaningful word strings for subsequent analysis. The word segmentation operation utilizes a PTB word segmentation device to perform word segmentation processing. After word segmentation, a vocabulary is constructed for the original text, and in this embodiment, the vocabulary selects 20000 words from the top to the bottom of the training text, that is, the size of the vocabulary V is 20000. After the vocabulary is selected, the vocabulary index data of the original corpus is constructed according to the indexes of the vocabularies, and the text vocabulary index data is used as the input of the model.
And B, learning a word vector by using a word2vec algorithm. Inputting words in the document into a word2vec algorithm to obtain a word vector, wherein the target function of the word vector is as the formula (13):
Figure BDA0001724932700000081
wherein k is a window word, i is a current word, Corp is the size of a word in a corpus, and a 128-dimensional word vector is obtained by learning through a gradient descent method;
c, extracting a context phrase semantic vector by using the CNN, and learning a context phrase hidden layer vector by using the RNN;
the context phrase semantic vector is extracted by using the CNN, and the context phrase hidden layer vector is learned by using the RNN and is calculated in parallel, specifically in this embodiment:
extracting a context phrase semantic vector by using the CNN; first, a K layer is randomly initialized to C size by using Gaussian distributionl×CmFor a given context phrase wd-l,wd-l+1,...,wdMapping the Context phrases into a matrix with the size of l × m through the word vector learned in the step B, wherein l is the length of the Context phrase, m is the dimension of the word vector, and performing convolution operation on the matrix on a convolution kernel initialized at random in a specific operation mode shown in formula (1), so as to obtain a vector Context, which is the semantic vector of the Context phrase;
learning a context phrase hidden layer vector using the RNN; will context phrase wd-l,wd-l+1,...,wdCorresponding word vectors are sequentially input to the LSTM modeIn type, the hidden layer vector h at 0 time is used0Each dimension of (2) is set to 0, then forgetting gates, input gates, output gates and final result context phrase hidden layer vectors are calculated in sequence by using the formulas (2) - (7), and the dimension is set to 128;
step D, calculating semantic vectors with weights and calculating document theme distribution by using an attention mechanism;
the calculation of the weighted semantic vector and the calculation of the document theme distribution by using the attention mechanism are calculated in parallel, specifically in the embodiment:
calculating a semantic vector with weight by using an attention mechanism; according to the word vector obtained in the step B and the semantic vector of the context phrase obtained in the step C, performing attention mechanism operation on each word in the context phrase to obtain an attention factor alphat,αtIs a real number between 0 and 1, the larger the number of the real number is, the more word vector information of the corresponding position of the real number is kept in the last mean-posing layer, so the size of the real number indicates the importance of the current word in representing the meaning of the whole phrase, that is, the more important word is noticed;
calculating the distribution of document topics; specifically, the document D is input into the LDA algorithm by utilizing the calculation of the LDA algorithm to obtain the theme distribution of each document D, and the theme distribution is directly used as a final result and is marked as Dz
Step E, predicting a target word and learning a document theme vector; adding weighted semantic vectors to DzAnd (4) directly splicing the words, then maximizing the probability of the target word, and obtaining the document theme vector by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method.

Claims (2)

1. A document theme vector extraction method based on deep learning is characterized by comprising the following steps:
step one, performing relevant definition, specifically as follows:
definition 1: document D, D ═ w1,w2,…,wi,…,wn],wiThe ith word representing document D;
definition 2: predicting word wd+1Representing a target word to be learned;
definition 3: window words which are formed by words continuously appearing in the text, and hidden internal association exists between the window words;
definition 4: the contextual phrase: w is ad-l,wd-l+1,…,wdThe word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;
definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm, and each row represents a theme of a document;
definition 6: n is a radical ofdAnd docid,NdRepresenting the number of documents in the corpus, docidRepresenting a location of a document; each document corresponds to a unique docidWherein, 1 is less than or equal to docid≤Nd
Learning to obtain a semantic vector of the context phrase by using a Convolutional Neural Network (CNN); the method comprises the following specific steps:
step 2.1, training a word vector matrix of the document D, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;
step 2.2, extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1 to obtain the context phrase wd-l,wd-l+1,…,wdA vector matrix M of (a);
step 2.3, calculating the semantic vector Context of the Context phrase by using the CNN, wherein the vector matrix M and the K layers obtained in the step 2.2 have the size of Cl×CmThe convolution kernel of (a) operates;
where K denotes the number of convolution kernels, ClRepresents the length of the convolution kernel, and Cl=l,CmRepresents the width of the convolution kernel, and Cm=m;
The semantic vector Context of a Context phrase is calculated by equation (1):
Figure FDA0002964896060000011
Context=[Context1,Context2,…,ContextK]
wherein, ContextkThe k-dimension of the semantic vector representing the context phrase, l the context phrase length, m the width of the word vector matrix, i.e. the word vector dimension, d the starting position of the first word in the context phrase, cpqIs the weight parameter of the p-th row and q-th column of the convolution kernel, MpqRepresenting the p-th row and q-th column of data of the vector matrix M, b being the bias parameter of the convolution kernel;
thirdly, learning the semantics of the context phrase by utilizing the long-short term memory network model LSTM to obtain a hidden layer vector hd-l,hd-l+1,…,hd(ii) a The specific implementation process is as follows:
step 3.1, assigning t to d-l, namely t is d-l, and t represents the t-th moment;
step 3.2 reaction of xtAssignment wtWord vector of xtRepresenting the word vector entered at time t, wtA word indicating input at the t-th time;
wherein, wtThe word vectors are obtained by mapping the word vector matrix output in step 2.1, i.e. extracting wtWord vectors at the corresponding positions of the vector matrix M;
step 3.3 reaction of xtObtaining a hidden layer vector h at the time t as an input of an LSTM modelt
Step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is outputd-l,hd-l+1,…,hdJumping to the step four;
step four, organically combining the CNN model and the LSTM model through an attention mechanism to obtain the average value of the semantic vector of the context phrase
Figure FDA0002964896060000021
The specific implementation method comprises the following steps:
step 4.1, obtaining an importance factor alpha of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through the following formula:
Figure FDA0002964896060000022
α=[αd-ld-l+1,…,αd]
wherein alpha istRepresenting the importance factor of the word at the time t on the semantic vector of the Context phrase, Context representing the semantic vector of the Context phrase obtained in the step two, xtRepresenting the word vector, x, input at time tiRepresenting the word vector input at the ith moment; t represents the transposition of the vector; e represents an exponential function with e, a natural constant, as the base;
step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the following formula;
ht′=αt*ht
d-l≤t≤d
h′=[h′d-l,h′d-l+1,…,hd′]
wherein h ist' denotes the weighted hidden layer vector h at time tt′,αtAn importance factor, h, representing each word at time t on the semantic vector of the context phrasetRepresenting a hidden layer vector at the moment t;
step 4.3 calculate the mean of the semantic vector of the context phrase using mean-posing operation
Figure FDA0002964896060000031
Calculated by the following equation (10):
Figure FDA0002964896060000032
wherein h ist' denotes the weighted hidden layer vector h at time tt′;
Step five, utilizing the average value of the semantic vector of the context phrase by a logistic regression method
Figure FDA0002964896060000033
And the document topic information prediction target word wd+1Obtaining the target word wd+1A predicted probability of (d);
the method comprises the following specific steps:
step 5.1 learning the document theme mapping matrix, then according to the document theme mapping matrix and docidEach document is mapped into a one-dimensional vector D of equal length and width of the word vector matrix in step 2.1z
Step 5.2 vector D output from step 5.1zAnd the average value of the context phrase semantic vector output in the step four
Figure FDA0002964896060000034
Splicing to obtain a splicing vector Vd
Figure FDA0002964896060000035
Step 5.3 uses the V output from step 5.2dTo predict the target word wd+1And classifying by a logistic regression method, wherein the objective function is shown as the formula (11):
Figure FDA0002964896060000036
wherein, thetad+1Is the target word wd+1The parameter, theta, corresponding to the locationiWord w in the corresponding word listiCorresponding parameters, | V | representing the size of the vocabulary, VdThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p represents probability, y represents dependent variable, and T represents matrix transposition;
step 5.4, calculating a loss function of the target function (11) by using a cross entropy method through a formula (12):
L=-log(P(y=wd+1|Vd)) (12)
wherein, wd+1Representing a target word, VdIs the concatenation vector of step 4.2, log () represents a base-10 logarithmic function;
and (3) updating and solving the loss function (12) by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method to obtain a document theme vector.
2. The method for extracting document theme vector based on deep learning of claim 1, wherein the specific implementation method of the step 3.3 is as follows:
step 3.3.1 calculating forgetting door f at t momenttThe forgetting control module is used for controlling forgetting information and calculating through a formula (2);
ft=σ(Wfxt+Ufht-1+bf) (2)
wherein, WfRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tfRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, bfDenotes the offset vector parameter, when t is d-l, ht-1=hd-l-1And h isd-l-1Is a zero vector, sigma represents a Sigmoid function, and is an activation function of the LSTM model;
step 3.3.2 input Gate i at time ttThe new information to be added at the current moment is controlled and calculated through a formula (3);
it=σ(Wixt+Uiht-1+bi) (3)
wherein, WiRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time tiRepresents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, biRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model;
step 3.3.3 calculating updated information at time t
Figure FDA0002964896060000041
Calculating by formula (4);
Figure FDA0002964896060000042
wherein the content of the first and second substances,
Figure FDA0002964896060000049
representing a parameter matrix, xtRepresenting the word vector entered at time t,
Figure FDA00029648960600000411
represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1,
Figure FDA00029648960600000410
representing a bias vector parameter, and tanh representing a hyperbolic tangent function, which is an activation function of an LSTM model;
step 3.3.4, calculating the information of the t moment, adding the information of the previous moment and the updated information of the current moment to obtain the information, and calculating the information through a formula (5);
Figure FDA0002964896060000043
wherein, ctInformation indicating time t, ftIndicating forgetting to leave door at time t, ct-1Information indicating the time t-1, itThe input gate at time t is shown,
Figure FDA0002964896060000044
information indicating the update at time t is provided,
Figure FDA0002964896060000045
represents a cross product of the vectors;
step 3.3.5 calculating time tOutput gate otFor controlling the input information, calculated by equation (6):
ot=σ(Woxt+U0ht-1+bo) (6)
wherein, WoRepresenting a parameter matrix, xtRepresenting the word vector, U, input at time t0Represents a parameter matrix, ht-1Representing the hidden layer vector at time t-1, boRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein the parameter matrix Wf,Uf,Wi,Ui
Figure FDA0002964896060000046
Wo,UoHas different matrix element sizes, and is biased with vector parameter bf,bi
Figure FDA0002964896060000047
boThe elements in (A) are different in size;
step 3.3.6 calculating the hidden layer vector h at time ttCalculated by equation (7):
Figure FDA0002964896060000048
wherein o istOutput gate representing time t, ctInformation indicating time t.
CN201810748564.1A 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning Active CN108984526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810748564.1A CN108984526B (en) 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810748564.1A CN108984526B (en) 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning

Publications (2)

Publication Number Publication Date
CN108984526A CN108984526A (en) 2018-12-11
CN108984526B true CN108984526B (en) 2021-05-07

Family

ID=64536620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810748564.1A Active CN108984526B (en) 2018-07-10 2018-07-10 Document theme vector extraction method based on deep learning

Country Status (1)

Country Link
CN (1) CN108984526B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414483B (en) * 2019-01-04 2023-03-28 阿里巴巴集团控股有限公司 Document processing device and method
CN109871532B (en) * 2019-01-04 2022-07-08 平安科技(深圳)有限公司 Text theme extraction method and device and storage medium
CN109960802A (en) * 2019-03-19 2019-07-02 四川大学 The information processing method and device of narrative text are reported for aviation safety
CN109933804A (en) * 2019-03-27 2019-06-25 北京信息科技大学 Merge the keyword abstraction method of subject information and two-way LSTM
CN110334358A (en) * 2019-04-28 2019-10-15 厦门大学 A kind of phrase table dendrography learning method of context-aware
CN110083710B (en) * 2019-04-30 2021-04-02 北京工业大学 Word definition generation method based on cyclic neural network and latent variable structure
CN110532395B (en) * 2019-05-13 2021-09-28 南京大学 Semantic embedding-based word vector improvement model establishing method
CN110825848B (en) * 2019-06-10 2022-08-09 北京理工大学 Text classification method based on phrase vectors
CN110263343B (en) * 2019-06-24 2021-06-15 北京理工大学 Phrase vector-based keyword extraction method and system
CN110457674B (en) * 2019-06-25 2021-05-14 西安电子科技大学 Text prediction method for theme guidance
CN110378409B (en) * 2019-07-15 2020-08-21 昆明理工大学 Chinese-Yue news document abstract generation method based on element association attention mechanism
CN110472047B (en) * 2019-07-15 2022-12-13 昆明理工大学 Multi-feature fusion Chinese-Yue news viewpoint sentence extraction method
CN110781256B (en) * 2019-08-30 2024-02-23 腾讯大地通途(北京)科技有限公司 Method and device for determining POI matched with Wi-Fi based on sending position data
CN110766073B (en) * 2019-10-22 2023-10-27 湖南科技大学 Mobile application classification method for strengthening topic attention mechanism
CN111125434B (en) * 2019-11-26 2023-06-27 北京理工大学 Relation extraction method and system based on ensemble learning
CN111274789B (en) * 2020-02-06 2021-07-06 支付宝(杭州)信息技术有限公司 Training method and device of text prediction model
CN111696624B (en) * 2020-06-08 2022-07-12 天津大学 DNA binding protein identification and function annotation deep learning method based on self-attention mechanism
CN111753540B (en) * 2020-06-24 2023-04-07 云南电网有限责任公司信息中心 Method and system for collecting text data to perform Natural Language Processing (NLP)
CN112597311B (en) * 2020-12-28 2023-07-11 东方红卫星移动通信有限公司 Terminal information classification method and system based on low-orbit satellite communication
CN112632966B (en) * 2020-12-30 2023-07-21 绿盟科技集团股份有限公司 Alarm information marking method, device, medium and equipment
CN112685538B (en) * 2020-12-30 2022-10-14 北京理工大学 Text vector retrieval method combined with external knowledge
CN112699662B (en) * 2020-12-31 2022-08-16 太原理工大学 False information early detection method based on text structure algorithm
CN112966551A (en) * 2021-01-29 2021-06-15 湖南科技学院 Method and device for acquiring video frame description information and electronic equipment
CN115763167B (en) * 2022-11-22 2023-09-22 黄华集团有限公司 Solid cabinet circuit breaker and control method thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN106919557A (en) * 2017-02-22 2017-07-04 中山大学 A kind of document vector generation method of combination topic model
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107423282A (en) * 2017-05-24 2017-12-01 南京大学 Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN107562792A (en) * 2017-07-31 2018-01-09 同济大学 A kind of question and answer matching process based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909537B (en) * 2017-02-07 2020-04-07 中山大学 One-word polysemous analysis method based on topic model and vector space

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069143A (en) * 2015-08-19 2015-11-18 百度在线网络技术(北京)有限公司 Method and device for extracting keywords from document
CN106547735A (en) * 2016-10-25 2017-03-29 复旦大学 The structure and using method of the dynamic word or word vector based on the context-aware of deep learning
CN106919557A (en) * 2017-02-22 2017-07-04 中山大学 A kind of document vector generation method of combination topic model
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107423282A (en) * 2017-05-24 2017-12-01 南京大学 Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN107562792A (en) * 2017-07-31 2018-01-09 同济大学 A kind of question and answer matching process based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Topic Discovery for Short Texts Using Word Embeddings;Guangxu Xun 等;《2016 IEEE 16th International Conference on Data Mining》;20161231;全文 *
基于词向量技术和混合神经网络的情感分析;胡朝举 等;《计算机应用研究》;20171212;第35卷(第12期);全文 *

Also Published As

Publication number Publication date
CN108984526A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN108984526B (en) Document theme vector extraction method based on deep learning
CN109753566B (en) Model training method for cross-domain emotion analysis based on convolutional neural network
CN110502749B (en) Text relation extraction method based on double-layer attention mechanism and bidirectional GRU
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN110196980B (en) Domain migration on Chinese word segmentation task based on convolutional network
CN111930942B (en) Text classification method, language model training method, device and equipment
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN111753058A (en) Text viewpoint mining method and system
CN111984791A (en) Long text classification method based on attention mechanism
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN115757773A (en) Method and device for classifying problem texts with multi-value chains
Suyanto Synonyms-based augmentation to improve fake news detection using bidirectional LSTM
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
Chan et al. Applying and optimizing NLP model with CARU
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN115186670B (en) Method and system for identifying domain named entities based on active learning
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN114722818A (en) Named entity recognition model based on anti-migration learning
CN114357166A (en) Text classification method based on deep learning
CN113988054A (en) Entity identification method for coal mine safety field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant