CN108984526B

CN108984526B - Document theme vector extraction method based on deep learning

Info

Publication number: CN108984526B
Application number: CN201810748564.1A
Authority: CN
Inventors: 高扬; 黄河燕; 陆池
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2021-05-07
Anticipated expiration: 2038-07-10
Also published as: CN108984526A

Abstract

The invention relates to a document theme vector extraction method based on deep learning, and belongs to the technical field of natural language processing. The method extracts local deep semantic information by using a convolutional neural network, learns time sequence information by using an LSTM model, enables vector semantics to be more comprehensive, selects implicit co-occurrence relations of context phrases and document themes, avoids the defects of some sentence-based theme vector models for short texts, organically combines CNN and LSTM models by using an attention mechanism, learns the deep semantic, time sequence information and significant information of contexts, and more effectively constructs a model for extracting document theme vectors.

Description

Document theme vector extraction method based on deep learning

Technical Field

The invention relates to a document theme vector extraction method based on deep learning, and belongs to the technical field of natural language processing.

Background

In the big data era today, the topic of how to discover massive internet text data is a research focus. The theme of the text data is analyzed, and the document theme vector is essentially deep semantic representing the document and is the inherent combination of the theme and the semantic. The extracted document theme vector can be widely applied to natural language processing tasks, including public opinion analysis of social networks and new media, timely acquisition of news hotspots and the like. Therefore, how to extract the document theme vector efficiently is an important research topic.

For text data, the theme is not necessarily directly embodied on the specific text content, which makes it difficult to mine the theme implied by the text, and the theme meaning included in the document needs to be extracted according to the relationship of words, sentences, paragraphs and the like of the text, and the theme of the document is extracted by combining the chapter relationship of the document. With the enrichment of statistical natural language processing methods and corpora in recent years, text topic modeling methods based on word-topic and document-topic have been proposed in succession, the basic idea is to assume that the topic of each word and document obeys a statistical probability distribution, calculate the probability distribution of the document topic by training the document data, and then perform clustering according to the document topic.

To correctly analyze the topic of each document, the conventional method is to perform topic analysis on each word of the text, but this method has a big problem: the words really determining the text theme only account for a small part of the text words, so that the traditional method can carry out a large amount of analysis on the words irrelevant to the theme, on one hand, the irrelevant words cause large calculation amount for realization, and on the other hand, the problems that the text theme is not accurately extracted and deep semantics of the text cannot be mined by combining the internal relevance relation of the text also exist.

With the improvement of hardware performance and the continuous expansion of data scale, deep learning is widely applied to various fields, and experimental results are greatly improved on the original basis. Deep learning is widely applied in recent years to methods combining word Embedding and document Embedding by the characteristics of elegant models, flexible architectures and the like. Among all deep learning methods, CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory Network) are the two most popular. In natural language processing tasks, the text analysis method based on the CNN and LSTM models can well find potential semantic features of the text, and great help is provided for the natural language processing tasks such as automatic abstracting, sentiment analysis, machine translation and the like on semantic analysis calculation.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, solve the problem of how to mine the deep semantics of a text by combining the internal association relation of the text and provide a document theme vector extraction method based on deep learning. The invention focuses more on the analysis of the document theme vector modeling, and digs the relevance implied by the text feature and the theme vector, thereby learning the document theme vector.

The core idea of the invention is as follows: the method comprises the steps of extracting the semantics of a context phrase by using the CNN, inputting the extracted semantics into an LSTM model, extracting the importance of words with different positions and different meanings of a text by using an attention mechanism, thereby retaining important information, completing the organic combination of the CNN and the LSTM model, excavating the internal association between contexts and learning document theme vectors with deep semantics and significance.

The method of the invention is realized by the following technical scheme.

A document theme vector extraction method based on deep learning comprises the following steps:

step one, performing relevant definition, specifically as follows:

definition 1: document D, D ═ w₁,w₂,...,w_i,...,w_n]，w_iThe ith word representing document D;

definition 2: predicting word w_d+1Representing a target word to be learned;

definition 3: window words which are formed by a plurality of words appearing continuously in the text, and hidden internal association exists between the window words;

definition 4: a context phrase representing a window word appearing before the predicted word location, the window length being l, the context phrase being denoted as w_d-l,w_d-l+1,...,w_d；

Definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm (Latent Dirichlet Allocation), and each line represents a theme of a document;

definition 6: n is a radical of_dAnd doc_id，N_dRepresenting the number of documents in the corpus, doc_idRepresenting a location of a document; each document corresponds to a unique doc_idWherein, 1 is less than or equal to doc_id≤N_d；

And step two, learning to obtain the semantic vector of the context phrase by using the CNN.

Step three, learning the semantics of the context phrase by using an LSTM model to obtain a hidden layer vector h_d-l,h_d-l+1,...,h_d。

Step four, organically combining the CNN model and the LSTM model through an attention mechanism to obtain the average value of the semantic vector of the context phrase

Step five, utilizing the average value of the semantic vector of the context phrase by a logistic regression method

And the document topic information prediction target word w_d+1Obtaining the target word w_d+1The prediction probability of (2).

Advantageous effects

Compared with the prior art, the document theme vector extraction method based on deep learning has the following beneficial effects:

1. extracting local deep semantic information by using CNN;

2. the LSTM model is utilized to learn out the time sequence information, so that the vector semantics are more comprehensive;

3. the implicit co-occurrence relation between the context phrase and the document theme is selected, so that the defects of a few sentence-based theme vector models to short texts are avoided;

4. the CNN model and the LSTM model are organically combined by utilizing an attention mechanism, deep semantics, time sequence information and significant information of context are learned, and a model for extracting document theme vectors is more effectively constructed.

Drawings

FIG. 1 is a flowchart of a document theme vector extraction method based on deep learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the method of the present invention is further described in detail below with reference to the accompanying drawings and embodiments.

A document theme vector extraction method based on deep learning is implemented in the following basic steps:

step one, performing relevant definition, specifically as follows:

definition 2: predicting word w_d+1(ii) a Representing a target word to be learned;

definition 4: contextual phrase (w)_d-l,w_d-l+1,...,w_d) The word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;

definition 5: the document theme mapping matrix is obtained by learning of an LDA algorithm, and each row represents a theme of a document;

And step two, learning to obtain the semantic vector Context of the Context phrase by using the CNN.

The specific implementation process is as follows:

step 2.1, training a word vector matrix of the document D by utilizing the word2vec and other algorithms, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;

step 2.2 extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1, thereby obtaining the context phrase w_d-l,w_d-l+1,...,w_dA vector matrix M of (a);

step 2.3 calculates the semantic vector Context of the Context phrase using the CNN. Specifically, the vector matrix obtained in step 2.2M and K layers of size C_l×C_mThe convolution kernel of (a) operates;

where K represents the number of convolution kernels, K equals 128 in this embodiment, and C_lRepresents the length of the convolution kernel, and C_l＝l，C_mRepresents the width of the convolution kernel, and C_m＝m；

The semantic vector Context of a Context phrase is calculated by equation (1):

1≤k≤K

Context＝[Context₁,Context₂,...,Context_K]

wherein, Context_kThe k-dimension of the semantic vector representing the context phrase, l the context phrase length, m the width of the word vector matrix, i.e. the word vector dimension, d the starting position of the first word in the context phrase, c_pqIs the weight parameter of the p-th row and q-th column of the convolution kernel, M_pqRepresenting the p-th row and q-th column of data of the vector matrix M, b being the bias parameter of the convolution kernel;

The specific implementation process is as follows:

step 3.1, assigning t to d-l, namely t is d-l, and t represents the t-th moment;

step 3.2 reaction of x_tAssignment w_tWord vector of x_tRepresenting the word vector entered at time t, w_tA word indicating input at the t-th time;

wherein, w_tThe word vectors are obtained by mapping the word vector matrix output in step 2.1, i.e. extracting w_tWord vectors at the corresponding positions of the vector matrix M;

step 3.3 reaction of x_tObtaining a hidden layer vector h at the time t as an input of an LSTM model_t；

The specific implementation process is as follows:

step 3.3.1 calculating forgetting door f at t moment_tThe forgetting control module is used for controlling forgetting information and calculating through a formula (2);

f_t＝σ(W_fx_t+U_fh_t-1+b_f) (2)

wherein, W_fRepresenting a parameter matrix, x_tRepresenting the word vector, U, input at time t_fRepresents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1, b_fDenotes the offset vector parameter, when t is d-l, h_t-1＝h_d-l-1And h is_d-l-1Is a zero vector, sigma represents a Sigmoid function, and is an activation function of the LSTM model;

step 3.3.2 input Gate i at time t_tThe new information to be added at the current moment is controlled and calculated through a formula (3);

i_t＝σ(W_ix_t+U_ih_t-1+b_i) (3)

wherein, W_iRepresenting a parameter matrix, x_tRepresenting the word vector, U, input at time t_iRepresents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1, b_iRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model;

step 3.3.3 calculating updated information at time t

Calculating by formula (4);

wherein the content of the first and second substances,

representing a parameter matrix, x_tRepresenting the word vector entered at time t,

represents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1,

representing a bias vector parameter, and tanh representing a hyperbolic tangent function, which is an activation function of an LSTM model;

step 3.3.4, calculating the information of the t moment, adding the information of the previous moment and the updated information of the current moment to obtain the information, and calculating the information through a formula (5);

wherein, c_tInformation indicating time t, f_tIndicating forgetting to leave door at time t, c_t-1Information indicating the time t-1, i_tThe input gate at time t is shown,

represents the updated information at time t, ° represents the cross-product of the vector;

step 3.3.5 output gate o at time t is calculated_tFor controlling the input information, calculated by equation (6):

o_t＝σ(W_ox_t+U₀h_t-1+b_o) (6)

wherein, W_oRepresenting a parameter matrix, x_tRepresenting the word vector, U, input at time t₀Represents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1, b_oRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein, the parameter matrix W in the steps 3.3.1-3.3.3 and 3.3.5_f，U_f，W_i，U_i，

W_o，U_oHas different matrix element sizes, and is biased with vector parameter b_f，b_i，

b_oThe elements in (A) are different in size; step 3.3.6 calculating the hidden layer vector h at time t_tCalculated by equation (7):

h_t＝o_tοc_t (7)

wherein o is_tOutput gate representing time t, c_tInformation indicating time t;

step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is output_d-l,h_d-l+1,...,h_dJumping to the step four;

step four, combining the CNN model and the LSTM model by using an attention mechanism to obtain the average value of the semantic vector of the context phrase

The specific implementation process is as follows:

step 4.1, obtaining an importance factor alpha of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through a formula (8):

d-l≤t≤d

α＝[α_d-l,α_d-l+1,...,α_d] (8)

wherein alpha is_tRepresenting the importance factor of the word at the time t on the semantic vector of the Context phrase, Context representing the semantic vector of the Context phrase obtained in the step two, x_tRepresenting the word vector, x, input at time t_iRepresenting the word vector input at the ith moment; t represents the transposition of the vector; e represents an exponential function with e, a natural constant, as the base;

step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the formula (9);

h′_t＝α_t*h_t

d-l≤t≤d

h′＝[h′_d-l,h′_d-l+1,...,h′_d] (9)

wherein, h'_tRepresenting the weighted hidden layer vector h 'at time t'_t，α_tAn importance factor, h, representing each word at time t on the semantic vector of the context phrase_tRepresenting a hidden layer vector at the moment t;

step 4.3 calculate the mean of the semantic vector of the context phrase using mean-posing operation

Calculated by equation (10):

wherein, h'_tRepresenting the weighted hidden layer vector h 'at time t'_t；

Step five, predicting the target word w by using the average value of the semantic vector of the context phrase and the document theme information through a logistic regression method_d+1Obtaining the target word w_d+1The prediction probability of (2). The specific implementation process is as follows:

step 5.1, learning the document theme mapping matrix by using the LDA algorithm, and then according to the document theme mapping matrix and doc_idEach document is mapped into a one-dimensional vector D of equal length and width of the word vector matrix in step 2.1_z；

Step 5.2 vector D output from step 5.1_zAnd the average value of the context phrase semantic vector output in the step four

Splicing to obtain a splicing vector V_d，

Step 5.3 uses the V output from step 5.2_dTo predict the target word w_d+1. The classification is carried out by a logistic regression method, and the objective function is shown as a formula (11)

Wherein, theta_d+1Is the target word w_d+1The parameter, theta, corresponding to the location_iWord w in the corresponding word list_iCorresponding parameters, | V | representing the size of the vocabulary, V_dThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p denotes probability, y denotes dependent variable, and T denotes matrix transposition.

Step 5.4, calculating a loss function of the target function (11) by using a cross entropy method through a formula (12):

L＝-log(P(y＝w_d+1|V_d)) (12)

wherein, w_d+1Representing a target word, V_dIs the concatenation vector of step 4.2, log () represents a base-10 logarithmic function;

and (3) updating and solving the loss function (12) by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method to obtain a document theme vector.

From step one to step five, the document theme vector with deep semantics and saliency is completed.

Examples

This example describes the practice of the present invention, as shown in FIG. 1.

As can be seen from fig. 1, the process of the document theme vector extraction method based on deep learning of the present invention is as follows:

step A, pretreatment; meaningless symbols such as special characters in the corpus are removed first, and then the text is segmented. Word segmentation is the process of dividing a continuous text sequence into individual words according to a given lexical rule, so that a sentence is decomposed into a plurality of continuous meaningful word strings for subsequent analysis. The word segmentation operation utilizes a PTB word segmentation device to perform word segmentation processing. After word segmentation, a vocabulary is constructed for the original text, and in this embodiment, the vocabulary selects 20000 words from the top to the bottom of the training text, that is, the size of the vocabulary V is 20000. After the vocabulary is selected, the vocabulary index data of the original corpus is constructed according to the indexes of the vocabularies, and the text vocabulary index data is used as the input of the model.

And B, learning a word vector by using a word2vec algorithm. Inputting words in the document into a word2vec algorithm to obtain a word vector, wherein the target function of the word vector is as the formula (13):

wherein k is a window word, i is a current word, Corp is the size of a word in a corpus, and a 128-dimensional word vector is obtained by learning through a gradient descent method;

c, extracting a context phrase semantic vector by using the CNN, and learning a context phrase hidden layer vector by using the RNN;

the context phrase semantic vector is extracted by using the CNN, and the context phrase hidden layer vector is learned by using the RNN and is calculated in parallel, specifically in this embodiment:

extracting a context phrase semantic vector by using the CNN; first, a K layer is randomly initialized to C size by using Gaussian distribution_l×C_mFor a given context phrase w_d-l,w_d-l+1,...,w_dMapping the Context phrases into a matrix with the size of l × m through the word vector learned in the step B, wherein l is the length of the Context phrase, m is the dimension of the word vector, and performing convolution operation on the matrix on a convolution kernel initialized at random in a specific operation mode shown in formula (1), so as to obtain a vector Context, which is the semantic vector of the Context phrase;

learning a context phrase hidden layer vector using the RNN; will context phrase w_d-l,w_d-l+1,...,w_dCorresponding word vectors are sequentially input to the LSTM modeIn type, the hidden layer vector h at 0 time is used₀Each dimension of (2) is set to 0, then forgetting gates, input gates, output gates and final result context phrase hidden layer vectors are calculated in sequence by using the formulas (2) - (7), and the dimension is set to 128;

step D, calculating semantic vectors with weights and calculating document theme distribution by using an attention mechanism;

the calculation of the weighted semantic vector and the calculation of the document theme distribution by using the attention mechanism are calculated in parallel, specifically in the embodiment:

calculating a semantic vector with weight by using an attention mechanism; according to the word vector obtained in the step B and the semantic vector of the context phrase obtained in the step C, performing attention mechanism operation on each word in the context phrase to obtain an attention factor alpha_t，α_tIs a real number between 0 and 1, the larger the number of the real number is, the more word vector information of the corresponding position of the real number is kept in the last mean-posing layer, so the size of the real number indicates the importance of the current word in representing the meaning of the whole phrase, that is, the more important word is noticed;

calculating the distribution of document topics; specifically, the document D is input into the LDA algorithm by utilizing the calculation of the LDA algorithm to obtain the theme distribution of each document D, and the theme distribution is directly used as a final result and is marked as D_z；

Step E, predicting a target word and learning a document theme vector; adding weighted semantic vectors to D_zAnd (4) directly splicing the words, then maximizing the probability of the target word, and obtaining the document theme vector by a Sampled Softmax algorithm and a small batch random gradient descent parameter updating method.

Claims

1. A document theme vector extraction method based on deep learning is characterized by comprising the following steps:

step one, performing relevant definition, specifically as follows:

definition 1: document D, D ═ w₁,w₂,…,w_i,…,w_n]，w_iThe ith word representing document D;

definition 2: predicting word w_d+1Representing a target word to be learned;

definition 3: window words which are formed by words continuously appearing in the text, and hidden internal association exists between the window words;

definition 4: the contextual phrase: w is a_d-l,w_d-l+1,…,w_dThe word is a window word appearing before the position of the predicted word, and the length of the context phrase is l;

Learning to obtain a semantic vector of the context phrase by using a Convolutional Neural Network (CNN); the method comprises the following specific steps:

step 2.1, training a word vector matrix of the document D, wherein the size of the word vector matrix is n multiplied by m, n represents the length of the word vector matrix, and m represents the width of the word vector matrix;

step 2.2, extracting the word vector corresponding to each word in the context phrase from the word vector matrix obtained in step 2.1 to obtain the context phrase w_d-l,w_d-l+1,…,w_dA vector matrix M of (a);

step 2.3, calculating the semantic vector Context of the Context phrase by using the CNN, wherein the vector matrix M and the K layers obtained in the step 2.2 have the size of C_l×C_mThe convolution kernel of (a) operates;

where K denotes the number of convolution kernels, C_lRepresents the length of the convolution kernel, and C_l＝l，C_mRepresents the width of the convolution kernel, and C_m＝m；

The semantic vector Context of a Context phrase is calculated by equation (1):

Context＝[Context₁,Context₂,…,Context_K]

thirdly, learning the semantics of the context phrase by utilizing the long-short term memory network model LSTM to obtain a hidden layer vector h_d-l,h_d-l+1,…,h_d(ii) a The specific implementation process is as follows:

Step 3.4, judging whether t is equal to d, if not, adding 1 to t, and skipping to step 3.2; if yes, the hidden layer vector h is output_d-l,h_d-l+1,…,h_dJumping to the step four;

The specific implementation method comprises the following steps:

step 4.1, obtaining an importance factor alpha of each word on the semantic vector of the context phrase through an attention mechanism by using the context phrase semantic vector obtained in the step two, and specifically calculating through the following formula:

α＝[α_d-l,α_d-l+1,…,α_d]

step 4.2, calculating a hidden layer vector h' with weight based on attention mechanism by the following formula;

h_t′＝α_t*h_t

d-l≤t≤d

h′＝[h′_d-l,h′_d-l+1,…,h_d′]

wherein h is_t' denotes the weighted hidden layer vector h at time t_t′，α_tAn importance factor, h, representing each word at time t on the semantic vector of the context phrase_tRepresenting a hidden layer vector at the moment t;

Calculated by the following equation (10):

wherein h is_t' denotes the weighted hidden layer vector h at time t_t′；

And the document topic information prediction target word w_d+1Obtaining the target word w_d+1A predicted probability of (d);

the method comprises the following specific steps:

step 5.1 learning the document theme mapping matrix, then according to the document theme mapping matrix and doc_idEach document is mapped into a one-dimensional vector D of equal length and width of the word vector matrix in step 2.1_z；

Splicing to obtain a splicing vector V_d，

Step 5.3 uses the V output from step 5.2_dTo predict the target word w_d+1And classifying by a logistic regression method, wherein the objective function is shown as the formula (11):

wherein, theta_d+1Is the target word w_d+1The parameter, theta, corresponding to the location_iWord w in the corresponding word list_iCorresponding parameters, | V | representing the size of the vocabulary, V_dThe splicing vector obtained in the step 5.2 is used, exp represents an exponential function with e as a base, and sigma represents summation; p represents probability, y represents dependent variable, and T represents matrix transposition;

L＝-log(P(y＝w_d+1|V_d)) (12)

2. The method for extracting document theme vector based on deep learning of claim 1, wherein the specific implementation method of the step 3.3 is as follows:

f_t＝σ(W_fx_t+U_fh_t-1+b_f) (2)

i_t＝σ(W_ix_t+U_ih_t-1+b_i) (3)

step 3.3.3 calculating updated information at time t

Calculating by formula (4);

wherein the content of the first and second substances,

information indicating the update at time t is provided,

represents a cross product of the vectors;

step 3.3.5 calculating time tOutput gate o_tFor controlling the input information, calculated by equation (6):

o_t＝σ(W_ox_t+U₀h_t-1+b_o) (6)

wherein, W_oRepresenting a parameter matrix, x_tRepresenting the word vector, U, input at time t₀Represents a parameter matrix, h_t-1Representing the hidden layer vector at time t-1, b_oRepresenting a bias vector parameter, and sigma representing a Sigmoid function, which is an activation function of an LSTM model; wherein the parameter matrix W_f，U_f，W_i，U_i，

b_oThe elements in (A) are different in size;

step 3.3.6 calculating the hidden layer vector h at time t_tCalculated by equation (7):

wherein o is_tOutput gate representing time t, c_tInformation indicating time t.