CN112527959B

CN112527959B - News classification method based on pooling convolution embedding and attention distribution neural network

Info

Publication number: CN112527959B
Application number: CN202011443363.4A
Authority: CN
Inventors: 唐贤伦; 郝博慧; 彭德光; 钟冰; 闫振甫; 王会明; 张璞
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2023-05-30
Anticipated expiration: 2040-12-11
Also published as: CN112527959A

Abstract

The invention discloses a news classification method based on pooling convolution embedded and attention distributed neural network, which uses characteristics and weights as key factors in the classification process. The mechanism is to use a convolution in the embedded layer to extract local features, delete the pooling layer to reduce information loss, and then add an attention mechanism to reassign weights to obtain global features of the text. The model captures not only the profound features of the text, but also the importance of the news parts. Convolutional Neural Networks (CNNs) play an important role in text classification tasks due to their advantages in extracting local features and location invariant features. The attention mechanism strengthens the weight of key information due to the fact that the attention mechanism extracts text context information and focuses on the characteristics of important parts, and the attention mechanism and the text context information are combined with stronger feature extraction capability. Combining the pooling-free CNN with the global attention mechanism to handle news classification problems can significantly improve the accuracy of text classification.

Description

News classification method based on pooling convolution embedding and attention distribution neural network

Technical Field

The invention belongs to a Chinese news text classification method, and particularly relates to a news classification method based on pooling-free convolution embedding and attention-distributed neural network.

Background

Text classification is a classical task of NLP. It assigns its corresponding tag to the specified text. Currently, text classification methods are mainly classified into conventional machine-learning text classification and deep-learning text classification.

Traditional machine-learning text classification methods include K nearest neighbor (KNN, K-nearest neighbor), maximum entropy (The Maximum Entropy), support vector machine (SVM, support Vector Machines), and the like. The core idea of the KNN algorithm is that if k samples out of the k nearest neighbors in most feature space belong to a certain class, the samples also belong to this class and share features with the samples in this class. The different categories are determined by the number of nearest neighbors, so it is appropriate for the sample size in the training dataset. The principle of maximum entropy is that the model with the maximum entropy is the best model when learning the probabilistic model. That is, the maximum entropy can also be understood as a model of the maximum entropy selected among a set of models satisfying the constraint. SVM is a generalized linear classifier for binary classification of data by supervised learning. Deep learning algorithms are now beginning to be widely used for text classification. A Recurrent Neural Network (RNN) is a time series-based neural network model that can capture long-term dependencies between sequences. However, as the length of the sequence increases, it is difficult for a standard RNN to obtain long-term dependencies, and thus it is difficult to model the entire sequence. During modeling, some information may be lost and there are problems with gradient extinction and gradient explosion. Convolutional Neural Networks (CNNs) are also applied to text classification tasks, which have great advantages in capturing local features and location invariant features. The use of long short memory networks (LSTM) can simulate the relationships between sentences. The LSTM is added with three gate structures based on RNN, so that the problems of gradient disappearance and gradient explosion are solved. In contrast to LSTM, the Gated Recursive Unit (GRU) has only two gate structures, namely an update gate and a reset gate. Thus, the GRU has fewer parameters and better convergence during training. Also, the hierarchical attention model incorporates the attention mechanism into the hierarchical GRU model so that the model can better capture important information of the document. In recent years, attention mechanisms have been widely used in the field of text classification because it can distinguish the importance of each word to the classification result.

Since a computer cannot directly process a text sequence, it is important to express the text in a form that the computer can understand (called text vectorization).

The invention aims at solving the problems that the input text semantic information is insufficient, and the pooling layer can cause information loss and the classification precision is reduced.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A news classification method based on pooling convolution embedding and attention distribution neural network is provided. The technical scheme of the invention is as follows:

a method of news classification based on a pooling convolutional embedding and attention-distributing neural network, comprising the steps of:

step 1: collecting a news text data set, carrying out standardized format processing and word segmentation on the news text, utilizing word embedding to obtain feature vectors of news, carrying out random segmentation on news according to news categories and news data on news labels, dividing the corpus into a training set, a testing set and a verification set, wherein the training set is used for training a news classification model, the verification set is used for verifying whether the model is reasonable or not, and the testing set is used for testing the classification effect of the model;

step 2: inputting the feature vector obtained by word embedding of the training set in the corpus in the step 1 into a CNN convolutional neural network, and canceling a pooling layer in the CNN;

step 3: inputting the feature vectors subjected to word embedding and pooling-free convolution in the step 2 into an attention mechanism, and carrying out weight redistribution on the feature vectors in the text so as to train a news classification model;

step 4: inputting the text vector of the test set in the corpus in the step 1 into CNN, classifying news categories according to the trained model in the step 3, and calculating the accuracy of the news categories.

2. A method for classifying news based on a pooling convolutional embedded and attention-distributed neural network according to claim 1, wherein said step 1: the news data set is collected, and for Chinese news, the format of the data set is normalized, and the format is as follows: the 'tag +' \t '+news' form is that a word-segmented news text word is used as the input of a word-segmentation layer to obtain a feature vector x of a group of words ₀ ,x ₁ ,x ₂ ,...,x _t . The feature vector is a language which can be identified by a computer. For text category labels, the size letters of the input language are specified, and each character is encoded using a 1-m encoding; then, the sequence of the character sequence vector is converted into a fixed length l ₀ Exceeding the length l ₀ All characters of (2) will be ignored and less than l ₀ Will be filled with 0 later.

3. The news classifying method based on pooling convolution embedding and attention distribution neural network according to claim 2, wherein said step 2 is to use word vector x of training set in corpus in step 1 ₀ ,x ₁ ,x ₂ ,...,x _n Inputting CNN, canceling a pooling layer of a character convolution network, wherein the pooling layer comprises the following specific steps: inputting the word vector with distributed representation into one-dimensional convolution network, wherein the network comprises an input layer, a convolution layer, an output layer, a pooling layer of the convolution neural network for maximizing the preservation of text characteristics, and one-dimensional convolution calculationObtaining the sum of the convolution of the discrete function and the discrete kernel function:

where τ (x) is a discrete kernel function, the input discrete function is δ (x), d is a step size, b is a bias term, where x represents a word vector and n represents the number of news word vectors.

4. A method of news classification based on a pooling convolutional embedding and attention-distributing neural network according to claim 3, characterized in that said b = k-d +1 is an offset constant, defined by a set of kernel functions τ _ij (x) Parameterizing i=1, 2, …, v, j=1, 2, …, w, each input δ _i (x) Or output c _j (y) are all called "features", m and n represent the size of the input and output features, output c _j (y) is delta _i (x) And τ _ij (x) Is a convolution sum of (a) and (b).

5. The news classification method based on the pooling convolution embedding and attention distribution neural network according to claim 4, wherein the step 3 inputs the feature vectors after word embedding and pooling convolution in the step 2 into an attention mechanism, and performs weight redistribution on the feature vectors in the text so as to train a news classification model, and specifically comprises the following steps:

for the feature vectors obtained in step 2, an attention model is input for each word x ₀ ,x ₁ ,...,x _n Are each represented in vector form and are input to a convolution unit to obtain an output h ₀ ,h ₁ ,…,h _n This output serves as input source=h for the attention mechanism ₀ ,h ₁ ,…,h _n A final feature vector of the text is calculated. In the attention mechanism, the hidden layer t moment state h _t Is randomly initialized and updated as a parameter during training while giving the source-side context vector s _t Source side context vector s _t Calculated as a weighted sum of the individual inputs, calculated as follows:

wherein L represents news text length, a _t (s) represents a variable length alignment vector,

representing the hidden layer state of the encoder. />

Context vector s _t All concealment states of the encoder should be considered, in the attention mechanism part, by concealing the state h at the decoder t moment _t Hiding state with each source of encoder

Comparison to generate variable length pair Ji Xiangliang a _t (s)：

f _a Is a function based on the content of the content,

representing the decoder t moment hidden state h _t Source hidden state with encoder->

Function of->

A content function representing the decoder t moment concealment state and all source concealment states of the encoder starting from the initial position s 1.

f _a Has 3 different formulas:

wherein W is _a Is the weight of the attention modelAnd (5) a heavy matrix.

At each time step, the model will infer a variable length pair Ji Quanchong vector based on the current target state and all source states, then based on a _t (s) calculating the global context vector as a weighted average over all source states.

Hidden layer t moment state h _t And context vector s _t The information of the two vectors is combined to generate the following decoder's attention-hiding state:

wherein the method comprises the steps of

Representing a new attention hiding state vector, +.>

The fully connected matrix representing the attention model weights, u representing the number of attention mechanism hidden units.

6. The method for classifying news based on a pooling convolutional embedding and attention-distributing neural network according to claim 5, wherein after introducing an attention mechanism, the final representation of the text is calculated as follows:

u _t ＝tanh(W _s h _t +b _s ) (6)

v＝∑ _t w _t h _t (8)

in the calculation process, W _s Weight coefficient matrix representing attention model, h _t Is a characteristic representation of convolution at time t, u _t Is a hidden layer representation of the neural network, and u _s Is a randomly initialized context vector, also known as a semantic representation of the input, w _t Is byThe importance weight normalized by the Softmax function, v, is the final feature vector of the text.

7. The news classification method based on the pooling convolution embedding and attention distribution neural network according to claim 6, wherein the step 4 inputs the text vector of the test set in the corpus in the step 1 into the CNN, classifies the news class according to the trained model in the step 3, and calculates the accuracy of the news class, and specifically includes:

the model uses a leakage_ReLU activation function, introduces a leakage value in the negative half of the ReLU, and is therefore called a leakage ReLU function, unlike the ReLU, which assigns a non-zero slope to all negative values as follows;

a _g is fixed, g represents a corresponding different route a _g The method comprises the steps of carrying out a first treatment on the surface of the Finally, multi-classification is carried out through a Softmax classifier to obtain a result;

result＝softmax(v) (10)

result is a vector whose dimensions are the number of categories, the number of each dimension being in the range of [0,1], which represents the probability that text falls into a category, the predicted category accuracy of the input sentence being:

prediction＝argmax(result) (11)

the invention has the advantages and beneficial effects as follows:

the invention utilizes features and weights as key factors in the classification process. The mechanism is to first convert the news text into word vectors using an embedding layer, which are input into a convolution operation to extract local features. The pooling layer in a conventional convolutional network is deleted to reduce information loss, as the pooling layer acts to actually down-sample the inputs, as is commonly done by maximizing the output of each filter, thus ignoring some news information, as shown in claim 2. According to claim 4, the local feature vectors obtained after the pooling convolution are input into a global attention mechanism to redistribute weights, thereby obtaining global features of the text. Due to the risk of inactivation of neurons in the negative interval, the Leaky_ReLU is selected as an activation function, and finally, the accuracy of news classification is calculated through Softmax. In the conventional practice, due to the uniformity of the convolutional network, the influence of an internal pooling layer on information loss is often ignored when network optimization is performed. Aiming at the problem, the model provided by the patent captures local characteristics of the text, reduces information loss in the unified structure of the traditional neural network, and captures importance of each part of the text. Thus, processing the text classification problem in combination with a pooling-free convolutional network and attention weight distribution can significantly improve the accuracy of the news classification.

Drawings

FIG. 1 is a block diagram of a method for news classification based on a pooled convolutional embedded and attention distributed neural network in accordance with a preferred embodiment of the present invention;

FIG. 2 is a pooling-free convolution embedding and attention-distributing neural network model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and specifically described below with reference to the drawings in the embodiments of the present invention. The described embodiments are only a few embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

in the present invention, as shown in fig. 1, a one-dimensional convolution operation is first used, and in the convolution network, a pooling layer is cancelled to reduce information loss, so that semantic features and position invariant features of an input sequence are extracted. The semantic features are then used as inputs to the attention mechanism to obtain global features for reassigning weights. The global feature vector is input to the fully connected layers and classified by the activation functions leak_relu and Softmax.

Step 1: collecting news data sets, normalizing data set formats for Chinese news, and writing: "tag + '\t' +news" form. And randomly segmenting the data set, dividing the data set into a training set, a testing set and a verification set. The training set is used for training the news classification model, the verification set is used for verifying whether the model is reasonable, and the test set is used for testing the classification effect of the model.

Using the segmented news text words as the input of the word embedding layer to obtain the feature vector x of a group of words ₀ ,x ₁ ,x ₂ ,...,x _t . The feature vector is a language which can be identified by a computer. For text category labels, the size letters of the input language are specified, and each character is encoded using a 1-m encoding; then, the sequence of the character sequence vector is converted into a fixed length l ₀ Exceeding the length l ₀ All characters of (2) will be ignored and less than l ₀ Will be filled with 0 later.

Step 2: after step 1 a temporal convolution module is added, which is a one-dimensional convolution operation. Convolutional neural network models have been widely used for image recognition, but also for text classification. CNN is a deep neural network, and mainly consists of an input layer, a hidden layer and an output layer. The input layer is responsible for analyzing the input variables. The hidden layer comprises a convolution layer and a pooling layer for learning features of the input information. The output layer is composed of fully connected layers.

Convolution operations of different proportions may extract more complex features of the text. The implementation of CNN is represented by the following formula.

h＝[h ₁ ,h ₂ ,…,h _n-k+1 ](3)

Where x represents an embedded word. Sigma denotes a filter whose function is to generate new features by convolution operations.

Is a nonlinear function. h is a _i Represents a feature obtained by a convolution operation, and h is a set of bits obtained by the convolution operationThe largest feature among the features. b represents a deviation term.

The present invention uses one-dimensional convolution and thus only convolves in the row direction. The downward arrow in the figure indicates that the convolution kernel moves from top to bottom. Further, the convolution step size is set to 3.h is a ₁ ,h ₂ ,h ₃ Representing the features obtained by extraction. At h ₁ ,h ₂ ,h ₃ Thereafter, the feature vector H is a feature representation of the entire sentence. That is, the convolution kernel k is convolved with the window vector at each location to generate a feature map H ε R of the input text ^length-m+1 . Each element H of the feature map H _j Calculated as the following equation.

h _j ＝f(σ _j ⊙k+b) (4)

As a result, the multiplication of matrix elements is performed, b is the bias term, and f is the activation function.

When the model input is a discrete function delta (x) ∈ [1, l]And a discrete kernel function τ (x) ∈ [1, k ]]Where delta (x), tau (x) e R, if the step size is d, the convolution c (y) between delta (x) and tau (x) e 1, theta+1](wherein

) The calculation is as follows:

x represents a word vector and n represents the number of news word vectors. b=k-d+1 is the offset constant. Similar to conventional convolutional neural networks used in computer vision, a module consists of a set of kernel functions τ _ij (x) (we refer to as "weights" (i=1, 2, …, v, j=1, 2, …, w)) were parameterized. Each input delta _i (x) Or output c _j (y) are all referred to as "features", m and n representing the size of the input and output features. Output c _j (y) is delta _i (x) And τ _ij (x) Is a convolution sum of (a) and (b).

The pooling convolution cancels the largest pooling layer in the CNN because pool operations may lose some semantic information. This new, successive higher-order feature representation is then incorporated into the attention mechanism.

Step 3: for the feature vectors obtained in step 2, an attention model is input for each word x ₀ ,x ₁ ,...,x _n Are each represented in vector form and are input to a convolution unit to obtain an output h ₀ ,h ₁ ,…,h _n This output serves as input source=h for the attention mechanism ₀ ,h ₁ ,…,h _n A final feature vector of the text is calculated. In the attention mechanism, the hidden layer t moment state h _t Is randomly initialized and updated as a parameter during training while giving the source-side context vector s _t Source side context vector s _t Calculated as a weighted sum of the individual inputs, calculated as follows:

representing the hidden layer state of the encoder.

Comparison to generate variable length pair Ji Xiangliang a _t (s)：/>

f _a Is a function based on the content of the content,

Function of->

f _a Has 3 different formulas:

wherein W is _a Is a weight matrix of the attention model.

wherein the method comprises the steps of

Representing a new attention hiding state vector, +.>

After introducing the attention mechanism, the final representation of the text is calculated as follows:

u _t ＝tanh(W _s h _t +b _s ) (10)

v＝∑ _t w _t h _t (12)

in the calculation process, W _s Weight coefficient matrix representing attention model, h _t Is a characteristic representation of convolution at time t, u _t Is a hidden layer representation of the neural network, and u _s Is a randomly initialized context vector, which may also be referred to as a semantic representation of the input. w (w) _t Is the importance weight normalized by the Softmax function. v is the final feature vector of the text.

Step 4: after step 3, the model uses the leak_relu activation function. The rectifying linear unit (ReLU) is the most commonly used activation function in neural networks and can be efficiently calculated. When the input is positive, the derivative is not zero, allowing gradient-based learning. However, when the input value of ReLU is negative, the output is still 0, and the first derivative is also 0. This will prevent the neuron from updating the parameters so that the neuron will not learn. This phenomenon is known as "dead neurons".

ReLU also produces many variants. In the present invention, to overcome the drawbacks of the ReLU, a leakage value is introduced in the negative half of the ReLU, and is therefore referred to as the leakage ReLU function. Unlike ReLU, lrehu assigns a non-zero slope for all negative values as follows;

a _g is fixed, g represents a corresponding different route a _g The method comprises the steps of carrying out a first treatment on the surface of the The leak_relu function is a variant of the classical (widely used) ReLU activation function. Since the derivative is always non-zero, the number of silent neurons can be reduced, ensuring that gradient-based continuous learning continues after entering the negative interval.

Finally, multiple classifications were performed by Softmax classifier to obtain results.

result＝softmax(v) (14)

result is a vector whose dimension is the number of categories. The number of each dimension is in the range of 0,1, which represents the probability that the text falls into a certain category. The prediction category accuracy of the input sentence is:

prediction＝argmax(result) (15)

the system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The above examples should be understood as illustrative only and not limiting the scope of the invention. Various changes and modifications to the present invention may be made by one skilled in the art after reading the teachings herein, and such equivalent changes and modifications are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. The news classification method based on the pooling convolution embedded and attention-distributed neural network is characterized by comprising the following steps of:

step 4: inputting the text vector of the test set in the corpus in the step 1 into CNN, classifying news categories according to the trained model in the step 3, and calculating the accuracy of the news categories;

the step 2 is to use the word vector x of the training set in the corpus in the step 1 ₀ ,x ₁ ,x ₂ ,...,x _n Inputting CNN, canceling a pooling layer of a character convolution network, wherein the pooling layer comprises the following specific steps: inputting the word vector which is subjected to distributed representation into a one-dimensional convolution network, wherein the network comprises an input layer, a convolution layer and an output layer, a pooling layer of the convolution neural network is canceled to maximally reserve text characteristics, and the one-dimensional convolution is calculated to obtain the convolution sum of a discrete function and a discrete kernel function:

wherein τ (x) is a discrete kernel function, the input discrete function is δ (x), d is a step size, b is a bias term, where x represents a word vector and n represents the number of news word vectors;

the b=k-d+1 is an offset constant, and is defined by a set of kernel functions τ _ij (x) Parameterizing i=1, 2, …, v, j=1, 2, …, w, each input δ _i (x) Or output c _j (y) are all called "features", m and n represent the size of the input and output features, output c _j (y) is delta _i (x) And τ _ij (x) Is a convolution sum of (2);

step 4, inputting the text vector of the test set in the corpus in step 1 into CNN, classifying news categories according to the trained model in step 3, and calculating the accuracy of the news categories, wherein the method specifically comprises the following steps:

the model uses a leakage_ReLU activation function, which introduces a leakage value in the negative half of the ReLU, thus being called a leakage_ReLU function, unlike ReLU, which assigns a non-zero slope to all negative values as follows;

a _g is fixed, g represents corresponding to different routes; finally, multi-classification is carried out through a Softmax classifier to obtain a result;

result＝softmax(v) (10)

prediction＝argmax(result) (11)。

2. a method for classifying news based on a pooling convolutional embedded and attention-distributed neural network according to claim 1, wherein said step 1: the news data set is collected, and for Chinese news, the format of the data set is normalized, and the format is as follows: the 'tag +' \t '+news' form is that a word-segmented news text word is used as the input of a word-segmentation layer to obtain a feature vector x of a group of words ₀ ,x ₁ ,x ₂ ,...,x _t The feature vector is a language which can be identified by a computer, and for text category labels, the size letters of the input language are specified, and each character is encoded by using 1-1024 codes; then, the sequence of the character sequence vector is converted into a fixed length l ₀ Exceeding the length l ₀ All characters of (2) will be ignored and less than l ₀ Will be filled with 0 later.

3. The news classification method based on the pooling convolution embedding and attention distribution neural network according to claim 1, wherein the step 3 inputs the feature vectors after word embedding and pooling convolution in the step 2 into an attention mechanism, and performs weight redistribution on the feature vectors in the text so as to train a news classification model, and specifically comprises the following steps:

for the feature vectors obtained in step 2, an attention model is input, each feature vector x ₀ ,x ₁ ,...,x _n Are each represented in vector form and are input to a convolution unit to obtain an output h ₀ ,h ₁ ,…,h _n This output serves as input source=h for the attention mechanism ₀ ,h ₁ ,…,h _n A final feature vector of the text is calculated. In the attention mechanism, the hidden layer t moment state h _t Is randomly initialized and serves as a reference during trainingThe number is updated while giving the source side context vector s _t Source side context vector s _t Calculated as a weighted sum of the individual inputs, calculated as follows:

representing a hidden layer state of the encoder;

Comparison to generate variable length pair Ji Xiangliang a _t (s)：

f _a Is a function based on the content of the content,

Function of->

A content function representing the decoder t moment concealment state and all source concealment states of the encoder starting from the initial position s 1;

f _a has 3 different formulas:

wherein W is _a Is a weight matrix of the attention model;

at each time step, the model will infer a variable length pair Ji Quanchong vector based on the current target state and all source states, then based on a _t (s) calculating the global context vector as a weighted average over all source states;

wherein the method comprises the steps of

Representing a new attention hiding state vector, +.>

4. A method of news classification based on a pooled convolutional embedded and attention distributed neural network according to claim 3, characterized in that after introducing the attention mechanism, the final representation of the text is calculated as follows:

u _t ＝tanh(W _s h _t +b _s ) (6)

/>

v＝∑ _t w _t h _t (8)

in the calculation process, W _s Weight coefficient matrix representing attention model, h _t Is a characteristic representation of convolution at time t, u _t Is a hidden layer representation of the neural network, and u _s Is a randomly initialized context vector, also known as a semantic representation of the input, w _t Is the importance weight normalized by the Softmax function, v is the final feature vector of the text.