CN114297379A

CN114297379A - Text binary classification method based on Transformer

Info

Publication number: CN114297379A
Application number: CN202111539076.8A
Authority: CN
Inventors: 张磊; 康辉; 江珊; 杨经纬; 李鑫; 李春; 高宁
Original assignee: China Telecom Digital Intelligence Technology Co Ltd
Current assignee: China Telecom Digital Intelligence Technology Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-08

Abstract

The invention discloses a two-classification method of texts based on Transformer, and relates to the technical field of text classification. The text binary classification method is characterized in that after the lengths of text sequences are unified, the text sequences are input into a neural network model which is sequentially connected by a CNN module, a Transformer module, an LSTM module and a full connection layer and are used for text sequence feature classification. According to the text two-classification method, the internal rules of the irregular text sequence data are effectively screened, so that efficient classification is realized, and the classification stability is improved.

Description

Text binary classification method based on Transformer

Technical Field

The invention relates to the technical field of text classification, in particular to a text binary classification method based on a Transformer.

Background

The traditional neural network model mainly learns the data point (vector) to data point (vector) transformation, while the recurrent neural network learns the data point sequence to data point sequence transformation, and more commonly used models in recent years include a gated cyclic unit, a long-short term memory and a Transformer model. Among them, the Transformer model has good effect in the fields of machine translation, natural language processing, etc., and becomes a research hotspot.

Real-life data is composed of a plurality of very long sequences, some are regular, some are completely irregular, but the data can be found regularly in patterns. The classification task of processing such data is an important direction of deep learning research.

In the current deep learning method for processing sequence class data, a method in a CNN + LSTM mode is adopted as an optimal scheme, and the effect is very good when regular sequences are classified. Compared with the problems of instability and over-training caused by an RNN neural network, each convolution kernel of the CNN is customized, and the customized CNN is used for processing the stability caused by one-dimensional data. However, for the classification problem of irregular text data sequences with different lengths, when the CNN + LSTM mode is used to train irregular sequence data, when the number of training rounds reaches 200-300 times, a local optimal solution occurs, which results in a performance straight line decrease, and then although the training rounds can return to normal, a very large number of classes appear on the classification result, even each piece of irregular sequence data can be used as a new classification result, and the classification effect is very poor.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a Transformer-based text binary classification method, which effectively screens the internal rules of irregular text sequence data, thereby realizing high-efficiency classification and improving the classification stability.

In order to achieve the technical purpose, the invention adopts the following technical scheme: a text binary classification method based on a Transformer specifically comprises the following steps:

(1) preprocessing the text sequence data, and unifying the lengths of all text sequences;

(2) constructing a neural network model sequentially connected by a CNN module, a Transformer module, an LSTM module and a full connection layer, wherein the CNN module sharpens the features to be extracted; the Transformer module is used for feature extraction, the LSTM module is used for further improving feature effect, and the full connection layer is used for feature classification;

(3) and (3) inputting the preprocessed text sequence into the neural network model constructed in the step (2), and outputting a text classification result.

Further, the process of preprocessing the text sequence data in the step (1) is specifically as follows: and performing data supplement by using a '#' at the tail of the rest text sequences by taking the text sequence with the longest sequence length in the text sequence data as a standard until the lengths of all the text sequences are the same.

Further, the step (3) comprises the following sub-steps:

(3.1) sequentially inputting the preprocessed text sequences into a CNN module, extracting coarse modularization characteristics, and convolving each text sequence into a sequence segment with the length of 50-100 by setting convolution times;

(3.2) performing sliding window processing on the extracted coarse modularized features through a self-attention mechanism in a transform module, dividing the coarse modularized features into subsequences, inputting the subsequences into a recurrent neural network in the transform module to extract features with higher similarity, performing encoding and decoding on the extracted features with higher similarity through the self-attention mechanism, and obtaining subsequences with highest public degree as modularized features;

and (3.3) inputting the modularization features into an LSTM module, further improving the modularization features, and outputting a feature classification result through a full connection layer.

Further, the CNN module in step (3.1) uses a convolution kernel of 1 × 3.

Further, in the sliding window processing in step (3.2), the width of the sliding window is set to be 30% of the length of the sequence segment.

Further, the extraction process of the features with higher similarity in the step (3.2) is specifically as follows: and performing convolution operation on each subsequence by a cyclic neural network in the Transformer module, wherein the length of a convolution kernel is the width of a sliding window, the subsequence is changed into 1 length after the convolution operation, and sequencing is performed according to the sequence of the sliding windows to obtain the characteristic with higher similarity.

Compared with the prior art, the invention has the following beneficial effects: the text two-classification method based on the Transformer realizes the acquisition of the internal rules of the irregular text sequence data through the self-attention mechanism in the Transformer model, extracts the characteristics with higher similarity through the embedded recurrent neural network, and performs characteristic promotion through the LSTM module, thereby effectively extracting the same internal rules and distribution among different irregular text sequence data, taking the internal rule change in the data as the classification basis, and effectively solving the classification problem of the irregular text sequence data.

Drawings

FIG. 1 is a flow chart of the Transformer-based text binary classification method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described more clearly below with reference to the drawings and the embodiments. It should be emphasized that the described embodiments are merely possible examples of implementations, rather than complete examples of implementations, of the invention. Based on the embodiments of the present invention, other embodiments that can be obtained by a person of ordinary skill in the art without creative efforts are within the protection scope of the present invention.

The invention provides a text binary classification method based on a Transformer, which specifically comprises the following steps:

(1) the text sequence data is preprocessed, the text sequence with the longest sequence length in the text sequence data is taken as a standard, and data supplement is carried out at the tail of the rest text sequences by using "#" until the lengths of all the text sequences are the same.

(2) The method comprises the steps of constructing a neural network model sequentially connected by a CNN module, a Transformer module, an LSTM module and a full connection layer, wherein the CNN module sharpens the features to be extracted to make the features more prominent; the Transformer module is used for feature extraction, the LSTM module is used for further improving feature effect, and the full connection layer is used for feature classification.

(3) Inputting the preprocessed text sequence into the neural network model constructed in the step (2), and outputting a text classification result; the method specifically comprises the following substeps:

(3.1) sequentially inputting the preprocessed text sequences into a CNN module, and extracting coarse modular features, wherein the CNN module can perform convolution operation on the text sequences according to a previously predetermined feature mode, 1 x 3 convolution kernels are adopted, each text sequence is convoluted into a sequence segment with the length of 50-100 by setting convolution times, and if the convolution times are too few, more features are reserved, so that the training effect of a Transformer module is reduced; if the convolution times are too many, the text sequence data is seriously lost;

(3.2) performing sliding window processing on the extracted coarse modularized features through a self-attention mechanism in a transform module, setting the width of a sliding window to be 30% of the length of a sequence segment, dividing the sequence segment into subsequences, inputting the subsequences into a recurrent neural network in the transform module to extract features with higher similarity, extracting the features with higher similarity, and performing encoding and decoding operations on the extracted features with higher similarity through the self-attention mechanism to obtain subsequences with highest public degree as modularized features. The method carries out rule search and feature conversion on the text sequence through a self-attention mechanism in a Transformer module, converts the features of the text sequence into a more obvious mechanism, for the treatment of text sequence commonsense, the traditional CNN and LSTM are difficult to think in a human-like angle, while the self-attention mechanism in the Transformer module has higher affinity with the brain in the text sequence treatment, and can be used for the treatment of natural language to a great extent; meanwhile, by embedding the cyclic neural network in the Transformer module, due to the fact that non-image data are small in data size and dimensionality, the cyclic neural network is designed into a double-layer cyclic neural network, expansion on a parallel layer is reduced, and a large amount of redundancy is avoided.

The extraction process of the features with higher similarity in the invention specifically comprises the following steps: and performing convolution operation on each subsequence by a cyclic neural network in the Transformer module, wherein the length of a convolution kernel is the width of a sliding window, the subsequences are changed into 1 length after the convolution operation, and the subsequences are sequenced according to the sequence of the sliding windows to obtain the characteristic of high similarity.

And (3.3) inputting the modularization features into an LSTM module, further improving the modularization features, enabling the classification accuracy to be the best, and outputting the feature classification result through a full connection layer.

The text binary classification method based on the Transformer effectively screens the internal rules of the irregular text sequence data, thereby realizing the high-efficiency classification and improving the classification stability.

Examples

In the embodiment, 26 capital letters are adopted to form a disordered text sequence, wherein the arrangement modes of the letters are different, the disordered text sequence is distinguished at present, firstly, 26 letters are digitalized, 0 is adopted for substitution of #', text sequence data is converted into digitalized data for later convolution neural network calculation, and meanwhile, the sequences are classified in a manual labeling mode. Fig. 1 is a flowchart of a Transformer-based text binary classification method, which specifically includes the following steps:

(1) knowing that the maximum sequence length k in the unordered text sequence is 5700, the rest unordered text sequences are supplemented with 0 after the last element until all the unordered text sequences are k in length, and the text sequence is preprocessed.

(2) Constructing a neural network model, wherein the neural network model comprises a CNN module, a Transformer module, an LSTM module and a full connection layer which are sequentially connected, the CNN module is a 1 x 3 convolution kernel, and is used for preliminarily extracting data characteristics to sharpen the characteristics to be extracted so as to be more prominent; the Transformer module is used for feature extraction, the LSTM module is used for further improving feature effect, and the full connection layer is used for feature classification.

(3.1) the CNN module in this embodiment adopts four convolutional layers, which are: the text sequence preprocessing method comprises a CONV1D _1 layer, a CONV1D _2 layer, a CONV1D _3 layer and a CONV1D _4 layer, wherein a convolution kernel of the CONV1D _1 layer is [1,0,1], a convolution kernel of the CONV1D _2 layer is [1,1,0], a convolution kernel of the CONV1D _3 layer is [0,1,0], a convolution kernel of the CONV1D _4 layer is [0,0,1], a text sequence which is preprocessed before is input into a CNN module through the convolution kernels, the text sequence is convoluted into a sequence section with the length of 70, and the sequence section is integrated with unobvious modularization characteristics;

(3.2) sending the sequence segments into a Transformer module for feature extraction, performing sliding window processing on the sequence segments with the length of 70 through a self-attention mechanism in the Transformer module, setting the length of a sliding window to be 21, sequentially inputting 50 subsequences into an embedded cyclic neural network, operating the 50 subsequences by using a 1-20 convolution kernel to obtain 50 numbers, sequentially combining the 50 numbers into features with higher similarity, and obtaining the subsequences with the highest public degree as modular features after encoding and decoding operations of the self-attention mechanism;

and (3.3) inputting the modularization features into an LSTM module, further improving the modularization features, and outputting a feature classification result through a full connection layer. The text binary classification method based on the Transformer realizes the intrinsic regularity classification of the text disordered sequences, is very similar to human languages, the similar spoken language modes can be automatically classified into the same type, and different languages are sequentially divided according to the intrinsic different modes, so that the mode-level classification is realized.

The traditional deep learning method comprises CNN, RNN and LSTM, the accuracy rate of the sequence classification task in the text is only 78% by using multi-layer CNN, and the accuracy rate of the sequence classification task is reduced by about 500 rounds along with the increase of the number of training rounds, so that the problem can be quickly caused by using RNN; the performance of the CNN model is enhanced after the CNN model is combined with the LSTM model, the accuracy of the model is 81 percent, the Transformer module is added into the CNN and the LSTM based on the Transformer text binary classification method, attention is focused on characteristics through the self-attention mechanism, the stability of the model is greatly improved, and the classification accuracy is stabilized to be more than 90 percent.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A text binary classification method based on a Transformer is characterized by comprising the following steps:

2. The Transformer-based text binary classification method according to claim 1, wherein the preprocessing of the text sequence data in the step (1) is specifically as follows: and performing data supplement by using a '#' at the tail of the rest text sequences by taking the text sequence with the longest sequence length in the text sequence data as a standard until the lengths of all the text sequences are the same.

3. The fransformer-based text dichotomy method according to claim 1, wherein the step (3) comprises the sub-steps of:

4. The fransformer-based text binary classification method of claim 3, wherein the CNN module in step (3.1) uses a 1 x 3 convolution kernel.

5. The fransformer-based text binary classification method of claim 3, wherein the width of the sliding window is set to 30% of the length of the sequence segment during the sliding window processing in step (3.2).

6. The Transformer-based text binary classification method according to claim 3, wherein the extraction process of the features with higher similarity in the step (3.2) is specifically as follows: and performing convolution operation on each subsequence by a cyclic neural network in the Transformer module, wherein the length of a convolution kernel is the width of a sliding window, the subsequence is changed into 1 length after the convolution operation, and sequencing is performed according to the sequence of the sliding windows to obtain the characteristic with higher similarity.