CN114297379A - Text binary classification method based on Transformer - Google Patents

Text binary classification method based on Transformer Download PDF

Info

Publication number
CN114297379A
CN114297379A CN202111539076.8A CN202111539076A CN114297379A CN 114297379 A CN114297379 A CN 114297379A CN 202111539076 A CN202111539076 A CN 202111539076A CN 114297379 A CN114297379 A CN 114297379A
Authority
CN
China
Prior art keywords
text
module
features
transformer
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111539076.8A
Other languages
Chinese (zh)
Inventor
张磊
康辉
江珊
杨经纬
李鑫
李春
高宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Digital Intelligence Technology Co Ltd
Original Assignee
China Telecom Digital Intelligence Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Digital Intelligence Technology Co Ltd filed Critical China Telecom Digital Intelligence Technology Co Ltd
Priority to CN202111539076.8A priority Critical patent/CN114297379A/en
Publication of CN114297379A publication Critical patent/CN114297379A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a two-classification method of texts based on Transformer, and relates to the technical field of text classification. The text binary classification method is characterized in that after the lengths of text sequences are unified, the text sequences are input into a neural network model which is sequentially connected by a CNN module, a Transformer module, an LSTM module and a full connection layer and are used for text sequence feature classification. According to the text two-classification method, the internal rules of the irregular text sequence data are effectively screened, so that efficient classification is realized, and the classification stability is improved.

Description

Text binary classification method based on Transformer
Technical Field
The invention relates to the technical field of text classification, in particular to a text binary classification method based on a Transformer.
Background
The traditional neural network model mainly learns the data point (vector) to data point (vector) transformation, while the recurrent neural network learns the data point sequence to data point sequence transformation, and more commonly used models in recent years include a gated cyclic unit, a long-short term memory and a Transformer model. Among them, the Transformer model has good effect in the fields of machine translation, natural language processing, etc., and becomes a research hotspot.
Real-life data is composed of a plurality of very long sequences, some are regular, some are completely irregular, but the data can be found regularly in patterns. The classification task of processing such data is an important direction of deep learning research.
In the current deep learning method for processing sequence class data, a method in a CNN + LSTM mode is adopted as an optimal scheme, and the effect is very good when regular sequences are classified. Compared with the problems of instability and over-training caused by an RNN neural network, each convolution kernel of the CNN is customized, and the customized CNN is used for processing the stability caused by one-dimensional data. However, for the classification problem of irregular text data sequences with different lengths, when the CNN + LSTM mode is used to train irregular sequence data, when the number of training rounds reaches 200-300 times, a local optimal solution occurs, which results in a performance straight line decrease, and then although the training rounds can return to normal, a very large number of classes appear on the classification result, even each piece of irregular sequence data can be used as a new classification result, and the classification effect is very poor.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a Transformer-based text binary classification method, which effectively screens the internal rules of irregular text sequence data, thereby realizing high-efficiency classification and improving the classification stability.
In order to achieve the technical purpose, the invention adopts the following technical scheme: a text binary classification method based on a Transformer specifically comprises the following steps:
(1) preprocessing the text sequence data, and unifying the lengths of all text sequences;
(2) constructing a neural network model sequentially connected by a CNN module, a Transformer module, an LSTM module and a full connection layer, wherein the CNN module sharpens the features to be extracted; the Transformer module is used for feature extraction, the LSTM module is used for further improving feature effect, and the full connection layer is used for feature classification;
(3) and (3) inputting the preprocessed text sequence into the neural network model constructed in the step (2), and outputting a text classification result.
Further, the process of preprocessing the text sequence data in the step (1) is specifically as follows: and performing data supplement by using a '#' at the tail of the rest text sequences by taking the text sequence with the longest sequence length in the text sequence data as a standard until the lengths of all the text sequences are the same.
Further, the step (3) comprises the following sub-steps:
(3.1) sequentially inputting the preprocessed text sequences into a CNN module, extracting coarse modularization characteristics, and convolving each text sequence into a sequence segment with the length of 50-100 by setting convolution times;
(3.2) performing sliding window processing on the extracted coarse modularized features through a self-attention mechanism in a transform module, dividing the coarse modularized features into subsequences, inputting the subsequences into a recurrent neural network in the transform module to extract features with higher similarity, performing encoding and decoding on the extracted features with higher similarity through the self-attention mechanism, and obtaining subsequences with highest public degree as modularized features;
and (3.3) inputting the modularization features into an LSTM module, further improving the modularization features, and outputting a feature classification result through a full connection layer.
Further, the CNN module in step (3.1) uses a convolution kernel of 1 × 3.
Further, in the sliding window processing in step (3.2), the width of the sliding window is set to be 30% of the length of the sequence segment.
Further, the extraction process of the features with higher similarity in the step (3.2) is specifically as follows: and performing convolution operation on each subsequence by a cyclic neural network in the Transformer module, wherein the length of a convolution kernel is the width of a sliding window, the subsequence is changed into 1 length after the convolution operation, and sequencing is performed according to the sequence of the sliding windows to obtain the characteristic with higher similarity.
Compared with the prior art, the invention has the following beneficial effects: the text two-classification method based on the Transformer realizes the acquisition of the internal rules of the irregular text sequence data through the self-attention mechanism in the Transformer model, extracts the characteristics with higher similarity through the embedded recurrent neural network, and performs characteristic promotion through the LSTM module, thereby effectively extracting the same internal rules and distribution among different irregular text sequence data, taking the internal rule change in the data as the classification basis, and effectively solving the classification problem of the irregular text sequence data.
Drawings
FIG. 1 is a flow chart of the Transformer-based text binary classification method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described more clearly below with reference to the drawings and the embodiments. It should be emphasized that the described embodiments are merely possible examples of implementations, rather than complete examples of implementations, of the invention. Based on the embodiments of the present invention, other embodiments that can be obtained by a person of ordinary skill in the art without creative efforts are within the protection scope of the present invention.
The invention provides a text binary classification method based on a Transformer, which specifically comprises the following steps:
(1) the text sequence data is preprocessed, the text sequence with the longest sequence length in the text sequence data is taken as a standard, and data supplement is carried out at the tail of the rest text sequences by using "#" until the lengths of all the text sequences are the same.
(2) The method comprises the steps of constructing a neural network model sequentially connected by a CNN module, a Transformer module, an LSTM module and a full connection layer, wherein the CNN module sharpens the features to be extracted to make the features more prominent; the Transformer module is used for feature extraction, the LSTM module is used for further improving feature effect, and the full connection layer is used for feature classification.
(3) Inputting the preprocessed text sequence into the neural network model constructed in the step (2), and outputting a text classification result; the method specifically comprises the following substeps:
(3.1) sequentially inputting the preprocessed text sequences into a CNN module, and extracting coarse modular features, wherein the CNN module can perform convolution operation on the text sequences according to a previously predetermined feature mode, 1 x 3 convolution kernels are adopted, each text sequence is convoluted into a sequence segment with the length of 50-100 by setting convolution times, and if the convolution times are too few, more features are reserved, so that the training effect of a Transformer module is reduced; if the convolution times are too many, the text sequence data is seriously lost;
(3.2) performing sliding window processing on the extracted coarse modularized features through a self-attention mechanism in a transform module, setting the width of a sliding window to be 30% of the length of a sequence segment, dividing the sequence segment into subsequences, inputting the subsequences into a recurrent neural network in the transform module to extract features with higher similarity, extracting the features with higher similarity, and performing encoding and decoding operations on the extracted features with higher similarity through the self-attention mechanism to obtain subsequences with highest public degree as modularized features. The method carries out rule search and feature conversion on the text sequence through a self-attention mechanism in a Transformer module, converts the features of the text sequence into a more obvious mechanism, for the treatment of text sequence commonsense, the traditional CNN and LSTM are difficult to think in a human-like angle, while the self-attention mechanism in the Transformer module has higher affinity with the brain in the text sequence treatment, and can be used for the treatment of natural language to a great extent; meanwhile, by embedding the cyclic neural network in the Transformer module, due to the fact that non-image data are small in data size and dimensionality, the cyclic neural network is designed into a double-layer cyclic neural network, expansion on a parallel layer is reduced, and a large amount of redundancy is avoided.
The extraction process of the features with higher similarity in the invention specifically comprises the following steps: and performing convolution operation on each subsequence by a cyclic neural network in the Transformer module, wherein the length of a convolution kernel is the width of a sliding window, the subsequences are changed into 1 length after the convolution operation, and the subsequences are sequenced according to the sequence of the sliding windows to obtain the characteristic of high similarity.
And (3.3) inputting the modularization features into an LSTM module, further improving the modularization features, enabling the classification accuracy to be the best, and outputting the feature classification result through a full connection layer.
The text binary classification method based on the Transformer effectively screens the internal rules of the irregular text sequence data, thereby realizing the high-efficiency classification and improving the classification stability.
Examples
In the embodiment, 26 capital letters are adopted to form a disordered text sequence, wherein the arrangement modes of the letters are different, the disordered text sequence is distinguished at present, firstly, 26 letters are digitalized, 0 is adopted for substitution of #', text sequence data is converted into digitalized data for later convolution neural network calculation, and meanwhile, the sequences are classified in a manual labeling mode. Fig. 1 is a flowchart of a Transformer-based text binary classification method, which specifically includes the following steps:
(1) knowing that the maximum sequence length k in the unordered text sequence is 5700, the rest unordered text sequences are supplemented with 0 after the last element until all the unordered text sequences are k in length, and the text sequence is preprocessed.
(2) Constructing a neural network model, wherein the neural network model comprises a CNN module, a Transformer module, an LSTM module and a full connection layer which are sequentially connected, the CNN module is a 1 x 3 convolution kernel, and is used for preliminarily extracting data characteristics to sharpen the characteristics to be extracted so as to be more prominent; the Transformer module is used for feature extraction, the LSTM module is used for further improving feature effect, and the full connection layer is used for feature classification.
(3) Inputting the preprocessed text sequence into the neural network model constructed in the step (2), and outputting a text classification result; the method specifically comprises the following substeps:
(3.1) the CNN module in this embodiment adopts four convolutional layers, which are: the text sequence preprocessing method comprises a CONV1D _1 layer, a CONV1D _2 layer, a CONV1D _3 layer and a CONV1D _4 layer, wherein a convolution kernel of the CONV1D _1 layer is [1,0,1], a convolution kernel of the CONV1D _2 layer is [1,1,0], a convolution kernel of the CONV1D _3 layer is [0,1,0], a convolution kernel of the CONV1D _4 layer is [0,0,1], a text sequence which is preprocessed before is input into a CNN module through the convolution kernels, the text sequence is convoluted into a sequence section with the length of 70, and the sequence section is integrated with unobvious modularization characteristics;
(3.2) sending the sequence segments into a Transformer module for feature extraction, performing sliding window processing on the sequence segments with the length of 70 through a self-attention mechanism in the Transformer module, setting the length of a sliding window to be 21, sequentially inputting 50 subsequences into an embedded cyclic neural network, operating the 50 subsequences by using a 1-20 convolution kernel to obtain 50 numbers, sequentially combining the 50 numbers into features with higher similarity, and obtaining the subsequences with the highest public degree as modular features after encoding and decoding operations of the self-attention mechanism;
and (3.3) inputting the modularization features into an LSTM module, further improving the modularization features, and outputting a feature classification result through a full connection layer. The text binary classification method based on the Transformer realizes the intrinsic regularity classification of the text disordered sequences, is very similar to human languages, the similar spoken language modes can be automatically classified into the same type, and different languages are sequentially divided according to the intrinsic different modes, so that the mode-level classification is realized.
The traditional deep learning method comprises CNN, RNN and LSTM, the accuracy rate of the sequence classification task in the text is only 78% by using multi-layer CNN, and the accuracy rate of the sequence classification task is reduced by about 500 rounds along with the increase of the number of training rounds, so that the problem can be quickly caused by using RNN; the performance of the CNN model is enhanced after the CNN model is combined with the LSTM model, the accuracy of the model is 81 percent, the Transformer module is added into the CNN and the LSTM based on the Transformer text binary classification method, attention is focused on characteristics through the self-attention mechanism, the stability of the model is greatly improved, and the classification accuracy is stabilized to be more than 90 percent.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (6)

1. A text binary classification method based on a Transformer is characterized by comprising the following steps:
(1) preprocessing the text sequence data, and unifying the lengths of all text sequences;
(2) constructing a neural network model sequentially connected by a CNN module, a Transformer module, an LSTM module and a full connection layer, wherein the CNN module sharpens the features to be extracted; the Transformer module is used for feature extraction, the LSTM module is used for further improving feature effect, and the full connection layer is used for feature classification;
(3) and (3) inputting the preprocessed text sequence into the neural network model constructed in the step (2), and outputting a text classification result.
2. The Transformer-based text binary classification method according to claim 1, wherein the preprocessing of the text sequence data in the step (1) is specifically as follows: and performing data supplement by using a '#' at the tail of the rest text sequences by taking the text sequence with the longest sequence length in the text sequence data as a standard until the lengths of all the text sequences are the same.
3. The fransformer-based text dichotomy method according to claim 1, wherein the step (3) comprises the sub-steps of:
(3.1) sequentially inputting the preprocessed text sequences into a CNN module, extracting coarse modularization characteristics, and convolving each text sequence into a sequence segment with the length of 50-100 by setting convolution times;
(3.2) performing sliding window processing on the extracted coarse modularized features through a self-attention mechanism in a transform module, dividing the coarse modularized features into subsequences, inputting the subsequences into a recurrent neural network in the transform module to extract features with higher similarity, performing encoding and decoding on the extracted features with higher similarity through the self-attention mechanism, and obtaining subsequences with highest public degree as modularized features;
and (3.3) inputting the modularization features into an LSTM module, further improving the modularization features, and outputting a feature classification result through a full connection layer.
4. The fransformer-based text binary classification method of claim 3, wherein the CNN module in step (3.1) uses a 1 x 3 convolution kernel.
5. The fransformer-based text binary classification method of claim 3, wherein the width of the sliding window is set to 30% of the length of the sequence segment during the sliding window processing in step (3.2).
6. The Transformer-based text binary classification method according to claim 3, wherein the extraction process of the features with higher similarity in the step (3.2) is specifically as follows: and performing convolution operation on each subsequence by a cyclic neural network in the Transformer module, wherein the length of a convolution kernel is the width of a sliding window, the subsequence is changed into 1 length after the convolution operation, and sequencing is performed according to the sequence of the sliding windows to obtain the characteristic with higher similarity.
CN202111539076.8A 2021-12-16 2021-12-16 Text binary classification method based on Transformer Pending CN114297379A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111539076.8A CN114297379A (en) 2021-12-16 2021-12-16 Text binary classification method based on Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111539076.8A CN114297379A (en) 2021-12-16 2021-12-16 Text binary classification method based on Transformer

Publications (1)

Publication Number Publication Date
CN114297379A true CN114297379A (en) 2022-04-08

Family

ID=80966929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111539076.8A Pending CN114297379A (en) 2021-12-16 2021-12-16 Text binary classification method based on Transformer

Country Status (1)

Country Link
CN (1) CN114297379A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858933A (en) * 2020-07-10 2020-10-30 暨南大学 Character-based hierarchical text emotion analysis method and system
CN111858932A (en) * 2020-07-10 2020-10-30 暨南大学 Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text
CN113177633A (en) * 2021-04-20 2021-07-27 浙江大学 Deep decoupling time sequence prediction method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858933A (en) * 2020-07-10 2020-10-30 暨南大学 Character-based hierarchical text emotion analysis method and system
CN111858932A (en) * 2020-07-10 2020-10-30 暨南大学 Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text
CN113177633A (en) * 2021-04-20 2021-07-27 浙江大学 Deep decoupling time sequence prediction method

Similar Documents

Publication Publication Date Title
CN105512289B (en) Image search method based on deep learning and Hash
CN111143563A (en) Text classification method based on integration of BERT, LSTM and CNN
CN110442707A (en) A kind of multi-tag file classification method based on seq2seq
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN111243699A (en) Chinese electronic medical record entity extraction method based on word information fusion
CN113190602B (en) Event joint extraction method integrating word features and deep learning
CN112163429B (en) Sentence correlation obtaining method, system and medium combining cyclic network and BERT
CN111241816A (en) Automatic news headline generation method
CN112507190B (en) Method and system for extracting keywords of financial and economic news
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN114529903A (en) Text refinement network
CN111026845B (en) Text classification method for acquiring multilevel context semantics
CN110704664B (en) Hash retrieval method
CN110955745B (en) Text hash retrieval method based on deep learning
CN115510864A (en) Chinese crop disease and pest named entity recognition method fused with domain dictionary
CN111506726A (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN113590818B (en) Government text data classification method based on integration of CNN (carbon fiber network), GRU (grid-like network) and KNN (K-nearest neighbor network)
CN110688501A (en) Hash retrieval method of full convolution network based on deep learning
CN113704473A (en) Media false news detection method and system based on long text feature extraction optimization
CN112541082A (en) Text emotion classification method and system
CN114297379A (en) Text binary classification method based on Transformer
CN104331717A (en) Feature dictionary structure and visual feature coding integrating image classifying method
CN111523325A (en) Chinese named entity recognition method based on strokes
CN114781356B (en) Text abstract generation method based on input sharing
CN114580422B (en) Named entity identification method combining two-stage classification of neighbor analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220408

RJ01 Rejection of invention patent application after publication