CN114443845A

CN114443845A - BERT-based multi-feature fine-granularity Chinese short text sentiment classification method

Info

Publication number: CN114443845A
Application number: CN202210066218.1A
Authority: CN
Inventors: 丁晓静; 卓胜祥; 范华俊; 左宁
Original assignee: Xuxu Network Technology Shanghai Co ltd
Current assignee: Xuxu Network Technology Shanghai Co ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-06

Abstract

The invention discloses a BERT-based multi-feature fine-grained Chinese short text sentiment classification method, which comprises the following steps of: step A, comprehensive expression of multi-dimensional characteristics: the input valid features of the model include 4 types: unique heat coding characteristics, position coding characteristics, font characteristics and pinyin characteristics; the four characteristics have the same dimension, further averaging is carried out to obtain a comprehensive characteristic expression, and the characteristic is subjected to a BERT model to obtain a final characteristic expression; the BERT transform superposes a plurality of multi-head self-attention and forward neural network modules, the added font and pinyin characteristics can be compatible with errors of font similarity or homophone in an input text to a certain extent, and related semantics can be correctly extracted even if the errors occur, so that the model can be adaptive to wrong texts in the real world, and the accuracy of model prediction is improved.

Description

BERT-based multi-feature fine-granularity Chinese short text sentiment classification method

Technical Field

The invention relates to the technical field of networks, in particular to a BERT-based multi-feature fine-grained Chinese short text sentiment classification method.

Background

The goal of sentiment analysis was to analyze the sentiment tendency expressed by people on entities and their attributes from text, and the earliest research of this technology began in the article review by two scholars, Nasukawa and Yi in 2003. A large amount of content with emotional tendency is generated along with the development of social media such as microblogs and the like and an e-commerce platform, and a required data basis is provided for emotion analysis. Today, sentiment analysis has been widely used in a number of fields. For example: in the field of commodity retail, the evaluation of users is very important feedback information for retailers and manufacturers, and the recognition and depreciation degree of the products and the competitive products thereof by the users can be quantified by carrying out emotion analysis on the evaluation of massive users, so that the appeal of the products and the comparison superiority and inferiority of the products and the competitive products of the users are known; in the social public opinion field, public opinion trend can be effectively mastered by analyzing public opinion on social hotspot events; in the aspect of enterprise public sentiment, the evaluation of the society on the enterprise can be quickly known by utilizing sentiment analysis, a decision basis is provided for strategic planning of the enterprise, and the competitiveness of the enterprise in the market is improved; in the field of financial transactions, the attitudes of traders on stocks and other financial derivatives are analyzed, and auxiliary bases are provided for market trading.

The existing popular emotion analysis model can be roughly divided into two parts:

1. and performing feature extraction, namely performing coded representation on the text. The coding method is divided into two types, i.e., autoregressive and autorecoding. Autoregression is a one-way model, based on the decoding part in the Transformer model; self-coding is a bi-directional model, based on the coding part in the transform model.

Transformer Is the classic task of NLP by the Google team in 2017, 6, as proposed by Ashish Vaswani et al in the article Attention Is All You Need published in 2017. Its model structure is shown in fig. 1:

the mapping from the features to the emotion categories is realized, generally, a full connection layer and a softmax layer are externally connected, the features are converted into the features with emotion category number dimensionality and then normalized to obtain the probability of each category.

In the prior art, a classification layer is added to fine tuning directly on the basis of an original BERT pre-training model. And performing fine tuning training on the BERT model obtained by pre-training a large amount of general linguistic data by using linguistic data in a specific field and labeled linguistic data in a specific task, and fully extracting the intrinsic meaning of token of the specific linguistic data under the characteristic task.

The first prior art has the following disadvantages:

a) the original BERT model features are too single, the features of the input encode part are only a one-hot encoding vector, a position encoding vector and a token type vector, wherein the token type vector is fixed and has no effective information because only a single sentence is input in an emotion analysis scene.

b) The prior art is adopted to directly carry out fine adjustment under the condition of rare labeled data, so that overfitting is easy to happen, and the robustness of the model cannot be guaranteed.

c) The model is influenced by other super parameters such as classification layer initialization and learning rate, batch size, weight attenuation rate and the like to fall into different extreme points, and the performance of each extreme point on different test sets is different, so that the effect may be biased if only a single model result is finally adopted.

d) Most emotion classifications in the industry are labeled as 2-3 categories, such as positive, negative, neutral, etc. In practical application, the classification is too crude, and the human emotion expression and tendency are more detailed and complicated, so that the amount of coarse-grained emotion classification information is too small, and the subsequent deep analysis is not facilitated.

Disclosure of Invention

The invention aims to provide a BERT-based multi-feature fine-grained Chinese short text sentiment classification method to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a BERT-based multi-feature fine-grained Chinese short text sentiment classification method comprises the following steps:

step A, comprehensive expression of multi-dimensional characteristics: the input valid features of the model include 4 types: one-hot coding characteristics, position coding characteristics, character shape characteristics and pinyin characteristics; the four characteristics have the same dimension, and a comprehensive characteristic is obtained after further averagingThe characteristic is expressed, and the final characteristic expression is obtained after the characteristic passes through a BERT model; the BERT Transformer superimposes a plurality of multi-head self-Attention and forward neural network modules, wherein the self-Attention module uses a bidirectional Attention mechanism, that is, each token focuses on the context of the left and right sides of the self-Attention module, and the multi-head self-Attention module formula is MultiHead (Q, K, V) ═ Concat (head 1.., head) W0, head ═ Attention (QWi)^Q,KWi^K,VWi^V) W0 is a weight matrix for dimension reduction after head stitching, Wi^Q，Wi^K，Wi^VWeight matrices of Q, K, V, respectively, wherein the formula of the middle Attention is

Q, K and V are input query, key and value vectors respectively, dk is vector degree, and the multi-head self-attention module reduces resources consumed by calculation by reducing dimensionality;

b, mapping text vector features to emotion classification probabilistic features: mapping of the text vector features obtained in the previous step to emotion classification features is achieved through a classification layer, a class _ size feature representation is obtained at the moment, the representation is subjected to probability transformation through a softmax layer, namely, each dimension value is between 0 and 1, the sum of all dimension values is equal to 1, the classification layer is represented by the formula S ═ X + b, W is a full-connection weight matrix of nxj, b is a bias term, X is a vector output by a feature extraction layer, and the obtained S enters the softmax layer

Wherein Pi is the probability of a text category i, Si is the numerical value in the ith neuron output by the classification layer, and j is the number of prediction categories;

step C, model fusion: the first 3 model predictions obtained by blind parameter settings were weighted averaged.

As a further technical scheme of the invention: the one-hot coding characteristic is that a vocab _ size _ embedding _ size coding matrix is generated and is obtained by querying dictionary id of token in the coding matrix.

As a further technical scheme of the invention: the position coding features inherit from a coding matrix of 512 × embedding _ size in a BERT pre-training model, and can code text with 512 length at most.

As a further technical scheme of the invention: the glyph features use three fonts: imitating Song, running script and clerical script, and performing convolution and pooling operation on the graphical expression.

As a further technical scheme of the invention: the pinyin features are obtained by carrying out embedding mapping on the full pinyin letters of the Chinese characters and then averaging.

As a further technical scheme of the invention: the BERT model structure is a superposition of a series of transform encoders, aiming to pre-train the deep bi-directional representation by jointly adjusting the context in all layers.

As a further technical scheme of the invention: the classification layer is a forward network of embedding _ size _ class _ size.

As a further technical scheme of the invention: the training process of the model comprises two steps:

1) performing Mask ML unsupervised training, constructing training data through a Mask ML strategy for an unlabeled text in a specific field, and pre-training a model, namely for token in each sentence:

the probability of 85 percent, and the original word is kept unchanged;

15% probability, replace with:

80% probability, using the character [ MASK ], replacing the current token;

replacing the current token with the token randomly extracted by the vocabulary with the probability of 10%;

the probability of 10 percent, and the original word is kept unchanged;

2) and (3) carrying out supervised training on text classification: and calculating the cross entropy loss of the probability result output at the [ CLS ] position and the real labeling result according to the labeling label, calculating the gradient of each parameter through the back propagation of the gradient, and updating the parameters.

Compared with the prior art, the invention has the beneficial effects that: the added font and pinyin characteristics can be compatible with errors of font similarity or homophone in an input text to a certain extent, and related semantics can be correctly extracted even if the errors occur, so that the model can be adaptive to error texts in the real world, and the accuracy of model prediction is improved; in the model training process, unsupervised text pre-training is fully utilized as the basis of supervised training, the semantic expression characteristics of the text in a specific field are learned from the unsupervised text, overfitting of the model is avoided under the condition that labeled text is rare, and the robustness of the model is improved. The final model fusion takes into account the tendency of a single model caused by initialization and hyper-parameter selection, and the model effect is more stable through the averaging of the model results.

Drawings

FIG. 1 is a diagram of a current popular emotion analysis model.

FIG. 2 is a schematic diagram of fine tuning by adding a classification layer directly on the basis of an original BERT pre-training model.

FIG. 3 is a schematic diagram of an overlay of a transform encoder.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1, example 1: a multi-feature fine-grained Chinese short text sentiment classification method based on BERT comprises the following steps:

step A: comprehensively expressing the multidimensional characteristics: the input valid features of the model include 4 types: unique hot coding characteristics, position coding characteristics, character shape characteristics and pinyin characteristics. The one-hot coding feature is that a vocab _ size _ embedding _ size coding matrix is generated, and the dictionary id of token is inquired in the coding matrix to obtain the unique-hot coding feature; the position coding characteristics are inherited from a coding matrix of 512 × embedding _ size in a BERT pre-training model, and the text with the length of 512 at most can be coded; the character style characteristics adopt three fonts: imitating Song, running script and clerical script, and performing convolution and pooling operation on the graphical expression; the pinyin features are obtained by performing embedding mapping on the full pinyin letters of the Chinese characters and then averaging. The four characteristics have the same dimension size, and a comprehensive characteristic expression is obtained after further averaging. The characteristics are subjected to a BERT model to obtain final characteristic expression. The BERT model structure is a superposition of a series of transform encoders (as shown in the following figure), aiming to pre-train the deep bi-directional representation by jointly adjusting the context in all layers.

The BERT Transformer superimposes a plurality of multi-head self-Attention and forward neural network modules, wherein the self-Attention module uses a bidirectional Attention mechanism, that is, each token focuses on the context of the left and right sides of the self-Attention module, and the multi-head self-Attention module formula is MultiHead (Q, K, V) ═ Concat (head 1.., head) W0, head ═ Attention (QWi)^Q,KWi^K,VWi^V) W0 is a weight matrix for dimension reduction after head stitching, Wi^Q，Wi^K，Wi^VWeight matrices of Q, K, V, respectively, wherein the formula of the middle Attention is

Q, K and V are input query, key and value vectors respectively, dk is vector degree, and the multi-head self-attention module reduces resources consumed by calculation by reducing dimensionality.

B, mapping text vector features to emotion classification probabilistic features:

and realizing the mapping of the text vector features obtained in the last step to emotion classification features through a classification layer (forward network of embedding _ size _ class _ size), obtaining a feature representation of class _ size at the moment, and performing probability representation through a softmax layer, namely realizing that each dimension value is between 0 and 1 and the sum of all dimension values is equal to 1. The classification layer formula is S ═ (WT X + b), W is a full-connection weight matrix of n × j, b is a bias term, and X is a vector output by the feature extraction layer. The obtained S enters a softmax layer, and the formula is

Wherein Pi is a text classThe probability of class i, Si is the value in the ith neuron output by the classification layer, and j is the number of prediction classes.

Adding softmax of all emotion classes to the [ CLS ] position of the model output, and predicting whether the output is: is happy.

Example 2, on the basis of example 1, the training process of the model is divided into two steps:

1) mask ML unsupervised training: for the unlabeled text in the specific field, constructing training data through a Mask ML strategy, and pre-training the model, namely for token in each sentence:

the probability of 85 percent, and the original word is kept unchanged;

15% probability, replace with:

80% probability, replace the current token with the character [ MASK ].

The 10% probability, the token randomly drawn using the vocabulary, replaces the current token.

The probability of 10 percent, and the original word is kept unchanged;

such as:

original sentence: i like it.

Inputting a model: [ CLS ] I like to be happy [ MASK ].

Adding softmax of all words in a layer at the [ MASK ] position of model output, and predicting whether the output is: it is a mixture of;

Such as:

original sentence: i like it.

Inputting a model: [ CLS ] I like to take it.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A BERT-based multi-feature fine-grained Chinese short text sentiment classification method is characterized by comprising the following steps:

step A, comprehensive expression of multi-dimensional characteristics: the input valid features of the model include 4 types: one-hot coding characteristics, position coding characteristics, character shape characteristics and pinyin characteristics; the four characteristics have the same dimension, further averaging is carried out to obtain a comprehensive characteristic expression, and the characteristic is subjected to a BERT model to obtain a final characteristic expression; the BERT Transformer superimposes a plurality of multi-head self-Attention and forward neural network modules, wherein the self-Attention module uses a bidirectional Attention mechanism, that is, each token focuses on the context of the left and right sides of the self-Attention module, and the multi-head self-Attention module formula is MultiHead (Q, K, V) ═ Concat (head 1.., head) W0, head ═ Attention (QWi)^Q,KWi^K,VWi^V) W0 is a weight matrix for dimension reduction after head stitching, Wi^Q，Wi^K，Wi^VWeight matrices of Q, K, V, respectively, wherein the formula of the middle Attention is

Q, K, V are inputs query, Q, K, V, respectively,The key and value vectors and dk are vector degrees, and the multi-head self-attention module reduces the resources consumed by calculation by reducing the dimensionality;

2. The method as claimed in claim 1, wherein the characteristic features of one-hot coding are a vocoded matrix of vocab _ size _ embedding _ size generated by querying token's dictionary id in the vocoded matrix.

3. The method as claimed in claim 1, wherein the position coding features are inherited from the coding matrix of 512 × embedding _ size in the BERT pre-training model, and can code text with 512 length at most.

4. The BERT-based multi-feature fine-grained Chinese short text sentiment classification method of claim 3, wherein the font features adopt three fonts: imitating Song, running script and clerical script, and performing convolution and pooling operation on the graphical expression to obtain the character string.

5. The BERT-based multi-feature fine-grained Chinese short text sentiment classification method of claim 4, wherein the Pinyin features are obtained by performing embedding mapping on full Pinyin letters of Chinese characters and then averaging.

6. The method of claim 1, wherein the BERT model structure is a superposition of a series of transform coders, aiming at pre-training a deep bi-directional representation by jointly adjusting contexts in all layers.

7. The BERT-based multi-feature fine-grained chinese short text sentiment classification method of claim 1 wherein the classification layer is a forward network of embedding _ size _ class _ size.

8. The BERT-based multi-feature fine-grained Chinese short text sentiment classification method according to claim 1, characterized in that a model training process is divided into two steps:

performing Mask ML unsupervised training, constructing training data through a Mask ML strategy for an unlabeled text in a specific field, and pre-training a model, namely for token in each sentence:

the probability of 85 percent, the original word is kept unchanged;

15% probability, replace with:

80% probability, using the character [ MASK ], replacing the current token;

the probability of 10 percent, and the original word is kept unchanged;