CN113822340A

CN113822340A - Image-text emotion recognition method based on attention mechanism

Info

Publication number: CN113822340A
Application number: CN202110992751.6A
Authority: CN
Inventors: 刘博�; 徐毓笑
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-21

Abstract

The invention discloses a graph-text emotion recognition method based on an attention mechanism, which integrates classification results of all modes by introducing interaction between mining mode internal information and learning modes of the attention mechanism of fire and heat in recent years and designing a decision-level fusion rule aiming at the problem that the contribution of each mode to emotion classification is different to obtain a final emotion recognition result. And a decision-level fusion mode is adopted, and a fusion rule is designed to integrate the classification probability of each classifier, so that the final emotion recognition accuracy is improved. The method for recognizing the image-text comment emotion is beneficial to supplement and optimize in aspects of multi-mode feature extraction, feature fusion and the like, effectively excavates modal internal information, constructs interaction among the modalities, and improves the accuracy of image-text emotion recognition.

Description

Image-text emotion recognition method based on attention mechanism

Technical Field

The invention belongs to the field of computer vision and natural language processing, and is mainly used for emotion recognition of image-text comments on internet social media.

Background

With the rapid development of social media, users tend to express opinions and share experiences on social media platforms such as twitter, facebook and Xinlang microblog, contents published by the users are developing towards diversification of contents and forms, the users increasingly match character comments with drawings unlike conventional pure text comments, traditional text-based emotion analysis is evolved into multi-modal emotion analysis, the purpose is to automatically identify basic attitudes in comments, extract emotions of the users and understand behaviors of the users, and the method has important application significance in actual life.

How to effectively utilize information in visual content and text content in image-text comments in multi-modal sentiment analysis is a challenging problem, and compared with sentiment analysis in a single modality, a multi-modal sentiment analysis method should effectively fuse information between different modalities. At present, multi-mode emotion analysis has three problems, namely, information in each mode cannot be fully extracted, emotion of the picture cannot be abstracted from bottom-layer features and middle-layer features of the picture, a comment text has the characteristics of randomness, shortness and the like, important semantic information cannot be effectively mined by a traditional text representation method, information of each mode needs to be effectively fused, redundant information is removed while the information is supplemented, contribution degrees of each mode to emotion classification are different, and how to allocate the weight occupied by each mode is also a problem.

The attention mechanism simulates the focusing capacity of human eyes, pays attention to more important and valuable information, and can distribute reasonable weight for information of different dimensions in the same mode by introducing the attention mechanism, so that context information is accurately processed, and the problem that the contribution of pictures and texts to emotion classification is not equivalent can be solved by distributing the weight for different modes. The existing multi-modal feature fusion method can be mainly divided into data layer fusion, feature level fusion and decision level fusion. The data layer fusion is to unify collected different data sets together through a certain rule to form an integral data set, and the realization is complex and the obtained data often contains too much redundant information. The feature level fusion is to extract features of information of each mode, construct a joint vector, and input the joint vector into a classifier for emotion classification, and the common methods are splicing, bitwise adding and bitwise multiplying. The decision-level fusion is to construct classifiers of each mode respectively, and integrate the obtained classification results according to a certain rule to obtain a final emotion recognition result. Decision-level fusion is relatively simpler, and a decision-level fusion formula is properly designed to obtain considerable recognition accuracy.

Disclosure of Invention

The invention provides a graph-text emotion recognition method based on an attention mechanism aiming at network comments on Internet social media, and by introducing the interaction between mining modal internal information and learning modalities better in the attention mechanism of fire and heat in recent years, decision-level fusion rules are designed aiming at the problems of different emotion classification contributions of various modalities to integrate the classification results of the various modalities to obtain a final emotion recognition result.

The interaction between different modalities is constructed by introducing a self-attention mechanism to better mine emotional information inside the modalities and introducing a cross-attention mechanism. The basis for doing so is that the attention mechanism can make the model put into more attention resources in the parts of the model that focus on, in order to obtain more detailed information, weakens the attention to other parts that are relatively unimportant simultaneously, obtains higher value information from a large amount of information, has improved the efficiency of model processing. In the task of image-text comment emotion recognition, text features and picture features are obtained through preliminary feature extraction, because certain relation exists among information of all the modes, the pictures and the texts are respectively used as auxiliary information of each other by adding a cross-mode coding layer, the covered features can be deduced from alignment elements of all the modes, the relation among all the modes is found and constructed, so that information of different modes can interact, the text features, the picture features and the multi-mode features obtained through a cross attention mechanism are respectively input into a self-coding layer, and further feature selection is carried out through the self-attention mechanism. With careful design and combination of these self-attention and cross-attention layers, the present method is able to extract high quality text features, image features, and multi-modal features from the input data.

The method adopts a decision-level fusion mode, designs a fusion rule to integrate the classification probability of each classifier, and improves the final emotion recognition accuracy. The traditional feature level fusion simply combines text features and picture features, omits structural information coupling between the text and the picture, and has poor interpretability. The contribution of pictures and texts in the network comment data of the actual internet social media to emotion classification is not equivalent, the influence of different data on emotion classification results is large, and the decision-level fusion has the advantages that independent classifiers can be established in each mode, and the final decision result is obtained by giving different weights to the result of each classifier. The method analyzes the characteristics of each mode independently, sets a fusion rule, gives respective weight to the classification results of different modes, solves the problem that the contribution of different modes to emotion classification is not equivalent, and improves the identification accuracy.

The method for recognizing the image-text comment emotion is beneficial to supplement and optimize in aspects of multi-mode feature extraction, feature fusion and the like, effectively excavates modal internal information, constructs interaction among the modalities, and improves the accuracy of image-text emotion recognition.

The method comprises the following steps:

step 1, preprocessing the image-text comment data and converting the image-text comment data into a data format required by an input model.

And 2, performing primary feature extraction on the preprocessed text features and the preprocessed picture data by using the pre-trained model to obtain the text features and the picture features.

And 3, inputting the text features and the image features obtained in the step 2 as auxiliary information to a cross-modal coding layer, and learning the interaction between different modes by using a cross attention mechanism.

And 4, respectively inputting the text features, the picture features and the multi-mode features obtained in the step 3 into a self-attention coding layer to distribute reasonable weights for information of different dimensions in the features, and further selecting the features.

And 5, respectively inputting the text features, the picture features and the multi-mode features obtained in the step 4 into respective multilayer perceptrons to obtain emotion recognition results.

And 6, giving respective weights of the emotion classification probabilities obtained by the classifiers, and performing decision-level fusion in a weighting mode to obtain a final emotion classification result.

Drawings

FIG. 1 is a flow chart of the method of operation.

FIG. 2 is a model diagram of the method.

Fig. 3 illustrates a sample example of text review.

Detailed Description

The present invention is described in detail below with reference to examples and the accompanying drawings.

The embodiment of the invention only takes the graphic comment as an example, but the algorithm can be extended to any multi-modal sentiment classification problem. For the image-text comment sample shown in fig. 1, the emotional tendency is 'happy', a model is designed for the task, and after the model is optimally trained, a new image-text comment sample is input, so that the emotional tendency of the sample can be output. The following is a detailed description of the steps.

Data preprocessing is an important step in the method, especially for user comments from a social media platform, the data is original and unstructured, and the method mainly comprises the following preprocessing steps:

deleting special symbols: on a social media platform, the content published by a user usually contains some special symbols, such as an "@" symbol pointing to other users, and information behind the symbol is often related to user privacy and is not useful in an emotion analysis task, so that words after the @ are required to be deleted.

Word segmentation: the comment text is divided into words using common segmentation tools, which become the basic unit for further text processing.

Removing stop words: in natural language processing, certain words are filtered out because they are of little value (called "stop words"), and therefore, common stop words in text reviews are deleted.

Adjusting the pixel size of the picture: the picture is adjusted to 224 x 224 pixels.

(1) Text feature extraction

Word sequence of text comments w obtained in step 1_i,...w_mWill specially mark [ CLS ]]Added to the beginning of a word sequence, special marks [ SEP ]]Added to the end of a word sequence, the word w is transformed by a pre-trained Roberta model_iMapped into a 768-dimensional vector, and the formula is as follows:

t_i'＝roberta(w_i),t_i'∈R⁷⁶⁸

(2) picture feature extraction

The picture extraction uses an advanced pre-trained model Resnet152, which has been pre-trained on the visual data of 1400 million images, the pictures collocated in the user comment have been processed to 224 × 224 pixel size by step 1, the Resnet152 model then pools the 7 × 7 meshes in the pictures evenly, generating 49 output vectors for each picture, the size of each vector is 2048 dimensions, and the formula is as follows:

ResNet(I)＝{r'_i∈R²⁰⁴⁸}

The attention mechanism is intended to derive a context vector y from a set of context vectors y associated with a query vector x_iAnd mining information in the data. An attention layer first computes a query vector x and each context vector y_iThe matching score between them. The scores are then normalized by the softmax function, and the output of the attention layer is a weighted sum of the context vector and the normalized scores. The formula is as follows:

q in the formula_i,k_i,v_iRepresenting queries, keys, value vectors, respectively, which are computed as linear mappings from the input sequence,

representing an attention map for predicting how different elements of an input sequence affect each other.

The invention adopts a trans-modal transformer coding layer to respectively mine emotion areas in pictures by utilizing text characteristics and mine emotion words related to the pictures in text description by utilizing the picture characteristics, each layer in the trans-modal coder consists of a bidirectional cross attention sublayer and two feedforward sublayers, and N layers are stacked in a cross modal coder_cAnd a layer using an input of the k-th layer as an output of the k + 1-th layer. Inside the k-th layer, first a bi-directional cross-attention sublayer is applied, which contains two unidirectional cross-attention sublayers: one from language to vision, one from vision to language:

the cross-attention layer is used to exchange information between the two modalities and align entities, fully mining the relevance and complementarity between the teletext data.

The picture obtained in the step 3 is specially usedAnd (3) obtaining a text feature T belonging to R after the feature is spliced with the text feature obtained in the step (2) through global average pooling^65×768Splicing the text characteristics obtained in the step 3 with the picture characteristics obtained in the step 2 through global average pooling to obtain picture characteristics V E R^17×768Splicing the text characteristics obtained in the step 3 and the picture characteristics obtained in the step 3 to obtain multi-mode combined characteristics M E R^32×768。

The invention adopts a transform coding layer to respectively carry out further feature coding on text features T, picture features V and multi-mode joint features M, wherein each layer in the coder comprises a self-attention sublayer and a feedforward sublayer, the feedforward sublayer consists of two complete connection layers, and a residual link and normalization layer is added behind each sublayer. The text encoder and the picture encoder have N respectively_tLayer and N_pLayer, self-attention layer formula as follows:

T'＝text-attention(T,T,T),T'∈R^65×768

V'＝vision-attention(V,V,V),V'∈R^17×768

M'＝multimodal-attention(M,M,M),M'∈R^32×768

From step 4, three outputs can be obtained, namely a text feature T ', a picture feature V ', and a multimodal feature M '. And respectively inputting the obtained three outputs into respective multilayer perceptrons to obtain the probability of each category:

P₁(y|T')＝MLP(T')

P₂(y|V')＝MLP(V')

P₃(y|M')＝MLP(M')

The emotion recognition rate expressions of the three classifiers of text, picture and multi-mode are obtained in the step 5:

P_i＝(p_i1,p_i2,...p_ic)^T(1≤i≤3)

p_ijis the recognition rate of the ith mode to the jth emotional state, c is the number of categories of emotion classification, | P_i|＝1,p_ijE to {0,1} (i is more than or equal to 1 and less than or equal to 3, and j is more than or equal to 1 and less than or equal to c), and obtaining a multi-modal emotion recognition weighting matrix W_iComprises the following steps:

linearly weighted fusion of the classification probability of each modality, the formula is as follows:

and selecting the category with the highest probability as a final recognition result according to a maximum rule, wherein the formula is as follows:

the learning rate is set to be 5e-5, the dropout rate is 0.1, the multi-head attention amount is 12, the whole model is trained for 12 epochs, the model provided by the method is optimally trained by using cross-classification cross entropy calculation based on mass Loss of back propagation, and the weight and the deviation are continuously adjusted, so that the Loss function achieves the convergence effect. The method is used for experiments on the image-text emotion recognition data set, and the accuracy is improved. And inputting the new image-text social comment sample into the trained model to obtain the emotion recognition result of the sample.

The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

Claims

1. A picture and text emotion recognition method based on an attention mechanism is characterized by comprising the following steps: the method comprises the following steps:

step 1, preprocessing image-text comment data and converting the image-text comment data into a data format required by an input model;

step 2, performing primary feature extraction on the preprocessed text features and the preprocessed picture data by using a pre-trained model to obtain text features and picture features;

step 3, inputting the text characteristics and the picture characteristics obtained in the step 2 as auxiliary information to a cross-modal coding layer, and learning the interaction between different modes by using a cross attention mechanism;

step 4, respectively inputting the text features, the picture features and the multi-modal features obtained in the step 3 into a self-attention coding layer to distribute reasonable weights for information of different dimensions in the features, and further selecting the features;

step 5, respectively inputting the text features, the picture features and the multi-mode features obtained in the step 4 into respective multilayer perceptrons to obtain emotion recognition results;

2. The method for recognizing the graphics context based on the attention mechanism as claimed in claim 1, wherein:

a data preprocessing step: deleting special symbols: on a social media platform, the content published by a user usually contains some special symbols, such as an "@" symbol pointing to other users, and information behind the symbol is often related to user privacy and is not useful in an emotion analysis task, so that words after the @ need to be deleted;

word segmentation: dividing the comment text into words by using a common word segmentation tool, wherein the words become basic units for further text processing; removing stop words: in natural language processing, common stop words in text reviews are deleted.

3. The method for recognizing the graphics context based on the attention mechanism as claimed in claim 1, wherein:

word sequence of text comments w obtained in step 1_i,...w_mWill specially mark [ CLS ]]Added to the beginning of a word sequence, special marks [ SEP ]]Added to the end of a word sequence, the word w is transformed by a pre-trained Roberta model_iMapping into 768-dimensional vector: the picture extraction employs an advanced pre-trained model Resnet 152.

4. The method for recognizing the graphics context based on the attention mechanism as claimed in claim 1, wherein: inputting the text characteristics and the picture characteristics obtained in the step (2) as auxiliary information to a cross-modal coding layer, and learning the interaction between different modes by using a cross attention mechanism; the attention mechanism is intended to derive a context vector y from a set of context vectors y associated with a query vector x_iMining information in the data; an attention layer first computes a query vector x and each context vector y_iA matching score therebetween; the scores are then normalized by the softmax function, and the output of the attention layer is a weighted sum of the context vector and the normalized scores.

5. The method for recognizing the graphics context based on the attention mechanism as claimed in claim 1, wherein:

respectively mining emotion areas in pictures by using text features and mining emotion words associated with the pictures in text description by using a cross-modal transformer coding layer, wherein each layer in the cross-modal coder consists of a two-way cross attention sublayer and two feedforward sublayers, and N layers are stacked in the cross-modal coder_cA layer using an input of a k-th layer as an output of a k + 1-th layer; inside the k-th layer, first a bi-directional cross-attention sublayer is applied, which contains two unidirectional cross-attention sublayers: one from speech to vision and one from vision to visionLanguage:

6. The method for recognizing the graphics context based on the attention mechanism as claimed in claim 1, wherein:

respectively inputting the text features, the image features and the multi-modal features obtained in the step 3 into a self-attention coding layer to distribute reasonable weights for information of different dimensions in the features, and performing further feature selection;

splicing the picture characteristics obtained in the step 3 with the text characteristics obtained in the step 2 through global average pooling to obtain text characteristics T e R^65×768Splicing the text characteristics obtained in the step 3 with the picture characteristics obtained in the step 2 through global average pooling to obtain picture characteristics V E R^17×768Splicing the text characteristics obtained in the step 3 and the picture characteristics obtained in the step 3 to obtain multi-mode combined characteristics M E R^32×768；

Respectively carrying out further feature coding on the text feature T, the picture feature V and the multi-mode combined feature M by adopting a transform coding layer, wherein each layer in a coder comprises a self-attention sublayer and a feedforward sublayer, the feedforward sublayer consists of two complete connection layers, and a residual link and normalization layer is added behind each sublayer; the text encoder and the picture encoder have N respectively_tLayer and N_pAnd (3) a layer.

7. The method for recognizing the graphics context based on the attention mechanism as claimed in claim 1, wherein:

respectively inputting the text features, the picture features and the multi-mode features obtained in the step (4) into respective multilayer perceptrons to obtain emotion recognition results;

the three outputs obtained by step 4 are the text feature T ', the picture feature V ', and the multimodal feature M '; and respectively inputting the obtained three outputs into respective multilayer perceptrons to obtain the probability of each category.

8. The method for recognizing the graphics context based on the attention mechanism as claimed in claim 1, wherein:

giving respective weights of the emotion classification probabilities obtained by each classifier, and performing decision-level fusion in a weighting mode to obtain a final emotion classification result;

obtaining emotion recognition rates of the three classifiers of the text, the picture and the multi-mode in the step 5, and fusing the classification probability of each mode by linear weighting; and selecting the category with the highest probability as the final recognition result according to the maximum rule.