CN117671460A

CN117671460A - Cross-modal image-text emotion analysis method based on hybrid fusion

Info

Publication number: CN117671460A
Application number: CN202311693001.4A
Authority: CN
Inventors: 袁志祥; 杜姝敏
Original assignee: Anhui University of Technology AHUT
Current assignee: Anhui University of Technology AHUT
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-08

Abstract

The invention discloses a cross-modal image-text emotion analysis method based on hybrid fusion, which belongs to the technical field of natural language processing and computer vision, and comprises the following steps: single-mode extraction; cross-modal feature fusion; attention mechanism and combination pooling treatment; partial to global conversion; and (5) decision fusion and mapping. The invention provides a graph-text emotion analysis model of a cross-mode attention mechanism, which comprises the steps of inputting data of two modes into a full-connection layer, mapping the data into a public space, and acquiring cross-mode emotion characteristics with graph-text related relation degree through mode interaction after the influence of redundant information is reduced by the model; key emotion characteristics are enhanced through an attention mechanism, and the effectiveness of characteristic fusion is enhanced; and a hybrid fusion model combining feature fusion and decision fusion is provided, the correlation of different modes is captured, the information is utilized more comprehensively and comprehensively, and the accuracy of the decision is improved.

Description

Cross-modal image-text emotion analysis method based on hybrid fusion

Technical Field

The invention relates to the technical field of natural language processing and computer vision, in particular to a cross-modal image-text emotion analysis method based on hybrid fusion.

Background

Emotion analysis is a task of identifying and analyzing emotion information contained in data by computer technology. This technology has wide use in many applications such as brand management, market research, public opinion monitoring, and user feedback analysis. Through emotion analysis, enterprises can know the attitudes and emotional responses of consumers to products or services of the enterprises, so that better decisions and improvements are made.

With the continuous development of internet technology, more and more people release personal views in social media such as trembling, reddish books, microblogs and the like in a mode of combining pictures and texts, traditional emotion analysis is mainly based on text data, however, emotion analysis of single text data needs to understand semantic and context information in the text, but the expression mode of natural language is very complex and various, emotion can be implicitly or indirectly expressed sometimes, and emotion tendencies of different people for the same text can be different. In visual emotion analysis, emotion is the subjective experience of a person on emotion expressed by an element of interest. For example, a picture showing a person walking on a road with an umbrella in a rainy day with a gray mask can be presumed that the expression may be negative emotion by the color, object, scene, etc. of the picture. The information of the pictures can supplement text expression content, and emotion information is comprehensively analyzed, so that the accuracy of emotion analysis is improved. Therefore, the image-text combined mode is taken as a data source of emotion analysis, has important significance for emotion analysis, and is a research direction with a certain value at present.

Early studies of multi-modal emotion analysis mostly resort to feature fusion or decision fusion. Feature fusion is to extract emotion features of different modes respectively and fuse the emotion features in a direct or weighted splicing mode. Such as CNN-Multi model: cai et al respectively designed text CNN for extracting text features, image CNN for extracting image features, and Multi-CNN for splicing image features and text features as inputs based on CNN, and finally, classification results of output feature vectors of Multi-CNN as final emotion analysis results. Such as CBOW-DA-LR model: the model for processing text and visual information is an extension of CBOW-LR, where LR is logistic regression, CBOW (Continuous Bag-of-Words) is a model that converts text into word vectors, which is predicted primarily from the context information of the Words to be predicted, adding a new task based on a denoising auto encoder (DA) applied to the image, aimed at obtaining a mid-level representation. In this final form, the image description obtained from DA is concatenated with the word representation obtained from CBOW, representing the new descriptor of the word window in the push, and finally the classification task is performed using LR. The decision fusion is to respectively classify emotion characteristics of different modes by using a classifier, and then fuse classification results by using reasonable rules. For example Cai Guoyong, inputting Text and Image into Text emotion analysis model (Text CNN) and Image emotion analysis model (Image CNN) based on Convolutional Neural Network (CNN) respectively to obtain Text feature and Image feature, fusing Text feature vector and Image feature vector, inputting into word-level CNN fusion model, phrase-level CNN fusion model and sentence-level CNN fusion model respectively, outputting three semantic-level Image-Text fusion vectors, inputting into 3 classifiers respectively for classification, obtaining three semantic-level Image-Text emotion analysis results, and finally fusing three results through integrating classifier for decision to obtain final result of Image-Text emotion analysis.

The two ways have the following problems: (1) Because of unnecessary, repeated or redundant information of the two modes, the information does not provide new or useful content, but only repeats the information which has appeared, and the redundant information can cause inefficiency and repeatability of information transmission, so that a model is difficult to mine effective information; (2) The simple feature stitching is used for feature fusion, so that information of different modes is not effectively interacted, the computational complexity is increased, and higher weight is not allocated to the features with emotion information, so that the model is more concerned with the information with obvious emotion tendency; (3) Single decision results may be subject to errors or unreliability, and because of the complexity and diversity in which the model may not be able to capture data, single decisions may not provide a comprehensive, accurate solution, ignoring other possible angles and factors. Therefore, a cross-modal image-text emotion analysis method based on hybrid fusion is provided.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems of isomerism of images and text data and the multi-modal fusion method in the prior art, the method further realizes the fusion among modes, cross-modal modeling, balances the contribution proportion of different modes on tasks and the like, and provides a cross-modal image-text emotion analysis method based on mixed fusion.

The invention solves the technical problems through the following technical proposal, and the invention comprises the following steps:

s1: extraction of unimodal vectors

Extracting text information and picture information, and correspondingly obtaining feature vectors of the text and the picture;

s2: cross-modal feature fusion

Mapping the two modes to the same dimension, and correlating the two mode feature vectors in the step S1 to obtain a feature vector of image-text interaction taking image information as a main component and a feature vector of image-text interaction taking text information as a main component;

s3: attention mechanism and combined pooling

Inputting the two feature vectors obtained in the step S2 into a cross-modal attention mechanism, and screening the features by using combination pooling after the cross-modal attention mechanism is processed;

s4: local to global conversion

The two single-mode feature vectors are subjected to mapping in a public space and are subjected to partial-to-whole conversion through the two single-mode vectors output by an attention mechanism and the vectors output in the step S3;

s5: decision fusion and mapping

And respectively classifying the two single-mode and cross-mode outputs, and distributing corresponding weights to the three classification results for decision fusion, so that a final prediction result is obtained.

Further, in the step S1, text information is extracted using the RoBERTa pre-training model to obtain feature vectors of the text, and text= { T is performed for the input text sequence ₁ ,T ₂ ,...,T _n Where n represents the text length, the embedded representation of the text is obtained by the RoBERTa pre-training model, the formula is as follows:

further, in the step S1, a ViT model is used to obtain a picture p _j Feature vectors of (a)The formula is as follows:

further, in the step S2, the specific processing procedure is as follows:

s21: mapping the two modes to the same dimension, and correlating the two mode feature vectors in the step S1, wherein the formula is as follows:

wherein W is _t 、W _ti 、W _v 、W _vj 、b _t 、b _v The weight matrix and bias which can be learned by the full-connection layer are represented, and a similarity matrix C which represents the correlation scores among different characteristics of pictures and texts is obtained;

s22: after the similarity matrix C is input into the full-connection layer, a matrix O is obtained through an activation function;

s23: image feature vector v _j The characteristic vector of image-text interaction is obtained by multiplying the matrix O point, and then the characteristics after interaction are connected through residual errors, so that the characteristic vector v of image-text interaction taking image information as a main component is obtained _j ^* And similarly, obtaining a characteristic vector t of image-text interaction based on text information _i ^* 。

Further, in the step S22, the similarity matrix C and the matrix O are as follows:

C＝(t _i W _iC )(v _j W _jC ) ^T

O＝sigmoid(W _CO C+b _CO )

wherein W is _iC 、W _jC Are respectively indicated to be t _i 、v _j Full-connected layer learnable weight matrix for linear transformation, W _CO 、b _CO Weight matrix and bias of full connection layer learnable parameters for C linear transformation, t _i V for text feature vectors output from common space _j Is an image feature vector output from the common space.

Further, in the step S23, a feature vector v of the image-based text interaction is generated _j ^* And a feature vector t of text-based interactions _i ^* The concrete representation is as follows:

further, in the step S3, the following processing steps are specifically included:

s31: feature vectorAnd feature vector->Inputting into a cross-mode attention mechanism, wherein the image features are in a main mode, the text features are in an auxiliary mode, and the expression capacity of the image features is improved by utilizing semantic information, and the formula is as follows:

wherein,is to->A learnable weight matrix that performs linear transformations,is to->A learnable weight matrix that performs linear transformation;

s32: after the cross-modal attention mechanism processing, screening the features by using a combined pooling, wherein the combined pooling concrete processing process specifically comprises the steps of reserving the features with the most emotion colors by using a maximum pooling layer and reserving the context information of the features by using an average pooling layer;

s33: and taking the splicing result of the two pooling layers as the output characteristic of the text auxiliary image, wherein the formula is as follows:

similarly, the features of the image-assisted text are obtained as follows:

s34: finally, the characteristics of the text auxiliary image obtained after the characteristic screening and the characteristics of the image auxiliary text are spliced, and the characteristics are used as the characteristics of image-text interaction, and the formula is as follows:

further, in the step S4, the one-dimensional convolution is used to extract the mapping of the local feature to the global feature to obtain the key feature, so as to realize feature dimension reduction, and the formula is as follows:

X ^(k) ＝Conv1D(LayerNorm(ReLU(Conv1D(x ^(k) ))))

wherein x is ^(k) And k in the table is v, t and f, which respectively refer to three modes of representation of input images, texts and image-text interaction.

Further, in the step S5, the following processing steps are specifically included:

s51: inputting the two single-mode feature vectors and the image-text interaction feature vector which are mapped into a softmax layer for classification, and respectively obtaining classification resultsThe emotion classification is calculated as follows:

wherein v, t and f in k respectively represent three modes of image, text and image-text interaction, W _m 、b _m Respectively representing the weight and the bias of the full connection layer;

s52: the final decision result is obtained by weighting the decision results of different modes, and the calculation formula is as follows:

wherein, alpha and beta correspond to text classification weight and image classification weight respectively.

Compared with the prior art, the invention has the following advantages: according to the cross-modal image-text emotion analysis method based on the hybrid fusion, an image-text emotion analysis model of a cross-modal attention mechanism is provided, data of two modes are firstly input into a full-connection layer and mapped into a public space, so that after the influence of redundant information is reduced by the model, cross-modal emotion characteristics with image-text related relation degree are acquired through modal interaction; key emotion characteristics are enhanced through an attention mechanism, and the effectiveness of characteristic fusion is enhanced; and a hybrid fusion model combining feature fusion and decision fusion is provided, the correlation of different modes is captured, the information is utilized more comprehensively and comprehensively, and the accuracy of the decision is improved.

Drawings

FIG. 1 is a schematic diagram of a cross-modal graph emotion analysis model according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of feature screening using combinatorial pooling in accordance with the first embodiment of the invention.

Detailed Description

The following describes in detail the examples of the present invention, which are implemented on the premise of the technical solution of the present invention, and detailed embodiments and specific operation procedures are given, but the scope of protection of the present invention is not limited to the following examples.

Example 1

The embodiment provides a technical scheme: a cross-modal image-text emotion analysis method based on hybrid fusion comprises the following steps:

step one: extracting the single-mode characteristics of the pictures and the texts;

step two: the characteristic interaction of the graph-text cross-mode;

step three: using an attention mechanism with added combination pooling;

step four: converting the two single-mode feature vectors and the image-text interaction feature vector from local to whole;

step five: and respectively classifying the two single-mode and cross-mode outputs, and distributing corresponding weights to the three classification results for decision fusion, so that a final prediction result is obtained.

The corresponding cross-modal image-text emotion analysis model of the method is shown in figure 1.

The specific contents of each step in the above-mentioned cross-mode image-text emotion analysis method are further described below.

1. Single-mode extraction

1.1 extraction of text features

The RoBERTa model (Robustly Optimized BERT Pretraining Approach) is a leading pre-training model, proposed by Facebook AI. It is one of the basic models that has made breakthroughs in the field of Natural Language Processing (NLP). Roberta is an improved version of the BERT (Bidirectional Encoder Representations from Transformers) model that improves the performance and generalization ability of the model by optimizing and improving the training process.

Therefore, the Roberta pre-training model is used for extracting text information, and feature vectors of the text are obtained. Text= { T for the input text sequence ₁ ,T ₂ ,...,T _n Where n represents the text length, obtaining an embedded representation of the text by the RoBERTa pre-training model, as shown in equation (1):

1.2 extraction of image features

Although the traditional convolutional neural network such as ResNet has a great breakthrough in the problems of gradient explosion and disappearance through residual connection and achieves good effect on image classification tasks, certain loss of global information is caused due to the limitation of local receptive fields. ViT model (Vision Transformer) is an image classification model adopting a transducer architecture, the model cuts a picture into a group of image sequence blocks, position information is added for the image blocks by using position embedding, and important relations among different image blocks are captured by using a self-attention mechanism, so that the interpretation of model decision is improved, and the representation of image characteristics is further learned; meanwhile, the method can not limit the size and resolution of the image, and reduces the dependence of manual design feature engineering. Therefore, the embodiment adopts ViT model to obtain the picture p _j Feature vectors of (a)Can representThe method comprises the following steps:

2. cross-modal feature fusion

Considering that the same group of image-text data may have repeated information and the isomerism of the multi-mode data, the image-text data is mapped into a public space with the same dimension through the full connection layer, so that the influence of redundant information in the multi-mode data can be reduced, two modes are mapped into the same dimension, the two modes can be related with each other, and the formula is as follows:

wherein W is _t 、W _ti 、W _v 、W _vj 、b _t 、b _v And (3) representing a weight matrix and bias which can be learned by the full-connection layer, and obtaining a similarity matrix C representing the correlation scores among different characteristics of the graphics and texts.

After the similarity matrix C is input into the full-connection layer, a matrix O is obtained through an activation function, and the similarity matrix C and the matrix O are as follows:

C＝(t _i W _iC )(v _j W _jC ) ^T (5)

O＝sigmoid(W _CO C+b _CO ) (6)

wherein W is _iC 、W _jC Are respectively indicated to be t _i 、v _j Full-connected layer learnable weight matrix for linear transformation, W _CO 、b _CO Weight matrix and bias of full connection layer learnable parameters for C linear transformation, t _i V for text feature vectors output from common space _j Is an image feature vector output from the common space;

image feature vector v _j The method comprises the steps of multiplying the matrix O point to obtain a characteristic vector of image-text interaction, and processing the characteristics after interaction through residual error connection in order to improve the information flow of a model and relieve the gradient disappearance problem, so as to obtain a characteristic vector v of image-text interaction based on image information _j ^* And similarly, obtaining a characteristic vector t of image-text interaction based on text information _i ^* The formula is as follows:

3. attention mechanism

The mechanism of attention was originally derived from research in the field of computer vision. This inspiration comes from the attentive mechanisms in the human visual system, i.e. we will consciously focus attention on the region of interest when viewing the image. Thus, a similar mechanism is introduced, and higher weight is given to the information with obvious emotion colors in the feature vector. In the embodiment, the image-text characteristic fusion is realized through an improved transducer. Feature vectorAnd feature vector->Input to a cross-modal attention mechanism, wherein the image features are the primary mode and the text features are the auxiliary mode. The expression capability of the image features is improved by utilizing semantic information, and the formula is as follows:

wherein,is to->A learnable weight matrix that performs linear transformations,is to->A learnable weight matrix that performs linear transformation; after the cross-modal attention mechanism processing, in order to effectively extract important features of modal interaction, the features are screened by using combination pooling, as shown in fig. 2.

The combined pooling treatment process specifically comprises the following steps:

the maximum pooling layer is used for reserving the characteristics with the most emotion colors, and as emotion information is closely related to the context, the analysis and judgment of the whole emotion are affected in order to prevent other important information from being lost, and meanwhile, the average pooling layer is used for reserving the context information of the characteristics. And taking the splicing result of the two pooling layers as the output characteristic of the text auxiliary image, wherein the formula is as follows:

similarly, the features of the image-assisted text are obtained as follows:

finally, the characteristics of the text auxiliary image obtained after the characteristic screening and the characteristics of the image auxiliary text are spliced, and the characteristics are used as the characteristics of image-text interaction, and the formula is as follows:

4. conversion from local to global

Since the text and the image are respectively converted into feature vectors by the RoBERTa model and the ViT model, and finally all the words and the image blocks are connected together to form feature vectors of the text and the image, the feature vectors are formed based on local information, and in order to further enhance the relation between the features, one-dimensional convolution is used for extracting the mapping of local features to global features to obtain key features, so as to realize feature dimension reduction, and the formula is as follows:

X ^(k) ＝Conv1D(LayerNorm(ReLU(Conv1D(x ^(k) )))) (19)

wherein x is ^(k) K in the formula is v, t and f, which respectively refer to three modes of representation of input images, texts and image-text interaction, when x is ^(k) When k=v, x ^(k) The feature vector of the single-mode image; when x is ^(k) When k=t, x ^(k) The feature vector is the feature vector of the single-mode text; when x is ^(k) When k=f, x ^(k) Is the feature vector of the image-text interaction.

5. Decision fusion and mapping

Inputting the two single-mode feature vectors and the image-text interaction feature vector which are mapped into a softmax layer for classification, and respectively obtaining classification resultsThe emotion classification is calculated as follows:

wherein v, t and f in k respectively represent three modes of image, text and image-text interaction, W _m 、b _m The weights and offsets of the fully connected layers are shown, respectively. When X is ^(k) When k=v, X ^(k) The feature vector is the feature vector of the single-mode image after feature dimension reduction; when X is ^(k) When k=t, X ^(k) The feature vector is the feature vector of the single-mode text after feature dimension reduction; when X is ^(k) When k=f, X ^(k) Is the characteristic vector of the graphic interaction after the characteristic dimension reduction.

Decision fusion is a simple and effective method for fusing multi-source information features, and can fully utilize the diversity of information, thereby improving the reliability and robustness of decision. Considering that the importance degree of the decision results of different modes is different, the final decision result is obtained by weighting the decision results of different modes, and the calculation formula is as follows:

wherein, alpha and beta correspond to text classification weight and image classification weight respectively, and grid search is used to determine the weight parameters of two single-mode decision results.

In the embodiment, cross entropy is used as a loss function to train the single-mode classification and the multi-mode classification of the model, an AdamW optimizer is adopted to optimize model parameters, a discarding method and a weight attenuation prevention model are used for overfitting in the optimization process, and a loss function formula is as follows:

wherein k is E [1,3 ]]And k is E N ^* ，The model prediction classification result is referred to as model prediction, and y is referred to as a true emotion label.

Example two

The experiment of this embodiment uses the disclosed MVSA dataset to verify the performance of a cross-modal teletext emotion analysis model (hereinafter "the present model" or "the outer"). The image-text pairs of the data set are all collected from Twitter comments and consist of two independent data sets MVSA-Single and MVSA-Multi, and the emotion labels of the data sets are classified into three categories of positive, negative and neutral. Each pair of teletext datasets of MVSA-Single has a pair of teletext emotion labels comprising 5129 image-text pairs. Each pair of teletext datasets of MVSA-Multi has three pairs of teletext emotion tags, comprising 19600 image-text pairs. MVSA-Single total 4511 and MVSA-Multi total 17505 after data pretreatment. The MVSA dataset was randomly divided into training set (80%), validation set (10%) and test set (10%).

From the data of the ablation experiments of table 1 below, it can be seen that the performance of the model was degraded to varying degrees after changing the model structure. The effect of removing the MLP is worst, and because the MVSA-Single data volume is smaller, the model has limited characteristic capability which can be learned, poor generalization capability and weakened characteristic relevance of different modes, so that acc and F1 are far lower than acc and F1 of the original model. Compared with the original model eliminating the attention mechanism, the acc and F1 of the data set MVSA-Single are respectively improved by 7.36% and 7.79%, and the acc and F1 of the data set MVSA-Multi are respectively improved by 3.24% and 2.67%, which proves that the attention mechanism can make the model pay more attention to the characteristic information related to emotion and help the model to accurately position the key part. In the model only retaining cross-modal decision, the acc and F1 on the data set MVSA-Single are respectively reduced by 3.79% and 3.57%, and the acc and F1 on the data set MVSA-Multi are respectively reduced by 2.83% and 3.26%, which proves that combining different modal decision results is necessary, the information of different modalities is fully utilized, and the accuracy of emotion analysis is improved.

Table 1 data from ablation experiments

The present model was compared to the baseline model as follows:

(1) SentiBank: 1200 adjective-noun pairs (ANP) are mined from each picture as image features, and image emotion classification is achieved through the ANP.

(2) SentiStrenth: semantic analysis and emotion intensity scoring are performed based on text content and sentence structure. The present model is far superior to these two singlemodes.

(3) sentiBank+sentiStrenth: and combining the classification results of the SentiBank and the SentiStrenth to realize the fusion of decision stages.

(4) CNN-Multichannel: features are extracted in the CNN using a plurality of channels of a plurality of different filter widths.

(5) CNN-Multi: and respectively designing a text CNN model and an image CNN model based on the CNN, splicing the extracted image features and the text features, and inputting the spliced image features and the spliced text features into a multi-CNN for classifying image-text fusion.

(6) Cbow+da+lr: text information is acquired using a CBOW model with negative sampling, and the video information is processed based on a denoising auto encoder.

(7) DNN-LR: the multi-depth convolutional neural network consisting of text DNN and image DNN uses logistic regression to fuse classification results.

Compared with two single-mode models, sentiBank (images) and SentiStrenth, CNN-multi-channel (texts), the accuracy of the model is greatly improved, and the multi-mode emotion analysis proves that the multi-mode emotion analysis has obvious superiority compared with the single-mode emotion information considered independently under the condition of combining the two single-mode emotion information of the image and the text. Compared with a multi-modal baseline model, the model has the advantages that the classifying effect of the SentiBank+SentiStrenth model is worst, the SentiBank model and the SentiStrenth model are only subjected to decision fusion, no characteristic information of the two modes is interacted, and the effect is far less than that of a deep learning-based method. The obtained image-text characteristics are spliced by the CBOW+DA+LR model, and are directly classified by using logistic regression, and compared with the CBOW+DA+LR model, the accuracy of the model in the MVSA-Single data set and F1 are respectively improved by 10.24% and 10.86%, and the accuracy of the model in the MVSA-Multi data set and F1 are respectively improved by 6.35% and 3.75%. After the CNN-Multi model extracts the features of the Single modes respectively, the two features of the Single modes are directly spliced, the features of different modes lack of relevance, and no attention mechanism is added, so that compared with the CNN-Multi model, the accuracy of the MVSA-Single data set and F1 of the CNN-Multi model are respectively improved by 12.9% and 16.01%, and the accuracy of the MVSA-Multi data set and F1 of the CNN-Multi model are respectively improved by 4.18% and 3.29%. The Multi-depth convolution neural network composed of DNN-LR model text DNN and image DNN is used for fusing probability results without cross-modal fusion, and compared with DNN-LR, the model has the advantages that the accuracy of MVSA-Single data set and F1 are respectively improved by 12.68% and 13.35%, and the accuracy of MVSA-Multi data set and F1 are respectively improved by 2.71% and 1.15%.

The effect comparisons are shown in tables 2 and 3 below:

table 2 comparison table of model effects in single mode

Table 3 comparison table of model effects in single mode

In summary, in the cross-modal image-text emotion analysis method based on hybrid fusion in the embodiment, after the single-modal features of the image and the text are extracted, the image features and the text features are mapped into the public space, so that the influence of redundant information on the model is reduced, the model can more effectively utilize the correlation of multi-modal information, and therefore the image-text features can be interacted better. And secondly, after the image-text interaction, the importance degree of the interaction features is highlighted through an attention mechanism, and the feature vectors with obvious salient emotion features are combined and pooled, so that the context information of the features is reserved, and the information loss caused by one-sided information is prevented. The feature vectors of the text and the image extracted by the RoBERTa model and the ViT model and the single-mode and cross-mode feature interactions at the back are based on the local features of the words and the image blocks, mapping from the local features to the global features is realized through two one-dimensional convolutions and LayerNorm, and after the feature dimension is reduced, softmax is respectively carried out on the two single-mode and cross-mode feature vectors; different deviation and limitation of decision results of different modes are possible, and the reliability of the model is improved by distributing certain weight to the decision results of two single modes and assisting the prediction of the cross modes through the decision results of the single modes.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. The cross-modal image-text emotion analysis method based on hybrid fusion is characterized by comprising the following steps of:

s1: extraction of unimodal vectors

s2: cross-modal feature fusion

s3: attention mechanism and combined pooling

s4: local to global conversion

s5: decision fusion and mapping

2. The cross-modal text emotion analysis method based on hybrid fusion according to claim 1, wherein in the step S1, text information is extracted by using a RoBERTa pre-training model to obtain feature vectors of text, and text= { T is performed on an input text sequence ₁ ,T ₂ ,...,T _n Where n represents the text length, the embedded representation of the text is obtained by the RoBERTa pre-training model, the formula is as follows:

T _i ^* ＝RoBERTa(text)。

3. the cross-modal teletext emotion analysis method based on hybrid fusion according to claim 1, wherein in the step S1, a ViT model is adopted to obtain the picture p _j Feature vectors of (a)The formula is as follows:

4. the cross-modal teletext emotion analysis method based on hybrid fusion according to claim 1, wherein in step S2, the specific processing procedure is as follows:

t _i ＝W _ti (ReLU(W _t T _i ^* +b _t ))

5. The cross-modal teletext emotion analysis method according to claim 4, wherein in step S22, the similarity matrix C and the matrix O are as follows:

C＝(t _i W _iC )(v _j W _jC ) ^T

O＝sigmoid(W _CO C+b _CO )

6. The cross-modal teletext emotion analysis method based on hybrid fusion according to claim 5, wherein in step S23, the feature vector v of the teletext interaction based on image information is _j ^* And a feature vector t of text-based interactions _i ^* The concrete representation is as follows:

7. the cross-modal teletext emotion analysis method based on hybrid fusion according to claim 6, wherein in step S3, the following processing steps are specifically included:

wherein,is to->A learnable weight matrix for linear transformation, < ->Is to->A learnable weight matrix that performs linear transformation;

similarly, the features of the image-assisted text are obtained as follows:

8. the cross-modal teletext emotion analysis method based on hybrid fusion according to claim 7, wherein in the step S4, a one-dimensional convolution is used to extract a mapping of local features to global features to obtain key features, and feature dimension reduction is achieved, as follows:

X ^(k) ＝Conv1D(LayerNorm(ReLU(Conv1D(x ^(k) ))))

9. The cross-modal teletext emotion analysis method according to claim 8, wherein in step S5, the method specifically comprises the following steps: