CN115577161A

CN115577161A - Multi-mode emotion analysis model fusing emotion resources

Info

Publication number: CN115577161A
Application number: CN202211262518.3A
Authority: CN
Inventors: 彭俊杰; 李爱国; 李松; 李璐
Original assignee: Xuzhou Daxi Energy Technology Co ltd
Current assignee: Xuzhou Daxi Energy Technology Co ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-01-06

Abstract

The invention discloses a multi-mode emotion analysis model fusing emotion resources, which belongs to the technical field of emotion analysis and comprises the following components: the system comprises a single modal characteristic extraction layer module for initially extracting text, visual and auditory modal characteristics, a single modal characteristic depth extraction layer module for capturing power in a mode by using a Transformer, designing an emotion word classification prediction task to carry out emotion embedding learning and extraction, a cross-modal characteristic interactive learning layer module for finishing multi-modal characteristic interactive learning by using emotion embedding so that other modal characteristics can sense emotion information in the text, a prediction layer module for finally inputting emotion characteristics learned from the first three layers into a deep neural network to finish a final prediction task, a first model for carrying out emotion analysis by taking emotion resources as main information, and a knowledge word classification prediction task for emotion learning.

Description

Multi-mode emotion analysis model fusing emotion resources

Technical Field

The invention relates to the technical field of emotion analysis, in particular to a multi-mode emotion analysis model fusing emotion resources.

Background

Emotion is one of the most important information in interpersonal communication, and emotional expression helps people to understand the attitudes of things to each other, and promotes communication and understanding between people. With the development of artificial intelligence technology, especially with the rapid rise of social media platforms represented by Facebook, youTube, etc. and the popularization of artificial intelligence products such as intelligent customer service, food and beverage service robots, etc., the emotion analysis technology is more and more emphasized. Through the emotion analysis technology, various products and media platforms can have more accurate comprehension capability on the emotion and intention of a user, and therefore, emotion analysis is widely concerned and deeply researched by academia and industry.

Emotional analysis aims at identifying the opinions and attitudes of users on things or people, and the research plays an important role in understanding people in different groups and the intentions of people. Traditional emotion analysis mainly analyzes and identifies the emotion of a research object through single-mode data, but the single-mode data is not robust and is easily affected by subjective consciousness and external environment, so that the identification rate is not high. For example, in the scenes of face shielding, noise interference on voice and the like, effective information contained in a single mode is reduced, and the emotion analysis accuracy is reduced.

In order to fully utilize information contained in the data of multiple modes and improve the accuracy of emotion analysis, the multi-mode emotion analysis draws extensive attention. The multi-modal emotion analysis aims to judge the attitude or emotion tendency of a person through information of multiple dimensions such as a voice signal and a visual signal. Compared with single-mode data, the multi-mode data contains richer emotion information, and complementary information among the multi-mode data is mined by an effective fusion means, so that the accuracy of emotion analysis can be effectively improved, and classification errors are reduced. At present, there are many methods for fusing multi-modal data, such as multi-modal fusion based on graphs, multi-modal fusion based on LSTM, etc., and some researches and experiments in recent years show that the multi-modal fusion method based on attention is more advantageous in terms of performance and efficiency, wherein the multi-modal attention fusion method performed by taking text as a center has a remarkable effect.

Although the above-described research has made progress in multimodal sentiment analysis, there are challenges to multimodal sentiment analysis. Emotion resources (such as positive words, negative words and adverbs) are used as important information for expressing emotion polarity in a text, and play an important role in the text emotion classification fields of sentence level emotion classification, aspect level emotion classification, opinion extraction and the like, however, in the multi-modal emotion analysis field, little research is done on the usefulness of the emotion resources, and emotion knowledge contained in a text mode is taken as key information to be included in a multi-modal emotion analysis task, so that emotion information cannot be sufficiently mined and utilized.

In order to solve the problems, the application designs a multi-modal emotion analysis model which is based on attention and carries out modal fusion by taking emotion resources as dominant information, fully excavates emotion resource information from a text, fuses the emotion resource information with the text mode as enhanced information, models a complex dependency relationship among different modes through a multi-head attention-based framework, and learns information difference among different modes so as to realize the learning and fusion of multi-modal characteristics.

Disclosure of Invention

The invention aims to provide a multi-modal emotion analysis model fusing emotion resources so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: the multi-modal emotion analysis model fusing emotion resources comprises the following steps:

a single mode characteristic extraction layer module used for initially extracting text, visual and auditory mode characteristics;

a single modal characteristic depth extraction layer module which captures power inside a modal by using a Transformer and designs an emotion word classification prediction task for emotion embedding learning and extraction;

a cross-modal interactive learning layer module which completes multi-modal characteristic interactive learning by utilizing emotion embedding so that other modal characteristics can perceive emotion information in a text;

and the emotion characteristics learned by the first three layers represent the emotion characteristics which are finally input into the prediction layer module of the deep neural network to complete the final prediction task.

Furthermore, the single feature depth extraction layer module comprises an emotion resource acquisition and expression module and a unimodal language feature learning module, wherein the emotion resource acquisition and expression module is used for acquiring emotion resources based on a viewpoint dictionary to classify the resources and then analyzing the resources through a formula and predicting emotion word classification, and the unimodal language feature learning module can capture interdependent features in a single-modal long distance.

Furthermore, the cross-modal interactive learning layer module comprises a multi-modal emotion perception module for substituting emotion embedding for text features, a visual modal learning module for performing inter-modal learning of emotion features by using the visual module, and an auditory modal learning module for performing inter-modal learning of emotion features by using the auditory module.

Further, the multi-modal emotion analysis method comprises the following steps:

s1: single-mode feature extraction:

for a given speech, three modal utterances including text, visual and auditory are usually included, for a text modality, the BERT is used for carrying out initial feature extraction considering that a large-scale pre-training speech model BERT has strong language representation capability and feature extraction capability, and for an acoustic modality and a visual modality, the Bi-LSTM is used for carrying out feature extraction considering the up-down correlation and the time sequence inside the modalities;

s2: depth feature extraction:

a1: obtaining and expressing emotion resources: besides three modes of text, vision and acoustics which are generally adopted in traditional multi-mode emotion analysis, the emotion words can also provide more accurate identification direction for feature learning, and through learning of emotion knowledge, the initial text feature vectors can learn emotion word information contained in the initial text feature vectors, so that the initial text feature vectors are rich in more definite semantic information and less in noise interference;

according to the method, a Liu ice viewpoint dictionary is selected as a main basis of emotion resources, emotion word label labeling is carried out on a text mode of an experimental data set, positive and negative English emotion words are listed in the Liu ice viewpoint dictionary, and abnormal emotion words such as misspelling and slang deformation are included, so that positions of the emotion words and the non-emotion words in an original text are labeled according to the viewpoint dictionary, and 0 character is filled at the tail of a short sentence to ensure consistency of label lengths;

in addition, the application designs a classification auxiliary task for emotion word prediction. In order to ensure that the emotion information can be effectively fused into high-dimensional feature representation, feature compression is carried out on the basis of following nonlinearity, the emotion feature value is ensured to be between 0 and 1, the probability rule is met, and the purpose of reducing information loss is achieved;

a2: unimodal language feature learning: the method has the advantages that the characteristic of slow RNN training is improved by the aid of the Transformer, the parallelism of calculation can be improved, the method is very suitable for feature learning of non-aligned modalities, and plays a role in natural language processing, so that in order to capture interdependent features in a single-modality long distance and extract richer semantic information from context representation, a structure based on the RNN is not selected to capture modality sequence information, and a structure based on the Transformer is selected to generate sequence features of each modality respectively. Furthermore, unlike the single-headed attention mechanism, the multi-headed self-attention mechanism is made as a key component of a transform, which can capture multiple correlations inside the modality by introducing multiple queries;

a3: and (3) transmembrane state interactive learning: considering that the text cannot contain all emotion information, and due to the limitation of a view dictionary, the method only labels emotion words for the labeling work of the emotion words, does not label words such as mood auxiliary words and mood adverbs, which can also affect modal expression emotion, and cannot guarantee that all network new words are completely covered. Therefore, it is not sufficient for accurate emotion analysis to use emotion embedding as a dominant factor for inter-modal feature learning. In addition, the visual mode and the acoustic mode contain additional information which is not contained in the text mode, and the learning and the extraction of emotional information are facilitated. In consideration of the points, the application introduces information between the other two modes so as to supplement the defects of emotion embedding expression;

a4: prediction layer: splicing the obtained feature representations, and sending the feature representations into a deep neural network to finish final prediction;

s3: experimental structure and analysis:

b1: statistics of the data set: three public multi-modal emotion analysis data sets are selected for experiments, MOSI, MOSEI and IEMOCAP;

CMU-MOSI: the data set consisted of 2199 video monologue segments from YouTube, each segment having an emotional intensity tag with an intensity in the range of [ -3, +3 for a firm positive emotion, -3 for a firm negative emotion. In addition, the training set, the test set, and the verification set of the data set respectively contain 1284, 229, and 686 video clips;

CMU-MOSEI: the data set is an improvement on CMU-MOSI, and the video clips, the human subjects and the like of the data set are more abundant. The data set contains 22856 video monologue segments from YouTube, the training set, validation set, and test set are respectively composed of 16326, 1871, and 4659 video segments;

IEMOCAP: the data set contains 4453 dialog segments, marked by nine emotion categories of happiness, anger, sadness, neutrality, etc., and because of the imbalance of some emotion labels, the first four emotion labels were chosen for the experiment. Furthermore, the training set, validation set, and test set of the data set consisted of 2717, 798, and 938 video segments, respectively;

b2：Baselines：

TFN: TFN fuses the interactions of single mode, double mode and triple mode, and uses Cartesian product to fuse the tensor;

LMF: the LMF is improved on the basis of the TFN, and the calculation memory during multi-mode tensor fusion is reduced by using a low-rank decomposition factor;

MulT: the MulT utilizes a cross-modal attention module to carry out information interaction between the modalities on the basis of a Transformer encoder;

ICCN: the ICCN attaches the acoustic modal information and the visual modal information to the text modal, and performs multi-modal fusion by exploring the hidden relation between the language information and the non-language information;

TCSP: the TCSP takes a text as a center, learns the shared and private semantics of the modalities by using a cross-modality prediction task, and performs multi-modality emotion prediction by fusing semantic features;

BIMHA: BIMHA discusses the relative importance and relationship of paired modal keys and expands multi-head attention to enhance information;

HEMT: HEMT proposes a method based on holographic reduction representation, which is a compressed version of the outer product model to promote cross-modal high-order fusion;

PMR: the PMR introduces an information center on the basis of a cross-mode transformer to perform information interaction with each mode, completes the complementation of common information and the internal characteristics of the modes in the process of repeated circulation, and performs emotion prediction by using the finally generated characteristics;

b3: setting parameters: the method uses 768-dimensional BERT pre-training word vectors as text features, the number of heads of multi-head attention is 8, the dropout of multi-task learning is 0.2, the dropout of a prediction layer is 0.5, for a MOSI data set, the initial learning rate of the data set is 3e-5, the batch size is 16, the number of hidden units of text, acoustics and visual modes is 128,16 and 32 respectively, for the MOSEI, the initial learning rate of the data set is 1e-5, the batch size is 32, the number of hidden units of text, acoustics and visual modes is 128,32 and 64 respectively, for the IECAP data set, the initial learning rates of four emotions of happiness, anger, sadness and neutrality are 3e-3,4e-4,7e-4 and 6e-4 respectively, the batch size is 32, the number of hidden units of text, acoustics and visual modes is 128,32 and 16 respectively;

the present application uses Adam optimizer for training and early-stopping strategy (early-stopping), the present application uses 6 different evaluation indicators to evaluate the performance of the model: for a multi-classification task, the two-classification accuracy (Acc-2), the three-classification accuracy (Acc-5), the five-classification accuracy (Acc-7) and the F1 value (F1-score) are used as evaluation indexes, for a regression task, the Mean Absolute Error (Mean Absolute Error-MAE) and the Pearson correlation coefficient (Pearson correlation-Corr) are used as evaluation indexes, and except for the Mean Absolute Error, all the evaluation indexes are the better values;

s3: and (4) analyzing results:

c1: analyzing the experimental results of the CMU-MOSI model, the performance of TFN and LMF is poor, because the difference of effective information proportion between modes is not considered in the two methods; mulT and BIMHA both pay attention to the importance of information interaction between modalities and respectively combine a Transformer framework and a multi-head attention mechanism to mine and capture information between modalities, so that the two methods are improved in result compared with the prior work, but do not reach the expected level, in the methods, ICCN and PMR are shown to be more prominent, ICCN provides useful guidance for cross-modality fusion by utilizing the hidden relation between text modalities and non-text modalities, PMR pays attention to the asynchrony of non-aligned multi-modality data and introduces a message center to assist the information interaction process of multiple modalities, although the two methods both obtain better performance, the method still wins over the method of the application because the complementarity and the difference between different modalities are considered, the important significance of emotion resources on modality fusion is also noticed, and the efficiency of information fusion and extraction between modalities is improved through the participation of emotion knowledge embedding;

in addition, the model of the present application exhibits excellent performance on both classification tasks and regression tasks. For classification tasks, the method achieves optimal results in terms of both F1 scores and classification accuracy, which are improved by 0.82% and 0.75% respectively compared with the best baseline results. For the regression task, the model of the present application also shows an extremely outstanding performance, in particular an increase in correlation (Corr) of 8.3%, which is a significant improvement and boost;

c2: the experimental results of the model on the CMU-MOSEI respectively report the performance of the model on the regression task and the classification task, and it can be noted that, for the regression task, the method of the present application exhibits very outstanding performance on both the Mean Absolute Error (MAE) and the correlation (Corr), which are respectively 1.3% and 7.9% higher than the best baseline result, and for the classification task, the method of the present application also obtains the optimal results, in which, compared with the optimal baseline result, the F1 score and accuracy are respectively 0.55% and 0.6% higher than the F1 score and accuracy of the five classification and seven classification accuracy are respectively 0.51% and 0.09% higher than the TFN, and in addition, the most undesirable method of the performance is TFN, each evaluation index of which is far lower than the method of the present application, especially, on the F1 score and two classification accuracy, the method of the present application is respectively 5.74% and 6.28% higher than the TFN, which is due to the neglecting the difference of different modal contributions when the modalities are fused; the method is close to ICCN and BIMHA, and the two methods show better experimental results, but are still inferior to the method in various evaluation indexes, probably because the method not only considers the difference of modal interaction information focused by BIMHA, but also considers the text information focused by ICCN, and in addition, the method also uses a multi-task combined learning framework to improve the generalization of the model and embeds and blends emotional knowledge into the characteristic learning process among the modalities, so the method obtains more excellent performance;

c3: the result of the model on the IEMOCAP data set shows that the method shows stronger competitiveness in negative emotions such as Angry and Sad, and particularly, the accuracy of classification of two categories of Angry emotion and the F1 score are respectively higher by 0.54% and 0.24%, which shows that compared with other methods, the method is more sensitive to negative emotions and is easier to perceive the negative emotions, and under the Happy emotion label, the method also obtains the good result of the second name, although ICCN and BIMHA respectively exceed the method of the application in the accuracy of classification of two categories and the F1 score, the difference between the method of the application and the two methods is extremely fine, and the application considers that the reason that the method of the application does not reach the optimal result can be that the label is an emotion vocabulary label with the emphasis on coarse granularity and is not a emotion label with fine granularity when the emotion word classification prediction is carried out, so that the learning accuracy of the model is not enough, and the subsequent emotion classification result of the emotion classification is influenced;

s4: ablation experiment: firstly, in a single-mode emotion prediction task, the performance of prediction by using a text mode is far higher than that of other two modes, which is probably because the text mode is a feature vector obtained by using a large-scale corpus, and the other two modes are obtained by manual extraction, so that the text mode possibly contains more abundant information and is very suitable for being applied to a feature learning and emotion prediction task;

second, in the bimodal emotion prediction task, it can be seen that the performance of the model is much worse when using the acoustic and visual modalities, which demonstrates the necessity of using the text modality, which contains more useful information than the other modalities.

Compared with the prior art, the invention has the beneficial effects that:

1. the model is a first model for performing multi-modal emotion analysis by taking emotion resources as dominant information, and is known by the application;

2. carrying out emotion word label labeling on text information, designing an emotion word classification prediction task for emotion knowledge learning, and discussing the relationship between emotion embedding and text embedding;

3. experiments are carried out on three data sets of CMU-MOSI, CMU-MOSEI and IEMOCAP, and the experiments show that compared with the existing multi-modal emotion analysis method, the method has obvious advantages and better generalization.

Drawings

FIG. 1 is a schematic block diagram of a multi-modal emotion analysis model of the present invention;

FIG. 2 is a MANSEL model overall architecture of the present invention;

FIG. 3 is a schematic block diagram of the multi-modal emotion recognition module of the present invention.

In the figure: 1. a single modal feature extraction layer module; 2. a single modal feature depth extraction layer module; 21. an emotion resource acquisition and expression module; 22. a unimodal language feature learning module; 3. a cross-modal interactive learning layer module; 31. a multi-modal emotion perception module; 32. a visual modality learning module; 33. an auditory modality learning module; 4. and a prediction layer module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides a technical solution: the multi-modal emotion analysis model fusing emotion resources comprises the following steps:

a single mode characteristic extraction layer module 1 for initially extracting text, visual and auditory mode characteristics;

a single modal characteristic depth extraction layer module 2 which captures the power inside the modal by using a Transformer and designs an emotion word classification prediction task for emotion embedding learning and extraction;

a cross-modal interactive learning layer module 3 which completes multi-modal characteristic interactive learning by using emotion embedding so that other modal characteristics can perceive emotion information in a text;

and the emotion characteristics learned by the first three layers represent the emotion characteristics which are finally input into a prediction layer module 4 of the deep neural network to complete the final prediction task.

The single feature depth extraction layer module 2 comprises an emotion resource acquisition and expression module 21 and a unimodal language feature learning module 22, wherein the emotion resource acquisition and expression module is used for acquiring emotion resources based on a view dictionary, classifying the resources, analyzing the emotion resources through a formula and predicting emotion word classification, and the unimodal language feature learning module can capture interdependent features in a single-modal long distance.

The cross-modal interactive learning layer module 3 comprises a multi-modal emotion perception module 31 for replacing text features with emotion embedding, a visual modal learning module 32 for learning emotion features among modalities by using a visual module, and an auditory modal learning module 33 for learning emotion features among modalities by using an auditory module.

The multi-modal emotion analysis method comprises the following steps:

s1: single mode feature extraction:

using the formula:

h _t ＝BERT(x _t ；θ _t )#(1)

h _a ＝BiLSTM(x _a ；θ _a )#(2)

h _v ＝BiLSTM(x _v ；θ _v )#(3)

represents the output result of the Bi-LSTM final hidden layer,

the output result of the first word vector of the last layer of BERT. d is a radical of _t 、d _a And d _v The dimensions, θ, representing the text, acoustic and visual feature vectors, respectively _t Learnable parameter, θ, representing BERT _a And theta _v Learnable parameters representing LSTM for acoustic and visual feature vector extraction, respectively;

s2: depth feature extraction:

a1: obtaining and expressing emotion resources: in addition to three modes of text, vision and acoustics commonly adopted in traditional multi-mode emotion analysis, the emotion words can provide more accurate identification direction for feature learning, and through learning of emotion knowledge, the initial text feature vectors can learn emotion word information contained in the initial text feature vectors, so that the initial text feature vectors are rich in more definite semantic information and less in noise interference;

as shown in formulas (4) to (6), the text feature h is set _t Putting a full-connection layer for dimension unification, performing square summation on the feature representation H captured by the full-connection layer, and then performing feature compression on the feature representation H by using a nonlinear function, so as to ensure that short vectors shrink to almost zero length and long vectors shrink to a length slightly lower than 1;

H＝W _s h _t +b _s #(4)

wherein W _s And b _s Respectively, the weight and the offset of the full link layer, the output result of the full link layer is

The eigenvector after the sum of H squares is represented as t = { t = } _i ：1≤i≤L}，t _i Represents the ith word H _i Squared summed output result, d _s Represents the dimension of the feature vector H, L represents the sequence length,

will learn the emotional feature vector E _s Sending the emotion words to an emotion word classification prediction layer to obtain a final prediction result of the positions of the emotion words

As shown in formula (7), wherein W _sent And b _sent Respectively, the weight and bias of the layer;

a2: unimodal language feature learning: the method has the advantages that the characteristic of slow RNN training is improved by the aid of the Transformer, the parallelism of calculation can be improved, the method is very suitable for feature learning of non-aligned modalities, and plays a very important role in the field of natural language processing, so that in order to capture interdependent features in a single-modality long distance, richer semantic information is extracted from context representation, a structure based on RNN is not selected to capture modality sequence information, a structure based on the Transformer is selected to generate sequence features of each modality respectively, and in addition, different from a single-head attention mechanism, a multi-head self-attention mechanism is used as a key component of the Transformer and can capture various correlations inside the modalities by introducing multiple queries;

specifically, as shown in equations (8) to (13).

E _m ＝MultiHeadAttention(Qm，K _m ，V _m )＝Concat(head ₁ ，...，head _h )·W ^o #(13)

Wherein the mode m is a { l, v, a }, and l, v, a respectively represent a text mode, a visual mode and an acoustic mode,

and

respectively representing the mode m at the ith head of the attention mechanism _i The above learned vectors query, key and value, and

and

are all learnable projection matrices, Q _i ，K _i ，V _i Respectively represent head _i Query, key and value, d _q ＝d _k ＝d _v ＝d _m H, h is the head number, and Concat (. Cndot.) is a series operation;

a3: transmembrane state interactive learning: in the previous multi-modal emotion analysis work, emotion embedding is used as auxiliary information in the fusion process of a text mode, a visual mode and an acoustic mode. However, the method and the device have the advantages that the interference caused by the non-emotional features of the text can be reduced after the text is subjected to shallow emotion knowledge learning, so that the obtained emotion embedding contains more semantic information, and the method and the device are more suitable for replacing the text to perform feature learning among modes. Therefore, in a cross-modal interactive learning layer, a multi-modal emotion perception module for cross-modal feature fusion is designed.

The structure of the multi-modal emotion perception module is shown in fig. 2, and the application guides acoustic and visual features by using a cross-modal multi-head attention mechanism under an emotion embedded feature space, so that emotion information can be perceived and expressed in a cross-modal interactive learning process. In the application, f belongs to { sa, sv }, and then the cross-mode multi-head attention mechanism formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

and

are all learnable projection matrices, d _q ＝d _k ＝d _v ＝d _s H is the number of heads, i is more than or equal to 1 and less than or equal to h, and Concat (·) is a stringIn joint operation, sa and sv respectively represent emotion embedding-acoustic mode and emotion embedding-visual mode, ef represents the output of the multi-modal emotion perception module,

in addition, considering that the text cannot contain all emotional information, and due to the limitation of a view dictionary, the annotation work of the emotion words only annotates the emotion words, words such as mood auxiliary words and mood adverbs, which can affect modal expression emotion, are not annotated, and meanwhile, the complete coverage of all network new words cannot be guaranteed. Therefore, it is not sufficient for accurate emotion analysis to use emotion embedding as a dominant factor for inter-modal feature learning. In addition, the visual mode and the acoustic mode contain additional information which is not contained in the text mode, and the learning and the extraction of emotional information are facilitated. In consideration of the points, the application introduces information between the other two modes so as to supplement the defects of emotion embedding expression;

E _at ＝MultiHeadAttention(Q _at ，K _at ，V _at )#(16)

E _vt ＝MultiHeadAttention(Q _vt ，K _vt ，V _vt )#(17)

E _o ＝Concat(E _sa ，E _sv ，E _at ，E _vt )#(18)

wherein E _o For the final output of the cross-modal interactive learning layer, E _at And E _vt Concat () is a tandem operation for a cross-modal multi-head attention representation of acoustic-text and visual-text modalities.

a4: prediction layer: splicing the obtained feature representations, and sending the feature representations into a deep neural network to finish final prediction; the obtained feature representations are spliced and sent to a deep neural network to complete a final prediction task, as shown in formulas (19) - (20), hfusion is an input feature of a prediction layer,

for the final prediction:

hfusion＝Concat(E _o ，E _t )#(19)

in addition, during model training, the application chooses to train by using a regression Loss function L1Loss, and the training Loss is composed of two parts, as shown in formula (21):

s3: experimental structure and analysis:

b1: statistics of the data set: three public multi-modal emotion analysis datasets were selected for experiments, MOSI, MOSEI and IEMOCAP;

the basic statistics of the data set are shown in table 1.

TABLE 1 data set statistics for MOSI, MOSEI and IEMOCAP

CMU-MOSI: the data set consisted of 2199 video monologue segments from YouTube, each segment having an emotional intensity tag with an intensity in the range of [ -3, +3 for a firm positive emotion, -3 for a firm negative emotion. In addition, the training set, the test set, and the validation set of the data set respectively contain 1284, 229, and 686 video segments;

CMU-MOSEI: the data set is an improvement on CMU-MOSI, and the video clips, the human subjects and the like of the data set are more abundant. The data set contains 22856 video monologue segments from YouTube, and the training set, the validation set, and the test set consist of 16326, 1871, and 4659 video segments, respectively;

IEMOCAP: the data set contains 4453 dialog segments marked by nine emotion categories of happiness, anger, sadness, neutrality, etc., and the first four emotion labels are selected for experiments due to the imbalance of some emotion labels. Furthermore, the training set, validation set, and test set of the data set consisted of 2717, 798, and 938 video segments, respectively;

b2：Baselines：

HEMT: HEMT proposes a method based on holographic reduced representation, which is a compressed version of the outer product model to promote cross-modal high-order fusion;

b3: setting parameters: the application uses 768-dimensional BERT pre-training word vectors as text features, the number of heads of multi-head attention is 8, the dropout of multi-task learning is 0.2, the dropout of a prediction layer is 0.5, for a MOSI data set, the application sets the initial learning rate of the data set to be 3e-5, the batch size to be 16, the number of hidden units of text, acoustic and visual modalities to be 128,16 and 32 respectively, for the MOSEI, the initial learning rate of the data set to be 1e-5, the batch size to be 32, the number of hidden units of text, acoustic and visual modalities to be 128,32 and 64 respectively, for an IECAP data set, the initial learning rates of happiness, anger, sadness and neutral four emotions to be 3e-3,4e-4,7e-4 and 6e-4 respectively, the batch size to be 32, and the number of hidden units of text, acoustic and visual modalities to be 128,32 and 16 respectively;

the present application uses Adam optimizer for training and an early-stopping strategy (early-stopping), the present application uses 6 different evaluation indicators to evaluate the performance of the model: for a multi-classification task, the two-classification accuracy (Acc-2), the three-classification accuracy (Acc-5), the five-classification accuracy (Acc-7) and the F1 value (F1-score) are used as evaluation indexes, for a regression task, the mean absolute Error (mean absolute Error-MAE) and the Pearson correlation coefficient (Pearson correlation-Corr) are used as evaluation indexes, and except for the mean absolute Error, all the evaluation indexes are higher values and better values;

s3: and (4) analyzing results:

c1: according to the experimental results of the model on the CMU-MOSI, the analysis of the experimental results can find that the performance of TFN and LMF is poor, because the difference of effective information ratio between modes is not considered in the two methods; mulT and BIMHA both pay attention to the importance of information interaction between modalities and respectively combine a Transformer framework and a multi-head attention mechanism to mine and capture information between modalities, so that the two methods are improved in result compared with the prior work, but do not reach the expected level, in the methods, ICCN and PMR are shown to be more prominent, ICCN provides useful guidance for cross-modality fusion by utilizing the hidden relation between text modalities and non-text modalities, PMR pays attention to the asynchrony of non-aligned multi-modality data and introduces a message center to assist the information interaction process of multiple modalities, although the two methods both obtain better performance, the method still wins over the method of the application because the complementarity and the difference between different modalities are considered, the important significance of emotion resources on modality fusion is also noticed, and the efficiency of information fusion and extraction between modalities is improved through the participation of emotion knowledge embedding;

in addition, the model of the application shows excellent performance on both classification tasks and regression tasks. For classification tasks, the method achieves optimal results in terms of both F1 scores and classification accuracy, which are improved by 0.82% and 0.75% respectively compared with the best baseline results. For the regression task, the model of the present application also shows an extremely outstanding performance, in particular an increase in correlation (Corr) of 8.3%, which is a significant improvement and boost;

table 2 experimental results on CMU-MOSI (@ indicates the better the index value is, ↓indicatesthe better the index value is)

Table 2 shows the experimental results of the model on CMU-MOSI. As can be seen from the table, the method of the present application is superior to other models in all indexes;

c2: the experimental results of the model on CMU-MOSEI report the performance of the model on the regression task and the classification task, respectively, and it can be noted that, for the regression task, the method of the present application exhibits very outstanding performance on both regression indexes of Mean Absolute Error (MAE) and correlation (Corr), which are respectively 1.3% and 7.9% higher than the best baseline result, and for the classification task, the method of the present application also achieves the optimal results, which are respectively 0.55% and 0.6% higher than the optimal baseline result in F1 score and 0.6% higher than the TFN in the five-classification and seven-classification accuracy, respectively, and furthermore, the most undesirable performance method is TFN, which is far lower than the method of the present application in each evaluation index, especially in F1 score and dichotomy accuracy, which is respectively 5.74% and 6.28% higher than TFN, because of the difference of different modalities is ignored in fusion by the n; the method is close to ICCN and BIMHA, and the two methods show better experimental results, but are still inferior to the method in various evaluation indexes, probably because the method not only considers the difference of modal interaction information focused by BIMHA, but also considers the text information focused by ICCN, and in addition, the method also uses a multi-task combined learning framework to improve the generalization of the model and embeds and blends emotional knowledge into the characteristic learning process among the modalities, so the method obtains more excellent performance;

table 3 experimental results on CMU-MOSEI (@ indicates that the higher the index value is, the better, and ↓indicatesthat the index value is, the lower the better)

c3: the result of the model on the IEMOCAP data set shows that the method shows stronger competitive power on negative emotions such as Angry and Sad, and particularly, the classification accuracy and the F1 score of the Angry emotion are respectively higher by 0.54% and 0.24%, which shows that compared with other methods, the method is more sensitive to the negative emotion, and is easier to perceive the negative emotion, and under the Happy emotion label, the method also obtains the second good score, although ICCN and BIMHA respectively exceed the method in the classification accuracy and the F1 score, the method and the two methods are extremely fine, and the reason that the gap between the method of the application and the optimal emotion label is not considered to be the reason that the label used is the emotion vocabulary label focusing on the coarse granularity and is not the emotion label focusing on the fine granularity during emotion word classification prediction of the application, so that the learning accuracy of the model on emotion knowledge is not enough, and the result of subsequent emotion classification is influenced;

table 4 IEMOCAP experiment results (↓ ] indicates the higher the index value, the better, and ↓indicatesthe lower the index value, the better)

Table 4 shows the results of the model on the IEMOCAP data set. Except for the experimental results of TCSP and HEMT which are not on the data set and the implementation details are not disclosed, the application compares the other existing methods on the first two data sets, and the best first two results are respectively thickened and thickened

To indicate. As can be seen from Table 4, the method of the application has the first two performance indexes, and the overall performance is obviously superior to that of other methods;

s4: ablation experiment: in order to verify the validity of the model, the present application performed experiments from two aspects. On the one hand the contributions of different modal combinations were explored as shown in table 5. Another aspect explores the contributions of different module combinations, as shown in table 6;

firstly, in a single-peak emotion prediction task, the performance of prediction by using a text mode is far higher than that of other two modes, which is probably because the text mode is a feature vector obtained by learning by using a large-scale corpus and the other two modes are obtained by manual extraction, the text mode possibly contains more abundant information and is very suitable for being applied to a feature learning and emotion prediction task;

table 5 shows the experimental results of different mode combinations (↓ ] indicates the index value is higher and the better, and ↓ ] indicates the index value is lower and the better)

Secondly, in the bimodal emotion prediction task, it can be known that the performance of the model is much poorer when the acoustic and visual modalities are used, which proves the necessity of using the text modality, and the text modality contains more useful information than other modalities;

table 6 experimental results for different module combinations. ( The highest experimental results are bolded data. "w/o U" means that no single modal depth feature extraction layer is used. "w/o E" indicates that the emotion embedding extraction layer is abolished. "w/o N" indicates that no multimodal emotion perception layer is used. )

To summarize: considering that emotion knowledge can provide basis for emotion judgment, the application provides a multi-mode emotion resource perception model based on attention, word level classification prediction task capturing emotion embedding representation is designed, and text modes are replaced by the word level classification prediction task capturing emotion embedding representation to guide cross-modal fusion.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The multi-modal emotion analysis model fused with emotion resources is characterized by comprising the following steps:

a single modal feature extraction layer module (1) for performing initial extraction of text, visual and auditory modal features;

a single modal characteristic depth extraction layer module (2) which captures the power in the modal by using a Transformer and designs an emotion word classification prediction task to carry out emotion embedding learning and extraction;

a cross-modal interactive learning layer module (3) which completes multi-modal feature interactive learning by utilizing emotion embedding so that other modal features can sense emotion information in a text;

and the emotion characteristics learned by the previous three layers represent the emotion characteristics which are finally input into a prediction layer module (4) of the deep neural network to complete a final prediction task.

2. The multimodal emotion analysis model fused with emotion resources of claim 1, wherein: the single feature depth extraction layer module (2) comprises an emotion resource acquisition and expression module (21) and a unimodal language feature learning module (22), wherein the emotion resource acquisition and expression module is used for acquiring emotion resources based on a viewpoint dictionary to classify the resources and then analyzing the emotion resources through a formula and classifying and predicting emotion words, and the unimodal language feature learning module can capture interdependent features in a single-modal long distance.

3. The multimodal emotion analysis model fused with emotion resources of claim 1, wherein: the cross-modal interactive learning layer module (3) comprises a multi-modal emotion perception module (31) for replacing text features with emotion embedding, a visual modal learning module (32) for learning emotion features among modalities by using the visual module, and an auditory modal learning module (33) for learning emotion features among modalities by using the auditory module.

4. The multimode emotion analysis model fused with emotion resources is characterized in that: the multi-modal emotion analysis method comprises the following steps:

s1: single mode feature extraction:

s2: depth feature extraction:

a2: unimodal language feature learning: the Transformer improves the characteristic of slow RNN training, can improve the parallelism of calculation, is very suitable for feature learning of non-aligned modalities, plays a very important role in the field of natural language processing, and therefore, in order to capture interdependent features in a single modality long distance and extract richer semantic information from context representation, the sequence information of each modality is generated by selecting a structure based on the RNN instead of selecting a structure based on the Transformer. Furthermore, unlike the single-headed attention mechanism, the multi-headed self-attention mechanism is made as a key component of a transform and can capture multiple correlations inside the modality by introducing multiple queries;

a3: transmembrane state interactive learning: considering that the text cannot contain all emotion information, and due to the limitation of a view dictionary, the method only labels emotion words for the labeling work of the emotion words, does not label words such as mood auxiliary words and mood adverbs, which can also affect modal expression emotion, and cannot guarantee that all network new words are completely covered. Therefore, it is not sufficient for accurate emotion analysis to use emotion embedding as a dominant factor for inter-modal feature learning. In addition, the visual mode and the acoustic mode contain additional information which is not contained in the text mode, so that the learning and the extraction of emotion information are facilitated, and in consideration of the points, the information between the other two modes is introduced to supplement the defects of emotion embedding expression;

s3: experimental structure and analysis:

b2：Baselines：

and (4) TFN: TFN fuses the interaction of single mode, double mode and triple mode, and uses Cartesian product to fuse tensor;

ICCN, adding acoustic modal and visual modal information to text modal, and performing multi-modal fusion by exploring the hidden relation between language information and non-language information;

b3: setting parameters: the application uses 768-dimensional BERT pre-training word vectors as text features, the head number of multi-head attention is 8, the dropout of multi-task learning is 0.2, the dropout of a prediction layer is 0.5, for a MOSI data set, the application sets the initial learning rate of the data set to be 3e-5, the batch size to be 16, the hidden unit numbers of text, acoustic and visual modalities to be 128,16 and 32 respectively, for the MOSEI, the initial learning rate of the data set to be 1e-5, the batch size to be 32, the hidden unit numbers of text, acoustic and visual modalities to be 128,32 and 64 respectively, for an IECAP data set, the initial learning rates of happiness, anger, sadness and neutral four emotions to be 3e-3,4e-4,7e-4 and 6e-4 respectively, the batch size to be 32 respectively, and the hidden unit numbers of text, acoustic and visual modalities to be 128,32 and 16 respectively;

the present application uses Adam optimizer for training and an early-stopping strategy (early-stopping), the present application uses 6 different evaluation indicators to evaluate the performance of the model: for a multi-classification task, the two-classification accuracy (Acc-2), the three-classification accuracy (Acc-5), the five-classification accuracy (Acc-7) and the F1 value (F1-score) are used as evaluation indexes, for a regression task, the Mean Absolute Error (Mean Absolute Error-MAE) and the Pearson correlation coefficient (Pearson correlation-Corr) are used as evaluation indexes, and except for the Mean Absolute Error, all the evaluation indexes are the better values;

s3: and (4) analyzing results:

c1: analyzing the experimental results of the CMU-MOSI model, the performance of TFN and LMF is poor, because the difference of effective information proportion between modes is not considered in the two methods; mulT and BIMHA both pay attention to the importance of information interaction between the modes and respectively combine a transform framework and a multi-head attention mechanism to mine and capture information between the modes, so that the two methods are improved in result compared with the prior work, but do not reach the expected level, in the methods, ICCN and PMR are more prominent in performance, ICCN provides useful guidance for cross-mode fusion by utilizing the hidden relation between text mode and non-text mode, PMR pays attention to the asynchrony of non-multi-mode data, and a message center is introduced to assist the information interaction process of multiple modes, although both methods obtain better performance, the method still pays attention to the importance of emotion resources to mode fusion while considering the complementarity and differences between different modes, and improves the efficiency of information fusion and extraction between various modes through the participation of emotion knowledge embedding, so the method is still inferior to the method of the application in evaluation index;

in addition, the model of the present application exhibits excellent performance on both classification tasks and regression tasks. For classification tasks, the method achieves optimal results in terms of both F1 scores and classification accuracy, which are improved by 0.82% and 0.75% respectively compared with the best baseline results. For the regression task, the model of the present application also shows an extremely outstanding performance, in particular an increase in correlation (Corr) of 8.3%, which is a significant improvement and promotion;

c2: the experimental results of the model on the CMU-MOSEI respectively report the performance of the model on the regression task and the classification task, and it can be noted that, for the regression task, the method of the present application exhibits very outstanding performance on both the Mean Absolute Error (MAE) and the correlation (Corr), which are respectively 1.3% and 7.9% higher than the best baseline result, and for the classification task, the method of the present application also obtains the optimal results, in which, compared with the optimal baseline result, the F1 score and accuracy are respectively 0.55% and 0.6% higher than the F1 score and accuracy of the five classification and seven classification accuracy are respectively 0.51% and 0.09% higher than the TFN, and in addition, the most undesirable method of the performance is TFN, each evaluation index of which is far lower than the method of the present application, especially, on the F1 score and two classification accuracy, the method of the present application is respectively 5.74% and 6.28% higher than the TFN, which is due to the neglecting the difference of different modal contributions when the modalities are fused; the method is similar to ICCN and BIMHA, and the two methods show better experimental results, but are still inferior to the method in various evaluation indexes, probably because the method not only considers the difference of modal interaction information noted by BIMHA, but also considers the text information noted by ICCN, and in addition, the method also uses a multi-task joint learning framework to improve the generalization of the model and embeds and blends emotional knowledge into the characteristic learning process among the modes, so the method achieves more excellent performance;