CN115730203A

CN115730203A - Voice emotion recognition method based on global perception cross-modal feature fusion network

Info

Publication number: CN115730203A
Application number: CN202211489099.7A
Authority: CN
Inventors: 李峰; 王玲玲; 杨菲; 罗久淞
Original assignee: Anhui University of Finance and Economics
Current assignee: Anhui University of Finance and Economics
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-03-03

Abstract

The invention relates to the technical field of language emotion recognition, and discloses a speech emotion recognition method based on a global perception cross-modal feature fusion network, which comprises a multi-modal emotion recognition model, wherein the multi-modal emotion recognition model comprises an SER (Server) and an ASR (asynchronous receiver/transmitter).

Description

Voice emotion recognition method based on global perception cross-modal feature fusion network

Technical Field

The invention relates to the technical field of language emotion recognition, in particular to a speech emotion recognition method based on a global perception cross-modal feature fusion network.

Background

Speech, as a first attribute of language, plays a decisive supporting role in language, it not only includes the content of text expressed by speaker, but also includes the emotional information that speaker intends to convey, the same text has great difference in different emotional expressions, therefore, speech emotion recognition has received more and more attention due to the importance of emotion in human normal conversation, taking ubiquitous virtual speech assistants (such as Alexa, siri, *** assistant and Cortana) as an example, as the number of interacting people increases, they must infer the emotion of user and make appropriate reaction to improve user experience, however, human not only expresses emotion by speech, but also expresses emotion by many other ways, such as words, body posture part, face part, etc., therefore, in order to correctly understand the emotion expressed in speech, we need to fully understand the emotional information contained in various modalities.

In real life, voice emotion recognition is helpful for people to better communicate, emotions usually appear in a conversation in various forms, such as voice and text, however, most of existing emotion recognition systems only use single-modal characteristics for emotion recognition, and ignore interaction among multi-modal information.

Disclosure of Invention

The invention provides the following technical scheme: a speech emotion recognition system based on a global perception cross-modal feature fusion network comprises a multi-modal emotion recognition model, wherein the multi-modal emotion recognition model comprises an SER and an ASR, the SER comprises a wav2vec2.0, a Roberta-base, a residual cross-modal fusion attention module, a global perception block and a complete connection layer, the ASR comprises a transcription layer, an audio feature layer and a complete connection layer, the SER is used for calculating cross Entrophy loss through predicting emotion labels and real emotion labels, the ASR calculates CTC loss through audio features of the wav2vec2.0 model and corresponding text transcription, and finally the cross Entrophy loss and the CTC loss are added to obtain a loss value of a training part.

Preferably, the multimodal emotion recognition model comprises a question statement;

data set D has k utterances u _i Each utterance corresponds to a label of l _i Each utterance is composed of a speech segment a _i And text transcription t _i Composition of u wherein _i ∈(a _i ,t _i )，t _i Is ASR transcribed text or manually annotated text, the proposed network model will u _i As input and assigns the correct emotion label to any given utterance.

<U,L>＝{{u _i ＝<a _i ,t _i >,l _i }|i∈[1,k]} (1)

Preferably, the multi-modal emotion recognition model comprises feature coding;

in feature coding, the audio information and text information of each utterance are coded into (wav2vec 2.0 feature), (text feature) by the corresponding coder to input the model we propose.

Preferably, the multi-modal emotion recognition model comprises speech coding;

the wav2vec2.0 features contain rich prosodic information needed for emotion recognition, in our model we use a pre-trained wav2vec2.0 model as the original audio waveform coder to extract wav2vec2.0 features, which are based on the transformer structure representing the speech audio sequence, by fitting a set of ASR modeling units shorter than phonemes to extract features, and in addition, we choose to use a wav2vec2-base model with a dimension size of 768, comparing the two versions wav2vec2.0 models, we input the audio data of the first utterance into the pre-trained wav2vec2.0 model to obtain a context-embedded representation, representing the size of the audio feature embedding, and thus, can be expressed as:

wherein F _wav2vec2.0 Representing the pre-trained wav2vec2.0 model as a function of the audio feature processor, j depends on the size of the original audio in the wav2vec2.0 model and the CNN feature extraction layer, which extracts frames from the original audio in steps of 20ms and jumps of 25ms, and in our experiments the parameters of the CNN feature extraction layer will be fixed at a constant level.

Preferably, the multi-modal emotion recognition model comprises a context text representation;

inputting text data into a roberta-base model for coding, marking the input text before extracting text features, adding separators and separating sentences, separating the sentences to prevent semantic confusion, finely adjusting the marked text data and corresponding utterances, and embedding context into the text, wherein the context can be expressed as:

wherein F _Roberta-base Representing a text feature extraction function, m depending on the number of tokens in the text, D _T Is the dimension size of the text feature embedding.

Preferably, the residual across-modal fusion attention module is composed of two parallel fusion blocks for different modalities, specifically, a speech-text residual across-modal fusion attention block and a text-speech residual across-modal fusion attention block, and the residual across-modal fusion attention module is configured to complete multi-modal information interaction by using a across-modal attention mechanism.

Preferably, the audio information (a) _i ) And text information (t) _i ) The method comprises the steps that relevant audio features and text features are generated by corresponding pre-training feature extractors, and a residual cross-modal fusion attention block is composed of a cross-modal attention mechanism, a linear layer, a normalization layer, a discarding layer, a Gaussian error linear unit activation function and a residual structure. Two kinds of differencesThe difference between the residual cross-modal fusion attention blocks of (1) that characterize audio features is the query, key, and value of the cross-modal attention mechanism

As queries, text features

The interaction of audio and text is carried out by using a multi-head attention mechanism as keys and values, and a text-voice residual cross-modal fusion attention block uses text features

As a query, audio features

As keys and values, firstly, the audio features and the text features are interacted through a multi-head attention mechanism, then the interacted features pass through a linear layer, a normalization layer and a discarding layer, and finally are connected with the initial audio features or text features of the block through a residual error structure;

wherein phi ₁ A learning function representing a first speech-text residual across-modal fusion attention block or a text-speech residual across-modal fusion attention block;

in addition, the output of the first residual cross-modal fusion attention block is combined with the initial audio feature or text feature and fed to a second residual cross-modal fusion attention block such that a plurality of residual cross-modal fusion attention blocks are stacked together to generate corresponding multi-modal fusion features

And

speech-text:

text-speech:

our fusion strategy differs from the previously proposed multimodal fusion model in that to better integrate multimodal information we will always keep the key and value unchanged, the key and value of each residual cross-modal fusion attention block as the initial audio and text features of the module, and finally the outputs of the two residual cross-modal fusion attention blocks are concatenated to generate the final multimodal fusion feature

Preferably, the global perception module captures global information of the multi-modal fusion feature after extracting the multi-modal fusion feature through the residual cross-modal fusion attention module, the size of the multi-modal fusion feature varies with the length of the audio (j) and the text (m), therefore, before inputting the multi-modal fusion feature into the global perception block, we must map the dimension of the multi-modal fusion feature to 1 dimension (for example, let the values of j and m be l), the structure of the global perception module is composed of two fully-connected layers, a convolutional layer, two normalization layers, a gaussian error linear unit activation function and a multiplication operation, the output dimensions of the first and last fully-connected layers in the module are respectively 4D _f And D _f (wherein, D _f 768), respectively, the output is split into 2D after gaussian error linear unit activation function projection _f Multiplication enhances cross-dimensional feature mixing, and finally, the output of the global perception fusion module is integrated for classification,the corresponding equation is described as follows:

y _i ＝FC(F _global-aware ),y _i ∈ ^C (9)

wherein phi _global-aware Is a function of the multi-modal fusion features through the global perception module, and C is the number of emotion classes.

Preferably, the multi-modal emotion recognition model comprises a connectionless temporal classification layer, effectively backpropagating gradients using connectionless temporal classification loss as a loss function, so we pass the waw2vec2.0 feature

And text transcription information t _i Calculating the connection meaning time classification loss;

wherein

V =32 is the size of our vocabulary, consisting of 26 letters of the alphabet and some punctuation marks;

furthermore, we need to use the output feature y of the global perceptual block _i And a true sentiment tag l _i To calculate cross entropy loss;

L _CrossEntrogy ＝CrossEntrogy(y _i ,l _i ) (12)

finally, we introduce a hyper-parameter α that combines two loss functions into a loss, α effectively controlling the relative importance of the connection-wise temporal classification loss.

L＝L _CrossEntrogy +αL _CTC ,α∈(0,1)(13)

Preferably, the speech emotion recognition method based on the global perception cross-modal feature fusion network is characterized by comprising the following steps: comprises the following main steps;

s1: respectively extracting wav2vec2.0 features and text features through a pre-training model of transfer learning;

s2: fusing features from different modalities through a residual cross-modality fusion attention module;

s3: introducing a global perception module to capture important emotion information of the multi-modal fusion features from different scales;

s4: numerous experiments performed on the IEMOCAP data set indicate.

Advantageous effects

Compared with the prior art, the invention provides a speech emotion recognition method based on a global perception cross-modal feature fusion network, which has the following beneficial effects:

the invention provides a global perception cross-modal feature fusion network (GCF-Net) for speech emotion recognition. In GCF-Net, a residual across-modal fusion attention module is designed to help a network to extract rich features from audio and text, a global fusion block is added behind the residual across-modal fusion attention module to further extract the features rich in emotion in a global range, automatic speech recognition is introduced as an auxiliary task for calculating the loss of the connotation time classification, and experimental results on an IEMOCAP data set show that the residual across-modal fusion attention module, the global perception module and the automatic speech recognition calculate the loss of the connotation time classification and improve the performance of the model.

Drawings

FIG. 1 is a schematic structural diagram of a multi-modal emotion recognition model of a speech emotion recognition method based on a global perception cross-modal feature fusion network;

FIG. 2 is a schematic diagram of a residual error cross-modal fusion attention module structure of a speech emotion recognition method based on a global perception cross-modal feature fusion network according to the present invention;

FIG. 3 is a schematic diagram of a voice-text residual across-modal fusion attention block structure of a voice emotion recognition method based on a global perceptual across-modal feature fusion network according to the present invention;

FIG. 4 is a schematic diagram of a global fusion block structure of a speech emotion recognition method based on a global perceptual cross-modal feature fusion network of the present invention;

FIG. 5 is a schematic diagram of confusion matrix results of different modes of a speech emotion recognition method based on a global perception cross-mode feature fusion network.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment;

referring to fig. 1, a speech emotion recognition system based on a global perception cross-modal feature fusion network comprises a multi-modal emotion recognition model, wherein the multi-modal emotion recognition model comprises an SER and an ASR, the SER comprises wav2vec2.0, a Roberta-base, a residual cross-modal fusion attention module, a global perception block and a complete connection layer, the ASR comprises a transcription layer, an audio feature layer and a complete connection layer, the SER is used for calculating CrossEntropy loss through predicting emotion labels and real emotion labels, the ASR part calculates CTC loss through audio features of the wav2vec2.0 model and corresponding text transcription, and finally the CrossEntropy loss and the CTC loss are added to obtain a loss value of a training part.

Further, the multi-modal emotion recognition model comprises a question statement;

data set D has k utterances u _i Each utterance corresponds to a label ofl _i Each utterance is composed of a speech segment a _i And text transcription t _i Composition of u wherein _i ∈(a _i ,t _i )，t _i Is ASR transcribed text or manually annotated text, the proposed network model will u _i As input and assign the correct emotion tag to any given utterance;

<U,L>＝{{u _i ＝<a _i ,t _i >,l _i }|i∈[1,k]} (1)

preferably, the multi-modal emotion recognition model comprises feature codes;

Further, the multi-modal emotion recognition model comprises speech coding;

wherein F _wav2vec2.0 Representing the pre-trained wav2vec2.0 model as a function of the audio feature processor, j depends on the size of the original audio in the wav2vec2.0 model and the CNN feature extraction layer, which extracts frames from the original audio in steps of 20ms and jumps of 25ms, and in our experiments the parameters of the CNN feature extraction layer will be fixed at a constant level。

Further, the multi-modal emotion recognition model comprises context text representation;

inputting text data into a roberta-base model for encoding, marking the input text before extracting text features, adding separators and separating sentences, separating the sentences to prevent semantic confusion, finely adjusting the marked text data and corresponding utterances, and embedding context into the following words:

Example two;

referring to fig. 2 and 3, the residual cross-modal fusion attention module is composed of two parallel fusion blocks for different modalities, specifically, a speech-text residual cross-modal fusion attention block and a text-speech residual cross-modal fusion attention block, and the residual cross-modal fusion attention module is used for completing multi-modal information interaction by using a cross-modal attention mechanism, designing a potential representation of different models combined by a fusion layer based on cross-modal attention, and introducing a new global perception fusion module to obtain key emotion information of multi-modal fusion features.

Further, the audio information (a) _i ) And text information (t) _i ) The method comprises the steps that relevant audio features and text features are generated by corresponding pre-training feature extractors, and a residual cross-modal fusion attention block is composed of a cross-modal attention mechanism, a linear layer, a normalization layer, a discarding layer, a Gaussian error linear unit activation function and a residual structure. The difference between two different residual cross-modal fusion attention blocks is the query, key and value of the cross-modal attention mechanism, and the speech-text residual cross-modal fusion attention block is to characterize the audio frequency

As queries, text features

As a query, audio features

wherein phi ₁ A learning function representing a first speech-text residual across-modality fusion attention block or a text-speech residual across-modality fusion attention block;

in addition, the output of the first residual cross-modal fusion attention block is combined with the initial audio feature or text feature and fed to a second residual cross-modal fusion attention block, such that a plurality of residual cross-modal fusion attention blocks are stacked together to generate corresponding multi-modal fusion features

And

speech-text:

text-speech:

Example three;

referring to fig. 4, after the global sensing module extracts the multi-modal fusion features through the residual cross-modal fusion attention module, the global information of the multi-modal fusion features is captured, the size of the multi-modal fusion features varies with the length of the audio (j) and the text (m), therefore, before the multi-modal fusion features are input to the global sensing block, we must map the dimension of the multi-modal fusion features to 1 dimension (for example, let the values of j and m be l), the structure of the global sensing module is composed of two fully-connected layers, a convolutional layer, two normalization layers, a gaussian error linear unit activation function and a multiplication operation, the output dimensions of the first and last fully-connected layers in the module are respectively 4D _f And D _f (wherein, D _f 768), respectively, the output is split into 2D after gaussian error linear unit activation function projection _f Multiplication enhances cross-dimension feature mixing, and finally, the output of the global sensing fusion module is integrated for classification, and the corresponding equation is described as follows:

y _i ＝FC(F _global-aware ),y _i ∈ ^C (9)

Further, the multi-modal emotion recognition model includes a connection-oriented temporal classification layer, which effectively propagates gradients back using connection-oriented temporal classification loss as a loss function, so we pass the waw2vec2.0 feature

wherein

furthermore, we need to use the output feature y of the global perceptual block _i And a true emotion tag l _i To calculate cross entropy loss;

L _CrossEntrogy ＝CrossEntrogy(y _i ,l _i ) (12)

finally, we introduce a hyper-parameter α that combines two loss functions into a loss, α can effectively control the relative importance of the connection-wise temporal classification loss.

L＝L _CrossEntrogy +αL _CTC ,α∈(0,1)(13)

Examples of the experiments

(1) Data set

Under much of the prior art in the speech emotion recognition literature, we trained and evaluated our own models on an IEMOCAP data set, a multimodal data set that is the benchmark for multimodal emotion recognition studies, that contains 12-hour impromptu and scripted audiovisual data from 10 theater actors (five males and five females) in five binary sessions, with emotion information for each session presented in four ways: video, audio, transcription, and motion capture of facial motion.

Due to experimental requirements, we selected audio and transcription data to evaluate our model on the IEMOCAP dataset. Like most studies, we have selected five emotions to recognize emotions: happy, angry, neutral, sad and excited, we labeled all excited sample data as happy since happy and excited are highly similar, and furthermore we randomly split the data set into training (80%) and testing (20%) parts, and evaluate our model using quintupled cross-validation.

(2) Experimental setup

To explore the advantages of multi-modal, we constructed two single-modal baselines using text and speech modalities, respectively, using Roberta-base as the text baseline for the context text encoder, then classified using a single linear layer and softmax activation function, the speech baseline using a similar setup as the text baseline, replacing the encoder with only the pre-trained wav2vec2.0 model, and furthermore, table 1 shows the basic hyper-parameter setup of our experiment.

TABLE 1 Superparameter settings.

Videocard	NVIDIAGeForce940MX
		Batches of	2
Cumulative gradient	4
		Period of time	100
α	0；0.001；0.01；0.1；1
		Optimizer	Adam
Learning rate	1e-5
		Loss function	Cross entropy and connection attention time classification loss function
Evaluation index	Weighted exact value and unweighted exact value

(3) Ablation experiment

In order to better understand the contributions made by the different modules in the proposed GCF-Net model. We performed multiple ablation studies on each of our modules in the IEMOCAP dataset. The weighted accuracy and the unweighted accuracy are selected as evaluation indexes to evaluate our model.

To verify the impact of each modality, we trained our proposed network using audio features or text features as input, respectively, without applying the fusion model. In Table 2, we can see that the fusion of two features combines the advantages of both features, significantly improving the emotion recognition rate compared to the single feature.

Table 2 ablation experiments with different modal characteristics.

Model (model)	Accuracy of weighting	To weight accuracy
			Roberta-base baseline	69.27％	69.89％
Wav2vec2.0 baseline	79.76％	78.66％
			Roberta-base+Wav2vec2.0	82.80％	82.01％

Furthermore, we investigated the impact of the global perception module on our proposed model. In table 3, we can see that adding global perceptual blocks improves the weighted and unweighted accuracies by 1.0% and 1.5%, respectively. Therefore, it can be proved that the global perception module can extract more important emotion information to improve the performance of the model.

Table 3 ablation experiment of global sensing module.

Models	Accuracy of weighting	To weight accuracy
			Non-joined global sense module	81.73％	80.92％
Joining a global awareness module	82.80％	82.01％

With the addition of the global perception block, we also set an ablation experiment for the residual cross-modal attention fusion block. We have two residual cross-modal attention fusion blocks placed in parallel, and therefore this ablation experiment was set up to verify the effect of different numbers of residual cross-modal attention fusion layers in our proposed model. Table 4 shows the best model performance with four layers of residual cross-modal attention blend blocks (m = 4). When m =5, the accuracy of the model may decrease. We consider m =4 as our best choice.

TABLE 4 residual Cross-Modal attention fusion Block ablation experiment

Residual cross-modal attention fusion block layer number	Accuracy of weighting	To weight accuracy
			1	79.66％	78.85％
2	80.79％	79.56％
			3	82.16％	81.10％
4	82.80％	82.01％
			5	80.95％	80.47％

The hyper-parameter α can control the intensity of CTC loss, therefore, we try to change α from 0 to 1 to obtain a different intensity. Table 5 shows the effect of different alpha values on our optimal model. We can see that the positive impact of CTCloss is largest when α = 0.1.

TABLE 5 ablation experiment of alpha size

α	WA	UA
			0	81.6％	81.1％
0.001	81.9％	81.6％
			0.01	81.2％	80.1％
0.1	82.8％	82.0％
			1	77.0％	76.2％

(4) Error analysis

We visualize the appearance and span of different modalities in different emotion categories by a confusion matrix. FIG. 5 shows the confusion matrix for each modality, including wav2vec2.0, roberta-base, and multi-modality fusion, respectively.

As shown in fig. 5 (a), it mistakenly confuses happiness and neutrality. Therefore, the recognition rate of these two emotions is much lower than that of the other two emotions, and particularly the recognition rate of anger reaches 86.94%. In general, most emotions are easily confused with neutrality. Our observations are consistent with other studies reported by others, who believe that neutrality is centered in the activation space, making it more challenging to distinguish from other classes. Fig. 5 (b) has a good effect in predicting happiness compared to fig. 5 (a). This result is reasonable, with happiness and other emotions being more significantly different in word distribution than the audio signal data, providing more emotional information, on the other hand, sad predictions are worst, with 23.71% confusion with neutrality.

The model in fig. 5 (c) makes up for the deficiencies of the first two models (fig. 5 (a) and (b)) by using a fusion of the two modal features. We can see that the prediction rate for each emotion, except for neutrality, reaches 80%. Especially the prediction of sadness reaches 91.27%. Unfortunately, the recognition rate of anger and neutrality is slightly reduced.

Comparative example

As shown in Table 6, we compared with today's mainstream multimodal emotion recognition models using the same modality data. It can be seen that our model achieves the most advanced experimental results in and for weighting accuracy. The comparison further demonstrates the effectiveness of our proposed model.

TABLE 6 quantitative comparison with mainstream multimodal methods on IEMOCAP dataset

Method	Accuracy of weighting	To weight accuracy	Year of year
				Xuetal.[24]	70.40％	69.50％	2019
Liuet.al.[59]	72.40％	70.10％	2020
				Makiuchietal.[31]	73.50％	73.00％	2021
Caietal.[32]	78.15％	-	2021
				Moraisetal.[60]	77.36％	77.76％	2022
Ghoshetal.[52]	77.64％	-	2022
				GCF-Net (our model)	82.80％	82.01％	-

In conclusion; the invention provides a global perception cross-modal feature fusion network (GCF-Net) for speech emotion recognition, wherein a residual error cross-modal fusion attention module is designed in the GCF-Net to help the network to extract rich features from audio and text, a global fusion block is added behind the residual error cross-modal fusion attention module to further extract rich features in a global range, automatic speech recognition is introduced to serve as an auxiliary task for calculating the connection ambiguity time classification loss, and experimental results on an IEMOCAP data set show that the residual error cross-modal fusion attention module, the global perception module and the automatic speech recognition calculate the connection ambiguity time classification loss all improve the performance of the model.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A speech emotion recognition method based on a global perception cross-modal feature fusion network is characterized by comprising the following steps: comprises the following main steps;

s3: introducing a global perception module to capture important emotion information of multi-modal fusion features from different scales;

s4: numerous experiments performed on the IEMOCAP data set indicate.

2. The speech emotion recognition system based on global perception cross-modal feature fusion network according to claim 1, comprising a multi-modal emotion recognition model, wherein: the multi-modal emotion recognition model comprises an SER and an ASR, wherein the SER comprises wav2vec2.0, a Roberta-base, a residual cross-modal fusion attention module, a global perception block and a complete connection layer, the ASR comprises a transcription layer, an audio feature layer and a complete connection layer, the SER is used for calculating cross Encopy loss through predicting emotion labels and real emotion labels, the ASR part calculates CTC loss through the audio feature of the wav2vec2.0 model and corresponding text transcription, and finally the cross Encopy loss and the CTC loss are added to obtain a loss value of a training part.

3. The system according to claim 2, wherein the system comprises: the multi-modal emotion recognition model comprises a question statement;

<U,L>＝{{u _i ＝<a _i ,t _i >,l _i }|i∈[1,k]} (1)

4. The system according to claim 2, wherein the system comprises: the multi-mode emotion recognition model comprises feature codes;

5. The system according to claim 2, wherein the system comprises: the multi-mode emotion recognition model comprises speech coding;

wav2vec2.0 features contain rich prosodic information needed for emotion recognition, in our model we use a pre-trained wav2vec2.0 model as the original audio waveform coder to extract wav2vec2.0 features, which are based on the transformer structure representing the speech audio sequence, by fitting a set of ASR modeling units shorter than the phonemes to extract features, and in addition, we choose to use a wav2vec2-base model with a dimension size of 768, comparing the two versions wav2vec2.0 models, we input the audio data of the first utterance into the pre-trained wav2vec2.0 model to obtain a context-embedded representation, representing the size of the audio feature embedding, and thus, can be expressed as:

6. The system according to claim 2, wherein the system comprises: the multi-modal emotion recognition model comprises a context text representation;

inputting text data into a roberta-base model for coding, marking the input text before extracting text features, adding separators and separating sentences, finely adjusting the marked text data and corresponding utterances after separating the sentences, wherein context embedding can be expressed as:

7. The system according to claim 2, wherein the system comprises: the residual across-modal fusion attention module is composed of two parallel fusion blocks aiming at different modalities, specifically a voice-text residual across-modal fusion attention block and a text-voice residual across-modal fusion attention block, and is used for completing multi-modal information interaction by using a across-modal attention mechanism.

8. The system according to claim 7, wherein the system comprises: the speech-text residual across-modality fusion attention block characterizes audio

As queries, text features

As a query, audio features

As keys and values, first, the audio features and text features are interacted with by a multi-head attention mechanism, and then the interacted features are passed through a linear layer,The normalization layer and the discard layer are connected with the initial audio features or text features of the block through a residual error structure;

And

speech-text:

text-speech:

9. The system according to claim 2, wherein the system comprises: the global perception module extracts the multi-modal fusion features through the residual cross-modal fusion attention module, then captures global information of the multi-modal fusion features, the size of the multi-modal fusion features changes along with the length of audio (j) and text (m), therefore, before the multi-modal fusion features are input into the global perception block, the dimension of the multi-modal fusion features needs to be mapped into 1 dimension (for example, the value of j and m is l), the structure of the global perception module is composed of two full connection layers, a convolution layer, two normalization layers, a Gaussian error linear unit activation function and a multiplication operation, the output dimensions of the first full connection layer and the last full connection layer in the module are respectively 4D _f And D _f (wherein, D _f 768), respectively, the output is split into 2D after gaussian error linear unit activation function projection _f Multiplication enhances cross-dimension feature mixing, and finally, the output of the global perception fusion module is integrated for classification, and the corresponding equation is described as follows:

y _i ＝FC(F _global-aware ),y _i ∈ ^C (9)

10. The system according to claim 2, wherein the system comprises: the multi-mode emotion recognition model comprises a connection meaningTemporal classification layer, using connectionless temporal classification loss as a loss function to efficiently backpropagate gradients, hence we pass the waw2vec2.0 feature

wherein

L _CrossEntrogy ＝CrossEntrogy(y _i ,l _i ) (12)

L＝L _CrossEntrogy +αL _CTC ,α∈(0,1)(13)