CN114494969A

CN114494969A - Emotion recognition method based on multimode voice information complementary AND gate control

Info

Publication number: CN114494969A
Application number: CN202210106236.8A
Authority: CN
Inventors: 刘峰; 李知函; 齐佳音; 周爱民; 李志斌
Original assignee: Shanghai University Of International Business And Economics; East China Normal University
Current assignee: Shanghai University Of International Business And Economics; East China Normal University
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-13

Abstract

The invention provides an emotion recognition method based on multimode voice information complementation and gate control, which belongs to the technical field of multimode emotion recognition and comprises the following steps: s1, extracting audio features and text features in the target video; s2, performing feature bidirectional fusion on the audio features and the text features; s3, the proportion of fusion representation in the result of bidirectional fusion in S2 is adjusted through a learnable door control mechanism and output; and S4, splicing the output of the learnable door control mechanism in S3, and finally obtaining emotion category output. According to the invention, a gating mechanism is applied to the cross attention module to determine whether to retain source modal information or cover target modal information, and the proportion of the source modal information and the target modal information is adjusted, so that the identification accuracy and the model parameters are balanced.

Description

Emotion recognition method based on multimode voice information complementation and gate control

Technical Field

The invention relates to the technical field of multi-modal emotion recognition, in particular to an emotion recognition method based on multi-modal voice information complementation and gate control.

Background

Emotion plays a key role in interpersonal communication, and not only linguistic information, but also acoustic information conveys an individual's emotional state. In many areas, such as human-computer interaction, healthcare, and cognitive sciences, there is considerable interest in developing tools to identify emotions in human vocal expressions. The recent explosion of deep learning has also promoted the development of emotion recognition, and in addition, the demand of applications has promoted the development of high-performance lightweight models.

There are many existing works to improve the performance of speech emotion recognition based on audio-only features. The LLDs-based characterization is extracted by deep learning networks, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), etc. Some variant modular structures, such as CNN-LSTM, are also used in this area to extract feature sequences and capture time dependencies.

However, the language information and the sound information are equally important for emotion recognition. Therefore, both text and audio modalities should be taken into account in order to accomplish the task of multi-modal emotion recognition. For audio modalities, the process of feature extraction is similar to that of single-modality speech emotion recognition. For text mode, a word embedding model like GloVe is typically used. What makes multimodal emotion recognition more challenging than single modality emotion recognition is the process of modality fusion. Some early work incorporated different features as inputs to deep neural networks, and in order to fuse modes at a deeper level, the Transformer architecture was widely used such that the learned modal fusion characterization was enhanced.

Despite the improvements made by previous work, few considerations are given to the scale and balance of modal fusion characterizations.

Disclosure of Invention

The invention aims to provide an emotion recognition method based on multi-mode voice information complementation and gate control, which can adjust the proportion of modal fusion characterization and realize the balance of emotion recognition accuracy and model parameter quantity.

In order to achieve the purpose, the invention adopts the technical scheme that:

the emotion recognition method based on multimode voice information complementation and gate control comprises the following steps: s1, extracting audio features and text features in the target video; s2, performing feature bidirectional fusion on the audio features and the text features; s3, the proportion of fusion representation in the result of bidirectional fusion in S2 is adjusted through a learnable door control mechanism and output; and S4, splicing the output of the learnable door control mechanism in S3, and finally obtaining emotion category output.

S2 includes: the text features are taken as a source modality, the audio features are taken as a target modality, the text features are taken as first original modality representations, and the text features and the audio features are fused through a Transformer cross attention mechanism to obtain first fusion representations; and taking the audio features as a source modality, taking the text features as a target modality, taking the audio features as a second original modality representation, and fusing the audio features and the text features through a Transformer cross attention mechanism to obtain a second fused representation.

S2 includes: taking the text characteristic as a source mode, taking the audio characteristic as a target mode, and then taking the text characteristic as a first original mode representation; fusing the text features and the audio features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a first intermediate fusion representation; enhancing the first intermediate fusion representation through the full-connection layer and normalization to obtain a first fusion representation; the audio features are characterized by a second original mode by taking the audio features as a source mode and the text features as a target mode; fusing the audio features and the text features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a second intermediate fusion representation; and enhancing the second intermediate fusion representation through the full-connection layer and normalization to obtain a second fusion representation.

S3 is: and respectively fusing the first fusion characteristics and the first original mode characteristics according to a proportion through a learnable door control mechanism to obtain first intermediate output, and fusing the second fusion characteristics and the second original mode characteristics according to a proportion to obtain second intermediate output.

S4 is: and splicing the first intermediate output and the second intermediate output to finally obtain emotion category output.

The method is deployed on a public data set CMU-MOSEI, and optimized by using an Adam optimizer in the training process.

Drawings

The invention and its features, aspects and advantages will become more apparent from reading the following detailed description of non-limiting embodiments with reference to the accompanying drawings. Like reference symbols in the various drawings indicate like elements. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is an architecture diagram of an emotion recognition method based on multi-mode voice information complementation and gate control provided by the present invention;

fig. 2 is a comparison graph of parameters and F1 values of the CMU-MOSEI in different models provided by the present invention.

Detailed Description

The invention will be further described with reference to the following drawings and specific examples, which are not intended to limit the invention thereto.

In the prior art, most speech emotion recognition models only consider information of a speech modality but do not consider texts, namely semantic information of the texts, and lack balance fusion of the semantic information and audio information; most of current networks are influenced by large-scale pre-training models, parameters are huge, and the networks are difficult to fall on the ground under scenes with high requirements on instantaneity and light weight.

According to the emotion recognition method based on multimode speech information complementation and gate control, as shown in fig. 1, firstly, audio features and text features in a target video are extracted, for a text mode, pre-trained GloVe word embedding is used for processing, and the embedding is a 300-dimensional vector. For audio modalities, covapr is used to extract low-level 74-dimensional vectors including 12 mel-frequency cepstral coefficients (MFCCs), pitch tracking and acoustic/non-acoustic segmentation features, peak slope parameters and pitch maxima.

Respectively extracting audio features and text features by using CNN-BilSTM and BilSTM, and performing extraction on text sequences

It is encoded using BilSTM, which can be expressed as

H^t＝BiLSTM(X^t)

Wherein

Representing the text feature being encoded.

For audio sequences, the audio sequence is represented as

Using one-dimensional convolution operations, i.e.

Then, BilSTM encodes it again as input,

wherein

Representing the encoded audio features.

After extracting the features, the cross attention module of the Transformer reinforces the features of one mode with the features of the other mode, wherein a door mechanism is used as a flow control unit to balance the proportion of the two modes.

Define a source modality as

The target modality is defined as

Wherein { S, T }. belongs to { T, a }, namely, the source mode is text, the target mode is audio fusion, the source mode is audio and the target mode is text bidirectional fusion. In particular, let d^S＝d^TD, i.e. the source modality is consistent with the target modality dimension.

In this embodiment, the text feature is used as the source modality H^tTargeting audio features to modality H^aIf the text feature is the first original mode representation; text features and audio features are fused through a transform Cross Attention mechanism, namely Cross Attention, and the formula is as follows:

Q＝W_Q×H^a

K＝W_K×H^t

V＝W_V×H^t

wherein

Is characterized by the linear transformation of the original features. Then, performing cross-layer connection and normalization through a residual error module:

h^t→a＝LN(H′+H^a)

wherein

A first intermediate fusion characterization is represented that fuses the source modality to the target modality.

And finally, enhancing the first intermediate fusion representation again by using the full-connection layer and normalization to obtain:

H^t→a′＝LN((h^t→a+FFN(h^t→a))

wherein

Represents the source modality H^tTowards target modality H^aThe first fused representation of the fusion, FFN, represents the fully-connected layer.

Audio feature-based source modality H^aTargeting text features as a target modality H^tIf so, the audio feature is a second original mode representation; and fusing the audio features and the text features through a Transformer cross attention mechanism, wherein the formula is as follows:

Q＝W_Q×H^t

K＝W_K×H^a

V＝W_V×H^a

wherein

h^a→t＝LN(H′+H^t)

wherein

A second intermediate fusion characterization representing a fusion of the source modality to the target modality.

And finally, enhancing the second intermediate fusion representation again by using the full-connection layer and the normalization to obtain:

H^a→t′＝LN((h^t→a+FFN(h^t→a))

wherein

Represents the source modality H^aTowards target modality H^tThe second fused representation of the fusion, FFN, represents the fully-connected layer.

And then proportionally fusing the first fusion characteristic and the first original modal characteristic through a learnable gate control mechanism to obtain a first intermediate output:

H^t→a＝H^t→a′×G_i+H^a×G_r

wherein

A representation integration gate for adjusting the information weight after the integration,

indicating a reservation gate for adjusting the weight of the original information.

And proportionally fusing the second fusion characteristic and the second original modal characteristic through a learnable door control mechanism to obtain a second intermediate output:

H^a→t＝H^a→t′×G_i+H^t×G_r

wherein

And finally, splicing the first intermediate output and the second intermediate output to finally obtain emotion category output, wherein the formula is as follows:

wherein,

a vector representation representing the fusion of a source modality to a target modality,

a vector representation representing the fusion of the target modality to the source modality, [ ·,·]representing a splicing operation, the Transformer representing a Transformer encoder,

representing predicted emotion categories.

Finally, the invention applies the method to the public data set CMU-MOSEI, using an Adam optimizer in the training process, using a learning rate attenuation technique. The experimental result shows that the model based on the method achieves the best performance with the minimum parameter number (only 0.432M), the method balances the accuracy and the parameter number, has considerable accuracy under the condition of ensuring practical application, and pays attention to light-weight application under the condition of ensuring the accuracy. The specific alignment is shown in fig. 2.

In summary, the invention fuses various modal information based on the Transformer cross complementation module, and the learnable gating mechanism controls information flow and stabilizes the training process. In the experiments, models based on the present method were compared to baseline models and further ablation studies were performed. The results show the effectiveness of each component, which is the basis for the performance and lightweight of the inventive model. The algorithm balances accuracy and parameter quantity, compared with the existing method, the model provided by the invention can achieve the best effect with the minimum parameter quantity, and an experimental result shows that the method is more beneficial to the application of an actual scene.

The present invention applies a gating control mechanism to the cross attention module to decide whether to retain the source modality information or to override the target modality information. In addition, most of the existing models rely on a large number of learnable parameters, and potential applications in some fields which have prospects but need real-time and lightweight models, such as human-computer interaction, are ignored. Therefore, a lightweight model is necessary to improve the feasibility and utility of speech emotion recognition applications.

The above description is of the preferred embodiment of the invention; it is to be understood that the invention is not limited to the particular embodiments described above, in that devices and structures not described in detail are understood to be implemented in a manner common in the art; any person skilled in the art can make many possible variations and modifications, or modify equivalent embodiments, without departing from the technical solution of the invention, without affecting the essence of the invention; therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. The emotion recognition method based on multimode voice information complementation and gate control is characterized by comprising the following steps of:

s1, extracting audio features and text features in the target video;

s2, performing feature bidirectional fusion on the audio features and the text features;

s3, adjusting the proportion of fusion representation in the result of the bidirectional fusion in the S2 through a learnable door control mechanism, and outputting the proportion;

and S4, splicing the outputs of the learnable door control mechanism in the S3, and finally obtaining emotion category outputs.

2. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 1, wherein said S2 comprises:

the text features are taken as a source modality, the audio features are taken as a target modality, the text features are taken as first original modality representations, and the text features and the audio features are fused through a Transformer cross attention mechanism to obtain first fusion representations;

and taking the audio features as a source modality, taking the text features as a target modality, taking the audio features as a second original modality representation, and fusing the audio features and the text features through a Transformer cross attention mechanism to obtain a second fused representation.

3. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 1, wherein said S2 comprises:

the text characteristic is taken as a source mode, the audio characteristic is taken as a target mode, and the text characteristic is taken as a first original mode representation;

fusing text features and audio features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a first intermediate fusion representation;

enhancing the first intermediate fusion representation through the full-connection layer and normalization to obtain a first fusion representation;

the audio features are characterized by a second original mode by taking the audio features as a source mode and the text features as a target mode;

fusing the audio features and the text features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a second intermediate fusion representation;

and enhancing the second intermediate fusion representation through the full-connection layer and normalization to obtain a second fusion representation.

4. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 2 or 3, wherein said S3 is:

and respectively fusing the first fusion characteristics and the first original mode characteristics according to a proportion through a learnable door control mechanism to obtain first intermediate output, and fusing the second fusion characteristics and the second original mode characteristics according to a proportion to obtain second intermediate output.

5. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 4, wherein said S4 is:

and splicing the first intermediate output and the second intermediate output to finally obtain emotion category output.

6. The method for emotion recognition based on multimode speech information complementation and gate control as claimed in claim 1, wherein the method is deployed on a public data set CMU-MOSEI, optimized using Adam optimizer during training.