CN114494969A - Emotion recognition method based on multimode voice information complementary AND gate control - Google Patents

Emotion recognition method based on multimode voice information complementary AND gate control Download PDF

Info

Publication number
CN114494969A
CN114494969A CN202210106236.8A CN202210106236A CN114494969A CN 114494969 A CN114494969 A CN 114494969A CN 202210106236 A CN202210106236 A CN 202210106236A CN 114494969 A CN114494969 A CN 114494969A
Authority
CN
China
Prior art keywords
features
fusion
representation
mode
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210106236.8A
Other languages
Chinese (zh)
Inventor
刘峰
李知函
齐佳音
周爱民
李志斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University Of International Business And Economics
East China Normal University
Original Assignee
Shanghai University Of International Business And Economics
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai University Of International Business And Economics, East China Normal University filed Critical Shanghai University Of International Business And Economics
Priority to CN202210106236.8A priority Critical patent/CN114494969A/en
Publication of CN114494969A publication Critical patent/CN114494969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an emotion recognition method based on multimode voice information complementation and gate control, which belongs to the technical field of multimode emotion recognition and comprises the following steps: s1, extracting audio features and text features in the target video; s2, performing feature bidirectional fusion on the audio features and the text features; s3, the proportion of fusion representation in the result of bidirectional fusion in S2 is adjusted through a learnable door control mechanism and output; and S4, splicing the output of the learnable door control mechanism in S3, and finally obtaining emotion category output. According to the invention, a gating mechanism is applied to the cross attention module to determine whether to retain source modal information or cover target modal information, and the proportion of the source modal information and the target modal information is adjusted, so that the identification accuracy and the model parameters are balanced.

Description

Emotion recognition method based on multimode voice information complementation and gate control
Technical Field
The invention relates to the technical field of multi-modal emotion recognition, in particular to an emotion recognition method based on multi-modal voice information complementation and gate control.
Background
Emotion plays a key role in interpersonal communication, and not only linguistic information, but also acoustic information conveys an individual's emotional state. In many areas, such as human-computer interaction, healthcare, and cognitive sciences, there is considerable interest in developing tools to identify emotions in human vocal expressions. The recent explosion of deep learning has also promoted the development of emotion recognition, and in addition, the demand of applications has promoted the development of high-performance lightweight models.
There are many existing works to improve the performance of speech emotion recognition based on audio-only features. The LLDs-based characterization is extracted by deep learning networks, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), etc. Some variant modular structures, such as CNN-LSTM, are also used in this area to extract feature sequences and capture time dependencies.
However, the language information and the sound information are equally important for emotion recognition. Therefore, both text and audio modalities should be taken into account in order to accomplish the task of multi-modal emotion recognition. For audio modalities, the process of feature extraction is similar to that of single-modality speech emotion recognition. For text mode, a word embedding model like GloVe is typically used. What makes multimodal emotion recognition more challenging than single modality emotion recognition is the process of modality fusion. Some early work incorporated different features as inputs to deep neural networks, and in order to fuse modes at a deeper level, the Transformer architecture was widely used such that the learned modal fusion characterization was enhanced.
Despite the improvements made by previous work, few considerations are given to the scale and balance of modal fusion characterizations.
Disclosure of Invention
The invention aims to provide an emotion recognition method based on multi-mode voice information complementation and gate control, which can adjust the proportion of modal fusion characterization and realize the balance of emotion recognition accuracy and model parameter quantity.
In order to achieve the purpose, the invention adopts the technical scheme that:
the emotion recognition method based on multimode voice information complementation and gate control comprises the following steps: s1, extracting audio features and text features in the target video; s2, performing feature bidirectional fusion on the audio features and the text features; s3, the proportion of fusion representation in the result of bidirectional fusion in S2 is adjusted through a learnable door control mechanism and output; and S4, splicing the output of the learnable door control mechanism in S3, and finally obtaining emotion category output.
S2 includes: the text features are taken as a source modality, the audio features are taken as a target modality, the text features are taken as first original modality representations, and the text features and the audio features are fused through a Transformer cross attention mechanism to obtain first fusion representations; and taking the audio features as a source modality, taking the text features as a target modality, taking the audio features as a second original modality representation, and fusing the audio features and the text features through a Transformer cross attention mechanism to obtain a second fused representation.
S2 includes: taking the text characteristic as a source mode, taking the audio characteristic as a target mode, and then taking the text characteristic as a first original mode representation; fusing the text features and the audio features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a first intermediate fusion representation; enhancing the first intermediate fusion representation through the full-connection layer and normalization to obtain a first fusion representation; the audio features are characterized by a second original mode by taking the audio features as a source mode and the text features as a target mode; fusing the audio features and the text features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a second intermediate fusion representation; and enhancing the second intermediate fusion representation through the full-connection layer and normalization to obtain a second fusion representation.
S3 is: and respectively fusing the first fusion characteristics and the first original mode characteristics according to a proportion through a learnable door control mechanism to obtain first intermediate output, and fusing the second fusion characteristics and the second original mode characteristics according to a proportion to obtain second intermediate output.
S4 is: and splicing the first intermediate output and the second intermediate output to finally obtain emotion category output.
The method is deployed on a public data set CMU-MOSEI, and optimized by using an Adam optimizer in the training process.
Drawings
The invention and its features, aspects and advantages will become more apparent from reading the following detailed description of non-limiting embodiments with reference to the accompanying drawings. Like reference symbols in the various drawings indicate like elements. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
FIG. 1 is an architecture diagram of an emotion recognition method based on multi-mode voice information complementation and gate control provided by the present invention;
fig. 2 is a comparison graph of parameters and F1 values of the CMU-MOSEI in different models provided by the present invention.
Detailed Description
The invention will be further described with reference to the following drawings and specific examples, which are not intended to limit the invention thereto.
In the prior art, most speech emotion recognition models only consider information of a speech modality but do not consider texts, namely semantic information of the texts, and lack balance fusion of the semantic information and audio information; most of current networks are influenced by large-scale pre-training models, parameters are huge, and the networks are difficult to fall on the ground under scenes with high requirements on instantaneity and light weight.
According to the emotion recognition method based on multimode speech information complementation and gate control, as shown in fig. 1, firstly, audio features and text features in a target video are extracted, for a text mode, pre-trained GloVe word embedding is used for processing, and the embedding is a 300-dimensional vector. For audio modalities, covapr is used to extract low-level 74-dimensional vectors including 12 mel-frequency cepstral coefficients (MFCCs), pitch tracking and acoustic/non-acoustic segmentation features, peak slope parameters and pitch maxima.
Respectively extracting audio features and text features by using CNN-BilSTM and BilSTM, and performing extraction on text sequences
Figure BDA0003493566070000031
It is encoded using BilSTM, which can be expressed as
Ht=BiLSTM(Xt)
Wherein
Figure BDA0003493566070000032
Representing the text feature being encoded.
For audio sequences, the audio sequence is represented as
Figure BDA0003493566070000033
Using one-dimensional convolution operations, i.e.
Figure BDA0003493566070000034
Then, BilSTM encodes it again as input,
Figure BDA0003493566070000041
wherein
Figure BDA0003493566070000042
Representing the encoded audio features.
After extracting the features, the cross attention module of the Transformer reinforces the features of one mode with the features of the other mode, wherein a door mechanism is used as a flow control unit to balance the proportion of the two modes.
Define a source modality as
Figure BDA0003493566070000043
The target modality is defined as
Figure BDA0003493566070000044
Wherein { S, T }. belongs to { T, a }, namely, the source mode is text, the target mode is audio fusion, the source mode is audio and the target mode is text bidirectional fusion. In particular, let dS=dTD, i.e. the source modality is consistent with the target modality dimension.
In this embodiment, the text feature is used as the source modality HtTargeting audio features to modality HaIf the text feature is the first original mode representation; text features and audio features are fused through a transform Cross Attention mechanism, namely Cross Attention, and the formula is as follows:
Q=WQ×Ha
K=WK×Ht
V=WV×Ht
Figure BDA0003493566070000045
wherein
Figure BDA0003493566070000046
Is characterized by the linear transformation of the original features. Then, performing cross-layer connection and normalization through a residual error module:
ht→a=LN(H′+Ha)
wherein
Figure BDA0003493566070000047
A first intermediate fusion characterization is represented that fuses the source modality to the target modality.
And finally, enhancing the first intermediate fusion representation again by using the full-connection layer and normalization to obtain:
Ht→a′=LN((ht→a+FFN(ht→a))
wherein
Figure BDA0003493566070000048
Represents the source modality HtTowards target modality HaThe first fused representation of the fusion, FFN, represents the fully-connected layer.
Audio feature-based source modality HaTargeting text features as a target modality HtIf so, the audio feature is a second original mode representation; and fusing the audio features and the text features through a Transformer cross attention mechanism, wherein the formula is as follows:
Q=WQ×Ht
K=WK×Ha
V=WV×Ha
Figure BDA0003493566070000051
wherein
Figure BDA0003493566070000052
Is characterized by the linear transformation of the original features. Then, performing cross-layer connection and normalization through a residual error module:
ha→t=LN(H′+Ht)
wherein
Figure BDA0003493566070000053
A second intermediate fusion characterization representing a fusion of the source modality to the target modality.
And finally, enhancing the second intermediate fusion representation again by using the full-connection layer and the normalization to obtain:
Ha→t′=LN((ht→a+FFN(ht→a))
wherein
Figure BDA0003493566070000054
Represents the source modality HaTowards target modality HtThe second fused representation of the fusion, FFN, represents the fully-connected layer.
And then proportionally fusing the first fusion characteristic and the first original modal characteristic through a learnable gate control mechanism to obtain a first intermediate output:
Ht→a=Ht→a′×Gi+Ha×Gr
wherein
Figure BDA0003493566070000055
A representation integration gate for adjusting the information weight after the integration,
Figure BDA0003493566070000056
indicating a reservation gate for adjusting the weight of the original information.
And proportionally fusing the second fusion characteristic and the second original modal characteristic through a learnable door control mechanism to obtain a second intermediate output:
Ha→t=Ha→t′×Gi+Ht×Gr
wherein
Figure BDA0003493566070000057
A representation integration gate for adjusting the information weight after the integration,
Figure BDA0003493566070000058
indicating a reservation gate for adjusting the weight of the original information.
And finally, splicing the first intermediate output and the second intermediate output to finally obtain emotion category output, wherein the formula is as follows:
Figure BDA0003493566070000059
wherein,
Figure BDA00034935660700000510
a vector representation representing the fusion of a source modality to a target modality,
Figure BDA00034935660700000511
a vector representation representing the fusion of the target modality to the source modality, [ ·,·]representing a splicing operation, the Transformer representing a Transformer encoder,
Figure BDA0003493566070000061
representing predicted emotion categories.
Finally, the invention applies the method to the public data set CMU-MOSEI, using an Adam optimizer in the training process, using a learning rate attenuation technique. The experimental result shows that the model based on the method achieves the best performance with the minimum parameter number (only 0.432M), the method balances the accuracy and the parameter number, has considerable accuracy under the condition of ensuring practical application, and pays attention to light-weight application under the condition of ensuring the accuracy. The specific alignment is shown in fig. 2.
In summary, the invention fuses various modal information based on the Transformer cross complementation module, and the learnable gating mechanism controls information flow and stabilizes the training process. In the experiments, models based on the present method were compared to baseline models and further ablation studies were performed. The results show the effectiveness of each component, which is the basis for the performance and lightweight of the inventive model. The algorithm balances accuracy and parameter quantity, compared with the existing method, the model provided by the invention can achieve the best effect with the minimum parameter quantity, and an experimental result shows that the method is more beneficial to the application of an actual scene.
The present invention applies a gating control mechanism to the cross attention module to decide whether to retain the source modality information or to override the target modality information. In addition, most of the existing models rely on a large number of learnable parameters, and potential applications in some fields which have prospects but need real-time and lightweight models, such as human-computer interaction, are ignored. Therefore, a lightweight model is necessary to improve the feasibility and utility of speech emotion recognition applications.
The above description is of the preferred embodiment of the invention; it is to be understood that the invention is not limited to the particular embodiments described above, in that devices and structures not described in detail are understood to be implemented in a manner common in the art; any person skilled in the art can make many possible variations and modifications, or modify equivalent embodiments, without departing from the technical solution of the invention, without affecting the essence of the invention; therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (6)

1. The emotion recognition method based on multimode voice information complementation and gate control is characterized by comprising the following steps of:
s1, extracting audio features and text features in the target video;
s2, performing feature bidirectional fusion on the audio features and the text features;
s3, adjusting the proportion of fusion representation in the result of the bidirectional fusion in the S2 through a learnable door control mechanism, and outputting the proportion;
and S4, splicing the outputs of the learnable door control mechanism in the S3, and finally obtaining emotion category outputs.
2. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 1, wherein said S2 comprises:
the text features are taken as a source modality, the audio features are taken as a target modality, the text features are taken as first original modality representations, and the text features and the audio features are fused through a Transformer cross attention mechanism to obtain first fusion representations;
and taking the audio features as a source modality, taking the text features as a target modality, taking the audio features as a second original modality representation, and fusing the audio features and the text features through a Transformer cross attention mechanism to obtain a second fused representation.
3. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 1, wherein said S2 comprises:
the text characteristic is taken as a source mode, the audio characteristic is taken as a target mode, and the text characteristic is taken as a first original mode representation;
fusing text features and audio features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a first intermediate fusion representation;
enhancing the first intermediate fusion representation through the full-connection layer and normalization to obtain a first fusion representation;
the audio features are characterized by a second original mode by taking the audio features as a source mode and the text features as a target mode;
fusing the audio features and the text features through a Transformer cross attention mechanism; performing cross-layer connection and normalization through a residual error module to obtain a second intermediate fusion representation;
and enhancing the second intermediate fusion representation through the full-connection layer and normalization to obtain a second fusion representation.
4. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 2 or 3, wherein said S3 is:
and respectively fusing the first fusion characteristics and the first original mode characteristics according to a proportion through a learnable door control mechanism to obtain first intermediate output, and fusing the second fusion characteristics and the second original mode characteristics according to a proportion to obtain second intermediate output.
5. The method for emotion recognition based on multi-mode speech information complementation and gate control as claimed in claim 4, wherein said S4 is:
and splicing the first intermediate output and the second intermediate output to finally obtain emotion category output.
6. The method for emotion recognition based on multimode speech information complementation and gate control as claimed in claim 1, wherein the method is deployed on a public data set CMU-MOSEI, optimized using Adam optimizer during training.
CN202210106236.8A 2022-01-28 2022-01-28 Emotion recognition method based on multimode voice information complementary AND gate control Pending CN114494969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210106236.8A CN114494969A (en) 2022-01-28 2022-01-28 Emotion recognition method based on multimode voice information complementary AND gate control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210106236.8A CN114494969A (en) 2022-01-28 2022-01-28 Emotion recognition method based on multimode voice information complementary AND gate control

Publications (1)

Publication Number Publication Date
CN114494969A true CN114494969A (en) 2022-05-13

Family

ID=81477008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210106236.8A Pending CN114494969A (en) 2022-01-28 2022-01-28 Emotion recognition method based on multimode voice information complementary AND gate control

Country Status (1)

Country Link
CN (1) CN114494969A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115238749A (en) * 2022-08-04 2022-10-25 中国人民解放军军事科学院***工程研究院 Feature fusion modulation identification method based on Transformer
CN117423168A (en) * 2023-12-19 2024-01-19 湖南三湘银行股份有限公司 User emotion recognition method and system based on multi-modal feature fusion

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115238749A (en) * 2022-08-04 2022-10-25 中国人民解放军军事科学院***工程研究院 Feature fusion modulation identification method based on Transformer
CN115238749B (en) * 2022-08-04 2024-04-23 中国人民解放军军事科学院***工程研究院 Modulation recognition method based on feature fusion of transducer
CN117423168A (en) * 2023-12-19 2024-01-19 湖南三湘银行股份有限公司 User emotion recognition method and system based on multi-modal feature fusion
CN117423168B (en) * 2023-12-19 2024-04-02 湖南三湘银行股份有限公司 User emotion recognition method and system based on multi-modal feature fusion

Similar Documents

Publication Publication Date Title
Huang et al. Attention assisted discovery of sub-utterance structure in speech emotion recognition.
Gu et al. Speech intention classification with multimodal deep learning
JP2023509031A (en) Translation method, device, device and computer program based on multimodal machine learning
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
Shashidhar et al. Combining audio and visual speech recognition using LSTM and deep convolutional neural network
Seo et al. Wav2kws: Transfer learning from speech representations for keyword spotting
CN115329779B (en) Multi-person dialogue emotion recognition method
CN114494969A (en) Emotion recognition method based on multimode voice information complementary AND gate control
Zhang et al. Multi-head attention fusion networks for multi-modal speech emotion recognition
CN111382257A (en) Method and system for generating dialog context
CN112597841B (en) Emotion analysis method based on door mechanism multi-mode fusion
CN110569869A (en) feature level fusion method for multi-modal emotion detection
CN111274412A (en) Information extraction method, information extraction model training device and storage medium
Qu et al. Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading
CN114882862A (en) Voice processing method and related equipment
Xu et al. A comprehensive survey of automated audio captioning
CN117892237A (en) Multi-modal dialogue emotion recognition method and system based on hypergraph neural network
CN116955579B (en) Chat reply generation method and device based on keyword knowledge retrieval
Singh et al. A lightweight 2D CNN based approach for speaker-independent emotion recognition from speech with new Indian Emotional Speech Corpora
CN116860943A (en) Multi-round dialogue method and system for dialogue style perception and theme guidance
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN116052291A (en) Multi-mode emotion recognition method based on non-aligned sequence
Liu et al. Keyword retrieving in continuous speech using connectionist temporal classification
Shin et al. Performance Analysis of a Chunk-Based Speech Emotion Recognition Model Using RNN.
Song et al. Towards realizing sign language to emotional speech conversion by deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination