CN113076847B

CN113076847B - Multi-mode emotion recognition method and system

Info

Publication number: CN113076847B
Application number: CN202110333007.5A
Authority: CN
Inventors: 姜晓庆; 陈贞翔; 杨倩; 郑永强
Original assignee: Shandong Sizheng Information Technology Co ltd; University of Jinan
Current assignee: Shandong Sizheng Information Technology Co ltd; University of Jinan
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2022-06-17
Anticipated expiration: 2041-03-29
Also published as: CN113076847A

Abstract

The scheme utilizes a prediction residual conditional entropy parameter generated in a sample reconstruction process under a compressed sensing theory through a novel and robust endpoint detection algorithm of a voice component in an emotion video sample, calculates a residual conditional entropy difference value in an iterative process of an Orthogonal Matching Pursuit (OMP) algorithm, completes endpoint detection according to an empirical threshold value, and completes feature learning of voiced segment emotion voice based on a reconstructed sample; meanwhile, the facial expression images are screened through the end point detection result of the emotional voice, and only the facial expression images with active emotional voice and the same time period are reserved, so that the purposes of enhancing the emotional detachability of the facial expression data set and reducing redundancy are achieved; the emotional voice features and the facial expression features are subjected to feature fusion, an effective multi-mode emotion recognition model is trained, and the purpose of effective multi-mode emotion recognition is achieved.

Description

Multi-mode emotion recognition method and system

Technical Field

The disclosure belongs to the technical field of emotion recognition, and particularly relates to a multi-mode emotion recognition method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Emotion recognition is a research hotspot in the field of emotion calculation, emotion signals of two modes, namely emotion voice and facial expression images, have the characteristics of convenience in acquisition and large amount of emotion information, and are two types of data sources which are important in emotion recognition research and have strong correlation.

The inventor finds that the emotion voice data and the facial expression image which are currently used in the multi-emotion recognition field are generally obtained by respectively storing emotion voice components and image components in emotion video samples. The feature extraction and learning of the emotion voice sample are both performed on the voiced segment, so the endpoint detection is an essential preprocessing step of emotion voice processing. The emotion video is easily interfered by noise during the collection, the noise is more obvious in emotion voice, the noise easily influences the endpoint detection precision of the emotion voice, and the important problem to be effectively solved is how to improve the anti-noise performance of the emotion voice and improve the feature extraction precision of the voice in a noise environment.

In addition, in the acquisition of facial expression images, the existing acquisition mode is to store all images in emotion video samples. Under the slowly changing characteristic of facial expressions, the non-distinguishing and non-screening acquisition mode of the facial expression images ignores the connection between different modal emotion expression modes, and does not consider the emotion distinguishability of the facial expression images, so that the emotion distinguishability of the acquired expression images is low, the redundancy is high, and the performance of a model trained and learned in the subsequent emotion recognition research is poor.

Disclosure of Invention

In order to solve the above problems, the present disclosure provides a multi-modal emotion recognition method and system.

According to a first aspect of the embodiments of the present disclosure, there is provided a multi-modal emotion recognition method, including:

extracting emotion voice components and emotion image components in the emotion video, and respectively storing the emotion voice components and the emotion image components;

performing end point detection on the emotional voice component by using an emotional voice residual error conditional entropy difference end point detection method to obtain an end point detection result of each frame of voice;

screening the emotion images in the emotion image components based on the endpoint detection results of the emotion voice components, and eliminating the emotion images of the silence segments in the emotion image components;

respectively extracting the characteristics of the reconstructed emotion voice component and the screened emotion image component;

fusing the characteristics of the emotional voice component and the characteristics of the emotional image component, and training an emotional recognition model through the fused characteristics;

and realizing multi-mode emotion recognition by using the trained emotion recognition model.

Furthermore, the emotion voice residual conditional entropy difference value end point detection method is realized on the basis of a prediction residual generated in the iterative execution process of an orthogonal matching pursuit algorithm.

Further, the method for detecting the end point of the emotion voice component by using the end point detection method of the emotion voice residual conditional entropy difference value comprises the following steps:

step (1): framing the emotional voice component to obtain a short-time voice frame, and acquiring an observed value of the short-time voice frame;

step (2): calculating the residual error between the last iteration estimation value and the observation value and the correlation between the residual error and the sensing matrix according to the observation value;

and (3): searching the atom with the maximum correlation in the observation matrix, and updating the support set for signal reconstruction;

and (4): approximating the short-time speech frame by using a least square method to obtain an estimated value of the short-time speech frame;

and (5): updating the residual error, calculating the conditional entropy of the residual error, and iteratively executing the step (2) to the step (5) until the sparsity condition is reached and then stopping iteration;

step (6), calculating a residual conditional entropy difference value of the first iteration and the last iteration; splicing short-time reconstructed voice frames to obtain a whole voice data sample;

and (7) judging the reconstructed emotional voice component by using a preset threshold, considering the frame voice as a voiced segment if the value is higher than the threshold, and considering the frame voice as a unvoiced segment if the value is lower than the threshold, and obtaining an end point detection result of the frame voice.

Furthermore, the observed value of the short-time speech frame is obtained by completing sparse transformation of the speech frame through discrete cosine transformation, and obtaining the observed value of the speech frame by taking a gaussian random matrix as an observation matrix.

Further, the residual conditional entropy is calculated by the following formula:

wherein the content of the first and second substances,

A_tis a support set formed by atoms of a sensing matrix in the t iteration process of the OMP algorithm,

calculating an estimated value by a least square method in the t iteration process; a. the_t-1Is a support set formed by atoms of a sensing matrix in the t-1 iteration process of the OMP algorithm,

is an estimated value calculated by a least square method in the process of t-1 times of iteration.

Further, the voice-based endpoint detection result is used for screening the emotion images in the emotion image component and eliminating the emotion images of the silence segments in the emotion image component, and the method comprises the following steps:

performing image screening according to the voice endpoint detection result, if the frame voice is a voiced segment, reserving a video image of a corresponding time segment and acquiring a facial expression image by using a facial detection algorithm; if the frame voice is a silent section, discarding the video image of the corresponding time section; and stores the effective facial expression image.

Further, the feature extraction step of the reconstructed emotion voice component is as follows:

and according to the voice endpoint detection result, extracting features based on the voice frame of the voiced segment, and obtaining the features of the emotional voice component through time-frequency domain and spectrum analysis.

According to a second aspect of the embodiments of the present disclosure, there is provided a multimodal emotion recognition system, including:

the data acquisition module is used for extracting the emotional voice component and the emotional image component in the emotional video and respectively storing the emotional voice component and the emotional image component;

the endpoint detection module is used for carrying out endpoint detection on the emotional voice component by utilizing an emotional voice residual error conditional entropy difference endpoint detection method to obtain an endpoint detection result of each frame of voice;

the image screening module is used for screening the emotion images in the emotion image components based on the endpoint detection results of the emotion voice components and eliminating the emotion images of the silence segments in the emotion image components;

the feature extraction module is used for respectively extracting features of the reconstructed emotion voice component and the screened emotion image component;

the model training module is used for fusing the characteristics of the emotion voice component and the characteristics of the emotion image component and training the emotion recognition model through the fused characteristics;

and the emotion recognition module is used for realizing multi-mode emotion recognition by utilizing the trained emotion recognition model.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including a memory, a processor and a computer program stored in the memory and running on the memory, wherein the processor implements the multi-modal emotion recognition method when executing the program.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multimodal emotion recognition method as described.

Compared with the prior art, the beneficial effect of this disclosure is:

(1) according to the scheme, the relation between different modal signals during emotion expression is considered, and the facial expression image with better emotion distinguishability is acquired through whether emotion voice exists or not, so that the feature learning of the facial expression image is more effective, and the performance of an emotion recognition model is improved;

(2) effective detection of emotion voice in the scheme disclosed by the disclosure is realized by a residual conditional entropy difference endpoint detection method, and an endpoint detection algorithm has anti-noise performance;

(3) the voice sample for feature learning in the scheme disclosed by the invention is a reconstructed voiced segment emotion voice sample, and the endpoint detection algorithm provided by the invention can complete endpoint detection of emotion voice while reconstructing the sample, and has the advantages of small calculated amount and calculation resource saving; meanwhile, the unvoiced part in the reconstructed sample is inhibited in the sample reconstruction process, the sample has more obvious unvoiced and turbid distinguishing in the feature learning process, and the accuracy and the effectiveness of emotional voice feature learning can be improved;

(4) the effective emotion voice feature and facial expression image feature fusion in the scheme disclosed by the disclosure can realize the training of a multi-modal emotion recognition model with better performance.

Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a block diagram of processing a speech signal by using compressed sensing theory cs (compressed sensing) according to a first embodiment of the disclosure;

fig. 2(a) is a speech time domain waveform diagram in a process of reconstructing a speech sample by using an OMP algorithm according to a first embodiment of the present disclosure;

fig. 2(b) is a graph of the difference between the conditional entropy of the last iteration and the conditional entropy of the first iteration of the OMP algorithm according to the first embodiment of the present disclosure;

fig. 3 is a block diagram illustrating screening of facial expression images and feature learning based on voice endpoint detection results according to an embodiment of the disclosure;

fig. 4 is an overall flowchart of a multi-modal emotion recognition method according to a first embodiment of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

The first embodiment is as follows:

the embodiment aims to provide a multi-modal emotion recognition method.

A method of multimodal emotion recognition, comprising:

fusing the characteristics of the emotional voice component and the characteristics of the emotional image component, and training an emotional recognition model through the fused characteristics; here, the emotion recognition model may adopt an SVM model (when there are fewer training samples) or a deep learning model (when there are more training samples);

and realizing multi-modal emotion recognition by using the trained emotion recognition model.

and according to the voice endpoint detection result, extracting features based on the voice frame of the voiced segment, and obtaining the features of the emotional voice component through time-frequency domain and spectrum analysis. The concrete characteristics include: prosodic features (e.g., fundamental frequency, short-term energy, time-dependent features such as sample duration, voiced segment duration, speech rate, etc.), psychoacoustic features (e.g., first, second, third formants, etc.), spectral features (e.g., MFCC parameters), and statistical parameters (maximum, minimum, mean) of the above features, etc.

Specifically, for ease of understanding, the method of the present disclosure is described in detail below with reference to the accompanying drawings:

according to the scheme, a novel and robust endpoint detection algorithm of voice components in emotion video samples is used, a prediction residual conditional entropy parameter generated in a sample reconstruction process under a Compressed Sensing theory (CS) is utilized, a residual conditional entropy difference value in an iterative process of an Orthogonal Matching Pursuit (OMP) algorithm is calculated, endpoint detection is completed according to an empirical threshold, and feature learning of voiced segment emotion voice is completed based on reconstructed samples; meanwhile, the facial expression images are screened through the end point detection result of the emotional voice, and only the facial expression images with active emotional voice and the same time period are reserved, so that the purposes of enhancing the emotional detachability of the facial expression data set and reducing redundancy are achieved; the emotion voice features and the facial expression features are subjected to feature fusion, an effective multi-mode emotion recognition model is trained, and the purpose of effective multi-mode emotion recognition is achieved.

As shown in fig. 4, an overall process of the multi-modal emotion recognition method of the present disclosure is shown, including:

step 1: respectively storing voice and images in the emotion video sample;

step 2: obtaining a short-time voice frame by framing the emotional voice, and obtaining an observed value of the short-time voice signal frame by adopting the sparse transformation and observation process of FIG. 1; specifically, the obtaining of the observed value of the short-time speech frame is performed by using discrete Cosine transform (dct) (discrete Cosine transform) to complete sparsity transformation of the speech frame, and a gaussian random matrix is used as an observation matrix to obtain the observed value of the speech frame.

And step 3: calculating the residual error between the last iteration estimation value and the observation value and the correlation between the residual error and the sensing matrix according to the observation value; specifically, during the first iteration, the residual error is set as the observed value of the voice frame, the correlation coefficient of the residual error and the sensing matrix is calculated, and the correlation of the residual error and the sensing matrix is represented by the correlation coefficient; when the iteration is not performed for the first time, calculating a residual error between the last iteration estimation value and the voice observation value and a correlation coefficient between the residual error and the sensing matrix, and representing the correlation between the residual error and the sensing matrix by the correlation coefficient;

and 4, step 4: searching the atom with the maximum correlation in the observation matrix, and updating the support set for signal reconstruction;

and 5: approximating the signal by using a least square method to obtain an estimated value of the signal;

step 6: updating the residual error, and calculating the conditional entropy of the residual error; repeating the step 3 to the step 5 until the sparsity condition is reached and stopping iteration; after sparse transformation, a K sparse signal is obtained for the short-time speech frame, which means that only K values in the signal are nonzero and the rest values are 0 or close to 0, and K is called the sparsity of the signal. In the reconstruction process, during iteration, if the iteration number is less than K, the iteration is continued, otherwise, the iteration is stopped. This determination condition is a sparsity condition.

And 7: calculating a residual conditional entropy difference value of the first iteration and the last iteration; splicing and reconstructing a whole voice sample by a short-time reconstructed voice frame;

and 8: judging according to a preset threshold value, wherein the threshold value is set according to an empirical value, the frame voice is considered to be a voiced segment above the threshold value, the frame voice is considered to be a silent segment below the threshold value, and an end point detection result of the frame voice is obtained;

and step 9: according to the voice endpoint detection result, performing feature learning on the voiced segment; meanwhile, image screening is carried out according to the voice endpoint detection result, if the frame of voice is a vocal section, the video image of the corresponding time section is reserved and a facial detection algorithm is combined, and a facial expression image is obtained; if the frame voice is a silent section, discarding the video image of the corresponding time section; storing and feature learning the effective facial expression images;

step 10: completing feature fusion of the emotional voice and the facial expression image; the feature fusion here refers to feature fusion based on a feature layer, that is, the extracted voice features and image features are spliced to form a feature vector (a plurality of samples can obtain a high-dimensional feature set);

step 11: training a multi-modal emotion recognition model;

step 12: inputting the characteristics of the test sample to finish multi-modal emotion recognition.

Furthermore, the emotion voice residual conditional entropy difference value endpoint detection method adopted by the disclosure is based on a prediction residual generated in an iterative execution process of an Orthogonal Matching Pursuit (OMP) algorithm. The OMP algorithm is a common algorithm in voice signal reconstruction, residual calculation is an important ring in the OMP algorithm, and from the information theory perspective, the acquisition of voice information in the iteration process means the reduction of residual entropy. The present disclosure employs introducing a conditional entropy between the residual of the t-th iteration and the signal estimate of the last iteration_eTo determine the degree of extraction of the speech component in the reconstructed residual, as shown in fig. 1, the process of processing the speech signal by using the compressed sensing theory cs (compressed sensing) is performed,

in the OMP algorithm, the reconstructed residual r obtained in the t-th iteration_tThe calculation formula of (2) is as follows:

wherein A is_tIs the t-th iteration process of the OMP algorithmA sensing matrix of a supporting set of atoms of the sensing matrix,

is the estimated value calculated by the least square method in the t-th iteration process.

σ_eThe calculation formula of (2) is as follows:

wherein, A_t-1Is a support set formed by atoms of a sensing matrix in the t-1 iteration process of the OMP algorithm,

is the estimated value calculated by the least square method in the process of t-1 times of iteration.

And when the iteration is completed, solving the difference value of the residual conditional entropy of the last iteration and the first iteration, and obtaining an endpoint detection result through judgment of an empirical threshold.

As shown in fig. 2(a) and fig. 2(b), a speech time domain waveform diagram, a residual conditional entropy difference diagram of the last iteration and the first iteration in the process of reconstructing a certain speech sample by using the OMP algorithm are given. As can be seen from FIGS. 2(a) and 2(b), the difference in the conditional entropy of the residuals during the iteration corresponds well to the significant components, σ, in the speech sample_eThe change trend of the time domain is corresponding to the position of a voiced segment (including unvoiced sound and voiced sound) in the original waveform, and the starting and ending point judgment of the reconstructed voice sample can be completed by adopting an empirical threshold condition.

As shown in fig. 3, an image screening and feature learning block diagram based on an emotion voice residual conditional entropy difference end point detection result is provided, and specifically, the facial expression image screening and feature learning based on a voice end point detection result includes the following steps:

step 1: screening facial expression images collected in the emotion video by using a sample reconstruction residual conditional entropy difference value endpoint detection result, and removing expression images without sound segments and only preserving facial expression images with sound segments;

step 2: processing the screened expression images by adopting a face detection algorithm to obtain a facial expression image data set;

and step 3: and finishing the feature learning of the facial expression image data set, for example, finishing the feature learning of the facial expression image by using the existing pre-trained multilayer convolutional neural network model and combining with the transfer learning.

Further, the scheme of the present disclosure mainly solves the following problems:

(1) processing an emotional voice component in an emotional video by adopting a compressed sensing theory, completing sparse transformation of the emotional voice by using discrete cosine transformation, and providing a prediction residual conditional entropy parameter of the emotional voice compressed sensing reconstruction by using a Gaussian random matrix as an observation matrix and an Orthogonal Matching Pursuit (OMP) algorithm as a reconstruction algorithm;

(2) the effective and robust emotion voice endpoint detection method based on the residual conditional entropy difference is realized. The method can calculate the conditional entropy between the predicted residual error and the signal estimation value of the previous iteration in the iterative process of an Orthogonal Matching Pursuit (OMP) algorithm in the reconstruction process of the voice sample subjected to compressed sensing processing, and finish the endpoint detection of the emotional voice according to the residual conditional entropy difference value before and after the iteration. The endpoint detection method is established on the basis of a compressed sensing reconstruction algorithm, and noise which does not have sparsity under any condition cannot be reconstructed from an observed value, so that the algorithm has better robustness to the noise. In the reconstruction process, the unvoiced sound has the noise-like characteristic and can be inhibited during OMP algorithm reconstruction, and the process is favorable for reconstruction sample unvoiced and turbid partition and improvement of feature learning precision.

(3) The end point detection result of the emotion voice is applied to the acquisition of the facial image in the emotion video sample, and the facial expression image of the silence section is abandoned to ensure that the acquired facial expression image has better emotion identifiability as much as possible, so that the feature learning of the facial expression image is more effective.

(4) Completing feature fusion of emotional voice and facial expressions, training an effective emotion recognition model and completing multi-modal emotion recognition.

Example two:

the embodiment aims at providing a multi-modal emotion recognition system.

A multi-modal emotion recognition system, comprising:

Example three:

the embodiment provides a method for detecting the working state of a call center customer service worker, and the detection method utilizes the multi-mode emotion recognition method.

The customer service staff need communicate with the customer when dealing with the questions and continuously answer various questions of the customer, the work has the characteristics of complicated content and high pressure, meanwhile, the attitude of the customer is not friendly under certain conditions, the customer service staff can generate certain negative emotion under the working environment, and the service quality can be seriously influenced if the customer service staff has certain negative emotion such as disgust or anger and the like, and the psychological health of the customer service staff is very unfavorable. The multi-mode emotion recognition method can be effectively applied to detection of the working state of the customer service staff, and when the emotion state of the customer service staff is abnormal, a prompt is given, so that the method is more beneficial to the customer service staff to adjust the emotion of the customer service staff and improve the service quality. The original recording system of the call center can be combined with a camera to realize the collection of the multi-mode emotion signals.

Based on this, the embodiment provides a method for detecting the working state of a customer service worker in a call center, which includes:

acquiring a face video image of a customer service worker during working in real time by using image acquisition equipment;

processing the face video image by using the multi-modal emotion recognition method to realize multi-modal emotion recognition;

judging the emotional state of the customer service staff based on the emotional recognition result, quantizing the working state score of the customer service staff according to the emotional state, judging that the working state of the customer service staff is abnormal when the working state score is lower than a set threshold value, and sending a corresponding alarm.

By the scheme, the problem of detecting the working state of the customer service staff can be effectively solved, and when the emotional state of the customer service staff is abnormal, the customer service staff can give a prompt, so that the customer service staff can adjust the emotion of the customer service staff in time and improve the service quality.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment one. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processor, a digital signal processor DSP, an application specific integrated circuit ASIC, an off-the-shelf programmable gate array FPGA or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and combines hardware thereof to complete the steps of the method. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The multi-mode emotion recognition method and the multi-mode emotion recognition system can be realized and have wide application prospects.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A multi-modal emotion recognition method, comprising:

performing end point detection on the emotional voice component by using an emotional voice residual error conditional entropy difference end point detection method to obtain an end point detection result of each frame of voice; the method comprises the following steps:

and (5): updating the residual error, calculating the conditional entropy of the residual error, and iteratively executing the steps (2) to (5) until the sparsity condition is reached and then stopping iteration;

step (6), calculating a residual conditional entropy difference value of the first iteration and the last iteration; splicing and reconstructing a whole voice data sample by a short-time reconstructed voice frame;

step (7) judging the reconstructed emotion voice component by using a preset threshold, if the reconstructed emotion voice component is higher than the threshold, the frame voice is considered as a voiced segment, and if the reconstructed emotion voice component is lower than the threshold, the frame voice is considered as a unvoiced segment, and an end point detection result of the frame voice is obtained;

2. The method of claim 1, wherein the emotion speech residual conditional entropy difference end point detection method is implemented on the basis of prediction residuals generated in the iterative execution process of the orthogonal matching pursuit algorithm.

3. The method as claimed in claim 1, wherein the obtaining of the observed value of the short-time speech frame is performed by performing sparse transformation of the speech frame using discrete cosine transformation, and obtaining the observed value of the speech frame using gaussian random matrix as the observation matrix.

4. The method of claim 1, wherein the residual conditional entropy is calculated by the formula:

wherein the content of the first and second substances,

r_tis the reconstructed residual, A, obtained in the t-th iteration_tIs a support set formed by atoms of a sensing matrix in the t iteration process of the OMP algorithm,

an estimated value calculated by a least square method in the t-th iteration process; a. the_t-1Formed by atoms of a sensing matrix in the t-1 iteration process of the OMP algorithmThe support set is arranged on the upper surface of the support frame,

5. The method of claim 1, wherein the speech-based endpoint detection method is used for screening the emotion images in the emotion image component and removing the emotion images without sound segments in the emotion image component, and comprises the following steps:

screening images according to the voice endpoint detection result, if the frame of voice is a vocal section, reserving the video image of the corresponding time section and acquiring a facial expression image by using a facial detection algorithm; if the frame voice is a silent section, discarding the video image of the corresponding time section; and stores the effective facial expression image.

6. The method of claim 1, wherein the step of extracting the features of the reconstructed emotion speech component comprises: and according to the voice endpoint detection result, extracting features based on the voice frame of the voiced segment, and obtaining the features of the emotional voice component through time-frequency domain and spectrum analysis.

7. A multi-modal emotion recognition system, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored and executed on the memory, wherein the processor implements a method of multimodal emotion recognition as recited in any of claims 1-6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements a method of multimodal emotion recognition as recited in any of claims 1-6.