CN110516696B

CN110516696B - Self-adaptive weight bimodal fusion emotion recognition method based on voice and expression

Info

Publication number: CN110516696B
Application number: CN201910632006.3A
Authority: CN
Inventors: 肖婧; 黄永明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2023-07-25
Anticipated expiration: 2039-07-12
Also published as: CN110516696A

Abstract

The invention relates to a self-adaptive weight bimodal fusion emotion recognition method based on voice and facial expression, which comprises the following steps: acquiring emotion voice and facial expression data, corresponding the emotion data to emotion types, and selecting a training sample set test sample set; extracting voice emotion characteristics from voice data, and extracting dynamic expression characteristics from the emotion data; based on the voice emotion characteristics and the expression characteristics, learning is carried out by adopting a deep learning method based on a semi-supervised automatic encoder, and a classification result and various output probabilities are obtained through a softmax classifier; and finally, fusing the two single-mode emotion recognition results in a decision layer, and obtaining a final emotion recognition result by adopting a self-adaptive weighting method. The invention adopts a self-adaptive weight fusion method aiming at the difference of the characteristic capability of different modal emotion characteristics of individuals, and has higher accuracy and objectivity.

Description

Self-adaptive weight bimodal fusion emotion recognition method based on voice and expression

Technical Field

The invention relates to the field of emotion recognition in emotion calculation, in particular to a self-adaptive weight dual-mode fusion emotion recognition method based on voice and facial expression.

Background

In recent years, under the development of artificial intelligence and robotics, the conventional man-machine interaction mode cannot meet the requirements, and novel man-machine interaction needs emotion communication, so emotion recognition becomes a key for the development of man-machine interaction technology and also becomes a research subject of a academic hotspot. Emotion recognition is a research topic related to multidisciplinary, and by enabling a computer to understand and recognize human emotion, and further predicting and understanding the behavioral trend and psychological state of human, efficient and harmonious human-computer emotion interaction is realized.

There are various expressions of human emotion, such as speech, expression, gesture, text, etc., from which we can extract valid information to correctly analyze emotion. And the expression and voice information are the most obvious and most easily analyzed characteristics, so that the method has been widely studied and applied. Psychologist mehrabaian gives an equation: the emotion exposure=7% of words +38% of sounds +55% of facial expressions, and the speech information and facial expression information of a visible person cover 93% of emotion information, which is the core of human communication information. In the emotion expression process, the facial deformation can effectively and intuitively express the emotion of the heart, is one of the most important characteristic information for emotion recognition, and the voice characteristics can also express rich emotion.

In recent years, due to the development of the Internet and the layering of various social media, communication modes of people are greatly enriched, such as video, audio and the like, so that multi-mode emotion recognition is possible. Conventional single-mode recognition may have a problem that a single emotion feature does not represent an emotion state well, for example, people may not have a large change in facial expression when expressing sad emotion, but at this time, sad and missed emotion can be distinguished from low and slow voice. The multi-mode identification enables information of different modes to be complementary, provides more emotion information for emotion identification, and improves the accuracy of emotion identification. However, at present, single-mode emotion recognition research is mature, and a multi-mode emotion recognition method is still to be developed and perfected. Therefore, the multi-mode emotion recognition has very important practical application significance. The dual-mode emotion recognition based on the most dominant expression and voice features has important research significance and practical value. Conventional weighting methods ignore individual variability and therefore, an adaptive weighting method is needed for weight distribution.

Disclosure of Invention

The invention aims to provide a self-adaptive weight dual-mode fusion emotion recognition method based on voice and facial expression, so that complementation of modal information is realized, and self-adaptive weight distribution aiming at personal difference is realized.

For this purpose, the invention adopts the following technical scheme:

an identification method based on self-adaptive weight bimodal fusion of voice and facial expression is characterized by comprising the following steps:

s1, acquiring emotion voice and facial expression data, corresponding the emotion data to emotion categories, selecting a training sample set test sample set,

s2, extracting voice emotion characteristics from voice data, extracting dynamic expression characteristics from the voice data, firstly, automatically extracting expression peak value frames, obtaining a dynamic image sequence from the beginning of expression to the expression peak value, normalizing the image sequence with non-fixed length into an image sequence with fixed length as the dynamic expression characteristics,

s3, learning by adopting a deep learning method based on a semi-supervised automatic encoder based on the voice emotion characteristics and the expression characteristics respectively, obtaining a classification result and each class output probability by a softmax classifier,

and S4, fusing the two single-mode emotion recognition results in a decision layer, and obtaining a final emotion recognition result by adopting a self-adaptive weight distribution method.

Further, the specific steps of the step S2 are as follows:

s2a.1: for voice emotion data, the obtained voice sample section is subjected to framing treatment and divided into multi-frame voice sections, windowing treatment is carried out on the voice sections after framing treatment to obtain voice emotion signals,

s2a.2: for the voice emotion signals obtained by S2A.1, extracting low-level feature extraction at the frame level, and extracting fundamental tone F0, short-time energy, frequency perturbation amplitude perturbation, harmonic to noise ratio, mel cepstrum coefficient and the like,

s2a.3: counting the low-level features obtained in the step of one frame level on the voice sample level formed by a plurality of frames, and applying a plurality of statistical functions, maximum values, minimum values, average values, standard deviations and the like to the low-level features to obtain voice emotion features;

S2B.1. for facial expression data, firstly, carrying out coordinate change on the obtained three-dimensional coordinate data of facial expression feature points, taking the nose tip as a center point, obtaining a rotation matrix by utilizing the SVD principle, and carrying out rotation change by multiplying the rotation matrix so as to eliminate the influence of head posture change.

S2B.2, extracting peak expression frames by using a slow feature analysis method, wherein the specific steps are as follows:

1) Treating each moving image sequence sample as a time input signal

2) Will beNormalizing to obtain a mean difference value of 0 and a variance of 1,

x(t)＝[x ₁ (t),x ₂ (t),…,x _I (t)] ^T ；

3) The input signal is subjected to nonlinear expansion, the problem is converted into a linear SFA problem,

4) Performing data whitening;

5) The linear SFA method solves.

S2B.3, after the dynamic expression sequence from the expression initial frame to the expression peak frame is obtained, normalizing by using the dynamic characteristics of non-fixed length by a linear interpolation method.

Further, the specific steps of the step S3 are as follows:

s3.1, inputting unlabeled and labeled input training samples aiming at a certain mode data, respectively generating reconstruction data and category output through encoding by a self-encoder, decoding and outputting by a softmax classifier,

s3.2, calculating an unsupervised learning representation reconstruction error and a supervised learning classification error,

s3.3, constructing an optimized objective function, simultaneously taking reconstruction errors and classification errors into consideration,

E(θ)＝αE _r +(1-α)E _c ；

and S3.4, updating parameters by a gradient descent method until the objective function converges.

Further, the specific steps of the step S4 are as follows:

s4.1, acquiring various output probabilities of two modes of a test sample of a softmax classifier, and calculating a variable delta _k ，δ _k Can be used for measuring the quality of emotion characterization of the mode according to delta of each sample _k The different sizes realize the self-adaptive distribution of weights, wherein J is the number of classes in the system, and P is a vector formed by sample output probabilities. P= { P _j |j＝1,…,J}，p _j The probability of belonging to each class output for the softmax classifier, d, represents the Euclidean distance between the two vectors.

S4.2. delta. Will be _k Mapping to [0,1 ] according to]And a and b are self-selected parameters as weights, and are determined according to specific conditions. ,

u _k ＝1-1/[1+exp(-a(δ _k -b))]；

s4.3, obtaining P in the fused output probability vector according to the following formula _final ＝{p _{final_j} I j=1, …, J }, the category to which the highest probability belongs is the identification category. P is p _{j_k} In order to obtain the j-th class probability output by utilizing the K-th mode to perform single-mode emotion recognition, K modes are used.

Compared with the prior art, the invention has the following beneficial effects: the emotion recognition method based on the self-adaptive weight double-mode fusion of the voice and the facial expression obtains more accurate and efficient recognition effects based on a standard database, adopts the self-adaptive weight fusion method aiming at the difference of the characteristic capacities of different modal emotion characteristics of individuals, has higher accuracy and objectivity, obtains 83% recognition rate based on an IEMOCAP emotion library, and obtains about 3% recognition rate improvement compared with the traditional fixed weight distribution.

Drawings

FIG. 1 is a schematic diagram of the overall flow of the identification method of the present invention.

Fig. 2 is a flow chart of step S3 of the present invention.

FIG. 3 is a flow chart of adaptive weight distribution according to the present invention.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

Example 1: referring to fig. 1-3, a recognition method based on adaptive weight bimodal fusion of speech and facial expressions, the method comprising the steps of:

Further, the specific steps of the step S2 are as follows:

s2a.3: the low-level features obtained in the step of one frame level are counted on the voice sample level formed by a plurality of frames, a plurality of statistical functions, a maximum value, a minimum value, an average value, a standard deviation and the like are applied to the low-level features, so as to obtain voice emotion features,

1) Treating each moving image sequence sample as a time input signal

x(t)＝[x ₁ (t),x ₂ (t),…,x _I (t)] ^T

4) Performing data whitening;

5) The linear SFA method solves.

Further, the specific steps of the step S3 are as follows:

E(θ)＝αE _r +(1-α)E _c ；

Further, the specific steps of the step S4 are as follows:

s4.1, acquiring various output probabilities of two modes of a test sample of a softmax classifier, and calculating a variable delta _k ，δ _k Can be used for measuring the quality of emotion characterization of the mode according to delta of each sample _k The different sizes realize the self-adaptive distribution of weights, wherein J is the number of classes in the system. P is the vector of sample output probabilities. P= { P _j |j＝1,…,J}，p _j The probability of belonging to each class output for the softmax classifier, d, represents the Euclidean distance between the two vectors.

S4.2. delta. Will be _k Mapping to [0,1 ] according to]And (3) taking the weight as a weight, wherein a and b are self-selected parameters.

u _k ＝1-1/[1+exp(-a(δ _k -b))]；

Application examples: referring to fig. 1-3, in this example, the IEMOCAP emotion database is used as a material, and the simulation platform is MATLAB R2014a.

As shown in FIG. 1, the emotion recognition method based on the self-adaptive weight double-mode fusion of voice and expression mainly comprises the following steps:

s1, acquiring emotion voice and facial expression data, corresponding the emotion data to emotion types, and selecting a training sample set test sample set. Four emotion categories, neutral, happy, sad and anger, are selected.

S2, extracting voice emotion characteristics from the voice data. Extracting dynamic expression features from expression data, firstly automatically extracting expression peak frames, obtaining a dynamic image sequence from the beginning of expression to the expression peak, and normalizing an image sequence with a non-fixed length into an image sequence with a fixed length to serve as the dynamic expression features. The extraction of the voice features is to extract INTERSPEECH 2010Paralinguistic Challenge standard feature sets and 1582-dimensional features by using an open-source voice feature extraction toolbox openSMILE. And extracting dynamic characteristics of facial expression. And extracting the peak expression frame by using a slow feature analysis method. And then setting a threshold value to find an expression initial frame, obtaining a dynamic expression sequence from the expression initial frame to an expression peak value frame, and normalizing by using a linear interpolation method and non-fixed-length dynamic characteristics.

S3, learning is carried out by adopting a deep learning method based on a semi-supervised automatic encoder based on the voice emotion characteristics and the expression characteristics respectively, and a classification result and various class output probabilities are obtained through a softmax classifier.

As shown in fig. 2, the step S3 of semi-supervised classification specifically includes:

s3.1, inputting training samples without labels and with labels aiming at a certain mode data. The self-encoder encodes, decodes and softmax classifier outputs produce reconstructed data and class outputs, respectively.

And S3.2, calculating an unsupervised learning representation reconstruction error and a supervised learning classification error.

And S3.3, constructing an optimization objective function, and simultaneously taking reconstruction errors and classification errors into consideration.

E(θ)＝αE _r +(1-α)E _c

As shown in fig. 3, the specific steps of the step S4 are as follows:

s4.1, acquiring various output probabilities of two modes of a test sample of the softmax classifier. Calculating the variable delta _k ，δ _k Can be used for measuring the quality of emotion characterization of the mode according to delta of each sample _k The different sizes enable adaptive allocation of weights. Wherein J is the number of classes in the system. P is the vector of sample output probabilities. P= { P _j |j＝1,…,J}，p _j The probability of belonging to each class output for the softmax classifier, d, represents the Euclidean distance between the two vectors.

S4.2. delta. Will be _k Mapping to [0,1 ] according to]As a weight. Wherein a and b are discretionary parameters.

u _k ＝1-1/[1+exp(-a(δ _k -b))]

S4.3, obtaining P in the fused output probability vector according to the following formula _final ＝{p _{final_j} I j=1, …, J }, the category to which the highest probability belongs is the identification category, p _{j_k} In order to obtain the j-th class probability output by utilizing the K-th mode to perform single-mode emotion recognition, K modes are used.

。

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and the equivalent substitutions or alternatives made on the basis of the above-mentioned technical solutions are all included in the scope of the present invention.

Claims

1. The self-adaptive weight bimodal fusion emotion recognition method based on voice and facial expression is characterized by comprising the following steps of:

s1, acquiring emotion voice data and facial expression data, corresponding the emotion data to emotion types, and selecting a training sample set test sample set;

s2, extracting voice emotion characteristics from voice data, extracting dynamic expression characteristics from the voice data, firstly automatically extracting expression peak frames, obtaining a dynamic image sequence from the beginning of expression to the expression peak, and normalizing the non-fixed-length image sequence into a fixed-length image sequence serving as the dynamic expression characteristics;

s3, learning is carried out by adopting a deep learning method based on a semi-supervised automatic encoder based on voice emotion characteristics and expression characteristics respectively, and classification results and various class output probabilities are obtained through a softmax classifier;

s4, fusing the two single-mode emotion recognition results in a decision layer, adopting a self-adaptive weight distribution method to obtain a final emotion recognition result,

the decision layer fusion step based on the self-adaptive weight in the step S4 is as follows:

s4.1, acquiring various output probabilities of two modes of a test sample of a softmax classifier, and calculating a variable delta _k ，δ _k Can be used for measuring the quality of emotion characterization of the mode according to delta of each sample _k Adaptive allocation of weights for different sizes is realized, wherein J is the number of classes in the system, P is a vector consisting of sample output probabilities, and P= { P _j |j＝1,…,J}，p _j The probability of belonging to each category output by the softmax classifier is d, and the Euclidean distance between two vectors is represented;

s4.2. delta. Will be _k Mapping to [0,1 ] according to]Wherein, a and b are self-selected parameters as weights,

u _k ＝1-1/[1+exp(-a(δ _k -b))]；

s4.3, obtaining P in the fused output probability vector according to the following formula _final ＝{p _{final_j} I j=1, …, J }, the category to which the highest probability belongs is the identification category, p _{j_k} The probability output of the j-th class obtained by carrying out single-mode emotion recognition by using the K-th mode is K modes in total;

2. the method for identifying the self-adaptive weight bimodal fusion emotion based on voice and facial expression according to claim 1, wherein the specific steps of extracting emotion features in the step S2 are as follows:

S2B.1. for facial expression data, firstly, carrying out coordinate change on the obtained three-dimensional coordinate data of facial expression characteristic points, taking the nose tip as a center point, obtaining a rotation matrix by utilizing the SVD principle, carrying out rotation change by multiplying the rotation matrix so as to eliminate the influence of head posture change,

S2B.2. extracting peak expression frame by slow feature analysis method,

3. The method for identifying the self-adaptive weight bimodal fusion emotion based on voice and facial expression according to claim 1, wherein the specific steps of semi-supervised learning in the step S3 are as follows:

s3.2, calculating an unsupervised learning representation reconstruction error E _r And supervised learning classification errors E _c ，

S3.3 constructing an optimization objective function while taking the reconstruction error E into account _r And classification error E _r ，

E(θ)＝αE _r +(1-α)E _c ；

4. The method for identifying the self-adaptive weight bimodal fusion emotion based on voice and facial expressions according to claim 2, wherein the method is characterized in that S2B.2 comprises the following steps of:

1) Treating each moving image sequence sample as a time input signal

x(t)＝[x ₁ (t),x ₂ (t),…,x _I (t)] ^T

3) Performing nonlinear expansion and expansion on an input signal, converting a problem into a linear SFA problem, and 4) performing data whitening;

5) The linear SFA method solves.