CN114724222B

CN114724222B - AI digital human emotion analysis method based on multiple modes

Info

Publication number: CN114724222B
Application number: CN202210394800.0A
Authority: CN
Inventors: 陈再蝶; 朱晓秋; 章星星; 樊伟东
Original assignee: Kangxu Technology Co ltd
Current assignee: Kangxu Technology Co ltd
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2024-04-19
Anticipated expiration: 2042-04-14
Also published as: CN114724222A

Abstract

The invention discloses a multimode-based AI digital human emotion analysis method, which comprises the following steps: s1, facial expression recognition emotion judgment, wherein an output result a is input into a multi-mode emotion analysis module; s2, voice emotion recognition and judgment, and outputting a result e to a multi-modal emotion analysis module; s3, identifying and judging text emotion, and outputting a result f to a multi-modal emotion analysis module; s4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to the AI digital person. According to the method, the emotion states of the user can be judged most comprehensively and accurately through the multi-mode judgment of the emotion states of the user, the meaning expressed by the user can be accurately grasped, and the method is not only suitable for chat robots in financial scenes, but also can be used as chat robots in other vertical fields, such as medical, educational, service and other fields.

Description

AI digital human emotion analysis method based on multiple modes

Technical Field

The invention relates to the technical field of AI digital persons, in particular to an AI digital person emotion analysis method based on multiple modes.

Background

The AI digital person system is generally composed of 5 modules of character image, voice generation, animation generation, audio and video synthesis display, interaction and the like, the interaction module enables the AI digital person to have an interaction function, namely, the intention of a user is identified through intelligent technologies such as voice semantic identification and the like, the subsequent voice and action of the digital person are determined according to the current intention of the user, the character is driven to start the next interaction, in the interaction process, the AI digital person needs to accurately judge emotion of the client so as to provide accurate service, the method can be divided into text emotion tendency judgment after semantic understanding, or the facial expression of the client is captured through a camera, and then the digital person is provided with emotion analysis through expression identification.

Firstly, face expression recognition is characterized in that face detection is carried out, the problem of missing detection often exists when a traditional face detection method is used for carrying out face detection on an image, the robustness is insufficient, the face is often not detected in a side face or light deficient environment, and the emotion analysis result is affected;

Secondly, for some specific scenes such as finance, medical treatment and education industry, an AI digital person generally has the capability of understanding the "look-and-feel" of the text semantics of the client, and makes correct judgment on the positive (yes), negative (no) or neutral emotion semantics of the client through the text semantics in combination with the business scene, but the text semantics understanding capability is accompanied by a large amount of data corpus or artificial dictionary construction, is very dependent on data resources and manpower resources, and in a wider scene, only the text semantics understanding is used for judging that the client emotion is slightly insufficient;

Finally, the existing part of AI digital people judge the emotion state of the user through voice features, one method is to judge the emotion state through voice text recognition and then through text, the method is very dependent on the accuracy of voice recognition, and the other method is to judge the emotion state directly through voice, but the feature extraction method for judging emotion from voice extraction is still immature, and as a result, the accuracy of judging the emotion state is lower;

In summary, based on the single-mode emotion state recognition, when judging the emotion state of a client, the accuracy is still lower than the multi-mode emotion comprehensive recognition results of images, voices, characters and the like, so the invention provides a multi-mode AI digital human emotion analysis method.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, a multi-mode-based AI digital human emotion analysis method is provided.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A multimode-based AI digital human emotion analysis method comprises the following steps:

s1, facial expression recognition emotion judgment:

s11, acquiring an image through a camera module to serve as an original image A to be detected;

S12, obtaining an original image B by converting the original image A to be detected into (640, 640,3);

s13, inputting the original image B into a trained retinaface face detection and recognition model, and outputting a face detection frame C;

S14, intercepting a target face area from the face detection frame C, wherein the target face area resize is a 224 multiplied by 224 face image D;

s15, inputting a face image D into a trained facial expression recognition model, classifying the face image C by using a convolutional neural network, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';

s16, outputting the emotion type corresponding to the maximum probability value, and outputting a result a to be input into the multi-mode emotion analysis module;

S2, voice emotion recognition judgment:

s21, collecting voice E through a voice collecting module;

s22, inputting the voice E into a voice emotion judging model, extracting zero crossing rate, amplitude, spectrum centroid and Mel frequency cepstrum coefficient of an audio map, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';

s23, outputting emotion categories corresponding to the maximum probability values, and outputting a result e to the multi-modal emotion analysis module;

S3, identifying and judging text emotion:

S31, collecting the voice E through a voice acquisition module, and converting the voice E into a text F;

S32, inputting the text F into a text emotion recognition model, performing emotion scoring on the text F, and outputting probability values of seven emotion categories obtained by "angry", "disgusted", "fearful", "happy", "sad", "surprise" and "neutral";

s33, outputting emotion categories corresponding to the maximum probability values, and outputting a result f to the multi-modal emotion analysis module;

S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to the AI digital person.

As a further description of the above technical solution:

The training method of the retinaface face detection and recognition model comprises the following steps: the widerface data set is selected, the widerface data set comprises at least 32203 pictures, the pictures are divided into a training set, a verification set and a test set according to a ratio of 4:1:5, the training set is used for training a retinaface face detection recognition model, and the pictures are subjected to data enhancement through methods of brightness change, saturation adjustment, hue adjustment, random clipping, mirror image overturning and size transformation during training.

As a further description of the above technical solution:

the retinaface face detection and recognition model comprises 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and in the retinaface face detection and recognition model, a global loss function L is as follows:

Wherein L _cls is loss of a face two-classifier softmax, L _box is regression loss function smoothL1 of box, L _pixel is dense regression loss, p _i is probability of ith anchor as face, For the i-th anchor's true category, 1 represents a face, 0 represents a non-face ,λ₁＝0.25、λ₂＝0.1、λ₃＝0.01,t_i＝{t_x,t_y,t_w,t_h} represents the predicted box coordinates,The representation real frame coordinates ,l_i＝{l_x1,l_y1,l_x2,l_y2,l_x3,l_y3,l_x4,l_y4,l_x5,l_y5} represent predicted face key coordinates ,l_i*＝{l_x1*,l_y1*,l_x2*,l_y2*,l_x3*,l_y3*,l_x4*,l_y4*,l_x5*,l_y5*} represent real face key coordinates.

As a further description of the above technical solution:

The training method of the facial expression recognition model comprises the following steps: and selecting a dataset of seven types of face images, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral', wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to a ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model.

As a further description of the above technical solution:

The facial expression recognition model comprises a CNN backbone network and a stream module, wherein the CNN backbone network comprises an inverse residual block BlockA and a downsampling module BlockB, the left side auxiliary branch of the downsampling module BlockB uses AVGPool, and the stream module uses a depth convolution DWConv layer with the step length larger than 1 to downsample and output a 1-dimensional feature vector.

As a further description of the above technical solution:

The emotion random combination condition comprises the following:

Emotion random combination case one: results e, f combined output: if the AI digital person does not contain the camera module, the image information can not be obtained in real time, the emotion state is judged by using voice and text characteristics, and further, the average probability value ef of the results e and f is taken, and the emotion type corresponding to the average probability value ef is used as a final emotion judgment result g1;

Emotion random combination case two: results a, e combined output: if the AI digital person does not contain the text emotion recognition model, the text emotion judgment result cannot be output in real time, the emotion state is judged by using the facial image and the voice characteristics, and further, the average probability value ae of the results a and e is taken as the final emotion judgment result g2, wherein the emotion category corresponds to the average probability value ae;

Emotion random combination case three: results a, f combined output: if the AI digital person does not contain the voice emotion judging model, the voice emotion judging result cannot be output in real time, judging the emotion state by using the face image and the text characteristics, and further, taking the average probability value af of the results a and f and the emotion type corresponding to the average probability value af as a final emotion judging result g3;

Affective random combination case four: results a, e, f combined output: and taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4.

As a further description of the above technical solution:

On the premise that the IOU of the face detection frames C and GT is larger than 0.5, standard evaluation is carried out on the face detection recognition model through accuracy and recall rate:

precision = TP/(tp+fp);

recall = TP/(tp+fn);

the TP=GT and n PRED are predicted correctly, the real example is predicted as the positive example, and the model is predicted as the positive example;

Fp=pred- (GT n PRED), misprediction, false positive, model prediction as positive, actually negative;

Fn=gt- (GT n PRED), misprediction, false counterexample, model prediction as counterexample, actually positive example;

Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED is a prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection to union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows: the method is suitable for chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, educational, service and the like.

Drawings

Fig. 1 shows a schematic flow chart of a facial expression recognition emotion judgment method according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of a network structure of a retinaface face detection and recognition model according to an embodiment of the present invention;

fig. 3 shows a schematic diagram of a CNN backbone network according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a CNN backbone network structure module according to an embodiment of the present invention;

FIG. 5 shows a schematic diagram of an SSH context module architecture provided in accordance with an embodiment of the present invention;

FIG. 6 illustrates a schematic diagram of an IOU provided in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1-6, the present invention provides a technical solution: a multimode-based AI digital human emotion analysis method comprises the following steps:

s1, facial expression recognition emotion judgment:

specifically, the training method of retinaface face detection and recognition model comprises the following steps: selecting widerface a widerface dataset which comprises at least 32203 pictures, dividing the dataset into a training set, a verification set and a test set according to a ratio of 4:1:5, and training a retinaface face detection recognition model, wherein the pictures are subjected to data enhancement by a brightness change, saturation adjustment, hue adjustment, random clipping, mirror image turning and size transformation method during training;

The traditional face detection method has the problems of omission and low robustness, training data is enhanced through data, and a retinaface face detection recognition model is trained to obtain a depth detection model with high accuracy;

Further, as shown in fig. 2, the retinaface face detection and recognition model includes 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and the retinaface face detection and recognition model adds the context SSH module in the 5 pyramid feature diagrams to improve the detection precision of the small face, introduces the deformable convolution network DCN module to improve the precision, adds 5 face key points to improve the precision of the detection algorithm in the hard part of the widerface dataset;

Specifically, the SSH context module improves the detection of small faces by introducing context detection in the feature map, in a two-stage detector, typically integrating the context by expanding the window around the candidates proposals, by mimicking this strategy with a simple convolution layer, fig. 5 shows the context layer integrated into the detection module, since the anchors are classified and regressed in a convolutional manner, a larger filter (larger convolution kernel size) is employed, which is similar to increasing the window size around proposals in a two-stage detector, for which the invention uses filters (convolution kernels) 5*5 and 7*7 in the SSH context module, in such a way that the sensitivity field proportional to the step size of the corresponding layer is increased for the context modeling, and thus the target dimension of each detection module is increased, in order to reduce the number of parameters, we use some series 3*3 convolution kernels instead of larger convolution kernels, a larger sensitivity field can be obtained, while in the present invention, the convolution kernels use some series 3*3 convolution kernels can achieve a larger sensitivity field, while the same number of parameters can be reduced by a small number of SSH modules, while the average widerface is achieved by a large number of average values of widerface;

The loss function comprises classification loss, box regression loss, face key points and dense regression loss, the regression accuracy of the box can be improved by introducing the face key points, and in retinaface face detection and identification model, the global loss function L is as follows:

Wherein L _cls is loss of a face two-classifier softmax, L _box is regression loss function smoothL1 of box, L _pixel is dense regression loss, p _i is probability of ith anchor as face, For the real class of the i-th anchor, 1 represents a face, 0 represents a non-face, λ ₁＝0.25、λ₂＝0.1、λ₃ =0.01, λ is used to balance the weights of the different loss types, t _i＝{t_x,t_y,t_w,t_h represents the predicted box coordinates,/>The representation real frame coordinates ,l_i＝{l_x1,l_y1,l_x2,l_y2,l_x3,l_y3,l_x4,l_y4,l_x5,l_y5} represents predicted face key point coordinates ,l_i*＝{l_x1*,l_y1*,l_x2*,l_y2*,l_x3*,l_y3*,l_x4*,l_y4*,l_x5*,l_y5*} represents real face key point coordinates;

Specifically, the training method of the facial expression recognition model comprises the following steps: selecting a dataset of seven types of face images, namely an anary, disgusted, fearful, happy, sad, surprise and neutr, wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to the ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model;

Further, as shown in fig. 3 and fig. 4, the facial expression recognition model includes a CNN backbone network and a stream module, the CNN backbone network includes an inverse residual block BlockA and a downsampling module BlockB, and the downsampling module BlockB assists in using AVGPool on the left side, because it can embed multi-scale information and aggregation features in different receptive fields, resulting in improvement of performance, downsampling is performed in the stream module by using a depth convolution DWConv layer with a step size greater than 1, and a 1-dimensional feature vector is output;

In the aspect of facial expression recognition, the size of an input data image is resize to 224 multiplied by 224, then, after the input data image is input into a CNN main network to carry out operations such as convolution extraction of features and the like, a 7 multiplied by 7 feature image is generated, after the CNN main network is adopted, in order to better extract feature image information, a stream module is used, a depth convolution DWConv layer with the step length larger than 1 is used for downsampling in the stream module, and then the downsampling is carried out on the stream module to output a 1-dimensional feature vector (1 x 7), so that the overfitting risk caused by a full connection layer can be reduced, and then, the feature vector calculation loss is used for prediction, wherein a rapid downsampling strategy is adopted in the initial stage of the CNN main network, so that the size of the feature image can be rapidly reduced, less parameters are spent, and the problems of weak feature embedding capability and long processing time caused by a slow downsampling process with limited calculation force can be avoided;

S2, voice emotion recognition judgment:

s21, collecting voice E through a voice collecting module;

S3, identifying and judging text emotion:

S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to an AI digital person;

Specifically, the emotion random combination condition includes the following:

Affective random combination case four: results a, e, f combined output: taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4;

The method is suitable for chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, educational, service and the like.

Referring to fig. 6, on the premise that the IOU of the face detection frames C and GT is greater than 0.5, standard evaluation is performed on the face detection recognition model through accuracy and recall:

precision = TP/(tp+fp);

recall = TP/(tp+fn);

Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED passes through the prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection and union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture;

The accuracy represents the prediction accuracy degree in the positive sample result, for the evaluation of some negative samples, the recall rate is used, the higher the recall rate is, the higher the probability that the actual negative sample is predicted is represented, and by combining the negative samples for analysis, whether the retinaface face detection recognition model is generalized or not can be insufficient for some special scenes, and some optimization modes, such as image enhancement and the like, are provided.

The method is suitable for the chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, education, service and the like.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The multi-mode-based AI digital human emotion analysis method is characterized by comprising the following steps: s1, facial expression recognition emotion judgment:

S2, voice emotion recognition judgment:

s21, collecting voice E through a voice collecting module;

S3, identifying and judging text emotion:

Wherein L _cls is loss of a face two-classifier softmax, L _box is loss of a box regression function sm 1, L _pixel is loss of dense regression, p _i is probability of an ith anchor being a face, For the i-th anchor's true category, 1 represents a face, 0 represents a non-face ,λ₁＝0.25、λ₂＝0.1、λ₃＝0.01,t_i＝{t_x,t_y,t_w,t_h} represents the predicted box coordinates,The representation real frame coordinates ,l_i＝{l_x1,l_y1,l_x2,l_y2,l_x3,l_y3,l_x4,l_y4,l_x5,l_y5} represents predicted face key point coordinates ,l_i*＝{l_x1*,l_y1*,l_x2*,l_y2*,l_x3*,l_y3*,l_x4*,l_y4*,l_x5*,l_y5*} represents real face key point coordinates;

The facial expression recognition model comprises a CNN backbone network and a flow module, wherein the CNN backbone network comprises an inverse residual block BlockA and a downsampling module BlockB, an AVG Pool is used for the left auxiliary branch of the downsampling module BlockB, and a depth convolution DWConv layer with the step length larger than 1 is used for downsampling in the flow module, and a 1-dimensional feature vector is output.

2. The multi-modal-based AI digital human emotion analysis method of claim 1, wherein the retinaface face detection recognition model training method is as follows: the widerface data set is selected, the widerface data set comprises at least 32203 pictures, the pictures are divided into a training set, a verification set and a test set according to a ratio of 4:1:5, the training set is used for training a retinaface face detection recognition model, and the pictures are subjected to data enhancement through methods of brightness change, saturation adjustment, hue adjustment, random clipping, mirror image overturning and size transformation during training.

3. The multi-modal-based AI digital human emotion analysis method of claim 1, wherein the training method of the facial expression recognition model is as follows: and selecting a dataset of seven types of face images, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral', wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to a ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model.

4. The multi-modal based AI digital human emotion analysis method of claim 1, wherein the emotion random combination condition comprises the following:

5. The multi-mode-based AI digital human emotion analysis method of claim 1, wherein on the premise that the ios of face detection frames C and GT are greater than 0.5, standard evaluation is performed on a face detection recognition model by precision and recall:

precision = TP/(tp+fp);

recall = TP/(tp+fn);