CN114724222B - AI digital human emotion analysis method based on multiple modes - Google Patents

AI digital human emotion analysis method based on multiple modes Download PDF

Info

Publication number
CN114724222B
CN114724222B CN202210394800.0A CN202210394800A CN114724222B CN 114724222 B CN114724222 B CN 114724222B CN 202210394800 A CN202210394800 A CN 202210394800A CN 114724222 B CN114724222 B CN 114724222B
Authority
CN
China
Prior art keywords
emotion
face
recognition model
voice
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210394800.0A
Other languages
Chinese (zh)
Other versions
CN114724222A (en
Inventor
陈再蝶
朱晓秋
章星星
樊伟东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kangxu Technology Co ltd
Original Assignee
Kangxu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kangxu Technology Co ltd filed Critical Kangxu Technology Co ltd
Priority to CN202210394800.0A priority Critical patent/CN114724222B/en
Publication of CN114724222A publication Critical patent/CN114724222A/en
Application granted granted Critical
Publication of CN114724222B publication Critical patent/CN114724222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Acoustics & Sound (AREA)
  • Hospice & Palliative Care (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multimode-based AI digital human emotion analysis method, which comprises the following steps: s1, facial expression recognition emotion judgment, wherein an output result a is input into a multi-mode emotion analysis module; s2, voice emotion recognition and judgment, and outputting a result e to a multi-modal emotion analysis module; s3, identifying and judging text emotion, and outputting a result f to a multi-modal emotion analysis module; s4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to the AI digital person. According to the method, the emotion states of the user can be judged most comprehensively and accurately through the multi-mode judgment of the emotion states of the user, the meaning expressed by the user can be accurately grasped, and the method is not only suitable for chat robots in financial scenes, but also can be used as chat robots in other vertical fields, such as medical, educational, service and other fields.

Description

AI digital human emotion analysis method based on multiple modes
Technical Field
The invention relates to the technical field of AI digital persons, in particular to an AI digital person emotion analysis method based on multiple modes.
Background
The AI digital person system is generally composed of 5 modules of character image, voice generation, animation generation, audio and video synthesis display, interaction and the like, the interaction module enables the AI digital person to have an interaction function, namely, the intention of a user is identified through intelligent technologies such as voice semantic identification and the like, the subsequent voice and action of the digital person are determined according to the current intention of the user, the character is driven to start the next interaction, in the interaction process, the AI digital person needs to accurately judge emotion of the client so as to provide accurate service, the method can be divided into text emotion tendency judgment after semantic understanding, or the facial expression of the client is captured through a camera, and then the digital person is provided with emotion analysis through expression identification.
Firstly, face expression recognition is characterized in that face detection is carried out, the problem of missing detection often exists when a traditional face detection method is used for carrying out face detection on an image, the robustness is insufficient, the face is often not detected in a side face or light deficient environment, and the emotion analysis result is affected;
Secondly, for some specific scenes such as finance, medical treatment and education industry, an AI digital person generally has the capability of understanding the "look-and-feel" of the text semantics of the client, and makes correct judgment on the positive (yes), negative (no) or neutral emotion semantics of the client through the text semantics in combination with the business scene, but the text semantics understanding capability is accompanied by a large amount of data corpus or artificial dictionary construction, is very dependent on data resources and manpower resources, and in a wider scene, only the text semantics understanding is used for judging that the client emotion is slightly insufficient;
Finally, the existing part of AI digital people judge the emotion state of the user through voice features, one method is to judge the emotion state through voice text recognition and then through text, the method is very dependent on the accuracy of voice recognition, and the other method is to judge the emotion state directly through voice, but the feature extraction method for judging emotion from voice extraction is still immature, and as a result, the accuracy of judging the emotion state is lower;
In summary, based on the single-mode emotion state recognition, when judging the emotion state of a client, the accuracy is still lower than the multi-mode emotion comprehensive recognition results of images, voices, characters and the like, so the invention provides a multi-mode AI digital human emotion analysis method.
Disclosure of Invention
In order to solve the technical problems mentioned in the background art, a multi-mode-based AI digital human emotion analysis method is provided.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
A multimode-based AI digital human emotion analysis method comprises the following steps:
s1, facial expression recognition emotion judgment:
s11, acquiring an image through a camera module to serve as an original image A to be detected;
S12, obtaining an original image B by converting the original image A to be detected into (640, 640,3);
s13, inputting the original image B into a trained retinaface face detection and recognition model, and outputting a face detection frame C;
S14, intercepting a target face area from the face detection frame C, wherein the target face area resize is a 224 multiplied by 224 face image D;
s15, inputting a face image D into a trained facial expression recognition model, classifying the face image C by using a convolutional neural network, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s16, outputting the emotion type corresponding to the maximum probability value, and outputting a result a to be input into the multi-mode emotion analysis module;
S2, voice emotion recognition judgment:
s21, collecting voice E through a voice collecting module;
s22, inputting the voice E into a voice emotion judging model, extracting zero crossing rate, amplitude, spectrum centroid and Mel frequency cepstrum coefficient of an audio map, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s23, outputting emotion categories corresponding to the maximum probability values, and outputting a result e to the multi-modal emotion analysis module;
S3, identifying and judging text emotion:
S31, collecting the voice E through a voice acquisition module, and converting the voice E into a text F;
S32, inputting the text F into a text emotion recognition model, performing emotion scoring on the text F, and outputting probability values of seven emotion categories obtained by "angry", "disgusted", "fearful", "happy", "sad", "surprise" and "neutral";
s33, outputting emotion categories corresponding to the maximum probability values, and outputting a result f to the multi-modal emotion analysis module;
S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to the AI digital person.
As a further description of the above technical solution:
The training method of the retinaface face detection and recognition model comprises the following steps: the widerface data set is selected, the widerface data set comprises at least 32203 pictures, the pictures are divided into a training set, a verification set and a test set according to a ratio of 4:1:5, the training set is used for training a retinaface face detection recognition model, and the pictures are subjected to data enhancement through methods of brightness change, saturation adjustment, hue adjustment, random clipping, mirror image overturning and size transformation during training.
As a further description of the above technical solution:
the retinaface face detection and recognition model comprises 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and in the retinaface face detection and recognition model, a global loss function L is as follows:
Wherein L cls is loss of a face two-classifier softmax, L box is regression loss function smoothL1 of box, L pixel is dense regression loss, p i is probability of ith anchor as face, For the i-th anchor's true category, 1 represents a face, 0 represents a non-face ,λ1=0.25、λ2=0.1、λ3=0.01,ti={tx,ty,tw,th} represents the predicted box coordinates,The representation real frame coordinates ,li={lx1,ly1,lx2,ly2,lx3,ly3,lx4,ly4,lx5,ly5} represent predicted face key coordinates ,li*={lx1*,ly1*,lx2*,ly2*,lx3*,ly3*,lx4*,ly4*,lx5*,ly5*} represent real face key coordinates.
As a further description of the above technical solution:
The training method of the facial expression recognition model comprises the following steps: and selecting a dataset of seven types of face images, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral', wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to a ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model.
As a further description of the above technical solution:
The facial expression recognition model comprises a CNN backbone network and a stream module, wherein the CNN backbone network comprises an inverse residual block BlockA and a downsampling module BlockB, the left side auxiliary branch of the downsampling module BlockB uses AVGPool, and the stream module uses a depth convolution DWConv layer with the step length larger than 1 to downsample and output a 1-dimensional feature vector.
As a further description of the above technical solution:
The emotion random combination condition comprises the following:
Emotion random combination case one: results e, f combined output: if the AI digital person does not contain the camera module, the image information can not be obtained in real time, the emotion state is judged by using voice and text characteristics, and further, the average probability value ef of the results e and f is taken, and the emotion type corresponding to the average probability value ef is used as a final emotion judgment result g1;
Emotion random combination case two: results a, e combined output: if the AI digital person does not contain the text emotion recognition model, the text emotion judgment result cannot be output in real time, the emotion state is judged by using the facial image and the voice characteristics, and further, the average probability value ae of the results a and e is taken as the final emotion judgment result g2, wherein the emotion category corresponds to the average probability value ae;
Emotion random combination case three: results a, f combined output: if the AI digital person does not contain the voice emotion judging model, the voice emotion judging result cannot be output in real time, judging the emotion state by using the face image and the text characteristics, and further, taking the average probability value af of the results a and f and the emotion type corresponding to the average probability value af as a final emotion judging result g3;
Affective random combination case four: results a, e, f combined output: and taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4.
As a further description of the above technical solution:
On the premise that the IOU of the face detection frames C and GT is larger than 0.5, standard evaluation is carried out on the face detection recognition model through accuracy and recall rate:
precision = TP/(tp+fp);
recall = TP/(tp+fn);
the TP=GT and n PRED are predicted correctly, the real example is predicted as the positive example, and the model is predicted as the positive example;
Fp=pred- (GT n PRED), misprediction, false positive, model prediction as positive, actually negative;
Fn=gt- (GT n PRED), misprediction, false counterexample, model prediction as counterexample, actually positive example;
Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED is a prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection to union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows: the method is suitable for chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, educational, service and the like.
Drawings
Fig. 1 shows a schematic flow chart of a facial expression recognition emotion judgment method according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a network structure of a retinaface face detection and recognition model according to an embodiment of the present invention;
fig. 3 shows a schematic diagram of a CNN backbone network according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a CNN backbone network structure module according to an embodiment of the present invention;
FIG. 5 shows a schematic diagram of an SSH context module architecture provided in accordance with an embodiment of the present invention;
FIG. 6 illustrates a schematic diagram of an IOU provided in accordance with an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1-6, the present invention provides a technical solution: a multimode-based AI digital human emotion analysis method comprises the following steps:
s1, facial expression recognition emotion judgment:
s11, acquiring an image through a camera module to serve as an original image A to be detected;
S12, obtaining an original image B by converting the original image A to be detected into (640, 640,3);
s13, inputting the original image B into a trained retinaface face detection and recognition model, and outputting a face detection frame C;
specifically, the training method of retinaface face detection and recognition model comprises the following steps: selecting widerface a widerface dataset which comprises at least 32203 pictures, dividing the dataset into a training set, a verification set and a test set according to a ratio of 4:1:5, and training a retinaface face detection recognition model, wherein the pictures are subjected to data enhancement by a brightness change, saturation adjustment, hue adjustment, random clipping, mirror image turning and size transformation method during training;
The traditional face detection method has the problems of omission and low robustness, training data is enhanced through data, and a retinaface face detection recognition model is trained to obtain a depth detection model with high accuracy;
Further, as shown in fig. 2, the retinaface face detection and recognition model includes 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and the retinaface face detection and recognition model adds the context SSH module in the 5 pyramid feature diagrams to improve the detection precision of the small face, introduces the deformable convolution network DCN module to improve the precision, adds 5 face key points to improve the precision of the detection algorithm in the hard part of the widerface dataset;
Specifically, the SSH context module improves the detection of small faces by introducing context detection in the feature map, in a two-stage detector, typically integrating the context by expanding the window around the candidates proposals, by mimicking this strategy with a simple convolution layer, fig. 5 shows the context layer integrated into the detection module, since the anchors are classified and regressed in a convolutional manner, a larger filter (larger convolution kernel size) is employed, which is similar to increasing the window size around proposals in a two-stage detector, for which the invention uses filters (convolution kernels) 5*5 and 7*7 in the SSH context module, in such a way that the sensitivity field proportional to the step size of the corresponding layer is increased for the context modeling, and thus the target dimension of each detection module is increased, in order to reduce the number of parameters, we use some series 3*3 convolution kernels instead of larger convolution kernels, a larger sensitivity field can be obtained, while in the present invention, the convolution kernels use some series 3*3 convolution kernels can achieve a larger sensitivity field, while the same number of parameters can be reduced by a small number of SSH modules, while the average widerface is achieved by a large number of average values of widerface;
The loss function comprises classification loss, box regression loss, face key points and dense regression loss, the regression accuracy of the box can be improved by introducing the face key points, and in retinaface face detection and identification model, the global loss function L is as follows:
Wherein L cls is loss of a face two-classifier softmax, L box is regression loss function smoothL1 of box, L pixel is dense regression loss, p i is probability of ith anchor as face, For the real class of the i-th anchor, 1 represents a face, 0 represents a non-face, λ 1=0.25、λ2=0.1、λ3 =0.01, λ is used to balance the weights of the different loss types, t i={tx,ty,tw,th represents the predicted box coordinates,/>The representation real frame coordinates ,li={lx1,ly1,lx2,ly2,lx3,ly3,lx4,ly4,lx5,ly5} represents predicted face key point coordinates ,li*={lx1*,ly1*,lx2*,ly2*,lx3*,ly3*,lx4*,ly4*,lx5*,ly5*} represents real face key point coordinates;
S14, intercepting a target face area from the face detection frame C, wherein the target face area resize is a 224 multiplied by 224 face image D;
s15, inputting a face image D into a trained facial expression recognition model, classifying the face image C by using a convolutional neural network, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
Specifically, the training method of the facial expression recognition model comprises the following steps: selecting a dataset of seven types of face images, namely an anary, disgusted, fearful, happy, sad, surprise and neutr, wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to the ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model;
Further, as shown in fig. 3 and fig. 4, the facial expression recognition model includes a CNN backbone network and a stream module, the CNN backbone network includes an inverse residual block BlockA and a downsampling module BlockB, and the downsampling module BlockB assists in using AVGPool on the left side, because it can embed multi-scale information and aggregation features in different receptive fields, resulting in improvement of performance, downsampling is performed in the stream module by using a depth convolution DWConv layer with a step size greater than 1, and a 1-dimensional feature vector is output;
In the aspect of facial expression recognition, the size of an input data image is resize to 224 multiplied by 224, then, after the input data image is input into a CNN main network to carry out operations such as convolution extraction of features and the like, a 7 multiplied by 7 feature image is generated, after the CNN main network is adopted, in order to better extract feature image information, a stream module is used, a depth convolution DWConv layer with the step length larger than 1 is used for downsampling in the stream module, and then the downsampling is carried out on the stream module to output a 1-dimensional feature vector (1 x 7), so that the overfitting risk caused by a full connection layer can be reduced, and then, the feature vector calculation loss is used for prediction, wherein a rapid downsampling strategy is adopted in the initial stage of the CNN main network, so that the size of the feature image can be rapidly reduced, less parameters are spent, and the problems of weak feature embedding capability and long processing time caused by a slow downsampling process with limited calculation force can be avoided;
s16, outputting the emotion type corresponding to the maximum probability value, and outputting a result a to be input into the multi-mode emotion analysis module;
S2, voice emotion recognition judgment:
s21, collecting voice E through a voice collecting module;
s22, inputting the voice E into a voice emotion judging model, extracting zero crossing rate, amplitude, spectrum centroid and Mel frequency cepstrum coefficient of an audio map, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s23, outputting emotion categories corresponding to the maximum probability values, and outputting a result e to the multi-modal emotion analysis module;
S3, identifying and judging text emotion:
S31, collecting the voice E through a voice acquisition module, and converting the voice E into a text F;
S32, inputting the text F into a text emotion recognition model, performing emotion scoring on the text F, and outputting probability values of seven emotion categories obtained by "angry", "disgusted", "fearful", "happy", "sad", "surprise" and "neutral";
s33, outputting emotion categories corresponding to the maximum probability values, and outputting a result f to the multi-modal emotion analysis module;
S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to an AI digital person;
Specifically, the emotion random combination condition includes the following:
Emotion random combination case one: results e, f combined output: if the AI digital person does not contain the camera module, the image information can not be obtained in real time, the emotion state is judged by using voice and text characteristics, and further, the average probability value ef of the results e and f is taken, and the emotion type corresponding to the average probability value ef is used as a final emotion judgment result g1;
Emotion random combination case two: results a, e combined output: if the AI digital person does not contain the text emotion recognition model, the text emotion judgment result cannot be output in real time, the emotion state is judged by using the facial image and the voice characteristics, and further, the average probability value ae of the results a and e is taken as the final emotion judgment result g2, wherein the emotion category corresponds to the average probability value ae;
Emotion random combination case three: results a, f combined output: if the AI digital person does not contain the voice emotion judging model, the voice emotion judging result cannot be output in real time, judging the emotion state by using the face image and the text characteristics, and further, taking the average probability value af of the results a and f and the emotion type corresponding to the average probability value af as a final emotion judging result g3;
Affective random combination case four: results a, e, f combined output: taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4;
The method is suitable for chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, educational, service and the like.
Referring to fig. 6, on the premise that the IOU of the face detection frames C and GT is greater than 0.5, standard evaluation is performed on the face detection recognition model through accuracy and recall:
precision = TP/(tp+fp);
recall = TP/(tp+fn);
the TP=GT and n PRED are predicted correctly, the real example is predicted as the positive example, and the model is predicted as the positive example;
Fp=pred- (GT n PRED), misprediction, false positive, model prediction as positive, actually negative;
Fn=gt- (GT n PRED), misprediction, false counterexample, model prediction as counterexample, actually positive example;
Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED passes through the prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection and union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture;
The accuracy represents the prediction accuracy degree in the positive sample result, for the evaluation of some negative samples, the recall rate is used, the higher the recall rate is, the higher the probability that the actual negative sample is predicted is represented, and by combining the negative samples for analysis, whether the retinaface face detection recognition model is generalized or not can be insufficient for some special scenes, and some optimization modes, such as image enhancement and the like, are provided.
The method is suitable for the chat robots in financial scenes, and can be used as chat robots in other vertical fields, such as medical, education, service and the like.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (5)

1. The multi-mode-based AI digital human emotion analysis method is characterized by comprising the following steps: s1, facial expression recognition emotion judgment:
s11, acquiring an image through a camera module to serve as an original image A to be detected;
S12, obtaining an original image B by converting the original image A to be detected into (640, 640,3);
s13, inputting the original image B into a trained retinaface face detection and recognition model, and outputting a face detection frame C;
S14, intercepting a target face area from the face detection frame C, wherein the target face area resize is a 224 multiplied by 224 face image D;
s15, inputting a face image D into a trained facial expression recognition model, classifying the face image C by using a convolutional neural network, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s16, outputting the emotion type corresponding to the maximum probability value, and outputting a result a to be input into the multi-mode emotion analysis module;
S2, voice emotion recognition judgment:
s21, collecting voice E through a voice collecting module;
s22, inputting the voice E into a voice emotion judging model, extracting zero crossing rate, amplitude, spectrum centroid and Mel frequency cepstrum coefficient of an audio map, and obtaining probability values of seven emotion categories, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral';
s23, outputting emotion categories corresponding to the maximum probability values, and outputting a result e to the multi-modal emotion analysis module;
S3, identifying and judging text emotion:
S31, collecting the voice E through a voice acquisition module, and converting the voice E into a text F;
S32, inputting the text F into a text emotion recognition model, performing emotion scoring on the text F, and outputting probability values of seven emotion categories obtained by "angry", "disgusted", "fearful", "happy", "sad", "surprise" and "neutral";
s33, outputting emotion categories corresponding to the maximum probability values, and outputting a result f to the multi-modal emotion analysis module;
S4, in the multi-mode emotion analysis module, carrying out emotion random combination judgment on the results a, e and f, taking an average probability value of random emotion combination conditions as a final emotion judgment result g, and outputting the final emotion judgment result g to an AI digital person;
the retinaface face detection and recognition model comprises 5 pyramid feature diagrams, an SSH context module and a deformable convolution network DCN module, and in the retinaface face detection and recognition model, a global loss function L is as follows:
Wherein L cls is loss of a face two-classifier softmax, L box is loss of a box regression function sm 1, L pixel is loss of dense regression, p i is probability of an ith anchor being a face, For the i-th anchor's true category, 1 represents a face, 0 represents a non-face ,λ1=0.25、λ2=0.1、λ3=0.01,ti={tx,ty,tw,th} represents the predicted box coordinates,The representation real frame coordinates ,li={lx1,ly1,lx2,ly2,lx3,ly3,lx4,ly4,lx5,ly5} represents predicted face key point coordinates ,li*={lx1*,ly1*,lx2*,ly2*,lx3*,ly3*,lx4*,ly4*,lx5*,ly5*} represents real face key point coordinates;
The facial expression recognition model comprises a CNN backbone network and a flow module, wherein the CNN backbone network comprises an inverse residual block BlockA and a downsampling module BlockB, an AVG Pool is used for the left auxiliary branch of the downsampling module BlockB, and a depth convolution DWConv layer with the step length larger than 1 is used for downsampling in the flow module, and a 1-dimensional feature vector is output.
2. The multi-modal-based AI digital human emotion analysis method of claim 1, wherein the retinaface face detection recognition model training method is as follows: the widerface data set is selected, the widerface data set comprises at least 32203 pictures, the pictures are divided into a training set, a verification set and a test set according to a ratio of 4:1:5, the training set is used for training a retinaface face detection recognition model, and the pictures are subjected to data enhancement through methods of brightness change, saturation adjustment, hue adjustment, random clipping, mirror image overturning and size transformation during training.
3. The multi-modal-based AI digital human emotion analysis method of claim 1, wherein the training method of the facial expression recognition model is as follows: and selecting a dataset of seven types of face images, namely 'angry', 'disgusted', 'fearful', 'happy', 'sad', 'surprise', 'neutral', wherein the dataset comprises at least 35887 pictures, and each type of at least 3000 pictures is divided into a training set, a verification set and a test set according to a ratio of 8:1:1, and the training set, the verification set and the test set are used for training a facial expression recognition model.
4. The multi-modal based AI digital human emotion analysis method of claim 1, wherein the emotion random combination condition comprises the following:
Emotion random combination case one: results e, f combined output: if the AI digital person does not contain the camera module, the image information can not be obtained in real time, the emotion state is judged by using voice and text characteristics, and further, the average probability value ef of the results e and f is taken, and the emotion type corresponding to the average probability value ef is used as a final emotion judgment result g1;
Emotion random combination case two: results a, e combined output: if the AI digital person does not contain the text emotion recognition model, the text emotion judgment result cannot be output in real time, the emotion state is judged by using the facial image and the voice characteristics, and further, the average probability value ae of the results a and e is taken as the final emotion judgment result g2, wherein the emotion category corresponds to the average probability value ae;
Emotion random combination case three: results a, f combined output: if the AI digital person does not contain the voice emotion judging model, the voice emotion judging result cannot be output in real time, judging the emotion state by using the face image and the text characteristics, and further, taking the average probability value af of the results a and f and the emotion type corresponding to the average probability value af as a final emotion judging result g3;
Affective random combination case four: results a, e, f combined output: and taking the average probability values aef of the results a, e and f and the emotion category corresponding to the average probability value aef as a final emotion judgment result g4.
5. The multi-mode-based AI digital human emotion analysis method of claim 1, wherein on the premise that the ios of face detection frames C and GT are greater than 0.5, standard evaluation is performed on a face detection recognition model by precision and recall:
precision = TP/(tp+fp);
recall = TP/(tp+fn);
the TP=GT and n PRED are predicted correctly, the real example is predicted as the positive example, and the model is predicted as the positive example;
Fp=pred- (GT n PRED), misprediction, false positive, model prediction as positive, actually negative;
Fn=gt- (GT n PRED), misprediction, false counterexample, model prediction as counterexample, actually positive example;
Specifically, GT represents the real value of the picture input to the retinaface face detection and recognition model, PRED is a prediction sample output by the retinaface face detection and recognition model, and IOU represents the ratio of intersection to union of the picture prediction result input to the retinaface face detection and recognition model in the widerface dataset and the real value of the input picture.
CN202210394800.0A 2022-04-14 2022-04-14 AI digital human emotion analysis method based on multiple modes Active CN114724222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210394800.0A CN114724222B (en) 2022-04-14 2022-04-14 AI digital human emotion analysis method based on multiple modes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210394800.0A CN114724222B (en) 2022-04-14 2022-04-14 AI digital human emotion analysis method based on multiple modes

Publications (2)

Publication Number Publication Date
CN114724222A CN114724222A (en) 2022-07-08
CN114724222B true CN114724222B (en) 2024-04-19

Family

ID=82244023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210394800.0A Active CN114724222B (en) 2022-04-14 2022-04-14 AI digital human emotion analysis method based on multiple modes

Country Status (1)

Country Link
CN (1) CN114724222B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641837A (en) * 2022-12-22 2023-01-24 北京资采信息技术有限公司 Intelligent robot conversation intention recognition method and system
CN116520980A (en) * 2023-04-03 2023-08-01 湖北大学 Interaction method, system and terminal for emotion analysis of intelligent shopping guide robot in mall
CN117234369B (en) * 2023-08-21 2024-06-21 华院计算技术(上海)股份有限公司 Digital human interaction method and system, computer readable storage medium and digital human equipment
CN117576279B (en) * 2023-11-28 2024-04-19 世优(北京)科技有限公司 Digital person driving method and system based on multi-mode data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609572A (en) * 2017-08-15 2018-01-19 中国科学院自动化研究所 Multi-modal emotion identification method, system based on neutral net and transfer learning
CN109558935A (en) * 2018-11-28 2019-04-02 黄欢 Emotion recognition and exchange method and system based on deep learning
KR20190119863A (en) * 2018-04-13 2019-10-23 인하대학교 산학협력단 Video-based human emotion recognition using semi-supervised learning and multimodal networks
CN111564164A (en) * 2020-04-01 2020-08-21 中国电力科学研究院有限公司 Multi-mode emotion recognition method and device
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network
CN112766173A (en) * 2021-01-21 2021-05-07 福建天泉教育科技有限公司 Multi-mode emotion analysis method and system based on AI deep learning
CN113158828A (en) * 2021-03-30 2021-07-23 华南理工大学 Facial emotion calibration method and system based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815785A (en) * 2018-12-05 2019-05-28 四川大学 A kind of face Emotion identification method based on double-current convolutional neural networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609572A (en) * 2017-08-15 2018-01-19 中国科学院自动化研究所 Multi-modal emotion identification method, system based on neutral net and transfer learning
KR20190119863A (en) * 2018-04-13 2019-10-23 인하대학교 산학협력단 Video-based human emotion recognition using semi-supervised learning and multimodal networks
CN109558935A (en) * 2018-11-28 2019-04-02 黄欢 Emotion recognition and exchange method and system based on deep learning
CN111564164A (en) * 2020-04-01 2020-08-21 中国电力科学研究院有限公司 Multi-mode emotion recognition method and device
CN112348075A (en) * 2020-11-02 2021-02-09 大连理工大学 Multi-mode emotion recognition method based on contextual attention neural network
CN112766173A (en) * 2021-01-21 2021-05-07 福建天泉教育科技有限公司 Multi-mode emotion analysis method and system based on AI deep learning
CN113158828A (en) * 2021-03-30 2021-07-23 华南理工大学 Facial emotion calibration method and system based on deep learning

Also Published As

Publication number Publication date
CN114724222A (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN114724222B (en) AI digital human emotion analysis method based on multiple modes
CN111523462B (en) Video sequence expression recognition system and method based on self-attention enhanced CNN
CN106599800A (en) Face micro-expression recognition method based on deep learning
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN109255289B (en) Cross-aging face recognition method based on unified generation model
CN109886153B (en) Real-time face detection method based on deep convolutional neural network
CN111582397A (en) CNN-RNN image emotion analysis method based on attention mechanism
CN112183334B (en) Video depth relation analysis method based on multi-mode feature fusion
CN112906485A (en) Visual impairment person auxiliary obstacle perception method based on improved YOLO model
CN112183468A (en) Pedestrian re-identification method based on multi-attention combined multi-level features
CN108416314B (en) Picture important face detection method
CN113011357A (en) Depth fake face video positioning method based on space-time fusion
CN110738160A (en) human face quality evaluation method combining with human face detection
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN110232564A (en) A kind of traffic accident law automatic decision method based on multi-modal data
CN112668638A (en) Image aesthetic quality evaluation and semantic recognition combined classification method and system
CN111666845A (en) Small sample deep learning multi-mode sign language recognition method based on key frame sampling
CN117149944B (en) Multi-mode situation emotion recognition method and system based on wide time range
CN110222636A (en) The pedestrian's attribute recognition approach inhibited based on background
CN113378949A (en) Dual-generation confrontation learning method based on capsule network and mixed attention
CN115797827A (en) ViT human body behavior identification method based on double-current network architecture
CN111652307A (en) Intelligent nondestructive identification method and device for redwood furniture based on convolutional neural network
CN111126155A (en) Pedestrian re-identification method for generating confrontation network based on semantic constraint
CN114662605A (en) Flame detection method based on improved YOLOv5 model
KR20210011707A (en) A CNN-based Scene classifier with attention model for scene recognition in video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: No. 2-206, No. 1399 Liangmu Road, Cangqian Street, Yuhang District, Hangzhou City, Zhejiang Province, 311100

Applicant after: Kangxu Technology Co.,Ltd.

Address before: 310000 2-206, 1399 liangmu Road, Cangqian street, Yuhang District, Hangzhou City, Zhejiang Province

Applicant before: Zhejiang kangxu Technology Co.,Ltd.

Country or region before: China

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant