CN115376559A - Emotion recognition method, device and equipment based on audio and video - Google Patents

Emotion recognition method, device and equipment based on audio and video Download PDF

Info

Publication number
CN115376559A
CN115376559A CN202211009366.6A CN202211009366A CN115376559A CN 115376559 A CN115376559 A CN 115376559A CN 202211009366 A CN202211009366 A CN 202211009366A CN 115376559 A CN115376559 A CN 115376559A
Authority
CN
China
Prior art keywords
emotion recognition
video
audio
voice
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211009366.6A
Other languages
Chinese (zh)
Inventor
颜谨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202211009366.6A priority Critical patent/CN115376559A/en
Publication of CN115376559A publication Critical patent/CN115376559A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure provides an emotion recognition method based on audio and video, which is applied to the field of artificial intelligence or other fields, and comprises: collecting audio and video data; preprocessing audio and video data to obtain voice data and video data; inputting voice data into a voice emotion recognition model to obtain a first probability distribution, wherein the first probability distribution is used for expressing a voice emotion recognition result obtained by the voice emotion recognition model; inputting the video data into a video emotion recognition model to obtain a second probability distribution, wherein the second probability distribution is used for expressing a video emotion recognition result obtained by the video emotion recognition model; and performing fusion judgment according to the voice emotion recognition result and the video emotion recognition result to obtain a comprehensive score of emotion recognition and determine emotion classification. The present disclosure also provides an emotion recognition system based on audio and video, an electronic device, a storage medium, and a program product.

Description

Emotion recognition method, device and equipment based on audio and video
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, a medium, and a program product for emotion recognition based on audio and video.
Background
Under the scene of large-scale call center especially banking industry collection, the telephone operator is very easily conflicted with the speech of the customer, relies on manual recording spot check afterwards mostly at present, has the problems of high human cost, poor timeliness, incomplete coverage of inspection range and the like, and how to efficiently identify the emotion change of the telephone operator in the event, and timely take intervention or evacuation measures to improve the field management efficiency, thereby maintaining the image of the banking industry to the customer service, which is a problem to be solved urgently.
Disclosure of Invention
In view of the above, the present disclosure provides an emotion recognition method, apparatus, device, medium, and program product based on audio and video.
According to a first aspect of the present disclosure, there is provided a method for emotion recognition based on audio and video, comprising: collecting audio and video data; preprocessing the audio and video data to obtain voice data and video data; inputting voice data into a voice emotion recognition model to obtain first probability distribution, wherein the first probability distribution is used for expressing a voice emotion recognition result obtained by the voice emotion recognition model; inputting the video data into the video emotion recognition model to obtain a second probability distribution, wherein the second probability distribution is used for expressing a video emotion recognition result obtained by the video emotion recognition model; and performing fusion judgment according to the voice emotion recognition result and the video emotion recognition result to obtain comprehensive scores of emotion recognition and determine emotion classification.
According to an embodiment of the disclosure, inputting the speech data into the speech emotion recognition model, obtaining the first probability distribution includes: preprocessing the voice data and extracting characteristic parameters of the voice data; identifying the characteristic parameters of the voice data by using a hidden Markov model to obtain a characteristic vector of the voice data; and classifying the feature vectors of the voice data by utilizing a pre-established artificial neural network to obtain a first probability distribution of voice emotion recognition.
According to an embodiment of the present disclosure, classifying feature vectors of speech data using a pre-established artificial neural network includes: normalizing the feature vectors of the voice data to obtain a feature matrix to be identified; taking the characteristic matrix to be identified as the input of the artificial neural network; and calculating the matching probability of each element in the characteristic matrix to be recognized and the standard characteristic matrix corresponding to the sample voice emotion, and the first probability distribution of the voice emotion recognition.
According to an embodiment of the present disclosure, the characteristic parameters of the voice data include a pitch frequency, a short-time energy, and an amplitude.
According to an embodiment of the present disclosure, inputting the video data into the video emotion recognition model, and obtaining the second probability distribution includes: preprocessing the video data and extracting facial expression images of the video data; performing feature extraction on the facial expression image by using a local binary fitting algorithm to obtain a feature vector of the video data; and classifying the feature vectors of the video data by using a random forest algorithm to obtain a second probability distribution of the video emotion recognition.
According to the embodiment of the disclosure, the feature extraction of the facial expression image by using the local binary fitting algorithm to obtain the feature vector of the video data comprises the following steps: carrying out face detection on the facial expression image to obtain a face partial image; extracting key points of the human face by using a local binary fitting algorithm according to the partial image of the human face; and constructing a feature vector of the video data according to the face key points.
According to the embodiment of the disclosure, fusion judgment is carried out according to the voice emotion recognition result and the video emotion recognition result to obtain the comprehensive score of emotion recognition, and determining emotion classification comprises the following steps: and calculating a comprehensive score of emotion recognition by using an argmax function according to the weight parameter, the first probability distribution and the second probability distribution and determining emotion classification.
According to the embodiment of the present disclosure, the preprocessing of the audio and video data to obtain the voice data and the video data includes: carrying out voice detection on the audio and video data to obtain voice data; and carrying out video extraction on the audio and video data to obtain video data.
According to the embodiment of the present disclosure, after determining the emotion classification, the method further includes: matching the emotion classification with a preset abnormal emotion sample set; and if the matching is successful, performing intervention processing on the abnormal emotion.
A second aspect of the present disclosure provides an emotion recognition apparatus based on audio and video, including: the acquisition module is used for acquiring audio and video data; the processing module is used for preprocessing the audio and video data to obtain voice data and video data; the voice emotion recognition module is used for inputting the voice data into the voice emotion recognition model to obtain first probability distribution, and the first probability distribution is used for representing a voice emotion recognition result obtained by the voice emotion recognition model; the video emotion recognition module is used for inputting the video data into the video emotion recognition model to obtain a second probability distribution, and the second probability distribution is used for representing a video emotion recognition result obtained by the video emotion recognition model; and the fusion judgment module is used for performing fusion judgment according to the voice emotion recognition result and the video emotion recognition result to obtain comprehensive scores of emotion recognition and determine emotion classification.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above-described audio-visual based emotion recognition method.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-mentioned audio-video based emotion recognition method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described audio-video based emotion recognition method.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:
fig. 1 schematically shows an application scenario diagram of a method for emotion recognition based on audio-video according to an embodiment of the present disclosure;
fig. 2 schematically shows a flow diagram of a method of audio-visual based emotion recognition according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a method of inputting speech data into a speech emotion recognition model resulting in a first probability distribution, in accordance with an embodiment of the disclosure;
FIG. 4 schematically illustrates a flow chart of a method of classifying feature vectors of speech data using a pre-established artificial neural network, in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a method of inputting video data into a video emotion recognition model resulting in a second probability distribution, in accordance with an embodiment of the disclosure;
FIG. 6 schematically illustrates a flowchart of a method for feature extraction of a facial expression image using a local binary fitting algorithm to obtain feature vectors of video data, according to an embodiment of the present disclosure;
fig. 7 schematically illustrates a flow chart of a method for preprocessing audio and video data to obtain voice data and video data according to an embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow diagram of a method of intervention processing for abnormal emotions, in accordance with an embodiment of the present disclosure;
FIG. 9 schematically shows a distribution diagram of a two-dimensional vector-based emotion expression model according to an embodiment of the present disclosure;
FIG. 10 is a schematic diagram illustrating a structure of a HMM and ANN based hybrid recognition model according to an embodiment of the present disclosure;
FIG. 11 schematically illustrates a flow diagram of a method for audio-video based emotion recognition and employee management in accordance with an embodiment of the present disclosure;
fig. 12 schematically illustrates a block diagram of a video-with-audio based emotion recognition apparatus according to an embodiment of the present disclosure;
fig. 13 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction should be interpreted in the sense one having ordinary skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B, a and C, B and C, and/or A, B, C, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include, but not be limited to, systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.
Based on the problems that the emotion change of a telephone operator is difficult to recognize in time and intervention or dispersion measures are difficult to take in time in the existing call center, the embodiment of the disclosure provides an emotion recognition method, device, equipment, medium and program product based on audio and video, which are applied to the field of finance, can recognize the emotion of the telephone operator in real time, effectively warn the emotion fluctuation of the telephone operator, internally reduce the management cost of the telephone center and externally improve the quality of customer service.
Fig. 1 schematically shows an application scenario diagram that may be applied to a video-with-audio based emotion recognition method according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as a web browser application, a search-type application, an instant messaging tool, social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the emotion recognition method based on audio and video provided by the embodiment of the present disclosure may be generally executed by the terminal devices 101, 102, 103 and the server 105. Accordingly, the emotion recognition system based on audio and video provided by the embodiment of the present disclosure can be generally disposed in the terminal devices 101, 102, 103 and the server 105. The emotion recognition method based on audio and video provided by the embodiment of the present disclosure may also be executed by a server or a server cluster which is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the audio-video based emotion recognition system provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The following describes in detail a method for emotion recognition based on audio and video according to the disclosed embodiment with reference to fig. 2 to 13 based on the scenario described in fig. 1.
Fig. 2 schematically shows a flow chart of a method of audio-visual based emotion recognition according to an embodiment of the present disclosure.
As shown in fig. 2, the audio-visual based emotion recognition 200 may include operations S210 to S250.
In operation S210, audio and video data is collected.
The audio/video acquisition equipment is used for acquiring audio/video data, and the audio acquisition equipment is fixedly placed in a clear sound receiving range of an audio/video scene to be acquired, so that the audio acquisition equipment can be helped to acquire clear sound, and the acquired sound data can be ensured to be high in quality; the video acquisition equipment is fixedly placed in a range of clear face images to be acquired of the audio and video scenes, so that the clear face images can be acquired by the video acquisition equipment, and the high quality of the acquired face image data can be ensured; the audio capture device and the video capture device may be the same device.
In the embodiment of the disclosure, before the audio and video data of the user is obtained, the consent or authorization of the user can be obtained. For example, a request for obtaining user audio-video data may be issued to the user before operation S210. In a case that the user agrees or authorizes that the user audio and video data can be obtained, the operation S210 is performed.
In operation S220, the audio and video data is preprocessed to obtain voice data and video data.
And segmenting the synchronously acquired audio and video data to respectively obtain voice data and video data, and respectively processing the voice data and the video data in the next step.
In operation S230, voice data is input into the speech emotion recognition model, and a first probability distribution is obtained, where the first probability distribution is used to represent speech emotion recognition results obtained by the speech emotion recognition model.
In operation S240, the video data is input into the video emotion recognition model to obtain a second probability distribution, where the second probability distribution is used to represent a video emotion recognition result obtained by the video emotion recognition model.
Before inputting the collected data into the emotion recognition model, the method further comprises collecting voice emotion samples and expression emotion samples from an emotion voice database and a facial expression recognition picture library. And then carrying out emotion recognition model training on the voice sample and the video sample by adopting an artificial intelligence algorithm and a classifier, wherein the emotion recognition model comprises a model for carrying out emotion recognition on the voice and a model for carrying out emotion recognition on the video, and the two models respectively output a voice emotion recognition result and a video emotion recognition result. The training enables the method to identify the human emotion more accurately and more real-timely.
In operation S250, fusion judgment is performed according to the speech emotion recognition result and the video emotion recognition result to obtain a comprehensive score of emotion recognition, and emotion classification is determined.
And finally, performing fusion judgment according to the voice emotion recognition result and the video emotion recognition result to determine emotion classification. According to the method, two main methods (voice emotion recognition and face image emotion recognition) of human emotion recognition are researched through a comprehensive emotion recognition system combining images and voices, and timeliness and accuracy of human emotion recognition are improved.
Fig. 3 schematically shows a flowchart of a method of inputting speech data into a speech emotion recognition model, resulting in a first probability distribution, according to an embodiment of the disclosure.
As shown in fig. 3, the method of inputting voice data into the speech emotion recognition model to obtain the first probability distribution may include operations S231 to S233.
In operation S231, the voice data is preprocessed and feature parameters of the voice data are extracted.
In operation S232, the hidden markov model is used to identify the feature parameters of the speech data, so as to obtain a feature vector of the speech data.
In operation S233, feature vectors of the voice data are classified using a pre-established artificial neural network, resulting in a first probability distribution of speech emotion recognition.
Hidden Markov Models (HMMs) are statistical models that describe the transition from one state to another with implicit unknown parameters, the states of which can be observed by observing a sequence of states of vectors, where each observation vector is generated by a state component with a certain probability density distribution, and is represented by a certain probability distribution as various states. When the hidden Markov model is applied to recognize voice, the forming process of a voice signal is described through different states of a Markov chain, corresponding probability output under different states is stored, model parameters are obtained through cyclic operation, the conditional probabilities of different models and the voice corresponding to the maximum value of the conditional probabilities are obtained through the algorithm, and the voice is the recognition result. An Artificial Neural Network (ANN) has great advantages in terms of parallel processing and classification, and due to the extremely strong input/output mapping capability of the Neural Network, in the continuous speech recognition of a large vocabulary, the speech recognition rate is greatly reduced in a noise environment. Therefore, the hybrid model is formed by combining the dynamic time sequence modeling capability of the HM and the classification decision capability of the ANN aiming at the advantages and the disadvantages of the hidden Markov model and the ANN, the output of the HM model becomes the input of the ANN, the speech emotion is classified and recognized, and the accuracy of speech recognition is improved.
FIG. 4 schematically illustrates a flow chart of a method of classifying feature vectors of speech data using a pre-established artificial neural network, according to an embodiment of the present disclosure.
As shown in fig. 4, the method of classifying feature vectors of voice data using a pre-established artificial neural network may include operations S2331-S2333.
In operation S2331, the feature vectors of the voice data are normalized to obtain a feature matrix to be recognized.
In operation S2332, the feature matrix to be recognized is used as an input of the artificial neural network.
In operation S2333, a matching probability of each element in the feature matrix to be recognized and the standard feature matrix corresponding to the sample speech emotion, a first probability distribution of speech emotion recognition is calculated.
In order to enable the feature vectors of the voice data output by the artificial neural network and the hidden Markov model to be well butted, firstly, normalization is carried out on the feature vectors of the voice data to obtain a normalized feature matrix to be recognized, the feature matrix to be recognized is input to an input layer of the artificial neural network, the feature matrix to be recognized comprises a plurality of components (target voice signals of a plurality of frames), a standard feature matrix corresponding to sample voice emotion in the trained artificial neural network model comprises a plurality of elements, and the matching probability of the feature matrix to be recognized and each element in the standard feature matrix is calculated to obtain a first probability distribution of voice emotion recognition.
The hidden Markov model and the artificial neural network are combined to identify the voice signal, so that the problem of low voice identification accuracy caused by the limitation of weak classification capability and poor pattern identification performance brought by a single HMM model or the limitation of poor dynamic characteristic expression capability brought by the neural network can be solved.
In an embodiment of the present disclosure, the characteristic parameters of the speech data include a pitch frequency, a short-time energy, and an amplitude.
The original voice signal contains various information, firstly, the characteristics which are easier to reflect emotional states are selected, and 3 characteristics of fundamental tone frequency, short-time energy and amplitude are selected in the text. The fundamental frequency is a fundamental frequency of vocal cord vibration, and a pattern of change in gene frequency is a pitch. The tone contains a large amount of speech emotion arousal information. Short-term energy, namely volume, is high or low, and higher short-term energy is easily shown in an excited emotional state such as angry or surprise. Amplitude represents a fluctuating situation of short-term energy, with a smaller amplitude for speech signals when sad or calm, and a larger amplitude for anger or startle.
Fig. 5 schematically shows a flowchart of a method for inputting video data into a video emotion recognition model to obtain a second probability distribution according to an embodiment of the present disclosure.
As shown in fig. 5, the method of inputting video data into the video emotion recognition model to obtain the second probability distribution may include operations S241 to S243.
In operation S241, the video data is preprocessed and a facial expression image of the video data is extracted.
In operation S242, feature extraction is performed on the facial expression image by using a local binary fitting algorithm to obtain a feature vector of the video data.
In operation S243, the feature vectors of the video data are classified by using a random forest algorithm to obtain a second probability distribution of video emotion recognition.
The video data is preprocessed, namely graying and histogram equalization are carried out on the whole video image so as to eliminate the influence of illumination noise factors on face detection and face key point detection, improve the image quality, and then the facial expression image of the video data is extracted. And processing the facial expression image by Local Binary Fitting (LBF), taking the obtained characteristic value as a characteristic vector of the emotion image, and performing classification and identification by using a Random Forest algorithm (RF) to obtain a second probability distribution of video emotion identification.
Fig. 6 schematically shows a flowchart of a method for extracting features of a facial expression image by using a local binary fitting algorithm to obtain feature vectors of video data according to an embodiment of the present disclosure.
As shown in fig. 6, the method for extracting features of the facial expression image by using the LBF algorithm to obtain the feature vector of the video data may include operations S2421 to S2423.
In operation S2421, face detection is performed on the facial expression image to obtain a face partial image.
In operation S2422, face key points are extracted using a local binary fitting algorithm from the face partial image.
In operation S2423, a feature vector of the video data is constructed from the face key points.
Carrying out face detection on the preprocessed video image through an Adaboost algorithm or other algorithms, and positioning the position of a face; the method comprises the steps of training and detecting through an LBF algorithm to obtain face key points, wherein the key points comprise cheek parts, double eyes parts, nose parts and lip parts of a human body, when the face expression is changed, the positions of the feature points change correspondingly along with human organs, and the positions of the feature points relative to each other are obtained based on the features to express the expression of the image.
On the basis of the embodiment, fusion judgment is carried out according to the voice emotion recognition result and the video emotion recognition result to obtain a comprehensive score of emotion recognition, and the emotion classification determination comprises the following steps: and calculating a comprehensive score of emotion recognition by using an argmax function according to the weight parameter, the first probability distribution and the second probability distribution and determining emotion classification.
The method carries out decision fusion judgment on an audio single-modal emotion recognition result and a video single-modal emotion recognition result, each modal recognition corresponds to one classifier, classifier output is combined, corresponding full weight is given according to classifier output probability, and a final score is obtained by using the following rules:
Figure BDA0003808926350000111
wherein R is the number of classifiers, P (X) c ) Is the prior probability of the emotion classification C, Y i Is the weight function of C at the hybrid classifier i.
Fig. 7 schematically shows a flowchart of a method for preprocessing audio and video data to obtain voice data and video data according to an embodiment of the present disclosure.
As shown in fig. 7, the method for preprocessing audio and video data to obtain voice data and video data may include operations S221 to S222.
In operation S221, voice detection is performed on the audio and video data to obtain voice data.
In operation S222, video extraction is performed on the audio and video data to obtain video data.
And preprocessing the audio and video data to respectively obtain voice data and video data, and respectively processing the voice data and the video data by using an artificial intelligent processing method to further respectively obtain a voice emotion recognition result and a video emotion recognition result.
Fig. 8 schematically shows a flowchart of a method of intervention handling of an abnormal emotion according to an embodiment of the present disclosure.
As shown in fig. 8, the method of intervening in the abnormal emotion may include operations S261 to S262.
In operation S261, the emotion classification is matched with a preset abnormal emotion sample set.
In operation S262, if the matching is successful, an intervention process is performed on the abnormal emotion.
In a traffic center and other use scenes, after the emotion classification is determined, the finally obtained emotion classification is further matched with the abnormal emotion sample, whether the emotion of the party is abnormal or not is judged, the party with abnormal emotion can be intervened or dredged in time, and the service image of a customer is maintained.
The present disclosure is further illustrated by the following detailed description. The emotion recognition method and device based on audio and video will be specifically described in the following embodiments. However, the following examples are merely illustrative of the present disclosure, and the scope of the present disclosure is not limited thereto.
The embodiment constructs a set of general emotion recognition and early warning system facing a large telephone traffic center, which comprises audio and video acquisition, audio and video separation, voice/facial emotion recognition training, voice/facial emotion judgment and early warning of emotion fluctuation of a telephone operator, internally reduces the management cost of the telephone traffic center, and externally improves the customer service quality.
The implementation also provides a method for carrying out emotion recognition and employee management based on audio and video for a large call center, which mainly comprises the following steps:
step 1: audio and video information acquisition:
step 101: deploying audio and video acquisition equipment, deploying the audio and video acquisition equipment at the telephone operator position, binding each piece of equipment with physical station information, and mapping the physical station information to a specific telephone operator;
step 102: audio and video information is collected, and audio and video information is collected in a bypass or synchronous mode when a telephone operator handles a service and is stored in a centralized object storage center;
step 2: training an emotion recognition model:
step 201: training data collection and labeling: collecting audio and video clip information or collecting industry sharing data, manually labeling emotion, and expressing emotion states by using a two-dimensional vector expression mode based on arousal (aroma) and active (measure) degrees, as shown in fig. 9, wherein a vertical axis 'excitation dimension' reflects the physiological excitation degree of a speaker or the preparation for taking certain action, and is active or passive; while the horizontal axis "positive dimension" reflects the positive or negative assessment of things by the speaker. In this expression model, each emotion can be considered to be part of a continuum, and different emotions can be mapped to a point in two-dimensional space;
step 202: training an emotion recognition model: a hybrid emotion classifier based on a combination of HM and ANN is employed herein. The HM model judges the speech emotion state based on the prior statistical probability, has stronger modeling capability on a dynamic time sequence, but has weak classification capability and poor mode recognition performance. The ANN model has strong classification decision-making capability and self-adaptive learning capability, but has poor dynamic characteristic expression capability. The mixed emotion classifier of the embodiment combines the dynamic time sequence modeling capability of the HM and the classification decision capability of the ANN to form a mixed model aiming at the advantages and disadvantages of the two models, as shown in fig. 10, the output of the HMM model becomes the input of the ANN, and the speech emotion is classified and recognized. The emotional characteristics are that the original voice signal contains various information, firstly, the characteristics which are easier to reflect the emotional state are selected, and in the embodiment, 3 characteristics of fundamental tone frequency, short-time energy and amplitude are selected.
And step 3: and (4) emotion recognition.
Step 301: audio and video information separation: separating the audio and video information of the operator collected in real time into independent audio stream and video stream, as shown in fig. 11;
step 302: voice signal feature extraction: voice framing to extract voiceprint characteristics, fundamental tone frequency, energy parameters, resonance peak frequency and the like;
step 303: and (3) speech state decoding: establishing an HM model aiming at each emotion state in advance, and inputting the probability distribution of the extracted feature vector matching with the basic emotion;
step 304: video image sampling: preprocessing the facial expression in the sampling frame;
step 305: extracting facial features: extracting image frame features by using an LBF (local binary pattern) algorithm;
step 306: judging the face emotion: performing model judgment on the extracted features by adopting an RF algorithm, and taking the emotion mean value of the video frame as a single-mode emotion recognition result;
step 307: and (3) emotion recognition fusion judgment: the complementarity of the audio signal and the facial expression information can improve the accuracy of emotion recognition to a certain extent. The method comprises the steps of carrying out decision fusion judgment on an audio single-modal emotion recognition result and a video single-modal emotion recognition result, enabling each modal recognition to correspond to one classifier, combining classifier outputs, giving corresponding full weight according to classifier output probability, and obtaining a final score by using the following rules:
Figure BDA0003808926350000141
wherein R is the number of classifiers, P (X) c ) Is the prior probability of the emotion classification C, Y i Is the weight function of C at the hybrid classifier i.
And 4, step 4: abnormal emotional event warning
Step 401: presetting an abnormal emotion sample matching set;
step 402: event triggering: generating an event information prompt when the abnormal emotion is identified for a plurality of time;
step 403: event handling by field management personnel: and (4) on-site management personnel read the audio and video to confirm the effectiveness of the event and take timely intervention measures.
According to the method and the device for recognizing the emotion based on the audio and video, disclosed by the invention, the accuracy of emotion recognition is effectively improved based on multi-mode recognition of the audio and video; the abnormal emotion early warning system based on event driving can find problems at fixed points in time, improve the field management efficiency and recover the loss of customer service image to the greatest extent; except abnormal emotions, an abnormal behavior recognition model of the telephone operator can be further introduced based on the limb actions and the like of video information, and the range of the abnormal events which can be monitored by the system is expanded; the method disclosed by the invention has stronger universality in a field management scene.
Fig. 12 schematically shows a block diagram of a video-with-audio based emotion recognition apparatus according to an embodiment of the present disclosure.
As shown in fig. 12, the av-based emotion recognition apparatus 1200 includes: the system comprises an acquisition module 1210, a processing module 1220, a voice emotion recognition module 1230, a video emotion recognition module 1240 and a fusion judgment module 1250.
And an acquisition module 1210 for acquiring audio/video data. According to an embodiment of the present disclosure, the acquisition module 1210 may be configured to perform the step S210 described above with reference to fig. 2, for example, and is not described herein again.
The processing module 1220 is configured to pre-process the audio and video data to obtain voice data and video data. According to an embodiment of the present disclosure, the processing module 1220 may be configured to perform the step S220 described above with reference to fig. 2, for example, and is not described herein again.
And the speech emotion recognition module 1230 is configured to input the speech data into the speech emotion recognition model to obtain a first probability distribution, where the first probability distribution is used to represent a speech emotion recognition result obtained by the speech emotion recognition model. According to an embodiment of the present disclosure, the speech emotion recognition module 1230 may be configured to perform the step S230 described above with reference to fig. 2, for example, and is not described herein again.
And the video emotion recognition module 1240 is used for inputting the video data into the video emotion recognition model to obtain a second probability distribution, and the second probability distribution is used for representing a video emotion recognition result obtained by the video emotion recognition model. According to an embodiment of the present disclosure, the video emotion recognition module 1240 may be used to perform the step S240 described above with reference to fig. 2, for example, and is not described herein again.
And a fusion judgment module 1250 configured to perform fusion judgment according to the speech emotion recognition result and the video emotion recognition result to obtain a comprehensive score of emotion recognition and determine emotion classification. According to an embodiment of the present disclosure, the fusion determining module 1250 may be configured to perform the step S250 described above with reference to fig. 2, for example, and is not described herein again.
It should be noted that any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
For example, any plurality of the acquisition module 1210, the processing module 1220, the speech emotion recognition module 1230, the video emotion recognition module 1240 and the fusion judgment module 1250 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the acquisition module 1210, the processing module 1220, the speech emotion recognition module 1230, the video emotion recognition module 1240 and the fusion judgment module 1250 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementation manners of software, hardware and firmware, or an appropriate combination of any several of them. Alternatively, at least one of the acquisition module 1210, the processing module 1220, the speech emotion recognition module 1230, the video emotion recognition module 1240 and the fusion judgment module 1250 may be at least partially implemented as a computer program module, which, when executed, may perform the corresponding functions.
Fig. 13 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure. The electronic device shown in fig. 13 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 13, the electronic apparatus 1300 described in this embodiment includes: a processor 1301, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1302 or a program loaded from a storage section 1308 into a Random Access Memory (RAM) 1303. The processor 1301 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1301 may also include on-board memory for caching purposes. Processor 1301 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 1303, various programs and data necessary for the operation of the system 1300 are stored. The processor 1301, ROM1302, and RAM 1303 are connected to each other by a bus 1304. The processor 1301 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM1302 and/or the RAM 1303. Note that the programs may also be stored in one or more memories other than the ROM1302 and RAM 1303. The processor 1301 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 1300 may also include input/output (I/O) interface 1305, which is also connected to bus 1304, according to an embodiment of the present disclosure. The system 1300 may also include one or more of the following components connected to the I/O interface 1305: an input portion 1306 including a keyboard, a mouse, and the like; an output section 1307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1308 including a hard disk and the like; and a communication section 1309 including a network interface card such as a LAN card, a modem, or the like. The communication section 1309 performs communication processing via a network such as the internet. A drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as necessary, so that a computer program read out therefrom is mounted into the storage portion 1308 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. The computer program, when executed by the processor 1301, performs the functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The embodiments of the present disclosure also provide a computer-readable storage medium, which may be included in the device/apparatus/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The above-mentioned computer-readable storage medium carries one or more programs which, when executed, implement a method for audio-video based emotion recognition according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include one or more memories other than the ROM1302 and/or the RAM 1303 and/or the ROM1302 and the RAM 1303 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated by the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the emotion recognition method based on audio and video provided by the embodiment of the disclosure.
The computer programs, when executed by the processor 1301, perform the functions defined in the systems/apparatuses of the embodiments of the present disclosure. The above described systems, devices, modules, units, etc. may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via communications component 1309, and/or installed from removable media 1311. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. The computer program, when executed by the processor 1301, performs the functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).
It should be noted that each functional module in each embodiment of the present disclosure may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of software products, in part or in whole, which substantially contributes to the prior art.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
While the disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims (13)

1. An emotion recognition method based on audio and video is characterized by comprising the following steps:
collecting audio and video data;
preprocessing the audio and video data to obtain voice data and video data;
inputting the voice data into a voice emotion recognition model to obtain a first probability distribution, wherein the first probability distribution is used for representing a voice emotion recognition result obtained by the voice emotion recognition model;
inputting the video data into a video emotion recognition model to obtain a second probability distribution, wherein the second probability distribution is used for expressing a video emotion recognition result obtained by the video emotion recognition model;
and performing fusion judgment according to the voice emotion recognition result and the video emotion recognition result to obtain a comprehensive score of emotion recognition and determine emotion classification.
2. The audio-visual based emotion recognition method of claim 1, wherein said inputting the speech data into a speech emotion recognition model, resulting in a first probability distribution comprises:
preprocessing the voice data and extracting characteristic parameters of the voice data;
recognizing the characteristic parameters of the voice data by using a hidden Markov model to obtain the characteristic vector of the voice data;
and classifying the feature vectors of the voice data by utilizing a pre-established artificial neural network to obtain a first probability distribution of the voice emotion recognition.
3. The audio-visual based emotion recognition method of claim 2, wherein said classifying the feature vectors of the speech data using a pre-established artificial neural network comprises:
normalizing the feature vectors of the voice data to obtain a feature matrix to be recognized;
taking the characteristic matrix to be identified as the input of the artificial neural network;
and calculating the matching probability of each element in the characteristic matrix to be recognized and the standard characteristic matrix corresponding to the sample voice emotion, and the first probability distribution of the voice emotion recognition.
4. The audio-visual based emotion recognition method of claim 2, wherein the characteristic parameters of the speech data include a pitch frequency, a short-time energy, and an amplitude.
5. The audio-video based emotion recognition method as recited in claim 1, wherein said inputting the video data into a video emotion recognition model to obtain a second probability distribution comprises:
preprocessing the video data and extracting facial expression images of the video data;
performing feature extraction on the facial expression image by using a local binary fitting algorithm to obtain a feature vector of the video data;
and classifying the feature vectors of the video data by using a random forest algorithm to obtain a second probability distribution of the video emotion recognition.
6. The method of claim 5, wherein the extracting the features of the facial expression image by using a local binary fitting algorithm to obtain the feature vector of the video data comprises:
carrying out face detection on the facial expression image to obtain a face partial image;
extracting key points of the human face by using a local binary fitting algorithm according to the human face partial image;
and constructing a feature vector of the video data according to the face key points.
7. The audio-video-based emotion recognition method of claim 1, wherein the performing fusion judgment according to the speech emotion recognition result and the video emotion recognition result to obtain a comprehensive score of emotion recognition, and determining emotion classification comprises:
and calculating a comprehensive score of emotion recognition by using an argmax function according to the weight parameter, the first probability distribution and the second probability distribution and determining emotion classification by using a preset prior probability as the weight parameter.
8. The audio-video based emotion recognition method as recited in claim 1, wherein said pre-processing the audio-video data to obtain voice data and video data comprises:
carrying out voice detection on the audio and video data to obtain the voice data;
and performing video extraction on the audio and video data to obtain the video data.
9. The method for emotion recognition based on audio-visual data according to claim 1, wherein said determining the emotion classification further comprises:
matching the emotion classification with a preset abnormal emotion sample set;
and if the matching is successful, performing intervention processing on the abnormal emotion.
10. An emotion recognition apparatus based on audio and video, characterized by comprising:
the acquisition module is used for acquiring audio and video data;
the processing module is used for preprocessing the audio and video data to obtain voice data and video data;
the voice emotion recognition module is used for inputting the voice data into a voice emotion recognition model to obtain a first probability distribution, and the first probability distribution is used for representing a voice emotion recognition result obtained by the voice emotion recognition model;
the video emotion recognition module is used for inputting the video data into a video emotion recognition model to obtain a second probability distribution, and the second probability distribution is used for representing a video emotion recognition result obtained by the video emotion recognition model;
and the fusion judgment module is used for performing fusion judgment according to the voice emotion recognition result and the video emotion recognition result to obtain a comprehensive score of emotion recognition and determine emotion classification.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method for audio-visual based emotion recognition according to any of claims 1-9.
12. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform a method of audio-visual based emotion recognition according to any of claims 1 to 9.
13. A computer program product comprising a computer program which, when executed by a processor, implements a method of audio-visual based emotion recognition according to any of claims 1 to 9.
CN202211009366.6A 2022-08-22 2022-08-22 Emotion recognition method, device and equipment based on audio and video Pending CN115376559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211009366.6A CN115376559A (en) 2022-08-22 2022-08-22 Emotion recognition method, device and equipment based on audio and video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211009366.6A CN115376559A (en) 2022-08-22 2022-08-22 Emotion recognition method, device and equipment based on audio and video

Publications (1)

Publication Number Publication Date
CN115376559A true CN115376559A (en) 2022-11-22

Family

ID=84067439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211009366.6A Pending CN115376559A (en) 2022-08-22 2022-08-22 Emotion recognition method, device and equipment based on audio and video

Country Status (1)

Country Link
CN (1) CN115376559A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953724A (en) * 2023-03-14 2023-04-11 智因科技(深圳)有限公司 User data analysis and management method, device, equipment and storage medium
CN116560513A (en) * 2023-07-08 2023-08-08 世优(北京)科技有限公司 AI digital human interaction method, device and system based on emotion recognition
CN117473304A (en) * 2023-12-28 2024-01-30 天津大学 Multi-mode image labeling method and device, electronic equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953724A (en) * 2023-03-14 2023-04-11 智因科技(深圳)有限公司 User data analysis and management method, device, equipment and storage medium
CN115953724B (en) * 2023-03-14 2023-06-16 深圳市银弹科技有限公司 User data analysis and management method, device, equipment and storage medium
CN116560513A (en) * 2023-07-08 2023-08-08 世优(北京)科技有限公司 AI digital human interaction method, device and system based on emotion recognition
CN116560513B (en) * 2023-07-08 2023-09-15 世优(北京)科技有限公司 AI digital human interaction method, device and system based on emotion recognition
CN117473304A (en) * 2023-12-28 2024-01-30 天津大学 Multi-mode image labeling method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109726624B (en) Identity authentication method, terminal device and computer readable storage medium
CN110147726B (en) Service quality inspection method and device, storage medium and electronic device
CN110443692B (en) Enterprise credit auditing method, device, equipment and computer readable storage medium
CN115376559A (en) Emotion recognition method, device and equipment based on audio and video
CN109117777A (en) The method and apparatus for generating information
US11822568B2 (en) Data processing method, electronic equipment and storage medium
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN110516083B (en) Album management method, storage medium and electronic device
CN113095204B (en) Double-recording data quality inspection method, device and system
CN116955699B (en) Video cross-mode search model training method, searching method and device
WO2023078070A1 (en) Character recognition method and apparatus, device, medium, and product
CN116932919B (en) Information pushing method, device, electronic equipment and computer readable medium
CN111738199A (en) Image information verification method, image information verification device, image information verification computing device and medium
CN113128284A (en) Multi-mode emotion recognition method and device
US8954327B2 (en) Voice data analyzing device, voice data analyzing method, and voice data analyzing program
CN116935889B (en) Audio category determining method and device, electronic equipment and storage medium
CN116884149A (en) Method, device, electronic equipment and medium for multi-mode information analysis
CN111899718B (en) Method, apparatus, device and medium for recognizing synthesized speech
Kshirsagar et al. Deepfake video detection methods using deep neural networks
CN112767946A (en) Method, apparatus, device, storage medium and program product for determining user status
KR102564570B1 (en) System and method for analyzing multimodal emotion
CN112115325A (en) Scene type determination method and training method and device of scene analysis model
CN118172861B (en) Intelligent bayonet hardware linkage control system and method based on java
CN118380000A (en) Source object recognition method and device, electronic equipment, storage medium and computer program product
CN117668224A (en) Emotion recognition model training method, emotion recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination