CN113724697A - Model generation method, emotion recognition method, device, equipment and storage medium - Google Patents
Model generation method, emotion recognition method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113724697A CN113724697A CN202111000230.4A CN202111000230A CN113724697A CN 113724697 A CN113724697 A CN 113724697A CN 202111000230 A CN202111000230 A CN 202111000230A CN 113724697 A CN113724697 A CN 113724697A
- Authority
- CN
- China
- Prior art keywords
- data
- result
- audio data
- emotion
- gender
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 116
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000002372 labelling Methods 0.000 claims abstract description 39
- 230000008451 emotion Effects 0.000 claims description 80
- 238000004364 calculation method Methods 0.000 claims description 43
- 238000007781 pre-processing Methods 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 26
- 230000015654 memory Effects 0.000 claims description 19
- 230000002457 bidirectional effect Effects 0.000 claims description 15
- 238000011176 pooling Methods 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 230000000873 masking effect Effects 0.000 claims description 6
- 230000036651 mood Effects 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 13
- 238000000605 extraction Methods 0.000 description 11
- 238000001228 spectrum Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 6
- 230000002996 emotional effect Effects 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 5
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000010355 oscillation Effects 0.000 description 3
- 101000794285 Drosophila melanogaster CDC42 small effector protein homolog Proteins 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 206010013954 Dysphoria Diseases 0.000 description 1
- 208000013875 Heart injury Diseases 0.000 description 1
- 206010049976 Impatience Diseases 0.000 description 1
- 206010022998 Irritability Diseases 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 239000002351 wastewater Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The disclosure provides a model generation method, an emotion recognition device, electronic equipment and a storage medium, and relates to the technical fields of artificial intelligence such as voice technology and natural language processing. The specific implementation scheme is as follows: inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data; and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model. The embodiment of the disclosure can improve the accuracy of emotion recognition.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and further relates to the field of artificial intelligence technologies such as speech technology and natural language processing, and in particular, to a model generation method, an emotion recognition apparatus, an electronic device, and a storage medium.
Background
With the development of computer technology, people can realize more new functions through computers, and emotion recognition is one of the functions.
Emotion recognition is performed by a computer, including the determination of emotional states by detecting externally observable information. Since emotion is a complex psychological activity of human beings, improvement of the accuracy of emotion recognition remains a problem that needs long-term attention.
Disclosure of Invention
The disclosure provides a model generation method, an emotion recognition device, an electronic device and a storage medium.
According to an aspect of the present disclosure, there is provided a model generation method including:
inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result output aiming at the audio data;
and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain the emotion recognition model.
According to another aspect of the present disclosure, there is provided an emotion recognition method including:
and inputting the audio data to be recognized into a recognition model to obtain an emotion recognition result, wherein the recognition model is the emotion recognition model provided by any one of the embodiments of the disclosure.
According to another aspect of the present disclosure, there is provided a model generation apparatus including:
the recognition module is used for inputting the audio data into a recognition model to be trained and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data;
and the training module is used for adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain the emotion recognition model.
According to another aspect of the present disclosure, there is provided an emotion recognition apparatus including:
and inputting the audio data to be recognized into a recognition model to obtain an emotion recognition result, wherein the recognition model is the emotion recognition model provided by any one of the embodiments of the disclosure.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method in any of the embodiments of the present disclosure.
According to the technology disclosed by the invention, gender information is blended in during emotion recognition, so that the accuracy of emotion recognition results can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of a model generation method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a model generation method according to another embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a model generation method according to yet another embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an identification method according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a model generation method according to an example of the present disclosure;
FIG. 6 is a schematic diagram of recognition model generation to be trained according to an example of the present disclosure;
FIG. 7 is a schematic diagram of a model generation apparatus according to an embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a model generation apparatus according to another embodiment of the present disclosure;
FIG. 9 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;
FIG. 10 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;
FIG. 11 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;
FIG. 12 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;
FIG. 13 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;
FIG. 14 is a block diagram of an electronic device for implementing a model generation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the present disclosure first provides a model generation method, as shown in fig. 1, including:
step S11: inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result output aiming at the audio data;
step S12: and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain the emotion recognition model.
In this embodiment, the audio data may be any audio data containing human or animal sounds. For example, it may be audio data of human speech, audio data of singing, or other audio data generated by human voice.
The audio data may include sound from one speaker or may include sound from more than two different speakers.
The audio data may include audio generated from a single type of speech in a single language, and may also include audio generated from multiple languages, such as audio generated from a speaker using chinese speech, audio generated from speech in a foreign language, audio generated from a mixture of chinese and foreign languages, and audio generated from a mixture of chinese and foreign languages.
The gender recognition result and the emotion recognition result output for the audio data include outputting a recognition result regarding the gender of the utterer and a recognition result regarding the emotion of the utterer for the voice of at least one utterer in the audio data.
The gender identification result may include probabilities that the speaker of the audio data belongs to various genders, for example, the probability that the speaker belongs to a male is 60% and the probability that the speaker belongs to a female is 40%. The gender identification result may also include a single gender identification result, such as the first speaker belonging to a male and the second speaker belonging to a female.
The emotion recognition result may include probabilities of the vocalist of the audio data being in various emotional states, for example, a probability of the vocalist being in a happy emotional state is 80% and a probability of being in a sad emotional state is 20%. The emotion recognition result may also include a separate, unambiguous emotion recognition result, such as a happy emotion of the speaker, or a combination of happy and nervous emotion of the speaker.
In another implementation, if two completely conflicting emotions are included in the emotion recognition result, one emotion output may be selected among the two completely conflicting emotions, and if two or more emotions that may occur together, such as distraction and relaxation, are included in the emotion recognition result, the two or more emotions that may occur together may be simultaneously output.
In this embodiment, the acquiring of the gender identification result and the emotion identification result output for the audio data may include acquiring intermediate data of gender identification and intermediate data of emotion identification for the audio data; and acquiring emotion recognition results for the audio data based on the intermediate data of gender recognition and the intermediate data of emotion recognition, and acquiring gender recognition results for the audio data based on the intermediate data of gender recognition in the lake region.
In this embodiment, the annotation result may include a gender annotation result and an emotion annotation result.
The gender labeling result may be male or female.
In other embodiments, the gender identification result and the gender labeling result may be divided into more sub-categories according to the information such as age under the gender category, such as male children, male adolescents, female adolescents, etc.
The emotion annotation result can be at least one of a plurality of preset emotions, such as happiness, anger, sadness and music.
In this embodiment, adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result, and the labeling result may include: optimizing or adjusting the recognition model to be trained based on the gender recognition result and the gender-related labeling result; and optimizing or adjusting the recognition model to be trained based on the emotion recognition result and the labeling result about the emotion.
The speech is an important carrier of emotion in human communication, the focus of speech recognition is on the content of the speaker speaking, the focus of emotion recognition is on the way in which the speaker speaks, the way in which people express speech under different emotional states is different, and the features presented in the audio information uttered by people are also different. For example, people speaking in happy hours may be more cheerful in tone, and dysphoria may be more bored in tone. Under general conditions, the emotion information embodied in the voice can be determined by the voice generated by speaking of people through information such as the speed, tone, volume and the like; the development of deep learning techniques has accelerated the progress of detecting emotion from speech. However, there are still some disadvantages in recognizing the emotion of the speaker through audio or voice, and there is a great relationship between the emotion in the audio and the speaker of the audio. Different listeners can understand different emotions when listening to the audio, and meanwhile, factors such as cultural differences increase difficulty in emotion recognition from the audio. It is found through research that the emotion expressed in the voice of people of different genders can be different under the condition that the information such as the speed, the tone, the volume and the like of the voice is the same. Usually, the expression of speech emotion is different for different genders, i.e. the female's vitality may be different from the male's vitality in tone, and other emotions such as excitement, fear, heart injury, etc. are similar. Specifically, utterances of the same tone and content may be uttered by one gender with a fussy mood, and uttered by another gender with a rudimentary mood.
Therefore, in this embodiment, gender information in the audio data is taken into account when designing the features and model network. According to the audio data, a gender recognition result and an emotion recognition result can be obtained. And training or adjusting the recognition model to be trained according to the gender recognition result and the emotion recognition result output by the recognition model to be trained and the labeling result aiming at the gender and the emotion to obtain the emotion recognition model. Due to the fact that gender information is combined in the emotion recognition process, information referenced in emotion recognition can be increased, and accuracy of emotion recognition results is improved.
In one embodiment, inputting audio data into a recognition model to be trained, and obtaining a gender recognition result and an emotion recognition result for the audio data output, as shown in fig. 2, includes:
step S21: acquiring first data at least fusing emotion information and gender information according to audio data by adopting a pre-processing layer of a recognition model to be trained;
step S22: acquiring a gender identification result according to the first data by adopting a gender identification output layer of the identification model to be trained;
step S23: obtaining an emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained;
the pre-processing layer is a data processing layer in front of a gender recognition output layer and an emotion recognition output layer in the recognition model to be trained.
In this embodiment, step S22 and step S23 may be executed simultaneously, or may be executed sequentially according to any order.
In the present embodiment, the portion of the recognition model for obtaining the emotion recognition result and the portion for obtaining the gender recognition result may be shared except for the final output layer. For example, the input layer, the data processing layer between the input layer and the output layer may be a common data processing layer for emotion recognition processing and gender recognition processing.
In this embodiment, the preprocessing layer may not include an input layer.
In this embodiment, the data processing layer for recognizing emotion information of audio data and the data processing layer for recognizing audio data may be shared, so that gender information in audio can be sufficiently considered when recognizing emotion.
In one embodiment, acquiring the first data fused with at least emotion information and gender information from the audio data, as shown in fig. 3, includes:
step S31: extracting frequency domain information of the audio data to obtain second data;
step S32: performing convolution and pooling calculation on the second data to obtain third data;
step S33: processing the third data by adopting a bidirectional Short Term Memory neural network (LSTM) to acquire fourth data;
step S34: and performing Self-Attention (Self Attention) weight calculation on the fourth data to acquire first data.
In this embodiment, the frequency domain information may be a frequency spectrum of sound, which is helpful for more intuitively recognizing the audio data. The frequency spectrum can be short for frequency spectrum density and can be represented by a distribution curve of frequency. Complex oscillations can be decomposed into harmonic oscillations of different amplitudes and different frequencies, and the pattern of amplitude versus frequency of these harmonic oscillations can be called a frequency spectrum. The frequency spectrum may embody that the audio signal is frequency domain information.
Performing convolution and pooling calculation on the second data to obtain third data, which may include performing convolution calculation on the second data line and then performing pooling calculation to obtain third data; and performing convolution calculation after the second data advancement pooling calculation to obtain third data.
The frequency domain information extraction is performed on the audio data, and the frequency domain information extraction may be performed on the audio data through a frequency spectrum extraction layer of the recognition model to be trained.
In this embodiment, the processing of the third data by using the bidirectional long-short term memory neural network may include performing forward calculation processing and performing backward calculation processing on the third data by using the bidirectional long-short term memory neural network.
And calculating the self-attention weight of the fourth data, wherein the self-attention mechanism layer of the recognition model to be trained is adopted to calculate the fourth data, and relatively higher weights are given to more important information in the fourth data.
In the embodiment, rich data about gender and emotion in the audio data can be obtained through extraction of frequency domain information, a bidirectional long-short term memory neural network and self-attention mechanism information, so that emotion can be judged more accurately.
Meanwhile, when the recognition model to be trained calculates the frequency spectrum extraction layer, the convolution layer, the pooling layer, the bidirectional long-short term memory neural network and the self-attention layer of the audio data, the gender information and the emotion information in the audio can be processed and retained at the same time, so that the subsequent emotion recognition result is related to the gender recognition result.
In one embodiment, the model generation method further comprises obtaining audio data by:
preprocessing the acquired original audio data to obtain a preprocessing result;
performing data enhancement operation on the preprocessing result to obtain enhanced data of the original audio data;
the original audio data and the enhancement data are used as audio data.
In this embodiment, the preprocessing may include operations such as noise reduction, background sound filtering, volume adjustment, sound clipping, and the like.
The pre-processing result may be pre-processed raw audio data.
The raw audio data may be raw audio data acquired with an audio recording device. Or may be an audio portion of the video data.
The data enhancement operation is performed on the preprocessing result, which can be data amplification on the preprocessing result, and under the condition of not increasing data substantially, the result basically equivalent to the fact that more training data are increased is generated, so that the limited data generate the training effect equivalent to more data, and the coverage of the data used by the training model is improved.
In this embodiment, the data enhancement may be supervised data enhancement, that is, data amplification may be performed on the basis of existing data by using a preset data transformation rule. The data enhancement may also be an unsupervised data enhancement. In addition, data enhancement may include single sample data enhancement and multi-sample data enhancement, and may also include generating new data enhancement and learning enhancement strategy enhancement.
In this embodiment, the original audio data and the enhancement data of the original audio data are used as audio data, so that the limited original audio data can be fully utilized, and the utilization rate of the training set is improved.
In one embodiment, in a case that the data enhancement operation is a differential enhancement operation, performing the data enhancement operation on the pre-processing result to obtain enhanced data of the original audio data includes:
extracting target audio features of original audio data;
and performing differential enhancement operation on the target audio features to obtain enhanced data of the original audio data.
In this embodiment, the target audio features of the original audio data may be primary audio features extracted from a data processing layer that is ordered as a preamble of the recognition model to be trained, and specifically may include audio features that can be directly extracted from the original audio data, and may also include audio features extracted from processed (noise reduction, etc.) original audio data.
In this embodiment, the primary audio feature may be related to the waveform of the audio data, and may be an equivalent expression form of the voice data, including various information such as emotion, gender, and language. Different primary audio features (or front-end features) may be obtained by trying different combinations of features, such as MFCC (Mel-Frequency Cepstral coeffients, Mel-Frequency Cepstral Coefficients) and PLP (Perceptual Linear prediction Coefficients), or features of different dimensions such as 40-80 (size of audio file unit), etc.
Performing a differential enhancement operation on the target audio feature may include performing partial differential enhancement and full differential enhancement on the target audio feature.
In the embodiment, the difference enhancement can be performed according to the original audio data, so that the data volume of the audio is increased, and the use effect of the limited original audio data in training of the recognition model to be trained is improved.
In one embodiment, preprocessing the acquired original audio data to obtain a preprocessing result includes:
performing at least one of the following operations on the raw audio data to pre-process the raw audio data:
changing the playing rate of the original audio data;
adding reverberation to the original audio data;
removing noise in the original audio data;
carrying out time domain channel covering operation on the original audio data;
the original audio data is subjected to a frequency domain channel masking operation.
In this embodiment, the change of the playing rate of the original audio data may be to increase or decrease the rate of the original audio.
Adding reverberation to the original audio data may be an effect of adding reflected or absorbed in the original audio.
The noise in the original audio data may be removed by removing background sounds in the original audio data, or removing non-background sounds irrelevant to emotion analysis in the original audio data, or extracting sounds to be analyzed from the original audio data as preprocessed original audio data.
The time domain channel masking operation is performed on the original audio data, and may be a partial signal cancellation operation performed on the original audio data on a time domain channel.
The frequency domain channel masking operation may be performed on the original audio data, or the partial signal cancellation operation may be performed on the original audio data on the frequency domain channel.
In this embodiment, the original audio is preprocessed to a certain extent, so that one piece of original audio data is converted into multiple pieces of original audio data, and the utilization rate of the original audio data is improved.
In one embodiment, adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result comprises:
performing loss calculation based on the gender recognition result, the emotion recognition result and the labeling result to obtain a loss value;
and adjusting the recognition model to be trained according to the loss value.
In this embodiment, the loss calculation is performed based on the gender identification result, the emotion identification result, and the tagging result, and the loss value may be obtained by at least one of:
acquiring a first loss value based on the gender identification result and the gender labeling result, and taking the first loss value as a loss value;
acquiring a second loss value based on the emotion recognition result and the annotation result about the emotion, and taking the second loss value as a loss value;
acquiring a third loss value based on the emotion recognition result, the annotation result on emotion and the annotation result on gender, and taking the third loss value as a loss value;
acquiring a fourth loss value based on the gender identification result, the gender-related labeling result and the emotion-related labeling result, and taking the fourth loss value as the loss value;
acquiring a first loss value based on the gender identification result and the gender-related labeling result; and obtaining a second loss value based on the emotion recognition result and the annotation result about the emotion, and obtaining a loss value based on the first loss value and the second loss value.
In the embodiment, the loss calculation can be performed according to the gender recognition result, the emotion recognition result and the labeling result, so that when the model is optimized or adjusted, the model can be optimized according to the dual recognition results of gender and emotion, and the model can learn the optimal parameters for recognizing gender. Meanwhile, according to the gender recognition optimal parameter and the emotion recognition result, the model can learn the optimal parameter for recognizing emotion, and the emotion recognition accuracy is improved.
In one embodiment, the tagging result includes a gender tagging result and an emotion tagging result, and the loss calculation is performed on the basis of the gender identification result, the emotion identification result and the tagging result to obtain the loss value, including:
calculating Cross Entropy loss (Cross Entropy) based on the gender marking result and the gender identification result to obtain a first loss value;
performing cross entropy loss calculation based on the emotion marking result and the emotion recognition result to obtain a second loss value;
and carrying out weighted summation on the first loss value and the second loss value, and taking the summation result as a loss value.
In this embodiment, the cross entropy loss calculation is performed based on the gender labeling result and the gender identification result, and may be performed according to the gender identification result and the gender labeling result by using a cross entropy loss function.
And performing cross entropy loss calculation based on the emotion labeling result and the emotion recognition result, wherein the cross entropy loss calculation can be performed according to the emotion recognition result and the emotion labeling result by using a cross entropy loss function.
In some embodiments, the cross-entropy loss may be a typical cross-entropy loss function; it may also be an extended type cross entropy loss function derived from a typical cross entropy loss function; or may be a two-class cross entropy loss function or a multi-class cross entropy loss function.
In other embodiments, the emotion recognition result and the gender recognition result can be converted into a fusion result, and cross entropy loss calculation is performed according to the fusion result and the labeling result.
In this embodiment, after the loss values are calculated for the gender identification result and the emotion identification result, the total loss value is calculated according to the respective weights, so that the emotion identification information and the gender identification information can be fused in the loss values, and the accuracy of the identification result is improved.
An embodiment of the present disclosure further provides an emotion recognition method, as shown in fig. 4, including:
step S41: and inputting the audio data to be recognized into a recognition model to obtain an emotion recognition result, wherein the recognition model is the emotion recognition model provided by any one of the embodiments of the disclosure.
The embodiment of the disclosure can be applied to various scenes such as intelligent robot conversation, intelligent teaching, intelligent customer service, psychological analysis, mobile service and the like. For example, a psychologist may record a patient's voice during a consultation and may subsequently design a treatment plan based on the mood expressed in the audio data obtained from the patient's voice.
In the disclosed example, gender information in the audio data is employed to assist in the emotion recognition process. In particular, the emotion in the recognition result may include a basic emotion, and may also include other complex emotions that may be encountered in life, and may also be determined by the data set of audio data employed by the training model. Such as anger, joy, neutrality, injury, anger, rage, irritability, annoyance, worry, fear, surprise, timidity, and the like. In a specific example, as shown in fig. 5, the model generation method may include the following steps:
step S51: and constructing an audio data set for training the recognition model, and carrying out emotion and gender labeling on the audio data set.
In this example, a certain amount (for example, 10 ten thousand) of raw audio data is collected to construct an audio data set, and emotion information labeling and gender information labeling can be performed on the raw audio data.
Step S52: the raw audio data in the audio data set is preprocessed. Specifically, the method may include performing a preliminary processing by removing noise and muting, framing, and the like, and then performing data enhancement on the data subjected to the preliminary processing, for example, adding reverberation, changing the rate, and the like.
In the process of preprocessing the original audio data, clean original audio data can be obtained by removing information such as environmental noise, busy tone, polyphonic ringtone and the like.
Step S53: and performing primary audio feature extraction on the preprocessed original audio data. The primary audio features may include Pitch frequency (Pitch), mel-frequency cepstrum (MFCC), PLP, FBank (Filter Bank), FFT (Fast Fourier transform) features, and the like.
Step S54: and carrying out differential enhancement on the primary audio features, and taking the primary audio features subjected to differential enhancement and the original audio data together as audio data.
Specifically, the primary audio features may be subjected to second order differential enhancement, and the data may be subjected to Shuffle (Shuffle).
Step S55: audio data is input into the recognition model to be trained.
And (3) carrying out Batch (Batch) training in the identification model to be trained, carrying out forward calculation step by step according to the audio data, and continuously mapping the original audio data and the primary audio features into high-level features.
Step S56: the loss value is calculated from the output data of the output layer.
And calculating the loss of the respective targets of the emotion recognition result and the gender recognition result on the final output layer of the recognition model to be trained.
Specifically, the influence degree of the gender in the emotion distinguishing process can be determined by controlling the coefficients of the gender and the emotion distinguishing process, and the following calculation formula can be specifically adopted:
L=α×Lemotion+β×Lgender;
where α and β may be weighted values, and may be set to 0.5 and 0.5, respectively. L isemotionThe loss value can be recognized for emotion, canThe emotion recognition result and the emotion labeling result can be obtained through calculation, and the emotion recognition result, the emotion labeling result and the gender labeling result can also be obtained through calculation. L isgenderThe gender identification result can be obtained by calculation according to the gender identification result and the gender labeling result.
And after loss calculation is finished, reversely updating the parameters of the whole recognition model to be trained, and iterating for multiple rounds until convergence, thereby obtaining the emotion recognition model with stable effect.
Step S57: and (5) testing the model.
When the emotion recognition model is tested, gender output can be achieved without participating in calculation, and after audio preprocessing, the gender output is input to a network to obtain a final emotion classification result.
In an example of the present disclosure, a model structure of a recognition model to be trained may be as shown in fig. 6, including: a Spectrum Extraction layer (Spectrum Extraction)61, a Convolution Neural Network (CNN) 62, a Max Pooling layer (Max power) 63, a bidirectional long-short term memory layer 64, and a self attention layer (self attention Module) 65. In addition, the recognition model to be trained may further include an Emotion Output Layer (Emotion Output Layer)66 and a Gender Output Layer (Gender Output Layer)67, and the Emotion Output Layer 66 and the Gender Output Layer 67 constitute Output layers of the recognition model to be trained. The emotion and gender recognition process shares all networks except the output layer in the recognition model to be trained.
The parameters of the spectrum extraction layer 61 may include feature dimension feature num, and value dspec. The parameters of convolutional layer 62 may include filter num ═ dcnn, steps strides, window size ═ ncw, and the step may take on a value of 1. The parameters of the maximum pooling layer 63 may include a step length npw, a window size, and the window size may be equal to the step length, that is, there may be: window size npw. The parameters of the bi-directional long-short term memory layer 64 may include the hidden size dlstm. The parameters of the self-attention layer 65 may include a hidden size, head dimension, head num, nattn. The dimension of the information output by the spectrum extraction layer 61 can be dspec, the dimension of the information output by the convolution layer can be dcnn, the dimension of the information output by the pooling layer can be dcnn, the dimension of the information output by the bidirectional long-short term memory neural network can be 2dlstm, and the dimension of the information output by the attention layer can be 2nattn × dlstm.
In this example, the output result of the emotion output layer 66 may be probabilities that the audio data contains various emotions, for example, the probabilities are {0.0, 0.5, 0.2, 0.3} for four emotions of impatience, sadness, happiness and tension, respectively. The output result of the gender output layer 67 may be probabilities that speakers of the audio data are of different genders, for example, the probabilities that the audio data are male and female are: {0.9,0.1}.
The number of each layer in the model structure in the disclosed example may be set according to the purpose of improving the accuracy of emotion recognition, for example, it is found through experiments that when two bidirectional LSTM layers are provided, the accuracy of the recognition result is enhanced, and then two bidirectional LSTM layers may be provided. For example, it is found by practical application that if two convolutional layers and/or pooling layers are provided, and the accuracy of the recognition result is enhanced, two convolutional layers and/or pooling layers may be provided. Data can be processed by the first layer of bidirectional LSTM layer and then input into the second layer of bidirectional LSTM layer for processing. After being processed by the first layer of the convolution layer, the first layer of the convolution layer can be input into the second layer of the convolution layer for processing. The waste water can be input into a second layer of pooling layer for treatment after being treated by the first layer of pooling layer.
The model structure provided by the embodiment of the disclosure can distinguish and find emotional characteristics under different sexes through the learning of the information of the sexes. Through the convolutional neural network layer, the local information mining effect is improved, long-term dependence information can be better described through the bidirectional LSTM, recognition can be carried out according to the concerned importance relation through the self-attention layer, meanwhile, in the embodiment, most networks are shared by emotion recognition and gender recognition, when emotion is distinguished, gender characteristics in the middle of audio data are utilized, the difference between the emotion characteristics is improved, and further, the emotion recognition accuracy is improved through the emotion recognition model provided by the embodiment of the disclosure.
An embodiment of the present disclosure further provides a model generation apparatus, as shown in fig. 7, including:
the recognition module 71 is configured to input audio data into a recognition model to be trained, and obtain a gender recognition result and an emotion recognition result output for the audio data;
and the training module 72 is configured to adjust the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model.
In one embodiment, as shown in fig. 8, the identification module comprises:
a first data unit 81, configured to acquire, according to audio data, first data in which emotion information and gender information are at least fused, by using a pre-processing layer of a recognition model to be trained;
the gender identification unit 82 is used for adopting a gender identification output layer of the identification model to be trained and obtaining a gender identification result according to the first data;
the emotion recognition unit 83 is used for acquiring an emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained;
the pre-processing layer is a data processing layer before the gender recognition output layer and the emotion recognition output layer.
In one embodiment, the first data unit is further configured to:
extracting frequency domain information of the audio data to obtain second data;
performing convolution and pooling calculation on the second data to obtain third data;
processing the third data by adopting a bidirectional long-short term memory neural network to obtain fourth data;
and performing self-attention weight calculation on the fourth data to obtain first data.
In one embodiment, as shown in fig. 9, the model generation apparatus further includes:
the preprocessing module 91 is configured to preprocess the acquired original audio data to obtain a preprocessing result;
the enhancement module 92 is configured to perform data enhancement operation on the preprocessing result to obtain enhanced data of the original audio data;
the obtaining module 93 is configured to use the original audio data and the enhanced data as audio data.
In one embodiment, as shown in fig. 10, in the case where the data enhancement operation is a differential enhancement operation, the enhancement module includes:
an audio feature unit 101, configured to extract a target audio feature of original audio data;
and the differential enhancement unit 102 is configured to perform a differential enhancement operation on the target audio features to obtain enhanced data of the original audio data.
In one embodiment, as shown in FIG. 11, the pre-processing module includes at least one of:
a rate unit 111 for changing a play rate of the original audio data;
a reverberation unit 112 for adding reverberation to the original audio data;
a noise unit 113 for removing noise from the original audio data;
a time domain unit 114, configured to perform a time domain channel masking operation on original audio data;
a frequency domain unit 115, configured to perform a frequency domain channel masking operation on the original audio data.
In one embodiment, as shown in fig. 12, the training module comprises:
a loss value obtaining unit 121, configured to perform loss calculation based on the gender recognition result, the emotion recognition result, and the tagging result, and obtain a loss value;
and an adjusting unit 122, configured to adjust the recognition model to be trained according to the loss value.
In one embodiment, the annotation result comprises a gender annotation result and an emotion annotation result, and the loss value unit is further configured to:
performing cross entropy loss calculation based on the gender marking result and the gender identification result to obtain a first loss value;
performing cross entropy loss calculation based on the emotion marking result and the emotion recognition result to obtain a second loss value;
and carrying out weighted summation on the first loss value and the second loss value, and taking the summation result as a loss value.
An embodiment of the present disclosure further provides an emotion recognition apparatus, as shown in fig. 13, including:
the emotion recognition module 131 is configured to input audio data to be recognized into a recognition model to obtain an emotion recognition result, where the recognition model is an emotion recognition model provided in any embodiment of the present disclosure.
According to the technical scheme, one of factors which have large influence on emotion recognition and auxiliary classification of gender are introduced in a multi-task learning mode, so that the extraction accuracy of emotion characteristics is further improved.
The embodiment of the disclosure can be applied to the technical field of computers, and especially can be applied to the technical fields of artificial intelligence, such as automatic driving, cloud computing, internet of things, big data, voice technology, intelligent search, information flow, deep learning and the like.
The functions of each unit, module or sub-module in each apparatus in the embodiments of the present disclosure may refer to the corresponding description in the above method embodiments, and are not described herein again.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 14 shows a schematic block diagram of an example electronic device 140 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 14, the electronic device 140 includes a computing unit 141, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)142 or a computer program loaded from a storage unit 148 into a Random Access Memory (RAM) 143. In the RAM 143, various programs and data required for the operation of the electronic apparatus 140 can also be stored. The calculation unit 141, the ROM 142, and the RAM 143 are connected to each other via a bus 144. An input/output (I/O) interface 145 is also connected to bus 144.
A number of components in the electronic device 140 are connected to the I/O interface 145, including: an input unit 146 such as a keyboard, a mouse, or the like; an output unit 147 such as various types of displays, speakers, and the like; a storage unit 148 such as a magnetic disk, optical disk, or the like; and a communication unit 149 such as a network card, modem, wireless communication transceiver, etc. The communication unit 149 allows the electronic device 140 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.
Claims (21)
1. A model generation method, comprising:
inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data;
and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model.
2. The method of claim 1, wherein the inputting audio data into a recognition model to be trained, and the obtaining of the gender recognition result and the emotion recognition result output for the audio data comprises:
acquiring first data at least fusing emotion information and gender information according to the audio data by adopting a pretreatment layer of the recognition model to be trained;
acquiring a gender identification result according to the first data by adopting a gender identification output layer of the identification model to be trained;
and acquiring the emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained.
3. The method according to claim 2, wherein the obtaining of the first data fused with at least emotion information and gender information from the audio data comprises:
extracting frequency domain information of the audio data to obtain second data;
performing convolution and pooling calculation on the second data to obtain third data;
processing the third data by adopting a bidirectional long-short term memory neural network to obtain fourth data;
and performing self-attention weight calculation on the fourth data to obtain the first data.
4. The method of any of claims 1-3, wherein the method further comprises obtaining the audio data by:
preprocessing the acquired original audio data to obtain a preprocessing result;
performing data enhancement operation on the preprocessing result to obtain enhanced data of the original audio data;
and taking the original audio data and the enhancement data as the audio data.
5. The method of claim 4, wherein, in the case that the data enhancement operation is a differential enhancement operation, the performing the data enhancement operation on the pre-processing result to obtain the enhancement data of the original audio data comprises:
extracting target audio features of the original audio data;
and performing differential enhancement operation on the target audio features to obtain enhanced data of the original audio data.
6. The method according to claim 4 or 5, wherein the preprocessing the acquired original audio data to obtain a preprocessing result comprises:
performing at least one of the following operations on the raw audio data to pre-process the raw audio data:
changing the playing rate of the original audio data;
adding reverberation to the original audio data;
removing noise in the original audio data;
carrying out time domain channel covering operation on the original audio data;
and carrying out frequency domain channel covering operation on the original audio data.
7. The method according to any one of claims 1-6, wherein the adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result comprises:
performing loss calculation based on the gender recognition result, the emotion recognition result and the labeling result to obtain a loss value;
and adjusting the recognition model to be trained according to the loss value.
8. The method of claim 7, wherein the labeling result comprises a gender labeling result and an emotion labeling result, and the performing the loss calculation based on the gender identification result, the emotion identification result and the labeling result to obtain the loss value comprises:
performing cross entropy loss calculation based on the gender marking result and the gender identification result to obtain a first loss value;
performing cross entropy loss calculation based on the emotion marking result and the emotion recognition result to obtain a second loss value;
and carrying out weighted summation on the first loss value and the second loss value, and taking the summation result as the loss value.
9. A method of emotion recognition, comprising:
inputting audio data to be recognized into a recognition model to obtain an emotion recognition result, wherein the recognition model is the emotion recognition model in any one of claims 1-8.
10. A model generation apparatus comprising:
the recognition module is used for inputting audio data into a recognition model to be trained and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data;
and the training module is used for adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model.
11. The apparatus of claim 10, wherein the identification module comprises:
the first data unit is used for acquiring first data at least fusing emotion information and gender information according to the audio data by adopting a pre-processing layer of the recognition model to be trained;
the gender identification unit is used for adopting a gender identification output layer of the identification model to be trained and obtaining the gender identification result according to the first data;
and the emotion recognition unit is used for acquiring the emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained.
12. The apparatus of claim 11, wherein the first data unit is further to:
extracting frequency domain information of the audio data to obtain second data;
performing convolution and pooling calculation on the second data to obtain third data;
processing the third data by adopting a bidirectional long-short term memory neural network to obtain fourth data;
and performing self-attention weight calculation on the fourth data to obtain the first data.
13. The apparatus of any one of claims 10-12, wherein the apparatus further comprises:
the preprocessing module is used for preprocessing the acquired original audio data to obtain a preprocessing result;
the enhancement module is used for carrying out data enhancement operation on the preprocessing result to obtain enhanced data of the original audio data;
and the acquisition module is used for taking the original audio data and the enhanced data as the audio data.
14. The apparatus of claim 13, wherein, in a case that the data enhancement operation is a differential enhancement operation, the enhancement module comprises:
the audio characteristic unit is used for extracting target audio characteristics of the original audio data;
and the differential enhancement unit is used for performing differential enhancement operation on the target audio features to acquire enhanced data of the original audio data.
15. The apparatus of claim 13 or 14, wherein the pre-processing module comprises at least one of:
a rate unit for changing a play rate of the original audio data;
a reverberation unit for adding reverberation to the original audio data;
a noise unit for removing noise in the original audio data;
a time domain unit, configured to perform a time domain channel masking operation on the original audio data;
and the frequency domain unit is used for carrying out frequency domain channel covering operation on the original audio data.
16. The apparatus of any of claims 10-15, wherein the training module comprises:
a loss value acquisition unit for performing loss calculation based on the gender identification result, the emotion identification result and the labeling result to acquire a loss value;
and the adjusting unit is used for adjusting the recognition model to be trained according to the loss value.
17. The apparatus of claim 16, wherein the annotation result comprises a gender annotation result and a mood annotation result, and the loss value unit is further configured to:
performing cross entropy loss calculation based on the gender marking result and the gender identification result to obtain a first loss value;
performing cross entropy loss calculation based on the emotion marking result and the emotion recognition result to obtain a second loss value;
and carrying out weighted summation on the first loss value and the second loss value, and taking the summation result as the loss value.
18. An emotion recognition apparatus comprising:
an emotion recognition module, configured to input audio data to be recognized into a recognition model to obtain an emotion recognition result, where the recognition model is the emotion recognition model according to any one of claims 10 to 17.
19. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-9.
21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111000230.4A CN113724697A (en) | 2021-08-27 | 2021-08-27 | Model generation method, emotion recognition method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111000230.4A CN113724697A (en) | 2021-08-27 | 2021-08-27 | Model generation method, emotion recognition method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113724697A true CN113724697A (en) | 2021-11-30 |
Family
ID=78678770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111000230.4A Pending CN113724697A (en) | 2021-08-27 | 2021-08-27 | Model generation method, emotion recognition method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113724697A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117473304A (en) * | 2023-12-28 | 2024-01-30 | 天津大学 | Multi-mode image labeling method and device, electronic equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110021308A (en) * | 2019-05-16 | 2019-07-16 | 北京百度网讯科技有限公司 | Voice mood recognition methods, device, computer equipment and storage medium |
CN110189754A (en) * | 2019-05-29 | 2019-08-30 | 腾讯科技(深圳)有限公司 | Voice interactive method, device, electronic equipment and storage medium |
CN110688499A (en) * | 2019-08-13 | 2020-01-14 | 深圳壹账通智能科技有限公司 | Data processing method, data processing device, computer equipment and storage medium |
CN110858234A (en) * | 2018-08-24 | 2020-03-03 | 中移(杭州)信息技术有限公司 | Method and device for pushing information according to human emotion |
CN110928997A (en) * | 2019-12-04 | 2020-03-27 | 北京文思海辉金信软件有限公司 | Intention recognition method and device, electronic equipment and readable storage medium |
CN111161733A (en) * | 2019-12-31 | 2020-05-15 | 中国银行股份有限公司 | Control method and device for intelligent voice service |
CN112216307A (en) * | 2019-07-12 | 2021-01-12 | 华为技术有限公司 | Speech emotion recognition method and device |
CN112926525A (en) * | 2021-03-30 | 2021-06-08 | 中国建设银行股份有限公司 | Emotion recognition method and device, electronic equipment and storage medium |
CN112927723A (en) * | 2021-04-20 | 2021-06-08 | 东南大学 | High-performance anti-noise speech emotion recognition method based on deep neural network |
-
2021
- 2021-08-27 CN CN202111000230.4A patent/CN113724697A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110858234A (en) * | 2018-08-24 | 2020-03-03 | 中移(杭州)信息技术有限公司 | Method and device for pushing information according to human emotion |
CN110021308A (en) * | 2019-05-16 | 2019-07-16 | 北京百度网讯科技有限公司 | Voice mood recognition methods, device, computer equipment and storage medium |
CN110189754A (en) * | 2019-05-29 | 2019-08-30 | 腾讯科技(深圳)有限公司 | Voice interactive method, device, electronic equipment and storage medium |
CN112216307A (en) * | 2019-07-12 | 2021-01-12 | 华为技术有限公司 | Speech emotion recognition method and device |
CN110688499A (en) * | 2019-08-13 | 2020-01-14 | 深圳壹账通智能科技有限公司 | Data processing method, data processing device, computer equipment and storage medium |
CN110928997A (en) * | 2019-12-04 | 2020-03-27 | 北京文思海辉金信软件有限公司 | Intention recognition method and device, electronic equipment and readable storage medium |
CN111161733A (en) * | 2019-12-31 | 2020-05-15 | 中国银行股份有限公司 | Control method and device for intelligent voice service |
CN112926525A (en) * | 2021-03-30 | 2021-06-08 | 中国建设银行股份有限公司 | Emotion recognition method and device, electronic equipment and storage medium |
CN112927723A (en) * | 2021-04-20 | 2021-06-08 | 东南大学 | High-performance anti-noise speech emotion recognition method based on deep neural network |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117473304A (en) * | 2023-12-28 | 2024-01-30 | 天津大学 | Multi-mode image labeling method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
US20190266998A1 (en) | Speech recognition method and device, computer device and storage medium | |
Kandali et al. | Emotion recognition from Assamese speeches using MFCC features and GMM classifier | |
CN107633851B (en) | Discrete speech emotion recognition method, device and system based on emotion dimension prediction | |
Yeh et al. | Segment-based emotion recognition from continuous Mandarin Chinese speech | |
CN107871499B (en) | Speech recognition method, system, computer device and computer-readable storage medium | |
Gharavian et al. | Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model | |
CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
Chenchah et al. | A bio-inspired emotion recognition system under real-life conditions | |
Thirumuru et al. | Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition | |
CN113823323A (en) | Audio processing method and device based on convolutional neural network and related equipment | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
Nawas et al. | Speaker recognition using random forest | |
CN113724697A (en) | Model generation method, emotion recognition method, device, equipment and storage medium | |
Alshamsi et al. | Automated speech emotion recognition on smart phones | |
Raghib et al. | Emotion analysis and speech signal processing | |
CN111429919A (en) | Anti-sound crosstalk method based on conference recording system, electronic device and storage medium | |
Win et al. | Emotion recognition system of noisy speech in real world environment | |
Tyagi et al. | Emotion extraction from speech using deep learning | |
Zhu et al. | A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio | |
CN113327596A (en) | Training method of voice recognition model, voice recognition method and device | |
Mangalam et al. | Emotion Recognition from Mizo Speech: A Signal Processing Approach | |
Chen et al. | A new learning scheme of emotion recognition from speech by using mean fourier parameters | |
Anila et al. | Emotion recognition using continuous density HMM | |
Rashmi et al. | Training based noise removal technique for a speech-to-text representation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |