CN113724697A

CN113724697A - Model generation method, emotion recognition method, device, equipment and storage medium

Info

Publication number: CN113724697A
Application number: CN202111000230.4A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-11-30

Abstract

The disclosure provides a model generation method, an emotion recognition device, electronic equipment and a storage medium, and relates to the technical fields of artificial intelligence such as voice technology and natural language processing. The specific implementation scheme is as follows: inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data; and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model. The embodiment of the disclosure can improve the accuracy of emotion recognition.

Description

Model generation method, emotion recognition method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and further relates to the field of artificial intelligence technologies such as speech technology and natural language processing, and in particular, to a model generation method, an emotion recognition apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, people can realize more new functions through computers, and emotion recognition is one of the functions.

Emotion recognition is performed by a computer, including the determination of emotional states by detecting externally observable information. Since emotion is a complex psychological activity of human beings, improvement of the accuracy of emotion recognition remains a problem that needs long-term attention.

Disclosure of Invention

The disclosure provides a model generation method, an emotion recognition device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a model generation method including:

inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result output aiming at the audio data;

and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain the emotion recognition model.

According to another aspect of the present disclosure, there is provided an emotion recognition method including:

and inputting the audio data to be recognized into a recognition model to obtain an emotion recognition result, wherein the recognition model is the emotion recognition model provided by any one of the embodiments of the disclosure.

According to another aspect of the present disclosure, there is provided a model generation apparatus including:

the recognition module is used for inputting the audio data into a recognition model to be trained and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data;

and the training module is used for adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain the emotion recognition model.

According to another aspect of the present disclosure, there is provided an emotion recognition apparatus including:

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method in any of the embodiments of the present disclosure.

According to the technology disclosed by the invention, gender information is blended in during emotion recognition, so that the accuracy of emotion recognition results can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a model generation method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a model generation method according to another embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a model generation method according to yet another embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an identification method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model generation method according to an example of the present disclosure;

FIG. 6 is a schematic diagram of recognition model generation to be trained according to an example of the present disclosure;

FIG. 7 is a schematic diagram of a model generation apparatus according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a model generation apparatus according to another embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;

FIG. 13 is a schematic diagram of a model generation apparatus according to yet another embodiment of the present disclosure;

FIG. 14 is a block diagram of an electronic device for implementing a model generation method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment of the present disclosure first provides a model generation method, as shown in fig. 1, including:

step S11: inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result output aiming at the audio data;

step S12: and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain the emotion recognition model.

In this embodiment, the audio data may be any audio data containing human or animal sounds. For example, it may be audio data of human speech, audio data of singing, or other audio data generated by human voice.

The audio data may include sound from one speaker or may include sound from more than two different speakers.

The audio data may include audio generated from a single type of speech in a single language, and may also include audio generated from multiple languages, such as audio generated from a speaker using chinese speech, audio generated from speech in a foreign language, audio generated from a mixture of chinese and foreign languages, and audio generated from a mixture of chinese and foreign languages.

The gender recognition result and the emotion recognition result output for the audio data include outputting a recognition result regarding the gender of the utterer and a recognition result regarding the emotion of the utterer for the voice of at least one utterer in the audio data.

The gender identification result may include probabilities that the speaker of the audio data belongs to various genders, for example, the probability that the speaker belongs to a male is 60% and the probability that the speaker belongs to a female is 40%. The gender identification result may also include a single gender identification result, such as the first speaker belonging to a male and the second speaker belonging to a female.

The emotion recognition result may include probabilities of the vocalist of the audio data being in various emotional states, for example, a probability of the vocalist being in a happy emotional state is 80% and a probability of being in a sad emotional state is 20%. The emotion recognition result may also include a separate, unambiguous emotion recognition result, such as a happy emotion of the speaker, or a combination of happy and nervous emotion of the speaker.

In another implementation, if two completely conflicting emotions are included in the emotion recognition result, one emotion output may be selected among the two completely conflicting emotions, and if two or more emotions that may occur together, such as distraction and relaxation, are included in the emotion recognition result, the two or more emotions that may occur together may be simultaneously output.

In this embodiment, the acquiring of the gender identification result and the emotion identification result output for the audio data may include acquiring intermediate data of gender identification and intermediate data of emotion identification for the audio data; and acquiring emotion recognition results for the audio data based on the intermediate data of gender recognition and the intermediate data of emotion recognition, and acquiring gender recognition results for the audio data based on the intermediate data of gender recognition in the lake region.

In this embodiment, the annotation result may include a gender annotation result and an emotion annotation result.

The gender labeling result may be male or female.

In other embodiments, the gender identification result and the gender labeling result may be divided into more sub-categories according to the information such as age under the gender category, such as male children, male adolescents, female adolescents, etc.

The emotion annotation result can be at least one of a plurality of preset emotions, such as happiness, anger, sadness and music.

In this embodiment, adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result, and the labeling result may include: optimizing or adjusting the recognition model to be trained based on the gender recognition result and the gender-related labeling result; and optimizing or adjusting the recognition model to be trained based on the emotion recognition result and the labeling result about the emotion.

The speech is an important carrier of emotion in human communication, the focus of speech recognition is on the content of the speaker speaking, the focus of emotion recognition is on the way in which the speaker speaks, the way in which people express speech under different emotional states is different, and the features presented in the audio information uttered by people are also different. For example, people speaking in happy hours may be more cheerful in tone, and dysphoria may be more bored in tone. Under general conditions, the emotion information embodied in the voice can be determined by the voice generated by speaking of people through information such as the speed, tone, volume and the like; the development of deep learning techniques has accelerated the progress of detecting emotion from speech. However, there are still some disadvantages in recognizing the emotion of the speaker through audio or voice, and there is a great relationship between the emotion in the audio and the speaker of the audio. Different listeners can understand different emotions when listening to the audio, and meanwhile, factors such as cultural differences increase difficulty in emotion recognition from the audio. It is found through research that the emotion expressed in the voice of people of different genders can be different under the condition that the information such as the speed, the tone, the volume and the like of the voice is the same. Usually, the expression of speech emotion is different for different genders, i.e. the female's vitality may be different from the male's vitality in tone, and other emotions such as excitement, fear, heart injury, etc. are similar. Specifically, utterances of the same tone and content may be uttered by one gender with a fussy mood, and uttered by another gender with a rudimentary mood.

Therefore, in this embodiment, gender information in the audio data is taken into account when designing the features and model network. According to the audio data, a gender recognition result and an emotion recognition result can be obtained. And training or adjusting the recognition model to be trained according to the gender recognition result and the emotion recognition result output by the recognition model to be trained and the labeling result aiming at the gender and the emotion to obtain the emotion recognition model. Due to the fact that gender information is combined in the emotion recognition process, information referenced in emotion recognition can be increased, and accuracy of emotion recognition results is improved.

In one embodiment, inputting audio data into a recognition model to be trained, and obtaining a gender recognition result and an emotion recognition result for the audio data output, as shown in fig. 2, includes:

step S21: acquiring first data at least fusing emotion information and gender information according to audio data by adopting a pre-processing layer of a recognition model to be trained;

step S22: acquiring a gender identification result according to the first data by adopting a gender identification output layer of the identification model to be trained;

step S23: obtaining an emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained;

the pre-processing layer is a data processing layer in front of a gender recognition output layer and an emotion recognition output layer in the recognition model to be trained.

In this embodiment, step S22 and step S23 may be executed simultaneously, or may be executed sequentially according to any order.

In the present embodiment, the portion of the recognition model for obtaining the emotion recognition result and the portion for obtaining the gender recognition result may be shared except for the final output layer. For example, the input layer, the data processing layer between the input layer and the output layer may be a common data processing layer for emotion recognition processing and gender recognition processing.

In this embodiment, the preprocessing layer may not include an input layer.

In this embodiment, the data processing layer for recognizing emotion information of audio data and the data processing layer for recognizing audio data may be shared, so that gender information in audio can be sufficiently considered when recognizing emotion.

In one embodiment, acquiring the first data fused with at least emotion information and gender information from the audio data, as shown in fig. 3, includes:

step S31: extracting frequency domain information of the audio data to obtain second data;

step S32: performing convolution and pooling calculation on the second data to obtain third data;

step S33: processing the third data by adopting a bidirectional Short Term Memory neural network (LSTM) to acquire fourth data;

step S34: and performing Self-Attention (Self Attention) weight calculation on the fourth data to acquire first data.

In this embodiment, the frequency domain information may be a frequency spectrum of sound, which is helpful for more intuitively recognizing the audio data. The frequency spectrum can be short for frequency spectrum density and can be represented by a distribution curve of frequency. Complex oscillations can be decomposed into harmonic oscillations of different amplitudes and different frequencies, and the pattern of amplitude versus frequency of these harmonic oscillations can be called a frequency spectrum. The frequency spectrum may embody that the audio signal is frequency domain information.

Performing convolution and pooling calculation on the second data to obtain third data, which may include performing convolution calculation on the second data line and then performing pooling calculation to obtain third data; and performing convolution calculation after the second data advancement pooling calculation to obtain third data.

The frequency domain information extraction is performed on the audio data, and the frequency domain information extraction may be performed on the audio data through a frequency spectrum extraction layer of the recognition model to be trained.

In this embodiment, the processing of the third data by using the bidirectional long-short term memory neural network may include performing forward calculation processing and performing backward calculation processing on the third data by using the bidirectional long-short term memory neural network.

And calculating the self-attention weight of the fourth data, wherein the self-attention mechanism layer of the recognition model to be trained is adopted to calculate the fourth data, and relatively higher weights are given to more important information in the fourth data.

In the embodiment, rich data about gender and emotion in the audio data can be obtained through extraction of frequency domain information, a bidirectional long-short term memory neural network and self-attention mechanism information, so that emotion can be judged more accurately.

Meanwhile, when the recognition model to be trained calculates the frequency spectrum extraction layer, the convolution layer, the pooling layer, the bidirectional long-short term memory neural network and the self-attention layer of the audio data, the gender information and the emotion information in the audio can be processed and retained at the same time, so that the subsequent emotion recognition result is related to the gender recognition result.

In one embodiment, the model generation method further comprises obtaining audio data by:

preprocessing the acquired original audio data to obtain a preprocessing result;

performing data enhancement operation on the preprocessing result to obtain enhanced data of the original audio data;

the original audio data and the enhancement data are used as audio data.

In this embodiment, the preprocessing may include operations such as noise reduction, background sound filtering, volume adjustment, sound clipping, and the like.

The pre-processing result may be pre-processed raw audio data.

The raw audio data may be raw audio data acquired with an audio recording device. Or may be an audio portion of the video data.

The data enhancement operation is performed on the preprocessing result, which can be data amplification on the preprocessing result, and under the condition of not increasing data substantially, the result basically equivalent to the fact that more training data are increased is generated, so that the limited data generate the training effect equivalent to more data, and the coverage of the data used by the training model is improved.

In this embodiment, the data enhancement may be supervised data enhancement, that is, data amplification may be performed on the basis of existing data by using a preset data transformation rule. The data enhancement may also be an unsupervised data enhancement. In addition, data enhancement may include single sample data enhancement and multi-sample data enhancement, and may also include generating new data enhancement and learning enhancement strategy enhancement.

In this embodiment, the original audio data and the enhancement data of the original audio data are used as audio data, so that the limited original audio data can be fully utilized, and the utilization rate of the training set is improved.

In one embodiment, in a case that the data enhancement operation is a differential enhancement operation, performing the data enhancement operation on the pre-processing result to obtain enhanced data of the original audio data includes:

extracting target audio features of original audio data;

and performing differential enhancement operation on the target audio features to obtain enhanced data of the original audio data.

In this embodiment, the target audio features of the original audio data may be primary audio features extracted from a data processing layer that is ordered as a preamble of the recognition model to be trained, and specifically may include audio features that can be directly extracted from the original audio data, and may also include audio features extracted from processed (noise reduction, etc.) original audio data.

In this embodiment, the primary audio feature may be related to the waveform of the audio data, and may be an equivalent expression form of the voice data, including various information such as emotion, gender, and language. Different primary audio features (or front-end features) may be obtained by trying different combinations of features, such as MFCC (Mel-Frequency Cepstral coeffients, Mel-Frequency Cepstral Coefficients) and PLP (Perceptual Linear prediction Coefficients), or features of different dimensions such as 40-80 (size of audio file unit), etc.

Performing a differential enhancement operation on the target audio feature may include performing partial differential enhancement and full differential enhancement on the target audio feature.

In the embodiment, the difference enhancement can be performed according to the original audio data, so that the data volume of the audio is increased, and the use effect of the limited original audio data in training of the recognition model to be trained is improved.

In one embodiment, preprocessing the acquired original audio data to obtain a preprocessing result includes:

performing at least one of the following operations on the raw audio data to pre-process the raw audio data:

changing the playing rate of the original audio data;

adding reverberation to the original audio data;

removing noise in the original audio data;

carrying out time domain channel covering operation on the original audio data;

the original audio data is subjected to a frequency domain channel masking operation.

In this embodiment, the change of the playing rate of the original audio data may be to increase or decrease the rate of the original audio.

Adding reverberation to the original audio data may be an effect of adding reflected or absorbed in the original audio.

The noise in the original audio data may be removed by removing background sounds in the original audio data, or removing non-background sounds irrelevant to emotion analysis in the original audio data, or extracting sounds to be analyzed from the original audio data as preprocessed original audio data.

The time domain channel masking operation is performed on the original audio data, and may be a partial signal cancellation operation performed on the original audio data on a time domain channel.

The frequency domain channel masking operation may be performed on the original audio data, or the partial signal cancellation operation may be performed on the original audio data on the frequency domain channel.

In this embodiment, the original audio is preprocessed to a certain extent, so that one piece of original audio data is converted into multiple pieces of original audio data, and the utilization rate of the original audio data is improved.

In one embodiment, adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result comprises:

performing loss calculation based on the gender recognition result, the emotion recognition result and the labeling result to obtain a loss value;

and adjusting the recognition model to be trained according to the loss value.

In this embodiment, the loss calculation is performed based on the gender identification result, the emotion identification result, and the tagging result, and the loss value may be obtained by at least one of:

acquiring a first loss value based on the gender identification result and the gender labeling result, and taking the first loss value as a loss value;

acquiring a second loss value based on the emotion recognition result and the annotation result about the emotion, and taking the second loss value as a loss value;

acquiring a third loss value based on the emotion recognition result, the annotation result on emotion and the annotation result on gender, and taking the third loss value as a loss value;

acquiring a fourth loss value based on the gender identification result, the gender-related labeling result and the emotion-related labeling result, and taking the fourth loss value as the loss value;

acquiring a first loss value based on the gender identification result and the gender-related labeling result; and obtaining a second loss value based on the emotion recognition result and the annotation result about the emotion, and obtaining a loss value based on the first loss value and the second loss value.

In the embodiment, the loss calculation can be performed according to the gender recognition result, the emotion recognition result and the labeling result, so that when the model is optimized or adjusted, the model can be optimized according to the dual recognition results of gender and emotion, and the model can learn the optimal parameters for recognizing gender. Meanwhile, according to the gender recognition optimal parameter and the emotion recognition result, the model can learn the optimal parameter for recognizing emotion, and the emotion recognition accuracy is improved.

In one embodiment, the tagging result includes a gender tagging result and an emotion tagging result, and the loss calculation is performed on the basis of the gender identification result, the emotion identification result and the tagging result to obtain the loss value, including:

calculating Cross Entropy loss (Cross Entropy) based on the gender marking result and the gender identification result to obtain a first loss value;

performing cross entropy loss calculation based on the emotion marking result and the emotion recognition result to obtain a second loss value;

and carrying out weighted summation on the first loss value and the second loss value, and taking the summation result as a loss value.

In this embodiment, the cross entropy loss calculation is performed based on the gender labeling result and the gender identification result, and may be performed according to the gender identification result and the gender labeling result by using a cross entropy loss function.

And performing cross entropy loss calculation based on the emotion labeling result and the emotion recognition result, wherein the cross entropy loss calculation can be performed according to the emotion recognition result and the emotion labeling result by using a cross entropy loss function.

In some embodiments, the cross-entropy loss may be a typical cross-entropy loss function; it may also be an extended type cross entropy loss function derived from a typical cross entropy loss function; or may be a two-class cross entropy loss function or a multi-class cross entropy loss function.

In other embodiments, the emotion recognition result and the gender recognition result can be converted into a fusion result, and cross entropy loss calculation is performed according to the fusion result and the labeling result.

In this embodiment, after the loss values are calculated for the gender identification result and the emotion identification result, the total loss value is calculated according to the respective weights, so that the emotion identification information and the gender identification information can be fused in the loss values, and the accuracy of the identification result is improved.

An embodiment of the present disclosure further provides an emotion recognition method, as shown in fig. 4, including:

step S41: and inputting the audio data to be recognized into a recognition model to obtain an emotion recognition result, wherein the recognition model is the emotion recognition model provided by any one of the embodiments of the disclosure.

The embodiment of the disclosure can be applied to various scenes such as intelligent robot conversation, intelligent teaching, intelligent customer service, psychological analysis, mobile service and the like. For example, a psychologist may record a patient's voice during a consultation and may subsequently design a treatment plan based on the mood expressed in the audio data obtained from the patient's voice.

In the disclosed example, gender information in the audio data is employed to assist in the emotion recognition process. In particular, the emotion in the recognition result may include a basic emotion, and may also include other complex emotions that may be encountered in life, and may also be determined by the data set of audio data employed by the training model. Such as anger, joy, neutrality, injury, anger, rage, irritability, annoyance, worry, fear, surprise, timidity, and the like. In a specific example, as shown in fig. 5, the model generation method may include the following steps:

step S51: and constructing an audio data set for training the recognition model, and carrying out emotion and gender labeling on the audio data set.

In this example, a certain amount (for example, 10 ten thousand) of raw audio data is collected to construct an audio data set, and emotion information labeling and gender information labeling can be performed on the raw audio data.

Step S52: the raw audio data in the audio data set is preprocessed. Specifically, the method may include performing a preliminary processing by removing noise and muting, framing, and the like, and then performing data enhancement on the data subjected to the preliminary processing, for example, adding reverberation, changing the rate, and the like.

In the process of preprocessing the original audio data, clean original audio data can be obtained by removing information such as environmental noise, busy tone, polyphonic ringtone and the like.

Step S53: and performing primary audio feature extraction on the preprocessed original audio data. The primary audio features may include Pitch frequency (Pitch), mel-frequency cepstrum (MFCC), PLP, FBank (Filter Bank), FFT (Fast Fourier transform) features, and the like.

Step S54: and carrying out differential enhancement on the primary audio features, and taking the primary audio features subjected to differential enhancement and the original audio data together as audio data.

Specifically, the primary audio features may be subjected to second order differential enhancement, and the data may be subjected to Shuffle (Shuffle).

Step S55: audio data is input into the recognition model to be trained.

And (3) carrying out Batch (Batch) training in the identification model to be trained, carrying out forward calculation step by step according to the audio data, and continuously mapping the original audio data and the primary audio features into high-level features.

Step S56: the loss value is calculated from the output data of the output layer.

And calculating the loss of the respective targets of the emotion recognition result and the gender recognition result on the final output layer of the recognition model to be trained.

Specifically, the influence degree of the gender in the emotion distinguishing process can be determined by controlling the coefficients of the gender and the emotion distinguishing process, and the following calculation formula can be specifically adopted:

L＝α×L_emotion+β×L_gender；

where α and β may be weighted values, and may be set to 0.5 and 0.5, respectively. L is_emotionThe loss value can be recognized for emotion, canThe emotion recognition result and the emotion labeling result can be obtained through calculation, and the emotion recognition result, the emotion labeling result and the gender labeling result can also be obtained through calculation. L is_genderThe gender identification result can be obtained by calculation according to the gender identification result and the gender labeling result.

And after loss calculation is finished, reversely updating the parameters of the whole recognition model to be trained, and iterating for multiple rounds until convergence, thereby obtaining the emotion recognition model with stable effect.

Step S57: and (5) testing the model.

When the emotion recognition model is tested, gender output can be achieved without participating in calculation, and after audio preprocessing, the gender output is input to a network to obtain a final emotion classification result.

In an example of the present disclosure, a model structure of a recognition model to be trained may be as shown in fig. 6, including: a Spectrum Extraction layer (Spectrum Extraction)61, a Convolution Neural Network (CNN) 62, a Max Pooling layer (Max power) 63, a bidirectional long-short term memory layer 64, and a self attention layer (self attention Module) 65. In addition, the recognition model to be trained may further include an Emotion Output Layer (Emotion Output Layer)66 and a Gender Output Layer (Gender Output Layer)67, and the Emotion Output Layer 66 and the Gender Output Layer 67 constitute Output layers of the recognition model to be trained. The emotion and gender recognition process shares all networks except the output layer in the recognition model to be trained.

The parameters of the spectrum extraction layer 61 may include feature dimension feature num, and value dspec. The parameters of convolutional layer 62 may include filter num ═ dcnn, steps strides, window size ═ ncw, and the step may take on a value of 1. The parameters of the maximum pooling layer 63 may include a step length npw, a window size, and the window size may be equal to the step length, that is, there may be: window size npw. The parameters of the bi-directional long-short term memory layer 64 may include the hidden size dlstm. The parameters of the self-attention layer 65 may include a hidden size, head dimension, head num, nattn. The dimension of the information output by the spectrum extraction layer 61 can be dspec, the dimension of the information output by the convolution layer can be dcnn, the dimension of the information output by the pooling layer can be dcnn, the dimension of the information output by the bidirectional long-short term memory neural network can be 2dlstm, and the dimension of the information output by the attention layer can be 2nattn × dlstm.

In this example, the output result of the emotion output layer 66 may be probabilities that the audio data contains various emotions, for example, the probabilities are {0.0, 0.5, 0.2, 0.3} for four emotions of impatience, sadness, happiness and tension, respectively. The output result of the gender output layer 67 may be probabilities that speakers of the audio data are of different genders, for example, the probabilities that the audio data are male and female are: {0.9,0.1}.

The number of each layer in the model structure in the disclosed example may be set according to the purpose of improving the accuracy of emotion recognition, for example, it is found through experiments that when two bidirectional LSTM layers are provided, the accuracy of the recognition result is enhanced, and then two bidirectional LSTM layers may be provided. For example, it is found by practical application that if two convolutional layers and/or pooling layers are provided, and the accuracy of the recognition result is enhanced, two convolutional layers and/or pooling layers may be provided. Data can be processed by the first layer of bidirectional LSTM layer and then input into the second layer of bidirectional LSTM layer for processing. After being processed by the first layer of the convolution layer, the first layer of the convolution layer can be input into the second layer of the convolution layer for processing. The waste water can be input into a second layer of pooling layer for treatment after being treated by the first layer of pooling layer.

The model structure provided by the embodiment of the disclosure can distinguish and find emotional characteristics under different sexes through the learning of the information of the sexes. Through the convolutional neural network layer, the local information mining effect is improved, long-term dependence information can be better described through the bidirectional LSTM, recognition can be carried out according to the concerned importance relation through the self-attention layer, meanwhile, in the embodiment, most networks are shared by emotion recognition and gender recognition, when emotion is distinguished, gender characteristics in the middle of audio data are utilized, the difference between the emotion characteristics is improved, and further, the emotion recognition accuracy is improved through the emotion recognition model provided by the embodiment of the disclosure.

An embodiment of the present disclosure further provides a model generation apparatus, as shown in fig. 7, including:

the recognition module 71 is configured to input audio data into a recognition model to be trained, and obtain a gender recognition result and an emotion recognition result output for the audio data;

and the training module 72 is configured to adjust the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model.

In one embodiment, as shown in fig. 8, the identification module comprises:

a first data unit 81, configured to acquire, according to audio data, first data in which emotion information and gender information are at least fused, by using a pre-processing layer of a recognition model to be trained;

the gender identification unit 82 is used for adopting a gender identification output layer of the identification model to be trained and obtaining a gender identification result according to the first data;

the emotion recognition unit 83 is used for acquiring an emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained;

the pre-processing layer is a data processing layer before the gender recognition output layer and the emotion recognition output layer.

In one embodiment, the first data unit is further configured to:

extracting frequency domain information of the audio data to obtain second data;

performing convolution and pooling calculation on the second data to obtain third data;

processing the third data by adopting a bidirectional long-short term memory neural network to obtain fourth data;

and performing self-attention weight calculation on the fourth data to obtain first data.

In one embodiment, as shown in fig. 9, the model generation apparatus further includes:

the preprocessing module 91 is configured to preprocess the acquired original audio data to obtain a preprocessing result;

the enhancement module 92 is configured to perform data enhancement operation on the preprocessing result to obtain enhanced data of the original audio data;

the obtaining module 93 is configured to use the original audio data and the enhanced data as audio data.

In one embodiment, as shown in fig. 10, in the case where the data enhancement operation is a differential enhancement operation, the enhancement module includes:

an audio feature unit 101, configured to extract a target audio feature of original audio data;

and the differential enhancement unit 102 is configured to perform a differential enhancement operation on the target audio features to obtain enhanced data of the original audio data.

In one embodiment, as shown in FIG. 11, the pre-processing module includes at least one of:

a rate unit 111 for changing a play rate of the original audio data;

a reverberation unit 112 for adding reverberation to the original audio data;

a noise unit 113 for removing noise from the original audio data;

a time domain unit 114, configured to perform a time domain channel masking operation on original audio data;

a frequency domain unit 115, configured to perform a frequency domain channel masking operation on the original audio data.

In one embodiment, as shown in fig. 12, the training module comprises:

a loss value obtaining unit 121, configured to perform loss calculation based on the gender recognition result, the emotion recognition result, and the tagging result, and obtain a loss value;

and an adjusting unit 122, configured to adjust the recognition model to be trained according to the loss value.

In one embodiment, the annotation result comprises a gender annotation result and an emotion annotation result, and the loss value unit is further configured to:

performing cross entropy loss calculation based on the gender marking result and the gender identification result to obtain a first loss value;

An embodiment of the present disclosure further provides an emotion recognition apparatus, as shown in fig. 13, including:

the emotion recognition module 131 is configured to input audio data to be recognized into a recognition model to obtain an emotion recognition result, where the recognition model is an emotion recognition model provided in any embodiment of the present disclosure.

According to the technical scheme, one of factors which have large influence on emotion recognition and auxiliary classification of gender are introduced in a multi-task learning mode, so that the extraction accuracy of emotion characteristics is further improved.

The embodiment of the disclosure can be applied to the technical field of computers, and especially can be applied to the technical fields of artificial intelligence, such as automatic driving, cloud computing, internet of things, big data, voice technology, intelligent search, information flow, deep learning and the like.

The functions of each unit, module or sub-module in each apparatus in the embodiments of the present disclosure may refer to the corresponding description in the above method embodiments, and are not described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 14 shows a schematic block diagram of an example electronic device 140 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 14, the electronic device 140 includes a computing unit 141, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)142 or a computer program loaded from a storage unit 148 into a Random Access Memory (RAM) 143. In the RAM 143, various programs and data required for the operation of the electronic apparatus 140 can also be stored. The calculation unit 141, the ROM 142, and the RAM 143 are connected to each other via a bus 144. An input/output (I/O) interface 145 is also connected to bus 144.

A number of components in the electronic device 140 are connected to the I/O interface 145, including: an input unit 146 such as a keyboard, a mouse, or the like; an output unit 147 such as various types of displays, speakers, and the like; a storage unit 148 such as a magnetic disk, optical disk, or the like; and a communication unit 149 such as a network card, modem, wireless communication transceiver, etc. The communication unit 149 allows the electronic device 140 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 141 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 141 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 141 performs the respective methods and processes described above, such as the model generation method. For example, in some embodiments, the model generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 148. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 140 via ROM 142 and/or communications unit 149. When loaded into RAM 143 and executed by computing unit 141, a computer program may perform one or more steps of the model generation method described above. Alternatively, in other embodiments, the computing unit 141 may be configured to perform the model generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model generation method, comprising:

inputting audio data into a recognition model to be trained, and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data;

and adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model.

2. The method of claim 1, wherein the inputting audio data into a recognition model to be trained, and the obtaining of the gender recognition result and the emotion recognition result output for the audio data comprises:

acquiring first data at least fusing emotion information and gender information according to the audio data by adopting a pretreatment layer of the recognition model to be trained;

acquiring a gender identification result according to the first data by adopting a gender identification output layer of the identification model to be trained;

and acquiring the emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained.

3. The method according to claim 2, wherein the obtaining of the first data fused with at least emotion information and gender information from the audio data comprises:

and performing self-attention weight calculation on the fourth data to obtain the first data.

4. The method of any of claims 1-3, wherein the method further comprises obtaining the audio data by:

and taking the original audio data and the enhancement data as the audio data.

5. The method of claim 4, wherein, in the case that the data enhancement operation is a differential enhancement operation, the performing the data enhancement operation on the pre-processing result to obtain the enhancement data of the original audio data comprises:

extracting target audio features of the original audio data;

6. The method according to claim 4 or 5, wherein the preprocessing the acquired original audio data to obtain a preprocessing result comprises:

changing the playing rate of the original audio data;

adding reverberation to the original audio data;

removing noise in the original audio data;

carrying out time domain channel covering operation on the original audio data;

and carrying out frequency domain channel covering operation on the original audio data.

7. The method according to any one of claims 1-6, wherein the adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result comprises:

and adjusting the recognition model to be trained according to the loss value.

8. The method of claim 7, wherein the labeling result comprises a gender labeling result and an emotion labeling result, and the performing the loss calculation based on the gender identification result, the emotion identification result and the labeling result to obtain the loss value comprises:

and carrying out weighted summation on the first loss value and the second loss value, and taking the summation result as the loss value.

9. A method of emotion recognition, comprising:

inputting audio data to be recognized into a recognition model to obtain an emotion recognition result, wherein the recognition model is the emotion recognition model in any one of claims 1-8.

10. A model generation apparatus comprising:

the recognition module is used for inputting audio data into a recognition model to be trained and acquiring a gender recognition result and an emotion recognition result which are output aiming at the audio data;

and the training module is used for adjusting the recognition model to be trained based on the gender recognition result, the emotion recognition result and the labeling result to obtain an emotion recognition model.

11. The apparatus of claim 10, wherein the identification module comprises:

the first data unit is used for acquiring first data at least fusing emotion information and gender information according to the audio data by adopting a pre-processing layer of the recognition model to be trained;

the gender identification unit is used for adopting a gender identification output layer of the identification model to be trained and obtaining the gender identification result according to the first data;

and the emotion recognition unit is used for acquiring the emotion recognition result according to the first data by adopting an emotion recognition output layer of the recognition model to be trained.

12. The apparatus of claim 11, wherein the first data unit is further to:

13. The apparatus of any one of claims 10-12, wherein the apparatus further comprises:

the preprocessing module is used for preprocessing the acquired original audio data to obtain a preprocessing result;

the enhancement module is used for carrying out data enhancement operation on the preprocessing result to obtain enhanced data of the original audio data;

and the acquisition module is used for taking the original audio data and the enhanced data as the audio data.

14. The apparatus of claim 13, wherein, in a case that the data enhancement operation is a differential enhancement operation, the enhancement module comprises:

the audio characteristic unit is used for extracting target audio characteristics of the original audio data;

and the differential enhancement unit is used for performing differential enhancement operation on the target audio features to acquire enhanced data of the original audio data.

15. The apparatus of claim 13 or 14, wherein the pre-processing module comprises at least one of:

a rate unit for changing a play rate of the original audio data;

a reverberation unit for adding reverberation to the original audio data;

a noise unit for removing noise in the original audio data;

a time domain unit, configured to perform a time domain channel masking operation on the original audio data;

and the frequency domain unit is used for carrying out frequency domain channel covering operation on the original audio data.

16. The apparatus of any of claims 10-15, wherein the training module comprises:

a loss value acquisition unit for performing loss calculation based on the gender identification result, the emotion identification result and the labeling result to acquire a loss value;

and the adjusting unit is used for adjusting the recognition model to be trained according to the loss value.

17. The apparatus of claim 16, wherein the annotation result comprises a gender annotation result and a mood annotation result, and the loss value unit is further configured to:

18. An emotion recognition apparatus comprising:

an emotion recognition module, configured to input audio data to be recognized into a recognition model to obtain an emotion recognition result, where the recognition model is the emotion recognition model according to any one of claims 10 to 17.

19. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.