CN114333786A

CN114333786A - Speech emotion recognition method and related device, electronic equipment and storage medium

Info

Publication number: CN114333786A
Application number: CN202111363984.6A
Authority: CN
Inventors: 石周; 高天; 方昕
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-04-12

Abstract

The application discloses a speech emotion recognition method, a related device, electronic equipment and a storage medium, wherein the speech emotion recognition method comprises the following steps: acquiring a voice to be recognized; recognizing the voice to be recognized by utilizing an emotion recognition network to obtain the emotion type of the voice to be recognized; the emotion recognition network is contained in the combined model, the combined model further comprises a domain recognition network, the combined model is obtained through combined training of emotion classification loss of the first sample voice belonging to the first data domain type based on the emotion recognition network and domain classification loss of the first sample voice and the second sample voice respectively by the domain recognition network, the second sample voice belongs to the second data domain type, and the sample emotion type is marked on the first sample voice. By the scheme, the accuracy of speech emotion recognition can be improved under the condition that sample data with accurate emotion category marking is rare.

Description

Speech emotion recognition method and related device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech emotion recognition method, a related apparatus, an electronic device, and a storage medium.

Background

Speech emotion recognition refers to the recognition of the instantaneous emotional state (i.e., emotion classification) of a speaker from the speaker's speech, such as, but not limited to: happy, sad, angry, frightened, disgust, fear, etc. The speech emotion recognition function is more and more prominent in various industries or scenes such as man-machine interaction, medical treatment, education, telephone traffic and the like which need to interact with people to develop business.

With the rapid development of deep learning, speech emotion recognition by using a network model has gradually become one of the mainstream technologies. Generally, accurate identification of network models relies on labeling of large amounts of sample data. However, due to a plurality of factors such as lack of speech emotion data in a real environment, subjective emotion and the like, sample data with accurate emotion category labeling is rare, so that the accuracy of a network model is severely restricted, and speech emotion recognition is influenced. In view of this, how to improve the accuracy of speech emotion recognition under the condition that sample data with accurate emotion category labeling is rare becomes an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a speech emotion recognition method, a related device, an electronic device and a storage medium, and the accuracy of speech emotion recognition can be improved under the condition that sample data with accurate emotion category marking is rare.

In order to solve the above technical problem, a first aspect of the present application provides a speech emotion recognition method, including: acquiring a voice to be recognized; recognizing the voice to be recognized by utilizing an emotion recognition network to obtain the emotion type of the voice to be recognized; the emotion recognition network is contained in the combined model, the combined model further comprises a domain recognition network, the combined model is obtained through combined training of emotion classification loss of the first sample voice belonging to the first data domain type based on the emotion recognition network and domain classification loss of the first sample voice and the second sample voice respectively by the domain recognition network, the second sample voice belongs to the second data domain type, and the sample emotion type is marked on the first sample voice.

In order to solve the above technical problem, a second method of the present application provides a speech emotion recognition apparatus, including: the emotion recognition system comprises a voice acquisition module and an emotion recognition module, wherein the voice acquisition module is used for acquiring a voice to be recognized; the emotion recognition module is used for recognizing the voice to be recognized by utilizing the emotion recognition network to obtain the emotion type of the voice to be recognized; the emotion recognition network is contained in the combined model, the combined model further comprises a domain recognition network, the combined model is obtained through combined training of emotion classification loss of the first sample voice belonging to the first data domain type based on the emotion recognition network and domain classification loss of the first sample voice and the second sample voice respectively by the domain recognition network, the second sample voice belongs to the second data domain type, and the sample emotion type is marked on the first sample voice.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the speech emotion recognition method in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being configured to implement the speech emotion recognition method in the first aspect.

In the scheme, the voice to be recognized is obtained, the voice to be recognized is recognized by utilizing an emotion recognition network, the emotion type of the voice to be recognized is obtained, the emotion recognition network is contained in a combined model, the combined model further comprises a domain recognition network, the combined model is obtained by carrying out combined training on the emotion classification loss of a first sample voice belonging to a first data domain type and the domain classification loss of a second sample voice respectively by the domain recognition network, the second sample voice belongs to a second data domain type, the first sample voice is marked with a sample emotion type, on one hand, the emotion recognition network is contained in the combined model, the emotion classification loss of the first sample voice marked with the sample emotion type is carried out by the emotion recognition network, and the network model can be supervised and trained by utilizing the first sample voice, on the other hand, a second unlabelled and sufficient sample voice is introduced, the combined model further comprises a domain recognition network, the domain classification losses of the first sample voice and the second sample voice are respectively realized through the domain recognition network, the network model can favorably process the voice data of the two data domains indiscriminately, the second sample voice is utilized to assist the first sample voice to train the network model unsupervised, and the network model can be cooperatively trained by combining the advantages of supervised training and the unsupervised training, so that the accuracy of voice emotion recognition can be improved under the condition that the sample data with accurate emotion class labels are rare.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of a speech emotion recognition method of the present application;

FIG. 2 is a block diagram of an embodiment of a federated model;

FIG. 3 is a block diagram of one embodiment of a speech feature extraction subnetwork;

FIG. 4 is a block diagram of an embodiment of an image feature extraction network;

FIG. 5 is a block diagram of an embodiment of the speech emotion recognition apparatus of the present application;

FIG. 6 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flowchart illustrating a speech emotion recognition method according to an embodiment of the present application. Specifically, the method may include the steps of:

step S11: and acquiring the voice to be recognized.

In one implementation scenario, as described below in connection with the following description, the emotion classification of the speech to be recognized is recognized using an emotion recognition network, and the emotion recognition network is included in a joint model, and the joint model is jointly trained based on a first sample speech belonging to a first data domain class and a second sample speech belonging to a second data domain class. On this basis, the speech to be recognized may belong to any one of the first data field category and the second data field category, which is not limited herein.

In a specific implementation scenario, the data field type may be defined by a data source, for example, the voice data from the mobile phone call and the voice data from the instant messaging software may be considered to belong to different data field types, or the voice data field from the video call and the voice data from the live recording may be considered to belong to different data field types, which is not illustrated herein.

In a specific implementation scenario, the specific data field categories of the first data field category and the second data field category may be set according to an actual application. For example, the emotion type can be accurately marked by the voice data from the video call by referring to the video picture, and the voice data from the internet is generally sufficient and can be obtained more conveniently, so that the voice data from the video call can be used as the first sample voice, and the voice data from the internet can be used as the second sample voice, at this time, the first data field type can be regarded as the video call, and the second data field type can be regarded as the internet. It should be noted that the above description about the first sample voice, the second sample voice, the first data field category and the second data field category is only one possible case in the practical application process, and the practical application process is not limited to this example.

Step S12: and recognizing the voice to be recognized by utilizing the emotion recognition network to obtain the emotion type of the voice to be recognized.

In the embodiment of the disclosure, the emotion recognition network is included in the combination model, the combination model further includes a domain recognition network, the combination model is obtained by performing combined training on the emotion classification loss of the first sample voice belonging to the first data domain category based on the emotion recognition network and the domain classification loss of the first sample voice and the second sample voice respectively by the domain recognition network, the second sample voice belongs to the second data domain category, and the first sample voice is labeled with the sample emotion category. On the basis, after the combined model training is converged, the emotion recognition network in the combined model can be obtained, and the speech emotion recognition is carried out by utilizing the emotion recognition network.

In one implementation scenario, the sample emotion categories labeled by the first sample speech may include, but are not limited to, happy, sad, angry, frightened, disgust, and fear, and are not limited herein. Further, as previously described, the second sample speech may not be labeled with a label associated with the emotion classification.

In an implementation scenario, the emotion classification loss may represent the accuracy of emotion recognition performed on the voice data by the emotion recognition network, for example, the greater the emotion classification loss, the lower the accuracy of emotion recognition performed by the emotion recognition network, and conversely, the smaller the emotion classification loss, the higher the accuracy of emotion recognition performed by the emotion recognition network. In the process of joint training, emotion classification loss can be minimized, so that the accuracy of emotion recognition on voice data by an emotion recognition network is improved.

In one implementation scenario, the domain classification loss may characterize the accuracy of the domain recognition network for performing the domain recognition on the voice data, for example, the greater the domain classification loss, the lower the accuracy of the domain recognition network for performing the domain recognition, whereas the smaller the domain classification loss, the higher the accuracy of the domain recognition network for performing the domain recognition. In the joint training process, the domain classification loss can be maximized, so that the network model can process the voice data of the two data domains without distinction, the negative influence of the scarcity of the marked first sample voice on the model training can be compensated through the unmarked and sufficient second sample voice in the joint process, and the network model can be trained cooperatively by combining the advantages of the supervised training and the unsupervised training.

In one implementation scenario, please refer to fig. 2 in combination, and fig. 2 is a schematic diagram of a framework of an embodiment of the joint model of the present application. As shown in FIG. 2, the emotion recognition network and the domain recognition network share an emotion feature extraction sub-network. On the basis, the emotion feature extraction sub-network can be utilized to respectively extract emotion features of the first sample voice and the second sample voice to obtain a first emotion feature of the first sample voice and a second emotion feature of the second sample voice, then emotion category prediction is carried out based on the first emotion feature to obtain a predicted emotion category of the first sample voice, domain category prediction is carried out based on the first emotion feature and the second emotion feature respectively to obtain a first predicted domain category to which the first sample voice belongs and a second predicted domain category to which the second sample voice belongs, and a first loss, a second loss and a third loss are obtained based on the difference between the first predicted domain category and the first data domain category, the difference between the second predicted domain category and the second data domain category and the difference between the predicted emotion category and the sample emotion category respectively, thereby obtaining a total loss based on the first loss, the second loss and the third loss, and adjusting the network parameters of the joint model based on the total loss. By the mode, the emotion classification loss of the first sample voice and the domain classification loss of the first sample voice and the second sample voice are monitored simultaneously, and the model performance of the network model is improved.

In a specific implementation scenario, as shown in fig. 2, the sub-network for emotion feature extraction may include a sub-network for speech feature extraction and a sub-network for speech emotion encoding, and the sub-network for speech emotion feature extraction is configured to perform emotion feature encoding based on speech features to obtain emotion features. Specifically, with the first sample voiceFor example, the first sample speech may be subjected to acoustic feature extraction to obtain first sample acoustic features of a plurality of first sample audio frames. Illustratively, the acoustic features may include, but are not limited to, SIFT (Scale Invariant Feature Transform) features, and the like, and are not limited herein. On the basis, the sub-network for extracting the speech features can be used to extract the first sample acoustic features of the first sample audio frames to obtain the first sample speech features (denoted as "first sample speech features" in the first sample audio frames)

) And then, carrying out emotion feature coding on the first sample voice features of the plurality of first sample audio frames by utilizing the voice emotion coding sub-network to obtain first emotion features (marked as first emotion features)

). In addition, the voice feature extraction sub-network may include, but is not limited to, a Time-Delay Neural network (TDNN), a long-short term memory network (drnn), and the like, and the network structure of the voice feature extraction sub-network is not limited herein. Referring to fig. 3, fig. 3 is a schematic diagram of a speech feature extraction sub-network according to an embodiment, taking the example that the speech feature extraction sub-network includes TDNN, the speech feature extraction sub-network may specifically include N layers (e.g., 5 layers, 6 layers, etc.) of TDNN sequentially connected, which inputs acoustic features such as SIFT and outputs speech features of an audio frame. In addition, the sub-network of speech emotion encoding may include, but is not limited to, a statistical pooling layer (statistical pooling layer) for computing first and second order statistics, i.e., mean and standard deviation, of the first sample speech feature of the first sample audio frames in the time dimension, and a fully-connected layer for processing to obtain the first emotion feature

It should be noted that, the process of extracting the second emotion feature of the second sample speech may count the pooled layers according to the above-mentioned specific way of extracting the first emotion feature of the first sample speechThe processing procedure can refer to the technical details thereof, and is not described herein again.

In one specific implementation scenario, with continued reference to FIG. 2, the emotion recognition network may further include an emotion classification subnetwork, and the emotion classification subnetwork is used to perform emotion class prediction. The emotion classification sub-network may include, but is not limited to, a fully connected layer, etc., and is not limited thereto. On the basis, the first emotion feature can be input into the emotion classification sub-network to obtain the predicted emotion category of the first sample voice. Specifically, the emotion classification subnetwork may output prediction probability values that the first sample speech belongs to a plurality of preset emotion categories (e.g., happy, sad, angry, frightened, disgusting, fear, etc.), and then the preset emotion category corresponding to the maximum prediction probability value may be used as the predicted emotion category of the first sample speech. On this basis, based on the sample emotion category of the first sample voice, the prediction probability values of the first sample voice belonging to the plurality of preset emotion categories are processed by adopting loss functions such as cross entropy and the like, so that the third loss is obtained. The specific calculation process of the third loss may refer to technical details of a loss function such as cross entropy, and is not described herein again. In the manner, the emotion recognition network further comprises an emotion classification sub-network, and the emotion classification sub-network is used for performing emotion class prediction, which is beneficial to improving emotion prediction efficiency.

In one specific implementation scenario, with continued reference to fig. 2, the domain identification network may further include a domain classification subnetwork, and the domain classification subnetwork is configured to perform domain class prediction. The domain type prediction sub-network may include, but is not limited to, a full connection layer, etc., and is not limited thereto. On the basis, the first emotional feature input domain classification sub-network can obtain a first data domain category of the first sample voice. Specifically, the domain classification subnetwork may output the prediction probability values that the first sample voice belongs to the first data domain class and the second data domain class, respectively, and then the data domain class corresponding to the maximum prediction probability value may be used as the first prediction domain class of the first sample voice. On this basis, based on the first data domain category of the first sample voice, the prediction probability values of the first sample voice belonging to the first data domain category and the second data domain category are processed by adopting a loss function such as a binary cross entropy, so as to obtain the first loss. The specific calculation process of the first loss may refer to technical details of a loss function such as cross entropy, and is not described herein again. In addition, the specific calculation process of the second loss is similar to that of the first loss, and is not described herein again. In the above manner, the domain identification network further includes a domain classification sub-network, and the domain classification sub-network is used for performing domain class prediction, which is beneficial to improving the domain class prediction efficiency.

In a specific implementation scenario, please continue to refer to fig. 2 in combination, the domain recognition network may include a first recognition network for performing domain recognition on the first sample voice and a second recognition network for performing domain recognition on the second sample voice, and the first recognition network and the second recognition network have the same network structure and network parameters, that is, the network parameters and the network structures of the first recognition network and the second recognition network are consistent before and after the adjustment. In this case, the first sample speech is input into the first recognition network to obtain the first prediction domain category, and the second sample speech is input into the second recognition network to obtain the second prediction domain category. In addition, as shown in FIG. 2, the first recognition network and the emotion recognition network may share the emotion feature extraction sub-network, that is, the first recognition network and the emotion recognition network may each include the aforementioned speech feature extraction sub-network and the aforementioned speech emotion encoding sub-network. By the mode, the data processing efficiency of the combined model can be improved by sharing the same network structure and even the same network parameters in the combined model.

In a specific implementation scenario, the first loss and the second loss are respectively inversely related to the total loss, and the third loss is positively related to the total loss. That is to say, the larger the first loss and the second loss, the smaller the total loss, and the smaller the third loss, the smaller the total loss, so by taking the minimum total loss as a model optimization target, the first emotion feature of the first sample voice and the second emotion feature of the second sample voice can not be distinguished by the model with the improvement of the model recognition precision, that is, the feature data extracted from the voice data of different data fields tend to be the same data distribution, and further, the emotion recognition can be accurately performed by the model in different data field categories.

In a specific implementation scenario, the specific manner of parameter adjustment may refer to technical details of an optimization manner such as gradient descent, which is not described herein again.

In one implementation scenario, to further improve model accuracy, the first sample speech may be separated from the sample video, and the sample video may also be separated from the sample face image. For example, a sample video may separate a sample image sequence, which may include several sample face images. On the basis, the sample face image auxiliary model training can be used for enhancing the emotional feature expression by utilizing face image information. In addition, referring to fig. 2, the combined model may further include an image feature extraction network, based on which the image feature extraction network may be used to perform image feature extraction on the sample face image to obtain sample image features of the sample face image, and the first emotion features and the sample image features are fused to obtain sample fusion features, so that emotion class prediction may be performed based on the sample fusion features to obtain predicted emotion classes. For example, the emotion classification sub-network may be used to perform emotion classification prediction on the sample fusion features to obtain a predicted emotion classification, and reference may be made to the foregoing related description for the emotion classification sub-network, which is not described herein again. In the mode, the first sample voice is set to be separated from the sample video, the sample face image can be further separated from the sample video, on the basis, the sample image features extracted through the sample face image and the first emotion features extracted through the first sample voice are fused to obtain the sample fusion features, emotion prediction is carried out based on the sample fusion features, the expression of the voice emotion features can be enhanced through face image information, and the model identification accuracy is favorably improved.

In a specific implementation scenario, in order to improve the training assistance effect of the sample face image, the sample video is video data verified by lip-shaped voice consistency, that is, audio data and image data at any time in the sample video are consistent in pronunciation dimension. For example, at time t, the lip shape is opened with a large margin when the image data shows articulation, and the audio data is expressed as "a" sound; or, at the time t +1, the lip shape is opened circularly with a small amplitude when the image data shows pronunciation, and the audio data is represented as "o" sound, and so on, which are not exemplified herein.

In a specific implementation scenario, the image feature extraction network may include, but is not limited to, a convolutional neural network, and the network structure of the image feature extraction network is not limited herein. Referring to fig. 4, fig. 4 is a schematic diagram of a framework of an embodiment of an image feature extraction network. As shown in fig. 4, the image feature extraction network comprises 5 layers, the first layer and the second layer each comprise a convolutional layer (conv) and a pooling layer (pool), the third layer and the fourth layer each comprise a convolutional layer, the fifth layer comprises a convolutional layer and a pooling layer, the sixth layer comprises a convolutional layer, and the seventh layer comprises a fully-connected layer (fc). It should be noted that the convolutional layers shown in fig. 4 are all three-dimensional convolutional layers, that is, the input image feature extraction network may be a continuous N frames (e.g., 5 frames) of sample face images randomly selected from the sample image sequence. In fig. 4, T represents the number of frames, W represents the image width, H represents the image height, and the last column represents the number of convolution kernels. Fig. 4 shows only one possible implementation manner of the image feature extraction network in the practical application process, and the specific structure of the image feature extraction network is not limited thereby.

In a specific implementation scenario, the features may be expressed in a vector form, and on the basis, the first emotion feature may be added to the sample image feature to obtain a sample fusion feature. Of course, the sample fusion feature may also be obtained by weighting the first emotion feature and the sample image feature, which is not limited herein. For ease of description, sample image features may be noted

Then the first emotional feature

And sample image features

Can be fused to obtain a sample fusion characteristic f_emo。

In an implementation scenario, there are differences between different speakers due to factors unrelated to emotion, such as gender, age, and the like, which may cause a large difference in the distribution of emotional characteristics between different speakers, in order to reduce the interference of speaker information on emotion recognition as much as possible, please continue to combine with reference to figures, the combined model may further include a speaker recognition network, and the speaker recognition network may include a speaker characteristic extraction subnet, based on which speaker characteristic extraction may be performed on a first sample voice by using the speaker characteristic extraction subnet to obtain speaker characteristics of the first sample voice, and based on mutual information between the first emotional characteristic and the speaker characteristic, a fourth loss may be obtained, and based on the first loss, the second loss, the third loss, and the fourth loss, a total loss is obtained, and the total loss is positively correlated with the fourth loss, that is, the fourth loss is larger, the larger the total loss, whereas the smaller the fourth loss, the smaller the total loss. The fourth loss and the information amount of the mutual information are also in a positive correlation relationship, that is, the larger the information amount is, the larger the fourth loss is, and conversely, the smaller the information amount is, the smaller the fourth loss is. The information amount calculation process may refer to the related technical details of the mutual information, and is not described herein again. In the above manner, the speaker recognition network is arranged in the combined model, and includes the speaker feature extraction sub-network, and the speaker feature extraction sub-network is utilized to perform speaker feature extraction on the first sample voice, so as to obtain the speaker feature of the first sample voice, thereby obtaining the fourth loss based on the mutual information between the first emotion feature and the speaker feature, and further enabling the mutual information between the first emotion feature and the speaker feature to be as small as possible by taking the minimum total loss as a model optimization target, namely reducing the correlation between the first emotion feature and the speaker feature as much as possible, reducing the interference of the speaker information on emotion recognition as much as possible, and being beneficial to improving the accuracy of emotion recognition.

In a specific implementation scenario, please refer to fig. 2 in combination, the speaker feature extraction sub-network and the emotion feature extraction sub-network may share a voice feature extraction sub-network, as described above, the voice feature extraction sub-network is used for performing voice feature extraction, and the network structure and the specific principle of the voice feature extraction sub-network may refer to the foregoing description, which is not repeated herein. The speaker feature extraction sub-network further comprises a speaker coding sub-network used for executing speaker coding based on the voice features to obtain the speaker features. Specifically, as mentioned above, the first sample speech feature (denoted as "first sample speech feature") that can be obtained several first sample audio frames is extracted by the speech feature extraction sub-network

) Then, the speaker coding sub-network can be reused to perform speaker coding on the first sample speech features of the plurality of first sample audio frames to obtain the speaker features (marked as f) of the first sample speech_spk). In addition, similar to the sub-network of speech emotion encoding, the sub-network of speaker encoding may include, but is not limited to, a statistical pooling layer (statistical pooling) for computing first-order and second-order statistics, i.e., mean and standard deviation, of the first sample speech features of several first sample audio frames in the time dimension, and a full-concatenation layer for processing to obtain the speaker feature f_spk. It should be noted that the processing procedure of the statistical pooling layer may refer to its technical details, which are not described herein again. In the above manner, the speaker feature extraction sub-network and the emotion feature extraction sub-network share the voice feature extraction sub-network, the voice feature extraction sub-network is used for performing voice feature extraction, the emotion feature extraction sub-network further comprises a voice emotion coding sub-network, the voice emotion coding sub-network is used for performing emotion feature coding based on voice features to obtain emotion features, the speaker feature extraction sub-network further comprises a speaker coding sub-network is used for performing speaker coding based on voice features to obtain speaker features, which is beneficial to reducing the complexity of a joint model as much as possible and improving the model effectAnd (4) rate.

In one particular implementation scenario, with continued reference to FIG. 2, the speaker recognition network may further include a speaker classification subnetwork, and the first sample utterance may be further tagged with a sample speaker (e.g., speaker A). It should be noted that the speaker classification subnetwork may include, but is not limited to, a fully connected layer, and is not limited thereto. On the basis, speaker characteristics can be subjected to speaker prediction by using the speaker classification subnetwork to obtain a predicted speaker of the first sample voice, and a fifth loss is obtained based on the difference between the sample speaker and the predicted speaker, so that a total loss can be obtained based on the first loss, the second loss, the third loss, the fourth loss and the fifth loss, and the fifth loss is positively correlated with the total loss. Specifically, speaker characteristics can be predicted by using the speaker classification subnetwork to obtain predicted probability values that the first sample voice belongs to a plurality of preset speakers (e.g., speaker a, speaker b, speaker c, etc.), and based on the sample speaker labeled by the first sample voice, the predicted probability values that the first sample voice belongs to the plurality of preset speakers are processed by using loss functions such as cross entropy, etc., to obtain a fifth loss. It should be noted that, for a specific calculation process of the fifth loss, reference may be made to technical details of a loss function such as cross entropy, and details are not described herein again. In the above manner, the fifth loss is obtained by further predicting the predicted speaker of the first sample voice and based on the difference between the predicted speaker and the sample speaker marked by the first sample voice, so that the fifth loss can be further combined in the parameter adjustment process, and further the voice feature accuracy can be improved by monitoring the total loss, which is beneficial to improving the accuracy of emotion recognition.

In an implementation scenario, as described in the foregoing related description and shown in fig. 2, the emotion recognition network may include an emotion feature extraction sub-network and an emotion classification sub-network, and the emotion feature extraction sub-network may further include a speech feature extraction sub-network and a speech emotion encoding sub-network, and on this basis, the acoustic features of the speech to be recognized may be extracted first to obtain acoustic features (e.g., SIFT features, etc.) of several audio frames. On the basis, the voice feature extraction sub-network can be used for extracting the voice features of the acoustic features of the audio frames to obtain the voice features of the audio frames, the voice emotion coding sub-network is used for carrying out emotion feature coding on the voice features of the audio frames to obtain emotion features of voice to be recognized, and the emotion classification sub-network is used for carrying out emotion classification prediction on the emotion features to obtain emotion classifications of the voice to be recognized. It should be noted that, the specific ways of extracting the acoustic features, extracting the speech features, encoding the emotion features, and predicting the emotion category may refer to the related descriptions in the foregoing embodiments, and are not described herein again.

Referring to fig. 5, fig. 5 is a schematic block diagram of an embodiment of a speech emotion recognition apparatus 50 according to the present application. The speech emotion recognition apparatus 50 includes: the emotion recognition system comprises a voice acquisition module 51 and an emotion recognition module 52, wherein the voice acquisition module 51 is used for acquiring a voice to be recognized; the emotion recognition module 52 is configured to recognize the speech to be recognized by using an emotion recognition network, so as to obtain an emotion category of the speech to be recognized; the emotion recognition network is contained in the combined model, the combined model further comprises a domain recognition network, the combined model is obtained through combined training of emotion classification loss of the first sample voice belonging to the first data domain type based on the emotion recognition network and domain classification loss of the first sample voice and the second sample voice respectively by the domain recognition network, the second sample voice belongs to the second data domain type, and the sample emotion type is marked on the first sample voice.

In the scheme, on one hand, the emotion recognition network is contained in the combined model, the emotion classification loss of the first sample voice marked with sample emotion types is supervised by the emotion recognition network, on the other hand, the unmarked and sufficient second sample voice is introduced, the combined model further comprises the domain recognition network, the domain classification loss of the first sample voice and the second sample voice is respectively realized by the domain recognition network, the network model can favorably and indiscriminately process the voice data of two data domains, so that the network model can be unsupervised and trained by assisting the first sample voice by the second sample voice, and the network model can be cooperatively trained by combining the advantages of supervised training and unsupervised training, therefore, under the condition that the sample data with accurate emotion type marks are rare, the accuracy of speech emotion recognition is improved.

In some disclosed embodiments, the emotion recognition network and the domain recognition network share an emotion feature extraction sub-network, and the speech emotion recognition apparatus 50 includes an emotion feature extraction module, configured to perform emotion feature extraction on the first sample speech and the second sample speech respectively by using the emotion feature extraction sub-network, so as to obtain a first emotion feature of the first sample speech and a second emotion feature of the second sample speech; the speech emotion recognition device 50 comprises an emotion category prediction module, which is used for performing emotion category prediction based on the first emotion characteristics to obtain a predicted emotion category of the first sample speech; the speech emotion recognition device 50 includes a domain type prediction module, configured to perform domain type prediction based on the first emotional feature and the second emotional feature, respectively, to obtain a first prediction domain type to which the first sample speech belongs and a second prediction domain type to which the second sample speech belongs; the speech emotion recognition device 50 comprises a loss calculation module for obtaining a first loss, a second loss and a third loss based on the difference between the first prediction domain category and the first data domain category, the difference between the second prediction domain category and the second data domain category and the difference between the prediction emotion category and the sample emotion category, respectively; the speech emotion recognition apparatus 50 includes a parameter adjustment module for obtaining a total loss based on the first loss, the second loss and the third loss, and adjusting a network parameter of the joint model based on the total loss.

Therefore, by simultaneously monitoring the emotion classification loss of the first sample voice and the domain classification loss of the first sample voice and the second sample voice, the model performance of the network model is improved.

In some disclosed embodiments, the first loss, the second loss, and the third loss are each negatively correlated to the total loss and positively correlated to the total loss.

Therefore, by taking the minimum total loss as a model optimization target, the model cannot distinguish the first emotion feature of the first sample voice and the second emotion feature of the second sample voice along with the improvement of the model identification precision, that is, the feature data extracted by the voice data in different data fields tend to be the same in data distribution, and further, the model can accurately identify emotion in different data field types.

In some disclosed embodiments, the emotion recognition network further comprises an emotion classification subnetwork for performing emotion class prediction; and/or, the domain identification network further comprises a domain classification sub-network for performing domain class prediction.

Therefore, the emotion recognition network further comprises an emotion classification sub-network, and the emotion classification sub-network is used for performing emotion class prediction and is beneficial to improving emotion prediction efficiency, and the domain recognition network further comprises a domain classification sub-network, and the domain classification sub-network is used for performing domain class prediction and is beneficial to improving domain class prediction efficiency.

In some disclosed embodiments, the first sample speech is separated from the sample video, and the sample video further separates a sample face image, the combined model further includes an image feature extraction network, and the speech emotion recognition apparatus 50 includes an image feature extraction module, configured to perform image feature extraction on the sample face image by using the image feature extraction network, so as to obtain a sample image feature of the sample face image; the speech emotion recognition device 50 comprises a feature fusion module, which is used for fusing the first emotion feature with the sample image feature to obtain a sample fusion feature; the emotion category prediction module is specifically used for carrying out emotion category prediction based on the sample fusion characteristics to obtain a predicted emotion category.

Therefore, the first sample voice is set to be separated from the sample video, the sample face image can be further separated from the sample video, on the basis, the sample image features extracted through the sample face image and the first emotion features extracted through the first sample voice are fused to obtain sample fusion features, emotion prediction is carried out based on the sample fusion features, voice emotion feature expression can be enhanced through face image information, and model identification accuracy is improved.

In some disclosed embodiments, the combined model further includes a speaker recognition network, the speaker recognition network includes a speaker feature extraction sub-network, the speech emotion recognition device 50 includes a speaker feature extraction module, configured to perform speaker feature extraction on the first sample speech by using the speaker feature extraction sub-network to obtain speaker features of the first sample speech; the speech emotion recognition device 50 includes a mutual information calculation module, configured to obtain a fourth loss based on mutual information between the first emotion feature and the speaker feature; the parameter adjusting module is further used for obtaining total loss based on the first loss, the second loss, the third loss and the fourth loss; wherein the total loss is positively correlated with the fourth loss.

Therefore, the speaker recognition network is arranged in the combined model and comprises the speaker characteristic extraction sub-network, the speaker characteristic extraction sub-network is utilized to extract the speaker characteristic of the first sample voice, so that the speaker characteristic of the first sample voice is obtained, the fourth loss is obtained based on the mutual information between the first emotional characteristic and the speaker characteristic, the minimum total loss is taken as a model optimization target, the mutual information between the first emotional characteristic and the speaker characteristic is enabled to be as small as possible, namely, the correlation between the first emotional characteristic and the speaker characteristic is reduced as much as possible, the interference of the speaker information on the emotion recognition can be reduced as much as possible, and the emotion recognition accuracy is improved.

In some disclosed embodiments, the speaker recognition network further includes a speaker classification subnetwork, the first sample speech is further labeled with a sample speaker, and the speech emotion recognition device 50 includes a speaker prediction module for performing speaker prediction on speaker characteristics by using the speaker classification subnetwork to obtain a predicted speaker of the first sample speech; the speech emotion recognition device 50 comprises a speaker difference measurement module for obtaining a fifth loss based on the difference between the sample speaker and the predicted speaker; the parameter adjusting module is further used for obtaining a total loss based on the first loss, the second loss, the third loss, the fourth loss and the fifth loss; wherein the fifth loss is positively correlated to the total loss.

Therefore, the fifth loss is obtained by further predicting the predicted speaker of the first sample voice and based on the difference between the predicted speaker and the sample speaker marked by the first sample voice, so that the fifth loss can be further combined in the parameter adjustment process, the voice feature precision can be further improved by monitoring the total loss, and the emotion recognition accuracy is favorably improved.

In some disclosed embodiments, the speaker feature extraction sub-network and the emotion feature extraction sub-network share a speech feature extraction sub-network, the speech feature extraction sub-network is configured to perform speech feature extraction, the emotion feature extraction sub-network further comprises a speech emotion encoding sub-network, the speech emotion encoding sub-network is configured to perform emotion feature encoding based on speech features to obtain emotion features, and the speaker feature extraction sub-network further comprises a speaker encoding sub-network configured to perform speaker encoding based on speech features to obtain speaker features.

Therefore, the speaker feature extraction sub-network and the emotion feature extraction sub-network share the voice feature extraction sub-network, the voice feature extraction sub-network is used for performing voice feature extraction, the emotion feature extraction sub-network further comprises a voice emotion coding sub-network, the voice emotion coding sub-network is used for performing emotion feature coding based on voice features to obtain emotion features, the speaker feature extraction sub-network further comprises a speaker coding sub-network used for performing speaker coding based on the voice features to obtain speaker features, complexity of a joint model is reduced as much as possible, and model efficiency is improved.

In some disclosed embodiments, the domain recognition network includes a first recognition network for domain-recognizing the first sample speech and a second recognition network for domain-recognizing the second sample speech, the first recognition network and the second recognition network have the same network structure and network parameters, and the first recognition network and the emotion recognition network share an emotion feature extraction sub-network.

Therefore, the data processing efficiency of the combined model can be improved by sharing the same network structure and even the same network parameters in the combined model.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an embodiment of an electronic device 60 according to the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, wherein the memory 61 stores program instructions, and the processor 62 is configured to execute the program instructions to implement the steps in any of the above embodiments of the speech emotion recognition method. Specifically, the electronic device 60 may include, but is not limited to: desktop computers, notebook computers, servers, mobile phones, tablet computers, and the like, without limitation.

In particular, the processor 62 is configured to control itself and the memory 61 to implement the steps in any of the above embodiments of the speech emotion recognition method. The processor 62 may also be referred to as a CPU (Central Processing Unit). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be collectively implemented by an integrated circuit chip.

Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer readable storage medium 70 according to the present application. The computer readable storage medium 70 stores program instructions 71 capable of being executed by the processor, and the program instructions 71 are used for implementing the steps in any of the above embodiments of the speech emotion recognition method.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A speech emotion recognition method is characterized by comprising the following steps:

acquiring a voice to be recognized;

recognizing the voice to be recognized by utilizing an emotion recognition network to obtain the emotion type of the voice to be recognized;

the emotion recognition network is contained in a joint model, the joint model further comprises a domain recognition network, the joint model is obtained through joint training based on emotion classification loss of the emotion recognition network on a first sample voice belonging to a first data domain category and domain classification loss of the domain recognition network on the first sample voice and a second sample voice respectively, the second sample voice belongs to a second data domain category, and the first sample voice is marked with a sample emotion category.

2. The method of claim 1, wherein the emotion recognition network and the domain recognition network share an emotion feature extraction sub-network, and wherein the training step of the joint model comprises:

performing emotion feature extraction on the first sample voice and the second sample voice respectively by using the emotion feature extraction sub-network to obtain a first emotion feature of the first sample voice and a second emotion feature of the second sample voice;

performing emotion type prediction based on the first emotion characteristics to obtain predicted emotion types of the first sample voice, and performing domain type prediction based on the first emotion characteristics and the second emotion characteristics respectively to obtain a first predicted domain type to which the first sample voice belongs and a second predicted domain type to which the second sample voice belongs;

deriving a first loss, a second loss, and a third loss based on a difference between the first prediction domain category and the first data domain category, a difference between the second prediction domain category and the second data domain category, and a difference between the prediction emotion category and the sample emotion category, respectively;

obtaining a total loss based on the first loss, the second loss, and the third loss, and adjusting a network parameter of the joint model based on the total loss.

3. The method of claim 2, wherein the first loss and the second loss are each negatively correlated to the total loss and the third loss is positively correlated to the total loss.

4. The method of claim 2, wherein the emotion recognition network further comprises an emotion classification subnetwork, the emotion classification subnetwork configured to perform the emotion classification prediction;

and/or the domain identification network further comprises a domain classification sub-network for performing the domain category prediction.

5. The method of claim 2, wherein the first sample speech is separated from a sample video, and wherein the sample video further separates a sample face image, and wherein the joint model further comprises an image feature extraction network, and wherein before the emotion classification prediction based on the first emotion feature is performed to obtain the predicted emotion classification of the first sample speech, the method further comprises:

carrying out image feature extraction on the sample face image by using the image feature extraction network to obtain sample image features of the sample face image;

fusing the first emotion characteristics with the sample image characteristics to obtain sample fusion characteristics;

the emotion category prediction is performed based on the first emotion feature to obtain the predicted emotion category of the first sample voice, and the method comprises the following steps:

and performing emotion category prediction based on the sample fusion characteristics to obtain the predicted emotion category.

6. The method of claim 2, wherein the joint model further comprises a speaker recognition network comprising a speaker feature extraction sub-network, and wherein the method further comprises, before deriving the total loss based on the first loss, the second loss, and the third loss:

carrying out speaker characteristic extraction on the first sample voice by utilizing the speaker characteristic extraction sub-network to obtain speaker characteristics of the first sample voice;

obtaining a fourth loss based on mutual information between the first emotional characteristic and the speaker characteristic;

said deriving a total loss based on said first loss, said second loss, and said third loss comprises:

deriving the total loss based on the first loss, the second loss, the third loss, and the fourth loss; wherein the total loss is positively correlated with the fourth loss.

7. The method of claim 6, wherein the speaker recognition network further comprises a speaker classification subnetwork, wherein the first sample speech is further labeled with sample speakers, and wherein the method further comprises, before the deriving the total loss based on the first loss, the second loss, the third loss, and the fourth loss:

carrying out speaker prediction on the speaker characteristics by utilizing the speaker classification subnetwork to obtain a predicted speaker of the first sample voice;

obtaining a fifth loss based on a difference between the sample speaker and the predicted speaker;

said deriving said total loss based on said first loss, said second loss, said third loss, and said fourth loss comprises:

deriving the total loss based on the first loss, the second loss, the third loss, the fourth loss, and the fifth loss; wherein the fifth loss is positively correlated with the total loss.

8. The method of claim 6, wherein the speaker feature extraction sub-network shares a speech feature extraction sub-network with the emotion feature extraction sub-network, wherein the speech feature extraction sub-network is configured to perform speech feature extraction, wherein the emotion feature extraction sub-network further comprises a speech emotion encoding sub-network, wherein the speech emotion encoding sub-network is configured to perform emotion feature encoding based on speech features to obtain emotion features, and wherein the speaker feature extraction sub-network further comprises a speaker encoding sub-network configured to perform speaker encoding based on speech features to obtain speaker features.

9. The method of claim 1, wherein the domain recognition network comprises a first recognition network for domain-recognizing the first sample speech and a second recognition network for domain-recognizing the second sample speech, the first recognition network and the second recognition network have the same network structure and network parameters, and the first recognition network and the emotion recognition network share the sub-network for emotion feature extraction.

10. A speech emotion recognition apparatus, comprising:

the voice acquisition module is used for acquiring the voice to be recognized;

the emotion recognition module is used for recognizing the voice to be recognized by utilizing an emotion recognition network to obtain the emotion type of the voice to be recognized;

11. An electronic device, comprising a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the speech emotion recognition method according to any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that program instructions are stored which can be executed by a processor for implementing the speech emotion recognition method as claimed in any one of claims 1 to 9.