WO2020216064A1 - Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium - Google Patents
Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium Download PDFInfo
- Publication number
- WO2020216064A1 WO2020216064A1 PCT/CN2020/083751 CN2020083751W WO2020216064A1 WO 2020216064 A1 WO2020216064 A1 WO 2020216064A1 CN 2020083751 W CN2020083751 W CN 2020083751W WO 2020216064 A1 WO2020216064 A1 WO 2020216064A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- emotion category
- classifiers
- emotion
- sub
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 116
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 37
- 230000008451 emotion Effects 0.000 claims abstract description 204
- 230000002996 emotional effect Effects 0.000 claims description 85
- 239000013598 vector Substances 0.000 claims description 51
- 230000004044 response Effects 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000003066 decision tree Methods 0.000 claims description 8
- 238000012706 support-vector machine Methods 0.000 claims description 8
- 238000013136 deep learning model Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 description 32
- 239000000523 sample Substances 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000007781 pre-processing Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 7
- 238000012549 training Methods 0.000 description 7
- 238000001914 filtration Methods 0.000 description 6
- 230000007704 transition Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 101000822695 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C1 Proteins 0.000 description 1
- 101000655262 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C2 Proteins 0.000 description 1
- 206010049976 Impatience Diseases 0.000 description 1
- 206010024642 Listless Diseases 0.000 description 1
- 101000655256 Paraclostridium bifermentans Small, acid-soluble spore protein alpha Proteins 0.000 description 1
- 101000655264 Paraclostridium bifermentans Small, acid-soluble spore protein beta Proteins 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 208000017971 listlessness Diseases 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000010422 painting Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
Definitions
- the embodiments of the present disclosure relate to speech emotion recognition methods, semantic recognition methods, question answering methods, computer equipment, and computer-readable storage media.
- the corresponding reply is only given according to the voice command issued by the user.
- corresponding responses are also given based on voice emotions.
- the method based on deep learning has higher requirements for hardware resources, and it is difficult to achieve real-time performance.
- the method based on machine learning can achieve a certain degree of real-time performance, but it needs to extract the most useful features through prior knowledge and select the most suitable classifier.
- the voice emotion recognition method may include: determining the value of the feature in the feature set from the voice signal based on a preset feature set; and inputting the value of the audio feature in the determined feature set to a classifier, And output the emotion category of the speech signal from the classifier.
- the classifier includes a plurality of sub-classifiers, wherein inputting the determined value of the audio feature in the feature set into the classifier, and outputting the emotion category of the speech signal from the classifier includes adding the determined
- the values of the audio features in the feature set are respectively input to the multiple sub-classifiers; the emotion category prediction results of the speech signal are respectively output from the multiple sub-classifiers; and based on the output from the multiple sub-classifiers
- the emotion category prediction result of identifies the emotion category of the voice signal.
- the voice emotion recognition method may include: providing a plurality of voice signal samples; extracting a plurality of features of each voice signal sample in the plurality of voice signal samples; calculating each of the plurality of features The emotional relevance of a feature to multiple emotional categories; select from the multiple features a feature with an emotional relevance greater than a preset emotional relevance threshold to obtain a first candidate feature subset; subtract the first candidate feature Set the feature with the greatest emotional correlation as the salient feature; calculate the feature correlation between each of the remaining features in the first candidate feature subset and the salient feature; delete the feature from the first candidate feature subset Features with correlation greater than emotional correlation to obtain a second candidate feature subset; calculating the variance of each feature in the second candidate feature subset; and deleting features from the second candidate feature subset with a variance less than a variance threshold To obtain the features in the preset feature set.
- the voice emotion recognition method may include: providing a plurality of voice signal samples; extracting a plurality of features of each voice signal sample in the plurality of voice signal samples; calculating each of the plurality of features The variance of the features; delete the features whose variance is less than the variance threshold from the multiple features to obtain a third candidate feature subset; calculate the emotions of each feature in the third candidate feature subset and multiple emotion categories Relevance; selecting from the third candidate feature subset the features with emotional relevance greater than the preset emotional relevance threshold to obtain the fourth candidate feature subset; combining the fourth candidate feature subset with the largest emotional relevance Feature as a salient feature; calculating the feature correlation between each of the remaining features in the fourth candidate feature subset and the salient feature; and deleting from the fourth candidate feature subset that the feature correlation is greater than the emotional correlation To obtain the features in the preset feature set.
- the emotional correlation is calculated by the following formula:
- X represents the feature vector
- Y represents the emotion category vector
- H(X) represents the entropy of X
- H(Y) represents the entropy of Y
- Y) represents the entropy of X
- the feature correlation is calculated by the following formula:
- H(X) represents the entropy of X
- H(Y) represents the entropy of Y
- Y) represents the entropy of X
- identifying the emotion category of the speech signal may include: voting on the emotion category prediction results by the multiple sub-classifiers And the weights of the multiple sub-classifiers to identify the emotion category of the speech signal.
- identifying the emotion category of the speech signal according to the votes of the multiple sub-classifiers on the emotion category prediction results and the weights of the multiple sub-classifiers may include: The voting result of the emotion category prediction result; in response to identifying a unique emotion category according to the voting of the emotion category prediction results by the plurality of sub-classifiers, the unique emotion category is used as the emotion category of the speech signal; and in response to The voting of the emotion category prediction results by the multiple sub-classifiers identifies at least two emotion categories, and the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers.
- the recognizing the emotion category of the speech signal based on the emotion category prediction results output from the plurality of sub-classifiers may include responding to at least two of the plurality of sub-classifiers
- the emotion category prediction results recognized by the sub-classifiers are the same, and the emotion category prediction results recognized by the at least two sub-classifiers are recognized as the emotion category of the speech signal.
- the plurality of sub-classifiers may include a support vector machine classifier, a decision tree classifier, and a neural network classifier.
- a semantic recognition method which includes: converting a voice signal into text information; using the target dialogue state of the previous round of dialogue as the current dialogue state; performing semantic understanding on the text information to Acquire the current intention of the user; and determine a target dialogue state according to the current dialogue state and the current intention, and use the target dialogue state as the semantics of the voice signal.
- At least one embodiment of the present disclosure also provides a question and answer method.
- the question answering method may include: receiving a voice signal; recognizing the semantic and emotional category of the voice signal; and outputting a response based on the semantic and emotional category of the voice signal.
- Recognizing the emotion category of the voice signal may include recognizing the emotion category of the voice signal according to the aforementioned voice emotion recognition method. Recognizing the semantics of the voice signal includes: recognizing the semantics of the voice signal according to the semantic recognition method described above.
- the output of a response based on the semantic and emotional category of the voice signal includes: selecting and outputting a response matching the recognized semantic and emotional category of the voice signal from a plurality of preset responses.
- the question answering method further includes: determining the current emotion category based on the emotion category determined in at least one previous round of question and answer.
- the computer device may include: a memory, which stores a computer program; and a processor, which is configured to, when executing the computer program, execute the aforementioned voice emotion recognition method and the aforementioned semantic recognition method Or the question and answer method as mentioned earlier.
- At least one embodiment of the present disclosure also provides a computer-readable storage medium.
- the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the aforementioned speech emotion recognition method, the aforementioned semantic recognition method, or the aforementioned The question and answer method.
- Fig. 1A shows a schematic flowchart of a question answering method according to at least one embodiment of the present disclosure
- FIG. 1B shows an example of determining the emotion category of the current round based on the previous emotion category according to at least one embodiment of the present disclosure
- Fig. 2 shows a schematic flowchart of a method for speech emotion recognition according to at least one embodiment of the present disclosure
- Fig. 3 shows a schematic flowchart of a feature extraction method according to at least one embodiment of the present disclosure
- Fig. 4 shows a schematic flowchart of another feature extraction method according to at least one embodiment of the present disclosure
- Fig. 5 is a schematic flowchart of a voice emotion recognition method according to at least one embodiment of the present disclosure
- Fig. 6 is a schematic flowchart of a semantic recognition method according to at least one embodiment of the present disclosure
- FIG. 7 is a schematic state transition table according to at least one embodiment of the present disclosure.
- Fig. 8 shows a schematic structural diagram of a question answering system according to at least one embodiment of the present disclosure
- Fig. 9 shows a schematic structural diagram of a speech emotion recognition device according to at least one embodiment of the present disclosure.
- FIG. 10 is a schematic structural diagram of a computing system suitable for implementing a voice emotion recognition method and device, a semantic recognition method or a question answering method and system according to at least one embodiment of the present disclosure.
- a voice emotion recognition method determines the final emotion category of the speech signal through the voting results of multiple classifiers. Compared with using a single classifier to determine the emotion category of the speech signal, they can improve the accuracy and real-time performance of the emotion category recognition of the speech signal. In addition, they also select features based on feature selection algorithms instead of prior knowledge, which can also improve the accuracy and real-time performance of emotion category recognition of speech signals.
- Fig. 1A shows a schematic flowchart of a question answering method 100 according to at least one embodiment of the present disclosure.
- the question and answer method 100 may include step 101, receiving or acquiring a voice signal.
- the voice signal can come from the user or any other subject that can emit a voice signal.
- the voice signal may include, for example, various question information posed by the user. It can receive voice signals collected by voice collection equipment in real time, or obtain pre-stored voice signals from the storage area.
- the question answering method 100 may further include step 102 of recognizing the semantics and emotion categories of the speech signal.
- Step 102 may include two sub-steps, namely, a step of recognizing the semantics of the voice signal and a step of recognizing the emotion category of the voice signal. These two sub-steps can be executed simultaneously or sequentially.
- the semantic recognition of the voice signal may be performed first and then the emotion category recognition of the voice signal may be performed, or the emotion category recognition of the voice signal may be performed first and then the semantic recognition of the voice signal may be performed.
- Recognizing the semantics of the voice signal may include parsing specific question information included in the voice signal, so as to output a corresponding answer from a preset database for the specific question information. Recognizing the semantics of the voice signal may be implemented by a semantic recognition method that will be described later with reference to FIGS. 6 and 7 according to an embodiment of the present disclosure. However, it should be understood that recognizing the semantics of the speech signal can also be implemented in various other methods known in the art, which are not limited in the embodiments of the present disclosure.
- the emotion category of the voice signal may be implemented by the voice emotion category method that will be described later with reference to FIGS. 2, 3, and 4 according to embodiments of the present disclosure.
- the emotion category may include multiple dimensional categories, such as negative emotions, positive emotions, and negative emotions such as urgency, impatience, sadness, and so on. Positive emotions are like happiness. Further, the emotional category may also include the degree of positive or negative emotions in each dimension, such as overly happy, very happy, happy, a little happy, unhappy, very unhappy, etc.
- the question answering method 100 may further include step 103, outputting the answer to the question answering based on the semantics and emotional category of the speech signal.
- a preset database may be included in the memory.
- the preset database may include multiple entries. Each item can include three attributes: semantics, emotional category, and reply.
- step 103 may include retrieving from the preset database a response that matches the recognized semantic and emotional category, and then outputting it to the user.
- the question and answer method may not directly output the response based on the semantics and emotion categories of the voice signal, but may first determine whether the user's emotions are negative (for example, lost, depressed, unhappy) based on the emotional category of the voice signal. , Listlessness, etc.). In the case of judging that the user’s emotion is negative, the question-and-answer method can further output positive information such as jokes (which, for example, may be completely independent of the semantics of the voice signal) to adjust the user’s emotion, and then based on the voice signal Semantics to output the reply.
- jokes which, for example, may be completely independent of the semantics of the voice signal
- the question and answer method 100 may be repeatedly executed multiple times to realize multiple rounds of question and answer.
- the semantic and emotional categories of the recognized speech signals can be stored or recorded to guide subsequent answers.
- the emotion category of the current round may be determined based on the previous (for example, the previous round or previous rounds) emotion category (for example, the change of emotion category or the number of various emotion categories) in order to guide the current round of questions Answer.
- FIG. 1B shows an example of determining the emotion category of the current round based on the previous emotion category according to at least one embodiment of the present disclosure.
- the emotional state of each round is first recorded. When the number of rounds exceeds three times, the emotional state is determined by the voting strategy. If there are at least two rounds of emotion in the three rounds of emotional state If the state is consistent, the emotional state is taken as the result of the first three rounds of voting, otherwise, the emotional state of the last judgment is taken as the voting result. Then use the emotional state obtained in the first three rounds to guide the next round of Q&A responses. According to the judged emotional state, search for a matching response method in the database. If the user's emotion is found to be negative, first use some way to relieve the user's emotion, and then return the answer. The different degrees of negative emotions correspond to different response content.
- the answer is output based not only on the semantics of the voice signal, but also based on the emotional category of the voice signal, thereby enabling the user to obtain a better experience.
- the current response is also output based on the previous emotion category, so that the current response can make the user more satisfied, and the user can get a better experience.
- Fig. 2 shows a schematic flowchart of a method 200 for speech emotion recognition according to at least one embodiment of the present disclosure.
- the voice emotion recognition method 200 may include step 201, preprocessing the voice signal.
- the voice signal can be received from the user.
- the preprocessing may include filtering, framing and other operations, which are known in the art, and therefore will not be repeated here.
- the voice emotion recognition method 200 may not include step 201.
- the voice signal has been processed in advance, or the voice signal has met actual requirements without preprocessing.
- the embodiments of the present disclosure do not limit this.
- the voice emotion recognition method 200 may further include step 202 of extracting the value of the feature in the feature set from the preprocessed voice signal based on the preset feature set.
- the features in the preset feature set are selected from multiple features based on the feature selection algorithm of fast filtering and variance during the training process of speech emotion category recognition. The selection process of the features in the preset feature set will be described in detail later in this article in conjunction with FIG. 3 and FIG. 4.
- the voice emotion recognition method 200 may further include step 203, in which the classifier recognizes the emotion category of the voice signal based on the value of the feature of the extracted audio signal.
- step 203 the value of the audio feature in the determined feature set is input to the classifier, and the emotion category of the speech signal is output from the classifier.
- the sub-classifiers may include various classifiers, such as a support vector machine classifier, a decision tree classifier, a neural network classifier, and so on.
- Each sub-classifier can include a pre-trained speech emotion category recognition model.
- Each speech emotion category recognition model is based on the corresponding sub-classifier in advance based on the same preset feature set and the same emotion category set (which includes emotion categories such as happiness, urgency, impatient, sadness, etc.).
- the emotion category recognition training process is based on a large number of speech signal samples.
- the neural network classifier may include a back-propagation neural network, the input layer of the neural network may be the feature of the preset feature set, and the output layer may be the emotion category set as described above Emotional category.
- the decision tree classifier according to the present disclosure may use a pre-pruning operation.
- the support vector machine classifier according to the present disclosure may use a soft-spaced support vector machine, so as to find as much as possible a clean supercomputer between two emotional categories that are not easily divided. flat.
- the sub-classifier when the value of the feature in the preset feature set is input to a sub-classifier, the sub-classifier can output an emotion category based on a pre-trained speech emotion category recognition model. In this way, when the value of the feature in the preset feature set is input into each sub-classifier, each sub-classifier will output an emotion category.
- recognizing the emotion category of the speech signal by the multiple sub-classifiers based on the value of the feature may include recognizing the emotion category of the voice signal based on the votes of the multiple sub-classifiers and the weights of the multiple sub-classifiers. Describe the emotional category of the speech signal.
- Recognizing the emotion category of the voice signal according to the voting results of the emotion category prediction results of the multiple sub-classifiers and the weights of the multiple sub-classifiers may include: obtaining voting results of the emotion category prediction results of the multiple sub-classifiers; In response to identifying a unique emotion category according to the voting results of the emotion category prediction results of the plurality of sub-classifiers, using the unique emotion category as the emotion category of the speech signal; and in response to the emotion classification according to the plurality of sub-classifiers The voting result of the category prediction result identifies at least two emotion categories, and the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers.
- the recognition of the emotion category of the speech signal by the plurality of sub-classifiers based on the value of the feature may include: responding to the emotion category recognized by at least two of the plurality of sub-classifiers The prediction result is the same, and the emotion category prediction result is recognized as the emotion category of the voice signal. In practical applications, it is assumed that five sub-classifiers are used to identify the emotional category of a speech signal.
- the sub-classifiers In one case, suppose that three of the sub-classifiers all output the same emotion category prediction result (for example, happy), and one of the sub-classifiers outputs another different emotion category prediction result (for example, impatient), and one of them
- the sub-classifier outputs another different emotion category prediction result (for example, sadness), and then according to the voting results of the five sub-classifiers on the emotion category prediction results, the only emotion category, that is, happy, will be identified.
- the emotion category of happiness is regarded as the final emotion category recognized by multiple sub-classifiers.
- each sub-classifier may be assigned a corresponding weight in advance.
- the vote value of each sub-classifier for its output of the emotion category prediction result is the weight of the sub-classifier
- the number of votes for each emotion category is the sum of the weights of all sub-classifiers that output the prediction results of the emotion category .
- the fundamentally disclosed embodiments are not limited to further identifying emotion categories based only on the weights of sub-classifiers.
- the weight of each sub-classifier may be predetermined, or the weight of each sub-classifier may be determined according to the test accuracy of each sub-classifier for a preset test sample set, for example, a sub-category with higher test accuracy
- the weight of the filter is greater, which is not limited in the embodiments of the present disclosure.
- the aforementioned voting results obtained by the multiple sub-classifiers for the emotion category prediction results may include:
- the emotion category prediction result with the most votes among the emotion category prediction results output by the multiple sub-classifiers is used as the emotion category recognized by the multiple sub-classifiers.
- determining the emotion category of the voice signal according to the weights of the multiple sub-classifiers may include:
- the emotion category corresponding to the largest sum of the calculated weights is used as the emotion category recognized by the plurality of sub-classifiers.
- the voice emotion category recognition method determines the final emotion category of the voice signal through the voting results of multiple classifiers. Compared with only using a single classifier to determine the emotion category of the voice signal, the voice emotion category recognition method according to the present disclosure can improve the accuracy and real-time performance of the emotion category recognition of the voice signal.
- the feature of the speech signal needs to be extracted.
- the number and types of extracted features have a significant impact on the accuracy and computational complexity of emotion category recognition.
- the number and types of features of the speech signal that need to be extracted are determined, so as to form what needs to be used in the actual speech signal emotion category recognition.
- a set of preset features The selection process of the features in the preset feature set will be described in detail below in conjunction with FIG. 3 and FIG. 4.
- Fig. 3 shows a schematic flowchart of a feature extraction method 300 according to an embodiment of the present disclosure.
- the feature extraction method 300 may include step 301, providing multiple voice signal samples; 302, preprocessing the multiple voice signal samples; 303, extracting each of the multiple voice signal samples Multiple characteristics of a speech signal sample.
- the multiple voice signal samples may come from an existing voice emotion database, such as a Berlin voice emotion database, or may be various voice signal samples accumulated over time.
- the pre-processing operation may be various pre-processing operations known in the art, which will not be repeated here.
- the multiple features may be the initial features extracted for each voice signal sample by an existing feature extractor used for signal processing and machine learning, such as openSMILE (open Speech and Music Interpretation by Large Space Extraction).
- These features may include, for example, frame energy, frame intensity, critical band spectrum, cepstrum coefficient, auditory spectrum, linear prediction coefficient, fundamental frequency, zero-crossing rate, and so on.
- z ij represents the value of the feature, 1 ⁇ i ⁇ N, 1 ⁇ j ⁇ D.
- Each row of the matrix represents the value of D features of a voice signal sample, and each column of the matrix represents N samples corresponding to a feature.
- the feature extraction method 300 may further include step 304 of calculating the emotional correlation between each of the multiple features and multiple emotional categories.
- emotional relevance can be calculated by the following general formula:
- X represents the feature vector
- Y represents the emotion category vector
- H(X) represents the entropy of X
- H(Y) represents the entropy of Y
- Y) represents the entropy of X
- x m and y l are the possible values of X and Y respectively
- p(x m ) and p(y l ) are the probabilities of x m and y l respectively.
- step 304 essentially includes, for each feature vector f j , 1 ⁇ j ⁇ D, calculating the emotional correlation SU(f j ,C), that is,
- step 304 D emotional correlations will be obtained.
- the feature extraction method 300 may further include a step 305 of selecting features from the plurality of features whose emotional relevance is greater than a preset emotional relevance threshold to obtain a first candidate feature subset.
- the preset emotional relevance threshold can be set according to needs or experience.
- each calculated emotional correlation is compared with a preset emotional correlation threshold. If the calculated emotional relevance is greater than the preset emotional relevance threshold, then the feature corresponding to the calculated emotional relevance is selected from D features so as to be put into the first candidate feature subset. If the calculated emotional relevance is less than or equal to the preset emotional relevance threshold, the feature corresponding to the calculated emotional relevance is deleted from the D features.
- the feature extraction method 300 may further include step 306, using the feature with the greatest emotional relevance in the first candidate feature subset as a salient feature.
- the emotional relevance corresponding to the features in the first candidate feature subset can be sorted, so that the feature corresponding to the largest emotional relevance is taken as the salient feature.
- the feature extraction method 300 may further include step 307 of calculating the feature correlation between each feature in the first candidate feature subset and the salient feature.
- feature correlation can also be calculated by the following general formula:
- X represents the feature vector
- Y represents the feature vector
- H(X) represents the entropy of X
- H(Y) represents the entropy of Y
- Y) represents the entropy of X
- x m and y l are the possible values of X and Y respectively
- p(x m ) and p(y l ) are the probabilities of x m and y l respectively.
- f a corresponds to the feature vector of the salient feature in the first candidate feature subset
- f b corresponds to the feature vector of one of the remaining features in the first candidate feature subset except f a
- the feature correlation between f a and f b can be:
- the feature extraction method 300 may further include step 308 of deleting features with a feature correlation greater than emotional correlation from the first candidate feature subset to obtain a second candidate feature subset.
- the feature corresponding to f b is related to the emotional category of the emotional category:
- step 308 for each remaining feature f b except f a in the first candidate feature subset, the feature correlation of the feature is compared with the emotional correlation of the feature, and if the feature correlation is greater than the emotional correlation (Ie, SU(f a , f b )>SU(f b , C)), then the feature is deleted from the first candidate feature subset.
- the second candidate feature subset After performing the above operations on all the remaining features in the first candidate feature subset except f a , the second candidate feature subset can be obtained.
- the feature extraction method 300 may further include step 309 of calculating the variance of each feature in the second candidate feature subset.
- calculating the variance of the feature that is, calculating the variance for the N-dimensional feature vector corresponding to the feature. For example, if the feature vector corresponding to a feature in the second candidate feature subset is f t , then calculating the variance of the feature is calculating the variance of f t .
- the feature extraction method 300 may further include 310, removing features whose variance is less than a variance threshold from the second candidate feature subset to obtain features in a preset feature set.
- the variance threshold can be set according to actual needs or experience.
- the variance of the feature is compared with a variance threshold. If the variance of the feature is less than the variance threshold, the feature is deleted from the second candidate feature subset.
- the remaining features in the second candidate feature subset are the finally selected features.
- These finally selected features constitute the features in the preset feature set described in the previous section of this article.
- the preset feature set will be used in the actual speech signal emotion category recognition and the training of the speech emotion category recognition model of the classifier.
- the feature extraction method shown in Figure 3 first uses the Fast Correlation-Based Filter Solution to filter the features, and then uses the variance to further filter the features.
- the features that are less relevant to the emotion category are first eliminated to retain the features that are more relevant to the emotion category, and then the features that are most relevant to the emotion category are used to further filter the features, which can greatly Reduce the time complexity of calculation.
- the feature extraction method in FIG. 3 uses feature variance to further remove features that do not change significantly.
- the feature extraction method shown in Figure 4 first uses variance to filter the features, and then uses the fast-filtering feature selection algorithm (Fast Correlation-Based Filter Solution) to further the features. filter.
- the feature extraction method of FIG. 4 will be described in detail below.
- FIG. 4 shows a schematic flowchart of another feature extraction method 400 according to at least one embodiment of the present disclosure.
- the feature extraction method 400 may include the following steps:
- the feature extraction method 300 of FIG. 3 differs from the feature extraction method 400 of FIG. 4 only in that the order of the fast filtering feature selection algorithm and the variance algorithm is different, those skilled in the art can fully implement the feature extraction method 400 based on the feature extraction method 300. Therefore, the specific implementation of the feature extraction method 400 will not be repeated here.
- the above-mentioned feature extraction method 300 may not include step 302.
- the aforementioned feature extraction method 400 may not include step 402.
- the speech signal samples in step 301 and step 401 have been processed in advance, or have met actual requirements without preprocessing. The embodiment of the present disclosure does not limit this.
- Fig. 5 is a schematic flowchart of a method for speech emotion recognition according to at least one embodiment of the present disclosure. As shown in Figure 5, the voice emotion recognition method includes steps S510 to S550.
- step S510 Select features to obtain a feature set.
- step S510 may be implemented based on the feature extraction method 300 of FIG. 3 or the feature extraction method 400 of FIG. 4.
- the feature extraction method 300 of FIG. 3 and the feature extraction method 400 of FIG. 4 please refer to the above description of the methods of FIG. 3 and FIG. 4, which will not be repeated here.
- the classifier may include multiple sub-classifiers.
- the sub-classifiers can include various classifiers, such as support vector machine classifiers, decision tree classifiers, neural network classifiers, and so on.
- Each sub-classifier can include a speech emotion category recognition model.
- Each speech emotion category recognition model uses the feature set obtained in step S510 and the same emotion category set (which includes emotion categories such as happiness, urgency, impatient, sadness, etc.) for training.
- the neural network classifier may include a back-propagation neural network, the input layer of the neural network may be the feature of the preset feature set, and the output layer may be the emotion category set as described above Emotional category.
- the decision tree classifier according to the embodiment of the present disclosure may use a pre-pruning operation.
- the support vector machine classifier according to the embodiment of the present disclosure may use a soft interval support vector machine, so as to find one as much as possible between two emotion categories that are not easily divided Clean super plane.
- test voice signal is a voice signal input by the user in an actual application, which is not limited in the embodiment of the present disclosure.
- step S540 Based on the feature set, extract the value of the feature in the feature set from the test voice signal.
- the feature set used in step S540 is the feature set obtained in step S510.
- Step S540 is basically the same as step 202 described above. Therefore, the detailed description of step S540 can refer to the description of step 202 above, which will not be repeated in the embodiment of the present disclosure.
- step S550 Use a classifier to recognize the emotion category of the test speech signal.
- the classifier used in step S550 is the trained classifier obtained in step S520.
- Step S550 is basically the same as step 203 described above. Therefore, the detailed description of step S550 can refer to the description of step 203 above, which will not be repeated in the embodiment of the present disclosure.
- steps S510 and S520 can be performed in advance. In the user's specific application, only steps S530 to S540 are performed. For example, the above steps S510 and S520 can be performed only once, and the trained classifier obtained in step S520 is stored in a remote server or a local storage for the user's client, and then in each specific application, Only steps S530 to S540 need to be performed. For another example, step S510 and step S520 can be executed periodically or irregularly using new training data to update the classifier. However, it should be understood that the embodiments of the present disclosure do not limit this.
- Fig. 6 shows a schematic flowchart of a semantic recognition method according to at least one embodiment of the present disclosure.
- the semantic recognition method includes:
- S610 Convert the voice signal into text information, and use the target dialogue state of the previous round of dialogue as the current dialogue state;
- S630 Determine, according to the named entity, a vector to be recognized corresponding to the named entity
- S640 Based on the vector to be recognized, determine the intent of the standard feature vector that meets the requirements as the current intent of the text information;
- S650 Determine the target dialogue state according to the current dialogue state and the current intention.
- a dialogue state is determined according to the state conversion table, and the dialogue state is taken as the target dialogue state; further, when the next text is received After information, the target dialogue state can be used as the current dialogue state for determining the dialogue state of the next text message.
- the intentions corresponding to the two adjacent voice signals input by the user can be associated, so that the current intention of the user can be correctly understood.
- the state transition table may include multiple dialogue states and multiple intents, and different dialogue states can be switched to the next dialogue state according to the corresponding intent.
- dialogue state 1 can be switched to dialogue state 4 when the current intent is intent 1
- dialogue state 1 can be switched to when the current intent is intent 2.
- Switch to dialogue state 2 dialogue state 1 can be switched to dialogue state 3 when the current intention is intention 5.
- dialogue state 2 can be switched to dialogue state 4 when the current intention is intent 3
- dialogue state 3 can be switched to dialogue state 2 when the current intention is intent 6
- dialogue state 3 can be switched to dialogue state when the current intention is intent 4.
- the target dialogue state can be switched to the target dialogue state "tomorrow's how is the weather?”. It can be seen that only one input of text information by the user (ie, "tomorrow”) cannot determine the specific meaning of the user, and the text information input by the user twice (ie, "how is the weather today?" and “ What about tomorrow?”) Form associations and understand the user’s current intentions correctly.
- Fig. 7 is a schematic state transition table according to at least one embodiment of the present disclosure. It should be noted that the state transition table shown in FIG. 7 is only exemplary, and the embodiment of the present disclosure does not limit the number and content of the dialog states and intentions in the state transition table, and the specific switching manner between them There are no restrictions, and adjustments can be made according to actual needs. It is understandable that the state transition table may be stored in the server in advance. For example, the state conversion table may be set by a technician based on actual experience, or it may be obtained by statistics or learning based on big data, which is not limited in the embodiment of the present disclosure.
- the voice signal may be converted into text information by any known method, which will not be repeated in the embodiment of the present disclosure.
- the named entity of the text information may be recognized through the named entity recognition model.
- Named entity recognition can refer to the recognition of entities with specific meanings in texts, such as proper nouns such as person names, organization names, and place names, and meaningful time. It is the basic task of information retrieval, question and answer systems and other technologies. For example, in “Xiao Ming is on vacation in Hawaii.”, the named entities are: “Xiao Ming-name of person", "Hawaii-name of place”. You can use language grammar-based techniques and statistical models (such as machine learning) to establish a named entity recognition system.
- the ways to use entity detection and recognition include: (1) first perform entity detection, and then identify the detected entity, (2) combine the entity and the recognized object into a model, and obtain the position of the character for marking and Category tag.
- step S620 other models or other methods may be used to identify the named entity of the text information, which is not limited in the embodiment of the present disclosure.
- a deep learning model may be used to determine the vector to be recognized corresponding to the named entity according to the named entity. It should be noted that in step S630, other models or other methods may also be used to determine the vector to be recognized corresponding to the named entity according to the named entity, which is not limited in the embodiment of the present disclosure.
- step S640 based on the vector to be recognized, the intent of the standard feature vector with the greatest similarity to the vector to be recognized may be determined as the current intent of the text information.
- the standard feature vector that meets the requirements can also be other standard feature vectors, depending on the actual situation. Therefore, the embodiment of the present disclosure is No specific restrictions.
- the voice signal After converting the voice signal into text information "I want to see Mona Lisa" in step S610, it is input into the named entity recognition model, and the named entity "I" is recognized through the named entity recognition model. I want to see PICTURE".
- the named entity recognition model performs the following operations on the received text information:
- the named entity recognition model takes a string of characters (for example, corresponding to a sentence or paragraph in the text information) as input, and recognizes related nouns mentioned in the string (People, places and organizations).
- the text information input into the named entity recognition model is: [I, think, look, Mo, Nai,, zhang, sun, umbrella,, female, person, O,..., O ];
- the named entity identified by the named entity recognition model is: [O, O, O, B-PER, I-PER, O, B-PIC, I-PIC, I-PIC, I- PIC, I-PIC, O,..., O], the named entities are: character-Monet, painting- woman with parasol.
- the named entity recognized by the named entity recognition model can determine its corresponding to-be-recognized vector after passing through the deep learning model.
- the vector to be recognized may be a feature vector, which includes text features that are classified and extracted from named entities through a deep learning model.
- a preset corpus (such as "for PIC (picture)", “author's nationality”, “PERSON's painting”, etc.) can be input into the deep learning model to obtain multiple standard features vector.
- the standard feature vector may be a feature vector that includes text features classified and extracted from the corpus through a deep learning model.
- the action of obtaining the standard feature vector can be completed in advance, or can be completed in real time, and can be set according to a specific scenario, which is not limited in the embodiment of the present disclosure.
- sentences with different representations but with the same purpose can be classified into the same intent through the deep learning model. For example: “I want to see the Mona Lisa”, “Help me change the Mona Lisa”, “Show me to switch to the Mona Lisa”, etc., can be classified as the same intention "The user wants to change the PIC (Mona Lisa) Take a look”.
- the cosine similarity between the vector to be recognized and multiple standard feature vectors can be obtained, and the intent of the standard feature vector with the greatest similarity is taken as the current intent of the vector to be recognized, that is, text information. I want to see the current intention of the "Mona Lisa” to "see the PIC".
- the target dialogue state may be determined according to the current dialogue state acquired in S610 and the current intention determined in S640.
- the target conversation state can be used to determine the answer.
- a preset database may be included in the memory.
- the preset database may include multiple entries. Each item can include three attributes: semantics, emotional category, and reply.
- it may include retrieving from the preset database a response that matches both the recognized semantics (ie, the target dialogue state) and the emotional category, and then outputting it to the user.
- FIG. 8 shows a schematic structural diagram of a question answering system 500 according to at least one embodiment of the present disclosure.
- the question answering system 500 may include a receiver 501 configured to receive voice signals.
- the receiver 501 may be configured to continuously receive multiple voice signals.
- the question answering system 500 may also include a recognition system 502, which is configured to recognize the semantic and emotional categories of the speech signal.
- the recognition system 502 may include a speech semantic recognition device 5021 and a speech emotion recognition device 5022.
- the voice semantic recognition device 5021 may be configured to recognize the semantics of a voice signal.
- the speech semantic recognition device 5021 can recognize the semantics of the speech signal in various methods known in the art.
- the voice emotion recognition device 5022 may be configured to recognize the emotion category of the voice signal. According to the present disclosure, the voice emotion recognition device 5022 can recognize the emotion category of the voice signal in the voice emotion recognition method as described above. The structure of the voice emotion recognition device will be described in detail later with reference to FIG. 9.
- the question answering system 500 may further include an outputter 503, which is configured to output answers based on the semantics and emotion categories of the voice signal.
- the receiver 501, the identification system 502, and the outputter 503 are detachably provided.
- the receiver 501 and the outputter 503 may be set at the user, and the identification system 502 may be set at the server or the cloud.
- the question answering system 500 may include a memory, which is configured to store various information, such as voice signals, preset feature sets as described above, semantics recognized by the voice semantic recognition device 5021, and voice The emotion categories recognized by the emotion recognition device 5022, various classifiers, a preset database including semantics, emotion categories, and responses, and so on.
- FIG. 9 shows a schematic structural diagram of a speech emotion recognition device 600 according to at least one embodiment of the present disclosure.
- the speech emotion recognition device 600 may include: a pre-processor 601 configured to pre-process the speech signal; The value of the feature in the feature set is extracted from the speech signal; and the recognizer 603 is configured to recognize the emotion category of the speech signal based on the value of the extracted feature by the classifier.
- the classifier may include a plurality of sub-classifiers.
- the recognizer 603 may be configured to recognize the emotion category of the voice signal based on the value of the feature by the plurality of sub-classifiers.
- the features in the preset feature set are selected from multiple features based on a fast-filtered feature selection algorithm and variance.
- the process of selecting the features in the preset feature set from multiple features based on the feature selection algorithm of fast filtering and variance may be the feature extraction method shown in FIG. 3 and the feature extraction method shown in FIG. 4 The feature extraction method.
- a computer device may include: a memory, which stores a computer program; and a processor, which is configured to execute the voice emotion recognition method as shown in FIG. 2 or as shown in FIG. 1A when the computer program is executed The question and answer method.
- FIG. 10 which shows a schematic structural diagram of a computing system 1100 suitable for implementing the speech emotion recognition method and device, the semantic recognition method or the question answering method and system of at least one embodiment of the present disclosure.
- the computing system 1100 includes a central processing unit (CPU) 1101, which can be based on a program stored in a read-only memory (ROM) 1102 or a program loaded from a storage portion 1108 to a random access memory (RAM) 1103 And perform various appropriate actions and processing.
- ROM read-only memory
- RAM random access memory
- various programs and data required for the operation of the system 1100 are also stored.
- the CPU 1101, the ROM 1102, and the RAM 1103 are connected to each other through a bus 1104.
- An input/output (I/O) interface 1105 is also connected to the bus 1104.
- the following components are connected to the I/O interface 1105: an input part 1106 including a keyboard, a mouse, a voice input device such as a microphone, etc.; an output part 1107 such as a cathode ray tube display, a liquid crystal display, and a speaker; including a hard disk, etc.
- the communication section 1109 performs communication processing via a network such as the Internet.
- the driver 1110 is also connected to the I/O interface 1105 as needed.
- a removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1110 as needed, so that the computer program read from it is installed into the storage portion 1108 as needed.
- an embodiment of the present disclosure includes a computer program product, which includes a computer program tangibly embodied on a machine-readable medium, and the computer program includes program code for implementing the methods and apparatuses of FIGS. 1A to 9.
- the computer program may be downloaded and installed from the network through the communication part 1109, and/or installed from the removable medium 1111.
- each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more logic for implementing prescribed Function executable instructions.
- the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
- each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
- the units or modules involved in the embodiments described in the present application can be implemented in software or hardware.
- descriptive types of hardware include field programmable gate arrays (FPGA), program-specific integrated circuits (ASIC), program-specific standard products (ASSP), system-on-chip (SOC), complex programmable logic devices (CPLD) and so on.
- FPGA field programmable gate arrays
- ASIC program-specific integrated circuits
- ASSP program-specific standard products
- SOC system-on-chip
- CPLD complex programmable logic devices
- the described units or modules can also be provided in the processor.
- the names of these units or modules do not constitute a limitation on the units or modules themselves under certain circumstances.
- a non-transitory computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the voice emotion recognition method shown in FIG. 2, the question answering method shown in FIG. 1A, or Figure 6 shows the semantic recognition method.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Bioinformatics & Computational Biology (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Psychiatry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Child & Adolescent Psychology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (18)
- 一种语音情感识别方法,包括:A method of speech emotion recognition, including:基于预设的特征集合,从语音信号确定出所述特征集合中的音频特征的值;以及Based on the preset feature set, determining the value of the audio feature in the feature set from the speech signal; and将所确定的所述特征集合中的所述音频特征的值输入分类器,并从所述分类器输出所述语音信号的情感类别,Inputting the determined value of the audio feature in the feature set to a classifier, and outputting the emotion category of the speech signal from the classifier,其中,所述分类器包括多个子分类器,Wherein, the classifier includes multiple sub-classifiers,所述将所确定的所述特征集合中的所述音频特征的值输入分类器,并从所述分类器输出所述语音信号的情感类别,包括:The inputting the determined value of the audio feature in the feature set into a classifier and outputting the emotion category of the speech signal from the classifier includes:将所确定的所述特征集合中的所述音频特征的值分别输入所述多个子分类器;Inputting the determined values of the audio features in the feature set into the multiple sub-classifiers;分别从所述多个子分类器输出所述语音信号的情感类别预测结果;以及Respectively outputting the emotion category prediction results of the speech signal from the multiple sub-classifiers; and基于从所述多个子分类器输出的情感类别预测结果,识别所述语音信号的情感类别。Based on the emotion category prediction results output from the plurality of sub-classifiers, the emotion category of the speech signal is recognized.
- 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:提供多个语音信号样本;Provide multiple voice signal samples;提取所述多个语音信号样本中的每个语音信号样本的多个特征;Extracting multiple features of each voice signal sample in the multiple voice signal samples;计算所述多个特征中的每个特征与多个情感类别的情感相关性;Calculating the emotional correlation between each of the multiple features and multiple emotional categories;从所述多个特征中选择情感相关性大于预设的情感相关性阈值的特征以获得第一候选特征子集;Selecting a feature with an emotional relevance greater than a preset emotional relevance threshold from the multiple features to obtain a first candidate feature subset;将所述第一候选特征子集中具有最大情感相关性的特征作为显著特征;Taking the feature with the greatest emotional relevance in the first candidate feature subset as a salient feature;计算所述第一候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性;Calculating the feature correlation between each of the remaining features in the first candidate feature subset and the salient feature;从所述第一候选特征子集中删除特征相关性大于情感相关性的特征以获得第二候选特征子集;Deleting features whose feature relevance is greater than emotional relevance from the first candidate feature subset to obtain a second candidate feature subset;计算所述第二候选特征子集中的每个特征的方差;以及Calculating the variance of each feature in the second candidate feature subset; and从所述第二候选特征子集中删除特征的方差小于方差阈值的特征以获得所述预设的特征集合中的特征。The feature whose variance of the feature is less than the variance threshold is deleted from the second candidate feature subset to obtain the feature in the preset feature set.
- 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:提供多个语音信号样本;Provide multiple voice signal samples;提取所述多个语音信号样本中的每个语音信号样本的多个特征;Extracting multiple features of each voice signal sample in the multiple voice signal samples;计算所述多个特征中的每个特征的方差;Calculating the variance of each of the multiple features;从所述多个特征中删除特征的方差小于方差阈值的特征以获得第三候选特征子集;Deleting features whose variance of features is less than a variance threshold from the multiple features to obtain a third candidate feature subset;计算所述第三候选特征子集中的每个特征与多个情感类别的情感相关性;Calculating the emotional correlation between each feature in the third candidate feature subset and multiple emotional categories;从所述第三候选特征子集中选择情感相关性大于预设的情感相关性阈值的特征以获得第四候选特征子集;Selecting, from the third candidate feature subset, features with emotional relevance greater than a preset emotional relevance threshold to obtain a fourth candidate feature subset;将所述第四候选特征子集中具有最大情感相关性的特征作为显著特征;Taking the feature with the greatest emotional relevance in the fourth candidate feature subset as the salient feature;计算所述第四候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性;以及Calculating the feature correlation between each feature in the fourth candidate feature subset and the salient feature; and从所述第四候选特征子集中删除特征相关性大于情感相关性的特征以获得所述预设的特征集合中的特征。The features whose feature relevance is greater than the emotional relevance are deleted from the fourth candidate feature subset to obtain features in the preset feature set.
- 根据权利要求2或3所述的方法,其中,情感相关性通过如下公式计算:The method according to claim 2 or 3, wherein the emotional correlation is calculated by the following formula:X表示特征向量,Y表示情感类别向量,H(X)表示X的熵,H(Y)表示Y的熵,H(X|Y)表示X|Y的熵;以及 X represents the feature vector, Y represents the emotion category vector, H(X) represents the entropy of X, H(Y) represents the entropy of Y, and H(X|Y) represents the entropy of X|Y; and其中,特征相关性通过如下公式计算:Among them, the feature correlation is calculated by the following formula:
- 根据权利要求1至4中任一项所述的方法,其中,所述基于从所述多个子分类器输出的所述情感类别预测结果,识别所述语音信号的所述情感类别,包括:根据所述多个子分类器对所述情感类别预测结果的投票和所述多个子分类器的权重来识别所述语音信号的情感类别。The method according to any one of claims 1 to 4, wherein the recognizing the emotion category of the speech signal based on the emotion category prediction results output from the plurality of sub-classifiers comprises: The multiple sub-classifiers vote on the emotion category prediction result and the weight of the multiple sub-classifiers to identify the emotion category of the speech signal.
- 根据权利要求5所述的方法,其中,所述根据所述多个子分类器对所述情感类别预测结果的投票和所述多个子分类器的权重来识别所述语音信号 的情感类别,包括:The method according to claim 5, wherein the identifying the emotion category of the speech signal according to the votes of the emotion category prediction results of the multiple sub-classifiers and the weight of the multiple sub-classifiers comprises:获得所述多个子分类器对所述情感类别预测结果的投票结果;Obtaining voting results of the multiple sub-classifiers on the emotion category prediction results;响应于根据所述多个子分类器对所述情感类别预测结果的投票结果识别出唯一情感类别,将该唯一的情感类别作为所述语音信号的情感类别;以及In response to identifying a unique emotion category according to the voting results of the emotion category prediction results of the plurality of sub-classifiers, use the unique emotion category as the emotion category of the speech signal; and响应于根据所述多个子分类器对所述情感类别预测结果的投票结果识别出至少两个情感类别,根据所述多个子分类器的权重来确定所述语音信号的情感类别。In response to identifying at least two emotion categories according to voting results of the emotion category prediction results of the multiple sub-classifiers, the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers.
- 根据权利要求1至4中任一项所述的方法,其中,所述基于从所述多个子分类器输出的所述情感类别预测结果,识别所述语音信号的所述情感类别,包括:The method according to any one of claims 1 to 4, wherein the recognizing the emotion category of the speech signal based on the emotion category prediction results output from the plurality of sub-classifiers comprises:响应于所述多个子分类器中的至少两个子分类器识别出的情感类别预测结果相同,将所述至少两个子分类器识别出的所述情感类别预测结果作为所述语音信号的情感类别。In response to the emotion category prediction results recognized by at least two sub-classifiers of the plurality of sub-classifiers being the same, the emotion category prediction results recognized by the at least two sub-classifiers are used as the emotion category of the speech signal.
- 根据权利要求1至7中任一项所述的方法,其中,所述多个子分类器包括支持向量机分类器、决策树分类器和神经网络分类器。The method according to any one of claims 1 to 7, wherein the plurality of sub-classifiers include a support vector machine classifier, a decision tree classifier, and a neural network classifier.
- 一种语义识别方法,包括:A semantic recognition method includes:将语音信号转换为文本信息;Convert the voice signal into text information;使用先前一轮对话的目标对话状态作为当前对话状态;Use the target dialogue state of the previous round of dialogue as the current dialogue state;对所述文本信息进行语义理解以获取用户的当前意图;以及Semantic understanding of the text information to obtain the current intention of the user; and根据所述当前对话状态和所述当前意图确定目标对话状态,以及使用所述目标对话状态作为所述语音信号的语义。Determine a target dialogue state according to the current dialogue state and the current intention, and use the target dialogue state as the semantics of the voice signal.
- 根据权利要求9所述的方法,其中,所述对所述文本信息进行语义理解以获取用户的当前意图,包括:The method according to claim 9, wherein the semantic understanding of the text information to obtain the current intention of the user comprises:识别所述文本信息的命名实体;Identifying the named entity of the text information;根据所述命名实体确定所述命名实体对应的待识别向量;Determining, according to the named entity, the vector to be recognized corresponding to the named entity;基于所述待识别向量,将满足要求的标准特征向量的意图确定为所述文本信息的当前意图。Based on the to-be-recognized vector, the intent of the standard feature vector that meets the requirements is determined as the current intent of the text information.
- 根据权利要求10所述的方法,其中,所述识别所述文本信息的命名实体,包括:The method according to claim 10, wherein said identifying the named entity of the text information comprises:通过命名实体识别模型识别所述文本信息的命名实体。The named entity of the text information is recognized through the named entity recognition model.
- 根据权利要求10或11所述的方法,其中,所述根据所述命名实体确定所述命名实体对应的待识别向量,包括:The method according to claim 10 or 11, wherein the determining a vector to be recognized corresponding to the named entity according to the named entity comprises:通过深度学习模型,根据所述命名实体确定所述命名实体对应的待识别向量。Through the deep learning model, the to-be-recognized vector corresponding to the named entity is determined according to the named entity.
- 根据权利要求10-12中任一所述的方法,其中,所述基于所述待识别向量,将满足要求的标准特征向量的意图确定为所述文本信息的当前意图,包括:The method according to any one of claims 10-12, wherein the determining the intent of the standard feature vector that meets the requirements as the current intent of the text information based on the vector to be recognized comprises:将与所述待识别向量的相似度最大的标准特征向量的意图确定为所述文本信息的当前意图。The intent of the standard feature vector with the greatest similarity to the vector to be recognized is determined as the current intent of the text information.
- 一种问答方法,包括:A question and answer method, including:接收语音信号;Receive voice signals;识别语音信号的语义和情感类别;以及Recognize the semantic and emotional categories of speech signals; and基于语音信号的语义和情感类别输出答复,Output responses based on the semantic and emotional categories of the speech signal,其中,所述识别语音信号的情感类别包括:根据权利要求1至8中任一项所述的方法识别所述语音信号的情感类别;以及Wherein, recognizing the emotion category of the voice signal includes: recognizing the emotion category of the voice signal according to the method according to any one of claims 1 to 8; and所述识别语音信号的语义包括:根据权利要求9至13中任一项所述的方法识别所述语音信号的语义。The recognizing the semantics of the voice signal comprises: recognizing the semantics of the voice signal according to the method of any one of claims 9-13.
- 根据权利要求14所述的问答方法,其中,所述基于语音信号的语义和情感类别输出答复,包括:The question answering method according to claim 14, wherein said outputting a reply based on the semantic and emotional category of the speech signal comprises:从预设的多个答复中选择并输出与所述语音信号的所识别的语义和情感类别相匹配的答复。Select and output a reply matching the recognized semantic and emotional category of the voice signal from a plurality of preset replies.
- 根据权利要求14或15所述的问答方法,还包括:The question answering method according to claim 14 or 15, further comprising:基于先前至少一轮问答中确定出的情感类别,确定当前轮问答的情感类别。Based on the emotion categories determined in at least one previous round of question and answer, the emotion category of the current round of question and answer is determined.
- 一种计算机设备,包括:A computer device including:存储器,其存储了计算机程序;以及Memory, which stores computer programs; and处理器,其被配置为,在执行所述计算机程序时,执行以下中至少之一:The processor is configured to execute at least one of the following when executing the computer program:根据权利要求1至8中任一项所述的方法;The method according to any one of claims 1 to 8;根据权利要求9至13中任一项所述的方法;以及The method according to any one of claims 9 to 13; and根据权利要求14至16中任一项所述的方法。The method according to any one of claims 14 to 16.
- 一种非瞬时计算机可读存储介质,其存储了计算机程序,所述计算机程序在被处理器执行时使得所述处理器执行以下中至少之一:A non-transitory computer-readable storage medium that stores a computer program that, when executed by a processor, causes the processor to perform at least one of the following:根据权利要求1至8中任一项所述的方法;The method according to any one of claims 1 to 8;根据权利要求9至13中任一项所述的方法;以及The method according to any one of claims 9 to 13; and根据权利要求14至16中任一项所述的方法。The method according to any one of claims 14 to 16.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910333653.4 | 2019-04-24 | ||
CN201910333653.4A CN110047517A (en) | 2019-04-24 | 2019-04-24 | Speech-emotion recognition method, answering method and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020216064A1 true WO2020216064A1 (en) | 2020-10-29 |
Family
ID=67279086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/083751 WO2020216064A1 (en) | 2019-04-24 | 2020-04-08 | Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110047517A (en) |
WO (1) | WO2020216064A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112735418A (en) * | 2021-01-19 | 2021-04-30 | 腾讯科技(深圳)有限公司 | Voice interaction processing method and device, terminal and storage medium |
CN112784583A (en) * | 2021-01-26 | 2021-05-11 | 浙江香侬慧语科技有限责任公司 | Multi-angle emotion analysis method, system, storage medium and equipment |
CN113239799A (en) * | 2021-05-12 | 2021-08-10 | 北京沃东天骏信息技术有限公司 | Training method, recognition method, device, electronic equipment and readable storage medium |
CN113539243A (en) * | 2021-07-06 | 2021-10-22 | 上海商汤智能科技有限公司 | Training method of voice classification model, voice classification method and related device |
CN113674736A (en) * | 2021-06-30 | 2021-11-19 | 国网江苏省电力有限公司电力科学研究院 | Classifier integration-based teacher classroom instruction identification method and system |
CN113689886A (en) * | 2021-07-13 | 2021-11-23 | 北京工业大学 | Voice data emotion detection method and device, electronic equipment and storage medium |
CN115083439A (en) * | 2022-06-10 | 2022-09-20 | 北京中电慧声科技有限公司 | Vehicle whistling sound identification method, system, terminal and storage medium |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047517A (en) * | 2019-04-24 | 2019-07-23 | 京东方科技集团股份有限公司 | Speech-emotion recognition method, answering method and computer equipment |
CN110619041A (en) * | 2019-09-16 | 2019-12-27 | 出门问问信息科技有限公司 | Intelligent dialogue method and device and computer readable storage medium |
CN113223498A (en) * | 2021-05-20 | 2021-08-06 | 四川大学华西医院 | Swallowing disorder identification method, device and apparatus based on throat voice information |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260416A (en) * | 2015-09-25 | 2016-01-20 | 百度在线网络技术(北京)有限公司 | Voice recognition based searching method and apparatus |
US20160155439A1 (en) * | 2001-12-07 | 2016-06-02 | At&T Intellectual Property Ii, L.P. | System and method of spoken language understanding in human computer dialogs |
WO2018060993A1 (en) * | 2016-09-27 | 2018-04-05 | Faception Ltd. | Method and system for personality-weighted emotion analysis |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
CN109616108A (en) * | 2018-11-29 | 2019-04-12 | 北京羽扇智信息科技有限公司 | More wheel dialogue interaction processing methods, device, electronic equipment and storage medium |
CN110047517A (en) * | 2019-04-24 | 2019-07-23 | 京东方科技集团股份有限公司 | Speech-emotion recognition method, answering method and computer equipment |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030110038A1 (en) * | 2001-10-16 | 2003-06-12 | Rajeev Sharma | Multi-modal gender classification using support vector machines (SVMs) |
CN103810994B (en) * | 2013-09-05 | 2016-09-14 | 江苏大学 | Speech emotional inference method based on emotion context and system |
CN104008754B (en) * | 2014-05-21 | 2017-01-18 | 华南理工大学 | Speech emotion recognition method based on semi-supervised feature selection |
CN105869657A (en) * | 2016-06-03 | 2016-08-17 | 竹间智能科技(上海)有限公司 | System and method for identifying voice emotion |
CN106254186A (en) * | 2016-08-05 | 2016-12-21 | 易晓阳 | A kind of interactive voice control system for identifying |
CN106683672B (en) * | 2016-12-21 | 2020-04-03 | 竹间智能科技(上海)有限公司 | Intelligent dialogue method and system based on emotion and semantics |
CN107609588B (en) * | 2017-09-12 | 2020-08-18 | 大连大学 | Parkinson patient UPDRS score prediction method based on voice signals |
CN107945790B (en) * | 2018-01-03 | 2021-01-26 | 京东方科技集团股份有限公司 | Emotion recognition method and emotion recognition system |
CN108319987B (en) * | 2018-02-20 | 2021-06-29 | 东北电力大学 | Filtering-packaging type combined flow characteristic selection method based on support vector machine |
CN108922512A (en) * | 2018-07-04 | 2018-11-30 | 广东猪兼强互联网科技有限公司 | A kind of personalization machine people phone customer service system |
CN109274819A (en) * | 2018-09-13 | 2019-01-25 | 广东小天才科技有限公司 | User emotion method of adjustment, device, mobile terminal and storage medium when call |
-
2019
- 2019-04-24 CN CN201910333653.4A patent/CN110047517A/en active Pending
-
2020
- 2020-04-08 WO PCT/CN2020/083751 patent/WO2020216064A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160155439A1 (en) * | 2001-12-07 | 2016-06-02 | At&T Intellectual Property Ii, L.P. | System and method of spoken language understanding in human computer dialogs |
CN105260416A (en) * | 2015-09-25 | 2016-01-20 | 百度在线网络技术(北京)有限公司 | Voice recognition based searching method and apparatus |
WO2018060993A1 (en) * | 2016-09-27 | 2018-04-05 | Faception Ltd. | Method and system for personality-weighted emotion analysis |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
CN109616108A (en) * | 2018-11-29 | 2019-04-12 | 北京羽扇智信息科技有限公司 | More wheel dialogue interaction processing methods, device, electronic equipment and storage medium |
CN110047517A (en) * | 2019-04-24 | 2019-07-23 | 京东方科技集团股份有限公司 | Speech-emotion recognition method, answering method and computer equipment |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112735418A (en) * | 2021-01-19 | 2021-04-30 | 腾讯科技(深圳)有限公司 | Voice interaction processing method and device, terminal and storage medium |
CN112735418B (en) * | 2021-01-19 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Voice interaction processing method, device, terminal and storage medium |
CN112784583A (en) * | 2021-01-26 | 2021-05-11 | 浙江香侬慧语科技有限责任公司 | Multi-angle emotion analysis method, system, storage medium and equipment |
CN113239799A (en) * | 2021-05-12 | 2021-08-10 | 北京沃东天骏信息技术有限公司 | Training method, recognition method, device, electronic equipment and readable storage medium |
CN113674736A (en) * | 2021-06-30 | 2021-11-19 | 国网江苏省电力有限公司电力科学研究院 | Classifier integration-based teacher classroom instruction identification method and system |
CN113539243A (en) * | 2021-07-06 | 2021-10-22 | 上海商汤智能科技有限公司 | Training method of voice classification model, voice classification method and related device |
CN113689886A (en) * | 2021-07-13 | 2021-11-23 | 北京工业大学 | Voice data emotion detection method and device, electronic equipment and storage medium |
CN113689886B (en) * | 2021-07-13 | 2023-05-30 | 北京工业大学 | Voice data emotion detection method and device, electronic equipment and storage medium |
CN115083439A (en) * | 2022-06-10 | 2022-09-20 | 北京中电慧声科技有限公司 | Vehicle whistling sound identification method, system, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110047517A (en) | 2019-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020216064A1 (en) | Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium | |
CN111046133B (en) | Question and answer method, equipment, storage medium and device based on mapping knowledge base | |
CN108829757B (en) | Intelligent service method, server and storage medium for chat robot | |
CN108319666B (en) | Power supply service assessment method based on multi-modal public opinion analysis | |
CN109493850B (en) | Growing type dialogue device | |
TWI536364B (en) | Automatic speech recognition method and system | |
US10515292B2 (en) | Joint acoustic and visual processing | |
CN111524527B (en) | Speaker separation method, speaker separation device, electronic device and storage medium | |
CN113094578B (en) | Deep learning-based content recommendation method, device, equipment and storage medium | |
WO2019179496A1 (en) | Method and system for retrieving video temporal segments | |
CN111445898B (en) | Language identification method and device, electronic equipment and storage medium | |
US11735190B2 (en) | Attentive adversarial domain-invariant training | |
WO2022252636A1 (en) | Artificial intelligence-based answer generation method and apparatus, device, and storage medium | |
CN109584865B (en) | Application program control method and device, readable storage medium and terminal equipment | |
US20230206928A1 (en) | Audio processing method and apparatus | |
US20230089308A1 (en) | Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering | |
CN110377695B (en) | Public opinion theme data clustering method and device and storage medium | |
Elshaer et al. | Transfer learning from sound representations for anger detection in speech | |
KR20200105057A (en) | Apparatus and method for extracting inquiry features for alalysis of inquery sentence | |
CN111159405B (en) | Irony detection method based on background knowledge | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
CN111209367A (en) | Information searching method, information searching device, electronic equipment and storage medium | |
CN115878847B (en) | Video guiding method, system, equipment and storage medium based on natural language | |
CN116775873A (en) | Multi-mode dialogue emotion recognition method | |
CN115357720B (en) | BERT-based multitasking news classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20794846 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20794846 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20794846 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10.05.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20794846 Country of ref document: EP Kind code of ref document: A1 |