WO2020216064A1 - Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium - Google Patents

Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium Download PDF

Info

Publication number
WO2020216064A1
WO2020216064A1 PCT/CN2020/083751 CN2020083751W WO2020216064A1 WO 2020216064 A1 WO2020216064 A1 WO 2020216064A1 CN 2020083751 W CN2020083751 W CN 2020083751W WO 2020216064 A1 WO2020216064 A1 WO 2020216064A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
emotion category
classifiers
emotion
sub
Prior art date
Application number
PCT/CN2020/083751
Other languages
French (fr)
Chinese (zh)
Inventor
贾红红
胡风硕
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2020216064A1 publication Critical patent/WO2020216064A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Definitions

  • the embodiments of the present disclosure relate to speech emotion recognition methods, semantic recognition methods, question answering methods, computer equipment, and computer-readable storage media.
  • the corresponding reply is only given according to the voice command issued by the user.
  • corresponding responses are also given based on voice emotions.
  • the method based on deep learning has higher requirements for hardware resources, and it is difficult to achieve real-time performance.
  • the method based on machine learning can achieve a certain degree of real-time performance, but it needs to extract the most useful features through prior knowledge and select the most suitable classifier.
  • the voice emotion recognition method may include: determining the value of the feature in the feature set from the voice signal based on a preset feature set; and inputting the value of the audio feature in the determined feature set to a classifier, And output the emotion category of the speech signal from the classifier.
  • the classifier includes a plurality of sub-classifiers, wherein inputting the determined value of the audio feature in the feature set into the classifier, and outputting the emotion category of the speech signal from the classifier includes adding the determined
  • the values of the audio features in the feature set are respectively input to the multiple sub-classifiers; the emotion category prediction results of the speech signal are respectively output from the multiple sub-classifiers; and based on the output from the multiple sub-classifiers
  • the emotion category prediction result of identifies the emotion category of the voice signal.
  • the voice emotion recognition method may include: providing a plurality of voice signal samples; extracting a plurality of features of each voice signal sample in the plurality of voice signal samples; calculating each of the plurality of features The emotional relevance of a feature to multiple emotional categories; select from the multiple features a feature with an emotional relevance greater than a preset emotional relevance threshold to obtain a first candidate feature subset; subtract the first candidate feature Set the feature with the greatest emotional correlation as the salient feature; calculate the feature correlation between each of the remaining features in the first candidate feature subset and the salient feature; delete the feature from the first candidate feature subset Features with correlation greater than emotional correlation to obtain a second candidate feature subset; calculating the variance of each feature in the second candidate feature subset; and deleting features from the second candidate feature subset with a variance less than a variance threshold To obtain the features in the preset feature set.
  • the voice emotion recognition method may include: providing a plurality of voice signal samples; extracting a plurality of features of each voice signal sample in the plurality of voice signal samples; calculating each of the plurality of features The variance of the features; delete the features whose variance is less than the variance threshold from the multiple features to obtain a third candidate feature subset; calculate the emotions of each feature in the third candidate feature subset and multiple emotion categories Relevance; selecting from the third candidate feature subset the features with emotional relevance greater than the preset emotional relevance threshold to obtain the fourth candidate feature subset; combining the fourth candidate feature subset with the largest emotional relevance Feature as a salient feature; calculating the feature correlation between each of the remaining features in the fourth candidate feature subset and the salient feature; and deleting from the fourth candidate feature subset that the feature correlation is greater than the emotional correlation To obtain the features in the preset feature set.
  • the emotional correlation is calculated by the following formula:
  • X represents the feature vector
  • Y represents the emotion category vector
  • H(X) represents the entropy of X
  • H(Y) represents the entropy of Y
  • Y) represents the entropy of X
  • the feature correlation is calculated by the following formula:
  • H(X) represents the entropy of X
  • H(Y) represents the entropy of Y
  • Y) represents the entropy of X
  • identifying the emotion category of the speech signal may include: voting on the emotion category prediction results by the multiple sub-classifiers And the weights of the multiple sub-classifiers to identify the emotion category of the speech signal.
  • identifying the emotion category of the speech signal according to the votes of the multiple sub-classifiers on the emotion category prediction results and the weights of the multiple sub-classifiers may include: The voting result of the emotion category prediction result; in response to identifying a unique emotion category according to the voting of the emotion category prediction results by the plurality of sub-classifiers, the unique emotion category is used as the emotion category of the speech signal; and in response to The voting of the emotion category prediction results by the multiple sub-classifiers identifies at least two emotion categories, and the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers.
  • the recognizing the emotion category of the speech signal based on the emotion category prediction results output from the plurality of sub-classifiers may include responding to at least two of the plurality of sub-classifiers
  • the emotion category prediction results recognized by the sub-classifiers are the same, and the emotion category prediction results recognized by the at least two sub-classifiers are recognized as the emotion category of the speech signal.
  • the plurality of sub-classifiers may include a support vector machine classifier, a decision tree classifier, and a neural network classifier.
  • a semantic recognition method which includes: converting a voice signal into text information; using the target dialogue state of the previous round of dialogue as the current dialogue state; performing semantic understanding on the text information to Acquire the current intention of the user; and determine a target dialogue state according to the current dialogue state and the current intention, and use the target dialogue state as the semantics of the voice signal.
  • At least one embodiment of the present disclosure also provides a question and answer method.
  • the question answering method may include: receiving a voice signal; recognizing the semantic and emotional category of the voice signal; and outputting a response based on the semantic and emotional category of the voice signal.
  • Recognizing the emotion category of the voice signal may include recognizing the emotion category of the voice signal according to the aforementioned voice emotion recognition method. Recognizing the semantics of the voice signal includes: recognizing the semantics of the voice signal according to the semantic recognition method described above.
  • the output of a response based on the semantic and emotional category of the voice signal includes: selecting and outputting a response matching the recognized semantic and emotional category of the voice signal from a plurality of preset responses.
  • the question answering method further includes: determining the current emotion category based on the emotion category determined in at least one previous round of question and answer.
  • the computer device may include: a memory, which stores a computer program; and a processor, which is configured to, when executing the computer program, execute the aforementioned voice emotion recognition method and the aforementioned semantic recognition method Or the question and answer method as mentioned earlier.
  • At least one embodiment of the present disclosure also provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the aforementioned speech emotion recognition method, the aforementioned semantic recognition method, or the aforementioned The question and answer method.
  • Fig. 1A shows a schematic flowchart of a question answering method according to at least one embodiment of the present disclosure
  • FIG. 1B shows an example of determining the emotion category of the current round based on the previous emotion category according to at least one embodiment of the present disclosure
  • Fig. 2 shows a schematic flowchart of a method for speech emotion recognition according to at least one embodiment of the present disclosure
  • Fig. 3 shows a schematic flowchart of a feature extraction method according to at least one embodiment of the present disclosure
  • Fig. 4 shows a schematic flowchart of another feature extraction method according to at least one embodiment of the present disclosure
  • Fig. 5 is a schematic flowchart of a voice emotion recognition method according to at least one embodiment of the present disclosure
  • Fig. 6 is a schematic flowchart of a semantic recognition method according to at least one embodiment of the present disclosure
  • FIG. 7 is a schematic state transition table according to at least one embodiment of the present disclosure.
  • Fig. 8 shows a schematic structural diagram of a question answering system according to at least one embodiment of the present disclosure
  • Fig. 9 shows a schematic structural diagram of a speech emotion recognition device according to at least one embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of a computing system suitable for implementing a voice emotion recognition method and device, a semantic recognition method or a question answering method and system according to at least one embodiment of the present disclosure.
  • a voice emotion recognition method determines the final emotion category of the speech signal through the voting results of multiple classifiers. Compared with using a single classifier to determine the emotion category of the speech signal, they can improve the accuracy and real-time performance of the emotion category recognition of the speech signal. In addition, they also select features based on feature selection algorithms instead of prior knowledge, which can also improve the accuracy and real-time performance of emotion category recognition of speech signals.
  • Fig. 1A shows a schematic flowchart of a question answering method 100 according to at least one embodiment of the present disclosure.
  • the question and answer method 100 may include step 101, receiving or acquiring a voice signal.
  • the voice signal can come from the user or any other subject that can emit a voice signal.
  • the voice signal may include, for example, various question information posed by the user. It can receive voice signals collected by voice collection equipment in real time, or obtain pre-stored voice signals from the storage area.
  • the question answering method 100 may further include step 102 of recognizing the semantics and emotion categories of the speech signal.
  • Step 102 may include two sub-steps, namely, a step of recognizing the semantics of the voice signal and a step of recognizing the emotion category of the voice signal. These two sub-steps can be executed simultaneously or sequentially.
  • the semantic recognition of the voice signal may be performed first and then the emotion category recognition of the voice signal may be performed, or the emotion category recognition of the voice signal may be performed first and then the semantic recognition of the voice signal may be performed.
  • Recognizing the semantics of the voice signal may include parsing specific question information included in the voice signal, so as to output a corresponding answer from a preset database for the specific question information. Recognizing the semantics of the voice signal may be implemented by a semantic recognition method that will be described later with reference to FIGS. 6 and 7 according to an embodiment of the present disclosure. However, it should be understood that recognizing the semantics of the speech signal can also be implemented in various other methods known in the art, which are not limited in the embodiments of the present disclosure.
  • the emotion category of the voice signal may be implemented by the voice emotion category method that will be described later with reference to FIGS. 2, 3, and 4 according to embodiments of the present disclosure.
  • the emotion category may include multiple dimensional categories, such as negative emotions, positive emotions, and negative emotions such as urgency, impatience, sadness, and so on. Positive emotions are like happiness. Further, the emotional category may also include the degree of positive or negative emotions in each dimension, such as overly happy, very happy, happy, a little happy, unhappy, very unhappy, etc.
  • the question answering method 100 may further include step 103, outputting the answer to the question answering based on the semantics and emotional category of the speech signal.
  • a preset database may be included in the memory.
  • the preset database may include multiple entries. Each item can include three attributes: semantics, emotional category, and reply.
  • step 103 may include retrieving from the preset database a response that matches the recognized semantic and emotional category, and then outputting it to the user.
  • the question and answer method may not directly output the response based on the semantics and emotion categories of the voice signal, but may first determine whether the user's emotions are negative (for example, lost, depressed, unhappy) based on the emotional category of the voice signal. , Listlessness, etc.). In the case of judging that the user’s emotion is negative, the question-and-answer method can further output positive information such as jokes (which, for example, may be completely independent of the semantics of the voice signal) to adjust the user’s emotion, and then based on the voice signal Semantics to output the reply.
  • jokes which, for example, may be completely independent of the semantics of the voice signal
  • the question and answer method 100 may be repeatedly executed multiple times to realize multiple rounds of question and answer.
  • the semantic and emotional categories of the recognized speech signals can be stored or recorded to guide subsequent answers.
  • the emotion category of the current round may be determined based on the previous (for example, the previous round or previous rounds) emotion category (for example, the change of emotion category or the number of various emotion categories) in order to guide the current round of questions Answer.
  • FIG. 1B shows an example of determining the emotion category of the current round based on the previous emotion category according to at least one embodiment of the present disclosure.
  • the emotional state of each round is first recorded. When the number of rounds exceeds three times, the emotional state is determined by the voting strategy. If there are at least two rounds of emotion in the three rounds of emotional state If the state is consistent, the emotional state is taken as the result of the first three rounds of voting, otherwise, the emotional state of the last judgment is taken as the voting result. Then use the emotional state obtained in the first three rounds to guide the next round of Q&A responses. According to the judged emotional state, search for a matching response method in the database. If the user's emotion is found to be negative, first use some way to relieve the user's emotion, and then return the answer. The different degrees of negative emotions correspond to different response content.
  • the answer is output based not only on the semantics of the voice signal, but also based on the emotional category of the voice signal, thereby enabling the user to obtain a better experience.
  • the current response is also output based on the previous emotion category, so that the current response can make the user more satisfied, and the user can get a better experience.
  • Fig. 2 shows a schematic flowchart of a method 200 for speech emotion recognition according to at least one embodiment of the present disclosure.
  • the voice emotion recognition method 200 may include step 201, preprocessing the voice signal.
  • the voice signal can be received from the user.
  • the preprocessing may include filtering, framing and other operations, which are known in the art, and therefore will not be repeated here.
  • the voice emotion recognition method 200 may not include step 201.
  • the voice signal has been processed in advance, or the voice signal has met actual requirements without preprocessing.
  • the embodiments of the present disclosure do not limit this.
  • the voice emotion recognition method 200 may further include step 202 of extracting the value of the feature in the feature set from the preprocessed voice signal based on the preset feature set.
  • the features in the preset feature set are selected from multiple features based on the feature selection algorithm of fast filtering and variance during the training process of speech emotion category recognition. The selection process of the features in the preset feature set will be described in detail later in this article in conjunction with FIG. 3 and FIG. 4.
  • the voice emotion recognition method 200 may further include step 203, in which the classifier recognizes the emotion category of the voice signal based on the value of the feature of the extracted audio signal.
  • step 203 the value of the audio feature in the determined feature set is input to the classifier, and the emotion category of the speech signal is output from the classifier.
  • the sub-classifiers may include various classifiers, such as a support vector machine classifier, a decision tree classifier, a neural network classifier, and so on.
  • Each sub-classifier can include a pre-trained speech emotion category recognition model.
  • Each speech emotion category recognition model is based on the corresponding sub-classifier in advance based on the same preset feature set and the same emotion category set (which includes emotion categories such as happiness, urgency, impatient, sadness, etc.).
  • the emotion category recognition training process is based on a large number of speech signal samples.
  • the neural network classifier may include a back-propagation neural network, the input layer of the neural network may be the feature of the preset feature set, and the output layer may be the emotion category set as described above Emotional category.
  • the decision tree classifier according to the present disclosure may use a pre-pruning operation.
  • the support vector machine classifier according to the present disclosure may use a soft-spaced support vector machine, so as to find as much as possible a clean supercomputer between two emotional categories that are not easily divided. flat.
  • the sub-classifier when the value of the feature in the preset feature set is input to a sub-classifier, the sub-classifier can output an emotion category based on a pre-trained speech emotion category recognition model. In this way, when the value of the feature in the preset feature set is input into each sub-classifier, each sub-classifier will output an emotion category.
  • recognizing the emotion category of the speech signal by the multiple sub-classifiers based on the value of the feature may include recognizing the emotion category of the voice signal based on the votes of the multiple sub-classifiers and the weights of the multiple sub-classifiers. Describe the emotional category of the speech signal.
  • Recognizing the emotion category of the voice signal according to the voting results of the emotion category prediction results of the multiple sub-classifiers and the weights of the multiple sub-classifiers may include: obtaining voting results of the emotion category prediction results of the multiple sub-classifiers; In response to identifying a unique emotion category according to the voting results of the emotion category prediction results of the plurality of sub-classifiers, using the unique emotion category as the emotion category of the speech signal; and in response to the emotion classification according to the plurality of sub-classifiers The voting result of the category prediction result identifies at least two emotion categories, and the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers.
  • the recognition of the emotion category of the speech signal by the plurality of sub-classifiers based on the value of the feature may include: responding to the emotion category recognized by at least two of the plurality of sub-classifiers The prediction result is the same, and the emotion category prediction result is recognized as the emotion category of the voice signal. In practical applications, it is assumed that five sub-classifiers are used to identify the emotional category of a speech signal.
  • the sub-classifiers In one case, suppose that three of the sub-classifiers all output the same emotion category prediction result (for example, happy), and one of the sub-classifiers outputs another different emotion category prediction result (for example, impatient), and one of them
  • the sub-classifier outputs another different emotion category prediction result (for example, sadness), and then according to the voting results of the five sub-classifiers on the emotion category prediction results, the only emotion category, that is, happy, will be identified.
  • the emotion category of happiness is regarded as the final emotion category recognized by multiple sub-classifiers.
  • each sub-classifier may be assigned a corresponding weight in advance.
  • the vote value of each sub-classifier for its output of the emotion category prediction result is the weight of the sub-classifier
  • the number of votes for each emotion category is the sum of the weights of all sub-classifiers that output the prediction results of the emotion category .
  • the fundamentally disclosed embodiments are not limited to further identifying emotion categories based only on the weights of sub-classifiers.
  • the weight of each sub-classifier may be predetermined, or the weight of each sub-classifier may be determined according to the test accuracy of each sub-classifier for a preset test sample set, for example, a sub-category with higher test accuracy
  • the weight of the filter is greater, which is not limited in the embodiments of the present disclosure.
  • the aforementioned voting results obtained by the multiple sub-classifiers for the emotion category prediction results may include:
  • the emotion category prediction result with the most votes among the emotion category prediction results output by the multiple sub-classifiers is used as the emotion category recognized by the multiple sub-classifiers.
  • determining the emotion category of the voice signal according to the weights of the multiple sub-classifiers may include:
  • the emotion category corresponding to the largest sum of the calculated weights is used as the emotion category recognized by the plurality of sub-classifiers.
  • the voice emotion category recognition method determines the final emotion category of the voice signal through the voting results of multiple classifiers. Compared with only using a single classifier to determine the emotion category of the voice signal, the voice emotion category recognition method according to the present disclosure can improve the accuracy and real-time performance of the emotion category recognition of the voice signal.
  • the feature of the speech signal needs to be extracted.
  • the number and types of extracted features have a significant impact on the accuracy and computational complexity of emotion category recognition.
  • the number and types of features of the speech signal that need to be extracted are determined, so as to form what needs to be used in the actual speech signal emotion category recognition.
  • a set of preset features The selection process of the features in the preset feature set will be described in detail below in conjunction with FIG. 3 and FIG. 4.
  • Fig. 3 shows a schematic flowchart of a feature extraction method 300 according to an embodiment of the present disclosure.
  • the feature extraction method 300 may include step 301, providing multiple voice signal samples; 302, preprocessing the multiple voice signal samples; 303, extracting each of the multiple voice signal samples Multiple characteristics of a speech signal sample.
  • the multiple voice signal samples may come from an existing voice emotion database, such as a Berlin voice emotion database, or may be various voice signal samples accumulated over time.
  • the pre-processing operation may be various pre-processing operations known in the art, which will not be repeated here.
  • the multiple features may be the initial features extracted for each voice signal sample by an existing feature extractor used for signal processing and machine learning, such as openSMILE (open Speech and Music Interpretation by Large Space Extraction).
  • These features may include, for example, frame energy, frame intensity, critical band spectrum, cepstrum coefficient, auditory spectrum, linear prediction coefficient, fundamental frequency, zero-crossing rate, and so on.
  • z ij represents the value of the feature, 1 ⁇ i ⁇ N, 1 ⁇ j ⁇ D.
  • Each row of the matrix represents the value of D features of a voice signal sample, and each column of the matrix represents N samples corresponding to a feature.
  • the feature extraction method 300 may further include step 304 of calculating the emotional correlation between each of the multiple features and multiple emotional categories.
  • emotional relevance can be calculated by the following general formula:
  • X represents the feature vector
  • Y represents the emotion category vector
  • H(X) represents the entropy of X
  • H(Y) represents the entropy of Y
  • Y) represents the entropy of X
  • x m and y l are the possible values of X and Y respectively
  • p(x m ) and p(y l ) are the probabilities of x m and y l respectively.
  • step 304 essentially includes, for each feature vector f j , 1 ⁇ j ⁇ D, calculating the emotional correlation SU(f j ,C), that is,
  • step 304 D emotional correlations will be obtained.
  • the feature extraction method 300 may further include a step 305 of selecting features from the plurality of features whose emotional relevance is greater than a preset emotional relevance threshold to obtain a first candidate feature subset.
  • the preset emotional relevance threshold can be set according to needs or experience.
  • each calculated emotional correlation is compared with a preset emotional correlation threshold. If the calculated emotional relevance is greater than the preset emotional relevance threshold, then the feature corresponding to the calculated emotional relevance is selected from D features so as to be put into the first candidate feature subset. If the calculated emotional relevance is less than or equal to the preset emotional relevance threshold, the feature corresponding to the calculated emotional relevance is deleted from the D features.
  • the feature extraction method 300 may further include step 306, using the feature with the greatest emotional relevance in the first candidate feature subset as a salient feature.
  • the emotional relevance corresponding to the features in the first candidate feature subset can be sorted, so that the feature corresponding to the largest emotional relevance is taken as the salient feature.
  • the feature extraction method 300 may further include step 307 of calculating the feature correlation between each feature in the first candidate feature subset and the salient feature.
  • feature correlation can also be calculated by the following general formula:
  • X represents the feature vector
  • Y represents the feature vector
  • H(X) represents the entropy of X
  • H(Y) represents the entropy of Y
  • Y) represents the entropy of X
  • x m and y l are the possible values of X and Y respectively
  • p(x m ) and p(y l ) are the probabilities of x m and y l respectively.
  • f a corresponds to the feature vector of the salient feature in the first candidate feature subset
  • f b corresponds to the feature vector of one of the remaining features in the first candidate feature subset except f a
  • the feature correlation between f a and f b can be:
  • the feature extraction method 300 may further include step 308 of deleting features with a feature correlation greater than emotional correlation from the first candidate feature subset to obtain a second candidate feature subset.
  • the feature corresponding to f b is related to the emotional category of the emotional category:
  • step 308 for each remaining feature f b except f a in the first candidate feature subset, the feature correlation of the feature is compared with the emotional correlation of the feature, and if the feature correlation is greater than the emotional correlation (Ie, SU(f a , f b )>SU(f b , C)), then the feature is deleted from the first candidate feature subset.
  • the second candidate feature subset After performing the above operations on all the remaining features in the first candidate feature subset except f a , the second candidate feature subset can be obtained.
  • the feature extraction method 300 may further include step 309 of calculating the variance of each feature in the second candidate feature subset.
  • calculating the variance of the feature that is, calculating the variance for the N-dimensional feature vector corresponding to the feature. For example, if the feature vector corresponding to a feature in the second candidate feature subset is f t , then calculating the variance of the feature is calculating the variance of f t .
  • the feature extraction method 300 may further include 310, removing features whose variance is less than a variance threshold from the second candidate feature subset to obtain features in a preset feature set.
  • the variance threshold can be set according to actual needs or experience.
  • the variance of the feature is compared with a variance threshold. If the variance of the feature is less than the variance threshold, the feature is deleted from the second candidate feature subset.
  • the remaining features in the second candidate feature subset are the finally selected features.
  • These finally selected features constitute the features in the preset feature set described in the previous section of this article.
  • the preset feature set will be used in the actual speech signal emotion category recognition and the training of the speech emotion category recognition model of the classifier.
  • the feature extraction method shown in Figure 3 first uses the Fast Correlation-Based Filter Solution to filter the features, and then uses the variance to further filter the features.
  • the features that are less relevant to the emotion category are first eliminated to retain the features that are more relevant to the emotion category, and then the features that are most relevant to the emotion category are used to further filter the features, which can greatly Reduce the time complexity of calculation.
  • the feature extraction method in FIG. 3 uses feature variance to further remove features that do not change significantly.
  • the feature extraction method shown in Figure 4 first uses variance to filter the features, and then uses the fast-filtering feature selection algorithm (Fast Correlation-Based Filter Solution) to further the features. filter.
  • the feature extraction method of FIG. 4 will be described in detail below.
  • FIG. 4 shows a schematic flowchart of another feature extraction method 400 according to at least one embodiment of the present disclosure.
  • the feature extraction method 400 may include the following steps:
  • the feature extraction method 300 of FIG. 3 differs from the feature extraction method 400 of FIG. 4 only in that the order of the fast filtering feature selection algorithm and the variance algorithm is different, those skilled in the art can fully implement the feature extraction method 400 based on the feature extraction method 300. Therefore, the specific implementation of the feature extraction method 400 will not be repeated here.
  • the above-mentioned feature extraction method 300 may not include step 302.
  • the aforementioned feature extraction method 400 may not include step 402.
  • the speech signal samples in step 301 and step 401 have been processed in advance, or have met actual requirements without preprocessing. The embodiment of the present disclosure does not limit this.
  • Fig. 5 is a schematic flowchart of a method for speech emotion recognition according to at least one embodiment of the present disclosure. As shown in Figure 5, the voice emotion recognition method includes steps S510 to S550.
  • step S510 Select features to obtain a feature set.
  • step S510 may be implemented based on the feature extraction method 300 of FIG. 3 or the feature extraction method 400 of FIG. 4.
  • the feature extraction method 300 of FIG. 3 and the feature extraction method 400 of FIG. 4 please refer to the above description of the methods of FIG. 3 and FIG. 4, which will not be repeated here.
  • the classifier may include multiple sub-classifiers.
  • the sub-classifiers can include various classifiers, such as support vector machine classifiers, decision tree classifiers, neural network classifiers, and so on.
  • Each sub-classifier can include a speech emotion category recognition model.
  • Each speech emotion category recognition model uses the feature set obtained in step S510 and the same emotion category set (which includes emotion categories such as happiness, urgency, impatient, sadness, etc.) for training.
  • the neural network classifier may include a back-propagation neural network, the input layer of the neural network may be the feature of the preset feature set, and the output layer may be the emotion category set as described above Emotional category.
  • the decision tree classifier according to the embodiment of the present disclosure may use a pre-pruning operation.
  • the support vector machine classifier according to the embodiment of the present disclosure may use a soft interval support vector machine, so as to find one as much as possible between two emotion categories that are not easily divided Clean super plane.
  • test voice signal is a voice signal input by the user in an actual application, which is not limited in the embodiment of the present disclosure.
  • step S540 Based on the feature set, extract the value of the feature in the feature set from the test voice signal.
  • the feature set used in step S540 is the feature set obtained in step S510.
  • Step S540 is basically the same as step 202 described above. Therefore, the detailed description of step S540 can refer to the description of step 202 above, which will not be repeated in the embodiment of the present disclosure.
  • step S550 Use a classifier to recognize the emotion category of the test speech signal.
  • the classifier used in step S550 is the trained classifier obtained in step S520.
  • Step S550 is basically the same as step 203 described above. Therefore, the detailed description of step S550 can refer to the description of step 203 above, which will not be repeated in the embodiment of the present disclosure.
  • steps S510 and S520 can be performed in advance. In the user's specific application, only steps S530 to S540 are performed. For example, the above steps S510 and S520 can be performed only once, and the trained classifier obtained in step S520 is stored in a remote server or a local storage for the user's client, and then in each specific application, Only steps S530 to S540 need to be performed. For another example, step S510 and step S520 can be executed periodically or irregularly using new training data to update the classifier. However, it should be understood that the embodiments of the present disclosure do not limit this.
  • Fig. 6 shows a schematic flowchart of a semantic recognition method according to at least one embodiment of the present disclosure.
  • the semantic recognition method includes:
  • S610 Convert the voice signal into text information, and use the target dialogue state of the previous round of dialogue as the current dialogue state;
  • S630 Determine, according to the named entity, a vector to be recognized corresponding to the named entity
  • S640 Based on the vector to be recognized, determine the intent of the standard feature vector that meets the requirements as the current intent of the text information;
  • S650 Determine the target dialogue state according to the current dialogue state and the current intention.
  • a dialogue state is determined according to the state conversion table, and the dialogue state is taken as the target dialogue state; further, when the next text is received After information, the target dialogue state can be used as the current dialogue state for determining the dialogue state of the next text message.
  • the intentions corresponding to the two adjacent voice signals input by the user can be associated, so that the current intention of the user can be correctly understood.
  • the state transition table may include multiple dialogue states and multiple intents, and different dialogue states can be switched to the next dialogue state according to the corresponding intent.
  • dialogue state 1 can be switched to dialogue state 4 when the current intent is intent 1
  • dialogue state 1 can be switched to when the current intent is intent 2.
  • Switch to dialogue state 2 dialogue state 1 can be switched to dialogue state 3 when the current intention is intention 5.
  • dialogue state 2 can be switched to dialogue state 4 when the current intention is intent 3
  • dialogue state 3 can be switched to dialogue state 2 when the current intention is intent 6
  • dialogue state 3 can be switched to dialogue state when the current intention is intent 4.
  • the target dialogue state can be switched to the target dialogue state "tomorrow's how is the weather?”. It can be seen that only one input of text information by the user (ie, "tomorrow”) cannot determine the specific meaning of the user, and the text information input by the user twice (ie, "how is the weather today?" and “ What about tomorrow?”) Form associations and understand the user’s current intentions correctly.
  • Fig. 7 is a schematic state transition table according to at least one embodiment of the present disclosure. It should be noted that the state transition table shown in FIG. 7 is only exemplary, and the embodiment of the present disclosure does not limit the number and content of the dialog states and intentions in the state transition table, and the specific switching manner between them There are no restrictions, and adjustments can be made according to actual needs. It is understandable that the state transition table may be stored in the server in advance. For example, the state conversion table may be set by a technician based on actual experience, or it may be obtained by statistics or learning based on big data, which is not limited in the embodiment of the present disclosure.
  • the voice signal may be converted into text information by any known method, which will not be repeated in the embodiment of the present disclosure.
  • the named entity of the text information may be recognized through the named entity recognition model.
  • Named entity recognition can refer to the recognition of entities with specific meanings in texts, such as proper nouns such as person names, organization names, and place names, and meaningful time. It is the basic task of information retrieval, question and answer systems and other technologies. For example, in “Xiao Ming is on vacation in Hawaii.”, the named entities are: “Xiao Ming-name of person", "Hawaii-name of place”. You can use language grammar-based techniques and statistical models (such as machine learning) to establish a named entity recognition system.
  • the ways to use entity detection and recognition include: (1) first perform entity detection, and then identify the detected entity, (2) combine the entity and the recognized object into a model, and obtain the position of the character for marking and Category tag.
  • step S620 other models or other methods may be used to identify the named entity of the text information, which is not limited in the embodiment of the present disclosure.
  • a deep learning model may be used to determine the vector to be recognized corresponding to the named entity according to the named entity. It should be noted that in step S630, other models or other methods may also be used to determine the vector to be recognized corresponding to the named entity according to the named entity, which is not limited in the embodiment of the present disclosure.
  • step S640 based on the vector to be recognized, the intent of the standard feature vector with the greatest similarity to the vector to be recognized may be determined as the current intent of the text information.
  • the standard feature vector that meets the requirements can also be other standard feature vectors, depending on the actual situation. Therefore, the embodiment of the present disclosure is No specific restrictions.
  • the voice signal After converting the voice signal into text information "I want to see Mona Lisa" in step S610, it is input into the named entity recognition model, and the named entity "I" is recognized through the named entity recognition model. I want to see PICTURE".
  • the named entity recognition model performs the following operations on the received text information:
  • the named entity recognition model takes a string of characters (for example, corresponding to a sentence or paragraph in the text information) as input, and recognizes related nouns mentioned in the string (People, places and organizations).
  • the text information input into the named entity recognition model is: [I, think, look, Mo, Nai,, zhang, sun, umbrella,, female, person, O,..., O ];
  • the named entity identified by the named entity recognition model is: [O, O, O, B-PER, I-PER, O, B-PIC, I-PIC, I-PIC, I- PIC, I-PIC, O,..., O], the named entities are: character-Monet, painting- woman with parasol.
  • the named entity recognized by the named entity recognition model can determine its corresponding to-be-recognized vector after passing through the deep learning model.
  • the vector to be recognized may be a feature vector, which includes text features that are classified and extracted from named entities through a deep learning model.
  • a preset corpus (such as "for PIC (picture)", “author's nationality”, “PERSON's painting”, etc.) can be input into the deep learning model to obtain multiple standard features vector.
  • the standard feature vector may be a feature vector that includes text features classified and extracted from the corpus through a deep learning model.
  • the action of obtaining the standard feature vector can be completed in advance, or can be completed in real time, and can be set according to a specific scenario, which is not limited in the embodiment of the present disclosure.
  • sentences with different representations but with the same purpose can be classified into the same intent through the deep learning model. For example: “I want to see the Mona Lisa”, “Help me change the Mona Lisa”, “Show me to switch to the Mona Lisa”, etc., can be classified as the same intention "The user wants to change the PIC (Mona Lisa) Take a look”.
  • the cosine similarity between the vector to be recognized and multiple standard feature vectors can be obtained, and the intent of the standard feature vector with the greatest similarity is taken as the current intent of the vector to be recognized, that is, text information. I want to see the current intention of the "Mona Lisa” to "see the PIC".
  • the target dialogue state may be determined according to the current dialogue state acquired in S610 and the current intention determined in S640.
  • the target conversation state can be used to determine the answer.
  • a preset database may be included in the memory.
  • the preset database may include multiple entries. Each item can include three attributes: semantics, emotional category, and reply.
  • it may include retrieving from the preset database a response that matches both the recognized semantics (ie, the target dialogue state) and the emotional category, and then outputting it to the user.
  • FIG. 8 shows a schematic structural diagram of a question answering system 500 according to at least one embodiment of the present disclosure.
  • the question answering system 500 may include a receiver 501 configured to receive voice signals.
  • the receiver 501 may be configured to continuously receive multiple voice signals.
  • the question answering system 500 may also include a recognition system 502, which is configured to recognize the semantic and emotional categories of the speech signal.
  • the recognition system 502 may include a speech semantic recognition device 5021 and a speech emotion recognition device 5022.
  • the voice semantic recognition device 5021 may be configured to recognize the semantics of a voice signal.
  • the speech semantic recognition device 5021 can recognize the semantics of the speech signal in various methods known in the art.
  • the voice emotion recognition device 5022 may be configured to recognize the emotion category of the voice signal. According to the present disclosure, the voice emotion recognition device 5022 can recognize the emotion category of the voice signal in the voice emotion recognition method as described above. The structure of the voice emotion recognition device will be described in detail later with reference to FIG. 9.
  • the question answering system 500 may further include an outputter 503, which is configured to output answers based on the semantics and emotion categories of the voice signal.
  • the receiver 501, the identification system 502, and the outputter 503 are detachably provided.
  • the receiver 501 and the outputter 503 may be set at the user, and the identification system 502 may be set at the server or the cloud.
  • the question answering system 500 may include a memory, which is configured to store various information, such as voice signals, preset feature sets as described above, semantics recognized by the voice semantic recognition device 5021, and voice The emotion categories recognized by the emotion recognition device 5022, various classifiers, a preset database including semantics, emotion categories, and responses, and so on.
  • FIG. 9 shows a schematic structural diagram of a speech emotion recognition device 600 according to at least one embodiment of the present disclosure.
  • the speech emotion recognition device 600 may include: a pre-processor 601 configured to pre-process the speech signal; The value of the feature in the feature set is extracted from the speech signal; and the recognizer 603 is configured to recognize the emotion category of the speech signal based on the value of the extracted feature by the classifier.
  • the classifier may include a plurality of sub-classifiers.
  • the recognizer 603 may be configured to recognize the emotion category of the voice signal based on the value of the feature by the plurality of sub-classifiers.
  • the features in the preset feature set are selected from multiple features based on a fast-filtered feature selection algorithm and variance.
  • the process of selecting the features in the preset feature set from multiple features based on the feature selection algorithm of fast filtering and variance may be the feature extraction method shown in FIG. 3 and the feature extraction method shown in FIG. 4 The feature extraction method.
  • a computer device may include: a memory, which stores a computer program; and a processor, which is configured to execute the voice emotion recognition method as shown in FIG. 2 or as shown in FIG. 1A when the computer program is executed The question and answer method.
  • FIG. 10 which shows a schematic structural diagram of a computing system 1100 suitable for implementing the speech emotion recognition method and device, the semantic recognition method or the question answering method and system of at least one embodiment of the present disclosure.
  • the computing system 1100 includes a central processing unit (CPU) 1101, which can be based on a program stored in a read-only memory (ROM) 1102 or a program loaded from a storage portion 1108 to a random access memory (RAM) 1103 And perform various appropriate actions and processing.
  • ROM read-only memory
  • RAM random access memory
  • various programs and data required for the operation of the system 1100 are also stored.
  • the CPU 1101, the ROM 1102, and the RAM 1103 are connected to each other through a bus 1104.
  • An input/output (I/O) interface 1105 is also connected to the bus 1104.
  • the following components are connected to the I/O interface 1105: an input part 1106 including a keyboard, a mouse, a voice input device such as a microphone, etc.; an output part 1107 such as a cathode ray tube display, a liquid crystal display, and a speaker; including a hard disk, etc.
  • the communication section 1109 performs communication processing via a network such as the Internet.
  • the driver 1110 is also connected to the I/O interface 1105 as needed.
  • a removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1110 as needed, so that the computer program read from it is installed into the storage portion 1108 as needed.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program tangibly embodied on a machine-readable medium, and the computer program includes program code for implementing the methods and apparatuses of FIGS. 1A to 9.
  • the computer program may be downloaded and installed from the network through the communication part 1109, and/or installed from the removable medium 1111.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more logic for implementing prescribed Function executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units or modules involved in the embodiments described in the present application can be implemented in software or hardware.
  • descriptive types of hardware include field programmable gate arrays (FPGA), program-specific integrated circuits (ASIC), program-specific standard products (ASSP), system-on-chip (SOC), complex programmable logic devices (CPLD) and so on.
  • FPGA field programmable gate arrays
  • ASIC program-specific integrated circuits
  • ASSP program-specific standard products
  • SOC system-on-chip
  • CPLD complex programmable logic devices
  • the described units or modules can also be provided in the processor.
  • the names of these units or modules do not constitute a limitation on the units or modules themselves under certain circumstances.
  • a non-transitory computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the voice emotion recognition method shown in FIG. 2, the question answering method shown in FIG. 1A, or Figure 6 shows the semantic recognition method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a speech emotion recognition method, a semantic recognition method, a question-answering method, a computer device and a computer-readable storage medium. The speech emotion recognition method comprises: on the basis of a pre-set feature set, determining, from a speech signal, values of features in the feature set; and inputting determined values of audio features in the feature set into a classifier, and outputting an emotion type of the speech signal from the classifier, wherein the classifier comprises a plurality of sub-classifiers, and the steps of inputting the determined values of the audio features in the feature set into the classifier, and outputting the emotion type of the speech signal from the classifier comprise: respectively inputting the determined values of the audio features in the feature set into the plurality of sub-classifiers; respectively outputting an emotion type prediction result of the speech signal from the plurality of sub-classifiers; and on the basis of the emotion type prediction results output from the plurality of sub-classifiers, recognizing the emotion type of the speech signal.

Description

语音情感识别方法、语义识别方法、问答方法、计算机设备及计算机可读存储介质Speech emotion recognition method, semantic recognition method, question answering method, computer equipment and computer readable storage medium
相关申请的交叉引用Cross references to related applications
本申请要求于2019年4月24日递交的第201910333653.4号中国专利申请的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。This application claims the priority of the Chinese patent application No. 201910333653.4 filed on April 24, 2019, and the contents of the above-mentioned Chinese patent application are cited here in full as a part of this application.
技术领域Technical field
本公开的实施例涉及语音情感识别方法、语义识别方法、问答方法、计算机设备及计算机可读存储介质。The embodiments of the present disclosure relate to speech emotion recognition methods, semantic recognition methods, question answering methods, computer equipment, and computer-readable storage media.
背景技术Background technique
目前,在大多数智能问答***中,仅仅根据用户发出的语音命令给出相应的回复。在少数智能问答***中,除了语音命令之外,还基于语音情感给出相应的回复。At present, in most intelligent question answering systems, the corresponding reply is only given according to the voice command issued by the user. In a few intelligent question answering systems, in addition to voice commands, corresponding responses are also given based on voice emotions.
现有的语音情感识别方法大都是基于深度学习或者机器学习。基于深度学习的方法对硬件资源有较高的要求,较难达到实时性。基于机器学习的方法可以达到一定程度的实时性,但是需要通过先验知识提取最有用的特征并选择最合适的分类器。Most of the existing speech emotion recognition methods are based on deep learning or machine learning. The method based on deep learning has higher requirements for hardware resources, and it is difficult to achieve real-time performance. The method based on machine learning can achieve a certain degree of real-time performance, but it needs to extract the most useful features through prior knowledge and select the most suitable classifier.
发明内容Summary of the invention
本公开至少一个实施例提供了一种语音情感识别方法。该语音情感识别方法可以包括:基于预设的特征集合,从语音信号确定出该特征集合中的特征的值;以及将所确定的所述特征集合中的所述音频特征的值输入分类器,并从所述分类器输出所述语音信号的情感类别。所述分类器包括多个子分类器,其中将所确定的所述特征集合中的所述音频特征的值输入分类器,并从所述分类器输出所述语音信号的情感类别包括将所确定的所述特征集合中的所述音频特征的值分别输入所述多个子分类器;分别从所述多个子分类器输 出所述语音信号的情感类别预测结果;以及基于从所述多个子分类器输出的情感类别预测结果,识别所述语音信号的情感类别。At least one embodiment of the present disclosure provides a voice emotion recognition method. The voice emotion recognition method may include: determining the value of the feature in the feature set from the voice signal based on a preset feature set; and inputting the value of the audio feature in the determined feature set to a classifier, And output the emotion category of the speech signal from the classifier. The classifier includes a plurality of sub-classifiers, wherein inputting the determined value of the audio feature in the feature set into the classifier, and outputting the emotion category of the speech signal from the classifier includes adding the determined The values of the audio features in the feature set are respectively input to the multiple sub-classifiers; the emotion category prediction results of the speech signal are respectively output from the multiple sub-classifiers; and based on the output from the multiple sub-classifiers The emotion category prediction result of, identifies the emotion category of the voice signal.
在一个实施例中,该语音情感识别方法可以包括:提供多个语音信号样本;提取所述多个语音信号样本中的每个语音信号样本的多个特征;计算所述多个特征中的每个特征与多个情感类别的情感相关性;从所述多个特征中选择情感相关性大于预设的情感相关性阈值的特征以获得第一候选特征子集;将所述第一候选特征子集中具有最大情感相关性的特征作为显著特征;计算所述第一候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性;从所述第一候选特征子集中删除特征相关性大于情感相关性的特征以获得第二候选特征子集;计算所述第二候选特征子集中的每个特征的方差;以及从所述第二候选特征子集中删除特征的方差小于方差阈值的特征以获得所述预设的特征集合中的特征。In an embodiment, the voice emotion recognition method may include: providing a plurality of voice signal samples; extracting a plurality of features of each voice signal sample in the plurality of voice signal samples; calculating each of the plurality of features The emotional relevance of a feature to multiple emotional categories; select from the multiple features a feature with an emotional relevance greater than a preset emotional relevance threshold to obtain a first candidate feature subset; subtract the first candidate feature Set the feature with the greatest emotional correlation as the salient feature; calculate the feature correlation between each of the remaining features in the first candidate feature subset and the salient feature; delete the feature from the first candidate feature subset Features with correlation greater than emotional correlation to obtain a second candidate feature subset; calculating the variance of each feature in the second candidate feature subset; and deleting features from the second candidate feature subset with a variance less than a variance threshold To obtain the features in the preset feature set.
在一个实施例中,该语音情感识别方法可以包括:提供多个语音信号样本;提取所述多个语音信号样本中的每个语音信号样本的多个特征;计算所述多个特征中的每个特征的方差;从所述多个特征中删除特征的方差小于方差阈值的特征以获得第三候选特征子集;计算所述第三候选特征子集中的每个特征与多个情感类别的情感相关性;从所述第三候选特征子集中选择情感相关性大于预设的情感相关性阈值的特征以获得第四候选特征子集;将所述第四候选特征子集中具有最大情感相关性的特征作为显著特征;计算所述第四候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性;以及从所述第四候选特征子集中删除特征相关性大于情感相关性的特征以获得所述预设的特征集合中的特征。In an embodiment, the voice emotion recognition method may include: providing a plurality of voice signal samples; extracting a plurality of features of each voice signal sample in the plurality of voice signal samples; calculating each of the plurality of features The variance of the features; delete the features whose variance is less than the variance threshold from the multiple features to obtain a third candidate feature subset; calculate the emotions of each feature in the third candidate feature subset and multiple emotion categories Relevance; selecting from the third candidate feature subset the features with emotional relevance greater than the preset emotional relevance threshold to obtain the fourth candidate feature subset; combining the fourth candidate feature subset with the largest emotional relevance Feature as a salient feature; calculating the feature correlation between each of the remaining features in the fourth candidate feature subset and the salient feature; and deleting from the fourth candidate feature subset that the feature correlation is greater than the emotional correlation To obtain the features in the preset feature set.
在一个实施例中,情感相关性通过如下公式计算:In one embodiment, the emotional correlation is calculated by the following formula:
Figure PCTCN2020083751-appb-000001
Figure PCTCN2020083751-appb-000001
其中,X表示特征向量,Y表示情感类别向量,H(X)表示X的熵;H(Y)表示Y的熵,H(X|Y)表示X|Y的熵。Among them, X represents the feature vector, Y represents the emotion category vector, H(X) represents the entropy of X; H(Y) represents the entropy of Y, and H(X|Y) represents the entropy of X|Y.
在一个实施例中,特征相关性通过如下公式计算:In one embodiment, the feature correlation is calculated by the following formula:
Figure PCTCN2020083751-appb-000002
Figure PCTCN2020083751-appb-000002
其中X表示一个特征向量,Y表示另一个特征向量,H(X)表示X的熵,H(Y)表示Y的熵,H(X|Y)表示X|Y的熵。Where X represents a feature vector, Y represents another feature vector, H(X) represents the entropy of X, H(Y) represents the entropy of Y, and H(X|Y) represents the entropy of X|Y.
在一个实施例中,基于从所述多个子分类器输出的所述情感类别预测结果,识别所述语音信号的所述情感类别可以包括:根据所述多个子分类器对情感类别预测结果的投票和所述多个子分类器的权重来识别所述语音信号的情感类别。In one embodiment, based on the emotion category prediction results output from the multiple sub-classifiers, identifying the emotion category of the speech signal may include: voting on the emotion category prediction results by the multiple sub-classifiers And the weights of the multiple sub-classifiers to identify the emotion category of the speech signal.
在一个实施例中,根据所述多个子分类器对情感类别预测结果的投票和所述多个子分类器的权重来识别所述语音信号的情感类别可以包括:获得所述多个子分类器对所述情感类别预测结果的投票结果;响应于根据所述多个子分类器对情感类别预测结果的投票识别出唯一情感类别,将该唯一的情感类别作为所述语音信号的情感类别;以及响应于根据所述多个子分类器对情感类别预测结果的投票识别出至少两个情感类别,根据所述多个子分类器的权重来确定所述语音信号的情感类别。In an embodiment, identifying the emotion category of the speech signal according to the votes of the multiple sub-classifiers on the emotion category prediction results and the weights of the multiple sub-classifiers may include: The voting result of the emotion category prediction result; in response to identifying a unique emotion category according to the voting of the emotion category prediction results by the plurality of sub-classifiers, the unique emotion category is used as the emotion category of the speech signal; and in response to The voting of the emotion category prediction results by the multiple sub-classifiers identifies at least two emotion categories, and the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers.
在一个实施例中,所述基于从所述多个子分类器输出的所述情感类别预测结果,识别所述语音信号的所述情感类别可以包括,响应于所述多个子分类器中的至少两个子分类器识别出的情感类别预测结果相同,将所述至少两个子分类器识别出的情感类别预测结果识别为所述语音信号的情感类别。In an embodiment, the recognizing the emotion category of the speech signal based on the emotion category prediction results output from the plurality of sub-classifiers may include responding to at least two of the plurality of sub-classifiers The emotion category prediction results recognized by the sub-classifiers are the same, and the emotion category prediction results recognized by the at least two sub-classifiers are recognized as the emotion category of the speech signal.
在一个实施例中,该多个子分类器可以包括支持向量机分类器、决策树分类器和神经网络分类器。In an embodiment, the plurality of sub-classifiers may include a support vector machine classifier, a decision tree classifier, and a neural network classifier.
根据本公开至少一个实施例还提供了一种语义识别方法,其包括:将语音信号转换为文本信息;使用先前一轮对话的目标对话状态作为当前对话状态;对所述文本信息进行语义理解以获取用户的当前意图;以及根据所述当前对话状态和所述当前意图确定目标对话状态,以及使用所述目标对话状态作为所述语音信号的语义。According to at least one embodiment of the present disclosure, there is also provided a semantic recognition method, which includes: converting a voice signal into text information; using the target dialogue state of the previous round of dialogue as the current dialogue state; performing semantic understanding on the text information to Acquire the current intention of the user; and determine a target dialogue state according to the current dialogue state and the current intention, and use the target dialogue state as the semantics of the voice signal.
本公开至少一个实施例还提供了一种问答方法。该问答方法可以包括:接收语音信号;识别语音信号的语义和情感类别;以及基于语音信号的语义和情感类别输出答复。识别语音信号的情感类别可以包括根据如前所述的语音情感识别方法识别语音信号的情感类别。识别语音信号的语义包括:根据如上所述的语义识别方法识别语音信号的语义。At least one embodiment of the present disclosure also provides a question and answer method. The question answering method may include: receiving a voice signal; recognizing the semantic and emotional category of the voice signal; and outputting a response based on the semantic and emotional category of the voice signal. Recognizing the emotion category of the voice signal may include recognizing the emotion category of the voice signal according to the aforementioned voice emotion recognition method. Recognizing the semantics of the voice signal includes: recognizing the semantics of the voice signal according to the semantic recognition method described above.
在一个实施例中,所述基于语音信号的语义和情感类别输出答复,包括: 从预设的多个答复中选择并输出与所述语音信号的所识别的语义和情感类别相匹配的答复。In one embodiment, the output of a response based on the semantic and emotional category of the voice signal includes: selecting and outputting a response matching the recognized semantic and emotional category of the voice signal from a plurality of preset responses.
在一个实施例中,该问答方法还包括:基于先前至少一轮问答中确定出的情感类别,确定当前的情感类别。In one embodiment, the question answering method further includes: determining the current emotion category based on the emotion category determined in at least one previous round of question and answer.
本公开的至少一个实施例还提供了一种计算机设备。该计算机设备可以包括:存储器,其存储了计算机程序;以及处理器,其被配置为,在执行所述计算机程序时,执行如前所述的语音情感识别方法、如前所述的语义识别方法或如前所述的问答方法。At least one embodiment of the present disclosure also provides a computer device. The computer device may include: a memory, which stores a computer program; and a processor, which is configured to, when executing the computer program, execute the aforementioned voice emotion recognition method and the aforementioned semantic recognition method Or the question and answer method as mentioned earlier.
本公开的至少一个实施例还提供了一种计算机可读存储介质。该计算机可读存储介质存储了计算机程序,所述计算机程序在被处理器执行时使得所述处理器执行如前所述的语音情感识别方法、如前所述的语义识别方法或如前所述的问答方法。At least one embodiment of the present disclosure also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the aforementioned speech emotion recognition method, the aforementioned semantic recognition method, or the aforementioned The question and answer method.
附图说明Description of the drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the drawings of the embodiments will be briefly introduced below. Obviously, the drawings described below only relate to some embodiments of the present disclosure, rather than limit the present disclosure.
图1A示出了根据本公开至少一个实施例的一种问答方法的示意性流程图;Fig. 1A shows a schematic flowchart of a question answering method according to at least one embodiment of the present disclosure;
图1B示出了根据本公开至少一个实施例的基于先前的情感类别来确定当前轮的情感类别的示例;FIG. 1B shows an example of determining the emotion category of the current round based on the previous emotion category according to at least one embodiment of the present disclosure;
图2示出了根据本公开至少一个实施例的一种语音情感识别方法的示意性流程图;Fig. 2 shows a schematic flowchart of a method for speech emotion recognition according to at least one embodiment of the present disclosure;
图3示出了根据本公开至少一个实施例的一种特征提取方法的示意性流程图;Fig. 3 shows a schematic flowchart of a feature extraction method according to at least one embodiment of the present disclosure;
图4示出了根据本公开至少一个实施例的另一种特征提取方法的示意性流程图;Fig. 4 shows a schematic flowchart of another feature extraction method according to at least one embodiment of the present disclosure;
图5是根据本公开至少一个实施例的语音情感识别方法的示意性流程图;Fig. 5 is a schematic flowchart of a voice emotion recognition method according to at least one embodiment of the present disclosure;
图6是根据本公开至少一个实施例的语义识别方法的示意性流程图;Fig. 6 is a schematic flowchart of a semantic recognition method according to at least one embodiment of the present disclosure;
图7是根据本公开至少一个实施例的示意性状态转换表;FIG. 7 is a schematic state transition table according to at least one embodiment of the present disclosure;
图8示出了根据本公开至少一个实施例的一种问答***的示意性结构图;Fig. 8 shows a schematic structural diagram of a question answering system according to at least one embodiment of the present disclosure;
图9示出了根据本公开至少一个实施例的一种语音情感识别设备的示意性结构图;以及Fig. 9 shows a schematic structural diagram of a speech emotion recognition device according to at least one embodiment of the present disclosure; and
图10是适于用来实现根据本公开至少一个实施例的语音情感识别方法和设备、语义识别方法或问答方法和***的计算***的结构示意图。FIG. 10 is a schematic structural diagram of a computing system suitable for implementing a voice emotion recognition method and device, a semantic recognition method or a question answering method and system according to at least one embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the objectives, technical solutions, and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the described embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative labor are within the protection scope of the present disclosure.
根据本公开,提供了语音情感识别方法、问答方法、语音情感识别设备、问答***、计算机设备及计算机可读存储介质。这些方法设备和***通过多个分类器的投票结果来确定语音信号的最终的情感类别。与仅仅使用单个分类器来确定语音信号的情感类别相比,它们能够提高语音信号的情感类别识别的准确率和实时性。此外,它们还根据特征选择算法而不是先验知识选取特征,从而也可以提高语音信号的情感类别识别的准确率和实时性。According to the present disclosure, a voice emotion recognition method, a question answering method, a voice emotion recognition device, a question answering system, a computer device, and a computer-readable storage medium are provided. These methods, devices and systems determine the final emotion category of the speech signal through the voting results of multiple classifiers. Compared with using a single classifier to determine the emotion category of the speech signal, they can improve the accuracy and real-time performance of the emotion category recognition of the speech signal. In addition, they also select features based on feature selection algorithms instead of prior knowledge, which can also improve the accuracy and real-time performance of emotion category recognition of speech signals.
图1A示出了根据本公开至少一个实施例的一种问答方法100的示意性流程图。该问答方法100可以包括步骤101,接收或获取语音信号。该语音信号可以来自用户或者任何可以发出语音信号的其他主体。语音信号可以包括例如用户提出的各种问题信息。可以实时接收语音采集设备采集的语音信号,或从存储区域获取预存语音信号。Fig. 1A shows a schematic flowchart of a question answering method 100 according to at least one embodiment of the present disclosure. The question and answer method 100 may include step 101, receiving or acquiring a voice signal. The voice signal can come from the user or any other subject that can emit a voice signal. The voice signal may include, for example, various question information posed by the user. It can receive voice signals collected by voice collection equipment in real time, or obtain pre-stored voice signals from the storage area.
该问答方法100可以进一步包括步骤102,识别语音信号的语义和情感类别。步骤102可以包括两个子步骤,即识别语音信号的语义的步骤和识别语音信号的情感类别的步骤。这两个子步骤可以同时执行,也可以顺序执行。可以先执行语音信号的语义的识别后执行语音信号的情感类别的识别,也可以先执行语音信号的情感类别的识别后执行语音信号的语义的识别。The question answering method 100 may further include step 102 of recognizing the semantics and emotion categories of the speech signal. Step 102 may include two sub-steps, namely, a step of recognizing the semantics of the voice signal and a step of recognizing the emotion category of the voice signal. These two sub-steps can be executed simultaneously or sequentially. The semantic recognition of the voice signal may be performed first and then the emotion category recognition of the voice signal may be performed, or the emotion category recognition of the voice signal may be performed first and then the semantic recognition of the voice signal may be performed.
识别语音信号的语义可以包括,解析语音信号中包括的具体问题信息,以便针对该具体问题信息从预设的数据库中输出对应的答复。识别语音信号的语义可通过稍后将参照根据本公开实施例的图6和图7描述的语义识别方法来实现。然而,应理解,识别语音信号的语义还可以以各种本领域已知的其他方法来实现,本公开的实施例对此不作限制。Recognizing the semantics of the voice signal may include parsing specific question information included in the voice signal, so as to output a corresponding answer from a preset database for the specific question information. Recognizing the semantics of the voice signal may be implemented by a semantic recognition method that will be described later with reference to FIGS. 6 and 7 according to an embodiment of the present disclosure. However, it should be understood that recognizing the semantics of the speech signal can also be implemented in various other methods known in the art, which are not limited in the embodiments of the present disclosure.
识别语音信号的情感类别可以通过稍后将参照根据本公开实施例的图2、图3和图4描述的语音情感类别方法来实现。根据本公开的实施例,情感类别可以包括多个维度的类别,如负面情绪、正面情绪,负面情绪如急迫、不耐烦、悲伤等等。正面情绪如高兴。进一步的,情感类别还可以包括每一维度的正面或负面情绪的程度,如,过渡开心、非常开心、开心、有点开心、不开心、很不开心等。Recognizing the emotion category of the voice signal may be implemented by the voice emotion category method that will be described later with reference to FIGS. 2, 3, and 4 according to embodiments of the present disclosure. According to an embodiment of the present disclosure, the emotion category may include multiple dimensional categories, such as negative emotions, positive emotions, and negative emotions such as urgency, impatience, sadness, and so on. Positive emotions are like happiness. Further, the emotional category may also include the degree of positive or negative emotions in each dimension, such as overly happy, very happy, happy, a little happy, unhappy, very unhappy, etc.
本领域技术人员可以根据实际需求对情感类别的种类和数目进行设置。Those skilled in the art can set the type and number of emotion categories according to actual needs.
该问答方法100可以进一步包括步骤103,基于语音信号的语义和情感类别输出问答的答复。The question answering method 100 may further include step 103, outputting the answer to the question answering based on the semantics and emotional category of the speech signal.
根据本公开的实施例,在存储器中可以包括预设的数据库。预设的数据库可以包括多个条目。每个条目可以包括语义、情感类别和回复三个属性。如此,步骤103可以包括从该预设的数据库中检索出与识别出的语义和情感类别二者相匹配的答复,进而将其输出给用户。According to an embodiment of the present disclosure, a preset database may be included in the memory. The preset database may include multiple entries. Each item can include three attributes: semantics, emotional category, and reply. In this way, step 103 may include retrieving from the preset database a response that matches the recognized semantic and emotional category, and then outputting it to the user.
在一个实施例中,该问答方法可以不直接基于语音信号的语义和情感类别输出答复,而是可以先基于语音信号的情感类别判断用户的情绪是否是消极的(例如,失落、低落、不高兴、无精打采等)。在判断出用户的情绪是消极的情况下,该问答方法可以进一步输出诸如笑话之类的积极信息(其例如可以与语音信号的语义完全无关)来调整用户的情绪,并且然后再基于语音信号的语义来输出答复。In one embodiment, the question and answer method may not directly output the response based on the semantics and emotion categories of the voice signal, but may first determine whether the user's emotions are negative (for example, lost, depressed, unhappy) based on the emotional category of the voice signal. , Listlessness, etc.). In the case of judging that the user’s emotion is negative, the question-and-answer method can further output positive information such as jokes (which, for example, may be completely independent of the semantics of the voice signal) to adjust the user’s emotion, and then based on the voice signal Semantics to output the reply.
根据本公开,问答方法100可以被反复执行多次,以便实现多轮问答。在每一轮问答中,识别出的语音信号的语义和情感类别可以被存储或记录,以便用于指导后续的答复。在一个实施例中,可以基于先前(例如上一轮或前几轮)的情感类别(例如,情感类别的变化或者各种情感类别的数目)来确定当前轮的情感类别以便指导当前轮的问题的答复。例如,图1B示出了根据本公开至少一个实施例的基于先前的情感类别来确定当前轮的情感类别 的示例。According to the present disclosure, the question and answer method 100 may be repeatedly executed multiple times to realize multiple rounds of question and answer. In each round of question and answer, the semantic and emotional categories of the recognized speech signals can be stored or recorded to guide subsequent answers. In one embodiment, the emotion category of the current round may be determined based on the previous (for example, the previous round or previous rounds) emotion category (for example, the change of emotion category or the number of various emotion categories) in order to guide the current round of questions Answer. For example, FIG. 1B shows an example of determining the emotion category of the current round based on the previous emotion category according to at least one embodiment of the present disclosure.
例如,当用户的问题类型属于多轮状态时,首先记录每一轮的情感状态,当轮数超过三次时,通过投票策略确定情感状态,如果三轮的情感状态中有至少有两轮的情感状态是一致的,则把该情感状态作为前三轮的投票结果,否则把最后一次的判断情感状态作为投票结果。然后通过前三轮得到的情感状态指导下一轮的问答回复。根据判断的情感状态,在数据库中寻找与之匹配的回复方式,如果发现用户的情绪偏负向时,先用某种方式缓解用户的情绪,再返回答案。情绪偏负向的不同程度对应不同应答内容。For example, when the user’s question type belongs to multiple rounds, the emotional state of each round is first recorded. When the number of rounds exceeds three times, the emotional state is determined by the voting strategy. If there are at least two rounds of emotion in the three rounds of emotional state If the state is consistent, the emotional state is taken as the result of the first three rounds of voting, otherwise, the emotional state of the last judgment is taken as the voting result. Then use the emotional state obtained in the first three rounds to guide the next round of Q&A responses. According to the judged emotional state, search for a matching response method in the database. If the user's emotion is found to be negative, first use some way to relieve the user's emotion, and then return the answer. The different degrees of negative emotions correspond to different response content.
根据本公开实施例的问答方法,不仅仅基于语音信号的语义,还基于语音信号的情感类别来输出答复,因而可以使得用户获得更好的体验。此外,根据本公开的问答方法,还基于先前的情感类别来输出当前的答复,因而可以使得当前的答复让用户更满意,进而使得用户获得更好的体验。According to the question answering method of the embodiment of the present disclosure, the answer is output based not only on the semantics of the voice signal, but also based on the emotional category of the voice signal, thereby enabling the user to obtain a better experience. In addition, according to the question and answer method of the present disclosure, the current response is also output based on the previous emotion category, so that the current response can make the user more satisfied, and the user can get a better experience.
图2示出了根据本公开至少一个实施例的一种语音情感识别方法200的示意性流程图。如图2中所示,该语音情感识别方法200可以包括步骤201,对语音信号进行预处理。如前所述,语音信号可以是从用户处接收到的。预处理可以包括滤波、分帧等操作,其是本领域已知的,因此在此不再赘述。Fig. 2 shows a schematic flowchart of a method 200 for speech emotion recognition according to at least one embodiment of the present disclosure. As shown in FIG. 2, the voice emotion recognition method 200 may include step 201, preprocessing the voice signal. As mentioned earlier, the voice signal can be received from the user. The preprocessing may include filtering, framing and other operations, which are known in the art, and therefore will not be repeated here.
然而,应理解,语音情感识别方法200也可不包括步骤201。例如,该语音信号是已预先进行了处理,或者该语音信号已满足实际要求而不需要进行预处理等。本公开的实施例对此不做限制。However, it should be understood that the voice emotion recognition method 200 may not include step 201. For example, the voice signal has been processed in advance, or the voice signal has met actual requirements without preprocessing. The embodiments of the present disclosure do not limit this.
如图2中所示,该语音情感识别方法200可以进一步包括步骤202,基于预设的特征集合从预处理后的语音信号中提取该特征集合中的特征的值。根据本公开,所述预设的特征集合中的特征是在语音情感类别识别的训练过程中基于快速过滤的特征选择算法和方差从多个特征中选出的。本文稍后将结合图3和图4对所述预设的特征集合中的特征的选择过程进行详细说明。As shown in FIG. 2, the voice emotion recognition method 200 may further include step 202 of extracting the value of the feature in the feature set from the preprocessed voice signal based on the preset feature set. According to the present disclosure, the features in the preset feature set are selected from multiple features based on the feature selection algorithm of fast filtering and variance during the training process of speech emotion category recognition. The selection process of the features in the preset feature set will be described in detail later in this article in conjunction with FIG. 3 and FIG. 4.
如图2中所示,该语音情感识别方法200可以进一步包括步骤203,由分类器基于所提取的音频信号的特征的值识别所述语音信号的情感类别。在步骤203中,将所确定的特征集合中的音频特征的值输入分类器,并从分类器输出语音信号的情感类别。As shown in FIG. 2, the voice emotion recognition method 200 may further include step 203, in which the classifier recognizes the emotion category of the voice signal based on the value of the feature of the extracted audio signal. In step 203, the value of the audio feature in the determined feature set is input to the classifier, and the emotion category of the speech signal is output from the classifier.
根据本公开至少一个实施例,所述分类器可以包括多个子分类器。由分类器基于所述特征的值识别所述语音信号的情感类别可以包括由所述多个子 分类器基于所述特征的值识别所述语音信号的情感类别。例如,将所确定的特征集合中的音频特征的值输入分类器,并从分类器输出语音信号的情感类别可包括:将所确定的所述特征集合中的所述音频特征的值分别输入所述多个子分类器;分别从所述多个子分类器输出所述语音信号的情感类别预测结果;以及基于从所述多个子分类器输出的情感类别预测结果,识别所述语音信号的情感类别。According to at least one embodiment of the present disclosure, the classifier may include a plurality of sub-classifiers. Recognizing the emotion category of the voice signal based on the value of the feature by the classifier may include recognizing the emotion category of the voice signal based on the value of the feature by the plurality of sub-classifiers. For example, inputting the value of the audio feature in the determined feature set into the classifier, and outputting the emotion category of the voice signal from the classifier may include: inputting the value of the audio feature in the determined feature set into each The multiple sub-classifiers; respectively output the emotion category prediction results of the voice signal from the multiple sub-classifiers; and recognize the emotion category of the voice signal based on the emotion category prediction results output from the multiple sub-classifiers.
根据本公开至少一个实施例,子分类器可以包括各种各样的分类器,例如支持向量机分类器、决策树分类器、神经网络分类器等等。每个子分类器都可以包括一个预先训练好的语音情感类别识别模型。每个语音情感类别识别模型都是相应的子分类器预先基于如前所述的同一预设的特征集合和同一情感类别集合(其包括诸如高兴、急迫、不耐烦、悲伤等情感类别)在语音情感类别识别的训练过程中在大量的语音信号样本的基础上训练好的。在一个实施例中,神经网络分类器可以包括反向传播神经网络,该神经网络的输入层可以是所述预设的特征集合的特征,而输出层可以是如前所述的情感类别集合中的情感类别。在一个实施例中,为了避免决策树过于复杂并且防止出现过拟合,根据本公开的决策树分类器可以使用预剪枝操作。在一个实施例中,为了缓解过拟合问题,根据本公开的支持向量机分类器可以使用软间隔支持向量机,从而在两个不容易划分的情感类别之间尽可能的找到一个干净的超平面。这些子分类器本身是本领域已知的分类器,因此在此不再对其如何训练语音情感类别识别模型的详细原理进行赘述。According to at least one embodiment of the present disclosure, the sub-classifiers may include various classifiers, such as a support vector machine classifier, a decision tree classifier, a neural network classifier, and so on. Each sub-classifier can include a pre-trained speech emotion category recognition model. Each speech emotion category recognition model is based on the corresponding sub-classifier in advance based on the same preset feature set and the same emotion category set (which includes emotion categories such as happiness, urgency, impatient, sadness, etc.). The emotion category recognition training process is based on a large number of speech signal samples. In one embodiment, the neural network classifier may include a back-propagation neural network, the input layer of the neural network may be the feature of the preset feature set, and the output layer may be the emotion category set as described above Emotional category. In one embodiment, in order to avoid the decision tree from being too complicated and prevent overfitting, the decision tree classifier according to the present disclosure may use a pre-pruning operation. In one embodiment, in order to alleviate the problem of over-fitting, the support vector machine classifier according to the present disclosure may use a soft-spaced support vector machine, so as to find as much as possible a clean supercomputer between two emotional categories that are not easily divided. flat. These sub-classifiers themselves are classifiers known in the art, so the detailed principles of how to train the speech emotion category recognition model will not be described here.
在实际的应用中,当向一个子分类器输入预设的特征集合中的特征的值时,该子分类器可以基于预先训练好的语音情感类别识别模型输出一个情感类别。如此,当将所述预设的特征集合中的特征的值分别输入各个子分类器时,每个子分类器都将输出一个情感类别。In practical applications, when the value of the feature in the preset feature set is input to a sub-classifier, the sub-classifier can output an emotion category based on a pre-trained speech emotion category recognition model. In this way, when the value of the feature in the preset feature set is input into each sub-classifier, each sub-classifier will output an emotion category.
在一个实施例中,由所述多个子分类器基于所述特征的值识别所述语音信号的情感类别可以包括根据所述多个子分类器的投票和所述多个子分类器的权重来识别所述语音信号的情感类别。根据所述多个子分类器对情感类别预测结果的投票和所述多个子分类器的权重来识别所述语音信号的情感类别可以包括:获得该多个子分类器对情感类别预测结果的投票结果;响应于根据所述多个子分类器对情感类别预测结果的投票结果识别出唯一情感类别, 将该唯一的情感类别作为所述语音信号的情感类别;以及响应于根据所述多个子分类器对情感类别预测结果的投票结果识别出至少两个情感类别,根据所述多个子分类器的权重来确定所述语音信号的情感类别。在一个实施例中,由所述多个子分类器基于所述特征的值识别所述语音信号的情感类别可以包括:响应于所述多个子分类器中的至少两个子分类器识别出的情感类别预测结果相同,将该情感类别预测结果识别为所述语音信号的情感类别。在实际应用中,假设使用5个子分类器来识别一个语音信号的情感类别。在一种情况下,假设其中三个子分类器都输出同一情感类别预测结果(例如,高兴),而其中一个子分类器输出另一种不同的情感类别预测结果(例如,不耐烦),其中一个子分类器输出又一种不同的情感类别预测结果(例如,悲伤),那么根据这5个子分类器对情感类别预测结果的投票结果,将识别出唯一的情感类别,即高兴。在这种情况下,将高兴这个情感类别作为由多个子分类器识别出的最终情感类别。在另一种情况下,假设其中两个子分类器输出同一情感类别预测结果(例如,高兴),而其中另两个子分类器输出另一种不同的情感类别预测结果(例如,不耐烦),最后一个子分类器输出又一种不同的情感类别预测结果(例如,悲伤),那么根据这5个子分类器对情感类别预测结果的投票结果,将识别出两个情感类别,即高兴和不耐烦。在这种情况下,识别出的情感类别不唯一,因此需要对识别出的情感类别进行进一步识别。根据本公开实施例,可以预先为每个子分类器分配相应的权重。也即,每个子分类器对其所输出的情感类别预测结果的投票值为该子分类器的权重,每个情感类别的得票数为输出该情感类别预测结果的所有子分类器的权重之和。例如,继续前述示例,假设输出高兴的两个子分类器的权重分别为1和2,输出不耐烦的两个子分类器的权重分别为3和4,那么不耐烦这个情感类别预测结果的得票数为3+4=7,而高兴这个情感类别预测结果的得票数为1+2=3,由于7大于3,因此不耐烦这个情感类别预测结果将作为由多个子分类器识别出的最终情感类别。当然,根本公开的实施例不限于仅仅基于子分类器的权重来进一步识别情感类别。例如,各子分类器的权重可以是预先确定的,或者各子分类器的权重可根据各子分类器对于预设测试样本集的测试准确率来确定,例如具有更高测试准确率的子分类器的权重更大,本公开的实施例对此不作限制。In one embodiment, recognizing the emotion category of the speech signal by the multiple sub-classifiers based on the value of the feature may include recognizing the emotion category of the voice signal based on the votes of the multiple sub-classifiers and the weights of the multiple sub-classifiers. Describe the emotional category of the speech signal. Recognizing the emotion category of the voice signal according to the voting results of the emotion category prediction results of the multiple sub-classifiers and the weights of the multiple sub-classifiers may include: obtaining voting results of the emotion category prediction results of the multiple sub-classifiers; In response to identifying a unique emotion category according to the voting results of the emotion category prediction results of the plurality of sub-classifiers, using the unique emotion category as the emotion category of the speech signal; and in response to the emotion classification according to the plurality of sub-classifiers The voting result of the category prediction result identifies at least two emotion categories, and the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers. In one embodiment, the recognition of the emotion category of the speech signal by the plurality of sub-classifiers based on the value of the feature may include: responding to the emotion category recognized by at least two of the plurality of sub-classifiers The prediction result is the same, and the emotion category prediction result is recognized as the emotion category of the voice signal. In practical applications, it is assumed that five sub-classifiers are used to identify the emotional category of a speech signal. In one case, suppose that three of the sub-classifiers all output the same emotion category prediction result (for example, happy), and one of the sub-classifiers outputs another different emotion category prediction result (for example, impatient), and one of them The sub-classifier outputs another different emotion category prediction result (for example, sadness), and then according to the voting results of the five sub-classifiers on the emotion category prediction results, the only emotion category, that is, happy, will be identified. In this case, the emotion category of happiness is regarded as the final emotion category recognized by multiple sub-classifiers. In another case, suppose that two of the sub-classifiers output the same emotion category prediction result (for example, happy), and the other two sub-classifiers output another different emotion category prediction result (for example, impatient), and finally One sub-classifier outputs another different emotion category prediction results (for example, sadness), then based on the voting results of the five sub-classifiers on the emotion category prediction results, two emotion categories, namely happy and impatient, will be identified. In this case, the recognized emotion category is not unique, so further recognition of the recognized emotion category is required. According to an embodiment of the present disclosure, each sub-classifier may be assigned a corresponding weight in advance. That is, the vote value of each sub-classifier for its output of the emotion category prediction result is the weight of the sub-classifier, and the number of votes for each emotion category is the sum of the weights of all sub-classifiers that output the prediction results of the emotion category . For example, continuing the previous example, assuming that the weights of the two sub-classifiers that output happy are 1 and 2, and the weights of the two sub-classifiers that output impatient are 3 and 4, respectively, then the number of votes for the prediction result of the emotional category of impatient is 3+4=7, and the number of votes for the emotion category prediction result of happy is 1+2=3. Since 7 is greater than 3, the emotion category prediction result of impatient will be the final emotion category identified by multiple sub-classifiers. Of course, the fundamentally disclosed embodiments are not limited to further identifying emotion categories based only on the weights of sub-classifiers. For example, the weight of each sub-classifier may be predetermined, or the weight of each sub-classifier may be determined according to the test accuracy of each sub-classifier for a preset test sample set, for example, a sub-category with higher test accuracy The weight of the filter is greater, which is not limited in the embodiments of the present disclosure.
例如,上述的获得该多个子分类器对情感类别预测结果的投票结果可包括:For example, the aforementioned voting results obtained by the multiple sub-classifiers for the emotion category prediction results may include:
对于该多个子分类器输出的每种情感类别预测结果,使用该多个子分类器中输出该情感类别预测结果的子分类器的数目作为该情感类别预测结果的得票数;以及For each emotion category prediction result output by the multiple sub-classifiers, use the number of sub-classifiers outputting the emotion category prediction result in the multiple sub-classifiers as the number of votes for the emotion category prediction result; and
将该多个子分类器输出的情感类别预测结果中得票最多的情感类别预测结果作为该多个子分类器识别出的情感类别。The emotion category prediction result with the most votes among the emotion category prediction results output by the multiple sub-classifiers is used as the emotion category recognized by the multiple sub-classifiers.
例如,在由多个子分类器识别出语音信号的至少两个情感类别的情况下,根据多个子分类器的权重来确定该语音信号的情感类别可包括:For example, in a case where at least two emotion categories of the voice signal are recognized by multiple sub-classifiers, determining the emotion category of the voice signal according to the weights of the multiple sub-classifiers may include:
计算输出该至少两个情感类别中的每个情感类别的子分类器的权重之和;以及Calculate and output the sum of the weights of the sub-classifiers of each of the at least two emotion categories; and
将对应于计算出的权重之和最大的情感类别作为该多个子分类器识别出的情感类别。The emotion category corresponding to the largest sum of the calculated weights is used as the emotion category recognized by the plurality of sub-classifiers.
如前所述,根据本公开的语音情感类别识别方法通过多个分类器的投票结果来确定语音信号的最终的情感类别。与仅仅使用单个分类器来确定语音信号的情感类别相比,根据本公开的语音情感类别识别方法能够提高语音信号的情感类别识别的准确率和实时性。As mentioned above, the voice emotion category recognition method according to the present disclosure determines the final emotion category of the voice signal through the voting results of multiple classifiers. Compared with only using a single classifier to determine the emotion category of the voice signal, the voice emotion category recognition method according to the present disclosure can improve the accuracy and real-time performance of the emotion category recognition of the voice signal.
根据本公开的至少一个实施例,在语音信号的情感类别识别过程中,需要对语音信号的特征进行提取。所提取的特征的数目和种类对于情感类别的识别的准确性和计算复杂度都有着显著影响。根据本公开的至少一个实施例,将在语音情感类别识别的训练过程中,对于需要提取的语音信号的特征的数目和种类进行确定,以便形成在实际的语音信号的情感类别识别中需要使用的预设的特征集合。下面将结合图3和图4对所述预设的特征集合中的特征的选择过程进行详细说明。According to at least one embodiment of the present disclosure, in the process of emotion category recognition of the speech signal, the feature of the speech signal needs to be extracted. The number and types of extracted features have a significant impact on the accuracy and computational complexity of emotion category recognition. According to at least one embodiment of the present disclosure, in the training process of speech emotion category recognition, the number and types of features of the speech signal that need to be extracted are determined, so as to form what needs to be used in the actual speech signal emotion category recognition. A set of preset features. The selection process of the features in the preset feature set will be described in detail below in conjunction with FIG. 3 and FIG. 4.
图3示出了根据本公开实施例的一种特征提取方法300的示意性流程图。Fig. 3 shows a schematic flowchart of a feature extraction method 300 according to an embodiment of the present disclosure.
如图3中所示,特征提取方法300可以包括步骤301,提供多个语音信号样本;302,对所述多个语音信号样本进行预处理;303,提取所述多个语音信号样本中的每个语音信号样本的多个特征。所述多个语音信号样本可以来自现有的语音情感数据库,例如柏林语音情感数据库,或者可以是随着时间的推移不断积累的各种语音信号样本。所述预处理操作可以是本领域中已 知的各种预处理器操作,在此不再赘述。所述多个特征可以是例如openSMILE(open Speech and Music Interpretation by Large Space Extraction)之类的用于信号处理和机器学习的现有特征提取器针对每个语音信号样本提取的初始特征。这些特征可以例如包括帧能量、帧强度、临界频带谱、倒谱系数、听觉谱、线性预测系数、基础频率、过零率等。在一个示例中,假设语音信号样本的数目为N个,提取的初始特征的数目为D个,那么针对N个语音信号样本分别提取D个初始特征的值将得到一个原始数据集的矩阵
Figure PCTCN2020083751-appb-000003
其中,z ij表示特征的值,1≤i≤N,1≤j≤D。矩阵的每行表示一个语音信号样本的D个特征的值,矩阵的每列表示一个特征对应的N个样本。如此,矩阵Z可以包括N个D维样本向量(s 1,s 2,…,s N) T,D个N维特征向量(f 1,f 2,…,f D),其中,s 1=[z 11,z 12,…,z 1D],s 2=[z 21,z 22,…,z 2D],s N=[z N1,z N2,…,z ND],f 1=[z 11,z 21,…,z N1] T,f 2=[z 12,z 22,…,z N2] T,f D=[z 1D,z 2D,…,z ND] T。此外,每个语音信号样本还对应一个已知的情感类别。所有这些情感类别都属于预设的情感类别集合。如此,N个样本的情感类别向量C=[c 1,c 2,…,c k,…,c N] T,其中c k表示语音信号样本的情感类别的值,1≤k≤N。
As shown in FIG. 3, the feature extraction method 300 may include step 301, providing multiple voice signal samples; 302, preprocessing the multiple voice signal samples; 303, extracting each of the multiple voice signal samples Multiple characteristics of a speech signal sample. The multiple voice signal samples may come from an existing voice emotion database, such as a Berlin voice emotion database, or may be various voice signal samples accumulated over time. The pre-processing operation may be various pre-processing operations known in the art, which will not be repeated here. The multiple features may be the initial features extracted for each voice signal sample by an existing feature extractor used for signal processing and machine learning, such as openSMILE (open Speech and Music Interpretation by Large Space Extraction). These features may include, for example, frame energy, frame intensity, critical band spectrum, cepstrum coefficient, auditory spectrum, linear prediction coefficient, fundamental frequency, zero-crossing rate, and so on. In an example, assuming that the number of speech signal samples is N and the number of extracted initial features is D, then extracting the values of D initial features for N speech signal samples will result in a matrix of the original data set
Figure PCTCN2020083751-appb-000003
Among them, z ij represents the value of the feature, 1≤i≤N, 1≤j≤D. Each row of the matrix represents the value of D features of a voice signal sample, and each column of the matrix represents N samples corresponding to a feature. In this way, the matrix Z may include N D-dimensional sample vectors (s 1 , s 2 ,...,s N ) T , and D N-dimensional eigenvectors (f 1 , f 2 ,..., f D ), where s 1 = [z 11 ,z 12 ,…,z 1D ],s 2 =[z 21 ,z 22 ,…,z 2D ],s N =[z N1 ,z N2 ,…,z ND ],f 1 =[z 11 ,z 21 ,…,z N1 ] T ,f 2 =[z 12 ,z 22 ,…,z N2 ] T ,f D =[z 1D ,z 2D ,…,z ND ] T. In addition, each voice signal sample also corresponds to a known emotion category. All these emotion categories belong to a preset set of emotion categories. In this way, the emotion category vector C=[c 1 ,c 2 ,...,c k ,...,c N ] T of N samples, where c k represents the value of the emotion category of the speech signal sample, and 1≤k≤N.
如图3中所示,特征提取方法300可以进一步包括步骤304,计算所述多个特征中的每个特征与多个情感类别的情感相关性。根据本公开,情感相关性可以通过如下通用公式计算:As shown in FIG. 3, the feature extraction method 300 may further include step 304 of calculating the emotional correlation between each of the multiple features and multiple emotional categories. According to the present disclosure, emotional relevance can be calculated by the following general formula:
Figure PCTCN2020083751-appb-000004
Figure PCTCN2020083751-appb-000004
其中,X表示特征向量,Y表示情感类别向量,H(X)表示X的熵,H(Y)表示Y的熵,H(X|Y)表示X|Y的熵。具体而言,Among them, X represents the feature vector, Y represents the emotion category vector, H(X) represents the entropy of X, H(Y) represents the entropy of Y, and H(X|Y) represents the entropy of X|Y. in particular,
Figure PCTCN2020083751-appb-000005
Figure PCTCN2020083751-appb-000005
Figure PCTCN2020083751-appb-000006
Figure PCTCN2020083751-appb-000006
Figure PCTCN2020083751-appb-000007
Figure PCTCN2020083751-appb-000007
其中,x m与y l分别为X和Y的可能取值,p(x m)和p(y l)分别为x m和y l的概率。 Among them, x m and y l are the possible values of X and Y respectively, and p(x m ) and p(y l ) are the probabilities of x m and y l respectively.
继续上述示例,按照上述通用计算公式,步骤304实质上包括,对于每个特征向量f j,1≤j≤D,计算情感相关性SU(f j,C),也就是, Continuing the above example, according to the above general calculation formula, step 304 essentially includes, for each feature vector f j , 1≤j≤D, calculating the emotional correlation SU(f j ,C), that is,
Figure PCTCN2020083751-appb-000008
Figure PCTCN2020083751-appb-000008
其中,
Figure PCTCN2020083751-appb-000009
Figure PCTCN2020083751-appb-000010
among them,
Figure PCTCN2020083751-appb-000009
Figure PCTCN2020083751-appb-000010
在步骤304完成后,将得到D个情感相关性。After step 304 is completed, D emotional correlations will be obtained.
如图3中所示,特征提取方法300可以进一步包括步骤305,从所述多个特征中选择情感相关性大于预设的情感相关性阈值的特征以获得第一候选特征子集。As shown in FIG. 3, the feature extraction method 300 may further include a step 305 of selecting features from the plurality of features whose emotional relevance is greater than a preset emotional relevance threshold to obtain a first candidate feature subset.
根据本公开的至少一个实施例,预设的情感相关性阈值可以根据需求或经验进行设置。在这个步骤中,将计算得到的每个情感相关性与预设的情感相关性阈值相比较。如果计算得到的情感相关性大于预设的情感相关性阈值,则将该计算得到的情感相关性所对应的特征从D个特征中选出以便放入第一候选特征子集中。如果计算得到的情感相关性小于或等于预设的情感相关性阈值,则将该计算得到的情感相关性所对应的特征从D个特征中删除。According to at least one embodiment of the present disclosure, the preset emotional relevance threshold can be set according to needs or experience. In this step, each calculated emotional correlation is compared with a preset emotional correlation threshold. If the calculated emotional relevance is greater than the preset emotional relevance threshold, then the feature corresponding to the calculated emotional relevance is selected from D features so as to be put into the first candidate feature subset. If the calculated emotional relevance is less than or equal to the preset emotional relevance threshold, the feature corresponding to the calculated emotional relevance is deleted from the D features.
如图3中所示,特征提取方法300可以进一步包括步骤306,将所述第一候选特征子集中具有最大情感相关性的特征作为显著特征。As shown in FIG. 3, the feature extraction method 300 may further include step 306, using the feature with the greatest emotional relevance in the first candidate feature subset as a salient feature.
在该步骤中,可以将所述第一候选特征子集中的特征所对应的情感相关性进行排序,从而将与最大情感相关性相对应的特征作为显著特征。In this step, the emotional relevance corresponding to the features in the first candidate feature subset can be sorted, so that the feature corresponding to the largest emotional relevance is taken as the salient feature.
如图3中所示,特征提取方法300可以进一步包括步骤307,计算所述 第一候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性。As shown in FIG. 3, the feature extraction method 300 may further include step 307 of calculating the feature correlation between each feature in the first candidate feature subset and the salient feature.
根据本公开的至少一个实施例,特征相关性也可以通过如下通用公式计算:According to at least one embodiment of the present disclosure, feature correlation can also be calculated by the following general formula:
Figure PCTCN2020083751-appb-000011
Figure PCTCN2020083751-appb-000011
其中,X表示特征向量,Y表示特征向量,H(X)表示X的熵,H(Y)表示Y的熵,H(X|Y)表示X|Y的熵。具体地,Among them, X represents the feature vector, Y represents the feature vector, H(X) represents the entropy of X, H(Y) represents the entropy of Y, and H(X|Y) represents the entropy of X|Y. specifically,
Figure PCTCN2020083751-appb-000012
Figure PCTCN2020083751-appb-000012
Figure PCTCN2020083751-appb-000013
Figure PCTCN2020083751-appb-000013
Figure PCTCN2020083751-appb-000014
Figure PCTCN2020083751-appb-000014
其中,x m与y l分别为X和Y的可能取值,p(x m)和p(y l)分别为x m和y l的概率。 Among them, x m and y l are the possible values of X and Y respectively, and p(x m ) and p(y l ) are the probabilities of x m and y l respectively.
具体而言,继续前面的示例,假设f a对应于第一候选特征子集中的显著特征的特征向量,f b对应于第一候选特征子集中除f a之外的其余特征之一的特征向量,则f a与f b之间的特征相关性可以为: Specifically, continuing the previous example, suppose that f a corresponds to the feature vector of the salient feature in the first candidate feature subset, and f b corresponds to the feature vector of one of the remaining features in the first candidate feature subset except f a , Then the feature correlation between f a and f b can be:
Figure PCTCN2020083751-appb-000015
Figure PCTCN2020083751-appb-000015
其中,
Figure PCTCN2020083751-appb-000016
Figure PCTCN2020083751-appb-000017
among them,
Figure PCTCN2020083751-appb-000016
Figure PCTCN2020083751-appb-000017
如图3中所示,特征提取方法300可以进一步包括步骤308,从所述第 一候选特征子集中删除特征相关性大于情感相关性的特征以获得第二候选特征子集。As shown in Fig. 3, the feature extraction method 300 may further include step 308 of deleting features with a feature correlation greater than emotional correlation from the first candidate feature subset to obtain a second candidate feature subset.
具体而言,继续前面的示例,由前述内容可知,f b对应的特征与情感类别的情感类别相关性: Specifically, continuing the previous example, it can be seen from the foregoing content that the feature corresponding to f b is related to the emotional category of the emotional category:
Figure PCTCN2020083751-appb-000018
Figure PCTCN2020083751-appb-000018
其中,
Figure PCTCN2020083751-appb-000019
Figure PCTCN2020083751-appb-000020
among them,
Figure PCTCN2020083751-appb-000019
Figure PCTCN2020083751-appb-000020
在步骤308中,对于第一候选特征子集中除f a之外的每个其余特征f b,将该特征的特征相关性与该特征的情感相关性相比较,如果特征相关性大于情感相关性(即,SU(f a,f b)>SU(f b,C)),则从所述第一候选特征子集中删除该特征。 In step 308, for each remaining feature f b except f a in the first candidate feature subset, the feature correlation of the feature is compared with the emotional correlation of the feature, and if the feature correlation is greater than the emotional correlation (Ie, SU(f a , f b )>SU(f b , C)), then the feature is deleted from the first candidate feature subset.
在对于第一候选特征子集中除f a之外的所有其余特征执行完上述操作之后,可以得到第二候选特征子集。 After performing the above operations on all the remaining features in the first candidate feature subset except f a , the second candidate feature subset can be obtained.
如图3中所示,在此之后,特征提取方法300可以进一步包括步骤309,计算所述第二候选特征子集中的每个特征的方差。As shown in FIG. 3, after that, the feature extraction method 300 may further include step 309 of calculating the variance of each feature in the second candidate feature subset.
根据本公开,计算特征的方差,也就是对于特征所对应的N维特征向量计算方差。例如,假设第二候选特征子集中的一个特征所对应的特征向量是f t,则计算该特征的方差就是计算f t的方差。 According to the present disclosure, calculating the variance of the feature, that is, calculating the variance for the N-dimensional feature vector corresponding to the feature. For example, if the feature vector corresponding to a feature in the second candidate feature subset is f t , then calculating the variance of the feature is calculating the variance of f t .
如图3中所示,在此之后,特征提取方法300可以进一步包括310,从所述第二候选特征子集中删除特征的方差小于方差阈值的特征以获得预设的特征集合中的特征。As shown in FIG. 3, after that, the feature extraction method 300 may further include 310, removing features whose variance is less than a variance threshold from the second candidate feature subset to obtain features in a preset feature set.
根据本公开的至少一个实施例,方差阈值可以根据实际需求或经验进行设置。在该步骤中,对于所述第二候选特征子集中的每个特征而言,将该特征的方差与方差阈值相比较。如果该特征的方差小于方差阈值,则将该特征从所述第二候选特征子集中删除。According to at least one embodiment of the present disclosure, the variance threshold can be set according to actual needs or experience. In this step, for each feature in the second candidate feature subset, the variance of the feature is compared with a variance threshold. If the variance of the feature is less than the variance threshold, the feature is deleted from the second candidate feature subset.
在对于所述第二候选特征子集中的每个特征执行完上述删除操作后,所述第二候选特征子集中余下的特征就是最终选择出的特征。这些最终选择出的特征构成了本文的前述部分所述的预设的特征集合中的特征。该预设的特征集合将用于实际的语音信号情感类别识别中以及分类器的语音情感类别识别模型的训练中。After the foregoing deletion operation is performed on each feature in the second candidate feature subset, the remaining features in the second candidate feature subset are the finally selected features. These finally selected features constitute the features in the preset feature set described in the previous section of this article. The preset feature set will be used in the actual speech signal emotion category recognition and the training of the speech emotion category recognition model of the classifier.
图3中所示的特征提取方法先利用快速过滤的特征选择算法(Fast Correlation-Based Filter Solution)对特征进行过滤,然后再利用方差对特征进行进一步过滤。在快速过滤的特征选择算法中,先剔除与情感类别相关性较小的特征从而保留与情感类别相关性较大的特征,然后再利用与情感类别相关性最大的特征进一步筛选特征,可以极大地减小计算的时间复杂度。此外,图3中的特征提取方法利用特征方差可以进一步去除本身变化不明显的特征。The feature extraction method shown in Figure 3 first uses the Fast Correlation-Based Filter Solution to filter the features, and then uses the variance to further filter the features. In the fast-filtering feature selection algorithm, the features that are less relevant to the emotion category are first eliminated to retain the features that are more relevant to the emotion category, and then the features that are most relevant to the emotion category are used to further filter the features, which can greatly Reduce the time complexity of calculation. In addition, the feature extraction method in FIG. 3 uses feature variance to further remove features that do not change significantly.
与图3中所示的方法不同,图4中所示的特征提取方法则是先利用方差对特征进行过滤,然后再利用快速过滤的特征选择算法(Fast Correlation-Based Filter Solution)对特征进行进一步过滤。下面将对图4的特征提取方法进行详细说明。Different from the method shown in Figure 3, the feature extraction method shown in Figure 4 first uses variance to filter the features, and then uses the fast-filtering feature selection algorithm (Fast Correlation-Based Filter Solution) to further the features. filter. The feature extraction method of FIG. 4 will be described in detail below.
图4示出了根据本公开至少一个实施例的另一特征提取方法400的示意性流程图。FIG. 4 shows a schematic flowchart of another feature extraction method 400 according to at least one embodiment of the present disclosure.
如图4中所示,特征提取方法400可以包括如下步骤:As shown in FIG. 4, the feature extraction method 400 may include the following steps:
401,提供多个语音信号样本;401. Provide multiple voice signal samples;
402,对所述多个语音信号样本进行预处理;402. Perform preprocessing on the multiple voice signal samples.
403,提取所述多个语音信号样本中的每个语音信号样本的多个特征;403. Extract multiple features of each voice signal sample in the multiple voice signal samples.
404,计算所述多个特征中的每个特征的方差;404. Calculate the variance of each of the multiple features.
405,从所述多个特征中删除特征的方差小于方差阈值的特征以获得第三候选特征子集;405. Delete the feature whose variance of the feature is less than the variance threshold from the multiple features to obtain a third candidate feature subset;
406,计算所述第三候选特征子集中的每个特征与多个情感类别的情感相关性;406. Calculate the emotional correlation between each feature in the third candidate feature subset and multiple emotional categories;
407,从所述第三候选特征子集中选择情感相关性大于预设的情感相关性阈值的特征以获得第四候选特征子集;407. Select from the third candidate feature subset the features whose emotional relevance is greater than a preset emotional relevance threshold to obtain a fourth candidate feature subset;
408,将所述第四候选特征子集中具有最大情感相关性的特征作为显著特 征;408. Use the feature with the greatest emotional relevance in the fourth candidate feature subset as a salient feature;
409,计算所述第四候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性;以及409. Calculate the feature correlation between each of the remaining features in the fourth candidate feature subset and the salient feature; and
410,从所述第四候选特征子集中删除特征相关性大于情感相关性的特征以获得所述预设的特征集合中的特征。410. Delete the features whose feature correlation is greater than the emotional correlation from the fourth candidate feature subset to obtain features in the preset feature set.
由于图3的特征提取方法300与图4的特征提取方法400的区别仅在于快速过滤的特征选择算法与方差算法的顺序不同,本领域技术人员完全可以基于特征提取方法300实现特征提取方法400,因此在此不再对特征提取方法400的具体实现进行赘述。Since the feature extraction method 300 of FIG. 3 differs from the feature extraction method 400 of FIG. 4 only in that the order of the fast filtering feature selection algorithm and the variance algorithm is different, those skilled in the art can fully implement the feature extraction method 400 based on the feature extraction method 300. Therefore, the specific implementation of the feature extraction method 400 will not be repeated here.
应理解,在一些实施例中,上述的特征提取方法300可不包括步骤302。类似地,在一些实施例中,上述的特征提取方法400可不包括步骤402。例如,步骤301和步骤401中语音信号样本已预先进行了处理,或已满足实际要求而不需要进行预处理。本公开的实施例对此不作限制。It should be understood that, in some embodiments, the above-mentioned feature extraction method 300 may not include step 302. Similarly, in some embodiments, the aforementioned feature extraction method 400 may not include step 402. For example, the speech signal samples in step 301 and step 401 have been processed in advance, or have met actual requirements without preprocessing. The embodiment of the present disclosure does not limit this.
图5是根据本公开至少一个实施例的语音情感识别方法的示意性流程图。如图5所示,语音情感识别方法包括步骤S510至S550。Fig. 5 is a schematic flowchart of a method for speech emotion recognition according to at least one embodiment of the present disclosure. As shown in Figure 5, the voice emotion recognition method includes steps S510 to S550.
S510、选择特征,以得到特征集合。例如,步骤S510可通过基于图3的特征提取方法300或图4的特征提取方法400来实施。对图3的特征提取方法300和图4的特征提取方法400可参见上文中关于图3和图4的方法的描述,本文中将不再赘述。S510. Select features to obtain a feature set. For example, step S510 may be implemented based on the feature extraction method 300 of FIG. 3 or the feature extraction method 400 of FIG. 4. For the feature extraction method 300 of FIG. 3 and the feature extraction method 400 of FIG. 4, please refer to the above description of the methods of FIG. 3 and FIG. 4, which will not be repeated here.
S520、使用所选择的特征对分类器进行训练,以得到经过训练的分类器。在一些实施例中,分类器可包括多个子分类器。子分类器可以包括各种各样的分类器,例如支持向量机分类器、决策树分类器、神经网络分类器等等。每个子分类器都可以包括语音情感类别识别模型。每个语音情感类别识别模型都使用在步骤S510中得到的特征集合和同一情感类别集合(其包括诸如高兴、急迫、不耐烦、悲伤等情感类别)进行训练。在一个实施例中,神经网络分类器可以包括反向传播神经网络,该神经网络的输入层可以是所述预设的特征集合的特征,而输出层可以是如前所述的情感类别集合中的情感类别。在一个实施例中,为了避免决策树过于复杂并且防止出现过拟合,根据本公开的实施例的决策树分类器可以使用预剪枝操作。在一个实施例中,为了缓解过拟合问题,根据本公开的实施例的支持向量机分类器可以使用软间 隔支持向量机,从而在两个不容易划分的情感类别之间尽可能的找到一个干净的超平面。这些子分类器本身是本领域已知的分类器,因此在此不再对其训练的详细原理进行赘述。S520. Train the classifier using the selected features to obtain a trained classifier. In some embodiments, the classifier may include multiple sub-classifiers. The sub-classifiers can include various classifiers, such as support vector machine classifiers, decision tree classifiers, neural network classifiers, and so on. Each sub-classifier can include a speech emotion category recognition model. Each speech emotion category recognition model uses the feature set obtained in step S510 and the same emotion category set (which includes emotion categories such as happiness, urgency, impatient, sadness, etc.) for training. In one embodiment, the neural network classifier may include a back-propagation neural network, the input layer of the neural network may be the feature of the preset feature set, and the output layer may be the emotion category set as described above Emotional category. In one embodiment, in order to avoid the decision tree from being too complicated and prevent overfitting, the decision tree classifier according to the embodiment of the present disclosure may use a pre-pruning operation. In one embodiment, in order to alleviate the overfitting problem, the support vector machine classifier according to the embodiment of the present disclosure may use a soft interval support vector machine, so as to find one as much as possible between two emotion categories that are not easily divided Clean super plane. These sub-classifiers themselves are classifiers known in the art, so the detailed principles of their training are not repeated here.
S530、提供测试语音信号。例如,该测试语音信号为在实际应用中由用户输入的语音信号,本公开的实施例对此不作限制。S530. Provide a test voice signal. For example, the test voice signal is a voice signal input by the user in an actual application, which is not limited in the embodiment of the present disclosure.
S540、基于特征集合,从测试语音信号中提取该特征集合中的特征的值。步骤S540中使用的特征集合为步骤S510中得到的特征集合。步骤S540与上述的步骤202基本相同,因此步骤S540的详细描述可参照上文中关于步骤202的描述,本公开的实施例对此将不再赘述。S540: Based on the feature set, extract the value of the feature in the feature set from the test voice signal. The feature set used in step S540 is the feature set obtained in step S510. Step S540 is basically the same as step 202 described above. Therefore, the detailed description of step S540 can refer to the description of step 202 above, which will not be repeated in the embodiment of the present disclosure.
S550、使用分类器识别测试语音信号的情感类别。步骤S550中使用的分类器为在步骤S520中得到的经过训练的分类器。步骤S550与上述的步骤203基本相同,因此步骤S550的详细描述可参照上文中关于步骤203的描述,本公开的实施例对此将不再赘述。S550: Use a classifier to recognize the emotion category of the test speech signal. The classifier used in step S550 is the trained classifier obtained in step S520. Step S550 is basically the same as step 203 described above. Therefore, the detailed description of step S550 can refer to the description of step 203 above, which will not be repeated in the embodiment of the present disclosure.
例如,上述的步骤S510和步骤S520可预先执行。在用户的具体应用中,仅执行步骤S530至步骤S540。例如,上述的步骤S510和步骤S520可仅执行一次,并将步骤S520中得到的经过训练的分类器存储在远程服务器或用于用户的客户端本地的存储器中,之后在每次具体应用中,仅需执行步骤S530至步骤S540。又例如,可定期或不定期的使用新的训练数据执行步骤S510和步骤S520,以更新分类器。然而,应理解,本公开的实施例对此不作限制。For example, the above steps S510 and S520 can be performed in advance. In the user's specific application, only steps S530 to S540 are performed. For example, the above steps S510 and S520 can be performed only once, and the trained classifier obtained in step S520 is stored in a remote server or a local storage for the user's client, and then in each specific application, Only steps S530 to S540 need to be performed. For another example, step S510 and step S520 can be executed periodically or irregularly using new training data to update the classifier. However, it should be understood that the embodiments of the present disclosure do not limit this.
图6示出了根据本公开至少一个实施例的语义识别方法的示意性流程图。Fig. 6 shows a schematic flowchart of a semantic recognition method according to at least one embodiment of the present disclosure.
如图6所示,根据本公开至少一个实施例的语义识别方法包括:As shown in FIG. 6, the semantic recognition method according to at least one embodiment of the present disclosure includes:
S610、将语音信号转换为文本信息,以及使用先前一轮对话的目标对话状态作为当前对话状态;S610: Convert the voice signal into text information, and use the target dialogue state of the previous round of dialogue as the current dialogue state;
S620、识别文本信息的命名实体;S620. Identify the named entity of the text information;
S630、根据命名实体确定命名实体对应的待识别向量;S630: Determine, according to the named entity, a vector to be recognized corresponding to the named entity;
S640、基于待识别向量,将满足要求的标准特征向量的意图确定为文本信息的当前意图;S640: Based on the vector to be recognized, determine the intent of the standard feature vector that meets the requirements as the current intent of the text information;
S650、根据当前对话状态和当前意图确定目标对话状态。S650: Determine the target dialogue state according to the current dialogue state and the current intention.
根据本公开实施例的语义识别方法,每接收一次从语音信号转换的文本信息,就会根据状态转换表确定一次对话状态,并将该对话状态作为目标对话状态;进而,在接收到下一次文本信息后,该目标对话状态可以作为确定下一次文本信息的对话状态的当前对话状态。这样,用户输入的相邻两次语音信号所对应的意图可以形成关联,从而可以正确理解用户的当前意图。According to the semantic recognition method of the embodiment of the present disclosure, every time text information converted from a voice signal is received, a dialogue state is determined according to the state conversion table, and the dialogue state is taken as the target dialogue state; further, when the next text is received After information, the target dialogue state can be used as the current dialogue state for determining the dialogue state of the next text message. In this way, the intentions corresponding to the two adjacent voice signals input by the user can be associated, so that the current intention of the user can be correctly understood.
在本公开的一些实施例中,状态转换表中可以包括多个对话状态以及多个意图,不同对话状态根据相应的意图可以切换到下一个对话状态。参见图7,在一实施例中,以4个对话状态和6个意图为例,对话状态1在当前意图为意图1时可以切换到对话状态4,对话状态1在当前意图为意图2时可以切换到对话状态2,对话状态1在当前意图为意图5时可以切换到对话状态3。类似地,对话状态2在当前意图为意图3时可以切换到对话状态4,对话状态3在当前意图为意图6时可以切换到对话状态2,对话状态3在当前意图为意图4时可以切换到对话状态4。In some embodiments of the present disclosure, the state transition table may include multiple dialogue states and multiple intents, and different dialogue states can be switched to the next dialogue state according to the corresponding intent. Referring to Figure 7, in an embodiment, taking 4 dialogue states and 6 intentions as an example, dialogue state 1 can be switched to dialogue state 4 when the current intent is intent 1, and dialogue state 1 can be switched to when the current intent is intent 2. Switch to dialogue state 2, dialogue state 1 can be switched to dialogue state 3 when the current intention is intention 5. Similarly, dialogue state 2 can be switched to dialogue state 4 when the current intention is intent 3, dialogue state 3 can be switched to dialogue state 2 when the current intention is intent 6, and dialogue state 3 can be switched to dialogue state when the current intention is intent 4. Dialogue state 4.
例如,当前的对话状态是“今天的天气怎么样?”,用户输入的内容(即,意图)是“明天呢?”,则可以根据当前对话状态和当前意图,切换到目标对话状态“明天的天气怎么样?”。可见,单从用户的一次输入文本信息(即,“明天呢”),无法确定用户的具体含义,而将用户输入的相邻两次文本信息(即,“今天的天气怎么样?”和“明天呢?”)形成关联,可以正确理解用户的当前意图。For example, if the current dialogue state is "What’s the weather like today?" and the user’s input (i.e., intention) is "tomorrow?", the target dialogue state can be switched to the target dialogue state "tomorrow's how is the weather?". It can be seen that only one input of text information by the user (ie, "tomorrow") cannot determine the specific meaning of the user, and the text information input by the user twice (ie, "how is the weather today?" and " What about tomorrow?”) Form associations and understand the user’s current intentions correctly.
图7是根据本公开至少一个实施例的示意性状态转换表。需要说明的是,图7中示出的状态转换表仅仅是示例性的,本公开的实施例对状态转换表中的对话状态和意图的个数和内容不作限制,它们之间的具体切换方式也不作限制,可以根据实际需求进行调整。可理解的是,状态转换表可以预先存储在服务器之内。例如,该状态转换表可以由技术人员根据实际经验进行设置,也可以根据大数据方式统计或者学习得到,本公开的实施例不作限定。Fig. 7 is a schematic state transition table according to at least one embodiment of the present disclosure. It should be noted that the state transition table shown in FIG. 7 is only exemplary, and the embodiment of the present disclosure does not limit the number and content of the dialog states and intentions in the state transition table, and the specific switching manner between them There are no restrictions, and adjustments can be made according to actual needs. It is understandable that the state transition table may be stored in the server in advance. For example, the state conversion table may be set by a technician based on actual experience, or it may be obtained by statistics or learning based on big data, which is not limited in the embodiment of the present disclosure.
在示例性实施例中,在步骤S610中,可通过任意已知的方法将语音信号转换为文本信息,本公开的实施例对此不再赘述。In an exemplary embodiment, in step S610, the voice signal may be converted into text information by any known method, which will not be repeated in the embodiment of the present disclosure.
例如,在一实施例中,在步骤S620中,可以通过命名实体识别模型来识别文本信息的命名实体。命名实体识别可以是指识别文本中具有特定意义的实体,如人名、机构名、地名等专有名词和有意义的时间等,是信息检索、 问答***等技术的基础任务。如在“小明在夏威夷度假。”中,命名实体有:“小明——人名”、“夏威夷——地名”。可以使用基于语言语法的技术以及统计模型(如机器学习)等,来建立命名实体识别***。使用实体检测与识别的途径包括:(1)先进行实体检测,再去对已经检测的实体进行识别,(2)将实体与识别的对象结合到一个模型里,同时得到字符的位置进行标记和类别标记。For example, in one embodiment, in step S620, the named entity of the text information may be recognized through the named entity recognition model. Named entity recognition can refer to the recognition of entities with specific meanings in texts, such as proper nouns such as person names, organization names, and place names, and meaningful time. It is the basic task of information retrieval, question and answer systems and other technologies. For example, in "Xiao Ming is on vacation in Hawaii.", the named entities are: "Xiao Ming-name of person", "Hawaii-name of place". You can use language grammar-based techniques and statistical models (such as machine learning) to establish a named entity recognition system. The ways to use entity detection and recognition include: (1) first perform entity detection, and then identify the detected entity, (2) combine the entity and the recognized object into a model, and obtain the position of the character for marking and Category tag.
需要说明的是,在步骤S620中,可以通过其他模型或者其他方法来识别文本信息的命名实体,本公开的实施例对此不作限制。It should be noted that in step S620, other models or other methods may be used to identify the named entity of the text information, which is not limited in the embodiment of the present disclosure.
例如,在一实施例中,在步骤S630中,可以通过深度学习模型来根据命名实体确定命名实体对应的待识别向量。需要说明的是,在步骤S630中,还可以通过其他模型或者其他方法来根据命名实体确定命名实体对应的待识别向量,本公开的实施例对此不作限制。For example, in one embodiment, in step S630, a deep learning model may be used to determine the vector to be recognized corresponding to the named entity according to the named entity. It should be noted that in step S630, other models or other methods may also be used to determine the vector to be recognized corresponding to the named entity according to the named entity, which is not limited in the embodiment of the present disclosure.
例如,在一实施例中,在步骤S640中,可以基于待识别向量,将与待识别向量的相似度最大的标准特征向量的意图确定为文本信息的当前意图。需要说明的是,上述满足要求的标准特征向量除了是与待识别向量的相似度最大的标准特征向量,也可以是其他标准特征向量,这取决于实际情况,因此,本公开的实施例对此不作具体限制。For example, in an embodiment, in step S640, based on the vector to be recognized, the intent of the standard feature vector with the greatest similarity to the vector to be recognized may be determined as the current intent of the text information. It should be noted that, in addition to the standard feature vector that has the greatest similarity to the vector to be recognized, the standard feature vector that meets the requirements can also be other standard feature vectors, depending on the actual situation. Therefore, the embodiment of the present disclosure is No specific restrictions.
例如,在一实施例中,在步骤S610中将语音信号转换为文本信息“我想看蒙娜丽莎”后,将其输入到命名实体识别模型,通过命名实体识别模型识别出命名实体“我想看PICTURE(画作)”。命名实体识别模型对所接收到的文本信息执行以下操作:命名实体识别模型将一串字符(例如,对应于文本信息中的句子或段落)作为输入,并识别该字符串中提到的相关名词(人物、地点和组织)。For example, in one embodiment, after converting the voice signal into text information "I want to see Mona Lisa" in step S610, it is input into the named entity recognition model, and the named entity "I" is recognized through the named entity recognition model. I want to see PICTURE". The named entity recognition model performs the following operations on the received text information: The named entity recognition model takes a string of characters (for example, corresponding to a sentence or paragraph in the text information) as input, and recognizes related nouns mentioned in the string (People, places and organizations).
例如,在一个实施例中,输入到命名实体识别模型中的文本信息为:[我,想,看,莫,奈,的,撑,阳,伞,的,女,人,O,…,O];则通过命名实体识别模型识别出的命名实体为:[O,O,O,B-PER,I-PER,O,B-PIC,I-PIC,I-PIC,I-PIC,I-PIC,I-PIC,O,…,O],即命名实体为:人物-莫奈,画作-撑阳伞的女人。For example, in one embodiment, the text information input into the named entity recognition model is: [I, think, look, Mo, Nai,, zhang, sun, umbrella,, female, person, O,..., O ]; The named entity identified by the named entity recognition model is: [O, O, O, B-PER, I-PER, O, B-PIC, I-PIC, I-PIC, I-PIC, I- PIC, I-PIC, O,..., O], the named entities are: character-Monet, painting-woman with parasol.
例如,在一个实施例中,通过命名实体识别模型识别出的命名实体经过深度学习模型后可以确定出其对应的待识别向量。例如,待识别向量可以是 一种特征向量,其包括通过深度学习模型从命名实体中分类并提取出的文本特征。For example, in one embodiment, the named entity recognized by the named entity recognition model can determine its corresponding to-be-recognized vector after passing through the deep learning model. For example, the vector to be recognized may be a feature vector, which includes text features that are classified and extracted from named entities through a deep learning model.
例如,在一实施例中,可将预设的语料(例如“换PIC(画作)看”、“作者的国籍”、“PERSON的画”等等)输入到深度学习模型可以得到多个标准特征向量。例如,标准特征向量可以是一种包括通过深度学习模型从语料中分类并提取出的文本特征的特征向量。For example, in one embodiment, a preset corpus (such as "for PIC (picture)", "author's nationality", "PERSON's painting", etc.) can be input into the deep learning model to obtain multiple standard features vector. For example, the standard feature vector may be a feature vector that includes text features classified and extracted from the corpus through a deep learning model.
需要说明的是,获取标准特征向量的动作可以预先完成,也可以实时完成,可根据具体场景进行设置,本公开的实施例对此不作限定。It should be noted that the action of obtaining the standard feature vector can be completed in advance, or can be completed in real time, and can be set according to a specific scenario, which is not limited in the embodiment of the present disclosure.
需要说明的是,在本公开的实施例中,通过深度学习模型可以将不同表示方式但具有同一目的的句子归为同一个意图。例如:“我想看蒙娜丽莎”、“帮我换蒙娜丽莎吧”、“给我切换到蒙娜丽莎看看”等,可以归为同一意图“用户想换PIC(蒙娜丽莎)看看”。It should be noted that, in the embodiments of the present disclosure, sentences with different representations but with the same purpose can be classified into the same intent through the deep learning model. For example: "I want to see the Mona Lisa", "Help me change the Mona Lisa", "Show me to switch to the Mona Lisa", etc., can be classified as the same intention "The user wants to change the PIC (Mona Lisa) Take a look".
这样,本实施例中通过将不同表示方式但具有同一目的的句子归为同一意图,有利于提高确定意图的准确性。In this way, in this embodiment, by classifying sentences with different expressions but with the same purpose as the same intention, it is beneficial to improve the accuracy of determining the intention.
例如,在一实施例中,在S640中,可以获取待识别向量和多个标准特征向量的余弦相似度,将相似度最大的标准特征向量的意图作为待识别向量的当前意图,即文本信息“我想看蒙娜丽莎”的当前意图“换幅PIC看看”。For example, in one embodiment, in S640, the cosine similarity between the vector to be recognized and multiple standard feature vectors can be obtained, and the intent of the standard feature vector with the greatest similarity is taken as the current intent of the vector to be recognized, that is, text information. I want to see the current intention of the "Mona Lisa" to "see the PIC".
例如,在一实施例中,在S650中,可以根据在S610中获取的当前对话状态和在S640中确定的当前意图确定出目标对话状态。目标对话状态可用于确定答复。For example, in one embodiment, in S650, the target dialogue state may be determined according to the current dialogue state acquired in S610 and the current intention determined in S640. The target conversation state can be used to determine the answer.
根据本公开的实施例,在存储器中可以包括预设的数据库。预设的数据库可以包括多个条目。每个条目可以包括语义、情感类别和回复三个属性。如此,例如,在步骤103中,可以包括从该预设的数据库中检索出与识别出的语义(即目标对话状态)和情感类别二者相匹配的答复,进而将其输出给用户。According to an embodiment of the present disclosure, a preset database may be included in the memory. The preset database may include multiple entries. Each item can include three attributes: semantics, emotional category, and reply. In this way, for example, in step 103, it may include retrieving from the preset database a response that matches both the recognized semantics (ie, the target dialogue state) and the emotional category, and then outputting it to the user.
图8示出了根据本公开至少一个实施例的一种问答***500的示意性结构图。FIG. 8 shows a schematic structural diagram of a question answering system 500 according to at least one embodiment of the present disclosure.
如图8中所示,该问答***500可以包括接收器501,其被配置为接收语音信号。在一个实施例中,接收器501可以被配置为持续地接收多个语音信号。As shown in FIG. 8, the question answering system 500 may include a receiver 501 configured to receive voice signals. In one embodiment, the receiver 501 may be configured to continuously receive multiple voice signals.
如图8中所示,该问答***500还可以包括识别***502,其被配置为识别语音信号的语义和情感类别。具体而言,识别***502可以包括语音语义识别设备5021和语音情感识别设备5022。语音语义识别设备5021可以被配置为识别语音信号的语义。语音语义识别设备5021可以以本领域已知的各种方法识别语音信号的语义。语音情感识别设备5022可以被配置为识别语音信号的情感类别。根据本公开,语音情感识别设备5022可以以如前所述的语音情感识别方法来识别语音信号的情感类别。稍后将参照图9对语音情感识别设备的结构进行详细说明。As shown in FIG. 8, the question answering system 500 may also include a recognition system 502, which is configured to recognize the semantic and emotional categories of the speech signal. Specifically, the recognition system 502 may include a speech semantic recognition device 5021 and a speech emotion recognition device 5022. The voice semantic recognition device 5021 may be configured to recognize the semantics of a voice signal. The speech semantic recognition device 5021 can recognize the semantics of the speech signal in various methods known in the art. The voice emotion recognition device 5022 may be configured to recognize the emotion category of the voice signal. According to the present disclosure, the voice emotion recognition device 5022 can recognize the emotion category of the voice signal in the voice emotion recognition method as described above. The structure of the voice emotion recognition device will be described in detail later with reference to FIG. 9.
如图8中所示,该问答***500还可以包括输出器503,其被配置为基于语音信号的语义和情感类别输出答复。As shown in FIG. 8, the question answering system 500 may further include an outputter 503, which is configured to output answers based on the semantics and emotion categories of the voice signal.
在示例性实施例中,接收器501、识别***502和输出器503可分离地设置。例如,接收器501和输出器503可设置在用户处,识别***502可设置在服务器端或云端。In an exemplary embodiment, the receiver 501, the identification system 502, and the outputter 503 are detachably provided. For example, the receiver 501 and the outputter 503 may be set at the user, and the identification system 502 may be set at the server or the cloud.
在一个实施例中,该问答***500可以包括存储器,其被配置来存储各种信息,比如,语音信号、如前所述的预设的特征集合、语音语义识别设备5021识别出的语义、语音情感识别设备5022识别出的情感类别、各种分类器、包括语义、情感类别和答复的预设的数据库等等。In one embodiment, the question answering system 500 may include a memory, which is configured to store various information, such as voice signals, preset feature sets as described above, semantics recognized by the voice semantic recognition device 5021, and voice The emotion categories recognized by the emotion recognition device 5022, various classifiers, a preset database including semantics, emotion categories, and responses, and so on.
图9示出了根据本公开至少一个实施例的一种语音情感识别设备600的示意性结构图。FIG. 9 shows a schematic structural diagram of a speech emotion recognition device 600 according to at least one embodiment of the present disclosure.
如图9中所示,该语音情感识别设备600可以包括:预处理器601,被配置为对语音信号进行预处理;特征提取器602,被配置为基于预设的特征集合从预处理后的语音信号中提取该特征集合中的特征的值;以及识别器603,被配置为由分类器基于所提取的特征的值识别所述语音信号的情感类别。As shown in FIG. 9, the speech emotion recognition device 600 may include: a pre-processor 601 configured to pre-process the speech signal; The value of the feature in the feature set is extracted from the speech signal; and the recognizer 603 is configured to recognize the emotion category of the speech signal based on the value of the extracted feature by the classifier.
根据本公开实施例,所述分类器可以包括多个子分类器。在这种情况下,所述识别器603可以被配置为,由所述多个子分类器基于所述特征的值识别所述语音信号的情感类别。According to an embodiment of the present disclosure, the classifier may include a plurality of sub-classifiers. In this case, the recognizer 603 may be configured to recognize the emotion category of the voice signal based on the value of the feature by the plurality of sub-classifiers.
根据本公开实施例,所述预设的特征集合中的特征是基于快速过滤的特征选择算法和方差从多个特征中选出的。According to an embodiment of the present disclosure, the features in the preset feature set are selected from multiple features based on a fast-filtered feature selection algorithm and variance.
在一个实施例中,基于快速过滤的特征选择算法和方差从多个特征中选 出所述预设的特征集合中的特征的过程可以图3中所示的特征提取方法和图4中所示的特征提取方法。In one embodiment, the process of selecting the features in the preset feature set from multiple features based on the feature selection algorithm of fast filtering and variance may be the feature extraction method shown in FIG. 3 and the feature extraction method shown in FIG. 4 The feature extraction method.
根据本公开实施例,还提供了一种计算机设备。该计算机设备可以包括:存储器,其存储了计算机程序;以及处理器,其被配置为,在执行所述计算机程序时,执行如图2中所示的语音情感识别方法或如图1A中所示的问答方法。According to an embodiment of the present disclosure, a computer device is also provided. The computer device may include: a memory, which stores a computer program; and a processor, which is configured to execute the voice emotion recognition method as shown in FIG. 2 or as shown in FIG. 1A when the computer program is executed The question and answer method.
下面参考图10,其示出了适于用来实现本公开至少一个实施例的语音情感识别方法和设备、语义识别方法或问答方法和***的计算***1100的结构示意图。10, which shows a schematic structural diagram of a computing system 1100 suitable for implementing the speech emotion recognition method and device, the semantic recognition method or the question answering method and system of at least one embodiment of the present disclosure.
如图10所示,计算***1100包括中央处理单元(CPU)1101,其可以根据存储在只读存储器(ROM)1102中的程序或者从存储部分1108加载到随机访问存储器(RAM)1103中的程序而执行各种适当的动作和处理。在RAM 1103中,还存储有***1100操作所需的各种程序和数据。CPU 1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(I/O)接口1105也连接至总线1104。As shown in FIG. 10, the computing system 1100 includes a central processing unit (CPU) 1101, which can be based on a program stored in a read-only memory (ROM) 1102 or a program loaded from a storage portion 1108 to a random access memory (RAM) 1103 And perform various appropriate actions and processing. In the RAM 1103, various programs and data required for the operation of the system 1100 are also stored. The CPU 1101, the ROM 1102, and the RAM 1103 are connected to each other through a bus 1104. An input/output (I/O) interface 1105 is also connected to the bus 1104.
以下部件连接至I/O接口1105:包括键盘、鼠标、诸如麦克风的语音输入装置等的输入部分1106;包括诸如阴极射线管显示屏、液晶显示屏等以及扬声器等的输出部分1107;包括硬盘等的存储部分1108;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1109。通信部分1109经由诸如因特网的网络执行通信处理。驱动器1110也根据需要连接至I/O接口1105。可拆卸介质1111,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1110上,以便于从其上读出的计算机程序根据需要被安装入存储部分1108。The following components are connected to the I/O interface 1105: an input part 1106 including a keyboard, a mouse, a voice input device such as a microphone, etc.; an output part 1107 such as a cathode ray tube display, a liquid crystal display, and a speaker; including a hard disk, etc. The storage part 1108; and the communication part 1109 including a network interface card such as a LAN card and a modem. The communication section 1109 performs communication processing via a network such as the Internet. The driver 1110 is also connected to the I/O interface 1105 as needed. A removable medium 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1110 as needed, so that the computer program read from it is installed into the storage portion 1108 as needed.
特别地,根据本公开的实施例,上文参考图1A至图9描述的过程和装置可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,所述计算机程序包含用于实施图1A至图9的方法和装置的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1109从网络上被下载和安装,和/或从可拆卸介质1111被安装。In particular, according to an embodiment of the present disclosure, the processes and devices described above with reference to FIGS. 1A to 9 may be implemented as computer software programs. For example, an embodiment of the present disclosure includes a computer program product, which includes a computer program tangibly embodied on a machine-readable medium, and the computer program includes program code for implementing the methods and apparatuses of FIGS. 1A to 9. In such an embodiment, the computer program may be downloaded and installed from the network through the communication part 1109, and/or installed from the removable medium 1111.
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和 计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,所述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more logic for implementing prescribed Function executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
描述于本申请实施例中所涉及到的单元或模块可以通过软件的方式实现,也可以通过硬件的方式来实现。例如,可使用的硬件的说明性类型包括现场可编程门阵列(FPGA)、程序专用的集成电路(ASIC)、程序专用的标准产品(ASSP)、片上***(SOC)、复杂可编程逻辑器件(CPLD)等。所描述的单元或模块也可以设置在处理器中。这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定。The units or modules involved in the embodiments described in the present application can be implemented in software or hardware. For example, descriptive types of hardware that can be used include field programmable gate arrays (FPGA), program-specific integrated circuits (ASIC), program-specific standard products (ASSP), system-on-chip (SOC), complex programmable logic devices ( CPLD) and so on. The described units or modules can also be provided in the processor. The names of these units or modules do not constitute a limitation on the units or modules themselves under certain circumstances.
根据本公开实施例,还提供了一种非瞬时计算机可读存储介质。该计算机可读存储介质存储了计算机程序,所述计算机程序在被处理器执行时使得所述处理器执行如图2中所示的语音情感识别方法、如图1A中所示的问答方法或如图6所示的语义识别方法。According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium is also provided. The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the voice emotion recognition method shown in FIG. 2, the question answering method shown in FIG. 1A, or Figure 6 shows the semantic recognition method.
以上所述仅是本公开的示范性实施方式,而非用于限制本公开的保护范围,本公开的保护范围由所附的权利要求确定。The foregoing descriptions are merely exemplary implementations of the present disclosure, and are not used to limit the protection scope of the present disclosure, which is determined by the appended claims.

Claims (18)

  1. 一种语音情感识别方法,包括:A method of speech emotion recognition, including:
    基于预设的特征集合,从语音信号确定出所述特征集合中的音频特征的值;以及Based on the preset feature set, determining the value of the audio feature in the feature set from the speech signal; and
    将所确定的所述特征集合中的所述音频特征的值输入分类器,并从所述分类器输出所述语音信号的情感类别,Inputting the determined value of the audio feature in the feature set to a classifier, and outputting the emotion category of the speech signal from the classifier,
    其中,所述分类器包括多个子分类器,Wherein, the classifier includes multiple sub-classifiers,
    所述将所确定的所述特征集合中的所述音频特征的值输入分类器,并从所述分类器输出所述语音信号的情感类别,包括:The inputting the determined value of the audio feature in the feature set into a classifier and outputting the emotion category of the speech signal from the classifier includes:
    将所确定的所述特征集合中的所述音频特征的值分别输入所述多个子分类器;Inputting the determined values of the audio features in the feature set into the multiple sub-classifiers;
    分别从所述多个子分类器输出所述语音信号的情感类别预测结果;以及Respectively outputting the emotion category prediction results of the speech signal from the multiple sub-classifiers; and
    基于从所述多个子分类器输出的情感类别预测结果,识别所述语音信号的情感类别。Based on the emotion category prediction results output from the plurality of sub-classifiers, the emotion category of the speech signal is recognized.
  2. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    提供多个语音信号样本;Provide multiple voice signal samples;
    提取所述多个语音信号样本中的每个语音信号样本的多个特征;Extracting multiple features of each voice signal sample in the multiple voice signal samples;
    计算所述多个特征中的每个特征与多个情感类别的情感相关性;Calculating the emotional correlation between each of the multiple features and multiple emotional categories;
    从所述多个特征中选择情感相关性大于预设的情感相关性阈值的特征以获得第一候选特征子集;Selecting a feature with an emotional relevance greater than a preset emotional relevance threshold from the multiple features to obtain a first candidate feature subset;
    将所述第一候选特征子集中具有最大情感相关性的特征作为显著特征;Taking the feature with the greatest emotional relevance in the first candidate feature subset as a salient feature;
    计算所述第一候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性;Calculating the feature correlation between each of the remaining features in the first candidate feature subset and the salient feature;
    从所述第一候选特征子集中删除特征相关性大于情感相关性的特征以获得第二候选特征子集;Deleting features whose feature relevance is greater than emotional relevance from the first candidate feature subset to obtain a second candidate feature subset;
    计算所述第二候选特征子集中的每个特征的方差;以及Calculating the variance of each feature in the second candidate feature subset; and
    从所述第二候选特征子集中删除特征的方差小于方差阈值的特征以获得所述预设的特征集合中的特征。The feature whose variance of the feature is less than the variance threshold is deleted from the second candidate feature subset to obtain the feature in the preset feature set.
  3. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    提供多个语音信号样本;Provide multiple voice signal samples;
    提取所述多个语音信号样本中的每个语音信号样本的多个特征;Extracting multiple features of each voice signal sample in the multiple voice signal samples;
    计算所述多个特征中的每个特征的方差;Calculating the variance of each of the multiple features;
    从所述多个特征中删除特征的方差小于方差阈值的特征以获得第三候选特征子集;Deleting features whose variance of features is less than a variance threshold from the multiple features to obtain a third candidate feature subset;
    计算所述第三候选特征子集中的每个特征与多个情感类别的情感相关性;Calculating the emotional correlation between each feature in the third candidate feature subset and multiple emotional categories;
    从所述第三候选特征子集中选择情感相关性大于预设的情感相关性阈值的特征以获得第四候选特征子集;Selecting, from the third candidate feature subset, features with emotional relevance greater than a preset emotional relevance threshold to obtain a fourth candidate feature subset;
    将所述第四候选特征子集中具有最大情感相关性的特征作为显著特征;Taking the feature with the greatest emotional relevance in the fourth candidate feature subset as the salient feature;
    计算所述第四候选特征子集中的其余特征中的每个特征与所述显著特征的特征相关性;以及Calculating the feature correlation between each feature in the fourth candidate feature subset and the salient feature; and
    从所述第四候选特征子集中删除特征相关性大于情感相关性的特征以获得所述预设的特征集合中的特征。The features whose feature relevance is greater than the emotional relevance are deleted from the fourth candidate feature subset to obtain features in the preset feature set.
  4. 根据权利要求2或3所述的方法,其中,情感相关性通过如下公式计算:The method according to claim 2 or 3, wherein the emotional correlation is calculated by the following formula:
    Figure PCTCN2020083751-appb-100001
    X表示特征向量,Y表示情感类别向量,H(X)表示X的熵,H(Y)表示Y的熵,H(X|Y)表示X|Y的熵;以及
    Figure PCTCN2020083751-appb-100001
    X represents the feature vector, Y represents the emotion category vector, H(X) represents the entropy of X, H(Y) represents the entropy of Y, and H(X|Y) represents the entropy of X|Y; and
    其中,特征相关性通过如下公式计算:Among them, the feature correlation is calculated by the following formula:
    Figure PCTCN2020083751-appb-100002
    X表示一个特征向量,Y表示另一个特征向量,H(X)表示X的熵,H(Y)表示Y的熵,H(X|Y)表示X|Y的熵。
    Figure PCTCN2020083751-appb-100002
    X represents a feature vector, Y represents another feature vector, H(X) represents the entropy of X, H(Y) represents the entropy of Y, and H(X|Y) represents the entropy of X|Y.
  5. 根据权利要求1至4中任一项所述的方法,其中,所述基于从所述多个子分类器输出的所述情感类别预测结果,识别所述语音信号的所述情感类别,包括:根据所述多个子分类器对所述情感类别预测结果的投票和所述多个子分类器的权重来识别所述语音信号的情感类别。The method according to any one of claims 1 to 4, wherein the recognizing the emotion category of the speech signal based on the emotion category prediction results output from the plurality of sub-classifiers comprises: The multiple sub-classifiers vote on the emotion category prediction result and the weight of the multiple sub-classifiers to identify the emotion category of the speech signal.
  6. 根据权利要求5所述的方法,其中,所述根据所述多个子分类器对所述情感类别预测结果的投票和所述多个子分类器的权重来识别所述语音信号 的情感类别,包括:The method according to claim 5, wherein the identifying the emotion category of the speech signal according to the votes of the emotion category prediction results of the multiple sub-classifiers and the weight of the multiple sub-classifiers comprises:
    获得所述多个子分类器对所述情感类别预测结果的投票结果;Obtaining voting results of the multiple sub-classifiers on the emotion category prediction results;
    响应于根据所述多个子分类器对所述情感类别预测结果的投票结果识别出唯一情感类别,将该唯一的情感类别作为所述语音信号的情感类别;以及In response to identifying a unique emotion category according to the voting results of the emotion category prediction results of the plurality of sub-classifiers, use the unique emotion category as the emotion category of the speech signal; and
    响应于根据所述多个子分类器对所述情感类别预测结果的投票结果识别出至少两个情感类别,根据所述多个子分类器的权重来确定所述语音信号的情感类别。In response to identifying at least two emotion categories according to voting results of the emotion category prediction results of the multiple sub-classifiers, the emotion category of the speech signal is determined according to the weights of the multiple sub-classifiers.
  7. 根据权利要求1至4中任一项所述的方法,其中,所述基于从所述多个子分类器输出的所述情感类别预测结果,识别所述语音信号的所述情感类别,包括:The method according to any one of claims 1 to 4, wherein the recognizing the emotion category of the speech signal based on the emotion category prediction results output from the plurality of sub-classifiers comprises:
    响应于所述多个子分类器中的至少两个子分类器识别出的情感类别预测结果相同,将所述至少两个子分类器识别出的所述情感类别预测结果作为所述语音信号的情感类别。In response to the emotion category prediction results recognized by at least two sub-classifiers of the plurality of sub-classifiers being the same, the emotion category prediction results recognized by the at least two sub-classifiers are used as the emotion category of the speech signal.
  8. 根据权利要求1至7中任一项所述的方法,其中,所述多个子分类器包括支持向量机分类器、决策树分类器和神经网络分类器。The method according to any one of claims 1 to 7, wherein the plurality of sub-classifiers include a support vector machine classifier, a decision tree classifier, and a neural network classifier.
  9. 一种语义识别方法,包括:A semantic recognition method includes:
    将语音信号转换为文本信息;Convert the voice signal into text information;
    使用先前一轮对话的目标对话状态作为当前对话状态;Use the target dialogue state of the previous round of dialogue as the current dialogue state;
    对所述文本信息进行语义理解以获取用户的当前意图;以及Semantic understanding of the text information to obtain the current intention of the user; and
    根据所述当前对话状态和所述当前意图确定目标对话状态,以及使用所述目标对话状态作为所述语音信号的语义。Determine a target dialogue state according to the current dialogue state and the current intention, and use the target dialogue state as the semantics of the voice signal.
  10. 根据权利要求9所述的方法,其中,所述对所述文本信息进行语义理解以获取用户的当前意图,包括:The method according to claim 9, wherein the semantic understanding of the text information to obtain the current intention of the user comprises:
    识别所述文本信息的命名实体;Identifying the named entity of the text information;
    根据所述命名实体确定所述命名实体对应的待识别向量;Determining, according to the named entity, the vector to be recognized corresponding to the named entity;
    基于所述待识别向量,将满足要求的标准特征向量的意图确定为所述文本信息的当前意图。Based on the to-be-recognized vector, the intent of the standard feature vector that meets the requirements is determined as the current intent of the text information.
  11. 根据权利要求10所述的方法,其中,所述识别所述文本信息的命名实体,包括:The method according to claim 10, wherein said identifying the named entity of the text information comprises:
    通过命名实体识别模型识别所述文本信息的命名实体。The named entity of the text information is recognized through the named entity recognition model.
  12. 根据权利要求10或11所述的方法,其中,所述根据所述命名实体确定所述命名实体对应的待识别向量,包括:The method according to claim 10 or 11, wherein the determining a vector to be recognized corresponding to the named entity according to the named entity comprises:
    通过深度学习模型,根据所述命名实体确定所述命名实体对应的待识别向量。Through the deep learning model, the to-be-recognized vector corresponding to the named entity is determined according to the named entity.
  13. 根据权利要求10-12中任一所述的方法,其中,所述基于所述待识别向量,将满足要求的标准特征向量的意图确定为所述文本信息的当前意图,包括:The method according to any one of claims 10-12, wherein the determining the intent of the standard feature vector that meets the requirements as the current intent of the text information based on the vector to be recognized comprises:
    将与所述待识别向量的相似度最大的标准特征向量的意图确定为所述文本信息的当前意图。The intent of the standard feature vector with the greatest similarity to the vector to be recognized is determined as the current intent of the text information.
  14. 一种问答方法,包括:A question and answer method, including:
    接收语音信号;Receive voice signals;
    识别语音信号的语义和情感类别;以及Recognize the semantic and emotional categories of speech signals; and
    基于语音信号的语义和情感类别输出答复,Output responses based on the semantic and emotional categories of the speech signal,
    其中,所述识别语音信号的情感类别包括:根据权利要求1至8中任一项所述的方法识别所述语音信号的情感类别;以及Wherein, recognizing the emotion category of the voice signal includes: recognizing the emotion category of the voice signal according to the method according to any one of claims 1 to 8; and
    所述识别语音信号的语义包括:根据权利要求9至13中任一项所述的方法识别所述语音信号的语义。The recognizing the semantics of the voice signal comprises: recognizing the semantics of the voice signal according to the method of any one of claims 9-13.
  15. 根据权利要求14所述的问答方法,其中,所述基于语音信号的语义和情感类别输出答复,包括:The question answering method according to claim 14, wherein said outputting a reply based on the semantic and emotional category of the speech signal comprises:
    从预设的多个答复中选择并输出与所述语音信号的所识别的语义和情感类别相匹配的答复。Select and output a reply matching the recognized semantic and emotional category of the voice signal from a plurality of preset replies.
  16. 根据权利要求14或15所述的问答方法,还包括:The question answering method according to claim 14 or 15, further comprising:
    基于先前至少一轮问答中确定出的情感类别,确定当前轮问答的情感类别。Based on the emotion categories determined in at least one previous round of question and answer, the emotion category of the current round of question and answer is determined.
  17. 一种计算机设备,包括:A computer device including:
    存储器,其存储了计算机程序;以及Memory, which stores computer programs; and
    处理器,其被配置为,在执行所述计算机程序时,执行以下中至少之一:The processor is configured to execute at least one of the following when executing the computer program:
    根据权利要求1至8中任一项所述的方法;The method according to any one of claims 1 to 8;
    根据权利要求9至13中任一项所述的方法;以及The method according to any one of claims 9 to 13; and
    根据权利要求14至16中任一项所述的方法。The method according to any one of claims 14 to 16.
  18. 一种非瞬时计算机可读存储介质,其存储了计算机程序,所述计算机程序在被处理器执行时使得所述处理器执行以下中至少之一:A non-transitory computer-readable storage medium that stores a computer program that, when executed by a processor, causes the processor to perform at least one of the following:
    根据权利要求1至8中任一项所述的方法;The method according to any one of claims 1 to 8;
    根据权利要求9至13中任一项所述的方法;以及The method according to any one of claims 9 to 13; and
    根据权利要求14至16中任一项所述的方法。The method according to any one of claims 14 to 16.
PCT/CN2020/083751 2019-04-24 2020-04-08 Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium WO2020216064A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910333653.4 2019-04-24
CN201910333653.4A CN110047517A (en) 2019-04-24 2019-04-24 Speech-emotion recognition method, answering method and computer equipment

Publications (1)

Publication Number Publication Date
WO2020216064A1 true WO2020216064A1 (en) 2020-10-29

Family

ID=67279086

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083751 WO2020216064A1 (en) 2019-04-24 2020-04-08 Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN110047517A (en)
WO (1) WO2020216064A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735418A (en) * 2021-01-19 2021-04-30 腾讯科技(深圳)有限公司 Voice interaction processing method and device, terminal and storage medium
CN112784583A (en) * 2021-01-26 2021-05-11 浙江香侬慧语科技有限责任公司 Multi-angle emotion analysis method, system, storage medium and equipment
CN113239799A (en) * 2021-05-12 2021-08-10 北京沃东天骏信息技术有限公司 Training method, recognition method, device, electronic equipment and readable storage medium
CN113539243A (en) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 Training method of voice classification model, voice classification method and related device
CN113674736A (en) * 2021-06-30 2021-11-19 国网江苏省电力有限公司电力科学研究院 Classifier integration-based teacher classroom instruction identification method and system
CN113689886A (en) * 2021-07-13 2021-11-23 北京工业大学 Voice data emotion detection method and device, electronic equipment and storage medium
CN115083439A (en) * 2022-06-10 2022-09-20 北京中电慧声科技有限公司 Vehicle whistling sound identification method, system, terminal and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047517A (en) * 2019-04-24 2019-07-23 京东方科技集团股份有限公司 Speech-emotion recognition method, answering method and computer equipment
CN110619041A (en) * 2019-09-16 2019-12-27 出门问问信息科技有限公司 Intelligent dialogue method and device and computer readable storage medium
CN113223498A (en) * 2021-05-20 2021-08-06 四川大学华西医院 Swallowing disorder identification method, device and apparatus based on throat voice information

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260416A (en) * 2015-09-25 2016-01-20 百度在线网络技术(北京)有限公司 Voice recognition based searching method and apparatus
US20160155439A1 (en) * 2001-12-07 2016-06-02 At&T Intellectual Property Ii, L.P. System and method of spoken language understanding in human computer dialogs
WO2018060993A1 (en) * 2016-09-27 2018-04-05 Faception Ltd. Method and system for personality-weighted emotion analysis
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN109616108A (en) * 2018-11-29 2019-04-12 北京羽扇智信息科技有限公司 More wheel dialogue interaction processing methods, device, electronic equipment and storage medium
CN110047517A (en) * 2019-04-24 2019-07-23 京东方科技集团股份有限公司 Speech-emotion recognition method, answering method and computer equipment

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030110038A1 (en) * 2001-10-16 2003-06-12 Rajeev Sharma Multi-modal gender classification using support vector machines (SVMs)
CN103810994B (en) * 2013-09-05 2016-09-14 江苏大学 Speech emotional inference method based on emotion context and system
CN104008754B (en) * 2014-05-21 2017-01-18 华南理工大学 Speech emotion recognition method based on semi-supervised feature selection
CN105869657A (en) * 2016-06-03 2016-08-17 竹间智能科技(上海)有限公司 System and method for identifying voice emotion
CN106254186A (en) * 2016-08-05 2016-12-21 易晓阳 A kind of interactive voice control system for identifying
CN106683672B (en) * 2016-12-21 2020-04-03 竹间智能科技(上海)有限公司 Intelligent dialogue method and system based on emotion and semantics
CN107609588B (en) * 2017-09-12 2020-08-18 大连大学 Parkinson patient UPDRS score prediction method based on voice signals
CN107945790B (en) * 2018-01-03 2021-01-26 京东方科技集团股份有限公司 Emotion recognition method and emotion recognition system
CN108319987B (en) * 2018-02-20 2021-06-29 东北电力大学 Filtering-packaging type combined flow characteristic selection method based on support vector machine
CN108922512A (en) * 2018-07-04 2018-11-30 广东猪兼强互联网科技有限公司 A kind of personalization machine people phone customer service system
CN109274819A (en) * 2018-09-13 2019-01-25 广东小天才科技有限公司 User emotion method of adjustment, device, mobile terminal and storage medium when call

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160155439A1 (en) * 2001-12-07 2016-06-02 At&T Intellectual Property Ii, L.P. System and method of spoken language understanding in human computer dialogs
CN105260416A (en) * 2015-09-25 2016-01-20 百度在线网络技术(北京)有限公司 Voice recognition based searching method and apparatus
WO2018060993A1 (en) * 2016-09-27 2018-04-05 Faception Ltd. Method and system for personality-weighted emotion analysis
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN109616108A (en) * 2018-11-29 2019-04-12 北京羽扇智信息科技有限公司 More wheel dialogue interaction processing methods, device, electronic equipment and storage medium
CN110047517A (en) * 2019-04-24 2019-07-23 京东方科技集团股份有限公司 Speech-emotion recognition method, answering method and computer equipment

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735418A (en) * 2021-01-19 2021-04-30 腾讯科技(深圳)有限公司 Voice interaction processing method and device, terminal and storage medium
CN112735418B (en) * 2021-01-19 2023-11-14 腾讯科技(深圳)有限公司 Voice interaction processing method, device, terminal and storage medium
CN112784583A (en) * 2021-01-26 2021-05-11 浙江香侬慧语科技有限责任公司 Multi-angle emotion analysis method, system, storage medium and equipment
CN113239799A (en) * 2021-05-12 2021-08-10 北京沃东天骏信息技术有限公司 Training method, recognition method, device, electronic equipment and readable storage medium
CN113674736A (en) * 2021-06-30 2021-11-19 国网江苏省电力有限公司电力科学研究院 Classifier integration-based teacher classroom instruction identification method and system
CN113539243A (en) * 2021-07-06 2021-10-22 上海商汤智能科技有限公司 Training method of voice classification model, voice classification method and related device
CN113689886A (en) * 2021-07-13 2021-11-23 北京工业大学 Voice data emotion detection method and device, electronic equipment and storage medium
CN113689886B (en) * 2021-07-13 2023-05-30 北京工业大学 Voice data emotion detection method and device, electronic equipment and storage medium
CN115083439A (en) * 2022-06-10 2022-09-20 北京中电慧声科技有限公司 Vehicle whistling sound identification method, system, terminal and storage medium

Also Published As

Publication number Publication date
CN110047517A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
WO2020216064A1 (en) Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium
CN111046133B (en) Question and answer method, equipment, storage medium and device based on mapping knowledge base
CN108829757B (en) Intelligent service method, server and storage medium for chat robot
CN108319666B (en) Power supply service assessment method based on multi-modal public opinion analysis
CN109493850B (en) Growing type dialogue device
TWI536364B (en) Automatic speech recognition method and system
US10515292B2 (en) Joint acoustic and visual processing
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
WO2019179496A1 (en) Method and system for retrieving video temporal segments
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
US11735190B2 (en) Attentive adversarial domain-invariant training
WO2022252636A1 (en) Artificial intelligence-based answer generation method and apparatus, device, and storage medium
CN109584865B (en) Application program control method and device, readable storage medium and terminal equipment
US20230206928A1 (en) Audio processing method and apparatus
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
CN110377695B (en) Public opinion theme data clustering method and device and storage medium
Elshaer et al. Transfer learning from sound representations for anger detection in speech
KR20200105057A (en) Apparatus and method for extracting inquiry features for alalysis of inquery sentence
CN111159405B (en) Irony detection method based on background knowledge
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN111209367A (en) Information searching method, information searching device, electronic equipment and storage medium
CN115878847B (en) Video guiding method, system, equipment and storage medium based on natural language
CN116775873A (en) Multi-mode dialogue emotion recognition method
CN115357720B (en) BERT-based multitasking news classification method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20794846

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20794846

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20794846

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10.05.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20794846

Country of ref document: EP

Kind code of ref document: A1