CN102142253B

CN102142253B - Voice emotion identification equipment and method

Info

Publication number: CN102142253B
Application number: CN2010101047793A
Authority: CN
Inventors: 郭庆; 王彬; 陆应亮
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2010-01-29
Filing date: 2010-01-29
Publication date: 2013-05-29
Anticipated expiration: 2030-01-29
Also published as: CN102142253A

Abstract

The invention provides voice emotion identification equipment and a voice emotion identification method. The voice emotion identification equipment comprises an emotion identification unit and a confidence judgment unit, wherein the emotion identification unit is used for identifying the current emotion state of voice of an addressor into a preliminary emotion state; and the confidence judgment unit is used for calculating the confidence of the preliminary emotion state and judging whether the preliminary emotion state is credible by using the confidence, and if so, the preliminary emotion state is determined as a final emotion state and the final emotion state is output. The confidence judgment is performed on the identification result of the voice emotion state and the final emotion state is determined according to the judgment result, so that the accuracy of the identification result of the voice emotion state can be improved.

Description

Voice emotion identification equipment and method

Technical field

The present invention relates to speech recognition technology, more specifically, relate to voice emotion identification equipment and speech-emotion recognition method.

Background technology

The emotion ability is the important symbol of human intelligence, and emotion is essential in person to person mutual, and is playing the part of important role in the processes such as the mankind's perception, decision-making.For a long time, the research of emotion intelligence only is present in psychology and cognitive science field.In recent years, along with the development of artificial intelligence, emotion intelligence is calculated this research topic with computer technology in conjunction with having produced emotion.This will promote the development of computer technology widely.

Voice are human important means that exchange, and are the most convenient of mutual transmission of information, the most direct approach of fundamental sum.Voice signal is also carrying abundant emotion information when passing on semantic information.Therefore, along with the fast development of human-computer interaction technology, the emotion information in the voice signal just more and more is subject to researchist's attention.As an important research direction of voice signal Kansei Information Processing, speech emotional identification is computer understanding human emotion's key, is the prerequisite that realizes intelligent man-machine interaction.

The problem that speech emotional identification at first will solve is the division of affective state, two kinds of affective state division methods relatively more commonly used is arranged at present: be the affective state division methods of continuous distribution and the affective state division methods that is in a discrete distribution.

The affective state division methods that is continuous distribution is generally tended to the human emotion is expressed as several continually varying dimensions.Yet, only have so far two dimensions to gain universal acceptance: degree of excitation (arousal) and evaluating deg (valence), and be not sufficient to express all basic emotions.In addition, owing to only have the quantization means of high reliability just to make the space have the calculating meaning, so another problem of representing of dimension is position and the distance of how quantization means emotion.But present dimension model is not all accomplished this point, and this has also limited based on the emotion of dimensional space and has calculated.

The affective state division methods that is in a discrete distribution is divided into a plurality of discrete states with the human emotion.Because it has simplified emotion model, calculates also comparatively simple, therefore, this methods of so far great majority research employing.

About how automatically to identify speaker's affective state from voice, in existing many patents, patented claim or paper etc. several different methods is disclosed.For example:

Patent documentation 1 is take the fundamental frequency track of voice, amplitude, formant frequency track as feature, and adopted the regular difference to speaker's sex of sex to process, be at last support vector machine (Support Vector Machine of each emotion training, SVM) model calculates its affective state by the SVM model to the input voice.

Patent documentation 2 at first carries out performance test to the features such as fundamental frequency, energy, word speed, resonance peak and bandwidth thereof of voice, filter out the feature set larger on emotion recognition impact by a kind of feature selecting algorithm, selected altogether 12 kinds with fundamental frequency, word speed, energy, resonance peak, feature that the resonance peak bandwidth is relevant.Then the input voice are extracted above feature, compare with the feature of the every kind of emotion that prestores in the database, nearest emotion template is thought to input the affective state of voice.

Patent documentation 3 has adopted fundamental frequency, voice duration, these three kinds of prosodic informations of energy of voice to carry out emotion as feature and has calculated.

Non-patent literature 4 adopts the SVM method to carry out emotion recognition for real call center data.In addition, except using prosodic features and spectrum signature, some further features (these features might be obtained with higher confidence level by audio recognition method) have also been introduced from voice, such as information such as phoneme boundary information, certain section voice are not fluent.

Adopt mixed Gauss model (GaussianMixture Model, GMM) method to carry out modelling for the spectrum signature of sequential in non-patent literature 5 and the non-patent literature 6.

Adopt hidden Markov model (Hidden Markov Model, HMM) method to carry out modelling for the spectrum signature of sequential in non-patent literature 7, non-patent literature 8 and the non-patent literature 9.Wherein, the difference that non-patent literature 9 further changes according to the acoustic feature of the dissimilar phoneme under the different emotions state (such as vowel, plosive, fricative, nasal sound etc.), train respectively the HMM of each affective state for different phoneme types, in identification, at first carry out the identification of phoneme for the input voice, and then the HMM that uses under the different affective states carries out the identification of emotion.

Adopt linear compartment analysis (LinearDiscriminant Analysis, LDA) method to carry out the identification of emotion according to prosodic features in non-patent literature 10 and the non-patent literature 11.

Yet, from at present disclosed many articles, patent and patented claim, most speech emotional identifying schemes is all paid close attention to by means of phonetic feature and affective state model etc. the affective state of voice is identified, and the factors such as phonetic feature and affective state model of not considering are on the impact of recognition result accuracy, and can't carry out dynamic adjustments to the recognition result accuracy, therefore cause the unsettled recognition result accuracy that needs further improvement.

Therefore, the voice emotion identification equipment and/or the method that still need a kind of recognition result accuracy that can be improved at present.

Patent documentation 1: the invention people is called the Chinese patent 200610097301.6 of " a kind of speech-emotion recognition method based on support vector machine " for Zhao Li etc., name;

Patent documentation 2: the invention people is that Valery A.Petrushin, name are called System, the U.S. Patent application 9/387,037 of methodand article of manufacture for an emotion detection system;

Patent documentation 3: the invention people is that V Ian M.Bennett, name are called Emotion detectiondevice﹠amp; The US Patent No. 2006/0122834A1 of method for use in distributed systems;

Non-patent literature 4: the author is the Five Emotion classes detection in real-word call center data:the use ofvarious types of paralinguistic features[C of Laurence Vidrascu and Laurence Devillers] .International workshop onParalinguistic Speech-between models and data, 2007, Saarbr ü cken, Germany;

Non-patent literature 5: the author is the people's such as Daniel Neiberg and Kjell Elenius Emotionrecognition in spontaneous speech.Lund University, Centre for Language﹠amp; Literature, dept.of Linguistics ﹠amp; Phonetics.Working Papers 52 (2006), 101-104;

Non-patent literature 6: the author is the people's such as Daniel Neiberg and Kjell Elenius Emotionrecognition in spontaneous speech UsingGMMs.Interspeech, 2006, Pittsburgh;

Non-patent literature 7: the author is the HiddenMarkov model-based speech emotion recognition[C of Schuller B, Rigoll G and Lang M] .Proceedings of the2003 IEEE International Conference on Acoustics, Speech ， ﹠amp; SignalProcessing, Hong Kong, 2003:401-404;

Non-patent literature 8: the author is the people's such as Nogueiras A, Moreno A and Bonafante A Speech emotion recognition using hidden Markov models[C] .Proceedings of Eurospeech, Aalborg, 2001:2679-2682;

Non-patent literature 9: the author is the people's such as Chul Min Lee and Serdar Yildirim Emotion Recognition based on Phoneme Classes[C] .In:Proceedings ofthe International Conference on Spoken Language Processing (ICSLP2004), Jeju Island, Korea;

Non-patent literature 10: the author is the Narayanan of Chul Min Lee and Shrikanth S, Roberto Pieraccini.Combining Acoustic and Language Information forEmotion[C] .7th International Conference on Spoken LanguageProcessing, 2002, Denver, USA;

Non-patent literature 11: the author is the A study on theautomatic detection and characterization of emotion in a voice servicecontext[C of C.Blouin and V.Maffiolo] .Proc.Interspeech, 2005, Lisbon, 469-472.

Summary of the invention

Provided hereinafter about brief overview of the present invention, in order to basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is determined key of the present invention or pith, neither be intended to limit scope of the present invention.Its purpose only is that the form of simplifying provides some concept, with this as the in greater detail preorder of discussing after a while.

At least one purpose of the present invention is to provide a kind of voice emotion identification equipment and method, and it can overcome the part shortcoming and defect of above-mentioned prior art at least, with the accuracy of the recognition result that improves the speech emotional state.

Another object of the present invention provides corresponding computer program and/or computer-readable recording medium.

To achieve these goals, according to one embodiment of present invention, provide a kind of voice emotion identification equipment, having comprised: the emotion recognition unit is used for the current affective state of speaker's voice is identified as preliminary affective state; And degree of confidence judging unit, be used for calculating the degree of confidence of preliminary affective state, and utilize this degree of confidence to judge whether this preliminary affective state is credible, if this preliminary affective state is judged as credible, then this preliminary affective state is defined as final affective state, and exports this final affective state.

In this voice emotion identification equipment, also can comprise: amending unit, be used in the incredible situation of the preliminary affective state of degree of confidence judgment unit judges, revising preliminary affective state obtaining the final affective state of voice, and export final affective state.

In this voice emotion identification equipment, the emotion recognition unit can be used for voice and a plurality of affective state model are mated to determine the preliminary affective state of voice, the degree of confidence judging unit can be used for utilizing the affective state inverse model corresponding with preliminary affective state and/or formerly the affective state of voice calculate the degree of confidence of preliminary affective state.

In this voice emotion identification equipment, the degree of confidence judging unit can be configured to utilize the affective state inverse model corresponding with preliminary affective state to calculate the first preliminary degree of confidence of preliminary affective state, in the first preliminary degree of confidence in the situation within the preset range, the first preliminary degree of confidence is defined as degree of confidence, and the affective state of voice calculates the second preliminary degree of confidence of preliminary affective state and the first preliminary degree of confidence and the second preliminary degree of confidence is synthesized to obtain degree of confidence otherwise utilize formerly.

In this voice emotion identification equipment, the degree of confidence judging unit can be configured to utilize the affective state of voice formerly to calculate the first preliminary degree of confidence of preliminary affective state, in the first preliminary degree of confidence in the situation within the preset range, the first preliminary degree of confidence is defined as degree of confidence, otherwise utilizes the affective state inverse model corresponding with preliminary affective state to calculate the second preliminary degree of confidence of preliminary affective state and the first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain degree of confidence.

To achieve these goals, according to another embodiment of the present invention, provide a kind of speech-emotion recognition method, having comprised: the preliminary affective state of identification speaker's voice; And the degree of confidence of calculating preliminary affective state, and utilize degree of confidence to judge whether this preliminary affective state is credible, if credible then this preliminary affective state is defined as final affective state, and would export this final affective state.

In this speech-emotion recognition method, also can comprise: revise preliminary affective state in the preliminary incredible situation of affective state obtaining the final affective state of voice judging, and export final affective state.

In this speech-emotion recognition method, the step of preliminary affective state of identification voice comprises the preliminary affective state that voice and a plurality of affective state model is mated to determine voice, the step of calculating degree of confidence comprise affective state inverse model that utilization is corresponding with preliminary affective state and/or formerly the affective state of voice calculate the degree of confidence of preliminary affective state.

In this speech-emotion recognition method, the step of wherein calculating the degree of confidence of preliminary affective state comprises: utilize the affective state inverse model corresponding with preliminary affective state to calculate the first preliminary degree of confidence of preliminary affective state, in the first preliminary degree of confidence in the situation within the preset range, the first preliminary degree of confidence is defined as degree of confidence, otherwise utilize the affective state of voice formerly to calculate the second preliminary degree of confidence of preliminary affective state, and the first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain degree of confidence.

In this speech-emotion recognition method, the step of calculating the degree of confidence of preliminary affective state comprises: utilize the affective state of voice formerly to calculate the first preliminary degree of confidence of preliminary affective state, in the first preliminary degree of confidence in the situation within the preset range, the first preliminary degree of confidence is defined as degree of confidence, otherwise utilizes the affective state inverse model corresponding with preliminary affective state to calculate the second preliminary degree of confidence of preliminary affective state and the first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain degree of confidence.

According to other embodiments of the invention, corresponding computer-readable recording medium and computer program are also provided.

According to embodiments of the invention, carry out the degree of confidence judgement and determine final affective state, the accuracy that can improve the recognition result of speech emotional state according to judged result by the recognition result to the speech emotional state.

By below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.

Description of drawings

The present invention can by with reference to hereinafter by reference to the accompanying drawings given description be better understood, wherein in institute's drawings attached, used same or analogous Reference numeral to represent identical or similar parts.Described accompanying drawing comprises in this manual and forms the part of this instructions together with following detailed description, and is used for further illustrating the preferred embodiments of the present invention and explains principle and advantage of the present invention.In the accompanying drawings:

Fig. 1 shows the synoptic diagram according to the voice emotion identification equipment of the embodiment of the invention one;

Fig. 2 shows the synoptic diagram according to the voice emotion identification equipment of the embodiment of the invention two;

Fig. 3 shows the synoptic diagram according to the voice emotion identification equipment of the embodiment of the invention three;

Fig. 4 shows the synoptic diagram according to the voice emotion identification equipment of the embodiment of the invention four;

Fig. 5 shows the synoptic diagram according to the voice emotion identification equipment of the embodiment of the invention five;

Fig. 6 shows the synoptic diagram according to the voice emotion identification equipment of the embodiment of the invention six;

Fig. 7 shows the synoptic diagram of the equipment of employed affective state model training in the embodiments of the invention;

Fig. 8 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention seven;

Fig. 9 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention eight;

Figure 10 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention nine;

Figure 11 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention ten;

Figure 12 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention 11; And

Figure 13 shows the block diagram of the example arrangement of the computing machine of realizing the embodiment of the invention.

It will be appreciated by those skilled in the art that in the accompanying drawing element only for simple and clear for the purpose of and illustrate, and not necessarily draw in proportion.For example, the size of some element may have been amplified with respect to other elements in the accompanying drawing, in order to help to improve the understanding to the embodiment of the invention.

Embodiment

In connection with accompanying drawing example embodiment of the present invention is described in detail hereinafter.For clarity and conciseness, all features of actual embodiment are not described in instructions.Yet, should understand, in the process of any this practical embodiments of exploitation, must make a lot of decisions specific to embodiment, in order to realize developer's objectives, for example, meet those restrictive conditions with system and traffic aided, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition, might be very complicated and time-consuming although will also be appreciated that development, concerning the those skilled in the art that have benefited from present disclosure, this development only is routine task.

At this, what also need to illustrate a bit is, for fear of having blured the present invention because of unnecessary details, in accompanying drawing and explanation, only described with according to the closely-related apparatus structure of the solution of the present invention and/or treatment step, and omitted to relation of the present invention expression and description little, parts known to persons of ordinary skill in the art and processing.

In general, the speech emotional identifying relates to speech emotional exercise equipment and voice emotion identification equipment.

The speech emotional exercise equipment adopts the method for statistical modeling to carry out respectively the statistics training of the model of this affective state according to different affective states for selected feature usually, thereby obtains the statistical model of each affective state.

Voice emotion identification equipment then extracts the feature of input voice usually and the model of each affective state mates, and the affective state that will have a maximum matching probability is considered as the emotion recognition result.

Fig. 1 shows the synoptic diagram according to the speech emotional state recognition equipment 100 of the embodiment of the invention one.

As shown in Figure 1, the speech emotional state recognition equipment 100 according to the embodiment of the invention one comprises emotion recognition unit 101 and degree of confidence judging unit 102.

Emotion recognition unit 101 is used for the current affective state of speaker's voice is identified as preliminary affective state.

Generally speaking, the voice signal under the different emotions state also has different construction features and the regularity of distribution in characteristic aspect such as its time structure, amplitude construction, fundamental frequency construction and resonance peak structures.Thus, by the voice signal of various concrete patterns is calculated and is analyzed and based on this modeling in the construction features of the characteristic aspect such as its time structure, amplitude construction, fundamental frequency construction and resonance peak structure and the regularity of distribution, the affective content that is implied in can recognition of speech signals.

Those skilled in the art is to be understood that; can select neatly emotion recognition unit 101 employed known affective state sorting technique, phonetic feature system of selection, affective state method for establishing model etc. when identifying the speech emotional state according to the demand of practical application, they all should be within the claimed spirit and scope of the present invention.

For example, in relevant research to the classification of basic emotion state between four kinds to eight kinds.The affective state with basic that affects larger classification such as Cornelius and Ekman is categorized as six kinds of " happy ", " sadness ", " fear ", " abhoing ", " indignation " and " surprised " etc.In addition, in some specific application, for example in call-center application, can further simplify processing for the classification of emotion.For example, emotion can be divided into positivity, negativity and neutral three classes.

In addition, for example, using at present more phonetic feature mainly is prosodic information and the spectrum information of voice.Prosodic information mainly comprises pitch, word speed and energy and pauses; What at present use of spectrum information was more is Mel frequency cepstrum coefficient (MFCC), linear predictor coefficient (LPC), resonance peak and correlated characteristic thereof etc.

In addition, for example, can use following modeling method: support vector machine (Support vectormachine, SVM), mixed Gauss model (Gaussian Mixture Model, GMM), linear compartment analysis (Linear discriminant analysis, LDA), hidden Markov model (Gaussian Mixture Model, GMM), decision tree (Decision tree) etc.

Therefore, for instructions for purpose of brevity, this just no longer to emotion recognition unit 101 during at identification speech emotional state employed concrete recognition methods be described in detail.

Degree of confidence judging unit 102 is used for calculating the degree of confidence of preliminary affective state, and utilize this degree of confidence to judge whether this preliminary affective state is credible, if this preliminary affective state is judged as credible, then this preliminary affective state is defined as final affective state, and exports this final affective state.

Those skilled in the art is to be understood that; the concrete confidence calculations method that can select neatly degree of confidence judging unit 102 when calculating the degree of confidence of preliminary affective state, to adopt according to the demand of practical application; such as determining corresponding degree of confidence according to the difference between the matching probability of the size of the matching probability of the preliminary affective state affective state model corresponding with it or preliminary affective state and a plurality of affective state models or ratio etc., they all should be within the claimed spirit and scope of the present invention.

From the above, increased calculating to the degree of confidence of the affective state identified on the basis of voice being processed and analyzed to identify its affective state according to the voice emotion identification equipment of the embodiment of the invention one.If judge that according to the degree of confidence of the affective state of identifying the affective state of identifying is credible (for example in the sufficiently high situation of this degree of confidence, for example be higher than predetermined threshold value or within predetermined span), can determine so to input the affiliated affective state of voice is the affective state of identifying, and the preliminary affective state of namely emotion recognition unit 101 being identified is defined as final affective state.

Therefore, according to the voice emotion identification equipment of the embodiment of the invention one by introducing the calculating to the degree of confidence of the affective state identified, can reduce false-alarm on the one hand, on the other hand, because itself identifies easily between some affective state, thereby the introducing of degree of confidence can improve the accuracy of recognition result by mistake.

In addition, need to prove, although abovely in conjunction with synoptic diagram shown in Figure 1 voice emotion identification equipment according to present embodiment is described, but those skilled in the art are to be understood that, synoptic diagram shown in Figure 1 only is exemplary, rather than to the restriction of scope of the present invention, those skilled in the art can carry out modification or modification to synoptic diagram shown in Figure 1 fully according to actual needs.

Fig. 2 shows the synoptic diagram according to the voice emotion identification equipment 200 of the embodiment of the invention two.

As shown in Figure 2, the speech emotional state recognition equipment 200 according to the embodiment of the invention two comprises emotion recognition unit 201, degree of confidence judging unit 202 and amending unit 203.

Emotion recognition unit 201 and degree of confidence judging unit 202 respectively with Fig. 1 in emotion recognition unit 101 and degree of confidence judging unit 102 similar, therefore just no longer emotion recognition unit 201 and degree of confidence judging unit 202 have been described in detail at this.

Amending unit 203 is used for judging at degree of confidence judging unit 202 revises preliminary affective state obtaining the final affective state of voice in the preliminary incredible situation of affective state, and exports final affective state.

Particularly, if degree of confidence judging unit 202 judges that according to the degree of confidence of the affective state of identifying the affective state of identifying is credible, for example in (for example be higher than predetermined threshold value or within predetermined span) in the sufficiently high situation of this degree of confidence, the affective state that can determine so to input under the voice is the affective state of identifying.

Otherwise, if degree of confidence judging unit 202 judges that according to the degree of confidence of the affective state of identifying the affective state of identifying is insincere, for example in (for example be lower than predetermined threshold value or not within predetermined span) in the lower situation of this degree of confidence, amending unit 203 can be done correcting process to preliminary affective state again according to concrete application so.

For example, consider from the angle of using, can carry out correcting process for the not high enough affective state undetermined of degree of confidence targetedly.

For example in the application of call center, what need most care is positivity, negativity and neutral three kinds of affective states, if affective state undetermined is that the degree of confidence of negative affect is not high enough, then amending unit 203 can be modified to neutral affective state with the affective state of input voice.

From the above, increased calculating to the degree of confidence of the affective state identified on the basis of voice being processed and analyzed to identify its affective state according to the voice emotion identification equipment of the embodiment of the invention two, judge according to the degree of confidence of calculating whether the affective state of identifying is credible, and process according to credible or insincere the execution accordingly of the affective state of identifying.

Therefore, according to the voice emotion identification equipment of the embodiment of the invention two by introducing the calculating to the degree of confidence of the affective state identified, the accuracy that can reduce false-alarm and improve recognition result.

Those skilled in the art is to be understood that; can select neatly and amending unit 203 is set according to the degree of confidence of the affective state of identifying the affective state of identifying is being defined as the concrete mode of the preliminary affective state of the correction taked when insincere according to the demand of concrete application, they all should be within the claimed spirit and scope of the present invention.

In addition, need to prove, although abovely in conjunction with synoptic diagram shown in Figure 2 voice emotion identification equipment according to present embodiment is described, but those skilled in the art are to be understood that, synoptic diagram shown in Figure 2 only is exemplary, rather than to the restriction of scope of the present invention, those skilled in the art can carry out modification or modification to synoptic diagram shown in Figure 2 fully according to actual needs.

Fig. 3 shows the synoptic diagram according to the voice emotion identification equipment 300 of the embodiment of the invention three.

As shown in Figure 3, the speech emotional state recognition equipment 300 according to the embodiment of the invention three comprises emotion recognition unit 301 and degree of confidence judging unit 302.

Emotion recognition unit 301 is used for voice and a plurality of affective state model are mated to determine the preliminary affective state of voice.

As indicated above; can select neatly emotion recognition unit 301 employed known affective state sorting technique, phonetic feature system of selection, affective state method for establishing model etc. when identifying the speech emotional state according to the demand of practical application, they all should be within the claimed spirit and scope of the present invention.

For example, emotion recognition unit 301 mates phonetic feature respectively with various affective state models, belong to the probable value of each affective state model to obtain the current speech feature.Take adopt sequential spectrum signature Mel cepstrum coefficient of frequency (MFCC) as the HMM model of affective characteristics as example, the voice segments of input severally one by one is being 39 dimensional feature vector o={o of frame number through being represented as after the feature extraction ₁, o ₂... o _N, N is the frame number of this voice segments.This proper vector and pre-stored 5 kinds of HMM affective state models are carried out matching probability calculate, calculate the probable value p (o|e that this proper vector belongs to this affective state model for each HMM affective state model _t=i).Because the use of HMM model is very general in this area, its computing method all have a detailed description in a lot of documents, are not described in detail here.

Therefore, for instructions for purpose of brevity, this just no longer to emotion recognition unit 301 during at identification speech emotional state employed concrete recognition methods be described in detail.

Degree of confidence judging unit 302 is used for utilizing the affective state inverse model corresponding with the determined preliminary affective state in emotion recognition unit 301 to calculate the degree of confidence of this preliminary affective state.

Here, emotion recognition unit 301 employed affective state models and degree of confidence judging unit 302 employed affective state inverse models are based on that identical emotional speech storehouse trains.For example, referring to the synoptic diagram of the equipment 700 of employed affective state model training in the embodiments of the invention shown in Figure 7.This equipment 700 comprises affective state model training unit 702 and affective state inverse model training unit 703.Affective state model 704 is trained based on mark emotional speech storehouse 701 in affective state model training unit 702.Affective state inverse model training unit 703 is trained affective state inverse model 705 based on mark emotional speech storehouse 701.

In general, the data of emotional speech database have two kinds of obtain manners.A kind of is to design in advance the recording drama, then looks for the people to carry out the wherein speech data of playing the part of to obtain corresponding emotion of various emotions.Another is to extract from real recording data, extracts in for example can be from call-center application a large amount of clients and the actual taped conversations language material between operator.

Through the data of this emotional speech database are carried out cutting and Emotion tagging, affective state in the dialog procedure is divided, for example be divided into happiness, indignation, worried, surprised, neutral, then carry out sentence segmentation and mark in addition, and remove affective state or the impaired sentence of statement, final form to possess a considerable amount ofly be rich in various emotions and show good mark emotional speech storehouse 701.

In addition, utilize mark emotional speech storehouse 701 to train although Fig. 7 only shows equipment 700, the invention is not restricted to this.For example, equipment 700 also can utilize universal phonetic data and mark emotional speech storehouse the two is trained.For example, can at first train a general model according to the universal phonetic data, and then carry out adaptive learning based on mark emotional speech storehouse, obtain at last the emotion model knowledge base.After this affective state model training unit 702 and affective state inverse model training unit 703 can be trained affective state model and affective state inverse model based on this emotion model knowledge base.

Affective state model training unit 702 can adopt the modeling method (for example HMM method or GMM method) of statistics and carry out the training of the affective state model corresponding with this affective state for each affective state according to the speech data under the corresponding affective state of storage in the mark emotional speech storehouse 701, to obtain the corresponding affective state model 704 of each affective state.

Perhaps, affective state model training unit 702 can at first train a general model according to the universal phonetic data, and then carries out adaptive learning based on mark emotional speech storehouse, obtains at last the corresponding affective state model 704 of each affective state.Wherein the algorithm that adopts of adaptive learning can utilize algorithm known in some field of speech recognition, for example maximum a posteriori probability (maximuma posterior, MAP) algorithm for estimating, MLLR (Maximum Likelihood LinearRegression) algorithm etc.

Affective state inverse model training unit 703 carries out the training of affective state inverse model for each affective state according to the speech data under the corresponding affective state of storage in the mark emotional speech storehouse 701, to obtain the affective state inverse model 705 corresponding with each affective state.

The training of affective state inverse model 705 can adopt the method identical with the training method of aforesaid affective state model to train, only the training data difference.For the affective state inverse model training of certain affective state, carry out statistical computation with the data except the data that belong to this affective state in whole affection datas as training data, to obtain the corresponding affective state inverse model 705 of this affective state.

Get back to now Fig. 3.From the above, after emotion recognition unit 301 mated voice and a plurality of affective state model to determine the preliminary affective state of voice, degree of confidence judging unit 302 utilized the affective state inverse model 304 corresponding with the determined preliminary affective state in emotion recognition unit 301 to calculate the degree of confidence of this preliminary affective state.

For example, the confidence calculations of degree of confidence judging unit 302 can be according to the statistical information in the identifying.Then, can set a threshold value, if the degree of confidence of current affective state is higher than this threshold value, can differentiate the input voice is this affective state.

For example, the confidence calculations of degree of confidence judging unit 302 can be used the threshold value of being determined by the ratio between two probability, that is, the matching probability of input voice and the affective state model that is confirmed as preliminary affective state is divided by the ratio of the matching probability of the affective state inverse model of input voice and this preliminary affective state.

Because the calculating of probable value can be adopted the mode of taking the logarithm in the realization of computer program, therefore can be expressed as follows:

CM (e_{t} = i) = \log (P (o | e_{t} = i)) - \log (P (o | e_{t} = \overset{&OverBar;}{i}))

Wherein, CM (e _t=i) the current affective state of representative is the degree of confidence of i; P (o|e _t=i) belong to the probability of affective state i for voice;

P (o | e_{t} = \overset{&OverBar;}{i})

The inverse model that belongs to affective state i for voice

Probability.

Because the affective state inverse model be used for to judge voice and belong to the probability of the affective state except this preliminary affective state, so degree of confidence judging unit 302 is by using the affective state inverse model can judge the degree of confidence of the determined preliminary affective state in emotion recognition unit 301.

Therefore, judge that according to determined preliminary affective state and the matching degree between the corresponding affective state inverse model input voice belong to the degree of confidence of determined preliminary affective state according to the speech emotional state recognition equipment 300 of the embodiment of the invention three, thus the accuracy that can reduce false-alarm and improve recognition result.

In addition, need to prove, although abovely in conjunction with synoptic diagram shown in Figure 3 voice emotion identification equipment according to present embodiment is described, but those skilled in the art are to be understood that, synoptic diagram shown in Figure 3 only is exemplary, rather than to the restriction of scope of the present invention, those skilled in the art can carry out modification or modification to synoptic diagram shown in Figure 3 fully according to actual needs.For example, those skilled in the art can be according to the needs of practical application and the concrete mode of the confidence calculations of flexible choice degree of confidence judging unit 302, and they all should be within the claimed spirit and scope of the present invention.

Fig. 4 shows the synoptic diagram according to the voice emotion identification equipment 400 of the embodiment of the invention four.

As shown in Figure 4, the voice emotion identification equipment 400 according to the embodiment of the invention four comprises emotion recognition unit 401 and degree of confidence judging unit 402.

Emotion recognition unit 401 and the affective state model 406 that uses thereof respectively with Fig. 3 in emotion recognition unit 301 and affective state model 303 similar, therefore just no longer emotion recognition unit 401 and affective state model 406 have been described in detail at this.

Degree of confidence judging unit 402 comprises the first preliminary confidence computation unit 403, the second preliminary confidence computation unit 404 and synthesis unit 405.

The first preliminary confidence computation unit 403 utilizes the affective state inverse model 407 corresponding with the determined preliminary affective state in emotion recognition unit 401 to calculate the first preliminary degree of confidence of this preliminary affective state.

For example, the first preliminary confidence computation unit 403 can adopt with Fig. 3 in the similar confidence calculations method of the confidence calculations method of degree of confidence judging unit 302.Similarly, those skilled in the art can be according to the needs of practical application and the concrete mode of the confidence calculations of flexible choice the first preliminary confidence computation unit 403, and they all should be within the claimed spirit and scope of the present invention.

The second preliminary confidence computation unit 404 utilizes speech emotional state 408 formerly to calculate the second preliminary degree of confidence of the determined preliminary affective state in emotion recognition unit 401.

Particularly, in one section dialog procedure, in general the affective state of same person is stable at short notice, if happiness, the affective state of these words be that the probability of indignation just is lower than is the probability of glad affective state or neutral affective state so such as the affective state of upper a word.Similarly, a people in said continuous three words, the combination that changes in emotional is larger, also very little such as the possibility of indignation-glad-indignation because most people's affective state transforms the process that all needs a gradual change.Therefore, the possibility of the mutual conversion between the different affective states has also had difference.Thus, the second preliminary confidence computation unit 404 can be calculated according to the affective state of voice formerly the second preliminary degree of confidence of the determined preliminary affective state in emotion recognition unit 401.Can utilize the conditional probability of the preliminary affective state of determined current speech under the prerequisite of the affective state of the formerly voice of determining or its function as the second preliminary degree of confidence.Conditional probability is larger, and the second preliminary degree of confidence is just larger, and conditional probability is less, and the second preliminary degree of confidence is just less.In one example, the second preliminary confidence computation unit 404 can utilize the conditional probability knowledge that obtains based on mark emotional speech storehouse statistics in advance to come the design conditions probability.Conditional probability knowledge can be considered as in the situation of the affective state that has a front N-1 statement, the probability of the affective state of N statement in succession.Circular is as follows.

Suppose to have M kind affective state i ∈ { e ₁, e ₂... e _i... e _M, then definition

(1) last statement affective state is i, and the probability that current statement affective state is j (binary (Bi-gram) training pattern) is:

P_{bi} (e_{t} = j | e_{t - 1} = i) = \frac{C (e_{t - 1} = i, e_{t} = j)}{C (e_{t - 1} = i)},

C (e wherein _T-1=i) represent that affective state is the number of times of the statement appearance of i in the mark emotional speech storehouse, C (e _T-1=i, e _t=j) in the expression mark emotional speech storehouse in same section process of speaking the affective state of continuous two words of same speaker be respectively the occurrence number of the situation of i and j.

(2) affective state of front two words is respectively i, j, and the probability that current statement affective state is k (ternary (Tri-gram) training pattern) is:

P_{tri} (e_{t} = k | e_{t - 2} = i, e_{t - 1} = j) = \frac{C (e_{t - 2} = i, e_{t - 1} = j, e_{t} = k)}{C (e_{t - 2} = i, e_{t - 1} = j)},

C (e wherein _T-2=i, e _T-1=j) in the expression mark emotional speech storehouse in same section process of speaking the affective state of continuous two words of same speaker be respectively the occurrence number of the situation of i and j, C (e _T-2=i, e _T-1=j, e _t=k) in the expression mark emotional speech storehouse in same section process of speaking the affective state of continuous three words of same speaker be respectively i, the occurrence number of the situation of j and k.

The affective state that similarly, can obtain front two words is respectively i ₁, i ₂..., i _N-1, current statement affective state is i _NProbability (N unit (N-gram) training pattern) be:

P_{N} (e_{t} = i_{N} | e_{t - N + 1} = i_{1}, e_{t - N + 2} = i_{2}, {. . ., e}_{t - 1} = i_{N - 1}) = \frac{C (e_{t - N + 1} = i_{1}, e_{t - N + 2} = i_{2}, . . ., e_{t - 1} = i_{N - 1}, e_{t} = i_{N})}{C (e_{t - N + 1} = i_{1}, e_{t - N + 2} = i_{2}, . . ., e_{t - 1} = i_{N - 1})}

C (e wherein _T-N+1=i ₁, e _T-N+2=i ₂..., e _T-1=i _N-1) in the expression mark emotional speech storehouse in same section process of speaking the affective state of same speaker's N continuous-1 word be respectively i ₁, i ₂..., i _N-1The occurrence number of situation, C (e _T-N+1=i ₁, e _T-N+2=i ₂..., e _T-1=i _N-1, e _t=i _N) in the expression mark emotional speech storehouse in same section process of speaking the affective state of same speaker's N continuous word be respectively i ₁, i ₂..., i _N-1And i _NThe occurrence number of situation.

Conditional probability knowledge can only comprise the conditional probability based on a kind of training pattern, also can comprise based on the conditional probability of training pattern of the same race not.In one section was spoken process, current statement affective state was had the greatest impact by the emotion of front two words, and the affective state of the statement of front affects less and be subjected to more.Therefore, conditional probability knowledge preferably comprise conditional probability based on the ternary training pattern, based on conditional probability or its combination of binary training pattern.That is to say that conditional probability knowledge preferably includes the conditional probability knowledge of changing between the affective state of speaker's adjacent two and/or three words in the process of speaking continuously.

Calculating about conditional probability can also be with reference to Chinese patent application 200910150458.4 (denomination of invention be " voice emotion identification equipment and the method for carrying out speech emotional identification ").

The second preliminary confidence computation unit 404 can be sought the conditional probability between the preliminary affective state of the affective state of fixed formerly voice and determined current speech from conditional probability knowledge, and calculates the second preliminary degree of confidence based on this conditional probability.

The second preliminary degree of confidence that the first preliminary degree of confidence that 405 pairs first preliminary confidence computation unit 403 of synthesis unit are calculated and the second preliminary confidence computation unit 404 are calculated is synthesized, to determine final degree of confidence.For example, can be with the weighted sum of the first preliminary degree of confidence and the second preliminary degree of confidence as final degree of confidence.

In one example, can set two threshold value θ for the first preliminary degree of confidence of input voice ₁And θ ₂, θ wherein ₁Be set to higher threshold value and θ ₂Be set to lower threshold value.If the first preliminary degree of confidence that obtains in the first preliminary confidence computation unit 203 is greater than θ ₁, can judge that so directly the input voice are affective state undetermined, if the firstth preliminary degree of confidence is less than θ ₂, judge directly that then the input voice are not affective states undetermined, that is to say in the first preliminary degree of confidence greater than θ ₁Or first preliminary degree of confidence less than θ ₂Situation under, the second preliminary confidence computation unit 404 does not need to calculate the second preliminary degree of confidence.Otherwise, if the first preliminary degree of confidence is more than or equal to θ ₂And less than or equal to θ ₁, then second preliminary confidence computation unit 404 calculating the second preliminary degree of confidence, and 405 pairs first preliminary degree of confidence of synthesis unit and the second preliminary degree of confidence are synthesized, to determine final degree of confidence.

From the above, be based on the one hand this section input voice according to the confidence calculations of the speech emotional state recognition equipment 400 of the embodiment of the invention four and calculate the first preliminary degree of confidence, can also adjust with reference to the second preliminary degree of confidence of the affective state of based on the context voice on the other hand, thereby can further improve speech emotional state recognition result's accuracy and reduce false-alarm.

In addition, need to prove, although abovely in conjunction with synoptic diagram shown in Figure 4 voice emotion identification equipment according to present embodiment is described, but those skilled in the art are to be understood that, synoptic diagram shown in Figure 4 only is exemplary, rather than to the restriction of scope of the present invention, those skilled in the art can carry out modification or modification to synoptic diagram shown in Figure 4 fully according to actual needs.For example; those skilled in the art can be according to the needs of practical application and is selected neatly the specific implementation of the first preliminary confidence computation unit 403, the second preliminary confidence computation unit 404 or synthesis unit 405, and they all should be within the claimed spirit and scope of the present invention.

Fig. 5 shows the synoptic diagram according to the voice emotion identification equipment 500 of the embodiment of the invention five.

As shown in Figure 5, the voice emotion identification equipment 500 according to the embodiment of the invention five comprises emotion recognition unit 501 and degree of confidence judging unit 502.

Emotion recognition unit 501 and the affective state model 503 that uses thereof respectively with Fig. 3 in emotion recognition unit 301 and affective state model 303 similar, therefore just no longer emotion recognition unit 501 and affective state model 303 have been described in detail at this.

Degree of confidence judging unit 502 is used for utilizing speech emotional state 504 formerly to calculate the degree of confidence of the determined preliminary affective state in emotion recognition unit 501.

Function and the implementation of the function of degree of confidence judging unit 502 and the formerly speech emotional state 504 that uses thereof and the second preliminary confidence computation unit 404 among implementation and Fig. 4 and the formerly speech emotional state 408 that uses thereof are similar, thus this with regard to no longer to degree of confidence judging unit 502 and formerly speech emotional state 504 be described in detail.

From the above, according to the speech emotional state recognition equipment 500 of the embodiment of the invention five based on context the affective state of voice judge that the input voice belong to the degree of confidence of determined preliminary affective state, thereby the accuracy that can reduce false-alarm and improve recognition result.

In addition, need to prove, although abovely in conjunction with synoptic diagram shown in Figure 5 voice emotion identification equipment according to present embodiment is described, but those skilled in the art are to be understood that, synoptic diagram shown in Figure 5 only is exemplary, rather than to the restriction of scope of the present invention, those skilled in the art can carry out modification or modification to synoptic diagram shown in Figure 5 fully according to actual needs.

Fig. 6 shows the synoptic diagram according to the voice emotion identification equipment 600 of the embodiment of the invention six.

As shown in Figure 6, the voice emotion identification equipment 600 according to the embodiment of the invention six comprises emotion recognition unit 601 and degree of confidence judging unit 602.

Emotion recognition unit 601 and the affective state model 606 that uses thereof respectively with Fig. 3 in emotion recognition unit 301 and affective state model 303 similar, therefore just no longer emotion recognition unit 601 and affective state model 606 have been described in detail at this.

Degree of confidence judging unit 602 comprises the first preliminary confidence computation unit 603, the second preliminary confidence computation unit 604 and synthesis unit 605.

The first preliminary confidence computation unit 603 utilizes speech emotional state 607 formerly to calculate the second preliminary degree of confidence of the determined preliminary affective state in emotion recognition unit 601.

The second preliminary confidence computation unit 604 utilizes the affective state inverse model 608 corresponding with the determined preliminary affective state in emotion recognition unit 601 to calculate the second preliminary degree of confidence of this preliminary affective state.

The function of the first preliminary confidence computation unit 603 and the second preliminary confidence computation unit 604 and implementation respectively with Fig. 4 in the second preliminary confidence computation unit 404 and function and the implementation of the first preliminary confidence computation unit 403 similar, therefore just no longer the first preliminary confidence computation unit 603 and the second preliminary confidence computation unit 604 have been described in detail at this.

The second preliminary degree of confidence that the first preliminary degree of confidence that 605 pairs first preliminary confidence computation unit 603 of synthesis unit are calculated and the second preliminary confidence computation unit 604 are calculated is synthesized, to determine final degree of confidence.

For example, synthesis unit 605 can carry out the first preliminary degree of confidence that confidence calculations obtains in the first preliminary confidence computation unit 603 and be lower than that directly this preliminary affective state of judgement is insincere in the situation of predetermined threshold to the preliminary affective state of input voice based on the affective state of voice formerly, be higher than in this first preliminary degree of confidence and judge directly in the situation of another predetermined threshold that this preliminary affective state is credible, and this first preliminary degree of confidence in the situation between these two predetermined thresholds in conjunction with the second preliminary confidence computation unit 604 based on the affective state inverse model 608 of correspondence and the second preliminary degree of confidence of calculating is judged final degree of confidence.

From the above, the affective state that is based on the one hand the context voice according to the confidence calculations of the speech emotional state recognition equipment 600 of the embodiment of the invention six calculates the first preliminary degree of confidence, can also adjust with reference to the second preliminary degree of confidence of inputting voice based on this section on the other hand, thereby can further improve speech emotional state recognition result's accuracy and reduce false-alarm.

In addition, need to prove, although abovely in conjunction with synoptic diagram shown in Figure 6 voice emotion identification equipment according to present embodiment is described, but those skilled in the art are to be understood that, synoptic diagram shown in Figure 6 only is exemplary, rather than to the restriction of scope of the present invention, those skilled in the art can carry out modification or modification to synoptic diagram shown in Figure 6 fully according to actual needs.For example; those skilled in the art can be according to the needs of practical application and is selected neatly the specific implementation of the first preliminary confidence computation unit 603, the second preliminary confidence computation unit 604 and synthesis unit 605, and they all should be within the claimed spirit and scope of the present invention.

According to embodiments of the invention, also provide a kind of speech emotional state identification method.

Fig. 8 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention seven.

As shown in Figure 8, according to the speech-emotion recognition method of the embodiment of the invention seven from step S801.In step S801, the preliminary affective state of identification speaker's voice.In step S802, calculate the degree of confidence of this preliminary affective state, and utilize degree of confidence to judge whether preliminary affective state is credible, if credible then described preliminary affective state is defined as final affective state, and would export described final affective state.

From the above, increased calculating to the degree of confidence of the affective state identified on the basis of voice being processed and analyzed to identify its affective state according to the speech-emotion recognition method of the embodiment of the invention seven.If judge that according to the degree of confidence of the affective state of identifying the affective state of identifying is credible (for example in the sufficiently high situation of this degree of confidence, for example be higher than predetermined threshold value or within predetermined span), can determine so to input the affiliated affective state of voice is the affective state of identifying, and namely the preliminary affective state of identifying is defined as final affective state.

Therefore, according to the speech-emotion recognition method of the embodiment of the invention seven by introducing the calculating to the degree of confidence of the affective state identified, can reduce false-alarm on the one hand, on the other hand, because itself identifies easily between some affective state, thereby the introducing of degree of confidence can improve the accuracy of recognition result by mistake.

Fig. 9 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention eight.

As shown in Figure 9, according to the speech-emotion recognition method of the embodiment of the invention eight from step S901.At step S901, the preliminary affective state of identification speaker's voice.At step S902, calculate the degree of confidence of this preliminary affective state.At step S903, utilize this degree of confidence to judge whether preliminary affective state is credible.If in step S903, judge preliminary affective state credible ("Yes" in step S903), then enter step S904.In step S904, the affective state that this is preliminary is defined as final affective state.If in step S903, judge preliminary affective state insincere ("No" in step S903), then enter step S905.In step S905, revise preliminary affective state obtaining the final affective state of voice, and export final affective state.

From the above, increased calculating to the degree of confidence of the affective state identified on the basis of voice being processed and analyzed to identify its affective state according to the speech-emotion recognition method of the embodiment of the invention eight, judge according to the degree of confidence of calculating whether the affective state of identifying is credible, and process according to credible or insincere the execution accordingly of the affective state of identifying.

Therefore, according to the speech-emotion recognition method of the embodiment of the invention eight by introducing the calculating to the degree of confidence of the affective state identified, the accuracy that can reduce false-alarm and improve recognition result.

In addition; those skilled in the art is to be understood that; in embodiments of the invention eight; can select neatly and be arranged on according to the degree of confidence of the affective state of identifying according to the demand of concrete application the affective state of identifying is defined as the concrete mode of the preliminary affective state of the correction taked when insincere, they all should be within the claimed spirit and scope of the present invention.

Figure 10 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention nine.

As shown in figure 10, according to the speech-emotion recognition method of the embodiment of the invention nine from step S1001.At step S1001, the preliminary affective state that speaker's voice and a plurality of affective state model mated to determine voice.At step S1002, utilize the inverse model corresponding with preliminary affective state and/or formerly the affective state of voice calculate the degree of confidence of described preliminary affective state, and utilize this degree of confidence to judge whether preliminary affective state is credible, if credible then this preliminary affective state is defined as final affective state, and export this final affective state.

From the above, according to the speech emotional state identification method of the embodiment of the invention nine according to the matching degree between determined preliminary affective state and the corresponding affective state inverse model and/or formerly the affective state of voice judge that the input voice belong to the degree of confidence of determined preliminary affective state, thereby the accuracy that can reduce false-alarm and improve recognition result.

Figure 11 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention ten.

As shown in figure 11, according to the speech-emotion recognition method of the embodiment of the invention ten from step S1101.At step S1101, the preliminary affective state that speaker's voice and a plurality of affective state model mated to determine voice.At step S1102, utilize the inverse model corresponding with preliminary affective state to calculate the first preliminary degree of confidence of described preliminary affective state.At step S1103, judge that the first preliminary degree of confidence is whether in preset range.If in step S1103, judge the first preliminary degree of confidence ("Yes" in step S1103) in preset range, then enter step S1104.In step S1104, the first preliminary degree of confidence is defined as degree of confidence, then enter step S1107.If in step S1103, judge the first preliminary degree of confidence ("No" in step S1103) not in preset range, then enter step S1105.In step S1105, utilize formerly that the affective state of voice calculates the second preliminary degree of confidence, then enter step S1106.In step S1106, the first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain degree of confidence, then enter step S1107.In step S1107, utilize above-mentioned degree of confidence to judge whether preliminary affective state is credible, if credible then described preliminary affective state is defined as final affective state, and would export final affective state.

From the above, be based on the one hand this section input voice according to the confidence calculations of the speech emotional state identification method of the embodiment of the invention ten and calculate the first preliminary degree of confidence, can also adjust with reference to the second preliminary degree of confidence of the affective state of based on the context voice on the other hand, thereby can further improve speech emotional state recognition result's accuracy and reduce false-alarm.

Figure 12 shows the process flow diagram according to the speech-emotion recognition method of the embodiment of the invention 11.

As shown in figure 12, according to the speech-emotion recognition method of the embodiment of the invention 11 from step S1201.At step S1201, the preliminary affective state that speaker's voice and a plurality of affective state model mated to determine voice.At step S1202, utilize the affective state of voice formerly to calculate the first preliminary degree of confidence of preliminary affective state.At step S1203, judge that the first preliminary degree of confidence is whether in preset range.If in step S1203, judge the first preliminary degree of confidence ("Yes" in step S1203) in preset range, then enter step S1204.In step S1204, the first preliminary degree of confidence is defined as degree of confidence, then enter step S1207.If in step S1203, judge the first preliminary degree of confidence ("No" in step S1203) not in preset range, then enter step S1205.In step S1205, utilize the inverse model corresponding with preliminary affective state to calculate the second preliminary degree of confidence, then enter step S1206.In step S1206, the first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain degree of confidence, then enter step S1207.In step S1107, utilize above-mentioned degree of confidence to judge whether preliminary affective state is credible, if credible then described preliminary affective state is defined as final affective state, and would export final affective state.

From the above, the affective state that is based on the one hand the context voice according to the confidence calculations of the speech emotional state identification method of the embodiment of the invention 11 calculates the first preliminary degree of confidence, can also adjust with reference to the second preliminary degree of confidence of inputting voice based on this section on the other hand, thereby can further improve speech emotional state recognition result's accuracy and reduce false-alarm.

Can be with reference to the structure of the voice emotion identification equipment according to embodiments of the invention one to embodiment six as described above and the function of all parts according to the specific implementation of each step in the embodiments of the invention seven to embodiment 11.For instructions for purpose of brevity, just no longer the specific implementation of above-mentioned each step has been described in detail at this.

In addition, need to prove, although abovely in conjunction with the process flow diagram shown in Fig. 8-12 the speech emotional state identification method according to present embodiment is described, but those skilled in the art are to be understood that, process flow diagram shown in Fig. 8-12 only is exemplary, rather than to the restriction of scope of the present invention, those skilled in the art can carry out modification or modification to the process flow diagram shown in Fig. 8-12 fully according to actual needs.For example; employed known affective state sorting technique, phonetic feature system of selection, affective state method for establishing model etc., the concrete confidence calculations method that adopts when calculating the degree of confidence of preliminary affective state, concrete mode of revising preliminary affective state etc. when those skilled in the art can be chosen in according to the demand of practical application identification speech emotional state fully neatly, they all should be within the claimed spirit and scope of the present invention.

Can order naturally following the instructions carry out in chronological order when also it is pointed out that the step of carrying out the series of processes in the process flow diagram shown in above-mentioned Fig. 8-12, but do not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.

Voice emotion identification equipment and method according to the embodiment of the invention can be widely used in the fields such as education, amusement, art.In the electronic game field, the affective state of Games Software by the perception player understood player's preference, and then can move to show this emotion by some and change, and makes game more true to nature, increase and the player between interactive.

The security that can also help to strengthen use equipment according to voice emotion identification equipment and the method for the embodiment of the invention.For example, the affective state that the intelligent monitor system of driver's safe driving can the dynamic monitoring human pilot, and in good time warning is proposed.Voice signal (such as speech speed, tonal variations, volume intensity, voice quality, sound articulation etc.) can be used for identifying the being slow of speech property that the driver answers a question, when system detects that driver's energy is not concentrated or emotion changes, can remind at any time the driver, and can automatically control, in time change state and the reaction of car.

Voice emotion identification equipment and method according to the embodiment of the invention also have huge application prospect in the field, call center.For example, Britain Telecom uses the emotion computing technique and improves the call center, makes its more hommization.When running into impolite especially user, the speech recognition system with emotion and consciousness can remind the telephonist to remain calm.After handling this class phone, the meeting comfort ﹠ encouragement telephonist of system helps them to regulate mood.

In addition, voice emotion identification equipment and the method according to the embodiment of the invention can also have following application mode in the field, call center: Real Time Monitoring comprises for the monitoring of operator's mood with for customer anger monitoring jumpy.For example monitoring in the listless situation of operator, can notify managerial personnel to arrange this operator to have a rest to adjust its mood.The off-line batch processing extracts some typical cases and analyzes for managerial personnel, such as some exemplary dialog in the Automatic Extraction recording data in one period, and as under operator's correct methods, the client becomes calmness by indignation.

Although described the present invention and advantage thereof in detail, be to be understood that and in the situation that does not break away from the spirit and scope of the present invention that limited by appended claim, can carry out various changes, alternative and conversion.

And method and apparatus according to the invention both can realize by hardware, also can realize by software and firmware.In situation about realizing by software or firmware, from storage medium or network to the computing machine with specialized hardware structure, for example multi-purpose computer 1300 shown in Figure 13 is installed the program that consists of this software, and this computing machine can be carried out various functions etc. when various program is installed.

In Figure 13, CPU (central processing unit) (CPU) 1301 carries out various processing according to the program of storage in the ROM (read-only memory) (ROM) 1302 or from the program that storage area 1308 is loaded into random access memory (RAM) 1303.In RAM 1303, also store as required data required when CPU 1301 carries out various processing etc.CPU 1301, ROM 1302 and RAM 1303 are connected to each other via bus 1304.Input/output interface 1305 also is connected to bus 1304.Following parts are connected to input/output interface 1305: importation 1306 comprises keyboard, mouse etc.; Output 1307 comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 1308 comprises hard disk etc.; And communications portion 1309, comprise that network interface unit is such as LAN card, modulator-demodular unit etc.Communications portion 1309 is processed such as the Internet executive communication via network.

As required, driver 1310 also is connected to input/output interface 1305.Detachable media 1311 is installed on the driver 1310 as required such as disk, CD, magneto-optic disk, semiconductor memory etc., so that the computer program of therefrom reading is installed in the storage area 1308 as required.

Realizing by software in the situation of above-mentioned series of processes, such as detachable media 1311 program that consists of software is being installed such as the Internet or storage medium from network.

It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 13 wherein has program stored therein, distributes separately to provide the detachable media 1311 of program to the user with equipment.The example of detachable media 1311 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 1302, the storage area 1308 etc., computer program stored wherein, and be distributed to the user with the equipment that comprises them.

At this moment, as long as this system or equipment have the function of executive routine, then embodiments of the present invention are not limited to program, and this program also can be form arbitrarily, for example, the program carried out of target program, interpreter or the shell script that offers operating system etc.

In addition, computing machine is by being connected to the corresponding website on the Internet, and will download and be installed to according to computer program code of the present invention and then carry out this program in the computing machine, also can realize the present invention.

And described enforcement program of the present invention also can for example adopt the form of one or more signals.Described signal can be the data-signal that can download from internet sites, or the data-signal that provides in carrier signal, or with the data-signal of any other form.

At last, also need to prove, in this article, relational terms such as the first and second grades only is used for an entity or operation are made a distinction with another entity or operation, and not necessarily requires or hint and have the relation of any this reality or sequentially between these entities or the operation.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby not only comprise those key elements so that comprise process, method, article or the equipment of a series of key elements, but also comprise other key elements of clearly not listing, or also be included as the intrinsic key element of this process, method, article or equipment.Do not having in the situation of more restrictions, the key element that is limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.

Although more than describe by reference to the accompanying drawings embodiments of the invention in detail, should be understood that embodiment described above just is used for explanation the present invention, and be not construed as limiting the invention.For a person skilled in the art, can in situation without departing from the spirit and scope of the present invention, make various changes and modifications above-mentioned embodiment.Therefore, scope of the present invention is only limited by appended claim and equivalent thereof.

Remarks

1. 1 kinds of voice emotion identification equipments of remarks comprise:

The emotion recognition unit is used for the current affective state of speaker's voice is identified as preliminary affective state; And

The degree of confidence judging unit, be used for calculating the degree of confidence of described preliminary affective state, and utilize described degree of confidence to judge whether described preliminary affective state is credible, if described preliminary affective state is judged as credible, then described preliminary affective state is defined as final affective state, and exports described final affective state.

Remarks 2. also comprises such as remarks 1 described voice emotion identification equipment:

Amending unit is used for revising described preliminary affective state obtaining the final affective state of described voice, and exports described final affective state in the described preliminary incredible situation of affective state of described degree of confidence judgment unit judges.

Remarks 3. is such as remarks 1 described voice emotion identification equipment, wherein said emotion recognition unit is used for described voice and a plurality of affective state model are mated to determine the preliminary affective state of described voice, described degree of confidence judging unit for the utilization inverse model corresponding with described preliminary affective state and/or formerly the affective state of voice calculate the degree of confidence of described preliminary affective state.

Remarks 4. is such as remarks 3 described voice emotion identification equipments, wherein said degree of confidence judging unit is configured to utilize the inverse model corresponding with described preliminary affective state to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise utilizes the affective state of voice formerly to calculate the second preliminary degree of confidence of described preliminary affective state and the described first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain described degree of confidence.

Remarks 5. is such as remarks 3 described voice emotion identification equipments, wherein said degree of confidence judging unit is configured to utilize the affective state of voice formerly to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise the second preliminary degree of confidence of utilizing the inverse model corresponding with described preliminary affective state to calculate described preliminary affective state also synthesizes to obtain described degree of confidence with the described first preliminary degree of confidence and the second preliminary degree of confidence.

6. 1 kinds of speech-emotion recognition methods of remarks comprise:

Determine the preliminary affective state of speaker's voice; And

Calculate the degree of confidence of described preliminary affective state, and utilize described degree of confidence to judge whether described preliminary affective state is credible, if credible then described preliminary affective state is defined as final affective state, and would export described final affective state.

Remarks 7. also comprises such as remarks 6 described speech-emotion recognition methods:

Revise described preliminary affective state in the described preliminary incredible situation of affective state obtaining the final affective state of described voice judging, and export described final affective state.

Remarks 8. is such as remarks 6 described speech-emotion recognition methods, wherein saidly determine that the step of the preliminary affective state of described voice comprises the preliminary affective state that described voice and a plurality of affective state model is mated to determine described voice, the step of described calculating degree of confidence comprise inverse model that utilization is corresponding with described preliminary affective state and/or formerly the affective state of voice calculate the degree of confidence of described preliminary affective state.

Remarks 9. is such as remarks 8 described speech-emotion recognition methods, the step of the degree of confidence of the described preliminary affective state of wherein said calculating comprises: utilize the inverse model corresponding with described preliminary affective state to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise utilizes the affective state of voice formerly to calculate the second preliminary degree of confidence of described preliminary affective state and the described first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain described degree of confidence.

Remarks 10. is such as remarks 8 described speech-emotion recognition methods, the step of the degree of confidence of the described preliminary affective state of wherein said calculating comprises: utilize the affective state of voice formerly to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise the second preliminary degree of confidence of utilizing the inverse model corresponding with described preliminary affective state to calculate described preliminary affective state also synthesizes to obtain described degree of confidence with the described first preliminary degree of confidence and the second preliminary degree of confidence.

Remarks 11. is such as remarks 9 or 10 described speech-emotion recognition methods, wherein the described first preliminary degree of confidence and the second preliminary degree of confidence synthesized to obtain described degree of confidence and comprises that weighted sum with the described first preliminary degree of confidence and the second preliminary degree of confidence is as described degree of confidence.

Claims

1. voice emotion identification equipment comprises:

The emotion recognition unit is used for the current affective state of speaker's voice is identified as preliminary affective state;

The degree of confidence judging unit, be used for calculating the degree of confidence of described preliminary affective state, and utilize described degree of confidence to judge whether described preliminary affective state is credible, if described preliminary affective state is judged as credible, then described preliminary affective state is defined as final affective state, and exports described final affective state; And

2. voice emotion identification equipment as claimed in claim 1, wherein said emotion recognition unit is used for described voice and a plurality of affective state model are mated to determine the preliminary affective state of described voice, described degree of confidence judging unit for the utilization affective state inverse model corresponding with described preliminary affective state and/or formerly the affective state of voice calculate the degree of confidence of described preliminary affective state.

3. voice emotion identification equipment as claimed in claim 2, wherein said degree of confidence judging unit is configured to utilize the affective state inverse model corresponding with described preliminary affective state to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise utilizes the affective state of voice formerly to calculate the second preliminary degree of confidence of described preliminary affective state and the described first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain described degree of confidence.

4. voice emotion identification equipment as claimed in claim 2, wherein said degree of confidence judging unit is configured to utilize the affective state of voice formerly to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise the second preliminary degree of confidence of utilizing the affective state inverse model corresponding with described preliminary affective state to calculate described preliminary affective state also synthesizes to obtain described degree of confidence with the described first preliminary degree of confidence and the second preliminary degree of confidence.

5. speech-emotion recognition method comprises:

The preliminary affective state of identification speaker's voice;

Calculate the degree of confidence of described preliminary affective state, and utilize described degree of confidence to judge whether described preliminary affective state is credible, if credible then described preliminary affective state is defined as final affective state, and would export described final affective state; And

6. speech-emotion recognition method as claimed in claim 5, the step of the preliminary affective state of the described voice of wherein said identification comprises the preliminary affective state that described voice and a plurality of affective state model is mated to determine described voice, the step of described calculating degree of confidence comprise utilize the affective state inverse model corresponding with described preliminary affective state and/or formerly the affective state of voice calculate the degree of confidence of described preliminary affective state.

7. speech-emotion recognition method as claimed in claim 6, the step of the degree of confidence of the described preliminary affective state of wherein said calculating comprises: utilize the affective state inverse model corresponding with described preliminary affective state to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise utilizes the affective state of voice formerly to calculate the second preliminary degree of confidence of described preliminary affective state and the described first preliminary degree of confidence and the second preliminary degree of confidence are synthesized to obtain described degree of confidence.

8. speech-emotion recognition method as claimed in claim 6, the step of the degree of confidence of the described preliminary affective state of wherein said calculating comprises: utilize the affective state of voice formerly to calculate the first preliminary degree of confidence of described preliminary affective state, in the described first preliminary degree of confidence in the situation within the preset range, the described first preliminary degree of confidence is defined as described degree of confidence, otherwise the second preliminary degree of confidence of utilizing the affective state inverse model corresponding with described preliminary affective state to calculate described preliminary affective state also synthesizes to obtain described degree of confidence with the described first preliminary degree of confidence and the second preliminary degree of confidence.