CN102737629A

CN102737629A - Embedded type speech emotion recognition method and device

Info

Publication number: CN102737629A
Application number: CN2011103586726A
Authority: CN
Inventors: 黄永明; 章国宝; 董飞; 祖晖; 刘海彬; 倪道宏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2011-11-11
Filing date: 2011-11-11
Publication date: 2012-10-17
Anticipated expiration: 2031-11-11
Also published as: CN102737629B

Abstract

The invention relates to an embedded type speech emotion recognition method and an embedded type speech emotion recognition device. The method comprises a feature extraction method, an emotion model training method, a Gaussian mixture model and an emotion recognition method. The method is as follows: parameters of a speech emotion recognition model can be adjusted in a self-adoption manner according to the recognition result of a speaker module, and an unspecific speaker speech emotion recognition question is transformed into a specific speaker speech emotion recognition problem. The device comprises a central processor, a power supply, a clock generator, a Nand Flash storage, a Nor Flash storage, an audio coding-decoding chip, a microphone, a loudspeaker, a keyboard, an LCD (Liquid Crystal Display) display and an USB (Universal Serial Bus) interface storage. According to the embedded type speech emotion recognition method and device, a speaker recognition model is added to the speech emotion recognition, and therefore, the problem that the speech emotion recognition is suddenly declined under an unspecific speaker condition is solved, and the identity recognition function is brought to the device.

Description

A kind of embedded speech emotion identification method and device

Technical field

Patent of the present invention relates to a kind of speech emotional recognition technology, relates in particular to a kind of embedded speech emotion identification method and device, belongs to the speech emotional distinguishment technical field.

Background technology

Automatic speech emotion recognition technology belongs to the relatively technology at edge of IT industry.Voice are carrying abundant emotion information as interpersonal communication media.Emotion is being played the part of important role in processes such as the mankind's perception, decision-makings, in the mankind exchange, has vital role.Along with development of science and technology, the man-machine communication is also more and more important in daily life.The man-machine interaction that utilizes voice to carry out nature, harmony is people's objectives of the struggle all the time.Speech emotional identification is an important content of harmonious man-machine interaction, and it will change precisian's machine interactive service in the past effectively, improve the cordiality property and the accuracy of man-machine interaction.Speech emotional identification replenishes as a kind of of speech recognition; It is mutual to strengthen man-machine emotion; In remote teaching, auxiliary detect a lie, automatic remote telephone service center and clinical medicine, intelligent toy, there is wide application prospect aspects such as smart mobile phone.

Embedded speech emotion recognition technology is refered in particular to the emotion recognition technology of on the autonomous device beyond the computer, moving, and especially is applied to the technology on voice toy, intelligent pet and other embedded product.Traditional voice product requirement user sends voice command with the mode of near neutral; The voice of band intense emotion color can influence the speech recognition effect on the contrary; The condition of sometimes this harshness and not enough hommization can be given up user's enthusiasm, and this is a big defective of present voice product.Emotion is dissolved in the voice product goes, can in the use of voice product, give the user bigger degree of freedom, promote user experience, this also is a general orientation of intelligent interaction voice production development.With the intelligent interaction voice toy is example; If the intelligent interaction toy can be discerned the emotion in the user speech; Emotions different in the voice is made different responses, can improve the shortcoming of the not enough hommization of electronic toy to a certain extent, strengthen compatibility, interest that toy uses.In the same way, embedded speech emotion recognition technology can realize that people and machine better exchange with interactive.Obviously, this demand exists on problems in present society, but does not also see at present the embedded product of releasing band emotion recognition function on the home market, can't but be one sorry greatly.

Summary of the invention

The problem that the present invention solves is: for overcoming traditional voice emotion recognition not high defective of discrimination when the unspecified person; Simultaneously for solving the problem that lacks speech emotional recognition device on the market with good human-computer interaction function; In conjunction with above background and demand; The present invention provides a kind of embedded speech emotion identification method and device thereof; This system can in low profile edge equipment, discern the speaker calmness, happiness, anger, fear, emotion such as calmness, take different operation according to the different emotions that speaker's voice carry.

Technical solution of the present invention is:

1, a kind of embedded speech emotion identification method may further comprise the steps:

Step 1: receive emotional speech segment input to be identified;

Step 2: to emotional speech segment digitizing to be identified so that audio digital signals to be provided;

Step 3: the emotion audio digital signals X (n) to be identified carries out pre-service, comprises pre-emphasis, divides frame, windowing, end-point detection:

Step 3.1: to emotion audio digital signals X (n) to be identified by the following pre-emphasis of carrying out:

\overset{&OverBar;}{X (n)} = X (n) - αX (n - 1) - - - (1)

α in the formula=0.9375, n representes emotion digital speech discrete point sequence number to be identified;

Step 3.2: adopt the method for overlapping segmentation to carry out the branch frame, the part of overlapping is arranged between former frame and back one frame, be called frame and move, frame pipettes 7ms here, promptly under the 11.025kHz sampling rate, gets 80 points, and each frame length is got 23ms, promptly gets 256 points;

Step 3.3: select Hamming window that voice signal is carried out windowing process, window function is following:

Each frame of digital voice discrete point sequence number of n ' expression in the formula, N representes that each frame of digital voice discrete point counts, here N=256;

Step 3.4: adopt known energy zero-crossing rate dual-threshold judgement method to accomplish end-point detection; Promptly energy and the zero-crossing rate according to neighbourhood noise all is lower than the short-time energy of voice signal and the principle of short-time zero-crossing rate; At first make the first order and differentiate, do the second level with short-time zero-crossing rate more on this basis then and differentiate, calculate the value of the short-time energy upper limit, lower limit and zero-crossing rate thresholding with short-time energy; Then every frame data are judged, obtained each frame of digital voice X (n ') after the end-point detection;

Step 4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;

Step 5: the speech characteristic parameter that step 4 is extracted is input in each the Speaker Identification submodel that has trained; Confirm that which Speaker Identification submodel is an optimum matching of this voice snippet, select the corresponding speaker of this model according to the Speaker Identification submodel of coupling;

Step 6:, from the speaker's speech emotional model of cognition storehouse that trains, select the corresponding speech emotional model of cognition of this speaker according to step 5 speaker's result of determination;

Step 7: step 4 extraction speech characteristic parameter is input in the speech emotional recognin model of step 6 selection; Said speech emotional model of cognition comprises happiness, anger, sadness, fears, tranquil five emotion submodels that trained, and confirms that according to the output result in the speech emotional model of cognition any emotion is an optimum matching of this voice snippet.

A kind of running gear of embedded speech emotion identification method; This device mainly comprises: central processing unit, power supply, clock generator, Nand type flash memory, Nor type flash memory, audio coding decoding chip, microphone, loudspeaker, keyboard, LCD, general serial type EBI mass-memory unit; The operating system of said Nor type flash memory save set; File system; Guide load module; It is kernel that said central processing unit adopts 32 embedded microprocessors based on the ARM framework, and said Nand type flash memory is preserved the software of audio recognition method and realized, comprises voice preprocess method, feature extracting method, emotion model training module, gauss hybrid models emotion recognition model; Said general serial type EBI mass-memory unit is preserved resource files such as music, picture.

Beneficial effect of the present invention comprises:

(1) the present invention comes the parameter of adaptive adjustment speech emotional model of cognition according to the recognition result of speaker's module; Unspecified person speech emotional identification problem is converted into persona certa's speech emotional identification problem, the not high problem of discrimination when having solved the unspecified person speech emotional and being identified in practical applications;

(2) the present invention makes this device when extracting the user speech emotion information, can carry out authentication to the user owing in speech emotional identification, added the Speaker Identification model, has engineering significance more;

(3) Speaker Identification model, speech emotional model of cognition all adopt with a kind of speech characteristic parameter (12 Wei Meier frequency cepstral coefficient) among the present invention; Sorter all adopts mixed Gauss model GMM (separately training and identification); The present invention need not increase too many computational complexity when reaching beneficial effect (2) like this;

(4) the speech emotional recognition device that the present invention relates to has that microphone carries out phonetic entry, keyboard carries out mode switch, and LCD, loudspeaker carry out emotion output, and man-machine interaction is friendly;

(5) the speech emotional recognition device that the present invention relates to has two kinds of training patternss, keyboard carries out mode switch; Comprise training by the gross and instant training; Be set at instant training mode like the user; Can be according to the instant renewal Speaker Identification model of User Recognition result, the parameter of speech emotional model of cognition, the learning process of mimic human;

Description of drawings

Fig. 1 is an apparatus structure block diagram of the present invention.

Fig. 2 is a feature extraction process flow diagram of the present invention.

Fig. 3 is a principle of work block diagram of the present invention.

Fig. 4 is the identifying process flow diagram for emotion model training process of the present invention.

Fig. 5 is this speech emotional recognition device and user's an interactive process flow diagram.

The practical implementation method

Embodiment 1

A kind of embedded speech emotion identification method may further comprise the steps:

Step 1: receive emotional speech segment input to be identified;

\overset{&OverBar;}{X (n)} = X (n) - αX (n - 1) - - - (1)

In the present embodiment,

Adopt following method to extracting speech characteristic parameter in the step 4 through pretreated digital speech:

Step 4.1: behind time-domain signal X (n '), augment 0, make that the length augment the sequence after 0 is N ', making N ' is 2 integral number power, and through obtaining linear spectral X (k) behind the DFT DFT, conversion formula is then:

X (k) = Σ_{n^{'} = 0}^{N^{'} - 1} x (n^{'}) \exp (- j 2 π n^{'} k / N^{'}), 0 \leq n^{'}, k \leq N^{'} - 1 - - - (3)

Step 4.2: above-mentioned linear spectral X (k) is passed through Mei Er frequency filter group H _m(k) obtain the Mei Er frequency spectrum, and the processing through the logarithm energy, obtain log spectrum S (m), to the overall transfer function of log spectrum S (m) be by linear spectral X (k):

S (m) = \ln (Σ_{k = 0}^{N^{'} - 1} {| X (k) |}^{2} H_{m} (k)), 0 \leq m \leq M - - - (4)

Wherein for the bank of filters that M BPF. arranged, m=1,2 ..., M, the transport function of each BPF. is:

H_{m} (k) = \begin{matrix} \{\begin{matrix} 0 & (k < f (m - 1)) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)} & (f (m - 1) \leq k \leq f (m)) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)} & (f (m) < k \leq f (m + 1)) \\ 0 & (k > f (m + 1)) \end{matrix} (0 \leq m < M) - - - (5) \end{matrix}

Step 4.3: above-mentioned log spectrum S (m) through discrete cosine transform, is transformed to the cepstrum frequency domain, promptly obtain Mei Er frequency cepstral coefficient c (n '):

c (n^{'}) = Σ_{m = 1}^{M - 1} S (m) \cos (\frac{π n^{'} (m + 1 / 2)}{M}), (0 \leq m < M) - - - (6) .

Described Speaker Identification model training method may further comprise the steps:

Step 5.1: receive each speaker's training utterance segment input;

Step 5.2: to the digitizing of speaker's training utterance segment so that audio digital signals X to be provided (n ₁), n wherein ₁The expression speaker trains digital speech discrete point sequence number;

Step 5.3: adopt the described pre-service of step 3, to audio digital signals X (n ₁) carry out pre-service, comprise pre-emphasis, divide frame, windowing, end-point detection, obtain the speaker train audio digital signals X (n ' ₁);

Step 5.4: to through pretreated digital speech X (n ' ₁) extracting speech characteristic parameter, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;

Step 5.5: the speech characteristic parameter that utilizes step 4 to extract is trained the Speaker Identification model, and concrete steps are following:

Step 5.5.1: be provided with the Speaker Identification model the exponent number of mixed Gauss model be 4;

Step 5.5.2:, obtain the initiation parameter of each Gaussian distribution: mean vector μ with K Mean Method (kmeans) initialization Speaker Identification model _k, the covariance matrix ∑ _k, mixed components weights c _k, it representes the initialization submodel parameter that k speaker is corresponding;

Step 5.5.3: t characteristic parameter establishing c speaker's training utterance

For

T wherein _cThe frame number of representing c speaker's training utterance, C are represented the sum of training sample, reappraise order according to the initiation parameter of following formula to Gaussian distribution

Wherein

The speaker that expression is corresponding obtains each Speaker Identification submodel parameter:

γ_{t}^{c} (\overset{&OverBar;}{k}) = \frac{N (x_{t}^{c}, μ_{\overset{&OverBar;}{k}}, Σ_{\overset{&OverBar;}{k}})}{Σ_{\overset{&OverBar;}{k} = 1}^{K} N (x_{t}^{c}, μ_{\overset{&OverBar;}{k}}, Σ_{\overset{&OverBar;}{k}})} - - - (7)

{\overset{&OverBar;}{c}}_{k} = \frac{Σ_{c = 1}^{C} Σ_{t = 1}^{T_{c}} γ_{t}^{c} (\overset{&OverBar;}{k})}{Σ_{\overset{&OverBar;}{k} = 1}^{K} Σ_{c = 1}^{C} Σ_{t = 1}^{T_{c}} γ_{t}^{c} (\overset{&OverBar;}{k})} - - - (8)

{\overset{&OverBar;}{μ}}_{\overset{&OverBar;}{k}} = \frac{Σ_{c = 1}^{C} Σ_{t = 1}^{T_{c}} γ_{t}^{c} (\overset{&OverBar;}{k}) x_{t}^{c}}{Σ_{c = 1}^{C} Σ_{t = 1}^{T_{c}} γ_{t}^{c} (\overset{&OverBar;}{k})} - - - (9)

{\overset{&OverBar;}{Σ}}_{\overset{&OverBar;}{k}} = \frac{Σ_{c = 1}^{C} Σ_{t = 1}^{T_{c}} γ_{t}^{c} (\overset{&OverBar;}{k}) (x_{t}^{c} - μ_{\overset{&OverBar;}{k}}) {(x_{t}^{c} - μ_{\overset{&OverBar;}{k}})}^{T}}{Σ_{c = 1}^{C} Σ_{t = 1}^{T_{c}} γ_{t}^{c} (\overset{&OverBar;}{k})} - - - (10);

Step 5.5.4: said Speaker Identification model is a mixed Gauss model; With the formula below top each Speaker Identification submodel parameter substitution that obtains; Each Speaker Identification submodel that formation trains, these submodel that trains set are final Speaker Identification model:

Said this gauss hybrid models comes the distribution of descriptor frame characteristic in feature space with the linear combination of 4 single Gaussian distribution, specifically describes as follows:

p (x) = Σ_{k = 1}^{4} {\overset{&OverBar;}{c}}_{k} b_{k} (x) - - - (11)

Wherein

b_{k} (x) = N (x, {\overset{&OverBar;}{μ}}_{\overset{&OverBar;}{k}}, {\overset{&OverBar;}{Σ}}_{\overset{&OverBar;}{k}}) = \frac{1}{{(2 π)}^{D / 2} {| {\overset{&OverBar;}{Σ}}_{\overset{&OverBar;}{k}} |}^{1 / 2}} \exp (- \frac{1}{2} {(x - {\overset{&OverBar;}{μ}}_{\overset{&OverBar;}{k}})}^{T} Σ_{i}^{- 1} (x - {\overset{&OverBar;}{μ}}_{\overset{&OverBar;}{k}})) - - - (12)

Wherein, D is an intrinsic dimensionality, D=12 here, b _k(x) being called kernel function, is that mean vector does

Covariance matrix does Gauss of distribution function, the weighting coefficient that Gaussian Mixture distributes

Satisfy:

Σ_{\overset{&OverBar;}{k} = 1}^{4} {\overset{&OverBar;}{c}}_{\overset{&OverBar;}{k}} = 1 - - - (13)

Speaker Identification mixed Gauss model parameter set λ ₁Be exactly to form, be expressed as the form of following tlv triple by the weights of above-mentioned each average component, covariance matrix and mixed components:

λ_{1} = {{\overset{&OverBar;}{c}}_{k}, {\overset{&OverBar;}{μ}}_{k}, {\overset{&OverBar;}{Σ}}_{k}; (k = 1, . . ., K)} - - - (14) .

Described and training method speech emotional model of cognition storehouse may further comprise the steps:

Step 7.1: receive 1 speaker's emotion training utterance segment input;

Step 7.2: to the digitizing of emotion training utterance segment so that audio digital signals X to be provided (n ₂), n wherein ₂Expression emotion training digital speech discrete point sequence number;

Step 7.3: adopt the described pre-service of step 3, emotion training audio digital signals X (n) handled, obtain emotion training audio digital signals X (n ' ₂);

Step 7.4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;

Step 7.5: the speech emotional characteristic parameter that utilizes step 4 to extract comes the training utterance emotion model, and concrete steps are following:

Step 7.5.1: be provided with the speech emotional model of cognition the exponent number of mixed Gauss model be 10;

Step 7.5.2: with K Mean Method kmeans initialization speech emotional model of cognition the mean vector μ ' of each Gaussian distribution _{K '}, the covariance matrix ∑ ' _{K '}, mixed components weights c ' _{K '}

Step 7.5.3: adopt the described emotion training utterance of step 5.3, the individual characteristic parameter of t ' of establishing the individual emotion training utterance of c ' does

T ' wherein _cThe frame number of representing the individual emotion training utterance of c ', C ' expression emotion training utterance total sample number reappraises the mixed Gauss model parameter according to following formula; Make k '=1 ..., k '=K '; The speech emotional model of cognition corresponding that formation trains with this speaker; And set up and this speaker's corresponding file folder, the emotion of k ' expression emotional speech correspondence wherein, promptly this speaker's emotion recognition model comprises the individual emotion submodel of K ':

γ_{t^{'}}^{' c^{'}} (k^{'}) = \frac{N (x_{t^{'}}^{c^{'}}, μ_{k^{'}}^{'}, Σ_{k^{'}}^{'})}{Σ_{k^{'} = 1}^{K^{'}} N (x_{t^{'}}^{' c}, μ_{k^{'}}^{'}, Σ_{k^{'}}^{'})} - - - (15)

{\overset{&OverBar;}{c}}_{k^{'}}^{'} = \frac{Σ_{c^{'} = 1}^{C^{'}} Σ_{t^{'} = 1}^{T_{c}^{'}} γ_{t^{'}}^{c^{'}} (k^{'})}{Σ_{k^{'} = 1}^{K^{'}} Σ_{c^{'} = 1}^{C^{'}} Σ_{t^{'} = 1}^{T_{c}^{'}} γ_{t^{'}}^{c^{'}} (k^{'})} - - - (16)

{\overset{&OverBar;}{μ}}_{k^{'}}^{'} = \frac{Σ_{c^{'} = 1}^{C^{'}} Σ_{t^{'} = 1}^{T_{c^{'}}^{'}} γ_{t^{'}}^{' c^{'}} (k^{'}) x_{t^{'}}^{' c^{'}}}{Σ_{c^{'} = 1}^{C^{'}} Σ_{t^{'} = 1}^{T_{c}^{'}} γ_{t^{'}}^{' c^{'}} (k^{'})} - - - (17)

{\overset{&OverBar;}{Σ}}_{k^{'}}^{'} = \frac{Σ_{c^{'} = 1}^{C^{'}} Σ_{t^{'} = 1}^{T_{c}^{'}} γ_{t^{'}}^{' c^{'}} (k^{'}) (x_{t^{'}}^{' c^{'}} - μ_{k^{'}}^{'}) {(x_{t^{'}}^{' c^{'}} - μ_{k^{'}}^{'})}^{T}}{Σ_{c^{'} = 1}^{C^{'}} Σ_{t^{'} = 1}^{T_{c^{'}}^{'}} γ_{t^{'}}^{' c^{'}} (k^{'})} - - - (18)

Step 7.5.4: the said speech emotional model of cognition corresponding with the speaker is mixed Gauss model; Formula below the speech emotional model of cognition parameter substitution that the top speaker who obtains is corresponding forms the speech emotional model of cognition corresponding with this speaker that trains:

Said this gauss hybrid models comes the distribution of descriptor frame characteristic in feature space with the linear combination of 10 single Gaussian distribution, specifically describes as follows:

p^{'} (x^{'}) = Σ_{k^{'} = 1}^{10} {\overset{&OverBar;}{c}}_{k}^{'} b_{k^{'}}^{'} (x^{'}) - - - (19)

Wherein

b_{k^{'}}^{'} (x^{'}) = N (x^{'}, {\overset{&OverBar;}{μ}}_{k^{'}}^{'}, {\overset{&OverBar;}{Σ}}_{k^{'}}^{'}) = \frac{1}{{(2 π)}^{D / 2} {| {\overset{&OverBar;}{Σ}}_{k^{'}}^{'} |}^{1 / 2}} \exp (- \frac{1}{2} {(x^{'} - {\overset{&OverBar;}{μ}}_{k^{'}}^{'})}^{T} {\overset{&OverBar;}{Σ}}_{k^{''}}^{' - 1} (x^{'} - {\overset{&OverBar;}{μ}}_{k^{'}}^{'})) - - - (20)

Wherein, D is an intrinsic dimensionality, D=12 here, b ' _{K '}(x ') is called kernel function, is that mean vector does

Covariance matrix does

Gauss of distribution function, the weighting coefficient that Gaussian Mixture distributes

Satisfy:

Σ_{k^{'} = 1}^{10} {\overset{&OverBar;}{c}}_{k^{'}}^{'} = 1 - - - (21)

Speech emotional identification mixed Gauss model parameter set λ ₂Be exactly to form, be expressed as the form of following tlv triple by the weights of above-mentioned each average component, covariance matrix and mixed components:

λ_{2} = {{\overset{&OverBar;}{c}}_{k^{'}}^{'}, {\overset{&OverBar;}{μ}}_{k^{'}}^{'}, {\overset{&OverBar;}{Σ}}_{k^{'}}^{'}; (k^{'} = 1, . . ., K^{'})} - - - (22)

Step 7.6: receive other speaker's emotion training utterance segment inputs; Each speaker's emotion training utterance is trained to step 7.5 by top step 7.2; Obtain the corresponding speech emotional model of cognition of each speaker; The corresponding speech emotional model of cognition set of each speaker that the emotion submodel that comprises each speaker's speech emotional model of cognition, these training obtain constitutes speech emotional model of cognition storehouse.

Embodiment 2

A kind of running gear of embedded speech emotion identification method; This device mainly comprises: central processing unit 101, power supply 102, clock generator 103, Nand type flash memory 104, Nor type flash memory 105, audio coding decoding chip 106, microphone 107, loudspeaker 108, keyboard 109, LCD 110, general serial type EBI mass-memory unit 111; It is characterized in that; The operating system of said Nor type flash memory 105 save sets; File system, guide load module, it is kernel that said central processing unit 101 adopts 32 embedded microprocessors based on the ARM framework; The software that said Nand type flash memory 104 is preserved audio recognition method is realized, comprises voice preprocess method, feature extracting method, emotion model training module, gauss hybrid models emotion recognition model; Said general serial type EBI mass-memory unit 111 is preserved resource files such as music, picture.

In the present embodiment,

Said Nand type flash memory 104, Nor type flash memory 105 link to each other with central processing unit 101 through external bus interface; Said clock generator 103 links to each other with central processing unit 101, and clock frequency is provided; Said audio coding decoding chip 106 links to each other with central processing unit 101 through COBBAIF; Said LCD 110 links to each other with central processing unit 101 through the liquid crystal control interface; Said keyboard 109 passes through input interface and links to each other with central processing unit 101; Said general serial type EBI mass-memory unit 111 links to each other with central processing unit 101 through USB; Said microphone 107, loudspeaker 108 link to each other with audio coding decoding chip 106 through interface.

This device comprises two kinds of mode of operations, is respectively training mode and recognition mode, and the selection of two kinds of mode of operations is controlled by right 8 said keypad devices 109, and whole process steps is following:

Step 1: receiving the button input of keypad device 109, judge whether to be input as recognition mode, is that recognition mode then gets into step 2, is that training mode then gets into step 13;

Step 2: utilize microphone 107 to receive the voice snippet input;

Step 2: utilize 106 pairs of voice snippet digitizings of audio coding decoding chip so that audio digital signals to be provided;

Step 3: audio digital signals is carried out pre-service, comprise pre-emphasis, divide frame, windowing, end-point detection;

Step 5: the speech characteristic parameter that extracts is input to trains in the Speaker Identification model, confirm that which speaker is an optimum matching of this voice snippet;

Step 6:, confirm that which kind of emotion is an optimum matching of this voice snippet according to result of determination;

Step 7: like recognition result is tranquil, at first shows the picture and the Chinese character " calmness " of characterization result through device LCD110, and loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then;

Step 8: like recognition result is glad, at first shows the picture and the Chinese character " happiness " of characterization result through device LCD110, and loudspeaker 108 is play and deposited corresponding audio files in the general serial type EBI mass-memory unit 111 then;

Step 9: like recognition result is sad, at first shows the picture and the Chinese character " sadness " of characterization result through device LCD110, and loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then;

Step 10: like recognition result is sad, at first shows the picture and the Chinese character " anger " of characterization result through device LCD110, and loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then;

Step 11: for fearing, at first " fear " that through the picture and the Chinese character of device LCD110 demonstration characterization result loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then like recognition result;

Step 12: receive the button input of keypad device 109, be judged as which kind of training mode, for training mode by the gross then gets into step 13, for timely training mode then gets into step 14;

Step 13: device gets into trains flow process by the gross

Step 13.1: receive the voice snippet input, and judge whether to arrive the quantity set point of training by the gross, be then to get into step 13.2, otherwise get into step 13.1 again;

Step 13.2: the voice to input carry out pre-service;

Step 13.3: adopt to extracting speech characteristic parameter through pretreated voice;

Step 13.4: training Speaker Identification model;

Step 13.5: training utterance emotion recognition model bank;

Step 14: device gets into instant training flow process

Step 14.1: receive 1 voice snippet input;

Step 14.2: 1 voice to input carry out pre-service;

Step 14.3: to extracting speech characteristic parameter through pretreated 1 voice;

Step 13.4: training Speaker Identification model;

Step 13.5: training utterance emotion recognition model bank.

One of effective embodiment that above embodiment has just realized, common variation that those skilled in the art carries out in technical scheme scope of the present invention and replacement all should be included in protection scope of the present invention.

Claims

1. an embedded speech emotion identification method is characterized in that, may further comprise the steps:

Step 1: receive emotional speech segment input to be identified;

2. embedded speech emotion identification method according to claim 1 is characterized in that, adopts following method to extracting speech characteristic parameter through pretreated digital speech in the step 4:

3. embedded speech emotion identification method according to claim 1 is characterized in that, described Speaker Identification model training method may further comprise the steps:

Step 5.1: receive each speaker's training utterance segment input;

Step 5.3: adopt the described pre-service of step 3, to audio digital signals X (n ₁) carry out pre-service, comprise pre-emphasis, divide frame, windowing, end-point detection, obtain the speaker and train audio digital signals X (n ₁');

Step 5.4: to the pretreated digital speech X (n of process ₁') extracting speech characteristic parameter, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;

For

Wherein

Wherein

Covariance matrix does

Satisfy:

4. embedded speech emotion identification method according to claim 1 is characterized in that, described and training method speech emotional model of cognition storehouse may further comprise the steps:

Step 7.1: receive 1 speaker's emotion training utterance segment input;

Step 7.5.2: with K Mean Method kmeans initialization speech emotional model of cognition the mean vector μ ' of each Gaussian distribution _k', the covariance matrix ∑ ' _k', mixed components weights c ' _k';

Step 7.5.3: adopt the described emotion training utterance of step 5.3, the individual characteristic parameter of t ' of establishing the individual emotion training utterance of c ' for x ' _t' ^c' | t '=1, L, T ' _c; C '=1, L, C ' }, T ' wherein _cThe frame number of representing the individual emotion training utterance of c ', C ' expression emotion training utterance total sample number reappraises the mixed Gauss model parameter according to following formula; Make k '=1, L, k '=K '; The speech emotional model of cognition corresponding that formation trains with this speaker; And set up and this speaker's corresponding file folder, the emotion of k ' expression emotional speech correspondence wherein, promptly this speaker's emotion recognition model comprises the individual emotion submodel of K ':

Wherein

Covariance matrix does

Satisfy:

5. the running gear of claims 1 described embedded speech emotion identification method; This device mainly comprises: central processing unit (101), power supply (102), clock generator (103), Nand type flash memory (104), Nor type flash memory 105), audio coding decoding chip (106), microphone (107), loudspeaker (108), keyboard (109), LCD (110), general serial type EBI mass-memory unit (111); It is characterized in that; The operating system of said Nor type flash memory (105) save set; File system; Guide load module; It is kernel that said central processing unit (101) adopts 32 embedded microprocessors based on the ARM framework, and said Nand type flash memory (104) is preserved the software of audio recognition method and realized, comprises voice preprocess method, feature extracting method, emotion model training module, gauss hybrid models emotion recognition model; Said general serial type EBI mass-memory unit (111) is preserved the resource file that comprises music, picture.

6. embedded speech emotion recognition device according to claim 7 is characterized in that, said Nand type flash memory (104), Nor type flash memory 105) link to each other with central processing unit (101) through external bus interface; Said clock generator (103) links to each other with central processing unit (101), and clock frequency is provided; Said audio coding decoding chip (106) links to each other with central processing unit (101) through COBBAIF; Said LCD (110) links to each other with central processing unit (101) through the liquid crystal control interface; Said keyboard (109) passes through input interface and links to each other with central processing unit (101); Said general serial type EBI mass-memory unit (111) links to each other with central processing unit (101) through USB; Said microphone (107), loudspeaker (108) link to each other with audio coding decoding chip (106) through interface.

7. embedded speech emotion recognition device according to claim 7; It is characterized in that this device comprises two kinds of mode of operations, is respectively training mode and recognition mode; The selection of two kinds of mode of operations is controlled by right 8 said keypad devices (109), and whole process steps is following:

Step 1: receiving the button input of keypad device (109), judge whether to be input as recognition mode, is that recognition mode then gets into step 2, is that training mode then gets into step 13;

Step 2: utilize microphone 107 to receive the voice snippet input;

Step 2: utilize audio coding decoding chip (106) to the voice snippet digitizing so that audio digital signals to be provided;

Step 6:, confirm that which kind of emotion is an optimum matching of this voice snippet according to result of determination.

Step 7: like recognition result is tranquil, at first shows the picture and the Chinese character " calmness " of characterization result through device LCD (110), and loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then;

Step 8: like recognition result is glad, at first shows the picture and the Chinese character " happiness " of characterization result through device LCD (110), and loudspeaker (108) is play and deposited corresponding audio files in the general serial type EBI mass-memory unit (111) then;

Step 9: like recognition result is sad, at first shows the picture and the Chinese character " sadness " of characterization result through device LCD (110), and loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then;

Step 10: like recognition result is sad, at first shows the picture and the Chinese character " anger " of characterization result through device LCD (110), and loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then;

Step 11: for fearing, at first " fear " that through the picture and the Chinese character of device LCD (110) demonstration characterization result loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then like recognition result;

Step 12: receive the button input of keypad device (109), be judged as which kind of training mode, for training mode by the gross then gets into step 13, for timely training mode then gets into step 14;

Step 13: device gets into trains flow process by the gross;

Step 13.2: the voice to input carry out pre-service;

Step 13.3: to extracting speech characteristic parameter through pretreated voice;

Step 13.4: training Speaker Identification model;

Step 13.5: training utterance emotion recognition model bank;

Step 14: device gets into instant training flow process;

Step 14.1: receive 1 voice snippet input;

Step 14.2: 1 voice to input carry out pre-service;

Step 13.4: training Speaker Identification model;

Step 13.5: training utterance emotion recognition model bank.