CN102737629A - Embedded type speech emotion recognition method and device - Google Patents

Embedded type speech emotion recognition method and device Download PDF

Info

Publication number
CN102737629A
CN102737629A CN2011103586726A CN201110358672A CN102737629A CN 102737629 A CN102737629 A CN 102737629A CN 2011103586726 A CN2011103586726 A CN 2011103586726A CN 201110358672 A CN201110358672 A CN 201110358672A CN 102737629 A CN102737629 A CN 102737629A
Authority
CN
China
Prior art keywords
speech
speaker
emotion
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103586726A
Other languages
Chinese (zh)
Other versions
CN102737629B (en
Inventor
黄永明
章国宝
董飞
祖晖
刘海彬
倪道宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201110358672.6A priority Critical patent/CN102737629B/en
Publication of CN102737629A publication Critical patent/CN102737629A/en
Application granted granted Critical
Publication of CN102737629B publication Critical patent/CN102737629B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an embedded type speech emotion recognition method and an embedded type speech emotion recognition device. The method comprises a feature extraction method, an emotion model training method, a Gaussian mixture model and an emotion recognition method. The method is as follows: parameters of a speech emotion recognition model can be adjusted in a self-adoption manner according to the recognition result of a speaker module, and an unspecific speaker speech emotion recognition question is transformed into a specific speaker speech emotion recognition problem. The device comprises a central processor, a power supply, a clock generator, a Nand Flash storage, a Nor Flash storage, an audio coding-decoding chip, a microphone, a loudspeaker, a keyboard, an LCD (Liquid Crystal Display) display and an USB (Universal Serial Bus) interface storage. According to the embedded type speech emotion recognition method and device, a speaker recognition model is added to the speech emotion recognition, and therefore, the problem that the speech emotion recognition is suddenly declined under an unspecific speaker condition is solved, and the identity recognition function is brought to the device.

Description

A kind of embedded speech emotion identification method and device
Technical field
Patent of the present invention relates to a kind of speech emotional recognition technology, relates in particular to a kind of embedded speech emotion identification method and device, belongs to the speech emotional distinguishment technical field.
Background technology
Automatic speech emotion recognition technology belongs to the relatively technology at edge of IT industry.Voice are carrying abundant emotion information as interpersonal communication media.Emotion is being played the part of important role in processes such as the mankind's perception, decision-makings, in the mankind exchange, has vital role.Along with development of science and technology, the man-machine communication is also more and more important in daily life.The man-machine interaction that utilizes voice to carry out nature, harmony is people's objectives of the struggle all the time.Speech emotional identification is an important content of harmonious man-machine interaction, and it will change precisian's machine interactive service in the past effectively, improve the cordiality property and the accuracy of man-machine interaction.Speech emotional identification replenishes as a kind of of speech recognition; It is mutual to strengthen man-machine emotion; In remote teaching, auxiliary detect a lie, automatic remote telephone service center and clinical medicine, intelligent toy, there is wide application prospect aspects such as smart mobile phone.
Embedded speech emotion recognition technology is refered in particular to the emotion recognition technology of on the autonomous device beyond the computer, moving, and especially is applied to the technology on voice toy, intelligent pet and other embedded product.Traditional voice product requirement user sends voice command with the mode of near neutral; The voice of band intense emotion color can influence the speech recognition effect on the contrary; The condition of sometimes this harshness and not enough hommization can be given up user's enthusiasm, and this is a big defective of present voice product.Emotion is dissolved in the voice product goes, can in the use of voice product, give the user bigger degree of freedom, promote user experience, this also is a general orientation of intelligent interaction voice production development.With the intelligent interaction voice toy is example; If the intelligent interaction toy can be discerned the emotion in the user speech; Emotions different in the voice is made different responses, can improve the shortcoming of the not enough hommization of electronic toy to a certain extent, strengthen compatibility, interest that toy uses.In the same way, embedded speech emotion recognition technology can realize that people and machine better exchange with interactive.Obviously, this demand exists on problems in present society, but does not also see at present the embedded product of releasing band emotion recognition function on the home market, can't but be one sorry greatly.
Summary of the invention
The problem that the present invention solves is: for overcoming traditional voice emotion recognition not high defective of discrimination when the unspecified person; Simultaneously for solving the problem that lacks speech emotional recognition device on the market with good human-computer interaction function; In conjunction with above background and demand; The present invention provides a kind of embedded speech emotion identification method and device thereof; This system can in low profile edge equipment, discern the speaker calmness, happiness, anger, fear, emotion such as calmness, take different operation according to the different emotions that speaker's voice carry.
Technical solution of the present invention is:
1, a kind of embedded speech emotion identification method may further comprise the steps:
Step 1: receive emotional speech segment input to be identified;
Step 2: to emotional speech segment digitizing to be identified so that audio digital signals to be provided;
Step 3: the emotion audio digital signals X (n) to be identified carries out pre-service, comprises pre-emphasis, divides frame, windowing, end-point detection:
Step 3.1: to emotion audio digital signals X (n) to be identified by the following pre-emphasis of carrying out:
X ( n ) ‾ = X ( n ) - αX ( n - 1 ) - - - ( 1 )
α in the formula=0.9375, n representes emotion digital speech discrete point sequence number to be identified;
Step 3.2: adopt the method for overlapping segmentation to carry out the branch frame, the part of overlapping is arranged between former frame and back one frame, be called frame and move, frame pipettes 7ms here, promptly under the 11.025kHz sampling rate, gets 80 points, and each frame length is got 23ms, promptly gets 256 points;
Step 3.3: select Hamming window that voice signal is carried out windowing process, window function is following:
Each frame of digital voice discrete point sequence number of n ' expression in the formula, N representes that each frame of digital voice discrete point counts, here N=256;
Step 3.4: adopt known energy zero-crossing rate dual-threshold judgement method to accomplish end-point detection; Promptly energy and the zero-crossing rate according to neighbourhood noise all is lower than the short-time energy of voice signal and the principle of short-time zero-crossing rate; At first make the first order and differentiate, do the second level with short-time zero-crossing rate more on this basis then and differentiate, calculate the value of the short-time energy upper limit, lower limit and zero-crossing rate thresholding with short-time energy; Then every frame data are judged, obtained each frame of digital voice X (n ') after the end-point detection;
Step 4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 5: the speech characteristic parameter that step 4 is extracted is input in each the Speaker Identification submodel that has trained; Confirm that which Speaker Identification submodel is an optimum matching of this voice snippet, select the corresponding speaker of this model according to the Speaker Identification submodel of coupling;
Step 6:, from the speaker's speech emotional model of cognition storehouse that trains, select the corresponding speech emotional model of cognition of this speaker according to step 5 speaker's result of determination;
Step 7: step 4 extraction speech characteristic parameter is input in the speech emotional recognin model of step 6 selection; Said speech emotional model of cognition comprises happiness, anger, sadness, fears, tranquil five emotion submodels that trained, and confirms that according to the output result in the speech emotional model of cognition any emotion is an optimum matching of this voice snippet.
A kind of running gear of embedded speech emotion identification method; This device mainly comprises: central processing unit, power supply, clock generator, Nand type flash memory, Nor type flash memory, audio coding decoding chip, microphone, loudspeaker, keyboard, LCD, general serial type EBI mass-memory unit; The operating system of said Nor type flash memory save set; File system; Guide load module; It is kernel that said central processing unit adopts 32 embedded microprocessors based on the ARM framework, and said Nand type flash memory is preserved the software of audio recognition method and realized, comprises voice preprocess method, feature extracting method, emotion model training module, gauss hybrid models emotion recognition model; Said general serial type EBI mass-memory unit is preserved resource files such as music, picture.
Beneficial effect of the present invention comprises:
(1) the present invention comes the parameter of adaptive adjustment speech emotional model of cognition according to the recognition result of speaker's module; Unspecified person speech emotional identification problem is converted into persona certa's speech emotional identification problem, the not high problem of discrimination when having solved the unspecified person speech emotional and being identified in practical applications;
(2) the present invention makes this device when extracting the user speech emotion information, can carry out authentication to the user owing in speech emotional identification, added the Speaker Identification model, has engineering significance more;
(3) Speaker Identification model, speech emotional model of cognition all adopt with a kind of speech characteristic parameter (12 Wei Meier frequency cepstral coefficient) among the present invention; Sorter all adopts mixed Gauss model GMM (separately training and identification); The present invention need not increase too many computational complexity when reaching beneficial effect (2) like this;
(4) the speech emotional recognition device that the present invention relates to has that microphone carries out phonetic entry, keyboard carries out mode switch, and LCD, loudspeaker carry out emotion output, and man-machine interaction is friendly;
(5) the speech emotional recognition device that the present invention relates to has two kinds of training patternss, keyboard carries out mode switch; Comprise training by the gross and instant training; Be set at instant training mode like the user; Can be according to the instant renewal Speaker Identification model of User Recognition result, the parameter of speech emotional model of cognition, the learning process of mimic human;
Description of drawings
Fig. 1 is an apparatus structure block diagram of the present invention.
Fig. 2 is a feature extraction process flow diagram of the present invention.
Fig. 3 is a principle of work block diagram of the present invention.
Fig. 4 is the identifying process flow diagram for emotion model training process of the present invention.
Fig. 5 is this speech emotional recognition device and user's an interactive process flow diagram.
The practical implementation method
Embodiment 1
A kind of embedded speech emotion identification method may further comprise the steps:
Step 1: receive emotional speech segment input to be identified;
Step 2: to emotional speech segment digitizing to be identified so that audio digital signals to be provided;
Step 3: the emotion audio digital signals X (n) to be identified carries out pre-service, comprises pre-emphasis, divides frame, windowing, end-point detection:
Step 3.1: to emotion audio digital signals X (n) to be identified by the following pre-emphasis of carrying out:
X ( n ) ‾ = X ( n ) - αX ( n - 1 ) - - - ( 1 )
α in the formula=0.9375, n representes emotion digital speech discrete point sequence number to be identified;
Step 3.2: adopt the method for overlapping segmentation to carry out the branch frame, the part of overlapping is arranged between former frame and back one frame, be called frame and move, frame pipettes 7ms here, promptly under the 11.025kHz sampling rate, gets 80 points, and each frame length is got 23ms, promptly gets 256 points;
Step 3.3: select Hamming window that voice signal is carried out windowing process, window function is following:
Figure BDA0000107700510000042
Each frame of digital voice discrete point sequence number of n ' expression in the formula, N representes that each frame of digital voice discrete point counts, here N=256;
Step 3.4: adopt known energy zero-crossing rate dual-threshold judgement method to accomplish end-point detection; Promptly energy and the zero-crossing rate according to neighbourhood noise all is lower than the short-time energy of voice signal and the principle of short-time zero-crossing rate; At first make the first order and differentiate, do the second level with short-time zero-crossing rate more on this basis then and differentiate, calculate the value of the short-time energy upper limit, lower limit and zero-crossing rate thresholding with short-time energy; Then every frame data are judged, obtained each frame of digital voice X (n ') after the end-point detection;
Step 4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 5: the speech characteristic parameter that step 4 is extracted is input in each the Speaker Identification submodel that has trained; Confirm that which Speaker Identification submodel is an optimum matching of this voice snippet, select the corresponding speaker of this model according to the Speaker Identification submodel of coupling;
Step 6:, from the speaker's speech emotional model of cognition storehouse that trains, select the corresponding speech emotional model of cognition of this speaker according to step 5 speaker's result of determination;
Step 7: step 4 extraction speech characteristic parameter is input in the speech emotional recognin model of step 6 selection; Said speech emotional model of cognition comprises happiness, anger, sadness, fears, tranquil five emotion submodels that trained, and confirms that according to the output result in the speech emotional model of cognition any emotion is an optimum matching of this voice snippet.
In the present embodiment,
Adopt following method to extracting speech characteristic parameter in the step 4 through pretreated digital speech:
Step 4.1: behind time-domain signal X (n '), augment 0, make that the length augment the sequence after 0 is N ', making N ' is 2 integral number power, and through obtaining linear spectral X (k) behind the DFT DFT, conversion formula is then:
X ( k ) = Σ n ′ = 0 N ′ - 1 x ( n ′ ) exp ( - j 2 π n ′ k / N ′ ) , 0 ≤ n ′ , k ≤ N ′ - 1 - - - ( 3 )
Step 4.2: above-mentioned linear spectral X (k) is passed through Mei Er frequency filter group H m(k) obtain the Mei Er frequency spectrum, and the processing through the logarithm energy, obtain log spectrum S (m), to the overall transfer function of log spectrum S (m) be by linear spectral X (k):
S ( m ) = ln ( Σ k = 0 N ′ - 1 | X ( k ) | 2 H m ( k ) ) , 0 ≤ m ≤ M - - - ( 4 )
Wherein for the bank of filters that M BPF. arranged, m=1,2 ..., M, the transport function of each BPF. is:
H m ( k ) = 0 ( k < f ( m - 1 ) ) k - f ( m - 1 ) f ( m ) - f ( m - 1 ) ( f ( m - 1 ) &le; k &le; f ( m ) ) f ( m + 1 ) - k f ( m + 1 ) - f ( m ) ( f ( m ) < k &le; f ( m + 1 ) ) 0 ( k > f ( m + 1 ) ) ( 0 &le; m < M ) - - - ( 5 )
Step 4.3: above-mentioned log spectrum S (m) through discrete cosine transform, is transformed to the cepstrum frequency domain, promptly obtain Mei Er frequency cepstral coefficient c (n '):
c ( n &prime; ) = &Sigma; m = 1 M - 1 S ( m ) cos ( &pi; n &prime; ( m + 1 / 2 ) M ) , ( 0 &le; m < M ) - - - ( 6 ) .
Described Speaker Identification model training method may further comprise the steps:
Step 5.1: receive each speaker's training utterance segment input;
Step 5.2: to the digitizing of speaker's training utterance segment so that audio digital signals X to be provided (n 1), n wherein 1The expression speaker trains digital speech discrete point sequence number;
Step 5.3: adopt the described pre-service of step 3, to audio digital signals X (n 1) carry out pre-service, comprise pre-emphasis, divide frame, windowing, end-point detection, obtain the speaker train audio digital signals X (n ' 1);
Step 5.4: to through pretreated digital speech X (n ' 1) extracting speech characteristic parameter, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 5.5: the speech characteristic parameter that utilizes step 4 to extract is trained the Speaker Identification model, and concrete steps are following:
Step 5.5.1: be provided with the Speaker Identification model the exponent number of mixed Gauss model be 4;
Step 5.5.2:, obtain the initiation parameter of each Gaussian distribution: mean vector μ with K Mean Method (kmeans) initialization Speaker Identification model k, the covariance matrix ∑ k, mixed components weights c k, it representes the initialization submodel parameter that k speaker is corresponding;
Step 5.5.3: t characteristic parameter establishing c speaker's training utterance
Figure BDA0000107700510000063
For
Figure BDA0000107700510000064
T wherein cThe frame number of representing c speaker's training utterance, C are represented the sum of training sample, reappraise order according to the initiation parameter of following formula to Gaussian distribution
Figure BDA0000107700510000071
Wherein
Figure BDA0000107700510000072
The speaker that expression is corresponding obtains each Speaker Identification submodel parameter:
&gamma; t c ( k &OverBar; ) = N ( x t c , &mu; k &OverBar; , &Sigma; k &OverBar; ) &Sigma; k &OverBar; = 1 K N ( x t c , &mu; k &OverBar; , &Sigma; k &OverBar; ) - - - ( 7 )
c &OverBar; k = &Sigma; c = 1 C &Sigma; t = 1 T c &gamma; t c ( k &OverBar; ) &Sigma; k &OverBar; = 1 K &Sigma; c = 1 C &Sigma; t = 1 T c &gamma; t c ( k &OverBar; ) - - - ( 8 )
&mu; &OverBar; k &OverBar; = &Sigma; c = 1 C &Sigma; t = 1 T c &gamma; t c ( k &OverBar; ) x t c &Sigma; c = 1 C &Sigma; t = 1 T c &gamma; t c ( k &OverBar; ) - - - ( 9 )
&Sigma; &OverBar; k &OverBar; = &Sigma; c = 1 C &Sigma; t = 1 T c &gamma; t c ( k &OverBar; ) ( x t c - &mu; k &OverBar; ) ( x t c - &mu; k &OverBar; ) T &Sigma; c = 1 C &Sigma; t = 1 T c &gamma; t c ( k &OverBar; ) - - - ( 10 ) ;
Step 5.5.4: said Speaker Identification model is a mixed Gauss model; With the formula below top each Speaker Identification submodel parameter substitution that obtains; Each Speaker Identification submodel that formation trains, these submodel that trains set are final Speaker Identification model:
Said this gauss hybrid models comes the distribution of descriptor frame characteristic in feature space with the linear combination of 4 single Gaussian distribution, specifically describes as follows:
p ( x ) = &Sigma; k = 1 4 c &OverBar; k b k ( x ) - - - ( 11 )
Wherein
b k ( x ) = N ( x , &mu; &OverBar; k &OverBar; , &Sigma; &OverBar; k &OverBar; ) = 1 ( 2 &pi; ) D / 2 | &Sigma; &OverBar; k &OverBar; | 1 / 2 exp ( - 1 2 ( x - &mu; &OverBar; k &OverBar; ) T &Sigma; i - 1 ( x - &mu; &OverBar; k &OverBar; ) ) - - - ( 12 )
Wherein, D is an intrinsic dimensionality, D=12 here, b k(x) being called kernel function, is that mean vector does
Figure BDA0000107700510000079
Covariance matrix does Gauss of distribution function, the weighting coefficient that Gaussian Mixture distributes
Figure BDA00001077005100000711
Satisfy:
&Sigma; k &OverBar; = 1 4 c &OverBar; k &OverBar; = 1 - - - ( 13 )
Speaker Identification mixed Gauss model parameter set λ 1Be exactly to form, be expressed as the form of following tlv triple by the weights of above-mentioned each average component, covariance matrix and mixed components:
&lambda; 1 = { c &OverBar; k , &mu; &OverBar; k , &Sigma; &OverBar; k ; ( k = 1 , . . . , K ) } - - - ( 14 ) .
Described and training method speech emotional model of cognition storehouse may further comprise the steps:
Step 7.1: receive 1 speaker's emotion training utterance segment input;
Step 7.2: to the digitizing of emotion training utterance segment so that audio digital signals X to be provided (n 2), n wherein 2Expression emotion training digital speech discrete point sequence number;
Step 7.3: adopt the described pre-service of step 3, emotion training audio digital signals X (n) handled, obtain emotion training audio digital signals X (n ' 2);
Step 7.4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 7.5: the speech emotional characteristic parameter that utilizes step 4 to extract comes the training utterance emotion model, and concrete steps are following:
Step 7.5.1: be provided with the speech emotional model of cognition the exponent number of mixed Gauss model be 10;
Step 7.5.2: with K Mean Method kmeans initialization speech emotional model of cognition the mean vector μ ' of each Gaussian distribution K ', the covariance matrix ∑ ' K ', mixed components weights c ' K '
Step 7.5.3: adopt the described emotion training utterance of step 5.3, the individual characteristic parameter of t ' of establishing the individual emotion training utterance of c ' does
Figure BDA0000107700510000082
T ' wherein cThe frame number of representing the individual emotion training utterance of c ', C ' expression emotion training utterance total sample number reappraises the mixed Gauss model parameter according to following formula; Make k '=1 ..., k '=K '; The speech emotional model of cognition corresponding that formation trains with this speaker; And set up and this speaker's corresponding file folder, the emotion of k ' expression emotional speech correspondence wherein, promptly this speaker's emotion recognition model comprises the individual emotion submodel of K ':
&gamma; t &prime; &prime; c &prime; ( k &prime; ) = N ( x t &prime; c &prime; , &mu; k &prime; &prime; , &Sigma; k &prime; &prime; ) &Sigma; k &prime; = 1 K &prime; N ( x t &prime; &prime; c , &mu; k &prime; &prime; , &Sigma; k &prime; &prime; ) - - - ( 15 )
c &OverBar; k &prime; &prime; = &Sigma; c &prime; = 1 C &prime; &Sigma; t &prime; = 1 T c &prime; &gamma; t &prime; c &prime; ( k &prime; ) &Sigma; k &prime; = 1 K &prime; &Sigma; c &prime; = 1 C &prime; &Sigma; t &prime; = 1 T c &prime; &gamma; t &prime; c &prime; ( k &prime; ) - - - ( 16 )
&mu; &OverBar; k &prime; &prime; = &Sigma; c &prime; = 1 C &prime; &Sigma; t &prime; = 1 T c &prime; &prime; &gamma; t &prime; &prime; c &prime; ( k &prime; ) x t &prime; &prime; c &prime; &Sigma; c &prime; = 1 C &prime; &Sigma; t &prime; = 1 T c &prime; &gamma; t &prime; &prime; c &prime; ( k &prime; ) - - - ( 17 )
&Sigma; &OverBar; k &prime; &prime; = &Sigma; c &prime; = 1 C &prime; &Sigma; t &prime; = 1 T c &prime; &gamma; t &prime; &prime; c &prime; ( k &prime; ) ( x t &prime; &prime; c &prime; - &mu; k &prime; &prime; ) ( x t &prime; &prime; c &prime; - &mu; k &prime; &prime; ) T &Sigma; c &prime; = 1 C &prime; &Sigma; t &prime; = 1 T c &prime; &prime; &gamma; t &prime; &prime; c &prime; ( k &prime; ) - - - ( 18 )
Step 7.5.4: the said speech emotional model of cognition corresponding with the speaker is mixed Gauss model; Formula below the speech emotional model of cognition parameter substitution that the top speaker who obtains is corresponding forms the speech emotional model of cognition corresponding with this speaker that trains:
Said this gauss hybrid models comes the distribution of descriptor frame characteristic in feature space with the linear combination of 10 single Gaussian distribution, specifically describes as follows:
p &prime; ( x &prime; ) = &Sigma; k &prime; = 1 10 c &OverBar; k &prime; b k &prime; &prime; ( x &prime; ) - - - ( 19 )
Wherein
b k &prime; &prime; ( x &prime; ) = N ( x &prime; , &mu; &OverBar; k &prime; &prime; , &Sigma; &OverBar; k &prime; &prime; ) = 1 ( 2 &pi; ) D / 2 | &Sigma; &OverBar; k &prime; &prime; | 1 / 2 exp ( - 1 2 ( x &prime; - &mu; &OverBar; k &prime; &prime; ) T &Sigma; &OverBar; k &prime; &prime; &prime; - 1 ( x &prime; - &mu; &OverBar; k &prime; &prime; ) ) - - - ( 20 )
Wherein, D is an intrinsic dimensionality, D=12 here, b ' K '(x ') is called kernel function, is that mean vector does
Figure BDA0000107700510000097
Covariance matrix does
Figure BDA0000107700510000098
Gauss of distribution function, the weighting coefficient that Gaussian Mixture distributes
Figure BDA0000107700510000099
Satisfy:
&Sigma; k &prime; = 1 10 c &OverBar; k &prime; &prime; = 1 - - - ( 21 )
Speech emotional identification mixed Gauss model parameter set λ 2Be exactly to form, be expressed as the form of following tlv triple by the weights of above-mentioned each average component, covariance matrix and mixed components:
&lambda; 2 = { c &OverBar; k &prime; &prime; , &mu; &OverBar; k &prime; &prime; , &Sigma; &OverBar; k &prime; &prime; ; ( k &prime; = 1 , . . . , K &prime; ) } - - - ( 22 )
Step 7.6: receive other speaker's emotion training utterance segment inputs; Each speaker's emotion training utterance is trained to step 7.5 by top step 7.2; Obtain the corresponding speech emotional model of cognition of each speaker; The corresponding speech emotional model of cognition set of each speaker that the emotion submodel that comprises each speaker's speech emotional model of cognition, these training obtain constitutes speech emotional model of cognition storehouse.
Embodiment 2
A kind of running gear of embedded speech emotion identification method; This device mainly comprises: central processing unit 101, power supply 102, clock generator 103, Nand type flash memory 104, Nor type flash memory 105, audio coding decoding chip 106, microphone 107, loudspeaker 108, keyboard 109, LCD 110, general serial type EBI mass-memory unit 111; It is characterized in that; The operating system of said Nor type flash memory 105 save sets; File system, guide load module, it is kernel that said central processing unit 101 adopts 32 embedded microprocessors based on the ARM framework; The software that said Nand type flash memory 104 is preserved audio recognition method is realized, comprises voice preprocess method, feature extracting method, emotion model training module, gauss hybrid models emotion recognition model; Said general serial type EBI mass-memory unit 111 is preserved resource files such as music, picture.
In the present embodiment,
Said Nand type flash memory 104, Nor type flash memory 105 link to each other with central processing unit 101 through external bus interface; Said clock generator 103 links to each other with central processing unit 101, and clock frequency is provided; Said audio coding decoding chip 106 links to each other with central processing unit 101 through COBBAIF; Said LCD 110 links to each other with central processing unit 101 through the liquid crystal control interface; Said keyboard 109 passes through input interface and links to each other with central processing unit 101; Said general serial type EBI mass-memory unit 111 links to each other with central processing unit 101 through USB; Said microphone 107, loudspeaker 108 link to each other with audio coding decoding chip 106 through interface.
This device comprises two kinds of mode of operations, is respectively training mode and recognition mode, and the selection of two kinds of mode of operations is controlled by right 8 said keypad devices 109, and whole process steps is following:
Step 1: receiving the button input of keypad device 109, judge whether to be input as recognition mode, is that recognition mode then gets into step 2, is that training mode then gets into step 13;
Step 2: utilize microphone 107 to receive the voice snippet input;
Step 2: utilize 106 pairs of voice snippet digitizings of audio coding decoding chip so that audio digital signals to be provided;
Step 3: audio digital signals is carried out pre-service, comprise pre-emphasis, divide frame, windowing, end-point detection;
Step 4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 5: the speech characteristic parameter that extracts is input to trains in the Speaker Identification model, confirm that which speaker is an optimum matching of this voice snippet;
Step 6:, confirm that which kind of emotion is an optimum matching of this voice snippet according to result of determination;
Step 7: like recognition result is tranquil, at first shows the picture and the Chinese character " calmness " of characterization result through device LCD110, and loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then;
Step 8: like recognition result is glad, at first shows the picture and the Chinese character " happiness " of characterization result through device LCD110, and loudspeaker 108 is play and deposited corresponding audio files in the general serial type EBI mass-memory unit 111 then;
Step 9: like recognition result is sad, at first shows the picture and the Chinese character " sadness " of characterization result through device LCD110, and loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then;
Step 10: like recognition result is sad, at first shows the picture and the Chinese character " anger " of characterization result through device LCD110, and loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then;
Step 11: for fearing, at first " fear " that through the picture and the Chinese character of device LCD110 demonstration characterization result loudspeaker 108 is play corresponding audio files in the general serial type EBI mass-memory unit 111 then like recognition result;
Step 12: receive the button input of keypad device 109, be judged as which kind of training mode, for training mode by the gross then gets into step 13, for timely training mode then gets into step 14;
Step 13: device gets into trains flow process by the gross
Step 13.1: receive the voice snippet input, and judge whether to arrive the quantity set point of training by the gross, be then to get into step 13.2, otherwise get into step 13.1 again;
Step 13.2: the voice to input carry out pre-service;
Step 13.3: adopt to extracting speech characteristic parameter through pretreated voice;
Step 13.4: training Speaker Identification model;
Step 13.5: training utterance emotion recognition model bank;
Step 14: device gets into instant training flow process
Step 14.1: receive 1 voice snippet input;
Step 14.2: 1 voice to input carry out pre-service;
Step 14.3: to extracting speech characteristic parameter through pretreated 1 voice;
Step 13.4: training Speaker Identification model;
Step 13.5: training utterance emotion recognition model bank.
One of effective embodiment that above embodiment has just realized, common variation that those skilled in the art carries out in technical scheme scope of the present invention and replacement all should be included in protection scope of the present invention.

Claims (7)

1. an embedded speech emotion identification method is characterized in that, may further comprise the steps:
Step 1: receive emotional speech segment input to be identified;
Step 2: to emotional speech segment digitizing to be identified so that audio digital signals to be provided;
Step 3: the emotion audio digital signals X (n) to be identified carries out pre-service, comprises pre-emphasis, divides frame, windowing, end-point detection:
Step 3.1: to emotion audio digital signals X (n) to be identified by the following pre-emphasis of carrying out:
Figure DEST_PATH_FDA00001641029300011
α in the formula=0.9375, n representes emotion digital speech discrete point sequence number to be identified;
Step 3.2: adopt the method for overlapping segmentation to carry out the branch frame, the part of overlapping is arranged between former frame and back one frame, be called frame and move, frame pipettes 7ms here, promptly under the 11.025kHz sampling rate, gets 80 points, and each frame length is got 23ms, promptly gets 256 points;
Step 3.3: select Hamming window that voice signal is carried out windowing process, window function is following:
Figure DEST_PATH_FDA00001641029300012
Each frame of digital voice discrete point sequence number of n ' expression in the formula, N representes that each frame of digital voice discrete point counts, here N=256;
Step 3.4: adopt known energy zero-crossing rate dual-threshold judgement method to accomplish end-point detection; Promptly energy and the zero-crossing rate according to neighbourhood noise all is lower than the short-time energy of voice signal and the principle of short-time zero-crossing rate; At first make the first order and differentiate, do the second level with short-time zero-crossing rate more on this basis then and differentiate, calculate the value of the short-time energy upper limit, lower limit and zero-crossing rate thresholding with short-time energy; Then every frame data are judged, obtained each frame of digital voice X (n ') after the end-point detection;
Step 4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 5: the speech characteristic parameter that step 4 is extracted is input in each the Speaker Identification submodel that has trained; Confirm that which Speaker Identification submodel is an optimum matching of this voice snippet, select the corresponding speaker of this model according to the Speaker Identification submodel of coupling;
Step 6:, from the speaker's speech emotional model of cognition storehouse that trains, select the corresponding speech emotional model of cognition of this speaker according to step 5 speaker's result of determination;
Step 7: step 4 extraction speech characteristic parameter is input in the speech emotional recognin model of step 6 selection; Said speech emotional model of cognition comprises happiness, anger, sadness, fears, tranquil five emotion submodels that trained, and confirms that according to the output result in the speech emotional model of cognition any emotion is an optimum matching of this voice snippet.
2. embedded speech emotion identification method according to claim 1 is characterized in that, adopts following method to extracting speech characteristic parameter through pretreated digital speech in the step 4:
Step 4.1: behind time-domain signal X (n '), augment 0, make that the length augment the sequence after 0 is N ', making N ' is 2 integral number power, and through obtaining linear spectral X (k) behind the DFT DFT, conversion formula is then:
Figure DEST_PATH_FDA00001641029300021
Step 4.2: above-mentioned linear spectral X (k) is passed through Mei Er frequency filter group H m(k) obtain the Mei Er frequency spectrum, and the processing through the logarithm energy, obtain log spectrum S (m), to the overall transfer function of log spectrum S (m) be by linear spectral X (k):
Figure DEST_PATH_FDA00001641029300022
Wherein for the bank of filters that M BPF. arranged, m=1,2 ..., M, the transport function of each BPF. is:
Figure DEST_PATH_FDA00001641029300023
Step 4.3: above-mentioned log spectrum S (m) through discrete cosine transform, is transformed to the cepstrum frequency domain, promptly obtain Mei Er frequency cepstral coefficient c (n '):
Figure DEST_PATH_FDA00001641029300024
3. embedded speech emotion identification method according to claim 1 is characterized in that, described Speaker Identification model training method may further comprise the steps:
Step 5.1: receive each speaker's training utterance segment input;
Step 5.2: to the digitizing of speaker's training utterance segment so that audio digital signals X to be provided (n 1), n wherein 1The expression speaker trains digital speech discrete point sequence number;
Step 5.3: adopt the described pre-service of step 3, to audio digital signals X (n 1) carry out pre-service, comprise pre-emphasis, divide frame, windowing, end-point detection, obtain the speaker and train audio digital signals X (n 1');
Step 5.4: to the pretreated digital speech X (n of process 1') extracting speech characteristic parameter, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 5.5: the speech characteristic parameter that utilizes step 4 to extract is trained the Speaker Identification model, and concrete steps are following:
Step 5.5.1: be provided with the Speaker Identification model the exponent number of mixed Gauss model be 4;
Step 5.5.2:, obtain the initiation parameter of each Gaussian distribution: mean vector μ with K Mean Method (kmeans) initialization Speaker Identification model k, the covariance matrix ∑ k, mixed components weights c k, it representes the initialization submodel parameter that k speaker is corresponding;
Step 5.5.3: t characteristic parameter establishing c speaker's training utterance
Figure DEST_PATH_FDA00001641029300031
For
Figure DEST_PATH_FDA00001641029300032
T wherein cThe frame number of representing c speaker's training utterance, C are represented the sum of training sample, reappraise order according to the initiation parameter of following formula to Gaussian distribution
Figure DEST_PATH_FDA00001641029300033
Wherein
Figure DEST_PATH_FDA00001641029300035
The speaker that expression is corresponding obtains each Speaker Identification submodel parameter:
Figure DEST_PATH_FDA00001641029300036
Figure DEST_PATH_FDA00001641029300037
Figure DEST_PATH_FDA00001641029300042
Step 5.5.4: said Speaker Identification model is a mixed Gauss model; With the formula below top each Speaker Identification submodel parameter substitution that obtains; Each Speaker Identification submodel that formation trains, these submodel that trains set are final Speaker Identification model:
Said this gauss hybrid models comes the distribution of descriptor frame characteristic in feature space with the linear combination of 4 single Gaussian distribution, specifically describes as follows:
Figure DEST_PATH_FDA00001641029300043
Wherein
Figure DEST_PATH_FDA00001641029300044
Wherein, D is an intrinsic dimensionality, D=12 here, b k(x) being called kernel function, is that mean vector does
Figure DEST_PATH_FDA00001641029300045
Covariance matrix does
Figure DEST_PATH_FDA00001641029300046
Gauss of distribution function, the weighting coefficient that Gaussian Mixture distributes
Figure DEST_PATH_FDA00001641029300047
Satisfy:
Speaker Identification mixed Gauss model parameter set λ 1Be exactly to form, be expressed as the form of following tlv triple by the weights of above-mentioned each average component, covariance matrix and mixed components:
Figure DEST_PATH_FDA00001641029300049
4. embedded speech emotion identification method according to claim 1 is characterized in that, described and training method speech emotional model of cognition storehouse may further comprise the steps:
Step 7.1: receive 1 speaker's emotion training utterance segment input;
Step 7.2: to the digitizing of emotion training utterance segment so that audio digital signals X to be provided (n 2), n wherein 2Expression emotion training digital speech discrete point sequence number;
Step 7.3: adopt the described pre-service of step 3, emotion training audio digital signals X (n) handled, obtain emotion training audio digital signals X (n ' 2);
Step 7.4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 7.5: the speech emotional characteristic parameter that utilizes step 4 to extract comes the training utterance emotion model, and concrete steps are following:
Step 7.5.1: be provided with the speech emotional model of cognition the exponent number of mixed Gauss model be 10;
Step 7.5.2: with K Mean Method kmeans initialization speech emotional model of cognition the mean vector μ ' of each Gaussian distribution k', the covariance matrix ∑ ' k', mixed components weights c ' k';
Step 7.5.3: adopt the described emotion training utterance of step 5.3, the individual characteristic parameter of t ' of establishing the individual emotion training utterance of c ' for x ' t' c' | t '=1, L, T ' c; C '=1, L, C ' }, T ' wherein cThe frame number of representing the individual emotion training utterance of c ', C ' expression emotion training utterance total sample number reappraises the mixed Gauss model parameter according to following formula; Make k '=1, L, k '=K '; The speech emotional model of cognition corresponding that formation trains with this speaker; And set up and this speaker's corresponding file folder, the emotion of k ' expression emotional speech correspondence wherein, promptly this speaker's emotion recognition model comprises the individual emotion submodel of K ':
Figure DEST_PATH_FDA00001641029300051
Figure DEST_PATH_FDA00001641029300052
Figure DEST_PATH_FDA00001641029300053
Figure DEST_PATH_FDA00001641029300061
Step 7.5.4: the said speech emotional model of cognition corresponding with the speaker is mixed Gauss model; Formula below the speech emotional model of cognition parameter substitution that the top speaker who obtains is corresponding forms the speech emotional model of cognition corresponding with this speaker that trains:
Said this gauss hybrid models comes the distribution of descriptor frame characteristic in feature space with the linear combination of 10 single Gaussian distribution, specifically describes as follows:
Figure DEST_PATH_FDA00001641029300062
Wherein
Figure DEST_PATH_FDA00001641029300063
Wherein, D is an intrinsic dimensionality, D=12 here, b ' K '(x ') is called kernel function, is that mean vector does
Figure DEST_PATH_FDA00001641029300064
Covariance matrix does
Figure DEST_PATH_FDA00001641029300065
Gauss of distribution function, the weighting coefficient that Gaussian Mixture distributes
Figure DEST_PATH_FDA00001641029300066
Satisfy:
Figure DEST_PATH_FDA00001641029300067
Speech emotional identification mixed Gauss model parameter set λ 2Be exactly to form, be expressed as the form of following tlv triple by the weights of above-mentioned each average component, covariance matrix and mixed components:
Figure DEST_PATH_FDA00001641029300068
Step 7.6: receive other speaker's emotion training utterance segment inputs; Each speaker's emotion training utterance is trained to step 7.5 by top step 7.2; Obtain the corresponding speech emotional model of cognition of each speaker; The corresponding speech emotional model of cognition set of each speaker that the emotion submodel that comprises each speaker's speech emotional model of cognition, these training obtain constitutes speech emotional model of cognition storehouse.
5. the running gear of claims 1 described embedded speech emotion identification method; This device mainly comprises: central processing unit (101), power supply (102), clock generator (103), Nand type flash memory (104), Nor type flash memory 105), audio coding decoding chip (106), microphone (107), loudspeaker (108), keyboard (109), LCD (110), general serial type EBI mass-memory unit (111); It is characterized in that; The operating system of said Nor type flash memory (105) save set; File system; Guide load module; It is kernel that said central processing unit (101) adopts 32 embedded microprocessors based on the ARM framework, and said Nand type flash memory (104) is preserved the software of audio recognition method and realized, comprises voice preprocess method, feature extracting method, emotion model training module, gauss hybrid models emotion recognition model; Said general serial type EBI mass-memory unit (111) is preserved the resource file that comprises music, picture.
6. embedded speech emotion recognition device according to claim 7 is characterized in that, said Nand type flash memory (104), Nor type flash memory 105) link to each other with central processing unit (101) through external bus interface; Said clock generator (103) links to each other with central processing unit (101), and clock frequency is provided; Said audio coding decoding chip (106) links to each other with central processing unit (101) through COBBAIF; Said LCD (110) links to each other with central processing unit (101) through the liquid crystal control interface; Said keyboard (109) passes through input interface and links to each other with central processing unit (101); Said general serial type EBI mass-memory unit (111) links to each other with central processing unit (101) through USB; Said microphone (107), loudspeaker (108) link to each other with audio coding decoding chip (106) through interface.
7. embedded speech emotion recognition device according to claim 7; It is characterized in that this device comprises two kinds of mode of operations, is respectively training mode and recognition mode; The selection of two kinds of mode of operations is controlled by right 8 said keypad devices (109), and whole process steps is following:
Step 1: receiving the button input of keypad device (109), judge whether to be input as recognition mode, is that recognition mode then gets into step 2, is that training mode then gets into step 13;
Step 2: utilize microphone 107 to receive the voice snippet input;
Step 2: utilize audio coding decoding chip (106) to the voice snippet digitizing so that audio digital signals to be provided;
Step 3: audio digital signals is carried out pre-service, comprise pre-emphasis, divide frame, windowing, end-point detection;
Step 4: to extracting speech characteristic parameter through pretreated digital speech, this characteristic parameter is 12 Wei Meier frequency cepstral coefficients;
Step 5: the speech characteristic parameter that extracts is input to trains in the Speaker Identification model, confirm that which speaker is an optimum matching of this voice snippet;
Step 6:, confirm that which kind of emotion is an optimum matching of this voice snippet according to result of determination.
Step 7: like recognition result is tranquil, at first shows the picture and the Chinese character " calmness " of characterization result through device LCD (110), and loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then;
Step 8: like recognition result is glad, at first shows the picture and the Chinese character " happiness " of characterization result through device LCD (110), and loudspeaker (108) is play and deposited corresponding audio files in the general serial type EBI mass-memory unit (111) then;
Step 9: like recognition result is sad, at first shows the picture and the Chinese character " sadness " of characterization result through device LCD (110), and loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then;
Step 10: like recognition result is sad, at first shows the picture and the Chinese character " anger " of characterization result through device LCD (110), and loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then;
Step 11: for fearing, at first " fear " that through the picture and the Chinese character of device LCD (110) demonstration characterization result loudspeaker (108) is play corresponding audio files in the general serial type EBI mass-memory unit (111) then like recognition result;
Step 12: receive the button input of keypad device (109), be judged as which kind of training mode, for training mode by the gross then gets into step 13, for timely training mode then gets into step 14;
Step 13: device gets into trains flow process by the gross;
Step 13.1: receive the voice snippet input, and judge whether to arrive the quantity set point of training by the gross, be then to get into step 13.2, otherwise get into step 13.1 again;
Step 13.2: the voice to input carry out pre-service;
Step 13.3: to extracting speech characteristic parameter through pretreated voice;
Step 13.4: training Speaker Identification model;
Step 13.5: training utterance emotion recognition model bank;
Step 14: device gets into instant training flow process;
Step 14.1: receive 1 voice snippet input;
Step 14.2: 1 voice to input carry out pre-service;
Step 14.3: to extracting speech characteristic parameter through pretreated 1 voice;
Step 13.4: training Speaker Identification model;
Step 13.5: training utterance emotion recognition model bank.
CN201110358672.6A 2011-11-11 2011-11-11 Embedded type speech emotion recognition method and device Expired - Fee Related CN102737629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110358672.6A CN102737629B (en) 2011-11-11 2011-11-11 Embedded type speech emotion recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110358672.6A CN102737629B (en) 2011-11-11 2011-11-11 Embedded type speech emotion recognition method and device

Publications (2)

Publication Number Publication Date
CN102737629A true CN102737629A (en) 2012-10-17
CN102737629B CN102737629B (en) 2014-12-03

Family

ID=46993004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110358672.6A Expired - Fee Related CN102737629B (en) 2011-11-11 2011-11-11 Embedded type speech emotion recognition method and device

Country Status (1)

Country Link
CN (1) CN102737629B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
CN103236258A (en) * 2013-05-06 2013-08-07 东南大学 Bhattacharyya distance optimal wavelet packet decomposition-based speech emotion feature extraction method
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
CN103295573A (en) * 2013-05-06 2013-09-11 东南大学 Voice emotional characteristic extraction method based on Fisher ratio optimal wavelet packet decomposition
CN103730113A (en) * 2013-12-27 2014-04-16 黄伟 Method for removing emotion voice interference during voiceprint identification and system for removing emotion voice interference during voiceprint identification
CN103778914A (en) * 2014-01-27 2014-05-07 华南理工大学 Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching
CN104572613A (en) * 2013-10-21 2015-04-29 富士通株式会社 Data processing device, data processing method and program
CN104700829A (en) * 2015-03-30 2015-06-10 中南民族大学 System and method for recognizing voice emotion of animal
CN104731548A (en) * 2013-12-24 2015-06-24 财团法人工业技术研究院 Identification network generating device and method thereof
CN106297823A (en) * 2016-08-22 2017-01-04 东南大学 A kind of speech emotional feature selection approach based on Standard of Environmental Noiseization conversion
WO2018023518A1 (en) * 2016-08-04 2018-02-08 易晓阳 Smart terminal for voice interaction and recognition
WO2018023517A1 (en) * 2016-08-04 2018-02-08 易晓阳 Voice interactive recognition control system
WO2018027506A1 (en) * 2016-08-09 2018-02-15 曹鸿鹏 Emotion recognition-based lighting control method
CN107895582A (en) * 2017-10-16 2018-04-10 中国电子科技集团公司第二十八研究所 Towards the speaker adaptation speech-emotion recognition method in multi-source information field
CN108010516A (en) * 2017-12-04 2018-05-08 广州势必可赢网络科技有限公司 Semantic independent speech emotion feature recognition method and device
CN108593282A (en) * 2018-07-05 2018-09-28 国网安徽省电力有限公司培训中心 A kind of breaker on-line monitoring and fault diagonosing device and its working method
CN109063670A (en) * 2018-08-16 2018-12-21 大连民族大学 Block letter language of the Manchus word recognition methods based on prefix grouping
CN109389182A (en) * 2018-10-31 2019-02-26 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN112002348A (en) * 2020-09-07 2020-11-27 复旦大学 Method and system for recognizing speech anger emotion of patient
CN112006697A (en) * 2020-06-02 2020-12-01 东南大学 Gradient boosting decision tree depression recognition method based on voice signals
CN112037762A (en) * 2020-09-10 2020-12-04 中航华东光电(上海)有限公司 Chinese-English mixed speech recognition method
CN112489688A (en) * 2020-11-09 2021-03-12 浪潮通用软件有限公司 Neural network-based emotion recognition method, device and medium
CN112639963A (en) * 2020-03-19 2021-04-09 深圳市大疆创新科技有限公司 Audio acquisition device, audio receiving device and audio processing method
CN113539230A (en) * 2020-03-31 2021-10-22 北京奔影网络科技有限公司 Speech synthesis method and device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2720359C1 (en) * 2019-04-16 2020-04-29 Хуавэй Текнолоджиз Ко., Лтд. Method and equipment for recognizing emotions in speech

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030154076A1 (en) * 2002-02-13 2003-08-14 Thomas Kemp Method for recognizing speech/speaker using emotional change to govern unsupervised adaptation
WO2007148493A1 (en) * 2006-06-23 2007-12-27 Panasonic Corporation Emotion recognizer
CN101261832A (en) * 2008-04-21 2008-09-10 北京航空航天大学 Extraction and modeling method for Chinese speech sensibility information
CN101937678A (en) * 2010-07-19 2011-01-05 东南大学 Judgment-deniable automatic speech emotion recognition method for fidget

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030154076A1 (en) * 2002-02-13 2003-08-14 Thomas Kemp Method for recognizing speech/speaker using emotional change to govern unsupervised adaptation
WO2007148493A1 (en) * 2006-06-23 2007-12-27 Panasonic Corporation Emotion recognizer
CN101261832A (en) * 2008-04-21 2008-09-10 北京航空航天大学 Extraction and modeling method for Chinese speech sensibility information
CN101937678A (en) * 2010-07-19 2011-01-05 东南大学 Judgment-deniable automatic speech emotion recognition method for fidget

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
WO2014153800A1 (en) * 2013-03-29 2014-10-02 京东方科技集团股份有限公司 Voice recognition system
CN103236258B (en) * 2013-05-06 2015-09-16 东南大学 Based on the speech emotional characteristic extraction method that Pasteur's distance wavelet packets decomposes
CN103236258A (en) * 2013-05-06 2013-08-07 东南大学 Bhattacharyya distance optimal wavelet packet decomposition-based speech emotion feature extraction method
CN103295573A (en) * 2013-05-06 2013-09-11 东南大学 Voice emotional characteristic extraction method based on Fisher ratio optimal wavelet packet decomposition
CN103295573B (en) * 2013-05-06 2015-07-01 东南大学 Voice emotional characteristic extraction method based on Fisher ratio optimal wavelet packet decomposition
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
CN104572613A (en) * 2013-10-21 2015-04-29 富士通株式会社 Data processing device, data processing method and program
CN104731548B (en) * 2013-12-24 2017-09-29 财团法人工业技术研究院 Identification network generating device and method thereof
US10002609B2 (en) 2013-12-24 2018-06-19 Industrial Technology Research Institute Device and method for generating recognition network by adjusting recognition vocabulary weights based on a number of times they appear in operation contents
CN104731548A (en) * 2013-12-24 2015-06-24 财团法人工业技术研究院 Identification network generating device and method thereof
CN103730113A (en) * 2013-12-27 2014-04-16 黄伟 Method for removing emotion voice interference during voiceprint identification and system for removing emotion voice interference during voiceprint identification
CN103778914B (en) * 2014-01-27 2017-02-15 华南理工大学 Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching
CN103778914A (en) * 2014-01-27 2014-05-07 华南理工大学 Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching
CN104700829A (en) * 2015-03-30 2015-06-10 中南民族大学 System and method for recognizing voice emotion of animal
CN104700829B (en) * 2015-03-30 2018-05-01 中南民族大学 Animal sounds Emotion identification system and method
WO2018023517A1 (en) * 2016-08-04 2018-02-08 易晓阳 Voice interactive recognition control system
WO2018023518A1 (en) * 2016-08-04 2018-02-08 易晓阳 Smart terminal for voice interaction and recognition
WO2018027506A1 (en) * 2016-08-09 2018-02-15 曹鸿鹏 Emotion recognition-based lighting control method
CN106297823A (en) * 2016-08-22 2017-01-04 东南大学 A kind of speech emotional feature selection approach based on Standard of Environmental Noiseization conversion
CN107895582A (en) * 2017-10-16 2018-04-10 中国电子科技集团公司第二十八研究所 Towards the speaker adaptation speech-emotion recognition method in multi-source information field
CN108010516A (en) * 2017-12-04 2018-05-08 广州势必可赢网络科技有限公司 Semantic independent speech emotion feature recognition method and device
CN108593282A (en) * 2018-07-05 2018-09-28 国网安徽省电力有限公司培训中心 A kind of breaker on-line monitoring and fault diagonosing device and its working method
CN109063670A (en) * 2018-08-16 2018-12-21 大连民族大学 Block letter language of the Manchus word recognition methods based on prefix grouping
CN109389182A (en) * 2018-10-31 2019-02-26 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN112639963A (en) * 2020-03-19 2021-04-09 深圳市大疆创新科技有限公司 Audio acquisition device, audio receiving device and audio processing method
WO2021184315A1 (en) * 2020-03-19 2021-09-23 深圳市大疆创新科技有限公司 Audio acquisition apparatus, audio receiving apparatus, and audio processing method
CN113539230A (en) * 2020-03-31 2021-10-22 北京奔影网络科技有限公司 Speech synthesis method and device
CN112006697A (en) * 2020-06-02 2020-12-01 东南大学 Gradient boosting decision tree depression recognition method based on voice signals
CN112002348A (en) * 2020-09-07 2020-11-27 复旦大学 Method and system for recognizing speech anger emotion of patient
CN112002348B (en) * 2020-09-07 2021-12-28 复旦大学 Method and system for recognizing speech anger emotion of patient
CN112037762A (en) * 2020-09-10 2020-12-04 中航华东光电(上海)有限公司 Chinese-English mixed speech recognition method
CN112489688A (en) * 2020-11-09 2021-03-12 浪潮通用软件有限公司 Neural network-based emotion recognition method, device and medium

Also Published As

Publication number Publication date
CN102737629B (en) 2014-12-03

Similar Documents

Publication Publication Date Title
CN102737629B (en) Embedded type speech emotion recognition method and device
CN110853618B (en) Language identification method, model training method, device and equipment
WO2021051544A1 (en) Voice recognition method and device
CN102332263B (en) Close neighbor principle based speaker recognition method for synthesizing emotional model
CN103871426A (en) Method and system for comparing similarity between user audio frequency and original audio frequency
CN102723078A (en) Emotion speech recognition method based on natural language comprehension
CN108564942A (en) One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN105938399B (en) The text input recognition methods of smart machine based on acoustics
CN105244042B (en) A kind of speech emotional interactive device and method based on finite-state automata
CN101923855A (en) Test-irrelevant voice print identifying system
CN102820033A (en) Voiceprint identification method
CN109508402A (en) Violation term detection method and device
CN102855875B (en) Network speech conversing control system and method based on external open control of speech input
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN107705802A (en) Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN103236258B (en) Based on the speech emotional characteristic extraction method that Pasteur&#39;s distance wavelet packets decomposes
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
CN109243492A (en) A kind of speech emotion recognition system and recognition methods
CN108564965A (en) A kind of anti-noise speech recognition system
CN106504772A (en) Speech-emotion recognition method based on weights of importance support vector machine classifier
CN102664010A (en) Robust speaker distinguishing method based on multifactor frequency displacement invariant feature
CN107221344A (en) A kind of speech emotional moving method
CN108899033A (en) A kind of method and device of determining speaker characteristic
CN104952446A (en) Digital building presentation system based on voice interaction
CN105070300A (en) Voice emotion characteristic selection method based on speaker standardization change

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: Jiangning District of Tangshan street in Nanjing city of Jiangsu Province, 211131 soup Road No. 18

Applicant after: Southeast University

Address before: 211189 Jiangsu Road, Jiangning Development Zone, Southeast University, No. 2, No.

Applicant before: Southeast University

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141203

Termination date: 20171111