CN103258531A

CN103258531A - Harmonic wave feature extracting method for irrelevant speech emotion recognition of speaker

Info

Publication number: CN103258531A
Application number: CN2013102079615A
Authority: CN
Inventors: 王坤侠; 安宁; 李廉
Original assignee: 安宁
Current assignee: Deep Blue Technology Shanghai Co Ltd
Priority date: 2013-05-29
Filing date: 2013-05-29
Publication date: 2013-08-21
Anticipated expiration: 2033-05-29
Also published as: CN103258531B

Abstract

The invention discloses a harmonic wave feature extracting method for irrelevant speech emotion recognition of a speaker. The harmonic wave feature extracting method comprises the following steps of (1) constructing a harmonic coefficient model based on a Fourier series, (2) extracting the characteristic parameters of the harmonic coefficients of speech signals to form characteristic vectors according to the constructed harmonic coefficient model, (3) inputting the characteristic vectors to a support vector machine (SVM) disaggregated model as data input, carrying out an irrelevant speech emotion recognition test of the speaker, and (4) outputting the effect of the characteristic parameters of the harmonic coefficients on the irrelevant speech emotion recognition of the speaker after training and testing. According to the harmonic wave feature extracting method for the irrelevant speech emotion recognition of the speaker, the characteristic parameters of the harmonic coefficients are applied to the irrelevant speech emotion recognition of the speaker, and a recognition rate is greatly improved.

Description

A kind of harmonic characteristic extracting method of the speech emotional identification that has nothing to do for the speaker

Technical field

The present invention relates to a kind of audio signal processing method, relate in particular to a kind of harmonic characteristic extracting method of the speech emotional identification that has nothing to do for the speaker.

Background technology

Along with the continuous development of pattern-recognition and the emotion theory of computation, utilize computing machine from voice signal, to automatically identify speaker's affective state and variation, namely the speech emotional recognition technology is subjected to numerous scholars' attention.Voice are as the important media that the mankind exchange, and are the Basic Ways of transmitting information between men.Voice signal is not only transmitting actual semantic content, and is containing abundant emotion information.

The research of speech emotional identification is for the intellectuality and the hommization that increase computing machine, the development of new man-machine environment, and promote subject development such as psychology, have important practical significance.At present, the speech emotional recognition technology is brought remarkable influence to people's work, studying and living.In educational circles, the speech emotional recognition technology is applied to the real-time network teaching, strengthens teaching efficiency, improves the quality of teaching; In amusement circles, the emotion interaction technique can be constructed anthropomorphic style and scene of game true to nature; In industry member, intelligent domestic electrical equipment, mobile phone, automobile etc. can be understood our emotion, and respond, for our work and life provides quality services; In medical circle, can the emotion variation of the elderly in part illness (as mental diseases such as depression, anxiety disorders) and the family not living home be detected and offer help.In addition, the speech emotional recognition technology can also play a significant role at aspects such as information retrieval, network communications, and its application scenarios is very wide.

Feature extraction is the basis in the speech emotional identification, and feature extraction is exactly the essential characteristic that extracts the expression speech emotional from voice signal.Phonetic feature can be divided into two classes, and one is the phonetics feature, and one is prosodic features.The researcher has attempted having used many affective characteristicses.A large amount of studies show that, in the speech emotional identification characteristic parameter commonly used as: fundamental frequency, resonance peak coefficient, linear predictor coefficient, cepstrum coefficient etc. are validity feature, how to seek the new expressive force that has more personal characteristics, have the phonetic feature of stronger robustness, remain major issue that needs to be resolved hurrily in the irrelevant speech emotional recognition technology field of speaker.

In sum, acoustic feature is the basis of speech signal analysis, and good acoustic feature can be excavated the essence of voice signal.Although the research of speech emotional identification makes progress, do not reach society far away to the requirement of its practicability, mainly show:

(1) do not find simple acoustical characteristic parameters in order to identify emotion reliably as yet;

(2) present, most of speech emotional characteristic extraction methods all are the stationarities in short-term of utilizing voice signal, and think separate between voice signal adjacent.Such feature extracting method has been lost the behavioral characteristics of voice signal.

Summary of the invention

The present invention is directed to above problem, propose a kind of speech emotional characteristic parameter extraction method based on the harmonic constant model, be used for the speech emotional identification that the speaker has nothing to do, this harmonic constant feature can improve the recognition performance of speech emotional identification.

The present invention is achieved in that a kind of harmonic characteristic extracting method of the speech emotional identification that has nothing to do for the speaker, and it may further comprise the steps:

Step 1 makes up the harmonic constant model based on Fourier series: to a voice signal x (m), satisfy the Fourier series of formula (1),

x (m) = Σ_{k = 1}^{M} a_{k} (m) \cos (2 πk F_{0} (m) m) + b_{k} (m) \sin (2 πk F_{0} (m) m) - - - (1)

When voice signal x (m) is steady in the section at the fixed time, be the limited voice signal x (m) of N to a length, one N point discrete signal [x (0), x (N-1)], after discrete Fourier transformation, generate spectrum signal [X (0) ... X (N-1)], discrete Fourier transformation is defined as formula (2):

, wherein, k=0,1,2..., N-1 is expressed as the linear system X=Wx of formula (3) with discrete Fourier transformation,

[\begin{matrix} X (0) \\ X (1) \\ X (2) \\ . \\ . \\ . \\ X (N - 2) \\ X (N - 1) \end{matrix}] = [\begin{matrix} 1 & 1 & 1 & . . . & 1 & 1 \\ 1 & e^{- j \frac{2 π}{N}} & e^{- j \frac{4 π}{N}} & . . . & e^{- j \frac{2 (N - 2) π}{N}} & e^{- j \frac{2 (N - 1) π}{N}} \\ 1 & e^{- j \frac{4 π}{N}} & e^{- j \frac{8 π}{N}} & . . . & e^{- j \frac{4 (N - 2) π}{N}} & e^{- j \frac{4 (N - 2) π}{N}} \\ . & . & . & . & . \\ . & . & . & . . . & . & . \\ . & . & . & . & . \\ 1 & e^{- j \frac{2 (N - 2) π}{N}} & e^{- j \frac{4 (N - 2) π}{N}} & . . . & e^{- j \frac{2 {(N - 2)}^{2} π}{N}} & e^{- j \frac{2 (N - 1) (N - 2) π}{N}} \\ 1 & e^{- j \frac{2 (N - 1) π}{N}} & e^{- j \frac{4 (N - 1) π}{N}} & . . . & e^{- j \frac{2 (N - 1) (N - 2) π}{N}} & e^{- j \frac{{(N - 1)}^{2} π}{N}} \end{matrix}] [\begin{matrix} x (0) \\ x (1) \\ x (2) \\ . \\ . \\ . \\ x (N - 2) \\ x (N - 1) \end{matrix}] - - - (3)

, constitute voice signal harmonic constant model, wherein, transition matrix is , X (k) is 0 harmonic constant to the N-1 interval, K is overtone order;

Step 2 is extracted the characteristic parameter based on the harmonic constant model;

At first, harmonic constant characteristic parameter extraction: voice signal x (m) is carried out the branch frame, wherein frame length 16ms, frame moves 8ms, according to voice signal harmonic constant model, calculates the harmonic constant of each frame, the harmonic constant X of voice signal (N, I)=[X (0,1) X (1,1) ... X (N 1,1) ... X (0, i-1) X (1, i-1) ... (N 1 for X, i-1), (0, i) X (1 for X, i) ... X (N-1,1) ,] (4), wherein i is frame number, according to formula (4) the each harmonic coefficient of voice signal x (m) is added up, calculate its maximal value, minimum value, median, mean value and variance obtain the global characteristics vector form (5) of voice signal

\begin{matrix} X_{\min} = \min (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{\max} = \max (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{med} = median (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{mea} = \frac{1}{k} Σ_{i = 1}^{k} X (N, i) \\ X_{std} = \sqrt{Σ_{i = 1}^{k} {(X (N, i) - X_{avg})}^{2}} \end{matrix} - - - (5)

；

Secondly, harmonic constant difference characteristic parameter extraction: the each harmonic coefficient that obtains in the harmonic constant characteristic parameter extraction step is carried out first order difference and second order difference computing according to formula (6),

\begin{matrix} ΔX = X (N, i + 1) - X (N, i) & i = 1,2, . . ., I \\ ΔΔX = ΔX (N, i + 1) - ΔX (N, i) & i = 1,2, . . ., I - 2 \end{matrix} - - - (6)

, obtain the dynamic harmonic coefficient sequence of voice signal, same, calculate first order difference and second order difference statistical value according to formula (5), obtain the overall behavioral characteristics vector of voice signal;

Step 3 is imported the eigenvector that step (two) is extracted as data, input to support vector machine disaggregated model (SVM), carries out the irrelevant speech emotional identification of speaker test;

Step 4 is through the effect of training and testing output harmonic wave coefficient characteristics parameter to the irrelevant speech emotional identification of speaker.

As the further improvement of such scheme, in step 3, for given training set (x _i, y _i), i=1 ..., n, x _i∈ R ^d, y ∈+1, and-1}, best lineoid ω x+b=0 obtains by minimizing (7) formula, , wherein, ξ _iBe slack variable, introduce complicacy and wrong branch rate that parameters C is come gauging system.

Preferably, in step 3, the double optimization decision function is defined as

, wherein, K is kernel function, x _iBe the support vector of corresponding Lagrangian multiplier parameter, n is the number of support vector, b ^*It is offset parameter.

Further improvement as such scheme, in step 2, extraction is divided into four-stage based on the characteristic parameter of harmonic constant model: sampling and quantification, pre-emphasis processing, windowing and harmonic characteristic extract, at first voice signal is sampled and quantize, analog signal conversion is become digital signal, promote the HFS of voice signal, make the frequency spectrum of voice signal become level and smooth, realize the pre-emphasis of voice signal, wherein, pre-emphasis adopts digital filter Z transfer function H (z)=1-095Z ^-1, windowed function adopts Hamming window.

This method has nothing to do voice harmonic constant feature application to the speaker speech emotional has improved discrimination in identifying greatly.Compared with the prior art, beneficial effect of the present invention is embodied in: the present invention proposes the harmonic constant model of voice, extract the harmonic constant feature of voice, comprise local feature and global characteristics, be applied to the speech emotional identification that the speaker has nothing to do, compare traditional feature, this phonetic feature has improved the performance of speech emotional identification greatly.This technology is applied to aspects such as intelligent appliance, medical science auxiliary curing and safety detection, can provides service and product humanized, emotional culture for the mankind.

Description of drawings

Fig. 1 is used for the speech emotional module identified structural drawing that the speaker has nothing to do for what preferred embodiments of the present invention provided.

Fig. 2 is the basic process of feature extraction of the present invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explaining the present invention, and be not used in restriction the present invention.

The harmonic characteristic extracting method of the speech emotional identification that has nothing to do for the speaker of the present invention, its flowage structure comprise feature extraction based on the harmonic constant model, as shown in Figure 1 based on main modular such as the model training of support vector machine and identification outputs.This harmonic characteristic extracting method may further comprise the steps: (1) makes up the harmonic constant model based on Fourier series; (2) the harmonic constant model that makes up according to step (1), the harmonic constant characteristic parameter of extraction voice signal; (3) characteristic parameter that extracts according to step (2) obtains eigenvector, and eigenvector is imported as data, inputs to support vector machine disaggregated model (SVM), carries out the irrelevant speech emotional identification of speaker test; (4) through the effect of training and testing output harmonic wave coefficient characteristics parameter to the irrelevant speech emotional identification of speaker.

Wherein, feature extraction is divided into four-stage: sampling and quantification, pre-emphasis processing, windowing and harmonic characteristic extract, as shown in Figure 2.At first voice signal is sampled and quantize, analog signal conversion is become digital signal.Because voice signal is subjected to the influence of glottal excitation and mouth and nose radiation, need to promote HFS, the frequency spectrum of signal is become smoothly, be the pre-emphasis of voice signal.Pre-emphasis adopts digital filter Z transfer function H (z)=1-095Z ^-1Windowed function adopts Hamming window.

(1) structure is specifically implemented as follows based on the harmonic constant model of Fourier series:

To a voice signal x (m), can be write as the mathematical form suc as formula (1), formula (1) is called voice signal x (m),

x (m) = Σ_{k = 1}^{M} a_{k} (m) \cos (2 πk F_{0} (m) m) + b_{k} (m) \sin (2 πk F_{0} (m) m) - - - (1)

Fourier series.Suppose that voice signal is steady in the time period at 10-30ms, N point discrete signal [x (0) ..., x (N-1)], after discrete Fourier transformation, generate spectrum signal [X (0) ..., X (N-1)].Be the limited voice signal x (m) of N to a length, its discrete Fourier transformation is defined as follows:

X (k) = Σ_{m = 0}^{N - 1} x (m) e^{- j \frac{2 π}{N} mk}, k = 0,1,2 . . ., N - 1 - - - (2)

Discrete Fourier transformation can be expressed as linear system X=Wx, as shown in the formula:

[\begin{matrix} X (0) \\ X (1) \\ X (2) \\ . \\ . \\ . \\ X (N - 2) \\ X (N - 1) \end{matrix}] = [\begin{matrix} 1 & 1 & 1 & . . . & 1 & 1 \\ 1 & e^{- j \frac{2 π}{N}} & e^{- j \frac{4 π}{N}} & . . . & e^{- j \frac{2 (N - 2) π}{N}} & e^{- j \frac{2 (N - 1) π}{N}} \\ 1 & e^{- j \frac{4 π}{N}} & e^{- j \frac{8 π}{N}} & . . . & e^{- j \frac{4 (N - 2) π}{N}} & e^{- j \frac{4 (N - 2) π}{N}} \\ . & . & . & . & . \\ . & . & . & . . . & . & . \\ . & . & . & . & . \\ 1 & e^{- j \frac{2 (N - 2) π}{N}} & e^{- j \frac{4 (N - 2) π}{N}} & . . . & e^{- j \frac{2 {(N - 2)}^{2} π}{N}} & e^{- j \frac{2 (N - 1) (N - 2) π}{N}} \\ 1 & e^{- j \frac{2 (N - 1) π}{N}} & e^{- j \frac{4 (N - 1) π}{N}} & . . . & e^{- j \frac{2 (N - 1) (N - 2) π}{N}} & e^{- j \frac{{(N - 1)}^{2} π}{N}} \end{matrix}] [\begin{matrix} x (0) \\ x (1) \\ x (2) \\ . \\ . \\ . \\ x (N - 2) \\ x (N - 1) \end{matrix}] - - - (3)

, wherein, transition matrix is Wherein X (k) is 0 harmonic constant to the N-1 interval, and K is overtone order.

(2) extraction is based on the characteristic parameter of harmonic constant model

1. harmonic constant characteristic parameter extraction

Based on the stationarity in short-term of voice signal, voice signal is carried out the branch frame, frame length 16ms wherein, frame moves 8ms.Obtain voice signal harmonic constant model according to (1), calculate the harmonic constant of each frame, and the harmonic constant X of voice signal (N, I)=[X (0,1) X (1,1) ... X (N 1,1),, (0, i-1) X (1 for X, i-1) ... (N 1, and i-1), X (0 for X, i) X (1, i) ... X (N-1,1), ] (4), wherein i is frame number, N is harmonic constant.According to formula (4) the each harmonic coefficient of voice signal is added up, calculated it: maximal value, minimum value, median, mean value and variance obtain the global characteristics vector (5) of voice signal:

\begin{matrix} X_{\min} = \min (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{\max} = \max (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{med} = median (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{mea} = \frac{1}{k} Σ_{i = 1}^{k} X (N, i) \\ X_{std} = \sqrt{Σ_{i = 1}^{k} {(X (N, i) - X_{avg})}^{2}} \end{matrix} - - - (5) .

2. harmonic constant difference characteristic parameter extraction

The proper vector difference is used for obtaining the continuous dynamic change track of speech feature vector, can obtain the pace of change of proper vector to the first order difference of proper vector, can extract the acceleration that proper vector changes to the second order difference of proper vector.1. the each harmonic coefficient that obtains is carried out first order difference and second order difference computing according to formula (6),

\begin{matrix} ΔX = X (N, i + 1) - X (N, i) & i = 1,2, . . ., I \\ ΔΔX = ΔX (N, i + 1) - ΔX (N, i) & i = 1,2, . . ., I - 2 \end{matrix} - - - (6)

, obtain the dynamic harmonic coefficient sequence of voice signal.Equally, calculate first order difference and second order difference statistical value according to formula (5), obtain the overall behavioral characteristics vector of voice signal.

(3) based on the irrelevant speech emotional identification of the speaker of harmonic constant feature

With 1. and the proper vector that 2. obtains as the input of support vector machine, adopt supporting vector machine model to train, set up supporting vector machine model, the output recognition effect.Idiographic flow is as follows:

For given training set (x _i, y _i), i=1 ..., n, x _i∈ R ^d, y ∈+1, and-1}, best lineoid ω x+b=0 can obtain by minimizing (7) formula.

p (ω, ξ) = \frac{1}{2} ω^{T} \cdot ω + C Σ_{2 = 1}^{l} ξ_{i} - - - (7)

, ξ _iBe slack variable, introduce complicacy and wrong branch rate that parameters C is come gauging system.For solving the double optimization problem, decision function is defined as

, wherein, K is kernel function, x _iBe the support vector of corresponding Lagrangian multiplier parameter, n is the number of support vector, b ^*It is offset parameter.For linear support vector machine, such kernel function meets the requirements very much, and for non-linear support vector machine, the Nonlinear Mapping kernel function is mapped to the high-order feature space with data, and best lineoid namely is present in this space.

(4) experimental result

The present invention's experiment is to finish under the emotion corpus of Berlin.Berlin emotional speech database is to be recorded by the W. Sendlmeier of Berlin technical college professor working group, this database comprises 5 male sex and 5 women's emotion statement, affective state comprises sadness, indignation, fears, detests, glad, be sick of and 7 kinds of neutrality etc. this experiment employing cross validation method.Extract voice 40 subharmonic coefficients as characteristic parameter, carry out the irrelevant emotion recognition of speaker, recognition result is 76.3%, compares traditional characteristic energy, resonance peak, zero-crossing rate, gene frequency etc. and improves 5.4%.Its confusion matrix such as table 1.

Table 1: emotion recognition confusion matrix

The above only is preferred embodiment of the present invention, not in order to limiting the present invention, all any modifications of doing within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. harmonic characteristic extracting method that is used for the speech emotional identification that the speaker has nothing to do, it is characterized in that: it may further comprise the steps:

x (m) = Σ_{k = 1}^{M} a_{k} (m) \cos (2 πk F_{0} (m) m) + b_{k} (m) \sin (2 πk F_{0} (m) m) - - - (1)

[\begin{matrix} X (0) \\ X (1) \\ X (2) \\ . \\ . \\ . \\ X (N - 2) \\ X (N - 1) \end{matrix}] = [\begin{matrix} 1 & 1 & 1 & . . . & 1 & 1 \\ 1 & e^{- j \frac{2 π}{N}} & e^{- j \frac{4 π}{N}} & . . . & e^{- j \frac{2 (N - 2) π}{N}} & e^{- j \frac{2 (N - 1) π}{N}} \\ 1 & e^{- j \frac{4 π}{N}} & e^{- j \frac{8 π}{N}} & . . . & e^{- j \frac{4 (N - 2) π}{N}} & e^{- j \frac{4 (N - 2) π}{N}} \\ . & . & . & . & . \\ . & . & . & . . . & . & . \\ . & . & . & . & . \\ 1 & e^{- j \frac{2 (N - 2) π}{N}} & e^{- j \frac{4 (N - 2) π}{N}} & . . . & e^{- j \frac{2 {(N - 2)}^{2} π}{N}} & e^{- j \frac{2 (N - 1) (N - 2) π}{N}} \\ 1 & e^{- j \frac{2 (N - 1) π}{N}} & e^{- j \frac{4 (N - 1) π}{N}} & . . . & e^{- j \frac{2 (N - 1) (N - 2) π}{N}} & e^{- j \frac{{(N - 1)}^{2} π}{N}} \end{matrix}] [\begin{matrix} x (0) \\ x (1) \\ x (2) \\ . \\ . \\ . \\ x (N - 2) \\ x (N - 1) \end{matrix}] - - - (3)

, constitute voice signal harmonic constant model, wherein, transition matrix is

, X (k) is 0 harmonic constant to the N-1 interval, K is overtone order;

At first, harmonic constant characteristic parameter extraction: voice signal x (m) is carried out the branch frame, wherein frame length 16ms, frame moves 8ms, according to voice signal harmonic constant model, calculates the harmonic constant of each frame, harmonic constant X (the N of voice signal, I)=[X (0,1) X (1,1) ... (N 1 for X, 1),, (0, i-1) X (1 for X, i-1) ... (N 1 for X, i-1), (0, i) X (1 for X, i) ... X (N-1,1) ,] (4), wherein i is frame number, according to formula (4) the each harmonic coefficient of voice signal x (m) is added up, calculate its maximal value, minimum value, median, mean value and variance obtain the global characteristics vector of voice signal as formula (5)

\begin{matrix} X_{\min} = \min (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{\max} = \max (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{med} = median (X (N, 1), X (N, 2), . . ., X (N, i)) \\ X_{mea} = \frac{1}{k} Σ_{i = 1}^{k} X (N, i) \\ X_{std} = \sqrt{Σ_{i = 1}^{k} {(X (N, i) - X_{avg})}^{2}} \end{matrix} - - - (5)

；

\begin{matrix} ΔX = X (N, i + 1) - X (N, i) & i = 1,2, . . ., I \\ ΔΔX = ΔX (N, i + 1) - ΔX (N, i) & i = 1,2, . . ., I - 2 \end{matrix} - - - (6)

2. the harmonic characteristic extracting method of the speech emotional identification that has nothing to do for the speaker according to claim 1 is characterized in that: in step 3, for given training set (x _i, y _i), i=1 ..., n, x _i∈ R ^d, y ∈+1, and-1}, best lineoid ω x+b=0 obtains by minimizing (7) formula,

, wherein, ξ _iBe slack variable, introduce complicacy and wrong branch rate that parameters C is come gauging system.

3. the harmonic characteristic extracting method of the speech emotional identification that has nothing to do for the speaker according to claim 2, it is characterized in that: in step 3, the double optimization decision function is defined as

4. the harmonic characteristic extracting method of the speech emotional identification that has nothing to do for the speaker according to claim 1, it is characterized in that: in step 2, extraction is divided into four-stage based on the characteristic parameter of harmonic constant model: sampling and quantification, pre-emphasis is handled, windowing and harmonic characteristic extract, at first voice signal is sampled and quantize, analog signal conversion is become digital signal, promote the HFS of voice signal, make the frequency spectrum of voice signal become level and smooth, realize the pre-emphasis of voice signal, wherein, pre-emphasis adopts digital filter Z transfer function H (z)=1-095Z ^-1, windowed function adopts Hamming window.