CN103730121A

CN103730121A - Method and device for recognizing disguised sounds

Info

Publication number: CN103730121A
Application number: CN201310728591.XA
Authority: CN
Inventors: 王泳; 黄继武
Original assignee: Shenzhen University; Sun Yat Sen University
Current assignee: Shenzhen University; Sun Yat Sen University
Priority date: 2013-12-24
Filing date: 2013-12-24
Publication date: 2014-04-16
Anticipated expiration: 2033-12-24
Also published as: CN103730121B

Abstract

The invention discloses a method and device for recognizing disguised sounds. The method is characterized in that the voice transformation coefficient is estimated through fundamental frequency characteristics of voices, and an Mel frequency cepstrum coefficient extraction algorithm is improved, namely the estimated coefficient is integrated into the Mel frequency cepstrum coefficient extraction algorithm through linear interpolation extension so that the Mel frequency cepstrum coefficient of the before-transformation voices can be approximately calculated. Finally, the method is integrated into a GMM-UBM recognizing frame to calculate the similarity between the voices, and meanwhile the transformed voices can be restored into the original voices through the estimated transformation coefficient. According to the method and device, a great improvement is achieved on the recognizing performance compared with a conventional recognizing evidence-obtaining method, and detection missing and the false alarm are both lower than the detection missing and the false alarm of a conventional scheme.

Description

A kind of recognition methods and device that pretends sound

Technical field

The present invention relates to multi-media information security field, more specifically, relate to a kind of recognition methods and device that pretends sound.

Background technology

Speech conversion (Voice Transformation) is one of the most frequently used method of speech processing.Its function is a sound to be become to another sound diverse sound of nature.Speech conversion is generally used for music making or protection speaker's safety and privacy, but also likely by criminal, is used for covering up sound, in case be identified identity.Therefore the speaker ' s identity identification after speech conversion has important using value.

The general step of speech conversion:

1) to signal x (n), divide frame, windowing:

F (k) = Σ_{n = 0}^{N - 1} x (n) \cdot w (n) \cdot e^{- j \frac{2 π}{N} \cdot k \cdot n} 0 \leq n < N - - - (1)

2) calculate instantaneous amplitude:

| F (k) | = | Σ_{n = 0}^{N - 1} x (n) \cdot w (n) \cdot e^{- j \frac{2 π}{N} \cdot k \cdot n} | 0 \leq n < N - - - (2)

3) by this frame and the phase relation of former frame, calculate instantaneous frequency:

ω (k) = (k + Δ) * \frac{F_{s}}{N} - - - (3)

F wherein _sbe sampling frequency, Δ is the deviation frequency of relative centre frequency.

4) frequency spectrum is flexible.First be instantaneous amplitude linear interpolation

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2 (4)

μ=k′/α-k (6)

Can not give rise to misunderstanding under prerequisite, still the instantaneous amplitude after interpolation is being designated as | F (k) |.

Then carrying out frequency line moves:

ω′(k*α)=ω(k)*α 0≤k<N/2 0≤k*α<N/2 (7)

Can not give rise to misunderstanding under prerequisite, still by after moving instantaneous frequency be designated as ω (k).

5) by instantaneous frequency, calculate instantaneous phase φ (k), obtain the FFT coefficient after speech conversion:

F(k)=|F(k)|e ^jφ(k) (8)

6) F (k) is carried out to FFT inverse transformation, can obtain the signal after speech conversion.

Extract the process of MFCC as shown in Figure 1.Concrete steps are as follows:

1) windowing and calculating frequency spectrum.

The hamming window that MFCC has wherein adopted N=1024 to order:

w (n) = 0.53836 - 0.46164 \cos (\frac{2 πn}{N - 1}) 0 \leq n < N - - - (9)

To doing FFT conversion after source signal x (n) windowing:

F (k) = Σ_{n = 0}^{N - 1} x (n) \cdot w (n) \cdot e^{- j \frac{2 π}{N} \cdot k \cdot n} 0 \leq n < N - - - (10)

2) Mel segmentation (triangle filtering) and log-transformation:

Weighted window is used quarter window, and its formula is as follows:

H_{m} (k) = \{\begin{matrix} 0 & k < k_{m - 1} \\ \frac{k - k_{m - 1}}{k_{m} - k_{m - 1}} & k_{m - 1} \leq k \leq k_{m} \\ \frac{k_{m + 1} - k}{k_{m + 1} - k_{m}} & k_{m} < k \leq k_{m + 1} \\ 0 & k > k_{m + 1} \end{matrix} - - - (11)

Wherein, k _m=f (m) N/F _s.F _sfor sampling frequency.

After utilizing quarter window to the energy spectrum weighting of FFT, make log-transformation:

Y (m) = \log [Σ_{k = 0}^{N - 1} {| F (k) |}^{2} \cdot H_{m} (k)] 1 \leq m \leq M - - - (12)

3) cosine inverse transformation

Finally utilize cosine inverse transformation, can obtain Mel cepstrum coefficient, i.e. MFCC:

MFCC (n) = \frac{1}{M} Σ_{m = 1}^{M} Y (m) \cos (\frac{n (m - 0.5) π}{M}) 1 \leq m \leq M 0 \leq n \leq N - 1 - - - (13)

GMM-UBM

Speaker Identification can be considered two class hypothesis:

H ₀: Y is from speaker S;

H ₁: Y is not from speaker S.

On mathematics, H ₀model λ with speaker S _hyprepresent H ₁with consistent background model λ _bkgrepresent.Probability calculation is as shown in Equation (14):

Along with the widespread use of Audiotechnica, protection audio production becomes a study hotspot of information security field.Voice evidence obtaining is also one of them important field.In the administration of justice, business and other application, the speaker ' s identity identification after speech conversion all has important using value.Experimental result shows, at voice, after larger conversion, conventional Speaker Identification scheme can cause higher or high loss and false alarm rate, identification complete failure.

Summary of the invention

Primary and foremost purpose of the present invention is to propose a kind of recognition methods of pretending sound, adopts the method can identify the speaker ' s identity of a conversion audio production, has very important using value on the speaker ' s identity after speech conversion.

Another object of the present invention is the recognition device that proposes a kind of language camouflage sound.

In order to solve the deficiencies in the prior art, the technical solution used in the present invention is as follows:

Pretend a recognition methods for sound, described method comprises:

In the training stage, utilize maximum expected value EM algorithm from background sound storehouse, to calculate consistent background model UBM λ _bkg;

In the training stage, extract the tested speech S of speaker j _jmel cepstrum coefficient MFCC and fundamental frequency, utilize Maximize algorithm to calculate the gauss hybrid models GMM λ of speaker j _j, calculate fundamental frequency mean value f _j; Set up the model V of speaker j _j=(λ _j, f _j), and be stored in model database;

In the training stage, obtain threshold value θ; Threshold value θ acquisition methods: computing client mark and personator's mark, utilize the distribution of this two classes mark to select threshold value θ to reach loss and the false alarm rate that meets application requirements, client's mark Client Scores wherein, the probability of speaker's sound bite under this speaker model, personator's mark Imposter Scores is the probability of speaker's sound bite under other speaker model;

At test phase, voice Y is the voice after conversion, extracts the fundamental frequency mean value f of voice Y _y; Utilize f _y/ f _jcalculate conversion coefficient; Utilize modified MFCC extraction algorithm to calculate the original MFCC coefficient X before Y conversion; Through the probability estimate algorithm based on GMM-UBM, show that Y is model V _jprobability Λ (X);

Relatively probability Λ (X) and threshold value θ, if gained probability be greater than threshold value θ voice Y be the said fragment of j; Otherwise voice Y is not that j is said;

Wherein said modified MFCC extraction algorithm is specially: after the windowing in MFCC extraction algorithm and FFT conversion, amplitude to FFT coefficient | F (k) | carry out that linear interpolation is flexible to be drawn | F (k ') |, shown in the flexible following formula of amplitude linear interpolation of FFT coefficient:

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2

μ=k′/(1/α′)-k

The inverse of the conversion coefficient that wherein 1/ α ' is described estimation, the conversion coefficient of α ' for estimating, α '=f _y/ f _j.

In a kind of preferred scheme, the extraction step of described fundamental frequency is as follows:

(1) to signal, arbitrary moment t is tried to achieve in windowing _midthe signal of the predetermined length value in front and back;

(2) ask the autocorrelation function of signal and the autocorrelation function of window function of described predetermined length value;

(3) pairwise correlation function is divided by, and maximal value place is cycle T, obtains this moment t _midfundamental frequency F.

In a kind of preferred scheme, described fundamental frequency mean value is mean (F), and mean () is for being averaging.

In a kind of preferred scheme, as α ' >1, need carry out frequency spectrum compensation; Making nyquist frequency is F _n; Compensation method is at F _n/ 2/ α ' is to F _n/ 2/ α '-F _nin frequency spectrum between/2, symmetry copies into F _n/ 2/ α ' is to F _nin/2/ scope.The effect of compensation is herein approximate reduction F _n/ 2/ α ' is to F _nthe amplitude of frequency range between/2/, thus the MFCC value after order reduction approaches original MFCC value.

A recognition device that pretends sound, comprising:

Training module, for utilizing maximum expected value EM algorithm to calculate consistent background model UBM λ from background sound storehouse _bkg; Extract the tested speech S of speaker j _jmel cepstrum coefficient MFCC and fundamental frequency, utilize Maximize algorithm to calculate the gauss hybrid models GMM λ of speaker j _j, calculate fundamental frequency mean value f _j; Set up the model V of speaker j _j=(λ _j, f _j), and be stored in model database, in the training stage, obtain threshold value θ;

Threshold value θ acquisition methods wherein: computing client mark and personator's mark, utilize the distribution of this two classes mark to select threshold value θ to reach loss and the false alarm rate that meets application requirements, client's mark Client Scores wherein, the probability of speaker's sound bite under this speaker model, personator's mark Imposter Scores is the probability of speaker's sound bite under other speaker model;

Test module, is the voice after conversion at voice Y, extracts its fundamental frequency mean value f _y; Utilize f _y/ f _jcalculate conversion coefficient; Utilize modified MFCC extraction algorithm to calculate the original MFCC coefficient X before Y conversion; Through the probability estimate algorithm based on GMM-UBM, show that Y is model V _jprobability Λ (X);

Identification module, relatively probability Λ (X) and threshold value θ, if gained probability be greater than threshold value θ voice Y be the said fragment of j; Otherwise voice Y is not that j is said;

Wherein the modified MFCC extraction algorithm specific implementation described in test module is:

After windowing in MFCC extraction algorithm and FFT conversion, the amplitude to FFT coefficient | F (k) | carry out that linear interpolation is flexible to be drawn | F (k ') |, shown in the flexible following formula of amplitude linear interpolation of FFT coefficient:

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2

μ=k′/(1/α′)-k

Compared with prior art, beneficial effect of the present invention is: the present invention compares conventional identification evidence collecting method on recognition performance great raising, adopt the mean value of fundamental frequency and estimate conversion coefficient, MFCC extraction algorithm has been done to improvement simultaneously, directly computing voice is changed previous MFCC feature, whether be some target speaker said, undetected all low than conventional scheme with false-alarm if by the method for calculating probability based on GMM-UBM, calculating tested speech.

Accompanying drawing explanation

Fig. 1 is the leaching process schematic diagram of Mel frequency cepstral coefficient.

Conversion coefficient contrast true obtain conversion coefficient (the true factor alpha (k)=2 of Fig. 2 for estimating ^k/12, estimation coefficient α ' (y)=2 ^y/12) comparison schematic diagram.

Fig. 3 is tetra-kinds of frequency domain camouflage methods of EER() curve map.

Fig. 4 is tetra-kinds of frequency domain methods of DET() curve map.

Fig. 5 is EER(TD-PSOLA) curve map.

Fig. 6 is DET(TD-PSOLA) curve map.

Embodiment

As shown in Fig. 3-6, the present invention is disclosed in the training stage, utilizes EM(Expectation Maximum maximum expected value) algorithm calculates the consistent background model of UBM(from background sound storehouse) λ _bkg; In the training stage, extract the tested speech S of speaker j _jmFCC coefficient and fundamental frequency, utilize MAP((Maximum A posteriori, maximum a posteriori probability) algorithm calculates the GMM(Gaussian Mixture Model of speaker j) model λ _j, calculate fundamental frequency mean value f _j.Set up the model V of speaker j _j=(λ _j, f _j), and be stored in model database.In the training stage, obtain threshold value θ.At test phase, voice Y is the voice after conversion, extracts its fundamental frequency mean value f _y.Utilize f _y/ f _jcalculate conversion coefficient; Utilize modified MFCC extraction algorithm to calculate the original MFCC coefficient X before Y conversion.Through the probability estimate algorithm based on GMM-UBM, show that Y is model V _jprobability Λ (X).If gained probability is greater than threshold value θ, to identify voice Y be the said fragment of j; If be not more than threshold value, identify voice Y not for j is said.

The method of estimation of calculating described conversion coefficient is: α '=f _y/ f _j, the conversion coefficient that wherein α ' is described estimation, fundamental frequency mean value is that fundamental frequency is averaging to gained.

The extraction step of fundamental frequency value is as follows:

(1) to signal, arbitrary moment t is tried to achieve in windowing _midthe signal of front and back one predetermined length value;

Modified extraction algorithm is after windowing in Mel frequency cepstral coefficient extraction algorithm and FFT conversion, the amplitude to FFT coefficient | F (k) | carry out that linear interpolation is flexible to be drawn | and F (k ') |.Shown in the flexible following formula of amplitude linear interpolation of FFT coefficient:

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0￡k′<N/2

μ=k′/(1/α′)-k

The inverse of the conversion coefficient that wherein flexible value 1/ α ' of linear interpolation is described estimation.

The method that coupling is calculated is the method for calculating probability based on GMM-UBM.Coupling is calculated and to be referred to that the probability of computing voice fragment under certain model, this probability reflect speak speaker's the probability of artificial this model representative of this sound bite.

Now provide sound bank and some experimental results of utilizing the inventive method to adopt.

Sound bank is TIMIT.This is the most frequently used sound bank in speech/speaker identification.Comprise 192 female and 438 men from 8 different geographicals, totally 630 speakers.Each speaker reads respectively 10 sections of different voice, amounts to 6300 sections of voice.All voice are WAV form, 8k sampling rate, 16bit quantified precision.This experiment is divided into three word banks by TIMIT:

1) consistent background sound storehouse: 60 men, all sound bites of 60 woman link together and train UBM.

2) the regular sound bank of mark: 40 female, 90 man's sound bites are for mark regular (TNorm).

3) exploitation-evaluation and test sound bank: 92 female, 288 men.To speaker j, its 5 sections of fragments are linked to be one section of 2048 dimension GMM model λ that train j _jand fundamental frequency mean value f _j.All the other 5 sections are linked to be one section, and act in this fragment with different switching transformation of coefficient camouflage.For training the sound bank of speaker model to be called exploitation sound bank; For the sound bank pretending, be called evaluation and test sound bank.

Consider five kinds of transformation tool (method): Adobe Audition, the Audacity based on frequency domain, GoldWave, RSITI and the TD-PSOLA based on time domain.Adjust as conversion intensity with 12 half acoustics the inside, and conversion coefficient and half is adjusted following relation:

α(k)=2 ^k/12

In experiment, only consider the speech conversion of-11≤k≤11, because the speech conversion of generally provide-11≤k≤11 of practical audio frequency (voice) instrument.

By transition function H (z)=1-0.97z ^-1voice signal is carried out to pre-emphasis.

Frame length 1024 sample points, 24 dimension MFCC matrixes are comprised of 12 dimension MFCC coefficients and 12 dimension Δ MFCC coefficients.

Provide the experimental example of estimating speech conversion coefficient below.Estimated value to same people's conversion coefficient is averaging, and in true that conversion coefficient is made comparisons, is presented in Fig. 2.

Provide recognition performance below.Fig. 3 is EER(Equal Error Rate, and loss False Reject Rate (FRR) equals false alarm rate False Alarm Rate (FAR)).Overall situation EER is in Table 1 and table 2.

The overall EER of table 1, | k|≤11(%)

The overall EER of table 2, | k|≤8(%)

Fig. 4 is DET(Detection Error Tradeoff, detects wrong balance).Visible, the performance of conventional scheme (baseline) is destroyed completely by various camouflage methods, that is to say, conventional Speaker Recognition System cannot correctly be identified the speaker ' s identity of camouflage voice.And method of the present invention (proposed with estimated scaling factor) has greatly reduced error probability, go up largely and can identify speaker ' s identity, reached acceptable level in a lot of application.Give in addition this method performance (optimum performance of this performance for adopting the present invention to reach) in the right-on situation of conversion coefficient.From each chart, the performance that the present invention reaches very approaches optimum performance.

Method of the present invention has also comprised the identification to TD-PSOLA camouflage.Result is presented in Fig. 5 and Fig. 6.Conventional scheme performance is slightly better than the method that the present invention puies forward.But owing to also cannot keep the sense of hearing naturality of voice when conversion intensity is little, so range of application is less, and present useful application software all no longer in fact in this way.

Claims

1. pretend a recognition methods for sound, it is characterized in that, described method comprises:

In the training stage, obtain threshold value θ, threshold value θ acquisition methods: computing client mark and personator's mark, utilize the distribution of this two classes mark to select threshold value θ to reach loss and the false alarm rate that meets application requirements, client's mark Client Scores wherein, the probability of speaker's sound bite under this speaker model, personator's mark Imposter Scores is the probability of speaker's sound bite under other speaker model;

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2

μ=k′/(1/α′)-k

2. the recognition methods of camouflage sound according to claim 1, is characterized in that, the extraction step of described fundamental frequency is as follows:

3. the recognition methods of camouflage sound according to claim 2, is characterized in that, described fundamental frequency mean value is mean (F), and mean () is for being averaging.

4. the recognition methods of camouflage sound according to claim 1, is characterized in that, as α ' >1, need carry out frequency spectrum compensation; Making nyquist frequency is F _n; Compensation method is at F _n/ 2/ α ' is to F _n/ 2/ α '-F _nin frequency spectrum between/2, symmetry copies into F _n/ 2/ α ' is to F _nin/2/ scope.

5. a recognition device that pretends sound, is characterized in that, comprising:

The modified MFCC extraction algorithm wherein adopting in test module is specially: after the windowing in MFCC extraction algorithm and FFT conversion, amplitude to FFT coefficient | F (k) | carry out that linear interpolation is flexible to be drawn | F (k ') |, shown in the flexible following formula of amplitude linear interpolation of FFT coefficient:

|F(k′)|=μ|F(k)|+(1-μ)|F(k+1)| 0≤k<N/2 0≤k′<N/2

μ=k′/(1/α′)-k