CN103280224B

CN103280224B - Based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm

Info

Publication number: CN103280224B
Application number: CN201310146293.XA
Authority: CN
Inventors: 宋鹏; 包永强; 赵力; 刘健刚
Original assignee: Southeast University
Current assignee: Shanghai Taiyu information technology Limited by Share Ltd
Priority date: 2013-04-24
Filing date: 2013-04-24
Publication date: 2015-09-16
Anticipated expiration: 2033-04-24
Also published as: CN103280224A

Abstract

The invention discloses a kind of based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, first utilization MAP algorithm utilizes a small amount of training statement to train the model obtaining source speaker and target speaker respectively from reference speaker model.Then, utilize the parameter in self-adaptation speaker model, propose the method for Gaussian normalization and average conversion respectively.And in order to improve conversion effect further, and then propose method Gaussian normalization method and average conversion merged.Simultaneously, because training statement is limited, the accuracy of adaptive model must be affected, the method that the present invention proposes KL divergence is optimized speaker model when changing, subjective and objective experimental result shows: no matter be distortion spectrum degree, or the quality of converting speech and the similarity with target voice.The method that the present invention proposes all obtains and based on the analogous effect of classical GMM method under symmetrical corpus condition.

Description

Based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm

Technical field

The present invention relates to a kind of based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, belong to voice process technology field.

Background technology

Speech conversion refers to the speaker characteristics speaker characteristics of a people being converted to another person, and a kind of technology keeping semantic content constant.It has a very wide range of applications: as the phonetic synthesis for personalization; The voice communication of low bit rate; The medically recovery etc. of impaired speech.In the past few decades, Voice Conversion Techniques obtains significant progress.A series of phonetics transfer methods that to have occurred with methods such as codebook mapping, gauss hybrid models, neural networks be representative.These methods achieve the conversion of speaker's voice personal characteristics to a great extent.But sight mainly concentrates on based on the speech conversion under symmetrical corpus (same sentence) condition by these methods.And situation about ignoring under asymmetric corpus (different statement).In other words, although before achieve comparatively satisfied converting speech quality based on the speech conversion under symmetrical corpus condition, be widely used, the situation of more asymmetric corpus in actual environment can not be directly applied to.Therefore, we need research further based on the phonetics transfer method under asymmetric corpus condition.

Abroad in the middle of pertinent literature, there is the phonetics transfer method that some propose for asymmetric corpus.As the training method etc. of the method based on the two linear regression of maximum likelihood, the method be separated with content based on the text of two-wire type conversion and the transfer function based on arest neighbors loop iteration.But there is a lot of defect in these methods: as maximum likelihood bilinear regression method depends on the pre-prepd transfer function obtained by symmetrical training; Two-wire type converter technique needs the training statement of a large amount of source speakers and target speaker to ensure the accuracy changed; Arest neighbors cyclic iterative is based upon the spectrum signature faced recently to correspond to identical phoneme, and need a large amount of training statements simultaneously.Therefore, it is large that these methods above-mentioned realize difficulty in actual applications, is not easy to operation.

Summary of the invention

Goal of the invention: the defect existed in order to the phonetics transfer method solved under asymmetric corpus, the invention provides a kind of based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm.

Technical scheme: a kind of based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, first obtains background speaker model by pre-prepd with reference to the training of speaker's statement; Then by MAP(Maximum a posteriori) adaptive technique, the statement of source speaker and target speaker is trained respectively and obtains source speaker and target speaker model; Then speech conversion function is obtained by the average in self-adaptation source speaker and target speaker model and variance training, propose the method for Gaussian normalization and average conversion respectively, in order to improve conversion effect further, and then propose the method for Gaussian normalization and average conversion fusion.In addition, because the training statement of source speaker and target speaker is limited, very difficult training obtains speaker model accurately, in the present invention, we have proposed and solves this problem by the method for KL divergence (Kullback-Leibler divergence).

1) self-adaptation of speaker model

Described based in the phonetics transfer method of adaptive technique, background speaker model is by GMM(Gaussian mixture model) describe, as follows:

p (z) = Σ_{i = 1}^{M} ω_{i} N (z, μ_{i}^{B}, Σ_{i}^{B})

Formula (1)

Wherein N (.) represents Gaussian distribution, and z is speech spectral characteristics vector, and M represents the number of gaussian component, ω _ifor the weight shared by i-th gaussian component, meet with represent mean vector and the variance matrix of i-th gaussian component respectively.Sequence O=[the o of given observation spectrum signature vector ₁, o ₂..., o _t], use MAP(Maximum a posteriori) adaptive algorithm upgrades average and variance, and formula is as follows:

{\hat{μ}}_{i}^{B} = γ_{i} E_{i} (o) + (1 - γ_{i}) μ_{i}^{B}

Formula (2)

{\hat{Σ}}_{i}^{B} = γ_{i} E_{i} (o^{2}) + (1 - γ_{i}) [{(μ_{i}^{B})}^{2} + Σ_{i}^{B}] - {({\hat{μ}}_{i}^{B})}^{2}

Formula (3)

Wherein with represent the middle updated value of i-th gaussian component average and variance respectively, E _i(o) and E _i(o ²) represent the average of i-th gaussian component and variance statistic amount, γ _ibe adaptive factor, for the balance to new and old statistic self-adaptation degree, meet

γ_{i} = \frac{n_{i}}{n_{i} + ρ}

Formula (4)

Wherein ρ is the related coefficient of self-adaptation speaker model and reference model, n _irepresent weight statistic.Finally obtain the weight of source speaker x and target speaker y model, average and variance respectively: with

2) based on the phonetics transfer method of Gaussian normalization

In the present invention, first proposed the phonetics transfer method based on Gaussian normalization, at translate phase, calculate each frame frequency spectrum signature parameter x of source speaker _tposterior probability on the speaker model of source, is expressed as:

m = \arg \underset{i}{\max p (i | x_{t}), i = 1,2, . . ., M}

Formula (5)

Wherein p (i|x _t) represent x _tbelong to the posterior probability of i-th gaussian component, meet according to the character of GMM cluster, source speaker and the same gaussian component of target speaker can be thought and belong to same phoneme, meet:

\frac{x - μ_{m}^{x}}{σ_{m}^{x}} = \frac{\hat{y} - μ_{m}^{y}}{σ_{m}^{y}}

Formula (6)

Wherein with represent average and the variance of m the gaussian component of source speaker and target speaker respectively, then can obtain transfer function as follows:

F (x) = \hat{y} = \frac{σ_{m}^{y}}{σ_{m}^{x}} x + μ_{m}^{y} - \frac{σ_{m}^{y}}{σ_{m}^{x}} μ_{m}^{x}

Formula (7)

3) based on the phonetics transfer method of average conversion

In the present invention, we have proposed the phonetics transfer method that another is changed based on average, the model mean vector sequence of given source speaker and target speaker: with then μ _xand μ _ybetween mapping function be shown below:

μ _y=F (μ _x)=A μ _x+ b formula (8)

Setting use least square method can obtain unknown parameter A and b:

A = {\hat{μ}}_{y} {\hat{μ}}_{x}^{T} {({\hat{μ}}_{x} {\hat{μ}}_{x}^{T})}^{- 1}, b = {\overset{&OverBar;}{μ}}_{y} - A {\overset{&OverBar;}{μ}}_{x}

Formula (9)

Wherein transfer function shown in formula (8) can be directly used in the conversion of spectrum signature, then transfer function is as follows:

F (x)=Ax+b formula (10)

4) based on the phonetics transfer method that Gaussian normalization and average conversion merge

The the 2nd and the 3rd) sets forth phonetics transfer method based on Gaussian normalization and average conversion in two parts.Wherein Gaussian normalization method can be counted as a kind of local linear smoothing method, and average conversion method can think a kind of global map method.In order to promote conversion effect further, the present invention proposes a kind of conversion method of these two kinds of methods being carried out merge.Transfer function is shown below:

F (x)=θ F _g(x)+(1-θ) F _m(x) formula (11)

Wherein F _g(x) and F _mx () represents respectively and trains by Gaussian normalization and average conversion method the transfer function obtained, θ is that weighting coefficient meets 0≤θ≤1.

5) model optimization

Have employed MAP adaptive algorithm in the present invention and modeling carried out to speaker model, but due to adaptive training statement limited, be not that the parameter of each gaussian component of speaker model can be updated.This will inevitably affect the effect of speech conversion.Invention introduces KL divergence to reduce the impact of this problem.KL divergence is used for describing the distance between different distributions, supposes f _i(x) and f _jx () represents the distribution of two gaussian component respectively, then KL divergence is therebetween expressed as

D (f_{i} (x) | | f_{j} (x)) = \underset{x}{Σ} f_{i} (x) \log \frac{f_{i} (x)}{f_{j} (x)}

Formula (12)

Formula (12) has asymmetry, here we to redefine KL divergence as follows:

D_{ij} (x) = \frac{1}{2} [D (f_{i} (x) | | f_{j} (x)) + D (f_{j} (x) | | f_{i} (x))]

Formula (13)

In transfer process, if the average of present component or variance are not updated, then the average of nearest gaussian component or variance is selected to replace.

Beneficial effect: compared with prior art, provided by the invention based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, advantage and effect are:

1) achieve the speech conversion based on asymmetric corpus, can effectively avoid for the symmetric requirement of corpus.

2) adopt MAP adaptive algorithm to carry out modeling to speaker model, speaker model can be obtained by the training statement of minute quantity, decrease the demand of speaker being trained to statement quantity.

3) propose the phonetics transfer method based on Gaussian normalization and average conversion respectively, and and then propose the two method merged, avoid on the one hand the demand for symmetrical corpus, decrease the calculated amount of transfer function training on the other hand to a great extent.

4) by KL divergence method, self-adaptation speaker model being optimized, by being optimized the parameter of the gaussian component be not updated in speaker model, the effect of speech conversion can being improved to a certain extent.

Accompanying drawing explanation

Fig. 1 is the process flow diagram obtaining transfer function in the embodiment of the present invention based on the method for Gaussian normalization;

Fig. 2 is the process flow diagram that the method mapped based on average in the embodiment of the present invention obtains transfer function;

Fig. 3 is the process flow diagram obtaining merging transfer function in the embodiment of the present invention;

Fig. 4 is that the embodiment of the present invention and prior art are about the conversion comparison diagram of male voice to female voice;

Fig. 5 is that the embodiment of the present invention and prior art are about the conversion comparison diagram of female voice to male voice;

Fig. 6 is the embodiment of the present invention and the Comparative result figure adopting mean opinion score method to obtain based on the classical GMM method under symmetrical corpus condition;

Fig. 7 is the embodiment of the present invention and the similarity test result comparison diagram obtained based on the classical GMM method under symmetrical corpus condition.

Embodiment

Below in conjunction with specific embodiment, illustrate the present invention further, these embodiments should be understood only be not used in for illustration of the present invention and limit the scope of the invention, after having read the present invention, the amendment of those skilled in the art to the various equivalent form of value of the present invention has all fallen within the application's claims limited range.

Based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, comprise the steps:

1) use STRAIGHT model to carry out feature extraction to the statement of all speakers, extract MFCC cepstrum (Mel-cepstrum coefficients, MCC) and fundamental frequency (F0) respectively.

2) trained the background model generating and meet GMM and distribute with reference to the spectrum signature MCC that the training statement of speaker extracts by pre-prepd third party; The description of background model, as follows:

p (z) = Σ_{i = 1}^{M} ω_{i} N (z, μ_{i}^{B}, Σ_{i}^{B})

Formula (1)

Wherein N (.) represents Gaussian distribution, and z is speech spectral characteristics vector, and M represents the number of gaussian component, ω _ifor the weight shared by i-th gaussian component, meet with represent mean vector and the variance matrix of i-th gaussian component respectively.

3) similar with the speaker adaptation in Speaker Identification, selection MAP algorithm respectively adaptive training obtains the model of source speaker and target speaker.

Sequence O=[the o of given observation spectrum signature vector ₁, o ₂..., o _t], use MAP adaptive algorithm to upgrade average and variance, formula is as follows:

{\hat{μ}}_{i}^{B} = γ_{i} E_{i} (o) + (1 - γ_{i}) μ_{i}^{B}

Formula (2)

{\hat{Σ}}_{i}^{B} = γ_{i} E_{i} (o^{2}) + (1 - γ_{i}) [{(μ_{i}^{B})}^{2} + Σ_{i}^{B}] - {({\hat{μ}}_{i}^{B})}^{2}

Formula (3)

Wherein with represent the middle updated value of i-th gaussian component average and variance respectively, E _i(o) and E _i(o ²) represent the average of i-th gaussian component and variance statistic amount, γ _iadaptive factor, for the balance to new and old statistic self-adaptation degree.Meet

γ_{i} = \frac{n_{i}}{n_{i} + ρ}

Formula (4)

Wherein ρ is the related coefficient of self-adaptation speaker model and reference model, n _irepresent weight statistic; Finally obtain the weight of source speaker x and target speaker y model, average and variance respectively: with

4) utilization KL divergence calculates the distance in each speaker model between different component respectively.

Suppose f _i(x) and f _jx () represents the distribution of two gaussian component respectively, then KL divergence is therebetween expressed as

D (f_{i} (x) | | f_{j} (x)) = \underset{x}{Σ} f_{i} (x) \log \frac{f_{i} (x)}{f_{j} (x)}

Formula (12)

Formula (12) has asymmetry, here we to redefine KL divergence as follows:

D_{ij} (x) = \frac{1}{2} [D (f_{i} (x) | | f_{j} (x)) + D (f_{j} (x) | | f_{i} (x))]

Formula (13)

5) for the spectrum signature vector of each frame tested speech, calculate its posterior probability in the speaker model of source in gaussian component, then select the gaussian component that posterior probability is maximum.

m = \arg \max_{i} p (i | x_{t}), i = 1,2, . . ., M

Formula (5)

Wherein p (i|x) represents posterior probability, meets

p (i | x) = \frac{ω_{i} N (x, μ_{i}^{x}, Σ_{i}^{xx})}{Σ_{j = 1}^{M} ω_{j} N (x, μ_{j}^{x}, Σ_{j}^{xx})} .

According to the character of GMM cluster, the same gaussian component of source speaker and target speaker can be thought and belongs to same phoneme, meets:

\frac{x - μ_{m}^{x}}{σ_{m}^{x}} = \frac{\hat{y} - μ_{m}^{y}}{σ_{m}^{y}}

Formula (6)

Wherein with represent average and the variance of m the gaussian component of source speaker and target speaker respectively, in current gaussian component, use Gaussian normalization thus obtain transfer function F _g(x).Meanwhile, in the training process of transfer function, if the average of present component or variance are not updated, then the average of the gaussian component selecting KL nearest or variance replace.Fig. 1 method given based on Gaussian normalization obtains the flow process of transfer function.

6) utilize the mean vector in self-adaptation speaker model, use the method based on least square to obtain spectrum signature transfer function F _mx (), meanwhile, in the training process of transfer function, if the average of present component or variance are not updated, then the average of the gaussian component selecting KL nearest or variance replace.Fig. 2 gives the flow process that the method mapped based on average obtains transfer function.

7) Gaussian normalization method can be counted as a kind of local linear smoothing method, and average conversion method can be regarded as a kind of global map method.In order to promote conversion effect further, the present invention proposes a kind of conversion method these two kinds of methods merged.Then transfer function is F (x)=θ F _g(x)+(1-θ) F _m(x).Fig. 3 give merge transfer function obtain process.

8) conversion of F0: adopt the classical method based on Gaussian normalization to change F0.

9) spectrum signature after the conversion obtained by transfer function and F0 carry out the synthesis of voice by STAIGHT model, finally obtain converting speech.

Performance evaluation:

The present embodiment have selected the English speech database of CMU ATCTIC and evaluates conversion effect.500 statements of BDL and CLB a man and a woman two speakers are selected to carry out the training of background model respectively.Respectively by RMS and SLT a man and a woman two speakers, comprise 120 statements respectively.50 wherein symmetrical statements are used for GMM pedestal method, and asymmetrical 50 statements are used for method of the present invention, and other 20 statements are used for evaluation test.The size of the mixed components M of background model is optimised is set as 256, and the size of the gaussian component of GMM pedestal method is optimised is simultaneously set as that 16, MCC exponent number is set to 24.

First we select Mel-cepstrum distance (Mel cepstral distance, MCD) to carry out objective evaluation to the spectrum signature after conversion.

MCD = 10 / \log 10 \sqrt{2 Σ_{j = 1}^{D} {({mc}_{j}^{c} - {mc}_{j}^{t})}^{2}}

Formula (14)

Wherein with be respectively the MCC of converting speech and target voice, D is the exponent number of MCC, and the less expression conversion effect of MCD value is better.

Fig. 4 and Fig. 5 gives the present invention the several method proposed and the MCD result obtained based on the classical GMM Measures compare under symmetrical corpus condition, and Fig. 4 gives the conversion of male voice to female voice, and Fig. 5 gives the conversion of female voice to male voice.Wherein GN represents Gaussian normalization method, MT represents average transformation approach, GNMT represents fusion method.Can find, along with the increase of training statement, the MCD curve of the method that the present invention proposes presents identical trend, all gradually near the result of GMM pedestal method.And adopt GNMT method total energy to obtain effect more better than GN or MT method.This shows that fusion method can improve the effect of Gaussian normalization method and average transformation approach effectively.

Then we select the methods such as mean opinion score (Mean opinion score, MOS) and similarity test to carry out subjective assessment to the quality of converting speech and the similarity of converting speech and target voice respectively.Fig. 6 be with the present invention propose method with adopt mean opinion score (Mean opinion score based on the classical GMM method under symmetrical corpus condition, MOS) result that obtains of method, what adopt is that 5 points of marking principles (wherein 1 is divided into " poor ", and 5 are divided into " very good ") made are given a mark to the quality of converting speech.Fig. 7 is by the inventive method and the similarity test result obtained based on the classical GMM method under symmetrical corpus condition, what adopt is the similarity that 5 points of systems (wherein 1 represents " completely different ", and 5 represent " completely the same ") judge converting speech and target voice equally.MOS test and similarity test all adopt 5 asymmetric statements to be used for speaker adaptation, and take part in marking by 6 professional researchists, and " work " font wherein in figure represents variance.Can find from the result of Fig. 6 and Fig. 7, the method that the present invention proposes can achieve the effect comparable with GMM method, demonstrates the result of objective evaluation MCD to a certain extent.

Claims

1. based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that: first obtain background speaker model by pre-prepd with reference to the training of speaker's statement; Then by MAP adaptive technique, the statement of source speaker and target speaker is trained respectively and obtains source speaker and target speaker model; Then obtain speech conversion function by the average in self-adaptation source speaker and target speaker model and variance training, in speech conversion process, use the method for Gaussian normalization and average conversion, and the method for Gaussian normalization and average conversion fusion; In addition statement is trained to obtain speaker model accurately by KL divergence from limited source speaker and target speaker.

2. as claimed in claim 1 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that:

The self-adaptation of speaker model

Described based in the phonetics transfer method of adaptive technique, background speaker model is described by GMM, as follows:

p (z) = Σ_{i = 1}^{M} ω_{i} N (z, μ_{i}^{B}, Σ_{i}^{B})

Formula (1)

Wherein N (.) represents Gaussian distribution, and Z is speech spectral characteristics vector, and M represents the number of gaussian component, ω _ifor the weight shared by i-th gaussian component, meet with represent mean vector and the variance matrix of i-th gaussian component respectively; Sequence O=[the o of given observation spectrum signature vector ₁, o ₂..., o _t], use MAP adaptive algorithm to upgrade average and variance, formula is as follows:

{\hat{μ}}_{i}^{B} = γ_{i} E_{i} (o) + (1 - γ_{i}) μ_{i}^{B}

Formula (2)

{\hat{Σ}}_{i}^{B} = γ_{i} E_{i} (o^{2}) + (1 - γ_{i}) [{(μ_{i}^{B})}^{2} + Σ_{i}^{B}] - {({\hat{μ}}_{i}^{B})}^{2}

Formula (3)

Wherein with represent the middle updated value of i-th gaussian component average and variance respectively; E _i(o) and E _i(o ²) represent the average of i-th gaussian component and variance statistic amount, γ _iadaptive factor, for the balance to new and old statistic self-adaptation degree; Meet

γ_{i} = \frac{n_{i}}{n_{i} + ρ}

Formula (4)

3. as claimed in claim 2 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that:

Based on the phonetics transfer method of Gaussian normalization

First proposed the phonetics transfer method based on Gaussian normalization, at translate phase, calculate each frame frequency spectrum signature parameter x of source speaker _tposterior probability on the speaker model of source, is expressed as:

\begin{matrix} m = \arg \underset{i}{m a x} p (i | x_{t}), & i = 1, 2, ..., M \end{matrix}

Formula (5)

\frac{x - μ_{m}^{x}}{σ_{m}^{x}} = \frac{\hat{y} - μ_{m}^{y}}{σ_{m}^{y}}

Formula (6)

F (x) = \hat{y} = \frac{σ_{m}^{y}}{σ_{m}^{x}} x + μ_{m}^{y} - \frac{σ_{m}^{y}}{σ_{m}^{x}} μ_{m}^{x}

Formula (7)

Based on the phonetics transfer method of average conversion

The model mean vector sequence of given source speaker and target speaker: with then μ _xand μ _ybetween mapping function be shown below:

μ _y=F (μ _x)=A μ _x+ b formula (8)

Setting use least square method can obtain unknown parameter A and b:

A = {\hat{μ}}_{y} {\hat{μ}}_{x}^{T} {({\hat{μ}}_{x} {\hat{μ}}_{x}^{T})}^{- 1}, b = {\overset{&OverBar;}{μ}}_{y} - A {\overset{&OverBar;}{μ}}_{x}

Formula (9)

F (x)=Ax+b formula (10)

4. as claimed in claim 3 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that:

Based on the phonetics transfer method that Gaussian normalization and average conversion merge

Transfer function is shown below:

F (x)=θ F _g(x)+(1-θ) F _m(x) formula (11)

Wherein F _g(x) and F _mx () represents respectively and trains by Gaussian normalization and average conversion method the transfer function obtained, θ is that weighting coefficient meets 0≤θ≤1;

Model optimization

KL divergence is used for describing the distance between different distributions, supposes f _i(x) and f _jx () represents the distribution of two gaussian component respectively, then KL divergence is therebetween expressed as

D (f_{i} (x) | | f_{j} (x)) = \underset{x}{Σ} f_{i} (x) l o g \frac{f_{i} (x)}{f_{j} (x)}

Formula (12)

Formula (12) has asymmetry, redefines KL divergence as follows:

D_{i j} (x) = \frac{1}{2} [D (f_{i} (x) | | f_{j} (x)) + D (f_{j} (x) | | f_{i} (x))]

Formula (13)

5. as claimed in claim 1 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that: the size of background speaker model GMM component M is selected according to the size of training corpus scale, be chosen as the Nth power of 2, N is positive integer.