CN103280224A

CN103280224A - Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Info

Publication number: CN103280224A
Application number: CN201310146293XA
Authority: CN
Inventors: 宋鹏; 包永强; 赵力; 刘健刚
Original assignee: Southeast University
Current assignee: Shanghai Taiyu information technology Limited by Share Ltd
Priority date: 2013-04-24
Filing date: 2013-04-24
Publication date: 2013-09-04
Anticipated expiration: 2033-04-24
Also published as: CN103280224B

Abstract

The invention discloses a voice conversion method under the asymmetric corpus condition on the basis of an adaptive algorithm. According to the method, firstly, source speaker and target speaker models are respectively obtained through training from a reference speaker model via utilizing a few of training sentences by a MAP (maximum a posteriori) algorithm; then, Gaussian normalization and average conversion methods are respectively provided through utilizing parameters in an adaptive speaker model; in addition, in order to further improve the conversion effect, a method combining the Gaussian normalization method and the average conversion method is further provided; and meanwhile, because the training sentences are limited, the accuracy of the adaptive model is inevitably influenced, the invention provides a KL (Kullback-Leibler) divergence method, so that a speaker model is optimized during the conversion, and subjective and objective experimental results show that the frequency spectrum distortion degree, the converted voice quality and the target voice similarity are respectively improved. All the methods provided by the invention obtain the effect similar to that of a classical GMM (Gaussian mixture model) method under the condition on the basis of a symmetric corpus.

Description

Based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm

Technical field

The present invention relates to a kind ofly based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, belong to the voice process technology field.

Background technology

Speech conversion refers to the speaker characteristics that a people's speaker characteristics is converted to another person, and keeps the constant a kind of technology of semantic content.It has a very wide range of applications: as being used for personalized phonetic synthesis; The voice communication of low bit rate; Medically recovery of impaired speech etc.In the past few decades, Voice Conversion Techniques has obtained significant progress.Having occurred with methods such as code book mapping, gauss hybrid models, neural networks is a series of phonetics transfer methods of representative.These methods have realized the conversion of speaker's voice personal characteristics to a great extent.Yet these methods mainly concentrate on sight based on the speech conversion under symmetrical corpus (same sentence) condition.And ignored situation under the asymmetric corpus (different statement).In other words, although the speech conversion based under the symmetrical corpus condition has before obtained comparatively satisfied converting speech quality, obtained using widely, can not directly apply to the situation of more asymmetric corpus in actual environment.Therefore, we need further research based on the phonetics transfer method under the asymmetric corpus condition.

In the middle of the pertinent literature, some have been arranged at the phonetics transfer method of asymmetric corpus proposition abroad.The method of separating with content as the method that returns based on maximum likelihood two-wire type, based on the text of two-wire type conversion and based on the training method of the transfer function of arest neighbors loop iteration etc.But there are a lot of defectives in these methods: depend on the pre-prepd transfer function that is obtained by symmetrical corpus training as the maximum likelihood bilinearity Return Law; Two-wire type converter technique needs a large amount of source speakers and target speaker's training statement to guarantee the accuracy of changing; The arest neighbors cyclic iterative is to be based upon the spectrum signature correspondence of facing recently identical phoneme, and needs a large amount of training statements simultaneously.Therefore, above-mentioned these methods realize that in actual applications difficulty is big, are not easy to operation.

Summary of the invention

Goal of the invention: the defective for the phonetics transfer method that solves under the asymmetric corpus exists the invention provides a kind of based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm.

Technical scheme: a kind of based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, at first obtain the background speaker model by pre-prepd with reference to the training of speaker's statement; Then by MAP(Maximum a posteriori) adaptive technique, source speaker and target speaker's statement trained respectively obtain source speaker and target speaker model; Then obtain the speech conversion function by the average in self-adaptation source speaker and the target speaker model and variance training, the method of Gaussian normalization and average conversion has been proposed respectively, improve conversion effect for further, and then proposed the method for Gaussian normalization and average conversion fusion.In addition, because source speaker and target speaker's training statement is limited, very difficult training obtains speaker model accurately, and in the present invention, we have proposed to solve this problem by the method for KL divergence (Kullback-Leibler divergence).

1) self-adaptation of speaker model

In described phonetics transfer method based on adaptive technique, the background speaker model is by GMM(Gaussian mixture model) describe, as follows:

p (z) = Σ_{i = 1}^{M} ω_{i} N (z, μ_{i}^{B}, Σ_{i}^{B})

Formula (1)

Wherein N (.) represents Gaussian distribution, and z is the voice spectrum proper vector, and M represents the number of gaussian component, ω _iBe i the weight that gaussian component is shared, satisfy

With

Mean vector and the variance matrix of representing i gaussian component respectively.The sequence O=[o of given observation spectrum signature vector ₁, o ₂..., o _T], use MAP(Maximum a posteriori) adaptive algorithm upgrades average and variance, and formula is as follows:

{\hat{μ}}_{i}^{B} = γ_{i} E_{i} (o) + (1 - γ_{i}) μ_{i}^{B}

Formula (2)

{\hat{Σ}}_{i}^{B} = γ_{i} E_{i} (o^{2}) + (1 - γ_{i}) [{(μ_{i}^{B})}^{2} + Σ_{i}^{B}] - {({\hat{μ}}_{i}^{B})}^{2}

Formula (3)

Wherein

With

The middle updating value of representing i gaussian component average and variance respectively, E _i(o) and E _i(o ²) expression i gaussian component average and variance statistic, γ _iBe the self-adaptation factor, be used for the balance to new and old statistic self-adaptation degree, satisfy

γ_{i} = \frac{n_{i}}{n_{i} + ρ}

Formula (4)

Wherein ρ is the related coefficient of self-adaptation speaker model and reference model, n _iExpression weight statistic.Final weight, average and the variance that obtains source speaker x and target speaker y model respectively:

With

2) based on the phonetics transfer method of Gaussian normalization

In the present invention, at first propose the phonetics transfer method based on Gaussian normalization, at translate phase, calculated each frame frequency spectrum signature parameter x of source speaker _tPosterior probability on the speaker model of source is expressed as:

m = \arg \underset{i}{\max p (i | x_{t}), i = 1,2, . . ., M}

Formula (5)

P (i|x wherein _t) expression x _tThe posterior probability that belongs to i gaussian component satisfies

According to the character of GMM cluster, source speaker and the same gaussian component of target speaker can be thought and belong to same phoneme, satisfy:

\frac{x - μ_{m}^{x}}{σ_{m}^{x}} = \frac{\hat{y} - μ_{m}^{y}}{σ_{m}^{y}}

Formula (6)

Wherein

With

Represent average and the variance of source speaker and target speaker's m gaussian component respectively, it is as follows then can to obtain transfer function:

F (x) = \hat{y} = \frac{σ_{m}^{y}}{σ_{m}^{x}} x + μ_{m}^{y} - \frac{σ_{m}^{y}}{σ_{m}^{x}} μ_{m}^{x}

Formula (7)

3) phonetics transfer method of changing based on average

In the present invention, we have proposed another phonetics transfer method based on the average conversion, given source speaker and target speaker's model mean vector sequence:

With

μ then _xAnd μ _yBetween mapping function be shown below:

μ _y=F (μ _x)=A μ _x+ b formula (8)

Set

Use least square method can obtain unknown parameter A and b:

A = {\hat{μ}}_{y} {\hat{μ}}_{x}^{T} {({\hat{μ}}_{x} {\hat{μ}}_{x}^{T})}^{- 1}, b = {\overset{&OverBar;}{μ}}_{y} - A {\overset{&OverBar;}{μ}}_{x}

Formula (9)

Wherein

Transfer function shown in the formula (8) can be directly used in the conversion of spectrum signature, and then transfer function is as follows:

F (x)=Ax+b formula (10)

4) phonetics transfer method that conversion is merged based on Gaussian normalization and average

The the 2nd and the 3rd) provided the phonetics transfer method based on Gaussian normalization and average conversion in two parts respectively.Wherein the Gaussian normalization method can be counted as a kind of local linear regression method, and the average conversion method can be thought a kind of global map method.In order further to promote conversion effect, the present invention proposes a kind of with these two kinds of conversion methods that method merges.Transfer function is shown below:

F (x)=θ F _g(x)+(1-θ) F _m(x) formula (11)

F wherein _g(x) and F _m(x) represent respectively to train the transfer function that obtains by Gaussian normalization and average conversion method, θ is that weighting coefficient satisfies 0≤θ≤1.

5) model optimization

Adopted the MAP adaptive algorithm that speaker model is carried out modeling among the present invention, but because the adaptive training statement is limited, be not that the parameter of each gaussian component of speaker model all can be updated.This will inevitably influence the effect of speech conversion.The present invention has introduced the influence that the KL divergence reduces this problem.The KL divergence is used for describing the distance between the different distributions, supposes f _i(x) and f _j(x) represent the distribution of two gaussian component respectively, then the KL divergence between the two is expressed as

D (f_{i} (x) | | f_{j} (x)) = \underset{x}{Σ} f_{i} (x) \log \frac{f_{i} (x)}{f_{j} (x)}

Formula (12)

Formula (12) has asymmetry, and it is as follows that we redefine the KL divergence here:

D_{ij} (x) = \frac{1}{2} [D (f_{i} (x) | | f_{j} (x)) + D (f_{j} (x) | | f_{i} (x))]

Formula (13)

In transfer process, if the average of current component or variance are not updated, then select for use the average of nearest gaussian component or variance to replace.

Beneficial effect: compared with prior art, provided by the invention based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, advantage and effect are:

1) realized speech conversion based on asymmetric corpus, can effectively avoid for the symmetric requirement of corpus.

2) adopt the MAP adaptive algorithm that speaker model is carried out modeling, can obtain speaker model by the training statement of minute quantity, reduced the demand of the speaker being trained statement quantity.

3) phonetics transfer method based on the conversion of Gaussian normalization and average has been proposed respectively, and and then the method for the two fusion has been proposed, avoided the demand for symmetrical corpus on the one hand, reduced the calculated amount of transfer function training on the other hand to a great extent.

4) by KL divergence method the self-adaptation speaker model is optimized, is optimized by the parameter to the gaussian component that is not updated in the speaker model, can improve the effect of speech conversion to a certain extent.

Description of drawings

Fig. 1 is for obtaining the process flow diagram of transfer function based on the method for Gaussian normalization in the embodiment of the invention;

Fig. 2 is for obtaining the process flow diagram of transfer function based on the method for average mapping in the embodiment of the invention;

Fig. 3 is for obtaining merging the process flow diagram of transfer function in the embodiment of the invention;

Fig. 4 is the embodiment of the invention and prior art arrive female voice about male voice conversion comparison diagram;

Fig. 5 is the embodiment of the invention and prior art arrive male voice about female voice conversion comparison diagram;

Fig. 6 is the embodiment of the invention and the comparison diagram as a result that adopts the mean opinion score method to obtain based on the classical GMM method under the symmetrical corpus condition;

Fig. 7 is the embodiment of the invention and the similarity test result comparison diagram that obtains based on the classical GMM method under the symmetrical corpus condition.

Embodiment

Below in conjunction with specific embodiment, further illustrate the present invention, should understand these embodiment only is used for explanation the present invention and is not used in and limits the scope of the invention, after having read the present invention, those skilled in the art all fall within the application's claims institute restricted portion to the modification of the various equivalent form of values of the present invention.

Phonetics transfer method based under the asymmetric corpus condition of adaptive algorithm comprises the steps:

1) use the STRAIGHT model to carry out feature extraction to all speakers' statement, extract respectively the Mei Er cepstrum coefficient (Mel-cepstrum coefficients, MCC) and fundamental frequency (F0).

2) satisfy the background model that GMM distributes by pre-prepd third party with reference to the spectrum signature MCC training generation of speaker's training statement extraction; The description of background model, as follows:

p (z) = Σ_{i = 1}^{M} ω_{i} N (z, μ_{i}^{B}, Σ_{i}^{B})

Formula (1)

With

Mean vector and the variance matrix of representing i gaussian component respectively.

3) with Speaker Identification in speaker adaptation similar, select the MAP algorithm respectively adaptive training obtain source speaker and target speaker's model.

The sequence O=[o of given observation spectrum signature vector ₁, o ₂..., o _T], use the MAP adaptive algorithm that average and variance are upgraded, formula is as follows:

{\hat{μ}}_{i}^{B} = γ_{i} E_{i} (o) + (1 - γ_{i}) μ_{i}^{B}

Formula (2)

{\hat{Σ}}_{i}^{B} = γ_{i} E_{i} (o^{2}) + (1 - γ_{i}) [{(μ_{i}^{B})}^{2} + Σ_{i}^{B}] - {({\hat{μ}}_{i}^{B})}^{2}

Formula (3)

Wherein

With

The middle updating value of representing i gaussian component average and variance respectively, E _i(o) and E _i(o ²) expression i gaussian component average and variance statistic, γ _iBe the self-adaptation factor, be used for the balance to new and old statistic self-adaptation degree.Satisfy

γ_{i} = \frac{n_{i}}{n_{i} + ρ}

Formula (4)

Wherein ρ is the related coefficient of self-adaptation speaker model and reference model, n _iExpression weight statistic; Final weight, average and the variance that obtains source speaker x and target speaker y model respectively: With

4) use the KL divergence to calculate the distance between the different components in each speaker model respectively.

Suppose f _i(x) and f _j(x) represent the distribution of two gaussian component respectively, then the KL divergence between the two is expressed as

D (f_{i} (x) | | f_{j} (x)) = \underset{x}{Σ} f_{i} (x) \log \frac{f_{i} (x)}{f_{j} (x)}

Formula (12)

D_{ij} (x) = \frac{1}{2} [D (f_{i} (x) | | f_{j} (x)) + D (f_{j} (x) | | f_{i} (x))]

Formula (13)

5) for the spectrum signature vector of each frame tested speech, calculate its posterior probability on the gaussian component in the speaker model of source, then select the gaussian component of posterior probability maximum.

m = \arg \max_{i} p (i | x_{t}), i = 1,2, . . ., M

Formula (5)

Wherein p (i|x) expression posterior probability satisfies

p (i | x) = \frac{ω_{i} N (x, μ_{i}^{x}, Σ_{i}^{xx})}{Σ_{j = 1}^{M} ω_{j} N (x, μ_{j}^{x}, Σ_{j}^{xx})} .

According to the character of GMM cluster, source speaker and target speaker's same gaussian component can be thought and belongs to same phoneme, satisfies:

\frac{x - μ_{m}^{x}}{σ_{m}^{x}} = \frac{\hat{y} - μ_{m}^{y}}{σ_{m}^{y}}

Formula (6)

Wherein

With Average and the variance of representing source speaker and target speaker's m gaussian component respectively in current gaussian component, thereby use Gaussian normalization to obtain transfer function F _g(x).Simultaneously, in the training process of transfer function, if the average of current component or variance are not updated, then select for use average or the variance of the nearest gaussian component of KL to replace.Fig. 1 has provided the flow process that obtains transfer function based on the method for Gaussian normalization.

6) utilize mean vector in the self-adaptation speaker model, use the method based on least square to obtain spectrum signature transfer function F _m(x), simultaneously, in the training process of transfer function, if the average of current component or variance are not updated, then select for use average or the variance of the nearest gaussian component of KL to replace.Fig. 2 has provided the flow process that obtains transfer function based on the method for average mapping.

7) the Gaussian normalization method can be counted as a kind of local linear regression method, and the average conversion method can be regarded as a kind of global map method.In order further to promote conversion effect, the present invention proposes a kind of conversion method that these two kinds of methods are merged.Then transfer function is F (x)=θ F _g(x)+(1-θ) F _m(x).Fig. 3 has provided the process that obtains that merges transfer function.

8) conversion of F0: adopt the classical method based on Gaussian normalization that F0 is changed.

9) carry out the synthetic of voice by the spectrum signature after the conversion of transfer function acquisition and F0 by the STAIGHT model, finally obtain converting speech.

Performance evaluation:

Present embodiment has selected the English speech database of CMU ATCTIC that conversion effect is estimated.Select 500 statements of BDL and two speakers of CLB a man and a woman to carry out the training of background model respectively.By RMS and two speakers of SLT a man and a woman, comprise 120 statements respectively respectively.Wherein Dui Cheng 50 statements are used for the GMM pedestal method, and asymmetrical 50 statements are used for method of the present invention, and other 20 statements are used for evaluation test.The big or small optimised of the mixed components M of background model is set at 256, and big or small optimised 16, the MCC exponent number that is set at of the gaussian component of GMM pedestal method is made as 24 simultaneously.

We at first select Mei Er cepstrum distance, and (Mel cepstral distance MCD) comes the spectrum signature after the conversion is carried out objective evaluation.

MCD = 10 / \log 10 \sqrt{2 Σ_{j = 1}^{D} {({mc}_{j}^{c} - {mc}_{j}^{t})}^{2}}

Formula (14)

Wherein

With Be respectively the MCC of converting speech and target voice, D is the exponent number of MCC, and the more little expression conversion effect of MCD value is more good.

Fig. 4 and Fig. 5 have provided the present invention several method that proposes and the MCD result who relatively obtains based on the classical GMM method under the symmetrical corpus condition, and Fig. 4 has provided the conversion of male voice to female voice, and Fig. 5 has provided the conversion of female voice to male voice.Wherein GN represents that Gaussian normalization method, MT represent that average transformation approach, GNMT represent fusion method.Can find that along with the increase of training statement, the MCD curve of the method that the present invention proposes presents identical trend, all gradually near the result of GMM pedestal method.And adopt GNMT method total energy to obtain than GN or the better effect of MT method.This shows that fusion method can improve the effect of Gaussian normalization method and average transformation approach effectively.

Then we select mean opinion score (Mean opinion score, MOS) and method such as similarity test respectively the quality of converting speech and the similarity of converting speech and target voice have been carried out subjective assessment.Fig. 6 adopts mean opinion score (Mean opinion score with the method for the present invention's proposition and based on the classical GMM method under the symmetrical corpus condition, MOS) result that obtains of method, what adopt is that the marking principle of making in 5 minutes (wherein 1 is divided into " poor ", and 5 are divided into " very good ") is come the quality of converting speech is given a mark.Fig. 7 is with the inventive method and the similarity test result that obtains based on the classical GMM method under the symmetrical corpus condition, adopt be equally 5 minutes system (wherein 1 expression " different fully ", 5 expressions " in full accord ") judge the similarity of converting speech and target voice.MOS test and similarity test all adopt 5 asymmetric statements to be used for speaker adaptation, and have participated in marking by 6 professional researchists, and wherein variance represented in " worker " font among the figure.Can find that from the result of Fig. 6 and Fig. 7 the method that the present invention proposes can obtain the effect comparable with the GMM method, has verified the result of objective evaluation MCD to a certain extent.

Claims

1. one kind based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that: at first obtain the background speaker model by pre-prepd with reference to the training of speaker's statement; Pass through the MAP adaptive technique then, source speaker and target speaker's statement is trained respectively obtain source speaker and target speaker model; Follow by the average in self-adaptation source speaker and the target speaker model and variance training and obtain the speech conversion function, in speech conversion process, use the method for Gaussian normalization and average conversion, and the method for Gaussian normalization and average conversion fusion; Sending out from limited source speaker and target speaker by the KL divergence in addition trains statement to obtain speaker model accurately.

2. as claimed in claim 1 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that:

The self-adaptation of speaker model

In described phonetics transfer method based on adaptive technique, the background speaker model is described by GMM, and is as follows:

Formula (1)

With

Mean vector and the variance matrix of representing i gaussian component respectively; The sequence O=[o of given observation spectrum signature vector ₁, o ₂..., o _T], use the MAP adaptive algorithm that average and variance are upgraded, formula is as follows:

Formula (2)

Formula (3)

Wherein

With

The middle updating value of representing i gaussian component average and variance respectively; E _i(o) and E _i(o ²) expression i gaussian component average and variance statistic, γ _iBe the self-adaptation factor, be used for the balance to new and old statistic self-adaptation degree; Satisfy

Formula (4)

Wherein ρ is the related coefficient of self-adaptation speaker model and reference model, n _iExpression weight statistic; Final weight, average and the variance that obtains source speaker x and target speaker y model respectively:

With

3. as claimed in claim 2 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that:

Phonetics transfer method based on Gaussian normalization

At first propose the phonetics transfer method based on Gaussian normalization, at translate phase, calculated each frame frequency spectrum signature parameter x of source speaker _tPosterior probability on the speaker model of source is expressed as:

Formula (5)

Formula (6)

Wherein

With

Formula (7)

Phonetics transfer method based on the average conversion

Given source speaker and target speaker's model mean vector sequence:

With

μ then _xAnd μ _yBetween mapping function be shown below:

μ _y=F (μ _x)=A μ _x+ b formula (8)

Set

Use least square method can obtain unknown parameter A and b:

Formula (9)

Wherein

F (x)=Ax+b formula (10).

4. as claimed in claim 3 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that:

Phonetics transfer method based on Gaussian normalization and average conversion fusion

Transfer function is shown below:

F (x)=θ F _g(x)+(1-θ) F _m(x) formula (11)

F wherein _g(x) and F _m(x) represent respectively to train the transfer function that obtains by Gaussian normalization and average conversion method, θ is that weighting coefficient satisfies 0≤θ≤1;

Model optimization

The KL divergence is used for describing the distance between the different distributions, supposes f _i(x) and f _j(x) represent the distribution of two gaussian component respectively, then the KL divergence between the two is expressed as

Formula (12)

Formula (12) has asymmetry, and it is as follows to redefine the KL divergence:

Formula (13)

5. as claimed in claim 1 based on the phonetics transfer method under the asymmetric corpus condition of adaptive algorithm, it is characterized in that: the size of background speaker model GMM component M is selected according to the size of training corpus scale, be chosen as 2 Nth power, N is positive integer.