CN102306492B

CN102306492B - Voice conversion method based on convolutive nonnegative matrix factorization

Info

Publication number: CN102306492B
Application number: CN201110267425A
Authority: CN
Inventors: 张雄伟; 孙健; 曹铁勇; 孙新建; 黄建军; 杨吉斌; 邹霞; 贾冲
Original assignee: PLA University of Science and Technology
Current assignee: PLA University of Science and Technology
Priority date: 2011-09-09
Filing date: 2011-09-09
Publication date: 2012-09-12
Anticipated expiration: 2031-09-09
Also published as: CN102306492A

Abstract

The invention discloses a voice conversion method based on convolutive nonnegative matrix factorization. The method comprises the following steps: (1) training a transformation model through training data: carrying out time calibration and parameter decomposition of training voice data, analyzing STRAIGHT spectrum by using a convolutive nonnegative matrix factorization method, and analyzing pitch frequency of source voice and object voice; (2) converting new input voice based on a training model: carrying out parameter decomposition on source voice data A[c] to be converted by employing a STRAIGHT model, realizing sound channel frequency spectrum parameter conversion based on convolutive nonnegative matrix factorization, realizing conversion of the pitch frequency based on obtained mean value and variance in a training phase, and synthesizing voice after conversion, wherein the voice is voice after synthesis and conversion of the STRAIGHT spectrum S[Bc] which is obtained through conversion, the pitch frequency f[Bc] and original aperiodic component ap[Ac]. According to the invention, training effect of voice conversion is improved, and voice quality of conversion voice is improved.

Description

Phonetics transfer method based on the decomposition of convolution nonnegative matrix

Technical field

The invention belongs to the voice process technology field, particularly a kind of phonetics transfer method that decomposes based on the convolution nonnegative matrix.

Background technology

Speech conversion is a kind of through the personal characteristic information in the speaker's voice signal of change source, makes it to have the technology of target speaker vocie personal characteristic information.Speech conversion all has wide application prospect in personalized man-machine interaction, military struggle, information security and multimedia recreation field.For example, through speech conversion and speech synthesis system are combined, can realize that personalized speech is synthetic; Through speech conversion, enemy commander's sound be can forge and disinformation or order sent, upset enemy's operational commanding; But through speech conversion representation of the historical personage speech etc.

Speech conversion (Voice Conversion/Transformation) technical research is the history (Li Bo in 20 years so far; Wang Chengyou; Cai Xuanping; Deng. speech conversion and correlation technique summary [J]. the communication journal, 2004 (05): 109-118.), its method the earliest is to be proposed in 1988 years by people such as Abe.Existing phonetics transfer method mainly comprises: based on the method for vector quantization yardage mapping (1. M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara; " Voice conversion through vector quantization, " ICASSP-88., 1988; Pp. 655-658.), based on the method for gauss hybrid models (2. Y. Stylianou, O. Cappe and E. Moulines, " Continuous probabilistic transform for voice conversion; " Speech and Audio Processing, IEEE Transactions on, vol. 6; Pp. 131-142,1998.), based on the method for HMM (3. E. K. Kim, S. Lee and Y. H. Oh; " Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker, " in Proc. Eurospeech, Rhodes; Greece, 1997, pp. 2519-2522.), based on the method for frequency spectrum bending (4. D. Erro and A. Moreno; " Weighted Frequency Warping for Voice Conversion, " in InterSpeech 2007-EuroSpeech Antwerp, Belgium; 2007.) and based on Artificial Neural Network model (5. S. Desai, A. Black, B. Yegnanarayana; And K. Prahallad, " Spectral Mapping Using Artificial Neural Networks for Voice Conversion, " Audio; Speech, and Language Processing, IEEE Transactions on; Vol. PP, p. 1-1,2010.).

Though propose to the existing several different methods of speech conversion, the effect of speech conversion also reaches the requirement of practicability far away.The problem that the existing voice conversion method exists mainly contains:

1. many phonetics transfer methods be based upon divide frame to voice signal after, under the framework of each frame independent processing.Under this framework, the correlativity between speech frame often is left in the basket, and uncontinuity occurs and manifests thereby cause changing the back voice, has reduced the quality of conversion back voice.For example based on the method for vector quantization yardage mapping, based on the method for gauss hybrid models and based on Artificial Neural Network model;

2. the target of speech conversion is the speaker's personal characteristic information that correctly changes in the voice; And the existing voice conversion method is not separated the personal characteristic information of speaker's voice before carrying out conversion process from voice signal, but directly voice signal is handled.So not only cause the conversion effect can't be satisfactory, owing to changed other composition in the voice signal, caused the decline of conversion back voice quality simultaneously.

The convolution nonnegative matrix is decomposed (Convolutive Nonnegative Matrix Factorization) a kind of nonnegative matrix decomposition method that is proposed of handling to voice signal; This method is under the prerequisite that guarantees the decomposition result nonnegativity; Use two-dimentional time-frequency base to replace original one dimension base vector, thereby better carried the sequential correlativity of voice signal.This method has comparatively successful application (6. S. Paris, " Convolutive Speech Bases and Their Application to Supervised Speech Separation, " Audio in the separation of many speakers voice; Speech; And Language Processing, IEEE Transactions on, vol. 15; Pp. 1-12,2007-01-01 2007.).Can voice signal be decomposed into the encoder matrix of one group of non-negative time-frequency base and this group base through this method.The sub spaces that the time-frequency base that decomposition obtains can think to have carried speaker characteristic, encoder matrix then is the projections of voice on each sub spaces.Therefore realized function that the speaker characteristic information in the voice signal is separated to a great extent through this is olation from voice signal.In addition, decompose the sequential correlativity that better to take voice signal into account, thereby guarantee the continuity of reconstruct voice because the convolution nonnegative matrix is decomposed with respect to traditional nonnegative matrix.

But there is the not unique problem of decomposition result in this method, and it is not unique promptly under different starting condition, same speech data to be decomposed the basis matrix that obtains.Though this not unique different expression form that can be regarded as feature space has limited its application in speech conversion.

Summary of the invention

The object of the present invention is to provide a kind of phonetics transfer method that decomposes based on the convolution nonnegative matrix.Utilize the convolution nonnegative matrix to decompose and realized separation personal characteristic information in the speech channel frequency spectrum; And in detachment process, effectively preserved the correlativity of voice sequential; Guaranteeing under source speaker's voice and the conforming prerequisite of target speaker voice spectrum convolution nonnegative matrix decomposition result, accomplishing the conversion of speech channel frequency spectrum through the replacement of time-frequency base.And further realized speech conversion on this basis, make the voice after the conversion have higher quality and on the vocie personal characteristic, have stronger similarity with the target speaker.

The technical solution that realizes the object of the invention is: a kind of phonetics transfer method that decomposes based on the convolution nonnegative matrix, and step is following:

At first, through training data to the transformation model training:

The first step: the time alignment of training utterance data and parameter decomposition; For the used parallel speech data of training; The voice of identical content that are source speaker and target speaker are right; Wherein source speaker voice can be expressed as A, and target speaker's voice can be expressed as B, at first extract both pitch period envelope p through the STRAIGHT model _AAnd p _B, calculate a fundamental tone mark point pm who is used to realize the processing of pitch synchronous splicing adding through pitch period envelope and primary speech signal afterwards _AAnd pm _BAccording to the phoneme division information; Corresponding phoneme with voice A, B is that unit carries out fundamental tone mark point coupling; Be elementary cell afterwards again with the phoneme, adopt pitch synchronous splicing adding mode to realize the time alignment of voice A and B, obtain voice A ' and B ' behind the time alignment based on coupling fundamental tone mark point; Use the STRAIGHT model that A ' and B ' are analyzed, obtain three groups of parameters:

(1) STRAIGHT that characterizes the sound channel characteristic composes S _{A '}, S _{B '}

(2) fundamental frequency f _{A '}, f _{B '}

(3) aperiodic component ap _{A '}, ap _{B '}

Second step: use convolution nonnegative matrix decomposition method that the STRAIGHT spectrum is analyzed, promptly at first to the STRAIGHT spectrum S of A ' _{A '}Adopt convolution nonnegative matrix decomposition method to analyze, obtain its time-frequency base W _{A '}(t) and encoder matrix H _{A '}, afterwards again through the STRAIGHT spectrum S of convolution nonnegative matrix is olation to B ' _{B '}Analyze, this moment, fixing its encoder matrix was H _{A '}, then can obtain its time-frequency base W _{B '}(t);

The 3rd step: analyze the fundamental frequency of source voice and target speech, promptly through fundamental frequency information f to A ' and B ' _{A '}And f _{B '}Analyze, obtain its both averages and variance: μ _{A '}, And μ _{B '},

Secondly, based on training pattern new input voice are changed:

Step 1: for source speech data A to be converted _cAdopt the STRAIGHT model to carry out parameter decomposition, obtain its STRAIGHT spectrum , fundamental frequency And aperiodic component

Three groups of parameters;

Step 2: decompose the conversion that realizes the vocal tract spectrum parameter based on the convolution nonnegative matrix, promptly right

Adopt the convolution nonnegative matrix to analyze, this moment, fixing its time-frequency base was W _{A '}, obtain the respective coding matrix

, and then the spectrum of the STRAIGHT after obtaining changing through following formula:

S_{B_{c}} = W_{B^{'}} &CircleTimes; H_{A_{c}}

Wherein the STRAIGHT after

expression conversion composes, and " " is convolution algorithm;

Step 3: the average and the variance of the fundamental frequency that obtains based on the training stage, realize the conversion of fundamental frequency:

f_{B_{c}} = (f_{A_{c}} - μ_{A^{'}}) \frac{σ_{B^{'}}}{σ_{A^{'}}} + μ_{B^{'}}

Fundamental frequency after wherein

expression is changed;

Step 4: the voice after the synthetic conversion, promptly through the voice after the synthetic conversion of the STRAIGHT spectrum

, fundamental frequency and the original aperiodic component that are converted to.

The present invention compared with prior art; Its remarkable advantage: (1) is in the training stage; Based on phoneme information; Adopt pitch synchronous splicing adding method to realize the coupling of source speaker's voice and target speaker voice, make the voice after the coupling have higher time matching precision and voice quality, promoted the training effect of speech conversion; (2) realize effective separation of personal characteristic information in the vocal tract spectrum through convolution nonnegative matrix decomposition method, transfer process can be implemented to personal characteristic information, thereby promoted the conversion effect of voice.In addition, convolution nonnegative matrix decomposition method has effectively been preserved the relativity of time domain of vocal tract spectrum parameter, makes the reconstruct voice have better continuity, has improved the voice quality of conversion voice.

Below in conjunction with accompanying drawing the present invention is described in further detail.

Description of drawings

Fig. 1 the present invention is based on the phonetics transfer method synoptic diagram that the convolution nonnegative matrix is decomposed.

Fig. 2 carries out handling synoptic diagram based on the time alignment of phoneme for training utterance.

Fig. 3 is a voice fundamental mark point synoptic diagram.

Fig. 4 is based on the training utterance time-frequency base calculation process synoptic diagram that the convolution nonnegative matrix is decomposed.

Fig. 5 is the STRAIGHT spectrum time-frequency base synoptic diagram that is made up of 40 subbases.

Fig. 6 is based on the spectral conversion schematic flow sheet that the convolution nonnegative matrix is decomposed.

Embodiment

In conjunction with Fig. 1, the present invention is based on the phonetics transfer method that the convolution nonnegative matrix is decomposed, step is following:

Training stage: through training data to the transformation model training.

The first step, the time alignment of training utterance data and parameter decomposition:

(1) time alignment of speech data is as shown in Figure 2.The source speaker's voice A and the target speaker voice B that at first training data are concentrated analyze through the STRAIGHT model, obtain the pitch of both each sampled points, i.e. pitch period envelope p _AAnd p _B:

p_{A} = [p_{A 1}, . . ., p_{Ai}, . . . p_{{Al}_{A}}]

p_{B} = [p_{B 1}, . . ., p_{Bi}, . . . p_{{Bl}_{A}}]

L wherein _AAnd l _BRepresent the sampled point number that comprises among source speaker's voice A and the target speaker voice B respectively.

The pitch period is here represented with sampled point number form, fraction part is carried out round.Because voiceless sound section and unvoiced segments do not have tangible pitch period, so its pitch period is fixed as [0.01f _s], f wherein _sBe the speech sample frequency, the maximum integer that is not more than x is got in " [x] " expression.Based on pitch contour, be frame length with pitch period length, voice A, B are carried out the branch frame.With voice A is example, divides the frame step following:

From the 1st sampled point a of voice _{S (1)}Beginning is with its pairing pitch period length

For frame length is confirmed the first frame F _A1, s (1)=1 wherein.Afterwards with voice

Sampled point a _{S (2)}Being second frame start position, is that frame length is confirmed the second frame F with its pairing pitch period _A2, wherein

By that analogy, to the i frame, the frame result obtained the starting point a of current speech based on last one minute _{S (i)}, and with its pairing pitch period length

The branch frame that obtains current speech for frame length is F as a result _Ai, wherein Repeat this process,, establish and obtain N altogether until the voice ending _AThe frame voice.

After accomplishing the branch frame, be the center with each frame voice central point, with the longest frame length l _AmaxFor length makes up l _Amax* N _AThe speech data matrix D _A, its each classify frame voice as, and every column data is carried out windowing process through the Hanning window.When making up matrix, voice initial sum ending curtailment place is used voice initial sum ending point polishing respectively.

To matrix D _APursue the row search, in each row, confirm a point, thereby constitute the fundamental tone mark locus of points pm that runs through each row _A, make each point value sum maximum on the track.The line position difference of selected point is not more than 6 row between the restriction adjacent columns in search procedure.Can obtain being used for the fundamental tone mark point that PSOLA handles through the method, these mark points are in the amplitude maximum value position in the voice voiced segments.Provided one section fundamental tone mark point synoptic diagram that voice obtain through said method among Fig. 3.Can arrive the fundamental tone mark point pm of voice B with quadrat method _B

According to the phoneme division information, set up the coupling corresponding relation of fundamental tone mark point in source speaker and the target speaker phoneme of speech sound:

.Wherein and

representes n the fundamental tone that phoneme comprised mark dot information in source speaker and the target speaker voice respectively, and concrete form is following:

{pm}_{A_{n}} = [{pm}_{A_{n} 1}, {pm}_{A_{n} 2}, . . ., {pm}_{A_{ni}}, . . ., {pm}_{A_{n I_{n}}}]

{pm}_{B_{n}} = [{pm}_{B_{n} 1}, {pm}_{B_{n} 2}, . . ., {pm}_{B_{nj}}, . . ., {pm}_{B_{{nJ}_{n}}}]

Here,

and

is respectively in source speaker and the target speaker voice i and j the fundamental tone mark point in n the phoneme.I _nAnd J _nBeing respectively the fundamental tone mark that both comprise in n the phoneme counts out.

Based on the fundamental tone mark dot information of coupling phoneme among training utterance A, the B, adopt the PSOLA method to realize the voice duration alignment of source speaker and the corresponding phoneme of target speaker.The frame length that PSOLA handles be taken as current fundamental tone mark point the triple-length of corresponding pitch period.In the alignment procedure, be benchmark than the minor element, realize alignment through PSOLA method compression another one phoneme with duration in the coupling phoneme.Carry out the duration adjustment because the PSOLA method is unit with the pitch period, the adjustment precision only can guarantee that in a pitch period length different information of therefore current coupling phoneme being adjusted will count in next coupling phoneme duration alignment and handle.Then realize alignment for the unvoiced segments between phoneme in the speech data through the brachymemma mode.

To after unvoiced segments is handled between each phoneme and phoneme among voice A, the B, obtained source speaker's voice A ' and target speaker voice B ' of elapsed time aligning through above-mentioned steps.

(2) speech parameter decomposes.For the training utterance behind the elapsed time aligning, adopt the STRAIGHT model to carry out parameter decomposition.Through decomposing three groups of parameters that can obtain source speaker's voice A ' and target speaker voice B ' respectively:

A) STRAIGHT that characterizes the vocal tract spectrum characteristic composes, and it is a two-dimensional matrix:

S = [\begin{matrix} s_{11} & . . . & s_{1 j} & . . . & s_{1 N} \\ . & . & . \\ . & . & . \\ . & . & . \\ s_{i 1} & . . . & s_{ij} & . . . & s_{iN} \\ . & . & . \\ . & . & . \\ . & . & . \\ s_{M 1} & . . . & s_{Mj} & . . . & s_{MN} \end{matrix}]

Wherein the STRAIGHT spectrum value of frame voice is shown in each tabulation, comprises M spectrum value point altogether, gets M=256.Whole section voice are divided into the N frame to be analyzed, between each frame center's point at a distance of 10ms.Here use S _{A '}Expression source speaker's voice STRAIGHT spectrum, S _{B '}Expression target speaker voice STRAIGHT spectrum;

B) fundamental frequency of training utterance, f=[f ₁..., f _j..., f _N], f wherein _jBe the fundamental frequency of voice j frame, it is corresponding that itself and STRAIGHT compose the j row.Here use f _{A '}Expression source speaker's voice fundamental frequency, f _{B '}Expression target speaker voice fundamental frequency;

C) aperiodic component ap, it is for characterizing the matrix of phonological component information characteristic non-periodic, and is less to the voice influence in conversion, thereby does not carry out conversion processing.

Second step, through convolution nonnegative matrix decomposition method voice STRAIGHT spectrum is analyzed, obtain the time-frequency base of source speaker and target speaker voice STRAIGHT spectrum, as shown in Figure 4.The analysis concrete steps are following:

(1) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S _{A '}Analyze, can obtain following decomposition result:

S_{A^{'}} \approx {\hat{S}}_{A^{'}} = Σ_{t = 0}^{T - 1} W_{A^{'}} (t) \cdot \overset{t &RightArrow;}{H_{A^{'}}}

W wherein _{A '}(t) be S _{A '}The time-frequency base, specifically be F * T _bMatrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, by T _bIndividual such base vector constitutes a time-frequency base, obtains T such time-frequency base altogether.Get T _b=8, T=40.

Expression is to encoder matrix H _{A '}With column vector form t the unit that move to right, concrete form is following:

If

H_{A^{'}} = [h_{A^{'} 1}, . . ., h_{A^{'} i}, . . ., h_{A^{'} I_{H}}]

H wherein _{A ' i}Be encoder matrix H _{A '}I column vector, and H _{A '}In comprise I altogether _HIndividual column vector.Then when t=2:

\overset{2 &RightArrow;}{H_{A^{'}}} = [\overset{&RightArrow;}{0}, \overset{&RightArrow;}{0}, h_{A^{'} 1}, . . ., h_{A^{'} i}, . . ., h_{A^{'} I_{H} - 2}]

When t=-2:

\overset{- 2 &RightArrow;}{H_{A^{'}}} = [h_{A^{'} 3}, . . ., h_{A^{'} i}, . . ., h_{A^{'} I_{H}}, \overset{&RightArrow;}{0}, \overset{&RightArrow;}{0},]

Wherein "

" is the null value column vector.

W _{A '}(t) and H _{A '}Calculating obtain through following iterative process:

A) to W _{A '}(t) and H _{A '}Carry out random initializtion;

B) pass through computes to S _{A '}Reconstruction result:

{\hat{S}}_{A^{'}} = Σ_{t = 0}^{T - 1} W_{A^{'}} (t) \cdot \overset{t &RightArrow;}{H_{A^{'}}}

C) based on

To time-frequency base W _{A '}(t) upgrade, renewal process is directed against t=1 ..., T _bCalculate successively:

Wherein "

" represent that the element between two matrixes multiplies each other I _{M * N}The expression element all is M * N matrix of 1.

The time-frequency base upgrades encoder matrix through following formula after upgrading and accomplishing:

D) judge whether iterations reaches maximum iteration time 300 times, or the voice reconstructed error is less than 10 ^-5, reconstructed error is confirmed by following formula:

e_{A} = \sqrt{\underset{ij}{Σ} {(s_{A^{'} ij} - {\hat{s}}_{A^{'} ij})}^{2}}

When above-mentioned two conditions do not satisfy, get back to step b) and continue iteration, otherwise the termination of iterations circulation gets into next step e).

E) obtain final decomposition result: W _{A '}(t) and H _{A '}

Fig. 5 is for decomposing the one section STRAIGHT spectrum time-frequency base synoptic diagram that obtains through said method.

Through the time-frequency base W that obtains after decomposing _{A '}Can be regarded as a proper subspace of source speaker STRAIGHT spectrum, carried the personal characteristic information of source speaker's vocal tract spectrum, and encoder matrix H _{A '}Then for composing in the subspace W _{A '}On projection, carried the variation in time of time-frequency base.Because source speaker's voice and target speaker voice that training utterance is concentrated have passed through the precise time aligning, can think that both only there are differences on speaker characteristic information, therefore after nonnegative matrix was decomposed, both encoder matrixs were identical.

(2) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S _{B '}Analyze, analytical approach is identical with (1) middle analytical approach, but this moment, the regular coding matrix was H _{A '}, then can obtain S _{B '}Time-frequency base W _{B '}

In the 3rd step, analyze source voice and target speech fundamental frequency parameter.Analysis obtains 1 rank and 2 rank statistic, the i.e. f of fundamental frequency in source speaker and the target speaker training utterance through the STRAIGHT model _{A '}And f _{B '}Average and variance μ _{A '},

And μ _{B '},

Translate phase is changed new input voice based on training pattern.

The first step: to source speech data A to be converted _cAdopt the STRAIGHT model to carry out parameter decomposition (the parameter decomposition method of parameter decomposition method and training stage is the same), obtain its STRAIGHT and compose , fundamental frequency

And aperiodic component

Three groups of parameters;

Second step: decompose the conversion that realizes the vocal tract spectrum parameter based on the convolution nonnegative matrix, as shown in Figure 6.At first

adopted the convolution nonnegative matrix to analyze.(1) said method is identical in second step of analytical approach and training stage, is the training stage to obtain W but set the time-frequency base this moment _{A '}Thereby, can obtain the respective coding matrix

By before analyze and can know, carried speaker's personal characteristic information in the time-frequency base, thereby when realizing conversion, used W _{B '}Replacement W _{A '}, afterwards and encoder matrix

Carry out convolution algorithm, the STRAIGHT spectrum after obtaining changing is shown below:

S_{B_{c}} = W_{B^{'}} &CircleTimes; H_{A_{c}} = Σ_{t = 0}^{T - 1} W_{A^{'}} (t) \cdot \overset{t &RightArrow;}{H_{A^{'}}}

Wherein the STRAIGHT after

expression conversion composes, and " " is process of convolution.

The 3rd step: realize the conversion of fundamental frequency.For fundamental frequency to be converted

; The source speaker who obtains through the training stage and the average and the variance of target speaker fundamental frequency, realize conversion according to following formula:

f_{B_{c}} = (f_{A_{c}} - μ_{A^{'}}) \frac{σ_{B^{'}}}{σ_{A^{'}}} + μ_{B^{'}}

Fundamental frequency after wherein

expression is changed.

The 4th step: the voice after the synthetic conversion.Use the STRAIGHT spectrum

after changing, the fundamental frequency after the conversion; And the aperiodic component that obtains during signal decomposition; According to STRAIGHT model phonetic synthesis algorithm, the speech data after can obtaining changing:

B_{c} = f_{STRAIGHT} (S_{B_{c}}, f_{B_{c}}, {ap}_{A_{c}})

Wherein

Expression STRAIGHT phonetic synthesis algorithm, B _cBe the speech data after the conversion.、

Claims

1. phonetics transfer method that decomposes based on the convolution nonnegative matrix is characterized in that step is following:

At first, through training data to the transformation model training:

The first step: the time alignment of training utterance data and parameter decomposition; For the used parallel speech data of training; The voice of identical content that are source speaker and target speaker are right; Wherein source speaker voice can be expressed as A, and target speaker's voice can be expressed as B, at first extract both pitch period envelope p through the STRAIGHT model _AAnd p _B, calculate a fundamental tone mark point pm who is used to realize the processing of pitch synchronous splicing adding through pitch period envelope and primary speech signal afterwards _AAnd pm _BAccording to the phoneme division information; Corresponding phoneme with voice A, B is that unit carries out fundamental tone mark point coupling; Be elementary cell afterwards again with the phoneme, adopt pitch synchronous splicing adding mode to realize the time alignment of voice A and B, obtain voice A ＇ and B ＇ behind the time alignment based on coupling fundamental tone mark point; Use the STRAIGHT model that A ＇ and B ＇ are analyzed, obtain three groups of parameters:

(1) STRAIGHT that characterizes the sound channel characteristic composes S _{A ＇}, S _{B ＇}

(2) fundamental frequency f _{A ＇}, f _{B ＇}

(3) aperiodic component ap _{A ＇}, ap _{B ＇}

Second step: use convolution nonnegative matrix decomposition method that the STRAIGHT spectrum is analyzed, promptly at first to the STRAIGHT spectrum S of A ＇ _{A ＇}Adopt convolution nonnegative matrix decomposition method to analyze, obtain its time-frequency base W _{A ＇}(t) and encoder matrix H _{A ＇}, afterwards again through the STRAIGHT spectrum S of convolution nonnegative matrix is olation to B ＇ _{B ＇}Analyze, this moment, fixing its encoder matrix was H _{A ＇}, then can obtain its time-frequency base W _{B ＇}(t);

The 3rd step: analyze the fundamental frequency of source voice and target speech, promptly through fundamental frequency information f to A ＇ and B ＇ _{A ＇}And f _{B ＇}Analyze, obtain its both averages and variance: μ _{A ＇},

And μ _{B ＇},

Secondly, based on training pattern new input voice are changed:

Step 1: for source speech data A to be converted _cAdopt the STRAIGHT model to carry out parameter decomposition, obtain its STRAIGHT spectrum

, fundamental frequency

And aperiodic component Three groups of parameters;

Adopt the convolution nonnegative matrix to analyze, this moment, fixing its time-frequency base was W _{A ＇}, obtain the respective coding matrix

S_{B_{c}} = W_{B^{'}} &CircleTimes; H_{A_{c}}

Wherein the STRAIGHT after

expression conversion composes, and "

" is convolution algorithm;

f_{B_{c}} = (f_{A_{c}} - μ_{A^{'}}) \frac{σ_{B^{'}}}{σ_{A^{'}}} + μ_{B^{'}}

Fundamental frequency after wherein

expression is changed;

, fundamental frequency

and the original aperiodic component

that are converted to.

2. the phonetics transfer method that decomposes based on the convolution nonnegative matrix according to claim 1 is characterized in that based on pitch contour, is frame length with pitch period length, and voice A, B are carried out time alignment:

1) divides the frame stage

For frame length is confirmed the first frame F _A1, s (1)=1 wherein is afterwards with voice

Sampled point a _{S (2)}Being second frame start position, is that frame length is confirmed the second frame F with its pairing pitch period _A2, wherein , by that analogy, to the i frame, the frame result obtained the starting point a of current speech based on last one minute _{S (i)}, and with its pairing pitch period length

The branch frame that obtains current speech for frame length is F as a result _Ai, wherein

, repeat this process, until the voice ending, establish and obtain N altogether _AThe frame voice;

After accomplishing the branch frame, be the center with each frame voice central point, with the longest frame length l _AmaxFor length makes up l _Amax* N _AThe speech data matrix D _A, its each classify frame voice as, and every column data is carried out windowing process through the Hanning window, when making up matrix, voice initial sum ending curtailment place is used a voice initial sum ending point polishing respectively;

To matrix D _APursue the row search, in each row, confirm a point, thereby constitute the fundamental tone mark locus of points pm that runs through each row _AMake each point value sum maximum on the track; The line position difference of selected point is not more than 6 row between the restriction adjacent columns in search procedure; Obtain being used for the fundamental tone mark point that the pitch synchronous splicing adding is handled through the method, these mark points are in the amplitude maximum value position in the voice voiced segments, can arrive the fundamental tone mark point pm of voice B with quadrat method _B

2) matching stage

According to the phoneme division information; Set up the coupling corresponding relation of fundamental tone mark point in source speaker and the target speaker phoneme of speech sound: ; Wherein

and

{pm}_{A_{n}} = [{pm}_{A_{n 1}}, {pm}_{A_{n 2}}, . . ., {pm}_{A_{ni}}, . . ., {pm}_{A_{n I_{n}}}]

{pm}_{B_{n}} = [{pm}_{B_{n 1}}, {pm}_{B_{n 2}}, . . ., {pm}_{B_{nj}}, . . ., {pm}_{B_{n J_{n}}}]

Here,

With Be respectively in source speaker and the target speaker voice i and j fundamental tone mark point in n the phoneme, I _nAnd J _nBeing respectively the fundamental tone mark that both comprise in n the phoneme counts out;

3) alignment stage

Fundamental tone mark dot information based on coupling phoneme among training utterance A, the B; Adopt pitch synchronous splicing adding method realize the voice duration alignment of source speaker and the corresponding phoneme of target speaker, the frame length that the pitch synchronous splicing adding is handled be taken as current fundamental tone mark the triple-length of corresponding pitch period; In the alignment procedure, be benchmark than the minor element, realize alignment through pitch synchronous splicing adding method compression another one phoneme with duration in the coupling phoneme; Because being unit with the pitch period, the PSOLA method carries out the duration adjustment; The adjustment precision only can guarantee in a pitch period length; Therefore the different information of current coupling phoneme being adjusted will count in next coupling phoneme duration alignment and handle, and then realize alignment through the brachymemma mode for the unvoiced segments between phoneme in the speech data;

To after unvoiced segments is handled between each phoneme and phoneme among voice A, the B, obtained the source speaker's voice A ＇ and the target speaker voice B ＇ of elapsed time aligning through above-mentioned steps.

3. the phonetics transfer method that decomposes based on the convolution nonnegative matrix according to claim 1; The speech parameter that it is characterized in that the training stage decomposes; For the training utterance behind the elapsed time aligning; Adopt the STRAIGHT model to carry out parameter decomposition, through decomposing three groups of parameters that obtain source speaker's voice A ＇ and target speaker voice B ＇ respectively:

1) STRAIGHT that characterizes the vocal tract spectrum characteristic composes, and it is a two-dimensional matrix:

S = [\begin{matrix} s_{11} & . . . & s_{1 j} & . . . & s_{1 N} \\ . & . \\ . & . \\ . & . \\ s_{i 1} & . . . & s_{ij} & . . . & s_{iN} \\ . & . & . \\ . & . & . \\ . & . & . \\ s_{M 1} & . . . & s_{Mj} & . . . & s_{MN} \end{matrix}]

Wherein the STRAIGHT spectrum value of frame voice is shown in each tabulation, comprises M spectrum value point altogether, gets M=256, and whole section voice are divided into the N frame to be analyzed, and at a distance of 10ms, uses S here between each frame center's point _{A ＇}Expression source speaker's voice STRAIGHT spectrum, S _{B ＇}Expression target speaker voice STRAIGHT spectrum;

2) fundamental frequency of training utterance, f=[f ₁..., f _j..., f _N], f wherein _jBe the fundamental frequency of voice j frame, it is corresponding that itself and STRAIGHT compose the j row, uses f here _{A ＇}Expression source speaker's voice fundamental frequency, f _{B ＇}Expression target speaker voice fundamental frequency;

3) aperiodic component ap, it is for characterizing the matrix of phonological component information characteristic non-periodic, and is less to the voice influence in conversion, thereby does not carry out conversion processing.

4. the phonetics transfer method that decomposes based on the convolution nonnegative matrix according to claim 1 is characterized in that the analytical procedure of training stage is following:

1) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S _{A ＇}Analyze, obtain following decomposition result:

S_{A^{'}} \approx {\hat{S}}_{A^{'}} = Σ_{t = 0}^{T - 1} W_{A^{'}} (t) \cdot \overset{t &RightArrow;}{H_{A^{'}}}

W wherein _{A ＇}(t) be S _{A ＇}The time-frequency base, specifically be F * T _bMatrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, by T _bIndividual such base vector constitutes a time-frequency base, obtains T such time-frequency base altogether, gets T _b=8, T=40,

Expression is to encoder matrix H _{A ＇}With column vector form t the unit that move to right, concrete form is following:

If

H_{A^{'}} = [h_{A^{'} 1}, . . ., h_{A^{'} i}, . . ., h_{A^{'} I_{H}}]

H wherein _{A ＇ i}Be encoder matrix H _{A ＇}I column vector, and H _{A ＇}In comprise I altogether _HIndividual column vector, then when t=2:

\overset{2 &RightArrow;}{H_{A^{'}}} = [\overset{&RightArrow;}{0}, \overset{&RightArrow;}{0}, h_{A^{'} 1}, . . ., h_{A^{'} i}, . . ., h_{A^{'} I_{H} - 2}]

When t=-2:

\overset{- 2 &RightArrow;}{H_{A^{'}}} = [h_{A^{'} 3}, . . ., h_{A^{'} i}, . . ., h_{A^{'} I_{H}}, \overset{&RightArrow;}{0}, \overset{&RightArrow;}{0}]

Wherein "

" is the null value column vector;

2) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S _{B ＇}Analyze analytical approach and 1) middle analytical approach is identical, but this moment, the regular coding matrix was H _{A ＇}, then can obtain S _{B ＇}Time-frequency base W _{B ＇}