CN102306492B - Voice conversion method based on convolutive nonnegative matrix factorization - Google Patents

Voice conversion method based on convolutive nonnegative matrix factorization Download PDF

Info

Publication number
CN102306492B
CN102306492B CN201110267425A CN201110267425A CN102306492B CN 102306492 B CN102306492 B CN 102306492B CN 201110267425 A CN201110267425 A CN 201110267425A CN 201110267425 A CN201110267425 A CN 201110267425A CN 102306492 B CN102306492 B CN 102306492B
Authority
CN
China
Prior art keywords
voice
prime
straight
conversion
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110267425A
Other languages
Chinese (zh)
Other versions
CN102306492A (en
Inventor
张雄伟
孙健
曹铁勇
孙新建
黄建军
杨吉斌
邹霞
贾冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PLA University of Science and Technology
Original Assignee
PLA University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLA University of Science and Technology filed Critical PLA University of Science and Technology
Priority to CN201110267425A priority Critical patent/CN102306492B/en
Publication of CN102306492A publication Critical patent/CN102306492A/en
Application granted granted Critical
Publication of CN102306492B publication Critical patent/CN102306492B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a voice conversion method based on convolutive nonnegative matrix factorization. The method comprises the following steps: (1) training a transformation model through training data: carrying out time calibration and parameter decomposition of training voice data, analyzing STRAIGHT spectrum by using a convolutive nonnegative matrix factorization method, and analyzing pitch frequency of source voice and object voice; (2) converting new input voice based on a training model: carrying out parameter decomposition on source voice data A[c] to be converted by employing a STRAIGHT model, realizing sound channel frequency spectrum parameter conversion based on convolutive nonnegative matrix factorization, realizing conversion of the pitch frequency based on obtained mean value and variance in a training phase, and synthesizing voice after conversion, wherein the voice is voice after synthesis and conversion of the STRAIGHT spectrum S[Bc] which is obtained through conversion, the pitch frequency f[Bc] and original aperiodic component ap[Ac]. According to the invention, training effect of voice conversion is improved, and voice quality of conversion voice is improved.

Description

Phonetics transfer method based on the decomposition of convolution nonnegative matrix
Technical field
The invention belongs to the voice process technology field, particularly a kind of phonetics transfer method that decomposes based on the convolution nonnegative matrix.
Background technology
Speech conversion is a kind of through the personal characteristic information in the speaker's voice signal of change source, makes it to have the technology of target speaker vocie personal characteristic information.Speech conversion all has wide application prospect in personalized man-machine interaction, military struggle, information security and multimedia recreation field.For example, through speech conversion and speech synthesis system are combined, can realize that personalized speech is synthetic; Through speech conversion, enemy commander's sound be can forge and disinformation or order sent, upset enemy's operational commanding; But through speech conversion representation of the historical personage speech etc.
Speech conversion (Voice Conversion/Transformation) technical research is the history (Li Bo in 20 years so far; Wang Chengyou; Cai Xuanping; Deng. speech conversion and correlation technique summary [J]. the communication journal, 2004 (05): 109-118.), its method the earliest is to be proposed in 1988 years by people such as Abe.Existing phonetics transfer method mainly comprises: based on the method for vector quantization yardage mapping (1. M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara; " Voice conversion through vector quantization, " ICASSP-88., 1988; Pp. 655-658.), based on the method for gauss hybrid models (2. Y. Stylianou, O. Cappe and E. Moulines, " Continuous probabilistic transform for voice conversion; " Speech and Audio Processing, IEEE Transactions on, vol. 6; Pp. 131-142,1998.), based on the method for HMM (3. E. K. Kim, S. Lee and Y. H. Oh; " Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker, " in Proc. Eurospeech, Rhodes; Greece, 1997, pp. 2519-2522.), based on the method for frequency spectrum bending (4. D. Erro and A. Moreno; " Weighted Frequency Warping for Voice Conversion, " in InterSpeech 2007-EuroSpeech Antwerp, Belgium; 2007.) and based on Artificial Neural Network model (5. S. Desai, A. Black, B. Yegnanarayana; And K. Prahallad, " Spectral Mapping Using Artificial Neural Networks for Voice Conversion, " Audio; Speech, and Language Processing, IEEE Transactions on; Vol. PP, p. 1-1,2010.).
Though propose to the existing several different methods of speech conversion, the effect of speech conversion also reaches the requirement of practicability far away.The problem that the existing voice conversion method exists mainly contains:
1. many phonetics transfer methods be based upon divide frame to voice signal after, under the framework of each frame independent processing.Under this framework, the correlativity between speech frame often is left in the basket, and uncontinuity occurs and manifests thereby cause changing the back voice, has reduced the quality of conversion back voice.For example based on the method for vector quantization yardage mapping, based on the method for gauss hybrid models and based on Artificial Neural Network model;
2. the target of speech conversion is the speaker's personal characteristic information that correctly changes in the voice; And the existing voice conversion method is not separated the personal characteristic information of speaker's voice before carrying out conversion process from voice signal, but directly voice signal is handled.So not only cause the conversion effect can't be satisfactory, owing to changed other composition in the voice signal, caused the decline of conversion back voice quality simultaneously.
The convolution nonnegative matrix is decomposed (Convolutive Nonnegative Matrix Factorization) a kind of nonnegative matrix decomposition method that is proposed of handling to voice signal; This method is under the prerequisite that guarantees the decomposition result nonnegativity; Use two-dimentional time-frequency base to replace original one dimension base vector, thereby better carried the sequential correlativity of voice signal.This method has comparatively successful application (6. S. Paris, " Convolutive Speech Bases and Their Application to Supervised Speech Separation, " Audio in the separation of many speakers voice; Speech; And Language Processing, IEEE Transactions on, vol. 15; Pp. 1-12,2007-01-01 2007.).Can voice signal be decomposed into the encoder matrix of one group of non-negative time-frequency base and this group base through this method.The sub spaces that the time-frequency base that decomposition obtains can think to have carried speaker characteristic, encoder matrix then is the projections of voice on each sub spaces.Therefore realized function that the speaker characteristic information in the voice signal is separated to a great extent through this is olation from voice signal.In addition, decompose the sequential correlativity that better to take voice signal into account, thereby guarantee the continuity of reconstruct voice because the convolution nonnegative matrix is decomposed with respect to traditional nonnegative matrix.
But there is the not unique problem of decomposition result in this method, and it is not unique promptly under different starting condition, same speech data to be decomposed the basis matrix that obtains.Though this not unique different expression form that can be regarded as feature space has limited its application in speech conversion.
Summary of the invention
The object of the present invention is to provide a kind of phonetics transfer method that decomposes based on the convolution nonnegative matrix.Utilize the convolution nonnegative matrix to decompose and realized separation personal characteristic information in the speech channel frequency spectrum; And in detachment process, effectively preserved the correlativity of voice sequential; Guaranteeing under source speaker's voice and the conforming prerequisite of target speaker voice spectrum convolution nonnegative matrix decomposition result, accomplishing the conversion of speech channel frequency spectrum through the replacement of time-frequency base.And further realized speech conversion on this basis, make the voice after the conversion have higher quality and on the vocie personal characteristic, have stronger similarity with the target speaker.
The technical solution that realizes the object of the invention is: a kind of phonetics transfer method that decomposes based on the convolution nonnegative matrix, and step is following:
At first, through training data to the transformation model training:
The first step: the time alignment of training utterance data and parameter decomposition; For the used parallel speech data of training; The voice of identical content that are source speaker and target speaker are right; Wherein source speaker voice can be expressed as A, and target speaker's voice can be expressed as B, at first extract both pitch period envelope p through the STRAIGHT model AAnd p B, calculate a fundamental tone mark point pm who is used to realize the processing of pitch synchronous splicing adding through pitch period envelope and primary speech signal afterwards AAnd pm BAccording to the phoneme division information; Corresponding phoneme with voice A, B is that unit carries out fundamental tone mark point coupling; Be elementary cell afterwards again with the phoneme, adopt pitch synchronous splicing adding mode to realize the time alignment of voice A and B, obtain voice A ' and B ' behind the time alignment based on coupling fundamental tone mark point; Use the STRAIGHT model that A ' and B ' are analyzed, obtain three groups of parameters:
(1) STRAIGHT that characterizes the sound channel characteristic composes S A ', S B '
(2) fundamental frequency f A ', f B '
(3) aperiodic component ap A ', ap B '
Second step: use convolution nonnegative matrix decomposition method that the STRAIGHT spectrum is analyzed, promptly at first to the STRAIGHT spectrum S of A ' A 'Adopt convolution nonnegative matrix decomposition method to analyze, obtain its time-frequency base W A '(t) and encoder matrix H A ', afterwards again through the STRAIGHT spectrum S of convolution nonnegative matrix is olation to B ' B 'Analyze, this moment, fixing its encoder matrix was H A ', then can obtain its time-frequency base W B '(t);
The 3rd step: analyze the fundamental frequency of source voice and target speech, promptly through fundamental frequency information f to A ' and B ' A 'And f B 'Analyze, obtain its both averages and variance: μ A ', And μ B ',
Figure GDA0000165716202
Secondly, based on training pattern new input voice are changed:
Step 1: for source speech data A to be converted cAdopt the STRAIGHT model to carry out parameter decomposition, obtain its STRAIGHT spectrum , fundamental frequency And aperiodic component
Figure GDA0000165716205
Three groups of parameters;
Step 2: decompose the conversion that realizes the vocal tract spectrum parameter based on the convolution nonnegative matrix, promptly right
Figure GDA0000165716206
Adopt the convolution nonnegative matrix to analyze, this moment, fixing its time-frequency base was W A ', obtain the respective coding matrix
Figure GDA0000165716207
, and then the spectrum of the STRAIGHT after obtaining changing through following formula:
S B c = W B ′ ⊗ H A c
Wherein the STRAIGHT after
Figure GDA0000165716209
expression conversion composes, and " " is convolution algorithm;
Step 3: the average and the variance of the fundamental frequency that obtains based on the training stage, realize the conversion of fundamental frequency:
f B c = ( f A c - μ A ′ ) σ B ′ σ A ′ + μ B ′
Fundamental frequency after wherein
Figure GDA00001657162012
expression is changed;
Step 4: the voice after the synthetic conversion, promptly through the voice after the synthetic conversion of the STRAIGHT spectrum
Figure GDA00001657162013
, fundamental frequency and the original aperiodic component that are converted to.
The present invention compared with prior art; Its remarkable advantage: (1) is in the training stage; Based on phoneme information; Adopt pitch synchronous splicing adding method to realize the coupling of source speaker's voice and target speaker voice, make the voice after the coupling have higher time matching precision and voice quality, promoted the training effect of speech conversion; (2) realize effective separation of personal characteristic information in the vocal tract spectrum through convolution nonnegative matrix decomposition method, transfer process can be implemented to personal characteristic information, thereby promoted the conversion effect of voice.In addition, convolution nonnegative matrix decomposition method has effectively been preserved the relativity of time domain of vocal tract spectrum parameter, makes the reconstruct voice have better continuity, has improved the voice quality of conversion voice.
Below in conjunction with accompanying drawing the present invention is described in further detail.
Description of drawings
Fig. 1 the present invention is based on the phonetics transfer method synoptic diagram that the convolution nonnegative matrix is decomposed.
Fig. 2 carries out handling synoptic diagram based on the time alignment of phoneme for training utterance.
Fig. 3 is a voice fundamental mark point synoptic diagram.
Fig. 4 is based on the training utterance time-frequency base calculation process synoptic diagram that the convolution nonnegative matrix is decomposed.
Fig. 5 is the STRAIGHT spectrum time-frequency base synoptic diagram that is made up of 40 subbases.
Fig. 6 is based on the spectral conversion schematic flow sheet that the convolution nonnegative matrix is decomposed.
Embodiment
In conjunction with Fig. 1, the present invention is based on the phonetics transfer method that the convolution nonnegative matrix is decomposed, step is following:
Training stage: through training data to the transformation model training.
The first step, the time alignment of training utterance data and parameter decomposition:
(1) time alignment of speech data is as shown in Figure 2.The source speaker's voice A and the target speaker voice B that at first training data are concentrated analyze through the STRAIGHT model, obtain the pitch of both each sampled points, i.e. pitch period envelope p AAnd p B:
p A = [ p A 1 , . . . , p Ai , . . . p Al A ]
p B = [ p B 1 , . . . , p Bi , . . . p Bl A ]
L wherein AAnd l BRepresent the sampled point number that comprises among source speaker's voice A and the target speaker voice B respectively.
The pitch period is here represented with sampled point number form, fraction part is carried out round.Because voiceless sound section and unvoiced segments do not have tangible pitch period, so its pitch period is fixed as [0.01f s], f wherein sBe the speech sample frequency, the maximum integer that is not more than x is got in " [x] " expression.Based on pitch contour, be frame length with pitch period length, voice A, B are carried out the branch frame.With voice A is example, divides the frame step following:
From the 1st sampled point a of voice S (1)Beginning is with its pairing pitch period length
Figure GDA00001657162018
For frame length is confirmed the first frame F A1, s (1)=1 wherein.Afterwards with voice
Figure GDA00001657162019
Sampled point a S (2)Being second frame start position, is that frame length is confirmed the second frame F with its pairing pitch period A2, wherein
Figure GDA00001657162020
By that analogy, to the i frame, the frame result obtained the starting point a of current speech based on last one minute S (i), and with its pairing pitch period length
Figure GDA00001657162021
The branch frame that obtains current speech for frame length is F as a result Ai, wherein Repeat this process,, establish and obtain N altogether until the voice ending AThe frame voice.
After accomplishing the branch frame, be the center with each frame voice central point, with the longest frame length l AmaxFor length makes up l Amax* N AThe speech data matrix D A, its each classify frame voice as, and every column data is carried out windowing process through the Hanning window.When making up matrix, voice initial sum ending curtailment place is used voice initial sum ending point polishing respectively.
To matrix D APursue the row search, in each row, confirm a point, thereby constitute the fundamental tone mark locus of points pm that runs through each row A, make each point value sum maximum on the track.The line position difference of selected point is not more than 6 row between the restriction adjacent columns in search procedure.Can obtain being used for the fundamental tone mark point that PSOLA handles through the method, these mark points are in the amplitude maximum value position in the voice voiced segments.Provided one section fundamental tone mark point synoptic diagram that voice obtain through said method among Fig. 3.Can arrive the fundamental tone mark point pm of voice B with quadrat method B
According to the phoneme division information, set up the coupling corresponding relation of fundamental tone mark point in source speaker and the target speaker phoneme of speech sound:
Figure GDA00001657162023
.Wherein and
Figure GDA00001657162025
representes n the fundamental tone that phoneme comprised mark dot information in source speaker and the target speaker voice respectively, and concrete form is following:
pm A n = [ pm A n 1 , pm A n 2 , . . . , pm A ni , . . . , pm A n I n ]
pm B n = [ pm B n 1 , pm B n 2 , . . . , pm B nj , . . . , pm B nJ n ]
Here,
Figure GDA00001657162028
and
Figure GDA00001657162029
is respectively in source speaker and the target speaker voice i and j the fundamental tone mark point in n the phoneme.I nAnd J nBeing respectively the fundamental tone mark that both comprise in n the phoneme counts out.
Based on the fundamental tone mark dot information of coupling phoneme among training utterance A, the B, adopt the PSOLA method to realize the voice duration alignment of source speaker and the corresponding phoneme of target speaker.The frame length that PSOLA handles be taken as current fundamental tone mark point the triple-length of corresponding pitch period.In the alignment procedure, be benchmark than the minor element, realize alignment through PSOLA method compression another one phoneme with duration in the coupling phoneme.Carry out the duration adjustment because the PSOLA method is unit with the pitch period, the adjustment precision only can guarantee that in a pitch period length different information of therefore current coupling phoneme being adjusted will count in next coupling phoneme duration alignment and handle.Then realize alignment for the unvoiced segments between phoneme in the speech data through the brachymemma mode.
To after unvoiced segments is handled between each phoneme and phoneme among voice A, the B, obtained source speaker's voice A ' and target speaker voice B ' of elapsed time aligning through above-mentioned steps.
(2) speech parameter decomposes.For the training utterance behind the elapsed time aligning, adopt the STRAIGHT model to carry out parameter decomposition.Through decomposing three groups of parameters that can obtain source speaker's voice A ' and target speaker voice B ' respectively:
A) STRAIGHT that characterizes the vocal tract spectrum characteristic composes, and it is a two-dimensional matrix:
S = s 11 . . . s 1 j . . . s 1 N . . . . . . . . . s i 1 . . . s ij . . . s iN . . . . . . . . . s M 1 . . . s Mj . . . s MN
Wherein the STRAIGHT spectrum value of frame voice is shown in each tabulation, comprises M spectrum value point altogether, gets M=256.Whole section voice are divided into the N frame to be analyzed, between each frame center's point at a distance of 10ms.Here use S A 'Expression source speaker's voice STRAIGHT spectrum, S B 'Expression target speaker voice STRAIGHT spectrum;
B) fundamental frequency of training utterance, f=[f 1..., f j..., f N], f wherein jBe the fundamental frequency of voice j frame, it is corresponding that itself and STRAIGHT compose the j row.Here use f A 'Expression source speaker's voice fundamental frequency, f B 'Expression target speaker voice fundamental frequency;
C) aperiodic component ap, it is for characterizing the matrix of phonological component information characteristic non-periodic, and is less to the voice influence in conversion, thereby does not carry out conversion processing.
Second step, through convolution nonnegative matrix decomposition method voice STRAIGHT spectrum is analyzed, obtain the time-frequency base of source speaker and target speaker voice STRAIGHT spectrum, as shown in Figure 4.The analysis concrete steps are following:
(1) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S A 'Analyze, can obtain following decomposition result:
S A ′ ≈ S ^ A ′ = Σ t = 0 T - 1 W A ′ ( t ) · H A ′ t →
W wherein A '(t) be S A 'The time-frequency base, specifically be F * T bMatrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, by T bIndividual such base vector constitutes a time-frequency base, obtains T such time-frequency base altogether.Get T b=8, T=40.
Figure GDA00001657162032
Expression is to encoder matrix H A 'With column vector form t the unit that move to right, concrete form is following:
If
H A ′ = [ h A ′ 1 , . . . , h A ′ i , . . . , h A ′ I H ]
H wherein A ' iBe encoder matrix H A 'I column vector, and H A 'In comprise I altogether HIndividual column vector.Then when t=2:
H A ′ 2 → = [ 0 → , 0 → , h A ′ 1 , . . . , h A ′ i , . . . , h A ′ I H - 2 ]
When t=-2:
H A ′ - 2 → = [ h A ′ 3 , . . . , h A ′ i , . . . , h A ′ I H , 0 → , 0 → , ]
Wherein "
Figure GDA00001657162036
" is the null value column vector.
W A '(t) and H A 'Calculating obtain through following iterative process:
A) to W A '(t) and H A 'Carry out random initializtion;
B) pass through computes to S A 'Reconstruction result:
S ^ A ′ = Σ t = 0 T - 1 W A ′ ( t ) · H A ′ t →
C) based on
Figure GDA00001657162038
To time-frequency base W A '(t) upgrade, renewal process is directed against t=1 ..., T bCalculate successively:
Figure GDA00001657162039
Wherein "
Figure GDA00001657162040
" represent that the element between two matrixes multiplies each other I M * NThe expression element all is M * N matrix of 1.
The time-frequency base upgrades encoder matrix through following formula after upgrading and accomplishing:
Figure GDA00001657162041
D) judge whether iterations reaches maximum iteration time 300 times, or the voice reconstructed error is less than 10 -5, reconstructed error is confirmed by following formula:
e A = Σ ij ( s A ′ ij - s ^ A ′ ij ) 2
When above-mentioned two conditions do not satisfy, get back to step b) and continue iteration, otherwise the termination of iterations circulation gets into next step e).
E) obtain final decomposition result: W A '(t) and H A '
Fig. 5 is for decomposing the one section STRAIGHT spectrum time-frequency base synoptic diagram that obtains through said method.
Through the time-frequency base W that obtains after decomposing A 'Can be regarded as a proper subspace of source speaker STRAIGHT spectrum, carried the personal characteristic information of source speaker's vocal tract spectrum, and encoder matrix H A 'Then for composing in the subspace W A 'On projection, carried the variation in time of time-frequency base.Because source speaker's voice and target speaker voice that training utterance is concentrated have passed through the precise time aligning, can think that both only there are differences on speaker characteristic information, therefore after nonnegative matrix was decomposed, both encoder matrixs were identical.
(2) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S B 'Analyze, analytical approach is identical with (1) middle analytical approach, but this moment, the regular coding matrix was H A ', then can obtain S B 'Time-frequency base W B '
In the 3rd step, analyze source voice and target speech fundamental frequency parameter.Analysis obtains 1 rank and 2 rank statistic, the i.e. f of fundamental frequency in source speaker and the target speaker training utterance through the STRAIGHT model A 'And f B 'Average and variance μ A ',
Figure GDA00001657162043
And μ B ',
Figure GDA00001657162044
Translate phase is changed new input voice based on training pattern.
The first step: to source speech data A to be converted cAdopt the STRAIGHT model to carry out parameter decomposition (the parameter decomposition method of parameter decomposition method and training stage is the same), obtain its STRAIGHT and compose , fundamental frequency
Figure GDA00001657162046
And aperiodic component
Figure GDA00001657162047
Three groups of parameters;
Second step: decompose the conversion that realizes the vocal tract spectrum parameter based on the convolution nonnegative matrix, as shown in Figure 6.At first
Figure GDA00001657162048
adopted the convolution nonnegative matrix to analyze.(1) said method is identical in second step of analytical approach and training stage, is the training stage to obtain W but set the time-frequency base this moment A 'Thereby, can obtain the respective coding matrix
Figure GDA00001657162049
By before analyze and can know, carried speaker's personal characteristic information in the time-frequency base, thereby when realizing conversion, used W B 'Replacement W A ', afterwards and encoder matrix
Figure GDA00001657162050
Carry out convolution algorithm, the STRAIGHT spectrum after obtaining changing is shown below:
S B c = W B ′ ⊗ H A c = Σ t = 0 T - 1 W A ′ ( t ) · H A ′ t →
Wherein the STRAIGHT after
Figure GDA00001657162052
expression conversion composes, and " " is process of convolution.
The 3rd step: realize the conversion of fundamental frequency.For fundamental frequency to be converted
Figure GDA00001657162054
; The source speaker who obtains through the training stage and the average and the variance of target speaker fundamental frequency, realize conversion according to following formula:
f B c = ( f A c - μ A ′ ) σ B ′ σ A ′ + μ B ′
Fundamental frequency after wherein
Figure GDA00001657162056
expression is changed.
The 4th step: the voice after the synthetic conversion.Use the STRAIGHT spectrum
Figure GDA00001657162057
after changing, the fundamental frequency after the conversion; And the aperiodic component that obtains during signal decomposition; According to STRAIGHT model phonetic synthesis algorithm, the speech data after can obtaining changing:
B c = f STRAIGHT ( S B c , f B c , ap A c )
Wherein
Figure GDA00001657162060
Expression STRAIGHT phonetic synthesis algorithm, B cBe the speech data after the conversion.、

Claims (4)

1. phonetics transfer method that decomposes based on the convolution nonnegative matrix is characterized in that step is following:
At first, through training data to the transformation model training:
The first step: the time alignment of training utterance data and parameter decomposition; For the used parallel speech data of training; The voice of identical content that are source speaker and target speaker are right; Wherein source speaker voice can be expressed as A, and target speaker's voice can be expressed as B, at first extract both pitch period envelope p through the STRAIGHT model AAnd p B, calculate a fundamental tone mark point pm who is used to realize the processing of pitch synchronous splicing adding through pitch period envelope and primary speech signal afterwards AAnd pm BAccording to the phoneme division information; Corresponding phoneme with voice A, B is that unit carries out fundamental tone mark point coupling; Be elementary cell afterwards again with the phoneme, adopt pitch synchronous splicing adding mode to realize the time alignment of voice A and B, obtain voice A ' and B ' behind the time alignment based on coupling fundamental tone mark point; Use the STRAIGHT model that A ' and B ' are analyzed, obtain three groups of parameters:
(1) STRAIGHT that characterizes the sound channel characteristic composes S A ', S B '
(2) fundamental frequency f A ', f B '
(3) aperiodic component ap A ', ap B '
Second step: use convolution nonnegative matrix decomposition method that the STRAIGHT spectrum is analyzed, promptly at first to the STRAIGHT spectrum S of A ' A 'Adopt convolution nonnegative matrix decomposition method to analyze, obtain its time-frequency base W A '(t) and encoder matrix H A ', afterwards again through the STRAIGHT spectrum S of convolution nonnegative matrix is olation to B ' B 'Analyze, this moment, fixing its encoder matrix was H A ', then can obtain its time-frequency base W B '(t);
The 3rd step: analyze the fundamental frequency of source voice and target speech, promptly through fundamental frequency information f to A ' and B ' A 'And f B 'Analyze, obtain its both averages and variance: μ A ',
Figure FDA0000165716191
And μ B ',
Figure FDA0000165716192
Secondly, based on training pattern new input voice are changed:
Step 1: for source speech data A to be converted cAdopt the STRAIGHT model to carry out parameter decomposition, obtain its STRAIGHT spectrum
Figure FDA0000165716193
, fundamental frequency
Figure FDA0000165716194
And aperiodic component Three groups of parameters;
Step 2: decompose the conversion that realizes the vocal tract spectrum parameter based on the convolution nonnegative matrix, promptly right
Figure FDA0000165716196
Adopt the convolution nonnegative matrix to analyze, this moment, fixing its time-frequency base was W A ', obtain the respective coding matrix
Figure FDA0000165716197
, and then the spectrum of the STRAIGHT after obtaining changing through following formula:
S B c = W B ′ ⊗ H A c
Wherein the STRAIGHT after
Figure FDA0000165716199
expression conversion composes, and "
Figure FDA00001657161910
" is convolution algorithm;
Step 3: the average and the variance of the fundamental frequency that obtains based on the training stage, realize the conversion of fundamental frequency:
f B c = ( f A c - μ A ′ ) σ B ′ σ A ′ + μ B ′
Fundamental frequency after wherein
Figure FDA00001657161912
expression is changed;
Step 4: the voice after the synthetic conversion, promptly through the voice after the synthetic conversion of the STRAIGHT spectrum
Figure FDA00001657161913
, fundamental frequency
Figure FDA00001657161914
and the original aperiodic component
Figure FDA00001657161915
that are converted to.
2. the phonetics transfer method that decomposes based on the convolution nonnegative matrix according to claim 1 is characterized in that based on pitch contour, is frame length with pitch period length, and voice A, B are carried out time alignment:
1) divides the frame stage
From the 1st sampled point a of voice S (1)Beginning is with its pairing pitch period length
Figure FDA00001657161916
For frame length is confirmed the first frame F A1, s (1)=1 wherein is afterwards with voice
Figure FDA00001657161917
Sampled point a S (2)Being second frame start position, is that frame length is confirmed the second frame F with its pairing pitch period A2, wherein , by that analogy, to the i frame, the frame result obtained the starting point a of current speech based on last one minute S (i), and with its pairing pitch period length
Figure FDA00001657161919
The branch frame that obtains current speech for frame length is F as a result Ai, wherein
Figure FDA00001657161920
, repeat this process, until the voice ending, establish and obtain N altogether AThe frame voice;
After accomplishing the branch frame, be the center with each frame voice central point, with the longest frame length l AmaxFor length makes up l Amax* N AThe speech data matrix D A, its each classify frame voice as, and every column data is carried out windowing process through the Hanning window, when making up matrix, voice initial sum ending curtailment place is used a voice initial sum ending point polishing respectively;
To matrix D APursue the row search, in each row, confirm a point, thereby constitute the fundamental tone mark locus of points pm that runs through each row AMake each point value sum maximum on the track; The line position difference of selected point is not more than 6 row between the restriction adjacent columns in search procedure; Obtain being used for the fundamental tone mark point that the pitch synchronous splicing adding is handled through the method, these mark points are in the amplitude maximum value position in the voice voiced segments, can arrive the fundamental tone mark point pm of voice B with quadrat method B
2) matching stage
According to the phoneme division information; Set up the coupling corresponding relation of fundamental tone mark point in source speaker and the target speaker phoneme of speech sound: ; Wherein
Figure FDA00001657161922
and
Figure FDA00001657161923
representes n the fundamental tone that phoneme comprised mark dot information in source speaker and the target speaker voice respectively, and concrete form is following:
pm A n = [ pm A n 1 , pm A n 2 , . . . , pm A ni , . . . , pm A n I n ]
pm B n = [ pm B n 1 , pm B n 2 , . . . , pm B nj , . . . , pm B n J n ]
Here,
Figure FDA00001657161926
With Be respectively in source speaker and the target speaker voice i and j fundamental tone mark point in n the phoneme, I nAnd J nBeing respectively the fundamental tone mark that both comprise in n the phoneme counts out;
3) alignment stage
Fundamental tone mark dot information based on coupling phoneme among training utterance A, the B; Adopt pitch synchronous splicing adding method realize the voice duration alignment of source speaker and the corresponding phoneme of target speaker, the frame length that the pitch synchronous splicing adding is handled be taken as current fundamental tone mark the triple-length of corresponding pitch period; In the alignment procedure, be benchmark than the minor element, realize alignment through pitch synchronous splicing adding method compression another one phoneme with duration in the coupling phoneme; Because being unit with the pitch period, the PSOLA method carries out the duration adjustment; The adjustment precision only can guarantee in a pitch period length; Therefore the different information of current coupling phoneme being adjusted will count in next coupling phoneme duration alignment and handle, and then realize alignment through the brachymemma mode for the unvoiced segments between phoneme in the speech data;
To after unvoiced segments is handled between each phoneme and phoneme among voice A, the B, obtained the source speaker's voice A ' and the target speaker voice B ' of elapsed time aligning through above-mentioned steps.
3. the phonetics transfer method that decomposes based on the convolution nonnegative matrix according to claim 1; The speech parameter that it is characterized in that the training stage decomposes; For the training utterance behind the elapsed time aligning; Adopt the STRAIGHT model to carry out parameter decomposition, through decomposing three groups of parameters that obtain source speaker's voice A ' and target speaker voice B ' respectively:
1) STRAIGHT that characterizes the vocal tract spectrum characteristic composes, and it is a two-dimensional matrix:
S = s 11 . . . s 1 j . . . s 1 N . . . . . . s i 1 . . . s ij . . . s iN . . . . . . . . . s M 1 . . . s Mj . . . s MN
Wherein the STRAIGHT spectrum value of frame voice is shown in each tabulation, comprises M spectrum value point altogether, gets M=256, and whole section voice are divided into the N frame to be analyzed, and at a distance of 10ms, uses S here between each frame center's point A 'Expression source speaker's voice STRAIGHT spectrum, S B 'Expression target speaker voice STRAIGHT spectrum;
2) fundamental frequency of training utterance, f=[f 1..., f j..., f N], f wherein jBe the fundamental frequency of voice j frame, it is corresponding that itself and STRAIGHT compose the j row, uses f here A 'Expression source speaker's voice fundamental frequency, f B 'Expression target speaker voice fundamental frequency;
3) aperiodic component ap, it is for characterizing the matrix of phonological component information characteristic non-periodic, and is less to the voice influence in conversion, thereby does not carry out conversion processing.
4. the phonetics transfer method that decomposes based on the convolution nonnegative matrix according to claim 1 is characterized in that the analytical procedure of training stage is following:
1) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S A 'Analyze, obtain following decomposition result:
S A ′ ≈ S ^ A ′ = Σ t = 0 T - 1 W A ′ ( t ) · H A ′ t →
W wherein A '(t) be S A 'The time-frequency base, specifically be F * T bMatrix, wherein each column vector is the frequency domain base vector of STRAIGHT spectrum, by T bIndividual such base vector constitutes a time-frequency base, obtains T such time-frequency base altogether, gets T b=8, T=40,
Figure FDA00001657161930
Expression is to encoder matrix H A 'With column vector form t the unit that move to right, concrete form is following:
If
H A ′ = [ h A ′ 1 , . . . , h A ′ i , . . . , h A ′ I H ]
H wherein A ' iBe encoder matrix H A 'I column vector, and H A 'In comprise I altogether HIndividual column vector, then when t=2:
H A ′ 2 → = [ 0 → , 0 → , h A ′ 1 , . . . , h A ′ i , . . . , h A ′ I H - 2 ]
When t=-2:
H A ′ - 2 → = [ h A ′ 3 , . . . , h A ′ i , . . . , h A ′ I H , 0 → , 0 → ]
Wherein "
Figure FDA00001657161934
" is the null value column vector;
2) use convolution nonnegative matrix decomposition method to source speaker STRAIGHT spectrum S B 'Analyze analytical approach and 1) middle analytical approach is identical, but this moment, the regular coding matrix was H A ', then can obtain S B 'Time-frequency base W B '
CN201110267425A 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization Expired - Fee Related CN102306492B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110267425A CN102306492B (en) 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110267425A CN102306492B (en) 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization

Publications (2)

Publication Number Publication Date
CN102306492A CN102306492A (en) 2012-01-04
CN102306492B true CN102306492B (en) 2012-09-12

Family

ID=45380342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110267425A Expired - Fee Related CN102306492B (en) 2011-09-09 2011-09-09 Voice conversion method based on convolutive nonnegative matrix factorization

Country Status (1)

Country Link
CN (1) CN102306492B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610236A (en) * 2012-02-29 2012-07-25 山东大学 Method for improving voice quality of throat microphone
CN102855884B (en) * 2012-09-11 2014-08-13 中国人民解放军理工大学 Speech time scale modification method based on short-term continuous nonnegative matrix decomposition
CN103020017A (en) * 2012-12-05 2013-04-03 湖州师范学院 Non-negative matrix factorization method of popular regularization and authentication information maximization
US10055479B2 (en) 2015-01-12 2018-08-21 Xerox Corporation Joint approach to feature and document labeling
CN105206257B (en) * 2015-10-14 2019-01-18 科大讯飞股份有限公司 A kind of sound converting method and device
CN105206259A (en) * 2015-11-03 2015-12-30 常州工学院 Voice conversion method
CN105930308B (en) * 2016-04-14 2019-01-15 中国科学院西安光学精密机械研究所 The non-negative matrix factorization method restored based on low-rank
CN107221321A (en) * 2017-03-27 2017-09-29 杭州电子科技大学 A kind of phonetics transfer method being used between any source and target voice
CN107464569A (en) * 2017-07-04 2017-12-12 清华大学 Vocoder
CN107785030B (en) * 2017-10-18 2021-04-30 杭州电子科技大学 Voice conversion method
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN109767778B (en) * 2018-12-27 2020-07-31 中国人民解放军陆军工程大学 Bi-L STM and WaveNet fused voice conversion method
CN110148424B (en) * 2019-05-08 2021-05-25 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN111899716B (en) * 2020-08-03 2021-03-12 北京帝派智能科技有限公司 Speech synthesis method and system
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278282A (en) * 2004-12-23 2008-10-01 剑桥显示技术公司 Digital signal processing methods and apparatus
CN101441872A (en) * 2007-11-19 2009-05-27 三菱电机株式会社 Denoising acoustic signals using constrained non-negative matrix factorization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278282A (en) * 2004-12-23 2008-10-01 剑桥显示技术公司 Digital signal processing methods and apparatus
CN101441872A (en) * 2007-11-19 2009-05-27 三菱电机株式会社 Denoising acoustic signals using constrained non-negative matrix factorization

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Zhang ye etal..Blind Separation of Convolutive Mixed Source Singnals by Using Robust Nonnegative Matrix Factorizaion.《2009 Fifth International Conference on Natural Computation》.2009, *
刘伯权 等.采用非负矩阵分解的语音盲分离.《计算机工程与设计》.2011,(第1期), *
闵刚 等.分段声码器中的语音分段算法研究.《信号处理》.2007,第23卷(第4A期), *

Also Published As

Publication number Publication date
CN102306492A (en) 2012-01-04

Similar Documents

Publication Publication Date Title
CN102306492B (en) Voice conversion method based on convolutive nonnegative matrix factorization
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN108847249A (en) Sound converts optimization method and system
Jemine Real-time voice cloning
EP2109096B1 (en) Speech synthesis with dynamic constraints
Hashimoto et al. Trajectory training considering global variance for speech synthesis based on neural networks
CN102436807A (en) Method and system for automatically generating voice with stressed syllables
US20230317056A1 (en) Audio generator and methods for generating an audio signal and training an audio generator
Nirmal et al. Voice conversion using general regression neural network
Denisov et al. End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
Nakamura et al. A mel-cepstral analysis technique restoring high frequency components from low-sampling-rate speech.
CN102436815A (en) Voice identifying device applied to on-line test system of spoken English
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
EP3149727B1 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
Aihara et al. Noise-robust voice conversion based on sparse spectral mapping using non-negative matrix factorization
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
CN103226946B (en) Voice synthesis method based on limited Boltzmann machine
Kanagawa et al. Speaker-independent style conversion for HMM-based expressive speech synthesis
Asakawa et al. Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics
Wen et al. Pitch-scaled spectrum based excitation model for HMM-based speech synthesis
Sailor et al. Unsupervised learning of temporal receptive fields using convolutional RBM for ASR task
Aroon et al. Statistical parametric speech synthesis: A review
Dinh et al. Quality improvement of HMM-based synthesized speech based on decomposition of naturalness and intelligibility using non-negative matrix factorization
Aihara et al. Semi-non-negative matrix factorization using alternating direction method of multipliers for voice conversion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120912

Termination date: 20140909

EXPY Termination of patent right or utility model