CN106971703A

CN106971703A - A kind of song synthetic method and device based on HMM

Info

Publication number: CN106971703A
Application number: CN201710160104.2A
Authority: CN
Inventors: 杨鸿武; 赵娜; 冯欢; 甘振业
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2017-07-21

Abstract

The invention discloses a kind of song synthetic method based on HMM and device, with TTS (literary periodicals) technology, pass through HTS (speech synthesis system based on hidden Markov model), and utilize STRAIGHT algorithms, and establish the acoustic model related based on HMM speaker synthesized towards song, the melody Controlling model of song, and speaker adaptation training has been carried out, realize the personalized speech synthesizer that a kind of lyrics based on HMM are changed in real time to song.The system device enriches the research contents of phonetic synthesis, the voice of synthesis is had more the expression of expressive force and emotion；Especially give the opportunity to study that the technical operations such as song is made, music is handled are provided with music-lover；Social resources workable for people are added, with certain practical value and important meaning.

Description

A kind of song synthetic method and device based on HMM

Technical field

The present invention relates to the fields such as human-computer interaction technology, text-language switch technology, speech synthesis technique, and in particular to a kind of Song synthetic method and device based on HMM.

Background technology

With constantly bringing forth new ideas and perfect for information technology, the music multimedia application in terms of many man-machine interactions is also gradually walked Enter our daily life, such as computer is requested a song, set a song to music, modifying on song, and mobile phone and listen song to know song etc..How meter is made The more hommization of calculation machine, can be as the mankind " singing ", that is to say, that, it is known that numbered musical notation and the lyrics, computer just can be automatic Produce beautiful, interesting to listen to song and have become a kind of new demand.With multimedia technology developing rapidly in entertainment field, together When more wide application space is also provided for this technology.

Overwhelming majority music are all to record and propagate in a digital format at present, for example, WAV, MP3, MIDI, Yi Jishi When a variety of storage forms such as music broadcast.Compared with traditional music pattern, digital music is in terms of making, storage, distribution There is incomparable advantage.By computer, creator can hear the making effect of musical works while setting a song to music, right Any modification operation that music score is carried out can timely feed back to creator, it is not necessary to carry out traditional rehearsal, performance, record The process of a series of complex such as system, editor handles music, greatly reduces cycle and the human cost of music making, simultaneously It it also avoid composer and the creation inspiration accidentally obtained lost in very long production process.

Speech synthesis technique is an important research content of field of human-computer interaction, is the important set of embedded research field Into part.Nowadays, song synthesis also progressively becomes a much-talked-about topic.However, before the appearance of song synthetic technology, language The development of sound synthetic technology relative maturity.Some scholars attempt to the method for phonetic synthesis to synthesize song, still There is a certain degree of otherness again in song and voice.Voice, which focuses on content, (can certainly express purpose, the feelings of speaker Sense), song focuses on deduction and the fluctuations of melody, and this causes the method for phonetic synthesis to be applied directly to the conjunction of song Into central.

Among long-term domestic and international research process, song synthesis is similar to speech synthesis technique, has also gradually formed The synthesis mode of three kinds of main flows：1. waveform concatenation formula is synthesized；2. the synthesis of the formula of parametrization；3. speech modification formula is synthesized.Wherein splice Synthesis and the synthesis of parametrization formula are all based on corpus, and synthesis tonequality is not high, and speech modification mode is more flexible, is basis Melodic information changes the parameters,acoustic of voice signal and then reaches the synthesis of song.The lyrics are at home and abroad proposed to song The personalized speech synthesis changed in real time.Song is produced according to the music-book information of song immediately, it can receive a first song lyrics Continuous speech.The system is after the typing voice corresponding with the lyrics with Viterbi algorithm in continuous phonetic synthesis unit Song is synthesized, is realized by pitch synchronous addition of waveforms (Pitch-Synchronous Overlap-Add, PSOLA) method Pitch, duration, the real-time conversion of energy and frequency spectrum, and synthesize song.Because the system does not account for voice and song in pitch With the otherness of the acoustic connection such as duration, cause the effect of synthesis undesirable.Also have on this basis, it is proposed that one big language material The lyrics in storehouse are changed to song, and the system has all reached relatively good result in terms of naturalness and tonequality.The system design The corpus of 3 mandarins, the optimum combination of each synthesis unit is determined with Viterbi algorithm.The defect of this method It is：Make corpus devote a tremendous amount of time and people energy.

Therefore, those skilled in the art be directed to exploitation it is a kind of it is new towards have music process demand person based on HMM Personalized song synthesis implementation method and device.

The content of the invention

In view of the drawbacks described above of prior art, the invention solves the problems that the synthesis of the Chinese song proposed in background technology is ground Study carefully less, synthesis tonequality is not high, operation the problems such as take time and effort there is provided it is a kind of towards have music process demand person based on The implementation method and device of HMM personalized song synthesis.

In order to solve the above technical problems, the technical scheme that the present invention is provided is as follows：

A kind of song synthetic method based on HMM, comprises the following steps：

A, analysis voice and song set up the melody Controlling model of song in the otherness of acoustic feature；

B, the acoustic model for setting up the correlation of the speaker based on HMM synthesized towards song；

C, using the speech synthesis system based on HMM synthesize song.

Further, the specific steps of the otherness of voice and song in acoustic feature are analyzed described in the step A such as Under：

A, analysis of spectrum is carried out to voice signal with temporal analysis and frequency domain analysis, and voice signal and song are believed Number carry out fundamental frequency comparative analysis；

B, required music-book information is extracted from MIDI systems using MIDI technologies；

C, the melodic information by reading the music score extracted in MIDI files, analyze the architectural feature of its music score file, enter And music parameter information is obtained, when the music parameter information includes channel number, note pitch, the speed of key, note starting Between and note duration.

Further, the melody Controlling model of song described in the step A includes fundamental frequency Controlling model and duration is controlled Model；The discrete notes in music score are converted into continuous fundamental curve using fundamental frequency Controlling model, and mould is controlled using duration Type obtains the pronunciation duration for singing note.

Further, the acoustic mode of the correlation of the speaker based on HMM synthesized towards song is set up described in the step B Type has the following steps：

A, the voice language material using speaker, analyze speech data, obtaining speech data includes fundamental frequency F0, duration, frequency Compose SP and aperiodic index AP parameters,acoustic；And the speaker adaptation training technique based on HMM is utilized, training is mixed The average sound model of voice；

B, a small amount of speech data using target speaker to be synthesized, by speaker adaptation converter technique, are obtained The adaptive acoustic model of target speaker, and adaptive model is modified and updated.

Further, described to be trained by the speaker adaptation based on HMM, training obtains the average sound mould of mixing voice Type comprises the following steps：

A, the corpus to speaker and target speaker corpus data carry out speech analysis, extract its acoustics ginseng Number：Mel cepstrum coefficients, and calculate their first-order difference and second differnce；

B, with reference to context property collection, carry out the HMM model and shape of HMM model training, training frequency spectrum and base frequency parameters The HMM MSD-HSMM of many distributions half of state duration parameters；

C, the sound bank using a small amount of target speaker, carry out speaker adaptation training, obtain being averaged for mixing voice Sound model, so as to obtain context-sensitive MSD-HSMM models.

Further, a small amount of speech data using target speaker to be synthesized, is become by speaker adaptation Technology is changed, the adaptive acoustic model of target speaker is obtained, and adaptive model is modified and updated, including following step Suddenly：

After a, speaker adaptation training, using the CMLLR adaptive algorithms based on HSMM, calculating obtains voice conversion State output probability distribution and duration mean of a probability distribution vector sum covariance matrix, characteristic vector o and shape under state i State duration d transformation equation is：

b_i(o)=N (o；Au_i-b,AΣ_iA^T)=| A^-1|N(Wξ；u_i,Σ_i)

p_i(d)=N (d；αm_i-β,ασ_i ²α)=| α^-1|N(αψ；m_i,σ_i ²)

Wherein, ξ=[o^T, 1], ψ=[d, 1]^T, μ_iThe average being distributed for state output, m_iThe average being distributed for duration, Σ_i For diagonal covariance matrix,For variance, W=[A^-1 b^-1] be target speaker's state output probability density distribution linear change Change matrix, X=[α^-1,β^-1] be state duration probability density distribution transformation matrix；

B, by the adaptive transformation algorithm based on HSMM, the frequency spectrum, fundamental frequency and duration parameters of speech data can be carried out Normalization and conversion, for the self-adapting data O that length is T, can carry out maximal possibility estimation to conversion Λ=(W, X)；

C, using maximum a posteriori MAP algorithms the adaptive model of voice is corrected and updated, for giving HSMM Parameter set λ, if its forward direction probability and backward probability are respectively：α_tAnd β (i)_t(i), then its Continuous Observation sequence under state i o_t-d+1…o_tGenerating probabilityFor：

MAP estimations are described as follows：

Wherein,WithMean vector after being converted for linear regression, ω and τ are respectively that state output and duration are distributed MAP estimates parameter,WithFor adaptive mean vectorWithWeighted average MAP estimates.

Further, the language that song is used is synthesized using the speech synthesis system based on HMM described in the step C It is based on STRAIGHT algorithms that cent, which is analysed with synthetic method,.

Further, synthesizing song using the speech synthesis system based on HMM described in the step C includes following step Suddenly：

A, using text analyzing instrument the lyrics text of input is analyzed, song that will be given using text analyzing program Word text is converted to the acoustics annotated sequence comprising linguistic context description information, with clustered in training process obtained each decision tree come The context HMM model related to each pronunciation and its linguistic context is predicted, then is spliced into a sentence HMM model；

B, according to MIDI files, obtain the pitch and the duration of a sound of each note in the lyrics, phase obtained by melody Controlling model The fundamental frequency and duration answered, frequency spectrum SP, the aperiodic index AP of syllable and fundamental frequency F0 duration are changed using note duration；

C, using in the related acoustic model of speaker and STRAIGHT algorithm generated statement HMM models on frequency spectrum SP, aperiodic index AP, duration, fundamental frequency F0 argument sequence, and synthesize voice, musical background is added, song is realized Synthesis.

Further, the language that song is used is synthesized using the speech synthesis system based on HMM described in the step C It is based on STRAIGHT algorithms, to comprise the following steps that cent, which is analysed with synthetic method,：

The voice signal of speaker is inputted first, and the fundamental frequency F of voice is extracted with STRAIGHT algorithms₀And spectrum envelope Spectral envelope, are then modulated to parameters,acoustic, the new sound source of generation and time varing filter, further according to original filter Ripple device model, voice is synthesized using such as following formula：

Wherein, Q represent synthesis excitation in one group of sampling point position, G () represent Pitch modulation, can arbitrarily with original The F of beginning voice₀To match the F after modulation₀, time structure of the all-pass filter for controlling fine pitch and original signal, such as one The linear phase being directly proportional to frequency is moved, for controlling F₀Fine structure, from modulation amplitude spectrum A (S (u (w), r (t)), u (w), r (t)) such as following formula, it can calculate and obtain corresponding Fourier transformation V (w, the t of minimum phase pulse_i), wherein A (), u () Represent the modulation of amplitude, frequency and time dimension respectively with r ()；

Wherein, q represents frequency.

A kind of song synthesizer based on HMM, it is characterised in that including：

Melody control module, the melody Controlling model for setting up song；

The related acoustic module of speaker based on HMM, the acoustics for setting up the speaker's correlation synthesized towards song Model；

Song synthesis module based on HMM, the singing voice to be synthesized for synthesizing.

Further, the melody control module, including：

MIDI analytic units, for analyzing the music-book information extracted from MIDI files, and obtain corresponding music parameter Information；

Prosodic control unit, for, in the otherness of acoustic feature, setting up the melody control of song according to voice and song Model.

Further, the related acoustic module of the speaker based on HMM, including：

Acoustic model unit, the acoustic model for obtaining target speaker；

Parameters,acoustic subelement, for the parameter phonetic synthesis based on HMM.

Further, the song synthesis module based on HMM, including：

Text analysis unit, carries out text analyzing to the lyrics text of input, obtains context-sensitive mark；

HMM model trains subelement, the HMM model storehouse for setting up speech data；

Speaker adaptation subelement, the characteristic parameter for normalizing and changing speaker in training obtains adaptive Model；

Phonetic synthesis unit, the singing voice to be synthesized for synthesizing；

Song synthesis unit, adds musical background for the singing voice to synthesis, completes the synthesis of song.

The present invention has the advantages and positive effects of：A kind of song synthetic method and device based on HMM, with TTS (literary periodicals) technology, by HTS (speech synthesis system based on hidden Markov model), and using STRAIGHT algorithms, And the acoustic model related based on HMM speaker synthesized towards song, the melody Controlling model of song are established, and carry out Speaker adaptation training, realizes the personalized speech synthesizer that a kind of lyrics based on HMM are changed in real time to song. Compared with traditional singing sound synthesis system, the system with speech analysis and synthetic method be using STRAIGHT algorithms as base Plinth, while adding speaker adaptation training process in the training stage, obtains the average sound model of mixing voice, by this Training process, can reduce the influence caused by the otherness of speaker in sound bank, so as to improve the language of song synthesis Sound quality；On the basis of average sound model, by speaker adaptation converter technique, using a small amount of speaker's language material, close Into naturalness and melodious degree all relatively good singing voices.The system device enriches the research contents of phonetic synthesis, makes synthesis Voice have more the expression of expressive force and emotion；Especially giving, there is music-lover to provide song is made, music is handled etc. The opportunity to study of technical operation；Social resources workable for people are added, with certain practical value and important meaning.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of system flow block diagram of song synthetic method based on HMM of the preferred embodiment of the present invention；

Fig. 2 is the MIDI system block diagrams of the preferred embodiment of the present invention；

Fig. 3 is the speaker adaptation speech synthesis system block diagram of the preferred embodiment of the present invention；

Fig. 4 is the STRAIGHT analysis-modulation-synthesis system block diagrams of the preferred embodiment of the present invention；

The apparatus structure signal that Fig. 5 realizes for a kind of song synthesis based on HMM of the preferred embodiment of the present invention Figure.

Embodiment

Below in conjunction with the accompanying drawing in the present invention, the technical scheme in the present invention is clearly and completely described, shown So, described only a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, The every other embodiment that those of ordinary skill in the art are obtained on the premise of creative work is not made, belongs to this Invent the scope of protection.

As shown in figure 1, a preferred embodiment of the present invention discloses a kind of song synthetic method based on HMM, with TTS (literary periodicals) technology, by HTS (speech synthesis system based on hidden Markov model), and using STRAIGHT algorithms, And the acoustic model related based on HMM speaker synthesized towards song, the melody Controlling model of song are established, and carry out Speaker adaptation training, realizes the personalized speech synthetic method that a kind of lyrics based on HMM are changed in real time to song. Comprise the following steps：

The otherness that voice and song are analyzed described in step A in acoustic feature is comprised the following steps that：

As shown in Fig. 2 being MIDI system block diagrams.

The melody Controlling model of song described in step A includes fundamental frequency Controlling model and duration Controlling model；Utilize fundamental frequency Discrete notes in music score are converted to continuous fundamental curve by Controlling model, and sing note using the acquisition of duration Controlling model Pronunciation duration.

As shown in figure 3, setting up the acoustic model of the correlation of the speaker based on HMM synthesized towards song described in step B Have the following steps：

B, a small amount of speech data using target speaker to be synthesized, by speaker adaptation converter technique, are obtained The adaptive acoustic model of target speaker, and adaptive model is modified and updated, and then synthesize and said with target Talk about the voice of people's tone color.

As shown in figure 3, described trained by the speaker adaptation based on HMM, training obtains the average sound of mixing voice Model comprises the following steps：

C, the sound bank using a small amount of target speaker, carry out speaker adaptation training, obtain being averaged for mixing voice Sound model, so that context-sensitive MSD-HSMM models are obtained, including：

1. (CMML) algorithm is linearly returned using constraint maximum likelihood, by the speech data of speaker in training and average sound Between difference linear regression function representation；

2. the equation of linear regression normalization being distributed with one group of state output distribution and state duration is trained between speaker Difference；

3. training obtains the average sound model of mixing voice, so as to obtain context-sensitive MSD-HSMM models.

A small amount of speech data using target speaker to be synthesized, by speaker adaptation converter technique, is obtained It is modified and updates to the adaptive acoustic model of target speaker, and to adaptive model, comprises the following steps：

b_i(o)=N (o；Au_i-b,AΣ_iA^T)=| A^-1|N(Wξ；u_i,Σ_i)

p_i(d)=N (d；αm_i-β,ασ_i ²α)=| α^-1|N(αψ；m_i,σ_i ²)

Wherein, ξ=[o^T, 1], ψ=[d, 1]^T, μ_iThe average being distributed for state output, m_iThe average being distributed for duration, Σ_i For diagonal covariance matrix,For variance, W=[A^-1b^-1] be target speaker's state output probability density distribution linear change Change matrix, X=[α^-1,β^-1] be state duration probability density distribution transformation matrix；

MAP estimations are described as follows：

C, using the speech synthesis system based on HMM synthesize song.

Synthesize speech analysis and the synthesis that song is used using the speech synthesis system based on HMM described in step C Method is based on STRAIGHT algorithms.

Synthesize song using the speech synthesis system based on HMM described in step C to comprise the following steps：A, use text Analysis tool is analyzed the lyrics text of input, is converted to given lyrics text comprising language using text analyzing program The acoustics annotated sequence of border description information, predicted with obtained each decision tree is clustered in training process with each pronunciation and its The related context HMM model of linguistic context, then it is spliced into a sentence HMM model；

As shown in figure 4, during singing voice is synthesized, analyze-modulate using STRAIGHT-synthesis system comes accurately Extraction fundamental frequency information, exclude spectrum envelope periodically disturb, described in the step C utilize the phonetic synthesis system based on HMM It is based on STRAIGHT algorithms, to comprise the following steps that system, which synthesizes the speech analysis that song used and synthetic method,：

Wherein, q represents frequency.

Corresponding with the above method, another preferred embodiment of the invention also discloses a kind of song synthesis based on HMM Device, the device is used to set up the acoustic model related based on HMM speaker synthesized towards song, the melody control mould of song Type, carries out speaker adaptation training, and by using the HTS (voices based on hidden Markov model of STRAIGHT algorithms Synthesis system), with reference to TTS (literary periodicals) technology, the lyrics are realized to the personalization conversion in real time of song.In realization, may be used The function of the present apparatus is realized by software, hardware or software and hardware combining mode.

As shown in figure 5, the song synthesizer includes：Melody control module, the related acoustics of the speaker based on HMM Module and the song synthesis module based on HMM.

Melody control module, the melody Controlling model for setting up song；

The melody control module, including：

By MIDI analytic units, the music-book information extracted from MIDI files is analyzed, and obtain corresponding music parameter Information；Then in melody control module, according to voice and song in the otherness of acoustic feature, the melody control mould of song is set up Type.

The related acoustic module of the speaker based on HMM, including：

Acoustic model unit, the acoustic model for obtaining target speaker；

The song synthesis module based on HMM, including：

HMM model trains subelement, the HMM model storehouse for setting up speech data, by extracting voice number in sound bank According to speaker's parameters,acoustic, mainly extract fundamental frequency, frequency spectrum and duration parameters, and combine the context markup information in sound storehouse, The statistical model of acoustic model is trained, further according to context property collection, fundamental frequency, frequency spectrum and duration parameters are determined；

Speaker adaptation subelement, the characteristic parameter for normalizing and changing speaker in training obtains adaptive Model, is trained by speaker, and state output is distributed and state duration between speaker and average sound model in normalization training Difference between distribution, and the use linear regression algorithm of maximum likelihood determines the average sound model of many speaker's mixing voices, then Using self-adapting data, the state output probability distribution and duration mean of a probability distribution vector sum covariance of speaker is calculated Matrix, and it is converted to target speaker model, so as to set up the MSD-HSMM of target speaker adaptive model；

Phonetic synthesis unit, the singing voice to be synthesized for synthesizing utilizes the adaptive model of amendment, prediction input text The speech parameter of this lyrics, and Speech acoustics parameter is extracted, then song is synthesized by the VODER based on STRAIGHT algorithms Sound voice；

The process described above process can be completed by the related hardware of programmed instruction, and described program can be stored in can In the storage medium of reading, the program performs the corresponding steps in the above method upon execution.

Above content is to combine specific preferred embodiment further description made for the present invention, it is impossible to assert The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's Protection domain.

Claims

1. a kind of song synthetic method based on HMM, it is characterised in that comprise the following steps：

C, using the speech synthesis system based on HMM synthesize song.

2. a kind of song synthetic method based on HMM according to claim 1, it is characterised in that described in the step A Analysis voice and song are comprised the following steps that in the otherness of acoustic feature：

A, with temporal analysis and frequency domain analysis analysis of spectrum is carried out to voice signal, and voice signal is entered with singing voice signals The comparative analysis of row fundamental frequency；

C, the melodic information by reading the music score extracted in MIDI files, analyze the architectural feature of its music score file, and then obtain Music parameter information, the music parameter information include channel number, note pitch, the speed of key, note initial time and Note duration.

3. a kind of song synthetic method based on HMM according to claim 2, it is characterised in that described in the step A The melody Controlling model of song includes fundamental frequency Controlling model and duration Controlling model；Using fundamental frequency Controlling model by music score from Scattered pitch is converted to continuous fundamental curve, and the pronunciation duration for singing note is obtained using duration Controlling model.

4. a kind of song synthetic method based on HMM according to claim 1, it is characterised in that described in the step B The acoustic model for setting up the correlation of the speaker based on HMM synthesized towards song has the following steps：

A, the voice language material using speaker, analyze speech data, obtaining speech data includes fundamental frequency F0, duration, frequency spectrum SP With aperiodic index AP parameters,acoustic；And the speaker adaptation training technique based on HMM is utilized, training obtains mixing voice Average sound model；

B, a small amount of speech data using target speaker to be synthesized, by speaker adaptation converter technique, obtain target The adaptive acoustic model of speaker, and adaptive model is modified and updated.

5. a kind of song synthetic method based on HMM according to claim 4, it is characterised in that described by based on HMM Speaker adaptation training, training obtains the average sound model of mixing voice and comprises the following steps：

A, the corpus to speaker and target speaker corpus data carry out speech analysis, extract its parameters,acoustic：Mel Cepstrum coefficient, and calculate their first-order difference and second differnce；

B, with reference to context property collection, HMM model training is carried out, when HMM model and the state of training frequency spectrum and base frequency parameters The HMM MSD-HSMM of many distributions half of long parameter；

C, the sound bank using a small amount of target speaker, carry out speaker adaptation training, obtain the average sound mould of mixing voice Type, so as to obtain context-sensitive MSD-HSMM models.

6. a kind of song synthetic method based on HMM according to claim 4, it is characterised in that described using to be synthesized Target speaker a small amount of speech data, by speaker adaptation converter technique, obtain target speaker it is adaptive at the sound Model is learned, and adaptive model is modified and updated, is comprised the following steps：

After a, speaker adaptation training, using the CMLLR adaptive algorithms based on HSMM, the shape for obtaining voice conversion is calculated State output probability is distributed and duration mean of a probability distribution vector sum covariance matrix, characteristic vector o and during state under state i Long d transformation equation is：

b_i(o)=N (o；Au_i-b,AΣ_iA^T)=| A^-1|N(Wξ；u_i,Σ_i)

p_i(d)=N (d；αm_i-β,ασ_i ²α)=| α^-1|N(αψ；m_i,σ_i ²)

Wherein, ξ=[o^T, 1], ψ=[d, 1]^T, μ_iThe average being distributed for state output, m_iThe average being distributed for duration, Σ_iTo be right Angle covariance matrix, σ_i ²For variance, W=[A^-1b^-1] be target speaker's state output probability density distribution linear transformation square Battle array, X=[α^-1,β^-1] be state duration probability density distribution transformation matrix；

B, by the adaptive transformation algorithm based on HSMM, normalizing can be carried out to the frequency spectrum, fundamental frequency and duration parameters of speech data Change and convert, for the self-adapting data O that length is T, maximal possibility estimation can be carried out to conversion Λ=(W, X)；

C, using maximum a posteriori MAP algorithms the adaptive model of voice is corrected and updated, the ginseng for giving HSMM Manifold λ, if its forward direction probability and backward probability are respectively：α_tAnd β (i)_t(i), then its Continuous Observation sequence under state i o_t-d+1…o_tGenerating probability κ_t ^d(i) it is：

κ_{t}^{d} (i) = \frac{1}{P (O | λ)} {\underset{j = 1}{Σ}}_{j &NotEqual; i}^{N} α_{t - d} (j) p (d) Π_{s = t - d + 1}^{t} b_{i} (o_{s}) β_{t} (i)

MAP estimations are described as follows：

Wherein,WithMean vector after being converted for linear regression, ω and τ are respectively that the MAP of state output and duration distribution estimates Count parameter,WithFor adaptive mean vectorWithWeighted average MAP estimates.

7. a kind of song synthetic method based on HMM according to claim 1, it is characterised in that described in the step C It is with STRAIGHT algorithms to synthesize speech analysis that song used and synthetic method using the speech synthesis system based on HMM Based on.

8. a kind of song synthetic method based on HMM according to claim 7, it is characterised in that described in the step C Synthesize song using the speech synthesis system based on HMM to comprise the following steps：

A, using text analyzing instrument the lyrics text of input is analyzed, using text analyzing program by given lyrics text Originally the acoustics annotated sequence comprising linguistic context description information is converted to, is predicted with obtained each decision tree is clustered in training process The context HMM model related to each pronunciation and its linguistic context, then it is spliced into a sentence HMM model；

B, according to MIDI files, obtain the pitch and the duration of a sound of each note in the lyrics, obtained accordingly by melody Controlling model Fundamental frequency and duration, frequency spectrum SP, the aperiodic index AP of syllable and fundamental frequency F0 duration are changed using note duration；

C, using in the related acoustic model of speaker and STRAIGHT algorithm generated statement HMM models on frequency spectrum SP, non- Periodic key AP, duration, fundamental frequency F0 argument sequence, and synthesize voice, musical background is added, the synthesis of song is realized.

9. a kind of song synthetic method based on HMM according to claim 7 or 8, it is characterised in that in the step C It is described that to synthesize speech analysis that song used and synthetic method using the speech synthesis system based on HMM be with STRAIGHT Based on algorithm, comprise the following steps：

The voice signal of speaker is inputted first, and the fundamental frequency F of voice is extracted with STRAIGHT algorithms₀With spectrum envelope Spectral Envelope, is then modulated to parameters,acoustic, the new sound source of generation and time varing filter, further according to former filter model, Voice is synthesized using such as following formula：

y (t) = \underset{t_{i} &Element; Q}{Σ} \frac{1}{\sqrt{G (f_{0} (t_{i}))}} V_{t_{i}} (t - T (t_{i}))

V_{t_{i}} (t) = \frac{1}{\sqrt{2 π}} {&Integral;}_{- \infty}^{\infty} V (w, t_{i}) φ (w) e^{j w (t)} d w

T (t_{i}) = \underset{t_{k} &Element; Q, k < i}{Σ} \frac{1}{\sqrt{G (f_{0} (t_{k}))}}

Wherein, Q represent synthesis excitation in one group of sampling point position, G () represent Pitch modulation, can arbitrarily with original language The F of sound₀To match the F after modulation₀, all-pass filter is used for the time structure for controlling fine pitch and original signal, such as one and frequency The linear phase that rate is directly proportional is moved, for controlling F₀Fine structure, from modulation amplitude spectrum A (S (u (w), r (t)), u (w), r (t)) such as following formula, it can calculate and obtain corresponding Fourier transformation V (w, the t of minimum phase pulse_i), wherein A (), u () and r () The modulation of amplitude, frequency and time dimension is represented respectively；

V (ω, t) = \exp (\frac{1}{\sqrt{2 π}} {&Integral;}_{0}^{\infty} h_{t} (q) e^{j ω q} d q)

h_{t} (q) = \{\begin{matrix} 0, (q < 0) \\ c_{t} (0), (q = 0) \\ 2 c_{t} (q), (q > 0) \end{matrix}

c_{t} (q) = \frac{1}{\sqrt{2 π}} {&Integral;}_{- \infty}^{\infty} e^{- j w q} l g A (S (u (w), r (t)), u (w), r (t)) d w

Wherein, q represents frequency.

10. a kind of song synthesizer based on HMM, it is characterised in that including：

Melody control module, the melody Controlling model for setting up song；

The related acoustic module of speaker based on HMM, the acoustic model for setting up the speaker's correlation synthesized towards song；

11. a kind of song synthesizer based on HMM according to claim 10, it is characterised in that the melody control Module, including：

Prosodic control unit, for, in the otherness of acoustic feature, setting up the melody Controlling model of song according to voice and song.

12. a kind of song synthesizer based on HMM according to claim 10, it is characterised in that described based on HMM's The related acoustic module of speaker, including：

Acoustic model unit, the acoustic model for obtaining target speaker；

13. a kind of song synthesizer based on HMM according to claim 10, it is characterised in that described based on HMM's Song synthesis module, including：

Speaker adaptation subelement, the characteristic parameter for normalizing and changing speaker in training, obtains adaptive model；