TW201001396A

TW201001396A - Method for synthesizing speech

Info

Publication number: TW201001396A
Application number: TW97123982A
Authority: TW
Inventors: Hung-Yan Gu
Original assignee: Univ Nat Taiwan Science Tech
Priority date: 2008-06-26
Filing date: 2008-06-26
Publication date: 2010-01-01
Also published as: TWI360108B

Abstract

A method for synthesizing speech is provided. First, a spectrum progression model and a prosody model are constructed respectively by using artificial neural networks (ANN). For the training of the spectrum progression model, a dynamic time warping (DTW) based technique is developed to compute the spectrum progression parameters of the training speech. Afterwards, the spectrum progression model and the prosody models are used to predict the spectrum progression parameters and prosody parameters for each of the syllables in an input sentence. Finally, synthetic speech signals are generated by using the spectrum progression parameters and prosody parameters of each syllable to control the improved harmonic plus noise model (HNM).

Description

201001396 …，一…V 28200twf.doc/n 九、發明說明：【發明所屬之技術領域】本發明是有關於一種語音信號處理的方法，且 a 有關於一種結合了頻譜演進模型與韻律模型，以私^疋加雜音模型去作語音合成的方法。二制％波【先前技術】，幾年’由於科技的日新月異’人們與電腦之 ^ '式’已不再是過去以指令輸入電腦，而電腦再以文字回應的方式所能滿足。gj此，如何發展__種人機之間更直接更為人性化的語音溝通方式，已是一個相當重要的課題。為了使電腦能夠以語音作為與人類溝通的媒介，所兩的技術就是語音辨識和語音合成。其中，文句轉語^ (teXM〇_speech，TTS)系統為語音合成系統的延伸，種用以將輸人的文字轉換為人類語音輸出的技術。傳統的語音信號合成技術包括共振峰合成（formant 上nthesis )、線性預估編碼(linear⑶出哗，LpC )201001396 ..., a...V 28200twf.doc/n IX. Description of the Invention: [Technical Field of the Invention] The present invention relates to a method of speech signal processing, and a relates to a combination of a spectrum evolution model and a prosody model, The method of adding a murmur model to speech synthesis. The second system of % wave [previous technology], a few years 'due to the rapid development of technology' people and the computer's 'style' is no longer the input of the computer into the command, and the computer can respond to the text in a way. Gj, how to develop __ a more humane voice communication between human-machines is a very important topic. In order to enable computers to use speech as a medium for communication with humans, the two technologies are speech recognition and speech synthesis. Among them, the sentence transfer ^ (teXM〇_speech, TTS) system is an extension of the speech synthesis system, which is used to convert the input text into human speech output technology. Traditional speech signal synthesis techniques include formant synthesis (former nthesis), linear predictive coding (linear (3), pLpC)

-成基週同步$加（Pitch synchronous overlap and add, A) &成、基於έ吾料庫的合成單元重新組合（c〇rpUS ased re-sequencing of synthesis units)等。然而，這些技術仍然存在以下所列的—個或多個缺點：’、、 L合成出的語音信號不夠清晰，也就是信號中含有迴音或雜音。 2. 不支援合成語音的說話速度或音高頻率的改變。 3. 合成語音的流暢性不佳，與真人發音有明顯的差距。 28200twf.doc/n 201001396 關於前述的第一項缺點，Y. Stylianou提出諧波加雜音模型(harmonic-plus-noise model，HNM) [1]，用以提升合成語音信號的清晰度，不過原始的HNM裡仍存在一些缺點。關於前述的第三項缺點，T. Yoshimura等人提出以HMM(hidden Markov model)模型來掌握語音頻譜隨著時間逐漸改變的過程[2] ’在此簡稱之為頻譜演進(spectrum progression)。然而，HMM模型只使用少數個狀態，且HMM並未考慮相鄰音框的特徵向量之間的相關性，在語音合成上不夠細密。 [1] Stylianou, Yannis, Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, Ecole Nationale Superieure des Telecommunications, Paris, France, 1996.- Pitch synchronous overlap and add (A) & synthesis, recombination of synthesis units based on c〇rpUS ased re-sequencing of synthesis units. However, these techniques still have one or more of the following disadvantages: ',, L The synthesized speech signal is not clear enough, that is, the signal contains echo or noise. 2. The speech speed or pitch frequency of the synthesized speech is not supported. 3. The fluency of synthesized speech is not good, and there is a clear gap between the pronunciation of real people. 28200twf.doc/n 201001396 With regard to the first shortcoming mentioned above, Y. Stylianou proposed a harmonic-plus-noise model (HNM) [1] to enhance the clarity of synthesized speech signals, but the original There are still some shortcomings in HNM. Regarding the third shortcoming described above, T. Yoshimura et al. proposed to use the HMM (hidden Markov model) model to grasp the process of changing the speech spectrum over time [2] ′, which is referred to herein as spectrum progression. However, the HMM model uses only a few states, and the HMM does not consider the correlation between the feature vectors of adjacent frames, and is not sufficiently detailed in speech synthesis. [1] Stylianou, Yannis, Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, Ecole Nationale Superieure des Telecommunications, Paris, France, 1996.

[2] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling in HMM-based Speech Synthesis SystemH, Proc. of ICSLP, V〇l 2, pp. 29-32, 1998. 【發明内容】本發明提供一種語音合成的方法，以類神經網路 (artificial neural network，ANN )分別建構出的頻譜演進模型和韻律模型，來預測出各音節的頻譜演進參數（以下簡稱頻演參數）和韻律參數，再據以控制所改進的譜波加雜音模型（harmonic plus noise model，HNM)，來進行注音信號合成的處理。 6 201001396 ww], i W 28200twf.doc/n 本發明提供一種訓練頻譜演進模型的方法，以動態時間杈正為基礎’發展出—種求取訓練用音節的頻演(頻譜演進)參數的技術；再依據所求出的頻演參數，去訓練以類神經網路所建構的頻譜演進模型。本發明提出一種語音合成的方法。首先，以類神經網路來建構頻譜演進模型與韻律模型，並且對這些模型作訓練。其中，頻譜演進模型的訓練，以動態時間校正（Dynamic f .立’ DTW)為基礎，我們發展出一種求取訓練用節的頻演參數的技術。參數模型訓練好之後，在合成 ^段對於—輸人文句，先作文句分析㈣analysis)，以獲 =文句内各音節各自的語境資料。然後，輸入語境資料至韻律模型與頻譜演進模型，而分別獲得各個音節的韻律參數/、頻浪參數。接著，依據各個音節的頻演參數與韻律參數，去控制所改進的諧波加雜音模型，來產生出該文句的合成語音信號。 ^ 在本發明之一實施例中，上述訓練頻譜演進模型的步 U 驟，先利用基於動態時間校正的方法來作頻譜比對，以分 f出训練用音節的頻演參數；之後，將語境資料作為類神 =路的輸人’並將分析出的頻演參數作為類神經網路的輸=，而訓練出頻譜演進模型。上述訓練韻律模型的步驟，將語境資料作為類神經網路的輸入，並將訓練用音節分析出的韻律參數作為類神經網路的輸出，而訓練出頻譜演進才莫型。在本發明之一實施例中，上述語境資料包括目前音節 7 28200twf.doc/n 201001396 的聲調類別、聲母（syllable initial)類別與韻母(syllable fmal) 類別，前一音節的聲調類別與韻母類別，後一音節的聲調類別與聲母類別，以及目前音節在句子内的位置數值。在本發明之一實施例中，上述語音合成處理的一步驟，為輸入語境資料至頻譜演進模型，而求得一音節的頻演參數’依據頻演參數，可決定合成的音節信號的頻譜隨著時間軸逐漸作演變的關係，及決定聲、 f'[2] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling in HMM-based Speech Synthesis SystemH, Proc. of ICSLP, V〇l 2, pp. 29-32 SUMMARY OF THE INVENTION The present invention provides a method for speech synthesis, which uses a neural network (ANN) to construct a spectrum evolution model and a prosody model to predict the spectrum evolution parameters of each syllable (below). Referring to the frequency parameter and the prosody parameter, the synthesis of the phonetic signal is performed according to the improved harmonic plus noise model (HNM). 6 201001396 ww], i W 28200twf.doc/n The present invention provides a method for training a spectrum evolution model, which is based on dynamic time correction and develops a technique for obtaining frequency performance (spectrum evolution) parameters of training syllables. Then, according to the obtained frequency performance parameters, the spectrum evolution model constructed by the neural network is trained. The present invention proposes a method of speech synthesis. First, the spectrum evolution model and the prosody model are constructed with a neural network, and these models are trained. Among them, the training of the spectrum evolution model is based on dynamic time correction (Dynamic f. DTW), and we have developed a technique for obtaining the frequency parameters of the training section. After the parametric model is trained, in the synthesis section, for the input sentence, first analyze the sentence (4) analysis to obtain the context data of each syllable in the sentence. Then, the context data is input to the prosody model and the spectrum evolution model, and the prosody parameters/frequency parameters of the respective syllables are respectively obtained. Then, according to the frequency parameters and prosody parameters of each syllable, the improved harmonic plus noise model is controlled to generate a synthesized speech signal of the sentence. In an embodiment of the present invention, the step of training the spectrum evolution model first uses a dynamic time correction based method to perform spectral comparison, and divides the frequency performance parameters of the training syllables; The contextual data is used as the input of the god-like road, and the analyzed frequency performance parameters are used as the neural network's transmission =, and the spectrum evolution model is trained. In the above steps of training the prosody model, the context data is used as the input of the neural network, and the prosody parameters analyzed by the syllables are used as the output of the neural network, and the spectrum evolution is trained. In an embodiment of the present invention, the context information includes a tone category, a syllable initial category, and a syllable fmal category of the current syllable 7 28200 twf.doc/n 201001396, the tone category and the final category of the previous syllable. , the tone category and initial type of the next syllable, and the position value of the current syllable in the sentence. In an embodiment of the present invention, a step of the speech synthesis process is to input a context data to a spectrum evolution model, and obtain a frequency syllabic parameter of the syllable 'determining the spectrum of the synthesized syllable signal according to the frequency performance parameter. As the time axis gradually evolves, and determines the sound, f'

L 比例。此外，輸入語境資料至韻律模型，可求得2 = 額律參數，_律參數可蚊合成的音_錢的基週軌跡、音長、音量、聲韻母的振幅比例。也就是韻律參數包括基週執跡、音長、音量、聲韻母振幅比例等參數。立外在本發明之一實施例中，上述以譜波加雜音模型合成音節信號的步驟，在訓練階段裡，先將各個國語音節錄音得到的原始音信號切割成-序列的音雖叫， =固音框的HNM參數(财與雜音參數）。接著，在合成階 &理’將欲合成的音節先分割成無聲(则㈣、愈 (vmced)兩個部分’再分卿放多個控繼於它們的時間轴 m，將無聲部分所對應的原始音音框的舰^參數， ^複衣到對應的控制點上。有聲部 =數來作時間轴的非線性對應(原始== 相鄰音=二：二=點娜 HN]U 3框的HNM參數作内插(原始的傳到該控制2法内插)，内插求得的麵參數再複製 “，·。此外，對於有聲部分控制點上的HNM參 201001396 w w, 28200twf.doc/n 數，還必需進一步配合基週軌跡數值的規定，在音色保持一致性的前提下’作另一種三階的Lagrange内插的調整(原始的HNM合成法未說明用何種内插法）。當各控制點上的 HNM參數決定之後，就可據以去作語音信號波形的產生。在本發明之一實施例中，上述以諧波加雜音模型合成音節信號的步驟，當無聲部分為短時間的無聲時，則直接把原始音的無聲部分的信號複製到合成音上來作合成；當無聲部分為長時間的無聲時，則利用無聲部分的雜音來數來進行合成。對於有聲部分，則依據有聲部分的諧二^數去δ成出5皆波化號，而依據雜音參數去合成出雜音作麥，再將兩者相加，而得到有聲部分的信號。另外，上述訓練頻譜演進模型的方法詳述如下。首先，提供一音節發音語料庫與一句子發音語料庫。上述音節發音語料庫包括國語各個音節的單獨發音的錄音，而音語料庫則包括多個句子的連續發音的錄音。接著Ρ，將^ 些句子發音作切割，以獲得多個音節的信號。之後，逐^ 將各個切割出的音節作為目標音節，使用動'態時間校正， ^比對目標音節與其對應(相同拼音）的單獨發音之參立即之間的頻譜，以獲得頻譜匹配路徑。依此頻譜匹配曰， ^兩音節的時_都正規化成之間，纽在目^立二=時間軸上均勻設置取樣點，各取樣點再依頻譜匹配ς =取得參考音節時間軸上對應的時間值，如此得到為頻演參數。最後，將各個目標音節的語為類神經網路模型的輸入，並將頻演參數作為類神經網= 28200twf.d〇c/n 201001396 的輪出”丨練出頻譜演進模型。的表考士::二貫施例中’上述比對目標音節與其對應 =節，Γ頻譜的步驟，可先切割目標音節與參考標=:框’再對各個音框去計算出特徵向量它們的特^: 彳目參考音節音框之間的距離’則以們的特徵向！之間的距離來計算。的參考Ϊ1;二的驟上=對目標音節與其對應音節分割—週务㈣顺目標音節與參考期性部分與有週期性部:;別對=對= 部=得到的路徑銜接成-條：二?後將兩成多個二二述在將-個句子發音切割境資‘神：=節= c 型二所之=月，== 音、無雜音的語立料t 用來合成出清晰無迴率的大幅度改變:：外，本念5說話速度和音高頻頻譜演進模型到改進的二、信= 地2所合成出的語音信號的流暢性顯著、建構知律模型之後，再據以酬韻律參數的值’則可以讓合成出的語音具有相t的自·。因此， 28200twf.doc/n 201001396 f發明之結合頻譜演進模型、韻律模型去控制改進的MM 杈型來作語音信號合成的方法，所合成出的語音，不僅信號清晰，並且聽覺上已經很靠近一般人說話的方式。° 為讓本發明之上述特徵和優點能更明顯易懂，下文特舉較佳實施例，並配合所附圖式，作詳細說明如、【實施方式】 r..' 太本發明之内容更為明瞭，以下特舉實施例作為本發明確實能夠據以實施的範例。。剞的：、i ΐ依照本發明一實施例所繪示之訓練頻譜演進模㈣方法 >瓜程圖。請參照圖！，首先，在步驟si〇5中供一音節發音語料庫與—句子發音語料庫。音節發音 ^包括國語各個音節的單獨發音的錄音，而句子發^語為-個音節。在本實施例二 =本㈣’錄製了各個音節單獨發音的第—聲卜，再錄製多個句子的連續發音，以將 11 一β—中所匕括的音節作為目標音節。句子Ξΐ切iUU()中’將句子發音語料庫中的各個丁刀口J，以獲得多個目標音節。例如，標^體庫中的各個句子作音節邊界點的調。缺後，音節標的的拼音符號與聲號和_資訊料的記錄播案，去取出各音節的信之後’在步驟SU5巾，以動態時間校正（DTW)方 28200twf.doc/n 201001396 法，逐一比對目標音節與其對應的參考音節兩者的頻譜 (spectrum)，以獲得頻譜匹配路徑。DTW的實際作法是曰，先將兩音節的信號分別切割成一序列的音框（frame)，並且計算出各個音框的特徵向量，再將目標音節與參考音節視為兩個序列的向量’然後依據⑷端點對應概之限^ (b)局部行走之限制，來求取兩向量序列之間具有最短距離的路徑，來作為頻譜匹配路徑。前述的局部行走之限制，在本實施例中使用的是如圖2C所示的形式。特徵㈣’—財減㈣聽_率倒_ =歎 C Md-FreqUenCy Cepstral c〇efficient，MF(：c )，貫施例中’每—個音_語音信號，會轉換成娜c 各U _係數來作為特徵向量，也就是共有26個 Ο 益调二卜’在兩音節之間使用DTW作頻譜匹配時，可對 _。舉=部===，別進行頻譜 ^卜郎的波形’區域A中的垂直線：分與無週期性部分的邊界點。而的波形，區域B中的水平線段B’代表4 缘為ΐΐ:分與無週期性部分的邊界點。而區域c二曲 =目標音節與參考音節之表不目標音節砧立ir_ 吩仅如軸格線區域D中㈣^ ^縱軸格線則表轉考音節的音框。曲線疋無週期性部分的頻譜匹配路徑，而區域 12 28200twf.d〇c/n ΟL ratio. In addition, by inputting the context data to the prosody model, 2 = the law of the forebearing parameter can be obtained, and the ratio of the amplitude of the base trajectory, the length of the sound, the volume, and the final of the vowel can be obtained. That is, the prosody parameters include parameters such as the base circumference, the sound length, the volume, and the amplitude ratio of the finals. In an embodiment of the present invention, the step of synthesizing the syllable signal by the spectral wave plus murmur model, in the training phase, first cutting the original sound signal obtained by recording the syllables of each country into a sequence-like sound, although HNM parameters (finance and noise parameters) of the solid box. Then, in the synthesis stage & rationale, the syllable to be synthesized is first divided into two parts: the (four) and the (vmced) parts, and then the plurality of controls are followed by their time axis m, corresponding to the silent part. The original sound box of the ship ^ parameter, ^ re-clothing to the corresponding control point. There is a voice part = number for the nonlinear correspondence of the time axis (original == adjacent sound = two: two = point Na HN) U 3 The HNM parameters of the frame are interpolated (the original is passed to the control 2 method interpolation), and the surface parameters obtained by interpolation are copied again. "In addition, for the HNM parameter 201001396 ww on the control part of the voiced part, 28200twf.doc /n number, it is necessary to further cooperate with the definition of the base trajectory value, and make another third-order Lagrange interpolation adjustment under the premise that the timbre remains consistent (the original HNM synthesis method does not explain what interpolation method is used) After the HNM parameter at each control point is determined, the generation of the speech signal waveform can be performed. In one embodiment of the present invention, the step of synthesizing the syllable signal by the harmonic plus noise model is as follows: When there is no sound for a short time, the silent part of the original sound is directly The signal is copied to the synthesized sound for synthesis; when the silent part is silent for a long time, the noise is synthesized by using the noise of the unvoiced part. For the voiced part, the δ is determined according to the harmonic part of the voiced part. The wave number, and the noise is synthesized according to the noise parameters, and then the two are added to obtain the signal of the voiced part. In addition, the above method for training the spectrum evolution model is detailed as follows. First, a syllable pronunciation corpus is provided. The syllable corpus of the syllable includes a recording of the individual pronunciations of the various syllables of the Mandarin, and the corpus of sounds includes the recording of the continuous pronunciation of the plurality of sentences. Then, the sentences are pronounced for cutting to obtain a plurality of syllables. After the signal is used, the cut syllables are used as the target syllables, and the dynamic state time correction is used to compare the spectrum between the target syllable and its corresponding (same pinyin) individual pronunciation to obtain the spectrum matching path. According to this spectrum matching ^, ^ two syllables are normalized into one, and the new one is uniformly set on the second axis. Sampling points, each sampling point is matched by the spectrum ς = the corresponding time value on the time axis of the reference syllable is obtained, thus obtained as the frequency performance parameter. Finally, the words of each target syllable are input of the neural network model, and the frequency is The parameters are used as the neural network = 28200twf.d〇c/n 201001396. The round-out "丨频谱频谱频谱频谱频谱频谱频谱频谱频谱 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The steps can be to first cut the target syllable and the reference mark =: box 'and then calculate the feature vector for each of the sound boxes. Their special ^: 彳参考参考参考参考参考 ' ' ' ' ' ' ' 以以 ' 之间之间之间之间The distance is calculated. The reference Ϊ1; the second step = the segmentation of the target syllable and its corresponding syllable - the weekly task (four) the target syllable and the reference period and the periodic part:; the opposite = the = part = the obtained path Connected into a strip: two? After the two more than two two in two will be sentenced to cut the situation of the capital 'God: = section = c type two of the = month, == sound, no noise language material t Used to synthesize a large change in clear no-return rate:: outside, the pronunciation of 5 words The high-frequency spectrum evolution model of the sound is improved. The fluency of the speech signal synthesized by the signal 2 is significant. After constructing the knowledge-based model, the value of the rhythm parameter can be used to make the synthesized speech have a phase. t···. Therefore, 28200twf.doc/n 201001396 f invented the combined spectrum evolution model and prosody model to control the improved MM 杈 type for speech signal synthesis, the synthesized speech, not only the signal is clear, and the hearing is already very close to the average person. The way you talk. The above features and advantages of the present invention will be more apparent from the following detailed description of the preferred embodiments of the invention. In the following, the following specific examples are given as examples in which the present invention can be implemented. .剞、 i 训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练训练Please refer to the picture! First, in step si〇5, a syllable pronunciation corpus and a sentence pronunciation corpus are provided. The syllable pronunciation ^ includes the recording of the individual pronunciation of each syllable of the national language, and the sentence is pronounced as a syllable. In the second embodiment = the present (four)', the first sound of each syllable is recorded, and the continuous sounds of the plurality of sentences are recorded to use the syllable included in the 11-β- as the target syllable. The sentence is cut into iUU() and the sentences are pronounced in each corpus in the corpus to obtain a plurality of target syllables. For example, each sentence in the standard library is used as the tone of the syllable boundary point. After the absence, the phonetic symbols of the syllables and the recordings of the vocal and _ information materials, after taking out the letters of each syllable, 'in step SU5, with dynamic time correction (DTW) side 28200twf.doc/n 201001396 method, one by one The spectrum of both the target syllable and its corresponding reference syllable is compared to obtain a spectrum matching path. The practical practice of DTW is to first cut the two syllable signals into a sequence of frames, and calculate the feature vectors of the individual frames, and then treat the target syllables and reference syllables as two series of vectors' then According to the limitation of (4) the endpoint corresponding to the limit ^ (b) the local walking limit, the path with the shortest distance between the two vector sequences is obtained as the spectrum matching path. The aforementioned local walking limitation is used in the present embodiment as shown in Fig. 2C. Features (4) '----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The coefficient is used as the eigenvector, that is, there are a total of 26 益调二卜 ′′. When DTW is used for spectrum matching between two syllables, _ can be used. Lift = part ===, do not perform the spectrum of the waveform of the circle "the vertical line in the area A: the boundary point of the part with no periodicity. For the waveform, the horizontal line segment B' in the area B represents the boundary edge of the 4th edge: the boundary point with the non-periodic portion. And the region c two songs = the target syllable and the reference syllable, the non-target syllable anvil ir_ pheno as the axis grid area D (four) ^ ^ vertical axis grid line to the sound box of the test syllable. The curve 疋 has no periodic part of the spectrum matching path, while the area 12 28200twf.d〇c/n Ο

L 201001396 E中的則是週期性部分的頻譜匹配路徑。接著，回到圖I的步驟sl2〇令規化為固定維度的頻演參數向量。也就“、、配路徑正『路，求得時間正規化的頻演參數向；二匹 B疋依照本發明—實施例所繪示之依〜兄’圖求取頻演參數的示意圖。請參照圖2 徑來 Α)以及參考音節（區域、払曰即（區域接著，再把目時長度都分取規化為1。文茶丹把目“即均勻切割成32個正後，依據頻譜匹配路徑，將狄、吟間點。之對映(mapping)至參考音節不曰:，的2個正規化時間點上的點所對應的參考音節上以找到各個目標音節個數值在㈣之間的^的數正規化時間值，而得到幻最後，在取得頻演參數之後，各個目標音節的語境資料作為類神經^跋不，將對應的頻演參數作為類神經網；、出而並將其進模型。所謂類神經網路模型的2出是：丨== 來進行訓練，以讓類神經=學 S輪入貝枓與輸出貧料之間的相關預測一個輸入資料所應對應的輪出資料 τ用末在本實施例中，語境資料、” 個）、聲母類別(22個)與韻母類‘目前音節的f周類別(5 類別與韻母粗分類_(9個）,' (’前—日節的聲調粗分類類別(6個），及目前音節t音節的聲調類別與聲母為，語音是時序性資訊的傳遞:子内的位f值二這是因斤以除了目前音節的聲調 13 28200twf.doc/n 201001396 類，;:類別之外，前-個音節的聲調類別相：重要J去個音節的聲調類別和聲母類別也是相田重要的。此外，考量至丨丨立i , 昔θθ即在句子中的位置（例如句因二5)亦會對目前音節的韻律狀態產生影響，因此更使用-個句中位置之時間比例資料。在本實施例中，類神經網路藏層、遞迴隱藏層、和輸出芦。於Λ思田十^成翰層隱各項語境:雜_料-^ 絲輸人語境資料，的資斜u 來輸入，例如一個聲調類別入^ = M / 來表也就是要用3個單元來輪 —fe減和遞迴隱藏層的單元健，可依實驗的結果來L 201001396 E is the spectrum matching path of the periodic part. Next, returning to step sl2 of Fig. 1, the frequency parameter vector is normalized to a fixed dimension. That is to say, ", the path is positive, the time is normalized, and the frequency parameter is normalized; the two horses are in accordance with the present invention - the schematic diagram of the frequency response parameter is drawn according to the embodiment of the present invention. Refer to Figure 2 for the path Α) and the reference syllable (area, 払曰 (the area is followed by the length of the target is divided into 1. The wenca Dan will be evenly cut into 32 positive, according to the spectrum Matching the path, mapping the di and di points to the reference syllable: the reference syllable corresponding to the points on the two normalization time points to find the value of each target syllable between (4) The number of ^ is normalized to the time value, and the magic is finally obtained. After the frequency parameter is obtained, the context data of each target syllable is used as the neural network, and the corresponding frequency parameter is used as the neural network; Put it into the model. The so-called neural network model 2 is: 丨 == to train, so that the correlation between the neuron = learning S round of Bellow and the output poor material should be corresponding to an input data. The rounded data τ is used in this embodiment, contextual data , "pieces", initials (22) and finals 'f-weeks of the current syllables (5 categories and finals of vowels _(9), ' ('the tone of the previous section of the rough classification category (6) And the tone class and initials of the current syllable t syllable, the voice is the transmission of time-series information: the bit f value in the sub-two is due to the tone of the current syllable 13 28200twf.doc/n 201001396 class;;: Category In addition, the tone category of the front-syllabic phase: the tone category and the initial type of the important syllables are also important to the field. In addition, considering the position of the ri, i θθ is the position in the sentence (for example, sentence two 5) It also affects the rhythm state of the current syllable, so it uses the time ratio data of the position in the sentence. In this embodiment, the neural network layer, the recursive hidden layer, and the output reed. Si Tian Shi ^ Cheng Han layer hidden context: Miscellaneous - material - ^ Silk input context data, the input oblique u input, for example, a tone category into ^ = M / to the table is to use 3 units The round-fe minus and recursive unit of the hidden layer can be based on the results of the experiment.

J =在本A知例中’我們設定此兩層的單元數都為16。外’輸出層的單元數，則直接依目標參數的健來決定。另外，由於音節的基週執跡、音長、音量與聲韻母振畜比例’對於合成語音的自然性也有—定的影響，因此，對於句子發音語料庫巾各奸的組成音節的韻律參數值也要1分析求取，然後再將這絲週軌跡、音長、音量、和聲韻母振11¾比例的數值，去訓練各自的類神經網路模型。也就是說’在步驟S11Q之後，更可分析出各個目標音節的韻律參數’之後，如同於步驟S125，將語境資料作為類神經網路的輸人，㈣各項韻律參數數值分卿為類神經網路的輸出’以訓練出基週軌跡、音長、音量、和聲韻母振幅比例等韻律參數的各自的模型。在頻譜演進模型與韻律模型訓練完成之後，便可使用上述這些模型來進行語音信號的合成。圖3是依照本發明 14 201001396 N 28200twf.doc/n -實施例所繪示之語音合成方法的流程圖。請參照圖3，在步驟S305中，首先分析一輸入的文句，以獲得各個文字的發音資訊。發音資訊包括音節拼音、聲調、聲母分類、韻母分類等項目的資料。接著，在步驟SMO中，依據發音資訊，組合出各音節的語境龍’以作為韻律_與頻譜演進模型的輸入。之後’在步驟S315巾，將語境參數輸人至鱗模型譜演進翻’时職得财參數無演參數。詳細地說，將語境參數作為頻譜演賴型的輸人，使得賴演進輸出對應此-語境參數的驗參數。並且，將語境夫數為韻律模㈣輸人’使得韻律模型輸出對應此— 的韻律參數。。口兄多致最後，在步驟S320巾，依據_參數與去控制改進過的魏加雜音模型，以合成出該輸入句語音信號。抑地說，_參數可用以決定合成音谱和原始音節頻譜之間的時間㈣應難，及人出立節内聲、韻母之_音長比例。韻律參數則可用以二^ 成音節信號的基週軌跡、音長、音量、H /、口再者，基週軌齡數更可μ決;^fr=gtt例。以頻演參數決定聲、韻母之音長比例於 ==原有始聲音:邊r放置待合成的音節二原片&線性之曲線，我們可由B,去對應出的益、有磬之邊KEb A，…'出待s成音節上的 ·、、、有权邊界‘點Λ，依據A，之位置就可決定聲、韻母 15 ,V 28200twf.doc/n 201001396 之音長比例。此外，時間軸對應聲部分的-個控制點(橫軸上），^_依’ ^位个^音節有區域E的片段線性曲線，來斜間位置和圖2八梗相鄰音框。丨對應至縣音(_上)的兩個本發::下例來詳細說明步驟S32〇。圖4是依昭合成的流程圖。从進的Η觀模型作音節信號在ΗΝΜ模型的分析階段，首先將各 =割;^_音框’以分析各個音_脑= t信;時，此音框信號的她被分割成低頻的；3 雜音部分’其分界點的頻率值則稱為最3 CMaxlmum Voiced Frequency，MVF)，率值小於MW,的賴視為由财錢組成，㈣率值大於賴譜則視為由雜音信號組成。另外，當—個立J = In this A case, 'we set the number of cells in both layers to 16. The number of cells in the outer output layer is determined directly by the health of the target parameter. In addition, since the base circumference of the syllable, the length of the sound, and the ratio of the volume and the rhyme of the vowels have a certain influence on the naturalness of the synthesized speech, the prosody parameter values of the constituent syllables of the sentence corpus are also To analyze and obtain, then the value of the weekly trajectory, length, volume, and vowel is adjusted to a ratio of 113⁄4 to train the respective neural network models. That is to say, after the step S11Q, the prosody parameters of each target syllable can be analyzed, as in step S125, the context data is used as the input of the neural network, and (4) the numerical values of the prosody parameters are classified into classes. The output of the neural network 'strains the respective models of the prosody parameters such as the base-peripheral trajectory, the length of the sound, the volume, and the amplitude of the arpeggio. After the spectrum evolution model and the prosody model training are completed, these models can be used to synthesize the speech signals. 3 is a flow chart of a speech synthesis method in accordance with an embodiment of the present invention 14 201001396 N 28200 twf.doc/n. Referring to FIG. 3, in step S305, an input sentence is first analyzed to obtain pronunciation information of each text. The pronunciation information includes data such as syllables, tones, initials, and finals. Next, in step SMO, the context dragons of the respective syllables are combined as the input of the prosody_ and the spectrum evolution model based on the pronunciation information. Then, in step S315, the context parameter is input to the scale model, and the score parameter has no parameters. In detail, the context parameter is used as the input of the spectrum deduction type, so that the Lai evolution output corresponds to the test parameter of the context parameter. And, the context number is the rhythm module (four) input 'the rhythm model output corresponds to this rhythm parameter. . Finally, in step S320, according to the _parameter and to control the improved Weijia murmur model, the input sentence speech signal is synthesized. In other words, the _ parameter can be used to determine the time between the synthesized spectrum and the original syllable spectrum (4), and the ratio of the length of the sound and finals in the human outing. The prosody parameter can be used to make the base trajectory, length, volume, H /, and mouth of the syllable signal. Furthermore, the number of bases can be determined by the number of bases; ^fr=gtt. The frequency parameter is used to determine the ratio of the length of the sound and the final sound to the == original sound: the edge r is placed on the syllable two original film & linear curve, we can use B to correspond to the benefit, the edge of the edge KEb A,...'s need to be sinus on the syllables, and the right boundary is 'point'. According to A, the position of the sound, finals 15, V 28200twf.doc/n 201001396 can be determined. In addition, the time axis corresponds to the control point of the sound part (on the horizontal axis), ^_ depends on the ^^ bit ^ syllable has the segment linear curve of the region E, and the oblique position and the adjacent sound frame of Fig. 2 and the stalk.丨 Corresponding to the two sounds of the county sound (_): The following example will explain step S32〇 in detail. Fig. 4 is a flow chart showing the synthesis. From the progressive model of the syllable signal in the analysis stage of the ΗΝΜ model, firstly, each = cut; ^ _ sound box 'to analyze each sound _ brain = t letter; when the sound box signal is divided into low frequency ;3 The murmur part's frequency value of the boundary point is called the most 3 CMaxlmum Voiced Frequency (MVF), the rate value is less than MW, and the lag is considered to be composed of financial money, and (4) the rate value is larger than the lag spectrum is considered to be composed of noise signals. . In addition, when

CJ ，是無聲的信號時’則它的整侧譜就全部視為二號所組成。 θ ^ 請參照圖4，在步驟S4〇5中，由於錄製音節時， ^各音㈣音量為—致’而原始的HNM參數分析方 ^使用ϋ定的振幅門權來_ MVF，這將使得音量小的曰即的MVF較小(也就是譜波數較外導致合成出失直的 =音信號。因此，在本實_巾，我們改進紐用動態式的門捏來侧MVF。詳細作法是，歧_個音節的各音框中去找出跨音框的譜波振幅最大值；然後，將此最大的諧 16 201001396 yjy I Kjyj-τ, j. N 28200twf.doc/n 波振幅值的1/512作為MVF偵測的門檻值；之後，比對各音框中各諧波振幅是否小於上述門襤值，當小於上述門檻值時’將其視為無聲諸波；如此當發生連續5個譜波的^ 幅皆小於上述門檻值時，最後-個大於上述門播值的讀波頻率即為此音框的MVF。據此，在決定MVF之後，再去記錄各諧波的頻率、振幅、與相位等諧波參數，並且記錄 30個倒頻譜係數以表示雜音部分的頻譜包絡(如叩匀。 n 接著，在步驟S410中，把待合成的音節信號分割成無聲部分與有聲部分，以分別佈放多個控制點。舉例來說，無聲部分的控伽數纽為與絲音科鱗部分的音框數量相同，也就是一個控制點對應一個音框。另一方面，有聲部分控制點的佈放，是以間隔固定個信號樣本（例如， ⑽個信號樣本）來進行佈放。射之，控伽的數量是由有聲部分的音長來決定。When CJ is a silent signal, then its entire side spectrum is considered to be composed of number two. θ ^ Please refer to FIG. 4. In step S4〇5, since the syllables are recorded, the volume of each sound (four) is - and the original HNM parameter analysis method uses the determined amplitude gate weight _ MVF, which will make The MVF with a small volume is small (that is, the number of spectral waves is relatively small, resulting in the synthesis of a straight-sound = tone signal. Therefore, in this real _ towel, we improve the dynamic use of the dynamic door to the side of the MVF. Detailed practice Yes, find the maximum amplitude of the transonic frame in each of the syllables; then, the maximum harmonic 16 201001396 yjy I Kjyj-τ, j. N 28200twf.doc/n wave amplitude 1/512 is used as the threshold value of MVF detection; after that, whether the harmonic amplitude of each sound box is smaller than the above threshold value, when it is less than the above threshold value, it is regarded as a silent wave; When the amplitudes of the five spectral waves are smaller than the above threshold value, the last read frequency of the above-mentioned gated value is the MVF of the sound box. Accordingly, after determining the MVF, the frequency of each harmonic is recorded. Harmonic parameters such as amplitude, phase, and phase, and recording 30 cepstral coefficients to represent the spectral packets of the noise portion (eg, 叩 uniform. n Next, in step S410, the syllable signal to be synthesized is divided into a silent portion and a voiced portion to respectively arrange a plurality of control points. For example, the control gamma of the silent portion is a wire The number of sound boxes in the scale part of the sound department is the same, that is, one control point corresponds to one sound box. On the other hand, the distribution of the sound part control points is performed by spacing a signal sample (for example, (10) signal samples). The number of control gammas is determined by the length of the voiced part.

在步驟S415中，將無聲部分所對應的各音框的HNM 參數，逐一複製到對應的控制點上。而在步驟S42〇中， ° 將有聲部分各個控制點所對應的兩相鄰音框的HNM炱數進行線性内插，不過相位值要先作反包裹(_rapping)的處理。之後再助插所求得的HNM參數複製到該個控制點上。在步驟S425中，對於有聲部分控制點上的HNM參數，進一步配合韻律模型輸出的基週執跡參數值的規定，在音色保持一致性的條件下，作另一種三階的Lagrange内插的調整。也就是頻率值被調到晷的第A個諧波的振幅4， 17 282〇〇twf.d〇c/n 201001396 之物建的頻譜包絡曲線中去内差出來，作妓，先從舊的猶頻率們對應的振=始音=觀頻率值和它值馬上所應對應的g財出頻率 Ο Ο 數，ίί，、在步驟S430中，依據這些控制點上的H_泉〜成出語音信號。在本實施射成 ;==r理，三種型態分別為二ΐ 長衿間的無聲信號以及有聲的信號。母(如1於$%=無聲信號’也就是對於短時間的無聲聲音的h击’號片段是直接由原始音裡複製到合成级；Γ：ΐ" ° ^ t s/；; 鮮二t聲部分的職雜音參數來進行合成。此外，參數八以11的㈣’我們以控繼上的諧波參數與雜音力口 ίΓΓΓ波信號和雜音信號，然後將兩者：據此，便可產生出合成的語音信號。去例中’我們也依據韻律模型所產生的輸出，成音節信號的音長、音量、聲韻母振幅比例出鱼長、ίϊ，：提升合成語音信號的自然度。以下則針對ϊ 說明:置、聲頭母振幅比例、與基週軌跡的調整分別舉例 18 201001396^ 28200twf.doc/n 然後，藉由音長數值來控制HNM合成出正確的音節時間長度。而在音量調整方面，其步驟與音長調整相似。同樣地，先取得欲合成音節的音量平均值，再配合韻律模型所產生出的音量比例值去計算出音量數值。然後，藉由音量數值來控制HNM合成出正確的音節音量。士關於聲韻母振幅比例的調整，在取得韻律模型所輸出 Γ 的聲韻母振幅比例值之後，將其乘上音量數值而得到聲母部分的振幅值。另外，更可調整基週軌跡之曲線。在取得韻律模型所輸出的16點音高數值之後，將這16點音高值均勻佈放於音節内有聲部分的時間軸上。有聲部分各控制點上的基頻值的《又疋，則以找出16點音高數值中的四個相鄰的對應點來作Lagrange内插而求得。綜上所述，基於DTW頻譜匹配取得的頻演參數，拿 … 去訓練以ANN建構的頻譜演進模型，然後於合成階段再 U 據此，型以預測出頻演參數，接著利用頻演參數來決定合 ^音節和原始音節之間的時間軸對應關係，如此就可以顯著地提升合成語音信號的流暢性。另外，使用韻律模型來出韻律參數，再據以控制合成音節的基週轨跡、音長、音量、與聲韻母振幅比例，可使合成的語音聽起來更為自然。此外，使用改進後的諧波加雜音模型來進行語音信號的合成，則可以大幅度提升合成語音的清晰度，並且可^ 援合成語音的說話速度及音高頻率的改變。 19 201001396w 282〇〇twfd〇c/n =然本發明已以較佳實施觸露如上，然其並非用以 =本發明’任何所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍内，當可作些許之更動與潤飾，口此本發明之保護範圍當視後附之申請專利範圍所界定者為準。【圖式簡單說明】圖1疋依照本發明一實施例所鳍示之訓練頻譜演進槿型的方法流程圖。 ' ,圖2 A是依照本發明一實施例所繪示之以DTW求取頻譜匹配路徑的示意圖。、一圖2B是依照本發明一實施例所繪示之依據頻譜匹配路徑來求取頻演參數的示意圖。圖2 C是依照本發明一實施例所繪示之用於〇 T w頻譜匹配之局部路徑限制的示意圖。圖3是依照本發明一實施例所繪示之語音合成方法的流程圖。圖4是依照本發明一實施例所繪示之以改進的HNM 模型作音節信號合成的流程圖。【主要元件符號說明】 S105〜S125 :本發明一實施例之訓練頻譜演進模型的方法各步驟 S305〜S320 :本發明一實施例之語音合成方法各步驟 S405〜S430 :本發明一實施例之基於改進的諧波加雜音模型的語音合成方法各步驟 20In step S415, the HNM parameters of the respective sound boxes corresponding to the silent portion are copied one by one to the corresponding control points. In step S42, °, the HNM parameters of the two adjacent frames corresponding to the control points of the voiced portion are linearly interpolated, but the phase values are first processed by _rapping. Then the HNM parameters obtained by the interpolation are copied to the control point. In step S425, for the HNM parameter on the voiced part control point, further matching with the specification of the base circumference tracking parameter value output by the prosody model, and adjusting the third-order Lagrange interpolation under the condition that the tone color remains consistent. . That is, the frequency value is adjusted to the amplitude of the fourth harmonic of 晷4, 17 282〇〇twf.d〇c/n 201001396, the spectral envelope curve of the material is deviated, as the first, from the old The vibration frequency corresponding to the frequency of the jujube = the initial frequency = the frequency value of the observation and the value of the g-currency corresponding to its value Ο ,, ίί, in step S430, according to the H_spring on these control points signal. In this embodiment, the three types are the silent signal between the two long turns and the sound signal. The mother (such as 1 in the $% = silent signal 'that is the h hit ' for the short-term silent sound is copied directly from the original sound to the synthesis level; Γ: ΐ " ° ^ ts /;; Part of the job noise parameters are used for synthesis. In addition, the parameter eight is 11 (four) 'we control the harmonic parameters and the noise signal and the noise signal, and then both: according to this, can be generated The synthesized speech signal. In the example, we also use the output generated by the prosody model. The length, volume, and vowel amplitude of the syllable signal are proportional to the length of the sound, and the naturalness of the synthesized speech signal is improved. For ϊ Description: Set the amplitude ratio of the head and the head and the adjustment of the base track. 18 201001396^ 28200twf.doc/n Then, the length of the sound is used to control the HNM to synthesize the correct syllable time length. In aspect, the steps are similar to the sound length adjustment. Similarly, the volume average of the syllables to be synthesized is first obtained, and then the volume ratio value generated by the prosody model is used to calculate the volume value. Then, by the volume The value is used to control the HNM to synthesize the correct syllable volume. The adjustment of the amplitude ratio of the vowel is obtained by multiplying the volume value by the volume value after obtaining the ratio of the amplitude of the vowel amplitude output from the prosody model. The curve of the base circumference trajectory can be adjusted. After the 16-point pitch value output by the prosody model is obtained, the 16-point pitch value is evenly distributed on the time axis of the vocal part in the syllable. The fundamental frequency value of the "others" is obtained by finding the four adjacent corresponding points in the 16-point pitch value for Lagrange interpolation. In summary, the frequency performance parameters obtained based on DTW spectral matching To take the spectrum evolution model constructed by ANN, and then to predict the frequency parameters in the synthesis stage, and then use the frequency parameters to determine the time axis correspondence between the syllables and the original syllables. Thus, the fluency of the synthesized speech signal can be significantly improved. In addition, the prosody model is used to derive the prosody parameters, and then the base trajectory, the length of the sound, and the volume of the synthesized syllable are controlled. The ratio of the amplitude of the vowel to the sound can make the synthesized speech sound more natural. In addition, using the improved harmonic plus murmur model to synthesize the speech signal can greatly improve the definition of the synthesized speech, and can support The speech speed of the synthesized speech and the change of the pitch frequency. 19 201001396w 282〇〇twfd〇c/n = However, the present invention has been exposed to the above preferred embodiment, but it is not used in the present invention. In general, the scope of protection of the present invention is subject to the definition of the scope of the appended patent application, without departing from the spirit and scope of the invention. 1 is a flow chart of a method for training a spectrum evolution type according to an embodiment of the present invention. 2A is a schematic diagram of obtaining a spectrum matching path by DTW according to an embodiment of the invention. FIG. 2B is a schematic diagram of obtaining frequency performance parameters according to a spectrum matching path according to an embodiment of the invention. 2C is a schematic diagram of local path limitation for 〇 T w spectral matching, in accordance with an embodiment of the invention. FIG. 3 is a flow chart of a speech synthesis method according to an embodiment of the invention. 4 is a flow chart of synthesizing a syllable signal with a modified HNM model, in accordance with an embodiment of the invention. [Big Element Symbol Description] S105~S125: Method for training a spectrum evolution model according to an embodiment of the present invention, steps S305 to S320: steps S405 to S430 of a speech synthesis method according to an embodiment of the present invention: based on an embodiment of the present invention Step 20 of the speech synthesis method of the improved harmonic plus noise model

Claims

28200twf.doc/n 201001396 十、申請專利範固： 1.一種語音合成的方法，包括：以類神經網路（ANN)來分別建構一多個韻律模型； # ^曰咳進模型與獲組成該文句衫個音節各自的發音貝说，依據籍音賴，分析出該些音節各自的語㈣些音節各自的語境資料至該些韻律兄模型與該頻知物型，而分賴得該些音節各自頻演參數；以及 /八依據該些音節各自的該頻演參數與該些韻律來控制-改進的譜波加雜音模型（題Μ)來合成該^ 自=信號波形，再將該些信號波形依序作串接，而獲得該文句的一合成語音信號。 Ζ如申請專利範圍第i項所述之語音合成的方法，並中建構該頻譜演進模型的步驟，包括：八 L 提供一音節發音語料庫與一句子發音語料庫，i :節發音語料庫包括單獨發音的多個參考音節，該/句子= 二語料庫則包括連續發音的多個句子所分割出的目標^ 即，基於動態時間校正（DTW)分別進行該些目標盥考ί節之間的頻譜比對’以分別獲得該些目標曰即σ自的頻演參數；以及 =該些目標音節各自的語境㈣分別作為該類神經網、勒入，並且把該目標音節求出的頻演參數作為該類神 21 vv 282〇〇twfdoc/n 201001396 經網路的輸出，而訓練出該頻譜演進模型。中其料職料2項所狀語音合成的方法，其者：二動悲時間校正分別進行該些目標音節與其對應的參哼曰郎之間的頻譜比對步驟，包括：發音之一 ’找出具有相同拼音的單獨贫日描作為參考日@，然後基於動態_校正，比對 f. “i:與t應的參考音節之間的頻譜’以獲得-頻量。正規化該頻演匹配路徑為蚊維度的—頻演參數向中請專利範圍第3項所述之語音合成的方法，其目標音節與其對應的參考音節兩者的頻譜的步鄉’包括： ^音即的信號分別切割成一序列的音框（frame); 雄在^异出各個音框的特徵向量，以計算出梅爾頻率倒頻 :脸，及其相鄰音框之間的係數差值，作為特徵向量， :目標音節與參考音節視為兩個序列的向量；以及態時應端點之限制、和局部行走之限制，以動 ΐ心求取兩向量相之間财最短行走距離的路位來作為頻譜匹配路徑。中比對專利_第3項所述之語音合成的方法，复 :，i=音節與其對應的參考音節兩者的頻譜的; 有週音節與參考音節分軸無性部分與 22 ^ 28200twf.doc/n 201001396 對;=無週期性部分與該有週期性部分分別去作頻譜比，將该無週期性部分與該有週期性部分比對得到的路徑銜接成一條頻譜匹配路徑。 /如申請專利範圍第1項所述之語音合成的方法，盆中獲得該文句的該合成語音信號的步驟，包括：八、上依，欲被合成之音節各自_演參數，決定該合 Γ： ϋ 3 :二::Γ中各合成音節信號和其對應的原始音節信號之二二1轴對應關係’和決定該合成語音信號中各合成音即㈣、韻母之_音長比例；以及些欲被合成之音節各自的韻律參數，決定該合及成音節信號的基週執跡、音長、音量、包括I如申5月專利犯圍第1項所述之語音合成的方法，更始立階段’將對應於該些欲被合成音節所錄製的原各自的諧波參數和以數序列的音框’再分析該些音框中請專利範圍第1項所述之語音合成的方法，其音模型來合成該些音節各自的信該合成語音信;序作串接’而獲得該文句的分，再八被口成的音靖分割成一無聲部分與一有聲部 ^分別佈放多個控制點於這兩部分；斗丨紐時間的無聲部分的信號，直接複制始音節上 23 ^ 28200twf.doc/n 201001396 的無聲部分的信號，來進行合成；對於長時間的無聲部分的信號控制點上的雜音參數，來進行合成；以i 以刀的上的信號’則利用該有聲部分的的控制點信號，再將兩者^參數，分縣合成出触信號和雜音 r L 中求項所述之語音合成的方法，其包括：有聲錯各自的控制點上的參數，求取步驟對於長k間之無聲部分上的各個 ::應的原始音節音框的諸波與雜音參數:複= 原於::::::士的各個控制點，將各控制點所對應的再將内r求得的错波與雜音致二據=跡:韻律參數的規定上數；以及 ''膽有聲部分各個控制點上的諧波參成音點上的错波與雜音參數，來產生出該合 10.如申請專利範圍第1項立、中該語境㈣包括目前音節的聲°。\&成的方法，其前-音節的聲調類別與=:類: =調類別與聲母粗分類類別，以及目前二ΐ中： 2428200twf.doc/n 201001396 X. Applying for patents: 1. A method of speech synthesis, comprising: constructing a plurality of prosody models by neural networks (ANN); #^曰cing into the model and obtaining the composition The pronunciations of the syllables of the sentence syllabary are based on the syllables, and the linguistic data of each of the syllables are analyzed, and the contextual data of the syllables are sent to the prosodic brother models and the frequency-known object types. Syllables each of the frequency parameters; and / eight according to the frequency parameters of the syllables and the rhythm to control - improved spectral wave plus noise model (title) to synthesize the ^ signal waveform, and then The signal waveforms are serially connected to obtain a synthesized speech signal of the sentence. For example, the method for synthesizing the speech synthesis described in the i-th patent scope, and the steps of constructing the spectrum evolution model include: 八L providing a syllable pronunciation corpus and a sentence pronunciation corpus, i: section pronunciation corpus including individual pronunciation Multiple reference syllables, the sentence/sentence = the second corpus includes the target segmented by multiple sentences of consecutive pronunciations ^, based on dynamic time correction (DTW), respectively, the spectral comparison between the targets To obtain the frequency parameters of the targets, that is, σ, respectively; and = the respective contexts of the target syllables (4) as the neural network, the intrusion, and the frequency parameters obtained by the target syllables as the class God 21 vv 282〇〇twfdoc/n 201001396 trained the spectrum evolution model via the output of the network. In the method of speech synthesis of two items in the material, the two: the turbulence time correction respectively performs the spectral comparison step between the target syllables and the corresponding ginseng lang, including: one of the pronunciations A separate poor day drawing with the same pinyin is used as the reference date @, and then based on the dynamic_correction, the comparison f. "i: the spectrum between the reference syllables that t should be obtained" to obtain the frequency. Normalize the frequency matching The path is the mosquito dimension-frequency parameter. The method of speech synthesis described in the third paragraph of the patent scope, the step of the spectrum of both the target syllable and its corresponding reference syllable includes: ^ sound is the signal separately cut A sequence of frames; the eigenvectors of the individual frames are calculated to calculate the frequency of the frequency of the Mel frequency: the difference between the face and its adjacent frames, as a feature vector, : The target syllable and the reference syllable are regarded as the vector of two sequences; and the limit of the end point of the state and the limitation of the local walking, so as to obtain the path of the shortest walking distance between the two vector phases as the spectrum matching Path. The method of speech synthesis described in the third paragraph of the patent, complex: i; the spectrum of both the syllable and its corresponding reference syllable; the peripheral syllable and the reference syllable split asexual part with 22 ^ 28200twf.doc/n 201001396 Pair == no periodic part and the periodic part respectively make a spectral ratio, and the path obtained by comparing the non-periodic part with the periodic part is connected into a spectrum matching path. The method for synthesizing speech according to the item, the step of obtaining the synthesized speech signal of the sentence in the basin, comprising: VIII, Shangyi, respective syllables to be synthesized, determining the combination: ϋ 3 : 2:: a two-two-axis correspondence relationship between the synthesized syllable signals and the corresponding original syllable signals in the ', and a ratio of the syllable lengths of the synthesized sounds in the synthesized speech signal, that is, (four) and finals; and the respective syllables to be synthesized The prosody parameter determines the base circumference of the syllable signal, the length of the sound, and the volume, including the method of speech synthesis as described in the first patent of the patent in May, and the initial stage will correspond to the want The original harmonic parameters recorded by the synthesized syllables and the sound frame of the sequence of 're-analysis of the speech synthesis methods described in the first section of the patent box, the sound model to synthesize the respective syllables The letter is synthesized into a voice letter; the sequence is concatenated to obtain the score of the sentence, and then the sound is divided into a silent part and a voiced part ^ respectively to lay a plurality of control points in the two parts; The signal of the silent part of the New Time, directly copying the signal of the silent part of 23 ^ 28200twf.doc/n 201001396 on the initial syllable, for synthesis; for the noise parameter on the signal control point of the long silent part, to synthesize; Taking i the signal on the knives, the control point signal of the vocal part is used, and then the two parameters are combined to synthesize the speech synthesis method of the touch signal and the noise r L , which includes : There are parameters on the respective control points of the sound and error, and the steps are taken for each of the silent parts of the long k:: the original wave of the original syllable sound box and the noise parameters: complex = original::::::士Individual control Point, the corresponding error wave and noise obtained by each control point are the second data = trace: the specified upper number of the prosody parameter; and ''the harmonics at each control point of the biliary part are reflected into the sound point The wrong wave and murmur parameters are used to generate the combination. 10. If the patent application scope is the first item, the context (4) includes the sound of the current syllable. \& method, its pre-syllable tonal category and =: class: = tune category and initial consonant classification category, and the current two: 24