TW201001396A - Method for synthesizing speech - Google Patents

Method for synthesizing speech Download PDF

Info

Publication number
TW201001396A
TW201001396A TW97123982A TW97123982A TW201001396A TW 201001396 A TW201001396 A TW 201001396A TW 97123982 A TW97123982 A TW 97123982A TW 97123982 A TW97123982 A TW 97123982A TW 201001396 A TW201001396 A TW 201001396A
Authority
TW
Taiwan
Prior art keywords
syllable
syllables
signal
parameters
sound
Prior art date
Application number
TW97123982A
Other languages
Chinese (zh)
Other versions
TWI360108B (en
Inventor
Hung-Yan Gu
Original Assignee
Univ Nat Taiwan Science Tech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Taiwan Science Tech filed Critical Univ Nat Taiwan Science Tech
Priority to TW97123982A priority Critical patent/TWI360108B/en
Publication of TW201001396A publication Critical patent/TW201001396A/en
Application granted granted Critical
Publication of TWI360108B publication Critical patent/TWI360108B/en

Links

Landscapes

  • Machine Translation (AREA)

Abstract

A method for synthesizing speech is provided. First, a spectrum progression model and a prosody model are constructed respectively by using artificial neural networks (ANN). For the training of the spectrum progression model, a dynamic time warping (DTW) based technique is developed to compute the spectrum progression parameters of the training speech. Afterwards, the spectrum progression model and the prosody models are used to predict the spectrum progression parameters and prosody parameters for each of the syllables in an input sentence. Finally, synthetic speech signals are generated by using the spectrum progression parameters and prosody parameters of each syllable to control the improved harmonic plus noise model (HNM).

Description

201001396 …,一…V 28200twf.doc/n 九、發明說明: 【發明所屬之技術領域】 本發明是有關於一種語音信號處理的方法,且 a 有關於一種結合了頻譜演進模型與韻律模型,以私^疋 加雜音模型去作語音合成的方法。 二制%波 【先前技術】 ,幾年’由於科技的日新月異’人們與電腦之 ^ '式’已不再是過去以指令輸入電腦,而電腦再以文字 回應的方式所能滿足。gj此,如何發展__種人機之間更直 接更為人性化的語音溝通方式,已是一個相當重要的課題。 為了使電腦能夠以語音作為與人類溝通的媒介,所兩 的技術就是語音辨識和語音合成。其中,文句轉語^ (teXM〇_speech,TTS)系統為語音合成系統的延伸, 種用以將輸人的文字轉換為人類語音輸出的技術。 傳統的語音信號合成技術包括共振峰合成(formant 上nthesis )、線性預估編碼(linear⑶出哗,LpC )201001396 ..., a...V 28200twf.doc/n IX. Description of the Invention: [Technical Field of the Invention] The present invention relates to a method of speech signal processing, and a relates to a combination of a spectrum evolution model and a prosody model, The method of adding a murmur model to speech synthesis. The second system of % wave [previous technology], a few years 'due to the rapid development of technology' people and the computer's 'style' is no longer the input of the computer into the command, and the computer can respond to the text in a way. Gj, how to develop __ a more humane voice communication between human-machines is a very important topic. In order to enable computers to use speech as a medium for communication with humans, the two technologies are speech recognition and speech synthesis. Among them, the sentence transfer ^ (teXM〇_speech, TTS) system is an extension of the speech synthesis system, which is used to convert the input text into human speech output technology. Traditional speech signal synthesis techniques include formant synthesis (former nthesis), linear predictive coding (linear (3), pLpC)

-成基週同步$加(Pitch synchronous overlap and add, A) &成、基於έ吾料庫的合成單元重新組合(c〇rpUS ased re-sequencing of synthesis units)等。然而,這些技 術仍然存在以下所列的—個或多個缺點:’、、 L合成出的語音信號不夠清晰,也就是信號中含有迴 音或雜音。 2. 不支援合成語音的說話速度或音高頻率的改變。 3. 合成語音的流暢性不佳,與真人發音有明顯的差距。 28200twf.doc/n 201001396 關於前述的第一項缺點,Y. Stylianou提出諧波加雜音 模型(harmonic-plus-noise model,HNM) [1],用以提升合成語音 信號的清晰度,不過原始的HNM裡仍存在一些缺點。關於前 述的第三項缺點,T. Yoshimura等人提出以HMM(hidden Markov model)模型來掌握語音頻譜隨著時間逐漸改變的 過程[2] ’在此簡稱之為頻譜演進(spectrum progression)。 然而,HMM模型只使用少數個狀態,且HMM並未考慮 相鄰音框的特徵向量之間的相關性,在語音合成上不夠細 密。 [1] Stylianou, Yannis, Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, Ecole Nationale Superieure des Telecommunications, Paris, France, 1996.- Pitch synchronous overlap and add (A) & synthesis, recombination of synthesis units based on c〇rpUS ased re-sequencing of synthesis units. However, these techniques still have one or more of the following disadvantages: ',, L The synthesized speech signal is not clear enough, that is, the signal contains echo or noise. 2. The speech speed or pitch frequency of the synthesized speech is not supported. 3. The fluency of synthesized speech is not good, and there is a clear gap between the pronunciation of real people. 28200twf.doc/n 201001396 With regard to the first shortcoming mentioned above, Y. Stylianou proposed a harmonic-plus-noise model (HNM) [1] to enhance the clarity of synthesized speech signals, but the original There are still some shortcomings in HNM. Regarding the third shortcoming described above, T. Yoshimura et al. proposed to use the HMM (hidden Markov model) model to grasp the process of changing the speech spectrum over time [2] ′, which is referred to herein as spectrum progression. However, the HMM model uses only a few states, and the HMM does not consider the correlation between the feature vectors of adjacent frames, and is not sufficiently detailed in speech synthesis. [1] Stylianou, Yannis, Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, Ecole Nationale Superieure des Telecommunications, Paris, France, 1996.

[2] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling in HMM-based Speech Synthesis SystemH, Proc. of ICSLP, V〇l 2, pp. 29-32, 1998. 【發明内容】 本發明提供一種語音合成的方法,以類神經網路 (artificial neural network,ANN )分別建構出的頻譜演進 模型和韻律模型,來預測出各音節的頻譜演進參數(以下 簡稱頻演參數)和韻律參數,再據以控制所改進的譜波加 雜音模型(harmonic plus noise model,HNM),來進行注 音信號合成的處理。 6 201001396 ww], i W 28200twf.doc/n 本發明提供一種訓練頻譜演進模型的方法,以動態時 間杈正為基礎’發展出—種求取訓練用音節的頻演(頻譜演 進)參數的技術;再依據所求出的頻演參數,去訓練以類神 經網路所建構的頻譜演進模型。 本發明提出一種語音合成的方法。首先,以類神經網 路來建構頻譜演進模型與韻律模型,並且對這些模型作訓 練。其中,頻譜演進模型的訓練,以動態時間校正(Dynamic f .立’ DTW)為基礎,我們發展出一種求取訓練 用節的頻演參數的技術。參數模型訓練好之後,在合成 ^段對於—輸人文句,先作文句分析㈣analysis),以獲 =文句内各音節各自的語境資料。然後,輸入語境資料至 韻律模型與頻譜演進模型,而分別獲得各個音節的韻律參 數/、頻浪參數。接著,依據各個音節的頻演參數與韻律參 數,去控制所改進的諧波加雜音模型,來產生出該文句的 合成語音信號。 ^ 在本發明之一實施例中,上述訓練頻譜演進模型的步 U 驟,先利用基於動態時間校正的方法來作頻譜比對,以分 f出训練用音節的頻演參數;之後,將語境資料作為類神 =路的輸人’並將分析出的頻演參數作為類神經網路的 輸=,而訓練出頻譜演進模型。上述訓練韻律模型的步驟, 將語境資料作為類神經網路的輸入,並將訓練用音節分析 出的韻律參數作為類神經網路的輸出,而訓練出頻譜演進 才莫型。 在本發明之一實施例中,上述語境資料包括目前音節 7 28200twf.doc/n 201001396 的聲調類別、聲母(syllable initial)類別與韻母(syllable fmal) 類別,前一音節的聲調類別與韻母類別,後一音節的聲調 類別與聲母類別,以及目前音節在句子内的位置數值。 在本發明之一實施例中,上述語音合成處理的一步 驟,為輸入語境資料至頻譜演進模型,而求得一音節的頻 演參數’依據頻演參數,可決定合成的音節信號的頻譜隨 著時間軸逐漸作演變的關係,及決定聲、 f'[2] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling in HMM-based Speech Synthesis SystemH, Proc. of ICSLP, V〇l 2, pp. 29-32 SUMMARY OF THE INVENTION The present invention provides a method for speech synthesis, which uses a neural network (ANN) to construct a spectrum evolution model and a prosody model to predict the spectrum evolution parameters of each syllable (below). Referring to the frequency parameter and the prosody parameter, the synthesis of the phonetic signal is performed according to the improved harmonic plus noise model (HNM). 6 201001396 ww], i W 28200twf.doc/n The present invention provides a method for training a spectrum evolution model, which is based on dynamic time correction and develops a technique for obtaining frequency performance (spectrum evolution) parameters of training syllables. Then, according to the obtained frequency performance parameters, the spectrum evolution model constructed by the neural network is trained. The present invention proposes a method of speech synthesis. First, the spectrum evolution model and the prosody model are constructed with a neural network, and these models are trained. Among them, the training of the spectrum evolution model is based on dynamic time correction (Dynamic f. DTW), and we have developed a technique for obtaining the frequency parameters of the training section. After the parametric model is trained, in the synthesis section, for the input sentence, first analyze the sentence (4) analysis to obtain the context data of each syllable in the sentence. Then, the context data is input to the prosody model and the spectrum evolution model, and the prosody parameters/frequency parameters of the respective syllables are respectively obtained. Then, according to the frequency parameters and prosody parameters of each syllable, the improved harmonic plus noise model is controlled to generate a synthesized speech signal of the sentence. In an embodiment of the present invention, the step of training the spectrum evolution model first uses a dynamic time correction based method to perform spectral comparison, and divides the frequency performance parameters of the training syllables; The contextual data is used as the input of the god-like road, and the analyzed frequency performance parameters are used as the neural network's transmission =, and the spectrum evolution model is trained. In the above steps of training the prosody model, the context data is used as the input of the neural network, and the prosody parameters analyzed by the syllables are used as the output of the neural network, and the spectrum evolution is trained. In an embodiment of the present invention, the context information includes a tone category, a syllable initial category, and a syllable fmal category of the current syllable 7 28200 twf.doc/n 201001396, the tone category and the final category of the previous syllable. , the tone category and initial type of the next syllable, and the position value of the current syllable in the sentence. In an embodiment of the present invention, a step of the speech synthesis process is to input a context data to a spectrum evolution model, and obtain a frequency syllabic parameter of the syllable 'determining the spectrum of the synthesized syllable signal according to the frequency performance parameter. As the time axis gradually evolves, and determines the sound, f'

L 比例。此外,輸入語境資料至韻律模型,可求得2 = 額律參數,_律參數可蚊合成的音_錢的基週軌 跡、音長、音量、聲韻母的振幅比例。也就是韻律參數包 括基週執跡、音長、音量、聲韻母振幅比例等參數。 立外在本發明之一實施例中,上述以譜波加雜音模型合成 音節信號的步驟,在訓練階段裡,先將各個國語音節錄音 得到的原始音信號切割成-序列的音雖叫, =固音框的HNM參數(财與雜音參數)。接著,在合成階 &理’將欲合成的音節先分割成無聲(则㈣、愈 (vmced)兩個部分’再分卿放多個控繼於它們的時間轴 m,將無聲部分所對應的原始音音框的舰^參數, ^複衣到對應的控制點上。有聲部 =數來作時間轴的非線性對應(原始== 相鄰音=二:二=點娜 HN]U 3框的HNM參數作内插(原始的傳 到該控制2法内插),内插求得的麵參數再複製 “,·。此外,對於有聲部分控制點上的HNM參 201001396 w w, 28200twf.doc/n 數,還必需進一步配合基週軌跡數值的規定,在音色保持 一致性的前提下’作另一種三階的Lagrange内插的調整(原 始的HNM合成法未說明用何種内插法)。當各控制點上的 HNM參數決定之後,就可據以去作語音信號波形的產生。 在本發明之一實施例中,上述以諧波加雜音模型合成 音節信號的步驟,當無聲部分為短時間的無聲時,則直接 把原始音的無聲部分的信號複製到合成音上來作合成;當 無聲部分為長時間的無聲時,則利用無聲部分的雜音來數 來進行合成。對於有聲部分,則依據有聲部分的諧二^數 去δ成出5皆波化號,而依據雜音參數去合成出雜音作麥, 再將兩者相加,而得到有聲部分的信號。 另外,上述訓練頻譜演進模型的方法詳述如下。首先, 提供一音節發音語料庫與一句子發音語料庫。上述音節發 音語料庫包括國語各個音節的單獨發音的錄音,而 音語料庫則包括多個句子的連續發音的錄音。接著Ρ,將^ 些句子發音作切割,以獲得多個音節的信號。之後,逐^ 將各個切割出的音節作為目標音節,使用動'態時間校正, ^比對目標音節與其對應(相同拼音)的單獨發音之參立 即之間的頻譜,以獲得頻譜匹配路徑。依此頻譜匹配曰, ^兩音節的時_都正規化成之間,纽在目^立 二=時間軸上均勻設置取樣點,各取樣點再依頻譜匹配ς =取得參考音節時間軸上對應的時間值,如此得到 為頻演參數。最後,將各個目標音節的語 為類神經網路模型的輸入,並將頻演參數作為類神經網= 28200twf.d〇c/n 201001396 的輪出”丨練出頻譜演進模型。 的表考士::二貫施例中’上述比對目標音節與其對應 =節,Γ頻譜的步驟,可先切割目標音節與參考 標=:框’再對各個音框去計算出特徵向量 它們的特^: 彳目參考音節音框之間的距離’則以 們的特徵向!之間的距離來計算。 的參考Ϊ1;二的驟上=對目標音節與其對應 音節分割—週务㈣顺目標音節與參考 期性部分與有週期性部:;別對=對= 部=得到的路徑銜接成-條:二?後將兩 成多個二二述在將-個句子發音切割 境資‘神:=節= c 型二所之=月,== 音、無雜音的語立料t 用來合成出清晰無迴 率的大幅度改變::外,本念5說話速度和音高頻 頻譜演進模型到改進的二、信= 地2所合成出的語音信號的流暢性顯著 、建構知律模型之後,再據以酬韻律參數的 值’則可以讓合成出的語音具有相t的自·。因此, 28200twf.doc/n 201001396 f發明之結合頻譜演進模型、韻律模型去控制改進的MM 杈型來作語音信號合成的方法,所合成出的語音,不僅信 號清晰,並且聽覺上已經很靠近一般人說話的方式。° 為讓本發明之上述特徵和優點能更明顯易懂,下文特 舉較佳實施例,並配合所附圖式,作詳細說明如 、 【實施方式】 r..' 太本發明之内容更為明瞭,以下特舉實施例作為 本發明確實能夠據以實施的範例。 。 剞的:、i ΐ依照本發明一實施例所繪示之訓練頻譜演進模 ㈣方法 >瓜程圖。請參照圖!,首先,在步驟si〇5中 供一音節發音語料庫與—句子發音語料庫。音節發音 ^包括國語各個音節的單獨發音的錄音,而句子發^語 為-個音節。在本實施例二 =本㈣’錄製了各個音節單獨發音的第—聲 卜,再錄製多個句子的連續發音,以將 11 一β—中所匕括的音節作為目標音節。 句子Ξΐ切iUU()中’將句子發音語料庫中的各個 丁刀口J,以獲得多個目標音節。例如, 標^體庫中的各個句子作音節邊界點的 調。缺後,音節標的的拼音符號與聲 號和_資訊 料的記錄播案,去取出各音節的信 之後’在步驟SU5巾,以動態時間校正(DTW)方 28200twf.doc/n 201001396 法,逐一比對目標音節與其對應的參考音節兩者的頻譜 (spectrum),以獲得頻譜匹配路徑。DTW的實際作法是曰, 先將兩音節的信號分別切割成一序列的音框(frame),並 且計算出各個音框的特徵向量,再將目標音節與參考音節 視為兩個序列的向量’然後依據⑷端點對應概之限^ (b)局部行走之限制,來求取兩向量序列之間具有最短 距離的路徑,來作為頻譜匹配路徑。前述的局部行走之限 制,在本實施例中使用的是如圖2C所示的形式。 特徵㈣’—財減㈣聽_率倒_ =歎 C Md-FreqUenCy Cepstral c〇efficient,MF(:c ), 貫施例中’每—個音_語音信號,會轉換成娜c 各U _係數來作為特徵向量,也就是共有26個 Ο 益调二卜’在兩音節之間使用DTW作頻譜匹配時,可對 _。舉=部===,別進行頻譜 ^卜郎的波形’區域A中的垂直線 : 分與無週期性部分的邊界點。而 的波形,區域B中的水平線段B’代表4 缘為ΐΐ:分與無週期性部分的邊界點。而區域c二曲 =目標音節與參考音節之 表不目標音節砧立ir_ 吩仅如軸格線 區域D中㈣^ ^縱軸格線則表轉考音節的音框。 曲線疋無週期性部分的頻譜匹配路徑,而區域 12 28200twf.d〇c/n ΟL ratio. In addition, by inputting the context data to the prosody model, 2 = the law of the forebearing parameter can be obtained, and the ratio of the amplitude of the base trajectory, the length of the sound, the volume, and the final of the vowel can be obtained. That is, the prosody parameters include parameters such as the base circumference, the sound length, the volume, and the amplitude ratio of the finals. In an embodiment of the present invention, the step of synthesizing the syllable signal by the spectral wave plus murmur model, in the training phase, first cutting the original sound signal obtained by recording the syllables of each country into a sequence-like sound, although HNM parameters (finance and noise parameters) of the solid box. Then, in the synthesis stage & rationale, the syllable to be synthesized is first divided into two parts: the (four) and the (vmced) parts, and then the plurality of controls are followed by their time axis m, corresponding to the silent part. The original sound box of the ship ^ parameter, ^ re-clothing to the corresponding control point. There is a voice part = number for the nonlinear correspondence of the time axis (original == adjacent sound = two: two = point Na HN) U 3 The HNM parameters of the frame are interpolated (the original is passed to the control 2 method interpolation), and the surface parameters obtained by interpolation are copied again. "In addition, for the HNM parameter 201001396 ww on the control part of the voiced part, 28200twf.doc /n number, it is necessary to further cooperate with the definition of the base trajectory value, and make another third-order Lagrange interpolation adjustment under the premise that the timbre remains consistent (the original HNM synthesis method does not explain what interpolation method is used) After the HNM parameter at each control point is determined, the generation of the speech signal waveform can be performed. In one embodiment of the present invention, the step of synthesizing the syllable signal by the harmonic plus noise model is as follows: When there is no sound for a short time, the silent part of the original sound is directly The signal is copied to the synthesized sound for synthesis; when the silent part is silent for a long time, the noise is synthesized by using the noise of the unvoiced part. For the voiced part, the δ is determined according to the harmonic part of the voiced part. The wave number, and the noise is synthesized according to the noise parameters, and then the two are added to obtain the signal of the voiced part. In addition, the above method for training the spectrum evolution model is detailed as follows. First, a syllable pronunciation corpus is provided. The syllable corpus of the syllable includes a recording of the individual pronunciations of the various syllables of the Mandarin, and the corpus of sounds includes the recording of the continuous pronunciation of the plurality of sentences. Then, the sentences are pronounced for cutting to obtain a plurality of syllables. After the signal is used, the cut syllables are used as the target syllables, and the dynamic state time correction is used to compare the spectrum between the target syllable and its corresponding (same pinyin) individual pronunciation to obtain the spectrum matching path. According to this spectrum matching ^, ^ two syllables are normalized into one, and the new one is uniformly set on the second axis. Sampling points, each sampling point is matched by the spectrum ς = the corresponding time value on the time axis of the reference syllable is obtained, thus obtained as the frequency performance parameter. Finally, the words of each target syllable are input of the neural network model, and the frequency is The parameters are used as the neural network = 28200twf.d〇c/n 201001396. The round-out "丨 频谱 频谱 频谱 频谱 频谱 频谱 频谱 频谱 频谱 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The steps can be to first cut the target syllable and the reference mark =: box 'and then calculate the feature vector for each of the sound boxes. Their special ^: 彳 参考 参考 参考 参考 参考 ' ' ' ' ' ' ' 以 以 ' 之间 之间 之间 之间The distance is calculated. The reference Ϊ1; the second step = the segmentation of the target syllable and its corresponding syllable - the weekly task (four) the target syllable and the reference period and the periodic part:; the opposite = the = part = the obtained path Connected into a strip: two? After the two more than two two in two will be sentenced to cut the situation of the capital 'God: = section = c type two of the = month, == sound, no noise language material t Used to synthesize a large change in clear no-return rate:: outside, the pronunciation of 5 words The high-frequency spectrum evolution model of the sound is improved. The fluency of the speech signal synthesized by the signal 2 is significant. After constructing the knowledge-based model, the value of the rhythm parameter can be used to make the synthesized speech have a phase. t···. Therefore, 28200twf.doc/n 201001396 f invented the combined spectrum evolution model and prosody model to control the improved MM 杈 type for speech signal synthesis, the synthesized speech, not only the signal is clear, and the hearing is already very close to the average person. The way you talk. The above features and advantages of the present invention will be more apparent from the following detailed description of the preferred embodiments of the invention. In the following, the following specific examples are given as examples in which the present invention can be implemented. .剞 、 i 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练 训练Please refer to the picture! First, in step si〇5, a syllable pronunciation corpus and a sentence pronunciation corpus are provided. The syllable pronunciation ^ includes the recording of the individual pronunciation of each syllable of the national language, and the sentence is pronounced as a syllable. In the second embodiment = the present (four)', the first sound of each syllable is recorded, and the continuous sounds of the plurality of sentences are recorded to use the syllable included in the 11-β- as the target syllable. The sentence is cut into iUU() and the sentences are pronounced in each corpus in the corpus to obtain a plurality of target syllables. For example, each sentence in the standard library is used as the tone of the syllable boundary point. After the absence, the phonetic symbols of the syllables and the recordings of the vocal and _ information materials, after taking out the letters of each syllable, 'in step SU5, with dynamic time correction (DTW) side 28200twf.doc/n 201001396 method, one by one The spectrum of both the target syllable and its corresponding reference syllable is compared to obtain a spectrum matching path. The practical practice of DTW is to first cut the two syllable signals into a sequence of frames, and calculate the feature vectors of the individual frames, and then treat the target syllables and reference syllables as two series of vectors' then According to the limitation of (4) the endpoint corresponding to the limit ^ (b) the local walking limit, the path with the shortest distance between the two vector sequences is obtained as the spectrum matching path. The aforementioned local walking limitation is used in the present embodiment as shown in Fig. 2C. Features (4) '----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The coefficient is used as the eigenvector, that is, there are a total of 26 益 调 二 卜 ′′. When DTW is used for spectrum matching between two syllables, _ can be used. Lift = part ===, do not perform the spectrum of the waveform of the circle "the vertical line in the area A: the boundary point of the part with no periodicity. For the waveform, the horizontal line segment B' in the area B represents the boundary edge of the 4th edge: the boundary point with the non-periodic portion. And the region c two songs = the target syllable and the reference syllable, the non-target syllable anvil ir_ pheno as the axis grid area D (four) ^ ^ vertical axis grid line to the sound box of the test syllable. The curve 疋 has no periodic part of the spectrum matching path, while the area 12 28200twf.d〇c/n Ο

L 201001396 E中的則是週期性部分的頻譜匹配路徑。 接著,回到圖I的步驟sl2〇令 規化為固定維度的頻演參數向量。也就“、、配路徑正 『路,求得時間正規化的頻演參數向;二匹 B疋依照本發明—實施例所繪示之依〜兄’圖 求取頻演參數的示意圖。請參照圖2 徑來 Α)以及參考音節(區域 、払曰即(區域 接著,再把目時長度都分取規化為1。 文茶丹把目“即均勻切割成32個正 後,依據頻譜匹配路徑,將狄 、吟間點。之 對映(mapping)至參考音節不曰:,的2個正規化時間點 上的點所對應的參考音節上 以找到各個目標音節 個數值在㈣之間的^的數正規化時間值,而得到幻 最後,在取得頻演參數之後, 各個目標音節的語境資料作為類神經^跋 不,將 對應的頻演參數作為類神經網;、出而並將其 進模型。所謂類神經網路模型的2出是:丨== 來進行訓練,以讓類神經=學 S輪入貝枓與輸出貧料之間的相關 預測一個輸入資料所應對應的輪出資料 τ用末 在本實施例中,語境資料 、” 個)、聲母類別(22個)與韻母類‘目前音節的f周類別(5 類別與韻母粗分類_(9個),' (’前—日節的聲調 粗分類類別(6個),及目前音節t音節的聲調類別與聲母 為,語音是時序性資訊的傳遞:子内的位f值二這是因 斤以除了目前音節的聲調 13 28200twf.doc/n 201001396 類,;:類別之外,前-個音節的聲調類別 相:重要J去個音節的聲調類別和聲母類別也是 相田重要的。此外,考量至丨丨立i , 昔θθ即在句子中的位置(例如句 因二5)亦會對目前音節的韻律狀態產生影響, 因此更使用-個句中位置之時間比例資料。 在本實施例中,類神經網路 藏層、遞迴隱藏層、和輸出芦。於Λ思田十^成翰層隱 各項語境:雜_料-^ 絲輸人語境資料, 的資斜u 來輸入,例如一個聲調類別 入^ = M / 來表也就是要用3個單元來輪 —fe減和遞迴隱藏層的單元健,可依實驗的結果來L 201001396 E is the spectrum matching path of the periodic part. Next, returning to step sl2 of Fig. 1, the frequency parameter vector is normalized to a fixed dimension. That is to say, ", the path is positive, the time is normalized, and the frequency parameter is normalized; the two horses are in accordance with the present invention - the schematic diagram of the frequency response parameter is drawn according to the embodiment of the present invention. Refer to Figure 2 for the path Α) and the reference syllable (area, 払曰 (the area is followed by the length of the target is divided into 1. The wenca Dan will be evenly cut into 32 positive, according to the spectrum Matching the path, mapping the di and di points to the reference syllable: the reference syllable corresponding to the points on the two normalization time points to find the value of each target syllable between (4) The number of ^ is normalized to the time value, and the magic is finally obtained. After the frequency parameter is obtained, the context data of each target syllable is used as the neural network, and the corresponding frequency parameter is used as the neural network; Put it into the model. The so-called neural network model 2 is: 丨 == to train, so that the correlation between the neuron = learning S round of Bellow and the output poor material should be corresponding to an input data. The rounded data τ is used in this embodiment, contextual data , "pieces", initials (22) and finals 'f-weeks of the current syllables (5 categories and finals of vowels _(9), ' ('the tone of the previous section of the rough classification category (6) And the tone class and initials of the current syllable t syllable, the voice is the transmission of time-series information: the bit f value in the sub-two is due to the tone of the current syllable 13 28200twf.doc/n 201001396 class;;: Category In addition, the tone category of the front-syllabic phase: the tone category and the initial type of the important syllables are also important to the field. In addition, considering the position of the ri, i θθ is the position in the sentence (for example, sentence two 5) It also affects the rhythm state of the current syllable, so it uses the time ratio data of the position in the sentence. In this embodiment, the neural network layer, the recursive hidden layer, and the output reed. Si Tian Shi ^ Cheng Han layer hidden context: Miscellaneous - material - ^ Silk input context data, the input oblique u input, for example, a tone category into ^ = M / to the table is to use 3 units The round-fe minus and recursive unit of the hidden layer can be based on the results of the experiment.

J =在本A知例中’我們設定此兩層的單元數都為16。 外’輸出層的單元數,則直接依目標參數的健來決定。 另外,由於音節的基週執跡、音長、音量與聲韻母振 畜比例’對於合成語音的自然性也有—定的影響,因此, 對於句子發音語料庫巾各奸的組成音節的韻律參數值也 要1分析求取,然後再將這絲週軌跡、音長、音量、和 聲韻母振11¾比例的數值,去訓練各自的類神經網路模型。 也就是說’在步驟S11Q之後,更可分析出各個目標音節 的韻律參數’之後,如同於步驟S125,將語境資料作為類 神經網路的輸人,㈣各項韻律參數數值分卿為類神經 網路的輸出’以訓練出基週軌跡、音長、音量、和聲韻母 振幅比例等韻律參數的各自的模型。 在頻譜演進模型與韻律模型訓練完成之後,便可使用 上述這些模型來進行語音信號的合成。圖3是依照本發明 14 201001396 N 28200twf.doc/n -實施例所繪示之語音合成方法的流程圖。請參照圖3, 在步驟S305中,首先分析一輸入的文句,以獲得各個文 字的發音資訊。發音資訊包括音節拼音、聲調、聲母分類、 韻母分類等項目的資料。 接著,在步驟SMO中,依據發音資訊,組合出各音 節的語境龍’以作為韻律_與頻譜演進模型的輸入。 之後’在步驟S315巾,將語境參數輸人至鱗模型 譜演進翻’时職得财參數無演參數。詳細地說, 將語境參數作為頻譜演賴型的輸人,使得賴演進 輸出對應此-語境參數的驗參數。並且,將語境夫數 為韻律模㈣輸人’使得韻律模型輸出對應此— 的韻律參數。 。口兄多致 最後,在步驟S320巾,依據_參數與 去控制改進過的魏加雜音模型,以合成出該輸入句 語音信號。抑地說,_參數可用以決定合成音 谱和原始音節頻譜之間的時間㈣應難,及人出立 節内聲、韻母之_音長比例。韻律參數則可用以二^ 成音節信號的基週軌跡、音長、音量、H /、 口 再者,基週軌齡數更可μ決;^fr=gtt例。 以頻演參數決定聲、韻母之音長比例於 ==原有始聲音:邊r放置待合成的音節二原 片&線性之曲線,我們可由B,去對應 出的 益、有磬之邊KEb A,…'出待s成音節上的 ·、、、有权邊界‘點Λ,依據A,之位置就可決定聲、韻母 15 ,V 28200twf.doc/n 201001396 之音長比例。此外,時間軸對應 聲部分的-個控制點(橫軸上),^_依’ ^位个^音節有 區域E的片段線性曲線,來斜間位置和圖2八梗 相鄰音框。 丨對應至縣音(_上)的兩個 本發::下例來詳細說明步驟S32〇。圖4是依昭 合成的流程圖。 从進的Η觀模型作音節信號 在ΗΝΜ模型的分析階段,首先將各 =割;^_音框’以分析各個音_脑= t信;時,此音框信號的她被分割成低頻的;3 雜音部分’其分界點的頻率值則稱為最3 CMaxlmum Voiced Frequency,MVF), 率值小於MW,的賴視為由财錢組成,㈣率值大 於賴譜則視為由雜音信號組成。另外,當—個立J = In this A case, 'we set the number of cells in both layers to 16. The number of cells in the outer output layer is determined directly by the health of the target parameter. In addition, since the base circumference of the syllable, the length of the sound, and the ratio of the volume and the rhyme of the vowels have a certain influence on the naturalness of the synthesized speech, the prosody parameter values of the constituent syllables of the sentence corpus are also To analyze and obtain, then the value of the weekly trajectory, length, volume, and vowel is adjusted to a ratio of 113⁄4 to train the respective neural network models. That is to say, after the step S11Q, the prosody parameters of each target syllable can be analyzed, as in step S125, the context data is used as the input of the neural network, and (4) the numerical values of the prosody parameters are classified into classes. The output of the neural network 'strains the respective models of the prosody parameters such as the base-peripheral trajectory, the length of the sound, the volume, and the amplitude of the arpeggio. After the spectrum evolution model and the prosody model training are completed, these models can be used to synthesize the speech signals. 3 is a flow chart of a speech synthesis method in accordance with an embodiment of the present invention 14 201001396 N 28200 twf.doc/n. Referring to FIG. 3, in step S305, an input sentence is first analyzed to obtain pronunciation information of each text. The pronunciation information includes data such as syllables, tones, initials, and finals. Next, in step SMO, the context dragons of the respective syllables are combined as the input of the prosody_ and the spectrum evolution model based on the pronunciation information. Then, in step S315, the context parameter is input to the scale model, and the score parameter has no parameters. In detail, the context parameter is used as the input of the spectrum deduction type, so that the Lai evolution output corresponds to the test parameter of the context parameter. And, the context number is the rhythm module (four) input 'the rhythm model output corresponds to this rhythm parameter. . Finally, in step S320, according to the _parameter and to control the improved Weijia murmur model, the input sentence speech signal is synthesized. In other words, the _ parameter can be used to determine the time between the synthesized spectrum and the original syllable spectrum (4), and the ratio of the length of the sound and finals in the human outing. The prosody parameter can be used to make the base trajectory, length, volume, H /, and mouth of the syllable signal. Furthermore, the number of bases can be determined by the number of bases; ^fr=gtt. The frequency parameter is used to determine the ratio of the length of the sound and the final sound to the == original sound: the edge r is placed on the syllable two original film & linear curve, we can use B to correspond to the benefit, the edge of the edge KEb A,...'s need to be sinus on the syllables, and the right boundary is 'point'. According to A, the position of the sound, finals 15, V 28200twf.doc/n 201001396 can be determined. In addition, the time axis corresponds to the control point of the sound part (on the horizontal axis), ^_ depends on the ^^ bit ^ syllable has the segment linear curve of the region E, and the oblique position and the adjacent sound frame of Fig. 2 and the stalk.丨 Corresponding to the two sounds of the county sound (_): The following example will explain step S32〇 in detail. Fig. 4 is a flow chart showing the synthesis. From the progressive model of the syllable signal in the analysis stage of the ΗΝΜ model, firstly, each = cut; ^ _ sound box 'to analyze each sound _ brain = t letter; when the sound box signal is divided into low frequency ;3 The murmur part's frequency value of the boundary point is called the most 3 CMaxlmum Voiced Frequency (MVF), the rate value is less than MW, and the lag is considered to be composed of financial money, and (4) the rate value is larger than the lag spectrum is considered to be composed of noise signals. . In addition, when

CJ ,是無聲的信號時’則它的整侧譜就全部視為 二 號所組成。 θ ^ 請參照圖4,在步驟S4〇5中,由於錄製音節時, ^各音㈣音量為—致’而原始的HNM參數分析方 ^使用ϋ定的振幅門權來_ MVF,這將使得音量小的 曰即的MVF較小(也就是譜波數較外導致合成出失直的 =音信號。因此,在本實_巾,我們改進紐用動態式 的門捏來侧MVF。詳細作法是,歧_個音節的各音框 中去找出跨音框的譜波振幅最大值;然後,將此最大的諧 16 201001396 yjy I Kjyj-τ, j. N 28200twf.doc/n 波振幅值的1/512作為MVF偵測的門檻值;之後,比對各 音框中各諧波振幅是否小於上述門襤值,當小於上述門檻 值時’將其視為無聲諸波;如此當發生連續5個譜波的^ 幅皆小於上述門檻值時,最後-個大於上述門播值的讀波 頻率即為此音框的MVF。據此,在決定MVF之後,再去 記錄各諧波的頻率、振幅、與相位等諧波參數,並且記錄 30個倒頻譜係數以表示雜音部分的頻譜包絡(如叩匀。 n 接著,在步驟S410中,把待合成的音節信號分割成 無聲部分與有聲部分,以分別佈放多個控制點。舉例來說, 無聲部分的控伽數纽為與絲音科鱗部分的音框 數量相同,也就是一個控制點對應一個音框。另一方面, 有聲部分控制點的佈放,是以間隔固定個信號樣本(例如, ⑽個信號樣本)來進行佈放。射之,控伽的數量是 由有聲部分的音長來決定。When CJ is a silent signal, then its entire side spectrum is considered to be composed of number two. θ ^ Please refer to FIG. 4. In step S4〇5, since the syllables are recorded, the volume of each sound (four) is - and the original HNM parameter analysis method uses the determined amplitude gate weight _ MVF, which will make The MVF with a small volume is small (that is, the number of spectral waves is relatively small, resulting in the synthesis of a straight-sound = tone signal. Therefore, in this real _ towel, we improve the dynamic use of the dynamic door to the side of the MVF. Detailed practice Yes, find the maximum amplitude of the transonic frame in each of the syllables; then, the maximum harmonic 16 201001396 yjy I Kjyj-τ, j. N 28200twf.doc/n wave amplitude 1/512 is used as the threshold value of MVF detection; after that, whether the harmonic amplitude of each sound box is smaller than the above threshold value, when it is less than the above threshold value, it is regarded as a silent wave; When the amplitudes of the five spectral waves are smaller than the above threshold value, the last read frequency of the above-mentioned gated value is the MVF of the sound box. Accordingly, after determining the MVF, the frequency of each harmonic is recorded. Harmonic parameters such as amplitude, phase, and phase, and recording 30 cepstral coefficients to represent the spectral packets of the noise portion (eg, 叩 uniform. n Next, in step S410, the syllable signal to be synthesized is divided into a silent portion and a voiced portion to respectively arrange a plurality of control points. For example, the control gamma of the silent portion is a wire The number of sound boxes in the scale part of the sound department is the same, that is, one control point corresponds to one sound box. On the other hand, the distribution of the sound part control points is performed by spacing a signal sample (for example, (10) signal samples). The number of control gammas is determined by the length of the voiced part.

在步驟S415中,將無聲部分所對應的各音框的HNM 參數,逐一複製到對應的控制點上。而在步驟S42〇中, ° 將有聲部分各個控制點所對應的兩相鄰音框的HNM炱數 進行線性内插,不過相位值要先作反包裹(_rapping)的處 理。之後再助插所求得的HNM參數複製到該個控制點 上。 在步驟S425中,對於有聲部分控制點上的HNM參 數,進一步配合韻律模型輸出的基週執跡參數值的規定, 在音色保持一致性的條件下,作另一種三階的Lagrange内 插的調整。也就是頻率值被調到晷的第A個諧波的振幅4, 17 282〇〇twf.d〇c/n 201001396 之物建的頻譜包絡曲 線中去内差出來,作妓,先從舊的猶頻率 們對應的振=始音=觀頻率值和它 值馬上所應對應的g財出頻率 Ο Ο 數,ίί,、在步驟S430中,依據這些控制點上的H_泉 〜成出語音信號。在本實施射 成 ;==r理,三種型態分別為二ΐ 長衿間的無聲信號以及有聲的信號。 母(如1於$%=無聲信號’也就是對於短時間的無聲聲 音的h击’號片段是直接由原始音裡複製到合成级 ;Γ:ΐ" ° ^ t s/;; 鮮二t聲部分的職雜音參數來進行合成。此外, 參數八以11的㈣’我們以控繼上的諧波參數與雜音 力口 ίΓΓΓ波信號和雜音信號,然後將兩者: 據此,便可產生出合成的語音信號。 去例中’我們也依據韻律模型所產生的輸出, 成音節信號的音長、音量、聲韻母振幅比例出鱼 長、ίϊ,:提升合成語音信號的自然度。以下則針對ϊ 說明:置、聲頭母振幅比例、與基週軌跡的調整分別舉例 18 201001396^ 28200twf.doc/n 然後,藉由音長數值來控制HNM合成出正確的音節時間 長度。 而在音量調整方面,其步驟與音長調整相似。同樣地, 先取得欲合成音節的音量平均值,再配合韻律模型所產生 出的音量比例值去計算出音量數值。然後,藉由音量數值 來控制HNM合成出正確的音節音量。 士關於聲韻母振幅比例的調整,在取得韻律模型所輸出 Γ 的聲韻母振幅比例值之後,將其乘上音量數值而得到聲母 部分的振幅值。 另外,更可調整基週軌跡之曲線。在取得韻律模型所 輸出的16點音高數值之後,將這16點音高值均勻佈放於 音節内有聲部分的時間軸上。有聲部分各控制點上的基頻 值的《又疋,則以找出16點音高數值中的四個相鄰的對應點 來作Lagrange内插而求得。 綜上所述,基於DTW頻譜匹配取得的頻演參數,拿 … 去訓練以ANN建構的頻譜演進模型,然後於合成階段再 U 據此,型以預測出頻演參數,接著利用頻演參數來決定合 ^音節和原始音節之間的時間軸對應關係,如此就可以顯 著地提升合成語音信號的流暢性。另外,使用韻律模型來 出韻律參數,再據以控制合成音節的基週轨跡、音長、 音量、與聲韻母振幅比例,可使合成的語音聽起來更為自 然。此外,使用改進後的諧波加雜音模型來進行語音信號 的合成,則可以大幅度提升合成語音的清晰度,並且可^ 援合成語音的說話速度及音高頻率的改變。 19 201001396w 282〇〇twfd〇c/n =然本發明已以較佳實施觸露如上,然其並非用以 =本發明’任何所屬技術領域中具有通常知識者,在不 脫離本發明之精神和範圍内,當可作些許之更動與潤飾, 口此本發明之保護範圍當視後附之申請專利範圍所界定者 為準。 【圖式簡單說明】 圖1疋依照本發明一實施例所鳍示之訓練頻譜演進槿 型的方法流程圖。 ' ,圖2 A是依照本發明一實施例所繪示之以DTW求取頻 譜匹配路徑的示意圖。 、 一圖2B是依照本發明一實施例所繪示之依據頻譜匹配 路徑來求取頻演參數的示意圖。 圖2 C是依照本發明一實施例所繪示之用於〇 T w頻譜 匹配之局部路徑限制的示意圖。 圖3是依照本發明一實施例所繪示之語音合成方法的 流程圖。 圖4是依照本發明一實施例所繪示之以改進的HNM 模型作音節信號合成的流程圖。 【主要元件符號說明】 S105〜S125 :本發明一實施例之訓練頻譜演進模型的 方法各步驟 S305〜S320 :本發明一實施例之語音合成方法各步驟 S405〜S430 :本發明一實施例之基於改進的諧波加雜 音模型的語音合成方法各步驟 20In step S415, the HNM parameters of the respective sound boxes corresponding to the silent portion are copied one by one to the corresponding control points. In step S42, °, the HNM parameters of the two adjacent frames corresponding to the control points of the voiced portion are linearly interpolated, but the phase values are first processed by _rapping. Then the HNM parameters obtained by the interpolation are copied to the control point. In step S425, for the HNM parameter on the voiced part control point, further matching with the specification of the base circumference tracking parameter value output by the prosody model, and adjusting the third-order Lagrange interpolation under the condition that the tone color remains consistent. . That is, the frequency value is adjusted to the amplitude of the fourth harmonic of 晷4, 17 282〇〇twf.d〇c/n 201001396, the spectral envelope curve of the material is deviated, as the first, from the old The vibration frequency corresponding to the frequency of the jujube = the initial frequency = the frequency value of the observation and the value of the g-currency corresponding to its value Ο ,, ίί, in step S430, according to the H_spring on these control points signal. In this embodiment, the three types are the silent signal between the two long turns and the sound signal. The mother (such as 1 in the $% = silent signal 'that is the h hit ' for the short-term silent sound is copied directly from the original sound to the synthesis level; Γ: ΐ " ° ^ ts /;; Part of the job noise parameters are used for synthesis. In addition, the parameter eight is 11 (four) 'we control the harmonic parameters and the noise signal and the noise signal, and then both: according to this, can be generated The synthesized speech signal. In the example, we also use the output generated by the prosody model. The length, volume, and vowel amplitude of the syllable signal are proportional to the length of the sound, and the naturalness of the synthesized speech signal is improved. For ϊ Description: Set the amplitude ratio of the head and the head and the adjustment of the base track. 18 201001396^ 28200twf.doc/n Then, the length of the sound is used to control the HNM to synthesize the correct syllable time length. In aspect, the steps are similar to the sound length adjustment. Similarly, the volume average of the syllables to be synthesized is first obtained, and then the volume ratio value generated by the prosody model is used to calculate the volume value. Then, by the volume The value is used to control the HNM to synthesize the correct syllable volume. The adjustment of the amplitude ratio of the vowel is obtained by multiplying the volume value by the volume value after obtaining the ratio of the amplitude of the vowel amplitude output from the prosody model. The curve of the base circumference trajectory can be adjusted. After the 16-point pitch value output by the prosody model is obtained, the 16-point pitch value is evenly distributed on the time axis of the vocal part in the syllable. The fundamental frequency value of the "others" is obtained by finding the four adjacent corresponding points in the 16-point pitch value for Lagrange interpolation. In summary, the frequency performance parameters obtained based on DTW spectral matching To take the spectrum evolution model constructed by ANN, and then to predict the frequency parameters in the synthesis stage, and then use the frequency parameters to determine the time axis correspondence between the syllables and the original syllables. Thus, the fluency of the synthesized speech signal can be significantly improved. In addition, the prosody model is used to derive the prosody parameters, and then the base trajectory, the length of the sound, and the volume of the synthesized syllable are controlled. The ratio of the amplitude of the vowel to the sound can make the synthesized speech sound more natural. In addition, using the improved harmonic plus murmur model to synthesize the speech signal can greatly improve the definition of the synthesized speech, and can support The speech speed of the synthesized speech and the change of the pitch frequency. 19 201001396w 282〇〇twfd〇c/n = However, the present invention has been exposed to the above preferred embodiment, but it is not used in the present invention. In general, the scope of protection of the present invention is subject to the definition of the scope of the appended patent application, without departing from the spirit and scope of the invention. 1 is a flow chart of a method for training a spectrum evolution type according to an embodiment of the present invention. 2A is a schematic diagram of obtaining a spectrum matching path by DTW according to an embodiment of the invention. FIG. 2B is a schematic diagram of obtaining frequency performance parameters according to a spectrum matching path according to an embodiment of the invention. 2C is a schematic diagram of local path limitation for 〇 T w spectral matching, in accordance with an embodiment of the invention. FIG. 3 is a flow chart of a speech synthesis method according to an embodiment of the invention. 4 is a flow chart of synthesizing a syllable signal with a modified HNM model, in accordance with an embodiment of the invention. [Big Element Symbol Description] S105~S125: Method for training a spectrum evolution model according to an embodiment of the present invention, steps S305 to S320: steps S405 to S430 of a speech synthesis method according to an embodiment of the present invention: based on an embodiment of the present invention Step 20 of the speech synthesis method of the improved harmonic plus noise model

Claims (1)

28200twf.doc/n 201001396 十、申請專利範固: 1.一種語音合成的方法,包括: 以類神經網路(ANN)來分別建構一 多個韻律模型; # ^曰咳進模型與 獲組成該文句衫個音節各自的發音 貝说,依據籍音賴,分析出該些音節各自的語 ㈣些音節各自的語境資料至該些韻律兄模型與 該頻知物型,而分賴得該些音節各自 頻演參數;以及 /八 依據該些音節各自的該頻演參數與該些韻律來 控制-改進的譜波加雜音模型(題Μ)來合成該^ 自=信號波形,再將該些信號波形依序作串接,而獲得該 文句的一合成語音信號。 Ζ如申請專利範圍第i項所述之語音合成的方法,並 中建構該頻譜演進模型的步驟,包括: 八 L 提供一音節發音語料庫與一句子發音語料庫,i :節發音語料庫包括單獨發音的多個參考音節,該/句子= 二語料庫則包括連續發音的多個句子所分割出的目標^ 即, 基於動態時間校正(DTW)分別進行該些目標盥 考ί節之間的頻譜比對’以分別獲得該些目標 曰即σ自的頻演參數;以及 =該些目標音節各自的語境㈣分別作為該類神經網 、勒入,並且把該目標音節求出的頻演參數作為該類神 21 vv 282〇〇twfdoc/n 201001396 經網路的輸出,而訓練出該頻譜演進模型。 中其料職料2項所狀語音合成的方法,其 者:二動悲時間校正分別進行該些目標音節與其對應的參 哼曰郎之間的頻譜比對步驟,包括: 發音之一 ’找出具有相同拼音的單獨 贫日描作為參考日@,然後基於動態_校正,比對 f. “i:與t應的參考音節之間的頻譜’以獲得-頻 量。正規化該頻演匹配路徑為蚊維度的—頻演參數向 中請專利範圍第3項所述之語音合成的方法,其 目標音節與其對應的參考音節兩者的頻譜的步 鄉’包括: ^音即的信號分別切割成一序列的音框(frame); 雄在^异出各個音框的特徵向量,以計算出梅爾頻率倒頻 :脸,及其相鄰音框之間的係數差值,作為特徵向量, :目標音節與參考音節視為兩個序列的向量;以及 態時應端點之限制、和局部行走之限制,以動 ΐ心求取兩向量相之間财最短行走距離的路 位來作為頻譜匹配路徑。 中比對專利_第3項所述之語音合成的方法,复 :,i=音節與其對應的參考音節兩者的頻譜的; 有週音節與參考音節分軸無性部分與 22 ^ 28200twf.doc/n 201001396 對;=無週期性部分與該有週期性部分分別去作頻譜比 ,將该無週期性部分與該有週期性部分比對得到的路徑 銜接成一條頻譜匹配路徑。 /如申請專利範圍第1項所述之語音合成的方法,盆 中獲得該文句的該合成語音信號的步驟,包括: 八 、上依,欲被合成之音節各自_演參數,決定該合 Γ: ϋ 3 :二::Γ中各合成音節信號和其對應的原始音節信號之 二二1轴對應關係’和決定該合成語音信號中各合成音 即㈣、韻母之_音長比例;以及 些欲被合成之音節各自的韻律參數,決定該合 及成音節信號的基週執跡、音長、音量、 包括I如申5月專利犯圍第1項所述之語音合成的方法,更 始立階段’將對應於該些欲被合成音節所錄製的原 各自的諧波參數和以數序列的音框’再分析該些音框 中請專利範圍第1項所述之語音合成的方法,其 音模型來合成該些音節各自的信 該合成語音信;序作串接’而獲得該文句的 分,再八被口成的音靖分割成一無聲部分與一有聲部 ^分別佈放多個控制點於這兩部分; 斗丨 紐時間的無聲部分的信號,直接複制始音節上 23 ^ 28200twf.doc/n 201001396 的無聲部分的信號,來進行合成; 對於長時間的無聲部分的信號 控制點上的雜音參數,來進行合成;以i 以刀的 上的信號’則利用該有聲部分的的控制點 信號,再將兩者^參數,分縣合成出触信號和雜音 r L 中求項所述之語音合成的方法,其 包括:有聲錯各自的控制點上的參數,求取步驟 對於長k間之無聲部分上的各個 ::應的原始音節音框的諸波與雜音參數:複= 原於::::::士的各個控制點,將各控制點所對應的 再將内r求得的错波與雜音 致二據=跡:韻律參數的規定上 數;以及 ''膽有聲部分各個控制點上的諧波參 成音點上的错波與雜音參數,來產生出該合 10.如申請專利範圍第1項 立 、 中該語境㈣包括目前音節的聲°。\&成的方法,其 前-音節的聲調類別與=:類: =調類別與聲母粗分類類別,以及目前二ΐ中: 2428200twf.doc/n 201001396 X. Applying for patents: 1. A method of speech synthesis, comprising: constructing a plurality of prosody models by neural networks (ANN); #^曰cing into the model and obtaining the composition The pronunciations of the syllables of the sentence syllabary are based on the syllables, and the linguistic data of each of the syllables are analyzed, and the contextual data of the syllables are sent to the prosodic brother models and the frequency-known object types. Syllables each of the frequency parameters; and / eight according to the frequency parameters of the syllables and the rhythm to control - improved spectral wave plus noise model (title) to synthesize the ^ signal waveform, and then The signal waveforms are serially connected to obtain a synthesized speech signal of the sentence. For example, the method for synthesizing the speech synthesis described in the i-th patent scope, and the steps of constructing the spectrum evolution model include: 八L providing a syllable pronunciation corpus and a sentence pronunciation corpus, i: section pronunciation corpus including individual pronunciation Multiple reference syllables, the sentence/sentence = the second corpus includes the target segmented by multiple sentences of consecutive pronunciations ^, based on dynamic time correction (DTW), respectively, the spectral comparison between the targets To obtain the frequency parameters of the targets, that is, σ, respectively; and = the respective contexts of the target syllables (4) as the neural network, the intrusion, and the frequency parameters obtained by the target syllables as the class God 21 vv 282〇〇twfdoc/n 201001396 trained the spectrum evolution model via the output of the network. In the method of speech synthesis of two items in the material, the two: the turbulence time correction respectively performs the spectral comparison step between the target syllables and the corresponding ginseng lang, including: one of the pronunciations A separate poor day drawing with the same pinyin is used as the reference date @, and then based on the dynamic_correction, the comparison f. "i: the spectrum between the reference syllables that t should be obtained" to obtain the frequency. Normalize the frequency matching The path is the mosquito dimension-frequency parameter. The method of speech synthesis described in the third paragraph of the patent scope, the step of the spectrum of both the target syllable and its corresponding reference syllable includes: ^ sound is the signal separately cut A sequence of frames; the eigenvectors of the individual frames are calculated to calculate the frequency of the frequency of the Mel frequency: the difference between the face and its adjacent frames, as a feature vector, : The target syllable and the reference syllable are regarded as the vector of two sequences; and the limit of the end point of the state and the limitation of the local walking, so as to obtain the path of the shortest walking distance between the two vector phases as the spectrum matching Path. The method of speech synthesis described in the third paragraph of the patent, complex: i; the spectrum of both the syllable and its corresponding reference syllable; the peripheral syllable and the reference syllable split asexual part with 22 ^ 28200twf.doc/n 201001396 Pair == no periodic part and the periodic part respectively make a spectral ratio, and the path obtained by comparing the non-periodic part with the periodic part is connected into a spectrum matching path. The method for synthesizing speech according to the item, the step of obtaining the synthesized speech signal of the sentence in the basin, comprising: VIII, Shangyi, respective syllables to be synthesized, determining the combination: ϋ 3 : 2:: a two-two-axis correspondence relationship between the synthesized syllable signals and the corresponding original syllable signals in the ', and a ratio of the syllable lengths of the synthesized sounds in the synthesized speech signal, that is, (four) and finals; and the respective syllables to be synthesized The prosody parameter determines the base circumference of the syllable signal, the length of the sound, and the volume, including the method of speech synthesis as described in the first patent of the patent in May, and the initial stage will correspond to the want The original harmonic parameters recorded by the synthesized syllables and the sound frame of the sequence of 're-analysis of the speech synthesis methods described in the first section of the patent box, the sound model to synthesize the respective syllables The letter is synthesized into a voice letter; the sequence is concatenated to obtain the score of the sentence, and then the sound is divided into a silent part and a voiced part ^ respectively to lay a plurality of control points in the two parts; The signal of the silent part of the New Time, directly copying the signal of the silent part of 23 ^ 28200twf.doc/n 201001396 on the initial syllable, for synthesis; for the noise parameter on the signal control point of the long silent part, to synthesize; Taking i the signal on the knives, the control point signal of the vocal part is used, and then the two parameters are combined to synthesize the speech synthesis method of the touch signal and the noise r L , which includes : There are parameters on the respective control points of the sound and error, and the steps are taken for each of the silent parts of the long k:: the original wave of the original syllable sound box and the noise parameters: complex = original::::::士Individual control Point, the corresponding error wave and noise obtained by each control point are the second data = trace: the specified upper number of the prosody parameter; and ''the harmonics at each control point of the biliary part are reflected into the sound point The wrong wave and murmur parameters are used to generate the combination. 10. If the patent application scope is the first item, the context (4) includes the sound of the current syllable. \& method, its pre-syllable tonal category and =: class: = tune category and initial consonant classification category, and the current two: 24
TW97123982A 2008-06-26 2008-06-26 Method for synthesizing speech TWI360108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW97123982A TWI360108B (en) 2008-06-26 2008-06-26 Method for synthesizing speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW97123982A TWI360108B (en) 2008-06-26 2008-06-26 Method for synthesizing speech

Publications (2)

Publication Number Publication Date
TW201001396A true TW201001396A (en) 2010-01-01
TWI360108B TWI360108B (en) 2012-03-11

Family

ID=44824867

Family Applications (1)

Application Number Title Priority Date Filing Date
TW97123982A TWI360108B (en) 2008-06-26 2008-06-26 Method for synthesizing speech

Country Status (1)

Country Link
TW (1) TWI360108B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102664003A (en) * 2012-04-24 2012-09-12 南京邮电大学 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)
CN108766450A (en) * 2018-04-16 2018-11-06 杭州电子科技大学 A kind of phonetics transfer method decomposed based on harmonic wave impulse
TWI749709B (en) * 2020-08-14 2021-12-11 國立雲林科技大學 A method of speaker identification

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI573129B (en) * 2013-02-05 2017-03-01 國立交通大學 Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech-synthesizing

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102664003A (en) * 2012-04-24 2012-09-12 南京邮电大学 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)
CN102664003B (en) * 2012-04-24 2013-12-04 南京邮电大学 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)
CN108766450A (en) * 2018-04-16 2018-11-06 杭州电子科技大学 A kind of phonetics transfer method decomposed based on harmonic wave impulse
CN108766450B (en) * 2018-04-16 2023-02-17 杭州电子科技大学 Voice conversion method based on harmonic impulse decomposition
TWI749709B (en) * 2020-08-14 2021-12-11 國立雲林科技大學 A method of speaker identification

Also Published As

Publication number Publication date
TWI360108B (en) 2012-03-11

Similar Documents

Publication Publication Date Title
Gold et al. Speech and audio signal processing: processing and perception of speech and music
Hono et al. Recent development of the DNN-based singing voice synthesis system—sinsy
Zhao et al. Using phonetic posteriorgram based frame pairing for segmental accent conversion
Aryal et al. Reduction of non-native accents through statistical parametric articulatory synthesis
López et al. Speaking style conversion from normal to Lombard speech using a glottal vocoder and Bayesian GMMs
Nakamura et al. Fast and high-quality singing voice synthesis system based on convolutional neural networks
Bak et al. Fastpitchformant: Source-filter based decomposed modeling for speech synthesis
Bollepalli et al. Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks
Kim Singing voice analysis/synthesis
Shahnawazuddin et al. Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition
TW201001396A (en) Method for synthesizing speech
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
Al-Radhi et al. Continuous vocoder applied in deep neural network based voice conversion
Hlaing et al. Enhancing Myanmar speech synthesis with linguistic information and LSTM-RNN
Richards et al. Deriving articulatory representations from speech with various excitation modes
Waghmare et al. Analysis of pitch and duration in speech synthesis using PSOLA
Juvela et al. Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system
Murphy Controlling the voice quality dimension of prosody in synthetic speech using an acoustic glottal model
Fu et al. Transfer Learning Based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis.
Tao et al. The NLPR Speech Synthesis entry for Blizzard Challenge 2017
Wu et al. VStyclone: Real-time Chinese voice style clone
i Barrobes Voice Conversion applied to Text-to-Speech systems
Lin et al. New refinement schemes for voice conversion
Shitov Computational speech acquisition for articulatory synthesis
Wang et al. Combining extreme learning machine and decision tree for duration prediction in HMM based speech synthesis.

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees