TW201113869A - Pronunciation variation generation method for spontaneous Chinese speech synthesis - Google Patents

Pronunciation variation generation method for spontaneous Chinese speech synthesis Download PDF

Info

Publication number
TW201113869A
TW201113869A TW98134883A TW98134883A TW201113869A TW 201113869 A TW201113869 A TW 201113869A TW 98134883 A TW98134883 A TW 98134883A TW 98134883 A TW98134883 A TW 98134883A TW 201113869 A TW201113869 A TW 201113869A
Authority
TW
Taiwan
Prior art keywords
pronunciation
model
variation
conversion
spectrum
Prior art date
Application number
TW98134883A
Other languages
Chinese (zh)
Other versions
TWI402824B (en
Inventor
zong-xian Wu
Chong-Han Li
Jun-Cheng Guo
Original Assignee
Univ Nat Cheng Kung
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Cheng Kung filed Critical Univ Nat Cheng Kung
Priority to TW98134883A priority Critical patent/TWI402824B/en
Publication of TW201113869A publication Critical patent/TW201113869A/en
Application granted granted Critical
Publication of TWI402824B publication Critical patent/TWI402824B/en

Links

Landscapes

  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

There is provided a pronunciation variation generation method for spontaneous Chinese speech synthesis. A transformation function is applied in a Hidden Markov Model (HMM) to establish a pronunciation variation model, and a Classification And Regression Tree (CART) is employed to predict the types of pronunciation variation, so as to generate a new melody model via the transformation function, thereby improving the insufficiency of using a constant number of melody models for synthesis. Furthermore, the pronunciation variation is classified in acoustic features by articulatory feature parameters, so as to compensate for the insufficiency of training corpus. Then, the phenomenon of generating pronunciation variation is employed to increase the spontaneity of HMM-based synthesized speech.

Description

201113869 、 六、發明說明: 【發明所屬之技術領域】 本發明係有關於一種中文自發性語音合成中發音變異產 生之方法’尤指涉及一種導入轉換函式於隱藏式馬可夫模型建 立發音變異模型’並以構音特性參數運用分類迴歸樹預測發音 隻異種類,特別係指藉由產生發音變異現象,用以增進基於隱 藏式馬可夫模型之合成語音之自然度者。 【先前技術】 心著科技之進步’電腦不僅已經融入人類之生活’同時更 朝向人H與自動化之方向發展,因此人機互祕—個相當 重要之課題。由於語音係人類溝通最直接之媒介,因此以語音 來作為人機互動之媒介十分重要。其中已有許多基於語音合成 技術之產品應運而生。例如手機之聲控撥號、微軟之語音合成 γ Text-To-Speech,TTS )系'、统、及即時語音導航系統等許多已 經實際應用之商品。目前語音合成系統大多應用在朗讀式 • (Read Speech)语音合成上,如運用於朗讀報紙新聞及電子 書上皆有不錯之表現,可擁有不錯之音質、清晰之發音與通暢 之語流。然而,合成器若應用在人機互動之溝通時,雖然清晰 易1’不過合成之語音卻只能發出一成不變之機械式發音。這 與一般自然語音(Spontaneous Speeeh)在發音自然度上仍然 ^有相§大之差距。崎式語音之雖在於它是逐稿平念之語 曰八發曰速度(Speaking Rate )會受到所閱讀文字之速度所 語速較為平順m定,因姑發音上具雜為清楚之表 現。反觀-般自然語音在發音上係健說話者之意志,故所受 201113869 到之發曰限制較>,所以在語速上經常並不一致,同時在發音 上也比較不八有固疋之模式,因此於自然語 象係影響語音自然度之重獅素。 κ見 語言學家根射文自發細語之語音所產生之口語發音 現象定義有4個特殊音韻現象,包含音節合併現象⑽臟 p traction )鼻音化現象(施灿如)、音節同化現象 (Assimilation)及音節拉長現象(Le_eni^)。其中該音節 口併現象在現代連續diff丨話語音語料庫(Mandarin201113869, VI, invention description: [Technical field of invention] The present invention relates to a method for generating pronunciation variation in spontaneous speech synthesis in Chinese, especially for introducing a conversion function to establish a pronunciation variation model in a hidden Markov model. The classification regression tree is used to predict the pronunciation of the different types, especially the phenomenon of the pronunciation variation, which is used to enhance the naturalness of the synthesized speech based on the hidden Markov model. [Prior Art] With the advancement of science and technology, the computer has not only been integrated into the life of human beings, but also developed in the direction of human H and automation. Therefore, human-machine mutual secret is a very important issue. Since speech is the most direct medium for human communication, it is important to use speech as a medium for human-computer interaction. Many products based on speech synthesis technology have emerged. For example, voice-activated dialing of mobile phones, voice synthesis of Microsoft's voice-to-speech (TTS), and many other practical applications such as 'systems' and instant voice navigation systems. At present, most speech synthesis systems are used in the speech synthesis of Read Speech. They are used for reading newspaper news and e-books. They have good sound quality, clear pronunciation and smooth language flow. However, if the synthesizer is used in the communication of human-computer interaction, although it is clear and easy, the synthesized speech can only emit a constant mechanical pronunciation. This is still far from the natural sound of Spontaneous Speeeh. Although the Saki-style voice is in the language of the manuscript, the Speaking Rate will be more smoothly determined by the speed of the text being read, because the pronunciation is very clear. On the other hand, the natural voice is the voice of the speaker in the pronunciation, so it is limited by the 201113869. Therefore, the speech rate is often inconsistent, and the pronunciation is not the same. Therefore, in the natural language system, the lion's essence of the naturalness of speech is affected. The vocabulary pronunciation phenomenon produced by the syllabus of the linguist's spontaneous whisper has four special phonological phenomena, including the syllable merging phenomenon (10) dirty p traction) nasal morphological phenomenon (Shi Canru), syllable assimilation (Assimilation) and The syllable is elongated (Le_eni^). Among them, the syllables are in the modern continuous diff voice corpus (Mandarin)

Conversational Dialogue Corpus, MCDC) t Mi t 最大之。卩刀’約84% ’其次為該音節同化現象佔了約11%,至 =該鼻音化與該音節接長現㈣為少數,故對自然式語音與朗 讀式語音來說’差異最鴨㈣音節合狀發音變異現象。在 自然語音合成之研究上,有-些藉由解決發音變異現象來提高 合成語音自财之研究,可分騎加發音字典(p_ndati〇n Dictionary Extension)與增加聲學模型(Ac〇ustic M〇4dConversational Dialogue Corpus, MCDC) t Mi t the biggest. The sickle 'about 84%' followed by the syllable assimilation phenomenon accounted for about 11%, to = the nasal sound and the syllables grow up (four) is a minority, so for the natural voice and reading voice "difference the most duck (four) The syllables are morphologically mutated. In the study of natural speech synthesis, there are some studies to improve the synthesis of speech self-finance by solving the phenomenon of pronunciation variation, which can be divided into the pronunciation dictionary (p_ndati〇n Dictionary Extension) and the acoustic model (Ac〇ustic M〇4d).

Extension)兩方面。其中增加發音字典之部分,係包含有: • ( 1 )將以詞為單位之發音變異部分利用辨識結果在發音 字典中增加發音可能,合斜侧时__( classificati〇nExtension) two aspects. The part of the pronunciation dictionary is added, which includes: • (1) The part of the pronunciation of the word is used to identify the pronunciation in the pronunciation dictionary, and the slanting side is __(classificati〇n

And Regression Tree,cart)去挑選適合之發音方式。 (2 )利用辨識結果建立發音網路(Pr〇nundati〇n Network)去決定發音方式。 (3)利用隱藏式馬可夫模型中狀態自由轉移之方式,嘗 試去對發音變異現象做描述,來_到提高合成語音自然度之研 究。 ’ 另一方面在增加聲學模型之研究上,針對標記為音節合併 201113869 現象之部分,額外訓練音節聲學模型(Sy_e恤bus此And Regression Tree, cart) to find the right way to pronounce. (2) Using the identification result to establish a pronunciation network (Pr〇nundati〇n Network) to determine the pronunciation method. (3) Using the method of state free transfer in the hidden Markov model, try to describe the phenomenon of pronunciation variation, and to improve the naturalness of synthesized speech. On the other hand, in the study of increasing the acoustic model, the additional syllable acoustic model (Sy_e shirt bus) is used for the part marked as syllable to merge 201113869

Modd,SPAM) ’ _這_相狀聲學翻錢音節合併 現象之職。將π語化語音視衫—個語者雜之語料特性, 利用調適之方式來調適模型。 然而’上述該些方法幾乎都建構在語音辨識之結果之後, 合成之語音伽事歧義之__,並麵Modd, SPAM) ’ _ This _ phase acoustic flip money syllable merges the phenomenon. The π-speech voice-visual shirt--the characteristics of the linguistic idioms are adapted to adapt the model. However, the above methods are almost all constructed after the result of speech recognition, and the synthesized speech gamma ambiguity __, face

可由事歧紅_練成,祕發音字典之上;^ 中文-般自發性口縣音中,發生發音變異現象之文字^且 有大量之硕組合,不可能針對發生發音麵之文字分別進^ 處理’因此無法用上述之方式來收集全部帶有變異現象之語 料’所以當要合成出訓練語料以外之文字時,將無法找出相= 應之發音瓶現絲進行合叙動作。故,—般f用者係益法 符合使用者於實際使用時之所需。 【發明内容】 …^發明之主要目縣在於,克服習知技藝所遭遇之上述問 題=提供-種導人轉換函式於隱藏式馬可細型建立發音變 異模型’ iUx構音雜參數分舰歸樹_發音變異種 類從而可藉由產生發音變異現象,用以增進基於隱藏式馬可 夫模型之合成語音之自然度者。 為達以上之目的,本發鴨-種t文自發性語音合成中發 音變異產生之方法,首先健出自發性語音與_式語音間^ 轉換關係縣依照發音方式不_轉換函式做分類;藉由線 ! 生轉換函式產生發音變異之模型,新產生之模型合成出帶 有發音變魏象之語音。並且使用語音之構音概參數將發音 201113869 變異做分類’利用分類回歸樹模型歸納出不同發音方式下之變 異特性’藉以預測訓練語料以外之發音變#。另外,在語音訊 號參^、取得頻譜及音高之參數分析上,係使用直行分析及 合成演算法’可得到精確之基頻參數及頻譜參數。先對平行語 料找出發音變異音素與正常音素間轉應關係,接著對成對之 音素利用線性關係訓練發I變異之轉換函心接著將發音變異 轉換函式中之參數與音長資訊以頻譜轉換模型記錄下來,並配 合發音^制分_雜模魏分類。最齡合斜利用欲 • 合成文字之發音參數,預測所需之轉換函式,配合基於隱藏式 馬可夫模狀語音合成!I (HMM_based SpeeehIt can be arbitrarily red _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Processing 'so it is impossible to collect all the corpus with variability in the above way'. Therefore, when a text other than the training corpus is to be synthesized, it will be impossible to find out the phase of the vocal bottle. Therefore, the general user's method is in line with the user's needs in actual use. [Summary of the Invention] ... The main target of the invention is to overcome the above problems encountered in the prior art. = Provide a kind of derivative conversion function to establish a pronunciation variation model in the concealed Marco can' i'm. The _ vocal variation type can be used to enhance the naturalness of the synthesized speech based on the hidden Markov model by generating a pronunciation variation phenomenon. In order to achieve the above purpose, the method of generating the pronunciation variation in the spontaneous speech synthesis of the duck-type t text firstly produces the spontaneous speech and the _-type inter-phone conversion relationship. The county classifies according to the pronunciation mode without the conversion function; The model of pronunciation variation is generated by the line! biotransform function, and the newly generated model synthesizes a speech with a pronounced Wei. And using the vocal parameters of the vocal parameters to classify the pronunciation of 201113869 variability 'using the classification regression tree model to generalize the variation characteristics under different pronunciation modes' to predict the pronunciation change other than the training corpus. In addition, in the analysis of the parameters of the voice signal, the spectrum and the pitch, the straight-line analysis and the synthesis algorithm are used to obtain accurate fundamental frequency parameters and spectral parameters. Firstly, the parallel corpus is used to find the relationship between the pronunciation phoneme and the normal phoneme. Then, the pair of phonemes is used to train the conversion function of the I mutation using the linear relationship, and then the parameters and the sound length information in the pronunciation conversion function are The spectrum conversion model is recorded and matched with the pronunciation ^ division _ miscellaneous model Wei classification. The most age-appropriate use of the desire • Synthesize the pronunciation parameters of the text, predict the required conversion function, combined with the hidden Markov-like speech synthesis! I (HMM_based Speeeh

System,HTS)之合成結果,將合成語音參數轉換,透過一梅 爾缝頻譜近似濾波器(Mel_1〇g Spectmm Appr〇ximati〇n朽㈣ MLSAFilter)合成一般自然語音輸出。 【實施方式】 本發明乃針對自發性語音巾,發音變異之躲及發音方式 建立套中文子轉音系統,提供國内相關人機雙向溝通、電腦 辅助教學及電腦對人之單向訊息傳勒統進行整合應用,以改 善人類與機器之間之溝通環境,讓電腦可以產生出更接近真 實豆田且夕樣化之語音,並能在任意系統平台上使用。有鑑 於此’本發日骑由產生發音變異現象之自然語音合成以改善合 成5吾音之自然度及流暢性,同時可結合應用於具備可攜性 /(P〇rtaWe)與可調適性(Flexibility)等特點之人機溝通介面 系統,可創造更有價值之人機溝通環境與資訊教育之内容從 而有更大之彈性及發展空間。The synthesis result of System, HTS) converts the synthesized speech parameters and synthesizes the general natural speech output through a Meyer slit spectral approximation filter (Mel_1〇g Spectmm Appr〇ximati〇n(4) MLSAFilter). [Embodiment] The present invention is to establish a Chinese sub-transmission system for the spontaneous voice towel, the hiding and pronunciation of the pronunciation variation, and provide domestic two-way communication, computer-assisted teaching and computer-to-human one-way information transmission. Integrate applications to improve the communication environment between humans and machines, so that computers can produce voices that are closer to real bean fields and can be used on any system platform. In view of the fact that this is a natural speech synthesis that produces a vocal variation phenomenon to improve the naturalness and fluency of the synthesized vocal, and can be combined for portability/(P〇rtaWe) and adaptability ( Flexibility and other features of the human-machine communication interface system can create more valuable human-computer communication environment and information education content, thus providing greater flexibility and development space.

S 6 201113869 二::變 ===:式語音繼 =方::下之_性,藉以預測訓練 希望能夠將—般藉由朗語1'〈變異。取後 CText.To-Speech,TTS) , 語音間之轉換關係後,透過語音轉換技術, 換並合成紐I罐性觀函式作為轉S 6 201113869 Two::Change ===: The style of speech follows = Party:: The next _ sex, by which predictive training hopes to be able to mutate by the language of 1'. After taking the CText.To-Speech, TTS), the conversion relationship between the voices, through the voice conversion technology, the combination of the New Zealand cans

#音變異之特殊音韻現象;再者,本發明 .如^日特性參數’運用分類迴歸樹做發音變異現象之預 ^如疋’可透過轉換函式產生新之音韻模型,藉以改善在傳 j 口成方法中,僅利用固定數量音韻模型合成之不足,並以構 曰特性參數達騎發音麵贿學·上之賴,糊補訓練 語料不足之問題,使音韻轉換更加準確,再藉由產生發音變異 現象’用以增進基於隱藏式馬可夫模型之合成語音之自然度 者。本方法適合應用於多語者或含有情緒之電腦自然語 成’利用分類迴歸樹預測發音變異種類可減低收集訓練語料之 需求,並可以結合數位學習'資訊交換與行動裝置,進而可創 造出更有商業價值之資訊產品。 请參閱『第1圖〜第3圖』所示,係分別為本發明之基本 流程示意圖、本發明於訓練階段之流程示意圖、及本發明於合 成階段之流程示意圖。如圖所示:本發明係一種申文自發性語 日合成中發音變異產生之方法,係包含一訓練階段(TrainingThe special phonological phenomenon of #音变变; Furthermore, the present invention. For example, the characteristic parameter of the day is used to perform the pronunciation variation phenomenon by using the classification regression tree, and the new phonological model can be generated through the conversion function, thereby improving the transmission. In the oral method, only the lack of synthesis of a fixed number of phonological models is used, and the problem of constituting the characteristics of the vocabulary, the vocabulary of the vocabulary, and the lack of linguistic training, make the phonological conversion more accurate, and then Produce pronunciation variation phenomenon to enhance the naturalness of synthetic speech based on hidden Markov models. The method is suitable for multi-lingual or computer-like language containing emotions. 'Using a classification regression tree to predict the type of pronunciation variation can reduce the need to collect training corpus, and can combine digital learning 'information exchange and mobile devices, and then create More commercially valuable information products. Please refer to FIG. 1 to FIG. 3, which are schematic diagrams of the basic flow of the present invention, a schematic diagram of the flow of the present invention in the training phase, and a schematic diagram of the flow of the present invention in the synthesis stage. As shown in the figure: the present invention is a method for generating pronunciation variation in spontaneous speech synthesis of a text, which includes a training phase (Training)

Phase) 1與一合成階段(SynthesisPhase) 2,該訓練階段1 中包含下列步驟: 201113869 、(A)發音變異轉換函式模型建立步驟11:首先係將一 :行語料(Parallel Corpus)工丄丄及對應之文字於前端進行 刖處理,其中平行語料部分係將經由直行分析(歸幽 y sis) 1 1 2後传到平滑之頻譜參數(smo〇thed Spectrum) 及音韻參數,使用動態時間校正(Dynamic Time ^丨叩, DTW) 1 1 3將其變成長度-致之平行錄,藉此路徑結果 $立發音變異音素與正常音素資料間之對應關係 ’得到成對之 曰素單元(Phone Pair),而文字部分係經過文字分析與根據人 φ 工預先標記好之韻律邊界,得到對應之文字標記,繼之,針對 榇5己為發音變異之部分進行發音變異轉換函式模型之訓練,將 該頻譜參數經過梅爾倒頻譜(Mel-cepstrum)之轉換 ,提取25 階之梅爾倒頻譜係數,結合該音韻參數及該文字標記作為一隱 藏式馬可夫模型之訓練資料(HMMTraining) i 1 4,藉由訓 練線性轉換函式產生一頻譜轉換模型 (HMM Models) 1 1 5 ’並以此新產生之模型115合成帶有發音變異現象之語 音’得到頻譜轉換函式與音長資訊; 春&立(B)發音變異預測模型回歸樹分類步驟1 2 :藉由前端 °° 9之構音特徵參數(Articulatory Feature) 1 2 1將發音變異 作分類’根據語言學與聲學上之發音參數進行轉換函式之歸群 Ί丨練矛J用一分類回歸樹模型(Ciassincati〇n⑽Regressi〇n Trees’CART) ’將上述賴讎函式與音長資訊,與分別根據 文字標s己求得對應之語言學上之資訊,進行該分類回歸樹模型 中轉換函式之分類回歸樹之訓練(F_CART Training) 1 2 2與 日長之分類回歸樹之訓練(D-CART Training) 1 2 3,以分 J仔到頻轉換預測模型 1 2 4 (Transformation Function 201113869Phase 1 and Synthesis Phase 2, the training phase 1 contains the following steps: 201113869, (A) pronunciation variation conversion function model establishment step 11: first one will be: Parallel Corpus丄 and the corresponding text are processed at the front end, where the parallel corpus part is passed to the smoothed spectral parameter (smo〇thed Spectrum) and the phonological parameters via the straight line analysis (1 y sis), using dynamic time. Correction (Dynamic Time ^丨叩, DTW) 1 1 3 Turn it into a length-to-parallel record, whereby the path result is the correspondence between the phonetic morphological phoneme and the normal phoneme data' to get the paired element unit (Phone Pair), and the text part is subjected to text analysis and a prosody boundary pre-marked according to the human φ work, and the corresponding text mark is obtained, and then, the training of the vocal variability conversion function model is performed for the part of the 发音5 vocal variation. The spectral parameter is converted by Mel-cepstrum, and the 25th-order Mel cepstral coefficient is extracted, and the phonetic parameter and the text mark are combined as a hidden type. The training material of the Markov model (HMMTraining) i 1 4, by training the linear transformation function to generate a spectrum conversion model (HMM Models) 1 1 5 'and synthesizing the speech with the pronunciation variation phenomenon with the newly generated model 115 Spectrum conversion function and length information; Spring & Standing (B) pronunciation variation prediction model Regression tree classification step 1 2: Classification of pronunciation variation by the front end ° ° 9 Articulatory Feature 1 2 1 According to the linguistic and acoustic pronunciation parameters, the conversion function is used to transform the spear. J uses a classification regression tree model (Ciassincati〇n(10) Regressi〇n Trees'CART) 'The above Laiwu function and sound length information, and According to the text label, the corresponding linguistic information is obtained, and the classification regression tree training of the conversion function in the classification regression tree model is performed (F_CART Training) 1 2 2 and the classification of the regression tree of the day length (D -CART Training) 1 2 3, to divide the J-to-frequency conversion prediction model 1 2 4 (Transformation Function 201113869

Model)與音長預測模型(DurationModel) 1 2 5 ; 該合成階段2中包含下列步驟: (C )HTS合成步驟2 1 :係於前端先輸入欲合成文字之 發音參數21 1,透過文字分析處理,經構音特性參數2工2 得到sf言學上之資訊而產生文字標記槽,繼之,進行發音變異 現象之預測,使用基於隱藏式馬可夫模型之語音合成 (HMM-based Speech Synthesis System, HTS)搭配文字標記 檔,經由聲學模型2 1 3之辨識,並透過狀態選擇2 1 4進行 頻譜、音長及音高(Pitch)參數之預測;以及 (D)變異轉換步驟2 2:係針對預測發生發音變異現象 之α卩为,依據該文字標記播資訊從上述頻讀轉換預測模型12 4與音長預測模型丄2 5中,挑選適合之頻譜轉換函式2 2工 與音長轉換函式2 2 2,分縣鱗與音長進行轉換2 2 3而 產生新之頻譜與音長參數。最後,於後端將該些轉換過後新產 生之參數經過-梅_數頻譜近_波器(Md_lQg Spect_ ApproximationF^MLSAF^u 2 4合成為一般自然語音 (Spontaneous Speech)後輸出。 上述步驟(A )發音變異轉換函式模型中資料對應關係之 建立,請進-步參閱『第4圖』所示,係本發明以動態時間校 正結果對應之音節_位置示。如_示:基於聲音資料 之長度不一’因此本發明乃利用動態時間校正找尋音節合併現 象之語音資料與-般正常朗讀式語音資料之間對應之關係。如 第所示’其巾縱軸為朗讀式語音頻罐資料序列’橫轴為發 生音節合狀語音頻譜#料序列,且财顯示魅之深淺係表 不上述兩筆資料之間歐式距離之差異大小,其顏色越深表示差 201113869 f越大’伽®巾之線段表示此兩筆資制最佳對應關係之動 態=間校正路徑3。於其中,線段3工為兩筆資料具有較恰當 之ϋ應部分’可視為兩筆請她近之部分;線段3 2 表不之動糾敝正雜3較為垂直,代表較錄之朗讀式語 音資料之音框對應顺少數之變異語音#料之音框。若以朗讀 式…料為基準’則可視為有部分音段被删除(Deletion);線段 3 3表示之動態時間校正路彳£3較為水平,代表較少數之朗讀 式資料之音框對應到較多數之變異語音資料之音框,可視為有 • 音段***(Inserti〇n)。據此,上述兩筆資料之間係可以由此方 式找到相對應之關係。 ^再者’本發明亦將正常語音之音節斷點位置,利用動態時 間校正之結果找出自發性口語化語音中音節斷點之位置。藉由 圖中動態時間校正結果對應之音節斷點位置之端點 (Boundary)對應’可以得到相對應正常音段與變#音段 應關係。 上述步驟(A )發音變異轉換函式模型中線性轉換函式與 • 隱藏式馬可夫模型資之建立,請進一步參閱『第5圖及第6圖、』 所不,係分別為本發明之線性轉換關係示意圖及本發明之頻譜 轉換模型示意圖。如圖所示:針對產生發音變異之音素單元部 分,經過上述動態時間校正處理,找出正確發音與變異發音i 間之對應關係’得到成對之音素單元後,係採用線性之假二關 係訓練發音變異之轉換函式,將發生變異之語音段視為正常語 音段之線性組合與無,將賴之音素單元個雜轉換之方 式描述平狀正倾變異音·之_,_絲正常語音段 (S〇_,XOcLxn))可經由式子Y=f(x)之線性轉換函式轉換成 201113869 為目標之變異語音段(Target,Y(yl..yn))。其轉換關係如第5 圖所示,藉以找出一線性轉換之關係,可將來源正常音素之資 料’透過該線性轉換關係後’產生成為變異之目標音素,藉此 訓練出一個與語者無關之轉換關係,能將任意語者之正常音Model) and the length prediction model (DurationModel) 1 2 5 ; The synthesis stage 2 includes the following steps: (C) HTS synthesis step 2 1 : inputting the pronunciation parameter 21 of the text to be synthesized at the front end, and processing by text analysis The vocal characteristics parameter 2 is used to obtain the sf speech information and the text mark groove is generated, followed by the prediction of the pronunciation variation phenomenon, and the HMM-based Speech Synthesis System (HTS) based on the hidden Markov model is used. Match the text mark file, identify by the acoustic model 2 1 3, and predict the spectrum, pitch length and pitch parameters through the state selection 2 1 4; and (D) the mutation conversion step 2 2: for the prediction to occur The alpha 卩 of the pronunciation variation phenomenon is selected from the above-mentioned frequency-reading conversion prediction model 12 4 and the sound length prediction model 丄 25 according to the word mark broadcast information, and the appropriate spectrum conversion function 2 2 and the length conversion function 2 are selected. 2 2, the county scale and the length of the sound are converted 2 2 3 to generate new spectrum and length parameters. Finally, the newly generated parameters after the conversion at the back end are output after the M-_1Q spectrum spectrometer (Md_lQg Spect_ApproximationF^MLSAF^u 2 4 is synthesized into a general natural speech (Spontaneous Speech). The above steps (A The establishment of the correspondence relationship of the data in the pronunciation variation conversion function model, please refer to the "figure 4" as shown in the figure, which is based on the syllable_position corresponding to the dynamic time correction result. For example, based on the sound data The length is different. Therefore, the present invention utilizes dynamic time correction to find the relationship between the speech data of the syllable merging phenomenon and the normal normal reading speech data. As shown in the figure, the vertical axis of the towel is a reading audio channel data sequence. 'The horizontal axis is the sequence of the syllable speech spectrum#, and the depth of the display shows the difference between the Euclidean distances between the above two data. The darker the color, the greater the difference 201113869 f. The line segment indicates the dynamic=inter-correction path 3 of the best correspondence between the two systems. Among them, the line segment 3 works for two pieces of data and has a more appropriate part. The part of the line segment 3 2 is not vertical and the correctness is more vertical, which means that the sound box of the recorded reading voice data corresponds to the sound box of the mutated voice # material. If based on the reading material... It can be regarded that some segments are deleted (Deletion); the segment 3 3 indicates that the dynamic time correction path is relatively horizontal, and the frame representing the smaller number of reading materials corresponds to the frame of the more mutated speech data. It can be regarded as • Inserti〇n. According to this, the corresponding relationship between the above two data can be found in this way. ^ Again, the present invention also uses the normal speech syllable breakpoint position, utilize The result of dynamic time correction is to find the position of the syllable breakpoint in the spontaneous spoken speech. The corresponding end of the syllable breakpoint position corresponding to the dynamic time correction result in the figure corresponds to 'Boundary'. #音段应 Relation. In the above steps (A), the linear conversion function and the hidden Markov model in the pronunciation variation function model are established. Please refer to "5th and 6th," The schematic diagram of the linear conversion relationship of the present invention and the schematic diagram of the spectrum conversion model of the present invention are shown in the figure. For the part of the phoneme unit that generates the pronunciation variation, the dynamic time correction processing is performed to find the correspondence between the correct pronunciation and the variation pronunciation i. After the relationship 'gets the paired phoneme unit, the linear pseudo-two relationship is used to train the pronunciation variation function, and the segmentation of the speech segment is regarded as the linear combination and non-normal segment of the normal speech segment. The way to describe the flat positive pitching sound _, _ silk normal speech segment (S〇_, XOcLxn)) can be converted to 201113869 as the target of the modified speech segment via the linear transformation function of the formula Y=f(x) (Target, Y (yl.. yn)). The conversion relationship is as shown in Fig. 5, in order to find a linear transformation relationship, and the data of the normal phoneme can be 'transformed through the linear transformation relationship' to become the target phoneme of the mutation, thereby training a language independent of the speaker. The conversion relationship can be used to normalize the voice of any speaker

素轉換成該gg·者之變異音素。如是,利用正常語音資料X 透過凝轉矩陣A之轉換後,以R作為旋轉誤差,其線性轉換 函式表示為: 、The prime is converted into the variant phoneme of the gg. If yes, after the normal speech data X is converted by the condensation matrix A, R is used as the rotation error, and the linear conversion function is expressed as:

V=AX+R (公式1)V=AX+R (Equation 1)

藉由隱藏式馬可夫模型,利用STRAIGHT分析演算法所 取出之聲學參數,可在_軸上之變錄有效之贿。在發音 變異之轉麵型之構上,_賴式馬可夫_,藉由其時 間軸上可考慮前後關聯之雜,使描述出來之聲學模型更且有 連慣性。在此為紐細地贿轉換之函式,除了狀隱献 可夫模型,朗時考慮正常語音與變異語音資糊之關聯性, 亦即最大化之機率。其定義為: 准⑽,棒以^抽⑻, ' (公式2) 其中為λ為初始機率;a為轉移機率;以及b為觀察機率。 在此將雜讎_核紗,I妹賴柯夫㈣來數之 同時,亦考慮在模型中同-個轉移狀態下之最佳轉換結果。將 觀察機率疋義成兩項,分別為正常語音χ之 語音γ之高斯分佈,其中γ之分佈係姻上述公式丨將平均 t y=Ajx+R取代’將此式帶入公式2,則原本b可以重新定 義為: 201113869 (公式3) 接著利用最大化期望值估計(Expectati〇n_Maximizati〇n, EM)演算法求解,首先將預估(E-step)中之期望值之輔助函 數(Q-fUnction)定義為·· β(Ί '\λ) = Eq {log P(0, ^ μ ο I Ο, Λ} = I Ο,Λ) log />(〇, q\X') q (公式4)By using the hidden Markov model, the acoustic parameters extracted by the STRAIGHT analysis algorithm can be used to effectively mark the changes on the _ axis. In the structure of the morphological variation, _ Lai Makov _, by considering the correlation between the front and the back on the time axis, makes the described acoustic model more and more inertia. Here, as a function of the conversion of the bribes of the New Zealand, in addition to the model of the hidden secrets, Langshi considers the correlation between the normal speech and the mutated speech, which is the probability of maximization. It is defined as: quasi (10), rod by ^ pumping (8), ' (formula 2) where λ is the initial probability; a is the transfer probability; and b is the observation probability. In this case, we will consider the best conversion results in the same transfer state in the model, while weaving the 雠_Nuclear yarn, I sister Lai Kefu (4). The probability of observing the probability is two, which are the Gaussian distribution of the γ of the normal phonetic χ, where the distribution of γ is the above formula, and the average ty=Ajx+R is substituted for 'this formula is brought into the formula 2, then the original b can Redefining as: 201113869 (Equation 3) Then using the Maximization Expectation Estimate (Expectati〇n_Maximizati〇n, EM) algorithm to solve, firstly define the auxiliary function (Q-fUnction) of the expected value in the E-step as ··β(Ί '\λ) = Eq {log P(0, ^ μ ο I Ο, Λ} = I Ο,Λ) log />(〇, q\X') q (Equation 4)

其中可將公式四視為初始機率(Initial probability)、轉移 機率(Transition probability )與觀察機率(Observation probability)三部分。重新整理為: Q(r\X) -QA^)+Qa(^)+Qb(^) ^ 其中初始機率部分為: Μ Ν α(乂»Σ#_)1〇δν,Σν=ι I i=l 轉移機率部分為: Μ Μ τ Ν α(^1^)=ΣΣΣ^(/^')1〇δα/ »Σ ,=1 7=1 Μ j=\ 觀察機率部分為: Q^)=ttr.(J^bAx^ 7=1 /=1 =ΣΣ^〇')1〇8&/(χ^/<γΊχ») y=i /=1 =Σ t ά·)4ΗςγΊ-吾(χ, -Κ ·) )=1/=1 ζ Δ (公式6) α “ (公式7) 12 8) 201113869 然後使用最佳化(M-step)估算模型參數,以期得到最大 化期望值,即估測參數讓Q-fimction最大化 利用多項式内插(Lagrange)方法得到各參數估測之式子,其 中的要估測之參數分別為初始狀態為i之初始機率π,.';由第^ 狀態轉移到j狀態之轉移機率V ;來源資料X之平均數^,丨1 來源資料X之變異數义;線性轉換矩陣V ;線性轉換後與目 標資料Y之殘差平均數R’ ;以及目標資料Y之變異數2),。利 用EM演算法得出最後所估測出最後之參數為: Ν Σ^.(〇 /=1 πί :Ί(0 (公式9) ΣΣ^(^) Σ^(〇 —1 —1 /=1 >1 /»1 Σ)’ -μ力(Ύ)7 (公式10) (公式11) aj = ί Σ < - r7 κγ Υς r> V /—1八,=1 τ Sr<〇'Xy( ~A;X() =~ f ~~!>'(·/) t^\ (公式12) (公式13) Σ;’ (公式14) ~Α. χ, - R,.’ )(y, - 、R,)r r Is! 0) (公式15) 其中由最均卿演算法估料之vw即為發音; 13 201113869 替奐函式中所需之參數。每組語音段經過上述隱藏式馬可夫 甘^之訓喊’可制如第6圖之麟轉換模型之狀態形式。 、次各狀有各自之線性轉換函式Y=AX+R、正常語音段長 與自發性口語語音段長度資訊lY。藉由透過此多線 、、函式之頻譜轉細型,本發明可以賴讀式語音框,透 過轉換函式轉換成發音類之語音段。晴式語音與自發性钮 音間之差異’可以_Lx與ly資訊對音長做輕,達到音: 長度變異之效果。在音長難之動作上,輸人-個正常音素、, 將音素依照音長轉換模型中各個狀態中LX之比例做切割 個區塊透過所屬之線性轉換函式轉換,接著利用LX與LY之比 例去增加或縮減原始音長之長度。 、 上述步驟(B)發音變異預測模型回歸樹分類,為本發明 另-個重點,為有助於發音現象之删,乃_發音^數 之摘取,先將發音變異現象進行分類。繼之,翻分類二 =發音變異特性做分類之_ ’將具有囉之發音特性變化之 資料點’分到同-個類別中’於其中,在同—個_中之資料 點帶有相同之發音雜變化。本發明額分_物作 模型之優點在於,以樹狀之結構來表現資料之分佈,其建 來之模型容易瞭解’並且能追蹤每節點上使用之變數進而瞭 資料真正之雜。發音㈣縣主要在骑音鱗上之變化與 音素長度上之變化。因此發音變異現象之預測,可分成頻譜轉 換預測與音長删^部分,^者酬翻之建立,蚊義^類 回歸樹所使用之問題集將被定義,請進一步參閱『/第7圖刀戶 示,係本發明分類回歸樹之架構示意圖。如圖所示:首:^以 隱藏式馬可夫模型為基礎之發音變異轉換函式模型,分別透^ 201113869 f換函式之分_歸雛音長之分__,將_ 與音長資訊依照上述所提之發音參數作分函式 分類。其中’最後分類得到之每-個樹葉節點矣—置與 之轉換模^,翻其删正常音素 1類別 與音素長壯之差異。 、4_社之變化 上述步驟(B )發音變異預測模型回歸樹分類 預測模型之建立,請進一步參閲『第R 貝》曰轉換 頻譜轉換F-CART預測模型示意圖。如圖; 素經由頻譜轉換後變成發音變異之音素, 曰之曰 回歸樹執行删時’係測之結果與目標之音錄 差=小越好。亦即來源音素經過分類後,根據所在^ 之轉換函式來進行頻料性之轉換,㈣換後之 (C〇讀〇與變異之目標音素越相似越好。本發明模;係 採用分_騎,在分社條件上係設定騎織之轉換誤差 (GTationError)小於***前之轉換誤差呈轉換之 算公式表示為: 、' 4Formula 4 can be considered as the initial probability, the transition probability and the observation probability. Rearranged as: Q(r\X) -QA^)+Qa(^)+Qb(^) ^ where the initial probability is: Μ Ν α(乂»Σ#_)1〇δν,Σν=ι I i =l The probability of transfer is: Μ Μ τ Ν α(^1^)=ΣΣΣ^(/^')1〇δα/ »Σ ,=1 7=1 Μ j=\ The probability of observation is: Q^)= Ttr.(J^bAx^ 7=1 /=1=ΣΣ^〇')1〇8&/(χ^/<γΊχ») y=i /=1 =Σ t ά·)4ΗςγΊ-吾(χ , -Κ ·) )=1/=1 ζ Δ (Equation 6) α “(Formula 7) 12 8) 201113869 Then use the optimization (M-step) to estimate the model parameters in order to maximize the expected value, ie estimate The parameter maximizes Q-fimction by using the polynomial interpolation method (Lagrange) method to obtain the formula of each parameter estimation, wherein the parameters to be estimated are the initial probability of initial state i, π, . '; The probability of transfer to the j state V; the average of the source data X, 丨1 the variation of the source data X; the linear transformation matrix V; the mean residual R of the target data Y after linear conversion; and the target data Y The variation is 2). The last parameter estimated by the EM algorithm is: Ν Σ^.(〇/=1 πί :Ί(0 (Equation 9) ΣΣ^(^) Σ^(〇—1 —1 /=1 >1 /»1 Σ)' -μ力(Ύ)7 (Equation 10) (Equation 11) aj = ί Σ < - r7 κγ Υς r> V / -1 eight, = 1 τ Sr <〇'Xy( ~A;X() =~ f ~~!>'(·/) t^\ (Equation 12) (Equation 13) Σ;' (Formula 14) ~Α. χ, - R,.' )(y, -, R,) rr Is! 0) (Equation 15) where vw is estimated by the most uniform algorithm; 13 201113869 The parameters required in the function of the 。. Each group of speech segments can be made into the state form of the lining conversion model as shown in Fig. 6 after the above-mentioned hidden Markov Gan training. The sub-parameters have their own linear conversion functions Y. =AX+R, normal speech segment length and spontaneous spoken speech segment length information lY. By translating the multi-line, function spectrum to fine-grained, the present invention can be used to convert the speech box into a pronunciation through a conversion function. The voice segment of the class. The difference between the clear voice and the spontaneous button sound can be _Lx and ly information on the length of the sound, to achieve the sound: the effect of length variation. In the action of the sound length, the input - normal Phoneme, cuts the phoneme according to the ratio of LX in each state in the sound length conversion model Block through linear conversion function of converter belongs, and then using the ratio of LX and LY of the length to increase or reduce the length of the original sound. The above step (B) pronunciation variation prediction model regression tree classification is another focus of the present invention, which is to facilitate the deletion of the pronunciation phenomenon, and to extract the pronunciation number, first classify the pronunciation variation phenomenon. Then, the classification of the second = pronunciation variation characteristics to classify _ 'to the information points of the change in the pronunciation characteristics of the 啰 to the same category - in which, in the same _ in the data points with the same The pronunciation is mixed. The advantage of the present invention is that the distribution of data in a tree-like structure makes it easy to understand the model and can track the variables used on each node and the data is truly miscellaneous. Pronunciation (4) The change in the county's main riding scale and the length of the phoneme. Therefore, the prediction of pronunciation variation can be divided into spectrum conversion prediction and sound length deletion, and the establishment of the compensation will be defined. The problem set used by the mosquito type regression tree will be defined. Please refer to the figure again. The household display is a schematic diagram of the structure of the classification regression tree of the present invention. As shown in the figure: First: ^ Based on the hidden Markov model, the pronunciation variation conversion function model, respectively, through the 201113869 f conversion function, the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The pronunciation parameters mentioned above are classified as sub-functions. Among them, each of the leaf nodes obtained by the final classification is converted to a conversion mode, and the difference between the normal phoneme 1 category and the phoneme is strong. Changes in 4_社 The above steps (B) pronunciation variation prediction model regression tree classification prediction model establishment, please refer to the "R-B shell" conversion spectrum conversion F-CART prediction model diagram. As shown in the figure; the element becomes a phoneme of the pronunciation variation after the spectrum conversion, and then the regression tree performs the deletion. The result of the measurement is the difference with the target. That is to say, after the source phoneme is classified, the frequency conversion is performed according to the conversion function of the ^, and (4) after the change (the C phone number is more similar to the target phoneme of the variation. The invention mode is adopted; In the case of riding, the conversion error of the riding and weaving conversion error (GTationError) is smaller than the conversion error before the splitting is expressed as: , ' 4

GenErrt = 2||ym -(A,xm +R)||2 (公式16) 其中ym為目標音素Y中第m個音框;Xm為來源音素χ 中第m個音框;AiXm+R為第丨個狀齡之線性轉換函式;以 及Μ為音框總數。 欲得到最佳之***結果’脚欲最大化減少之轉換誤差 $ ’減少之誤差3:計算方式為母節點之轉換誤差扣除***後子 節點轉換誤差。其定義為: 15 201113869GenErrt = 2||ym -(A,xm +R)||2 (Equation 16) where ym is the mth frame in the target phoneme Y; Xm is the mth frame in the source phoneme ;; AiXm+R is The linear conversion function of the third age; and the total number of sound boxes. To get the best split result, the conversion error of maximal reduction of the foot is reduced by the error of the minus $3: the calculation method is the conversion error of the parent node minus the post-split child node conversion error. It is defined as: 15 201113869

Reduced Generation Error (RGE) =GenErr - V WflenEr^ ' (公式17) 其中GenErrp為母節點之轉換誤差;GenErri為第i個子節 點之轉換誤差’以及Wj為子節點i之資料量。其分類到同— 個節點之資料將以下述公式丨8及公式丨9重新計算轉換函式中 旋轉矩陣與平移量之部分,其計算方式為: (公式18)Reduced Generation Error (RGE) = GenErr - V WflenEr^ ' (Equation 17) where GenErrp is the transformation error of the parent node; GenErri is the conversion error of the i-th sub-node and Wj is the data amount of the child node i. The data classified into the same node will be recalculated by the following formula 丨8 and formula 丨9 to calculate the part of the rotation matrix and the translation amount in the conversion function, which is calculated as: (Equation 18)

MiMi

R ' (公式19) 據此,該頻譜轉換預測模型之建置,係包括下列步驟: (a )產生包含所有資料之根節點(R0〇t)s〇 ’並建立候立 節點集合u={s〇} ’以及葉節點集合v=0 ; 、 & )從u中取出節點Sm,從&之資料點中產生所有7 此之問題集之集合Q,,...,%,..·},並對所有㈣做一次*** 作為^步驟(b )中能使得廳最大之問題集^ 料刀裂之問稍,並記錄RGE; 將S (二步驟(C )中之臟>〇,係***此候選節點, s /入隼I _%分到左右子節點Sml及Smr中,並將S丨石 Smr加入集合^,若β ^ ^ 右RGE<〇 ’則將Sm加入集合v ;以及 元素)作模型之^ f,並對所有葉_ (即V中之則 上述每個ίΓ,藉崎算各節財之轉換函式。R ' (Equation 19) Accordingly, the construction of the spectrum conversion prediction model includes the following steps: (a) generating a root node (R0〇t) s〇 containing all data and establishing a set of waiting nodes u={ S〇} 'and the leaf node set v=0; , & ) takes the node Sm from u, and generates a set Q of all 7 of the problem sets from the & data points, ..., %, .. ·}, and do a split for all (four) as ^ step (b) can make the hall the biggest problem set, and record RGE; will S (the second step (C) dirty> The system splits the candidate node, s / 隼 I _% is divided into the left and right child nodes Sml and Smr, and S r Smr is added to the set ^, if β ^ ^ right RGE < 〇 ' then Sm is added to the set v; And the element) is used as the model ^ f, and for all the leaves _ (that is, each of the above-mentioned V in V), the conversion function of each wealth is calculated by the saga.

數A ‘,其中觸數包含姆語射縣參數與發音參 "數L含轉換函式巾之來源頻譜參數x與目 201113869 &頻4參數γ ’皆為25維之頻譜參數’所有資料點從根節點 (R:t)出發’***時由方框内之問題’決定資料點要被分到左 子即點或右子節點。以圖中斜線之資料點為例,考慮完***之 條件後’在Root***時之問題係「Previous LW length = 4」, $該貝料點發音參數吻合,答案為Yes,故分到左子節點,於 其中,若答案為No,則分到右子節點。最後,在葉節點(Leaf) 利用在葉節點中之\與¥得到轉換函式之參數。 +,上述步驟(B )發音變異預測模型回歸樹分類中音長預測 模型之建立,請進一步參閱『第9圖』所示,係本發明之音長 、CART預測模型示意圖。如圖所示··在音長之預測上係採用 ^類回歸顺發音參«赠為分咖狀觀,喊態内之 =料長度變化作為改變音長之資訊,同時並考絲源資料之音 度與目標資料之音長長度。其中’本發明係使用圓料 Z之資料長度向量,即為將每—個狀態中之音長資訊結合, 作為分類回歸樹之資料。 在料賴巾,係糊發音參數調音長之變化。因此在 八1準則上欲使擁有類似之音長變化資訊之資料點,得以 ^到同—個分群中,乃·均方縣(Mean Square Ε·, =^將資料點與其平均數之誤差平方之總和,定義以 貝才心扣除平均•之平方合,其表示為: 祕五Ί(ί)2 7 (公式20) 中第’.、:rSEi為第1個葉節點之均方誤差MSE; '為此節點 資料料;ni為此節點中資料量;以及《為此節點中所有 貝枓之平馳。概制錄前之咐鮮餘齡裂後之均 17 201113869 方誤差值作為***之條件,脚賴奴均雜差(㈣⑽ Mean Square Error, RMSE)係大於〇。其中之算法表示 為: 听 ’Number A ', where the number of touches contains the parameters of the linguistic county and the pronunciation parameter" the number L contains the source spectrum parameter of the conversion function towel x and the target 201113869 & frequency 4 parameter γ ' are all 25-dimensional spectral parameters 'all data The point deriving from the root node (R:t) 'the problem in the box when splitting' determines that the data point is to be assigned to the left child point or the right child node. Take the data point of the diagonal line in the figure as an example. After considering the conditions of the splitting, the problem in the root splitting is "Previous LW length = 4". The pronunciation parameter of the shell material point is the same. The answer is Yes, so it is assigned to the left child. Node, in which, if the answer is No, it is assigned to the right child node. Finally, the leaf node (Leaf) uses the parameters of the conversion function of \ and ¥ in the leaf node. +, the above step (B) pronunciation variation prediction model regression tree classification middle length prediction model, please refer to the "Fig. 9", which is a schematic diagram of the sound length and CART prediction model of the present invention. As shown in the figure, the prediction of the length of the sound is based on the ^ class returning to the pronunciation of the singularity of the singularity of the singularity of the singularity of the singularity of the singularity, the change of the length of the material in the shouting state as the information of changing the length of the sound, and The length of the pitch and the length of the target data. Wherein the invention uses the data length vector of the round material Z, that is, combines the sound length information in each state as the data of the classification regression tree. In the material of the towel, the change in the tuning parameter of the paste is long. Therefore, in the 8th criterion, the data points with similar sound length change information can be obtained into the same group, which is Mean Square (, Mean Square Ε·, =^ squared the error between the data points and their averages The sum of the definitions is the square of the mean square deduction, which is expressed as: The first '., :rSEi is the mean square error MSE of the first leaf node in the secret five Ί(ί) 2 7 (Formula 20); 'For this node data; ni the amount of data in this node; and "all the bells in this node are flat. Before the record is recorded, the average age after the crack is 17 201113869 square error value as the condition of the split Mean Square Error (RMS) is greater than 〇. The algorithm is expressed as:

Reduced MSE =MSEparent ~ Σ ^ Χ Μ^) ' (公式21) 其中MSEparent代表***前母節點之MSE值;代表分 裂後第i個子節點中MSE值;以及Wi為子節點i之資料數量 比值(Weight) ’其計算方式為:Reduced MSE = MSEparent ~ Σ ^ Χ Μ^) ' (Equation 21) where MSEparent represents the MSE value of the parent node before splitting; represents the MSE value of the i-th child node after splitting; and Wi is the data quantity ratio of the child node i (Weight ) 'The calculation method is:

_ Wi = K • (公式22) 其中%為母節點p中資料數量;以及<為第i個子節點中 資料數量。 由此可知’音長預測模型之建置步驟,與頻譜轉換預測模 型之建置步驟相似,不同之地方在於***之條件係以屬£ 作^換,目標在於讓***後之節點能夠有最大之麵e,其中 之資料點為音長資訊,即原始音長與變異後之音長。最後,於 ***後結果之葉節點求取該節點之統計參數,作為預測之音長 # 資訊。以财每個資料點參數皆包含音長資訊參數L與發音參 數\其中音長參數L=W,…,咖,維度&,且該n為画 中狀態之數量,· 4為狀態;十來源音素乂之音長參數;以及 2狀態1中目標音素Υ之音長參數。藉此於葉節點中求出統計 量(Mean與Variance)作為代表此分類音長之資訊。 上述步驟(B )發音變異預賴型回歸樹分類中轉換函式 之驗證’對於域擇挑選_正需要轉換之音素與其對應 之轉換函式,本發明亦對轉換函式做驗證(篇她η )之動作。 201113869 由於平行之朗讀式語音(Read Speech )與自發式語音 (Spontaneous speech)在語料之收集上很不容易,用以訓練轉 換函式之訓練語料,係以所收集到少量之平行語料來訓練,為 避免因為少量語料下造成之分類結果不具代表性,本發明尚利 用另一組訓練語料外之平行語料對分類後訓練所得之轉換函 式做驗證之動作。其驗證之方法為,將在訓練語料外之平行語 料經由F-CART挑選出最適當之轉換函式後,將正常之音素資 料透過被挑選出之轉換函式得到預測之轉換結果,觀察轉換結 果是否比未轉換之朗讀式語音音素模型,即來源音素模型 (Source)更接近發音變異音素模型(Tafget),而距離大小之 估測標準係利用頻譜之歐式距離,當轉換函試驗證後,發現轉 換後之距離還比未轉換來得大時,則代表此轉換函式驗證失 敗’亦即經過這個轉換函式轉換後所得之結果不會比未轉換之 來源音素模型更接近發音變異模型,所以必須將該挑選出之轉 換函式之轉換動作忽略,亦即當挑選到這個轉換函式時,資料 不進行轉換之動作。藉此,可將分類回歸樹之預測誤差與錯誤 所導致之轉換函式挑選錯誤,經由此驗證之動作,修正轉換函 式挑選錯誤之部分,進而能以挑選最適合之轉換函式進行轉 換。 為評估本發明所提之方法,本發明於一較佳實施例中,如 第1圖所示,其發音變異轉換模型係由平行之訓練語料,基於 線性轉換函式所估算而得,且在本實施例中,頻譜之轉換模型 也同時建立以處理頻譜之轉換,而音長模型也經由等比例縮放 之方式進行調整。輸入之語句經由文字分析内容與音節段點, 擷取音韻參數與發音參數以及文字資訊,藉由語音之構音特徵 201113869 2將發音變異做分類’分軸觸獅歸納出不同發音 ^下之變雜性,藉·刺練語科以外之發音變異,依據 讀資赌鱗轉換到_與音長綱麵中, 2之頻譜轉換函式與音長轉換函式,分職頻譜與音長進行 f換。如是,本發明之發展平台可建置於%邮 腦、2GBRAM、及職dwsxp作業系統之環境使用, 系統開發工具為 Microsoft Visuai C++ 6.0。 -月 > 閱第1 〇圖及第1 1圖』所示’係分別為本發明以 MCDC中統計前25常產生發音變異現象之詞之示意圖、及本 發明之語料字長度分布示意圖。如圖所示:本發明在語音資料 庫中係採用兩組語料,第一組為訓練線性轉換函式所使用之語 =為自行枚集之中文朗讀式語音與一般自發式口語語音之平 仃浯料。該平行語料係由三名語者進行錄製,針對所設計之句 子’以模仿實際對話之方式分別錄製一般口語語音、以及照稿 朗讀之語音。此語料之設計準則’其一為針對在現代漢語連續 口語對話語音語料庫中,統計前25個經常出現發音變異現象 之詞來加以設計’如第10圖所示’其中括弧内為所設計語料 出現之人數,其—為為求語料之平衡性’考慮到巾文之所有音 素對於107個音素模型至少出現一次。至於第二組則為訓練 中文合成器所使用之語料’係採用北京清華大學之語音合成語 料庫(TsingHna - Corpus of Speech Synthesis, TH-CoSS ),此 語料庫主魏針職語普通話語音合紅研究、開發與評測, 以及語音學研究而設計之漢語語料庫,其語料文本主要選自新 聞。請參第1 1圖所示,係本發明語音資料庫之特性統計,圖 中格狀直條為TH-CoSS ’斜狀直條為一般朗讀語料,空白直 201113869 條為自發性語音。如圖顯示,一般朗讀語料之平均字長度大約 落在350至400毫秒(ms)之間,而自發性語音之平均字長 度大約落在200ms。 請參閱『第12圖〜第14圖』所示,係分別為本發明發 曰變異模型MOS之測试結果示意圖、本發明發音變異模型之 客觀评估結果示意圖、及本發明自然度評比M〇s之測試結果 示意圖。如圖所示:根據上述本發明所提之方法,以該實施例 中針對發音變異模型之評估、發音變異翻之客觀評估、以及 φ 整體自然度之評估等相關實驗作探討,其中: 該發S變異模型之評估:比較一僅使用音長資訊來調整 (Duration)'個以音素為基礎使用一個GMM模型來描述一 個音素(Phone-base)、以及本發明所提出之使用一個HMM模 型來描述一個音素(State-base),比較這三者之間之差異,證 明本發明提出之方法_利用模型描述一個音素,係具有 較好之效果,如第1 2圖所示,在使用HMM對一個音素做發 :變異模型之描述(State_base)時,因為考慮時間上之關係與 • ^後發音之連續性,在相似度上係有較好之表現,而且合成之 曰質也在可接受之程度,故證明利用HMM去描述發音變異 上’確實係有較佳之效果。 立該發音變異模型之客觀評估:本發明採用均方誤差量測發 ^音變異轉換出來之音韻參數與目標音韻參數之差異,作為客觀 評估之準則’該均方誤差之計算表示為: MSE=:^L(yn~yn)2 n=〇 (公式 23) 如第13圖所示,所提出之轉換函式所得到之結果,在聲_ Wi = K • (Equation 22) where % is the number of data in parent node p; and < is the number of data in the ith child node. It can be seen that the step of constructing the sound length prediction model is similar to the construction step of the spectrum conversion prediction model. The difference is that the condition of the split is changed by the genus, and the goal is to enable the split node to have the largest Face e, where the data point is the length information, that is, the original length and the length of the variation. Finally, the leaf node of the split result is used to obtain the statistical parameter of the node as the predicted sound length # information. Each data point parameter includes a length information parameter L and a pronunciation parameter, wherein the length parameter L=W, ..., coffee, dimension & and the n is the number of states in the picture, · 4 is the state; The length parameter of the source phoneme ;; and the pitch length parameter of the target phoneme 2 in state 2. The statistic (Mean and Variance) is obtained from the leaf nodes as information representing the length of the classification. In the above step (B), the verification of the conversion function in the pronunciation variation pre-regression regression tree classification 'for the domain selection selection _ the phoneme that needs to be converted and its corresponding conversion function, the present invention also verifies the conversion function (Article η she ) action. 201113869 Because the parallel reading speech (Read Speech) and Spontaneous speech are not easy to collect in the corpus, the training corpus used to train the conversion function is based on a small amount of parallel corpus collected. To train, in order to avoid the classification result caused by a small amount of corpus is not representative, the present invention still uses another set of parallel corpus outside the training corpus to verify the conversion function obtained after the classification training. The verification method is that after the parallel corpus outside the training corpus selects the most appropriate conversion function via F-CART, the normal phoneme data is obtained through the selected conversion function to obtain the predicted conversion result, and observe Whether the conversion result is closer to the pronunciation variation phoneme model (Tafget) than the unconverted reading phoneme model, the source phoneme model (Source), and the distance size estimation standard uses the Euclidean distance of the spectrum, when the conversion function test certificate If it is found that the converted distance is larger than the unconverted, it means that the conversion function fails to be verified, that is, the result obtained after the conversion function conversion is not closer to the pronunciation variation model than the unconverted source phoneme model. Therefore, the conversion action of the selected conversion function must be ignored, that is, when the conversion function is selected, the data is not converted. In this way, the prediction function caused by the classification regression tree and the conversion function caused by the error can be selected incorrectly, and the part of the conversion function can be corrected by the action of the verification, and the conversion function can be selected by selecting the most suitable conversion function. In order to evaluate the method of the present invention, in a preferred embodiment of the present invention, as shown in FIG. 1, the pronunciation variation conversion model is estimated from a parallel training corpus based on a linear conversion function, and In this embodiment, the spectrum conversion model is also established to process the conversion of the spectrum, and the length model is also adjusted by scaling. The input sentence analyzes the content and the syllable segment points through the text, extracts the phonological parameters and the pronunciation parameters and the text information, and classifies the pronunciation variation by the phonetic features of the voice 201113869 2 to classify the lions into different pronunciations. Sexuality, borrowing and spurring the pronunciation variation other than the corpus, according to the gambling scales converted to _ and the length of the face, 2 the spectral conversion function and the length conversion function, the divisional spectrum and the length of the f . If so, the development platform of the present invention can be built in the environment of % Mail, 2 GB RAM, and the dwsxp operating system, and the system development tool is Microsoft Visuai C++ 6.0. - month > read the first diagram and the first diagram "1" are the schematic diagrams of the words in the MCDC which are commonly used to generate the pronunciation variation phenomenon in the first half of the present invention, and the distribution of the length of the corpus of the present invention. As shown in the figure: the present invention uses two sets of corpus in the speech database, and the first group is the language used to train the linear conversion function = the self-enumerated Chinese reading speech and the general spontaneous spoken speech. Unexpected. The parallel corpus is recorded by three speakers and the general spoken voice and the spoken voice are recorded separately for the designed sentence in a manner that mimics the actual dialogue. The design criterion of this corpus is to design the words of the top 25 frequently occurring pronunciation variations in the continuous spoken dialogue corpus of modern Chinese, as shown in Figure 10, where the design language is in brackets. The number of people appearing, which is the balance of the corpus, considers that all the phonemes of the towel appear at least once for the 107 phoneme models. As for the second group, the corpus used for training Chinese synthesizers is based on the TsingHna - Corpus of Speech Synthesis (TH-CoSS), which is the vocabulary of the main vocabulary. Chinese language corpus designed for development, evaluation, and phonetics research. The corpus text is mainly selected from news. Please refer to Fig. 1 1 for the characteristic statistics of the speech database of the present invention. The straight line in the figure is TH-CoSS ‘straight straight bar is a general reading corpus, and the blank straight 201113869 is a spontaneous speech. As shown in the figure, the average word length of a general reading corpus falls between 350 and 400 milliseconds (ms), while the average word length of spontaneous speech falls to about 200 ms. Please refer to the "12th to 14th" diagrams, which are schematic diagrams of the test results of the hairpin variation model MOS of the present invention, the objective evaluation results of the pronunciation variation model of the present invention, and the naturalness evaluation of the present invention. Schematic diagram of the test results. As shown in the figure: according to the method of the present invention mentioned above, in the embodiment, the evaluation of the pronunciation variation model, the objective evaluation of the pronunciation variation, and the evaluation of the overall naturalness of φ are discussed, wherein: Evaluation of the S-variation model: comparing one using the sound length information to adjust (Duration) using a GMM model to describe a phoneme (Phone-base) based on phoneme, and the present invention uses an HMM model to describe A phoneme (State-base), comparing the differences between the three, proves that the method proposed by the invention _ using the model to describe a phoneme has a better effect, as shown in Figure 12, using a HMM pair When the phoneme is made: the description of the model of variation (State_base), because of the relationship between time and the continuity of the pronunciation after ^, the similarity is better, and the quality of the synthesis is acceptable. Therefore, it is proved that the use of HMM to describe the pronunciation variation is indeed a better effect. Objective evaluation of the pronunciation variation model: The present invention uses the mean square error to measure the difference between the phonological parameter and the target phonological parameter converted by the vocal variation as a criterion for objective evaluation. The calculation of the mean square error is expressed as: MSE= :^L(yn~yn)2 n=〇(Equation 23) As shown in Figure 13, the result of the proposed conversion function is in the sound

S 21 201113869 . 轉數上與目標麵之資料間,抑得聰相近之結果,而利 用較小单位SMe-base之轉換函式所得之結果比利用 P—e之轉換結果來的好,故此結果可與m〇s測試結果 相呼應。 該整體自然度之評估··比較在傳統HTS系統合成與利用 本發明職之方賊良之纽合奴實聽果,討論包含使用 傳統HTS合成器、使用_之方法_ 用本發明所得到之合成結果。如第工4圖所示,本發明提出之 φ 方法所建置之系統’雖然在經過線性轉換之過程中損失一些語 音上之品質’但在語音之流暢度達到與傳統hts系統差不多 之表現,而在π語錄度之評估上,更在大部分受測之結果 中’達到最佳之表現。 至此顯現以本發明之方法具體整合為一中文自發性語音 合成系統具實用性與穩定性。 本發明基於隱藏式馬可夫模型之語音合成器並加以改 良,已經可合成出流暢及清晰之語音,其系統之可攜性及適應 • 性更是其發展優勢,並且在合成語音之自然度上可達到大幅改 善之效果。藉此,本發明可具體整合各式人機雙向溝通系統、 行動裝置、資訊查詢服務系統及資訊教育系統,應用在各種大 眾服務窗口、手機及PDA上;或整合其他資訊傳播技術,應 用在各式服務系統、導覽系統或建構於居家看護環境等,例如 電子地圖有聲導覽系統、隨身電子故事書、即時語音教學、線 上航空訂票系統、火車查詢服務與氣象查詢服務之資訊檢索查 詢系統、股市電子交易系統、及居家看護系統等。 矣t、上所述,本發明係一種中文自發性語音合成中發音變異 [ 22 201113869 產生之方法,可有效改善㈣之種種缺點,係導人轉換函式於 隱藏式馬可夫模型建立發音變異模型,並運用分類迴歸樹預測 發音變異種類,可透過轉換函式產生新之音韻模型, 僅利用固定數量音韻模型合成之不足,並以構音特性參數對發 曰篗異作聲學特性上之分類,以彌補訓練語料不足之問題,再 藉由產生發音變異現象,心謂進基於隱藏式馬可夫模型之合 成語音之自然度者,進而使本發明之産生能更進步、更實用、 更符合使用者之所須,確已符合發明專利申請之要件,爰依法 Φ 提出專利申請。 惟以上所述者,僅為本發明之較佳實施例而已,當不能以 匕限疋本發明實施之範圍;故,凡依本發明申請專利範圍及發 月說明書内谷所作之簡單的等效變化與修飾’皆應仍屬本發明 專利涵蓋之範圍内。 【圖式簡單說明】 第1圖’係本發明之基本流程示意圖。 • 第2圖’係本發明於訓練階段之流程示意圖。 第3圖,係本發明於合成階段之流程示意圖。 第4圖’係本發明以動態時間校正結果對應之音節斷點位 置示意圖。 第5圖’係本發明之線性轉換關係示意圖。 第6圖,係本發明之頻譜轉換模型示意圖。 第7圖’係本發明分類回歸樹之架構示意圖。 第8圖,係本發明之頻譜轉換F_CART預測模型示意圖。 第9圖,係本發明之音長D_CART預測模型示意圖。 23 201113869 第1 0圖,係本發明以MCDC中統計前25常產生發音變 異現象之詞之示意圖。 第11圖,係本發明之語料字長度分布示意圖。 第1 2圖,係本發明發音變異模型M〇s之測試結果示意 圖。 ' 第13圖,係本發明發音變異模型之客觀評估結果示意 圖。 第1 4圖’係本發明自然度評比MOS之測試結果示音圖。 【主要元件符號說明】 訓練階段1 步驟(A)發音變異轉換函式模型建立工工 平行語料111 直行分析112 動態時間校正113 隱藏式馬可夫模型之訓練資料114 • 頻譜轉換模型;I 15 步驟(B)發音變異預測模型回歸樹分類 構音特徵參數121 轉換函式之分類回歸樹之訓練12 2 音長之分類回歸樹之訓練12 3 頻譜轉換預測模型12 4 音長預測模型12 5 合成階段2 步驟(C)HTS合成2 1 24 201113869 輸入欲合成文字之發音參數211 構音特性參數212 聲學模型213 狀態選擇214 步驟(D)變異轉換2 2 頻譜轉換函式2 21 音長轉換函式2 2 2 轉換2 2 3 梅爾對數頻譜近似濾波器2 2 4 動態時間校正路徑3 線段31〜3 3S 21 201113869 . Between the data on the number of revolutions and the target surface, the result is similar to that of Cong, and the result obtained by using the conversion function of the smaller unit SMe-base is better than the result of conversion using P-e, so the result Can echo the m〇s test results. The evaluation of the overall naturalness··Compared with the traditional HTS system synthesis and the use of the singer of the invention, the discussion includes the use of the traditional HTS synthesizer, the method of using _ the synthesis obtained by the invention result. As shown in Fig. 4, the system constructed by the φ method proposed by the present invention 'saves some voice quality during the linear conversion process', but the fluency of the speech reaches a performance similar to that of the conventional hts system. In the evaluation of π vocabulary, it is the best performance in most of the tested results. At this point, it appears that the method of the present invention is specifically integrated into a Chinese spontaneous speech synthesis system with practicality and stability. The invention is based on the speech synthesizer of the hidden Markov model and is improved, and the smooth and clear voice can be synthesized, and the portability and adaptability of the system are its development advantages, and the naturalness of the synthesized speech can be Achieve a substantial improvement. Therefore, the present invention can specifically integrate various human-machine two-way communication systems, mobile devices, information inquiry service systems, and information education systems, and can be applied to various public service windows, mobile phones, and PDAs; or integrate other information communication technologies and apply them in various Service system, navigation system or information retrieval query system built in home care environment, such as electronic map audio navigation system, portable electronic storybook, instant voice teaching, online airline booking system, train inquiry service and weather inquiry service , the stock market electronic trading system, and home care system.矣t, above, the present invention is a method for generating pronunciation variation in Chinese spontaneous speech synthesis [22 201113869, which can effectively improve various shortcomings of (4), and a derivative conversion function to establish a pronunciation variation model in a hidden Markov model, The classification regression tree is used to predict the type of pronunciation variation. The new phonological model can be generated through the conversion function. Only the deficiencies of the fixed number of phonological models are used, and the acoustic characteristics of the vocal characteristics are classified to compensate the acoustic characteristics. The problem of insufficient training corpus, and by the phenomenon of pronunciation variation, the heart is based on the naturalness of the synthesized speech based on the hidden Markov model, so that the invention can be more advanced, more practical and more suitable for the user. It must have met the requirements of the invention patent application and 提出 filed a patent application according to law. However, the above is only the preferred embodiment of the present invention, and the scope of the present invention cannot be limited to the scope of the present invention; therefore, the simple equivalent of the patent scope and the valley in the monthly specification of the present invention. Variations and modifications are still within the scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a schematic view showing the basic flow of the present invention. • Figure 2 is a schematic diagram of the flow of the invention in the training phase. Figure 3 is a schematic flow diagram of the present invention in the synthesis stage. Fig. 4 is a schematic diagram showing the position of a syllable breakpoint corresponding to the dynamic time correction result of the present invention. Fig. 5 is a schematic diagram showing the linear conversion relationship of the present invention. Figure 6 is a schematic diagram of the spectrum conversion model of the present invention. Figure 7 is a schematic diagram of the structure of the classification regression tree of the present invention. Figure 8 is a schematic diagram of the spectrum conversion F_CART prediction model of the present invention. Figure 9 is a schematic diagram of the sound length D_CART prediction model of the present invention. 23 201113869 Fig. 10 is a schematic diagram showing the words of the first 25 frequent occurrences of pronunciation variation in the MCDC. Figure 11 is a schematic diagram showing the distribution of the length of the corpus of the present invention. Fig. 12 is a schematic diagram showing the test results of the pronunciation variation model M〇s of the present invention. Fig. 13 is a schematic diagram showing the objective evaluation results of the pronunciation variation model of the present invention. Fig. 14 is a sound diagram of the test results of the naturalness evaluation MOS of the present invention. [Main component symbol description] Training phase 1 Step (A) pronunciation variation conversion function model construction engineering parallel corpus 111 straight analysis 112 dynamic time correction 113 hidden Markov model training data 114 • spectrum conversion model; I 15 steps ( B) Pronunciation variation prediction model Regression tree classification Consonant feature parameters 121 Classification function of classification function Regression tree training 12 2 Classification of classification and regression tree training 12 3 Spectrum conversion prediction model 12 4 Sound length prediction model 12 5 Synthesis stage 2 steps (C) HTS synthesis 2 1 24 201113869 Input pronunciation parameter to be synthesized 211 Consonant characteristic parameter 212 Acoustic model 213 State selection 214 Step (D) Variation conversion 2 2 Spectrum conversion function 2 21 Sound length conversion function 2 2 2 Conversion 2 2 3 Mel log spectrum approximation filter 2 2 4 Dynamic time correction path 3 Line segments 31~3 3

2525

Claims (1)

201113869 七、申5青專利範圍: 1 · -種巾文自發性語音合射發音變異產生之方法,係提供於 各式人機㊉雙向溝通及賴獅教料、紐行整合應用,其 (Training Phase) (Synthesis Phase) ’該訓練階段中包含下列步驟: (A) 發音變異轉換函式模型建立步驟:係先將一平行 語料(Parallel Corpus)及對應之文字進行前處理,其中平行 §吾料部分係將經由直行分析(Straight^a^sis )後得到之頻譜 φ 參數(Spectrum)及音韻參數’使用動態時間校正(Dynamic Tmie Warping,DTW)之路徑結果建立發音變異音素與正常音 素資料間之對應關係,得到成對之音素單元(ph〇ne pair), 而文子部分係經過文字分析與根據人工預先標記好之韻律邊 界,得到對應之文字標記,繼之,針對標記為發音變異之部 分進行發音變異轉換函式模型之訓練,經提取該頻譜參數結 合該音韻參數及該文字標記作為一隱藏式馬可夫模型 (HMM)之訓練資料’訓練線性轉換函式,得到一具頻譜轉 鲁換函式與音長資訊之頻譜轉換模型(HMMM〇dds); (B) 發音變異預測模型回歸樹分類步驟:藉由語音之 構音特徵參數(Articulatory Feature)將發音變異作分類,根 據語言學與聲學上之發音參數進行轉換函式之歸群與訓練, 再利用一分類回歸樹模型(Classification and Regression Trees, CART),將上述頻譜轉換模型申頻譜轉換函式與音長資訊, 分別根據文字標記求得對應之語言學上之資訊,進行該分類 回歸樹模型之訓練,分別得到頻譜轉換預測模型 (Transformation Function Model)與音長預測模型(Duration 26 201113869 Model); 該合成階段中包含下列步驟: (C ) HTS合成步驟:係輸入欲合成文字之發音參數, 經刖%之文子分析處理,得到語言學上之資訊而產生文字標 記檔,並進行發音變異現象之預測,使用基於隱藏式馬可夫 模型之語音合成器(HMM-based Speech Synthesis System, HTS)搭配文字標記檔進行頻譜 '音長及音高(pitch)參數 之預測;以及 φ (D)變異轉換步棘:係針對預測發生發音變異現象之 部分,依據該文字標記檔資訊從上述頻譜轉換預測模型與音 長預測模型中,挑選適合之頻譜轉換函式與音長轉換函式, 分別將頻譜與音長進行轉換’並將雛過_產生之參數經 過一梅爾對數頻譜近似濾波器(Mel_1〇g Sp^ctr^ ApproximadonFUtemsAF^)合成一般自然語音輸出。 2 ·依據申請專利範圍第丄項所述之中文自發性語音合成中發音 變異產生之方法,其中,該步驟(Α)提取之頻譜參數係經 • 過梅爾倒頻譜(Mel_cePstmm)之轉換’提取25 ρ身之梅爾倒 頻譜係數者。 3 ·依據申請專利範圍第i項所述之中文自發性語音合成中發音 變異產生之方法,其中,該步驟(A)係採用線性之假言^ 係訓練發讀異之轉換函式’將發生麵之語音段視為正常 語音段之祕組合與轉換,將成對之音素單元细線性轉換 之方式描述平行之正常與變異音段間之關係。 4·依據申請專利範圍第1項所述之中文自發性語音合成中發音 變異產生之方法’其巾,該步驟(A)之線性轉換函式^ 27 201113869 :ax+r ,==專利範_ i項所述之中文自發性語音合成中發音 t 之方法’其中,該步驟(B )之分類回歸樹模型在 =4之條件上係設定為分之轉換誤差(g嶋命 小於***前=轉換誤差,其轉換之誤差計算公式表示為: GenErr{ = ^||ym -(A,xm + R)||2201113869 VII, Shen 5 Qing patent scope: 1 · - The method of spontaneous speech vocalization variation of the towel type is provided in all kinds of human-machine ten-way communication and Lai lion teaching materials, New Zealand integration application, (Training Phase) (Synthesis Phase) 'The training phase includes the following steps: (A) Pronunciation variation conversion function model establishment steps: a parallel corpus (Parallel Corpus) and corresponding text are pre-processed, which parallel § In the material part, the spectrum φ parameter (Spectrum) and the phonological parameter 'Dynamic Tmie Warping (DTW) path result obtained after straight analysis (Straight^a^sis) are used to establish the relationship between the phonetic variation phoneme and the normal phoneme data. Corresponding relationship, the paired phoneme unit (ph〇ne pair) is obtained, and the text part is subjected to text analysis and artificially pre-marked prosody boundary to obtain the corresponding text mark, and then, the part marked as pronunciation variation Performing training of the pronunciation variation conversion function model, extracting the spectral parameter and combining the phonetic parameter and the text mark as a hidden Markov model (HMM) training data 'training linear conversion function, get a spectrum conversion function and spectrum length conversion spectrum model (HMMM〇dds); (B) pronunciation variation prediction model regression tree classification step : Classification of pronunciation variations by the Articulatory Feature of speech, grouping and training of conversion functions according to linguistic and acoustic pronunciation parameters, and then using a classification and regression tree model (Classification and Regression Trees, CART), the spectrum conversion model is applied to the spectrum conversion function and the sound length information, and the corresponding linguistic information is obtained according to the text mark, and the classification regression tree model is trained to obtain the spectrum conversion prediction model (Transformation Function). Model) and the sound length prediction model (Duration 26 201113869 Model); The synthesis stage includes the following steps: (C) HTS synthesis step: inputting the pronunciation parameters of the text to be synthesized, and analyzing and processing the text by 刖% to obtain linguistically The information is used to generate a text mark file, and the prediction of the pronunciation variation phenomenon is made. The HMM-based Speech Synthesis System (HTS) is based on the text marker file to predict the spectrum 'sound length and pitch parameters; and the φ (D) variability conversion step: To predict the part of the phenomenon of pronunciation variation, according to the text mark file information, select the appropriate spectrum conversion function and the sound length conversion function from the above-mentioned spectrum conversion prediction model and the sound length prediction model, and convert the spectrum and the sound length respectively. The parameters of the _ generation are synthesized by a Mel logarithmic spectrum approximation filter (Mel_1〇g Sp^ctr^ ApproximadonFUtemsAF^) to synthesize the natural speech output. 2. The method for generating vocal variation in Chinese spontaneous speech synthesis according to the scope of the patent application scope, wherein the spectral parameter extracted by the step (Α) is converted by Mel-cePstmm (Extraction of Mel_cePstmm) 25 迈 迈 迈 迈 迈 迈 迈 迈 迈 迈 迈3. The method for generating pronunciation variation in Chinese spontaneous speech synthesis according to the scope of claim patent, wherein the step (A) adopts a linear hypothesis, and the training conversion function "will occur" The speech segment of the surface is regarded as the secret combination and transformation of the normal speech segment, and the relationship between the normal and the variation segment of the parallel is described in the manner of the fine linear transformation of the paired phoneme unit. 4. The method for generating the pronunciation variation in the Chinese spontaneous speech synthesis according to the scope of the patent application, the towel, the linear conversion function of the step (A) ^ 27 201113869 : ax + r , == patent model _ The method of pronunciation t in the Chinese spontaneous speech synthesis described in item i, wherein the classification regression tree model of the step (B) is set to the conversion error of the division on the condition of =4 (g嶋 is less than the pre-split=conversion) The error, the error calculation formula of the conversion is expressed as: GenErr{ = ^||ym -(A,xm + R)||2 ’其令ym為目標音素Y中第ni個 音框、Xm為來源音素Χ中第m個音框、AiXm+R為第i個狀 態中之線性轉換函式、以及M為音框總數。 6·依據申請專利範圍第1項所述之中文自發性語音合成中發音 變異產生之方法,其中’該步驟⑻之分_歸麵型係 包含轉換函式之分類回歸樹(Transf_ti〇n恤也⑽cart) 與音長之㈣細猜(DufatiQn CART),帛⑽觸譜轉換 函式與音長資訊依據該發音參數作分類回歸樹之建置與分Let ym be the nith sound box in the target phoneme Y, Xm be the mth sound box in the source phoneme 、, AiXm+R be the linear conversion function in the i-th state, and M be the total number of the sound frames. 6. A method for generating pronunciation variation in Chinese spontaneous speech synthesis according to the scope of claim 1 of the patent application, wherein 'the step (8) is a classification regression tree containing a conversion function (Transf_ti〇n shirt also) (10) cart) and the sound length (4) fine guess (DufatiQn CART), 帛 (10) the spectrum conversion function and the sound length information according to the pronunciation parameter for the classification and regression tree construction and points 類:且最後分類得到之每一個樹葉節點係代表一種類別之轉 麵型,俾伽其糊正f音素與變異音糊麟上之變化 與音素長度上之差異。 7 .依據申請專利範_丨項所述之中文自發性語音合成中發音 變異產生之方法,其中,該步驟(B)之頻譜轉換預測模型 之建置,係包括下列步驟: (a )產生包含所有資料之根節點供〇说) 節點集合U,},以及葉節點集合v=0;h (b )從U中取出節點sm,從sm之資料點中產生所有 28 201113869 之集合Q,,.备.},並對所有㈣做一次 似2)選擇在步驟(b)中能使得RGE最大之問題集免, 乍為刀裂之問題集’並記錄RGE;Class: and each of the leaf nodes obtained in the final classification represents a type of transformation, the difference between the gamma and the variation of the phoneme and the length of the phoneme. 7. The method for generating a pronunciation variation in a Chinese spontaneous speech synthesis according to the patent application specification, wherein the step of constructing the spectral conversion prediction model of the step (B) comprises the following steps: (a) generating the inclusion The root node of all the data is said to be) the node set U,}, and the leaf node set v=0; h (b) takes the node sm from U, and generates all the sets Q 2011 of 201113869 from the data points of sm,. Prepare.}, and do all (4) once like 2) choose in step (b) to make RGE the biggest problem set exemption, and become a problem set of 'cracking' and record RGE; 8 98 9 (d )若步驟(c )中之職>〇,係***此候選節點, =m之資料根據qt分到左右子節點^及、中,並將& 及mr加入集合;;’若職<〇,則將心加入集合乂;以及 ^ 一( e )將sm移出u,若U矣0係回至步驟(b ),若u=0, 係完成分類回歸樹之建置,並對所有葉節點作模型之訓練, 藉以計算各節點中之轉換函式。 依據申請專鄕圍第i項所叙中文自發性語音合成中發音 變異產生之方法,其中,該步驟(B )之音長删模型係採 用均方誤差(Mean Square Error,MSE )。 依據申請專纖圍第1項所述之中文自發性語音合成中發音 變異產生之方法,其中’該方法係適用於多語者或含有情緒 之電腦自然語音合成’並可結合數位學f、f訊交換與行動 裝置者。 0 ·依射請專利範®ΙΠ摘述之+文自紐語音合成中發 音變異產生之方法’其巾’該方法係可建置於pentiumiv 3,2GHz個人電腦、2GB RAM、及wind〇ws χρ作業系統之平 台上使用。 29(d) If the job in step (c) is to split the candidate node, the data of =m is divided into the left and right child nodes ^ and , according to qt, and & and mr are added to the set; <〇, then add the heart to the set 乂; and ^ (e) move sm out of u, if U 矣 0 returns to step (b), if u=0, the system completes the classification regression tree construction, and All leaf nodes are trained as models to calculate the conversion function in each node. According to the method of applying for the variation of pronunciation in the spontaneous speech synthesis of the Chinese language mentioned in item i, the sound length deletion model of the step (B) adopts a Mean Square Error (MSE). According to the method for applying the pronunciation variation in Chinese spontaneous speech synthesis as described in the first item of the special fiber, the method is applicable to multi-lingual or computer-like speech synthesis with emotions and can be combined with digital f, f Exchange and mobile devices. 0 · According to the patent, please refer to the patent + ΙΠ 之 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Used on the platform of the operating system. 29
TW98134883A 2009-10-15 2009-10-15 A pronunciation variation generation method for spontaneous speech synthesis TWI402824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW98134883A TWI402824B (en) 2009-10-15 2009-10-15 A pronunciation variation generation method for spontaneous speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW98134883A TWI402824B (en) 2009-10-15 2009-10-15 A pronunciation variation generation method for spontaneous speech synthesis

Publications (2)

Publication Number Publication Date
TW201113869A true TW201113869A (en) 2011-04-16
TWI402824B TWI402824B (en) 2013-07-21

Family

ID=44909831

Family Applications (1)

Application Number Title Priority Date Filing Date
TW98134883A TWI402824B (en) 2009-10-15 2009-10-15 A pronunciation variation generation method for spontaneous speech synthesis

Country Status (1)

Country Link
TW (1) TWI402824B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI475558B (en) * 2012-11-08 2015-03-01 Ind Tech Res Inst Method and apparatus for utterance verification
CN111128122A (en) * 2019-12-31 2020-05-08 苏州思必驰信息科技有限公司 Method and system for optimizing rhythm prediction model
TWI746138B (en) * 2020-08-31 2021-11-11 國立中正大學 System for clarifying a dysarthria voice and method thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7263484B1 (en) * 2000-03-04 2007-08-28 Georgia Tech Research Corporation Phonetic searching
TWI269191B (en) * 2005-07-27 2006-12-21 Ren-Yuan Lyu Method of synchronizing speech waveform playback and text display
TWI297487B (en) * 2005-11-18 2008-06-01 Tze Fen Li A method for speech recognition

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI475558B (en) * 2012-11-08 2015-03-01 Ind Tech Res Inst Method and apparatus for utterance verification
US8972264B2 (en) 2012-11-08 2015-03-03 Industrial Technology Research Institute Method and apparatus for utterance verification
CN111128122A (en) * 2019-12-31 2020-05-08 苏州思必驰信息科技有限公司 Method and system for optimizing rhythm prediction model
TWI746138B (en) * 2020-08-31 2021-11-11 國立中正大學 System for clarifying a dysarthria voice and method thereof

Also Published As

Publication number Publication date
TWI402824B (en) 2013-07-21

Similar Documents

Publication Publication Date Title
CN108447486B (en) Voice translation method and device
TW504663B (en) Spelling speech recognition apparatus and method for mobile communication
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
US20220013106A1 (en) Multi-speaker neural text-to-speech synthesis
CN104272382B (en) Personalized singing synthetic method based on template and system
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
CN101751922B (en) Text-independent speech conversion system based on HMM model state mapping
JP4829477B2 (en) Voice quality conversion device, voice quality conversion method, and voice quality conversion program
WO2013040981A1 (en) Speaker recognition method for combining emotion model based on near neighbour principles
CN102122507A (en) Speech error detection method by front-end processing using artificial neural network (ANN)
CN101901598A (en) Humming synthesis method and system
Agrawal et al. Analysis and modeling of acoustic information for automatic dialect classification
CN113327574A (en) Speech synthesis method, device, computer equipment and storage medium
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
Heeren The effect of word class on speaker-dependent information in the Standard Dutch vowel/aː
CN111370001A (en) Pronunciation correction method, intelligent terminal and storage medium
TW201113869A (en) Pronunciation variation generation method for spontaneous Chinese speech synthesis
Hsia et al. Conversion function clustering and selection using linguistic and spectral information for emotional voice conversion
Furui Robust methods in automatic speech recognition and understanding.
Liu et al. A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin
Hsu et al. Speaker-dependent model interpolation for statistical emotional speech synthesis
Sisman Machine learning for limited data voice conversion
Gibson Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models
Shih et al. Speech-driven talking face using embedded confusable system for real time mobile multimedia

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees