JP3270668B2

JP3270668B2 - Prosody synthesizer based on artificial neural network from text to speech

Info

Publication number: JP3270668B2
Application number: JP28357395A
Authority: JP
Inventors: シン−ホーンチェン; ショー−フワフワン
Original assignee: ナショナルサイエンスカウンシル
Priority date: 1995-10-31
Filing date: 1995-10-31
Publication date: 2002-04-02
Anticipated expiration: 2015-10-31
Also published as: JPH09146576A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はテキスト（原文）か
らスピーチ（会話）への変換用の人工的なニューラルネ
ットワークにもとづく韻律情報の合成装置に関する。The present invention relates to an artificial neural network for converting text (original text) to speech (conversation) .
The present invention relates to a device for synthesizing prosody information based on a network .

【０００２】[0002]

【従来の技術】連続状のスピーチは、話した実際の単語
（ワード）のほかに、超分節的な（スープラセグメンタ
ル）情報、例えば、強勢、タイミング構造、基本周波数
（ＦＯ）等高線（コンタ）パターン等、を包含する。こ
れらの情報は一般的にスピーチの韻律と称され、これは
文章形式、文章論的構造、意味論、話者の感情的状態、
等の影響を受ける。話者は通常、スピーチのリズム、語
句の強調、呼吸のための休止、等を伝えるために韻律を
用いる。韻律がないと、スピーチは平坦な音調を有する
ようになり、退屈に、不愉快に、またはほとんど判断で
きないようにひびくものになるであらう。したがって、
適切な韻律の情報を発生させることは、テキスト対スピ
ーチ（テキストツウスピーチ、ＴＴＳ）のシステム
において自然のスピーチを合成するためには、最も重要
な問題である。BACKGROUND OF THE INVENTION Continuous speech is the actual word spoken
In addition to (words) , it includes hypersegmental (supra-segmental) information, such as stress, timing structure, fundamental frequency (FO) contour patterns, and the like. This information is commonly referred to as the prosody of the speech , which includes sentence form, grammatical structure, semantics, emotional state of the speaker,
Etc. Speakers typically use prosody to convey the rhythm of speech , emphasis of phrases, pauses for breathing, and the like. Without prosody, the speech will have a flat tone and will be boring, unpleasant, or almost indistinguishable. Therefore,
Generating the appropriate prosodic information is a matter of text versus speed.
This is the most important issue for synthesizing natural speech in a speech ( text-to- speech , TTS) system.

【０００３】[0003]

【発明が解決しようとする課題】一般的なＴＴＳシステ
ムについて、発生が必要な韻律情報は、基本周波数（Ｆ
Ｏ）コンタ、エネルギレベル、単語の継続時間、および
単語間の休止期間、を包含する。これらの韻律情報は一
般的には入力テキストから抽出される言語特徴に従って
生成される。基本的には、低レベルの語句（ｌｅｘｉｃ
ａｌ）特徴、例えば単語の音声学的構造から、高レベル
の語句特徴、例えば文章構造上の境界、までにわたる、
相違するレベルの言語特徴が使用される。 For a general TTS system, the prosody information that needs to be generated is based on the fundamental frequency (F
O) contours , energy levels, word duration, and
A pause between words . These prosody information is generally based on linguistic features extracted from the input text.
Generated . Basically, low-level words (lexic
al) features, for example, from phonetic structure of a word, high word features, ranging for example sentences structural boundary, until,
Different levels of linguistic features are used.

【０００４】種々の言語のＴＴＳについて、過去におい
て韻律発生に対する多くの研究法が提案されたが（Ｃａ
ｒｌｓｏｎＲ，ＧｒａｎｓｔｒｏｍＢ（１９７９）
による「規則に完全にもとづく、テキスト対スピーチの
システム」Ｐｒｏｃ．ＩＣＡＳＳＰ，ｐｐ６８６−６８
８，１９７６；Ｌ．Ｓ．Ｌｅｅ，Ｃ．Ｙ．Ｔｓｅｎｇ，
Ｍ．Ｏｕｈ−Ｙｏｕｎｇ「中国のテキスト対スピーチの
システムにおける合成の規則」ＩＥＥＥＴｒａｎｓ．
ＡＳＡＰ．Ｖｏｌ．３７，ｐｐ．１３０９−１３２０；
Ｔ．Ｊ．Ｓｅｎｏｗｓｋｉ，Ｃ．Ｒ．Ｒｏｓｅｎｂｅｒ
ｇ「ＮＥＴａｌｋ、声を出して読むことを学習する並列
ネットワーク」ＪｏｈｎＨｏｐｋｉｎｓＵｎｉｖｅ
ｒｓｉｔｙＥＥＣＳＴｅｃｈｎｉｃａｌＲｅｐｏ
ｒｔ，１９８６）、スピーチの高レベルの韻律構造を探
究するために韻律の生成において高レベルの言語特徴を
優雅に呼び出すことは、一般的に、依然として困難であ
る。In the past, many research methods for prosody generation have been proposed for TTS of various languages (Cas).
rlson R, Granstrom B (1979)
"A text- to- speech system based entirely on rules," Proc. ICASSP, pp686-68
8, 1976; S. Lee, C.I. Y. Tseng,
M. Ohh-Young, "Rules for Composition in Chinese Text- to- Speech Systems," IEEE Trans.
ASAP. Vol. 37 pp. 1309-1320;
T. J. Senowski, C .; R. Rosember
g "NETtalk, parallel learning to read aloud
Network "John Hopkins Unive
rity EECS Technical Repo
rt, 1986) , exploring high-level prosodic structures of speech
High-level linguistic features in prosody generation
Graceful calling is generally still difficult.

【０００５】したがって、結果として得られる合成され
た韻律情報は自然で流暢なスピーチを生成するには充分
には良好ではない。これは特にＦＯ合成について真実で
あり、その理由はこれが合成されたスピーチの自然さに
影響を及ぼす最も重要な韻律パラメータであるからであ
る。従来の規則準拠のＦＯ合成の研究法は、幾つかの単
純な抑揚（イントネーション）パターンを、高レベルの
言語特徴からの影響を評価するための、高レベルの韻律
構造であると考える。この方法では、先ず入力テキスト
から幾つかの文章構造上の境界を検出するために文章構
造を解析し、次に異なる文章構造上の成分からの影響を
個別に考慮することによってＦＯコンタを決定するため
に種々のルールを使用する。 [0005] Therefore, the resulting synthesized prosodic information is not good enough to produce natural and fluent speech . This is especially true for FO synthesis, since it is the most important prosodic parameter that affects the naturalness of synthesized speech . Traditional rules-based FO synthesis approaches consider some simple intonation patterns to be a high-level prosodic structure for assessing the effects from high-level linguistic features. In this method, the input text is first
Sentence structure to detect some sentence structure boundaries from
Analysis, and then analyze the effects of different textual components.
To determine the FO contour by considering individually
Use various rules.

【０００６】センテンス（１文）レベルの韻律構造をモ
デル化するために、数個のみの予め定められた抑揚パタ
ーンが用いられる。例えば、周知の下傾効果（ｄｅｃｌ
ｉｎａｔｉｏｎｅｆｆｅｃｔ）は、叙述文に対して下
降ＦＯコンタを使用することを示唆している。明らか
に、この方法は、高性能のＴＴＳシステムには充分に良
好なものではない。To model the prosodic structure at the sentence (one sentence) level, only a few predetermined intonation patterns are used. For example, the well-known downward inclination effect (decl
instance effect) is
This suggests using a falling FO contour. clear
In addition, this method is not good enough for high performance TTS systems.

【０００７】標準中国語のＴＴＳについて、韻律情報の
合成において、同様な状況に遭遇する。標準中国語は音
調的な言語である。各文字は音節として発音される。わ
ずかに約１３００の音韻論的に許容される音節が存在す
るのみであり、これらの音節は４１１の基本音節と５つ
の音調の適法なあらゆる組合せのセットである。各基本
音節は選択的な子音の最初部分と母音の最終部分で構成
される。単語（ワード）は、文章構造上意味をもつ最小
の単位であるが、１つまたは複数の音節からなる。[0007] A similar situation is encountered in the synthesis of prosodic information for Mandarin TTS . Mandarin is a tonal language. Each letter is pronounced as a syllable. There are only about 1300 phonologically acceptable syllables, and these syllables are the set of any legal combination of 411 base syllables and 5 tones. Each basic syllable is made up of the first part of an optional consonant and the last part of a vowel. A word is the smallest unit having a meaning in the sentence structure, but is composed of one or more syllables.

【０００８】音節は標準中国語スピーチにおける基本的
発音単位であるという事実により、標準中国語ＴＴＳに
おいても基本的合成単位として普通に選択される。した
がって、合成されることが必要である韻律情報は、ピッ
チ（またはＦＯ）コンタ、エネルギレベル、音節の最終
の継続時間、および音節間の休止期間を包含する。過去
において、これらの韻律パラメータの幾つかまたはすべ
てを合成するために幾つかの研究法が提案された。[0008] Due to the fact that syllables are the basic pronunciation unit in Mandarin speech , they are also commonly selected as the basic synthesis unit in Mandarin TTS. Thus, the prosodic information that needs to be synthesized includes pitch (or FO) contours , energy levels, final duration of syllables, and pauses between syllables. In the past, several approaches have been proposed to synthesize some or all of these prosodic parameters.

【０００９】これらは、規則準拠の研究法（Ｊ．Ｚｈａ
ｎｇ“中国語用のテキスト対スピーチのシステムの音響
パラメータおよび音韻論的規則”Ｐｒｏｃ．ＩＣＡＳＳ
Ｐ．ｐｐ．２０２３−２０２６，１９８６）、統計的方
法（Ｓ．Ｈ．Ｃｈｅｎ，Ｓ．Ｇ．Ｃｈａｎｇ，Ｓ．Ｍ．
Ｌｅｅ”標準中国語のスピーチ用の統計的モデル準拠の
基本周波数合成装置”Ｊ．Ａｃｃｏｕｓｔ．Ｓｏｃ．Ａ
ｍ．９２（１），ｐｐ．１１４−１２０，Ｊｕｌｙ１
９９２）、線形回帰法（Ｓ．Ｈ．Ｈｗａｎｇ，Ｓ．Ｈ．
Ｃｈｅｎ“標準中国語のテキスト対スピーチのシステム
用のニューラルネットワーク準拠のＦＯ合成装置”ＩＥ
ＥＰｒｏｃ．Ｖｉｓ．ＩｍａｇｅＳｉｇｎａｌＰ
ｒｏｃｅｓｓ．Ｖｏｌ．１４１Ｎｏ．６，Ｄｅｃ．１
９９４）、多重層知覚子（ＭＬＰ）準拠の研究法（Ｙ．
Ｒ．Ｗａｎｇ．Ｓ．Ｈ．Ｃｈｅｎ“韻律情報に援助され
る連続的な標準中国語のスピーチの音調認識”Ｊ．Ａｃ
ｏｕｓｔ．Ｓｏｃ．Ａｍ．Ｖｏｌ．９６，Ｎｏ．５，Ｐ
ｔｌ，ｐｐ２６３７−２６４５，Ｎｏｖ，１９９４）、
等を包含する。[0009] These are rules-based research methods (J. Zha).
ng, "Acoustic parameters and phonological rules of a text- to- speech system for Chinese" Proc. ICASS
P. pp. 2023--2026, 1986), statistical methods (SH Chen, SG Chang, SM).
Lee, "A Statistical Model-Based Fundamental Frequency Synthesizer for Mandarin Speech ." Accout. Soc. A
m. 92 (1), p. 114-120, July 1
992), linear regression method (SH Hwang, SH).
Chen, "Neural Network Based FO Synthesizer for Mandarin Text- to- Speech Systems" IE
E Proc. Vis. Image Signal P
process. Vol. 141 No. 6, Dec. 1
994), a multi-layer sensory (MLP) compliant research method (Y.
R. Wang. S. H. Chen, Tone Recognition of Continuous Mandarin Speech , Aided by Prosodic Information. Ac
oust. Soc. Am. Vol. 96, No. 5, P
tl, pp2637-2645, Nov, 1994),
And the like.

【００１０】幾らかの改良は達成されたにせよ、これら
の研究法は、任意のテキストについて自然なスピーチを
合成するため適切な韻律情報を発生させる目標に到達す
ることからは依然として遠く離れている。これらの研究
法の主要な欠点は、韻律情報の合成のために、高レベル
の言語特徴を優雅に呼び出すことが出来ないことにあ
る。[0010] Although some improvements have been achieved, these approaches are still far from reaching the goal of generating appropriate prosodic information to synthesize natural speech for any text . . A major drawback of these approaches is that they cannot gracefully invoke high-level linguistic features due to the synthesis of prosodic information.

【００１１】[0011]

【課題を解決するための手段】本発明においては、人間
言語のテキスト対スピーチの変換用の人工的ニューラル
ネットワークにもとづく韻律情報の合成装置であって、
該装置は、韻律モデルであって、該人間言語の高レベル
の言語特徴を受信し、該人間言語の単語に同期するクロ
ックに基づいて作動し該単語の韻律構造の韻律状態をあ
らわす出力を供給するもの、および、スピーチの韻律パ
ラメータの発生装置であって、該人間言語の低レベルの
言語特徴および該韻律モデルから供給される該韻律状態
の出力を受信し、音節に同期するクロックに基づいて作
動し、韻律パラメータを供給するもの、を具備し、人間
の韻律発音の機構を模擬するよう人工的ニューラルネッ
トワークを用い、次いで現実の音声から人間の言語音声
の韻律構造をモデルする訓練が行われるようになってい
ることを特徴とする韻律情報の合成装置、が提供され
る。SUMMARY OF THE INVENTION In the present invention, an artificial neural network for human language text- to- speech conversion is provided.
An apparatus for synthesizing prosodic information based on a network ,
The apparatus receives a high-level linguistic feature of the human language, which is a prosodic model, and is synchronized with a word of the human language.
Supplies an output representing the prosodic state of operation and prosodic structure of the word based on the click, and a generator of prosodic parameters of a speech, from a low-level language features and該韻law model of the person between languages Receiving the output of the provided prosodic state, operating based on a clock synchronized with the syllable , and supplying a prosodic parameter, comprising an artificial neural network for simulating a human prosodic pronunciation mechanism.
Using network, then synthesizer prosody information, characterized by training the model prosodic structure of human language speech from the real speech is to be carried out, is provided.

【００１２】好適には、本発明による人工的ニューラル
ネットワークにもとづく韻律情報の合成装置における該
韻律モデルは、（ａ）該高レベルの言語特徴を記憶し該高レベルの言語
特徴の正規化された入力値を発生させる複数の記憶ユニ
ットを包含する入力層の第１の部分、および、（ｂ）複数の記憶ユニットを包含する第１のかくれた層
であって、該入力層の第１の部分から該正規化された高
レベルの言語特徴の入力値を受信し、該韻律状態の出力
を供給するものであり、その場合に、該入力層の第１の
部分の各記憶ユニットは記憶ユニットからなる人工的ニ
ューラルネットワークファイバによって第１のかくれた
層の各記憶ユニットに接続されているもの、を具備し、
該スピーチの韻律パラメータの発生装置は、（ｃ）複数の記憶ユニットを包含する第２のかくれた層
であって、該入力層の第２の部分から該低レベルの言語
特徴を、該第１のかくれた層から韻律状態の出力を受信
し、出力を供給するものであり、その場合に、該入力層
の第２の部分の各記憶ユニットおよび第１のかくれた層
の各記憶ユニットは記憶ユニットからなる人工的ニュー
ラルネットワークファイバによって第２のかくれた層の
各記憶ユニットに接続されているもの、および、（ｄ）第２のかくれた層から該出力を受信し、該韻律パ
ラメータを供給する出力層であって、複数の記憶ユニッ
トを包含し、第２のかくれた層の各記憶ユニットが記憶
ユニットからなる人工的ニューラルネットワークファイ
バによって出力層の各記憶ユニットに接続されているも
の、を具備する。Preferably, the artificial neural according to the invention
The prosody model in the network- based prosody information synthesizer includes: (a) an input including a plurality of storage units for storing the high-level language features and generating normalized input values of the high-level language features. A first portion of a layer, and (b) a first hidden layer containing a plurality of storage units, the input of the normalized high-level language features from the first portion of the input layer. Receiving a value and providing an output of the prosodic state, wherein each storage unit of the first portion of the input layer comprises an artificial unit of storage units .
Connected to each storage unit of the first hidden layer by a neural network fiber;
The apparatus for generating a prosodic parameter of speech comprises: (c) a second hidden layer containing a plurality of storage units, the low-level linguistic features from a second portion of the input layer being transmitted to the first layer. Receive prosodic state output from the hidden layer
And is intended to provide an output, in that case, the storage unit of each storage unit and the first hidden layer of the second portion of the input layer consists of a storage unit artificial New
(D) an output layer receiving the output from the second hidden layer and supplying the prosodic parameters, wherein the output layer is connected to each storage unit of the second hidden layer by a neural network fiber . Comprising a plurality of storage units, each storage unit of the second hidden layer being connected to each storage unit of the output layer by an artificial neural network fiber comprising the storage units. I do.

【００１３】好適には、該韻律モデルにおいては、該第
１のかくれた層におけると同じ数の記憶ユニットを包含
する第１の反復式の層がさらに設けられ、該第１の反復
式の層の各記憶ユニットは、記憶ユニットからなる人工
的ニューラルネットワークファイバによって第１のかく
れた層の各記憶ユニットに接続され、該第１のかくれた
層により非線形の変換により発生する出力のすべては、
該第１の反復式の層を通る入力として、それ自身に返還
供給されるようになっている。Preferably, in the prosodic model, there is further provided a first iterative layer containing the same number of storage units as in the first hidden layer, wherein the first iterative layer Are connected to each storage unit of the first hidden layer by an artificial neural network fiber consisting of the storage unit, and all of the outputs generated by the first hidden layer due to the non-linear transformation are:
It is adapted to be fed back to itself as input through the first iterative layer.

【００１４】好適には該韻律パラメータ発生装置におい
ては、該第２のかくれた層におけると同じ数の記憶ユニ
ットを包含する第２の反復式の層がさらに設けられ、該
第２の反復式の層の各記憶ユニットは記憶ユニットから
なる人工的ニューラルネットワークのファイバによって
第２のかくれた層の各記憶ユニットに接続され、該第２
のかくれた層により非線形の変換により発生する出力の
すべては、該第２の反復式の層を通る入力として、それ
自身に返還供給されるようになっている。Preferably, in the prosodic parameter generator, there is further provided a second iterative layer containing the same number of storage units as in the second hidden layer. Each storage unit of the layer is connected to each storage unit of the second hidden layer by a fiber of an artificial neural network of storage units,
All of the output produced by the non-linear transformation by the hidden layer is returned to itself as input through the second iterative layer.

【００１５】好適には該韻律パラメータ発生装置におい
ては、複数の記憶ユニットを包含する第３の反復式の層
がさらに設けられ、第２のかくれた層の各記憶ユニット
は記憶ユニットからなる人工的ニューラルネットワーク
のファイバによって該第３の反復式の層の各記憶ユニッ
トに接続され、該出力層により発生させられる出力の一
部は非線形に変換され、該変換された信号は該第３の反
復式の層を通る入力としてそれ自身に返還供給されるよ
うになっている。Preferably, the prosodic parameter generator further comprises a third repetitive layer containing a plurality of storage units, each storage unit of the second hidden layer being an artificial storage unit. A portion of the output generated by the output layer, which is connected to each storage unit of the third iterative layer by a fiber of a neural network, is non-linearly converted, and the converted signal is It is fed back to itself as input through the three iterative layers.

【００１６】好適には、該入力層の第１の部分の各記憶
ユニットを第１のかくれた層の各記憶ユニットに接続す
る該人工的なニューラルネットワークのファイバと、第
１の反復式の層の各記憶ユニットを第１のかくれた層の
各記憶ユニットに接続する該人工的なニューラルネット
ワークのファイバは、単純な反復式のニューラルネット
ワークを構成するようになっている。Preferably, the fibers of the artificial neural network connecting each storage unit of the first portion of the input layer to each storage unit of the first hidden layer; and a first iterative layer. Artificial neural network connecting each storage unit in the first hidden layer to each storage unit in the first hidden layer
Work fiber is a simple iterative neural network
Work is configured.

【００１７】[0017]

【発明の実施の形態】本発明においては、高レベルの言
語特徴からの影響を適切に考慮するための、ＴＴＳ用の
韻律の生成についての新規な研究法が提案される。基本
的な構想は、人間の韻律発音の機構を模擬するために人
工的なニューラルネットワーク（ＡＮＮ）を用い、次い
で、実際のスピーチから、人間の言語の韻律構造をモデ
ル化するためにそれを訓練することである。訓練用テキ
ストから抽出された幾つかの単語レベルの特徴を入力と
してとり、これらの訓練用テキストに関連するスピーチ
から抽出された韻律パラメータを出力目標として設定す
ることにより、ＡＮＮを訓練して関連するテキストの発
生の韻律構造と言語特徴シーケンスと間の関係を自動的
に学習させることができる。したがって、よく訓練され
たＡＮＮは、与えられた入力テキストについて適切な韻
律情報を生成するための韻律合成装置であると考えるこ
とができる。DETAILED DESCRIPTION OF THE INVENTION The present invention proposes a new approach to the generation of prosody for TTS in order to properly consider the effects of high-level linguistic features. The basic idea is to use an artificial neural network (ANN) to simulate the mechanics of human prosodic pronunciation , and then, from actual speech, Model prosodic structure
Is to train it to make it better . Training text
And inputs the feature of several word level extracted from strike
And taken, by setting prosodic parameters extracted from speech <br/> associated with these training text as output target, a text associated to train ANN originating
Automatically determine the relationship between raw prosodic structures and linguistic feature sequences
Can be learned. Therefore, a well-trained ANN can be considered a prosody synthesizer for generating appropriate prosody information for a given input text .

【００１８】図１は人間の頭脳における韻律発音の過程
の概念的モデルを示す。入力されたテキストはまず、こ
のモデルにおいて、テキストの分析により予行的に処理
されて、幾らかの言語特徴が抽出される。次いで、これ
らの言語特徴は韻律発音機構により解釈されて適切な韻
律情報を生成する。基本的には、低レベルの語句特徴、
例えば音節の音調から、高レベルの特徴、例えば文章構
造上の境界、までにわたる多様な言語特徴が、韻律の生
成に影響を及ぼすために発音機構により用いられること
ができる。FIG. 1 shows a conceptual model of the process of prosodic pronunciation in the human brain. Input text first, this
In this model, some linguistic features are extracted by processing the text by analyzing the text . These linguistic features are then interpreted by the prosodic pronunciation mechanism to generate the appropriate prosodic information. Basically, low-level phrase features,
For example, from syllable tones, high-level features such as sentence structure
Concrete on the border, a variety of language features over to, the prosody of raw
Can be used by the sound generator to affect the composition .

【００１９】高レベルの言語特徴からの影響がより広範
囲にわたるものであるという事実により、われわれは、
図２に示されるようにモデルを２つの部分に分割するこ
とによりそのモデルを精密化することを提案したい。第
１の部分は韻律のモデルであって、人間のスピーチの高
レベルの言語特徴からの影響を考慮するものである。こ
れは韻律状態と称される音声学的状態を探究し、この音
声学的状態は音韻発生の広範囲の動向を制御する。第２
の部分は実際の音韻パラメータ発生装置である。これは
幾つかの低レベルの語句特徴を用い、音韻のモデルの助
けをかりて音韻情報を発生させる。Due to the fact that the effects from high-level linguistic features are more extensive, we have:
We propose to refine the model by splitting the model into two parts as shown in FIG. The first part is a model of prosody, taking into account the effects of high-level linguistic features on human speech . This is to explore the sound biological state referred to as prosody state, this sound
The vocal states control a wide range of phonological developments. Second
Is the actual phoneme parameter generator. this is
Some low-level phrase features are used to generate phonological information with the help of a phonological model.

【００２０】人間の発音機構の前記の精密化されたモデ
ルを模擬するために、本発明においては多層回帰式ニュ
ーラルネットワーク（ＲＮＮ）音韻合成装置が採用され
る。図３はＲＮＮのブロック線図をあらわす。図３に示
されるように、ＲＮＮは４層のネットワークであって１
つの入力層、２つのかくれた層、および１つの出力層を
有する。これの詳細な構築は図４に示される。これは２
つの部分に機能的に分割されることができる。In order to simulate the above refined model of the human sound production mechanism, the present invention uses a multi-layer regression equation.
A neural network (RNN) phonemic synthesizer is employed. FIG. 3 shows a block diagram of the RNN. As shown in FIG. 3, the RNN is a four-layer network ,
It has one input layer, two hidden layers, and one output layer. The detailed construction of this is shown in FIG. This is 2
It can be functionally divided into two parts.

【００２１】第１の部分は、入力層の第１の部分と、出
力がすべてそれ自身の入力へ帰還する第１のかくれた層
からなる。これは、入力されたテキストの幾つかの単語
レベルの言語特徴を使用することにより、言語のスピー
チの高レベル韻律構造を探究するための、韻律モデルで
あると見做される。これは、単語に同期したクロックに
基づいて作動し、目下の単語の韻律構造の韻律状態をあ
らわす幾つかの出力を生成する。 [0021] The first portion, a first portion of the input layer, out
The first hidden layer where all power returns to its own input
Consists of This is a few words of the entered text
Use language features at the level to increase language speed
To explore high levels prosodic structure Ji, it is regarded as is prosodic model. This is a clock synchronized with the word
It produces several outputs representing the prosodic state of the prosodic structure of the current word .

【００２２】入力の特徴は、現在の単語Ｗ _i 、後続の単
語Ｗ _i+1 の両者について、スピーチの部分（詞類）を示
す、ＰＯＳ（Ｗ _i ）およびＰＯＳ（Ｗ _i+1 ）、長さＬ _en
（Ｗ _i ）およびＬ _en （Ｗ _i+1 ）、および、２つの単語間
に存在する句読法記号（ＰＭ）の形式を示す指示子であ
るＰＭ（Ｗ_i，Ｗ_i+1）を包含する。標準中国語につい
て、われわれは、実験において、４２のＰＯＳ形式およ
び４つのＰＭ形式を用いた。それらは表１および表２に
それぞれ示される。The input features include the current word W _i ,
Indicate the speech part (lyrics) for both words Wi _{i + 1}
POS ( _Wi ) and POS ( _{Wi + 1} ), length L _en
(W _i ) and L _en (W _{i + 1} ) , and PM (W _i , W _{i + 1} ), which is an indicator indicating the format of the punctuation mark (PM) existing between two words. . For Mandarin, we used 42 POS formats and 4 PM formats in the experiments. They are shown in Tables 1 and 2, respectively.

【表１】 [Table 1]

【表２】 [Table 2]

【００２３】韻律モデルを実現するために、幾つかの単
語レベルの言語特徴の入力を有するこの形式の回帰式ニ
ューラルネットワークを用いる理由は下記のとおりであ
る。第１に、自然な中国語の非限定の入力テキストにつ
いて高レベルの文章構造上の特徴を正確に得ることが一
般には容易でなく、また標準中国語のスピーチは韻律構
造と同形ではないという事実のために、韻律モデルを実
現しようとして高レベルの文章構造上の特徴を入力特徴
として直接用いることは、実際上、不適切である。第２
に、単語は発音の最小の有意味の単位であるから、単語
はまた、標準中国語スピーチの韻律構造の基本構築要素
であるべきである。第３に、標準中国語の発声の韻律構
造は、構成単語の関係を叙述するモデルであると判断さ
れることができる。In order to implement the prosody model , several simple
This type of regression equation with word- level linguistic feature inputs
The reason for using a neural network is as follows. First, it is generally not easy to accurately obtain high-level sentence structure features for natural Chinese unrestricted input text , and Mandarin speech is composed of prosodic structures. Due to the fact that they are not isomorphic with, it is practically inappropriate to directly use high-level sentence structure features as input features in an attempt to implement a prosodic model. Second
In, because the word is the smallest unit of meaningful pronunciation, word <br/> also should be a basic building element of metrical structure of Mandarin speech. Third, the prosodic structure of Mandarin utterances can be determined to be a model that describes the relationships between constituent words .

【００２４】第４に、この研究で用いられるＲＮＮの第
１の部分は、単純なＲＮＮであり、この単純なＲＮＮ
は、有限状態の機械を模擬するために、（Ｒ．Ｐ．Ｌｉ
ｐｐｍａｎｎ）“ニューラルネットワークを用いる計算
の序論”（ＩＥＥＥＡＳＳＰＭａｇ．ｐｐ．４−２
２，１９８７）の研究において用いられている。したが
ってこれは、標準中国語の発声における単語の関係を叙
述するためにモデルとして用いられるに適切な動的なシ
ステムである。以上に述べられた見解から、われわれ
は、幾つかの単語レベルの言語特徴の入力をもつＲＮＮ
の第１の部分が、韻律モデルを実現するために使用可能
であるものと信ずる。 [0024] Fourth, the first of RNN used in this study
1 is a simple RNN, and this simple RNN
(RP Li to simulate a finite state machine)
ppmann) "Introduction to Computation Using Neural Networks " (IEEE ASSPMag.pp. 4-2)
Have been used in the study of 2,1987). Therefore, it is a dynamic system suitable to be used as a model to delineate word relationships in Mandarin utterances. In view of the above, we consider an RNN with some word- level linguistic feature inputs.
The first part of can be used to implement a prosodic model
I believe it is.

【００２５】ＲＮＮの第２の部分は、入力層の第２部
分、第２のかくれた層、および出力層からなる。これは
実際の韻律パラメータ発生装置である。これは、第２の
かくれた層へ直接に供給される幾つかの低レベルの言語
特徴および韻律モデルから生成された韻律状態を用いる
ことにより、標準中国語ＴＴＳシステムで必要とされる
韻律パラメータのすべてを生成させるために、音節に同
期したクロックに基づいて作動する。第２のかくれた層
の出力はすべてそれ自身の入力へ返還供給される。それ
に加えて、ピッチ平均とエネルギレベルの２つの出力韻
律パラメータは出力層の入力へ返還供給される。この配
置により、韻律パラメータ発生装置は、動的なシステム
になり、実際のスピーチにおけるこれらの時間的に変化
する韻律パラメータを予想することができる。The second part of the RNN comprises a second part of the input layer, a second hidden layer, and an output layer. This is the actual prosodic parameter generator. This uses several low-level linguistic features and prosodic states generated directly from the prosodic model to be supplied directly to the second hidden layer, thereby providing the prosody parameters required by the Mandarin TTS system. Syllables to generate everything
It operates based on the expected clock . All outputs of the second hidden layer are fed back to its own input. In addition, two output prosodic parameters, pitch average and energy level, are fed back to the input of the output layer. This arrangement allows the prosody parameter generator to be a dynamic system and to anticipate these time-varying prosody parameters in the actual speech .

【００２６】標準中国語について、本発明で用いられる
入力される低レベル言語特徴は、処理しつつある音節Ｓ
_j の、音調Ｔ（Ｓ_j）、初期部の形式Ｉ（Ｓ_j）、最終
部Ｆ（Ｓ_j）の形式、および処理しつつある音節が現行
の単語Ｗ_iの最初の、最後の、または中間の音節のいず
れであるかを示す指示子Ｌ（Ｓ_j／Ｗ_i）を包含する。
われわれの実験によれば、子音の発音の態様にもとづき
決定される初期部分の６つの幅の広い形式と成分母音と
鼻音により最終的に分類される１７の形式が用いられ
る。表３および表４はこれらの初期および最終の形式を
表にしたものである。[0026] For Mandarin Chinese, the low-level language features input used in the present invention, syllables S that is being treated
of _j, tone T (S _j), the form I (S _j) of the initial portion and the final portion F (S _j) format, and being processed, syllables first of the current word W _i, the last, or An indicator L (S _j / W _i ) indicating which of the intermediate syllables is included.
According to our experiments , six wide forms of the initial part, which are determined on the basis of the pronunciation of the consonant, and seventeen forms finally classified by component vowels and nasal sounds are used. Tables 3 and 4 tabulate these initial and final formats.

【００２７】出力韻律パラメータは、処理しつつある音
節の、ピッチコンタ、エネルギレベル（すなわち最大の
ログエネルギ）、および最終の継続時間の４つの低次の
直交変換された係数を包含する。ここで、エネルギレベ
ルと最終の継続時間は処理しつつある音節の最終形式用
に正規化される。これらのパラメータの正規化は、語句
の言語特徴により生ずるこれらの韻律パラメータの変動
可能性からもたらされるシステムの複雑性を低減させる
ために行われる。ピッチコンタを直交変換するために用
いられる基本的関数は下記で与えられる。The output prosody parameter is the sound being processed
Encompasses sections, the pitch contour, the energy level (i.e. largest log energy), and the final four lower order orthogonal transformation coefficients of the duration. Here, the energy level and final duration are normalized for the final form of the syllable being processed. The normalization of these parameters is done to reduce the complexity of the system resulting from the variability of these prosodic parameters caused by the linguistic features of the phrase . Basically function used for orthogonal transform pitch contour is given below.

【数１】 (Equation 1)

【数２】 (Equation 2)

【数３】 (Equation 3)

【数４】 (Equation 4)

【００２８】したがって、ピッチコンタの該４つの低次
の直交変換された係数の、最初のおよび他の３つの係数
は、処理しつつある音節のピッチコンタの平均および形
状をそれぞれあらわす。[0028] Thus, of the four lower-order orthogonal transform coefficients of the pitch contour, the first and the other three coefficients represents the average and the shape of the pitch contour of while processing certain syllables respectively.

【表３】 [Table 3]

【表４】 [Table 4]

【００２９】ＲＮＮ韻律合成装置は、実際のスピーチの
発声の大規模のセットを用いて誤り後方伝播（ＥＢＰ）
のアルゴリズムにより訓練することができる。入力テキ
ストから抽出された言語特徴を入力として供給し対応す
る発声から抽出された韻律パラメータを所望の出力の目
標として設定することにより、ＲＮＮは発声の韻律構造
と入力テキストの単語レベルの言語特徴の間の関係を自
動的に学習することができる。適切に訓練を行った後
に、われわれはＲＮＮを標準中国語のＴＴＳ用の韻律合
成装置であると判断することができる。本発明において
開示されるＲＮＮ準拠の韻律合成装置が標準中国語に加
えて、幾つかの人間の言語にも適していることは、当業
者は容易に理解することができる。[0029] The RNN prosody synthesizer uses a large set of actual speech utterances to backpropagate error (EBP).
Can be trained by the following algorithm. By providing as input the linguistic features extracted from the input text and setting the prosodic parameters extracted from the corresponding utterances as desired output targets, the RNN can determine the prosodic structure of the utterance and the word- level linguistic features of the input text . The relationship between them can be learned automatically. After proper training, we can determine that RNN is a prosodic synthesizer for Mandarin TTS. Those skilled in the art can easily understand that the RNN-based prosody synthesizer disclosed in the present invention is suitable for some human languages in addition to Mandarin.

【００３０】標準中国語用の韻律情報の合成の本発明に
おける新しい解決策の特性は、模擬（シミュレーショ
ン）により点検された。電気通信研究所により提供され
る連続的標準中国語のスピーチのデータベースであるＭ
ＯＴＣ，ＲＯＣが用いられた。データベースは、６５５
の、文章の、およびパラグラフの発声を包含する。発声
はすべて唯一人の男性の話者により行われた。発声はす
べて１秒につき３．５ないし４．５の音節の割合の速度
で自然に行われた。データベースは２つの部分、訓練用
セットと外部試験セット、に分割された。これらの２つ
のセットはそれぞれ、２８１９１音節および７０５１音
節からなる。The characteristics of the new solution in the present invention of synthesizing prosodic information for Mandarin were checked by simulation. M, a database of continuous Mandarin speeches provided by the Telecommunications Research Institute
OTC and ROC were used. The database is 655
, Textual, and paragraph utterances. All utterances were made by a single male speaker. All utterances occurred spontaneously at a rate of 3.5 to 4.5 syllables per second. The database was divided into two parts, a training set and an external test set. These two sets consist of 28191 syllables and 7051 syllables, respectively.

【００３１】スピーチ信号はすべて２０kHz のサンプリ
ング速度でデジタル式に記録された。次いでそれは１０
msのフレームに分割され、波形、エネルギ、零通過割
合、ＬＰＣ係数、ケプストラム（Ｃｅｐｓｔｒｕｍ）、
およびデルタケプストラム（ｄｅｌｔａ−ｃｅｐｓｔｒ
ｕｍ）を包含する幾つかの音響特徴にもとづいて、静
寂、無声、および有声の部分に手動で区分される。次い
で、下降的にサンプリングされた１０kHz のスピーチ信
号から、合成に対する韻律パラメータが抽出された。韻
律パラメータは、各音節の、ピッチコンタの４つの直交
変換された係数、正規化された最大の言語エネルギ、お
よび正規化された最終の継続時間を包含する。All speech signals were recorded digitally at a sampling rate of 20 kHz. Then it is 10
ms, the waveform, energy, zero crossing rate, LPC coefficient, cepstrum,
And delta-cepstrum
um), are manually divided into quiet, unvoiced, and voiced parts based on some acoustic features. The prosodic parameters for the synthesis were then extracted from the down-sampled 10 kHz speech signal. Prosodic parameters include for each syllable, four orthogonal transformed coefficients of the pitch contour, the normalized maximum language energy, and the duration of the normalized final.

【００３２】ここで、ピッチ期間は、手動による誤り補
正をともなうＳＩＦＴアルゴリズム（Ｊ．Ｄ．Ｍａｒｋ
ｅｌ“基本周波数推算用のＳＩＦＴアルゴリズム”ＩＥ
ＥＥＴｒａｎｓ．ｏｎＡｕｄｉｏａｎｄＥｌｅｃ
ｔｒｏａｃｏｕｓｔｉｃｓＶｏｌ．ＡＵ−２０，Ｎｏ．
５，ｐｐ．３６７−３７７，Ｄｅｃ．１９７２）により
検出される。ピッチ検出のフレーム長さは、１０msのフ
レーム偏位をともない４０msである。言語エネルギ分析
用のフレーム長さは、１０msのフレーム偏位をともない
２０msである。Here, the pitch period is determined by the SIFT algorithm (JD Mark) with manual error correction.
el "SIFT algorithm for fundamental frequency estimation" IE
EETrans. on Audio and Elec
trocustics Vol. AU-20, No.
5, pp. 367-377, Dec. 1972). The frame length for pitch detection is 40 ms with a 10 ms frame excursion. The frame length for language energy analysis is 20 ms with a 10 ms frame excursion.

【００３３】次いで、約８００００語を包含する中国語
の辞書を用いることにより、スピーチデータベースにお
ける発声に関連するテキストのすべてを語句の単語のシ
ーケンスへ復号化するために、自動的なテキスト分析が
用いられる。辞書における語句の単語の各個は１つない
し５つの音節を包含する。適切な単語シーケンスのすべ
てを得たのち、全部の単語のＰＯＳは手動で決定され
た。模擬において、表１にあらわされるＰＯＳの４２の
形式のセットが用いられた。次いで、用いられた高レベ
ルの言語特徴のすべてが抽出された。Automatic text analysis is then used to decode all of the utterance-related text in the speech database into a sequence of phrase words by using a Chinese dictionary containing about 80,000 words. Can be Phrase word in the dictionary brackets includes one connected to five syllables. After obtaining all of the appropriate word sequences, the POS of all words was determined manually. In the simulation, a set of 42 types of POS shown in Table 1 was used. Then, all of the high-level linguistic features used were extracted.

【００３４】表５は合成された韻律パラメータの平均二
乗の誤差をあらわす。この表から内側試験、外側試験の
それぞれにおけるピッチコンタ合成について０．８２ms
／フレーム、1.08ms／フレームのＲＭＳＥが達成された
ことを見出すことができる。ピッチ平均の合成の代表的
な例が図５に示される。この図において、合成されたピ
ッチ平均の軌跡が、音節の大半についての原軌跡の対応
部分に極めて良好に符合していることをみることができ
る。エネルギレベルの合成について、内側の試験、外側
の試験それぞれにおいて、３．１２dB、４．８８dBのＲ
ＭＳＥが得られた。図６は図５において用いられたと同
じ入力テキストについてのエネルギレベル合成の結果を
示す。この図に示されるように、合成されたエネルギレ
ベルの軌跡が音節の大半についての原軌跡の対応部分に
依然として符合している。Table 5 shows the mean square error of the synthesized prosodic parameters. Inside the test from the table, the pitch contour synthesis in each of the outer test 0.82ms
It can be seen that a RMSE of 1.08 ms / frame was achieved. A typical example of pitch average synthesis is shown in FIG. In this figure, it can be seen that the synthesized pitch average trajectory matches very well with the corresponding part of the original trajectory for most of the syllables. Regarding the synthesis of the energy levels, in the inner test and the outer test, an R of 3.12 dB and 4.88 dB was used.
MSE was obtained. FIG. 6 shows the result of the energy level synthesis for the same input text used in FIG. As shown in this figure, the trajectory of the synthesized energy level still matches the corresponding portion of the original trajectory for most of the syllables.

【００３５】最終の継続時間の合成について、内側試験
と外側試験において２８．７ms、３８．２msのＲＭＳＥ
が得られた。図７は前に用いられたと同じ入力テキスト
についての音節の合成された最終の継続時間をあらわ
す。この図において、われわれは再び、合成された最終
の継続時間の軌跡が音節の大半についての原軌跡の対応
部分に極めて良好に符合していることを見出す。For the synthesis of the final duration, the RMSE of 28.7 ms and 38.2 ms in the inner and outer tests
was gotten. FIG. 7 shows the final synthesized duration of the syllables for the same input text used previously. In this figure we again find that the synthesized final duration trajectory matches very well with the corresponding part of the original trajectory for most of the syllables.

【００３６】ピッチコンタ合成の２つの代表的な例が図
１３、図１４にそれぞれ示される。これらの２つの図か
らわれわれは、音節の合成されたピッチコンタの大半
が、形状とレベルの両方において、元のコンタの対応部
分に類似することを見出す。図１３において第４から第
７までの音節、ここにその語句の音調性はすべて音調３
であるが、についての合成されたピッチコンタの形状が
音調２、音調３、音調２、および音調３の標準の形状と
同様にみえることは、注意する価値がある。同様に、図
１４において、最後の２つの音節、ここにその語句の音
調性はすべて音調３であるが、についての合成されたピ
ッチコンタの形状は、音調２および音調３の標準のパタ
ーンと同様にみえる。このことは、音調２が音調３に続
くとき音調３を音調２へ変化させるという有名な連声規
則がここで正確に実現していることを示す。[0036] Two representative examples of the pitch contour synthesis is shown in FIGS. 13, 14. Us from these two figures, the majority of the synthetic pitch contour of syllables, both in shape and level, finding that similar to the corresponding portion of the original contour. In FIG. 13, the fourth to seventh syllables, where the tonality of the phrase is tone 3
However, it is worth noting that the synthesized pitch contour shape for looks like the standard shape of Tone 2, Tone 3, Tone 2, and Tone 3. Similarly, in FIG. 14, the last two syllables, all here tonality of that term is tonal 3, the shape of the synthesized peak <br/> pitch contour for, tone 2 and tone 3 Looks the same as the standard pattern. This indicates that the well-known collocation rule that changes tone 3 to tone 2 when tone 2 follows tone 3 is now correctly implemented.

【００３７】データベースにおける３−３の音調対およ
び３−３−３の音調系列をもつ音節系列のすべてについ
ての元のピッチコンタと合成されたピッチ間隔の両方を
注意深く聴き取り点検することにより、われわれは、発
音された真正の音調性に標識を付し、音調変化の数を計
算した。表６および表７は実験結果をあらわす。表に示
されるように、正確な合成の割合は、３−３の音調につ
いては８６％であり、３−３−３の音調系列については
７７．４％である。By carefully listening and checking both the original pitch contours and the synthesized pitch intervals for all of the 3-3 note pairs and 3-3-3 note sequence in the database, The authentic tonality pronounced was marked, and the number of tonal changes was calculated. Tables 6 and 7 show the experimental results. As shown in the table, the correct synthesis ratio is 86% for the tone of 3-3 and 77.4% for the tone sequence of 3-3-3.

【００３８】さらに誤差分析を行うことにより、われわ
れは誤差の大半は音調２または音調３として、発音され
ることが許容される音節において発生したことを見出し
た。したがって、音調３の迷惑な誤り発音は感知されな
かった。このことは、音調３の変化についての連声規則
がＲＮＮ韻律合成装置により自動的に学習され暗黙的に
記憶されたことを確認するものである。したがって、以
上に述べられた見解にもとづき、われわれは、提案され
たＲＮＮ韻律合成装置は極めて良好に作用すると結論づ
けることができる。By further performing an error analysis, we have found that most of the errors occurred in syllables allowed to be pronounced as tone 2 or tone 3. Therefore, no annoying erroneous pronunciation of tone 3 was detected. This confirms that the polyphony rule for the tone 3 change was automatically learned and implicitly stored by the RNN prosody synthesizer. Thus, based on the views set out above, we can conclude that the proposed RNN prosody synthesizer works very well.

【表５】 [Table 5]

【表６】 [Table 6]

【表７】 [Table 7]

【００３９】音調モデルの特性をさらに点検することに
より、われわれは、ＲＮＮ韻律合成装置の最初の部分の
出力を８つのクラスにベクトル量子化し、各クラスに８
状態の有限状態機械（ＦＳＭ）からの状態を割当てた。
このＦＳＭは、入力されるテキストの単語に同期したク
ロックに基づいて作動する。表８〜表１１は、状態転移
の確率、文章およびパラグラフのテキストの開始および
終了の単語の分布、ＰＭの前および後における単語の分
布、および種々の長さをもつ単語の分布を包含するＦＳ
Ｍの幾つかの統計をあらわす。図１５はＦＳＭのトポロ
ジーを図示するが、この場合にわれわれは幾つかの最も
重要な状態転移のみを図示した。By further examining the characteristics of the tonal model, we vector quantize the output of the first part of the RNN prosody synthesizer into eight classes, with eight classes in each class.
State from State Finite State Machine (FSM) was assigned.
Click this FSM is, in synchronization with the words of the text to be input
Operates based on locks . Tables 8 to 11, the state transition probabilities, the start and end of the distribution of words in the sentence and paragraph text, the distribution of words before and after PM, and including the distribution of words with various lengths FS
Here are some statistics of M. FIG. 15 illustrates the topology of the FSM, in which case we have illustrated only some of the most important state transitions.

【００４０】表９および表１０から、状態１と状態２は
１つの文章または１つのパラグラフの終了状態であるこ
とがわかる。状態４は１つの文章の開始状態である。表
１１からわれわれは、状態７は非終了の単音節の単語に
関連することを見出す。状態０は３より大であるかまた
はそれに等しい長さをもつ複数音節の単語に関連する。
代名詞はたいてい状態０にも関連する。幾つかの３音節
の単語は状態６に関連する。図１５から、状態５と状態
７が高い確率をもって状態４に続くことを見出すことも
できる。したがって、それらは文章の開始部分に極めて
ひんぱんにあらわれる。状態４は、通常ＰＭを間にはさ
んで、状態１と状態２に続く。状態７は常に状態３に続
き、形容句を形成する。From Tables 9 and 10, it can be seen that states 1 and 2 are the end states of one sentence or one paragraph. State 4 is the start state of one sentence. From Table 11, we find that state 7 is associated with non-terminated monosyllable words . State 0 is associated with a multi-syllable word having a length greater than or equal to three.
Pronouns are also often associated with state 0. Some three-syllable words are associated with state 6. From FIG. 15, it can also be seen that State 5 and State 7 follow State 4 with high probability. Therefore, they appear very frequently at the beginning of the text. State 4 continues to states 1 and 2 with normal PM in between. State 7 always follows State 3 and forms an adjective.

【００４１】より多くのテキスト（およびスピーチ）と
それに対応する符号化された状態の系列を用いて、状態
と文章構造（および韻律構造）の関係をさらに探究する
ことにより、ＦＳＭのより多くの解釈を行うことが可能
である。以上に述べた見解にもとづき、ＦＳＭは言語的
に意味深いことが確認される。したがって、提案される
韻律モデルは単語レベルの言語特徴の入力のみを用いる
ものであるが、これは中国語の高レベルの言語特徴の韻
律情報発生に及ぼす影響を評価するのに有効なモデルで
ある。したがって、これは韻律情報の合成を援助するた
めには極めて有用である。By further exploring the relationship between state and sentence structure (and prosodic structure) using more text (and speech ) and the corresponding sequence of coded states, more interpretations of the FSM can be obtained. It is possible to do. Based on the above observations, it is confirmed that FSM is linguistically significant. Therefore, the proposed prosody model uses only word- level linguistic feature inputs, which is an effective model to evaluate the effect of high-level linguistic features in Chinese on the generation of prosodic information. . Therefore, it is extremely useful for assisting in synthesizing prosodic information.

【表８】 [Table 8]

【表９】 [Table 9]

【表１０】 [Table 10]

【表１１】 [Table 11]

【００４２】最後に、標準中国語のＴＴＳシステムにも
とづくピッチ同期式重複加算（ＰＳＯＬＡ）は、現在の
ＲＮＮ準拠の韻律合成装置を主観的に試験するためにま
た、実行される。これは訓練用のデータのセットから抽
出された４１１の音節の波形のセットを、基本的な合成
単位として用いる。音節のピッチコンタ、エネルギレベ
ル、および最終継続時間を包含する３つの韻律パラメー
タが現在のＲＮＮ準拠の韻律合成装置により発生させら
れる。その他の韻律パラメータ、すなわち音節間の休止
期間、は幾つかの簡単な規則により設定される。台湾居
住の固有の中国人の多数による非公式の聴き取り試験に
より、合成された音声はすべて極めて自然に発音される
ことが確認された。したがって、この試験にもとづき、
われわれは、現在のＲＮＮ準拠の韻律合成装置は極めて
良好に作用すると結論づけることができる。Finally, pitch-synchronized overlap-add (PSOLA) based on the Mandarin TTS system is also performed to subjectively test current RNN-based prosodic synthesizers. It uses a set of 411 syllable waveforms extracted from the training data set as a basic synthesis unit. Syllable pitch contour, energy level, and the final duration three prosodic parameters including is generated by the current RNN compliant prosody synthesizer. Other prosodic parameters, ie, pauses between syllables, are set by some simple rules. Informal listening tests with a large number of endemic Chinese residing in Taiwan have confirmed that all synthesized speech is pronounced quite naturally. Therefore, based on this test,
We can conclude that current RNN-based prosody synthesizers work very well.

【図面の簡単な説明】[Brief description of the drawings]

【図１】人間の頭脳における韻律発音の過程の概念的モ
デルを示すブロック線図である。FIG. 1 is a block diagram showing a conceptual model of the process of prosodic pronunciation in the human brain.

【図２】図１の概念的モデルのより精密化されたものを
示すブロック線図である。FIG. 2 is a block diagram showing a more refined version of the conceptual model of FIG.

【図３】本発明におけるＲＮＮ韻律合成装置を示すブロ
ック線図である。FIG. 3 is a block diagram showing an RNN prosody synthesis apparatus according to the present invention.

【図４】図３に示される、本発明におけるＲＮＮ韻律合
成装置の詳細な構築を示すブロック線図である。FIG. 4 is a block diagram showing a detailed construction of the RNN prosody synthesis apparatus according to the present invention shown in FIG. 3;

【図５】合成シーケンスの一例として、ピッチ期間をあ
らわす図である。点線は合成シーケンスを、実線は原の
シーケンスをあらわす。FIG. 5 is a diagram illustrating a pitch period as an example of a synthesis sequence. The dotted line represents the composite sequence, and the solid line represents the original sequence.

【図６】合成シーケンスの一例として、エネルギレベル
をあらわす図である。FIG. 6 is a diagram showing an energy level as an example of a synthesis sequence.

【図７】合成シーケンスの一例として、音節の最終の継
続時間をあらわす図である。FIG. 7 is a diagram illustrating a final duration of a syllable as an example of a synthesis sequence.

【図８】図５、図６、および図７の結合をあらわす図で
ある。FIG. 8 is a diagram showing a combination of FIGS. 5, 6, and 7;

【図９】合成シーケンスの一例として、ピッチ期間を示
す図である。点線は合成シーケンスを、実線は原のシー
ケンスをあらわす。FIG. 9 is a diagram illustrating a pitch period as an example of a synthesis sequence. The dotted line represents the composite sequence, and the solid line represents the original sequence.

【図１０】合成シーケンスの一例として、エネルギレベ
ルをあらわす図である。FIG. 10 is a diagram showing an energy level as an example of a synthesis sequence.

【図１１】合成シーケンスの一例として、音節の最終継
続時間をあらわす図である。FIG. 11 is a diagram showing the final duration of a syllable as an example of a synthesis sequence.

【図１２】図９、図１０、および図１１の結合をあらわ
す図である。FIG. 12 is a diagram showing a combination of FIGS. 9, 10 and 11;

【図１３】音声の波形、音節の合成されたピッチのコン
タ、および元のピッチのコンタをあらわす図である。点
線は合成されたピッチのコンタを、実線は元のピッチの
コンタをあらわす。[13] speech waveform synthesized pitch con syllable
Data, and a diagram representing the contour of the original pitch. The dotted line shows the contour of the synthesized pitch, and the solid line shows the contour of the original pitch.
Represents contour .

【図１４】音声の波形、音節の合成されたピッチのコン
タ、および元のピッチのコンタをあらわす図１１と同様
な図である。[14] speech waveform synthesized pitch con syllable
Data, and is similar to Fig. 11 which represents the contour of the original pitch.

【図１５】韻律モデルのＦＳＭのトポロジイを示す図で
ある。FIG. 15 is a diagram showing the topology of the FSM of the prosody model.

───────────────────────────────────────────────────── フロントページの続き (72)発明者フワンショー−フワ台湾，シンチュー，グオリチャオチュンダクスディアンシンゴンチェンダジ（番地なし) (56)参考文献特開平２−5098（ＪＰ，Ａ) 特開平２−72399（ＪＰ，Ａ) 特開平２−304493（ＪＰ，Ａ) 特開平４−298794（ＪＰ，Ａ) ──────────────────────────────────────────────────続き Continuation of the front page (72) Inventor Huan Shaw-Hwa Taiwan, Shinchu, Guorichaochun Dachshdian Shingonchendaj (No address) (56) References JP-A-2-5098 (JP, A) JP-A-2-72399 (JP, A) JP-A-2-304493 (JP, A) JP-A-4-298794 (JP, A)

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】人間言語のテキストからスピーチへの人
工的ニューラルネットワークにもとづく韻律情報の合成
装置であって、該装置は人工的ニューラルネットワーク
を備え、該ニューラルネットワークは、韻律モデルであって、該人間言語の高レベルの言語特徴
を受信し、該人間言語の単語に同期するクロックに基づ
いて作動し該単語の韻律構造の韻律状態をあらわす出力
を供給するもの、および、スピーチの韻律パラメータの発生装置であって、該人間
言語の低レベルの言語特徴および該韻律モデルから供給
される該韻律状態の出力を受信し、音節に同期するクロ
ックに基づいて作動し、韻律パラメータを供給するも
の、を具備し、人間の韻律発音の機構を模擬するよう人工的
ニューラルネットワークを用い、次いで現実の音声から
人間の言語音声の韻律構造をモデルする訓練が行われる
ようになっていることを特徴とする韻律情報の合成装
置。1. A synthesizer prosody information based on human <br/> artificially neural network from the human language of the text to speech, the apparatus artificial neural network
Comprising a, the neural network is a prosodic model receives high-level language characteristic of the person between languages, the operating based on a clock synchronized with the word of the person between the language prosody state prosodic structure of the words A device for providing a representational output, and a device for generating prosodic parameters of speech , receiving the output of the prosodic state provided from the low-level linguistic features of the human language and the prosodic model and synchronizing with a syllable . Black
Operates on the basis of the click, it supplies the prosody parameters, comprising the artificial to simulate the mechanism of human prosodic pronunciation
An apparatus for synthesizing prosodic information, wherein training is performed using a neural network and then modeling a prosodic structure of human language speech from real speech.

【請求項２】該韻律モデルは、（ａ）該高レベルの言語特徴を記憶し該高レベルの言語
特徴の正規化された入力値を発生させる複数の記憶ユニ
ットを包含する入力層の第１の部分、および、（ｂ）複数の記憶ユニットを包含する第１のかくれた層
であって、該入力層の第１の部分から該正規化された高
レベルの言語特徴の入力値を受信し、該韻律状態の出力
を供給するものであり、その場合に、該入力層の第１の
部分の各記憶ユニットは記憶ユニットからなる人工的ニ
ューラルネットワークファイバによって第１のかくれた
層の各記憶ユニットに接続されているもの、を具備し、該スピーチの韻律パラメータの発生装置は、（ｃ）複数の記憶ユニットを包含する第２のかくれた層
であって、該入力層の第２の部分から該低レベルの言語
特徴を、該第１のかくれた層から韻律状態の出力を受信
し、出力を供給するものであり、その場合に、該入力層
の第２の部分の各記憶ユニットおよび第１のかくれた層
の各記憶ユニットは記憶ユニットからなる人工的ニュー
ラルネットワークファイバによって第２のかくれた層の
各記憶ユニットに接続されているもの、および、（ｄ）第２のかくれた層から該出力を受信し、該スピー
チの韻律パラメータを供給する出力層であって、複数の
記憶ユニットを包含し、第２のかくれた層の各記憶ユニ
ットが記憶ユニットからなる人工的ニューラルネットワ
ークファイバによって出力層の各記憶ユニットに接続さ
れているもの、を具備する、請求項１記載の韻律情報の
合成装置。2. The method of claim 1, wherein the prosodic model comprises: (a) a first of an input layer including a plurality of storage units for storing the high-level linguistic features and generating a normalized input value of the high-level linguistic features. And (b) a first hidden layer containing a plurality of storage units, receiving the normalized high level linguistic feature input value from a first portion of the input layer. , Providing an output of the prosody state, wherein each storage unit of the first part of the input layer comprises an artificial unit of storage units .
Connected to each storage unit of the first hidden layer by a neural network fiber, wherein the generator of the prosodic parameters of the speech comprises: (c) a second unit comprising a plurality of storage units. Receiving the low-level linguistic features from a second portion of the input layer and prosodic output from the first hidden layer
And is intended to provide an output, in that case, the storage unit of each storage unit and the first hidden layer of the second portion of the input layer consists of a storage unit artificial New
One connected to each memory unit of the second hidden layer by neural network fibers, and receives the output from the (d) second hidden layer, the speaker
An artificial neural network comprising a plurality of storage units, wherein each storage unit of the second hidden layer comprises a storage unit.
One connected to each of the storage units in the output layer by chromatography click fibers, comprising a synthesizer prosodic information according to claim 1, wherein.

【請求項３】該第１のかくれた層におけると同じ数の
記憶ユニットを包含する第１の反復式の層がさらに設け
られ、該第１の反復式の層の各記憶ユニットは、記憶ユ
ニットからなる人工的ニューラルネットワークファイバ
によって第１のかくれた層の各記憶ユニットに接続さ
れ、該第１のかくれた層により非線形の変換により発生
する出力のすべては、該第１の反復式の層を通る入力と
して、それ自身に返還供給されるようになっている、請
求項２記載の韻律情報の合成装置。3. A first repeating layer comprising the same number of storage units as in the first hidden layer is further provided, wherein each storage unit of the first repeating layer is a storage unit. Artificial neural network fiber consisting of
By being connected to each storage unit in the first hidden layer and all the output generated by the conversion of non-linear by layers hidden the first as an input through the layer of iterative the first, on its own 3. The apparatus for synthesizing prosodic information according to claim 2, wherein the apparatus is adapted to be supplied back.

【請求項４】該第２のかくれた層におけると同じ数の
記憶ユニットを包含する第２の反復式の層がさらに設け
られ、該第２の反復式の層の各記憶ユニットは記憶ユニ
ットからなる人工的ニューラルネットワークのファイバ
によって第２のかくれた層の各記憶ユニットに接続さ
れ、該第２のかくれた層により非線形の変換により発生
する出力のすべては、該第２の反復式の層を通る入力と
して、それ自身に返還供給されるようになっている、請
求項２記載の韻律情報の合成装置。4. A second repeating layer further comprising a same number of storage units as in the second hidden layer, wherein each storage unit of the second repeating layer comprises a storage unit from the storage unit. The Artificial Neural Network Fiber
By being connected to each storage unit in the second hidden layer, all of the output generated by the conversion of non-linear by layers hidden the second, as an input through the layer of iterative the second, on its own 3. The apparatus for synthesizing prosodic information according to claim 2, wherein the apparatus is adapted to be supplied back.

【請求項５】複数の記憶ユニットを包含する第３の反
復式の層がさらに設けられ、第２のかくれた層の各記憶
ユニットは記憶ユニットからなる人工的ニューラルネッ
トワークのファイバによって該第３の反復式の層の各記
憶ユニットに接続され、該出力層により発生させられる
出力の一部は非線形に変換され、該変換された信号は該
第３の反復式の層を通る入力としてそれ自身に返還供給
されるようになっている、請求項２記載の韻律情報の合
成装置。5. A system according to claim 1, further comprising a third repetitive layer comprising a plurality of storage units, wherein each storage unit of the second hidden layer comprises an artificial neural network comprising storage units.
A portion of the output generated by the output layer is non-linearly converted and the converted signal is coupled to the storage unit of the third iterative layer by a fiber of the network . 3. The device for synthesizing prosodic information according to claim 2, wherein the prosody information is adapted to be supplied back to itself as input through the layer.

【請求項６】該入力層の第１の部分の各記憶ユニット
を第１のかくれた層の各記憶ユニットに接続する該人工
的なニューラルネットワークのファイバと、第１の反復
式の層の各記憶ユニットを第１のかくれた層の各記憶ユ
ニットに接続する該人工的なニューラルネットワークの
ファイバは、単純な反復式のニューラルネットワークを
構成するようになっている、請求項３記載の韻律情報の
合成装置。6. The fiber of the artificial neural network connecting each storage unit of the first portion of the input layer to each storage unit of the first hidden layer, and each of the first repeating layers. 4. The prosodic information of claim 3, wherein the fibers of the artificial neural network connecting storage units to each storage unit of the first hidden layer are adapted to form a simple iterative neural network . Synthesizer.