JP2006227589A

JP2006227589A - Device and method for speech synthesis

Info

Publication number: JP2006227589A
Application number: JP2005376598A
Authority: JP
Inventors: Yumiko Kato; 弓子加藤; Natsuki Saito; 夏樹齋藤; Eiichi Naito; 栄一内藤
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2005-01-20
Filing date: 2005-12-27
Publication date: 2006-08-31

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for speech synthesis that precisely and naturally reproduces intended temporal variation in utterance style. <P>SOLUTION: A text input section 201 inputs a text with commands including an utterance style command. The utterance style command is a command specifying temporal variation in utterance style in units of phonemes, morae, or syllables. A markup language analysis section 202 separates the text with commands into a text as an object of speech synthesis and the utterance style command. A language processing section 203 takes an linguistic analysis of the separated text, specifies a phoneme section wherein temporal variation in utterance style is specified, and outputs a phoneme string corresponding to the text. A deformation position/deformation weight determination section 207 selects a conversion function for deforming standard elementary speech units of synthetic parameters according to the temporal variation in utterance style. A waveform generation section 214 generates a waveform of a synthesized speech based upon the synthesis parameters converted by the selected conversion function. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は音声の発話スタイルの時間的変化を記述した音声合成用テキスト構造に基づき、発話スタイルの時間的変化を実現した音声合成装置および音声合成方法に関するものである。 The present invention relates to a speech synthesizer and a speech synthesis method that realize a temporal change of an utterance style based on a text structure for speech synthesis that describes a temporal change of a speech utterance style.

従来の発話スタイルの時間的変化を実現した音声合成方法としては、部分的な強調や、感情表現などを音声の韻律のみを変化させることによって表現するものがあった（例えば、特許文献１、特許文献２、特許文献３参照）。強調や感情表現を含む時間経過に伴って滑らかに変化する音声における豊かな表現を合成音声において実現しようとする要求は、合成音声が単なる情報提供手段でなく、ひとつの表現手段としてコンテンツ制作に取り入れられるようになってより高くなっている。そのような時間経過に伴って変化する感情表現を合成音声に付け加える手段として、合成すべきテキストの意味解析により感情の種類と強度を抽出し、時間経過に伴う変化を設定し、指定された変化に合わせて韻律および音声パラメータをあらかじめ定められた関数とその混合関数に従って変形することによって音声の感情の時間的変化を実現したものがある(例えば、特許文献４、特許文献５参照)。 As a conventional speech synthesis method that realizes temporal changes in the utterance style, there are methods that express partial emphasis, emotional expression, etc. by changing only the prosody of the speech (for example, Patent Document 1, Patent) Reference 2 and Patent Reference 3). The requirement to achieve rich expression in synthesized speech that is smoothly changing over time, including emphasis and emotional expression, is incorporated into content production as a means of expression rather than simply providing information. Is getting higher. As a means of adding emotional expressions that change over time to synthesized speech, the type and intensity of emotions are extracted by semantic analysis of the text to be synthesized, changes over time are set, and specified changes In accordance with the above, there are those in which the temporal change of the voice emotion is realized by changing the prosody and the voice parameters according to a predetermined function and its mixed function (see, for example, Patent Document 4 and Patent Document 5).

さらに、コンテンツが流通し、機能、方式および能力の異なる再生装置によって再生されることを考慮して、再生装置の状況にかかわらず再生時に一定の再現精度を保証するために、マークアップ言語が作られ、音声においてはVoiceXMLVer.2.0およびSSML（Speech Synthesis Markup Language）Ver.1.0が規格化されている。音声の時間経過に伴う変化をマークアップ言語により記述する方式については、音声に変換しようとする自然言語テキストの単語あるいは文字間にタグを挿入して、音声の変化の対象となるテキスト範囲の始点と終点を示し、音声の属性とその強度、変化方式等を記述するものがある（例えば特許文献２、特許文献３参照）。図２０は、前記特許文献３に記載された従来の音声の時間的変化を記述した音声合成用テキスト構造を示すものであり、図２１は図２０のような音声合成用テキストを取得して音声合成により音声を生成する音声合成装置の構成を示すブロック図である。図２０は「冬型の気圧配置となった９日、近畿地方など西日本は雪の影響で交通機関が乱れました」というテキストを合成音声にする際に、文の最初は「幸せそう(happy)」な声で話し始め、文中で徐々に変化し最後には「怒り(angry)」の声で終了するよう指示するものである。これは、タグ内の＜morphing type = "express" start = "happy" end = "angry" ＞で示されている。morphing type = "express"は、声に対する変形が表現に関することを示している。かつ、start = "happy"は、文の最初で「幸せそう(happy)」に話し始めること、end = "angry" は、最後に「怒り(angry)」の声で終了することを示している。図２１において、テキスト入力部１０４はこのようなタグ付きテキスト１０３の入力を受け付け、テキスト解析部１０５は、タグ付きテキスト１０３のどこが指示でどこが指示でないかを解析する。タグ解析部１０６は入力されたタグ付きテキスト１０３のマークアップ言語の記述を解析し、タグ属性解析部１０７はタグ間の整合性を確認した後タグによる指示を解釈する。次いで、言語処理部１０８は言語辞書１１０を参照しながら、タグを省いた音声に変換する対象のテキストを言語解析する。さらに、音声合成部１０９は韻律・波形辞書を参照しながら、タグ属性解析部１０７で解釈されたタグによる指示に基づき、音の高低、デュレーションおよび強弱からなる韻律を変形し、音声を合成する。始点と終点との間は実際の音声の時間長に対する関数として定義された補間関数により変形量が設定される。
特開平１１−２０２８８４号公報（第１３−１４頁、図１０）特開２００２−９１４７４号公報（第６−７頁、図２）特開２００３−２９５８８２号公報（第３−６頁、図１、図２、図３、図４、図５）特開平５−３０７３９６号公報（第３−４頁、図３）特開２００３−２３３３８８号公報（第６−８頁、図１、図３、図４） In addition, considering that content is distributed and played by playback devices with different functions, methods, and capabilities, markup languages are created to guarantee a certain level of playback accuracy regardless of playback device status. In voice, VoiceXML Ver.2.0 and SSML (Speech Synthesis Markup Language) Ver.1.0 are standardized. For a method for describing changes in speech over time in a markup language, insert a tag between words or characters of natural language text to be converted to speech, and the starting point of the text range subject to speech change And the end point, and the voice attribute, its intensity, change method, etc. are described (see, for example, Patent Document 2 and Patent Document 3). FIG. 20 shows a conventional text structure for speech synthesis describing the temporal change of speech described in Patent Document 3, and FIG. 21 shows the speech synthesis text shown in FIG. It is a block diagram which shows the structure of the speech synthesizer which produces | generates a speech by a synthesis | combination. Fig. 20 shows that the first sentence of the sentence was “happy” when the text “The transportation was disturbed by the influence of snow in western Japan such as the Kinki region on the 9th, when it became a winter-type air pressure arrangement” was synthesized speech. It begins to speak with a "voice", gradually changes in the sentence, and finally ends with a voice of "angry". This is indicated by <morphing type = “express” start = “happy” end = “angry”> in the tag. “morphing type =“ express ”indicates that the deformation to the voice relates to the expression. And start = "happy" indicates that the sentence begins with "happy" at the beginning of the sentence, and end = "angry" ends with the voice of "angry" at the end. . In FIG. 21, the text input unit 104 receives input of such tagged text 103, and the text analysis unit 105 analyzes where the tagged text 103 is an instruction and where is not an instruction. The tag analysis unit 106 analyzes the description in the markup language of the input tagged text 103, and the tag attribute analysis unit 107 interprets the instruction by the tag after confirming the consistency between the tags. Next, the language processing unit 108 performs language analysis on the text to be converted into speech without tags while referring to the language dictionary 110. Further, the speech synthesizer 109 synthesizes the speech by modifying the prosody composed of the pitch, duration, and strength of the sound based on the instruction by the tag interpreted by the tag attribute analyzer 107 while referring to the prosody / waveform dictionary. The amount of deformation is set between the start point and the end point by an interpolation function defined as a function with respect to the actual time length of the voice.
Japanese Patent Laid-Open No. 11-202884 (pages 13-14, FIG. 10) JP 2002-91474 A (page 6-7, FIG. 2) Japanese Patent Laying-Open No. 2003-295882 (page 3-6, FIG. 1, FIG. 2, FIG. 3, FIG. 4, FIG. 5) JP-A-5-307396 (page 3-4, FIG. 3) JP 2003-233388 A (page 6-8, FIG. 1, FIG. 3, FIG. 4)

しかしながら、マークアップ言語については、時間的変化を時間と韻律パラメータの関係として記述する際に、実時間を記述している。図２２（ａ）は、対象となるテキスト範囲が音声に変換された際の実時間に対応付けて韻律の変化を指示するマークアップ言語の一例を示す図である。図２２（ａ）のVoiceXMLVer.2.0のように実時間による記述では、韻律の動きを、0.1秒後に声の高さを20Hz高くし、それより0.75秒後に声の高さをさらに10Hz高くするよう指示している。しかし、音声合成装置ごとの内部の動作や標準データによって、合成される音声の時間長が異なるため、図２３の音声合成装置Ａと音声合成装置Ｂとでは音声開始から0.10秒と0.75秒の時間位置で発音されている音韻が異なる。例えば、音声合成装置Ａでは音声開始から0.10秒で、「は」という音声の開始直後であるが、音声合成装置Ｂでは音声開始から0.10秒で、まだ「お」という音声の終わりの方が発音されている。このように、マークアップ言語で指定した時間位置と音韻列や単語列との時間位置の関係は音声合成装置ごとに異なることになり、場合によっては指定時間位置が、音声が終了した後になってしまう等、音声合成装置ごとの再現性があまり高いとは言えないという問題がある。 However, in the markup language, the real time is described when the temporal change is described as the relationship between the time and the prosodic parameter. FIG. 22A is a diagram illustrating an example of a markup language that indicates a change in prosody in association with real time when a target text range is converted into speech. In the description in real time like VoiceXMLVer.2.0 in FIG. 22 (a), the prosody movement is set to increase the voice pitch by 20 Hz after 0.1 second, and further increase the voice pitch by 10 Hz after 0.75 seconds. I am instructing. However, since the time length of the synthesized speech differs depending on the internal operation and standard data for each speech synthesizer, the speech synthesizer A and speech synthesizer B in FIG. 23 have a time of 0.10 seconds and 0.75 seconds from the start of speech. The phonemes that are pronounced at different positions are different. For example, in the speech synthesizer A, it is 0.10 seconds from the start of speech and immediately after the start of the speech “ha”, but in the speech synthesizer B, it is 0.10 seconds from the start of speech and the end of the speech “o” is still pronounced. Has been. Thus, the relationship between the time position specified in the markup language and the time position between the phoneme string and the word string will be different for each speech synthesizer, and in some cases, the specified time position may be after the end of the speech. For example, the reproducibility of each speech synthesizer is not very high.

また、図２２（ｂ）は、韻律の時間的変化を記述するために、変化の対象となるテキスト範囲が音声に変換された際の時間長に対する先頭位置からの比率で時間位置を記述している一例を示す図である。しかし、図２３に示すように、実時間で記述する場合と同様、音声合成装置ごとに音韻ごとの時間長は異なり、音声合成装置Ａと音声合成装置Ｂとでは、音声の先頭から40％の時間位置で発音されている音韻が異なる。例えば、音声合成装置Ａでは音声開始から40％の時間位置で、「ご」という音声の開始直後であるが、音声合成装置Ｂでは音声開始から40％の時間位置で、「よー」の「ー」という音声の中間が発音されているところである。このように、マークアップ言語による韻律変化の指示を記述した場合には、期待した音韻に対応した時間位置で発話スタイルを制御することができない。 Further, FIG. 22 (b) describes the time position by the ratio from the head position to the time length when the text range to be changed is converted to speech in order to describe the temporal change of the prosody. FIG. However, as shown in FIG. 23, the time length for each phoneme is different for each speech synthesizer, as in the case of description in real time, and the speech synthesizer A and the speech synthesizer B are 40% from the beginning of the speech. The phonemes that are pronounced at the time position are different. For example, in the speech synthesizer A, the voice position is 40% from the start of the voice and immediately after the start of the voice “go”, but in the voice synthesizer B, the voice position “40” is 40% from the start of the voice. Is in the middle of the sound. As described above, when an instruction for prosody change by a markup language is described, the utterance style cannot be controlled at the time position corresponding to the expected phoneme.

文中のある形態素の先頭のつもりで時間位置30％を指定した場合に、音声合成装置によっては音韻の時間長データがマークアップ言語記述時の予測と異なるために、時間位置30％の位置は、意図した形態素の末尾になってしまうというように、やはり音声合成装置ごとの再現性が低くなってしまう。すなわちマークアップ言語の記述作業者が意図する時間的変化を音韻や単語といった記述作業者にとって自然な、自身の発話イメージの再現として発話スタイルの時間的変化を記述する方法が無く、それを再現する音声合成装置も無かった。また、マークアップ言語は、欧州言語のような言語であれば単語の間、日本語や中国語のような表記方法を持つ言語であれば文字の間にタグを記述するが、欧州言語においては複数音節からなる長い単語、日本語であれば複数音節あるいはモーラを持つ１文字の漢字の途中に発話スタイル制御の始点や終点を設定することが難しく、欧州言語の単語や日本語の文字以下の単位での制御を指示することが困難である。すなわち「掌」という文字の「たなごころ」という読みに対して、読みの一部である「ごころ」を強調する、基本周波数を上昇させる、声質を変える等の指示をすることができなかった。 When 30% time position is specified at the beginning of a certain morpheme in a sentence, the time length data of phonemes differs from the prediction at the time of markup language description depending on the speech synthesizer. The reproducibility for each speech synthesizer is also lowered, such as the end of the intended morpheme. In other words, there is no way to describe the temporal change of the utterance style as a reproduction of one's own utterance image, which is natural for the writer of the phoneme and words, and reproduces the temporal change intended by the markup language description worker. There was no speech synthesizer. In markup languages, tags are written between words in languages such as European languages, and between characters in languages with notation methods such as Japanese and Chinese. In European languages, It is difficult to set the start and end points of utterance style control in the middle of a single word kanji with multiple syllables or mora for long words consisting of multiple syllables. It is difficult to instruct control in units. In other words, for the reading “palm” of the word “palm”, it was not possible to give instructions such as emphasizing “Kokoro” as part of the reading, increasing the fundamental frequency, changing the voice quality, etc. .

さらに、欧州言語については文と発音の関係が複雑な言語もあり、単語中の音素あるいは音節の区切りに対応する文字位置にタグを記述するのは困難である。すなわちマークアップ言語の記述作業者が意図する時間的変化を音韻や単語といった記述作業者にとって自然な、自身の発話イメージの再現として精度よく発話スタイルの時間的変化を記述することができず、音声合成装置が異なると記述作業者が意図する時間的変化が再現できないという課題を有している。 Furthermore, in some European languages, the relationship between sentence and pronunciation is complicated, and it is difficult to describe a tag at a character position corresponding to a phoneme or syllable break in a word. In other words, it is not possible to accurately describe the temporal change of the utterance style as a reproduction of one's utterance image, which is natural for the description worker such as phonology or words, with respect to the temporal change intended by the markup language description worker. If the synthesizing apparatus is different, there is a problem that the temporal change intended by the description worker cannot be reproduced.

本発明は、前記従来の課題を解決するもので、合成音声の発話スタイルの時間的変化を音声合成装置ごとの標準データや動作の違いにかかわらず、精度よく再現するための音声合成用テキスト構造と、そのテキスト構造を用いて指定された発話スタイルの時間的変化を忠実に再現するための音声合成装置および音声合成方法を提供することを目的とする。 The present invention solves the above-described conventional problems, and is a text structure for speech synthesis for accurately reproducing a temporal change in the speech style of synthesized speech regardless of differences in standard data and operation for each speech synthesizer. Another object of the present invention is to provide a speech synthesizer and a speech synthesis method for faithfully reproducing temporal changes in the utterance style specified by using the text structure.

前記従来の課題を解決するために、本発明の音声合成装置は、コマンド付きテキストを入力とし、前記テキストを読み上げる音声を合成する音声合成装置であって、コマンド付きテキストを、（１）音声に合成すべき前記テキストと（２）前記テキストから合成される音声の発話表現である発話スタイルの時間的変化を、音素、モーラ、音節のいずれか1つを単位として指定する発話スタイルコマンドとに分離する分離手段と、分離された前記テキストを言語解析し、少なくとも前記テキストを表す音素列、モーラ列、音節列のうち前記発話スタイルコマンドにおいて前記時間的変化を指定する単位として使用された単位で表記された音韻列を出力する言語処理手段と、前記発話スタイルコマンドで指定された単位を識別し、出力された前記音韻列中において、前記発話スタイルコマンドで前記発話スタイルの前記時間的変化が指定された音韻区間を、識別した単位で特定する区間特定手段と、特定された前記音韻区間において、前記発話スタイルの時間的変化に従って発話される音声を合成する音声合成手段とを備えることを特徴とする。 In order to solve the above-described conventional problems, a speech synthesizer according to the present invention is a speech synthesizer that synthesizes a speech that reads out text with a command-attached text as an input. Separation of the text to be synthesized and (2) utterance style commands that specify temporal changes in the utterance style, which is the utterance expression of speech synthesized from the text, in units of one of phonemes, mora, or syllables Separation means for performing linguistic analysis on the separated text, and at least a phoneme string, a mora string, and a syllable string representing the text are expressed in a unit used as a unit for specifying the temporal change in the utterance style command A linguistic processing means for outputting the phoneme sequence that has been output, and identifying the unit specified by the utterance style command; In the rhyme sequence, a section specifying means for specifying, in the identified unit, a phonological section in which the temporal change of the utterance style is specified by the utterance style command, and a time of the utterance style in the specified phonological section. Voice synthesis means for synthesizing a voice uttered according to a change in the sound.

本構成によって、マークアップ言語を用いることにより、個々の音声合成装置が標準の音韻時間長として設定している音韻時間長にかかわり無く、発話スタイルコマンドで発話スタイルの時間的変化が指定された音韻区間を正確に特定することができ、これによって、発話スタイルの時間的変化を正確に再現した合成音声を生成することができる。さらに、本発明の音声合成装置によれば、発話スタイルコマンドにより、発話スタイルの時間的変化を、音素、モーラ、音節のいずれか1つを単位として指定することができるので、発話スタイルの時間的変化をより滑らかに表した自然な合成音声を生成することができる。 With this configuration, by using a markup language, a phoneme whose utterance style changes over time is specified by the utterance style command regardless of the phoneme duration set by each speech synthesizer as the standard phoneme duration. It is possible to accurately specify the section, and thereby, it is possible to generate synthesized speech that accurately reproduces the temporal change of the utterance style. Furthermore, according to the speech synthesizer of the present invention, the temporal change of the speech style can be specified in units of one of phonemes, mora, and syllables by the speech style command. It is possible to generate a natural synthesized speech that represents changes more smoothly.

なお、本発明は、このような音声合成装置として実現することができるだけでなく、このような音声合成装置が備える特徴的な手段をステップとする音声合成方法として実現したり、それらのステップをコンピュータに実行させるプログラムとして実現したりすることもできる。そして、そのようなプログラムは、ＣＤ−ＲＯＭ等の記録媒体やインターネット等の伝送媒体を介して配信することができるのは言うまでもない。 Note that the present invention can be realized not only as such a speech synthesizer, but also as a speech synthesis method using steps characteristic of the speech synthesizer as a step, or by performing these steps as a computer. It can also be realized as a program to be executed. Needless to say, such a program can be distributed via a recording medium such as a CD-ROM or a transmission medium such as the Internet.

本発明の音声合成用テキスト構造によれば、合成音声の発話スタイルの時間的変化を記述する際に、時間位置を音声あるいは言語単位で指定することで、音声合成装置ごとの標準データや動作の違いによって生じる音声合成装置ごとの音韻時間長の差のために、マークアップ言語記述時の意図とは異なった音韻や単語等の音声的あるは言語的位置で発話スタイルが制御されることを防ぎ、個々の音声合成装置のデータや動作に非依存な、再現精度の高い発話スタイルの時間的変化の記述方法を提供することができる。さらに発話スタイルの時間的変化を音声あるいは言語の単位で指定することで、実時間や時間比率による指定では困難な自然言語文字列との明確な対応を定義することができるため、記述作業者が意図する時間的変化を自身の発話イメージで定義することができ、より直感的に指定することができる。さらにタグが挿入可能な単位であるテキスト表記の単語あるいは文字よりも小さい、音素、モーラ、音節の単位で時間的変化の始点や終点を指定することができる。また、そのようなマークアップ言語の記述に対して、本発明の音声合成装置および音声合成方法によれば、発話スタイル変換関数の選択と変形、統計的学習による発話スタイル空間の変換、素片ごとに用意された混合のための対応点を利用した音声合成パラメータや音声分析パラメータの混合（モーフィング）により、発話スタイルの再現と時間的変化の制御が可能となり韻律のみならず、スペクトル情報も含めた音声の特徴を細やかに制御することができ、記述された発話スタイルの時間的変化を高精度に再現することができる。 According to the text structure for speech synthesis of the present invention, when describing the temporal change of the speech style of synthesized speech, by specifying the time position in speech or language units, the standard data and operation of each speech synthesizer can be specified. Due to differences in phoneme durations between speech synthesizers caused by differences, it prevents speech styles from being controlled at phonetic or linguistic positions, such as phonemes and words, which are different from the intentions of markup language descriptions. In addition, it is possible to provide a method for describing a temporal change of an utterance style with high reproducibility that is independent of data and operations of individual speech synthesizers. Furthermore, by designating temporal changes in utterance style in units of speech or language, it is possible to define a clear correspondence with natural language character strings that are difficult to specify by real time or time ratio, so that the description worker can The intended temporal change can be defined by its own speech image and can be specified more intuitively. Furthermore, the start point and end point of temporal change can be specified in units of phonemes, mora, and syllables that are smaller than a word or character in text notation, which is a unit in which a tag can be inserted. Further, according to such a markup language description, according to the speech synthesizer and speech synthesis method of the present invention, the selection and modification of the utterance style conversion function, the conversion of the utterance style space by statistical learning, and the unit By combining speech synthesis parameters and speech analysis parameters using corresponding points for mixing prepared in (Morphing), it is possible to reproduce utterance styles and control temporal changes, including spectral information as well as prosody It is possible to finely control the characteristics of the speech and to reproduce the temporal change of the described utterance style with high accuracy.

以下本発明の実施の形態について、図面を参照しながら説明する。 Embodiments of the present invention will be described below with reference to the drawings.

（実施の形態１）
図１は、本発明の実施の形態１における音声合成装置の機能ブロック図である。図２（ａ）及び（ｂ）は、テキストに付与されたマークアップ言語のタグの一例を示す図である。図２（ａ）は、「あらゆる現実をすべて自分の方へ捻じ曲げたのだ」というテキストの先頭から１４モーラ目までを、声の感情が怒りの「５」から「１」に変化しつつ発音されるよう音声の合成を指示したマークアップ言語を示している。図２（ｂ）は、図２（ａ）に示した漢字かな混じりテキストをモーラで表し、合成音声の音韻部分と感情表現の変形との対応を具体的に示している。図３は、本発明の実施の形態１の音声合成装置の動作を示すフローチャートである。図４は、本発明の実施の形態１の音声合成装置の処理内容を模式的に示した図である。 (Embodiment 1)
FIG. 1 is a functional block diagram of the speech synthesizer according to Embodiment 1 of the present invention. FIGS. 2A and 2B are diagrams showing an example of a markup language tag attached to text. Fig. 2 (a) shows that the emotion of the voice changes from "5" to "1" in anger from the beginning of the text "All the reality is twisted towards me" to the 14th mora. A markup language that instructs to synthesize speech to be pronounced is shown. FIG. 2B shows the kanji-kana mixed text shown in FIG. 2A as a mora, and specifically shows the correspondence between the phoneme portion of the synthesized speech and the transformation of the emotional expression. FIG. 3 is a flowchart showing the operation of the speech synthesizer according to the first embodiment of the present invention. FIG. 4 is a diagram schematically showing the processing contents of the speech synthesizer according to the first embodiment of the present invention.

図１において、実施の形態１の音声合成装置は、韻律に変化を施すべき音韻の位置を、モーラを単位として指定する音声合成装置であって、テキスト入力部２０１、マークアップ言語解析部２０２、言語処理部２０３、辞書２０４、韻律制御部２０５、標準韻律パタンデータベース２０６、変形位置・変形重み決定部２０７、変換関数選択部２０８、変換関数データベース２０９、変換関数パラメータ設定部２１０、素片選択部２１１、標準素片データベース２１２、合成パラメータ生成部２１３および波形生成部２１４を備える。 In FIG. 1, the speech synthesizer of Embodiment 1 is a speech synthesizer that designates the position of a phoneme whose prosody should be changed in units of mora, and includes a text input unit 201, a markup language analysis unit 202, Language processing unit 203, dictionary 204, prosody control unit 205, standard prosody pattern database 206, deformation position / transformation weight determination unit 207, conversion function selection unit 208, conversion function database 209, conversion function parameter setting unit 210, segment selection unit 211, a standard segment database 212, a synthesis parameter generation unit 213, and a waveform generation unit 214.

テキスト入力部２０１は、定められたマークアップ言語の基準に従って記述されたタグ付きテキストの入力を受け付ける。 The text input unit 201 accepts input of tagged text described according to a defined markup language standard.

マークアップ言語解析部２０２は、テキスト入力部２０１に入力されたタグ付きテキストを解析し、タグ部分とタグ以外の自然言語部分（例えば、かな漢字混じりテキストなど）とに分離する。タグによる指示情報は、変形位置・変形重み決定部２０７に出力する。また、タグが挿入されていた位置情報（例えば、タグが、図２（ａ）に示した「あらゆる現実をすべて自分の方へ捻じ曲げたのだ」というテキストの「あ」の前に挿入されていたことを示す情報）をタグ以外の自然言語部分とともに、言語処理部２０３に出力する。 The markup language analysis unit 202 analyzes the tagged text input to the text input unit 201 and separates it into a tag portion and a natural language portion other than the tag (for example, kana-kanji mixed text). The instruction information by the tag is output to the deformation position / deformation weight determination unit 207. In addition, the position information where the tag was inserted (for example, the tag is inserted in front of “A” in the text “Any reality was twisted towards you” shown in FIG. 2A). Information indicating that it has been recorded) is output to the language processing unit 203 together with the natural language portion other than the tag.

言語処理部２０３は、マークアップ言語解析部２０２で生成された（１）タグ以外の自然言語部分と（２）タグが挿入されていた位置情報との入力をうけ、辞書２０４を参照して自然言語部分の言語解析を行う。これにより、タグ以外の自然言語部分のモーラの数がわかる。そして、タグ挿入位置が対応付けられた読みを表す音韻列と、アクセントやアクセント句区切りおよびポーズ位置等を表す韻律指定情報、各形態素の品詞や文節の係り受け等をあらわす言語情報を出力する。 The language processing unit 203 receives the natural language part other than the (1) tag generated by the markup language analysis unit 202 and (2) the position information where the tag has been inserted, and refers to the dictionary 204 for natural processing. Perform language analysis of the language part. As a result, the number of mora in the natural language part other than the tag is known. Then, a phoneme string representing a reading associated with the tag insertion position, prosodic designation information representing an accent, an accent phrase delimiter, a pause position, and the like, language information representing a part of speech of each morpheme, a sentence dependency, and the like are output.

辞書２０４は、形態素の読み、品詞、アクセント、アクセント結合規則等を格納したデータベースである。 The dictionary 204 is a database storing morpheme readings, parts of speech, accents, accent combining rules, and the like.

韻律制御部２０５は、言語処理部２０３により生成されたタグ挿入位置が対応付けられた音韻列、韻律指定情報と言語情報を入力され、音韻列、韻律指定情報、言語情報に基づき、標準韻律パタンデータベース２０６を参照して音韻列に対応した基本周波数、振幅、音韻時間長、ポーズ時間長を生成し、タグ挿入位置と対応付けて出力する。 The prosodic control unit 205 is input with the phoneme sequence, prosodic designation information and language information associated with the tag insertion position generated by the language processing unit 203, and based on the phoneme sequence, prosodic designation information, and language information, The basic frequency, amplitude, phoneme time length, and pause time length corresponding to the phoneme string are generated with reference to the database 206, and output in association with the tag insertion position.

変形位置・変形重み決定部２０７は、韻律制御部２０５で生成された音韻列に対応付けられたタグ挿入位置と音韻およびポーズ時間長と、マークアップ言語解析部２０２より出力されたタグによる指示情報を入力され、タグの指示とタグ挿入位置と音韻列に基づき発話スタイルの指定を解析し、音韻列上の区間と発話スタイルの対応を決定する。さらに、音韻列上で発話スタイルの変形を行う区間を決定し、発話スタイルの変形重みを決定する。 The deformation position / deformation weight determination unit 207 includes a tag insertion position, a phoneme and a pause time length associated with the phoneme sequence generated by the prosody control unit 205, and tag designation information output from the markup language analysis unit 202. , And the designation of the utterance style is analyzed based on the tag instruction, the tag insertion position, and the phoneme string, and the correspondence between the section on the phoneme string and the utterance style is determined. Further, a section in which the utterance style is deformed on the phoneme string is determined, and the deformation weight of the utterance style is determined.

変換関数選択部２０８は変形位置・変形重み決定部２０７で生成された音韻列に対応する発話スタイルに従って、代表的あるいは基本的な発話スタイル、すなわち話者、声質、感情、対話の相手との人間関係等、発話の状況のようなパラ言語的表現に変化が起こる属性に対して、あらかじめ実音声より素片単位で韻律とスペクトルの対応点の差分を求めることで生成された韻律およびスペクトル情報を変換するための変換関数を変換関数データベース２０９から選択する。より具体的には、音韻や、形態素、韻律指定情報等と合わせて格納した変換関数データベース２０９より、音韻列に対応する変換関数を抽出する。例えば、図２（ａ）の「あらゆる現実を・・・」の「あ」の音声の表現が、後述する変換関数パラメータ設定部２１０により設定されるパラメータによって、anger「５」になるような変換関数を抽出する。 The conversion function selection unit 208 performs representative or basic utterance styles according to the utterance styles corresponding to the phoneme strings generated by the deformation position / deformation weight determination unit 207, that is, humans with speakers, voice quality, emotions, and conversation partners. Prosody and spectrum information generated by finding the difference between corresponding points of prosody and spectrum in units of segments from real speech for attributes that change in paralinguistic expressions such as utterances A conversion function for conversion is selected from the conversion function database 209. More specifically, a conversion function corresponding to a phoneme string is extracted from a conversion function database 209 stored together with phonemes, morphemes, prosodic designation information, and the like. For example, a conversion in which the expression of the voice of “A” in “Any reality” in FIG. 2A becomes anger “5” by a parameter set by the conversion function parameter setting unit 210 described later. Extract a function.

変換関数パラメータ設定部２１０は、変形位置・変形重み決定部２０７で生成された変形区間と変形区間内の変形重みにより変換関数選択部２０８で抽出された各音韻に対応する発話スタイル変換関数のパラメータを設定する。 The conversion function parameter setting unit 210 is a parameter of the utterance style conversion function corresponding to each phoneme extracted by the conversion function selection unit 208 based on the deformation interval generated by the deformation position / deformation weight determination unit 207 and the deformation weight in the deformation interval. Set.

一方、素片選択部２１１は、韻律制御部２０５で生成された音韻列に対応した基本周波数、振幅、音韻時間長と、言語処理部２０３により生成された言語情報とから、後述の標準素片データベース２１２を参照して音韻列に対応する音声合成パラメータ素片を抽出する。 On the other hand, the segment selection unit 211 uses a fundamental frequency, amplitude, and phoneme duration corresponding to the phoneme sequence generated by the prosody control unit 205 and language information generated by the language processing unit 203 to be described later. A speech synthesis parameter segment corresponding to the phoneme string is extracted with reference to the database 212.

標準素片データベース２１２は、実音声より生成した音韻ごとの音声合成パラメータと音韻環境、基本周波数、振幅、音韻時間長、言語情報等の属性を格納しているデータベースである。 The standard segment database 212 is a database that stores speech synthesis parameters for each phoneme generated from real speech and attributes such as phoneme environment, fundamental frequency, amplitude, phoneme duration, and language information.

合成パラメータ生成部２１３は、素片選択部２１１で抽出された音声合成パラメータ素片を接続し、変換関数パラメータ設定部２１０でパラメータを設定された、音韻ごとの変換関数により各音韻の音声合成パラメータを変換し、発話スタイル変形を行った一連の音声合成パラメータ列を生成する。 The synthesis parameter generation unit 213 connects the speech synthesis parameter segments extracted by the segment selection unit 211, and the speech synthesis parameters of each phoneme by the conversion function for each phoneme, the parameters of which are set by the conversion function parameter setting unit 210 To generate a series of speech synthesis parameter sequences that have undergone utterance style transformation.

波形生成部２１４は、合成パラメータ生成部２１３で生成された一連の音声合成パラメータに基づき音声波形を生成し、出力する。 The waveform generation unit 214 generates and outputs a speech waveform based on a series of speech synthesis parameters generated by the synthesis parameter generation unit 213.

次に、上記の構成による音声合成装置の動作を詳細に説明する。テキスト入力部２０１は、図２（ａ）に示すマークアップ言語によるタグ付きテキストを入力テキストとして受け付ける。図２（ａ）のタグ付きテキストは、「あらゆる現実をすべて自分の方へ捻じ曲げたのだ」というテキストについて、（ｂ）に示すように、テキストの先頭は怒りの重み５の発話スタイルで、先頭音韻から、14モーラ目すなわち「自分の」の「ぶ」では怒りの重み１の発話スタイルとなるよう、14モーラの間に徐々に発話スタイルを変化させることを指示するものである。まずテキスト入力部２０１は図２の（ａ）のタグ付きテキストを受けつける（Ｓ２００１）。マークアップ言語解析部２０２は入力テキスト中のタグを識別し（Ｓ２００２）、タグ位置情報つきの自然言語と、タグによる発話スタイルの指示とに分離する（Ｓ２００３）。言語処理部２０３は形態素の読み、品詞、アクセント、アクセント結合規則等を格納した辞書２０４を参照して形態素解析を行い、さらに形態素の構成から構文解析を行って、入力された自然言語テキストに対応する音韻列と言語情報を生成する。さらに、音韻列と言語情報に基づきアクセント、アクセント句の区切れ確率、ポーズ存在確率等の韻律指定情報を生成する（Ｓ２００４）。さらに言語処理部２０３は、入力テキスト中のタグ位置に対応する音韻列中の位置にタグ位置を記録する（Ｓ２００５）。韻律制御部２０５は音韻列と韻律指定情報および言語情報を属性として用いて、あらかじめ属性ごとにパラメータが設定された関数により、入力音韻列に対応する標準音韻時間長およびポーズ時間長を設定する。次いで、音韻列と韻律指定情報および言語情報の属性により、基本周波数と振幅の標準韻律パタンを標準韻律パタンデータベース２０４より抽出し、さらに属性に基づいて変形を加えて入力音韻列に対応する標準基本周波数パタン、標準振幅パタンを生成する（Ｓ２００６）。図４は、タグによる発話スタイルの指示に基づいて、変形位置・変形重み決定部２０７によって決定されたモーラ単位の変形区間の一例を示す図である。変形位置・変形重み決定部２０７は韻律制御部２０５で生成された音韻列に対応する標準音韻時間長およびポーズ時間長とタグ位置、マークアップ言語解析部２０２で分離されたタグによる指示情報とに基づき、図４に示すように、モーラ単位で変形区間を設定し、さらにその変形区間の音韻時間長より変形区間の実時間を計算する（Ｓ２００７）。次いで、実時間上で発話スタイルの重みを線形に補間する（Ｓ２００８）。ステップＳ２００８で実時間上で補間された重みより、音韻時間長を用いて素片選択単位の中心点と素片接続点での発話スタイルの重みを計算する（Ｓ２００９）。変換関数選択部２０８はステップＳ２００６で韻律制御部２０５によって生成された基本周波数パタン、振幅パタン、音韻時間長と、ステップＳ２００７で変形位置・変形重み決定部２０７によって設定された変形区間と区間ごとに指定された発話スタイルとに基づいて、標準音声を生成する際に使用される素片を変換するのに最適な変換関数を各素片選択単位ごとに変換関数データベース２０９より抽出する。一方、素片選択部２１１はステップＳ２００６で韻律制御部２０５によって生成された基本周波数パタン、振幅パタン、音韻時間長と音韻列とに従って、合成しようとする音声の音声合成パラメータ素片を標準素片データベース２１２より抽出する（Ｓ２０１０）。変換関数パラメータ設定部２１０はステップＳ２００９で変形位置・変形重み決定部２０７により計算された素片選択単位の中心点と素片接続点での発話スタイルの重みに基づいて、ステップＳ２０１０で変換関数選択部２０８により変換関数データベース２０９から素片単位ごとに選択された変換関数の素片選択単位の中心と素片接続点での変換関数パラメータを設定する（Ｓ２０１１）。音声パラメータ生成部２１３はステップＳ２０１０で素変選択部２１１により選択された音韻列に対応する標準音声を生成する音声合成パラメータと、ステップＳ２００６で生成された基本周波数パタン、振幅パタン、音韻時間長、ポーズ時間長とをステップＳ２０１１でパラメータ設定された素片単位ごとの変換関数を用いて変換し、タグにより指定された発話スタイルの時間変化を実現する、連続した音声の音声合成パラメータ列を生成する（Ｓ２０１２）。波形生成部２１４はステップＳ２０１２で生成された音声合成パラメータに従って音声波形を合成する（Ｓ２０１３）。 Next, the operation of the speech synthesizer configured as described above will be described in detail. The text input unit 201 receives tagged text in the markup language shown in FIG. 2A as input text. The tagged text in FIG. 2 (a) is about the text that "every reality is twisted towards you", as shown in (b), the beginning of the text is an utterance style with anger weight 5 From the head phoneme, it is instructed to gradually change the utterance style during the 14 mora so that the utterance style of the anger weight 1 is obtained at the 14th mora, that is, “bu” of “my”. First, the text input unit 201 receives the tagged text shown in FIG. 2A (S2001). The markup language analysis unit 202 identifies a tag in the input text (S2002), and separates it into a natural language with tag position information and an utterance style instruction by the tag (S2003). The language processing unit 203 performs morpheme analysis with reference to the dictionary 204 storing morpheme readings, parts of speech, accents, accent combining rules, etc., and further performs syntax analysis from the morpheme structure to correspond to the input natural language text Generate phoneme strings and linguistic information. Furthermore, prosody designating information such as accents, accent phrase segmentation probabilities, and pose presence probabilities is generated based on the phoneme string and language information (S2004). Furthermore, the language processing unit 203 records the tag position at a position in the phoneme string corresponding to the tag position in the input text (S2005). The prosody control unit 205 sets the standard phoneme time length and pause time length corresponding to the input phoneme string by using a function in which parameters are set in advance for each attribute, using the phoneme string, prosody designation information, and language information as attributes. Next, the standard prosody pattern of the fundamental frequency and the amplitude is extracted from the standard prosody pattern database 204 based on the attributes of the phoneme sequence, the prosodic designation information, and the language information, and the standard basic corresponding to the input phoneme sequence is further modified based on the attribute. A frequency pattern and a standard amplitude pattern are generated (S2006). FIG. 4 is a diagram illustrating an example of a mora-based deformation section determined by the deformation position / deformation weight determination unit 207 based on an utterance style instruction by a tag. The deformation position / deformation weight determination unit 207 converts the standard phoneme time length and pause time length corresponding to the phoneme sequence generated by the prosody control unit 205, the tag position, and the instruction information by the tag separated by the markup language analysis unit 202. Based on this, as shown in FIG. 4, a modified section is set in units of mora, and the actual time of the modified section is calculated from the phoneme time length of the modified section (S2007). Next, the speech style weight is linearly interpolated in real time (S2008). Based on the weight interpolated in real time in step S2008, the utterance style weights at the center point and the segment connection point of the segment selection unit are calculated using the phoneme duration (S2009). The conversion function selection unit 208 performs the basic frequency pattern, the amplitude pattern, and the phoneme time length generated by the prosody control unit 205 in step S2006, and the deformation interval and the interval set by the deformation position / deformation weight determination unit 207 in step S2007. Based on the designated utterance style, an optimal conversion function for converting a segment used when generating standard speech is extracted from the conversion function database 209 for each unit selection unit. On the other hand, the unit selection unit 211 converts the speech synthesis parameter unit of the speech to be synthesized into a standard unit according to the fundamental frequency pattern, amplitude pattern, phoneme duration and phoneme sequence generated by the prosody control unit 205 in step S2006. Extracted from the database 212 (S2010). The transformation function parameter setting unit 210 selects a transformation function in step S2010 based on the utterance style weights at the center point of the segment selection unit and the segment connection point calculated by the transformation position / transformation weight determination unit 207 in step S2009. The conversion function parameters at the center and unit connection point of the unit selection unit of the conversion function selected for each unit unit from the conversion function database 209 by the unit 208 are set (S2011). The speech parameter generation unit 213 generates a speech synthesis parameter for generating a standard speech corresponding to the phoneme sequence selected by the prime change selection unit 211 in step S2010, a fundamental frequency pattern, an amplitude pattern, a phoneme time length generated in step S2006, The pause time length is converted using the unit-by-unit conversion function set as a parameter in step S2011 to generate a continuous speech synthesis parameter string that realizes the temporal change of the utterance style specified by the tag. (S2012). The waveform generation unit 214 synthesizes a speech waveform according to the speech synthesis parameter generated in step S2012 (S2013).

かかる読み、すなわち音韻列および韻律を設定した後に、タグによる発話スタイルの時間変化を設定する構成によれば、入力されたモーラ等の音声の単位で時間位置を記述されたタグ付きテキストに対して、ステップＳ２００８、Ｓ２００９で変形位置・変形重み決定部２０７が、韻律制御部２０５で生成されたテキストの読みに対応する個々の音韻の時間長を取得し、モーラ単位で記述された発話スタイル変形の指示を実時間軸上に配置された音韻列に対応させ、実時間上での変形重みの補間を行うことができる。従って、各音韻での変形重みを設定し、ステップＳ２０１１で変換関数パラメータ設定部２１０が、変換関数選択部２０８で素片ごとに選択された変換関数について、実時間上の変形重みの変化を実現するよう、音韻ごとに設定された変形重みに合わせて変換関数のパラメータを設定することができる。これにより、モーラ等の音声の単位で記述された時間位置は、音声合成装置ごとに音韻時間長データが異なっていても、正確に指定された音声単位に設定されることができる。かつ、各音声合成装置ごとの音韻時間長設定に従って実時間軸へ変換された上で補間され、実時間上で設定された発話スタイルの重みを音韻を基準とした関数選択の単位に従って参照して関数の選択単位ごとに発話スタイルの変形重みを設定し、その変形重みに従って発話スタイルを変換する変換関数のパラメータを設定して適用することで、標準音声の音声合成パラメータを変換して発話スタイルが実時間上で徐々に変化する滑らかな音声合成パラメータを生成することができる。タグとして記述された発話スタイルの時間変化は、タグを挿入することができない１文字より小さい音声単位で指定された、記述時に意図した音韻位置を正確に再現することができる。また、タグに指定された発話スタイルに対応する、音声素片単位であらかじめ生成されたスペクトル情報の変換を含む変換関数のパラメータを制御することで時間的に徐々に変化する発話スタイルの時間変化を韻律のみならず、スペクトル情報も合わせて制御し、タグに指定された発話スタイルの時間的変化を精度よく再現することができる。 After setting such a reading, that is, a phoneme string and a prosody, according to the configuration in which the time change of the utterance style by the tag is set, with respect to the tagged text in which the time position is described in the unit of speech such as input mora In steps S2008 and S2009, the transformation position / deformation weight determination unit 207 obtains the time length of each phoneme corresponding to the text reading generated by the prosody control unit 205, and the utterance style transformation described in units of mora. The instruction can be made to correspond to the phoneme string arranged on the real time axis, and the transformation weight can be interpolated in real time. Therefore, the transformation weight for each phoneme is set, and in step S2011, the transformation function parameter setting unit 210 realizes the transformation weight change in real time for the transformation function selected for each segment by the transformation function selection unit 208. Thus, the parameters of the conversion function can be set in accordance with the deformation weight set for each phoneme. As a result, the time position described in units of speech such as mora can be set to the correctly designated speech unit even if the phoneme time length data differs for each speech synthesizer. In addition, after conversion to the real time axis according to the phoneme time length setting for each speech synthesizer, the utterance style weight set in real time is referenced according to the unit of function selection based on phonemes. By setting the transformation weight of the speech style for each function selection unit, and setting and applying the parameters of the conversion function that converts the speech style according to the transformation weight, the speech synthesis parameters of the standard speech are converted and the speech style is changed. Smooth speech synthesis parameters that gradually change in real time can be generated. The time change of the utterance style described as a tag can accurately reproduce the phoneme position intended at the time of description specified in units of speech smaller than one character into which the tag cannot be inserted. In addition, by controlling the parameters of the conversion function including the conversion of the spectrum information generated in advance in units of speech units corresponding to the utterance style specified in the tag, the temporal change of the utterance style that changes gradually over time is controlled. By controlling not only the prosody but also the spectrum information, the temporal change of the utterance style specified in the tag can be accurately reproduced.

このように、本発明の実施の形態１の音声合成装置によれば、韻律を変化させる音韻列の位置をモーラに対応させて示すので、音声合成装置によって異なることなく正確に同一の音韻列を指定して韻律の変化を指示することができる。また、モーラに対応させて韻律の変化を指示することができるので、音韻列の長さが短い場合の細かい調整も可能になる。また、韻律を変化させる音韻列の範囲を正確に特定した上で、その範囲の発音に要する実時間を測定し、測定された時間に対応させて平均して変化させるので、音韻ごとに韻律を変化させる場合よりも、韻律の変化が滑らかに聞こえるという効果がある。 As described above, according to the speech synthesizer of Embodiment 1 of the present invention, the position of the phoneme sequence for changing the prosody is shown corresponding to the mora, so that the exact same phoneme sequence can be obtained without being different depending on the speech synthesizer. It can be specified to indicate prosody changes. In addition, since a change in prosody can be instructed corresponding to a mora, fine adjustment can be made when the phoneme string is short. In addition, after accurately identifying the range of the phoneme sequence that changes the prosody, the real time required to pronounce the range is measured, and the average is changed according to the measured time. There is an effect that the prosody change can be heard more smoothly than when changing.

なお、上記実施の形態１では、韻律を変化させる音韻列の位置を、モーラに対応させて特定したが、モーラに限らず、モーラとは異なる音韻の単位を用いて特定するようにしてもよい。例えば、音節や音素などを単位として、韻律を変化させる音韻列の範囲を特定するとしてもよい。図５（ａ）、（ｂ）および（ｃ）は、本発明の実施の形態１における音声合成用テキスト構造の異なる形態とそのタグによる指示内容の模式図である。図５（ａ）は、図２（ａ）に示したタグ付きテキストと同様の韻律変更の指示を、その変更範囲を音節単位で数える場合のタグ付きテキストの一例を示している。図５（ｂ）は、図５（ａ）に示した漢字かな混じりテキストを音節（syllable）で表し、合成音声の音韻部分と感情表現の変形との対応を具体的に示している。図５（ｃ）は、変化する感情表現（発話スタイル）の重みの時間的変化をグラフで表している。図５（ａ）に示される「あらゆる現実をすべて自分の方へ捻じ曲げたのだ」という漢字かな混じりテキストは、音節では、「ん」や「を」が直前の音と１音で発音されるため、図５（ｂ）にカタカナで示すように区切られる。すなわち、６音節目は、図２（ｂ）に示した「ン」ではなく、「現実（「ゲンジツ）」の「ジ」となる。このため、変更範囲を音節単位で６音節と指定した場合には、「アラユルゲンジ」までの音声が実時間で線形補間されることになる。 In the first embodiment, the position of the phoneme string for changing the prosody is specified in association with the mora. However, the position is not limited to the mora, and may be specified using a unit of phoneme different from the mora. . For example, the range of the phoneme string that changes the prosody may be specified in units of syllables and phonemes. FIGS. 5A, 5B, and 5C are schematic diagrams of different forms of the text structure for speech synthesis and the contents of instructions by the tags according to Embodiment 1 of the present invention. FIG. 5A shows an example of tagged text when the prosody change instruction similar to the tagged text shown in FIG. 2A is counted in syllable units. FIG. 5B shows the kanji-kana mixed text shown in FIG. 5A as syllable, and specifically shows the correspondence between the phoneme portion of the synthesized speech and the transformation of the emotional expression. FIG. 5C shows a temporal change in the weight of the changing emotional expression (speech style). The kanji-kana mixed text “All reality is twisted and bent toward you” shown in Fig. 5 (a) is pronounced in the syllable with the previous sound and one note. Therefore, it is divided as shown by Katakana in FIG. That is, the sixth syllable is not “n” shown in FIG. 2B but “ji” of “real” (“genjitsu”). For this reason, when the change range is designated as 6 syllables in syllable units, the speech up to “Ayurgenji” is linearly interpolated in real time.

さらに、音素を単位として韻律を変化させる音韻列の範囲を特定するとしてもよい。図６（ａ）、（ｂ）及び（ｃ）は、本発明の実施の形態１における音声合成用テキスト構造の異なる形態とそのタグによる指示内容を示す模式図である。図６（ａ）は、図２（ａ）に示したタグ付きテキストと同様の韻律変更の指示を、その変更範囲を音素単位で数える場合のタグ付きテキストの一例を示している。ただし、図６（ａ）では、韻律を変化させる音韻列の範囲を１３音素目までとしている。図６（ｂ）は、図６（ａ）に示した漢字かな混じりテキストを音素（phoneme）で表し、合成音声の音韻部分と感情表現の変形との
対応を具体的に示している。図６（ｃ）は、感情表現の時間的変化をグラフで表した図である。図６（ｂ）および（ｃ）に示すように、韻律の変更範囲を音素単位で１３音素目とすると、「arayurugenjits」の「ts」が１３音素目と数えられるので、結果的に「あらゆる現実」の「arayurugenjits」までの音声の韻律が、怒りの重み「５」から重み「１」まで、実時間的に滑らかに変化されて合成され、それ以降は重み「１」が維持されることになる。 Furthermore, a range of phoneme strings whose prosody is changed in units of phonemes may be specified. FIGS. 6A, 6B, and 6C are schematic views showing different forms of the text structure for speech synthesis and the contents of instructions by the tags according to Embodiment 1 of the present invention. FIG. 6A shows an example of tagged text when the prosody change instruction similar to the tagged text shown in FIG. 2A is counted in units of phonemes. However, in FIG. 6A, the range of the phoneme string for changing the prosody is limited to the 13th phoneme. FIG. 6B shows the kanji-kana mixed text shown in FIG. 6A with phonemes, and specifically shows the correspondence between the phoneme portion of the synthesized speech and the transformation of the emotional expression. FIG. 6C is a graph showing the temporal change in emotional expression. As shown in FIGS. 6B and 6C, if the prosody change range is the 13th phoneme in phoneme units, the “ts” of “arayurugenjits” is counted as the 13th phoneme. The prosody of the speech up to “arayurugenjits” is synthesized with a smooth change in real time from the anger weight “5” to the weight “1”, and thereafter the weight “1” is maintained. Become.

以上のように、本実施の形態１の音声合成装置によれば、発話スタイルの時間変化をタグとして記述し、しかも、記述時に意図した音韻位置を、タグを挿入することができないような１文字より小さい音声単位（例えば、音素、モーラ、音節）を単位として指定することができる。従って、単語等を単位として発話スタイルの時間変化を表す従来の音声合成装置と比較した場合、（１）発話スタイルの時間変化を指定するための音韻位置を、利用者の意図した音韻位置に、より正確に一致させることができる。さらに、（２）発話スタイルの時間変化を、より滑らかに自然な感じで表現することができるという効果がある。 As described above, according to the speech synthesizer of the first embodiment, one character that describes the time change of the utterance style as a tag and that cannot be inserted into the phoneme position intended at the time of description. Smaller speech units (eg, phonemes, mora, syllables) can be specified as units. Therefore, when compared with a conventional speech synthesizer that represents a time change of the utterance style in units of words or the like, (1) the phoneme position for designating the time change of the utterance style is changed to the phoneme position intended by the user. It can be matched more accurately. Furthermore, (2) there is an effect that the time change of the speech style can be expressed more smoothly and naturally.

（実施の形態２）
上記実施の形態１では、「怒り」という１つの感情表現についてのみ音韻列を変形したが、本発明の実施の形態２では、２つの感情表現の変形を混ぜ合わせて音韻列を変形し音声を合成する場合の例について説明する。また、実施の形態１では、モーラ、音節及び音素によって韻律の変更範囲を特定する場合について説明したが、本発明の実施の形態２では、アクセント句を単位として音韻列の範囲を特定する場合について説明する。図７は、本発明の実施の形態２のテキストに付与されたマークアップ言語のタグの一例を示す図である。図７（ａ）は、１つのタグ付きテキストに２つの韻律変更を行なう指示を示し、その変更範囲をアクセント句の単位で数える場合の例を示している。図７（ｂ）は、図７（ａ）に示した漢字かな混じりテキストをアクセント句（acphrase）で表し、合成音声の音韻部分と感情表現の変形との対応を具体的に示している。図７（ｃ）は、図７（ｂ）のように重みが補間され、２つの感情表現の変形が混ぜ合わされたときのそれぞれの重みの時間的変化をグラフで表している。図８は本発明の実施の形態２の音声合成装置の動作を示すフローチャートである。図８において、図３と同じ動作ステップについては同じ符号を用い、説明を省略する。 (Embodiment 2)
In Embodiment 1 above, the phoneme sequence is transformed only for one emotional expression “anger”, but in Embodiment 2 of the present invention, the phoneme sequence is transformed by combining the transformations of the two emotional expressions. An example in the case of combining will be described. In the first embodiment, the case where the prosody change range is specified by the mora, syllable, and phoneme has been described. However, in the second embodiment of the present invention, the range of the phoneme string is specified in units of accent phrases. explain. FIG. 7 is a diagram showing an example of a markup language tag attached to the text according to the second embodiment of the present invention. FIG. 7A shows an example in which two prosodic changes are given to one tagged text, and the change range is counted in units of accent phrases. FIG. 7B represents the kanji-kana mixed text shown in FIG. 7A with an accent phrase, and specifically shows the correspondence between the phoneme portion of the synthesized speech and the transformation of the emotional expression. FIG. 7C is a graph showing temporal changes in the respective weights when the weights are interpolated as shown in FIG. 7B and the two emotion expression variations are mixed. FIG. 8 is a flowchart showing the operation of the speech synthesizer according to the second embodiment of the present invention. 8, the same operation steps as those in FIG. 3 are denoted by the same reference numerals, and the description thereof is omitted.

なお、本実施の形態２の音声合成装置の構成は図１に同様であるので説明を省略する。
本発明の実施の形態２の音声合成装置の動作を詳細に説明する。テキスト入力部２０１は図７の（ａ）に示すマークアップ言語によるタグ付きテキストを入力テキストとして受け付ける。図７の（ａ）のタグ付きテキストは、「あらゆる現実をすべて自分の方へ捻じ曲げたのだ」というテキストについて、（ｂ）に示すように、テキストの第１アクセント句すなわち、「あらゆる」の先頭は怒りの重み５の発話スタイルで表され、第１アクセント句から第３アクセント句まで、すなわち「すべて」まで徐々に怒りの重みが変化し、第３アクセント句の終端では怒りの重み２の発話スタイルとなる。さらに、第２アクセント句すなわち「現実を」の先頭では、笑いの重み０の発話スタイルで表現され、第２アクセント句から第５アクセント句まで徐々に笑いの重みが変化し、第５アクセント句すなわち「捻じ曲げたのだ」の終端では笑いの重み３の発話スタイルになる。すなわち、第２アクセント句から第５アクセント句の間では怒りと笑いの発話スタイルが重みを変えながら混合される複雑な表情の変化が指示されている。 The configuration of the speech synthesizer according to the second embodiment is the same as that shown in FIG.
The operation of the speech synthesizer according to the second embodiment of the present invention will be described in detail. The text input unit 201 accepts tagged text in the markup language shown in FIG. 7A as input text. The tagged text of FIG. 7A is the first accent phrase of the text, ie “everything”, as shown in (b) for the text “All reality was twisted towards you”. Is expressed in the utterance style of anger weight 5, the anger weight gradually changes from the first accent phrase to the third accent phrase, that is, “all”, and the anger weight 2 at the end of the third accent phrase. Utterance style. Furthermore, at the head of the second accent phrase, that is, “Reality”, the utterance style is expressed with a laughing weight of 0, and the laughing weight gradually changes from the second accent phrase to the fifth accent phrase. At the end of “I twisted it”, the utterance style has a laughing weight of 3. That is, between the second accent phrase and the fifth accent phrase, an intricate facial expression change in which the anger and laughter utterance styles are mixed while changing the weight is instructed.

このような入力に対し、図８のフローチャートではステップＳ２００６の過程までは実施の形態１に共通であるので、以降の動作についてのみ説明する。変形位置・変形重み決定部２０７はステップＳ２００６で韻律制御部２０５により生成された音韻列に対応する標準音韻時間長およびポーズ時間長とタグ位置、マークアップ言語解析部２０２で分離されたタグによる指示情報とに基づき、図７の（ｂ）に示すように、アクセント句単位で指定された発話スタイルの時間変化の時間位置について、変化の開始点については指定されたアクセント句の先頭モーラを設定し、変化の終端については指定されたアクセント句の最終モーラを設定する。すなわち、「あらゆる」の「あ」の怒りの重み５からはじまり、「すべて」の「て」の怒りの重み２へ変化し、その後は終点タグのある文末まで怒りの重み２を維持する。一方「現実を」の「げ」の笑いの重み０から「捻じ曲げたのだ」の「だ」の笑いの重みを３に変化する。「現実を」の「げ」から「すべて」の「て」までは怒りの重み、および笑いの重み共に時間的に変化しており、「自分の方へ」の「じ」から一定の重みの怒りの発話スタイルに徐々に重みが変化する笑いの発話スタイルが混合されるよう、変形区間と発話スタイルが設定される。変形区間内の音韻時間長より変形区間の実時間を計算し（Ｓ２１０７）、発話スタイルごとに実時間上で発話スタイルの重みを線形に補間する（Ｓ２１０８）。ステップＳ２１０８で実時間上で補間された重みより、発話スタイルごとに音韻時間長を用いて素片選択単位の中心点と素片接続点での発話スタイルの重みを計算する（Ｓ２１０９）。変換関数選択部２０８はステップＳ２００６で韻律制御部２０５によって生成された基本周波数パタン、振幅パタン、音韻時間長と、ステップＳ２１０７で変形位置・変形重み決定部２０７によって設定された変形区間と区間ごとに指定された発話スタイルに基づいて、標準音声を生成する際に使用される素片を変換するのに最適な変換関数を各素片選択単位ごとに変換関数データベース２０９より抽出する。複数の発話スタイルが指定されている区間にある音韻に対応する素片については指定されているすべての発話スタイルについて該当する変換関数を抽出する。一方素片選択部２１１はステップＳ２００６で韻律制御部２０５によって生成された基本周波数パタン、振幅パタン、音韻時間長と音韻列とに従って、合成しようとする音声の音声合成パラメータ素片を標準素片データベース２１２より抽出する（Ｓ２１１０）。変換関数パラメータ設定部２１０はステップＳ２００９で変形位置・変形重み決定部２０７により発話スタイルごとに計算された素片選択単位の中心点と素片接続点での発話スタイルの重みに基づき変換関数を合成する。複数の発話スタイルに対応して選択された変換関数の組み合わせによっては、合成結果としての変換関数による変換結果が音声として聞き取ることができないような、例えば基本周波数が高すぎて音韻の識別に重要な第１、第２フォルマントの周波数を超えてしまうような音声合成パラメータ列になる可能性がある。変換関数の合成に際し、あらかじめ作成されたパラメータ間の関係を考慮したパラメータ設定可能空間の中で変換関数を合成することにより合成された変換関数で変換された合成音声パラメータによる音声が破綻するのを防止する。上記のようにしてステップＳ２１１０で変換関数選択部２０８により変換関数データベース２０９から素片単位ごとに複数の発話スタイルに対応して選択された変換関数を１つの変換関数に合成し、素変選択単位の中心と素片接続点での変換関数パラメータを設定する（Ｓ２１１１）。音声パラメータ生成部２１３はステップＳ２１１０で素変選択部２１１により選択された音韻列に対応する標準音声を生成する音声合成パラメータと、ステップＳ２００６で生成された基本周波数パタン、振幅パタン、音韻時間長、ポーズ時間長とをステップＳ２１１１で合成され、パラメータ設定された素片単位ごとの変換関数を用いて変換し、タグにより指定された複数の発話スタイルを混合する時間変化を実現する、連続した音声の音声合成パラメータ列を生成する（Ｓ２０１２）。波形生成部２１４はステップＳ２０１２で生成された音声合成パラメータに従って音声波形を合成する（Ｓ２０１３）。 With respect to such an input, since the process up to step S2006 is common to the first embodiment in the flowchart of FIG. 8, only the subsequent operation will be described. The deformed position / deformed weight determination unit 207 designates the standard phoneme time length and pause time length corresponding to the phoneme sequence generated by the prosody control unit 205 in step S2006, the tag position, and the tag instructions separated by the markup language analysis unit 202. Based on the information, as shown in FIG. 7B, for the time position of the time change of the utterance style specified in units of accent phrases, the start mora of the specified accent phrase is set as the change start point. Set the final mora of the specified accent phrase for the end of the change. That is, it starts from the anger weight 5 of “any” “a”, changes to the anger weight 2 of “all” “te”, and thereafter maintains the anger weight 2 until the end of the sentence with the end point tag. On the other hand, the weight of laughter of “da” of “da” is changed from 3 to the weight of laughing of “da” of “real”. From the “real” “ge” to the “all” “te”, both the weight of anger and the weight of laughter change over time. The modified section and the utterance style are set so that the laughing utterance style in which the weight gradually changes is mixed with the angry utterance style. The real time of the deformation section is calculated from the phoneme length in the deformation section (S2107), and the utterance style weight is linearly interpolated on the real time for each utterance style (S2108). Based on the weights interpolated in real time in step S2108, the utterance style weights at the center point of the segment selection unit and the segment connection point are calculated using the phoneme duration for each utterance style (S2109). The transformation function selection unit 208 performs the basic frequency pattern, the amplitude pattern, and the phoneme time length generated by the prosody control unit 205 in step S2006, and the transformation interval and the interval set by the transformation position / deformation weight determination unit 207 in step S2107. Based on the designated utterance style, the optimum conversion function for converting the segments used when generating the standard speech is extracted from the conversion function database 209 for each unit selection unit. For the segments corresponding to the phonemes in the section where a plurality of utterance styles are specified, the corresponding conversion function is extracted for all the specified utterance styles. On the other hand, the unit selection unit 211 converts the speech synthesis parameter unit of the speech to be synthesized into the standard unit database according to the fundamental frequency pattern, amplitude pattern, phoneme duration and phoneme sequence generated by the prosody control unit 205 in step S2006. Extract from 212 (S2110). In step S2009, the transformation function parameter setting unit 210 synthesizes a transformation function based on the utterance style weights at the segment selection unit center point and segment connection point calculated for each utterance style by the transformation position / transformation weight determination unit 207. To do. Depending on the combination of conversion functions selected for multiple utterance styles, the conversion result of the conversion function as a synthesis result cannot be heard as speech, for example, the fundamental frequency is too high and is important for phonological identification There is a possibility that the speech synthesis parameter string exceeds the frequency of the first and second formants. When synthesizing the conversion function, the speech due to the synthesized speech parameter converted by the conversion function synthesized by synthesizing the conversion function in the parameter setting space that takes into account the relationship between the parameters created in advance is broken. To prevent. In step S2110, the transformation function selection unit 208 combines the transformation functions selected from the transformation function database 209 corresponding to a plurality of utterance styles into one transformation function by the transformation function database 209, and the elementary variation selection unit. The transformation function parameters at the center and the segment connection point are set (S2111). The speech parameter generation unit 213 generates a speech synthesis parameter for generating a standard speech corresponding to the phoneme sequence selected by the prime change selection unit 211 in step S2110, a fundamental frequency pattern, an amplitude pattern, a phoneme time length generated in step S2006, The pause time length is synthesized in step S2111 and converted by using a parameter-set conversion function for each unit of unit, and a continuous time change of a plurality of utterance styles specified by tags is realized. A speech synthesis parameter string is generated (S2012). The waveform generation unit 214 synthesizes a speech waveform according to the speech synthesis parameter generated in step S2012 (S2013).

また、図９（ａ）、（ｂ）及び（ｃ）は、本発明の実施の形態２のテキストに付与されたマークアップ言語のタグの変形例を示す図である。図９（ａ）は、テキストを一対のタグで挟んで韻律変更の指示を記述するのではなく、１つのタグに続くテキストの先頭から韻律変更範囲をモーラの単位で数える場合の例を示している。図９（ｂ）は、図９（ａ）に示した漢字かな混じりテキストをモーラで表し、混合される２つの合成音声の音韻部分と表現の変形（男声と女声）との対応を具体的に示している。図９（ｃ）は、図９（ｂ）のように重みが補間され、２つの表現の変形が混ぜ合わされたときのそれぞれの重みの時間的変化をグラフで表している。 FIGS. 9A, 9B, and 9C are diagrams showing modifications of the markup language tag attached to the text according to the second embodiment of the present invention. FIG. 9A shows an example in which the prosody change range is counted in units of mora from the beginning of the text following one tag, rather than describing the prosody change instruction by sandwiching the text between a pair of tags. Yes. FIG. 9B shows the kana-kana mixed text shown in FIG. 9A with mora, and specifically shows the correspondence between the phoneme portion of the two synthesized speech to be mixed and the transformation of the expression (male voice and female voice). Show. FIG. 9C is a graph showing temporal changes in the respective weights when the weights are interpolated as shown in FIG. 9B and the deformations of the two expressions are mixed.

変形例の音声合成装置の動作も図８に従って詳細に説明する。
図９の（ａ）のタグ付きテキストは、タグの指示が及ぶ範囲について終端を明示的に示さず、次の同内容のパラメータ変更指示あるいは、リセットの指示が入力されるまで最終の指定が維持される方式をとっている。例えば、＜voice gender=male[5,0]14mora/＞というタグでは、対になる２つのタグでテキストを挟んで発話スタイルの範囲を指定するのではなく、このタグの直後に来るテキストの先頭から、それに続くテキストの14モーラ目までを範囲として指定している。このように指定された範囲では、重み５から始まる男声が重み０に滑らかに変形されることが指示されている。すなわち、「あらゆる現実をすべて自分の方へ捻じ曲げたのだ」というテキストについて、（ｂ）に示すように、発話スタイルのうち話者の性別について図９の（ａ）の１行目のタグでテキストの先頭で男声の重み５の発話スタイルから「あらゆる」の「あ」から14モーラ目に男声の重み０の発話スタイルになるように指定しており、図９の（ａ）の３行目のタグで「すべて」の「す」で女声の重み０の発話スタイルから「すべて」の「す」から15モーラ目で女声の重み５の発話スタイルになるように指定している。すなわち、男性らしい発話スタイルの先頭の「あ」から「現実を」の「を」にかけて徐々に中性的な発話スタイルへと変化していき、さらに「すべて」の「す」からさらに男性らしさが減少すると共に女性らしさが加わっていき、「自分」の「ぶ」でついには男性らしさが消え、その後さらに「捻じ曲げた」の「た」まで徐々に女性らしさが強くなり、「のだ」はそのまま女性らしい発話スタイルで話されるという複雑な話者様態の変化を指示するものである。このような入力に対し、図８のステップＳ２００６の過程までは実施の形態１に共通であるので、以降の動作についてのみ説明する。 The operation of the modified speech synthesizer will also be described in detail with reference to FIG.
The tagged text in FIG. 9A does not explicitly indicate the end of the range covered by the tag instruction, and the final specification is maintained until the next parameter change instruction or reset instruction of the same content is input. Is used. For example, in the <voice gender = male [5,0] 14mora /> tag, instead of specifying the utterance style range with the text between two pairs of tags, the beginning of the text that comes immediately after this tag To the 14th mora of the text that follows. In the range specified in this way, it is instructed that a male voice starting from weight 5 is smoothly transformed to weight 0. That is, for the text “Any reality was twisted towards you”, as shown in (b), the tag of the first line in FIG. In the beginning of the text, the utterance style with the male voice weight 5 is specified to be the utterance style with the male voice weight 0 in the 14th mora from “A” of “every”, and the three lines in FIG. The eye tag specifies that the utterance style with the female voice weight 0 is changed from the utterance style with the female voice weight 0 at “all” in the eye tag to the utterance style with the female voice weight 5 at the 15 mora eye from the “all” “su”. In other words, from the beginning of the masculine utterance style “A” to “Reality” “to” gradually changes to a neutral utterance style, and from “all” “su” further masculinity As it decreases, femininity is added, masculinity disappears at the end of "self", and then the femininity gradually increases until "ta" of "twisted and bent", It indicates a complicated change in the speaker's way of speaking in a feminine style. Since such an input is common to Embodiment 1 up to the process of step S2006 in FIG. 8, only the subsequent operation will be described.

変形位置・変形重み決定部２０７は韻律制御部２０５で生成された音韻列に対応する標準音韻時間長およびポーズ時間長とタグ位置、マークアップ言語解析部２０２で分離されたタグによる指示情報とに基づき、図９の（ｂ）に示すように、モーラ単位で指定された発話スタイルの時間変化の時間位置について、変化の開始点についてはタグ直後のテキストの先頭に対応するモーラを設定し、変化の終端については開始点から数えたモーラ数に当たるモーラを設定する。すなわち、１行目のタグに対しては「あらゆる」の「あ」の男声の重み５からはじまり、「自分」の「ぶ」の男声の重み０へ変化し、その後は新たな指定がない限り男声の重みについては０を維持する一方「すべて」の「す」の女声の重み０から「捻じ曲げた」の「た」の女声の重み５に変化し、その後新たな指定がない限り女声の重みについては５を維持する。「すべて」の「す」から「自分」の「ぶ」までは男声の重み、女声の重み共に時間的に変化するよう、変形区間と発話スタイルが設定される。変形区間の実時間を計算し（Ｓ２１０７）、各発話スタイルの重みを線形に補間する（Ｓ２１０８）。さらにステップＳ２１０９で発話スタイルごとに素片選択単位の中心点と素片接続点での発話スタイルの重みを計算する。ステップＳ２１１０で変換関数選択部２０８は素片を変換する変換関数を各素片選択単位ごとに変換関数データベース２０９より抽出し、一方素片選択部２１１は音声の音声合成パラメータ素片を標準素片データベース２１２より抽出する。ステップ２１１１で変換関数パラメータ設定部２１０は発話スタイル重みに基づき変換関数を合成し、パラメータを設定する。ステップＳ２０１２で音声パラメータ生成部２１３が連続した音声の音声合成パラメータ列を生成し、ステップＳ２０１３で波形生成部２１４が音声合成パラメータに従って音声波形を合成する。 The deformation position / deformation weight determination unit 207 converts the standard phoneme time length and pause time length corresponding to the phoneme sequence generated by the prosody control unit 205, the tag position, and the instruction information by the tag separated by the markup language analysis unit 202. On the basis of the time position of the time change of the utterance style specified in units of mora, the mora corresponding to the head of the text immediately after the tag is set as the change start point, as shown in FIG. For the end of, a mora corresponding to the number of mora counted from the start point is set. That is, for the tag in the first row, the weight of the male voice of “Any” of “Any” is changed to a weight of 0 of the male voice of “My” “B”, and thereafter, unless a new designation is made. The weight of the male voice is maintained at 0, while the weight of the female voice of “all” is changed from 0 to the weight of the female voice of “ta” of “twisted”, and then the female voice weight is changed unless otherwise specified. The weight is maintained at 5. From “all” “s” to “self” “bu”, the transformation interval and the utterance style are set so that the weight of the male voice and the weight of the female voice change with time. The real time of the deformation section is calculated (S2107), and the weight of each utterance style is linearly interpolated (S2108). Further, in step S2109, for each utterance style, the weight of the utterance style at the center point of the segment selection unit and the segment connection point is calculated. In step S2110, the conversion function selection unit 208 extracts a conversion function for converting a unit from the conversion function database 209 for each unit selection unit, while the unit selection unit 211 converts a speech synthesis parameter unit of speech into a standard unit. Extracted from the database 212. In step 2111, the conversion function parameter setting unit 210 synthesizes a conversion function based on the utterance style weight and sets a parameter. In step S2012, the speech parameter generation unit 213 generates a continuous speech synthesis parameter string, and in step S2013, the waveform generation unit 214 synthesizes a speech waveform according to the speech synthesis parameter.

かかる変形位置・変形重み決定部２０７でタグによる発話スタイルを混合して実現する時間変化を設定する構成によれば、入力されたアクセント句、呼気段落等の音声の単位で時間位置を記述されたタグ付きテキストに対して、変形位置・変形重み決定部２０７が韻律制御部２０５によってステップＳ２００６で生成された実時間で示された音韻時間長にアクセント句等の音声単位で発話スタイルの時間変化を指示したタグを対応させ、音韻時間長に応じて設定された発話スタイルの時間変化を実現するように変換関数パラメータ設定部２１０で変換関数を合成してパラメータを設定することができる。これにより、音声合成装置ごとにそれぞれに異なる音韻時間長ごとに対応して、アクセント句等の音声の単位で記述された時間位置を、指定された音声単位に実時間軸上で正確に設定することができる。従って、実時間軸上で補間された発話スタイル重みに従って発話スタイルを混合して変換する合成変換関数を作成してパラメータを設定し、適用することができる。そして、標準音声の音声合成パラメータを変換して複数の発話スタイルがあるスタイルから、徐々に他のスタイルへ変化していくような、実時間上で徐々に変化、遷移する音声合成パラメータを生成することにより、滑らかな発話スタイルの変化を表現することができる。すなわち、タグとして記述された発話スタイルの時間変化は音声単位で指定された、記述時に意図した音韻位置を正確に再現することができる。また、タグに指定された発話スタイルに対応する、音声素片単位であらかじめ生成されたスペクトル情報の変換を含む変換関数のパラメータを波形生成時に音声として破綻しない範囲で合成、制御することで時間的に徐々に変化、遷移する発話スタイルの時間変化を韻律のみならず、スペクトル情報も合わせて制御し、タグに指定された発話スタイルの時間的変化を精度よく再現することができる。 According to the configuration in which the change position / deformation weight determination unit 207 sets the time change realized by mixing the utterance styles by tags, the time position is described in units of speech such as the input accent phrase and exhalation paragraph. For the tagged text, the transformation position / transformation weight determination unit 207 changes the utterance style over time in speech units such as accent phrases to the phoneme duration shown in real time by the prosody control unit 205 in step S2006. The conversion function parameter setting unit 210 can synthesize the conversion function and set the parameters so that the instructed tag is associated and the utterance style time change set according to the phoneme duration is realized. As a result, the time position described in units of speech such as accent phrases is accurately set on the specified speech unit on the real-time axis corresponding to each different phoneme duration for each speech synthesizer. be able to. Therefore, it is possible to create a composite conversion function that mixes and converts utterance styles according to the utterance style weights interpolated on the real time axis, sets parameters, and applies them. Then, by converting the speech synthesis parameters of standard speech, it generates speech synthesis parameters that gradually change and transition in real time, such as gradually changing from one style to another style. Thus, it is possible to express a smooth change in utterance style. That is, the time change of the utterance style described as a tag can be accurately reproduced at the phonetic position intended at the time of description, which is designated in units of speech. Also, by combining and controlling the parameters of the conversion function corresponding to the utterance style specified in the tag, including the conversion of spectrum information generated in advance in units of speech units, within a range that does not fail as speech when generating waveforms, Thus, the temporal change of the utterance style specified in the tag can be accurately reproduced by controlling not only the prosody but also the spectral information in accordance with the temporal change of the utterance style that gradually changes and changes.

なお、本実施の形態において、発話スタイルを混合する際に音声波形が破綻しない範囲のパラメータ空間内で変換関数を合成するとしたが、特開２００３−２３３３８８号公報のように、発話スタイル重みを正規化して発話スタイルの混合比率を設定し、合成変換関数が合成音声が破綻するような極端な変換を行わないように制御するものとしても良い。 In this embodiment, when the utterance styles are mixed, the conversion function is synthesized within the parameter space in a range where the speech waveform does not fail. However, as disclosed in JP-A-2003-233388, the utterance style weights are normalized. It is also possible to set the utterance style mixing ratio and control the composite conversion function so as not to perform extreme conversion that causes the synthesized speech to fail.

（実施の形態３）
上記実施の形態１および実施の形態２では、タグ付きテキストの表現を時間的に滑らかに変形させる場合について説明したが、本発明の実施の形態３ではテキストの一部分だけ臨時的に表現を変形させる方法について説明する。図１０（ａ）、（ｂ）及び（ｃ）は、本実施の形態３のテキストに付与されたマークアップ言語のタグの一例を示す図である。図１０（ａ）は、図９（ａ）に示した記述に加えて、一時的にテキストの一部分の感情表現を変更する場合に、そのテキスト部分を一対のタグで挟んで韻律変更の指示を記述する例を示している。図１０（ｂ）は、図１０（ａ）に示した漢字かな混じりテキストをモーラで表し、混合される２つの合成音声の音韻部分と感情表現の重み変形処理との対応を具体的に示している。図１０（ｃ）は、図１０（ｂ）のように重みが補間および臨時処理され、２つの表現の変形が混ぜ合わされたときの「怒り」の重みの時間的変化をグラフで表している。図１１は、本実施の形態３の音声合成装置の動作を示すフローチャートの一部である。図１１においてはステップＳ２００９までは図３と同じ動作であるので図示及び説明を省略し、さらにＳ２００９以降においても図３と同じ動作ステップについては同じ符号を用い、説明を省略する。 (Embodiment 3)
In the first embodiment and the second embodiment, the case where the expression of the tagged text is smoothly deformed in time has been described, but in the third embodiment of the present invention, the expression is temporarily modified only for a part of the text. A method will be described. FIGS. 10A, 10B, and 10C are diagrams showing an example of a markup language tag attached to the text of the third embodiment. FIG. 10A shows a prosody change instruction in addition to the description shown in FIG. 9A when the emotional expression of a part of the text is temporarily changed by sandwiching the text part between a pair of tags. An example to describe is shown. FIG. 10B shows the kanji-kana mixed text shown in FIG. 10A with a mora, and specifically shows the correspondence between the phoneme portion of the two synthesized speech to be mixed and the weight transformation processing of the emotion expression. Yes. FIG. 10C is a graph showing temporal changes in the weight of “anger” when the weights are interpolated and temporarily processed as shown in FIG. 10B and the deformations of the two expressions are mixed. FIG. 11 is a part of a flowchart showing the operation of the speech synthesis apparatus according to the third embodiment. In FIG. 11, the operations up to step S2009 are the same as those in FIG. 3, and thus illustration and description thereof are omitted. Further, even after S2009, the same operation steps as those in FIG.

音声合成装置の構成は図１に同様であるので説明を省略する。
テキスト入力部２０１は図１０の（ａ）に示すマークアップ言語によるタグ付きテキストを入力テキストとして受け付ける。図１０の（ａ）のタグ付きテキストは、「あらゆる現実をすべて自分の方へ捻じ曲げたのだ」というテキストについて、（ｂ）に示すように、テキストの先頭は怒りの重み５の発話スタイルで、先頭音韻から、１７モーラ目すなわち「方」の「ー（長音部分）」では怒りの重み０の発話スタイルとなるよう、１７モーラの間に徐々に発話スタイルを変化させることを指示するものである。さらに、その発話スタイルの時間的変化の途中で、設定した時間的変化を指定区間のみ無効とする臨時処理が指定され、「すべて」の区間のみ１行目のタグによって指定された発話スタイルとはかかわり無く、怒りの重み５の発話スタイルを指示するものである。ステップＳ２００１からステップＳ２００８までは、実施の形態１では発話スタイルの時間的変化の終端が先頭から１４モーラ目であったのが、本実施の形態では先頭から１７モーラ目に変わった以外は同様の動作であるので説明を省略し、ステップＳ２００９より詳細を説明する。実施の形態１と同様にステップＳ２００９で、実時間上で補間された重みに基づき、音韻時間長を用いて素片選択単位の中心点と素片接続点での発話スタイル重みを計算する。変形区間内に臨時処理を指示するタグがない場合には（Ｓ２２０１）実施の形態１と同様にステップＳ２０１０からステップＳ２０１３の処理を経て、素片と変換関数を選択し、標準韻律と標準素片を変換関数に従って変換し、音声合成パラメータを生成し、その音声合成パラメータに従って音声波形を合成する。変形区間内に臨時処理を指示するタグがある場合には（Ｓ２２０１）、変形位置・変形重み決定部２０７は臨時処理区間の音韻を特定し（Ｓ２２０２）、臨時処理区間内に発話スタイルの時間変化指定がない場合には（Ｓ２２０３）ステップＳ２２０９で設定された発話スタイルとその重みのうちステップＳ２２０２で特定された臨時処理区間の音韻に対応する音声素片に対応する発話スタイルとその重みを臨時処理で指定されたものと入れ替える（Ｓ２２０６）。臨時処理区間内に発話スタイルの時間変化指定がある場合には（Ｓ２２０３）臨時処理区間の音韻時間長より実時間位置での発話スタイル重みを設定し（Ｓ２２０４）、ステップＳ２２０４で設定した実時間軸上で補間した発話スタイルとその重みを音韻時間長より素片選択単位に変換する（Ｓ２２０５）。ステップＳ２２０５で素片選択単位に変換された発話スタイルとその重みに基づき、ステップＳ２２０９で設定された発話スタイルとその重みのうち臨時処理区間の音韻に対応する音声素片に対応する発話スタイルとその重みを臨時処理で指定されたものと入れ替える（Ｓ２２０６）。図１０の例では、図１０の（ａ）の１行目で指定する発話スタイルの時間変化の変形区間内に指定された臨時処理は時間的変化を含まないため、「あらゆる」の「あ」から「方へ」の「ー」で、徐々に怒りの発話スタイルの重みが減少していく途中、3行目と5行目のタグによって「すべて」のみ怒りの重み５の発話スタイルを維持し、「自分の方へ」で再度１行目のタグの指示する時間的変化に戻るように、発話スタイルとその重みが素片選択単位で設定される。変形位置・変形重み決定部２０７により上記ステップＳ２００９からＳ２２０６のように計算された素片選択単位の中心点と素片接続点での発話スタイル重みに基づいて、実施の形態１と同様に変換関数パラメータ設定部２１０はステップＳ２０１０で変換関数選択部２０８により変換関数データベース２０９から素片単位ごとに選択された変換関数の素変選択単位の中心と素片接続点での変換関数パラメータを設定する。音声パラメータ生成部２１３はステップＳ２０１０で素変選択部２１１により選択された音韻列に対応する標準音声を生成する音声合成パラメータと、ステップＳ２００６で生成された基本周波数パタン、振幅パタン、音韻時間長、ポーズ時間長とをステップＳ２０１１でパラメータ設定された素片単位ごとの変換関数を用いて変換し、タグによりしてされた発話スタイルの時間変化を実現する、連続した音声の音声合成パラメータ列を生成する（Ｓ２０１２）。波形生成部２１４はステップＳ２０１２で生成された音声合成パラメータに従って音声波形を合成する（Ｓ２０１３）。 The configuration of the speech synthesizer is the same as that shown in FIG.
The text input unit 201 accepts tagged text in the markup language shown in FIG. 10A as input text. The tagged text in (a) of FIG. 10 is about the text that “every reality is twisted toward you”, and as shown in (b), the beginning of the text is an utterance style with an anger weight 5 Then, from the first phoneme, the 17th mora, that is, the “how” “-(long sound part)” indicates that the utterance style is gradually changed during the 17th mora so that the utterance style has an anger weight of 0. It is. Furthermore, in the middle of the temporal change of the utterance style, a temporary process for invalidating the set temporal change only in the designated section is designated, and the utterance style designated by the tag in the first line only in the “all” section Regardless of this, the utterance style of anger weight 5 is designated. From step S2001 to step S2008, the end of the temporal change of the utterance style is the 14th mora from the beginning in the first embodiment, but the present embodiment is the same except that the end is changed to the 17th mora from the beginning. Since this is an operation, a description thereof will be omitted, and details will be described from step S2009. As in the first embodiment, in step S2009, utterance style weights at the center point and unit connection point of the unit selection unit are calculated using the phoneme length based on the weight interpolated in real time. If there is no tag for instructing temporary processing in the deformation section (S2201), the unit and the conversion function are selected through the processing from step S2010 to step S2013 as in the first embodiment, and the standard prosody and the standard unit are selected. Is converted according to the conversion function, a speech synthesis parameter is generated, and a speech waveform is synthesized according to the speech synthesis parameter. If there is a tag for instructing temporary processing within the deformation section (S2201), the deformation position / deformation weight determining unit 207 identifies the phoneme of the temporary processing section (S2202), and the temporal change of the utterance style is within the temporary processing section. If there is no designation (S2203), the utterance style and its weight corresponding to the speech unit corresponding to the phoneme in the temporary processing section specified in step S2202 among the utterance style and its weight set in step S2209 are temporarily processed. It replaces with the one specified in (S2206). If there is an utterance style time change designation in the temporary processing section (S2203), the utterance style weight at the real time position is set from the phoneme length of the temporary processing section (S2204), and the real time axis set in step S2204 The speech style interpolated above and its weight are converted into unit selection units from the phoneme time length (S2205). Based on the utterance style converted to the unit selection unit in step S2205 and its weight, the utterance style corresponding to the speech unit corresponding to the phoneme in the temporary processing section among the utterance style and the weight set in step S2209 and its weight The weight is replaced with that specified in the temporary processing (S2206). In the example of FIG. 10, since the temporary process specified in the deformation section of the temporal change of the utterance style specified in the first line of FIG. 10A does not include a temporal change, “A” of “any”. From “to” to “to”, the angry utterance style weight is gradually decreasing and the utterance style of only “all” anger weight 5 is maintained by the tags in the third and fifth lines. The utterance style and its weight are set in units of unit selection so as to return to the temporal change indicated by the tag in the first line again with “Toward yourself”. Based on the utterance style weights at the center point of the segment selection unit and the segment connection point calculated by the deformation position / deformation weight determining unit 207 as in steps S2009 to S2206, the conversion function is the same as in the first embodiment. In step S2010, the parameter setting unit 210 sets the conversion function parameters at the center and unit connection point of the element selection unit of the conversion function selected for each unit from the conversion function database 209 by the conversion function selection unit 208. The speech parameter generation unit 213 generates a speech synthesis parameter for generating a standard speech corresponding to the phoneme sequence selected by the prime change selection unit 211 in step S2010, a fundamental frequency pattern, an amplitude pattern, a phoneme time length generated in step S2006, The pause time length is converted using the unit-by-unit conversion function parameterized in step S2011, and a continuous speech synthesis parameter sequence is generated that realizes the time change of the utterance style made by the tag. (S2012). The waveform generation unit 214 synthesizes a speech waveform according to the speech synthesis parameter generated in step S2012 (S2013).

かかる変形位置・変形重み決定部２０７において発話スタイルの臨時処理を記述したタグに対応する構成によれば、モーラ等の音声の単位で時間位置を記述され、一連の時間変化を行う区間の途中にその連続した変化を一時中断して臨時の処理を行う指示が記述されたタグ付きテキストに対し、変形位置・変形重み決定部２０７で背景となる発話スタイルの変形区間内に臨時処理が含まれるかどうかを判断し、臨時処理区間の発話スタイル重みを設定して背景となる発話スタイル重みの情報と入れ替えることで発話スタイルが実時間上で徐々に変化する途中で、背景となる変化とは独立に指定された発話スタイルを挿入した音声合成パラメータを生成することととなり、時間的変化を指定した区間内に臨時処理を指定する方法を提供することで、比較的広範にわたる変化の中で、局所的な変化を指定することができ、音声の表現のバリエーションは大きくなる。このような臨時処理の指定は音声合成の対象となるテキストと合成された音声の時間位置の関係が明確でなければ不可能である。本発明の音声あるいは言語単位で時間位置を指定する方式であれば、広範な時間変化指定が局所表現の対象となるテキスト中の文字列を含むか否かが、マークアップ言語の記述作業時に確認可能であり、臨時処理の記述による表現が可能になる。 According to the configuration corresponding to the tag describing the temporary processing of the utterance style in the deformation position / deformation weight determining unit 207, the time position is described in units of speech such as mora, and in the middle of a section where a series of time changes are performed. Whether or not temporary processing is included in the deformation section of the utterance style that is the background in the deformation position / deformation weight determination unit 207 for the tagged text in which the instruction to perform temporary processing by temporarily interrupting the continuous change is described In the middle of gradual change of the utterance style in real time by setting the utterance style weight of the temporary processing section and replacing it with the information of the utterance style weight as the background, it is independent of the background change To provide speech synthesis parameters with the specified utterance style inserted, and to provide a method for specifying temporary processing within a section that specifies temporal changes , In a changing over a relatively wide range, it is possible to specify the local variations, variations in the representation of the voice increases. Such temporary processing cannot be designated unless the relationship between the text to be synthesized and the time position of the synthesized speech is clear. If the time position is specified in units of speech or language according to the present invention, it is confirmed at the time of the markup language description work whether or not a wide range of time change specification includes a character string in the text to be subjected to local expression. It is possible, and the expression by the description of the temporary processing becomes possible.

なお、本実施の形態において、臨時処理で指定した発話スタイルと、臨時処理を含む区間において指定した発話スタイルは同じもので、臨時処理は重みの変更のみであったが、臨時処理区間を含む区間において指定した発話スタイルとはまったく異なる発話スタイルを指定し、臨時処理区間を含む区間において指定した発話スタイルおよびその重みの内容を臨時処理区間内のみにおいて無効とするとしても良い。あるいは、臨時処理区間を含む区間において指定した発話スタイルとは異なる発話スタイルを指定し、臨時処理区間を含む区間において指定した発話スタイルおよびその重みの内容に、さらに、臨時処理区間内のみにおいて、指定した発話スタイルを混合するとしてもよい。 In this embodiment, the utterance style specified in the temporary process is the same as the utterance style specified in the section including the temporary process, and the temporary process only changes the weight, but includes the temporary process section. The utterance style that is completely different from the utterance style specified in step S3 may be specified, and the utterance style specified in the section including the temporary processing section and the content of the weight may be invalidated only in the temporary processing section. Or, specify an utterance style that is different from the utterance style specified in the section including the temporary processing section, specify the utterance style specified in the section including the temporary processing section and the contents of the weight, and specify only in the temporary processing section. The utterance styles may be mixed.

（実施の形態４）
上記実施の形態１〜３では、同一の構成により、様々な発話スタイルで音声合成の表現の変形を行う場合について説明したが、以下では、異なる構成により様々な表現の変形を行う場合について説明する。 (Embodiment 4)
In the first to third embodiments, the case where the expression of speech synthesis is modified in various utterance styles using the same configuration has been described. However, the case where various expressions are modified using different configurations will be described below. .

図１２は、本発明の実施の形態４における音声合成装置の機能ブロック図であり、図１３は本発明の実施の形態４の音声合成装置の動作を示すフローチャートである。 FIG. 12 is a functional block diagram of the speech synthesizer according to Embodiment 4 of the present invention, and FIG. 13 is a flowchart showing the operation of the speech synthesizer according to Embodiment 4 of the present invention.

図１２において図１と共通する部分については同一の番号を付与し、説明を省略する。また図１３において図３と共通する動作ステップについても同一の番号を付与し、説明を省略する。 12 that are the same as those in FIG. 1 are assigned the same reference numerals and descriptions thereof are omitted. Also, in FIG. 13, the same steps as those in FIG.

図１２において、本実施の形態の音声合成装置は、相異なる２つの発話スタイルＡ、Ｂに対応し、変形位置・重み決定部によって切り替えまたは混合される２組の韻律パタンデータベース、韻律制御部、混合基準点付き素片データベースおよび合成パラメータ生成部を備え、これら２組の構成から出力される音声合成パラメータが、変形位置・重み決定部からの混合重み変化情報に従ってモーフィングされる装置であり、テキスト入力部２０１、マークアップ言語解析部２０２、言語処理部２０３、辞書２０４、変形位置・重み決定部３０５、スイッチ３０６ａ、スイッチ３０６ｂ、韻律制御部Ａ３０７ａ、韻律制御部Ｂ３０７ｂ、韻律パタンデータベースＡ３０８ａ、韻律パタンデータベースＢ３０８ｂ、合成パラメータ生成部Ａ３０９ａ、合成パラメータ生成部Ｂ３０９ｂ、混合基準点付素片データベースＡ３１０ａ、混合基準点付素片データベースＢ３１０ｂ、モーフィング部３１１、および波形生成部２１４を備える。 In FIG. 12, the speech synthesizer of the present embodiment corresponds to two different utterance styles A and B, and two sets of prosodic pattern databases, prosodic control units, which are switched or mixed by the deformed position / weight determining unit, A device comprising a unit database with a mixture reference point and a synthesis parameter generation unit, in which speech synthesis parameters output from these two sets are morphed according to the mixture weight change information from the deformation position / weight determination unit, and a text Input unit 201, markup language analysis unit 202, language processing unit 203, dictionary 204, deformation position / weight determination unit 305, switch 306a, switch 306b, prosody control unit A307a, prosody control unit B307b, prosody pattern database A308a, prosody pattern Database B308b, synthesis parameter generation unit A309a Comprising synthesis parameter generating unit B309b, segment database A310a with mixed reference point, mixing the reference with point unit database B310b, morphing unit 311, and the waveform generator 214.

変形位置・重み決定部３０５は、言語処理部２０３で生成された音韻列に対応付けられたタグ挿入位置と、マークアップ言語解析部２０２より出力されたタグによる指示情報を入力され、タグの指示とタグ挿入位置と音韻列に基づき発話スタイルの指定を解析する。次いで、音韻列上の区間と発話スタイルの対応を解析し、スイッチ３０６ａとスイッチ３０６ｂとを制御する。さらに、音韻列上で発話スタイルの混合を行う区間を解析し、複数発話スタイルの混合重みを設定する。 The deformation position / weight determination unit 305 receives the tag insertion position associated with the phoneme string generated by the language processing unit 203 and the tag instruction information output from the markup language analysis unit 202, and receives the tag instruction. And utterance style specification is analyzed based on tag insertion position and phoneme sequence. Next, the correspondence between the sections on the phoneme string and the speech style is analyzed, and the switches 306a and 306b are controlled. Furthermore, a section in which utterance styles are mixed on the phoneme string is analyzed, and a mixture weight of a plurality of utterance styles is set.

スイッチ３０６ａとスイッチ３０６ｂとは変形位置・重み決定部３０５より出力される制御信号によって韻律制御部Ａ３０７ａ、および韻律制御部Ｂ３０７ｂの動作を制御する。 The switches 306a and 306b control the operation of the prosody control unit A307a and the prosody control unit B307b by the control signal output from the deformed position / weight determination unit 305.

韻律制御部Ａ３０７ａ、韻律制御部Ｂ３０７ｂは、言語処理部２０３により生成されたタグ挿入位置が対応付けられた音韻列、韻律指定情報と言語情報の入力をスイッチ３０６ａ、スイッチ３０６ｂに制御され、スイッチ３０６ａ、スイッチ３０６ｂがつながり、言語処理部２０３からの入力があったときにのみ音韻列、韻律指定情報、言語情報に基づき、それぞれＡ、Ｂの発話スタイルの実音声より生成した韻律パタンデータベースＡ３０８ａ、韻律パタンデータベースＢ３０８ｂを参照して音韻列に対応した基本周波数、振幅、音韻時間長、ポーズ時間長を生成し、タグ挿入位置と対応付けてそれぞれに出力する。 The prosody control unit A307a and the prosody control unit B307b are controlled by the switch 306a and the switch 306b to input the phoneme string associated with the tag insertion position generated by the language processing unit 203, the prosodic designation information, and the language information. Only when there is an input from the language processing unit 203, the prosody pattern database A308a generated from the actual speech of the utterance styles A and B based on the phoneme string, prosody designation information, and language information, respectively. The basic frequency, amplitude, phoneme time length, and pause time length corresponding to the phoneme string are generated with reference to the pattern database B308b, and output in association with the tag insertion position.

混合基準点付素片データベースＡ３１０ａ、混合基準点付素片データベースＢ３１０ｂは、それぞれの発話スタイルＡ、発話スタイルＢの実音声より生成した音韻ごとの音声合成パラメータと音韻環境、基本周波数、振幅、音韻時間長、言語情報等の属性と、音声をモーフィングする際のパラメータの混合の基準となる点を周波数と素片内の時間位置で示した混合位置情報とを格納している。 The mixed reference point-attached segment database A 310a and the mixed reference point-attached fragment database B 310b include speech synthesis parameters, phoneme environments, fundamental frequencies, amplitudes, phonemes for each phoneme generated from the actual speech of the speech style A and the speech style B, respectively. It stores attributes such as time length, language information, etc., and mixed position information indicating a point that is a reference for mixing parameters when morphing speech by frequency and time position in the segment.

合成パラメータ生成部Ａ３０９ａ、合成パラメータ生成部Ｂ３０９ｂは、各々韻律制御部Ａ３０５a、韻律制御部Ｂ３０５ｂで生成された音韻列に対応した基本周波数、振幅、
音韻時間長と言語処理部２０３により生成された言語情報とから、混合基準点付素片データベースＡ３１０ａ、混合基準点付素片データベースＢ３１０ｂをそれぞれに参照して音韻列に対応する音声合成パラメータ素片を抽出する。次いで、音声合成パラメータ素片を接続してそれぞれ発話スタイルＡの音声合成パラメータ列、発話スタイルＢの音声合成パラメータ列を生成し、素片ごとの混合基準点を付与して出力する。 The synthesis parameter generation unit A309a and the synthesis parameter generation unit B309b are respectively provided with a fundamental frequency, an amplitude, and a frequency corresponding to the phoneme strings generated by the prosody control unit A305a and the prosody control unit B305b.
A speech synthesis parameter segment corresponding to a phoneme sequence by referring to the mixed reference point-attached segment database A310a and the mixed reference point-attached segment database B310b, respectively, from the phoneme duration and the language information generated by the language processing unit 203. To extract. Next, speech synthesis parameter segments are connected to generate speech style A speech synthesis parameter sequences and speech style B speech synthesis parameter sequences, respectively, and output by adding a mixing reference point for each segment.

モーフィング部３１１は合成パラメータ生成部Ａ３０９ａ、合成パラメータ生成部Ｂ３０９ｂがそれぞれに生成した発話スタイルの異なる混合基準点付き音声合成パラメータ列を取得し、変形位置・重み決定部３０５で設定された混合区間と混合重み情報に基づき合成パラメータ生成部Ａ３０９ａが生成した発話スタイルＡの音声合成パラメータ列と合成パラメータ生成部Ｂ３０９ｂが生成した発話スタイルＢの音声合成パラメータ列とを混合基準点を対応させてモーフィングし、複数の発話スタイルを混合した音声合成パラメータ列を生成する。波形生成部２１４はモーフィング部３１１で生成された一連の音声合成パラメータに基づき音声波形を生成し、出力する。 The morphing unit 311 acquires speech synthesis parameter sequences with mixed reference points having different utterance styles generated by the synthesis parameter generation unit A 309a and the synthesis parameter generation unit B 309b, respectively, and the mixing interval set by the deformation position / weight determination unit 305 Morphing the speech synthesis parameter sequence of the utterance style A generated by the synthesis parameter generation unit A309a based on the mixing weight information and the speech synthesis parameter sequence of the utterance style B generated by the synthesis parameter generation unit B309b in association with the mixing reference point, A speech synthesis parameter sequence in which a plurality of utterance styles are mixed is generated. The waveform generation unit 214 generates a speech waveform based on the series of speech synthesis parameters generated by the morphing unit 311 and outputs the speech waveform.

次に、上記の構成による音声合成装置の動作を詳細に説明する。実施の形態１と同様に図２の（ａ）に示す入力に対し、ステップＳ２００５の過程までは実施の形態１に共通であるので、以降の動作についてのみ説明する。変形位置・重み決定部３０５はマークアップ言語解析部２０２によって分離された発話スタイルの指示とステップＳ２００５で生成されたタグ位置を記録した音韻列より、タグにより指定された時間位置を音韻列に当てはめ、音韻列上に混合の元となる発話スタイルとその重みを設定する（Ｓ３００６）。変形位置・重み決定部３０５はステップＳ３００６で設定した混合元となる発話スタイルに従って、スイッチ３０６ａ、３０６ｂを接続する（Ｓ３００７）ここでは発話スタイルをＡ、Ｂの２種類のみ図示しているが、さらに多数の発話スタイルを備え、多数の発話スタイルのうち、ステップＳ３００６で設定した発話スタイルの韻律を生成する韻律制御部につながるスイッチを選択して接続するものとする。図２の（ａ）の入力テキストについては、発話スタイルＡを標準スタイル、発話スタイルＢを怒りのスタイルとする。韻律制御部Ａ３０７ａはステップＳ２００４で言語処理部２０３により生成された音韻列と韻律指定情報および言語情報を属性として用いてあらかじめ属性ごとにパラメータが設定された関数により入力音韻列に対応する発話スタイルＡ（標準スタイル）の音韻時間長およびポーズ時間長を設定し、音韻列と韻律指定情報および言語情報の属性により基本周波数と振幅の発話スタイルＡ（標準スタイル）の韻律パタンを韻律パタンデータベースＡ３０８ａより抽出し、さらに属性に基づいて変形を加えて入力音韻列に対応する発話スタイルＡ（標準スタイル）の基本周波数パタン、発話スタイルＡ（標準スタイル）振幅パタンを生成する。一方発話スタイルＢ（怒りのスタイル）についても同様に韻律制御部Ｂ３０７ｂは発話スタイルＢ（怒りのスタイル）の音韻時間長およびポーズ時間長を設定し、発話スタイルＢ（怒りのスタイル）の韻律パタンを韻律パタンデータベースＢ３０８ｂより抽出し、さらに入力音韻列に対応する発話スタイルＢ（怒りのスタイル）の基本周波数パタン、発話スタイルＢ（怒りのスタイル）の振幅パタンを生成する（Ｓ３００８）。合成パラメータ生成部Ａ３０９ａはステップＳ３００８で韻律制御部Ａ３０７ａによって生成された発話スタイルＡ（標準スタイル）の基本周波数パタン、振幅パタン、音韻時間長と音韻列とに従って、合成しようとする音声の音声合成パラメータ素片を発話スタイルＡの素片を格納した混合基準点つき素片データベースＡ３１０ａより抽出し、接続する。同様に合成パラメータ生成部Ｂ３０９ｂはステップＳ３００８で韻律制御部Ｂ３０７ｂによって生成された発話スタイルＢ（怒りのスタイル）の基本周波数パタン、振幅パタン、音韻時間長と音韻列とに従って、合成しようとする音声の音声合成パラメータ素片を発話スタイルＢの素片を格納した混合基準点つき素片データベースＢ３１０ｂより抽出し、接続する（Ｓ３００９）。モーフィング部３１１はステップＳ３００６で変形位置・重み決定部３０５によって設定された混合区間において、ステップＳ３００９で生成された発話スタイルＡ（標準スタイル）と発話スタイルＢ（怒りのスタイル音声合成パラメータ列を素片ごとの混合基準点を対応させ、混合基準点間を時間方向とスペクトル方向に線形に補間し、ステップ３００６で変形位置・重み決定部３０５によって設定された各音韻ごとに設定された混合重みに従ってモーフィングして、混合区間において発話スタイルが徐々に変化する音声合成パラメータ列を生成する。波形生成部２１４はステップＳ３０１０で生成された音声合成パラメータに従って音声波形を合成する（Ｓ２０１３）。 Next, the operation of the speech synthesizer configured as described above will be described in detail. Similar to the first embodiment, with respect to the input shown in FIG. 2A, the process up to step S2005 is common to the first embodiment, so only the subsequent operation will be described. The deformed position / weight determining unit 305 applies the time position specified by the tag to the phoneme string based on the utterance style instruction separated by the markup language analyzing unit 202 and the phoneme string recorded in step S2005. Then, the utterance style and the weight to be mixed are set on the phoneme string (S3006). The deformed position / weight determining unit 305 connects the switches 306a and 306b in accordance with the utterance style that is set in step S3006 (S3007). Here, only two utterance styles A and B are shown. It is assumed that a plurality of utterance styles are provided, and a switch connected to a prosody control unit that generates a melody of the utterance style set in step S3006 is selected and connected among the many utterance styles. For the input text in FIG. 2A, the utterance style A is the standard style and the utterance style B is the anger style. The prosody control unit A307a uses the phoneme sequence generated by the language processing unit 203 in step S2004, the prosodic designation information, and the language information as attributes, and the utterance style A corresponding to the input phoneme sequence by a function in which parameters are set in advance for each attribute. The phoneme time length and pause time length of (standard style) are set, and the prosodic pattern of the utterance style A (standard style) of the fundamental frequency and amplitude is extracted from the prosodic pattern database A308a by the attributes of the phoneme string, prosodic designation information, and language information. Further, the basic frequency pattern and the utterance style A (standard style) amplitude pattern of the utterance style A (standard style) corresponding to the input phoneme string are generated by modifying based on the attribute. On the other hand, for the utterance style B (anger style), the prosody control unit B307b similarly sets the phoneme time length and pause time length of the utterance style B (anger style), and sets the prosodic pattern of the utterance style B (anger style). The basic frequency pattern of the utterance style B (anger style) and the amplitude pattern of the utterance style B (anger style) corresponding to the input phoneme string are generated (S3008). The synthesis parameter generation unit A309a performs speech synthesis parameters of the speech to be synthesized according to the basic frequency pattern, amplitude pattern, phoneme duration and phoneme sequence of the speech style A (standard style) generated by the prosody control unit A307a in step S3008. The segments are extracted from the segment database A310a with mixed reference points storing the speech style A segments and connected. Similarly, the synthesis parameter generation unit B309b generates the speech to be synthesized according to the basic frequency pattern, amplitude pattern, phoneme time length, and phoneme sequence of the speech style B (anger style) generated by the prosody control unit B307b in step S3008. The speech synthesis parameter segment is extracted from the segment database B310b with mixed reference points storing the speech style B segment and connected (S3009). The morphing unit 311 segments the speech style A (standard style) and speech style B (anger style speech synthesis parameter sequence) generated in step S3009 in the mixed section set by the deformation position / weight determination unit 305 in step S3006. Corresponding to each mixing reference point, linearly interpolating between the mixing reference points in the time direction and the spectral direction, and morphing according to the mixing weight set for each phoneme set by the deformation position / weight determination unit 305 in step 3006 Then, a speech synthesis parameter sequence in which the speech style gradually changes in the mixed section is generated, and the waveform generation unit 214 synthesizes a speech waveform according to the speech synthesis parameter generated in step S3010 (S2013).

かかる発話スタイルごとの韻律パタンデータベース３０８、混合基準点付き素片データベース３１０とモーフィング部３１１を備えた構成によれば、発話スタイルごとに用意された韻律制御部と韻律パタンデータベース、および合成パラメータ生成部と混合基準点付き素片データベースを選択して韻律生成、素片選択および素片接続を実行し、混合元の発話スタイルの音声合成パラメータ列を生成して、混合重みに従ってモーフィング部３１１で各発話スタイルの音声合成パラメータ列の混合基準点を対応付けてモーフィングすることにより、各発話スタイルごとの音韻時間長設定を混合基準点を用いることで素片単位で
補間することができる。また、スペクトル方向にも混合基準点を用いることで音声パラメータ同士を対応付けて補間することで、複数の発話スタイルの音声合成パラメータを実時間のばらつき、スペクトル特性のばらつきを発話スタイルの特性として抽出して混合し、新しい発話スタイルを自由に生成することができる。 According to the configuration including the prosodic pattern database 308 for each utterance style, the segment database with mixing reference points 310, and the morphing unit 311, the prosodic control unit and prosody pattern database prepared for each utterance style, and the synthesis parameter generating unit And a unit database with mixed reference points is selected to execute prosody generation, unit selection and unit connection, generate a speech synthesis parameter sequence of the utterance style of the mixing source, and each utterance in the morphing unit 311 according to the mixing weight By morphing the mixed reference points of the style speech synthesis parameter string in association with each other, the phoneme duration setting for each utterance style can be interpolated in units of segments by using the mixed reference points. Also, by using mixed reference points in the spectral direction and interpolating the speech parameters in association with each other, the speech synthesis parameters of multiple utterance styles are extracted in real time and the spectral characteristics are extracted as utterance style characteristics. And can mix and generate new utterance styles freely.

なお、本実施の形態において、図２の（ａ）のような１つの発話スタイルの重みが変化する場合について説明したが、図７の（ａ）のように複数の発話スタイルを混合する際には、本実施の形態で標準スタイルとした発話スタイルＡを笑いの発話スタイル等とし、発話スタイルＢの怒りの発話スタイルと混合するものとすれば良い。 In the present embodiment, the case where the weight of one utterance style changes as shown in FIG. 2A has been described. However, when a plurality of utterance styles are mixed as shown in FIG. The utterance style A, which is the standard style in the present embodiment, may be used as the laughing utterance style or the like and mixed with the angry utterance style of the utterance style B.

なお、本実施の形態において、２つの発話スタイルA,Bに対応する２組の韻律パラメー
タデータベース、韻律制御部、混合基準点付き素片データベースおよび合成パラメータ生成部を備えるとしたが、３つ以上の発話スタイルに対応する韻律パラメータデータベース、韻律制御部、混合基準点付き素片データベースおよび合成パラメータ生成部の組を備え、スイッチで切り替えるとしてもよい。 In the present embodiment, two sets of prosodic parameter databases, prosody control units, segment database with mixed reference points, and synthesis parameter generation units corresponding to two utterance styles A and B are provided. A prosody parameter database corresponding to the utterance style, a prosody control unit, a segment database with mixed reference points, and a synthesis parameter generation unit may be provided and switched by a switch.

なお、本実施の形態において、各混合元の発話スタイルの音声合成パラメータ列の生成に、各発話スタイルを生成する韻律制御部と合成パラメータ生成部とを設けたが、図１４に示すように、韻律制御部と合成パラメータ生成部は単一で、韻律パタンデータベース、混合基準点つき素片データベースが発話スタイルごとに複数個用意されており、韻律制御部と合成パラメータ生成部はこれらのデータベースを切り替えて、各々の発話スタイルの韻律情報、合成パラメータ列を生成し、合成パラメータ記憶部３２０に一時的に記憶し、記憶された複数個の発話スタイルの音声合成パラメータ列をモーフィングするものとしても良い。 In the present embodiment, the prosody control unit and the synthesis parameter generation unit for generating each utterance style are provided for the generation of the speech synthesis parameter sequence of each utterance style, but as shown in FIG. There is a single prosodic control unit and synthesis parameter generation unit, and a plurality of prosody pattern databases and segment database with mixed reference points are prepared for each utterance style. The prosody control unit and synthesis parameter generation unit switch between these databases. Thus, the prosodic information and the synthesis parameter sequence of each utterance style may be generated, temporarily stored in the synthesis parameter storage unit 320, and the stored speech synthesis parameter sequences of a plurality of utterance styles may be morphed.

（実施の形態５）
本実施の形態の音声合成装置では、混合基準点付き素片データベース内にパラメータ素片を格納しておくのではなく、混合基準点付き素片波形データベースの中に、混合されるべき音声の波形そのものを保持している点が前述の実施の形態と異なる。図１５は、本発明の実施の形態５における音声合成装置の機能ブロック図であり、図１６は本発明の実施の形態５の音声合成装置の動作を示すフローチャートである。 (Embodiment 5)
In the speech synthesizer according to the present embodiment, the parameter segment is not stored in the segment database with the mixing reference point, but the waveform of the speech to be mixed in the segment waveform database with the mixing reference point. This is different from the above-described embodiment in that it is retained. FIG. 15 is a functional block diagram of the speech synthesizer according to the fifth embodiment of the present invention, and FIG. 16 is a flowchart showing the operation of the speech synthesizer according to the fifth embodiment of the present invention.

図１５において図１、図１２と共通する部分については同一の番号を付与し、説明を省略する。また図１６において図３、図１３と共通する動作ステップについても同一の番号を付与し、説明を省略する。 In FIG. 15, parts that are the same as those in FIGS. 1 and 12 are given the same numbers, and descriptions thereof are omitted. In FIG. 16, the same operation steps as those in FIG. 3 and FIG.

図１５において、波形重畳部Ａ４０９a、波形重畳部Ｂ４０９ｂは各々韻律制御部Ａ３
０５a、韻律制御部Ｂ３０５ｂで生成された音韻列に対応した基本周波数、振幅、音韻時
間長と言語処理部２０３により生成された言語情報とから、それぞれの発話スタイルＡ、発話スタイルＢの実音声より生成した音韻ごとの素片波形と音韻環境、基本周波数、振幅、音韻時間長、言語情報等の属性と、音声をモーフィングする際のパラメータの混合の基準となる点を周波数と素片内の時間位置で示した混合位置情報とを格納した混合基準点付素片波形データベースＡ４１０a、混合基準点付素片波形データベースＢ４１０ｂをそれぞれに参照して音韻列に対応する素片波形を抽出し、波形を接続してそれぞれ発話スタイルＡの音声波形、発話スタイルＢの音声波形を生成し、素片ごとの混合基準点を付与して出力する。 In FIG. 15, the waveform superimposing unit A409a and the waveform superimposing unit B409b are respectively connected to the prosody control unit A3.
05a, based on the basic speech, the amplitude, the phoneme duration corresponding to the phoneme sequence generated by the prosody control unit B305b, and the linguistic information generated by the language processing unit 203, from the actual speech of each utterance style A and utterance style B The frequency and the time within the segment are used as the basis for mixing the parameters of the generated phoneme waveform, the phoneme environment, the fundamental frequency, the amplitude, the phoneme duration, language information, etc., and the parameters for morphing the speech. A segment waveform corresponding to a phoneme string is extracted by referring to the mixed reference point-attached segment waveform database A410a and the mixed reference point-attached segment waveform database B410b respectively storing the mixed position information indicated by the position. The speech waveform of speech style A and the speech waveform of speech style B are generated by connecting, respectively, and a mixed reference point for each segment is given and output.

スペクトル分析部４１１は波形重畳部Ａ４０９ａと波形重畳部Ｂ４０９ｂがそれぞれに生成した混合基準点を付与した音声波形をスペクトル分析し、モーフィング可能な音声合成パラメータ列に変換する。 The spectrum analysis unit 411 performs spectrum analysis on the speech waveform to which the mixing reference point generated by the waveform superimposing unit A409a and the waveform superimposing unit B409b is added, and converts the speech waveform into a morphable speech synthesis parameter string.

モーフィング部３１１は波形重畳部Ａ４０９ａで生成された発話スタイルＡの音声波形と波形重畳部Ｂ４０９ｂで生成された発話スタイルＢの音声波形とについて、変形位置・重み決定部３０５で設定された混合区間と混合重み情報に基づきスペクトル分析部４１１で分析され生成された発話スタイルＡの音声パラメータ列と発話スタイルＢの音声パラメータ列とを混合基準点を対応させてモーフィングし、複数の発話スタイルを混合した音声パラメータ列を生成する。 The morphing unit 311 uses the mixed section set by the modified position / weight determining unit 305 for the speech waveform of the speech style A generated by the waveform superimposing unit A409a and the speech waveform of the speech style B generated by the waveform superimposing unit B409b. Speech obtained by morphing the speech parameter sequence of the utterance style A and the speech parameter sequence of the utterance style B, which are analyzed and generated by the spectrum analysis unit 411 based on the mixing weight information, with the mixed reference points corresponding to each other, and mixing a plurality of utterance styles Generate a parameter string.

波形生成部２１４はモーフィング部３１１で生成された一連の音声パラメータに基づき音声波形を生成し、出力する。 The waveform generation unit 214 generates and outputs a speech waveform based on the series of speech parameters generated by the morphing unit 311.

次に、上記の構成による音声合成装置の動作を詳細に説明する。実施の形態１と同様に図２の（ａ）に示す入力に対し、ステップＳ２００５の過程までは実施の形態１に共通であり、ステップＳ３００８の過程までは実施の形態２に共通であるので、以降の動作についてのみ説明する。 Next, the operation of the speech synthesizer configured as described above will be described in detail. As in the first embodiment, for the input shown in FIG. 2A, the process up to step S2005 is common to the first embodiment, and the process up to step S3008 is common to the second embodiment. Only the subsequent operation will be described.

波形重畳部Ａ４０９ａはステップＳ３００８で韻律制御部Ａ３０７ａによって生成された発話スタイルＡ（標準スタイル）の基本周波数パタン、振幅パタン、音韻時間長と音韻列とに従って、合成しようとする音声の素片波形を発話スタイルＡの素片を格納した混合基準点付素片波形データベースＡ４１０ａより抽出し、接続して音声波形を生成する。 The waveform superimposing unit A409a generates a speech unit waveform to be synthesized according to the basic frequency pattern, amplitude pattern, phoneme time length and phoneme sequence of the speech style A (standard style) generated by the prosody control unit A307a in step S3008. Extracted from the mixed reference point-attached segment waveform database A 410a storing the speech style A segments and connected to generate speech waveforms.

同様に波形重畳部Ｂ３０９ｂはステップＳ３００８で韻律制御部Ｂ３０７ｂによって生成された発話スタイルＢ（怒りのスタイル）の基本周波数パタン、振幅パタン、音韻時間長と音韻列とに従って、合成しようとする音声の素片波形を発話スタイルＢの素片を格納した混合基準点付素片波形データベースＢ４１０ｂより抽出し、接続して音声波形を生成する（Ｓ４００９）。 Similarly, the waveform superimposing unit B309b generates a speech element to be synthesized according to the fundamental frequency pattern, amplitude pattern, phoneme time length, and phoneme sequence of the speech style B (anger style) generated by the prosody control unit B307b in step S3008. The single waveform is extracted from the mixed reference point-attached segment waveform database B410b storing the speech style B segments and connected to generate a speech waveform (S4009).

スペクトル分析部４１１はステップＳ４００９で波形重畳部Ａ、波形重畳部Ｂでそれぞれに生成された発話スタイルＡ（標準スタイル）の音声波形、発話スタイルＢ（怒りのスタイル）の音声波形を分析し、それぞれを発話スタイルＡ（標準スタイル）の音声パラメータ列、発話スタイルＢ（怒りのスタイル）の音声パラメータ列に変換する（Ｓ４０１０）。 In step S4009, the spectrum analysis unit 411 analyzes the speech waveform of the utterance style A (standard style) and the speech waveform of the utterance style B (anger style) generated by the waveform superimposing unit A and the waveform superimposing unit B, respectively. Is converted into a speech parameter string of utterance style A (standard style) and a speech parameter string of utterance style B (anger style) (S4010).

モーフィング部３１１はステップＳ３００６で変形位置・重み決定部３０５によって設定された混合区間において、ステップＳ４０１０で生成された発話スタイルＡ（標準スタイル）と発話スタイルＢ（怒りのスタイル）の音声パラメータ列を素片ごとの混合基準点を対応させ、混合基準点間を時間方向とスペクトル方向に線形に補間し、ステップＳ３００６で変形位置・重み決定部３０５によって設定された各音韻ごとに設定された混合重みに従ってモーフィングして、混合区間において発話スタイルが徐々に変化する音声合成パラメータ列を生成する。 The morphing unit 311 stores the speech parameter sequences of the utterance style A (standard style) and the utterance style B (anger style) generated in step S4010 in the mixed section set by the deformation position / weight determination unit 305 in step S3006. Each mixing reference point is made to correspond, and the mixing reference point is linearly interpolated between the time direction and the spectral direction, and in accordance with the mixing weight set for each phoneme set by the deformation position / weight determination unit 305 in step S3006. Morphing generates a speech synthesis parameter sequence in which the utterance style gradually changes in the mixed section.

波形生成部２１４はステップＳ３０１０で生成された音声パラメータに従って音声波形を合成する（Ｓ２０１３）。 The waveform generation unit 214 synthesizes a speech waveform according to the speech parameter generated in step S3010 (S2013).

かかる発話スタイルごとに混合基準点付素片波形データベースと波形重畳部とスペクトル分析部を備えた構成によれば、発話スタイルごとに用意された混合基準点付素片波形データベースを選択して、素片波形選択および素片波形接続を実行し、混合元の発話スタイルの音声波形を生成し、波形を分析して音声パラメータ列を生成して、混合重みに従ってモーフィング部３１１で各発話スタイルの音声合成パラメータ列の混合基準点を対応付けてモーフィングすることにより、音声波形のスペクトル分析をしてスペクトル方向にも混合基準点を用いて音声パラメータどうしを対応付けて補間することで、波形重畳方式を採用している音声合成装置であっても複数発話スタイルの実時間のばらつき、スペクトル特性のばらつきを発話スタイルの特性として抽出して混合し、新しい発話スタイルを自由に生成することができる。 According to the configuration including the mixed reference point-attached fragment waveform database, the waveform superimposing unit, and the spectrum analyzing unit for each utterance style, the mixed reference point-attached fragment waveform database prepared for each utterance style is selected, Single waveform selection and segment waveform connection are executed, a speech waveform of the utterance style of the mixing source is generated, a speech parameter string is generated by analyzing the waveform, and speech synthesis of each utterance style is performed by the morphing unit 311 according to the mixing weight. Adopts waveform superposition method by analyzing the spectrum of the speech waveform by associating and morphing the mixed reference points in the parameter sequence and interpolating the speech parameters by using the mixed reference points in the spectral direction as well. Even if a speech synthesizer is used, the utterance style can Were mixed and extracted as sex, a new speech style can be freely produced.

なお、本実施の形態において、図２の（ａ）のような１つの発話スタイルの重みが変化する場合について説明したが、図７の（ａ）のように複数の発話スタイルを混合する際には、本実施の形態で標準スタイルとした発話スタイルＡを笑いの発話スタイル等とし、発話スタイルＢの怒りの発話スタイルと混合するものとしても良い。 In the present embodiment, the case where the weight of one utterance style changes as shown in FIG. 2A has been described. However, when a plurality of utterance styles are mixed as shown in FIG. The utterance style A, which is the standard style in the present embodiment, may be the laughing utterance style or the like, and may be mixed with the angry utterance style of the utterance style B.

なお、本実施形態において、２つの発話スタイルA,Bに対応する２組の韻律パラメータ
データベース、韻律制御部、混合基準点付き素片データベースおよび合成パラメータ生成部を備えるとしたが、３つ以上の発話スタイルに対応する韻律パラメータデータベース、韻律制御部、混合基準点付き素片データベースおよび合成パラメータ生成部の組を備え、スイッチで切り替えるとしてもよい。 In the present embodiment, two sets of prosodic parameter databases corresponding to two utterance styles A and B, a prosody control unit, a segment database with mixed reference points, and a synthesis parameter generation unit are provided. A set of a prosodic parameter database corresponding to the utterance style, a prosodic control unit, a segment database with mixed reference points, and a synthesis parameter generating unit may be provided and switched by a switch.

なお、本実施の形態において、各混合元の発話スタイルの音声波形データの生成に、各発話スタイルを生成する韻律制御部と波形重畳部とを設けたが、図１４において韻律制御部と合成パラメータ生成部は単一で、韻律パタンデータベース３０８ａ、３０８ｂ、３０８ｃ・・・、混合基準点付素片データベース３１０ａ、３１０ｂ、３１０ｃ・・・が発話スタイルごとに複数個用意されていたのと同様に、韻律制御部と波形重畳部が単一で、韻律パタンデータベース、混合基準点付素片波形データべースが発話スタイルごとに複数個用意され、韻律制御部と波形重畳部はこれらのデータベースを切り替えて、各々の発話スタイルの韻律情報、音声波形データを生成し、合成パラメータ記憶部３２０に対応する音声波形記憶部に一時的に記憶し、記憶された複数個の発話スタイルの音声波形を分析して音声パラメータ列に変換し、その音声パラメータ列をモーフィングするものとしても良い。 In the present embodiment, the prosody control unit and the waveform superimposing unit for generating each utterance style are provided for generating the speech waveform data of the utterance style of each mixing source. As the generation unit is single, a plurality of prosody pattern databases 308a, 308b, 308c,..., Mixed reference point-attached segment databases 310a, 310b, 310c,. There is a single prosody control unit and waveform superimposing unit. Prosody pattern database and multiple segment waveform databases with mixed reference points are prepared for each utterance style. The prosody control unit and waveform superimposing unit switch between these databases. Then, prosodic information and speech waveform data of each utterance style are generated and temporarily stored in the speech waveform storage unit corresponding to the synthesis parameter storage unit 320. Was converted to speech parameter sequence by analyzing the stored plurality of speech style speech waveform may be as morphing the speech parameter sequence.

（実施の形態６）
本実施の形態では、音声合成パラメータの高次ベクトル空間として表される声質空間を変換・回転する変換関数を用いて、発話スタイルコマンドによって指示される音声を合成する場合の一例について説明する。 (Embodiment 6)
In the present embodiment, an example of synthesizing speech instructed by an utterance style command using a conversion function that converts and rotates a voice quality space expressed as a higher-order vector space of speech synthesis parameters will be described.

図１７は、本発明の実施の形態６における音声合成装置の機能ブロック図であり、図１８は本発明の実施の形態６の音声合成装置の動作を示すフローチャートである。 FIG. 17 is a functional block diagram of the speech synthesizer according to Embodiment 6 of the present invention, and FIG. 18 is a flowchart showing the operation of the speech synthesizer according to Embodiment 6 of the present invention.

図１７において図１と共通する部分については同一の番号を付与し、説明を省略する。また図１８において図３と共通する動作ステップについても同一の番号を付与し、説明を省略する。 In FIG. 17, parts that are the same as those in FIG. In FIG. 18, the same steps as those in FIG.

図１７において、変形位置・重み決定部５０５はマークアップ言語解析部２０２によって入力テキストから分離された、タグとして記述されていた指示情報と、言語処理部２０３で生成されたタグ位置が付与された音韻列より音韻列上での変形位置と、発話スタイルの混合重みを決定する。 In FIG. 17, the transformation position / weight determination unit 505 is provided with the instruction information described as a tag separated from the input text by the markup language analysis unit 202 and the tag position generated by the language processing unit 203. The transformation position on the phoneme string and the blending weight of the utterance style are determined from the phoneme string.

混合声質空間計算部５０６は標準声質から基本的な発話スタイルへ変換するために音声合成パラメータのベクトル空間として表現される声質空間を変形、回転する変換関数を記憶した、基本変換式データベース５０７を参照し、基本変換式を元に変形位置・重み決定部５０５によって決定された発話スタイルの混合重みに従って基本変換式を混合、合成して制御単位ごとの変換式を生成する。 The mixed voice quality space calculation unit 506 refers to a basic conversion formula database 507 that stores a conversion function for transforming and rotating a voice quality space expressed as a vector space of speech synthesis parameters in order to convert from a standard voice quality to a basic speech style. Then, based on the basic conversion formula, the basic conversion formulas are mixed and combined according to the utterance style mixing weight determined by the deformation position / weight determination unit 505 to generate a conversion formula for each control unit.

基本変換式データベース５０７に格納された基本変換式は、標準発話スタイルのベクトル空間から、基本的な発話スタイルごとに実音声によって構成された各発話スタイルのベクトル空間へ変換する式である。 The basic conversion formula stored in the basic conversion formula database 507 is a formula for converting from a standard utterance style vector space to a vector space of each utterance style configured by real speech for each basic utterance style.

基本変換式は、各発話スタイルのベクトル空間中の実音声を確率統計モデルで表現し、モデル間の変換式としてあらかじめ作成されたものである。声質空間変換部５０８は混合声質空間計算部５０６によって生成された制御単位ごとの変換式により標準声質空間データ５０９を変換し、制御単位ごとの声質空間データを作成する。 The basic conversion formula expresses real speech in the vector space of each utterance style with a probability statistical model, and is created in advance as a conversion formula between models. The voice quality space conversion unit 508 converts the standard voice quality space data 509 using the conversion formula for each control unit generated by the mixed voice quality space calculation unit 506, and creates voice quality space data for each control unit.

変換後声質空間データ記憶部５１０は声質空間変換部５０８で生成された制御単位ごとの声質空間データを蓄積する。韻律生成部５１１は変換後声質空間データ記憶部５１０に蓄積された声質空間データのうち、当該制御単位に対応する制御単位の声質空間において、言語処理部２０３で生成された音韻列、韻律指示情報、言語情報を属性として用いた確率統計モデルにより基本周波数、振幅および、音韻時間長を設定する。 The converted voice quality space data storage unit 510 stores the voice quality space data for each control unit generated by the voice quality space conversion unit 508. The prosody generation unit 511 includes, in the voice quality space stored in the converted voice quality space data storage unit 510, the phoneme string and prosody indication information generated by the language processing unit 203 in the voice quality space of the control unit corresponding to the control unit. The fundamental frequency, amplitude, and phoneme length are set by a probability statistical model using language information as an attribute.

スペクトルパラメータ生成部５１２は変換後声質空間データ記憶部５１０に蓄積された声質空間データのうち、当該制御単位に対応する制御単位の声質空間において、言語処理部２０３で生成された音韻列、言語情報、および韻律生成部５１１で生成された基本周波数、振幅、音韻時間長を属性として用いた確率統計モデルによりスペクトル情報を生成する。波形生成部５１３は韻律生成部５１１で生成された韻律情報とスペクトルパラメータ生成部５１２で生成されたスペクトル情報とに基づき音声波形を合成する。 The spectrum parameter generation unit 512 includes the phoneme string and language information generated by the language processing unit 203 in the voice quality space of the control unit corresponding to the control unit among the voice quality space data accumulated in the converted voice quality space data storage unit 510. And spectral information is generated by a probability statistical model using the fundamental frequency, amplitude, and phoneme time length generated by the prosody generation unit 511 as attributes. The waveform generation unit 513 synthesizes a speech waveform based on the prosody information generated by the prosody generation unit 511 and the spectrum information generated by the spectrum parameter generation unit 512.

次に、上記の構成による音声合成装置の動作を詳細に説明する。実施の形態１および実施の形態２と同様に図２の（ａ）あるいは図７の（ａ）に示す入力に対し、ステップＳ２００５の過程までは実施の形態１に共通であるので、以降の動作についてのみ説明する。変形位置・重み決定部５０５はマークアップ言語解析部２０２によって入力テキストから分離された発話スタイル変化に関する指示情報と、言語処理部２０３で生成されたタグ位置が付与された音韻列より音韻列上での発話スタイルの変形位置を設定し、音韻列上で発話スタイル変化あるいは重みを線形に補間し、音韻列上での混合重みあるいは重みの変化を設定する（Ｓ５００６）。 Next, the operation of the speech synthesizer configured as described above will be described in detail. Similar to the first embodiment and the second embodiment, the input shown in FIG. 2A or FIG. 7A is common to the first embodiment until step S2005. Only will be described. The transformation position / weight determination unit 505 performs a phonological sequence on the phoneme sequence from the phonological sequence to which the tag style generated by the language processing unit 203 and the instruction information on the utterance style change separated from the input text by the markup language analysis unit 202 is assigned. Is set, the utterance style change or weight is linearly interpolated on the phoneme string, and the mixture weight or weight change on the phoneme string is set (S5006).

混合声質空間計算部５０６はステップＳ５００６で変形位置・重み決定部５０５によって設定された混合元の発話スタイルへ標準声質空間を変換させる、基本変換式を基本変換式データベース５０７より抽出する。ステップＳ５００６で設定された音韻列上での発話スタイルの変形あるいは重みの変化を時間単位を音声合成時の制御単位に変換し、ステップＳ５００６で決定された発話スタイルの混合重みに従って制御単位ごとに基本変換式データベース５０７より抽出した基本変換式を混合、合成、パラメータ調整して制御単位ごとの変換式を生成する（Ｓ５００７）。 In step S5006, the mixed voice quality space calculation unit 506 extracts from the basic conversion formula database 507 a basic conversion formula for converting the standard voice quality space to the utterance style of the mixing source set by the deformed position / weight determination unit 505. The transformation of the utterance style or the change of the weight on the phoneme sequence set in step S5006 is converted from a time unit to a control unit at the time of speech synthesis, and basic for each control unit according to the mixture weight of the utterance style determined in step S5006. A basic conversion formula extracted from the conversion formula database 507 is mixed, synthesized, and parameter-adjusted to generate a conversion formula for each control unit (S5007).

声質空間変換部５０８は混合声質空間計算部５０６によって生成された制御単位ごとの変換式により標準声質空間データ５０９を変換し、制御単位ごとの声質空間データを作成する。作成された声質空間データは変換後声質空間データ記憶部５１０へ蓄積される（Ｓ５００８）。 The voice quality space conversion unit 508 converts the standard voice quality space data 509 using the conversion formula for each control unit generated by the mixed voice quality space calculation unit 506, and creates voice quality space data for each control unit. The created voice quality space data is stored in the converted voice quality space data storage unit 510 (S5008).

韻律生成部５１１はステップＳ５００８で変換後声質空間データ記憶部５０９に蓄積された声質空間データのうち、当該制御単位に対応する制御単位の声質空間を抽出し、その声質空間においてステップＳ２００４で言語処理部２０３で生成された音韻列、韻律指示情報、言語情報を属性として用いて確率統計モデルに基づき基本周波数、振幅および、音韻時間長を生成する（Ｓ５００９）。次いでスペクトルパラメータ生成部５１２はステッ
プＳ５００８で変換後声質空間データ記憶部５０９に蓄積された声質空間データのうち、当該制御単位に対応する声質空間において、ステップＳ２００４で生成された音韻列、言語情報、およびステップＳ５００９で韻律生成部５１１によって生成された基本周波数、振幅、音韻時間長を属性として用いて、確率統計モデルに基づいてスペクトル情報を生成する（Ｓ５０１０）。 The prosody generation unit 511 extracts the voice quality space of the control unit corresponding to the control unit from the voice quality space data stored in the converted voice quality space data storage unit 509 in step S5008, and performs language processing in the voice quality space in step S2004. Using the phoneme sequence, prosodic instruction information, and language information generated by the unit 203 as attributes, the fundamental frequency, amplitude, and phoneme time length are generated based on the probability statistical model (S5009). Next, the spectral parameter generation unit 512 includes the phoneme string, language information generated in step S2004 in the voice quality space corresponding to the control unit among the voice quality space data stored in the converted voice quality space data storage unit 509 in step S5008. In addition, using the fundamental frequency, amplitude, and phoneme duration generated by the prosody generation unit 511 in step S5009 as attributes, spectrum information is generated based on the probability statistical model (S5010).

制御単位ごとに生成されたステップＳ５００９で生成された韻律情報と、ステップＳ５０１０で生成されたスペクトル情報を音声合成パラメータとして波形生成部５１３により音声波形を生成する（Ｓ５０１１）。 A speech waveform is generated by the waveform generation unit 513 using the prosodic information generated in step S5009 generated for each control unit and the spectrum information generated in step S5010 as speech synthesis parameters (S5011).

かかる混合声質空間計算部５０６、基本変換式データベース５０７、声質空間変換部５０８、標準声質空間データ５０９、および変換後声質空間データ記憶部を備えた構成によれば、標準声質空間を混合元となる発話スタイルの声質空間に変換するための基本変換式を抽出し、発話スタイルの時間変化に合わせて音声合成時の制御単位ごとに基本変換式を混合、合成する。 According to the configuration including the mixed voice quality space calculation unit 506, the basic conversion formula database 507, the voice quality space conversion unit 508, the standard voice quality space data 509, and the post-conversion voice quality space data storage unit, the standard voice quality space becomes the mixing source. A basic conversion formula for converting to the voice quality space of the utterance style is extracted, and the basic conversion formula is mixed and synthesized for each control unit at the time of speech synthesis in accordance with the time change of the utterance style.

混合、合成されてできた発話スタイルの変換式により声質空間変換部５０８で標準声質空間を変換する。音声合成時の制御単位ごとに異なる声質空間を生成してこれらを変換後声質空間データ記憶部５１０に記憶して、韻律生成部５１１、スペクトルパラメータ生成部５１２がそれぞれ韻律情報、スペクトル情報を生成する際に該当する制御単位の声質空間を参照して音声合成パラメータを制御単位で生成することにより、韻律情報とスペクトル情報を含む声質空間を変換することで、発話スタイルの変化を韻律のみならず、スペクトル情報も合わせて制御し、タグに指定された発話スタイルの時間的変化を精度よく再現することができる。 The voice quality space conversion unit 508 converts the standard voice quality space using the conversion formula of the utterance style that is mixed and synthesized. Different voice quality spaces are generated for each control unit at the time of speech synthesis and stored in the converted voice quality space data storage unit 510, and the prosody generation unit 511 and the spectrum parameter generation unit 512 generate prosody information and spectrum information, respectively. By referring to the voice quality space of the corresponding control unit and generating speech synthesis parameters in the control unit, by converting the voice quality space including prosodic information and spectrum information, not only the prosodic change of the utterance style, Spectral information is also controlled and the temporal change of the utterance style specified in the tag can be accurately reproduced.

なお、本実施の形態において、韻律とスペクトルパラメータを１つの声質空間で表現したが、韻律空間とスペクトル空間等、複数個の空間に分割して変換するものとしても良い。 In this embodiment, the prosody and spectrum parameters are expressed by one voice quality space, but may be divided into a plurality of spaces such as a prosody space and a spectrum space for conversion.

なお、本実施の形態において、音韻列を単位とした時間軸上で発話スタイル混合比率を補間したが、実施の形態１のように音韻列上に設定した時間位置を実時間に変更し、実時間上で発話スタイル混合比率を補間しても良い。 In this embodiment, the utterance style mixture ratio is interpolated on the time axis in units of phoneme strings, but the time position set on the phoneme string is changed to real time as in Embodiment 1, and the real time is changed. The speech style mixture ratio may be interpolated over time.

なお、実施の形態１、実施の形態２、実施の形態３、実施の形態４、実施の形態５において、韻律制御部は音韻ごとの時間長モデルにより音韻時間等とポーズ時間長を決定し、韻律パタンデータベースを参照して、パタン選択と変形によって基本周波数パタン、振幅パタンを生成するとしたが、言語情報や音韻列を属性として用いる確率統計モデルによって基本周波数パタン、振幅パタン、音韻時間長、ポーズ時間長を生成するとしても良い。 In the first embodiment, the second embodiment, the third embodiment, the fourth embodiment, and the fifth embodiment, the prosody control unit determines the phoneme time and the pause time length by the time length model for each phoneme, Referring to the prosodic pattern database, the basic frequency pattern and amplitude pattern are generated by pattern selection and transformation. However, the basic frequency pattern, amplitude pattern, phonological time length, and pause are based on a probabilistic statistical model using linguistic information and phoneme strings as attributes. A time length may be generated.

なお、実施の形態１、実施の形態２、実施の形態３、実施の形態４、実施の形態５において、音韻列上に設定した時間位置を実時間に変更し、実時間上で発話スタイル重み、発話スタイル混合比率、あるいは変換関数パラメータ混合比率を補間した後、素片単位へ変換して発話スタイルの制御を行ったが、実施の形態６のように音韻列を単位とした時間軸上で発話スタイル重み、発話スタイル混合比率、あるいは変換関数パラメータ混合比率を補間して発話スタイルの制御を行うとしても良い。 In the first embodiment, the second embodiment, the third embodiment, the fourth embodiment, and the fifth embodiment, the time position set on the phonological sequence is changed to the real time, and the utterance style weight in the real time. Then, after interpolating the utterance style mixture ratio or the conversion function parameter mixture ratio, the utterance style is controlled by converting into the unit of unit, but on the time axis in units of phoneme strings as in the sixth embodiment. The utterance style may be controlled by interpolating the utterance style weight, the utterance style mixture ratio, or the conversion function parameter mixture ratio.

図１９は、タグ付きテキストを作成するための処理部を、音声合成装置の内部に備えた場合の構成の一例を示す図である。なお、実施の形態１、実施の形態２、実施の形態３、実施の形態４、実施の形態５、実施の形態６において、図１９に示すように、音声合成の対象となるテキスト本文を作成するテキスト作成部６０１およびテキストの所望の位置に
所定のタグと、タグの属性を挿入してタグ付テキストを作成するタグ作成部６０２を付しても良い。また、これらのテキスト作成部６０１、タグ作成部６０２は音声合成装置の外部部であっても、音声合成装置自身が備えるものであっても良い。 FIG. 19 is a diagram illustrating an example of a configuration in a case where a processing unit for creating tagged text is provided in the speech synthesizer. In Embodiment 1, Embodiment 2, Embodiment 3, Embodiment 4, Embodiment 5, and Embodiment 6, as shown in FIG. 19, a text body that is a target of speech synthesis is created. A text creation unit 601 and a tag creation unit 602 that creates a tagged text by inserting a predetermined tag and a tag attribute at a desired position of the text may be added. The text creation unit 601 and the tag creation unit 602 may be external parts of the speech synthesizer or may be provided in the speech synthesizer itself.

なお、実施の形態１、実施の形態２、実施の形態３において、ステップＳ２０１０で素片選択を行った後にステップＳ２０１１で変換関数パラメータを設定したが、素変選択はステップＳ２００６の韻律生成より後で、かつステップＳ２０１２の音声合成パラメータ素片を変換して音声合成パラメータを生成するより前であれば、いつ行っても良い。 In the first embodiment, the second embodiment, and the third embodiment, the conversion function parameters are set in step S2011 after performing the segment selection in step S2010. However, the variational selection is performed after the prosody generation in step S2006. As long as it is before the speech synthesis parameter segment is generated by converting the speech synthesis parameter segment in step S2012, it may be performed at any time.

なお、発話スタイルの時間的変化を指定する際の時間単位を、実施の形態１、実施の形態３、実施の形態４、実施の形態５、においてはモーラ、実施の形態２においてはアクセント句、実施の形態６においては特に定めず制御単位としたが、音素、モーラ、音節、アクセント句、ストレス句、フレーズ、呼気段落等の音声単位あるいは文字、形態素、単語、文節、節、文等の言語単位としても良く、特に図５に示すように音節を時間単位としてもよいし、あるいは図６に示すように音素を時間単位としてもよい。 The time unit for designating the temporal change of the utterance style is the mora in the first embodiment, the third embodiment, the fourth embodiment, and the fifth embodiment, and the accent phrase in the second embodiment. Although the control unit is not particularly defined in the sixth embodiment, a speech unit such as phonemes, mora, syllables, accent phrases, stress phrases, phrases, exhalation paragraphs, or languages such as characters, morphemes, words, phrases, clauses, sentences, etc. The syllable may be a time unit as shown in FIG. 5, or the phoneme may be a time unit as shown in FIG.

なお、実施の形態１、実施の形態２、実施の形態３、実施の形態４、実施の形態５、実施の形態６において、発話スタイルの重みを線形に補間したが、指数関数、対数関数、シグモイド曲線等、単調増加あるいは単調減少する他の関数を用いてもよい。 In the first embodiment, the second embodiment, the third embodiment, the fourth embodiment, the fifth embodiment, and the sixth embodiment, the utterance style weights are linearly interpolated, but the exponential function, logarithmic function, Other functions that monotonously increase or monotonously decrease, such as a sigmoid curve, may be used.

なお、実施の形態１、実施の形態２、実施の形態３において、変換関数のパラメータを設定する時間位置を音声素片単位の中心点を接続点としたが、音声合成装置の制御点として適当なものであればこれ以外の時間位置で変換関数のパラメータを設定するものとしても良い。 In the first embodiment, the second embodiment, and the third embodiment, the time position for setting the parameters of the conversion function is set as the connection point at the center point of the speech unit unit, but it is suitable as the control point of the speech synthesizer. If so, the parameters of the conversion function may be set at other time positions.

本発明にかかる音声合成用テキスト構造、音声合成方法および音声合成装置は、発話中徐々に発話スタイルが変化する合成音声を指示し、再現する機能を有する音声対話装置等として有用である。またカーナビゲーションシステム、電話による応対システム、電子メールの読み上げ装置、せりふの読み上げ装置等の用途にも応用できる。 The text structure for speech synthesis, the speech synthesis method, and the speech synthesizer according to the present invention are useful as a speech dialogue apparatus or the like having a function of instructing and reproducing synthesized speech in which the speech style gradually changes during speech. It can also be used in applications such as car navigation systems, telephone response systems, e-mail reading devices, and dialogue reading devices.

本発明の実施の形態１における音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer in Embodiment 1 of this invention. 本発明の実施の形態１における音声合成用テキスト構造とタグによる指示内容の模式図である。It is a schematic diagram of the instruction | indication content by the text structure for speech synthesis | combination and tag in Embodiment 1 of this invention. 本発明の実施の形態１における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in Embodiment 1 of this invention. 本発明の実施の形態１における音声合成装置での制御時間軸変換の模式図である。It is a schematic diagram of control time axis conversion in the speech synthesizer in Embodiment 1 of the present invention. 本発明の実施の形態１における音声合成用テキスト構造の異なる形態とそのタグによる指示内容の模式図である。It is a schematic diagram of the different contents of the text structure for speech synthesis in Embodiment 1 of the present invention and the instruction content by the tag. 本発明の実施の形態１における音声合成用テキスト構造の異なる形態とそのタグによる指示内容の模式図である。It is a schematic diagram of the different contents of the text structure for speech synthesis in Embodiment 1 of the present invention and the instruction content by the tag. 本発明の実施の形態２における音声合成用テキスト構造とタグによる指示内容の模式図である。It is a schematic diagram of the instruction | indication content by the text structure for speech synthesis | combination and tag in Embodiment 2 of this invention. 本発明の実施の形態２における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in Embodiment 2 of this invention. 本発明の実施の形態３における音声合成用テキスト構造とタグによる指示内容の模式図である。It is a schematic diagram of the instruction | indication content by the text structure for speech synthesis | combination and tag in Embodiment 3 of this invention. 本発明の実施の形態３における音声合成用テキスト構造の異なる形態とそのタグによる指示内容の模式図である。It is a schematic diagram of the different contents of the text structure for speech synthesis in Embodiment 3 of the present invention and the instruction content by the tag. 本発明の実施の形態３における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in Embodiment 3 of this invention. 本発明の実施の形態４における音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer in Embodiment 4 of this invention. 本発明の実施の形態４における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in Embodiment 4 of this invention. 本発明の実施の形態４における異なる構成による音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer by the different structure in Embodiment 4 of this invention. 本発明の実施の形態５における音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer in Embodiment 5 of this invention. 本発明の実施の形態５における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in Embodiment 5 of this invention. 本発明の実施の形態６における音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer in Embodiment 6 of this invention. 本発明の実施の形態６における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in Embodiment 6 of this invention. タグ付きテキストを作成するための処理部を、音声合成装置の内部に備えた場合の構成の一例を示す図である。It is a figure which shows an example of a structure at the time of providing the process part for producing a text with a tag in the inside of a speech synthesizer. 従来の音声合成用テキスト構造の模式図である。It is a schematic diagram of the conventional text structure for speech synthesis. 従来の音声合成装置の機能ブロック図である。It is a functional block diagram of the conventional speech synthesizer. 従来の音声合成用テキスト構造の模式図である。It is a schematic diagram of the conventional text structure for speech synthesis. 従来の音声合成用テキストを従来の音声合成装置により合成した際の音声の実時間と音韻列の関係を示した模式図である。It is the schematic diagram which showed the relationship between the real time of a speech at the time of synthesize | combining the text for conventional speech synthesis with the conventional speech synthesizer, and a phoneme string.

符号の説明Explanation of symbols

１０１、６０１テキスト作成部
１０２、６０２タグ作成部
１０３タグ付きテキスト
１０４テキスト入力部
１０５テキスト解析部
１０６タグ解析部
１０７タグ属性解析部
１０８言語処理部
１０９音声合成部
１１０言語辞書
１１１韻律・波形辞書
２０１テキスト入力部
２０２マークアップ言語解析部
２０３言語処理部
２０４辞書
２０５、３０７ａ、３０７ｂ韻律制御部
２０６標準韻律パタンデータ
２０７変形位置・変形重み決定部
２０８変換関数選択部
２０９変換関数データベース
２１０変換関数パラメータ設定部
２１１素片選択部
２１２標準素片データベース
２１３、３０９ａ、３０９ｂ合成パラメータ生成部
２１４、５１３波形生成部
３０５、５０５変形位置・重み決定部
３０６ａ、３０６ｂスイッチ
３０８ａ、３０８ｂ、３０８ｃ韻律パタンデータベース
３１０ａ、３１０ｂ、３１０ｃ混合基準点付き素片データベース
３１１モーフィング部
３２０合成パラメータ記憶部
４０９ａ、４０９ｂ波形重畳部
４１０ａ，４１０ｂ混合基準点付き素片波形データベース
４１１スペクトル分析部
５０６混合声質空間計算部
５０７基本変換式データベース
５０８声質空間変換部
５０９標準声質空間データ
５１０変換後声質空間データ記憶部
５１１韻律生成部
５１２スペクトルパラメータ生成部 101, 601 Text creation unit 102, 602 Tag creation unit 103 Tagged text 104 Text input unit 105 Text analysis unit 106 Tag analysis unit 107 Tag attribute analysis unit 108 Language processing unit 109 Speech synthesis unit 110 Language dictionary 111 Prosody / waveform dictionary 201 Text input unit 202 Markup language analysis unit 203 Language processing unit 204 Dictionary 205, 307a, 307b Prosody control unit 206 Standard prosody pattern data 207 Deformation position / deformation weight determination unit 208 Conversion function selection unit 209 Conversion function database 210 Conversion function parameter setting Unit 211 unit selection unit 212 standard segment database 213, 309a, 309b synthesis parameter generation unit 214, 513 waveform generation unit 305, 505 deformation position / weight determination unit 306a, 306b switch 08a, 308b, 308c Prosody pattern database 310a, 310b, 310c Segment database with mixed reference point 311 Morphing unit 320 Composite parameter storage unit 409a, 409b Waveform superimposing unit 410a, 410b Segment waveform database with mixed reference point 411 Spectrum analysis unit 506 Mixed voice quality space calculation unit 507 Basic conversion formula database 508 Voice quality space conversion unit 509 Standard voice quality space data 510 Converted voice quality space data storage unit 511 Prosody generation unit 512 Spectral parameter generation unit

Claims

コマンド付きテキストを入力とし、前記テキストを読み上げる音声を合成する音声合成装置であって、
コマンド付きテキストを、（１）音声に合成すべき前記テキストと（２）前記テキストから合成される音声の発話表現である発話スタイルの時間的変化を、音素、モーラ、音節のいずれか1つを単位として指定する発話スタイルコマンドとに分離する分離手段と、
分離された前記テキストを言語解析し、少なくとも前記テキストを表す音素列、モーラ列、音節列のうち前記発話スタイルコマンドにおいて前記時間的変化を指定する単位として使用された単位で表記された音韻列を出力する言語処理手段と、
前記発話スタイルコマンドで指定された単位を識別し、出力された前記音韻列中において、前記発話スタイルコマンドで前記発話スタイルの前記時間的変化が指定された音韻区間を、識別した単位で特定する区間特定手段と、
特定された前記音韻区間において、前記発話スタイルの時間的変化に従って発話される音声を合成する音声合成手段と
を備えることを特徴とする音声合成装置。 A speech synthesizer that synthesizes speech that reads a text with a command and reads the text,
The commanded text is changed to (1) the text to be synthesized with the speech and (2) the temporal change of the utterance style, which is the speech expression of the speech synthesized from the text, with one of phonemes, mora, or syllables. Separation means for separating speech style commands that are specified as units,
The separated text is linguistically analyzed, and at least a phoneme sequence expressed in a unit used as a unit for designating the temporal change in the utterance style command among a phoneme sequence, a mora sequence, and a syllable sequence representing the text. Language processing means for outputting;
A unit that identifies a unit specified by the utterance style command and identifies a phoneme segment in which the temporal change of the utterance style is specified by the utterance style command in the output unit. Specific means,
A speech synthesizer comprising: speech synthesis means for synthesizing speech uttered in accordance with a temporal change of the utterance style in the identified phoneme section.

前記音声合成手段は、前記発話スタイルコマンドに従って、特定された前記音韻区間の読み上げに要する実時間に対応して、前記発話スタイルが変化する音声を合成する
ことを特徴とする請求項１記載の音声合成装置。 2. The speech according to claim 1, wherein the speech synthesis unit synthesizes speech in which the utterance style changes in accordance with real time required to read out the specified phonological section in accordance with the utterance style command. Synthesizer.

発話スタイルの時間的変化は、前記発話スタイルの重みの時間的変化によって表され、
前記音声合成手段は、前記発話スタイルコマンドに従って、前記音韻区間の読み上げに要する実時間に対応して、前記発話スタイルの重みが変化する音声を合成する
ことを特徴とする請求項２記載の音声合成装置。 The temporal change of the utterance style is represented by the temporal change of the weight of the utterance style,
3. The speech synthesis according to claim 2, wherein the speech synthesis means synthesizes speech in which the weight of the speech style changes in accordance with the real time required to read out the phoneme section according to the speech style command. apparatus.

前記音声合成手段は、
合成される音声を表す音声合成パラメータの音声合成処理単位のまとまりである音声合成パラメータ素片を格納した音声合成パラメータ素片データベースと、
前記音声合成パラメータ素片データベースから、複数の音声合成パラメータ素片を選択し、接続して、所定の韻律を生成し、前記言語処理手段によって出力される音韻列に対応した音声合成パラメータを生成する音声合成パラメータ生成部と、
特定された前記音韻区間に対応する音声合成パラメータを前記発話スタイルコマンドに従って変換するための変換規則を格納した変換規則データベースと、
前記発話スタイルコマンドに対応する前記変換規則を前記変換規則データベースより選択する変換規則選択部と、
特定された前記音韻区間に対応する前記音声合成パラメータを、選択された前記変換規則に基づいて変換する音声合成パラメータ変換部と、
前記音声合成パラメータに基づき音声波形を生成する音声波形生成部とを備える
ことを特徴とする請求項１記載の音声合成装置。 The speech synthesis means
A speech synthesis parameter segment database storing speech synthesis parameter segments that are a unit of speech synthesis processing units of speech synthesis parameters representing speech to be synthesized;
A plurality of speech synthesis parameter segments are selected from the speech synthesis parameter segment database, connected, a predetermined prosody is generated, and a speech synthesis parameter corresponding to a phoneme sequence output by the language processing unit is generated. A speech synthesis parameter generation unit;
A conversion rule database storing conversion rules for converting speech synthesis parameters corresponding to the specified phoneme section according to the utterance style command;
A conversion rule selection unit that selects the conversion rule corresponding to the utterance style command from the conversion rule database;
A speech synthesis parameter conversion unit that converts the speech synthesis parameter corresponding to the identified phoneme section based on the selected conversion rule;
The speech synthesis apparatus according to claim 1, further comprising: a speech waveform generation unit that generates a speech waveform based on the speech synthesis parameter.

前記音声合成手段は、
合成される音声を表す音声合成パラメータの音声合成処理単位のまとまりである音声合成パラメータ素片と、前記音声合成パラメータ素片内で音声の混合時に複数音声の対応を
特定するための周波数と時間とによって定義される基準点とを共に格納した音声合成パラメータ素片データベースと、
前記音声合成パラメータ素片データベースから、複数の音声合成パラメータ素片を選択し、接続して、所定の韻律を生成し、前記言語処理手段によって出力される音韻列に対応する音声合成パラメータを生成する音声合成パラメータ生成部と、
前記音声合成パラメータ素片データベースより前記音声合成パラメータ素片とともに選択した前記基準点を複数音声の対応点として、前記音声合成パラメータ生成部により生成された前記音声合成パラメータを、前記区間特定手段によって特定された前記音韻区間において混合する音声混合部と、
混合された前記音声合成パラメータに基づいて、音声波形を生成する音声波形生成部と
を備えることを特徴とする請求項１記載の音声合成装置。 The speech synthesis means
A speech synthesis parameter segment, which is a unit of speech synthesis processing units of speech synthesis parameters representing speech to be synthesized, and a frequency and time for specifying the correspondence of a plurality of speeches when mixing speech within the speech synthesis parameter segment A speech synthesis parameter fragment database that stores both reference points defined by
A plurality of speech synthesis parameter segments are selected from the speech synthesis parameter segment database, connected to generate a predetermined prosody, and a speech synthesis parameter corresponding to a phoneme sequence output by the language processing means is generated. A speech synthesis parameter generation unit;
The section specifying means identifies the speech synthesis parameter generated by the speech synthesis parameter generation unit with the reference point selected together with the speech synthesis parameter segment from the speech synthesis parameter segment database as a corresponding point of a plurality of speeches. A speech mixing unit that mixes in the phoneme section
The speech synthesis apparatus according to claim 1, further comprising: a speech waveform generation unit that generates a speech waveform based on the mixed speech synthesis parameters.

前記音声混合部は、互いに異なる発話スタイルに対応する音声合成パラメータを混合する場合に、各前記音声合成パラメータの重みを、前記音韻区間の読み上げに要する実時間の経過に従って変化することで発話スタイルの時間的変化を生成する
ことを特徴とする請求項５記載の音声合成装置。 When mixing speech synthesis parameters corresponding to different utterance styles, the speech mixing unit changes the weight of each speech synthesis parameter according to the passage of real time required to read out the phonological section, thereby changing the speech style. The speech synthesizer according to claim 5, wherein a temporal change is generated.

前記音声合成手段は、
音声の一部分の波形を表し、音声合成処理単位のまとまりである音声波形素片と、前記音声波形素片内で音声の混合時に複数音声の対応を特定するための周波数と時間とによって定義される基準点とを共に格納した音声波形素片データベースと、
前記音声波形素片データベースから、複数の音声波形素片を選択し、接続して、前記言語処理手段によって出力される音韻列と所定の韻律に対応する音声波形を生成する音声波形生成部と、
前記音声波形生成部により生成された前記音声波形を分析し韻律情報とスペクトル情報を抽出する音声波形分析部と、
前記音声波形分析部によって抽出された前記韻律情報と前記スペクトル情報とを、前記音声波形素片データベースから音声波形素片とともに選択した前記基準点を複数音声の対応点として、前記区間特定手段によって特定された前記音韻区間において混合する音声混合部と、
前記音声混合部で混合された前記韻律情報と前記スペクトル情報に基づき音声波形を生成する音声波形生成部と
を備えることを特徴とする請求項１記載の音声合成装置。 The speech synthesis means
Represents a waveform of a part of speech, and is defined by a speech waveform segment that is a unit of speech synthesis processing unit, and a frequency and time for specifying correspondence of a plurality of speeches when speech is mixed in the speech waveform segment Speech waveform segment database that stores both reference points and
A speech waveform generation unit that selects and connects a plurality of speech waveform segments from the speech waveform segment database and generates a speech waveform corresponding to a phoneme sequence output by the language processing means and a predetermined prosody;
A speech waveform analysis unit that analyzes the speech waveform generated by the speech waveform generation unit and extracts prosodic information and spectrum information;
Identifying the prosody information and the spectrum information extracted by the speech waveform analysis unit together with the speech waveform segment from the speech waveform segment database as the corresponding points of a plurality of speech, by the section identifying means A speech mixing unit that mixes in the phoneme section
The speech synthesis apparatus according to claim 1, further comprising: a speech waveform generation unit configured to generate a speech waveform based on the prosodic information mixed by the speech mixing unit and the spectrum information.

前記音声混合部は、互いに異なる発話スタイルに対応する音声波形素片を混合する場合に、複数の前記韻律情報と前記スペクトル情報とを混合する重みを、前記音韻区間の読み上げに要する実時間の経過に従って変化することで発話スタイルの時間的変化を生成する
ことを特徴とする請求項７記載の音声合成装置。 When the speech mixing unit mixes speech waveform segments corresponding to different utterance styles, the weight of mixing a plurality of the prosodic information and the spectrum information is obtained as a result of real time required for reading out the phoneme section. The speech synthesizer according to claim 7, wherein a temporal change of the utterance style is generated by changing in accordance with.

前記発話スタイルコマンドは、話者の性別、年齢、話者、性格、発話時の体調、発話時の気分、話者間の人間関係、話者間の物理的距離、話者間の通信状態、発話の場所、発話時の時間帯、話者が置かれた環境、周囲の雑音または感情のいずれかについて合成音声の表現を制御するコマンドまたはタグである
ことを特徴とする請求項１〜８のいずれか１項に記載の音声合成装置。 The utterance style command includes the gender of the speaker, age, speaker, personality, physical condition during utterance, mood during utterance, human relationship between speakers, physical distance between speakers, communication status between speakers, The command or tag for controlling the expression of the synthesized speech for any of the location of the utterance, the time zone during the utterance, the environment in which the speaker is placed, the surrounding noise or emotion. The speech synthesis device according to any one of the above.

前記音声合成手段は、
あらかじめ自然音声の韻律情報、スペクトル情報、音韻列および言語情報を統計的に学習した第１の統計モデルに基づいて韻律を生成する韻律生成部と、
あらかじめ自然音声の韻律情報、スペクトル情報、音韻列および言語情報を統計的に学習した、前記第１の統計モデルとは異なる第２の統計モデルに基づいてスペクトル情報を
生成するスペクトル情報生成部と、
前記韻律生成部が持つ前記第１の統計モデルと、前記スペクトル情報生成部が持つ前記第２の統計モデルとを、前記発話スタイルコマンドに従って変換するための変換規則を格納した変換規則データベースと、
前記発話スタイルコマンドに対応する前記変換規則を、前記変換規則データベースより選択する変換規則選択部と、
前記変換規則選択部によって選択された前記変換規則を変形または混合し、変形または混合された前記変換規則に基づいて、前記第１の統計モデルと前記第２の統計モデルとを変換する統計モデル変換部と、
変換された第１の統計モデルに基づいて前記韻律生成部により生成された韻律と、変換された第２の統計モデルに基づいて前記スペクトル情報生成部により生成されたスペクトル情報とに基づいて、音声波形を生成する音声波形生成部と
を備えることを特徴とする請求項１記載の音声合成装置。 The speech synthesis means
A prosody generation unit that generates a prosody based on a first statistical model in which prosody information, spectrum information, phoneme strings, and language information of natural speech are statistically learned in advance;
A spectral information generating unit that statistically learns prosodic information, spectral information, phoneme strings, and linguistic information of natural speech beforehand, and generates spectral information based on a second statistical model different from the first statistical model;
A conversion rule database storing conversion rules for converting the first statistical model of the prosody generation unit and the second statistical model of the spectrum information generation unit in accordance with the utterance style command;
A conversion rule selection unit that selects the conversion rule corresponding to the utterance style command from the conversion rule database;
A statistical model transformation that transforms or mixes the transformation rule selected by the transformation rule selection unit and transforms the first statistical model and the second statistical model based on the transformed or mixed transformation rule. And
Based on the prosody generated by the prosody generation unit based on the converted first statistical model and the spectrum information generated by the spectrum information generation unit based on the converted second statistical model, speech The speech synthesis apparatus according to claim 1, further comprising: a speech waveform generation unit that generates a waveform.

コマンド付きテキストを入力し、前記テキストを読み上げる音声を合成する音声合成方法であって、
コマンド付きテキストを、（１）音声に合成すべき前記テキストと（２）前記テキストから合成される音声の発話表現である発話スタイルの時間的変化を、音素、モーラ、音節のいずれか1つを単位として指定する発話スタイルコマンドとに分離する分離ステップと、
分離された前記テキストを言語解析し、少なくとも前記テキストを表す音素列、モーラ列、音節列のうち前記発話スタイルコマンドにおいて前記時間的変化を指定する単位として使用された単位で表記された音韻列を出力する言語処理ステップと、
前記発話スタイルコマンドで指定された単位を識別し、出力された前記音韻列中において、前記発話スタイルコマンドで前記発話スタイルの前記時間的変化が指定された音韻区間を、識別した単位で特定する区間特定ステップと、
特定された前記音韻区間において、前記発話スタイルの時間的変化に従って発話される音声を合成する音声合成ステップと
を含むことを特徴とする音声合成方法。 A speech synthesis method for inputting text with a command and synthesizing speech that reads out the text,
The commanded text is changed to (1) the text to be synthesized with the speech and (2) the temporal change of the utterance style, which is the speech expression of the speech synthesized from the text, with one of phonemes, mora, or syllables. A separation step for separating the speech style command specified as a unit;
The separated text is linguistically analyzed, and at least a phoneme sequence expressed in a unit used as a unit for designating the temporal change in the utterance style command among a phoneme sequence, a mora sequence, and a syllable sequence representing the text. A language processing step to output;
A unit that identifies a unit specified by the utterance style command and identifies a phoneme segment in which the temporal change of the utterance style is specified by the utterance style command in the output unit. Specific steps,
A speech synthesis method comprising: synthesizing speech uttered in accordance with a temporal change of the utterance style in the identified phoneme section.

コマンド付きテキストを入力し、前記テキストを読み上げる音声を合成する音声合成装置のためのプログラムであって、コンピュータに
コマンド付きテキストを、（１）音声に合成すべき前記テキストと（２）前記テキストから合成される音声の発話表現である発話スタイルの時間的変化を、音素、モーラ、音節のいずれか1つを単位として指定する発話スタイルコマンドとに分離する分離ステップと、分離された前記テキストを言語解析し、少なくとも前記テキストを表す音素列、モーラ列、音節列のうち前記発話スタイルコマンドにおいて前記時間的変化を指定する単位として使用された単位で表記された音韻列を出力する言語処理ステップと、前記発話スタイルコマンドで指定された単位を識別し、出力された前記音韻列中において、前記発話スタイルコマンドで前記発話スタイルの前記時間的変化が指定された音韻区間を、識別した単位で特定する区間特定ステップと、特定された前記音韻区間において、前記発話スタイルの時間的変化に従って発話される音声を合成する音声合成ステップと
を実行させるためのプログラム。 A program for a speech synthesizer that inputs a text with a command and synthesizes a speech that reads out the text, the command-attached text to a computer, (1) the text to be synthesized with the speech and (2) the text A separation step for separating temporal changes in the speech style, which is a speech expression of the synthesized speech, into speech style commands that specify one of phonemes, mora, or syllables, and the separated text as a language A linguistic processing step of analyzing and outputting a phoneme string expressed in a unit used as a unit for designating the temporal change in the utterance style command among at least a phoneme string representing the text, a mora string, and a syllable string; Identify the unit specified in the utterance style command, and in the output phoneme string, A step of specifying a phoneme segment in which the temporal change of the utterance style is specified by the spoken utterance style command in an identified unit, and the utterance is spoken in accordance with the temporal change of the utterance style in the identified phoneme segment. A program for executing a speech synthesis step for synthesizing speech.