JP6163454B2

JP6163454B2 - Speech synthesis apparatus, method and program thereof

Info

Publication number: JP6163454B2
Application number: JP2014103961A
Authority: JP
Inventors: 勇祐井島; 水野　秀之; 秀之水野; 宮崎　昇; 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-05-20
Filing date: 2014-05-20
Publication date: 2017-07-12
Anticipated expiration: 2034-05-20
Also published as: JP2015219430A

Description

本発明は、対象テキストに対応する音声データを合成する音声合成装置、その方法及びプログラムに関する。 The present invention relates to a speech synthesizer that synthesizes speech data corresponding to a target text, a method thereof, and a program.

近年、電子書籍リーダ等に、音声合成技術を用いたコンテンツの文字情報の読み上げ技術が導入されるようになっている。合成音声によるコンテンツの読み上げにおいて、合成音声の韻律パタンを変更することができれば、聴者が飽きにくいコンテンツを提供することができると考えられる。特許文献１では、電子書籍等の読み上げにおいて、ユーザが自由に合成音声の声質を指定し、切り替えることができる技術が提案されている。 In recent years, a technology for reading out character information of content using a speech synthesis technology has been introduced to an electronic book reader or the like. If the prosody pattern of the synthesized speech can be changed in the reading of the content by the synthesized speech, it is considered that it is possible to provide the content that makes it difficult for the listener to get bored. Patent Document 1 proposes a technique that allows a user to freely specify and switch the voice quality of synthesized speech when reading an electronic book or the like.

特開２００５−３２１７０６号公報JP 2005-321706 A

しかしながら、特許文献１は、合成音声の再生中に声質を切り替えることを想定していない。そのため、仮に、再生中に声質を切り替えると、声の高さ、イントネーション、リズム等が突然切り替わるため、異音が発生したり、聴者に違和感を与える可能性が高い。 However, Patent Document 1 does not assume that the voice quality is switched during the reproduction of the synthesized speech. For this reason, if the voice quality is switched during playback, the pitch, intonation, rhythm, and the like of the voice are suddenly switched, so that there is a high possibility that abnormal sounds are generated or the listener feels uncomfortable.

本発明は、再生中に韻律パタンを切り替えた際に生じる異音や違和感を抑えることができる音声合成装置、その方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide a speech synthesizer, a method thereof, and a program capable of suppressing an abnormal sound and a sense of incongruity generated when a prosodic pattern is switched during reproduction.

上記の課題を解決するために、本発明の一態様によれば、音声合成装置は、音素セグメンテーション情報は、音声合成の対象テキストを読み上げた音声データである収録音声データに対応する音素と音素の継続時間に対応する情報とを含むものとし、韻律パタンは収録音声データに対応する基本周波数の時間変化のパタンと音素セグメンテーション情報とを含むものとし、複数の韻律パタンが記憶される韻律パタンＤＢ記憶部と、再生中の音声データに対応し、韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替前韻律パタンの特徴と、韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替後韻律パタンの特徴とが異なる場合に、再生中の音声データの再生位置に関する音声再生位置情報を用いて、切替前韻律パタンと切替後韻律パタンとの補間処理を行い、合成音声データの生成に用いる合成用韻律パタンを生成する韻律パタン生成部を含む。韻律パタン生成部は、切替前韻律パタンと切替後韻律パタンと音声再生位置情報とを用いて、音声再生位置情報が示す再生位置が、切替後韻律パタンのどの位置にあたるかを示す一致点情報を推定する一致点推定部と、切替前韻律パタンと切替後韻律パタンと一致点情報とを用いて、一致点情報が示す位置から所定の時間をかけて、切替前韻律パタンの基本周波数の時間変化を切替後韻律パタンの基本周波数の時間変化に変換するために補間処理を行う韻律パタン補間部と、を含む。 In order to solve the above-described problem, according to one aspect of the present invention, a speech synthesizer includes: a phoneme segmentation information, a phoneme and a phoneme corresponding to recorded speech data that is speech data read out from a target text for speech synthesis; Information corresponding to the duration, and the prosody pattern includes a temporal change pattern of the fundamental frequency corresponding to the recorded voice data and phoneme segmentation information, and a prosody pattern DB storage unit storing a plurality of prosody patterns; The feature of the pre-switching prosody pattern, which is one of the prosody patterns stored in the prosody pattern DB storage unit, corresponding to the sound data being reproduced, and one of the prosodic patterns stored in the prosody pattern DB storage unit If the characteristic of the post-switching prosodic pattern is different, the pre-switching rhyme is obtained using the audio playback position information related to the playback position of the audio data being played back. It performs interpolation processing of the pattern and the post-switching prosodic patterns, including prosodic pattern generating unit for generating a synthesis prosody patterns used for generating the synthesized speech data. The prosodic pattern generation unit uses the pre-switching prosodic pattern, the post-switching prosodic pattern, and the audio playback position information to obtain coincidence point information indicating which position in the post-switching prosodic pattern the playback position indicated by the audio playback position information corresponds to. Using the presumed matching point estimation unit, the pre-switching prosodic pattern, the post-switching prosody pattern, and the matching point information, the time change of the fundamental frequency of the pre-switching prosodic pattern over a predetermined time from the position indicated by the matching point information And a prosody pattern interpolation unit that performs an interpolation process to convert time into a time change of the fundamental frequency of the switched prosody pattern.

上記の課題を解決するために、本発明の他の態様によれば、音声合成方法は、音素セグメンテーション情報は、音声合成の対象テキストを読み上げた音声データである収録音声データに対応する音素と音素の継続時間に対応する情報とを含むものとし、韻律パタンは収録音声データに対応する基本周波数の時間変化のパタンと音素セグメンテーション情報とを含むものとし、韻律パタンＤＢ記憶部には複数の韻律パタンが記憶されているものとし、韻律パタン生成ステップは、再生中の音声データに対応し、韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替前韻律パタンの特徴と、韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替後韻律パタンの特徴とが異なる場合に、再生中の音声データの再生位置に関する音声再生位置情報を用いて、切替前韻律パタンと切替後韻律パタンとの補間処理を行い、合成音声データの生成に用いる合成用韻律パタンを生成する韻律パタン生成ステップを含む。韻律パタン生成ステップは、切替前韻律パタンと切替後韻律パタンと音声再生位置情報とを用いて、音声再生位置情報が示す再生位置が、切替後韻律パタンのどの位置にあたるかを示す一致点情報を推定する一致点推定ステップと、切替前韻律パタンと切替後韻律パタンと一致点情報とを用いて、一致点情報が示す位置から所定の時間をかけて、切替前韻律パタンの基本周波数の時間変化を切替後韻律パタンの基本周波数の時間変化に変換するために補間処理を行う韻律パタン補間ステップと、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, a speech synthesis method includes: phoneme segmentation information includes phonemes and phonemes corresponding to recorded speech data that is speech data read out from a speech synthesis target text. The prosody pattern is assumed to include a temporal change pattern of the fundamental frequency corresponding to the recorded voice data and phoneme segmentation information, and a plurality of prosody patterns are stored in the prosody pattern DB storage unit. The prosody pattern generation step corresponds to the sound data being reproduced, and features of the prosody pattern before switching, which is one of the prosody patterns stored in the prosody pattern DB storage unit, and the prosody pattern DB storage Playback position of the audio data being played back when the characteristics of the prosodic pattern after switching, which is one of the prosodic patterns stored in the section, are different Using the voice playback position information about, before switching performs interpolation processing of the prosody pattern and the post-switching prosodic patterns, including prosodic pattern generation step of generating a synthesis prosody patterns used for generating the synthesized speech data. The prosodic pattern generation step uses the pre-switching prosodic pattern, the post-switching prosodic pattern, and the audio playback position information, and uses matching point information indicating which position of the post-switching prosodic pattern the playback position indicated by the audio playback position information corresponds to. Using the presumed matching point estimation step, the pre-switching prosodic pattern, the post-switching prosodic pattern, and the matching point information, the time change of the fundamental frequency of the pre-switching prosodic pattern over a predetermined time from the position indicated by the matching point information And a prosodic pattern interpolation step for performing an interpolation process to convert to a temporal change in the fundamental frequency of the prosodic pattern after switching.

本発明によれば、再生中に韻律パタンを切り替えた際に生じる異音や違和感を抑えることができるという効果を奏する。 According to the present invention, there is an effect that it is possible to suppress abnormal sounds and discomfort generated when the prosody pattern is switched during reproduction.

第一実施形態に係る音声合成装置の機能ブロック図。1 is a functional block diagram of a speech synthesizer according to a first embodiment. 第一実施形態に係る音声合成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech synthesizer which concerns on 1st embodiment. 図３Ａは、切替前の韻律パタンP_aの音素セグメンテーション情報S_aの例を示す図。図３Ｂは、切替後の韻律パタンP_bの音素セグメンテーション情報S_bの例を示す図。Figure 3A is a diagram showing an example of a phoneme segmentation information S _a before-switching prosody pattern P _a. FIG. 3B is a diagram showing an example of phoneme segmentation information S _b of the prosodic pattern P _b after switching. 韻律パタンＤＢ記憶部に格納されている情報の例を示す図。The figure which shows the example of the information stored in the prosody pattern DB memory | storage part. ユーザが方言と音質とを指定する画面の例を示す図。The figure which shows the example of the screen which a user designates a dialect and sound quality. ユーザ制御情報Cの例を示す図。The figure which shows the example of the user control information C. 韻律パタン選択部の処理フローの例を示す図。The figure which shows the example of the processing flow of a prosodic pattern selection part. 韻律パタン生成部の機能ブロック図。The functional block diagram of a prosody pattern production | generation part. 韻律パタン生成部の処理フローの例を示す図。The figure which shows the example of the processing flow of a prosody pattern production | generation part. 合成音声再生位置情報A_posの例を示す図。The figure which shows the example of synthetic _{| combination} audio _| voice reproduction position information _Apos . 韻律パタン補間部の機能ブロック図。The functional block diagram of a prosodic pattern interpolation part. 韻律パタン補間部の処理フローの例を示す図。The figure which shows the example of the processing flow of a prosodic pattern interpolation part. 音声波形生成部１３０の機能ブロック図。The functional block diagram of the audio | voice waveform generation part 130. FIG. 音声波形生成部１３０の処理フローの例を示す図。The figure which shows the example of the processing flow of the audio | voice waveform generation part.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態＞
図１は第一実施形態に係る音声合成装置１００の機能ブロック図を、図２はその処理フローの例を示す図である。 <First embodiment>
FIG. 1 is a functional block diagram of the speech synthesizer 100 according to the first embodiment, and FIG. 2 is a diagram showing an example of the processing flow.

音声合成装置１００は、韻律パタン選択部１１０と、韻律パタン生成部１２０と、音声波形生成部１３０と、韻律パタンＤＢ記憶部１４０と音声合成用ＤＢ記憶部１５０とを含む。なお、音声合成装置１００は、さらに、収録音声ＤＢ記憶部１０１と韻律パタン抽出部１０３とを含んでもよい。 The speech synthesis apparatus 100 includes a prosody pattern selection unit 110, a prosody pattern generation unit 120, a speech waveform generation unit 130, a prosody pattern DB storage unit 140, and a speech synthesis DB storage unit 150. Note that the speech synthesizer 100 may further include a recorded speech DB storage unit 101 and a prosody pattern extraction unit 103.

本実施形態では、音声合成装置１００は、ユーザにより指定されるユーザ制御情報Cを受け取り、ユーザ制御情報Cに対応する韻律パタンに基づき、合成音声データX_cを生成し、出力する。なお、ユーザ制御情報Cは、韻律パタンＤＢ記憶部１４０に記憶されている複数の韻律パタンの１つである切替後韻律パタンの特徴を特定する情報を含む。さらに、ユーザ制御情報Cは、音声合成用ＤＢ記憶部１５０に記憶されている複数の音声データベースの１つである切替後音声データベースを特定する情報と再生速度を特定する情報とを含む。 In the present embodiment, the speech synthesizer 100 receives user control information C specified by the user, generates synthesized speech data _Xc based on the prosodic pattern corresponding to the user control information C, and outputs it. Note that the user control information C includes information that identifies the characteristics of the post-switching prosodic pattern, which is one of a plurality of prosodic patterns stored in the prosodic pattern DB storage unit 140. Further, the user control information C includes information for specifying a post-switching speech database that is one of a plurality of speech databases stored in the speech synthesis DB storage unit 150 and information for specifying a playback speed.

本実施形態では、韻律パタンＤＢ記憶部１４０に記憶されている韻律パタンは、所定の方言に対応するものとする。まず事前に、所望のコンテンツ（以下「対象テキスト」ともいう）を様々な方言で読み上げた音声データから韻律パタンを抽出し、韻律パタンＤＢ記憶部１４０に格納しておく。コンテンツの読み上げ時には、ユーザにより指定された方言等の情報と、現在の再生位置等の情報から適切な韻律パタンを韻律パタンＤＢ記憶部１４０から選択し、選択された韻律パタンとユーザにより指定された音声データベースとを用いて、合成音声データX_cを生成して出力する。 In the present embodiment, the prosody pattern stored in the prosody pattern DB storage unit 140 corresponds to a predetermined dialect. First, prosody patterns are extracted from speech data obtained by reading out desired content (hereinafter also referred to as “target text”) in various dialects, and stored in the prosody pattern DB storage unit 140. When reading the content, an appropriate prosodic pattern is selected from the prosody pattern DB storage unit 140 based on information such as a dialect specified by the user and information such as the current reproduction position, and the selected prosodic pattern and the user specify Using the speech database, the synthesized speech data _Xc is generated and output.

合成音声データX_cはスピーカ等の再生装置に入力され、再生される。 The synthesized voice data _Xc is input to a playback device such as a speaker and played back.

＜収録音声ＤＢ記憶部１０１及び韻律パタン抽出部１０３＞
収録音声ＤＢ記憶部１０１には、読上げたいコンテンツ（書籍等）を、事前に様々な方言で人が読み上げた音声データ（以下、「収録音声データ」ともいい、収録音声データは音声合成の対象テキストを読み上げた音声データと言ってもよい）Y_jが記憶されている。jは、収録音声データのインデックスを表し、文や文章、方言等の組合せごとに付与されているものとする。収録音声ＤＢ記憶部１０１に記憶されている収録音声データY_jの総数をJとすると、j=1,2,…,Jである。ただし、Jは２以上の整数である。読上げたいコンテンツに含まれる文や文章の総数をＫ、方言の総数をＬとするとＪ＝Ｋ×Ｌである。 <Recorded voice DB storage unit 101 and prosodic pattern extraction unit 103>
The recorded voice DB storage unit 101 stores voice data (hereinafter referred to as “recorded voice data”) that is read in advance by various dialects of the content (such as books) to be read. Y _j is stored. j represents an index of recorded audio data, and is assigned to each combination of sentences, sentences, dialects, and the like. If the total number of recorded voice data Y _j stored in the recorded voice DB storage unit 101 is J, j = 1, 2,. However, J is an integer of 2 or more. If the total number of sentences and sentences included in the content to be read is K and the total number of dialects is L, J = K × L.

韻律パタン抽出部１０３は、収録音声ＤＢ記憶部１０１から収録音声データY_jを取り出し、収録音声データY_jから韻律パタンP_jを抽出し、韻律パタンＤＢ記憶部１４０に格納する。 The prosody pattern extraction unit 103 extracts the recorded voice data Y _j from the recorded voice DB storage unit 101, extracts the prosody pattern P _j from the recorded voice data Y _j , and stores it in the prosody pattern DB storage unit 140.

なお、予めJ個の韻律パタンP_jを記憶した韻律パタンＤＢ記憶部１４０を用意しておけば、音声合成装置１００は、収録音声ＤＢ記憶部１０１と韻律パタン抽出部１０３とを含まなくともよい。 If the prosody pattern DB storage unit 140 that stores J prosodic patterns P _j in advance is prepared, the speech synthesizer 100 may not include the recorded speech DB storage unit 101 and the prosody pattern extraction unit 103. .

韻律パタンP_jは収録音声データY_jに対応する基本周波数（以下「F0」ともいう）の時間変化のパタンf0_jと音素セグメンテーション情報S_jとを含む。音素セグメンテーション情報S_jは、収録音声データY_jに対応する音素と音素の継続時間に対応する情報（例えば、各音素の開始時刻、終了時間）とを含む。図３Ａ及び図３Ｂは、それぞれ切替前の韻律パタンP_aの音素セグメンテーション情報S_a及び切替後の韻律パタンP_bの音素セグメンテーション情報S_bの例を示す。ａ及びｂはそれぞれ切替前及び切替後の韻律パタンを表すインデックスであり、それぞれ１，２，…，Ｊの何れかである。 The prosodic pattern P _j includes a temporal change pattern f 0 _j and phoneme segmentation information S _{j of} a fundamental frequency (hereinafter also referred to as “F0”) corresponding to the recorded speech data Y _j . The phoneme segmentation information S _j includes a phoneme corresponding to the recorded voice data Y _j and information corresponding to the duration of the phoneme (for example, the start time and end time of each phoneme). 3A and 3B, respectively showing an example of a phoneme segmentation information S _b prosody pattern P _b after phoneme segmentation information S _a and the switching of the switching preceding prosodic pattern P _a. a and b are indexes representing prosodic patterns before and after switching, and are one of 1, 2,.

F0の時間変化パタンとは、収録音声データから一定時間（フレーム）ごとに抽出されるF0の情報の時系列である。例えば1[s]の音声データに対して、フレームシフト長5[ms]でF0を抽出した場合、F0の時間変化パタンは、200点分のF0の情報の時系列を表す。 The F0 time change pattern is a time series of F0 information extracted from the recorded audio data at regular time intervals (frames). For example, when F0 is extracted with a frame shift length of 5 [ms] for audio data of 1 [s], the time change pattern of F0 represents a time series of information of F0 for 200 points.

＜韻律パタンＤＢ記憶部１４０＞
韻律パタンＤＢ記憶部１４０には、J個の韻律パタンP_jが記憶される。 <Prosodic pattern DB storage unit 140>
The prosody pattern DB storage unit 140 stores J prosodic patterns P _j .

なお、韻律パタンＤＢ記憶部１４０には、各収録音声データY_jから抽出された韻律パタンP_j以外に、韻律パタンの識別子（例えば、韻律パタンNo.であり、本実施形態では韻律パタンNo.=jである）、読み上げたコンテンツの文や文章の識別子（例えば、文番号）、韻律パタンの特徴等が格納されている。韻律パタンの特徴としては、方言（例えば、大阪、京都、鹿児島、高知）、性別（例えば、男性、女性）、年代（例えば、子ども、青年、中年や、１０代、２０代、…）、発話状況（例えば「会話」「演説」）等が考えられる。要は、韻律パタンの特徴とは、他の韻律パタンに比べて、何らかの傾向がみられる韻律パタンを表すものであればどのようなものであってもよく、必ずしも話者が異なる必要はない。 In the prosody pattern DB storage unit 140, in addition to the prosody pattern P _j extracted from each recorded speech data Y _j , an identifier of the prosody pattern (for example, a prosodic pattern No. = j), the sentence of the read content, the identifier of the sentence (for example, sentence number), the characteristics of the prosodic pattern, and the like are stored. Prosodic patterns include dialects (eg, Osaka, Kyoto, Kagoshima, Kochi), gender (eg, male, female), age (eg, children, adolescents, middle-aged, teens, twenties, ...), An utterance situation (for example, “conversation”, “speech”) or the like is considered. In short, the features of the prosodic pattern may be anything as long as they represent a prosodic pattern that shows some tendency compared to other prosodic patterns, and the speakers are not necessarily different.

図４は、韻律パタンＤＢ記憶部１４０に格納されている情報の例を示す。図４は、30個の文から構成されるコンテンツを複数の方言で収録した音声の韻律パタンＤＢ記憶部１４０の例であり、各韻律パタンの識別子（韻律パタンNo.）と、コンテンツの中で何番目の文であるかを示す情報の文番号、方言の情報が付加されている。図４では、韻律パタンの特徴として、方言を用いている。 FIG. 4 shows an example of information stored in the prosody pattern DB storage unit 140. FIG. 4 is an example of a speech prosody pattern DB storage unit 140 that records content composed of 30 sentences in a plurality of dialects. Each prosodic pattern identifier (prosodic pattern No.) and A sentence number and dialect information of information indicating what number sentence is added. In FIG. 4, a dialect is used as a feature of the prosodic pattern.

なお、本実施形態において、「方言」とは、所定の共通語・標準語に対して、ある地方で用いられる特有の韻律パタン（声の高さ、イントネーション、リズム等）を意味し、ある地方で用いられる特有の文法や語彙を含まないものとする。そのため、文番号が同一である場合、方言が異なっていたとしても、「同一の文章（同一の対象テキスト）」を発話した場合の韻律パタンになっている。なお、所定の共通語・標準語を方言の一種としてもよい。 In this embodiment, “dialect” means a specific prosodic pattern (voice pitch, intonation, rhythm, etc.) used in a certain region with respect to a predetermined common language / standard word. It does not include the specific grammar and vocabulary used in For this reason, when the sentence numbers are the same, even if the dialects are different, the prosodic pattern is obtained when “same sentence (same target text)” is spoken. The predetermined common language / standard language may be a kind of dialect.

＜音声合成用ＤＢ記憶部１５０＞
音声合成用ＤＢ記憶部１５０には、複数の音声データベースが記憶される。なお、音声データベースは、音声合成に用いられるものであって、ユーザ制御情報Cによって指定可能である。本実施形態では、音声合成方式にHMM音声合成方式を用い、音声合成用ＤＢ記憶部１５０には、HMMが選択対象となる音声データベースごとに格納されている。図５の例では、4種類の声質（「げんき」「しょんぼり」「たかい」「ひくい」）に対応するHMMが音声合成用ＤＢ記憶部１５０に格納されている。音声データベースは、性別（例えば、男性、女性）、年代（例えば、子ども、青年、中年や、１０代、２０代、…）、発話状況（例えば「会話」「演説」）等の組合せごとに用意してもよい。 <Speech Synthesis DB Storage 150>
The speech synthesis DB storage unit 150 stores a plurality of speech databases. The speech database is used for speech synthesis and can be specified by the user control information C. In the present embodiment, the HMM speech synthesis method is used as the speech synthesis method, and the HMM is stored in the speech synthesis DB storage unit 150 for each speech database to be selected. In the example of FIG. 5, HMMs corresponding to four types of voice qualities (“Genki”, “Shonbori”, “Takai”, “Hikui”) are stored in the speech synthesis DB storage unit 150. The speech database is for each combination of gender (for example, male, female), age (for example, children, adolescents, middle-aged, teenagers, twenties,. You may prepare.

ここで、韻律パタンＤＢ記憶部１４０に記憶されている韻律パタンの１つである切替後韻律パタンの特徴を特定する情報と音声合成用ＤＢ記憶部１５０に記憶されている音声データベースの１つである切替後音声データベースを特定する情報と再生速度を特定する情報とを含むユーザ制御情報Cについて説明する。 Here, one of the prosody patterns stored in the prosody pattern DB storage unit 140 is one of the prosody patterns stored in the speech synthesis DB storage unit 150 and information for specifying the characteristics of the post-switching prosody pattern. User control information C including information for specifying a certain switched audio database and information for specifying a playback speed will be described.

本実施形態において、ユーザ制御情報Cは、ユーザにより指定された方言等を示す情報含む。言い換えると、本実施形態では、方言により韻律パタンの特徴を特定する。さらに、ユーザ制御情報Cは、ユーザにより指定された音声データベースを示す情報を含む。本実施形態では、声質により音声データベースを特定する。さらに本実施形態では、ユーザ制御情報Cは、ユーザにより指定された再生速度を示す情報を含む。図５は、ユーザが方言と音声データベースと再生速度を指定する画面の例を示す。また、図６はユーザ制御情報Cの例を示す。なお、コンテンツの読み上げを開始する際には、方言と音質と速度の初期値を予め決めておいてもよいし、方言と音質と速度をユーザが選択した後に、コンテンツの読み上げを開始してもよい。ユーザ制御情報Cは、所定時間ごとに取得してもよいし、韻律パタンの特徴と音声データベースと速度の少なくとも１つが変更されたときにのみ取得してもよい。 In the present embodiment, the user control information C includes information indicating a dialect specified by the user. In other words, in this embodiment, the features of the prosodic pattern are specified by dialect. Further, the user control information C includes information indicating a voice database designated by the user. In this embodiment, the voice database is specified by voice quality. Further, in the present embodiment, the user control information C includes information indicating the reproduction speed designated by the user. FIG. 5 shows an example of a screen on which the user designates a dialect, a voice database, and a playback speed. FIG. 6 shows an example of the user control information C. When starting to read the content, the initial values of the dialect, sound quality, and speed may be determined in advance, or after the user selects the dialect, sound quality, and speed, the reading of the content may be started. Good. The user control information C may be acquired every predetermined time, or may be acquired only when at least one of the characteristics of the prosodic pattern, the speech database, and the speed is changed.

＜韻律パタン選択部１１０＞
韻律パタン選択部では、ユーザ制御情報Cと合成音声再生位置情報A_posとを受け取り、ユーザ制御情報Cと合成音声再生位置情報A_posとを用いて、再生する合成音声データX_cに使用する韻律パタンP_bを選択して出力する（Ｓ１１０）。 <Prosodic pattern selection unit 110>
The prosody pattern selecting section, prosody receives user control information C and the synthesized speech reproduction position information A _pos, using the user control information C and the synthesized speech reproduction position information A _pos, using the synthesized speech data X _c to be reproduced pattern P _b selects and outputs (S110).

図７は、韻律パタン選択部１１０の処理フローの例を示す。 FIG. 7 shows an example of the processing flow of the prosody pattern selection unit 110.

なお、合成音声再生位置情報A_posは、音声波形生成部１３０によって出力される情報であって、再生中の合成音声データX_cの再生位置に関する情報で、再生中の韻律パタンNo.、再生位置（時間情報、再生が終了したか否か）等を含む。なお、図３に示すように、韻律パタンＤＢ記憶部１４０には、韻律パタンNo.に対応する文番号及び韻律パタンの特徴が記憶されているため、韻律パタンNo.から再生中の文番号及び韻律パタンの特徴を特定することができる。 The synthesized speech reproduction position information A _pos is information output by the speech waveform generation unit 130, and is information regarding the reproduction position of the synthesized speech data _Xc being reproduced, and the prosody pattern number being reproduced, the reproduction position. (Time information, whether or not playback has ended) and the like. As shown in FIG. 3, the prosody pattern DB storage unit 140 stores the sentence number corresponding to the prosodic pattern No. and the features of the prosodic pattern, so that the sentence number being reproduced from the prosodic pattern No. Features of prosodic patterns can be specified.

まず、韻律パタン選択部１１０は、合成音声再生位置情報A_posを用いて、再生中の合成音声データX_c（この実施形態では文単位の合成音声データX_c）が終了したか否かを判別する（Ｓ１１１）。韻律パタン選択部１１０は、再生が終了していた場合は、文番号をインクリメントし（Ｓ１１２）、インクリメント後の文番号に対応し、かつ、ユーザ制御情報Cで特定される韻律パタンの特徴（方言）に対応する韻律パタンが存在するか否かを判別する（Ｓ１１３）。存在する場合には、韻律パタン選択部１１０は、その韻律パタンP_d（ただし、ｄはインクリメント後の文番号に対応し、かつ、ユーザ制御情報Cで特定される韻律パタンの特徴に対応する韻律パタンを表すインデックスであり、１，２，…，Ｊの何れかである）を韻律パタンＤＢ記憶部１４０から取り出して出力する（Ｓ１１４）。なお、このとき、韻律パタン選択部１１０は、再生中の合成音声データX_cが終了したことを示す制御信号C_sysを出力する。 First, the prosody pattern selection section 110, using synthetic speech reproduction position information A _pos, synthesized speech data X _c being reproduced (in this embodiment synthesized speech data X _c of Buntan'i) determines whether or not the finished (S111). The prosody pattern selection unit 110 increments the sentence number when the reproduction is finished (S112), corresponds to the sentence number after the increment, and features of the prosodic pattern (dialect) specified by the user control information C ) Is determined whether or not there exists a prosodic pattern (S113). If the prosody pattern exists, the prosody pattern selection unit 110 selects the prosody pattern P _d (where d corresponds to the sentence number after the increment and corresponds to the feature of the prosodic pattern specified by the user control information C). An index representing a pattern, which is one of 1, 2,..., J) is extracted from the prosody pattern DB storage unit 140 and output (S114). At this time, the prosody pattern selection unit 110 outputs a control signal _Csys indicating that the synthesized speech data _Xc being reproduced has ended.

次の文番号の韻律パタンが存在しない場合は、韻律パタン選択部１１０は、コンテンツの全てを読み上げ終えたと判断して、再生処理を終了し、各部に対して再生処理を終了したことを示す制御信号を出力する（Ｓ１１５、なお図１中では出力された制御信号を省略する）。 When the prosodic pattern of the next sentence number does not exist, the prosodic pattern selection unit 110 determines that all of the content has been read out, ends the playback process, and indicates that the playback process has been completed for each unit A signal is output (S115, the output control signal is omitted in FIG. 1).

合成音声データX_cが再生中の（再生が終了していない）場合は、韻律パタン選択部１１０は、ユーザ制御情報Cで特定される韻律パタンの特徴が変更されたか否かを判断する（Ｓ１１６）。例えば、韻律パタン選択部１１０は、ユーザ制御情報Cで特定される韻律パタンの特徴と、合成音声再生位置情報A_posに含まれる韻律パタンNo.で特定される韻律パタンの特徴とが異なる場合に韻律パタンの特徴が変更されたと判断し、一致する場合に韻律パタンの特徴が変更されていないと判断する。韻律パタンの特徴が変更された場合、韻律パタン選択部１１０は、変更された韻律パタンの特徴と同じの特徴を持ち、かつ、現在再生中の文番号を持つ韻律パタンP_bを韻律パタンＤＢ記憶部１４０から取り出して出力する（Ｓ１１７）。なお、このとき、韻律パタン選択部１１０は、合成音声データX_cの再生中に韻律パタンの特徴が切り替わったことを示す制御信号C_sysを出力する。 When the synthesized speech data _Xc is being played back (playback has not ended), the prosodic pattern selection unit 110 determines whether or not the prosody pattern feature specified by the user control information C has been changed (S116). ). For example, the prosody pattern selection unit 110 determines that the prosody pattern feature specified by the user control information C is different from the prosody pattern feature specified by the prosody pattern No. included in the synthesized speech reproduction position information A _pos . It is determined that the features of the prosodic pattern have been changed, and if they match, it is determined that the features of the prosodic pattern have not been changed. When the prosody pattern feature is changed, the prosody pattern selection unit 110 stores the prosody pattern P _b having the same feature as the changed prosody pattern feature and the currently reproduced sentence number in the prosody pattern DB storage. The data is taken out from the unit 140 and output (S117). At this time, the prosody pattern selection unit 110 outputs a control signal _Csys indicating that the characteristics of the prosody pattern are switched during the reproduction of the synthesized speech data _Xc .

ユーザ制御情報Cの韻律パタンの特徴が変更されていない場合は、韻律パタン選択部１１０は、何も出力しない。 If the characteristics of the prosodic pattern of the user control information C are not changed, the prosodic pattern selection unit 110 outputs nothing.

＜韻律パタン生成部１２０＞
図８は韻律パタン生成部１２０の機能ブロック図を、図９はその処理フローの例を示す。 <Prosodic pattern generation unit 120>
8 shows a functional block diagram of the prosody pattern generation unit 120, and FIG. 9 shows an example of the processing flow.

韻律パタン生成部１２０は、一致点推定部１２２と、韻律パタン補間部１２３とを含む。 The prosody pattern generation unit 120 includes a matching point estimation unit 122 and a prosody pattern interpolation unit 123.

韻律パタン生成部１２０は、制御信号C_sysと、韻律パタンP_bまたはP_dとを受け取る。 Prosody pattern generation unit 120 receives a control signal C _sys, a prosodic pattern P _b or P _d.

制御信号C_sysが再生中の合成音声データX_cが終了したことを示す場合、韻律パタン生成部１２０は、韻律パタンの補間処理は行わず、受け取った韻律パタンP_dをそのまま合成用韻律パタンP_cとして音声波形生成部１３０に出力する。よって、一致点推定部１２２及び韻律パタン補間部１２３での処理は行わない。なお、この場合、仮に、ユーザ制御情報Cで特定される韻律パタンの特徴が切り替わったとしても、切替前後では文が異なるため、異音は発声せず、違和感も生じない。そのため、韻律パタンの補間処理（Ｓ１２０）を必要としない。 When the control signal C _sys indicates that synthetic speech data X _c being reproduced has been completed, the prosody pattern generation unit 120, interpolation processing of prosodic pattern is not performed, the prosody pattern P _d as it is for synthesis prosody pattern P received _c is output to the speech waveform generator 130. Therefore, the process in the matching point estimation part 122 and the prosody pattern interpolation part 123 is not performed. In this case, even if the characteristics of the prosodic pattern specified by the user control information C are switched, the sentences are different before and after the switching, so that no abnormal sound is produced and no sense of incongruity does not occur. Therefore, prosody pattern interpolation processing (S120) is not required.

制御信号C_sysが合成音声データX_cの再生中に韻律パタンの特徴が切り替わったことを示す場合（言い換えると、切替前韻律パタンP_aと切替後韻律パタンP_bとが異なる場合）には、韻律パタン生成部１２０は、少なくとも、韻律パタンP_bを受け取った際の（韻律パタンP_bが出力されるのと同じタイミング（例えば同じフレーム）で出力された）合成音声再生位置情報A_posを受け取る。さらに、韻律パタン生成部１２０は、合成音声再生位置情報A_posに含まれる韻律パタンNo.aに対応する韻律パタン（以下「切替前韻律パタン」ともいう）P_aを韻律パタンＤＢ記憶部１４０から取り出す。 Control signal C _sys is (in other words, if before switching the prosody pattern P _a and the posterior prosodic pattern P _b are different) indicate that the switched characteristic prosody patterns during playback of the synthesized speech data X _c, the The prosody pattern generation unit 120 receives at least the synthesized speech reproduction position information A _pos when the prosody pattern P _b is received (output at the same timing (for example, the same frame) as when the prosody pattern P _b is output). . Further, prosody pattern generator 120, prosody pattern corresponding to prosodic pattern No.a included in the combined audio reproduction position information A _pos (hereinafter referred to as "pre-switching prosody pattern") to P _a prosody pattern DB storage section 140 Take out.

韻律パタン生成部１２０は、合成音声再生位置情報A_posを用いて、切替前韻律パタンP_aと韻律パタンP_b（以下「切替後韻律パタンP_b」ともいう）との補間処理を行い、新たな韻律パタンを作成して、合成用韻律パタンP_cとして出力する（Ｓ１２０）。これにより韻律パタン切替時の異音の発生を抑え、違和感を低減する。以下、各部の処理内容を説明する。 Prosody pattern generation unit 120 uses a synthesized voice playback position information A _pos, performs interpolation processing of the pre-switching prosodic pattern P _a prosodic pattern P _b (hereinafter referred to as "prosodic pattern P _b after switching"), new A prosodic pattern is created and output as a synthesizing prosody pattern _Pc (S120). As a result, the generation of abnormal sounds at the time of prosody pattern switching is suppressed, and the uncomfortable feeling is reduced. Hereinafter, the processing content of each part is demonstrated.

（一致点推定部１２２）
一致点推定部１２２は、合成音声再生位置情報A_posと切替前韻律パタンP_aと切替後韻律パタンP_bとを受け取る。一致点推定部１２２は、合成音声再生位置情報A_posと切替前韻律パタンP_a（より詳しく言うと切替前韻律パタンP_aに含まれる音素セグメンテーション情報S_a）と切替後韻律パタンP_b（より詳しく言うと切替後韻律パタンP_bに含まれる音素セグメンテーション情報S_b）を用いて、合成音声再生位置情報A_posが示す再生位置が切替後韻律パタンP_bのどの位置にあたるかを示す情報（以下、「一致点情報」ともいう）T_mを推定して出力する（Ｓ１２２）。 (Matching point estimation unit 122)
Match point estimating unit 122, the synthetic speech reproduction position information A _pos and switching before receiving the prosodic pattern P _a and the posterior prosodic pattern P _b. Match point estimating unit 122, the synthetic speech reproduction position information A _pos and before switching the prosody pattern P _a (More particularly the pre-switching the prosody pattern P phoneme segmentation information contained in _a S _a) and after-switching prosodic pattern P _b (more More specifically, using the phoneme segmentation information S _b ) included in the post-switching prosodic pattern P _b ), information indicating which position of the post-switching prosodic pattern P _{b the} playback position indicated by the synthesized speech playback position information A _pos (hereinafter referred to as the post-switching prosodic pattern P _b) _Tm ) (also referred to as “matching point information”) is estimated and output (S122).

例えば、一致点推定部１２２は、まず、合成音声再生位置情報A_posと音素セグメンテーション情報S_aとを用いて、現在再生中の音素を推定する。例えば、図１０の合成音声再生位置情報A_posと、図３Ａの切替前音素セグメンテーション情報S_aとが入力された場合、再生位置（450ms）に相当する音素（以下、「再生音素」ともいう）（“o”）が推定結果として求められる。さらに、再生位置（450ms）に相当する音素（“o”）内の相対的な位置（0.4=現在の音素内の位置（100ms=450[ms]-350[ms]）/音素時間長（250[ms]=600[ms]-350[ms]））が推定結果として求められる。 For example, the coincident point estimation unit 122, first, by using the synthesized speech reproduction position information A _pos phoneme segmentation information S _a, it estimates the phonemes currently playing. For example, a synthesized speech reproduction position information A _pos in FIG 10, when the the switch front phoneme segmentation information S _a in FIG. 3A are inputted, phonemes corresponding to the playback position (450 ms) (hereinafter, also referred to as "regeneration phoneme") (“O”) is obtained as an estimation result. Furthermore, the relative position in the phoneme (“o”) corresponding to the playback position (450 ms) (0.4 = the position in the current phoneme (100 ms = 450 [ms] −350 [ms]) / phoneme time length (250 [ms] = 600 [ms] -350 [ms])) is obtained as an estimation result.

一致点推定部１２２は、次に、現在の合成音声再生位置情報A_posが切替後韻律パタンのどこにあたるか推定する。例えば、図３Ｂの切替後の音素セグメンテーション情報S_b（図３Ｂ）が入力された場合、先ほど推定した音素（“o”）と相対的な位置（0.4）とを用いて、380[ms]（= 300[ms] + 0.4 * (500[ms]-300[ms])）が一致点情報T_mとして求められる。 Next, the matching point estimation unit 122 estimates where the current synthesized speech reproduction position information A _pos corresponds to the post-switching prosodic pattern. For example, when the phoneme segmentation information S _b (FIG. 3B) after switching in FIG. 3B is input, the phoneme (“o”) estimated earlier and the relative position (0.4) are used, and 380 [ms] ( = 300 [ms] + 0.4 * (500 [ms] -300 [ms])) is determined as a match point information T _m.

＜韻律パタン補間部１２３＞
図１１は韻律パタン補間部１２３の機能ブロック図を、図１２はその処理フローの例を示す。韻律パタン補間部１２３は、時間情報補正部１２３ａとＦ０時間変化パタン補間部１２３ｂとを含む。 <Prosodic pattern interpolation unit 123>
FIG. 11 is a functional block diagram of the prosody pattern interpolation unit 123, and FIG. 12 shows an example of the processing flow. The prosody pattern interpolation unit 123 includes a time information correction unit 123a and an F0 time change pattern interpolation unit 123b.

韻律パタン補間部１２３は、切替前韻律パタンP_aと切替後韻律パタンP_bと一致点情報T_mとを受け取る。韻律パタン補間部１２３は、切替前韻律パタンP_aと切替後韻律パタンP_bと一致点情報T_mと用いて、一致点情報T_mが示す位置から所定の時間T_iをかけて、切替前韻律パタンP_aの基本周波数の時間変化パタンF0_aを切替後韻律パタンP_bの基本周波数の時間変化パタンF0_bに変換するために補間処理を行い、補間後の韻律パタンを合成用韻律パタンP_cとして出力する（ｓ１２３）。なお、時間変化パタンF0_aはN点のF0情報からなるものとし、時間変化パタンF0_bはM点のF0情報からなるものとする。 Prosody pattern interpolation unit 123, before switching receive the prosody pattern P _a and the posterior prosodic pattern P _b and match point information T _m. Prosody pattern interpolation unit 123 uses the pre-switching prosodic pattern P _a and the posterior prosodic pattern P _b a match point information T _m, over a predetermined time from the position indicated by the match-point information T _m T _i, before switching performs interpolation processing to convert the time change pattern F0 _a fundamental frequency of the prosodic pattern P _a time change pattern F0 _b of the fundamental frequency of the posterior prosodic pattern P _b, prosody pattern for synthesis prosody pattern P after interpolation Output as _c (s123). It is assumed that the time change pattern F0 _a is composed of N points of F0 information, and the time change pattern F0 _b is composed of M points of F0 information.

（時間情報補正部１２３ａ）
時間情報補正部１２３ａは、切替前韻律パタンP_aと切替後韻律パタンP_bと一致点情報T_mとを受け取る。 (Time information correction unit 123a)
Time information correction unit 123a receives the matching point information T _m and before switching the prosody pattern P _a and the posterior prosodic pattern P _b.

切替前韻律パタンP_aに含まれる切替前音素セグメンテーション情報S_aと切替後韻律パタンP_bに含まれる切替後音素セグメンテーション情報S_bとが異なるため、そのままでは補間処理を行うことができない。そのため、時間情報補正部１２３ａは、まず、切替前音素セグメンテーション情報S_aを切替後音素セグメンテーション情報S_bに合わせるように時間情報の補正を行い、補正後の音素セグメンテーション情報S_Cを出力する（Ｓ１２３ａ）。音素セグメンテーション情報の補正は、切替前韻律パタンP_aと切替後韻律パタンP_bとの文章は同一であるため、切替後音素セグメンテーション情報S_bを切替前音素セグメンテーション情報S_aにコピーすることで行う。つまり、切替後音素セグメンテーション情報S_bを補正後の音素セグメンテーション情報S_Cとして用いる。 Since the post-switching phoneme segmentation information S _b included in the pre-switching phoneme segmentation information S _a and the posterior prosodic pattern P _b included in the pre-switching prosodic pattern P _a are different, as it can not perform interpolation processing. Therefore, the time information correction unit 123a first performs the correction of the time information so as to match the pre-switching phoneme segmentation information S _a to the posterior phoneme segmentation information S _b, and outputs the phoneme segmentation information S _C after correction (S123a ). Correction of the phoneme segmentation information, since the text of the pre-switching prosodic pattern P _a and the posterior prosodic pattern P _b are the same, carried out by copying the post-switching phoneme segmentation information S _b in before switching phoneme segmentation information S _a . That is, the post-switching phoneme segmentation information S _b is used as the corrected phoneme segmentation information S _C.

次に、時間情報補正部１２３ａは、切替前の時間変化パタンF0_aから切替後の時間変化パタンF0_bと同一のフレーム数M点を持つ時間変化パタンF0_awを得、出力する（ｓ１２３ａ）。補正にはDPマッチング（Dynamic Time Warping; DTW）等を用いる。 Next, the time information correction unit 123a obtains and outputs the time change pattern F0 _aw having the same number of frames M as the time change pattern F0 _b after switching from the time change pattern F0 _a before switching (s123a). DP matching (Dynamic Time Warping; DTW) or the like is used for correction.

（Ｆ０時間変化パタン補間部１２３ｂ）
Ｆ０時間変化パタン補間部１２３ｂは、切替後韻律パタンP_bと一致点情報T_mと時間変化パタンF0_awと音素セグメンテーション情報S_Cとを受け取る。Ｆ０時間変化パタン補間部１２３ｂは、切替後韻律パタンP_bに含まれる時間変化パタンF0_bと、時間情報の補正を行った時間変化パタンF0_awとの補間処理を行い、補間後の時間変化パタンF0_cを得る（Ｓ１２３ｂ）。Ｆ０時間変化パタン補間部１２３ｂは、音素セグメンテーション情報S_Cと時間変化パタンF0_cとを含む合成用韻律パタンP_cを出力する。 (F0 time change pattern interpolation unit 123b)
F0 period changing pattern interpolation unit 123b receives the switching after prosodic pattern P _b a match point information T _m and temporal change pattern F0 _aw and phoneme segmentation information S _C. F0 period changing pattern interpolation unit 123b performs a time change pattern F0 _b contained in the posterior prosodic pattern P _b, the interpolation process with the time change pattern F0 _aw performing the correction of the time information, the time change pattern after interpolation F0 _c is obtained (S123b). The F0 time change pattern interpolation unit 123b outputs a synthesis prosody pattern P _c including the phoneme segmentation information S _C and the time change pattern F0 _c .

補間の対象となる時間変化パタンは、一致点推定部１２２で推定した一致点情報T_m（上記の例では、380[ms]）から、一定の時間長T_i[ms]分である。補間手法としては、様々な手法があるが、最も単純な線形補間を使用した場合、補間後のi番目のフレームの時間変化パタンF0_cは次式により求められる。 The time change pattern to be interpolated is a certain time length T _i [ms] from the matching point information T _m (380 [ms] in the above example) estimated by the matching point estimation unit 122. There are various interpolation methods. When the simplest linear interpolation is used, the temporal change pattern F0 _c of the i-th frame after the interpolation is obtained by the following equation.

ここで、T[ms]はフレームシフト長を表す。また、w[i]は補間重みであり、補間対象のフレーム（i番目）が一致点情報（T_m）に近い場合は、切替前韻律パタンP_aのF0の時間変化パタンF0_awの重みが大きくなり、補間対象となる時間長T_iに近づくにつれ、切替後韻律パタンP_bのF0の時間変化パタンF0_bの重みが大きくなればよく、上記の式で表されるものに限定されない。例えば、上記の式では、重みが時間情報に対して、直線的に増加しているが、シグモイド関数等を使用することで、重みの変化パタンを変更することも可能である。 Here, T [ms] represents the frame shift length. Also, w [i] is the interpolation weight, if the interpolation target frame (i-th) is close to matching point information (T _m) is the weight of a temporal change pattern F0 _aw of F0 before-switching prosodic pattern P _a As the time length T _i becomes larger and approaches the time length T _i to be interpolated, the weight of the time change pattern F 0 _b of the F 0 of the post-switching prosodic pattern P _{b only} needs to be increased, and is not limited to that represented by the above formula. For example, in the above equation, the weight increases linearly with respect to the time information, but the weight change pattern can be changed by using a sigmoid function or the like.

＜音声波形生成部１３０＞
図１３は音声波形生成部１３０の機能ブロック図を、図１４はその処理フローの例を示す。 <Audio waveform generation unit 130>
FIG. 13 is a functional block diagram of the speech waveform generator 130, and FIG. 14 shows an example of the processing flow.

音声波形生成部１３０は、スペクトルパラメータ生成部１３１とスペクトルパラメータ補間部１３２と波形生成部１３３とを含む。 The speech waveform generation unit 130 includes a spectrum parameter generation unit 131, a spectrum parameter interpolation unit 132, and a waveform generation unit 133.

音声波形生成部１３０は、ユーザ制御情報Cと合成用韻律パタンP_cとを受け取る。さらに、音声波形生成部１３０は、切替前韻律パタンP_aと切替後韻律パタンP_bとが異なり、かつ、切替前音声データベースZ_Aと切替後音声データベースZ_Bとが異なる場合（言い換えると、韻律パタンを切り替えるタイミングと音声データベースを切り替えるタイミングが同じ場合）、一致点情報T_mを受け取る。なお、切替前韻律パタンP_aと切替後韻律パタンP_bとが異なるか否かは、韻律パタン選択部１１０の出力値である制御信号C_sysを受け取って判断してもよいし、韻律パタン選択部１１０における判断方法と同様の判断方法により、制御信号Cと合成音声再生位置情報A_posとを用いて音声波形生成部１３０内で判断してもよい。切替前音声データベースZ_Aと切替後音声データベースZ_Bとが異なるか否かは、ユーザ制御情報Cを受け取った時点で使用している音声データベース（切替前音声データベースZ_A）とユーザ制御情報Cにより特定される音声データベース（切替後音声データベースZ_B）が一致するか否かにより判断することができる。 The speech waveform generation unit 130 receives the user control information C and the synthesis prosody pattern _Pc . Furthermore, the speech waveform generation unit 130, is different from the pre-switching prosodic pattern P _a and the posterior prosodic pattern P _b, and, when the switch before the speech database Z _A and the posterior speech database Z _B are different cases (in other words, prosody When the pattern switching timing and the voice database switching timing are the same), the matching point information _Tm is received. Incidentally, whether the switch before prosodic pattern P _a and the posterior prosodic pattern P _b are different, may be determined by receiving a control signal C _sys is the output value of the prosodic pattern selection section 110, the prosody pattern selection The determination may be made in the audio waveform generation unit 130 using the control signal C and the synthesized audio reproduction position information A _pos by the same determination method as that in the unit 110. Whether or not the voice database Z _A before switching is different from the voice database Z _B after switching depends on the voice database used before the user control information C is received (the voice database Z _A before switching) and the user control information C. The determination can be made based on whether or not the specified voice database (switched voice database Z _B ) matches.

音声波形生成部１３０は、切替前音声データベースZ_Aと切替後音声データベースZ_Bとが異なる場合に、切替前音声データベースZ_Aと合成用韻律パタンP_cとを用いて、切替前スペクトルパラメータSP_aを生成する。また、音声波形生成部１３０は、切替後音声データベースZ_Bと合成用韻律パタンP_cとを用いて、切替後スペクトルパラメータSP_bを生成する。さらに、音声波形生成部１３０は、合成音声再生位置情報A_posまたは一致点情報T_mを用いて、切替前スペクトルパラメータSP_aと切替後スペクトルパラメータSP_bとの補間処理を行い、合成音声データの生成に用いる合成用スペクトルパラメータSP_cを生成して出力する（Ｓ１３０）。 When the pre-switching speech database Z _A and the post-switching speech database Z _B are different, the speech waveform generation unit 130 uses the pre-switching speech database Z _A and the synthesizing prosody pattern P _c and uses the pre-switching spectral parameter SP _a. Is generated. The audio waveform generator 130, with the after-switching speech database Z _B and synthesis prosodic pattern P _c, to generate a spectral parameter SP _b after switching. Furthermore, the speech waveform generation unit 130, using synthetic speech reproduction position information A _pos or matching point information T _m, before switching performs interpolation processing of the spectral parameters SP _a and the posterior spectral parameter SP _b, of the synthesized speech data A synthesis spectrum parameter SP _c used for generation is generated and output (S130).

音声波形生成部１３０は、切替前音声データベースZ_Aと切替後音声データベースZ_Bとが同じ場合（Ｓ１３１Ａ）、さらに、音声合成用ＤＢ記憶部１５０から切替前音声データベースZ_Aを取り出す。 When the pre-switching speech database Z _A and the post-switching speech database Z _B are the same (S131A), the speech waveform generation unit 130 further extracts the pre-switching speech database Z _A from the speech synthesis DB storage unit 150.

また、音声波形生成部１３０は、切替前音声データベースZ_Aと切替後音声データベースZ_Bとが異なる場合（Ｓ１３１Ａ）、さらに、音声合成用ＤＢ記憶部１５０から切替前音声データベースZ_Aと切替後音声データベースZ_Bとを取り出す。 In addition, when the pre-switching speech database Z _A and the post-switching speech database Z _B are different (S131A), the speech waveform generation unit 130 further stores the pre-switching speech database Z _A and the post-switching speech from the speech synthesis DB storage unit 150. Retrieve database Z _B.

＜スペクトルパラメータ生成部１３１＞
（切替前音声データベースZ_Aと切替後音声データベースZ_Bとが同じ場合）
スペクトルパラメータ生成部１３１は、合成用韻律パタンP_c（より詳しく言うと、合成用韻律パタンP_cに含まれる音素セグメンテーション情報S_c）と切替前音声データベースZ_Aとを用いて、スペクトルパラメータを生成して合成用スペクトルパラメータSP_cとして波形生成部１３３に出力する（Ｓ１３１ｄ）。本実施形態では、音声データベースとして、隠れマルコフモデル（HMM）を用いており、例えば参考文献１等の方法により、音素セグメンテーション情報とHMMとを用いてスペクトルパラメータを生成することができる。
（参考文献１）益子他，“動的特徴を用いたHMMに基づく音声合成”，信学論，1996, vol.J79-D-II， no.12， pp.2184-2190. <Spectral parameter generation unit 131>
(When the pre-switching voice database Z _A and the post-switching voice database Z _B are the same)
The spectrum parameter generation unit 131 generates a spectrum parameter using the prosody pattern P _c for synthesis (more specifically, the phoneme segmentation information S _c included in the prosody pattern P _c for synthesis) and the pre-switching speech database Z _A. Then, it is output to the waveform generation unit 133 as a synthesis spectrum parameter SP _c (S131d). In the present embodiment, a hidden Markov model (HMM) is used as the speech database, and for example, the spectral parameters can be generated by using the phoneme segmentation information and the HMM by the method of Reference 1 or the like.
(Reference 1) Mashiko et al., “HMM-based speech synthesis using dynamic features”, IEICE Theory, 1996, vol.J79-D-II, no.12, pp.2184-2190.

（切替前音声データベースZ_Aと切替後音声データベースZ_Bとが異なる場合）
スペクトルパラメータ生成部１３１は、合成用韻律パタンP_c（より詳しく言うと、合成用韻律パタンP_cに含まれる音素セグメンテーション情報S_c）と切替前音声データベースZ_Aとを用いて、切替前スペクトルパラメータSP_aを生成して（参考文献１参照）スペクトルパラメータ補間部１３２に出力する（Ｓ１３１）。また、スペクトルパラメータ生成部１３１は、合成用韻律パタンP_c（より詳しく言うと、合成用韻律パタンP_cに含まれる音素セグメンテーション情報S_c）と切替後音声データベースZ_Bとを用いて、切替後スペクトルパラメータSP_bを生成して（参考文献１参照）スペクトルパラメータ補間部１３２に出力する（Ｓ１３１）。 (When the pre-switching voice database Z _A and the post-switching voice database Z _B are different)
Spectral parameter generating unit 131 (and more particularly, phoneme segmentation information contained in the synthesized prosody pattern P _c S _c) synthesis prosodic patterns P _c and before switching by using the speech database Z _A, before switching spectral parameter SP _a is generated (see Reference 1) and output to the spectral parameter interpolation unit 132 (S131). Further, the spectral parameter generating unit 131, by using (and more particularly, phoneme segmentation information contained in the synthesized prosody pattern P _c S _c) synthesis prosodic patterns P _c and the after-switching speech database Z _B, after the switching A spectrum parameter SP _b is generated (see Reference 1) and output to the spectrum parameter interpolation unit 132 (S131).

＜スペクトルパラメータ補間部１３２＞
切替前音声データベースZ_Aと切替後音声データベースZ_Bとが異なる場合のみ実行される。 <Spectral parameter interpolation unit 132>
Only executed when the pre-switching voice database Z _A and the post-switching voice database Z _B are different.

スペクトルパラメータ補間部１３２は、切替前スペクトルパラメータSP_aと切替後スペクトルパラメータSP_bとを受け取る。さらに、切替前韻律パタンP_aと切替後韻律パタンP_bとが異なる場合、スペクトルパラメータ補間部１３２は、一致点情報T_mを受け取る。一方、切替前韻律パタンP_aと切替後韻律パタンP_bとが同じ場合（言い換えると、音声データベースだけが切り替わった場合）、スペクトルパラメータ補間部１３２は、合成音声再生位置情報A_posを受け取る。スペクトルパラメータ補間部１３２は、切替前スペクトルパラメータSP_aと切替後スペクトルパラメータSP_bと一致点情報T_mまたは合成音声再生位置情報A_posとを用いて、一致点情報T_mまたは合成音声再生位置情報A_posが示す位置から所定の時間T_iをかけて、切替前スペクトルパラメータSP_aを切替後スペクトルパラメータSP_bに変換するために補間処理を行い、合成音声データX_cの生成に使用する合成用スペクトルパラメータSP_cを作成して出力する（Ｓ１３２）。なお、切替前スペクトルパラメータSP_a及び切替後スペクトルパラメータSP_bは、それぞれF0の時間変化パタン（M点のF0情報）を保持している。 The spectrum parameter interpolation unit 132 receives the spectrum parameter SP _a before switching and the spectrum parameter SP _b after switching. Furthermore, when the pre-switching prosodic pattern P _a and the posterior prosodic pattern P _b are different, the spectral parameter interpolation section 132 receives the match-point information T _m. On the other hand, if the switch before prosodic pattern P _a and the posterior prosodic pattern P _b are the same (in other words, if only speech database is switched), the spectral parameter interpolation unit 132 receives the synthesized speech reproduction position information A _pos. The spectrum parameter interpolation unit 132 uses the spectrum parameter SP _a before switching, the spectrum parameter SP _b after switching, and the matching point information T _m or the synthesized voice reproduction position information A _pos to use the matching point information T _m or the synthesized voice reproduction position information. A for the synthesis used to generate the synthesized speech data X _c by performing interpolation processing to convert the pre-switching spectral parameter SP _a to the post-switching spectral parameter SP _b over a predetermined time T _i from the position indicated by A _pos A spectrum parameter SP _c is created and output (S132). Incidentally, before switching the spectral parameter SP _a and the posterior spectral parameter SP _b holds a time change pattern of F0 (F0 information point M), respectively.

切替後のスペクトルパラメータSP_bと切替前のスペクトルパラメータSP_aとの補間処理を行うことで、補間後の合成用スペクトルパラメータSP_cを得る。補間の対象となるスペクトルパラメータは、一致点情報T_m（上記の例では、380[ms]）または合成音声再生位置情報A_posから、一定の時間長T_i[ms]分である。補間手法としては、様々な手法があるが、最も単純な線形補間を使用した場合、補間後のi番目のフレームの合成用スペクトルパラメータSP_c[i]は次式（一致点情報T_mを用いた場合）により求められる。 Interpolation processing is performed on the spectrum parameter SP _b after switching and the spectrum parameter SP _a before switching to obtain the spectrum parameter for synthesis SP _c after interpolation. The spectrum parameter to be interpolated is a certain time length T _i [ms] from the matching point information T _m (in the above example, 380 [ms]) or the synthesized speech reproduction position information A _pos . There are various interpolation methods. When the simplest linear interpolation is used, the spectrum parameter SP _c [i] for synthesis of the i-th frame after the interpolation is expressed by the following equation (the matching point information T _m is used). If required).

補間重みw[i]は、Ｆ０時間変化パタン補間部１２３ｂで説明したように設定することができる。なお、合成音声再生位置情報A_posを用いた場合、一致点情報T_mに代えて合成音声再生位置情報A_posを用いればよい。 The interpolation weight w [i] can be set as described in the F0 time change pattern interpolation unit 123b. In the case of using the synthesized speech reproduction position information A _pos, it may be used synthetic speech reproduction position information A _pos instead of matching point information T _m.

＜波形生成部１３３＞
波形生成部１３３は、合成用スペクトルパラメータSP_cと合成用韻律パタンP_cとユーザ制御信号Cとを受け取る。波形生成部１３３は、合成用韻律パタンP_c（より詳しく言うと合成用韻律パタンP_cに含まれるF0の時間変化パタンF0_c）と合成用スペクトルパラメータSP_cとを用いて、音声合成フィルタ（例えば、参考文献２）を用いて音声波形を生成する（Ｓ１３３）。
（参考文献２）今井他，“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”，電子情報通信学会論文誌 A, 1983, Vol.J66-A No.2 pp.122-129. <Waveform generator 133>
The waveform generator 133 receives the synthesis spectrum parameter SP _c , the synthesis prosody pattern P _c, and the user control signal C. Waveform generating unit 133, using the synthesis prosodic patterns P _c (more specifically speaking, the time change pattern F0 _c of F0 included in the synthesized prosody patterns P _c) and synthesis spectral parameter SP _c, speech synthesis filter ( For example, a speech waveform is generated using Reference Document 2) (S133).
(Reference 2) Imai et al., “Mel Log Spectrum Approximation (MLSA) Filter for Speech Synthesis”, IEICE Transactions A, 1983, Vol. J66-A No. 2 pp.122-129.

さらに、波形生成部１３３は、ユーザ制御情報に含まれる速度が、生成した音声波形の速度と同じ場合、生成した音声波形を合成音声データX_cとして出力する。また、ユーザ制御情報に含まれる速度が、生成した音声波形の速度と異なる場合、ユーザ制御情報に含まれる速度に応じて、音声波形の再生速度を変更し（例えば参考文献３）、速度変更後の音声波形を合成音声データX_cとして出力する。
（参考文献３）森田直孝，板倉文忠、“ポインター移動制御による重複加算法（ＰＩＣＯＬＡ）を用いた音声の時間軸での伸長圧縮とその評価”、日本音響学会講演論文集、昭和６１年、p.149-150. Furthermore, when the speed included in the user control information is the same as the speed of the generated speech waveform, the waveform generation unit 133 outputs the generated speech waveform as synthesized speech data _Xc . Further, when the speed included in the user control information is different from the speed of the generated speech waveform, the playback speed of the speech waveform is changed according to the speed included in the user control information (for example, Reference 3), and the speed is changed. _Is output as synthesized speech data _Xc .
(Reference 3) Naotaka Morita and Fumitada Itakura, “Expansion and compression of speech over time using pointer movement control (PICOLA) and its evaluation”, Proc. Of the Acoustical Society of Japan, 1986, p. .149-150.

なお、波形生成部１３３は、合成音声データX_cを出力する際に、その再生位置に関する情報（再生中の韻律パタンNo.、再生位置（時間情報、再生が終了したか否か）等を含む情報）である合成音声再生位置情報A_posを出力する。 Note that the waveform generation unit 133 includes information regarding the reproduction position (prosodic pattern number being reproduced, reproduction position (time information, whether reproduction has ended), and the like when the synthesized speech data _Xc is output. Information) is output as synthesized speech reproduction position information A _pos .

＜効果＞
このような構成により、ユーザの指定により自由に韻律パタンの特徴（方言）を変更することができるため、飽きにくく、理解しやすい合成音声を提供することができる。さらに、従来技術とは異なり、ユーザは合成音声の再生途中に韻律パタンを切り替えることができるため、インタラクティブ性の高い音声合成技術を実現することができる。さらに、再生中に韻律パタンを切り替えた際に生じる異音や違和感を抑えることができる。特に、韻律パタンの特徴として、方言を含む場合には、異音や違和感を抑える効果が高い。方言が異なる場合、同じ単語であっても、声の高さ、イントネーション、リズム等が全く異なる場合があり、単純に韻律パタンを切り替えると、異音や違和感が生じる可能性が高い。しかし、本実施形態であれば、切替前韻律パタンと切替後韻律パタンとの間を補間するため、声の高さ、イントネーション、リズム等の違いをスムーズに切り替えることができ、異音の発生を抑えたり、聴者の違和感を低減することができ、自然な合成音声を読み上げることができる。さらに、本実施形態によれば、音声データベースの切替に発生する異音や違和感も抑えることができる。 <Effect>
With such a configuration, the prosody pattern features (dialects) can be freely changed according to the user's specification, so that it is possible to provide a synthesized speech that is not bored and easy to understand. Furthermore, unlike the prior art, since the user can switch the prosodic pattern during the reproduction of the synthesized speech, a highly interactive speech synthesis technology can be realized. Furthermore, it is possible to suppress abnormal sounds and discomfort caused when the prosody pattern is switched during reproduction. In particular, when a dialect is included as a feature of the prosodic pattern, the effect of suppressing abnormal sounds and discomfort is high. When dialects are different, the pitch, intonation, rhythm, etc. may be completely different even for the same word. If the prosodic pattern is simply switched, there is a high possibility that an unusual sound or a sense of incongruity will occur. However, according to the present embodiment, the interpolation between the pre-switching prosodic pattern and the post-switching prosodic pattern is performed, so that differences in voice pitch, intonation, rhythm, etc. can be switched smoothly, and abnormal sounds are generated. It is possible to suppress or reduce the listener's uncomfortable feeling and to read out a natural synthesized voice. Furthermore, according to the present embodiment, it is possible to suppress abnormal sounds and uncomfortable feelings that occur when switching between voice databases.

＜変形例＞
本実施形態では、音声合成方式にHMMを用いているが、波形接続型を用いてもよい。その場合、音声合成用ＤＢ記憶部１５０には音声素片データベースが選択対象となる音声データベースごとに格納される。なお、音声合成方式に波形接続型を用いた場合、スペクトルパラメータ生成部１３１は、まず、音声素片データベースと韻律パタンとを用いて、音声データを作成する。次に、スペクトルパラメータ生成部１３１は、音声データからスペクトルパラメータを生成する。生成したスペクトルパラメータを用いてスペクトルパラメータ補間処理を行えばよい。 <Modification>
In this embodiment, the HMM is used for the speech synthesis method, but a waveform connection type may be used. In that case, the speech synthesis database storage unit 150 stores a speech segment database for each speech database to be selected. When the waveform connection type is used for the speech synthesis method, the spectrum parameter generation unit 131 first creates speech data using the speech unit database and the prosodic pattern. Next, the spectrum parameter generation unit 131 generates a spectrum parameter from the audio data. Spectral parameter interpolation processing may be performed using the generated spectral parameters.

また、本実施形態では、コンテンツを読み上げる際に、全て合成音声を使って読上げているが、補間処理を行う区間以外が、収録音声ＤＢ記憶部１０１に記憶されている収録音声を用いてもよい。つまり、切替後のみ所定の時間T_iだけ音声合成を行う構成としてもよい。韻律パタン選択部１１０が、合成音声データX_cの再生中に韻律パタンの特徴が切り替わったことを示す制御信号C_sysを出力する場合のみ、各部で補間処理を行い、それ以外の時間区間では、収録音声をそのまま出力するように制御信号を出力する。 Further, in the present embodiment, when reading out the content, all are read out using synthesized speech, but the recorded speech stored in the recorded speech DB storage unit 101 may be used outside the interval for performing the interpolation process. . That may be configured to perform speech synthesis predetermined time T _i only after switching. Only when the prosody pattern selection unit 110 outputs the control signal C _sys indicating that the features of the prosody pattern have been switched during the reproduction of the synthesized speech data X _c , interpolation processing is performed in each unit, and in other time intervals, A control signal is output so that the recorded sound is output as it is.

図５の例では、4種類の声質（「げんき」「しょんぼり」「たかい」「ひくい」）により特定される4種類のHMMが音声合成用ＤＢ記憶部１５０に格納されているが、必ずしも対応するHMMを全て用意しなくともよい。例えば、標準的な高さの声質を一つ用意しておき、声質「たかい」「ひくい」が含まれる場合には、標準的な高さの声質に対応する基本周波数を高くしたり、低くしてもよい。また、ある音声データベースを別の音声データベースに変換するような関数を用意しておき、利用してもよい。要は、音声データベースを切り替える際に、スペクトルパラメータの補間処理を行えば、同様の効果を得ることができる。 In the example of FIG. 5, four types of HMMs specified by four types of voice qualities (“Genki”, “Shonbori”, “Takai”, “Hikui”) are stored in the speech synthesis DB storage unit 150, but they are not necessarily compatible. It is not necessary to prepare all HMMs. For example, if one standard quality voice quality is prepared and the voice quality “Takai” or “Hikui” is included, the basic frequency corresponding to the standard voice quality is increased or decreased. May be. Further, a function for converting a certain voice database into another voice database may be prepared and used. In short, the same effect can be obtained by performing spectral parameter interpolation processing when switching the speech database.

本実施形態では、読み上げるコンテンツが複数の文からなるものであったが、一文（一つの合成音声データ）からなってもよい。その場合、文番号を用いなくともよい。 In the present embodiment, the content to be read is composed of a plurality of sentences, but may be composed of one sentence (one synthesized voice data). In that case, the sentence number need not be used.

本実施形態では、ユーザ制御情報Cを用いて、切替後韻律パタンの特徴を特定しているが、音声合成装置１００の内部で、切替後韻律パタンの特徴を特定する制御情報を生成してもよい。例えば、所定の時間ごと、切替後韻律パタンの特徴を変更するように制御情報を生成してもよい。また、ランダムな時間ごとに、ランダムに切替後韻律パタンの特徴を特定するように制御情報を生成してもよい。切替後音声データベースについても同様である。 In this embodiment, the feature of the post-switching prosodic pattern is specified using the user control information C. However, even if control information for specifying the feature of the post-switching prosodic pattern is generated inside the speech synthesizer 100, Good. For example, the control information may be generated so as to change the characteristics of the prosodic pattern after switching every predetermined time. Further, the control information may be generated so as to specify the characteristics of the prosodic pattern after switching at random time intervals. The same applies to the voice database after switching.

本実施形態では、韻律パタン、音声データベース、再生速度を切替可能としているが、少なくとも韻律パタンを切り替えることができればよい。また、さらに別の項目を切替可能としてもよい。 In this embodiment, the prosody pattern, the voice database, and the playback speed can be switched. However, it is sufficient that at least the prosody pattern can be switched. Further, another item may be switched.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

音素セグメンテーション情報は、音声合成の対象テキストを読み上げた音声データである収録音声データに対応する音素と音素の継続時間に対応する情報とを含むものとし、韻律パタンは収録音声データに対応する基本周波数の時間変化のパタンと音素セグメンテーション情報とを含むものとし、複数の韻律パタンが記憶される韻律パタンＤＢ記憶部と、
再生中の音声データに対応し、前記韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替前韻律パタンの特徴と、前記韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替後韻律パタンの特徴とが異なる場合に、再生中の音声データの再生位置に関する音声再生位置情報を用いて、前記切替前韻律パタンと前記切替後韻律パタンとの補間処理を行い、合成音声データの生成に用いる合成用韻律パタンを生成する韻律パタン生成部を含み、
韻律パタン生成部は、
前記切替前韻律パタンと前記切替後韻律パタンと前記音声再生位置情報とを用いて、前記音声再生位置情報が示す前記再生位置が、前記切替後韻律パタンのどの位置にあたるかを示す一致点情報を推定する一致点推定部と、
前記切替前韻律パタンと前記切替後韻律パタンと前記一致点情報とを用いて、前記一致点情報が示す位置から所定の時間をかけて、前記切替前韻律パタンの基本周波数の時間変化を前記切替後韻律パタンの基本周波数の時間変化に変換するために補間処理を行う韻律パタン補間部と、を含む、
音声合成装置。 The phoneme segmentation information includes the phoneme corresponding to the recorded voice data that is the voice data read out from the target text for speech synthesis and the information corresponding to the duration of the phoneme, and the prosodic pattern has the basic frequency corresponding to the recorded voice data. A prosodic pattern DB storage unit that includes a temporal change pattern and phoneme segmentation information, and stores a plurality of prosodic patterns;
The characteristics of the prosodic pattern before switching, which is one of the prosodic patterns stored in the prosodic pattern DB storage unit, corresponding to the voice data being reproduced, and one of the prosodic patterns stored in the prosodic pattern DB storage unit When the characteristics of the prosodic pattern after switching are different, using the audio reproduction position information regarding the reproduction position of the audio data being reproduced, the interpolation process between the prosodic pattern before switching and the prosodic pattern after switching is performed, Including a prosody pattern generation unit for generating a synthetic prosody pattern used to generate synthetic speech data;
The prosody pattern generation unit
Using the pre-switching prosodic pattern, the post-switching prosodic pattern, and the audio reproduction position information, matching point information indicating which position of the post-switching prosodic pattern the reproduction position indicated by the audio reproduction position information corresponds to A matching point estimation unit to be estimated;
Using the pre-switching prosodic pattern, the post-switching prosodic pattern, and the coincidence point information, the time change of the fundamental frequency of the pre-switching prosodic pattern is changed over the predetermined time from the position indicated by the coincidence point information. A prosody pattern interpolation unit that performs an interpolation process to convert the fundamental frequency of the post-prosody pattern into a temporal change,
Speech synthesizer.

請求項１の音声合成装置であって、
さらに、
複数の音声データベースが記憶される音声合成用ＤＢ記憶部と、
再生中の音声データに対応し、前記音声合成用ＤＢ記憶部に記憶されている音声データベースの１つである切替前音声データベースと、前記音声合成用ＤＢ記憶部に記憶されている音声データベースの１つである切替後音声データベースとが異なる場合に、前記切替前音声データベースと韻律パタンとを用いて、切替前スペクトルパラメータを生成し、前記切替後音声データベースと前記韻律パタンとを用いて、切替後スペクトルパラメータを生成し、前記切替前スペクトルパラメータと前記切替後スペクトルパラメータとの補間処理を行い、合成音声データの生成に用いる合成用スペクトルパラメータを生成する音声波形生成部を含み、
前記音声波形生成部は、
前記切替前スペクトルパラメータと前記切替後スペクトルパラメータと前記音声再生位置情報とを用いて、前記一致点情報または前記音声再生位置情報が示す位置から所定の時間をかけて、前記切替前スペクトルパラメータを前記切替後スペクトルパラメータに変換するために補間処理を行うスペクトルパラメータ補間部を含む、
音声合成装置。 The speech synthesizer of claim 1,
further,
A speech synthesis DB storage unit that stores a plurality of speech databases;
Corresponding to the voice data being played back, one of the voice databases before switching, which is one of the voice databases stored in the voice synthesis DB storage unit, and one of the voice databases stored in the voice synthesis DB storage unit The pre-switching speech database and the prosodic pattern are used to generate a pre-switching spectral parameter, and the post-switching speech database and the prosodic pattern are used to switch the Generating a spectral parameter, interpolating between the pre-switching spectral parameter and the post-switching spectral parameter, and including a speech waveform generating unit that generates a spectral parameter for synthesis used to generate synthetic speech data,
The speech waveform generator is
Using the pre-switching spectral parameter, the post-switching spectral parameter, and the audio reproduction position information, the pre-switching spectral parameter is calculated over a predetermined time from the position indicated by the matching point information or the audio reproduction position information. Including a spectral parameter interpolator for performing an interpolation process to convert the spectral parameter after switching,
Speech synthesizer.

音素セグメンテーション情報は、音声合成の対象テキストを読み上げた音声データである収録音声データに対応する音素と音素の継続時間に対応する情報とを含むものとし、韻律パタンは収録音声データに対応する基本周波数の時間変化のパタンと音素セグメンテーション情報とを含むものとし、韻律パタンＤＢ記憶部には複数の韻律パタンが記憶されているものとし、
再生中の音声データに対応し、前記韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替前韻律パタンの特徴と、前記韻律パタンＤＢ記憶部に記憶されている韻律パタンの１つである切替後韻律パタンの特徴とが異なる場合に、再生中の音声データの再生位置に関する音声再生位置情報を用いて、前記切替前韻律パタンと前記切替後韻律パタンとの補間処理を行い、合成音声データの生成に用いる合成用韻律パタンを生成する韻律パタン生成ステップを含み、
前記韻律パタン生成ステップは、
前記切替前韻律パタンと前記切替後韻律パタンと前記音声再生位置情報とを用いて、前記音声再生位置情報が示す前記再生位置が、前記切替後韻律パタンのどの位置にあたるかを示す一致点情報を推定する一致点推定ステップと、
前記切替前韻律パタンと前記切替後韻律パタンと前記一致点情報とを用いて、前記一致点情報が示す位置から所定の時間をかけて、前記切替前韻律パタンの基本周波数の時間変化を前記切替後韻律パタンの基本周波数の時間変化に変換するために補間処理を行う韻律パタン補間ステップと、を含む、
音声合成方法。 The phoneme segmentation information includes the phoneme corresponding to the recorded voice data that is the voice data read out from the target text for speech synthesis and the information corresponding to the duration of the phoneme, and the prosodic pattern has the basic frequency corresponding to the recorded voice data. It is assumed that a temporal change pattern and phoneme segmentation information are included, and a plurality of prosodic patterns are stored in the prosody pattern DB storage unit,
The characteristics of the prosodic pattern before switching, which is one of the prosodic patterns stored in the prosodic pattern DB storage unit, corresponding to the voice data being reproduced, and one of the prosodic patterns stored in the prosodic pattern DB storage unit When the characteristics of the prosodic pattern after switching are different, using the audio reproduction position information regarding the reproduction position of the audio data being reproduced, the interpolation process between the prosodic pattern before switching and the prosodic pattern after switching is performed, Including a prosody pattern generation step for generating a synthetic prosody pattern used for generating synthetic speech data;
The prosodic pattern generation step includes:
Using the pre-switching prosodic pattern, the post-switching prosodic pattern, and the audio reproduction position information, matching point information indicating which position of the post-switching prosodic pattern the reproduction position indicated by the audio reproduction position information corresponds to A matching point estimation step to be estimated;
Using the pre-switching prosodic pattern, the post-switching prosodic pattern, and the coincidence point information, the time change of the fundamental frequency of the pre-switching prosodic pattern is changed over the predetermined time from the position indicated by the coincidence point information. A prosodic pattern interpolation step for performing an interpolation process to convert the post prosodic pattern into a temporal change in the fundamental frequency of the prosodic pattern,
Speech synthesis method.

請求項１または２の音声合成装置として、コンピュータを機能させるためのプログラム。 The program for functioning a computer as a speech synthesizer of Claim 1 or 2.