JP6469035B2

JP6469035B2 - Speech synthesis apparatus and speech synthesis method

Info

Publication number: JP6469035B2
Application number: JP2016034224A
Authority: JP
Inventors: 啓吾川島; 貴弘大塚
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2016-02-25
Filing date: 2016-02-25
Publication date: 2019-02-13
Anticipated expiration: 2036-02-25
Also published as: JP2017151291A

Description

本発明は、音声合成技術に関し、特に、合成単位間の境界付近における韻律などの音響特徴を制御するための音声合成技術に関する。 The present invention relates to a speech synthesis technique, and more particularly to a speech synthesis technique for controlling acoustic features such as prosody near the boundary between synthesis units.

従来の音声合成技術では、たとえば、単語の音韻情報とアクセント型に従う規則と基づいて音声合成が行われていたが、単調で不自然な合成音声が生成されていた。近年、合成音声を単調にさせずに柔軟な韻律制御を実現することができる音声合成技術がいくつか提案されている。 In the conventional speech synthesis technology, for example, speech synthesis is performed based on word phoneme information and rules according to accent type, but monotonous and unnatural synthesized speech is generated. In recent years, several speech synthesis technologies that can realize flexible prosodic control without making the synthesized speech monotonous have been proposed.

たとえば、特許文献１（特開平１０−８３１９２号公報）には、１つのアクセント型を決める最小単位である韻律語間の依存強度、係り受け構造、及び受け数のうちの少なくとも１つの情報を用いて、隣接する韻律語間の結合強度を決定して韻律句間の境界強度を求める音声合成装置が開示されている。この音声合成装置に含まれる韻律情報生成部は、その境界強度を用いて、連続する韻律句間の境界に対してポーズの挿入指令またはフレーズの立ち上げ指令などの韻律制御指令を設定することができる。 For example, Patent Document 1 (Japanese Patent Laid-Open No. 10-83192) uses at least one piece of information of the dependency strength between prosodic words, the dependency structure, and the number of receptions, which is the minimum unit for determining one accent type. Thus, there is disclosed a speech synthesizer for determining the boundary strength between prosodic phrases by determining the joint strength between adjacent prosodic words. The prosody information generation unit included in the speech synthesizer can set a prosodic control command such as a pause insertion command or a phrase start command for a boundary between successive prosodic phrases using the boundary strength. it can.

特開平１０−８３１９２号公報（たとえば、図１及び段落００１６〜００〜００４４）JP-A-10-83192 (for example, FIG. 1 and paragraphs 0016 to 00-0044)

しかしながら、特許文献１に開示されている韻律制御指令では、多様な韻律制御を実行することが難しいため、当該境界付近の韻律が単調になりやすく、あるいは当該境界付近で合成音声の音質が劣化しやすいという課題がある。 However, with the prosodic control command disclosed in Patent Document 1, it is difficult to execute various prosodic controls, so that the prosody near the boundary tends to be monotonous, or the sound quality of the synthesized speech deteriorates near the boundary. There is a problem that it is easy.

上記に鑑みて本発明の目的は、韻律句などの合成単位間の境界付近に対して韻律などの音響特徴の多様な制御を実行することができ、合成音声の音質劣化を軽減することができる音声合成装置及び音声合成方法を提供することである。 In view of the above, an object of the present invention is to perform various controls of acoustic features such as prosody on the vicinity of boundaries between synthesis units such as prosodic phrases, and to reduce deterioration in sound quality of synthesized speech. A speech synthesis apparatus and a speech synthesis method are provided.

本発明の一態様による音声合成装置は、音響特徴の境界を形成する一対の言語単位を入力として合成音声波形を生成する音声合成装置であって、音声素片候補の言語情報と当該音声素片候補の音響特徴量と当該音声素片候補の音響特徴を分類するカテゴリとの間の対応関係を定める素片辞書が記憶されているデータ記憶部と、前記素片辞書を用いて、前記一対の言語単位の一方に対応する複数の音声素片候補からなる第１素片候補群とともに、前記一対の言語単位の他方に対応する複数の音声素片候補からなる第２素片候補群を選択する素片候補選択部と、前記第１素片候補群を構成する音声素片候補と前記第２素片候補群を構成する音声素片候補との組み合わせの中から、少なくとも１組の音声素片候補からなる組み合わせ候補を抽出する素片選択部と、前記組み合わせ候補の中から選択された組の音声素片の音響特徴情報に基づいて前記合成音声波形を生成する音声波形生成部とを備え、前記素片選択部は、前記第２素片候補群を構成する音声素片候補のカテゴリ同士の類似尺度に基づき、前記第１素片候補群を構成する各音声素片候補について前記第２素片候補群の中から少なくとも１個の音声素片候補を選択することによって前記組み合わせ候補を抽出することを特徴とする。 A speech synthesizer according to an aspect of the present invention is a speech synthesizer that generates a synthesized speech waveform by inputting a pair of language units forming a boundary of acoustic features, and includes speech unit candidate language information and the speech unit A data storage unit storing a segment dictionary that defines a correspondence relationship between a candidate acoustic feature quantity and a category for classifying the acoustic feature of the speech segment candidate, and using the segment dictionary, the pair of pairs A first unit candidate group consisting of a plurality of speech unit candidates corresponding to one of the language units and a second unit candidate group consisting of a plurality of speech unit candidates corresponding to the other of the pair of language units are selected. At least one set of speech units out of a combination of a unit candidate selection unit, a speech unit candidate constituting the first unit candidate group, and a speech unit candidate constituting the second unit candidate group Extract candidate combinations consisting of candidates A segment selection unit; and a speech waveform generation unit configured to generate the synthesized speech waveform based on acoustic feature information of a speech unit of a set selected from the combination candidates. Based on the similarity measure between the categories of speech unit candidates that constitute the two unit candidate group, at least one of the second unit candidate groups for each speech unit candidate that constitutes the first unit candidate group The combination candidate is extracted by selecting a speech segment candidate.

本発明の他の態様による音声合成方法は、音声素片候補の言語情報と当該音声素片候補の音響特徴量と当該音声素片候補の音響特徴を分類するカテゴリとの間の対応関係を定める素片辞書が記憶されているデータ記憶部を備えた情報処理装置において実行される音声合成方法であって、音響特徴の境界を形成する一対の言語単位を入力とするステップと、前記素片辞書を用いて、前記一対の言語単位の一方に対応する複数の音声素片候補からなる第１素片候補群とともに、前記一対の言語単位の他方に対応する複数の音声素片候補からなる第２素片候補群を選択するステップと、前記第２素片候補群を構成する複数の音声素片候補のカテゴリ同士の類似尺度に基づき、前記第１素片候補群を構成する各音声素片候補に対して前記第２素片候補群の中から少なくとも１個の音声素片候補を選択するステップと、当該選択結果により、前記第１素片候補群を構成する複数の音声素片候補と前記第２素片候補群を構成する複数の音声素片候補との組み合わせの中から、少なくとも１組の音声素片候補からなる組み合わせ候補を抽出するステップと、前記組み合わせ候補の中から選択された組の音声素片の音響特徴情報に基づいて合成音声波形を生成するステップとを備えることを特徴とする。 The speech synthesis method according to another aspect of the present invention defines a correspondence relationship between language information of a speech unit candidate, an acoustic feature amount of the speech unit candidate, and a category for classifying the acoustic feature of the speech unit candidate. A speech synthesis method executed in an information processing apparatus including a data storage unit in which a unit dictionary is stored, the step of inputting a pair of language units forming a boundary of acoustic features, and the unit dictionary And a first unit candidate group consisting of a plurality of speech unit candidates corresponding to one of the pair of language units and a second unit consisting of a plurality of speech unit candidates corresponding to the other of the pair of language units. Based on the step of selecting a segment candidate group and a similarity measure between categories of a plurality of speech segment candidates that constitute the second segment candidate group, each speech segment candidate that constitutes the first segment candidate group The second segment candidate Selecting at least one speech unit candidate from among the plurality of speech unit candidates constituting the first unit candidate group and a plurality of constituting the second unit candidate group based on the selection result Extracting a combination candidate consisting of at least one set of speech unit candidates from the combination with the speech unit candidates, and based on acoustic feature information of a set of speech units selected from the combination candidates Generating a synthesized speech waveform.

本発明によれば、音響特徴の多様な制御を実行することができ、合成音声の音質劣化を軽減することができる。 According to the present invention, various control of acoustic features can be executed, and deterioration of sound quality of synthesized speech can be reduced.

本発明に係る実施の形態１の音声合成装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speech synthesizer of Embodiment 1 which concerns on this invention. 図２Ａ及び図２Ｂは、実施の形態１に係る辞書データの構造例を示す図である。2A and 2B are diagrams showing an example of the structure of dictionary data according to the first embodiment. 図３Ａ及び図３Ｂは、音響特徴量の例を概略的に示すグラフである。3A and 3B are graphs schematically showing examples of acoustic feature amounts. 実施の形態１に係る音声合成処理の手順の一例を概略的に示すフローチャートである。3 is a flowchart schematically showing an example of a procedure of speech synthesis processing according to the first embodiment. 実施の形態１に係る素片選択処理の一例を概略的に示すフローチャートである。6 is a flowchart schematically showing an example of a segment selection process according to the first embodiment. 図６Ａ及び図６Ｂは、フレーズのカテゴリの組み合わせの例を示す図である。6A and 6B are diagrams illustrating examples of combinations of phrase categories. 図７Ａ及び図７Ｂは、フレーズのカテゴリの組み合わせの他の例を示す図である。7A and 7B are diagrams illustrating other examples of combinations of phrase categories. 図８Ａ及び図８Ｂは、フレーズのカテゴリの組み合わせの更に他の例を示す図である。8A and 8B are diagrams showing still another example of combinations of phrase categories. 音響特徴パラメータで形成される空間におけるカテゴリの配置を概念的に示す図である。It is a figure which shows notionally the arrangement | positioning of the category in the space formed with an acoustic feature parameter. 実施の形態１の音声合成装置のハードウェア構成例を示すブロック図である。2 is a block diagram illustrating a hardware configuration example of the speech synthesizer of Embodiment 1. FIG. 本発明に係る実施の形態２の音声合成装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speech synthesizer of Embodiment 2 which concerns on this invention. 実施の形態２に係る音声合成処理の手順の一例を概略的に示すフローチャートである。12 is a flowchart schematically showing an example of a procedure of speech synthesis processing according to the second embodiment. 実施の形態２に係る素片選択処理の一例を概略的に示すフローチャートである。12 is a flowchart schematically showing an example of a segment selection process according to the second embodiment. 図１４Ａは、入力テキスト情報の例を示し、図１４Ｂは、この入力テキスト情報に対応するピッチパターンを示し、図１４Ｃは、フレーズ境界記号を含む中間言語情報の例を示す図である。FIG. 14A shows an example of input text information, FIG. 14B shows a pitch pattern corresponding to this input text information, and FIG. 14C shows an example of intermediate language information including a phrase boundary symbol.

以下、図面を参照しつつ、本発明に係る種々の実施の形態について詳細に説明する。
実施の形態１．
図１は、本発明に係る実施の形態１である音声合成装置１の概略構成を示すブロック図である。この音声合成装置１は、時間領域で隣接する言語単位間に音響特徴の違いによる境界が存在する場合に、これら言語単位のそれぞれの音響特徴を考慮して音声素片（以下、単に「素片」ともいう。）を選択し、当該選択された素片の音声波形を接続して合成音声波形を生成する機能を有するものである。 Hereinafter, various embodiments according to the present invention will be described in detail with reference to the drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a schematic configuration of a speech synthesizer 1 according to the first embodiment of the present invention. When there is a boundary due to a difference in acoustic features between adjacent language units in the time domain, the speech synthesizer 1 takes into account the acoustic features of these language units (hereinafter simply referred to as “units”). And a function of generating a synthesized speech waveform by connecting the speech waveforms of the selected segments.

ここで、音響特徴は、音の高さ（周波数）、音の強さ、音の時間的な長さ及び音色（声色ともいう。）のうちから選択された１つ以上の要素で表すことができる。音響特徴の一種である韻律的特徴は、音の高さ、音の強さ及び音の時間的な長さのうちから選択された１つ以上の要素で表すことが可能である。韻律的特徴としては、たとえば、ピッチ、アクセント位置、アクセント強度、ポーズまたはイントネーションが挙げられる。音色は、音声のスペクトル情報（周波数特性）に依存し、たとえば、共振周波数（フォルマント周波数）によって特徴付けられる。なお、音の高さの変化のパターンである声調も、音声のスペクトル情報に依存する音響特徴の一種である。 Here, the acoustic feature can be expressed by one or more elements selected from the pitch (frequency), the strength, the duration of the sound, and the timbre (also called voice color). it can. A prosodic feature, which is a kind of acoustic feature, can be represented by one or more elements selected from the pitch, the strength, and the time length of the sound. Prosodic features include, for example, pitch, accent position, accent strength, pose, or intonation. The timbre depends on the spectrum information (frequency characteristics) of the voice, and is characterized by, for example, a resonance frequency (formant frequency). Note that tone, which is a pattern of change in pitch, is also a kind of acoustic feature that depends on the spectrum information of speech.

本明細書では、素片とは、１つの音素、２つ以上の音素の連鎖、１つの分割音素、２つ以上の分割音素の連鎖、または、１つ以上の音素と１つ以上の分割音素との連鎖をいう。音素（ｐｈｏｎｅｍｅ）とは、１つの言語において、言葉の意味の区別を表すのに用いられる音声の最小の単位をいう。標準的な日本語の場合、母音及び子音の１つ１つが音素となる。たとえば、「あ」などの母音/a/、「い」などの母音/i/、及び「ま」などの子音/m/が音素に該当する。また、分割音素とは、１つの音素を音声的に分割して得られる２つ以上の分割片の各々をいう。１つの音素が音声的に２つの分割片に分割された場合、各分割片は半音素と呼ばれる。 In this specification, a segment means one phoneme, a chain of two or more phonemes, one split phoneme, a chain of two or more split phonemes, or one or more phonemes and one or more split phonemes. The chain. A phoneme is the smallest unit of speech used to represent the distinction of words in one language. In standard Japanese, each vowel and consonant is a phoneme. For example, a vowel / a / such as “A”, a vowel / i / such as “I”, and a consonant / m / such as “MA” correspond to phonemes. A divided phoneme means each of two or more divided pieces obtained by voice-dividing one phoneme. When one phoneme is phonetically divided into two divided pieces, each divided piece is called a semi-phoneme.

素片の単位は、音素単位、分割音素単位、音節単位、形態素単位、アクセントフレーズ単位、呼気フレーズ単位、または文単位などの、任意の単位とすることができる。また、アクセントフレーズとは、１つのアクセント核を有する言語単位をいう。呼気フレーズとは、人間が一息で発声することができる言語単位をいう。フレーズは、一呼吸で発生する１つ以上のアクセントフレーズの集まりということもできる。 The unit of the segment can be any unit such as a phoneme unit, a divided phoneme unit, a syllable unit, a morpheme unit, an accent phrase unit, an exhalation phrase unit, or a sentence unit. An accent phrase means a language unit having one accent nucleus. An exhalation phrase refers to a language unit that a human can speak at a breath. A phrase can also be referred to as a collection of one or more accent phrases that occur in one breath.

図１に示されるように、この音声合成装置１は、読み上げられるべき内容を示すテキスト情報が入力される入力部１０と、そのテキスト情報の言語解析を実行して表音記号を含む中間言語情報を生成する言語解析部１１と、その中間言語情報に基づいて複数の音声素片候補（以下、単に「素片候補」ともいう。）を選択する素片候補選択部１４と、当該複数の素片候補の中から音声合成で使用されるべき素片を選択する素片選択部１５と、当該選択された素片の音響特徴情報に韻律加工及び波形接続処理を施して合成音声波形を示す音声データを生成する音声波形生成部１６と、その音声データを外部に出力する出力部１７と、各種辞書２１〜２３が予め記憶されたデータ記憶部２０とを備えている。 As shown in FIG. 1, the speech synthesizer 1 includes an input unit 10 to which text information indicating the content to be read is input, and intermediate language information including a phonetic symbol by performing language analysis of the text information. , A language analysis unit 11 that generates a speech segment, a segment candidate selection unit 14 that selects a plurality of speech segment candidates (hereinafter also simply referred to as “segment candidates”) based on the intermediate language information, and the plurality of segments A segment selection unit 15 that selects a segment to be used in speech synthesis from among the segment candidates, and a speech that shows a synthesized speech waveform by performing prosodic processing and waveform connection processing on the acoustic feature information of the selected segment A voice waveform generation unit 16 that generates data, an output unit 17 that outputs the voice data to the outside, and a data storage unit 20 in which various dictionaries 21 to 23 are stored in advance are provided.

入力部１０に入力されるテキスト情報は、たとえば、漢字、アルファベット、数字及び記号などの、読み上げ可能な複数種の言語要素の中から選択された言語要素の系列を表すデータであればよい。入力部１０は、文字または記号などの符号の入力が可能な入力デバイスにより構成されていればよい。この種の入力デバイスとしては、たとえば、キーボードまたはタッチパネルの使用が可能である。また、テキストデータを含むファイルデータを選択的に受け付けるように入力部１０が構成されてもよい。あるいは、入力部１０は、通信ネットワークまたはケーブルなどの通信媒体を介して、通信装置などの外部装置から転送されたストリームデータを受信可能であるように構成されてもよい。 The text information input to the input unit 10 may be data representing a sequence of language elements selected from a plurality of types of language elements that can be read out, such as kanji, alphabet, numbers, and symbols. The input part 10 should just be comprised by the input device which can input codes, such as a character or a symbol. As this type of input device, for example, a keyboard or a touch panel can be used. Further, the input unit 10 may be configured to selectively receive file data including text data. Alternatively, the input unit 10 may be configured to be able to receive stream data transferred from an external device such as a communication device via a communication medium such as a communication network or a cable.

更には、入力部１０は、人間の発した音響波を直接受信して電気信号に変換する音響変換部と、その電気信号をテキスト情報に変換する情報処理部とを備えてもよい。この場合には、入力部１０は、ＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチング法、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）法、またはニューラルネットワークによる方法などの公知の音声認識技術を用いてテキスト情報を生成する構成を有してもよい。この種の音声認識技術は、たとえば、非特許文献１（古井貞熙著，「音声情報処理」森北出版株式会社，２００９年９月１０日発行，ｐｐ．９１〜１０６）に開示されている。 Further, the input unit 10 may include an acoustic conversion unit that directly receives an acoustic wave generated by a human and converts it into an electrical signal, and an information processing unit that converts the electrical signal into text information. In this case, the input unit 10 has a configuration for generating text information using a known speech recognition technique such as a DP (Dynamic Programming) matching method, an HMM (Hidden Markov Model) method, or a method using a neural network. May be. This type of speech recognition technology is disclosed, for example, in Non-Patent Document 1 (written by Sadahiro Furui, “Speech Information Processing”, published by Morikita Publishing Co., Ltd., September 10, 2009, pp. 91-106).

言語解析部１１は、入力されたテキスト情報に対し、言語辞書２１を用いて形態素解析、構文解析及び韻律処理などの言語解析を施すことにより、表音記号を含む中間言語情報を生成する。言語辞書２１には、言語解析に必要な言語情報が格納されている。言語情報は、単語の読みを含み、更に、たとえば、単語のアクセント境界、アクセント位置、アクセント強度、品詞及び出現頻度などの特徴のうちの１つ以上を含む情報であればよい。中間言語は、読み、アクセント区切り、アクセント位置（アクセント核）、アクセント強度、ポーズ位置またはポーズ長などの、発声の仕方を表す表音記号を含む情報である。たとえば、以下の日本語のテキスト情報が入力された場合には、言語解析部１１は、以下に示されるようにカタカナで読みが表現された中間言語を生成することができる。 The language analysis unit 11 generates intermediate language information including phonetic symbols by performing language analysis such as morphological analysis, syntax analysis, and prosodic processing on the input text information using the language dictionary 21. The language dictionary 21 stores language information necessary for language analysis. The linguistic information may be information that includes reading of a word and further includes, for example, one or more of features such as an accent boundary, accent position, accent strength, part of speech, and appearance frequency of the word. The intermediate language is information including phonetic symbols representing the way of utterance, such as reading, accent break, accent position (accent core), accent strength, pose position or pose length. For example, when the following Japanese text information is input, the language analysis unit 11 can generate an intermediate language in which the reading is expressed in katakana as shown below.

入力テキスト情報：「今日は寒いが、晴れていて、明日は暑いが、雨である。」
中間言語：「キョ’−ワ／サム’イガ，ハ’レテイテ，アス’ワ／アツ’イガ／ア’メデアル．」 Input text information: “Today is cold but sunny, tomorrow is hot but rainy.”
Intermediate language: "Kyo-wa / Sam'Iga, Ha'Letite, As'wa / Atsu'Iga / A'Medial."

ここで、「’」は、アクセント位置を示す表音記号であり、「／」は、アクセント区切りを示す表音記号である。なお、たとえば、特開２００２−３３３８９６号公報または特開２００３−４４０７３号公報に開示されている言語解析技術が使用されてもよい。 Here, “′” is a phonetic symbol indicating an accent position, and “/” is a phonetic symbol indicating an accent break. For example, the language analysis technique disclosed in Japanese Patent Application Laid-Open No. 2002-333896 or Japanese Patent Application Laid-Open No. 2003-44073 may be used.

素片候補選択部１４は、素片辞書２２を参照して、この素片辞書２２に登録されている多数の素片候補の中から、中間言語情報に含まれる各言語単位（たとえば、フレーズ）について単数または複数の素片候補からなる素片候補群を選択する。素片辞書２２には、多数の素片候補に関する素片情報が予め格納されている。素片情報は、少なくとも、各素片候補の言語情報（たとえば、当該各素片候補の読みを示す音韻情報）と、各素片候補の音響特徴情報と、各素片候補の音響特徴を分類するカテゴリとの間の対応関係を定めるデータである。各素片候補の音響特徴情報としては、各素片候補の音声波形を示す波形データ、あるいは、各素片候補の音声波形の特徴を示す音響特徴パラメータの集合のいずれでもよい。また、素片情報は、音響特徴情報そのものを含んでいてもよいし、あるいは、音響特徴情報が格納されている記録領域を指定する情報を含んでいてもよい。 The segment candidate selection unit 14 refers to the segment dictionary 22 and selects each language unit (for example, phrase) included in the intermediate language information from among a large number of segment candidates registered in the segment dictionary 22. A segment candidate group consisting of one or a plurality of segment candidates is selected. In the element dictionary 22, element information relating to a large number of element candidates is stored in advance. The segment information classifies at least language information of each segment candidate (for example, phoneme information indicating the reading of each segment candidate), acoustic feature information of each segment candidate, and acoustic features of each segment candidate. It is data that defines the correspondence between categories. The acoustic feature information of each unit candidate may be either waveform data indicating the speech waveform of each unit candidate or a set of acoustic feature parameters indicating the features of the speech waveform of each unit candidate. Moreover, the segment information may include the acoustic feature information itself, or may include information specifying a recording area in which the acoustic feature information is stored.

カテゴリは、人為的に設定することができる。カテゴリとしては、文意または構文構造を意味するカテゴリが挙げられる。たとえば、「並列」、「強意」、「呼びかけ」、「不確定」、「選択」、「類推」、「限定」、「例示」、「程度」、「範囲」、「軽視」、「禁止」、「感動」、「質問」、「疑問」、「反語」、「確信」、「断定」、「仮定」、「念押し」、「順接」、「逆接」、「添加」、「対比」、「転換」、「同列」、「補足」、「独立（係り受けの関係が無いため先行フレーズまたは後続フレーズの影響を受けないカテゴリ）」、「非独立」がカテゴリの具体例として挙げられる。また、発話スタイルに基づいてカテゴリが設けられてもよい。たとえば、喜怒哀楽などの感情、ささやき声、悩みといったパラ言語情報がカテゴリとして設定されてもよい。 Categories can be set artificially. The category includes a category meaning a sentence meaning or a syntactic structure. For example, “Parallel”, “Positive”, “Call”, “Indeterminate”, “Select”, “Analog”, “Limited”, “Example”, “Degree”, “Range”, “Disregard”, “Prohibited ”,“ Impression ”,“ Question ”,“ Question ”,“ Antonym ”,“ Confidence ”,“ Consciousness ”,“ Assumption ”,“ Intention ”,“ Forward ”,“ Inverse ”,“ Addition ”,“ Contrast ” ”,“ Conversion ”,“ same ”,“ supplement ”,“ independence (category not affected by preceding phrase or succeeding phrase because there is no dependency relationship) ”,“ non-independent ”are examples of categories. . A category may be provided based on the utterance style. For example, paralinguistic information such as emotions such as emotions, whispering voices, and troubles may be set as a category.

本明細書では、カテゴリの中でも、特に、音響特徴の境界を形成する一対のフレーズのそれぞれのカテゴリを「フレーズ境界パターン」と呼ぶこととする。たとえば、「明日は暑いが、雨である。」とのテキスト情報が入力された場合、「明日は暑いが」との先行フレーズと、「雨である」との後続フレーズとの間に韻律の違いが存在するので、これら先行フレーズと後続フレーズとの間に音響特徴の境界が存在する。なお、音響特徴の境界は、フレーズ（語句）間だけでなく、フレーズ以外の言語単位間に存在する場合がある。 In the present specification, among categories, in particular, each category of a pair of phrases forming a boundary of acoustic features is referred to as a “phrase boundary pattern”. For example, if the text information “Tomorrow is hot but rainy” is entered, the prosody between the preceding phrase “Tomorrow is hot” and the following phrase “rainy” Because there is a difference, there is a boundary of acoustic features between these preceding phrases and succeeding phrases. Note that the boundary between acoustic features may exist not only between phrases (words) but also between language units other than phrases.

ここで、素片辞書２２においては、１つの素片候補は、先行フレーズ及び後続フレーズのいずれか一方に対応付けられていてもよいし、あるいは、先行フレーズ及び後続フレーズの組み合わせに対応付けられていてもよい。また、１つの素片候補が複数種のカテゴリに属すると判断される場合には、当該素片候補に複数種のカテゴリが割り当てられてもよい。 Here, in the segment dictionary 22, one segment candidate may be associated with either the preceding phrase or the succeeding phrase, or may be associated with a combination of the preceding phrase and the succeeding phrase. May be. When it is determined that one segment candidate belongs to a plurality of types of categories, a plurality of types of categories may be assigned to the unit candidates.

なお、フレーズ境界パターンは、当該フレーズ境界パターンを有する素片候補の音響特徴を表す音響特徴パラメータを用いてｋ−ｍｅａｎｓ法などの公知のクラスタリング手法により自動で作成されてもよい。これにより、各素片候補にカテゴリを付与する手間を無くすことができる。更に、フレーズ境界パターンのカテゴリは、音韻情報またはアクセント情報を用いて細かく分類されたものでもよい。 Note that the phrase boundary pattern may be automatically created by a known clustering method such as a k-means method using an acoustic feature parameter representing an acoustic feature of a segment candidate having the phrase boundary pattern. Thereby, the effort which provides a category to each segment candidate can be eliminated. Further, the phrase boundary pattern category may be finely classified using phonological information or accent information.

次に、素片選択部１５は、素片辞書２２及びフレーズ境界パターン辞書２３（以下、単に「パターン辞書２３」ともいう。）を用いて、素片候補選択部１４で選択された素片候補群の中から、音声合成処理で接続されるべき素片を選択し、当該選択された素片の音響特徴情報を音声波形生成部１６に供給する機能を有する。素片候補選択部１４及び素片選択部１５の機能の詳細については後述する。 Next, the segment selection unit 15 uses the segment dictionary 22 and the phrase boundary pattern dictionary 23 (hereinafter also simply referred to as “pattern dictionary 23”) to select the segment candidate selected by the segment candidate selection unit 14. It has a function of selecting a segment to be connected from the group by speech synthesis processing and supplying acoustic feature information of the selected segment to the speech waveform generation unit 16. Details of the functions of the segment candidate selection unit 14 and the segment selection unit 15 will be described later.

パターン辞書２３においては、２種類のフレーズ境界パターン（候補カテゴリ）同士の類似尺度（以下、単に「尺度」ともいう。）を複数定める辞書データが格納されている。図２Ａは、或る先行フレーズに対応する素片候補に割り当てられた２種類のフレーズ境界パターン（候補カテゴリ）同士の尺度を格納する辞書データＤＴａの構造の一例を示す図である。また、図２Ｂは、或る後続フレーズに対応する素片候補に割り当てられた２種類のフレーズ境界パターン（候補カテゴリ）同士の尺度を格納する辞書データＤＴｂの構造の一例を示す図である。図２Ａ及び図２Ｂに示した辞書データＤＴａ，ＤＴｂは、パターン辞書２３の一部としてデータ記憶部２０に予め記憶されている。たとえば、「晴れですが、見えないでしょう。」とのテキスト情報については、「見えないでしょう」との後続フレーズに対して「順接」及び「逆接」という２種類のフレーズ境界パターンを割り当てることが可能である。この場合、「順接」と「逆接」との間の尺度がパターン辞書２３に格納されている。 The pattern dictionary 23 stores dictionary data that defines a plurality of similarity measures (hereinafter also simply referred to as “measures”) between two types of phrase boundary patterns (candidate categories). FIG. 2A is a diagram illustrating an example of a structure of dictionary data DTa that stores a scale between two types of phrase boundary patterns (candidate categories) assigned to a segment candidate corresponding to a certain preceding phrase. FIG. 2B is a diagram showing an example of the structure of dictionary data DTb that stores a scale between two types of phrase boundary patterns (candidate categories) assigned to a segment candidate corresponding to a certain subsequent phrase. The dictionary data DTa and DTb shown in FIGS. 2A and 2B are stored in advance in the data storage unit 20 as part of the pattern dictionary 23. For example, for text information that says “It's sunny but I can't see it”, two types of phrase boundary patterns, “forward” and “reverse”, are assigned to subsequent phrases that say “I ca n’t see”. It is possible. In this case, a scale between “order” and “reverse” is stored in the pattern dictionary 23.

次に、パターン辞書２３に格納される尺度について説明する。尺度は、素片候補の２種類のカテゴリ間の類似度を表す指標値であり、たとえば、素片候補の音響特徴量を用いて算出される値である。今、或るカテゴリに属する素片候補の音響特徴量Ｆ_ａがＮ個の音響特徴パラメータａ_１，ａ_２，…，ａ_Ｎ（Ｎは正整数）からなり、他のカテゴリに属する当該素片候補の音響特徴量Ｆ_ｂがＮ個の音響特徴パラメータｂ_１，ｂ_２，…，ｂ_Ｎからなるとき、尺度Ｄｔは、次式（１）で与えられる。

Next, the scale stored in the pattern dictionary 23 will be described. The scale is an index value representing the similarity between two types of segment candidates, and is a value calculated using, for example, acoustic feature quantities of the segment candidates. Now, acoustic features of the segment candidates belonging to a certain category F _a are N acoustic feature parameters _{_{a 1, a 2, ...,}} a N (N is a positive integer) a, the segment belonging to the other categories When the candidate acoustic feature amount F _b is composed of N acoustic feature parameters b ₁ , b ₂ ,..., B _N , the scale Dt is given by the following equation (1).

ここで、ｗ_１，…，ｗ_Ｎは、重み係数である。尺度Ｄｔは、Ｎ次元ユークリッド空間において、音響特徴パラメータ間の差分（＝ａ_ｋ−ｂ_ｋ）に重み（＝ｗ_ｋ ^１／２）が付けられた重み付きユークリッド距離として計算されている。 Here, w ₁ ,..., W _N are weighting factors. The scale Dt is calculated as a weighted Euclidean distance in which a weight (= w _k ^1/2 ) is added to a difference (= a _k −b _k ) between acoustic feature parameters in the N-dimensional Euclidean space.

図３Ａは、或る先行フレーズがカテゴリ「逆接」に属する場合の音響特徴量Ｆ_ａの例を概略的に示すグラフであり、図３Ｂは、同じ或る先行フレーズがカテゴリ「限定」に属する場合の音響特徴量Ｆ_ｂの例を概略的に示すグラフである。図３Ａ及び図３Ｂに示したグラフにおいて、横軸は時間を表し、縦軸はピッチ（基本周波数）を表している。図３Ａ及び図３Ｂでは、音響特徴パラメータａ_１，ｂ_１は、波形の第１の傾斜を示し、音響特徴パラメータａ_２，ｂ_２は、波形の第２の傾斜を示している。また、音響特徴パラメータａ_３，ｂ_３は、波形のレンジを示し、音響特徴パラメータａ_４，ｂ_４は、波形の継続時間を示している。なお、重み付きユークリッド距離に代えて、他の類似尺度（たとえば、コスト）が採用されてもよい。 Figure 3A is an example of acoustic features F _a where a certain preceding phrase belongs to the category "reverse connection" is a graph showing schematically, FIG. 3B, if the same one preceding phrase belongs to the category "restrict" examples of acoustic features F _b of which is a graph showing schematically. In the graphs shown in FIGS. 3A and 3B, the horizontal axis represents time, and the vertical axis represents pitch (fundamental frequency). 3A and 3B, the acoustic feature parameters a ₁ and b ₁ indicate the first slope of the waveform, and the acoustic feature parameters a ₂ and b ₂ indicate the second slope of the waveform. The acoustic feature parameters a ₃ and b ₃ indicate waveform ranges, and the acoustic feature parameters a ₄ and b ₄ indicate waveform durations. Note that other similarity measures (for example, cost) may be employed instead of the weighted Euclidean distance.

尺度の計算に用いられる音響特徴パラメータａ_１〜ａ_Ｎ，ｂ_１〜ｂ_Ｎは、図３Ａ及び図３Ｂに示した例に限らず、音響特徴を表す音響特徴パラメータを少なくとも１つ以上含めばよい。ピッチについては、たとえば、波形区間内の平均値、四分位値、中間値、レンジまたは分散などの統計値、並びに、極値を結ぶ傾斜を音響特徴パラメータとして用いればよい。話速については、たとえば、波形区間内の継続時間の変化、モーラ単位の平均継続時間の変化、またはフレーズ内の複数区間におけるモーラ単位平均継続時間の変化を音響特徴パラメータとして使用してもよい。更に、パワーについては、たとえば、波形の振幅値もしくはエネルギー、または人間の可聴域などの周波数帯域が限定されたエネルギーを音響特徴パラメータとして使用することができる。また、パワーに関するその区間内の平均値、四分位値、中間値、レンジもしくは分散などの統計値、またはフレーズ内の複数区間における変化を音響特徴パラメータとして使用してもよい。ポーズについては、ポーズ時間長を音響特徴パラメータとして使用すればよい。 The acoustic feature parameters a _{1 to} a _N and b _{1 to} b _N used for the calculation of the scale are not limited to the examples shown in FIGS. 3A and 3B, and may include at least one acoustic feature parameter representing the acoustic feature. . For the pitch, for example, an average value, a quartile value, an intermediate value, a statistical value such as a range or a variance in a waveform section, and a slope connecting extreme values may be used as the acoustic feature parameters. As for the speech speed, for example, a change in duration in a waveform section, a change in average duration in units of mora, or a change in average duration in units of mora in a plurality of sections in a phrase may be used as an acoustic feature parameter. Furthermore, as for the power, for example, an amplitude value or energy of a waveform, or energy having a limited frequency band such as a human audible range can be used as an acoustic feature parameter. Further, an average value, a quartile value, an intermediate value, a statistical value such as a range or a variance regarding the power, or a change in a plurality of sections in the phrase may be used as the acoustic feature parameter. For pauses, the pause time length may be used as an acoustic feature parameter.

また、尺度の計算に使用される単位としては、フレーズ全体の区間、音響特徴の出現しやすい文末の一定時間長の区間、音響特徴の出現しやすい文頭の一定時間長の区間、助詞もしくは助動詞などの形態素の単位の区間、または音響特徴パラメータの極値を結ぶ区間などの、フレーズ境界パターンの特徴を表す１つ以上の区間が使用されればよい。 Units used in the calculation of the scale include the entire phrase section, a fixed-length section at the end of a sentence where an acoustic feature is likely to appear, a fixed-length section at the beginning of a sentence where an acoustic feature is likely to appear, a particle or an auxiliary verb, etc. One or more sections representing the features of the phrase boundary pattern, such as sections of morpheme units or sections connecting the extreme values of the acoustic feature parameters, may be used.

また、尺度の計算は、カテゴリに分類された素片候補のセントロイド（代表的なパターン）を使用して実行されればよい。ここで、セントロイドに近い１つ以上の代表的な素片候補を選択してカテゴリ間の尺度を計算してもよく、あるいは、ランダムに選択された１つ以上の素片候補を選択してカテゴリ間の尺度を計算してもよい。また、或るカテゴリの音響特徴パラメータを予め設定しておき、当該或るカテゴリに対する他のカテゴリの尺度を計算してもよい。なお、複数の素片候補を利用して１つの尺度が算出されてもよい。この場合は、当該複数の素片候補に基づいて算出された複数の尺度の和または平均などの統計値を１つの尺度として算出すればよい。 The scale calculation may be performed using the centroid (representative pattern) of segment candidates classified into categories. Here, one or more representative segment candidates close to the centroid may be selected to calculate the scale between categories, or one or more randomly selected segment candidates may be selected. A measure between categories may be calculated. Further, acoustic feature parameters of a certain category may be set in advance, and a scale of another category for the certain category may be calculated. One scale may be calculated using a plurality of segment candidates. In this case, a statistical value such as the sum or average of a plurality of scales calculated based on the plurality of segment candidates may be calculated as one scale.

次に、図１に示される音声波形生成部１６は、素片選択部１５で選択された素片に韻律加工及び波形接続処理を施して合成音声波形を示す音声データを生成する。ここで、音声データは、所定フォーマットの音声ファイルまたはストリームデータとして生成されればよい。また、音声データは、符号化されたデータであってもよい。出力部１７は、その音声データを外部に出力する。ここで、出力部１７は、アナログ信号に変換された音声データを出力してもよい。 Next, the speech waveform generator 16 shown in FIG. 1 performs prosodic processing and waveform connection processing on the segments selected by the segment selector 15 to generate speech data indicating a synthesized speech waveform. Here, the audio data may be generated as an audio file or stream data in a predetermined format. The audio data may be encoded data. The output unit 17 outputs the audio data to the outside. Here, the output unit 17 may output audio data converted into an analog signal.

次に、図４及び図５を参照しつつ、上記音声合成装置１の動作について説明する。図４は、実施の形態１に係る音声合成処理の手順の一例を概略的に示すフローチャートであり、図５は、図４に示される素片選択処理（ステップＳＴ１２）の手順の一例を概略的に示すフローチャートである。図４及び図５は、入力された中間言語情報が、音響特徴の境界を形成する一対の先行フレーズ及び後続フレーズを含む場合の処理手順を示している。 Next, the operation of the speech synthesizer 1 will be described with reference to FIGS. 4 and 5. FIG. 4 is a flowchart schematically showing an example of the procedure of the speech synthesis process according to Embodiment 1, and FIG. 5 is a schematic example of the procedure of the segment selection process (step ST12) shown in FIG. It is a flowchart shown in FIG. 4 and 5 show a processing procedure when the input intermediate language information includes a pair of preceding phrases and succeeding phrases that form the boundary of the acoustic features.

図４に示されるように、音声合成装置１は、外部からテキスト情報が入力されるまで待機している（ステップＳＴ１０のＮＯ）。外部からテキスト情報が入力されると（ステップＳＴ１０のＹＥＳ）、言語解析部１１は、入力部１０を介して入力されたテキスト情報に対し、言語辞書２１を用いた言語解析を実行して中間言語情報を生成する（ステップＳＴ１１）。その後、素片選択処理（ステップＳＴ１２）が実行される。 As shown in FIG. 4, the speech synthesizer 1 stands by until text information is input from the outside (NO in step ST10). When text information is input from the outside (YES in step ST10), the language analysis unit 11 performs language analysis using the language dictionary 21 on the text information input via the input unit 10 and performs intermediate language analysis. Information is generated (step ST11). Thereafter, a segment selection process (step ST12) is executed.

なお、音声合成装置１の入力部１０は、外部から中間処理情報が直接入力される構成を有していてもよい。この場合には、音声合成装置１は、言語解析部１１を有する必要はない。また、図４のフローチャートにおいては、ステップＳＴ１１は不要である。 Note that the input unit 10 of the speech synthesizer 1 may have a configuration in which intermediate processing information is directly input from the outside. In this case, the speech synthesizer 1 does not need to have the language analysis unit 11. Further, step ST11 is unnecessary in the flowchart of FIG.

次に、図５を参照すると、素片候補選択部１４は、素片辞書２２を参照し、入力された中間言語情報に基づいて素片候補群を選択する（ステップＳＴ２１）。ここで、入力された中間言語情報は、音響特徴の境界を形成する一対のフレーズを含むので、素片候補選択部１４は、素片辞書２２を用いて、当該一対のフレーズのうちの先行フレーズに対応する素片候補α_１，α_２，…，α_Ｍ（以下、先行素片候補α_１〜α_Ｍともいう。）からなる素片候補群｛α_ｍ｝を選択するとともに、当該一対のフレーズのうちの後続フレーズに対応する素片候補β_１，β_２，…，β_Ｋ（以下、後続素片候補α_１〜α_Ｋともいう。）からなる素片候補群｛β_ｋ｝を選択する。ここで、Ｍ，Ｋは、正整数である。素片辞書２２では、素片候補α_１〜α_Ｍの各々に少なくとも１つのカテゴリが割り当てられており、素片候補β_１〜β_Ｋの各々に少なくとも１つのカテゴリが割り当てられている。 Next, referring to FIG. 5, the segment candidate selection unit 14 refers to the segment dictionary 22 and selects a segment candidate group based on the input intermediate language information (step ST21). Here, since the input intermediate language information includes a pair of phrases that form the boundary of the acoustic features, the segment candidate selection unit 14 uses the segment dictionary 22 and uses the preceding phrase of the pair of phrases. _1, alpha _2, segment candidates alpha corresponding to ..., alpha _M (hereinafter, also referred to as a preceding segment candidate α ₁ ~α _M.) with selecting a fragment candidate group {alpha _m} consisting of the pair Select a segment candidate group {β _k } consisting of segment candidates β ₁ , β ₂ ,..., Β _K (hereinafter also referred to as subsequent segment candidates α _{1 to} α _K ) corresponding to the subsequent phrases in the phrase. To do. Here, M and K are positive integers. In the segment dictionary 22, at least one category is allocated to each of the segment candidates α _{1 to} α _M , and at least one category is allocated to each of the segment candidates β _{1 to} β _K.

次に、素片選択部１５は、先行素片候補α_１〜α_Ｍの中から一の先行素片候補α_ｉを選択する（ステップＳＴ２２）。ｉは、先行素片候補α_ｉの識別番号であり、今の場合、たとえばｉ＝１である。続けて、素片選択部１５は、当該先行素片候補α_ｉのカテゴリを目標カテゴリに設定する（ステップＳＴ２３）。これにより、後続素片候補β_１〜β_Ｋのそれぞれのカテゴリが候補カテゴリに設定される。 Next, the segment selection unit 15 selects one preceding segment candidate α _i from the preceding segment candidates α _{1 to} α _M (step ST22). i is the identification number of the preceding segment candidate α _i , and in this case, for example, i = 1. Subsequently, the segment selection unit 15 sets the category of the preceding segment candidate α _i as a target category (step ST23). Thereby, each category of the subsequent segment candidates β _{1 to} β _K is set as a candidate category.

次に、素片選択部１５は、パターン辞書２３を参照して、後続素片候補β_１〜β_Ｋの候補カテゴリ同士の尺度に基づき、先行素片候補α_ｉについて、後続素片候補β_１〜β_Ｋの中から少なくとも１個の後続素片候補β_ｊを抽出する（ステップＳＴ２４）。ここでは、素片選択部１５は、図２Ａに示した辞書データＤＴａにおける候補カテゴリ同士の尺度を参照して、少なくとも１個の後続素片候補β_ｊを抽出すればよい。たとえば、１０個の後続素片候補β_１〜β_１０が存在する場合、少なくとも、後続素片候補の組み合わせの数（＝_１０Ｃ_２＝４５個）の辞書データＤＴａが用意されている。素片選択部１５は、後続素片候補β_１〜β_Ｋの中から、最も高い尺度を有する候補カテゴリを有する後続素片候補を抽出してもよいし、あるいは、後続素片候補β_１〜β_Ｋの中から、尺度の大きい方から順に所定個数の後続素片候補を選択してもよい。これにより、先行素片候補α_ｉと後続素片候補β_ｊとの組み合わせ候補（α_ｉ，β_ｊ）が少なくとも１組生成される。当該組み合わせ候補（α_ｉ，β_ｊ）と当該組み合わせ候補（α_ｉ，β_ｊ）の音響特徴情報とは、音声波形生成部１６に供給される。 Next, the unit selection unit 15 refers to the pattern dictionary 23, based on the measure of the candidate category between subsequent segment candidate β ₁ ~β _K, for the preceding segment candidate alpha _i, subsequent segment candidate beta ₁ At least one subsequent segment candidate β _j is extracted from ~ β _K (step ST24). Here, the segment selection unit 15 may extract at least one subsequent segment candidate β _j by referring to the scale of candidate categories in the dictionary data DTa shown in FIG. 2A. For example, when there are _ten subsequent segment candidates β _{1 to} β ₁₀ , at least the number of subsequent segment candidate combinations (= ₁₀ C ₂ = 45) of dictionary data DTa is prepared. The segment selection unit 15 may extract a subsequent segment candidate having a candidate category having the highest scale from the subsequent segment candidates β _{1 to} β _K , or alternatively, the subsequent segment candidates β ₁ to β ₁ to β ₁ to β _K may be extracted. A predetermined number of subsequent segment candidates may be selected from β _K in descending order of scale. As a result, at least one combination candidate (α _i , β _j ) of the preceding segment candidate α _i and the subsequent segment candidate β _j is generated. The combination candidate (α _i , β _j ) and the acoustic feature information of the combination candidate (α _i , β _j ) are supplied to the speech waveform generation unit 16.

その後、未選択の先行素片候補が有ると判定した場合には（ステップＳＴ２５のＹＥＳ）、素片選択部１５は、識別番号ｉを変更して（たとえば、ｉを１から２に変更して）新たな先行素片候補α_ｉを選択し（ステップＳＴ２６）、この先行素片候補α_ｉについて上記ステップＳＴ２３〜ＳＴ２４を実行する。この結果、先行素片候補α_ｉと後続素片候補β_ｊとの新たな組み合わせの候補（α_ｉ，β_ｊ）が少なくとも１組生成される。当該新たな組み合わせ候補（α_ｉ，β_ｊ）と当該新たな組み合わせ候補（α_ｉ，β_ｊ）の音響特徴情報とは、音声波形生成部１６に供給される。 Thereafter, when it is determined that there is an unselected preceding segment candidate (YES in step ST25), the segment selection unit 15 changes the identification number i (for example, changes i from 1 to 2). ) A new preceding element candidate α _i is selected (step ST26), and the above steps ST23 to ST24 are executed for the preceding element candidate α _i . As a result, at least one set (α _i , β _j ) of a new combination of the leading element candidate α _i and the succeeding element candidate β _j is generated. The new combination candidate (α _i , β _j ) and the acoustic feature information of the new combination candidate (α _i , β _j ) are supplied to the speech waveform generation unit 16.

最終的に、未選択の先行素片候補が無いと判定した場合には（ステップＳＴ２５のＮＯ）、少なくともＭ個の組み合わせ候補が音声波形生成部１６に供給されることとなる。音声波形生成部１６は、所定の波形接続アルゴリズムに従い、基本周波数、パワー、継続長及び短時間振幅スペクトルなどの音響特徴パラメータを用いて、少なくともＭ個の組み合わせ候補（先行素片候補と後続素片候補との組の候補）の中から、音声波形の接続性を良好にする最適な組み合わせを選択する（ステップＳＴ２７）。波形接続アルゴリズムは、特に制限されるものではないが、たとえば、特開２０１３−１５６４７２号公報に開示されている公知の波形接続アルゴリズムが使用されればよい。 Finally, when it is determined that there is no unselected preceding segment candidate (NO in step ST25), at least M combination candidates are supplied to the speech waveform generation unit 16. The speech waveform generation unit 16 uses at least M combination candidates (preceding segment candidate and subsequent segment) using acoustic feature parameters such as fundamental frequency, power, duration, and short-time amplitude spectrum according to a predetermined waveform connection algorithm. The optimum combination that improves the connectivity of the speech waveform is selected from the candidates of the pair with the candidate) (step ST27). The waveform connection algorithm is not particularly limited. For example, a known waveform connection algorithm disclosed in Japanese Patent Application Laid-Open No. 2013-156472 may be used.

その後、図４を参照すると、音声波形生成部１６は、素片選択部１５で選択された組み合わせの素片の音響特徴情報に韻律加工を実行し（ステップＳＴ１３）、更に、その韻律加工の結果得られた、先行素片及び後続素片の音響特徴情報に基づいて合成音声波形を示す音声データを生成する（ステップＳＴ１４）。その後、出力部１７は、その音声データを外部に出力する（ステップＳＴ１５）。 After that, referring to FIG. 4, the speech waveform generation unit 16 performs prosodic processing on the acoustic feature information of the combination of segments selected by the segment selection unit 15 (step ST13), and further results of the prosody processing Based on the obtained acoustic feature information of the preceding and subsequent segments, speech data indicating a synthesized speech waveform is generated (step ST14). Thereafter, the output unit 17 outputs the audio data to the outside (step ST15).

ステップＳＴ１３の韻律加工は、たとえば、公知のＰＳＯＬＡ（Ｐｉｔｃｈ−ＳｙｎｃｈｒｏｎｏｕｓＯｖｅｒｌａｐａｎｄＡｄｄ）法を用いて素片の基本周波数及び継続時間長を変形させることで実行されればよい。ＰＳＯＬＡ法は、たとえば、非特許文献２（F.J.Charpentier and M.G.Stella, ICASSP86, pp.2015-2018, Tokyo, 1986）に開示されている。 The prosody processing in step ST13 may be executed by changing the fundamental frequency and duration length of the segment using, for example, a known PSOLA (Pitch-Synchronous Overlap and Add) method. The PSOLA method is disclosed, for example, in Non-Patent Document 2 (F.J.Charpentier and M.G.Stella, ICASSP86, pp.2015-2018, Tokyo, 1986).

また、合成音声波形の生成法については、音声波形生成部１６は、先ず、先行素片の音声波形の端の形状と後続素片の音声波形の端の形状とを考慮して、時間領域におけるこれらの端の配置位置（たとえば、ピッチ単位の相関値が高くなる位置）を決定する。次に、音声波形生成部１６は、先行素片及び後続素片の音声波形を重ね合わせして合成音声波形を生成する。このとき、それら音声波形の端同士は加算されて平均化される。 As for the method of generating a synthesized speech waveform, the speech waveform generation unit 16 first considers the shape of the end of the speech waveform of the preceding unit and the shape of the end of the speech waveform of the subsequent unit in the time domain. The arrangement position of these ends (for example, the position where the correlation value in pitch units becomes high) is determined. Next, the speech waveform generation unit 16 generates a synthesized speech waveform by superimposing the speech waveforms of the preceding unit and the subsequent unit. At this time, the ends of the speech waveforms are added and averaged.

以上に説明した図５の素片選択処理では、素片選択部１５は、先行素片候補α_１〜α_Ｍと後続素片候補β_１〜β_Ｋとの組み合わせの中から、カテゴリ間の尺度に基づいて１組以上の組み合わせ候補を抽出する（ステップＳＴ２１〜ＳＴ２６）。よって、音声波形生成部１６は、多種多様な組み合わせ候補の中から接続性の良好な１組を選択することができる（ステップＳＴ２７）。 In the unit selection process of FIG. 5 described above, the unit selection unit 15 measures the scale between categories from the combinations of the preceding unit candidates α _{1 to} α _M and the subsequent unit candidates β _{1 to} β _K. Based on the above, one or more combination candidates are extracted (steps ST21 to ST26). Therefore, the speech waveform generation unit 16 can select one set with good connectivity from a wide variety of combination candidates (step ST27).

図６Ａ及び図６Ｂは、「晴れですが、見えないでしょう。」とのテキスト情報及びピッチパターンの例を示す図である。このテキスト情報では、「晴れですが」との先行フレーズに対応する素片候補のカテゴリが「逆接」に分類される一方で、「見えないでしょう。」との後続フレーズに対応する素片候補のフレーズ境界パターンとしては、図６Ａに示す「順接」と図６Ｂに示す「逆接」という２種類のカテゴリが存在し得る。また、図７Ａ及び図７Ｂは、「晴れでしか、見えないでしょう。」とのテキスト情報及びピッチパターンの例を示す図である。このテキスト情報では、「晴れでしか」との先行フレーズに対応する素片候補のカテゴリが「限定」に分類される一方で、「見えないでしょう。」との後続フレーズに対応する素片候補のフレーズ境界パターンとしては、図７Ａに示す「順接」と図７Ｂに示す「限定」という２種類のカテゴリが存在し得る。更に、図８Ａ及び図８Ｂは、「晴れなので、見えないでしょう。」とのテキスト情報及びピッチパターンの例を示す図である。このテキスト情報では、「晴れなので」との先行フレーズに対応する素片候補のカテゴリが「順接」に分類される一方で、「見えないでしょう。」との後続フレーズに対応する素片候補のフレーズ境界パターンとしては、図８Ａに示す「順接」と図８Ｂに示す「強調」という２種類のカテゴリが存在し得る。 FIG. 6A and FIG. 6B are diagrams showing examples of text information and a pitch pattern “It is sunny but will not be visible”. In this text information, the segment candidate category corresponding to the preceding phrase “I am fine” is classified as “reverse connection”, while the segment candidate corresponding to the subsequent phrase “I will not see”. As the phrase boundary pattern, there can be two categories of “forward contact” shown in FIG. 6A and “reverse connection” shown in FIG. 6B. FIG. 7A and FIG. 7B are diagrams showing examples of text information and a pitch pattern that “you can see only when it is clear”. In this text information, the segment candidate category corresponding to the preceding phrase “Only sunny” is classified as “Limited”, while the segment candidate corresponding to the following phrase “You will not see”. As the phrase boundary pattern, there can be two categories of “order” shown in FIG. 7A and “limited” shown in FIG. 7B. Further, FIG. 8A and FIG. 8B are diagrams showing examples of text information and a pitch pattern that “you will not see because it is fine”. In this text information, the segment candidate category corresponding to the preceding phrase “because it is sunny” is classified as “junction”, while the segment candidate corresponding to the subsequent phrase “I will not see”. As the phrase boundary pattern, there can be two categories of “order” shown in FIG. 8A and “emphasis” shown in FIG. 8B.

このように先行フレーズと後続フレーズとの間の音響特徴の境界付近では、カテゴリの組み合わせ候補が複数存在する場合がある。図９は、ピッチ平均値及びポーズ時間長という２つの音響特徴パラメータで形成される空間におけるカテゴリＣ１，Ｃ２，Ｄ１〜Ｄ４の配置を概念的に示す図である。図９の例では、「並列」のカテゴリＣ１を持つ先行素片は、「強調」，「限定」及び「逆接」のカテゴリＤ１〜Ｄ３を持つ後続素片とともに組み合わせ候補を構成している。また、「順接」のカテゴリＣ２を持つ先行素片は、「単独」のカテゴリＤ４を持つ後続素片とともに組み合わせ候補を構成している。本実施の形態の音声合成装置１は、カテゴリ間の尺度に基づいて、そのような組み合わせ候補を抽出することができる。したがって、それら組み合わせ候補の中から接続性の良好な最適な１組を選択することが可能である。 As described above, there may be a plurality of category combination candidates near the boundary of the acoustic feature between the preceding phrase and the succeeding phrase. FIG. 9 is a diagram conceptually showing the arrangement of the categories C1, C2, D1 to D4 in a space formed by two acoustic feature parameters such as a pitch average value and a pause time length. In the example of FIG. 9, the preceding element having the category C1 of “parallel” constitutes a combination candidate together with the succeeding elements having the categories D1 to D3 of “emphasis”, “limitation”, and “reverse connection”. Further, the preceding segment having the “order” category C2 constitutes a combination candidate together with the subsequent segment having the “single” category D4. The speech synthesizer 1 of this embodiment can extract such combination candidates based on the scale between categories. Therefore, it is possible to select an optimal set with good connectivity from among these combination candidates.

上記音声合成装置１のハードウェア構成は、たとえば、パーソナルコンピュータ、ワークステーションまたはメインフレームなどのＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）内蔵のコンピュータで実現可能である。あるいは、上記音声合成装置１のハードウェア構成は、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）またはＦＰＧＡ（Ｆｉｅｌｄ−ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）などのＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ）により実現されてもよい。 The hardware configuration of the speech synthesizer 1 can be realized by, for example, a computer with a CPU (Central Processing Unit) such as a personal computer, a workstation, or a main frame. Alternatively, the hardware configuration of the speech synthesizer 1 may be an LSI (Large Realized Gate Array) such as DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), or FPGA (Field-Programmable Gate Array). Good.

図１０は、音声合成装置１のハードウェア構成例である情報処理装置３の構成を概略的に示すブロック図である。図１０の例では、情報処理装置３は、ＣＰＵまたはＤＳＰなどのＬＳＩを含むプロセッサ３０、メモリ３１、センサインタフェース部３２、テキスト入力インタフェース部３３、表示インタフェース部３４及び音声インタフェース部３５を備えている。プロセッサ３０、センサインタフェース部３２、テキスト入力インタフェース部３３、表示インタフェース部３４及び音声インタフェース部３５は、信号路３６を介して相互に接続されている。 FIG. 10 is a block diagram schematically showing the configuration of the information processing apparatus 3 that is a hardware configuration example of the speech synthesizer 1. In the example of FIG. 10, the information processing apparatus 3 includes a processor 30 including an LSI such as a CPU or DSP, a memory 31, a sensor interface unit 32, a text input interface unit 33, a display interface unit 34, and a voice interface unit 35. . The processor 30, the sensor interface unit 32, the text input interface unit 33, the display interface unit 34, and the voice interface unit 35 are connected to each other via a signal path 36.

プロセッサ３０は、上記した言語解析部１１、素片候補選択部１４、素片選択部１５及び音声波形生成部１６の機能を実現するハードウェアである。また、センサインタフェース部３２及びテキスト入力インタフェース部３３は、入力部１０に相当し、マウス、キーボード、タッチパネル、加速度センサ及び撮像カメラなどの外部デバイスに対するインタフェースを構成することができる。表示インタフェース部３４は、入力部１０に入力されたテキスト情報及び出力部１７の出力内容をディスプレイ装置に表示させるために使用することができる。また、音声インタフェース部３５は、入力部１０が音声入力機能を有する場合にマイクに対するインタフェースとして機能することができ、また、合成音声波形を示す音声データをスピーカから出力する場合の当該スピーカに対するインタフェースとして機能することができる。 The processor 30 is hardware that realizes the functions of the language analysis unit 11, the unit candidate selection unit 14, the unit selection unit 15, and the speech waveform generation unit 16. The sensor interface unit 32 and the text input interface unit 33 correspond to the input unit 10 and can constitute an interface to external devices such as a mouse, a keyboard, a touch panel, an acceleration sensor, and an imaging camera. The display interface unit 34 can be used to display text information input to the input unit 10 and output contents of the output unit 17 on a display device. The audio interface unit 35 can function as an interface for a microphone when the input unit 10 has an audio input function, and as an interface for the speaker when audio data indicating a synthesized audio waveform is output from the speaker. Can function.

メモリ３１は、データ記憶部２０として使用可能である。メモリ３１としては、たとえば、ＳＤＲＡＭ（ＳｙｎｃｈｒｏｎｏｕｓＤＲＡＭ）などの揮発性メモリ、及び、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＨＤＤ（ハードディスクドライブ）もしくはＳＳＤ（ソリッドステートドライブ）などの不揮発性メモリを含む記録媒体を使用すればよい。プロセッサ３０は、メモリ３１に記録されているコンピュータ・プログラムを読み出し、このコンピュータ・プログラムに従って動作することにより、言語解析部１１、素片候補選択部１４、素片選択部１５及び音声波形生成部１６の機能を実現することができる。 The memory 31 can be used as the data storage unit 20. As the memory 31, for example, a recording medium including a volatile memory such as SDRAM (Synchronous DRAM) and a non-volatile memory such as ROM (Read Only Memory), HDD (Hard Disk Drive) or SSD (Solid State Drive) is used. do it. The processor 30 reads out the computer program recorded in the memory 31 and operates according to the computer program, whereby the language analysis unit 11, the segment candidate selection unit 14, the segment selection unit 15, and the speech waveform generation unit 16 are processed. The function can be realized.

また、入力部１０における入力方法が、音声認識、ジェスチャ認識または視線認識などのセンサ認識を使用する場合には、メモリ３１は、そのセンサ認識用の処理プログラムを記憶することが可能である。出力部１７における出力方法が、スピーカーによる音声再生またはストリームデータによる音声データ配信を行う場合には、メモリ３１は、そのための処理プログラムを記憶することが可能である。 When the input method in the input unit 10 uses sensor recognition such as voice recognition, gesture recognition, or line-of-sight recognition, the memory 31 can store a processing program for the sensor recognition. When the output method in the output unit 17 performs audio reproduction by a speaker or audio data distribution by stream data, the memory 31 can store a processing program therefor.

なお、図１０において、メモリ３１は、情報処理装置３の内部に配置されている。この代わりに、情報処理装置３は、ＵＳＢ（ユニバーサルシリアルバス）メモリなどの可搬性メモリを着脱自在に接続可能なメモリ・インタフェースを有していてもよい。これにより、言語辞書２１、素片辞書２２及びパターン辞書２３が記憶された可搬性メモリを情報処理装置３に接続することができる。あるいは、情報処理装置３は、メモリ３１に加えて、そのメモリ・インタフェースを有していてもよい。 In FIG. 10, the memory 31 is disposed inside the information processing apparatus 3. Instead of this, the information processing apparatus 3 may have a memory interface to which a portable memory such as a USB (Universal Serial Bus) memory can be detachably connected. Thereby, the portable memory in which the language dictionary 21, the segment dictionary 22, and the pattern dictionary 23 are stored can be connected to the information processing apparatus 3. Alternatively, the information processing apparatus 3 may have the memory interface in addition to the memory 31.

以上に説明したように実施の形態１の音声合成装置１では、素片候補選択部１４は、中間言語情報に基づいて先行素片候補α_１〜α_Ｍと後続素片候補β_１〜β_Ｋとを選択する（図５のステップＳＴ２１）。素片選択部１５は、先行素片候補α_１〜α_Ｍと後続素片候補β_１〜β_Ｋとの組み合わせの中から、尺度に基づいて１組以上の組み合わせ候補を絞り込んでいる（ステップＳＴ２１〜ＳＴ２６）。そして、音声波形生成部１６は、それらの多様な組み合わせ候補の中から接続性の良好な１組を選択することができる（ステップＳＴ２７）。これにより、音響特徴の接続性を良好にすることが可能となる。また、音響特徴の接続性が良好になるので、韻律加工（ステップＳＴ１３）による音質劣化が抑制されるという効果が得られる。したがって、韻律などの音響特徴の多様な制御を実行することができ、合成音声の音質劣化を軽減することができる。 As described above, in the speech synthesizer 1 of the first embodiment, the segment candidate selection unit 14 uses the preceding segment candidates α _{1 to} α _M and the subsequent segment candidates β _{1 to} β _K based on the intermediate language information. Are selected (step ST21 in FIG. 5). The segment selection unit 15 narrows down one or more combination candidates from the combinations of the preceding segment candidates α _{1 to} α _M and the subsequent segment candidates β _{1 to} β _K (step ST21). -ST26). Then, the speech waveform generation unit 16 can select one set with good connectivity from among these various combination candidates (step ST27). This makes it possible to improve the connectivity of the acoustic features. In addition, since the connectivity of the acoustic features is improved, an effect that the deterioration of sound quality due to prosodic processing (step ST13) is suppressed can be obtained. Therefore, various control of acoustic features such as prosody can be executed, and deterioration of sound quality of synthesized speech can be reduced.

ところで、図５の素片選択処理では、図２Ａに示した辞書データＤＴａ内の候補カテゴリ同士の尺度を参照し、先行素片候補α_１〜α_Ｍの各先行素片候補α_ｉについて、後続素片候補β_１〜β_Ｋの中から後続素片候補β_ｊを選択することによって組み合わせ候補（α_ｉ，β_ｊ）が抽出されている（ステップＳＴ２２〜ＳＴ２４）。この場合、図２Ｂに示した辞書データＤＴｂは不要である。この代わりに、図２Ｂに示した辞書データＤＴｂ内の候補カテゴリ同士の尺度を参照し、後続素片候補β_１〜β_Ｋの各先行素片候補β_ｉについて、先行素片候補α_１〜α_Ｍの中から先行素片候補β_ｊを選択することによって組み合わせ候補（β_ｉ，α_ｊ）が抽出されてもよい。この場合、図２Ａに示した辞書データＤＴａは不要となる。 By the way, in the segment selection process of FIG. 5, the scale of candidate categories in the dictionary data DTa shown in FIG. 2A is referred to, and the subsequent segment candidates α _i of the preceding segment candidates α _{1 to} α _M A combination candidate (α _i , β _j ) is extracted by selecting a subsequent segment candidate β _j from the segment candidates β _{1 to} β _K (steps ST22 to ST24). In this case, the dictionary data DTb shown in FIG. 2B is not necessary. Instead, with reference to the scales of candidate categories in the dictionary data DTb shown in FIG. 2B, the preceding unit candidates α _{1 to} α for each preceding unit candidate β _i of the subsequent unit candidates β _{1 to} β _K _A combination candidate (β _i , α _j ) may be extracted by selecting a preceding segment candidate β _j from _M. In this case, the dictionary data DTa shown in FIG. 2A is not necessary.

また、素片辞書２２には、人間の発声データを基に生成された韻律パターンが音響特徴情報の一部として予め格納されていることが望ましい。この場合、音声波形生成部１６は、素片辞書２２内の韻律パターンを合成音声波形に反映させることができるので、自然性の高い合成音声を得ることができる。ここで、韻律パターンは、たとえば、各素片候補の基本周波数（ピッチ）、継続時間及びパワーなどの音響特徴パラメータである。 Further, it is desirable that the segment dictionary 22 stores in advance a prosodic pattern generated based on human utterance data as part of acoustic feature information. In this case, since the speech waveform generation unit 16 can reflect the prosodic pattern in the unit dictionary 22 in the synthesized speech waveform, it is possible to obtain highly natural synthesized speech. Here, the prosodic pattern is, for example, acoustic feature parameters such as the fundamental frequency (pitch), duration, and power of each segment candidate.

実施の形態２．
次に、本発明に係る実施の形態２について説明する。図１１は、実施の形態２である音声合成装置２の概略構成を示すブロック図である。 Embodiment 2. FIG.
Next, a second embodiment according to the present invention will be described. FIG. 11 is a block diagram illustrating a schematic configuration of the speech synthesizer 2 according to the second embodiment.

図１１に示されるように、この音声合成装置２は、読み上げられるべき内容を示すテキスト情報が入力される入力部１０と、そのテキスト情報の言語解析を実行して表音記号を含む中間言語情報を生成する言語解析部１１と、その中間言語情報に含まれる各言語単位の韻律的特徴などの音響特徴を示す音響特徴パターンを生成する韻律生成部１２と、その音響特徴パターンに基づいて各言語単位の音響特徴を分類するカテゴリを決定するフレーズ境界パターン決定部１３と、中間言語情報及び音響特徴パターンに基づいて複数の素片候補を選択する素片候補選択部１４Ａと、当該複数の素片候補の中から音声合成で使用される素片を選択する素片選択部１５Ａと、当該選択された素片の音響特徴情報に韻律加工及び波形接続処理を施して合成音声波形を示す音声データを生成する音声波形生成部１６と、その音声データを外部に出力する出力部１７と、各種辞書２１〜２５が予め記憶されたデータ記憶部２０とを備えている。 As shown in FIG. 11, the speech synthesizer 2 includes an input unit 10 to which text information indicating the content to be read is input, and intermediate language information including a phonetic symbol by performing language analysis of the text information. A language analysis unit 11 that generates a sound, a prosody generation unit 12 that generates an acoustic feature pattern indicating acoustic features such as prosodic features of each language unit included in the intermediate language information, and each language based on the acoustic feature pattern A phrase boundary pattern determination unit 13 that determines a category for classifying the unit acoustic features, a segment candidate selection unit 14A that selects a plurality of segment candidates based on the intermediate language information and the acoustic feature pattern, and the plurality of segments A unit selection unit 15A that selects a unit to be used for speech synthesis from candidates, and synthesis by applying prosodic processing and waveform connection processing to the acoustic feature information of the selected unit A speech waveform generation unit 16 for generating voice data indicating the voice waveform, and an output unit 17 for outputting the audio data to the outside, various dictionary 21 to 25 and a data storage unit 20 stored in advance.

図１１に示される入力部１０、言語解析部１１、音声波形生成部１６及び出力部１７の構成は、それぞれ、上記実施の形態１の入力部１０、言語解析部１１、音声波形生成部１６及び出力部１７の構成と同じである。また、図１１に示される言語辞書２１及び素片辞書２２の内容も、それぞれ、上記実施の形態１の言語辞書２１及び素片辞書２２の内容と同じである。 The configurations of the input unit 10, the language analysis unit 11, the speech waveform generation unit 16, and the output unit 17 shown in FIG. 11 are respectively the input unit 10, the language analysis unit 11, the speech waveform generation unit 16, and the output unit 17 of the first embodiment. The configuration of the output unit 17 is the same. Also, the contents of the language dictionary 21 and the segment dictionary 22 shown in FIG. 11 are the same as the contents of the language dictionary 21 and the segment dictionary 22 of the first embodiment, respectively.

韻律生成部１２は、韻律辞書２４を参照して各言語単位の韻律的特徴などの音響特徴を示す音響特徴パターンを推定することができる。韻律辞書２４には、複数の言語単位の音響特徴パターンが記憶されている。その音響特徴パターンは、音韻情報、アクセント型及び呼気内モーラ数などの言語情報と、当該言語情報に対応するピッチパターン、時間長及び振幅などの音響特徴を示す音響特徴パラメータとで構成されている。後述するように、音響特徴の境界を形成する一対の先行フレーズ及び後続フレーズが入力された場合には、韻律生成部１２は、韻律辞書２４を参照して先行フレーズ及び後続フレーズのそれぞれの音響特徴パターンを推定する。これら音響特徴パターンはターゲット韻律情報として利用される。この種のターゲット韻律情報は、たとえば、特開平１−２８４８９８号公報に開示されている技術を利用して生成されてもよい。 The prosody generation unit 12 can estimate an acoustic feature pattern indicating acoustic features such as prosodic features of each language unit with reference to the prosody dictionary 24. The prosodic dictionary 24 stores acoustic feature patterns of a plurality of language units. The acoustic feature pattern is composed of linguistic information such as phonological information, accent type, and number of mora in exhalation, and acoustic feature parameters indicating acoustic features such as pitch pattern, time length, and amplitude corresponding to the linguistic information. . As will be described later, when a pair of preceding phrases and succeeding phrases that form the boundary of the acoustic features are input, the prosody generation unit 12 refers to the prosodic dictionary 24 and each acoustic feature of the preceding phrases and succeeding phrases. Estimate the pattern. These acoustic feature patterns are used as target prosodic information. This type of target prosody information may be generated using, for example, a technique disclosed in Japanese Patent Laid-Open No. 1-284898.

なお、本実施の形態の韻律生成部１２によって本発明の音響特徴生成部を構成することが可能である。また、韻律辞書２４は、上記の韻律的特徴に限らず、音色または声調などの音響特徴を示す情報を音響特徴パターンとして記憶していてもよい。 It should be noted that the prosody generation unit 12 of the present embodiment can constitute the acoustic feature generation unit of the present invention. The prosodic dictionary 24 is not limited to the prosodic features described above, and may store information indicating acoustic features such as timbre or tone as acoustic feature patterns.

フレーズ境界パターン決定部１３は、フレーズ境界パターン辞書２５及びターゲット韻律情報を用いて、当該先行フレーズ及び当該後続フレーズのそれぞれの音響特徴を分類するカテゴリを決定する。フレーズ境界パターン辞書２５には、ターゲット韻律情報とカテゴリとの対応関係を定めるデータが格納されているので、フレーズ境界パターン決定部１３は、フレーズ境界パターン辞書２５を参照することで、ターゲット韻律情報（音響特徴パターン）に対応するカテゴリを決定することができる。 The phrase boundary pattern determination unit 13 uses the phrase boundary pattern dictionary 25 and the target prosody information to determine a category for classifying the acoustic features of the preceding phrase and the subsequent phrase. Since the phrase boundary pattern dictionary 25 stores data that defines the correspondence between the target prosodic information and the category, the phrase boundary pattern determining unit 13 refers to the phrase boundary pattern dictionary 25 so that the target prosodic information ( The category corresponding to the acoustic feature pattern) can be determined.

素片候補選択部１４Ａは、フレーズ境界パターン決定部１３で決定されたカテゴリを目標カテゴリとして利用し、素片辞書２２及びフレーズ境界パターン辞書２３Ａを参照して、当該中間言語情報に含まれる各言語単位（たとえば、フレーズ）について単数または複数の素片候補からなる素片候補群を選択する。図１１の素片辞書２２の構成は、上記実施の形態１の素片辞書２２の構成と同じである。また、フレーズ境界パターン辞書２３Ａ（以下、単に「パターン辞書２３Ａ」ともいう。）の構成は、上記実施の形態１のパターン辞書２３の構成と実質的に同じである。素片候補選択部１４Ａの機能の詳細については後述する。 The segment candidate selection unit 14A uses the category determined by the phrase boundary pattern determination unit 13 as a target category, refers to the segment dictionary 22 and the phrase boundary pattern dictionary 23A, and displays each language included in the intermediate language information. A unit candidate group consisting of one or a plurality of unit candidates is selected for a unit (for example, a phrase). The configuration of the segment dictionary 22 in FIG. 11 is the same as the configuration of the segment dictionary 22 of the first embodiment. The configuration of the phrase boundary pattern dictionary 23A (hereinafter also simply referred to as “pattern dictionary 23A”) is substantially the same as the configuration of the pattern dictionary 23 of the first embodiment. Details of the function of the segment candidate selection unit 14A will be described later.

素片選択部１５Ａは、素片辞書２２及びパターン辞書２３Ａを用いて、素片候補選択部１４Ａで選択された素片候補群の中から、音声合成処理で接続されるべき素片を選択し、当該選択された素片の音響特徴情報を音声波形生成部１６に供給する機能を有する。素片選択部１５Ａの機能の詳細については後述する。 The segment selection unit 15A uses the segment dictionary 22 and the pattern dictionary 23A to select a segment to be connected in the speech synthesis process from the segment candidate group selected by the segment candidate selection unit 14A. The function of supplying the acoustic feature information of the selected segment to the speech waveform generation unit 16 is provided. Details of the function of the segment selection unit 15A will be described later.

音声波形生成部１６は、素片選択部１５Ａで選択された素片に韻律加工及び波形接続処理を施して合成音声波形を示す音声データを生成する。出力部１７は、その音声データを外部に出力する。 The speech waveform generation unit 16 performs prosodic processing and waveform connection processing on the segment selected by the segment selection unit 15A to generate speech data indicating a synthesized speech waveform. The output unit 17 outputs the audio data to the outside.

次に、図１２及び図１３を参照しつつ、音声合成装置２の動作について説明する。図１２は、実施の形態２に係る音声合成処理の手順の一例を概略的に示すフローチャートであり、図１３は、図１２に示される素片選択処理（ステップＳＴ３４）の手順の一例を概略的に示すフローチャートである。図１２及び図１３は、入力された中間言語情報が、音響特徴の境界を形成する一対の先行フレーズ及び後続フレーズを含む場合の処理手順を示している。 Next, the operation of the speech synthesizer 2 will be described with reference to FIGS. FIG. 12 is a flowchart schematically showing an example of the procedure of the speech synthesis process according to the second embodiment, and FIG. 13 is a schematic example of the procedure of the segment selection process (step ST34) shown in FIG. It is a flowchart shown in FIG. FIG. 12 and FIG. 13 show a processing procedure when the input intermediate language information includes a pair of preceding phrases and succeeding phrases that form the boundary of the acoustic features.

図１２に示されるように、音声合成装置２は、外部からテキスト情報が入力されるまで待機している（ステップＳＴ３０のＮＯ）。外部からテキスト情報が入力されると（ステップＳＴ３０のＹＥＳ）、言語解析部１１は、入力部１０を介して入力されたテキスト情報に対し、言語辞書２１を用いた言語解析を実行して中間言語情報を生成する（ステップＳＴ３１）。 As shown in FIG. 12, the speech synthesizer 2 stands by until text information is input from the outside (NO in step ST30). When text information is input from the outside (YES in step ST30), the language analysis unit 11 performs language analysis using the language dictionary 21 on the text information input via the input unit 10 and performs intermediate language analysis. Information is generated (step ST31).

次に、韻律生成部１２は、上述したように、韻律辞書２４を参照して、先行フレーズ及び後続フレーズのそれぞれの音響特徴パターンをターゲット韻律情報として生成する（ステップＳＴ３２）。そして、フレーズ境界パターン決定部１３は、フレーズ境界パターン辞書２５を参照し、先行フレーズの音響特徴パターンに基づいて、当該先行フレーズのフレーズ境界パターンであるカテゴリを決定するとともに、後続フレーズの音響特徴パターンに基づいて、当該後続フレーズのフレーズ境界パターンであるカテゴリを決定する（ステップＳＴ３３）。 Next, as described above, the prosody generation unit 12 refers to the prosody dictionary 24 and generates each acoustic feature pattern of the preceding phrase and the subsequent phrase as target prosodic information (step ST32). Then, the phrase boundary pattern determination unit 13 refers to the phrase boundary pattern dictionary 25 and determines the category that is the phrase boundary pattern of the preceding phrase based on the acoustic feature pattern of the preceding phrase, and the acoustic feature pattern of the subsequent phrase. Based on the above, the category that is the phrase boundary pattern of the subsequent phrase is determined (step ST33).

ステップＳＴ３３の具体的な処理内容は、以下の通りである。フレーズ境界パターン決定部１３は、韻律生成部１２から、当該先行フレーズのターゲット韻律情報と当該後続フレーズのターゲット韻律情報との供給を受けている。また、フレーズ境界パターン辞書２５には、参照用韻律情報とカテゴリとの対応関係を定めるデータが格納されている。よって、韻律生成部１２は、ターゲット韻律情報に含まれる音響特徴パラメータと参照用韻律情報に含まれる音響特徴パラメータとを用いて尺度を算出することができる。この尺度の算出は、たとえば、上式（１）を用いて実行することが可能である。韻律生成部１２は、尺度の最も大きい参照用韻律情報に対応するカテゴリを、当該先行フレーズまたは後続フレーズのフレーズ境界パターンとして決定することができる（ステップＳＴ３３）。 The specific processing content of step ST33 is as follows. The phrase boundary pattern determination unit 13 receives the target prosody information of the preceding phrase and the target prosody information of the subsequent phrase from the prosody generation unit 12. The phrase boundary pattern dictionary 25 stores data that defines the correspondence between reference prosodic information and categories. Therefore, the prosody generation unit 12 can calculate the scale using the acoustic feature parameter included in the target prosody information and the acoustic feature parameter included in the reference prosody information. The calculation of this scale can be executed using, for example, the above equation (1). The prosodic generation unit 12 can determine the category corresponding to the reference prosodic information having the largest scale as the phrase boundary pattern of the preceding phrase or the succeeding phrase (step ST33).

なお、音声合成装置２の入力部１０は、外部から中間処理情報が直接入力される構成を有していてもよい。この場合には、音声合成装置２は、言語解析部１１を有する必要はない。また、図１２のフローチャートにおいては、ステップＳＴ３１は不要である。 Note that the input unit 10 of the speech synthesizer 2 may have a configuration in which intermediate processing information is directly input from the outside. In this case, the speech synthesizer 2 does not need to have the language analysis unit 11. In the flowchart of FIG. 12, step ST31 is not necessary.

その後、素片選択処理（ステップＳＴ３４）が実行される。図１３を参照すると、素片候補選択部１４Ａは、素片辞書２２を参照し、入力された中間言語情報に基づいて素片候補群を選択する（ステップＳＴ４１）。ここで、入力された中間言語情報は、音響特徴の境界を形成する一対のフレーズを含むので、素片候補選択部１４Ａは、素片辞書２２を用いて、当該一対のフレーズのうちの先行フレーズに対応する素片候補α_１，α_２，…，α_Ｍ（以下、先行素片候補α_１〜α_Ｍともいう。）からなる素片候補群｛α_ｍ｝を選択するとともに、その後続フレーズに対応する素片候補β_１，β_２，…，β_Ｋ（以下、後続素片候補α_１〜α_Ｋともいう。）からなる素片候補群｛β_ｋ｝を選択する。ここで、Ｍ，Ｋは、正整数である。素片辞書２２では、素片候補α_１〜α_Ｍの各々に少なくとも１つのカテゴリが割り当てられており、素片候補β_１〜β_Ｋの各々に少なくとも１つのカテゴリが割り当てられている。 Thereafter, an element selection process (step ST34) is executed. Referring to FIG. 13, the segment candidate selection unit 14A refers to the segment dictionary 22 and selects a segment candidate group based on the input intermediate language information (step ST41). Here, since the input intermediate language information includes a pair of phrases that form the boundary of the acoustic features, the segment candidate selection unit 14A uses the segment dictionary 22 to determine the preceding phrase of the pair of phrases. _1, alpha _2, segment candidates alpha corresponding to ..., alpha _M (hereinafter, also referred to as a preceding segment candidate α ₁ ~α _M.) with selecting a fragment candidate group {alpha _m} consisting of the following phrases _2, segment candidates beta _1, beta corresponding to ..., beta _K (hereinafter, also referred to as a trailing segment candidate alpha ₁ to? _K.) selecting a segment candidate group {beta _k} consisting of. Here, M and K are positive integers. In the segment dictionary 22, at least one category is allocated to each of the segment candidates α _{1 to} α _M , and at least one category is allocated to each of the segment candidates β _{1 to} β _K.

次に、素片候補選択部１４Ａは、ステップＳＴ３３で決定されたフレーズ境界パターン（ターゲット韻律情報に対応するフレーズ境界パターン）をそれぞれ目標カテゴリに設定する（ステップＳＴ４２）。 Next, the segment candidate selection unit 14A sets the phrase boundary pattern (phrase boundary pattern corresponding to the target prosodic information) determined in step ST33 as a target category (step ST42).

続いて、素片候補選択部１４Ａは、当該先行フレーズの目標カテゴリと先行素片候補α_１〜α_Ｍのカテゴリとの間の尺度に基づいて、先行素片候補α_１〜α_Ｍの中から少なくとも１個の素片候補からなる中間素片候補群｛α_ｐ｝を抽出するとともに、当該後続フレーズの目標カテゴリと後続素片候補β_１〜β_Ｋのカテゴリとの間の尺度に基づいて、後続素片候補β_１〜β_Ｍの中から少なくとも１個の素片候補からなる中間素片候補群｛β_ｑ｝を抽出する（ステップＳＴ４３）。ここで、実施の形態１のパターン辞書２３と同様に、パターン辞書２３Ａは、２種類のフレーズ境界パターン（候補カテゴリ）同士の尺度を複数定める辞書データが格納されている。素片候補選択部１４Ａは、パターン辞書２３Ａを参照して、たとえば、先行素片候補α_１〜α_Ｍの中から、尺度の大きい方から順に所定個数の素片候補を選択することにより中間素片候補群｛α_ｐ｝を抽出することができる。後続素片候補β_１〜β_Ｋについても、素片候補選択部１４Ａは、パターン辞書２３Ａを参照して、これら後続素片候補β_１〜β_Ｋの中から、尺度の大きい方から順に所定個数の素片候補を選択することにより中間素片候補群｛β_ｑ｝を抽出することができる。 Subsequently, the segment candidate selection unit 14A selects from among the preceding segment candidates α _{1 to} α _M based on a scale between the target category of the preceding phrase and the categories of the preceding segment candidates α _{1 to} α _M. Extracting an intermediate segment candidate group {α _p } consisting of at least one segment candidate, and based on a scale between the target category of the subsequent phrase and the categories of the subsequent segment candidates β _{1 to} β _K , An intermediate segment candidate group {β _q } composed of at least one segment candidate is extracted from the subsequent segment candidates β _{1 to} β _M (step ST43). Here, as with the pattern dictionary 23 of the first embodiment, the pattern dictionary 23A stores dictionary data that defines a plurality of measures between two types of phrase boundary patterns (candidate categories). The segment candidate selection unit 14A refers to the pattern dictionary 23A and selects, for example, a predetermined number of segment candidates from the preceding segment candidates α _{1 to} α _M in descending order of the scale, thereby selecting intermediate segments. A single candidate group {α _p } can be extracted. Regarding the subsequent segment candidates β _{1 to} β _K , the segment candidate selection unit 14A refers to the pattern dictionary 23A, and selects a predetermined number from the subsequent segment candidates β _{1 to} β _K in descending order of the scale. By selecting a segment candidate, an intermediate segment candidate group {β _q } can be extracted.

その後、素片選択部１５Ａは、中間素片候補群｛α_ｐ｝の中から一の先行素片候補α_ｉを選択する（ステップＳＴ４４）。ｉは、先行素片候補α_ｉの識別番号であり、今の場合、たとえばｉ＝１である。続けて、素片選択部１５Ａは、当該先行素片候補α_ｉのカテゴリを目標カテゴリに設定する（ステップＳＴ４５）。これにより、中間素片候補群｛β_ｑ｝を構成する後続素片候補のそれぞれのカテゴリが候補カテゴリに設定される。 Thereafter, the segment selection unit 15A selects one preceding segment candidate α _i from the intermediate segment candidate group {α _p } (step ST44). i is the identification number of the preceding segment candidate α _i , and in this case, for example, i = 1. Subsequently, the segment selection unit 15A sets the category of the preceding segment candidate α _i as a target category (step ST45). Thereby, each category of the subsequent segment candidate which comprises intermediate segment candidate group {(beta) _q } is set to a candidate category.

次に、素片選択部１５Ａは、パターン辞書２３Ａを参照して、中間素片候補群｛β_ｑ｝を構成する後続素片候補の候補カテゴリ同士の尺度に基づき、先行素片候補α_ｉについて、中間素片候補群｛β_ｑ｝の中から少なくとも１個の後続素片候補β_ｊを抽出する（ステップＳＴ４６）。ここでは、素片選択部１５Ａは、上記ステップＳＴ２４（図５）の場合と同様に、図２Ａに示した辞書データＤＴａにおける候補カテゴリ同士の尺度を参照して、少なくとも１個の後続素片候補β_ｊを抽出すればよい。これにより、先行素片候補α_ｉと後続素片候補β_ｊとの組み合わせ候補（α_ｉ，β_ｊ）が少なくとも１組生成される。この組み合わせ候補（α_ｉ，β_ｊ）と当該組み合わせ候補（α_ｉ，β_ｊ）の音響特徴情報とは、音声波形生成部１６に供給される。 Next, the element selection unit 15A refers to the pattern dictionary 23A, and determines the preceding element candidate α _i based on the scale of candidate categories of the subsequent element candidates constituting the intermediate element candidate group {β _q }. Then, at least one subsequent segment candidate β _j is extracted from the intermediate segment candidate group {β _q } (step ST46). Here, similarly to the case of step ST24 (FIG. 5), the segment selection unit 15A refers to the scale of candidate categories in the dictionary data DTa shown in FIG. 2A, and at least one subsequent segment candidate. β _j may be extracted. As a result, at least one combination candidate (α _i , β _j ) of the preceding segment candidate α _i and the subsequent segment candidate β _j is generated. The combination candidate (α _i , β _j ) and the acoustic feature information of the combination candidate (α _i , β _j ) are supplied to the speech waveform generation unit 16.

その後、未選択の先行素片候補が有ると判定した場合には（ステップＳＴ４７のＹＥＳ）、素片選択部１５Ａは、識別番号ｉを変更して（たとえば、ｉを１から２に変更して）新たな先行素片候補α_ｉを選択し（ステップＳＴ４８）、この先行素片候補α_ｉについて上記ステップＳＴ４５〜ＳＴ４６を実行する。この結果、先行素片候補α_ｉと後続素片候補β_ｊとの新たな組み合わせの候補（α_ｉ，β_ｊ）が少なくとも１組生成される。この新たな組み合わせ候補（α_ｉ，β_ｊ）と当該新たな組み合わせ候補（α_ｉ，β_ｊ）の音響特徴情報とは、音声波形生成部１６に供給される。 Thereafter, when it is determined that there is an unselected preceding segment candidate (YES in step ST47), the segment selection unit 15A changes the identification number i (for example, changes i from 1 to 2). ) A new preceding element candidate α _i is selected (step ST48), and the above steps ST45 to ST46 are executed for the preceding element candidate α _i . As a result, at least one set (α _i , β _j ) of a new combination of the leading element candidate α _i and the succeeding element candidate β _j is generated. The new combination candidate (α _i , β _j ) and the acoustic feature information of the new combination candidate (α _i , β _j ) are supplied to the speech waveform generation unit 16.

最終的に、未選択の先行素片候補が無いと判定した場合には（ステップＳＴ４７のＮＯ）、音声波形生成部１６は、所定の波形接続アルゴリズムに従い、基本周波数、パワー、継続長及び短時間振幅スペクトルなどの音響特徴パラメータを用いて、上記した組み合わせ候補（先行素片候補と後続素片候補との組の候補）の中から、音声波形の接続性を良好にする最適な組み合わせを選択する（ステップＳＴ４９）。波形接続アルゴリズムは、特に制限されるものではないが、たとえば、特開２０１３−１５６４７２号公報に開示されている公知の波形接続アルゴリズムが使用されればよい。 Finally, when it is determined that there is no unselected preceding segment candidate (NO in step ST47), the speech waveform generator 16 follows the predetermined waveform connection algorithm, and the fundamental frequency, power, duration, and short time Using the acoustic feature parameters such as the amplitude spectrum, the optimum combination that improves the connectivity of the speech waveform is selected from the above-described combination candidates (candidates for the combination of the preceding unit candidate and the subsequent unit candidate). (Step ST49). The waveform connection algorithm is not particularly limited. For example, a known waveform connection algorithm disclosed in Japanese Patent Application Laid-Open No. 2013-156472 may be used.

その後、図１２を参照すると、音声波形生成部１６は、素片選択部１５Ａで選択された組み合わせの素片の音響特徴情報に韻律加工を実行し（ステップＳＴ３５）、更に、その韻律加工の結果得られた、先行素片及び後続素片の音響特徴情報に基づいて合成音声波形を示す音声データを生成する（ステップＳＴ３６）。その後、出力部１７は、その音声データを外部に出力する（ステップＳＴ３７）。 Then, referring to FIG. 12, the speech waveform generation unit 16 performs prosodic processing on the acoustic feature information of the combination of segments selected by the segment selection unit 15A (step ST35), and further results of the prosody processing Based on the obtained acoustic feature information of the preceding and subsequent segments, speech data indicating a synthesized speech waveform is generated (step ST36). Thereafter, the output unit 17 outputs the audio data to the outside (step ST37).

以上に説明したように実施の形態２の音声合成装置２では、素片候補選択部１４Ａは、ターゲット韻律情報に対応するフレーズ境界パターンを用いて、先行素片候補α_１〜α_Ｍの中から中間素片候補群｛α_ｐ｝を抽出するとともに、後続素片候補β_１〜β_Ｋの中から中間素片候補群｛β_ｑ｝を抽出している（図１３のステップＳＴ４２，ＳＴ４３）。また、素片選択部１５Ａは、当該中間素片候補群｛α_ｐ｝を構成する先行素片候補と当該中間素片候補群｛β_ｑ｝を構成する後続素片候補との組み合わせの中から、尺度に基づいて１組以上の組み合わせ候補を絞り込んでいる（ステップＳＴ４４〜ＳＴ４８）。そして、音声波形生成部１６は、それらの多様な組み合わせ候補の中から接続性の良好な１組を選択することができる（ステップＳＴ４９）。これにより、上記実施の形態１の場合と比べると、音響特徴の接続性を更に良好にすることが可能となる。また、音響特徴の接続性が良好になるので、韻律加工（図１２のステップＳＴ３５）による音質劣化が抑制されるという効果が得られる。したがって、音響特徴の多様な制御を実行することができ、合成音声の音質劣化を軽減することができる。 As described above, in the speech synthesizer 2 according to the second embodiment, the segment candidate selection unit 14A uses the phrase boundary pattern corresponding to the target prosodic information from the preceding segment candidates α _{1 to} α _M. The intermediate element candidate group {α _p } is extracted, and the intermediate element candidate group {β _q } is extracted from the subsequent element candidates β _{1 to} β _K (steps ST42 and ST43 in FIG. 13). Further, the element selection unit 15A selects from among combinations of the preceding element candidates constituting the intermediate element candidate group {α _p } and the subsequent element candidates constituting the intermediate element candidate group {β _q }. One or more combination candidates are narrowed down based on the scale (steps ST44 to ST48). Then, the speech waveform generation unit 16 can select one set with good connectivity from among these various combination candidates (step ST49). Thereby, compared with the case of the said Embodiment 1, it becomes possible to make the connectivity of an acoustic feature still more favorable. In addition, since the connectivity of the acoustic features is improved, an effect is obtained that deterioration in sound quality due to prosodic processing (step ST35 in FIG. 12) is suppressed. Therefore, various control of the acoustic features can be executed, and deterioration of the sound quality of the synthesized speech can be reduced.

ところで、図１２の素片選択処理では、中間素片候補群｛α_ｐ｝を構成する各先行素片候補α_ｉについて、中間素片候補群｛β_ｑ｝の中から後続素片候補β_ｊを選択することによって組み合わせ候補（α_ｉ，β_ｊ）が抽出されている（ステップＳＴ４４〜ＳＴ４６）。この場合、図２Ａに示した辞書データＤＴａが参照されるので、図２Ｂに示した辞書データＤＴｂは不要である。この代わりに、図２Ｂに示した辞書データＤＴｂ内の候補カテゴリ同士の尺度を参照し、中間素片候補群｛β_ｑ｝を構成する各後続素片候補β_ｉについて、中間素片候補群｛α_ｐ｝の中から先行素片候補α_ｊを選択することによって組み合わせ候補（β_ｉ，α_ｊ）が抽出されてもよい。 By the way, in the segment selection process of FIG. 12, for each preceding segment candidate α _i constituting the intermediate segment candidate group {α _p }, the subsequent segment candidate β _j from the intermediate segment candidate group {β _q }. The combination candidate (α _i , β _j ) is extracted by selecting (steps ST44 to ST46). In this case, since the dictionary data DTa shown in FIG. 2A is referred to, the dictionary data DTb shown in FIG. 2B is unnecessary. Instead, referring to the scale of candidate categories in the dictionary data DTb shown in FIG. 2B, for each subsequent segment candidate β _i constituting the intermediate segment candidate group {β _q }, the intermediate segment candidate group { A combination candidate (β _i , α _j ) may be extracted by selecting a preceding segment candidate α _j from α _p }.

なお、言語解析部１１で生成される中間言語情報がフレーズ境界パターンを指定する記号（以下「フレーズ境界記号」ともいう。）を含んでいてもよい。この場合、フレーズ境界パターンごとに異なるフレーズ境界記号を設けることが可能である。図１４Ａは、入力テキスト情報の例を示し、図１４Ｂは、この入力テキスト情報に対応するピッチパターンを示し、図１４Ｃは、フレーズ境界記号を含む中間言語情報の例を示す図である。図１４Ｃに示されるように、フレーズ境界記号”＠ｊｕｎ”，”＠ｇｙａｋｕ”が挿入されている。すなわち、「寒いが」と「晴れていて」との間において、呼気フレーズ区切りを表す記号”,”の前に「順接」を示す”＠ｊｕｎ”が挿入されている。また、「暑いが」と「雨である」との間において、アクセントフレーズ区切りを表す記号”/”の前に「逆接」を示す”＠ｇｙａｋｕ”が挿入されている。フレーズ境界記号”＠ｊｕｎ”は、フレーズ「今日は寒いが」の音響特徴情報を示し、フレーズ境界記号”＠ｇｙａｋｕ”は、フレーズ「明日は暑いが」の音響特徴情報を示している。フレーズ境界パターン決定部１３は、これらフレーズ境界記号で示される音響特徴情報をターゲット韻律（目標韻律）情報として利用してカテゴリを決定することができる。 Note that the intermediate language information generated by the language analysis unit 11 may include a symbol designating a phrase boundary pattern (hereinafter also referred to as “phrase boundary symbol”). In this case, it is possible to provide a different phrase boundary symbol for each phrase boundary pattern. FIG. 14A shows an example of input text information, FIG. 14B shows a pitch pattern corresponding to this input text information, and FIG. 14C shows an example of intermediate language information including a phrase boundary symbol. As shown in FIG. 14C, phrase boundary symbols “@jun” and “@gyaku” are inserted. That is, “@jun” indicating “junction” is inserted before the symbol “,” indicating the exhalation phrase delimiter between “cold but” and “sunny”. Also, “@gyaku” indicating “reverse connection” is inserted between the symbol “/” indicating the accent phrase delimiter between “hot” and “rainy”. The phrase boundary symbol “@jun” indicates the acoustic feature information of the phrase “Today is cold,” and the phrase boundary symbol “@gyaku” indicates the acoustic feature information of the phrase “Tomorrow is hot.” The phrase boundary pattern determination unit 13 can determine a category using the acoustic feature information indicated by these phrase boundary symbols as target prosody (target prosody) information.

また、本実施の形態の音声合成装置２のハードウェア構成は、上記実施の形態１の場合と同様に、図１０に示した情報処理装置３によって実現可能である。 Further, the hardware configuration of the speech synthesizer 2 of the present embodiment can be realized by the information processing apparatus 3 shown in FIG. 10 as in the case of the first embodiment.

以上、図面を参照して本発明に係る種々の実施の形態について述べたが、これら実施の形態は本発明の例示であり、これら実施の形態以外の様々な形態を採用することもできる。
なお、本発明の範囲内において、実施の形態１，２の自由な組み合わせ、各実施の形態の任意の構成要素の変形、または各実施の形態の任意の構成要素の省略が可能である。 Although various embodiments according to the present invention have been described above with reference to the drawings, these embodiments are examples of the present invention, and various forms other than these embodiments can be adopted.
In addition, within the scope of the present invention, the free combination of the first and second embodiments, the modification of any component in each embodiment, or the omission of any component in each embodiment is possible.

１，２音声合成装置、３情報処理装置、１０入力部、１１言語解析部、１２韻律生成部、１３フレーズ境界パターン決定部、１４，１４Ａ素片候補選択部、１５，１５Ａ素片選択部、１６音声波形生成部、１７出力部、２０データ記憶部、２１言語辞書、２２素片辞書、２３，２３Ａフレーズ境界パターン辞書、２４韻律辞書、２５フレーズ境界パターン辞書、３０プロセッサ、３１メモリ、３２センサインタフェース部、３３テキスト入力インタフェース部、３４表示インタフェース部、３５音声インタフェース部、３６信号路。 1, 2 speech synthesis device, 3 information processing device, 10 input unit, 11 language analysis unit, 12 prosody generation unit, 13 phrase boundary pattern determination unit, 14, 14A segment candidate selection unit, 15, 15A segment selection unit, 16 speech waveform generation unit, 17 output unit, 20 data storage unit, 21 language dictionary, 22 segment dictionary, 23, 23A phrase boundary pattern dictionary, 24 prosodic dictionary, 25 phrase boundary pattern dictionary, 30 processor, 31 memory, 32 sensor Interface unit, 33 Text input interface unit, 34 Display interface unit, 35 Audio interface unit, 36 Signal path.

Claims

音響特徴の境界を形成する一対の言語単位を入力として合成音声波形を生成する音声合成装置であって、
音声素片候補の言語情報と当該音声素片候補の音響特徴量と当該音声素片候補の音響特徴を分類するカテゴリとの間の対応関係を定める素片辞書が記憶されているデータ記憶部と、
前記素片辞書を用いて、前記一対の言語単位の一方に対応する複数の音声素片候補からなる第１素片候補群とともに、前記一対の言語単位の他方に対応する複数の音声素片候補からなる第２素片候補群を選択する素片候補選択部と、
前記第１素片候補群を構成する音声素片候補と前記第２素片候補群を構成する音声素片候補との組み合わせの中から、少なくとも１組の音声素片候補からなる組み合わせ候補を抽出する素片選択部と、
前記組み合わせ候補の中から選択された組の音声素片の音響特徴情報に基づいて前記合成音声波形を生成する音声波形生成部と
を備え、
前記素片選択部は、前記第２素片候補群を構成する音声素片候補のカテゴリ同士の類似尺度に基づき、前記第１素片候補群を構成する各音声素片候補について前記第２素片候補群の中から少なくとも１個の音声素片候補を選択することによって前記組み合わせ候補を抽出することを特徴とする音声合成装置。 A speech synthesizer that generates a synthesized speech waveform by inputting a pair of language units forming a boundary of acoustic features,
A data storage unit storing a unit dictionary for defining correspondence between language information of speech unit candidates, acoustic feature quantities of the speech unit candidates, and categories for classifying the acoustic features of the speech unit candidates; ,
Using the unit dictionary, a plurality of speech unit candidates corresponding to the other of the pair of language units together with a first unit candidate group consisting of a plurality of speech unit candidates corresponding to one of the pair of language units. A segment candidate selection unit for selecting a second segment candidate group consisting of:
A combination candidate consisting of at least one speech unit candidate is extracted from a combination of speech unit candidates constituting the first unit candidate group and speech unit candidates constituting the second unit candidate group. A segment selection unit to perform,
A speech waveform generation unit that generates the synthesized speech waveform based on acoustic feature information of a speech unit of a set selected from the combination candidates,
The unit selection unit is configured to determine the second unit for each speech unit candidate that configures the first unit candidate group based on a similarity measure between categories of speech unit candidates that configure the second unit candidate group. A speech synthesizer characterized in that said combination candidate is extracted by selecting at least one speech element candidate from a group of candidate groups.

請求項１記載の音声合成装置であって、
前記データ記憶部には、音声素片候補のカテゴリ同士の類似尺度を格納するパターン辞書が予め記憶されており、
前記素片選択部は、前記パターン辞書を用いて前記組み合わせ候補を抽出することを特徴とする音声合成装置。 The speech synthesizer according to claim 1,
In the data storage unit, a pattern dictionary for storing similarity measures between categories of speech unit candidates is stored in advance,
The speech synthesizer characterized in that the segment selection unit extracts the combination candidates using the pattern dictionary.

請求項２記載の音声合成装置であって、前記類似尺度は、重み付きユークリッド距離であることを特徴とする音声合成装置。 3. The speech synthesizer according to claim 2, wherein the similarity measure is a weighted Euclidean distance.

請求項１から請求項３のうちのいずれか１項記載の音声合成装置であって、入力されたテキスト情報の言語解析を実行して、前記一対の言語単位を含む中間言語情報を生成する言語解析部を更に備え、
前記素片候補選択部は、前記中間言語情報に基づいて前記第１素片候補群及び前記第２素片候補群を選択することを特徴とする音声合成装置。 The speech synthesizer according to any one of claims 1 to 3, wherein a language that performs language analysis of input text information and generates intermediate language information including the pair of language units. An analysis unit,
The speech synthesizer characterized in that the segment candidate selection unit selects the first segment candidate group and the second segment candidate group based on the intermediate language information.

請求項１記載の音声合成装置であって、
前記一対の言語単位のうちの一方の言語単位の音響特徴を示す第１の音響特徴パターンを推定するとともに、前記一対の言語単位のうちの他方の言語単位の音響特徴を示す第２の音響特徴パターンを推定する音響特徴生成部と、
前記第１の音響特徴パターンに基づいて、当該一方の言語単位の音響特徴を分類するカテゴリである第１の目標カテゴリを決定するとともに、前記第２の音響特徴パターンに基づいて、当該他方の言語単位の音響特徴を分類するカテゴリである第２の目標カテゴリを決定するパターン決定部と
を備え、
前記素片候補選択部は、前記第１の目標カテゴリと前記第１素片候補群を構成する各音声素片候補のカテゴリとの間の類似尺度に基いて、前記第１素片候補群の中から少なくとも１個の音声素片候補からなる第１の中間素片候補群を抽出するとともに、前記第２の目標カテゴリと前記第２素片候補群を構成する各音声素片候補のカテゴリとの間の類似尺度に基づいて、前記第２素片候補群の中から少なくとも１個の音声素片候補からなる第２の中間素片候補群を抽出し、
前記素片選択部は、前記第１の中間素片候補群を構成する音声素片候補と前記第２の中間素片候補群を構成する音声素片候補との組み合わせの中から前記組み合わせ候補を抽出することを特徴とする音声合成装置。 The speech synthesizer according to claim 1,
A first acoustic feature pattern indicating an acoustic feature of one language unit of the pair of language units is estimated, and a second acoustic feature indicating an acoustic feature of the other language unit of the pair of language units. An acoustic feature generator for estimating a pattern;
Based on the first acoustic feature pattern, a first target category which is a category for classifying the acoustic features of the one language unit is determined, and the other language is based on the second acoustic feature pattern. A pattern determining unit that determines a second target category that is a category for classifying the acoustic features of the unit,
The unit candidate selection unit is configured to determine whether the first unit candidate group is based on a similarity measure between the first target category and the category of each speech unit candidate constituting the first unit candidate group. A first intermediate segment candidate group consisting of at least one speech segment candidate is extracted from the second target category and a category of each speech segment candidate constituting the second segment candidate group; A second intermediate segment candidate group consisting of at least one speech segment candidate from the second segment candidate group based on a similarity measure between
The unit selection unit selects the combination candidate from a combination of a speech unit candidate constituting the first intermediate unit candidate group and a speech unit candidate constituting the second intermediate unit candidate group. A speech synthesizer characterized by extracting.

請求項５記載の音声合成装置であって、
前記データ記憶部には、音声素片候補のカテゴリ同士の類似尺度を格納するパターン辞書が予め記憶されており、
前記素片候補選択部は、前記パターン辞書を用いて前記第１の中間素片候補群及び前記第２の中間素片候補群を抽出することを特徴とする音声合成装置。 The speech synthesizer according to claim 5,
In the data storage unit, a pattern dictionary for storing similarity measures between categories of speech unit candidates is stored in advance,
The speech synthesizer characterized in that the segment candidate selection unit extracts the first intermediate segment candidate group and the second intermediate segment candidate group using the pattern dictionary.

請求項５または請求項６記載の音声合成装置であって、前記類似尺度は、重み付きユークリッド距離であることを特徴とする音声合成装置。 7. The speech synthesizer according to claim 5, wherein the similarity measure is a weighted Euclidean distance.

請求項５から請求項７のうちのいずれか１項記載の音声合成装置であって、入力されたテキスト情報の言語解析を実行して、前記一対の言語単位を含む中間言語情報を生成する言語解析部を更に備え、
前記素片候補選択部は、前記中間言語情報に基づいて前記第１の中間素片候補群及び前記第２の中間素片候補群を抽出することを特徴とする音声合成装置。 The speech synthesizer according to any one of claims 5 to 7, wherein a language that performs language analysis of input text information and generates intermediate language information including the pair of language units. An analysis unit,
The speech synthesizer characterized in that the segment candidate selection unit extracts the first intermediate segment candidate group and the second intermediate segment candidate group based on the intermediate language information.

音声素片候補の言語情報と当該音声素片候補の音響特徴量と当該音声素片候補の音響特徴を分類するカテゴリとの間の対応関係を定める素片辞書が記憶されているデータ記憶部を備えた情報処理装置において実行される音声合成方法であって、
音響特徴の境界を形成する一対の言語単位を入力とするステップと、
前記素片辞書を用いて、前記一対の言語単位の一方に対応する複数の音声素片候補からなる第１素片候補群とともに、前記一対の言語単位の他方に対応する複数の音声素片候補からなる第２素片候補群を選択するステップと、
前記第２素片候補群を構成する複数の音声素片候補のカテゴリ同士の類似尺度に基づき、前記第１素片候補群を構成する各音声素片候補に対して前記第２素片候補群の中から少なくとも１個の音声素片候補を選択するステップと、
当該選択結果により、前記第１素片候補群を構成する複数の音声素片候補と前記第２素片候補群を構成する複数の音声素片候補との組み合わせの中から、少なくとも１組の音声素片候補からなる組み合わせ候補を抽出するステップと、
前記組み合わせ候補の中から選択された組の音声素片の音響特徴情報に基づいて合成音声波形を生成するステップと
を備えることを特徴とする音声合成方法。 A data storage unit storing a unit dictionary for defining correspondence between speech unit candidate language information, acoustic feature amount of the speech unit candidate, and a category for classifying the acoustic feature of the speech unit candidate A speech synthesis method executed in an information processing apparatus provided,
Receiving as input a pair of language units forming a boundary of acoustic features;
Using the unit dictionary, a plurality of speech unit candidates corresponding to the other of the pair of language units together with a first unit candidate group consisting of a plurality of speech unit candidates corresponding to one of the pair of language units. Selecting a second segment candidate group consisting of:
The second unit candidate group for each speech unit candidate constituting the first unit candidate group based on a similarity measure between categories of a plurality of speech unit candidates constituting the second unit candidate group. Selecting at least one speech segment candidate from
According to the selection result, at least one set of speech is selected from a combination of a plurality of speech unit candidates configuring the first unit candidate group and a plurality of speech unit candidates configuring the second unit candidate group. Extracting a combination candidate consisting of segment candidates;
Generating a synthesized speech waveform based on acoustic feature information of a speech unit of a set selected from the combination candidates.

請求項９記載の音声合成方法であって、
前記一対の言語単位のうちの一方の言語単位の音響特徴を示す第１の音響特徴パターンを検出するとともに、前記一対の言語単位のうちの他方の言語単位の音響特徴を示す第２の音響特徴パターンを検出するステップと、
前記第１の音響特徴パターンに基づいて、当該一方の言語単位の音響特徴を分類するカテゴリである第１の目標カテゴリを決定するステップと、
前記第２の音響特徴パターンに基づいて、当該他方の言語単位の音響特徴を分類するカテゴリである第２の目標カテゴリを決定するステップと、
前記第１の目標カテゴリと前記第１素片候補群を構成する各音声素片候補のカテゴリとの間の類似尺度に基いて、前記第１素片候補群の中から少なくとも１個の音声素片候補からなる第１の中間素片候補群を抽出するステップと、
前記第２の目標カテゴリと前記第２素片候補群を構成する各音声素片候補のカテゴリとの間の類似尺度に基づいて、前記第２素片候補群の中から少なくとも１個の音声素片候補からなる第２の中間素片候補群を抽出するステップと
を更に備え、
前記組み合わせ候補は、前記第１の中間素片候補群を構成する音声素片候補と前記第２の中間素片候補群を構成する音声素片候補との組み合わせの中から抽出されることを特徴とする音声合成方法。 The speech synthesis method according to claim 9, comprising:
A second acoustic feature that detects an acoustic feature of one language unit of the pair of language units and detects an acoustic feature of the other language unit of the pair of language units. Detecting a pattern;
Determining a first target category that is a category for classifying the acoustic features of the one language unit based on the first acoustic feature pattern;
Based on the second acoustic feature pattern, determining a second eye Shimegika categories are categories for classifying acoustic characteristics of the other language unit,
Based on a similarity measure between the first target category and the category of each speech unit candidate constituting the first unit candidate group, at least one speech unit from the first unit candidate group. Extracting a first intermediate segment candidate group of segment candidates;
Based on a similarity measure between the second target category and the category of each speech unit candidate that constitutes the second unit candidate group, at least one speech unit from the second unit candidate group. Extracting a second intermediate segment candidate group consisting of candidate pieces,
The combination candidate is extracted from a combination of a speech unit candidate constituting the first intermediate unit candidate group and a speech unit candidate constituting the second intermediate unit candidate group. A speech synthesis method.