JP6479637B2

JP6479637B2 - Sentence set generation device, sentence set generation method, program

Info

Publication number: JP6479637B2
Application number: JP2015236629A
Authority: JP
Inventors: 勇祐井島; 吉田　明弘; 明弘吉田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-12-03
Filing date: 2015-12-03
Publication date: 2019-03-06
Anticipated expiration: 2035-12-03
Also published as: JP2017102328A

Description

本発明は、音声合成用モデルの学習に用いる文章を選択する技術に関し、特にモデル適応により変換元となる音声合成用モデルから別の音声合成用モデルを学習する際に用いる文章を選択する技術に関する。 The present invention relates to a technique for selecting a sentence to be used for learning a speech synthesis model, and more particularly to a technique for selecting a sentence to be used when learning another voice synthesis model from a voice synthesis model to be converted by model adaptation. .

コーパスベース音声合成技術では、学習用文章セット（コーパス）を発話した音声から音声合成用モデルを学習し、当該モデルを用いて合成音声を生成する。近年、コーパスベース音声合成技術の一つとして、任意の話者に関する少量の音声データを用いて、事前に学習した音声合成用モデル（変換元モデル）から当該話者の音声合成用モデル（変換後モデル）を生成し、変換後モデルから当該話者の合成音声を生成する手法であるモデル適応技術が提案されている（非特許文献１）。このモデル適応技術は、目標話者の音声を合成するために必要な音声データが少量でよいという特徴を有する。 In the corpus-based speech synthesis technology, a speech synthesis model is learned from speech that utters a learning sentence set (corpus), and synthesized speech is generated using the model. In recent years, as one of the corpus-based speech synthesis technologies, using a small amount of speech data about an arbitrary speaker, the speech synthesis model of the speaker (after conversion) Model adaptation technology, which is a method of generating a model) and generating synthesized speech of the speaker from the converted model has been proposed (Non-patent Document 1). This model adaptation technique has a feature that a small amount of speech data is required to synthesize the speech of the target speaker.

コーパスベース音声合成技術を用いて、できる限り低コストで高品質な合成音声を生成するためには、音韻、韻律などにおいてバランスが取れた学習用文章セットを準備する必要がある。バランスが取れた学習用文章セットを生成するためには、事前に用意する大量の文章の集合（以下、母集団文章セットという）から、音素（ｍｏｎｏｐｈｏｎｅ、ｄｉｐｈｏｎｅ、ｔｒｉｐｈｏｎｅなど）、音節、単語といった音韻情報をできるだけ多くカバーするように学習用文章セットを生成する手法（特許文献１）や発話単位に加えてアクセント等も考慮して学習用文章セットを生成する手法（非特許文献２）等が提案されている。 In order to generate a high-quality synthesized speech at as low a cost as possible using a corpus-based speech synthesis technology, it is necessary to prepare a learning sentence set balanced in phonology, prosody and the like. In order to generate a balanced learning sentence set, phonemes such as phonemes (monophone, diphone, triphone, etc.), syllables, words, etc. are prepared from a large set of prepared sentences (hereinafter referred to as a population sentence set). Proposed methods to generate a learning sentence set so as to cover as much information as possible (Patent Document 1), a technique to generate a learning sentence set in consideration of accents in addition to speech units (Non-Patent Document 2), etc. Has been.

特開２００４−２４６１４０号公報JP 2004-246140 A

田村正統、益子貴史、徳田恵一、小林隆夫、“ＨＭＭに基づく音声合成におけるピッチ・スペクトルの話者適応”、電子情報通信学会論文誌、電子情報通信学会、２００２年４月、Vol.J85-D-II No.4 pp.545-553．Masanori Tamura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, “Speaker Adaptation of Pitch / Spectrum in Speech Synthesis Based on HMM”, IEICE Transactions, IEICE, April 2002, Vol.J85-D -II No.4 pp.545-553. 荒生侑介、能勢隆、郡山知樹、篠崎隆宏、小林隆夫、“音声合成のための音韻・韻律コンテキストを考慮した文選択アルゴリズムの評価”、日本音響学会２０１４年春季研究発表会講演論文集、日本音響学会、２０１４年３月、pp.405-406．Kyosuke Arao, Takashi Nose, Tomoki Koriyama, Takahiro Shinozaki, Takao Kobayashi, “Evaluation of sentence selection algorithm considering phoneme / prosodic context for speech synthesis”, Acoustical Society of Japan 2014 Spring Conference Presentation, Proc. Academic Society, March 2014, pp.405-406.

これらの文献に記載の手法では、学習用文章セットを発話した音声によりカバーされる対象となる音韻、韻律などの文章選択に用いる要素（以下、文章選択要素という）が膨大とならないように、事前に人手で絞り込む必要がある。例えば、非特許文献２に記載の手法では、音素の連鎖を考慮する際に人手で音素をグルーピングすることで、カバー対象となる文章選択要素を絞り込んでいる。しかし、このような人手での絞り込みは、音声合成用モデルの学習で用いる情報を落としてしまう可能性があり、必ずしも最適な学習用文章セットが生成されているとは限らない。 In the methods described in these documents, elements used for sentence selection such as phonemes and prosody (hereinafter referred to as sentence selection elements) to be covered by the speech uttered from the learning sentence set are prevented in advance. It is necessary to narrow down manually. For example, in the method described in Non-Patent Document 2, sentence selection elements to be covered are narrowed down by manually grouping phonemes when considering a chain of phonemes. However, such manual narrowing may drop information used for learning the speech synthesis model, and an optimal learning sentence set is not always generated.

そこで、変換元モデルを用いて、変換後モデル学習に用いる音声データの元になる学習用文章セットを母集団文章セットから生成することにより、音韻、韻律などの文章選択に用いる要素においてバランスのとれた学習用文章セットを生成する文章セット生成装置を提供することを目的とする。 Therefore, by using the conversion source model, a learning sentence set that is the basis of the speech data used for the model learning after conversion is generated from the population sentence set, so that the elements used for sentence selection such as phonology and prosody can be balanced. An object of the present invention is to provide a sentence set generation device that generates a learning sentence set.

本発明の一態様は、決定木として表現される音声合成用モデルである変換元モデルから別の音声合成用モデルである変換後モデルを学習する際に用いる学習用文章セットを、母集団文章セットの部分集合として生成する文章セット生成装置であって、前記決定木のリーフノード以外の各内部ノードには、文章選択に用いる文章選択要素に関する問いが付与されており、解析後文章記録部に記録した、前記母集団文章セットの選択元文章と前記文章選択要素を用いて表現されるコンテキストの組である解析後文章の集合から、前記決定木と前記コンテキストを用いて前記解析後文章に対応する選択元文章を前記学習用文章セットに追加するか否かを判断することにより、前記学習用文章セットを生成する文章選択部とを有する。 According to one aspect of the present invention, a learning sentence set used when learning a converted model that is another speech synthesis model from a conversion source model that is a speech synthesis model expressed as a decision tree is a population sentence set. Is a sentence set generation device that generates a subset of the above, and each internal node other than the leaf node of the decision tree is given a question regarding a sentence selection element used for sentence selection, and is recorded in a post-analysis sentence recording unit Corresponding to the post-analysis text using the decision tree and the context from the set of post-analysis text that is a set of contexts expressed using the source text of the population text set and the text selection element A sentence selection unit that generates the learning sentence set by determining whether or not to add a selection source sentence to the learning sentence set.

本発明によれば、変換元モデルを用いて母集団文章セットから学習用文章セットを生成することにより、音韻、韻律などの文章選択に用いる要素においてバランスのとれた学習用文章セットを生成することが可能となる。 According to the present invention, by generating a learning sentence set from a population sentence set using a conversion source model, a learning sentence set balanced in elements used for sentence selection such as phonemes and prosody is generated. Is possible.

モデル適応技術の概要を示す図。The figure which shows the outline | summary of a model adaptation technique. 決定木により表現した変換元モデルの一例を示す図。The figure which shows an example of the conversion origin model expressed by the decision tree. 実施例１の文章セット生成装置１００の構成を示すブロック図。1 is a block diagram illustrating a configuration of a text set generation device 100 according to a first embodiment. 実施例１の文章セット生成装置１００の動作を示すフローチャート。5 is a flowchart illustrating the operation of the sentence set generation device 100 according to the first embodiment. 実施例１の文章選択用情報テーブルの一例を示す図。The figure which shows an example of the information table for text selections of Example 1. FIG. コンテキストの一例を示す図。The figure which shows an example of a context. 実施例１の文章選択部１３０の構成を示す図。The figure which shows the structure of the text selection part 130 of Example 1. FIG. 実施例１の文章選択部１３０の動作を示すフローチャート。6 is a flowchart illustrating an operation of a text selection unit according to the first embodiment. 実施例３の文章セット生成装置３００の構成を示すブロック図。The block diagram which shows the structure of the text set production | generation apparatus 300 of Example 3. FIG. 実施例３の文章セット生成装置３００の動作を示すフローチャート。10 is a flowchart illustrating the operation of the sentence set generation apparatus 300 according to the third embodiment. 実施例３の文章選択用情報テーブルの一例を示す図。The figure which shows an example of the information table for text selections of Example 3. FIG. 実施例３の文章選択部３３０の構成を示す図。The figure which shows the structure of the text selection part 330 of Example 3. FIG. 実施例３の文章選択部３３０の動作を示すフローチャート。10 is a flowchart illustrating the operation of a text selection unit 330 according to the third embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

まず、図１を参照してモデル適応技術の概略について説明する。モデル適応では、変換元モデル学習用音声データから変換元モデルを学習する処理（変換元モデル学習処理）、変換後モデル学習用音声データを用いて変換元モデルを変換後モデルに変換する処理（変換後モデル学習処理）が行われる。変換後モデル学習用音声データは、学習用文章セットの文章を発話した音声であり、この学習用文章セットの文章は母集団文章セットから選択される。この選択処理を行う（つまり、学習用文章セットを生成する）のが、本願発明の文章セット生成装置である。 First, an outline of the model adaptation technique will be described with reference to FIG. In model adaptation, the process of learning the conversion source model from the conversion model learning speech data (conversion source model learning process), and the process of converting the conversion source model into the converted model using the converted model learning speech data (conversion Post model learning processing) is performed. The post-conversion model learning speech data is speech that utters a sentence in the learning sentence set, and the sentence in the learning sentence set is selected from the population sentence set. The sentence set generation device of the present invention performs this selection process (that is, generates a learning sentence set).

次に、いくつかの用語について定義・説明を与える。音声合成用モデルとは、音声合成の際に必要となる音声の特徴量（例えば、基本周波数（Ｆ０）などの音高パラメータ、ケプストラムやメルケプストラムなどのスペクトルパラメータ）、継続時間長を音声合成単位ごとに統計的にモデル化したものである。音声合成単位の例として、音素、音節、単語などがある。 Next, definitions and explanations are given for some terms. A voice synthesis model is a voice synthesis unit that includes voice feature values (eg, pitch parameters such as fundamental frequency (F0), spectrum parameters such as cepstrum and mel cepstrum), and duration length required for voice synthesis. Each is a statistical model. Examples of speech synthesis units include phonemes, syllables, and words.

本願発明では音声合成用モデルの中でも決定木構造を持つモデルを扱う。決定木構造を持つモデルとして代表的なものにＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）音声合成（参考非特許文献３）がある。ＨＭＭ音声合成では、パラメータごとに音声合成用モデルが決定木で表現される。その例として、スペクトルパラメータ、音高パラメータ、継続時間長の３つのパラメータを用いたＨＭＭ音声合成モデルの二分決定木を図２に示す。決定木の表現に用いられている情報は変換元モデルの学習に用いた情報であり、文章選択に用いる文章選択要素となるものである。決定木は文章選択要素を用いて表現される木である。具体的には、リーフノード以外の内部ノードには文章選択要素に関する問いが付与されており、ＹＥＳ（該当する）、ＮＯ（該当しない）の判断に従い、次の深さのノード（内部ノードまたはリーフノード）に進むような決定木となっている。
（参考非特許文献３）益子貴史、徳田恵一、小林隆夫、今井聖、“動的特徴を用いたＨＭＭに基づく音声合成”、電子情報通信学会論文誌、電子情報通信学会、１９９６年１２月、Vol.J79-D-II No.12 pp.2184-2190． In the present invention, a model having a decision tree structure is handled among the speech synthesis models. A typical model having a decision tree structure is HMM (Hidden Markov Model) speech synthesis (Reference Non-Patent Document 3). In HMM speech synthesis, a speech synthesis model is represented by a decision tree for each parameter. As an example, FIG. 2 shows a binary decision tree of an HMM speech synthesis model using three parameters: a spectrum parameter, a pitch parameter, and a duration length. The information used for the representation of the decision tree is information used for learning the conversion source model and serves as a sentence selection element used for sentence selection. A decision tree is a tree expressed using sentence selection elements. Specifically, the internal node other than the leaf node is given a question regarding the sentence selection element, and the node (internal node or leaf) of the next depth is determined according to the determination of YES (applicable) or NO (not applicable). Node).
(Reference Non-Patent Document 3) Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Kiyoshi Imai, “HMM-based speech synthesis using dynamic features”, IEICE Transactions, IEICE, December 1996, Vol.J79-D-II No.12 pp.2184-2190.

以下、図３、図４を参照して実施例１の文章セット生成装置１００について説明する。図３は、実施例１の文章セット生成装置１００の構成を示すブロック図である。図４は、実施例１の文章セット生成装置１００の動作を示すフローチャートである。図３に示すように文章セット生成装置１００は、文章選択用情報テーブル作成部１１０と、文章選択用情報テーブル記録部１０３と、テキスト解析部１２０と、解析後文章記録部１０４と、文章選択部１３０を含む。文章セット生成装置１００は、変換元モデル記録部１０１、選択元文章記録部１０２と接続している。 Hereinafter, the sentence set generation apparatus 100 according to the first embodiment will be described with reference to FIGS. 3 and 4. FIG. 3 is a block diagram illustrating a configuration of the sentence set generation device 100 according to the first embodiment. FIG. 4 is a flowchart illustrating the operation of the sentence set generation apparatus 100 according to the first embodiment. As shown in FIG. 3, the sentence set generation device 100 includes a sentence selection information table creation unit 110, a sentence selection information table recording unit 103, a text analysis unit 120, a post-analysis sentence recording unit 104, and a sentence selection unit. 130 is included. The sentence set generation device 100 is connected to a conversion source model recording unit 101 and a selection source sentence recording unit 102.

変換元モデル記録部１０１には、モデル適応の変換元モデル学習処理で学習した変換元モデルを記録しておく。選択元文章記録部１０２には、母集団文章セットの要素である選択元文章を記録しておく。 The conversion source model recording unit 101 records the conversion source model learned in the model adaptation conversion source model learning process. The selection source sentence recording unit 102 records the selection source sentence that is an element of the population sentence set.

文章選択用情報テーブル作成部１１０は、決定木で表現された変換元モデルから文章選択用情報テーブルを作成し、初期化する（Ｓ１１０）。文章選択用情報テーブルは、各パラメータの決定木に対応して作成される。例えば、スペクトルパラメータ、音高パラメータ、継続時間長の３つのパラメータを用いた音声合成用モデルでは、３つの決定木（図２参照）から３つの文章選択用情報テーブル（図５参照）が作成される。スペクトルパラメータ、音高パラメータ、継続時間長の決定木のリーフノード数をＮ_ｓｐ、Ｎ_Ｆ０、Ｎ_ｄｕｒとし、リーフノードが１〜Ｎ_ｓｐ、１〜Ｎ_Ｆ０、１〜Ｎ_ｄｕｒで識別されるとすると、スペクトルパラメータ、音高パラメータ、継続時間長の文章選択用情報テーブルはリーフノード１〜Ｎ_ｓｐ、リーフノード１〜Ｎ_Ｆ０、リーフノード１〜Ｎ_ｄｕｒの選択頻度を保持するテーブルとなる。選択頻度は選択元文章を処理する過程でそのリーフノードが選択された回数を表すが、詳細については文章選択部１３０で説明する。文章選択用情報テーブル作成部１１０では、各テーブルの各リーフノードの選択頻度の値を０で初期化する。 The text selection information table creation unit 110 creates a text selection information table from the conversion source model represented by the decision tree and initializes it (S110). The text selection information table is created corresponding to each parameter decision tree. For example, in a speech synthesis model using three parameters: a spectrum parameter, a pitch parameter, and a duration length, three text selection information tables (see FIG. 5) are created from three decision trees (see FIG. 2). The When the number of leaf nodes of a decision tree of spectrum parameters, pitch parameters, and duration lengths is N _sp , N _F0 , N _dur and the leaf nodes are identified by 1 to N _sp , 1 to N _F0 , and 1 to N _dur Then, the text selection information table of the spectrum parameter, pitch parameter, and duration is a table that holds the selection frequency of the leaf nodes 1 to N _sp , the leaf nodes 1 to N _F0 , and the leaf nodes 1 to N _dur . The selection frequency indicates the number of times the leaf node is selected in the process of processing the selection source sentence, and the details will be described in the sentence selection unit 130. The text selection information table creation unit 110 initializes the value of the selection frequency of each leaf node in each table to 0.

当該文章選択用情報テーブルは、文章選択用情報テーブル記録部１０３に記録する。文章選択用情報テーブル記録部１０３は、ＲＡＭ、ハードディスクなど一時的な記録媒体、永続的な記録媒体のいずれであってもよい。 The sentence selection information table is recorded in the sentence selection information table recording unit 103. The text selection information table recording unit 103 may be a temporary recording medium such as a RAM or a hard disk, or a permanent recording medium.

テキスト解析部１２０は、母集団文章セットの各選択元文章に対して、読み、アクセントなどの情報を付与した後、読み、アクセント、形態素などの文章選択要素を用いて表現されるコンテキストを音声合成単位ごとに付与した文章である解析後文章を生成する（Ｓ１２０）。つまり、解析後文章とは、選択元文章と１以上のコンテキストの組である。また、コンテキストは、文章選択部１３０による処理に必要な情報であり、決定木構造をたどるために用いられる。コンテキストは、例えば、図６のコンテキストラベルを用いて表現することができる（参考非特許文献４）。
（参考非特許文献４）吉岡元貴、田村正統、益子貴史、小林隆夫、徳田恵一、“ＨＭＭ音声合成における韻律の変動要因の検討”、電子情報通信学会技術研究報告．ＳＰ，音声、電子情報通信学会、２００１年１０月、Vol.101, No.352, pp.51-56． The text analysis unit 120 assigns information such as reading and accent to each selection source sentence in the population sentence set, and then synthesizes a context expressed using sentence selection elements such as reading, accent, and morpheme. A post-analysis sentence that is a sentence given for each unit is generated (S120). That is, the post-analysis sentence is a set of a selection source sentence and one or more contexts. The context is information necessary for processing by the text selection unit 130 and is used to follow a decision tree structure. The context can be expressed using, for example, the context label in FIG. 6 (Reference Non-Patent Document 4).
(Reference Non-Patent Document 4) Mototaka Yoshioka, Masanori Tamura, Takashi Masuko, Takao Kobayashi, Keiichi Tokuda, “Examination of variation factors of prosody in HMM speech synthesis”, IEICE technical report. SP, Voice, IEICE, October 2001, Vol. 101, No. 352, pp. 51-56.

したがって、（読み、アクセント、形態素などの）コンテキストの表現に用いる文章選択要素の種類は、決定木の表現に用いられているものと同一である必要がある。例えば、変換元モデルの学習に読みとアクセントを使用した場合は、テキスト解析部１２０で得られるコンテキストは読みとアクセントのみでよい。 Therefore, the types of sentence selection elements used for expressing the context (such as reading, accent, and morpheme) need to be the same as those used for expressing the decision tree. For example, when reading and accent are used for learning the conversion source model, the context obtained by the text analysis unit 120 may be only reading and accent.

当該解析後文章は、解析後文章記録部１０４に記録する。解析後文章記録部１０４は、文章選択用情報テーブル記録部１０３と同様、ＲＡＭ、ハードディスクなど一時的な記録媒体、永続的な記録媒体のいずれであってもよい。 The analyzed sentence is recorded in the analyzed sentence recording unit 104. Similar to the sentence selection information table recording unit 103, the post-analysis sentence recording unit 104 may be a temporary recording medium such as a RAM or a hard disk, or a permanent recording medium.

また、コンテキストは何らかのテキスト解析手法により付与されるものとする。なお、テキスト解析する代わりに事前に人手で付与しておいてもよい。この場合、テキスト解析部１２０は不要である。文章セット生成装置１００はテキスト解析部１２０を有さない構成となり、選択元文章記録部１０２と接続する代わりに母集団文章セットの選択元文章を解析した解析後文章を記録した解析後文章記録部１０４と接続することになる。このときの解析後文章記録部１０４は永続的な記録媒体を用いて構成されることになる。 The context is given by some text analysis method. Instead of text analysis, it may be given manually in advance. In this case, the text analysis unit 120 is not necessary. The sentence set generation device 100 does not include the text analysis unit 120, and instead of being connected to the selection source sentence recording unit 102, the post-analysis sentence recording unit that records the analysis sentence after analyzing the selection source sentence of the population sentence set 104 is connected. The post-analysis sentence recording unit 104 at this time is configured using a permanent recording medium.

文章選択部１３０は、変換元モデルの決定木を用いて、解析後文章の集合から学習用文章セットを生成する（Ｓ１３０）。以下、図７、図８を参照して文章選択部１３０の処理について詳細に説明する。図７は、文章選択部１３０の構成を示すブロック図である。図８は、文選択部１３０の動作を示すフローチャートである。図７に示すように文章選択部１３０は、リーフノード決定部１３１と、スコア計算部１３２と、最高スコア文章選択部１３３と、文章選択用情報テーブル更新部１３４と、終了判定部１３５を含む。 The sentence selection unit 130 generates a learning sentence set from the set of analyzed sentences using the decision tree of the conversion source model (S130). Hereinafter, the processing of the text selection unit 130 will be described in detail with reference to FIGS. 7 and 8. FIG. 7 is a block diagram illustrating a configuration of the text selection unit 130. FIG. 8 is a flowchart showing the operation of the sentence selection unit 130. As shown in FIG. 7, the sentence selection unit 130 includes a leaf node determination unit 131, a score calculation unit 132, a highest score sentence selection unit 133, a sentence selection information table update unit 134, and an end determination unit 135.

母集団文章セットに含まれる選択元文章の数をＳとして、選択元文章をＢ_１、Ｂ_２、…、Ｂ_Ｓで表す。また、選択元文章Ｂ_ｉ（ｉ＝１，２，…，Ｓ）をテキスト解析部１２０が処理した結果である解析後文章をＡ_ｉで表す。解析後文章をＡ_ｉに含まれるコンテキストの数（つまり、音声合成単位の数）をｋ（ｉ）とし、コンテキストをＣ_ｊ（ｊ＝１，…，ｋ（ｉ））で表すことにすると、Ａ_ｉ＝（Ｂ_ｉ，Ｃ_１，…，Ｃ_ｋ（ｉ））と表すことができる。各コンテキストＣ_ｊには文章選択要素の種類数の情報が含まれる。例えば、図６のコンテキストラベルには、音素、形態素、アクセント句の３種類の文章選択要素が含まれている。 The number of selection source sentences included in the population sentence set is S, and the selection source sentences are represented by B ₁ , B ₂ ,..., B _S. A post-analysis sentence that is a result of processing the selection source sentence B _i (i = 1, 2,..., S) by the text analysis unit 120 is denoted by A _i . Assuming that the number of contexts (that is, the number of speech synthesis units) included in A _i is k (i) and the context is represented by C _j (j = 1,..., K (i)), A _i = (B _i , C ₁ ,..., C _{k (i)} ). Each context C _j includes information on the number of types of sentence selection elements. For example, the context label in FIG. 6 includes three types of sentence selection elements, phonemes, morphemes, and accent phrases.

リーフノード決定部１３１は、母集団文章セットの選択元文章Ｂ_ｉ（ｉ＝１，２，…，Ｓ、ただし、すでに学習用文章セットの要素となっている文章（以下、このすでに学習用文章セットの要素となっている文章の集合のことを生成済み学習用文章セットという）を除く）に対応する解析後文章Ａ_ｉに含まれるコンテキストＣ_ｊ（ｊ＝１，…，ｋ（ｉ））について、各パラメータの決定木をたどり、コンテキストＣ_ｊに対応するリーフノードを決定する（Ｓ１３１）。あわせて当該リーフノードの選択頻度を１増加させることで文章選択用情報テーブルを更新する。決定木をたどるとは、コンテキストを表現するために用いられている文章選択要素を用いて、内部ノードに付与されている問いに対してＹＥＳ−ＮＯを判断し、ルートからリーフノードに向かって進んでいくことをいう。なお、ルートからリーフノードに至るまでの判断結果のノードの列をパスといい、パスに含まれる内部ノード、リーフノードのことをカバーされる内部ノード、リーフノードという。 The leaf node determination unit 131 selects the source sentence set B _i (i = 1, 2,..., S of the population sentence set, but the sentence that is already an element of the learning sentence set (hereinafter, this already learned sentence) Context C _j (j = 1,..., K (i)) included in the post-analysis sentence A _i corresponding to the set of sentences that are elements of the set (except for the generated learning sentence set). Is followed through the decision tree of each parameter to determine the leaf node corresponding to the context C _j (S131). In addition, the sentence selection information table is updated by increasing the selection frequency of the leaf node by one. Tracing a decision tree uses a sentence selection element used to express a context, determines YES-NO for a question given to an internal node, and proceeds from a root toward a leaf node. It means to go. The sequence of nodes from the root to the leaf node is called a path, and the internal nodes and leaf nodes included in the path are called covered internal nodes and leaf nodes.

スコア計算部１３２は、Ｓ１３１で更新した文章選択用情報テーブル（つまり、生成済み学習用文章セットに選択元文章Ｂ_ｉを追加した集合に対応する文章選択用情報テーブル）を用いて、スコアＳＣ_ｉを算出する（Ｓ１３２）。なお、生成済み学習用文章セットは最初のＳ１３１の処理が始まるまでに空集合で初期化されているものとする。 The score calculation unit 132 uses the sentence selection information table updated in S131 (that is, the sentence selection information table corresponding to the set obtained by adding the selection source sentence B _i to the generated learning sentence set) to obtain the score SC _i. Is calculated (S132). Note that the generated learning sentence set is initialized as an empty set before the first processing of S131 is started.

文章選択におけるスコアの代表的なものとして、カバー率（特許文献１）、エントロピー（非特許文献２）がある。以下では、スコアにカバー率を採用して説明する。文章選択におけるカバー率とは、選択された１以上の文章により（音素環境などの）文章選択要素がどれだけカバーされているかを示す指標である。本実施例では、決定木のリーフノードを用いてカバー率を算出する。リーフノードを用いたカバー率の計算式は次式により表される。 Typical scores for sentence selection include a coverage rate (Patent Document 1) and entropy (Non-Patent Document 2). In the following description, the cover rate is adopted for the score. The coverage in sentence selection is an index indicating how much a sentence selection element (such as phoneme environment) is covered by one or more selected sentences. In this embodiment, the coverage rate is calculated using the leaf nodes of the decision tree. The formula for calculating the coverage using leaf nodes is expressed by the following formula.

ここで、Ｎ_ｓｐ、Ｎ_Ｆ０、Ｎ_ｄｕｒはスペクトルパラメータ、音高パラメータ、継続時間長の決定木のリーフノード数である。また、関数ｏ_ｓｐ（ｋ）、ｏ_Ｆ０（ｋ）、ｏ_ｄｕｒ（ｋ）は、スペクトルパラメータ、音高パラメータ、継続時間長の決定木のｋ番目のリーフノードがカバー（選択）されているかどうかを出力する関数である。これらの関数は当該リーフノードの選択頻度が１以上であれば１を、０であれば０を出力する。これにより、学習用文章セット（Ｓ１３１とＳ１３２の繰り返し処理の過程では、生成済み学習用文章セットに選択元文章Ｂ_ｉを追加した集合）に含まれる文章を用いて３つの決定木に含まれるリーフノードが選択される割合を示すことができる。 Here, N _sp , N _F0 , and N _dur are the number of leaf nodes in the decision tree for the spectrum parameter, pitch parameter, and duration. Also, the functions o _sp (k), o _F0 (k), and o _dur (k) indicate whether or not the k-th leaf node of the decision tree of the spectrum parameter, pitch parameter, and duration is covered (selected). Is a function that outputs These functions output 1 if the selection frequency of the leaf node is 1 or more, and 0 if it is 0. Thus, (in the course of iteration S131 and S132, the generated learned for text sets to add the selected original texts B _i set) training sentence set leaves contained in the three decision trees using the sentence contained in It can indicate the rate at which nodes are selected.

スコアを算出した後、Ｓ１３１で更新した文章選択用情報テーブルを更新前の文章選択用情報テーブルに戻す。ただし、Ｓ１３１で更新した文章選択用情報テーブルの情報は文章選択用情報テーブル更新部１３４で用いるので、元に戻す際に当該情報は文章選択用情報テーブル更新部１３４が後ほど利用できるよう、コピーするなどして退避しておく。リーフノード決定部１３１での処理（Ｓ１３１）及びスコア計算部１３２での処理（Ｓ１３２）を母集団文章セットの選択元文章Ｂ_１、Ｂ_２、…、Ｂ_Ｓ（ただし、生成済み学習用文章セットに含まれるものを除く）に対して実行する。 After the score is calculated, the text selection information table updated in S131 is returned to the text selection information table before update. However, since the information in the text selection information table updating unit 134 updated in S131 is used in the text selection information table updating unit 134, when the information is restored, the information is copied so that the text selection information table updating unit 134 can use it later. Evacuate it. The processing of the leaf node determination unit 131 (S131) and the processing of the score calculation unit 132 (S132) are selected from the population sentence set source sentences B ₁ , B ₂ ,..., B _S (however, the generated learning sentence set (Except for those included in).

最高スコア文章選択部１３３は、スコア計算部１３２が計算したＳＣ_１、ＳＣ_２、…、ＳＣ_Ｓ（ただし、生成済み学習用文章セットの文章に対応するものを除く）のうち、最も高くなるスコア（最高スコア）に対応する選択元文章Ｂ_ｋを選択し、生成済み学習用文章セットに追加する（Ｓ１３３）。 Highest score sentence selection unit 133, _SC _1, SC 2, which score calculation unit 132 has calculated, ..., SC _S (except for those corresponding to the sentence of the generated training sentence sets) of, becomes highest score select the selection original sentence B _k corresponding to the (highest score), to add to the generated learning sentence set (S133).

文章選択用情報テーブル更新部１３４は、リーフノード決定部１３１が選択元文章Ｂ_ｋを処理した際に生成した文章選択用情報テーブルの情報により、文章選択用情報テーブルを更新する（Ｓ１３４）。 Text selection information table update section 134, the generated information of the text selection information table when a leaf node determining unit 131 has processed the selected original texts B _k, and updates the information table for the text selection (S134).

終了判定部１３５は、終了条件を満たしているかどうかの判定を行い、満たしている場合、生成済み学習用文章セットを学習用文章セットとして出力する（Ｓ１３５）。終了条件としては、選択された文章数が所定値に達したか（例えば、生成済み学習用文章セットの濃度が｜Ｓ｜／２を超えたか）、Ｓ１３３で計算した最高スコアが所定値に達したか（例えば、最高スコアが０．８を超えたか）などがある。また、これらの組合せであってもよい。なお、最高スコアが所定値に達することがなかったために、すべての選択元文章を生成済み学習用文章セットに追加することもありえる。 The end determination unit 135 determines whether or not the end condition is satisfied. If the end condition is satisfied, the end determination unit 135 outputs the generated learning sentence set as a learning sentence set (S135). Termination conditions include whether the number of selected sentences has reached a predetermined value (for example, whether the concentration of the generated learning sentence set exceeds | S | / 2), or the highest score calculated in S133 reaches the predetermined value (For example, whether the highest score exceeded 0.8). Moreover, these combinations may be sufficient. Since the highest score never reaches a predetermined value, all the selection source sentences may be added to the generated learning sentence set.

変換元モデルの決定木を利用して学習用文章セットを生成するための基準となる（音素、アクセント等の）文章選択要素をモデル適応の過程において動的に設定することにより、文章選択要素においてバランスのとれた学習用文章セットを生成することが可能となる。また、その結果、当該学習用文章セットと同程度の文章量の文章セットを発話した音声データに基づいて生成する合成音声に比して、合成音声の品質が向上する。 By dynamically setting the text selection elements (phonemes, accents, etc.) that are the basis for generating the learning text set using the decision tree of the conversion source model, A well-balanced learning sentence set can be generated. As a result, the quality of the synthesized speech is improved as compared to synthesized speech generated based on speech data in which a sentence set having the same amount of text as the learning sentence set is uttered.

実施例１では、スコア計算時に各パラメータのリーフノードを同等に扱っているため、ある特定のパラメータを重視して文章選択を行うことができない。そこで、実施例２では、パラメータに重みづけをしてスコアを計算する。これにより、文章選択の基準として特定のパラメータを重視した形で学習用文章セットを生成することが可能となる。 In Example 1, since the leaf node of each parameter is handled equally at the time of score calculation, it is not possible to select a sentence with an emphasis on a specific parameter. Therefore, in the second embodiment, the score is calculated by weighting the parameters. As a result, a learning sentence set can be generated in a form that emphasizes a specific parameter as a reference for sentence selection.

以下に、実施例１と相違するスコア計算部１３２の計算式について説明する。スコア計算部１３２は、次式のようにパラメータごとのサブスコアを算出し、それらの重み付き和としてスコアＳＣ_ｉを計算する（Ｓ１３２）。 Below, the calculation formula of the score calculation part 132 different from Example 1 is demonstrated. The score calculation unit 132 calculates a sub-score for each parameter as in the following equation, and calculates a score SC _i as a weighted sum thereof (S132).

ここで、ｗ_ｓｐ、ｗ_Ｆ０、ｗ_ｄｕｒはスペクトルパラメータ、音高パラメータ、継続時間長の各パラメータの重みである。 Here, w _sp , w _F0 , and w _dur are weights of the parameters of the spectrum parameter, pitch parameter, and duration time.

この重みを調整することで、特定のパラメータを重視した文章選択が可能になる。 By adjusting this weight, it becomes possible to select a sentence with an emphasis on a specific parameter.

実施例１及び２では、スコア計算時にリーフノードの情報しか利用しない。相対的には深さが深い位置にあるリーフノードの方が多いため、リーフノードを用いて文章を選択してしまうと、学習用文章セットに偏りが出てしまい、その結果バランスよく適切な文章が選択されない可能性がある。そこで、実施例３では、リーフノードだけでなく、内部ノードも活用してスコア計算を行うこととする。これにより、全体のバランスを考慮した文章選択を行うことが可能となる。 In the first and second embodiments, only leaf node information is used at the time of score calculation. Since there are more leaf nodes at relatively deep positions, if you select a sentence using the leaf nodes, the learning sentence set will be biased, and as a result, an appropriate sentence with a good balance May not be selected. Thus, in the third embodiment, score calculation is performed using not only leaf nodes but also internal nodes. Thereby, it becomes possible to perform sentence selection in consideration of the overall balance.

以下、図９、図１０を参照して実施例３の文章セット生成装置３００について説明する。図９は、実施例３の文章セット生成装置３００の構成を示すブロック図である。図１０は、実施例３の文章セット生成装置３００の動作を示すフローチャートである。図９に示すように文章セット生成装置３００は、文章選択用情報テーブル作成部３１０と、文章選択用情報テーブル記録部１０３と、テキスト解析部１２０と、文章選択部３３０を含む。文章セット生成装置３００は、変換元モデル記録部１０１、選択元文章記録部１０２と接続している。 Hereinafter, the sentence set generation apparatus 300 according to the third embodiment will be described with reference to FIGS. 9 and 10. FIG. 9 is a block diagram illustrating a configuration of the sentence set generation device 300 according to the third embodiment. FIG. 10 is a flowchart illustrating the operation of the sentence set generation apparatus 300 according to the third embodiment. As shown in FIG. 9, the sentence set generation device 300 includes a sentence selection information table creation unit 310, a sentence selection information table recording unit 103, a text analysis unit 120, and a sentence selection unit 330. The sentence set generation device 300 is connected to the conversion source model recording unit 101 and the selection source sentence recording unit 102.

実施例１と相違するのは、文章選択用情報テーブル作成部３１０、文章選択部３３０である。 The difference from the first embodiment is a text selection information table creation unit 310 and a text selection unit 330.

文章選択用情報テーブル作成部３１０は、決定木で表現された変換元モデルから文章選択用情報テーブルを作成し、初期化する（Ｓ３１０）。ここで作成される文章選択用情報テーブルを図１１に示す。３つのパラメータ（スペクトルパラメータ、音高パラメータ、継続時間長）について、テーブルが作成される点では実施例１と同じであるが、リーフノードでなく、決定木の深さごとにレコードが作成される点で相違する。スペクトルパラメータの決定木の深さをＤ_ｓｐ、深さ１のノード数をＮ_ｓｐ、１、深さ２のノード数をＮ_ｓｐ、２、…、深さＤ_ｓｐのノード数をＮ_{ｓｐ、Ｄｓｐ}とし、深さ１のノードが１〜Ｎ_ｓｐ、１、深さ２のノードが１〜Ｎ_ｓｐ、２、…、深さＤ_ｓｐのノードが１〜Ｎ_{ｓｐ、Ｄｓｐ}で識別されるとすると、スペクトルパラメータの文章選択用情報テーブルは深さ１のノード１〜Ｎ_ｓｐ、１、深さ２のノード１〜Ｎ_ｓｐ、２、…、深さＤ_ｓｐのノード１〜Ｎ_{ｓｐ、Ｄｓｐ}での選択頻度を保持するテーブルとなる。音高パラメータの文章選択用情報テーブル、継続時間長の文章選択用情報テーブルも同様である。ただし、音高パラメータの決定木の深さをＤ_Ｆ０、深さ１のノード数をＮ_Ｆ０、１、深さ２のノード数をＮ_Ｆ０、２、…、深さＤ_Ｆ０のノード数をＮ_{Ｆ０、ＤＦ０}とする。また、継続時間長の決定木の深さをＤ_ｄｕｒ、深さ１のノード数をＮ_{ｄｕｒ、１}、深さ２のノード数をＮ_{ｄｕｒ、２}、…、深さＤ_ｄｕｒのノード数をＮ_{ｄｕｒ、Ｄｄｕｒ}とする。文章選択用情報テーブル作成部３１０では、実施例１同様、各テーブルの各ノードの選択頻度の値を０で初期化する。 The text selection information table creation unit 310 creates a text selection information table from the conversion source model represented by the decision tree and initializes it (S310). The sentence selection information table created here is shown in FIG. For the three parameters (spectrum parameter, pitch parameter, duration length), the same as in the first embodiment in that a table is created, but a record is created for each depth of a decision tree, not a leaf node. It is different in point. The depth of the spectrum parameter decision tree is D _sp , the number of nodes at depth 1 is N _{sp 1} , the number of nodes at depth 2 is N _{sp 2} ,..., And the number of nodes at depth D _sp is N _{sp, Dsp.} and then, a node of the depth 1 is _{1 to N sp, 1,} node depth 2 _{1 to N sp, 2,} ..., node depth _{D sp} is _{1 to N sp,} tries to be identified by _Dsp, nodes _{1 to N sp, 1} sentence selection information table depth 1 of the spectral parameters, depth 2 nodes _{1 to N sp, 2,} ..., a depth _{D sp} nodes _{1 to N sp,} selection in _Dsp It becomes a table that holds the frequency. The same applies to the pitch selection text selection information table and the duration time text selection information table. However, the depth of the decision tree for the pitch parameter is D _F0 , the number of nodes at depth 1 is N _F0 , 1, the number of nodes at depth 2 is N _F0 , 2,..., _And the number of nodes at depth D _F0 is N _{Let F0 and DF0} . Further, the depth of the decision tree of the duration length is D _dur , the number of nodes at depth 1 is N _{dur, 1} , the number of nodes at depth 2 is N _{dur, 2} ,..., And the number of nodes at depth D _dur is N _{dur and Ddur} . The text selection information table creation unit 310 initializes the value of the selection frequency of each node of each table to 0 as in the first embodiment.

文章選択部３３０は、変換元モデルの決定木を用いて、解析後文章の集合から学習用文章セットを生成する（Ｓ３３０）。以下、図１２、図１３を参照して文章選択部３３０の処理について詳細に説明する。図１２は、文章選択部３３０の構成を示すブロック図である。図１３は、文選択部３３０の動作を示すフローチャートである。図１２に示すように文章選択部３３０は、ノード決定部３３１と、スコア計算部３３２と、最高スコア文章選択部１３３と、文章選択用情報テーブル更新部１３４と、終了判定部１３５を含む。 The sentence selection unit 330 generates a learning sentence set from the set of analyzed sentences using the decision tree of the conversion source model (S330). Hereinafter, the process of the text selection unit 330 will be described in detail with reference to FIGS. 12 and 13. FIG. 12 is a block diagram illustrating a configuration of the sentence selection unit 330. FIG. 13 is a flowchart showing the operation of the sentence selection unit 330. As shown in FIG. 12, the sentence selection unit 330 includes a node determination unit 331, a score calculation unit 332, a highest score sentence selection unit 133, a sentence selection information table update unit 134, and an end determination unit 135.

ノード決定部３３１は、母集団文章セットの選択元文章Ｂ_ｉ（ｉ＝１，２，…，Ｓ、ただし、生成済み学習用文章セットの文章を除く）に対応する解析後文章Ａ_ｉに含まれるコンテキストＣ_ｊ（ｊ＝１，…，ｋ（ｉ））について、各パラメータの決定木をたどり、ルートからリーフノードに至るまでのパスに含まれる内部ノード、リーフノードを決定する（Ｓ３３１）。あわせて当該ノードの選択頻度を１増加させることで文章選択用情報テーブルを更新する。 The node determination unit 331 includes the post-analysis sentence A _i corresponding to the selection source sentence B _i (i = 1, 2,..., S, excluding the sentences in the generated learning sentence set) of the population sentence set. For each context C _j (j = 1,..., K (i)), the internal node and leaf node included in the path from the root to the leaf node are determined by following the decision tree of each parameter (S331). At the same time, the information selection table is updated by increasing the selection frequency of the node by one.

スコア計算部３３２は、Ｓ３３１で更新した文章選択用情報テーブル（つまり、生成済み学習用文章セットに選択元文章Ｂ_ｉを追加した集合に対応する文章選択用情報テーブル）を用いて、スコアＳＣ_ｉを算出する（Ｓ３３２）。なお、生成済み学習用文章セットは最初のＳ３３１の処理が始まるまでに空集合で初期化されているのは、実施例１と同様である。 The score calculation unit 332 uses the sentence selection information table updated in S331 (that is, the sentence selection information table corresponding to the set in which the selection source sentence B _i is added to the generated learning sentence set) to obtain the score SC _i. Is calculated (S332). It should be noted that the generated learning sentence set is initialized as an empty set before the first processing of S331 starts, as in the first embodiment.

スコアＳＣ_ｉは、各決定木の内部ノードとリーフノードを用いたサブスコアの重み付き和として計算される。 The score SC _i is calculated as a weighted sum of subscores using the internal nodes and leaf nodes of each decision tree.

ここで、ＳＣ_ｓｐ,ｉ、ＳＣ_Ｆ０,ｉ、ＳＣ_{ｄｕｒ,ｉ}はスペクトルパラメータ、音高パラメータ、継続時間長のサブスコアである。Ｎ_ｓｐ,ｄ、Ｎ_Ｆ０,ｄ、Ｎ_{ｄｕｒ,ｄ}はスペクトルパラメータ、音高パラメータ、継続時間長の決定木の深さｄのノード数である。また、関数ｏ_ｓｐ,ｄ（ｋ）、ｏ_Ｆ０,ｄ（ｋ）、ｏ_{ｄｕｒ,ｄ}（ｋ）は、各パラメータの決定木の深さｄのｋ番目のノードがカバー（選択）されているかどうかを出力する関数である。この関数は当該ノードの選択頻度が１以上であれば１を、０であれば０を出力する。ｗ_ｓｐ,ｄ、ｗ_Ｆ０,ｄ、ｗ_{ｄｕｒ,ｄ}は、各パラメータの決定木の深さｄの重みを示す。これにより、各パラメータの決定木のノードのうち、どれだけの割合を選択できているか示すことができる。最後に、ｗ_ｓｐ、ｗ_Ｆ０、ｗ_ｄｕｒを各パラメータの重みとし、上記サブスコアを用いてＳＣ_ｉを計算する。 Here, SC _{sp, i} , SC _{F0, i} , and SC _{dur, i} are sub-scores of spectrum parameters, pitch parameters, and duration lengths. N _{sp, d} , N _{F0, d} , and N _{dur, d} are the number of nodes of the depth d of the decision tree of the spectrum parameter, pitch parameter, and duration length. The functions o _{sp, d} (k), o _{F0, d} (k), and o _{dur, d} (k) indicate whether the k-th node of the decision tree depth d of each parameter is covered (selected). This function outputs whether or not. This function outputs 1 if the node selection frequency is 1 or more, and 0 if it is 0. w _{sp, d} , w _{F0, d} , and w _{dur, d} indicate the weights of the depth d of the decision tree of each parameter. As a result, it is possible to indicate how much of the decision tree nodes for each parameter can be selected. Finally, w _sp , w _F0 , and w _dur are used as the weight of each parameter, and SC _i is calculated using the subscore.

スコア計算方法を内部ノードも含めたカバー率とすることにより、スコア計算時に内部ノードの選択頻度を考慮したうえで文章選択をすることが可能になる。これにより、一部の部分木のみに偏って文章が選択されてしまうことを避けることが可能になる。その結果、文章選択のための要素がバランスよく選択されるため、実施例１よりも合成音声の品質が向上する。 By setting the score calculation method to the coverage rate including the internal nodes, it becomes possible to select a sentence in consideration of the selection frequency of the internal nodes when calculating the score. As a result, it is possible to avoid selecting a sentence biased toward only a part of the partial tree. As a result, the elements for sentence selection are selected in a balanced manner, so that the quality of the synthesized speech is improved as compared with the first embodiment.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

決定木として表現される音声合成用モデルである変換元モデルから別の音声合成用モデルである変換後モデルを学習する際に用いる学習用文章セットを、母集団文章セットの部分集合として生成する文章セット生成装置であって、
前記決定木のリーフノード以外の各内部ノードには、文章選択に用いる文章選択要素に関する問いが付与されており、
解析後文章記録部に記録した、前記母集団文章セットの選択元文章と前記文章選択要素を用いて表現されるコンテキストの組である解析後文章の集合から、前記決定木と前記コンテキストを用いて前記解析後文章に対応する選択元文章を前記学習用文章セットに追加するか否かを判断することにより、前記学習用文章セットを生成する文章選択部と
を有する文章セット生成装置。 A sentence that generates a learning sentence set as a subset of the population sentence set used when learning a converted model that is another speech synthesis model from a conversion source model that is a speech synthesis model expressed as a decision tree A set generator,
Each internal node other than the leaf node of the decision tree is given a question regarding a sentence selection element used for sentence selection,
From the set of post-analysis sentences recorded in the post-analysis sentence recording unit, which is a set of contexts expressed using the selection source sentence of the population sentence set and the sentence selection element, using the decision tree and the context A sentence set generation device comprising: a sentence selection unit that generates the learning sentence set by determining whether or not to add a selection source sentence corresponding to the analyzed sentence to the learning sentence set.

請求項１に記載の文章セット生成装置であって、
前記文章選択部は、
前記解析後文章に対して、当該解析後文章のコンテキストごとに前記決定木をたどることによりカバーされるリーフノードを決定するリーフノード決定部と、
生成済み学習用文章セットの文章と前記解析後文章によりカバーされるリーフノードの数を用いて前記解析後文章のスコアを計算するスコア計算部と、
前記スコアが最高になる前記解析後文章に対応する選択元文章を前記生成済み学習用文章セットに追加することにより、前記生成済み学習用文章セットを更新する最高スコア文章選択部と、
終了条件を満たしている場合、前記生成済み学習用文章セットを前記学習用文章セットとして出力する終了判定部と
を有する文章セット生成装置。 The sentence set generation device according to claim 1,
The sentence selection unit
For the analyzed sentence, a leaf node determination unit that determines a leaf node that is covered by following the decision tree for each context of the analyzed sentence;
A score calculation unit that calculates the score of the analyzed sentence using the sentences of the generated learning sentence set and the number of leaf nodes covered by the analyzed sentence;
A highest score sentence selection unit for updating the generated learning sentence set by adding a selection source sentence corresponding to the analyzed sentence having the highest score to the generated learning sentence set;
A sentence set generation device comprising: an end determination unit that outputs the generated learning sentence set as the learning sentence set when an end condition is satisfied.

請求項１に記載の文章セット生成装置であって、
前記文章選択部は、
前記解析後文章に対して、当該解析後文章のコンテキストごとに前記決定木をたどることによりカバーされる内部ノードとリーフノードを決定するノード決定部と、
生成済み学習用文章セットの文章と前記解析後文章によりカバーされる内部ノードとリーフノードに基づいて算出される数を用いて前記解析後文章のスコアを計算するスコア計算部と、
前記スコアが最高になる前記解析後文章に対応する選択元文章を前記生成済み学習用文章セットに追加することにより、前記生成済み学習用文章セットを更新する最高スコア文章選択部と、
終了条件を満たしている場合、前記生成済み学習用文章セットを前記学習用文章セットとして出力する終了判定部と
を有する文章セット生成装置。 The sentence set generation device according to claim 1,
The sentence selection unit
For the analyzed sentence, a node determination unit that determines an internal node and a leaf node that are covered by following the decision tree for each context of the analyzed sentence,
A score calculation unit that calculates a score of the analyzed sentence using a number calculated based on a sentence of the generated learning sentence set, an internal node covered by the analyzed sentence, and a leaf node;
A highest score sentence selection unit for updating the generated learning sentence set by adding a selection source sentence corresponding to the analyzed sentence having the highest score to the generated learning sentence set;
A sentence set generation device comprising: an end determination unit that outputs the generated learning sentence set as the learning sentence set when an end condition is satisfied.

文章セット生成装置が、決定木として表現される音声合成用モデルである変換元モデルから別の音声合成用モデルである変換後モデルを学習する際に用いる学習用文章セットを、母集団文章セットの部分集合として生成する文章セット生成方法であって、
前記決定木のリーフノード以外の各内部ノードには、文章選択に用いる文章選択要素に関する問いが付与されており、
前記文章セット生成装置が備える解析後文章記録部に記録した、前記母集団文章セットの選択元文章と前記文章選択要素を用いて表現されるコンテキストの組である解析後文章の集合から、前記決定木と前記コンテキストを用いて前記解析後文章に対応する選択元文章を前記学習用文章セットに追加するか否かを判断することにより、前記学習用文章セットを生成する文章選択ステップと
を実行する文章セット生成方法。 The learning sentence set used when the sentence set generating device learns a converted model that is another speech synthesis model from a conversion source model that is a speech synthesis model expressed as a decision tree, A sentence set generation method for generating as a subset,
Each internal node other than the leaf node of the decision tree is given a question regarding a sentence selection element used for sentence selection,
The determination is made from a set of post-analysis texts, which is a set of contexts expressed using the source text of the population text set and the text selection elements, recorded in the post-analysis text recording unit included in the text set generation device. Executing a sentence selection step of generating the learning sentence set by determining whether or not to add a selection source sentence corresponding to the analyzed sentence to the learning sentence set using the tree and the context Sentence set generation method.

請求項１ないし３のいずれか１項に記載の文章セット生成装置としてコンピュータを機能させるためのプログラム。 The program for functioning a computer as a text set production | generation apparatus of any one of Claim 1 thru | or 3.