JP2021012351A

JP2021012351A - Voice synthesis processing device, voice synthesis processing method, and, program

Info

Publication number: JP2021012351A
Application number: JP2019200440A
Authority: JP
Inventors: 拓磨岡本; Takuma Okamoto; 戸田　智基; Tomoki Toda; 智基戸田; 芳則志賀; Yoshinori Shiga; 恒河井; Hisashi Kawai
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2019-07-05
Filing date: 2019-11-05
Publication date: 2021-02-04
Anticipated expiration: 2039-11-05
Also published as: JP7432199B2

Abstract

To realize a voice synthesis processing device which realizes high-quality voice synthesis processing by learning and optimizing with a model of a neural network for text voice synthesis using a sequence-to-sequence method which can set a processing target language to any language.SOLUTION: A speech synthesis processing device 100 performs text analysis processing according to a language to be processed, and from full context label data acquired by a text analysis processing, and acquires optimized full context label data suitable for processing with a neural network model, and performs processing using the acquired optimized full context label data, so that highly accurate speech synthesis processing can be performed for any processing target language.SELECTED DRAWING: Figure 1

Description

本発明は、音声合成処理技術に関する。特に、テキストを音声に変換するテキスト音声合成（ＴＴＳ：ｔｅｘｔ-ｔｏ-ｓｐｅｅｃｈ）技術に関する。 The present invention relates to a speech synthesis processing technique. In particular, it relates to a text-to-speech (TTS) technique for converting text into speech.

テキストから自然な音声を合成するテキスト音声合成（ＴＴＳ）技術において、近年、ニューラルネットワークの導入により高品質な音声合成が可能となっている。このようなテキスト音声合成技術を用いたシステムでは、英語音声を合成する場合、音素継続長と音響モデルとを同時に学習・最適化するsequence-to-sequence方式を用いたテキスト音声合成技術により、英語テキストからメルスペクトログラムを推定し、推定したメルスペクトログラムから、ニューラルボコーダにより音声波形を取得する。このように処理することで、上記テキスト音声合成技術を用いたシステムでは、処理対象言語が英語である場合、人間の音声と同等の品質の音声合成が可能となる（例えば、非特許文献１を参照）。 In the text-to-speech synthesis (TTS) technology for synthesizing natural speech from text, the introduction of neural networks has made it possible to synthesize high-quality speech in recent years. In a system using such text-to-speech synthesis technology, when synthesizing English-speech, the text-to-speech synthesis technology using the sequence-to-sequence method that simultaneously learns and optimizes the phonetic continuation length and the acoustic model is used in English. The mel spectrogram is estimated from the text, and the voice waveform is acquired from the estimated mel spectrogram by the neural vocabulary. By processing in this way, in a system using the above-mentioned text-to-speech synthesis technology, when the processing target language is English, it is possible to synthesize speech with the same quality as human speech (for example, Non-Patent Document 1). reference).

Jonathan Shen, R Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, "Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions," Proc. ICASSP, Apr. 2018, pp. 4779-4783.Jonathan Shen, R Pang, RJ Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, RA Saurous, Y. Agiomyrgiannakis, and Y. Wu , "Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions," Proc. ICASSP, Apr. 2018, pp. 4779-4783.

しかしながら、上記のsequence-to-sequence方式を用いたテキスト音声合成を日本語に適用するのは困難である。日本語は、漢字を使う言語であり、漢字の数が膨大であり、また、漢字の読みのバリエーションも多いので、日本語テキストを、sequence-to-sequence方式を用いたテキスト音声合成のモデルに、直接入力し、処理言語を英語としたときと同様に、当該モデルの学習・最適化を行うことは困難である。 However, it is difficult to apply text-to-speech synthesis using the above sequence-to-sequence method to Japanese. Japanese is a language that uses kanji, the number of kanji is enormous, and there are many variations in reading kanji, so Japanese text can be used as a model for text-to-sequence synthesis using the sequence-to-sequence method. , It is difficult to learn and optimize the model in the same way as when inputting directly and using English as the processing language.

そこで本発明は、上記課題に鑑み、日本語等の英語以外の言語を処理対象言語とする場合においても（処理対象言語を任意の言語にできる）、sequence-to-sequence方式を用いたテキスト音声合成用のニューラルネットワークのモデルにより、学習・最適化を行い、高品質な音声合成処理を実現する音声合成処理装置、音声合成処理方法、および、プログラムを実現することを目的とする。 Therefore, in view of the above problems, the present invention considers the text speech using the sequence-to-sequence method even when a language other than English such as Japanese is used as the processing target language (the processing target language can be any language). The purpose is to realize a speech synthesis processing device, a speech synthesis processing method, and a program that perform learning and optimization by a model of a neural network for synthesis and realize high-quality speech synthesis processing.

上記課題を解決するための第１の発明は、任意の言語を処理対象言語とし、エンコーダ・デコーダ方式のニューラルネットワークを用いて音声合成処理を実行する音声合成処理装置であって、テキスト解析部と、フルコンテキストラベルベクトル処理部と、エンコーダ部と、デコーダ部と、を備える。 The first invention for solving the above problems is a speech synthesis processing device that executes speech synthesis processing using an encoder / decoder-type neural network with an arbitrary language as a processing target language, and includes a text analysis unit and a text analysis unit. , A full context label vector processing unit, an encoder unit, and a decoder unit.

テキスト解析部は、処理対象言語のテキストデータに対してテキスト解析処理を実行し、コンテキストラベルデータを取得する。 The text analysis unit executes text analysis processing on the text data of the language to be processed and acquires the context label data.

フルコンテキストラベルベクトル処理部は、テキスト解析部により取得されたコンテキストラベルデータから、コンテキストラベルデータを取得する処理において処理対象とされた音素である単独音素についてのコンテキストラベルを取得することで、ニューラルネットワークの学習処理に適した最適化フルコンテキストラベルデータを取得する。 The full context label vector processing unit acquires a context label for a single sound element that is a processing target in the process of acquiring the context label data from the context label data acquired by the text analysis unit, thereby performing a neural network. Get optimized full context label data suitable for the learning process of.

エンコーダ部は、最適化フルコンテキストラベルデータに基づいて、ニューラルネットワークのエンコード処理を実行することで、隠れ状態データを取得する。 The encoder unit acquires hidden state data by executing neural network encoding processing based on the optimized full context label data.

デコーダ部は、隠れ状態データに基づいて、ニューラルネットワークのデコード処理を実行することで、最適化フルコンテキストラベルデータに対応する音響特徴量データを取得する。 The decoder unit acquires the acoustic feature data corresponding to the optimized full context label data by executing the decoding process of the neural network based on the hidden state data.

ボコーダは、デコーダ部により取得された音響特徴量から音声波形データを取得する。 The vocoder acquires audio waveform data from the acoustic features acquired by the decoder unit.

この音声合成処理装置では、ニューラルネットワークのモデルで処理するのに適した最適化フルコンテキストラベルデータを用いて、ニューラルネットワークによる処理（学習処理、予測処理）を実行するため、高精度な音声合成処理を実行することができる。つまり、この音声合成処理装置では、従来技術とは異なり、処理対象の音素に先行する、あるいは、後続する音素についてのデータを含まないコンテキストラベルデータを最適化フルコンテキストラベルデータとして取得し、取得した最適化フルコンテキストラベルデータにより、ニューラルネットワークのモデルの処理を行う。ニューラルネットワーク（特に、sequence-to-sequence方式のニューラルネットワーク）では、時系列のデータを用いた処理が実行されるので、従来の音声合成処理に用いるコンテキストラベルデータに含める必要があった、処理対象の音素に先行する、あるいは、後続するデータは、ニューラルネットワークのモデルの処理において冗長となり、処理効率を落とす原因となる。この音声合成処理装置１００では、最適化フルコンテキストラベルデータ（単独音素についてのデータから構成されるコンテキストラベルデータ）を用いるので、ニューラルネットワークのモデルの処理が非常に効果的に実行できる。その結果、この音声合成処理装置では、高精度の音声合成処理を実行できる。 In this speech synthesis processing device, processing (learning processing, prediction processing) by the neural network is executed using the optimized full context label data suitable for processing by the model of the neural network, so that the speech synthesis processing is highly accurate. Can be executed. That is, unlike the prior art, this speech synthesis processing device acquires and acquires context label data that does not include data on the phonemes that precede or follow the phoneme to be processed as optimized full context label data. The optimized full context label data is used to process the model of the neural network. In a neural network (particularly, a sequence-to-sequence type neural network), processing using time-series data is executed, so the processing target that had to be included in the context label data used in the conventional speech synthesis processing. The data that precedes or follows the phoneme of is redundant in the processing of the model of the neural network, which causes a decrease in processing efficiency. Since the speech synthesis processing device 100 uses the optimized full context label data (context label data composed of data about a single phoneme), the processing of the neural network model can be executed very effectively. As a result, this speech synthesis processing device can execute high-precision speech synthesis processing.

また、この音声合成処理装置では、処理対象言語に応じたテキスト解析処理を行い、当該テキスト解析処理で取得されたフルコンテキストラベルデータから、ニューラルネットワーク（例えば、sequence-to-sequence方式を用いたニューラルネットワーク）のモデルで処理するのに適した最適化フルコンテキストラベルデータを取得し、取得した最適化フルコンテキストラベルデータを用いて処理を行うことで、任意の処理対象言語について、高精度な音声合成処理を行うことができる。 In addition, this speech synthesis processing device performs text analysis processing according to the language to be processed, and from the full context label data acquired by the text analysis processing, a neural network (for example, a neural using a sequence-to-sequence method). By acquiring optimized full-context label data suitable for processing in a network) model and performing processing using the acquired optimized full-context label data, highly accurate speech synthesis can be performed for any processing target language. Processing can be performed.

したがって、この音声合成処理装置では、日本語等の英語以外の言語を処理対象言語とする場合においても（処理対象言語を任意の言語にできる）、例えば、sequence-to-sequence方式を用いたテキスト音声合成用のニューラルネットワークのモデルにより、学習・最適化を行い、高品質な音声合成処理を実現することができる。 Therefore, in this speech synthesis processing device, even when a language other than English such as Japanese is used as the processing target language (the processing target language can be any language), for example, a text using the sequence-to-sequence method. By using a model of a neural network for speech synthesis, it is possible to perform learning and optimization and realize high-quality speech synthesis processing.

なお、「単独音素」とは、テキスト解析処理においてコンテキストラベルデータを取得するときに、処理対象とした音素のことをいう。 The "single phoneme" refers to a phoneme to be processed when the context label data is acquired in the text analysis process.

また、「最適化」とは、厳密な意味での最適化の他に、所定の誤差範囲を許容する範囲内に収めることを含む概念である。 Further, "optimization" is a concept including not only optimization in a strict sense but also keeping a predetermined error range within an allowable range.

第２の発明は、第１の発明であって、音響特徴量は、メルスペクトログラムのデータである。 The second invention is the first invention, and the acoustic features are mel spectrogram data.

これにより、この音声合成処理装置では、入力されたテキストに対応するメルスペクトログラムのデータにより、音声合成処理を実行できる。 As a result, in this voice synthesis processing device, the voice synthesis processing can be executed by the data of the mel spectrogram corresponding to the input text.

第３の発明は、第１または第２の発明であって、ボコーダは、ニューラルネットワークのモデルを用いた処理を実行することで、音響特徴量から音声波形データを取得する。 The third invention is the first or second invention, in which the vocoder acquires voice waveform data from acoustic features by executing processing using a model of a neural network.

これにより、この音声合成処理装置では、ニューラルネットワーク処理ができるボコーダを用いて、音声合成処理を実行できる。 As a result, the speech synthesis processing device can execute the speech synthesis processing by using a vocoder capable of neural network processing.

第４の発明は、第３の発明であって、ボコーダは、可逆変換ネットワークにより構成されたニューラルネットワークのモデルを用いた処理を実行することで、音響特徴量から音声波形データを取得する。 The fourth invention is the third invention, in which the vocoder acquires voice waveform data from acoustic features by executing processing using a model of a neural network configured by a reversible conversion network.

この音声合成処理装置では、ボコーダが、可逆変換ネットワークにより構成されたニューラルネットワークのモデルを用いた処理を行うので、ボコーダの構成をシンプルにできる。その結果、この音声合成処理装置では、ボコーダでの処理を高速化でき、音声合成処理をリアルタイムで実行できる。 In this speech synthesis processing device, the vocoder performs processing using a model of a neural network configured by a reversible conversion network, so that the configuration of the vocoder can be simplified. As a result, in this voice synthesis processing device, the processing in the vocoder can be speeded up, and the voice synthesis processing can be executed in real time.

第５の発明は、第１から第４のいずれかの発明であって、音素単位のコンテキストラベルデータから音素継続長を推定する音素継続長推定部をさらに備える。 The fifth invention is any one of the first to fourth inventions, further including a phoneme continuation length estimation unit that estimates the phoneme continuation length from the context label data of the phoneme unit.

フルコンテキストラベルベクトル処理部は、音素継続長推定部により推定された音素継続長である推定音素継続長に対応する期間において、当該推定音素継続長に対応する音素の最適化フルコンテキストラベルデータを継続してエンコーダ部に出力する。 The full-context label vector processing unit continues the phoneme optimization full-context label data corresponding to the estimated phoneme continuation length during the period corresponding to the estimated phoneme continuation length, which is the phoneme continuation length estimated by the phoneme continuation length estimation unit. And output to the encoder section.

この音声合成処理装置では、エンコーダ部への入力データ（最適化フルコンテキストラベルデータ）を、音素継続長推定部により取得（推定）した音素ごとの音素継続長に基づいて、引き延ばす処理（音素ｐｈ_ｋの音素継続長ｄｕｒ（ｐｈ_ｋ）に相当する期間、音素ｐｈ_ｋの最適化フルコンテキストラベルデータを、繰り返しエンコーダ部３に入力する処理）を実行する。つまり、この音声合成処理装置では、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理を実行して取得した音素継続長を用いて予測処理を実行するので、注意機構予測が失敗することに起因する、合成発話が途中で止まってしまう、同じフレーズを何回も繰り返してしまう、等の問題が発生することはない。 The speech synthesis processing apparatus, the input data to the encoder unit (optimized full context label data), on the basis of the phoneme duration of each phoneme obtained (estimated) by the phoneme duration estimation unit, stretching treatment (phoneme ph _k The process of repeatedly inputting the optimized full context label data of the phoneme ph _k to the encoder unit 3) is executed for a period corresponding to the phoneme continuation length dur (ph _k ) of. That is, in this speech synthesis processing device, the prediction process is performed using the phoneme continuation length obtained by executing the estimation process using a model such as the hidden Markov model, which can stably and appropriately estimate the phoneme continuation length. Since it is executed, problems such as failure of attention mechanism prediction, synthetic speech stopping in the middle, and repeating the same phrase many times do not occur.

すなわち、この音声合成処理装置では、（１）音素継続長については、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理（音素継続長推定部による処理）により取得し、（２）音響特徴量については、sequence-to-sequence方式を用いたニューラルネットワークのモデルで処理することにより取得する。 That is, in this speech synthesis processing device, (1) the phoneme continuation length is estimated by using a model such as a hidden Markov model, which can stably and appropriately estimate the phoneme continuation length (phoneme continuation length estimation unit). (Processing by), and (2) the acoustic features are acquired by processing with a neural network model using the sequence-to-sequence method.

したがって、この音声合成処理装置では、注意機構予測が失敗することに起因する、合成発話が途中で止まってしまう、同じフレーズを何回も繰り返してしまう、等の問題が発生することを適切に防止するとともに、高精度な音声合成処理を実行することができる。 Therefore, this speech synthesis processing device appropriately prevents problems such as failure of attention mechanism prediction, synthetic utterance stopping in the middle, and repeating the same phrase many times. At the same time, it is possible to execute highly accurate speech synthesis processing.

第６の発明は、任意の言語を処理対象言語とし、エンコーダ・デコーダ方式のニューラルネットワークを用いて音声合成処理を実行する音声合成処理方法であって、テキスト解析ステップと、フルコンテキストラベルベクトル処理ステップと、エンコード処理ステップと、デコード処理ステップと、ボコーダ処理ステップと、を備える。 The sixth invention is a speech synthesis processing method in which an arbitrary language is used as a processing target language and speech synthesis processing is executed using an encoder / decoder neural network, and is a text analysis step and a full context label vector processing step. , An encoding processing step, a decoding processing step, and a vocoder processing step.

テキスト解析ステップは、処理対象言語のテキストデータに対してテキスト解析処理を実行し、コンテキストラベルデータを取得する。 The text analysis step executes the text analysis process on the text data of the language to be processed and acquires the context label data.

フルコンテキストラベルベクトル処理ステップは、テキスト解析ステップにより取得されたコンテキストラベルデータから、コンテキストラベルデータを取得する処理において処理対象とされた音素である単独音素についてのコンテキストラベルを取得することで、ニューラルネットワークの学習処理に適した最適化フルコンテキストラベルデータを取得する。 The full context label vector processing step is a neural network by acquiring the context label for a single phone that is the sound element processed in the process of acquiring the context label data from the context label data acquired by the text analysis step. Get optimized full context label data suitable for the learning process of.

エンコード処理ステップは、最適化フルコンテキストラベルデータに基づいて、ニューラルネットワークのエンコード処理を実行することで、隠れ状態データを取得する。 The encoding processing step acquires the hidden state data by executing the encoding processing of the neural network based on the optimized full context label data.

デコード処理ステップは、隠れ状態データに基づいて、ニューラルネットワークのデコード処理を実行することで、最適化フルコンテキストラベルデータに対応する音響特徴量データを取得する。 The decoding processing step acquires the acoustic feature data corresponding to the optimized full context label data by executing the decoding processing of the neural network based on the hidden state data.

ボコーダ処理ステップは、デコード処理ステップにより取得された音響特徴量から音声波形データを取得する。 The vocoder processing step acquires voice waveform data from the acoustic features acquired by the decoding processing step.

これにより、第１の発明と同様の効果を奏する音声合成処理方法を実現することができる。 As a result, it is possible to realize a speech synthesis processing method having the same effect as that of the first invention.

第７の発明は、第６の発明である音声合成処理方法をコンピュータに実行させるためのプログラムである。 The seventh invention is a program for causing a computer to execute the speech synthesis processing method according to the sixth invention.

これにより、第１の発明と同様の効果を奏する音声合成処理方法をコンピュータに実行させるためのプログラムを実現することができる。 This makes it possible to realize a program for causing a computer to execute a speech synthesis processing method having the same effect as that of the first invention.

第８の発明は、任意の言語を処理対象言語とし、エンコーダ・デコーダ方式のニューラルネットワークを用いて音声合成処理を実行する音声合成処理装置であって、テキスト解析部と、フルコンテキストラベルベクトル処理部と、エンコーダ部と、音素継続長推定部と、強制アテンション部と、内分処理部と、コンテキスト算出部と、デコーダ部と、ボコーダと、を備える。 The eighth invention is a speech synthesis processing apparatus that executes speech synthesis processing using an encoder / decoder type neural network with an arbitrary language as a processing target language, and is a text analysis unit and a full context label vector processing unit. It also includes an encoder unit, a phoneme continuation length estimation unit, a forced attention unit, an internal division processing unit, a context calculation unit, a decoder unit, and a vocoder.

音素継続長推定部は、音素単位のコンテキストラベルデータから音素継続長を推定する。 The phoneme continuation length estimation unit estimates the phoneme continuation length from the context label data of each phoneme.

強制アテンション部は、音素継続長推定部により推定された音素継続長に基づいて、第１重み付け係数データを取得する。 The forced attention unit acquires the first weighting coefficient data based on the phoneme continuation length estimated by the phoneme continuation length estimation unit.

アテンション部は、エンコーダ部により取得された隠れ状態データに基づいて、第２重み付け係数データを取得する。 The attention unit acquires the second weighting coefficient data based on the hidden state data acquired by the encoder unit.

内分処理部は、第１重み付け係数データと第２重み付け係数データとに対して内分処理を行うことで、合成重み付け係数データを取得する。 The internal division processing unit acquires the combined weighting coefficient data by performing internal division processing on the first weighting coefficient data and the second weighting coefficient data.

コンテキスト算出部は、合成重み付け係数データにより、エンコーダ部により取得された隠れ状態データに対して重み付け合成処理を実行することで、コンテキスト状態データを取得する。 The context calculation unit acquires the context state data by executing the weighting composition process on the hidden state data acquired by the encoder unit based on the composition weighting coefficient data.

デコーダ部は、コンテキスト状態データに基づいて、ニューラルネットワークのデコード処理を実行することで、最適化フルコンテキストラベルデータに対応する音響特徴量データを取得する。 The decoder unit acquires the acoustic feature data corresponding to the optimized full context label data by executing the decoding process of the neural network based on the context state data.

この音声合成処理装置では、音素継続長については、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理（音素継続長推定部による処理）により取得した音素継続長を用いて処理することで、音素継続長の予測精度を保証する。つまり、この音声合成処理装置では、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理（音素継続長推定部による処理）により取得した音素継続長を用いて強制アテンション部により取得した重み付け係数データと、アテンション部により取得された重み付け係数データとを適度に合成した重み付け係数データにより生成したコンテキスト状態データを用いて予測処理を実行する。したがって、この音声合成処理装置では、注意機構の予測が失敗する場合（アテンション部により適切な重み付け係数データが取得できない場合）であっても、強制アテンション部により取得した重み付け係数データによる重み分の重み付け係数データが取得できるため、注意機構の予測の失敗が音声合成処理に影響を及ぼさないようにできる。 In this speech synthesis processing device, the phoneme continuation length is acquired by estimation processing (processing by the phoneme continuation length estimation unit) using a model such as a hidden Markov model, which can stably and appropriately estimate the phoneme continuation length. The prediction accuracy of the phoneme continuation length is guaranteed by processing using the phoneme continuation length. That is, in this speech synthesis processing device, the phoneme continuation length acquired by the estimation process (processing by the phoneme continuation length estimation unit) using a model such as the hidden Markov model, which can stably and appropriately estimate the phoneme continuation length. The prediction process is executed using the context state data generated by the weighting coefficient data obtained by appropriately synthesizing the weighting coefficient data acquired by the forced attention unit and the weighting coefficient data acquired by the attention unit using. Therefore, in this speech synthesis processing device, even if the prediction of the attention mechanism fails (when appropriate weighting coefficient data cannot be acquired by the attention unit), the weighting by the weighting coefficient data acquired by the forced attention unit is weighted. Since the coefficient data can be acquired, it is possible to prevent the failure of the attention mechanism prediction from affecting the speech synthesis process.

さらに、この音声合成処理装置では、音響特徴量については、sequence-to-sequence方式を用いたニューラルネットワークのモデルで処理することにより取得できるので、高精度な音響特徴量の予測処理が実現できる。 Further, in this speech synthesis processing device, the acoustic features can be obtained by processing with a neural network model using the sequence-to-sequence method, so that highly accurate prediction processing of the acoustic features can be realized.

なお、この音声合成処理装置において、内分処理を実行するときの内分比は、固定値であってもよいし、動的に変化する（更新される）値であってもよい。 In this speech synthesis processing apparatus, the internal division ratio when the internal division processing is executed may be a fixed value or a dynamically changing (updated) value.

本発明によれば、日本語等の英語以外の言語を処理対象言語とする場合においても（処理対象言語を任意の言語にできる）、sequence-to-sequence方式を用いたテキスト音声合成用のニューラルネットワークのモデルにより、学習・最適化を行い、高品質な音声合成処理を実現する音声合成処理装置、音声合成処理方法、および、プログラムを実現することができる。 According to the present invention, even when a language other than English such as Japanese is used as the processing target language (the processing target language can be any language), a neural for text-to-speech synthesis using the sequence-to-sequence method. According to the network model, it is possible to realize a speech synthesis processing device, a speech synthesis processing method, and a program that perform learning / optimization and realize high-quality speech synthesis processing.

第１実施形態に係る音声合成処理装置１００の概略構成図。The schematic block diagram of the speech synthesis processing apparatus 100 which concerns on 1st Embodiment. 処理対象言語を日本語とした場合のテキスト解析処理により取得されるフルコンテキストラベルデータに含まれる情報（パラメータ）（一例）を示す図。The figure which shows the information (parameter) (example) included in the full context label data acquired by the text analysis processing when the processing target language is Japanese. 最適化フルコンテキストラベルデータに含まれる情報（パラメータ）（一例）を示す図。The figure which shows the information (parameter) (example) contained in the optimization full context label data. 第１実施形態の第１変形例の音声合成処理装置のボコーダ６の概略構成を示す図。The figure which shows the schematic structure of the vocoder 6 of the voice synthesis processing apparatus of the 1st modification of 1st Embodiment. 第１実施形態の第１変形例の音声合成処理装置のボコーダ６の概略構成を示す図。The figure which shows the schematic structure of the vocoder 6 of the voice synthesis processing apparatus of the 1st modification of 1st Embodiment. 第１実施形態の第１変形例の音声合成処理装置によりＴＴＳ処理（処理対象言語：日本語）実行し、取得した音声波形データのメルスペクトログラム（予測データ）と、入力テキストの実際の音声波形データのメルスペクトログラム（オリジナルデータ）とを示す図。TTS processing (processing target language: Japanese) is executed by the speech synthesis processing device of the first modification of the first embodiment, and the mel spectrogram (prediction data) of the acquired speech waveform data and the actual speech waveform data of the input text. The figure which shows the mel spectrogram (original data) of. 第２実施形態に係る音声合成処理装置２００の概略構成図Schematic configuration diagram of the speech synthesis processing device 200 according to the second embodiment 推定された音素継続長に基づいて、エンコーダ部３に入力するデータＤｘ２を生成する処理を説明するための図。The figure for demonstrating the process of generating the data Dx2 to be input to the encoder part 3 based on the estimated phoneme continuation length. 第３実施形態に係る音声合成処理装置３００の概略構成図。The schematic block diagram of the speech synthesis processing apparatus 300 which concerns on 3rd Embodiment. アテンション部４Ａにより取得された重み付け係数データｗａｔｔ（ｔ）と、強制アテンション部８により取得された重み付け係数データｗｆ（ｔ）とから取得した合成重み付け係数データｗ（ｔ）を用いてコンテキスト状態データｃ（ｔ）を取得する処理について説明するための図。Context state data c using the weighted coefficient data watt (t) acquired by the attention unit 4A and the composite weighted coefficient data w (t) acquired from the weighted coefficient data wf (t) acquired by the forced attention unit 8. The figure for demonstrating the process of acquiring (t). アテンション部４Ａにより取得された重み付け係数データｗａｔｔ（ｔ）と、強制アテンション部８により取得された重み付け係数データｗｆ（ｔ）とから取得した合成重み付け係数データｗ（ｔ）を用いてコンテキスト状態データｃ（ｔ）を取得する処理について説明するための図（時刻ｔ２の処理）。Context state data c using the weighted coefficient data watt (t) acquired by the attention unit 4A and the composite weighted coefficient data w (t) acquired from the weighted coefficient data wf (t) acquired by the forced attention unit 8. The figure (the process of time t2) for demonstrating the process of acquiring (t). アテンション部４Ａにより取得された重み付け係数データｗａｔｔ（ｔ）と、強制アテンション部８により取得された重み付け係数データｗｆ（ｔ）とから取得した合成重み付け係数データｗ（ｔ）を用いてコンテキスト状態データｃ（ｔ）を取得する処理について説明するための図（時刻ｔ３の処理）。Context state data c using the weighted coefficient data watt (t) acquired by the attention unit 4A and the composite weighted coefficient data w (t) acquired from the weighted coefficient data wf (t) acquired by the forced attention unit 8. The figure (the process of time t3) for demonstrating the process of acquiring (t). 時刻ｔ２においての処理で、注意機構の予測が失敗している場合を説明するための図。The figure for demonstrating the case where the prediction of an attention mechanism fails in the process at time t2. 本発明に係る音声合成処理装置を実現するコンピュータのハードウェア構成を示すブロック図。The block diagram which shows the hardware structure of the computer which realizes the speech synthesis processing apparatus which concerns on this invention.

［第１実施形態］
第１実施形態について、図面を参照しながら、以下説明する。 [First Embodiment]
The first embodiment will be described below with reference to the drawings.

＜１．１：音声合成処理装置の構成＞
図１は、第１実施形態に係る音声合成処理装置１００の概略構成図である。 <1.1: Configuration of speech synthesis processing device>
FIG. 1 is a schematic configuration diagram of the speech synthesis processing device 100 according to the first embodiment.

音声合成処理装置１００は、図１に示すように、テキスト解析部１と、フルコンテキストラベルベクトル処理部２と、エンコーダ部３と、アテンション部４と、デコーダ部５と、ボコーダ６とを備える。 As shown in FIG. 1, the speech synthesis processing device 100 includes a text analysis unit 1, a full context label vector processing unit 2, an encoder unit 3, an attention unit 4, a decoder unit 5, and a vocoder 6.

テキスト解析部１は、処理対象言語のテキストデータＤｉｎを入力とし、入力されたテキストデータＤｉｎに対して、テキスト解析処理を実行し、様々な言語情報からなるコンテキストを含む音素ラベルであるコンテキストラベルの系列を取得する。なお、日本語のように、アクセントやピッチによって、同じ文字（例えば、漢字）であっても、発音されたときの音声波形が異なる言語では、当該音素（処理対象の音素）の前後の音素についての言語情報も、コンテキストラベルに含める必要がある。テキスト解析部１は、上記のように、テキストが発音されたときの音声波形を特定するためのコンテキストラベル（処理対象言語によって必要となる先行する音素、および／または、後続する音素のデータを含めたコンテキストラベル）をフルコンテキストラベルデータＤｘ１として、フルコンテキストラベルベクトル処理部２に出力する。 The text analysis unit 1 inputs the text data Din of the language to be processed, executes the text analysis process on the input text data Din, and performs a text analysis process on the input text data Din, which is a context label which is a phonetic label including a context composed of various language information. Get the series. Note that in languages such as Japanese, where the same character (for example, Kanji) has a different voice waveform when pronounced, depending on the accent and pitch, the phonemes before and after the phoneme (phoneme to be processed) Language information should also be included in the context label. As described above, the text analysis unit 1 includes the context label (preceding phoneme required by the processing target language and / or the data of the succeeding phoneme) for specifying the speech waveform when the text is pronounced. The full context label data Dx1 is output to the full context label vector processing unit 2.

フルコンテキストラベルベクトル処理部２は、テキスト解析部１から出力されるデータＤｘ１（フルコンテキストラベルのデータ）を入力する。フルコンテキストラベルベクトル処理部２は、入力されたフルコンテキストラベルデータＤｘ１から、sequence-to-sequence方式のニューラルネットワークのモデルの学習処理に適したフルコンテキストラベルデータを取得するためのフルコンテキストラベルベクトル処理を実行する。そして、フルコンテキストラベルベクトル処理部２は、フルコンテキストラベルベクトル処理により取得したデータをデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）として、エンコーダ部３のエンコーダ側プレネット処理部３１に出力する。 The full context label vector processing unit 2 inputs data Dx1 (full context label data) output from the text analysis unit 1. The full context label vector processing unit 2 performs full context label vector processing for acquiring full context label data suitable for training processing of a sequence-to-sequence neural network model from the input full context label data Dx1. To execute. Then, the full context label vector processing unit 2 outputs the data acquired by the full context label vector processing as data Dx2 (optimized full context label data Dx2) to the encoder side prenet processing unit 31 of the encoder unit 3.

エンコーダ部３は、図１に示すように、エンコーダ側プレネット処理部３１と、エンコーダ側ＬＳＴＭ層３２（ＬＳＴＭ：Ｌｏｎｇｓｈｏｒｔ−ｔｅｒｍｍｅｍｏｒｙ）とを備える。 As shown in FIG. 1, the encoder unit 3 includes an encoder-side prenet processing unit 31 and an encoder-side LSTM layer 32 (LSTM: Long short-term memory).

エンコーダ側プレネット処理部３１は、フルコンテキストラベルベクトル処理部２から出力されるデータＤｘ２を入力する。エンコーダ側プレネット処理部３１は、入力したデータＤｘ２に対して、コンボリューション処理（コンボリューションフィルタによる処理）、データの正規化処理、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理）を実行し、エンコーダ側ＬＳＴＭ層３２に入力可能なデータを取得する。そして、エンコーダ側プレネット処理部３１は、上記処理（プレネット処理）により取得したデータをデータＤｘ３としてエンコーダ側ＬＳＴＭ層３２に出力する。 The encoder-side prenet processing unit 31 inputs the data Dx2 output from the full context label vector processing unit 2. The encoder-side pre-net processing unit 31 performs convolution processing (processing by a convolution filter), data normalization processing, and processing by an activation function (for example, ReLU: Correctified Liner Unit) for the input data Dx2. ) Is executed to acquire data that can be input to the LSTM layer 32 on the encoder side. Then, the encoder-side prenet processing unit 31 outputs the data acquired by the above processing (pre-net processing) to the encoder-side LSTM layer 32 as data Dx3.

エンコーダ側ＬＳＴＭ層３２は、リカーレントニューラルネットワークの隠れ層（ＬＳＴＭ層）に対応する層であり、エンコーダ側プレネット処理部３１から、現時刻ｔにおいて出力されるデータＤｘ３（これをデータＤｘ３（ｔ）と表記する）と、１つ前の時間ステップにおいて、エンコーダ側ＬＳＴＭ層３２から出力されたデータＤｘ４（これをデータＤｘ４（ｔ−１）と表記する）とを入力する。エンコーダ側ＬＳＴＭ層３２は、入力されたデータＤｘ３（ｔ）、データＤｘ４（ｔ−１）に対して、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｘ４（データＤｘ４（ｔ））としてアテンション部４に出力する。 The LSTM layer 32 on the encoder side is a layer corresponding to the hidden layer (LSTM layer) of the recurrent neural network, and the data Dx3 (this is the data Dx3 (t)) output from the prenet processing unit 31 on the encoder side at the current time t. ) And the data Dx4 (this is referred to as data Dx4 (t-1)) output from the encoder-side LSTM layer 32 in the previous time step. The encoder-side LSTM layer 32 executes processing by the LSTM layer on the input data Dx3 (t) and data Dx4 (t-1), and uses the processed data as data Dx4 (data Dx4 (t)). Output to the attention unit 4.

アテンション部４は、エンコーダ部３から出力されるデータＤｘ４と、デコーダ部５のデコーダ側ＬＳＴＭ層５２から出力されるデータｈo（出力側隠れ状態データｈo）とを入力する。アテンション部４は、エンコーダ部３から出力されるデータＤｘ４、すなわち、入力側隠れ状態データ（これをデータｈｉという。また、時刻ｔの入力側隠れ状態データをデータｈｉ（ｔ）と表記する。）を所定の時間ステップ分記憶保持する。時間ステップｔ＝１からｔ＝Ｓ（Ｓ：自然数）の期間において、エンコーダ部３により取得され、アテンション部４に出力されたデータＤｘ４（＝ｈｉ）の集合を、ｈｉ_{１．．．Ｓ}と表記する。つまり、アテンション部４は、下記に相当するデータｈｉ_{１．．．Ｓ}を記憶保持する。
ｈｉ_{１．．．Ｓ}＝｛Ｄｘ４（１），Ｄｘ４（２），・・・，Ｄｘ４（Ｓ）｝
また、アテンション部４は、デコーダ部５のデコーダ側ＬＳＴＭ層５２から出力されるデータＤｙ３、すなわち、出力側隠れ状態データ（これをデータｈｏという）を所定の時間ステップ分記憶保持する。時間ステップｔ＝１からｔ＝Ｔ（Ｔ：自然数）の期間において、デコーダ側ＬＳＴＭ層５２により取得され、アテンション部４に出力されたデータＤｙ３（＝ｈｏ）の集合を、ｈｏ_{１．．．Ｔ}と表記する。つまり、アテンション部４は、下記に相当するデータｈｏ_{１．．．Ｔ}を記憶保持する。
ｈｏ_{１．．．Ｔ}＝｛Ｄｙ３（１），Ｄｙ３（２），・・・，Ｄｙ３（Ｔ）｝
そして、アテンション部４は、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}と、出力側隠れ状態データの集合データｈｏ_{１．．．Ｔ}と、に基づいて、例えば、
ｃ（ｔ）＝ｆ１＿ａｔｔｎ（ｈｉ_{１．．．Ｓ}，ｈｏ_{１．．．Ｔ}）
ｆ１＿ａｔｔｎ（）：コンテキスト状態データを取得する関数
に相当する処理を実行して、現時刻ｔのコンテキスト状態データｃ（ｔ）を取得する。そして、アテンション部４は、取得したコンテキスト状態データｃ（ｔ）をデコーダ側ＬＳＴＭ層５２に出力する。
デコーダ部５は、図１に示すように、デコーダ側プレネット処理部５１と、デコーダ側ＬＳＴＭ層５２と、線形予測部５３と、ポストネット処理部５４と、加算器５５と、を備える。 The attention unit 4 inputs the data Dx4 output from the encoder unit 3 and the data ho (output side hidden state data ho) output from the decoder side LSTM layer 52 of the decoder unit 5. The attention unit 4 is the data Dx4 output from the encoder unit 3, that is, the input side hidden state data (this is referred to as data hi, and the input side hidden state data at time t is referred to as data hi (t)). Is stored for a predetermined time step. In the period from the time step t = 1 to t = S (S: natural number), the set of data Dx4 (= hi) acquired by the encoder unit 3 and output to the attention unit 4 is hi _{1. .. .. Notated as S.} That is, the attention unit 4 has the following data hi _{1. .. ..} Store _S in memory.
hi _{1. .. .. S} = {Dx4 (1), Dx4 (2), ..., Dx4 (S)}
Further, the attention unit 4 stores and holds the data Dy3 output from the decoder side LSTM layer 52 of the decoder unit 5, that is, the output side hidden state data (this is referred to as data ho) for a predetermined time step. In the period from time step t = 1 to t = T (T: natural number), the set of data Dy3 (= ho) acquired by the decoder-side LSTM layer 52 and output to the attention unit 4 is ho _{1. .. .. Notated as T.} That is, the attention unit 4 has the following data ho _{1. .. ..} Memorize _T.
ho _{1. .. .. T} = {Dy3 (1), Dy3 (2), ..., Dy3 (T)}
Then, the attention unit 4 is a set data hi _{1. Of} the input side hidden state data _{. .. ..} Set data ho of _S and output side hidden state data ho _{1. .. ..} Based on _T , for example,
c (t) = f1_attn (hi _{... S} , ho _{... T} )
f1_attn (): The process corresponding to the function for acquiring the context state data is executed to acquire the context state data c (t) at the current time t. Then, the attention unit 4 outputs the acquired context state data c (t) to the decoder side LSTM layer 52.
As shown in FIG. 1, the decoder unit 5 includes a decoder-side prenet processing unit 51, a decoder-side LSTM layer 52, a linear prediction unit 53, a post-net processing unit 54, and an adder 55.

デコーダ側プレネット処理部５１は、線形予測部５３から出力される、１時間ステップ前のデータＤｙ４（これをＤｙ４（ｔ−１）という）を入力する。デコーダ側プレネット処理部５１は、例えば、複数層（例えば、２層）の全結合層を有しており、データの正規化処理（例えば、線形予測部５３から出力されるデータ（ベクトルデータ）の次元数が２Ｎであり、デコーダ側ＬＳＴＭ層に入力されるデータ（ベクトルデータ）の次元数がＮである場合、データの次元数をＮにするように、例えば、ドロップアウト処理を行うことを含む）、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理）を実行し、デコーダ側ＬＳＴＭ層５２に入力可能なデータを取得する。そして、デコーダ側プレネット処理部５１は、上記処理（プレネット処理）により取得したデータをデータＤｙ２としてデコーダ側ＬＳＴＭ層５２に出力する。 The decoder-side prenet processing unit 51 inputs the data Dy4 (this is referred to as Dy4 (t-1)) one hour before the step, which is output from the linear prediction unit 53. The decoder-side prenet processing unit 51 has, for example, a plurality of layers (for example, two layers) of fully connected layers, and data normalization processing (for example, data (vector data) output from the linear prediction unit 53). When the number of dimensions of the data (vector data) input to the LSTM layer on the decoder side is N, for example, dropout processing is performed so that the number of dimensions of the data is N. (Including), processing by the activation function (for example, processing by the ReLU function (ReLU: Rectifier Unit)) is executed, and data that can be input to the decoder side LSTM layer 52 is acquired. Then, the decoder-side prenet processing unit 51 outputs the data acquired by the above processing (pre-net processing) to the decoder-side LSTM layer 52 as data Dy2.

デコーダ側ＬＳＴＭ層５２は、リカーレントニューラルネットワークの隠れ層（ＬＳＴＭ層）に対応する層である。デコーダ側ＬＳＴＭ層５２は、デコーダ側プレネット処理部５１から、現時刻ｔにおいて出力されるデータＤｙ２（これをデータＤｙ２（ｔ）と表記する）と、１つ前の時間ステップにおいて、デコーダ側ＬＳＴＭ層５２から出力されたデータＤｙ３（これをデータＤｙ３（ｔ−１）と表記する）と、アテンション部４から出力される時刻ｔのコンテキスト状態データｃ（ｔ）とを入力する。 The decoder-side LSTM layer 52 is a layer corresponding to a hidden layer (LSTM layer) of the recurrent neural network. The decoder-side LSTM layer 52 includes data Dy2 (this is referred to as data Dy2 (t)) output from the decoder-side prenet processing unit 51 at the current time t, and the decoder-side LSTM in the previous time step. Data Dy3 output from the layer 52 (this is referred to as data Dy3 (t-1)) and context state data c (t) at time t output from the attention unit 4 are input.

デコーダ側ＬＳＴＭ層５２は、入力されたデータＤｙ２（ｔ）、データＤｙ３（ｔ−１）、および、コンテキスト状態データｃ（ｔ）を用いて、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｙ３（データＤｙ３（ｔ））として線形予測部５３に出力する。また、デコーダ側ＬＳＴＭ層５２は、データＤｙ３（ｔ）、すなわち、時刻ｔの出力側隠れ状態データｈｏ（ｔ）をアテンション部４に出力する。 The decoder-side LSTM layer 52 executes processing by the LSTM layer using the input data Dy2 (t), data Dy3 (t-1), and context state data c (t), and obtains the processed data. It is output to the linear prediction unit 53 as data Dy3 (data Dy3 (t)). Further, the decoder side LSTM layer 52 outputs the data Dy3 (t), that is, the output side hidden state data ho (t) at the time t to the attention unit 4.

線形予測部５３は、デコーダ側ＬＳＴＭ層５２から出力されるデータＤｙ３を入力する。線形予測部５３は、所定の期間（例えば、メルスペクトログラムを取得するための１フレーム期間に相当する期間）内に、デコーダ側ＬＳＴＭ層５２から出力されるデータＤｙ３（複数のデータＤｙ３）を記憶保持し、当該複数のデータＤｙ３を用いて線形変換することで、所定期間におけるメルスペクトログラムの予測データＤｙ４を取得する。そして、線形予測部５３は、取得したデータＤｙ４をポストネット処理部５４、加算器５５、および、デコーダ側プレネット処理部５１に出力する。 The linear prediction unit 53 inputs the data Dy3 output from the decoder side LSTM layer 52. The linear prediction unit 53 stores and holds data Dy3 (plurality of data Dy3) output from the decoder-side LSTM layer 52 within a predetermined period (for example, a period corresponding to one frame period for acquiring a mel spectrogram). Then, by performing a linear transformation using the plurality of data Dy3, the prediction data Dy4 of the mel spectrogram in a predetermined period is acquired. Then, the linear prediction unit 53 outputs the acquired data Dy4 to the postnet processing unit 54, the adder 55, and the decoder side prenet processing unit 51.

ポストネット処理部５４は、例えば、複数層（例えば、５層）のコンボリューション層を有しており、コンボリューション処理（コンボリューションフィルタによる処理）、データの正規化処理、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理やｔａｎｈ関数による処理）を実行し、予測データ（予測メルスペクトログラム）の残差データ（residual）を取得し、取得した残差データをデータＤｙ５として加算器５５に出力する。 The post-net processing unit 54 has, for example, a convolution layer having a plurality of layers (for example, five layers), and includes convolution processing (processing by a convolution filter), data normalization processing, and processing by an activation function (processing by an activation function). For example, the ReLU function (ReLU: Rectifier Unit) processing or the tanh function processing) is executed, the residual data (residual) of the prediction data (prediction mel spectrometer) is acquired, and the acquired residual data is used as the data Dy5. Output to the adder 55.

加算器５５は、線形予測部５３から出力される予測データＤｙ４（予測メルスペクトログラムのデータ）と、ポストネット処理部５４から出力される残差データＤｙ５（予測メルスペクトログラムの残差データ）とを入力する。加算器５５は、予測データＤｙ４（予測メルスペクトログラムのデータ）と、残差データＤｙ５（予測メルスペクトログラムの残差データ）とに対して加算処理を実行し、加算処理後のデータ（予測メルスペクトログラムのデータ）をデータＤｙ６としてボコーダ６に出力する。 The adder 55 inputs the prediction data Dy4 (prediction mel spectrometry data) output from the linear prediction unit 53 and the residual data Dy5 (residual data of the prediction mel spectrometry) output from the postnet processing unit 54. To do. The adder 55 executes addition processing on the prediction data Dy4 (data of the prediction mel spectrogram) and the residual data Dy5 (residual data of the prediction mel spectrogram), and the data after the addition processing (data of the prediction mel spectrogram). Data) is output to the vocoder 6 as data Dy6.

ボコーダ６は、音響特徴量のデータを入力とし、入力された音響特徴量のデータから、当該音響特徴量に対応する音声信号波形を出力する。本実施形態において、ボコーダ６は、ニューラルネットワークによるモデルを用いたボコーダを採用する。ボコーダ６は、入力される音響特徴量を、メルスペクトログラムのデータとし、出力を当該メルスペクトログラムに対応する音声信号波形とする。ボコーダ６は、学習時において、メルスペクトログラムと、当該メルスペクトログラムにより実現される音声信号波形（教師データ）として、ニューラルネットワークのモデルを学習させ、当該ニューラルネットワークのパラメータの最適化パラメータを取得することで、当該ニューラルネットワークのモデルを最適化する処理を行う。そして、ボコーダ６は、予測時において、最適化したニューラルネットワークのモデルを用いて、処理を行うことで、入力されるメルスペクトログラムのデータ（例えば、デコーダ部５から出力されるデータＤｙ６）から、当該メルスペクトログラムに対応する音声信号波形を予測し、予測した音声信号波形のデータをデータＤｏｕｔとして出力する。 The vocoder 6 inputs the data of the acoustic feature amount, and outputs the audio signal waveform corresponding to the acoustic feature amount from the input acoustic feature amount data. In the present embodiment, the vocoder 6 employs a vocoder using a model based on a neural network. The vocoder 6 uses the input acoustic features as mel spectrogram data and outputs the audio signal waveform corresponding to the mel spectrogram. At the time of learning, the bocoder 6 trains a neural network model as a mel spectrogram and a voice signal waveform (teacher data) realized by the mel spectrogram, and acquires the optimization parameters of the parameters of the neural network. , Performs the process of optimizing the model of the neural network. Then, the vocoder 6 performs processing using the optimized neural network model at the time of prediction, and from the input mel spectrogram data (for example, the data Dy6 output from the decoder unit 5), the said The audio signal waveform corresponding to the mel spectrogram is predicted, and the data of the predicted audio signal waveform is output as data Dout.

＜１．２：音声合成処理装置の動作＞
以上のように構成された音声合成処理装置１００の動作について以下説明する。 <1.2: Operation of speech synthesis processing device>
The operation of the speech synthesis processing device 100 configured as described above will be described below.

以下では、音声合成処理装置１００の動作を、（１）学習処理（学習時の処理）と、（２）予測処理（予測時の処理）とに分けて説明する。 Hereinafter, the operation of the speech synthesis processing device 100 will be described separately as (1) learning processing (processing at the time of learning) and (2) prediction processing (processing at the time of prediction).

（１．２．１：学習処理）
まず、音声合成処理装置１００による学習処理について、説明する。なお、説明便宜のため、処理対象言語を日本語として、以下、説明する。 (1.2.1: Learning process)
First, the learning process by the speech synthesis processing device 100 will be described. For convenience of explanation, the processing target language will be Japanese, which will be described below.

処理対象言語である日本語のテキストデータＤｉｎをテキスト解析部１に入力する。また、当該テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）のデータを教師データとして用意する。 The Japanese text data Din, which is the processing target language, is input to the text analysis unit 1. In addition, mel spectrogram (acoustic feature amount) data corresponding to the text data Din is prepared as teacher data.

テキスト解析部１は、入力されたテキストデータＤｉｎに対して、テキスト解析処理を実行し、様々な言語情報からなるコンテキストを含む音素ラベルであるコンテキストラベルの系列を取得する。 The text analysis unit 1 executes a text analysis process on the input text data Din, and acquires a series of context labels which are phonetic labels including contexts composed of various language information.

日本語は、アクセントやピッチによって、同じ文字（例えば、漢字）であっても、発音されたときの音声波形が異なる言語であるので、当該音素（処理対象の音素）の前後の音素についての言語情報も、コンテキストラベルに含める必要がある。テキスト解析部１は、処理対象を日本語とする場合、テキストデータＤｉｎに対して、日本語用のテキスト解析処理を実行し、テキストが発音されたときの音声波形を特定するためのパラメータについて、必要に応じて、（１）当該音素のみのデータ、（２）先行する音素、および／または、後続する音素についてのデータを取得し、取得したデータをまとめてフルコンテキストラベルデータを取得する。 Japanese is a language in which the voice waveform when pronounced is different even if the same character (for example, Kanji) is pronounced depending on the accent and pitch, so the language for the phonemes before and after the phoneme (phoneme to be processed). Information should also be included in the context label. When the processing target is Japanese, the text analysis unit 1 executes text analysis processing for Japanese on the text data Din, and obtains parameters for specifying the voice waveform when the text is pronounced. If necessary, (1) data of only the phoneme, (2) data of the preceding phoneme, and / or the succeeding phoneme are acquired, and the acquired data are collectively acquired as full context label data.

図２は、処理対象言語を日本語とした場合のテキスト解析処理により取得されるフルコンテキストラベルデータに含まれる情報（パラメータ）（一例）を示す図である。 FIG. 2 is a diagram showing information (parameters) (example) included in the full context label data acquired by the text analysis process when the processing target language is Japanese.

図２に示す場合では、フルコンテキストラベルデータの各パラメータは、図２の「概要」に示した内容を特定するためのデータであり、図２の表に示した次元数、音素数分のデータである。 In the case shown in FIG. 2, each parameter of the full context label data is data for specifying the content shown in the “outline” of FIG. 2, and is the data for the number of dimensions and the number of phonemes shown in the table of FIG. Is.

図２に示すように、テキスト解析部１は、図２の表の全てのパラメータのデータをまとめて、フルコンテキストラベルデータ（ベクトルのデータ）として、取得する。図２の場合、フルコンテキストラベルデータは、４７８次元のベクトルデータとなる。 As shown in FIG. 2, the text analysis unit 1 collectively acquires the data of all the parameters in the table of FIG. 2 as full context label data (vector data). In the case of FIG. 2, the full context label data is 478-dimensional vector data.

上記のようにして取得されたフルコンテキストラベルデータＤｘ１は、テキスト解析部１からフルコンテキストラベルベクトル処理部２に出力される。 The full context label data Dx1 acquired as described above is output from the text analysis unit 1 to the full context label vector processing unit 2.

フルコンテキストラベルベクトル処理部２は、入力されたフルコンテキストラベルデータＤｘ１から、sequence-to-sequence方式のニューラルネットワークのモデルの学習処理に適したフルコンテキストラベルデータを取得するためのフルコンテキストラベルベクトル処理を実行する。具体的には、フルコンテキストラベルベクトル処理部２は、先行する音素についてのパラメータ（データ）、後続する音素についてのパラメータ（データ）を削除することで、最適化フルコンテキストラベルデータＤｘ２を取得する。例えば、フルコンテキストラベルデータＤｘ１が図２に示すパラメータを含むデータである場合、先行する音素についてのパラメータ（データ）、後続する音素についてのパラメータ（データ）を削除することで、最適化フルコンテキストラベルデータＤｘ２を取得する。 The full context label vector processing unit 2 performs full context label vector processing for acquiring full context label data suitable for training processing of a sequence-to-sequence neural network model from the input full context label data Dx1. To execute. Specifically, the full context label vector processing unit 2 acquires the optimized full context label data Dx2 by deleting the parameter (data) for the preceding phoneme and the parameter (data) for the succeeding phoneme. For example, when the full context label data Dx1 is data including the parameters shown in FIG. 2, the optimized full context label is deleted by deleting the parameter (data) for the preceding phoneme and the parameter (data) for the succeeding phoneme. Acquire data Dx2.

図３は、上記のようにして取得した最適化フルコンテキストラベルデータに含まれる情報（パラメータ）（一例）を示す図である。 FIG. 3 is a diagram showing information (parameters) (example) included in the optimized full context label data acquired as described above.

図３の場合、最適化フルコンテキストラベルデータは、１３０次元のベクトルデータとなり、４７８次元のベクトルデータであるフルコンテキストラベルデータＤｘ１と比べると、次元数が著しく低減されていることが分かる。 In the case of FIG. 3, the optimized full context label data is 130-dimensional vector data, and it can be seen that the number of dimensions is significantly reduced as compared with the full context label data Dx1 which is 478-dimensional vector data.

音声合成処理装置１００で用いられているニューラルネットワークのモデルが、sequence-to-sequence方式のニューラルネットワーク（リカーレントニューラルネットワーク）のモデルであり、エンコーダ側ＬＳＴＭ層３２、デコーダ側ＬＳＴＭ層５２を有しているので、入力されるデータ列について、時系列の関係を考慮した学習処理、予測処理ができるため、従来技術で必要とされていた先行する音素、後続する音素のデータは、冗長となり、学習処理の効率、予測処理の精度を悪化させる原因となる。そのため、音声合成処理装置１００では、上記のように、当該音素についてのパラメータ（データ）のみを残して取得した最適化フルコンテキストラベルデータＤｘ２を取得し、取得した最適化フルコンテキストラベルデータＤｘ２を用いて、学習処理、予測処理を行うことで、高速かつ高精度に処理を実行することができる。 The model of the neural network used in the speech synthesis processing device 100 is a model of a sequence-to-sequence type neural network (recurrent neural network), and has an encoder-side LSTM layer 32 and a decoder-side LSTM layer 52. Therefore, the input data string can be learned and predicted in consideration of the time series relationship, so that the data of the preceding and succeeding neural networks required in the prior art becomes redundant and learned. It causes deterioration of processing efficiency and prediction processing accuracy. Therefore, in the speech synthesis processing device 100, as described above, the optimized full context label data Dx2 acquired by leaving only the parameters (data) for the phoneme is acquired, and the acquired optimized full context label data Dx2 is used. By performing the learning process and the prediction process, the process can be executed at high speed and with high accuracy.

上記により取得されたデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）は、フルコンテキストラベルベクトル処理部２からのエンコーダ部３のエンコーダ側プレネット処理部３１に出力される。 The data Dx2 (optimized full context label data Dx2) acquired as described above is output from the full context label vector processing unit 2 to the encoder-side prenet processing unit 31 of the encoder unit 3.

エンコーダ側プレネット処理部３１は、フルコンテキストラベルベクトル処理部２から入力したデータＤｘ２に対して、コンボリューション処理（コンボリューションフィルタによる処理）、データの正規化処理、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理）を実行し、エンコーダ側ＬＳＴＭ層３２に入力可能なデータを取得する。そして、エンコーダ側プレネット処理部３１は、上記処理（プレネット処理）により取得したデータをデータＤｘ３としてエンコーダ側ＬＳＴＭ層３２に出力する。 The encoder-side prenet processing unit 31 performs convolution processing (processing by the convolution filter), data normalization processing, and processing by the activation function (for example, processing by the convolution filter) for the data Dx2 input from the full context label vector processing unit 2. The ReLU function (processing by ReLU: Rectifier Unit) is executed, and the data that can be input to the LSTM layer 32 on the encoder side is acquired. Then, the encoder-side prenet processing unit 31 outputs the data acquired by the above processing (pre-net processing) to the encoder-side LSTM layer 32 as data Dx3.

エンコーダ側ＬＳＴＭ層３２は、エンコーダ側プレネット処理部３１から、現時刻ｔにおいて出力されるデータＤｘ３（ｔ）と、１つ前の時間ステップにおいて、エンコーダ側ＬＳＴＭ層３２から出力されたデータＤｘ４（ｔ−１）とを入力する。そして、エンコーダ側ＬＳＴＭ層３２は、入力されたデータＤｘ３（ｔ）、データＤｘ４（ｔ−１）に対して、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｘ４（データＤｘ４（ｔ））としてアテンション部４に出力する。 The encoder-side LSTM layer 32 includes data Dx3 (t) output from the encoder-side prenet processing unit 31 at the current time t and data Dx4 (data Dx4) output from the encoder-side LSTM layer 32 in the previous time step. Enter t-1). Then, the encoder-side LSTM layer 32 executes the processing by the LSTM layer on the input data Dx3 (t) and data Dx4 (t-1), and the processed data is data Dx4 (data Dx4 (t)). ) Is output to the attention unit 4.

アテンション部４は、エンコーダ部３から出力されるデータＤｘ４と、デコーダ部５のデコーダ側ＬＳＴＭ層５２から出力されるデータｈo（出力側隠れ状態データｈo）とを入力する。アテンション部４は、エンコーダ部３から出力されるデータＤｘ４、すなわち、入力側隠れ状態データｈｉを所定の時間ステップ分記憶保持する。例えば、アテンション部４は、時間ステップｔ＝１からｔ＝Ｓ（Ｓ：自然数）の期間において、エンコーダ部３により取得され、アテンション部４に出力されたデータＤｘ４（＝ｈｉ）の集合を、ｈｉ_{１．．．Ｓ}（＝｛Ｄｘ４（１），Ｄｘ４（２），・・・，Ｄｘ４（Ｓ）｝）として記憶保持する。 The attention unit 4 inputs the data Dx4 output from the encoder unit 3 and the data ho (output side hidden state data ho) output from the decoder side LSTM layer 52 of the decoder unit 5. The attention unit 4 stores and holds the data Dx4 output from the encoder unit 3, that is, the input side hidden state data hi for a predetermined time step. For example, the attention unit 4 obtains a set of data Dx4 (= hi) acquired by the encoder unit 3 and output to the attention unit 4 during the period from the time step t = 1 to t = S (S: natural number). _{1. 1. .. .. It} is stored as _S (= {Dx4 (1), Dx4 (2), ..., Dx4 (S)}).

また、アテンション部４は、デコーダ部５のデコーダ側ＬＳＴＭ層５２から出力されるデータＤｙ３、すなわち、出力側隠れ状態データｈｏを所定の時間ステップ分記憶保持する。例えば、アテンション部４は、時間ステップｔ＝１からｔ＝Ｔ（Ｔ：自然数）の期間において、デコーダ側ＬＳＴＭ層５２により取得され、アテンション部４に出力されたデータＤｙ３（＝ｈｏ）の集合を、ｈｏ_{１．．．Ｔ}（＝｛Ｄｙ３（１），Ｄｙ３（２），・・・，Ｄｙ３（Ｔ）｝）として記憶保持する。 Further, the attention unit 4 stores and holds the data Dy3 output from the decoder side LSTM layer 52 of the decoder unit 5, that is, the output side hidden state data ho for a predetermined time step. For example, the attention unit 4 collects a set of data Dy3 (= ho) acquired by the decoder-side LSTM layer 52 and output to the attention unit 4 during the period from the time step t = 1 to t = T (T: natural number). , Ho _{1. .. .. It} is stored as _T (= {Dy3 (1), Dy3 (2), ..., Dy3 (T)}).

そして、アテンション部４は、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}と、出力側隠れ状態データの集合データｈｏ_{１．．．Ｔ}と、に基づいて、例えば、
ｃ（ｔ）＝ｆ１＿ａｔｔｎ（ｈｉ_{１．．．Ｓ}，ｈｏ_{１．．．Ｔ}）
ｆ１＿ａｔｔｎ（）：コンテキスト状態データを取得する関数
に相当する処理を実行して、現時刻ｔのコンテキスト状態データｃ（ｔ）を取得する。 Then, the attention unit 4 is a set data hi _{1. Of} the input side hidden state data _{. .. ..} Set data ho of _S and output side hidden state data ho _{1. .. ..} Based on _T , for example,
c (t) = f1_attn (hi _{... S} , ho _{... T} )
f1_attn (): The process corresponding to the function for acquiring the context state data is executed to acquire the context state data c (t) at the current time t.

そして、アテンション部４は、取得したコンテキスト状態データｃ（ｔ）をデコーダ側ＬＳＴＭ層５２に出力する。
デコーダ側プレネット処理部５１は、線形予測部５３から出力される、１時間ステップ前のデータＤｙ４（ｔ−１）を入力する。デコーダ側プレネット処理部５１は、例えば、複数層（例えば、２層）の全結合層を有しており、データの正規化処理（例えば、線形予測部５３から出力されるデータ（ベクトルデータ）の次元数が２Ｎであり、デコーダ側ＬＳＴＭ層に入力されるデータ（ベクトルデータ）の次元数がＮである場合、データの次元数をＮにするように、例えば、ドロップアウト処理を行うことを含む）、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理）を実行し、デコーダ側ＬＳＴＭ層５２に入力可能なデータを取得する。そして、デコーダ側プレネット処理部５１は、上記処理（プレネット処理）により取得したデータをデータＤｙ２としてデコーダ側ＬＳＴＭ層５２に出力する。 Then, the attention unit 4 outputs the acquired context state data c (t) to the decoder side LSTM layer 52.
The decoder-side prenet processing unit 51 inputs the data Dy4 (t-1) one hour before the step, which is output from the linear prediction unit 53. The decoder-side prenet processing unit 51 has, for example, a plurality of layers (for example, two layers) of fully connected layers, and data normalization processing (for example, data (vector data) output from the linear prediction unit 53). When the number of dimensions of the data (vector data) input to the LSTM layer on the decoder side is N, for example, dropout processing is performed so that the number of dimensions of the data is N. (Including), processing by the activation function (for example, processing by the ReLU function (ReLU: Rectifier Unit)) is executed, and data that can be input to the decoder side LSTM layer 52 is acquired. Then, the decoder-side prenet processing unit 51 outputs the data acquired by the above processing (pre-net processing) to the decoder-side LSTM layer 52 as data Dy2.

デコーダ側ＬＳＴＭ層５２は、デコーダ側プレネット処理部５１から、現時刻ｔにおいて出力されるデータＤｙ２（ｔ）と、１つ前の時間ステップにおいて、デコーダ側ＬＳＴＭ層５２から出力されたデータＤｙ３（ｔ−１）と、アテンション部４から出力される時刻ｔのコンテキスト状態データｃ（ｔ）とを入力する。 The decoder-side LSTM layer 52 includes data Dy2 (t) output from the decoder-side prenet processing unit 51 at the current time t and data Dy3 (data dy3) output from the decoder-side LSTM layer 52 in the previous time step. The t-1) and the context state data c (t) at time t output from the attention unit 4 are input.

デコーダ側ＬＳＴＭ層５２は、入力されたデータＤｙ２（ｔ）、データＤｙ３（ｔ−１）、および、コンテキスト状態データｃ（ｔ）を用いて、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｙ３（ｔ）として線形予測部５３に出力する。また、デコーダ側ＬＳＴＭ層５２は、データＤｙ３（ｔ）、すなわち、時刻ｔの出力側隠れ状態データｈｏ（ｔ）をアテンション部４に出力する。 The decoder-side LSTM layer 52 executes processing by the LSTM layer using the input data Dy2 (t), data Dy3 (t-1), and context state data c (t), and obtains the processed data. It is output to the linear prediction unit 53 as data Dy3 (t). Further, the decoder side LSTM layer 52 outputs the data Dy3 (t), that is, the output side hidden state data ho (t) at the time t to the attention unit 4.

ポストネット処理部５４は、例えば、コンボリューション処理（コンボリューションフィルタによる処理）、データの正規化処理、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理やｔａｎｈ関数による処理）を実行し、予測データ（予測メルスペクトログラム）の残差データ（residual）を取得し、取得した残差データをデータＤｙ５として加算器５５に出力する。 The postnet processing unit 54 may, for example, perform convolution processing (processing by a convolution filter), data normalization processing, processing by an activation function (for example, processing by a ReLU function (ReLU: Selected Liner Unit), or processing by a tanh function. ) Is executed, the residual data (residual) of the predicted data (predicted mel spectrogram) is acquired, and the acquired residual data is output to the adder 55 as data Dy5.

加算器５５は、線形予測部５３から出力される予測データＤｙ４（予測メルスペクトログラムのデータ）と、ポストネット処理部５４から出力される残差データＤｙ５（予測メルスペクトログラムの残差データ）とを入力する。加算器５５は、予測データＤｙ４（予測メルスペクトログラムのデータ）と、残差データＤｙ５（予測メルスペクトログラムの残差データ）とに対して加算処理を実行し、加算処理後のデータ（予測メルスペクトログラムのデータ）をデータＤｙ６として出力する。 The adder 55 inputs the prediction data Dy4 (prediction mel spectrometry data) output from the linear prediction unit 53 and the residual data Dy5 (residual data of the prediction mel spectrometry) output from the postnet processing unit 54. To do. The adder 55 executes an addition process on the predicted data Dy4 (data of the predicted mel spectrogram) and the residual data Dy5 (residual data of the predicted mel spectrogram), and the data after the addition process (the data of the predicted mel spectrometer). Data) is output as data Dy6.

そして、音声合成処理装置１００では、上記のように取得されたデータＤｙ６（予測メルスペクトログラムのデータ）と、テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）の教師データ（正解のメルスペクトログラム）とを比較し、両者の差（比較結果）（例えば、差分ベクトルのノルムやユークリッド距離により表現する差）が小さくなるように、エンコーダ部３、デコーダ部５のニューラルネットワークのモデルのパラメータを更新する。音声合成処理装置１００では、このパラメータ更新処理を繰り返し実行し、データＤｙ６（予測メルスペクトログラムのデータ）と、テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）の教師データ（正解のメルスペクトログラム）との差が十分小さくなる（所定の誤差範囲におさまる）、ニューラルネットワークのモデルのパラメータを最適化パラメータとして取得する。 Then, in the speech synthesis processing device 100, the data Dy6 (predicted mel spectrogram data) acquired as described above and the teacher data (correct mel spectrogram) of the mel spectrogram (acoustic feature amount) corresponding to the text data Din The parameters of the neural network model of the encoder unit 3 and the decoder unit 5 are updated so that the difference between the two (comparison result) (for example, the difference expressed by the norm of the difference vector and the Euclidean distance) becomes small. In the speech synthesis processing apparatus 100, this parameter update process is repeatedly executed, and the data Dy6 (predicted mel spectrogram data) and the mel spectrogram (acoustic feature amount) teacher data (correct mel spectrogram) corresponding to the text data Din are obtained. When the difference between the two becomes sufficiently small (within a predetermined error range), the parameters of the neural network model are acquired as the optimization parameters.

音声合成処理装置１００では、上記のようにして取得した最適化パラメータに基づいて、エンコーダ部３、デコーダ部５のニューラルネットワークのモデルの各層に含まれるシナプス間の結合係数（重み係数）を設定することで、エンコーダ部３、デコーダ部５のニューラルネットワークのモデルを最適化モデル（学習済みモデル）とすることができる。 In the speech synthesis processing device 100, the coupling coefficient (weighting coefficient) between synapses included in each layer of the neural network model of the encoder unit 3 and the decoder unit 5 is set based on the optimization parameters acquired as described above. Therefore, the model of the neural network of the encoder unit 3 and the decoder unit 5 can be used as an optimized model (trained model).

以上により、音声合成処理装置１００において、入力をテキストデータとし、出力をメルスペクトログラムとするニューラルネットワークの学習済みモデル（最適化モデル）を構築できる。 As described above, in the speech synthesis processing device 100, a trained model (optimized model) of a neural network having an input as text data and an output as a mel spectrogram can be constructed.

また、ボコーダ６として、ニューラルネットワークによるモデルを用いたボコーダを採用する場合、入力される音響特徴量を、メルスペクトログラムのデータとし、出力を当該メルスペクトログラムに対応する音声信号波形として学習処理を実行する。つまり、ボコーダ６において、メルスペクトログラムのデータを入力し、音声合成処理をニューラルネットワークによるモデルを用いた処理により実行し、音声波形データを出力させる。ボコーダ６から出力される当該音声波形データと、ボコーダに入力したメルスペクトログラムに対応する音声波形データ（正解の音声波形データ）とを比較し、両者の差（比較結果）（例えば、差分ベクトルのノルムやユークリッド距離により表現する差）が小さくなるように、ボコーダ６のニューラルネットワークのモデルのパラメータを更新する。ボコーダ６では、このパラメータ更新処理を繰り返し実行し、ボコーダの入力データ（メルスペクトログラムのデータ）と、ボコーダ６に入力されたメルスペクトログラムに対応する音声波形データ（正解の音声波形データ）との差が十分小さくなる（所定の誤差範囲におさまる）、ニューラルネットワークのモデルのパラメータを最適化パラメータとして取得する。 Further, when a vocoder using a model by a neural network is adopted as the vocoder 6, the input acoustic feature amount is used as mel spectrogram data, and the output is used as the audio signal waveform corresponding to the mel spectrogram to execute the learning process. .. That is, in the vocoder 6, the mel spectrogram data is input, the voice synthesis process is executed by the process using the model by the neural network, and the voice waveform data is output. The audio waveform data output from the vocoder 6 is compared with the audio waveform data (correct audio waveform data) corresponding to the mel specogram input to the vocoder, and the difference between the two (comparison result) (for example, the norm of the difference vector). And the parameter of the model of the neural network of the vocoder 6 is updated so that the difference expressed by the Euclidean distance) becomes small. In the vocoder 6, this parameter update process is repeatedly executed, and the difference between the input data of the vocoder (mel specogram data) and the voice waveform data (correct voice waveform data) corresponding to the mel specogram input to the vocoder 6 is found. Obtain the parameters of the model of the neural network as optimization parameters that are sufficiently small (within a predetermined error range).

ボコーダ６では、上記のようにして取得した最適化パラメータに基づいて、ボコーダ６のニューラルネットワークのモデルの各層に含まれるシナプス間の結合係数（重み係数）を設定することで、ボコーダ６のニューラルネットワークのモデルの最適化モデル（学習済みモデル）とすることができる。 In the vocoder 6, the neural network of the vocoder 6 is set by setting the coupling coefficient (weighting coefficient) between synapses included in each layer of the model of the neural network of the vocoder 6 based on the optimization parameters acquired as described above. It can be an optimized model (trained model) of the model of.

以上により、ボコーダ６において、入力をテキストデータとし、出力をメルスペクトログラムとするニューラルネットワークの学習済みモデル（最適化モデル）を構築できる。 As described above, in the vocoder 6, it is possible to construct a trained model (optimized model) of a neural network in which the input is text data and the output is a mel spectrogram.

なお、音声合成処理装置１００において、（１）エンコーダ部３、デコーダ部５の学習処理と、（２）ボコーダ６の学習処理とを連携させて学習処理を実行してもよいし、上記のように、個別に学習処理を実行してもよい。音声合成処理装置１００において、（１）エンコーダ部３、デコーダ部５の学習処理と、（２）ボコーダ６の学習処理とを連携させて学習処理を実行する場合、入力をテキストデータとし、当該テキストデータに対応する音声波形データ（正解の音声波形データ）とを用いて、（１）エンコーダ部３、デコーダ部５のニューラルネットワークのモデルと、（２）ボコーダ６のニューラルネットワークのモデルの最適化パラメータを取得することで学習処理を実行すればよい。 In the speech synthesis processing device 100, the learning process may be executed in cooperation with (1) the learning process of the encoder unit 3 and the decoder unit 5 and (2) the learning process of the vocoder 6, as described above. In addition, the learning process may be executed individually. When the speech synthesis processing device 100 executes the learning process in cooperation with (1) the learning process of the encoder unit 3 and the decoder unit 5 and (2) the learning process of the vocoder 6, the input is set as text data and the text is concerned. Optimizing parameters of (1) the model of the neural network of the encoder unit 3 and the decoder unit 5 and (2) the model of the neural network of the vocoder 6 using the voice waveform data (correct voice waveform data) corresponding to the data. The learning process may be executed by acquiring.

（１．２．２：予測処理）
次に、音声合成処理装置１００による予測処理について、説明する。なお、予測処理においても、説明便宜のため、処理対象言語を日本語として、以下、説明する。 (1.2.2: Prediction processing)
Next, the prediction process by the speech synthesis processing device 100 will be described. In the prediction processing as well, for convenience of explanation, the processing target language will be Japanese, which will be described below.

予測処理を実行する場合、音声合成処理装置１００では、上記の学習処理により取得された学習済みモデル、すなわち、エンコーダ部３、デコーダ部５のニューラルネットワークの最適化モデル（最適化パラメータが設定されているモデル）、および、ボコーダ６のニューラルネットワークの最適化モデル（最適化パラメータが設定されているモデル）が構築されている。そして、音声合成処理装置１００では、当該学習済みモデルを用いて予測処理が実行される。 When executing the prediction process, in the speech synthesis processing device 100, the trained model acquired by the above learning process, that is, the optimization model (optimization parameter) of the neural network of the encoder unit 3 and the decoder unit 5 is set. The model) and the optimization model (model in which the optimization parameters are set) of the neural network of the vocoder 6 are constructed. Then, in the speech synthesis processing device 100, the prediction processing is executed using the trained model.

音声合成処理の対象とする日本語のテキストデータＤｉｎをテキスト解析部１に入力する。 The Japanese text data Din, which is the target of the voice synthesis processing, is input to the text analysis unit 1.

テキスト解析部１は、入力されたテキストデータＤｉｎに対して、日本語用のテキスト解析処理を実行し、例えば、図２に示すパラメータを含む４７８次元のベクトルデータとして、フルコンテキストラベルデータＤｘ１を取得する。 The text analysis unit 1 executes a text analysis process for Japanese on the input text data Din, and acquires full context label data Dx1 as 478-dimensional vector data including the parameters shown in FIG. 2, for example. To do.

そして、取得されたフルコンテキストラベルデータＤｘ１は、テキスト解析部１からフルコンテキストラベルベクトル処理部２に出力される。 Then, the acquired full context label data Dx1 is output from the text analysis unit 1 to the full context label vector processing unit 2.

フルコンテキストラベルベクトル処理部２は、入力されたフルコンテキストラベルデータＤｘ１に対して、フルコンテキストラベルベクトル処理を実行し、最適化フルコンテキストラベルＤｘ２を取得する。なお、ここで取得される最適化フルコンテキストラベルＤｘ２は、エンコーダ部３、デコーダ部５のsequence-to-sequence方式のニューラルネットワークのモデルの学習処理を行うときに設定した最適化フルコンテキストラベルデータＤｘ２と同じ次元数を有し、かつ、同じパラメータ（情報）を有するデータである。 The full context label vector processing unit 2 executes full context label vector processing on the input full context label data Dx1 and acquires the optimized full context label Dx2. The optimized full context label Dx2 acquired here is the optimized full context label data Dx2 set when the training process of the sequence-to-sequence neural network model of the encoder unit 3 and the decoder unit 5 is performed. It is data having the same number of dimensions as and having the same parameters (information).

上記により取得されたデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）は、フルコンテキストラベルベクトル処理部２からエンコーダ部３のエンコーダ側プレネット処理部３１に出力される。 The data Dx2 (optimized full context label data Dx2) acquired as described above is output from the full context label vector processing unit 2 to the encoder-side prenet processing unit 31 of the encoder unit 3.

エンコーダ側ＬＳＴＭ層３２は、エンコーダ側プレネット処理部３１から、現時刻ｔにおいて出力されるデータＤｘ３（ｔ）と、１つ前の時間ステップにおいて、エンコーダ側ＬＳＴＭ層３２から出力されたデータＤｘ４（ｔ−１）とを入力する。そして、エンコーダ側ＬＳＴＭ層３２は、入力されたデータＤｘ３（ｔ）、データＤｘ４（ｔ−１）に対して、ＬＳＴＭ層による処理（ニューラルネットワーク処理）を実行し、処理後のデータをデータＤｘ４（データＤｘ４（ｔ））としてアテンション部４に出力する。 The encoder-side LSTM layer 32 includes data Dx3 (t) output from the encoder-side prenet processing unit 31 at the current time t and data Dx4 (data Dx4) output from the encoder-side LSTM layer 32 in the previous time step. Enter t-1). Then, the encoder-side LSTM layer 32 executes processing (neural network processing) by the LSTM layer on the input data Dx3 (t) and data Dx4 (t-1), and the processed data is data Dx4 (data Dx4 (t-1). It is output to the attention unit 4 as data Dx4 (t)).

加算器５５は、線形予測部５３から出力される予測データＤｙ４（予測メルスペクトログラムのデータ）と、ポストネット処理部５４から出力される残差データＤｙ５（予測メルスペクトログラムの残差データ）とを入力する。加算器５５は、予測データＤｙ４（予測メルスペクトログラムのデータ）と、残差データＤｙ５（予測メルスペクトログラムの残差データ）とに対して加算処理を実行し、加算処理後のデータ（予測メルスペクトログラムのデータ）をデータＤｙ６として、ボコーダ６に出力する。 The adder 55 inputs the prediction data Dy4 (prediction mel spectrometry data) output from the linear prediction unit 53 and the residual data Dy5 (residual data of the prediction mel spectrometry) output from the postnet processing unit 54. To do. The adder 55 executes addition processing on the prediction data Dy4 (data of the prediction mel spectrogram) and the residual data Dy5 (residual data of the prediction mel spectrogram), and the data after the addition processing (data of the prediction mel spectrogram). Data) is output to the vocabulary 6 as data Dy6.

ボコーダ６は、デコーダ部５の加算器５５から出力されるデータＤｙ６（予測メルスペクトログラムのデータ（音響特徴量のデータ））を入力とし、入力されたデータＤｙ６に対して、学習済みモデルを用いたニューラルネットワーク処理による音声合成処理を実行し、データＤｙ６（予測メルスペクトログラム）に対応する音声信号波形データを取得する。そして、ボコーダ６は、取得した音声信号波形データを、データＤｏｕｔとして出力する。 The vocoder 6 inputs data Dy6 (data of predicted mel spectrogram (data of acoustic feature amount)) output from the adder 55 of the decoder unit 5, and uses a trained model for the input data Dy6. The speech synthesis process by the neural network process is executed, and the speech signal waveform data corresponding to the data Dy6 (predicted mel spectrogram) is acquired. Then, the vocoder 6 outputs the acquired voice signal waveform data as data Dout.

このように、音声合成処理装置１００では、入力されたテキストデータＤｉｎに対応する音声波形データＤｏｕｔを取得することができる。 In this way, the voice synthesis processing device 100 can acquire the voice waveform data Dout corresponding to the input text data Din.

以上のように、音声合成処理装置１００では、処理対象言語（上記では日本語）のテキストを入力とし、当該処理対象言語に応じたテキスト解析処理により、フルコンテキストラベルデータを取得し、取得したフルコンテキストラベルデータからsequence-to-sequence方式を用いたニューラルネットワークのモデルで処理（学習処理、および／または、予測処理）を実行するのに適したデータである最適化フルコンテキストラベルデータを取得する。そして、音声合成処理装置１００では、入力を最適化フルコンテキストラベルデータとし、出力をメルスペクトログラム（音響特徴量の一例）として、エンコーダ部３、アテンション部４、および、デコーダ部５において、ニューラルネットワークのモデルを用いた処理（学習処理、予測処理）を実行することで、高精度な処理を実現できる。さらに、音声合成処理装置１００では、ボコーダ６により、上記により取得したメルスペクトログラム（音響特徴量の一例）から、当該メルスペクトログラムに対応する音声信号波形データを取得し、取得したデータを出力することで、音声波形データ（データＤｏｕｔ）を取得する。これにより、音声合成処理装置１００では、入力されたテキストに相当する音声波形データを取得することができる。 As described above, in the speech synthesis processing apparatus 100, the text of the processing target language (Japanese in the above) is input, and the full context label data is acquired by the text analysis processing according to the processing target language, and the acquired full is obtained. From the context label data, the optimized full context label data, which is the data suitable for executing the process (learning process and / or the prediction process) in the model of the neural network using the sequence-to-sequence method, is acquired. Then, in the speech synthesis processing device 100, the input is the optimized full context label data, the output is the mel spectrogram (an example of the acoustic feature amount), and the encoder unit 3, the attention unit 4, and the decoder unit 5 use the neural network. Highly accurate processing can be realized by executing processing using the model (learning processing, prediction processing). Further, in the voice synthesis processing device 100, the voice signal waveform data corresponding to the mel spectrogram is acquired from the mel spectrogram (an example of the acoustic feature amount) acquired by the vocoder 6 and the acquired data is output. , Acquires voice waveform data (data Dout). As a result, the voice synthesis processing device 100 can acquire voice waveform data corresponding to the input text.

つまり、音声合成処理装置１００では、sequence-to-sequence方式を用いたニューラルネットワークのモデルで処理するのに適した最適化フルコンテキストラベルデータを用いて、ニューラルネットワークによる処理が実行されるため、高精度な音声合成処理を実行することができる。また、音声合成処理装置１００では、処理対象言語に応じたテキスト解析処理を行い、当該テキスト解析処理で取得されたフルコンテキストラベルデータから、sequence-to-sequence方式を用いたニューラルネットワークのモデルで処理するのに適した最適化フルコンテキストラベルデータを取得し、取得した最適化フルコンテキストラベルデータを用いて処理を行うことで、任意の処理対象言語について、高精度な音声合成処理を行うことができる。 That is, in the speech synthesis processing device 100, the processing by the neural network is executed using the optimized full context label data suitable for processing by the model of the neural network using the sequence-to-sequence method, so that the processing is high. Accurate speech synthesis processing can be executed. Further, the speech synthesis processing device 100 performs text analysis processing according to the language to be processed, and processes the full context label data acquired in the text analysis processing with a model of a neural network using a sequence-to-sequence method. By acquiring the optimized full context label data suitable for the processing and performing processing using the acquired optimized full context label data, it is possible to perform highly accurate speech synthesis processing for any processing target language. ..

したがって、音声合成処理装置１００では、日本語等の英語以外の言語を処理対象言語とする場合においても（処理対象言語を任意の言語にできる）、sequence-to-sequence方式を用いたテキスト音声合成用のニューラルネットワークのモデルにより、学習・最適化を行い、高品質な音声合成処理を実現することができる。 Therefore, in the speech synthesis processing device 100, even when a language other than English such as Japanese is used as the processing target language (the processing target language can be any language), the text speech synthesis using the sequence-to-sequence method is used. By using the model of the neural network for learning and optimizing, it is possible to realize high-quality speech synthesis processing.

≪第１変形例≫
次に、第１実施形態の第１変形例について、説明する。なお、上記実施形態と同様の部分については、同一符号を付し、詳細な説明を省略する。 ≪First modification≫
Next, a first modification of the first embodiment will be described. The same parts as those in the above embodiment are designated by the same reference numerals, and detailed description thereof will be omitted.

本変形例の音声合成処理装置では、ボコーダ６が、例えば、下記先行技術文献に開示されているような、可逆変換が可能なニューラルネットワークのモデルを用いた処理を行う。この点が第１実施形態と相違し、それ以外については、本変形例の音声合成処理装置は、第１実施形態の音声合成処理装置１００と同様である。
（先行技術文献Ａ）：
R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flowbased generative network for speech synthesis,” in Proc. ICASSP, May 2019.
図４は、第１実施形態の第１変形例の音声合成処理装置のボコーダ６の概略構成を示す図であり、学習処理時においけるデータの流れを明示した図である。 In the speech synthesis processing apparatus of this modification, the vocoder 6 performs processing using a model of a neural network capable of reversible conversion, for example, as disclosed in the following prior art document. This point is different from the first embodiment, and other than that, the voice synthesis processing device of this modification is the same as the voice synthesis processing device 100 of the first embodiment.
(Prior Art Document A):
R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flowbased generative network for speech synthesis,” in Proc. ICASSP, May 2019.
FIG. 4 is a diagram showing a schematic configuration of a vocoder 6 of the voice synthesis processing device of the first modification of the first embodiment, and is a diagram clearly showing the flow of data in the learning process.

図５は、第１実施形態の第１変形例の音声合成処理装置のボコーダ６の概略構成を示す図であり、予測処理時においけるデータの流れを明示した図である。 FIG. 5 is a diagram showing a schematic configuration of a vocoder 6 of the voice synthesis processing device of the first modification of the first embodiment, and is a diagram clearly showing the flow of data in the prediction processing.

本変形例のボコーダ６は、図４に示すように、ベクトル処理部６１と、アップサンプリング処理部６２と、ｍ個（ｍ：自然数）の可逆処理部６３ａ〜６３ｘとを備える。 As shown in FIG. 4, the vocoder 6 of this modification includes a vector processing unit 61, an upsampling processing unit 62, and m (m: natural number) reversible processing units 63a to 63x.

まず、本変形例のボコーダ６の学習処理について、説明する。 First, the learning process of the vocoder 6 of this modification will be described.

本変形例のボコーダ６は、学習処理において、音響特徴量としてメルスペクトログラム（これをデータｈとする）と、当該メルスペクトログラムに対応する音声信号波形データ（正解データ）（これをデータｘとする）とを入力し、ガウス白色ノイズ（これをデータｚとする）を出力する。 In the learning process, the vocoder 6 of this modification has a mel spectrogram (this is referred to as data h) and audio signal waveform data (correct answer data) corresponding to the mel spectrogram (this is referred to as data x) as acoustic features. Is input, and Gaussian white noise (this is referred to as data z) is output.

ベクトル処理部６１は、学習処理時において、音声信号波形データｘを入力し、入力したデータｘに対して、例えば、コンボリューション処理を施して、可逆処理部６３ａ（学習処理時において最初にデータ入力される可逆処理部）に入力可能な次元数のベクトルデータＤｘ１に変換する。そして、ベクトル処理部６１は、変換したベクトルデータＤｘ_１を可逆処理部６３ａに出力する。 The vector processing unit 61 inputs the voice signal waveform data x during the learning process, performs convolution processing on the input data x, for example, and performs a lossless processing unit 63a (first data input during the learning process). It is converted into vector data Dx1 having a number of dimensions that can be input to the lossless processing unit). Then, the vector processing unit 61 outputs the converted vector data Dx ₁ to the reversible processing unit 63a.

アップサンプリング処理部６２は、音響特徴量としてメルスペクトログラムのデータｈを入力し、入力されたメルスペクトログラムのデータｈに対して、アップサンプリング処理を実行し、処理後のデータ（アップサンプリングされたメルスペクトログラムのデータ）をデータｈ１として、可逆処理部６３ａ〜６３ｘのそれぞれのＷＮ変換部６３２に出力する。 The upsampling processing unit 62 inputs the mel spectrogram data h as the acoustic feature amount, executes the upsampling process on the input mel spectrogram data h, and performs the processed data (upsampled mel spectrogram). Data) is output as data h1 to each WN conversion unit 632 of the reversible processing units 63a to 63x.

可逆処理部６３ａは、図４に示すように、可逆１×１畳み込み層と、アフィンカップリング層とを備える。 As shown in FIG. 4, the reversible processing unit 63a includes a reversible 1 × 1 convolution layer and an affine coupling layer.

可逆１×１畳み込み層は、ベクトル処理部６１から出力されるデータＤｘ_１を入力とし、入力されたデータに対して、重み係数行列Ｗ_ｋ（ｋ＝１）（シナプス間の結合係数（重み係数）を規定する行列）により、ニューラルネットワーク処理を実行する、つまり、
ＤｘＡ_１＝Ｗ_１×Ｄｘ_１
に相当する処理を実行して、データＤｘＡ_１を取得する。 The reversible 1 × 1 convolution layer receives the data Dx ₁ output from the vector processing unit 61 as an input, and with respect to the input data, the weight coefficient matrix W _k (k = 1) (coupling coefficient between synapses (weight coefficient). ) Is used to execute neural network processing, that is,
DxA ₁ = W ₁ x Dx ₁
Data DxA ₁ is acquired by executing the process corresponding to.

なお、重み係数行列Ｗ_ｋは、直行行列となるように設定されており、したがって、逆変換が可能となる。 The weighting coefficient matrix _Wk is set to be an orthogonal matrix, and therefore the inverse transformation is possible.

このようにして取得されたデータＤｘＡ_１は、可逆１×１畳み込み層からアフィンカップリング層に出力される。 The data DxA ₁ acquired in this way is output from the reversible 1 × 1 convolution layer to the affine coupling layer.

アフィンカップリング層では、データ分割部６３１により、
ｘ＝ＤｘＡ_１
ｘ_ａ，ｘ_ｂ＝ｓｐｌｉｔ（ｘ）
ｓｐｌｉｔ（）：データ分割をする関数
に相当する処理を実行し、入力データｘを２分割し、分割データｘ_ａとｘ_ｂを取得する。例えば、ｘがｎ１×２（ｎ１：自然数）のビット数のデータである場合、ｘ_ａは、ｘの上位ｎ１ビット分のデータであり、ｘ_ｂは、ｘの下位ｎ１ビット分のデータである。 In the affine coupling layer, the data division unit 631
x = DxA ₁
x _a , x _b = split (x)
split (): Executes a process corresponding to a function that divides data, divides the input data x into two, and acquires the divided data x _a and x _b . For example, when x is data having the number of bits of n1 × 2 (n1: natural number), x _a is the data for the upper n1 bits of x, and x _b is the data for the lower n1 bits of x. ..

そして、データｘ_ａは、ＭＮ変換部６３２およびデータ合成部６３４に出力される。また、データｘ_ｂは、アフィン変換部６３３に出力される。 Then, the data x _a is output to the MN conversion unit 632 and the data synthesis unit 634. Further, the data x _b is output to the affine transformation unit 633.

ＭＮ変換部６３２は、データ分割部６３１から出力されるデータｘ_ａと、アップサンプリング処理部６２から出力されるアップサンプリングされたメルスペクトログラムのデータｈ１とを入力する。そして、ＭＮ変換部６３２は、データｘ_ａと、データｈ１とに対して、任意の変換であるＭＮ変換（例えば、ＷａｖｅＮｅｔによる変換）を実行し、アフィン変換のパラメータとするデータｓ_ｊ，ｔ_ｊ（ｓ_ｊ：アフィン変換用の行列、ｔ_ｊ：アフィン変換用のオフセット）を取得する。取得されたアフィン変換のパラメータとするデータｓ_ｊ，ｔ_ｊは、ＷＮ変換部６３２からアフィン変換部６３３に出力される。 The MN conversion unit 632 inputs the data x _a output from the data division unit 631 and the upsampled mel spectrogram data h1 output from the upsampling processing unit 62. Then, the MN conversion unit 632 executes MN conversion (for example, conversion by WaveNet), which is an arbitrary conversion, on the data x _a and the data h1, and uses the data s _j , t _j as parameters for the affine transformation. (S _j : Matrix for affine transformation, t _j : Offset for affine transformation) is acquired. The acquired data s _j and t _j as parameters of the affine transformation are output from the WN conversion unit 632 to the affine transformation unit 633.

アフィン変換部６３３は、ＭＮ変換部６３２により取得されたデータｓ_ｊ，ｔ_ｊを用いて、データ分割部６３１から入力されるデータｘ_ｂに対して、アフィン変換を行う。つまり、アフィン変換部６３３は、
ｘ_ｂ’＝Ａｆｆｉｎ（ｓ_ｊ，ｔ_ｊ，ｘ_ｂ）
＝ｓ_ｊ×ｘ_ｂ＋ｔ_ｊ
に相当する処理を実行することで、データｘ_ｂのアフィン変換後のデータｘ_ｂ’を取得し、取得したデータｘ_ｂ’をデータ合成部６３４に出力する。 The affine transformation unit 633 uses the data s _j and t _j acquired by the MN conversion unit 632 to perform affine transformation on the data x _b input from the data division unit 631. That is, the affine transformation unit 633
x _b '= Affin (s _j , t _j , x _b )
= S _j × x _b + t _j
By executing the corresponding processing, _'to get the acquired data x _b' data x _b after the affine transformation of the data x _b outputs to the data synthesis section 634.

データ合成部６３４では、データ分割部６３１から出力されるデータｘ_ａと、アフィン変換部６３３から出力されるデータｘ_ｂ’とを入力し、データｘ_ａと、データｘ_ｂ’とを合成する処理、すなわち、
Ｄｘ_２＝ｃｏｎｃａｔ（ｘ_ａ，ｘ_ｂ’）
に相当する処理を実行し、データＤｘ_２を取得する。なお、データ合成部６３４でのデータ合成処理は、例えば、ｘ_ａ、ｘ_ｂ’が、それぞれ、ｎ１ビットのデータである場合、上位ｎ１ビットがｘ_ａとなり、下位ｎ１ビットがｘ_ｂ’となるｎ１×２ビットのデータを取得する処理である。 The data combining unit 634, a data _{x a} is outputted from the data dividing unit 631, 'enter the, and data _{x a,} the data _{x b'} data _{x b} outputted from the affine transformation unit 633 processing for combining the That is,
Dx ₂ = concat (x _a , x _b ')
The process corresponding to is executed and the data Dx ₂ is acquired. The data synthesis processing in the data synthesis section 634, for _{example, x} a, ', respectively, when the data of n1 bits, the upper n1 bits _{x a,} and the lower n1 bits _{x b'} _{x b} becomes This is a process for acquiring n1 × 2 bit data.

このようにして取得されたデータＤｘ_２は、可逆処理部６３ａから、可逆処理部６３ｂ（２番目の可逆処理部）に出力される。 The data Dx ₂ acquired in this way is output from the reversible processing unit 63a to the reversible processing unit 63b (second reversible processing unit).

可逆処理部６３ｂ〜６３ｘでは、可逆処理部６３ａと同様の処理が実行される。つまり、本変形例のボコーダ６では、図４に示すように、可逆処理部６３ａの処理がｍ回繰り返し実行される。そして、最終段の可逆処理部６３ｘからのデータｚが出力される。なお、本変形例のボコーダ６は、ｍ個の可逆処理部を備えるものとする。 In the reversible processing units 63b to 63x, the same processing as in the reversible processing unit 63a is executed. That is, in the vocoder 6 of the present modification, as shown in FIG. 4, the processing of the reversible processing unit 63a is repeatedly executed m times. Then, the data z from the reversible processing unit 63x in the final stage is output. The vocoder 6 of this modification is provided with m reversible processing units.

そして、本変形例のボコーダ６では、出力データｚが、ガウス白色ノイズとなるように、ニューラルネットワークのモデルの学習を行う。つまり、ｘを入力としたときのｚをｚ（ｘ）とすると、ｚ（ｘ）がガウス分布Ｎ（μ，σ）（μは平均値でありμ＝０、σは標準偏差）に従うガウス確率変数となるように、本変形例のボコーダ６のニューラルネットワークのモデルのパラメータを設定する。なお、σは、例えば、入力される音響特徴量としてメルスペクトログラムのデータの情報量Ｉに相関のあるデータとする。 Then, in the vocoder 6 of this modification, the model of the neural network is trained so that the output data z becomes Gauss white noise. That is, if z is z (x) when x is input, the Gaussian probability that z (x) follows the Gaussian distribution N (μ, σ) (μ is the mean value, μ = 0, σ is the standard deviation). The parameters of the model of the neural network of the vocoder 6 of this modification are set so as to be variables. Note that σ is, for example, data that correlates with the information amount I of the mel spectrogram data as the input acoustic feature amount.

つまり、本変形例のボコーダ６では、ｘが入力されたときの尤度（θ：ニューラルネットワークのパラメータ）ｐ_θ（ｘ）を、下記数式により規定することができ、当該尤度ｐ_θ（ｘ）を最大にするパラメータθ_ｏｐｔを取得することで、学習処理を実行する。

ｐ_θ（ｘ）：ｘが入力されたときの尤度（θ：ニューラルネットワークのパラメータ）
ｓ_ｊ（ｘ，ｈ）：ｘ、ｈが入力されたときのｊ番目のアフィンカップリング層の出力係数ベクトル
Ｗ_ｋ：ｋ番目の可逆１×１畳み込み層の係数行列（重み付け係数の行列）
ｚ（ｘ）：ｘが入力されたときの出力値（出力ベクトル）。
ｈ：音響特徴量（ここでは、メルスペクトログラム）
σ_ＷＧ ^２：ガウス分布の予測分散値
なお、ｚ（ｘ）は、ガウス分布Ｎ（μ，σ）（μは平均値でありμ＝０、σは標準偏差）に従うガウス確率変数に相当するものである。すなわち、ｚ〜Ｎ（μ，σ）＝Ｎ（０，σ）である。また、ｍ１は、アフィンカップリング層の処理の回数、ｍ２は、可逆１×１畳み込み層の処理の回数であり、本変形例のボコーダ６では、ｍ１＝ｍ２＝ｍである。 That is, in the bocoder 6 of this modification, the likelihood (θ: neural network parameter) p _θ (x) when x is input can be defined by the following mathematical formula, and the likelihood p _θ (x). ) _Is acquired, and the learning process is executed.

p _θ (x): Likelihood when x is input (θ: Neural network parameter)
s _j (x, h): Output coefficient vector of the j-th affine coupling layer when x, h is input W _k : Coefficient matrix of the k-th reversible 1 × 1 convolution layer (matrix of weighting coefficients)
z (x): Output value (output vector) when x is input.
h: Acoustic features (here, mel spectrogram)
σ _WG ² : Predicted variance value of Gaussian distribution z (x) corresponds to Gaussian random variable according to Gaussian distribution N (μ, σ) (μ is the mean value, μ = 0, σ is the standard deviation). Is. That is, z to N (μ, σ) = N (0, σ). Further, m1 is the number of treatments of the affine coupling layer, m2 is the number of treatments of the reversible 1 × 1 convolution layer, and in the vocoder 6 of this modification, m1 = m2 = m.

本変形例のボコーダ６では、下記数式に相当する処理を実行することで、本変形例のボコーダ６のニューラルネットワークのモデルの最適化パラメータθ_ｏｐｔを取得する。

本変形例のボコーダ６では、上記の学習処理により取得した最適化パラメータθ_ｏｐｔにより、ニューラルネットワークのモデルのパラメータが設定され（各可逆処理部６３ｂ〜６３ｘのアフィンカップリング層、可逆１×１畳み込み層のパラメータが設定され）、学習済みモデルが構築される。 In the vocoder 6 of this modification, the optimization parameter θ _opt of the model of the neural network of the vocoder 6 of this modification is acquired by executing the process corresponding to the following mathematical formula.

In the bocoder 6 of this modification, the parameters of the neural network model are set by the optimization parameter θ _opt acquired by the above learning process (the affin coupling layer of each reversible processing unit 63b to 63x, reversible 1 × 1 convolution). Layer parameters are set) and the trained model is built.

次に、本変形例のボコーダ６の予測処理について、説明する。 Next, the prediction processing of the vocoder 6 of this modification will be described.

本変形例のボコーダ６は、予測処理において、音響特徴量としてメルスペクトログラム（これをデータｈとする）と、当該メルスペクトログラムの情報量Ｉに相関のあるデータを標準偏差σとし、平均値を「０」とするガウス白色ノイズｚとを入力とする。 In the prediction process, the bocoder 6 of this modification uses the mel spectrogram (this is referred to as data h) as the acoustic feature amount and the data correlated with the information amount I of the mel spectrogram as the standard deviation σ, and sets the mean value as “ The Gaussian white noise z set to "0" is input.

本変形例のボコーダ６では、予測処理時において、図５に示すように、学習処理時とは、逆の処理が実行される。 In the vocoder 6 of this modification, as shown in FIG. 5, the process opposite to that during the learning process is executed during the prediction process.

メルスペクトログラムのデータ（例えば、デコーダ部５から出力されるデータＤｙ６）がアップサンプリング処理部６２に入力される。 The mel spectrogram data (for example, the data Dy6 output from the decoder unit 5) is input to the upsampling processing unit 62.

また、ガウス白色ノイズｚ（データｚという）が可逆処理部６３ｘに入力される。 Further, Gauss white noise z (referred to as data z) is input to the reversible processing unit 63x.

そして、可逆処理部６３ｘにおいて、入力されたデータｚに対して、アフィンカップリング層の処理、可逆１×１畳み込み層の層の処理が実行される。この処理が、図５に示すように、ｍ回繰り返し実行される。各処理は、同様であるので、可逆処理部６３ａでの処理について、説明する。 Then, in the reversible processing unit 63x, the processing of the affine coupling layer and the processing of the layer of the reversible 1 × 1 convolution layer are executed for the input data z. As shown in FIG. 5, this process is repeatedly executed m times. Since each process is the same, the process in the reversible processing unit 63a will be described.

データ合成部６３４では、可逆処理部６３ｂから出力されるデータＤｘ’_２を入力し、学習処理時とは逆の処理、すなわち、データ分割処理を実行する。つまり、データ合成部６３４では、
ｘ＝Ｄｘ’_２
ｘ_ａ，ｘ_ｂ’＝ｓｐｌｉｔ（ｘ）
ｓｐｌｉｔ（）：データ分割をする関数
に相当する処理を実行し、入力データｘを２分割し、分割データｘ_ａとｘ_ｂ’を取得する。 The data synthesis unit 634 inputs the data Dx ' ₂ output from the reversible processing unit 63b, and executes a process opposite to that at the time of the learning process, that is, a data division process. That is, in the data synthesis unit 634,
x = Dx ' ₂
x _a , x _b '= split (x)
split (): Executes a process corresponding to a function that divides data, divides the input data x into two, and acquires the divided data x _a and x _b '.

そして、データ合成部６３４は、取得したデータｘ_ａをＭＮ変換部６３２およびデータ分割部６３１に出力し、データｘ_ｂ’をアフィン変換部６３３に出力する。 The data synthesis unit 634 outputs the acquired data _{x a} to the MN conversion unit 632 and the data dividing unit 631, and outputs the data _{x b} 'in the affine transformation unit 633.

ＭＮ変換部６３２は、データ合成部６３４から出力されるデータｘ_ａと、アップサンプリング処理部６２から出力されるアップサンプリングされたメルスペクトログラムのデータｈ１とを入力する。そして、ＭＮ変換部６３２は、データｘ_ａと、データｈ１とに対して、任意の変換であるＭＮ変換（例えば、ＷａｖｅＮｅｔによる変換）を実行し、アフィン変換のパラメータとするデータｓ_ｊ，ｔ_ｊ（ｓ_ｊ：アフィン変換用の行列、ｔ_ｊ：アフィン変換用のオフセット）を取得する。取得されたアフィン変換のパラメータとするデータｓ_ｊ，ｔ_ｊは、ＷＮ変換部６３２からアフィン変換部６３３に出力される。 The MN conversion unit 632 inputs the data x _a output from the data synthesis unit 634 and the upsampled mel spectrogram data h1 output from the upsampling processing unit 62. Then, the MN conversion unit 632 executes MN conversion (for example, conversion by WaveNet), which is an arbitrary conversion, on the data x _a and the data h1, and uses the data s _j , t _j as parameters for the affine transformation. (S _j : Matrix for affine transformation, t _j : Offset for affine transformation) is acquired. The acquired data s _j and t _j as parameters of the affine transformation are output from the WN conversion unit 632 to the affine transformation unit 633.

アフィン変換部６３３は、ＭＮ変換部６３２により取得されたデータｓ_ｊ，ｔ_ｊを用いて、データ合成部６３４から入力されるデータｘ’_ｂに対して、アフィン逆変換（学習処理時に行ったアフィン変換の逆変換）を行う。つまり、アフィン変換部６３３は、
ｘ_ｂ＝Ａｆｆｉｎ^−１（ｓ_ｊ，ｔ_ｊ，ｘ_ｂ’）
に相当する処理を実行することで、データｘ_ｂ’のアフィン逆変換後のデータｘ_ｂを取得し、取得したデータｘ_ｂをデータ分割部６３１に出力する。 The affine transformation unit 633 uses the data s _j and t _j acquired by the MN conversion unit 632 to perform affine inverse transformation (affine performed during the learning process _{) with} respect to the data x'b input from the data synthesis unit 634. Inverse conversion of conversion) is performed. That is, the affine transformation unit 633
x _b = Affin ^-1 (s _j , t _j , x _b ')
By executing the process corresponding to, the data x _b after the affine inverse conversion of the data x _b'is acquired, and the acquired data x _b is output to the data division unit 631.

データ分割部６３１は、データ合成部６３４から出力されるデータｘ_ａと、アフィン変換部６３３から出力されるデータｘ_ｂとを入力し、データｘ_ａと、データｘ_ｂとを合成する処理、すなわち、
Ｄｘ’_１＝ｃｏｎｃａｔ（ｘ_ａ，ｘ_ｂ）
に相当する処理を実行し、データＤｘ’_１を取得する。そして、データ分割部６３１は、取得したデータＤｘ’_１を出力する。 The data division unit 631 inputs the data x _a output from the data synthesis unit 634 and the data x _b output from the affine transformation unit 633, and synthesizes the data x _a and the data x _b , that is, ,
_{_{Dx '1 = concat (x a}} , x b)
Corresponding to the process is executed to acquire the data Dx _'1. Then, the data dividing unit 631 outputs the data Dx _'1 acquired.

上記のようにして可逆処理部６３ｘ〜６３ａにより処理されることで取得されたデータＤｘ’_１が、ベクトル処理部６１に入力される。 Data Dx _'1 obtained by being processed by the lossless processor 63x~63a as described above is input to the vector processing unit 61.

ベクトル処理部６１は、学習処理時と逆の処理を実行することで、データＤｘ’_１から、予測音声信号波形データｘを取得し、出力する。 The vector processing unit 61 acquires and outputs the predicted audio signal waveform data x from the data Dx ' ₁ by executing the process opposite to that at the time of the learning process.

以上のように処理することで、本変形例のボコーダ６では、入力ｚ（ガウス白色ノイズｚ）と、メルスペクトログラムのデータｈから、予測音声信号波形データｘを取得することができる。 By processing as described above, in the vocoder 6 of this modification, the predicted audio signal waveform data x can be acquired from the input z (Gauss white noise z) and the mel spectrogram data h.

本変形例のボコーダ６では、ニューラルネットワークを可逆変換できる構成を採用している。このため、本変形例のボコーダ６では、（１）ガウス白色ノイズが入力されたときに出力される音声波形データの尤度と、（２）音声波形データが入力されたときに出力されるガウス白色ノイズの尤度とを等価にし、学習処理を行いやすい（計算が容易である）後者（音声波形データが入力されたときに出力されるガウス白色ノイズの尤度）により、学習処理を行うことで、効率良く学習処理を行うことができる。 The vocoder 6 of this modified example adopts a configuration capable of reversibly converting a neural network. Therefore, in the vocoder 6 of this modification, (1) the likelihood of the audio waveform data output when Gaussian white noise is input, and (2) the Gaussian output when the audio waveform data is input. Equivalent to the likelihood of white noise and easy to perform learning processing (easy to calculate) The latter (probability of Gaussian white noise output when voice waveform data is input) is used for learning processing. Therefore, the learning process can be performed efficiently.

そして、本変形例のボコーダ６では、ニューラルネットワークを可逆変換できる構成を有しているので、上記学習処理により取得した学習済みモデルにより、予測処理を、学習処理時とは逆の処理（逆変換）により実現できる。 Since the vocoder 6 of this modified example has a configuration capable of reversibly transforming the neural network, the prediction process is subjected to the process opposite to that at the time of the learning process (inverse conversion) by the trained model acquired by the above learning process. ) Can be realized.

このように、本変形例のボコーダ６では、音響特徴量としてメルスペクトログラムのデータから音声波形データを直接予測（取得）できる構成をシンプルな構成で実現できる。そして、本変形例のボコーダ６では、このようなシンプルな構成を有しているので、処理精度を保ちながら、予測処理を高速に行うことができ、音声合成処理をリアルタイムで実行することが可能になる。 As described above, the vocoder 6 of the present modification can realize a configuration in which the voice waveform data can be directly predicted (acquired) from the mel spectrogram data as the acoustic feature quantity with a simple configuration. Since the vocoder 6 of this modification has such a simple configuration, it is possible to perform prediction processing at high speed while maintaining processing accuracy, and it is possible to execute voice synthesis processing in real time. become.

図６は、本変形例の音声合成処理装置によりＴＴＳ処理（処理対象言語：日本語）実行し、取得した音声波形データのメルスペクトログラム（予測データ）と、入力テキストの実際の音声波形データのメルスペクトログラム（オリジナルデータ）とを示す図である。 FIG. 6 shows a mel spectrogram (prediction data) of the voice waveform data acquired by executing TTS processing (processing target language: Japanese) by the voice synthesis processing device of this modified example, and a mel of the actual voice waveform data of the input text. It is a figure which shows the spectrogram (original data).

図６から分かるように、本変形例の音声合成処理装置によりＴＴＳ処理では、非常に高精度な音声波形データが予測（取得）できる。 As can be seen from FIG. 6, the voice synthesis processing apparatus of this modified example can predict (acquire) very accurate voice waveform data in the TTS processing.

［第２実施形態］
次に、第２実施形態について、説明する。なお、上記実施形態と同様の部分については、同一符号を付し、詳細な説明を省略する。 [Second Embodiment]
Next, the second embodiment will be described. The same parts as those in the above embodiment are designated by the same reference numerals, and detailed description thereof will be omitted.

第１実施形態では、エンコーダ・デコーダ方式（sequence-to-sequence方式）を用いたの音声合成処理装置１００について、説明した。第１実施形態の音声合成処理装置１００は、注意機構（アテンション部４）を備えており、音素継続長と音響モデルとを注意機構を用いて同時に最適化するニューラル音声合成処理を実現することができる。これにより、第１実施形態の音声合成処理装置１００では、自然音声クラスの高音質なテキスト音声合成を実現できる。しかしながら、第１実施形態の音声合成処理装置１００では、推論時（予測処理時）に、まれに注意機構予測が失敗することがあり、これにより合成発話が途中で止まってしまう、同じフレーズを何回も繰り返してしまう、等の問題がある。 In the first embodiment, the speech synthesis processing device 100 using the encoder / decoder method (sequence-to-sequence method) has been described. The speech synthesis processing device 100 of the first embodiment includes an attention mechanism (attention unit 4), and can realize a neural speech synthesis process that simultaneously optimizes the phoneme continuation length and the acoustic model by using the attention mechanism. it can. As a result, the speech synthesis processing device 100 of the first embodiment can realize high-quality text speech synthesis in the natural speech class. However, in the speech synthesis processing device 100 of the first embodiment, in rare cases, the attention mechanism prediction may fail at the time of inference (during prediction processing), and this causes the synthetic utterance to stop in the middle. There is a problem that it is repeated many times.

第２実施形態では、上記問題を解決するための技術について、説明する。 In the second embodiment, a technique for solving the above problem will be described.

＜２．１：音声合成処理装置の構成＞
図７は、第２実施形態に係る音声合成処理装置２００の概略構成図である。 <2.1: Configuration of speech synthesis processing device>
FIG. 7 is a schematic configuration diagram of the speech synthesis processing device 200 according to the second embodiment.

第２実施形態に係る音声合成処理装置２００は、第１実施形態の音声合成処理装置１００において、アテンション部４を削除し、音素継続長推定部７を追加した構成を有している。そして、第２実施形態に係る音声合成処理装置２００は、第１実施形態の音声合成処理装置１００において、テキスト解析部１をテキスト解析部１Ａに置換し、フルコンテキストラベルベクトル処理部２をフルコンテキストラベルベクトル処理部２Ａに置換し、デコーダ部５をデコーダ部５Ａに置換した構成を有している。 The speech synthesis processing device 200 according to the second embodiment has a configuration in which the attention unit 4 is deleted and the phoneme continuation length estimation unit 7 is added in the speech synthesis processing device 100 of the first embodiment. Then, in the speech synthesis processing device 200 according to the second embodiment, in the speech synthesis processing device 100 of the first embodiment, the text analysis unit 1 is replaced with the text analysis unit 1A, and the full context label vector processing unit 2 is replaced with the full context. It has a configuration in which the label vector processing unit 2A is replaced and the decoder unit 5 is replaced with the decoder unit 5A.

テキスト解析部１Ａは、第１実施形態のテキスト解析部１と同様の機能を有しており、さらに、音素のコンテキストラベルを取得する機能を有している。テキスト解析部１Ａは、処理対象言語のテキストデータＤｉｎから音素のコンテキストラベルを取得し、取得した音素のコンテキストラベルのデータをデータＤｘ０１として、音素継続長推定部７に出力する。 The text analysis unit 1A has the same function as the text analysis unit 1 of the first embodiment, and further has a function of acquiring a context label of a phoneme. The text analysis unit 1A acquires the context label of the phonetic element from the text data Din of the language to be processed, and outputs the acquired context label data of the phonetic element as data Dx01 to the sound element continuation length estimation unit 7.

音素継続長推定部７は、テキスト解析部１Ａから出力されるデータＤｘ０１（音素のコンテキストラベルのデータ）を入力する。音素継続長推定部７は、データＤｘ０１（音素のコンテキストラベルのデータ）から、データＤｘ０１に対応する音素の音素継続長を推定（取得）する音素継続長推定処理を実行する。具体的には、音素継続長推定部７は、例えば、隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）、ニューラルネットワークモデル等を用いた、音素のコンテキストラベルから当該音素の音素継続長を推定（予測）するモデル（処理システム）により、音素継続長推定処理を実行する。 The phoneme continuation length estimation unit 7 inputs data Dx01 (phoneme context label data) output from the text analysis unit 1A. The phoneme continuation length estimation unit 7 executes a phoneme continuation length estimation process that estimates (acquires) the phoneme continuation length of the phoneme corresponding to the data Dx01 from the data Dx01 (phoneme context label data). Specifically, the phoneme continuation length estimation unit 7 estimates (predicts) the phoneme continuation length of the phoneme from the context label of the phoneme using, for example, a hidden Markov model (HMM: Hidden Markov Model), a neural network model, or the like. Phoneme continuation length estimation processing is executed by the model (processing system).

そして、音素継続長推定部７は、音素継続長推定処理により取得（推定）した音素継続長のデータをデータＤｘ０２として、フルコンテキストラベルベクトル処理部２Ａに出力する。 Then, the phoneme continuation length estimation unit 7 outputs the phoneme continuation length data acquired (estimated) by the phoneme continuation length estimation process to the full context label vector processing unit 2A as data Dx02.

フルコンテキストラベルベクトル処理部２Ａは、第１実施形態のフルコンテキストラベルベクトル処理部２と同様の機能を有しており、さらに、音素継続長推定部７により推定された音素継続長に相当する期間において、当該音素継続長に対応する音素についての最適化フルコンテキストラベルデータをエンコーダ部３に継続して出力する機能を有する。 The full context label vector processing unit 2A has the same function as the full context label vector processing unit 2 of the first embodiment, and further, a period corresponding to the phoneme continuation length estimated by the phoneme continuation length estimation unit 7. In the above, it has a function of continuously outputting the optimized full context label data for the phoneme corresponding to the phoneme continuation length to the encoder unit 3.

フルコンテキストラベルベクトル処理部２Ａは、テキスト解析部１から出力されるデータＤｘ１（フルコンテキストラベルのデータ）と、音素継続長推定部７から出力されるデータＤｘ０２（音素の音素継続長のデータ）とを入力する。 The full context label vector processing unit 2A includes data Dx1 (full context label data) output from the text analysis unit 1 and data Dx02 (phoneme continuation length data) output from the phoneme continuation length estimation unit 7. Enter.

フルコンテキストラベルベクトル処理部２Ａは、テキスト解析部１Ａから出力されるデータＤｘ１（フルコンテキストラベルのデータ）を入力する。フルコンテキストラベルベクトル処理部２Ａは、入力されたフルコンテキストラベルデータＤｘ１から、sequence-to-sequence方式のニューラルネットワークのモデルの学習処理に適したフルコンテキストラベルデータを取得するためのフルコンテキストラベルベクトル処理を実行する。そして、フルコンテキストラベルベクトル処理部２Ａは、フルコンテキストラベルベクトル処理により取得したデータをデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）として、エンコーダ部３のエンコーダ側プレネット処理部３１に出力する。このとき、フルコンテキストラベルベクトル処理部２Ａは、音素継続長推定部７により推定された音素継続長に相当する期間において、当該音素継続長に対応する音素についての最適化フルコンテキストラベルデータをエンコーダ部３に継続して出力する。 The full context label vector processing unit 2A inputs the data Dx1 (full context label data) output from the text analysis unit 1A. The full context label vector processing unit 2A performs full context label vector processing for acquiring full context label data suitable for learning processing of a sequence-to-sequence neural network model from the input full context label data Dx1. To execute. Then, the full context label vector processing unit 2A outputs the data acquired by the full context label vector processing as data Dx2 (optimized full context label data Dx2) to the encoder side plenet processing unit 31 of the encoder unit 3. At this time, the full context label vector processing unit 2A outputs the optimized full context label data for the phoneme corresponding to the phoneme continuation length in the period corresponding to the phoneme continuation length estimated by the phoneme continuation length estimation unit 7. Continue to output to 3.

デコーダ部５Ａは、第１実施形態のデコーダ部５において、デコーダ側ＬＳＴＭ層５２をデコーダ側ＬＳＴＭ層５２Ａに置換した構成を有している。それ以外は、デコーダ部５Ａは、第１実施形態のデコーダ部５と同様である。 The decoder unit 5A has a configuration in which the decoder unit LSTM layer 52 is replaced with the decoder side LSTM layer 52A in the decoder unit 5 of the first embodiment. Other than that, the decoder unit 5A is the same as the decoder unit 5 of the first embodiment.

デコーダ側ＬＳＴＭ層５２Ａは、デコーダ側ＬＳＴＭ層５２と同様の機能を有している。デコーダ側ＬＳＴＭ層５２Ａは、デコーダ側プレネット処理部５１から、現時刻ｔにおいて出力されるデータＤｙ２（これをデータＤｙ２（ｔ）と表記する）と、１つ前の時間ステップにおいて、デコーダ側ＬＳＴＭ層５２Ａから出力されたデータＤｙ３（これをデータＤｙ３（ｔ−１）と表記する）と、エンコーダ部３から出力される時刻ｔの入力側隠れ状態データｈｉ（ｔ）を入力する。 The decoder-side LSTM layer 52A has the same function as the decoder-side LSTM layer 52. The decoder-side LSTM layer 52A has data Dy2 (this is referred to as data Dy2 (t)) output from the decoder-side prenet processing unit 51 at the current time t, and the decoder-side LSTM in the previous time step. The data Dy3 output from the layer 52A (this is referred to as the data Dy3 (t-1)) and the input side hidden state data hi (t) at the time t output from the encoder unit 3 are input.

デコーダ側ＬＳＴＭ層５２Ａは、入力されたデータＤｙ２（ｔ）、データＤｙ３（ｔ−１）、および、入力側隠れ状態データｈｉ（ｔ）を用いて、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｙ３（データＤｙ３（ｔ））として線形予測部５３に出力する。 The decoder-side LSTM layer 52A executes processing by the LSTM layer using the input data Dy2 (t), data Dy3 (t-1), and input-side hidden state data hi (t), and after the processing, The data is output to the linear prediction unit 53 as data Dy3 (data Dy3 (t)).

＜２．２：音声合成処理装置の動作＞
以上のように構成された音声合成処理装置２００の動作について以下説明する。 <2.2: Operation of speech synthesis processing device>
The operation of the speech synthesis processing device 200 configured as described above will be described below.

図８は、推定された音素継続長に基づいて、エンコーダ部３に入力するデータＤｘ２を生成する処理を説明するための図である。 FIG. 8 is a diagram for explaining a process of generating data Dx2 to be input to the encoder unit 3 based on the estimated phoneme continuation length.

以下では、音声合成処理装置２００の動作を、（１）学習処理（学習時の処理）と、（２）予測処理（予測時の処理）とに分けて説明する。 Hereinafter, the operation of the speech synthesis processing device 200 will be described separately as (1) learning processing (processing at the time of learning) and (2) prediction processing (processing at the time of prediction).

（２．２．１：学習処理）
まず、音声合成処理装置２００による学習処理について、説明する。なお、説明便宜のため、処理対象言語を日本語として、以下、説明する。 (2.2.1: Learning process)
First, the learning process by the speech synthesis processing device 200 will be described. For convenience of explanation, the processing target language will be Japanese, which will be described below.

処理対象言語である日本語のテキストデータＤｉｎをテキスト解析部１Ａに入力する。また、当該テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）のデータを教師データとして用意する。 The Japanese text data Din, which is the processing target language, is input to the text analysis unit 1A. In addition, mel spectrogram (acoustic feature amount) data corresponding to the text data Din is prepared as teacher data.

テキスト解析部１Ａは、第１実施形態と同様に、入力されたテキストデータＤｉｎに対して、テキスト解析処理を実行し、様々な言語情報からなるコンテキストを含む音素ラベルであるコンテキストラベルの系列を取得する。 Similar to the first embodiment, the text analysis unit 1A executes a text analysis process on the input text data Din, and acquires a series of context labels which are phonetic labels including contexts composed of various language information. To do.

テキスト解析部１Ａは、第１実施形態と同様に、取得したフルコンテキストラベルデータをデータＤｘ１としてフルコンテキストラベルベクトル処理部２に出力する。 Similar to the first embodiment, the text analysis unit 1A outputs the acquired full context label data as data Dx1 to the full context label vector processing unit 2.

また、テキスト解析部１Ａは、処理対象言語のテキストデータＤｉｎから音素のコンテキストラベルを取得し、取得した音素のコンテキストラベルのデータをデータＤｘ０１として、音素継続長推定部７に出力する。 Further, the text analysis unit 1A acquires the context label of the phonetic element from the text data Din of the language to be processed, and outputs the acquired context label data of the phonetic element as data Dx01 to the sound element continuation length estimation unit 7.

音素継続長推定部７は、テキスト解析部１Ａから出力されるデータＤｘ０１（音素のコンテキストラベルのデータ）から、データＤｘ０１に対応する音素の音素継続長を推定（取得）する音素継続長推定処理を実行する。具体的には、音素継続長推定部７は、例えば、隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）、ニューラルネットワークモデル等を用いた、音素のコンテキストラベルから当該音素の音素継続長を推定（予測）するモデル（処理システム）により、音素継続長推定処理を実行する。 The phoneme continuation length estimation unit 7 performs a phoneme continuation length estimation process that estimates (acquires) the phoneme continuation length of the phoneme corresponding to the data Dx01 from the data Dx01 (phoneme context label data) output from the text analysis unit 1A. Execute. Specifically, the phoneme continuation length estimation unit 7 estimates (predicts) the phoneme continuation length of the phoneme from the context label of the phoneme using, for example, a hidden Markov model (HMM: Hidden Markov Model), a neural network model, or the like. Phoneme continuation length estimation processing is executed by the model (processing system).

フルコンテキストラベルベクトル処理部２Ａは、テキスト解析部１Ａから出力されるデータＤｘ１（フルコンテキストラベルのデータ）から、sequence-to-sequence方式のニューラルネットワークのモデルの学習処理に適したフルコンテキストラベルデータを取得するためのフルコンテキストラベルベクトル処理（第１実施形態と同様のフルコンテキストラベルベクトル処理）を実行する。そして、フルコンテキストラベルベクトル処理部２Ａは、フルコンテキストラベルベクトル処理により取得したデータをデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）として、エンコーダ部３のエンコーダ側プレネット処理部３１に出力する。このとき、フルコンテキストラベルベクトル処理部２Ａは、音素継続長推定部７により推定された音素継続長に相当する期間において、当該音素継続長に対応する音素についての最適化フルコンテキストラベルデータをエンコーダ部３に継続して出力する。 The full context label vector processing unit 2A uses the data Dx1 (full context label data) output from the text analysis unit 1A to obtain full context label data suitable for learning processing of a sequence-to-sequence neural network model. The full context label vector processing for acquisition (full context label vector processing similar to the first embodiment) is executed. Then, the full context label vector processing unit 2A outputs the data acquired by the full context label vector processing as data Dx2 (optimized full context label data Dx2) to the encoder side plenet processing unit 31 of the encoder unit 3. At this time, the full context label vector processing unit 2A outputs the optimized full context label data for the phoneme corresponding to the phoneme continuation length in the period corresponding to the phoneme continuation length estimated by the phoneme continuation length estimation unit 7. Continue to output to 3.

フルコンテキストラベルベクトル処理部２Ａにより取得されたデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）は、フルコンテキストラベルベクトル処理部２からのエンコーダ部３のエンコーダ側プレネット処理部３１に出力される。 The data Dx2 (optimized full context label data Dx2) acquired by the full context label vector processing unit 2A is output from the full context label vector processing unit 2 to the encoder side prenet processing unit 31 of the encoder unit 3.

エンコーダ側ＬＳＴＭ層３２は、エンコーダ側プレネット処理部３１から、現時刻ｔにおいて出力されるデータＤｘ３（ｔ）と、１つ前の時間ステップにおいて、エンコーダ側ＬＳＴＭ層３２から出力されたデータＤｘ４（ｔ−１）とを入力する。そして、エンコーダ側ＬＳＴＭ層３２は、入力されたデータＤｘ３（ｔ）、データＤｘ４（ｔ−１）に対して、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｘ４（データＤｘ４（ｔ）（＝入力側隠れ状態データｈｉ（ｔ）））としてデコーダ部５Ａのデコーダ側ＬＳＴＭ層５２Ａに出力する。 The encoder-side LSTM layer 32 includes data Dx3 (t) output from the encoder-side prenet processing unit 31 at the current time t and data Dx4 (data Dx4) output from the encoder-side LSTM layer 32 in the previous time step. Enter t-1). Then, the encoder side LSTM layer 32 executes the processing by the LSTM layer on the input data Dx3 (t) and data Dx4 (t-1), and the processed data is data Dx4 (data Dx4 (t)). (= Input side hidden state data hi (t))) is output to the decoder side LSTM layer 52A of the decoder unit 5A.

デコーダ側プレネット処理部５１は、線形予測部５３から出力される、１時間ステップ前のデータＤｙ４（ｔ−１）を入力する。デコーダ側プレネット処理部５１は、例えば、複数層（例えば、２層）の全結合層を有しており、データの正規化処理（例えば、線形予測部５３から出力されるデータ（ベクトルデータ）の次元数が２Ｎであり、デコーダ側ＬＳＴＭ層に入力されるデータ（ベクトルデータ）の次元数がＮである場合、データの次元数をＮにするように、例えば、ドロップアウト処理を行うことを含む）、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理）を実行し、デコーダ側ＬＳＴＭ層５２に入力可能なデータを取得する。そして、デコーダ側プレネット処理部５１は、上記処理（プレネット処理）により取得したデータをデータＤｙ２としてデコーダ側ＬＳＴＭ層５２に出力する。 The decoder-side prenet processing unit 51 inputs the data Dy4 (t-1) one hour before the step, which is output from the linear prediction unit 53. The decoder-side prenet processing unit 51 has, for example, a plurality of layers (for example, two layers) of fully connected layers, and data normalization processing (for example, data (vector data) output from the linear prediction unit 53). When the number of dimensions of the data (vector data) input to the LSTM layer on the decoder side is N, for example, dropout processing is performed so that the number of dimensions of the data is N. (Including), processing by the activation function (for example, processing by the ReLU function (ReLU: Rectifier Unit)) is executed, and data that can be input to the decoder side LSTM layer 52 is acquired. Then, the decoder-side prenet processing unit 51 outputs the data acquired by the above processing (pre-net processing) to the decoder-side LSTM layer 52 as data Dy2.

デコーダ側ＬＳＴＭ層５２Ａは、デコーダ側プレネット処理部５１から、現時刻ｔにおいて出力されるデータＤｙ２（ｔ）と、１つ前の時間ステップにおいて、デコーダ側ＬＳＴＭ層５２から出力されたデータＤｙ３（ｔ−１）と、エンコーダ部３から出力される時刻ｔの入力側隠れ状態データｈｉ（ｔ）（＝Ｄｘ４（ｔ））とを入力する。 The decoder-side LSTM layer 52A includes data Dy2 (t) output from the decoder-side prenet processing unit 51 at the current time t and data Dy3 (data Dy3) output from the decoder-side LSTM layer 52 in the previous time step. The t-1) and the input side hidden state data hi (t) (= Dx4 (t)) of the time t output from the encoder unit 3 are input.

デコーダ側ＬＳＴＭ層５２Ａは、入力されたデータＤｙ２（ｔ）、データＤｙ３（ｔ−１）、および、入力側隠れ状態データｈｉ（ｔ）を用いて、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｙ３（ｔ）として線形予測部５３に出力する。
線形予測部５３、ポストネット処理部５４、および、加算器５５では、第１実施形態と同様の処理が実行される。 The decoder-side LSTM layer 52A executes processing by the LSTM layer using the input data Dy2 (t), data Dy3 (t-1), and input-side hidden state data hi (t), and after the processing, The data is output to the linear prediction unit 53 as data Dy3 (t).
In the linear prediction unit 53, the postnet processing unit 54, and the adder 55, the same processing as in the first embodiment is executed.

そして、音声合成処理装置２００では、上記のように取得されたデータＤｙ６（予測メルスペクトログラムのデータ）と、テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）の教師データ（正解のメルスペクトログラム）とを比較し、両者の差（比較結果）（例えば、差分ベクトルのノルムやユークリッド距離により表現する差）が小さくなるように、エンコーダ部３、デコーダ部５Ａのニューラルネットワークのモデルのパラメータを更新する。音声合成処理装置１００では、このパラメータ更新処理を繰り返し実行し、データＤｙ６（予測メルスペクトログラムのデータ）と、テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）の教師データ（正解のメルスペクトログラム）との差が十分小さくなる（所定の誤差範囲におさまる）、ニューラルネットワークのモデルのパラメータを最適化パラメータとして取得する。 Then, in the speech synthesis processing device 200, the data Dy6 (predicted mel spectrogram data) acquired as described above and the mel spectrogram (acoustic feature amount) teacher data (correct mel spectrogram) corresponding to the text data Din are used. The parameters of the model of the neural network of the encoder unit 3 and the decoder unit 5A are updated so that the difference between the two (comparison result) (for example, the difference expressed by the norm of the difference vector and the Euclidean distance) becomes small. In the speech synthesis processing apparatus 100, this parameter update process is repeatedly executed, and the data Dy6 (predicted mel spectrogram data) and the mel spectrogram (acoustic feature amount) teacher data (correct mel spectrogram) corresponding to the text data Din are obtained. When the difference between the two becomes sufficiently small (within a predetermined error range), the parameters of the neural network model are acquired as the optimization parameters.

音声合成処理装置２００では、上記のようにして取得した最適化パラメータに基づいて、エンコーダ部３、デコーダ部５Ａのニューラルネットワークのモデルの各層に含まれるシナプス間の結合係数（重み係数）を設定することで、エンコーダ部３、デコーダ部５Ａのニューラルネットワークのモデルを最適化モデル（学習済みモデル）とすることができる。 In the speech synthesis processing device 200, the coupling coefficient (weighting coefficient) between synapses included in each layer of the neural network model of the encoder unit 3 and the decoder unit 5A is set based on the optimization parameters acquired as described above. Therefore, the model of the neural network of the encoder unit 3 and the decoder unit 5A can be used as an optimized model (trained model).

以上により、音声合成処理装置２００において、入力をテキストデータとし、出力をメルスペクトログラムとするニューラルネットワークの学習済みモデル（最適化モデル）を構築できる。 As described above, in the speech synthesis processing device 200, a trained model (optimized model) of a neural network having an input as text data and an output as a mel spectrogram can be constructed.

なお、音声合成処理装置２００において、第１実施形態の音声合成処理装置１００における学習処理により取得したニューラルネットワークの学習済みモデル（最適化モデル）を用いてもよい。つまり、音声合成処理装置２００において、第１実施形態の音声合成処理装置１００における学習処理により取得したニューラルネットワークの学習済みモデルのエンコーダ部３およびデコーダ部５の最適パラメータを用いて、音声合成処理装置２００のエンコーダ部３およびデコーダ部５Ａのパラメータを設定することで、音声合成処理装置２００において、学習済みモデルを構築するようにしてもよい。 In the speech synthesis processing device 200, the trained model (optimized model) of the neural network acquired by the learning process in the speech synthesis processing device 100 of the first embodiment may be used. That is, in the voice synthesis processing device 200, the voice synthesis processing device uses the optimum parameters of the encoder unit 3 and the decoder unit 5 of the trained model of the neural network acquired by the learning process in the voice synthesis processing device 100 of the first embodiment. By setting the parameters of the encoder unit 3 and the decoder unit 5A of the 200, the trained model may be constructed in the speech synthesis processing device 200.

また、ボコーダ６として、ニューラルネットワークによるモデルを用いたボコーダを採用する場合、その学習処理は、第１実施形態と同様である。 Further, when a vocoder using a model by a neural network is adopted as the vocoder 6, the learning process thereof is the same as that of the first embodiment.

これにより、第１実施形態と同様に、ボコーダ６において、入力をテキストデータとし、出力をメルスペクトログラムとするニューラルネットワークの学習済みモデル（最適化モデル）を構築できる。 As a result, in the vocoder 6, a trained model (optimized model) of a neural network having an input as text data and an output as a mel spectrogram can be constructed as in the first embodiment.

なお、音声合成処理装置２００において、（１）エンコーダ部３、デコーダ部５Ａの学習処理と、（２）ボコーダ６の学習処理とを連携させて学習処理を実行してもよいし、上記のように、個別に学習処理を実行してもよい。音声合成処理装置２００において、（１）エンコーダ部３、デコーダ部５Ａの学習処理と、（２）ボコーダ６の学習処理とを連携させて学習処理を実行する場合、入力をテキストデータとし、当該テキストデータに対応する音声波形データ（正解の音声波形データ）とを用いて、（１）エンコーダ部３、デコーダ部５Ａのニューラルネットワークのモデルと、（２）ボコーダ６のニューラルネットワークのモデルの最適化パラメータを取得することで学習処理を実行すればよい。 In the speech synthesis processing device 200, the learning process may be executed in cooperation with (1) the learning process of the encoder unit 3 and the decoder unit 5A and (2) the learning process of the vocoder 6, as described above. In addition, the learning process may be executed individually. When the speech synthesis processing device 200 executes the learning process in cooperation with (1) the learning process of the encoder unit 3 and the decoder unit 5A and (2) the learning process of the vocoder 6, the input is set as text data and the text is concerned. Optimizing parameters of (1) the model of the neural network of the encoder unit 3 and the decoder unit 5A and (2) the model of the neural network of the vocoder 6 using the voice waveform data (correct voice waveform data) corresponding to the data. The learning process may be executed by acquiring.

（２．２．２：予測処理）
次に、音声合成処理装置２００による予測処理について、説明する。なお、予測処理においても、説明便宜のため、処理対象言語を日本語として、以下、説明する。 (2.2.2: Prediction processing)
Next, the prediction process by the speech synthesis processing device 200 will be described. In the prediction processing as well, for convenience of explanation, the processing target language will be Japanese, which will be described below.

予測処理を実行する場合、音声合成処理装置２００では、上記の学習処理により取得された学習済みモデル、すなわち、エンコーダ部３、デコーダ部５Ａのニューラルネットワークの最適化モデル（最適化パラメータが設定されているモデル）、および、ボコーダ６のニューラルネットワークの最適化モデル（最適化パラメータが設定されているモデル）が構築されている。そして、音声合成処理装置２００では、当該学習済みモデルを用いて予測処理が実行される。 When executing the prediction process, in the speech synthesis processing device 200, the trained model acquired by the above learning process, that is, the optimization model (optimization parameter) of the neural network of the encoder unit 3 and the decoder unit 5A is set. The model) and the optimization model (model in which the optimization parameters are set) of the neural network of the vocoder 6 are constructed. Then, in the speech synthesis processing device 200, the prediction processing is executed using the trained model.

音声合成処理の対象とする日本語のテキストデータＤｉｎをテキスト解析部１Ａに入力する。 The Japanese text data Din, which is the target of the voice synthesis processing, is input to the text analysis unit 1A.

テキスト解析部１Ａは、入力されたテキストデータＤｉｎに対して、日本語用のテキスト解析処理を実行し、例えば、図２に示すパラメータを含む４７８次元のベクトルデータとして、フルコンテキストラベルデータＤｘ１を取得する。 The text analysis unit 1A executes a text analysis process for Japanese on the input text data Din, and acquires full context label data Dx1 as 478-dimensional vector data including the parameters shown in FIG. 2, for example. To do.

そして、取得されたフルコンテキストラベルデータＤｘ１は、テキスト解析部１Ａからフルコンテキストラベルベクトル処理部２Ａに出力される。 Then, the acquired full context label data Dx1 is output from the text analysis unit 1A to the full context label vector processing unit 2A.

例えば、図８に示すように、入力データＤｉｎが「今日の天気は．．．」である場合、データＤｘ０１に含まれる各音素のデータを、
（１）ｐｈ_０＝「ｋ」、（２）ｐｈ_１＝「ｙ」、（３）ｐｈ_２＝「ｏｕ」、（４）ｐｈ_３＝「ｎ」、（５）ｐｈ_０４＝「ｏ」、（６）ｐｈ_ｓｉｌ＝無音状態、（７）ｐｈ_５＝「ｔ」、（８）ｐｈ_６＝「ｅ」、（９）ｐｈ_０７＝「ｎ」、・・・
とし、音素ｐｈ_ｋ（ｋ：整数）の推定された音素継続長をｄｕｒ（ｐｈ_ｋ）とすると、音素継続長推定部７は、音素ｐｈ_ｋ（ｋ：整数）のコンテキストラベルを用いて、音素継続長推定処理を実行することで、音素ｐｈ_ｋの推定された音素継続長ｄｕｒ（ｐｈ_ｋ）を取得する。例えば、上記の各音素（音素ｐｈ_ｋ）について、音素継続長推定部７により取得（推定）された音素継続長ｄｕｒ（ｐｈ_ｋ）が、図８に示す時間の長さ（継続長）を有するものとする。 For example, as shown in FIG. 8, when the input data Din is "Today's weather is ...", the data of each phoneme included in the data Dx01 is displayed.
(1) ph ₀ = "k", (2) ph ₁ = "y", (3) ph ₂ = "ou", (4) ph ₃ = "n", (5) ph ₀₄ = "o", (6) ph _sil = silence, (7) ph ₅ = "t", (8) ph ₆ = "e", (9) ph ₀₇ = "n", ...
Assuming that the estimated phoneme continuation length of the phoneme ph _k (k: integer) is dur (ph _k ), the phoneme continuation length estimation unit 7 uses the context label of the phoneme ph _k (k: integer) to make a phoneme. by executing the duration estimation process to obtain the estimated phoneme duration dur of the phoneme ph _{_k} (ph _k). For example, for each of the above phonemes (phoneme ph _k ), the phoneme continuation length dur (ph _k ) acquired (estimated) by the phoneme continuation length estimation unit 7 has the time length (continuation length) shown in FIG. It shall be.

そして、音素継続長推定部７は、音素継続長推定処理により取得（推定）した音素継続長のデータ（図８の場合、ｄｕｒ（ｐｈ_ｋ））をデータＤｘ０２として、フルコンテキストラベルベクトル処理部２Ａに出力する。 Then, the phoneme continuation length estimation unit 7 uses the phoneme continuation length data (dur (ph _k ) in the case of FIG. 8) acquired (estimated) by the phoneme continuation length estimation process as data Dx02, and the full context label vector processing unit 2A. Output to.

フルコンテキストラベルベクトル処理部２Ａは、入力されたフルコンテキストラベルデータＤｘ１に対して、フルコンテキストラベルベクトル処理を実行し、最適化フルコンテキストラベルＤｘ２を取得する。なお、ここで取得される最適化フルコンテキストラベルＤｘ２は、エンコーダ部３、デコーダ部５Ａのsequence-to-sequence方式のニューラルネットワークのモデルの学習処理を行うときに設定した最適化フルコンテキストラベルデータＤｘ２と同じ次元数を有し、かつ、同じパラメータ（情報）を有するデータである。 The full context label vector processing unit 2A executes the full context label vector processing on the input full context label data Dx1 and acquires the optimized full context label Dx2. The optimized full context label Dx2 acquired here is the optimized full context label data Dx2 set when the learning process of the sequence-to-sequence neural network model of the encoder unit 3 and the decoder unit 5A is performed. It is data having the same number of dimensions as and having the same parameters (information).

上記により取得されたデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）は、フルコンテキストラベルベクトル処理部２からエンコーダ部３のエンコーダ側プレネット処理部３１に出力される。このとき、フルコンテキストラベルベクトル処理部２Ａは、音素継続長推定部７により推定された音素継続長に相当する期間において、当該音素継続長に対応する音素についての最適化フルコンテキストラベルデータをエンコーダ部３に継続して出力する。例えば、図８に示すように、音素ｐｈ_ｋについての最適化フルコンテキストラベルデータをデータＤｘ２（ｐｈ_ｋ）とすると、フルコンテキストラベルベクトル処理部２Ａは、音素ｐｈ_ｋについての最適化フルコンテキストラベルデータＤｘ２（ｐｈ_ｋ）を、当該音素ｐｈ_ｋの推定された音素継続長ｄｕｒ（ｐｈ_ｋ）に相当する期間において、継続してエンコーダ部３に出力する。 The data Dx2 (optimized full context label data Dx2) acquired as described above is output from the full context label vector processing unit 2 to the encoder-side prenet processing unit 31 of the encoder unit 3. At this time, the full context label vector processing unit 2A outputs the optimized full context label data for the phoneme corresponding to the phoneme continuation length in the period corresponding to the phoneme continuation length estimated by the phoneme continuation length estimation unit 7. Continue to output to 3. For example, as shown in FIG. 8, assuming that the optimized full context label data for the phoneme ph _k is data Dx2 (ph _k ), the full context label vector processing unit 2A may perform the optimized full context label data for the phoneme ph _k. Dx2 the (ph _k), the estimated phoneme duration dur period corresponding to (ph _k) of the phoneme ph _k, and outputs to the encoder unit 3 continuously.

つまり、音素ｐｈ_ｋについての最適化フルコンテキストラベルデータＤｘ２（ｐｈ_ｋ）は、推定された音素継続長ｄｕｒ（ｐｈ_ｋ）に相当する期間、繰り返しエンコーダ部３に出力される。すなわち、フルコンテキストラベルベクトル処理部２Ａでは、推定された音素継続長ｄｕｒ（ｐｈ_ｋ）に基づいて、エンコーダ部３へ入力するデータ（最適化フルコンテキストラベルデータＤｘ２（ｐｈ_ｋ））の時間引き延ばし処理が実行される。 That is, the phoneme ph _k for optimization full context label data Dx2 (ph _k) is a period corresponding to the estimated phoneme duration dur (ph _k), is output to the repetition encoder 3. That is, in the full context label vector processing unit 2A, the time extension processing of the data (optimized full context label data Dx2 (ph _k )) to be input to the encoder unit 3 based on the estimated phoneme continuation length dur (ph _k ). Is executed.

エンコーダ側プレネット処理部３１は、フルコンテキストラベルベクトル処理部２Ａから入力したデータＤｘ２に対して、コンボリューション処理（コンボリューションフィルタによる処理）、データの正規化処理、活性化関数による処理（例えば、ＲｅＬＵ関数（ＲｅＬＵ：ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）による処理）を実行し、エンコーダ側ＬＳＴＭ層３２に入力可能なデータを取得する。そして、エンコーダ側プレネット処理部３１は、上記処理（プレネット処理）により取得したデータをデータＤｘ３としてエンコーダ側ＬＳＴＭ層３２に出力する。 The encoder-side prenet processing unit 31 performs convolution processing (processing by the convolution filter), data normalization processing, and processing by the activation function (for example, processing by the convolution filter) for the data Dx2 input from the full context label vector processing unit 2A. The ReLU function (processing by ReLU: Rectifier Unit) is executed, and the data that can be input to the LSTM layer 32 on the encoder side is acquired. Then, the encoder-side prenet processing unit 31 outputs the data acquired by the above processing (pre-net processing) to the encoder-side LSTM layer 32 as data Dx3.

エンコーダ側ＬＳＴＭ層３２は、エンコーダ側プレネット処理部３１から、現時刻ｔにおいて出力されるデータＤｘ３（ｔ）と、１つ前の時間ステップにおいて、エンコーダ側ＬＳＴＭ層３２から出力されたデータＤｘ４（ｔ−１）とを入力する。そして、エンコーダ側ＬＳＴＭ層３２は、入力されたデータＤｘ３（ｔ）、データＤｘ４（ｔ−１）に対して、ＬＳＴＭ層による処理（ニューラルネットワーク処理）を実行し、処理後のデータをデータＤｘ４（データＤｘ４（ｔ）（＝入力側隠れ状態データｈｉ（ｔ）））としてデコーダ部５Ａのデコーダ側ＬＳＴＭ層５２Ａに出力する。 The encoder-side LSTM layer 32 includes data Dx3 (t) output from the encoder-side prenet processing unit 31 at the current time t and data Dx4 (data Dx4) output from the encoder-side LSTM layer 32 in the previous time step. Enter t-1). Then, the LSTM layer 32 on the encoder side executes processing (neural network processing) by the LSTM layer on the input data Dx3 (t) and data Dx4 (t-1), and converts the processed data into data Dx4 (data Dx4 (t-1). It is output as data Dx4 (t) (= input side hidden state data hi (t)) to the decoder side LSTM layer 52A of the decoder unit 5A.

ボコーダ６は、デコーダ部５Ａの加算器５５から出力されるデータＤｙ６（予測メルスペクトログラムのデータ（音響特徴量のデータ））を入力とし、入力されたデータＤｙ６に対して、学習済みモデルを用いたニューラルネットワーク処理による音声合成処理を実行し、データＤｙ６（予測メルスペクトログラム）に対応する音声信号波形データを取得する。そして、ボコーダ６は、取得した音声信号波形データを、データＤｏｕｔとして出力する。 The vocoder 6 used the data Dy6 (data of the predicted mel spectrogram (data of the acoustic feature amount)) output from the adder 55 of the decoder unit 5A as an input, and used a trained model for the input data Dy6. The speech synthesis process by the neural network process is executed, and the speech signal waveform data corresponding to the data Dy6 (predicted mel spectrogram) is acquired. Then, the vocoder 6 outputs the acquired voice signal waveform data as data Dout.

このように、音声合成処理装置２００では、入力されたテキストデータＤｉｎに対応する音声波形データＤｏｕｔを取得することができる。 In this way, the voice synthesis processing device 200 can acquire the voice waveform data Dout corresponding to the input text data Din.

以上のように、音声合成処理装置２００では、処理対象言語（上記では日本語）のテキストを入力とし、当該処理対象言語に応じたテキスト解析処理により、フルコンテキストラベルデータを取得し、取得したフルコンテキストラベルデータからsequence-to-sequence方式を用いたニューラルネットワークのモデルで処理（学習処理、および／または、予測処理）を実行するのに適したデータである最適化フルコンテキストラベルデータを取得する。そして、音声合成処理装置２００では、入力を最適化フルコンテキストラベルデータとし、出力をメルスペクトログラム（音響特徴量の一例）として、エンコーダ部３、および、デコーダ部５Ａにおいて、ニューラルネットワークのモデルを用いた処理（学習処理、予測処理）を実行することで、高精度な処理を実現できる。さらに、音声合成処理装置２００では、ボコーダ６により、上記により取得したメルスペクトログラム（音響特徴量の一例）から、当該メルスペクトログラムに対応する音声信号波形データを取得し、取得したデータを出力することで、音声波形データ（データＤｏｕｔ）を取得する。これにより、音声合成処理装置２００では、入力されたテキストに相当する音声波形データを取得することができる。 As described above, in the speech synthesis processing device 200, the text of the processing target language (Japanese in the above) is input, and the full context label data is acquired by the text analysis processing according to the processing target language, and the acquired full is obtained. From the context label data, the optimized full context label data, which is the data suitable for executing the process (learning process and / or the prediction process) in the model of the neural network using the sequence-to-sequence method, is acquired. Then, in the speech synthesis processing device 200, the input is the optimized full context label data, the output is the mel spectrogram (an example of the acoustic feature amount), and the encoder unit 3 and the decoder unit 5A use the neural network model. Highly accurate processing can be realized by executing processing (learning processing, prediction processing). Further, in the voice synthesis processing device 200, the voice signal waveform data corresponding to the mel spectrogram is acquired from the mel spectrogram (an example of the acoustic feature amount) acquired by the vocoder 6, and the acquired data is output. , Acquires voice waveform data (data Dout). As a result, the voice synthesis processing device 200 can acquire voice waveform data corresponding to the input text.

さらに、音声合成処理装置２００では、エンコーダ部３への入力データ（最適化フルコンテキストラベルデータ）を、音素継続長推定部７により取得（推定）した音素ごとの音素継続長に基づいて、引き延ばす処理（音素ｐｈ_ｋの音素継続長ｄｕｒ（ｐｈ_ｋ）に相当する期間、音素ｐｈ_ｋの最適化フルコンテキストラベルデータを、繰り返しエンコーダ部３に入力する処理）を実行する。つまり、音声合成処理装置２００では、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理を実行して取得した音素継続長を用いて予測処理を実行するので、注意機構予測が失敗することに起因する、合成発話が途中で止まってしまう、同じフレーズを何回も繰り返してしまう、等の問題が発生することはない。 Further, in the speech synthesis processing device 200, the processing of extending the input data (optimized full context label data) to the encoder unit 3 based on the phoneme continuation length of each phoneme acquired (estimated) by the phoneme continuation length estimation unit 7. (phoneme ph phoneme duration dur (period corresponding to ph _k) of _k, the optimization full context label data of the phoneme ph _k, the process of inputting the repetition encoder 3) is executed. That is, in the speech synthesis processing device 200, the prediction process is performed using the phoneme continuation length obtained by executing the estimation process using a model such as the hidden Markov model, which can stably and appropriately estimate the phoneme continuation length. Since it is executed, problems such as failure of attention mechanism prediction, synthetic speech stopping in the middle, and repeating the same phrase many times do not occur.

すなわち、音声合成処理装置２００では、（１）音素継続長については、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理（音素継続長推定部７による処理）により取得し、（２）音響特徴量については、sequence-to-sequence方式を用いたニューラルネットワークのモデルで処理することにより取得する。 That is, in the speech synthesis processing device 200, (1) the phoneme continuation length is estimated by using a model such as a hidden Markov model, which can stably and appropriately estimate the phoneme continuation length (phoneme continuation length estimation unit). (Processing according to 7), and (2) the acoustic features are acquired by processing with a neural network model using the sequence-to-sequence method.

したがって、音声合成処理装置２００では、注意機構予測が失敗することに起因する、合成発話が途中で止まってしまう、同じフレーズを何回も繰り返してしまう、等の問題が発生することを適切に防止するとともに、高精度な音声合成処理を実行することができる。 Therefore, the speech synthesis processing device 200 appropriately prevents problems such as failure of attention mechanism prediction, synthetic utterance stopping in the middle, and repeating the same phrase many times. At the same time, it is possible to execute highly accurate speech synthesis processing.

［第３実施形態］
次に、第３実施形態について、説明する。なお、上記実施形態と同様の部分については、同一符号を付し、詳細な説明を省略する。 [Third Embodiment]
Next, the third embodiment will be described. The same parts as those in the above embodiment are designated by the same reference numerals, and detailed description thereof will be omitted.

＜３．１：音声合成処理装置の構成＞
図９は、第３実施形態に係る音声合成処理装置３００の概略構成図である。 <3.1: Configuration of speech synthesis processing device>
FIG. 9 is a schematic configuration diagram of the speech synthesis processing device 300 according to the third embodiment.

第３実施形態に係る音声合成処理装置３００は、第１実施形態の音声合成処理装置１００において、テキスト解析部１をテキスト解析部１Ａに置換し、アテンション部４をアテンション部４Ａに置換し、デコーダ部５をデコーダ部５Ｂに置換した構成を有している。そして、音声合成処理装置３００は、音声合成処理装置１００において、音素継続長推定部７と、強制アテンション部８と、内分処理部９と、コンテキスト算出部１０とを追加した構成を有している。 The speech synthesis processing device 300 according to the third embodiment replaces the text analysis unit 1 with the text analysis unit 1A, the attention unit 4 with the attention unit 4A, and the decoder in the speech synthesis processing device 100 of the first embodiment. It has a configuration in which unit 5 is replaced with a decoder unit 5B. The speech synthesis processing device 300 has a configuration in which the phoneme continuation length estimation unit 7, the forced attention unit 8, the internal division processing unit 9, and the context calculation unit 10 are added to the speech synthesis processing device 100. There is.

テキスト解析部１Ａ、および、音素継続長推定部７は、第２実施形態のテキスト解析部１Ａと同様の構成、機能を有している。 The text analysis unit 1A and the phoneme continuation length estimation unit 7 have the same configuration and functions as the text analysis unit 1A of the second embodiment.

なお、音素継続長推定部７は、音素継続長推定処理により取得（推定）した音素継続長のデータをデータＤｘ０２として、強制アテンション部８に出力する。 The phoneme continuation length estimation unit 7 outputs (estimated) phoneme continuation length data acquired (estimated) by the phoneme continuation length estimation process to the forced attention unit 8 as data Dx02.

アテンション部４Ａは、エンコーダ部３から出力されるデータＤｘ４と、デコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂから出力されるデータｈo（出力側隠れ状態データｈo）とを入力する。アテンション部４Ａは、エンコーダ部３から出力されるデータＤｘ４、すなわち、入力側隠れ状態データｈｉを所定の時間ステップ分記憶保持する。時間ステップｔ＝１からｔ＝Ｓ（Ｓ：自然数）の期間において、エンコーダ部３により取得され、アテンション部４Ａに出力されたデータＤｘ４（＝ｈｉ）の集合を、ｈｉ_{１．．．Ｓ}と表記する。つまり、アテンション部４Ａは、下記に相当するデータｈｉ_{１．．．Ｓ}を記憶保持する。
ｈｉ_{１．．．Ｓ}＝｛Ｄｘ４（１），Ｄｘ４（２），・・・，Ｄｘ４（Ｓ）｝
また、アテンション部４Ａは、デコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂから出力されるデータＤｙ３、すなわち、出力側隠れ状態データｈｏを所定の時間ステップ分記憶保持する。時間ステップｔ＝１からｔ＝Ｔ（Ｔ：自然数）の期間において、デコーダ側ＬＳＴＭ層５２Ｂにより取得され、アテンション部４Ａに出力されたデータＤｙ３（＝ｈｏ）の集合を、ｈｏ_{１．．．Ｔ}と表記する。つまり、アテンション部４Ａは、下記に相当するデータｈｏ_{１．．．Ｔ}を記憶保持する。
ｈｏ_{１．．．Ｔ}＝｛Ｄｙ３（１），Ｄｙ３（２），・・・，Ｄｙ３（Ｔ）｝
そして、アテンション部４Ａは、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}と、出力側隠れ状態データの集合データｈｏ_{１．．．Ｔ}と、に基づいて、例えば、
ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＝ｆ２＿ａｔｔｎ（ｈｉ_{１．．．Ｓ}，ｈｏ_{１．．．Ｔ}）
ｆ２＿ａｔｔｎ（）：重み付け係数データを取得する関数
に相当する処理を実行して、現時刻ｔの重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}を取得する。そして、アテンション部４Ａは、取得した重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}を内分処理部９に出力する。なお、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}の各要素データに対する重み付け係数データの集合データを重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}と表記する。 The attention unit 4A inputs the data Dx4 output from the encoder unit 3 and the data ho (output side hidden state data ho) output from the decoder side LSTM layer 52B of the decoder unit 5B. The attention unit 4A stores and holds the data Dx4 output from the encoder unit 3, that is, the input side hidden state data hi for a predetermined time step. The set of data Dx4 (= hi) acquired by the encoder unit 3 and output to the attention unit 4A during the period from the time step t = 1 to t = S (S: natural number) is hi _{1. .. .. Notated as S.} That is, the attention unit 4A has the following data hi _{1. .. ..} Store _S in memory.
hi _{1. .. .. S} = {Dx4 (1), Dx4 (2), ..., Dx4 (S)}
Further, the attention unit 4A stores and holds the data Dy3 output from the decoder side LSTM layer 52B of the decoder unit 5B, that is, the output side hidden state data ho for a predetermined time step. In the period from time step t = 1 to t = T (T: natural number), the set of data Dy3 (= ho) acquired by the decoder-side LSTM layer 52B and output to the attention unit 4A is ho _{1. .. .. Notated as T.} That is, the attention unit 4A has the following data ho _{1. .. ..} Memorize _T.
ho _{1. .. .. T} = {Dy3 (1), Dy3 (2), ..., Dy3 (T)}
Then, the attention unit 4A is a set data hi _{1. Of} the input side hidden state data _{. .. ..} Set data ho of _S and output side hidden state data ho _{1. .. ..} Based on _T , for example,
w _att (t) _{1. .. .. S} = f2_attn (hi _{... S} , ho _{... T} )
f2_attn (): The weighting coefficient data w _att (t) at the current time t is executed by executing the process corresponding to the function for acquiring the weighting coefficient data _{. .. .. Get S.} Then, the attention unit 4A receives the acquired weighting coefficient data _watt (t) _{1. .. .. S} is output to the internal division processing unit 9. It should be noted that the set data of the input side hidden state data hi _{1. .. ..} The set data of the weighting coefficient data for each element data of _S is the weighting coefficient data w _att (t) _{1. .. .. Notated as S.}

また、アテンション部４Ａは、データＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}をコンテキスト算出部１０に出力する。 Further, the attention unit 4A is a set data hi of data Dx4 (= hi) _{. .. ..} Output _S to the context calculation unit 10.

強制アテンション部８は、音素継続長推定部７から出力される推定された音素継続長のデータＤｘ０２を入力する。強制アテンション部８は、音素継続長データＤｘ０２に対応する音素についてのエンコーダ部３により処理されたデータが出力されるとき、当該音素の推定された音素継続長（音素継続長データＤｘ０２）に相当する期間、重み付け係数を強制的に所定の値（例えば、「１」）にした重み付け係数データｗ_ｆ（ｔ）を生成する。なお、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}の各要素データに対する重み付け係数データと対応づけるために、時刻ｔを中心として、Ｓ個にデータを拡張（同一データを複製して拡張）した重み付け係数データｗ_ｆ（ｔ）を重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}と表記する。 The forced attention unit 8 inputs the data Dx02 of the estimated phoneme continuation length output from the phoneme continuation length estimation unit 7. The forced attention unit 8 corresponds to the estimated phoneme continuation length (phoneme continuation length data Dx02) of the phoneme when the data processed by the encoder unit 3 for the phoneme corresponding to the phoneme continuation length data Dx02 is output. The weighting coefficient data w _f (t) in which the weighting coefficient is forcibly set to a predetermined value (for example, “1”) during the period is generated. It should be noted that the set data of the input side hidden state data hi _{1. .. .. In} order to associate with the weighting coefficient data for each element data of _S , the weighting coefficient data w _f (t) obtained by expanding the data to S pieces (replicating and expanding the same data) around the time t is the weighting coefficient data w. _f (t) _{1. .. .. Notated as S.}

強制アテンション部８は、上記により生成した重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}を内分処理部９に出力する。 The forced attention unit 8 has the weighting coefficient data w _f (t) _{1. .. .. S} is output to the internal division processing unit 9.

内分処理部９は、アテンション部４Ａから出力される重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}と、強制アテンション部８から出力される重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}とを入力する。そして、内分処理部９は、重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}と、重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}とに対して、内分処理を実行することで、合成重み付け係数データｗ（ｔ）を取得する。具体的には、内分処理部９は、
ｗ（ｔ）_{１．．．Ｓ}＝（１−α）×ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＋α×ｗ_ｆ（ｔ）_{１．．．Ｓ}
０≦α≦１
に相当する処理を実行することで、合成重み付け係数データｗ（ｔ）を取得する。なお、上記数式（内分処理）は、それぞれ対応する要素ごとに、内分処理を実行することを表している。つまり、ｊ番目（１≦ｊ≦Ｓ）のデータについては、
ｗ（ｔ）_ｊ＝（１−α）×ｗ_ａｔｔ（ｔ）_ｊ＋α×ｗ_ｆ（ｔ）_ｊ
に相当する処理が実行されることで、ｊ番目の合成重み付け係数データｗ（ｔ）_ｊが取得される。 The internal division processing unit 9 has weighted coefficient data _watt (t) output from the attention unit 4A _{. .. .. S} and the weighting coefficient data w _f (t) output from the forced attention unit 8 _{. .. ..} Enter _S. Then, the internal division processing unit 9 performs weighting coefficient data _watt (t) _{1. .. .. S} and weighting coefficient data w _f (t) _{1. .. .. The} composite weighting coefficient data w (t) is acquired by executing the internal division processing with respect to _S. Specifically, the internal division processing unit 9
w (t) _{1. .. .. S} = (1-α) × w _att (t) _{1. .. .. S} + α × w _f (t) _{1. .. .. S}
0 ≤ α ≤ 1
By executing the process corresponding to, the composite weighting coefficient data w (t) is acquired. The above mathematical formula (internal division processing) indicates that the internal division processing is executed for each corresponding element. That is, for the jth (1 ≦ j ≦ S) data,
w (t) _j = (1-α) × w _att (t) _j + α × w _f (t) _j
By executing the process corresponding to, the j-th composite weighting coefficient data w (t) _j is acquired.

そして、内分処理部９は、取得した合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}をコンテキスト算出部１０に出力する。 Then, the internal division processing unit 9 uses the acquired composite weighting coefficient data w (t) _{1. .. ..} Output _S to the context calculation unit 10.

コンテキスト算出部１０は、アテンション部４Ａから出力されるデータＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}と、内分処理部９から出力される合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}とを入力する。そして、コンテキスト算出部１０は、合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}に基づいて、データＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}に対して、重み付け加算処理を実行することで、コンテキスト状態データｃ（ｔ）を取得する。そして、コンテキスト算出部１０は、取得したコンテキスト状態データｃ（ｔ）をデコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂに出力する。 The context calculation unit 10 is a set data hi of data Dx4 (= hi) output from the attention unit 4A _{. .. .. S} and the composite weighting coefficient data w (t) output from the internal division processing unit 9 _{. .. ..} Enter _S. Then, the context calculation unit 10 uses the composite weighting coefficient data w (t) _{1. .. ..} Based on _S , the set data hi of data Dx4 (= hi) _{1. .. ..} By executing the weighting addition process for _S , the context state data c (t) is acquired. Then, the context calculation unit 10 outputs the acquired context state data c (t) to the decoder side LSTM layer 52B of the decoder unit 5B.

デコーダ部５Ｂは、第１実施形態のデコーダ部５において、デコーダ側ＬＳＴＭ層５２をデコーダ側ＬＳＴＭ層５２Ｂに置換した構成を有している。それ以外は、デコーダ部５Ｂは、第１実施形態のデコーダ部５と同様である。 The decoder unit 5B has a configuration in which the decoder unit 5B of the first embodiment replaces the decoder side LSTM layer 52 with the decoder side LSTM layer 52B. Other than that, the decoder unit 5B is the same as the decoder unit 5 of the first embodiment.

デコーダ側ＬＳＴＭ層５２Ｂは、デコーダ側ＬＳＴＭ層５２と同様の機能を有している。デコーダ側ＬＳＴＭ層５２Ｂは、デコーダ側プレネット処理部５１から、現時刻ｔにおいて出力されるデータＤｙ２（これをデータＤｙ２（ｔ）と表記する）と、１つ前の時間ステップにおいて、デコーダ側ＬＳＴＭ層５２Ｂから出力されたデータＤｙ３（これをデータＤｙ３（ｔ−１）と表記する）と、コンテキスト算出部１０から出力される時刻ｔのコンテキスト状態データｃ（ｔ）とを入力する。 The decoder-side LSTM layer 52B has the same function as the decoder-side LSTM layer 52. The decoder side LSTM layer 52B has data Dy2 (this is referred to as data Dy2 (t)) output from the decoder side prenet processing unit 51 at the current time t, and the decoder side LSTM in the previous time step. The data Dy3 output from the layer 52B (this is referred to as data Dy3 (t-1)) and the context state data c (t) at time t output from the context calculation unit 10 are input.

デコーダ側ＬＳＴＭ層５２Ｂは、入力されたデータＤｙ２（ｔ）、データＤｙ３（ｔ−１）、および、コンテキスト状態データｃ（ｔ）を用いて、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｙ３（データＤｙ３（ｔ））として線形予測部５３に出力する。また、デコーダ側ＬＳＴＭ層５２Ｂは、データＤｙ３（ｔ）、すなわち、時刻ｔの出力側隠れ状態データｈｏ（ｔ）をアテンション部４Ａに出力する。 The decoder-side LSTM layer 52B executes processing by the LSTM layer using the input data Dy2 (t), data Dy3 (t-1), and context state data c (t), and obtains the processed data. It is output to the linear prediction unit 53 as data Dy3 (data Dy3 (t)). Further, the decoder side LSTM layer 52B outputs the data Dy3 (t), that is, the output side hidden state data ho (t) at the time t to the attention unit 4A.

＜３．２：音声合成処理装置の動作＞
以上のように構成された音声合成処理装置３００の動作について以下説明する。 <3.2: Operation of speech synthesis processing device>
The operation of the speech synthesis processing device 300 configured as described above will be described below.

図１０〜図１２は、アテンション部４Ａにより取得された重み付け係数データｗ_ａｔｔ（ｔ）と、強制アテンション部８により取得された重み付け係数データｗ_ｆ（ｔ）とから取得した合成重み付け係数データｗ（ｔ）を用いてコンテキスト状態データｃ（ｔ）を取得する処理について説明するための図である。 10 to 12 show the combined weighting coefficient data w (t) acquired from the weighting coefficient data w _att (t) acquired by the attention unit 4A and the weighting coefficient data w _f (t) acquired by the forced attention unit 8. It is a figure for demonstrating the process of acquiring the context state data c (t) using t).

（３．２．１：学習処理）
まず、音声合成処理装置３００による学習処理について、説明する。なお、説明便宜のため、処理対象言語を日本語として、以下、説明する。 (3.2.1: Learning process)
First, the learning process by the speech synthesis processing device 300 will be described. For convenience of explanation, the processing target language will be Japanese, which will be described below.

そして、音素継続長推定部７は、音素継続長推定処理により取得（推定）した音素継続長のデータをデータＤｘ０２として、強制アテンション部８に出力する。 Then, the phoneme continuation length estimation unit 7 outputs the phoneme continuation length data acquired (estimated) by the phoneme continuation length estimation process to the forced attention unit 8 as data Dx02.

フルコンテキストラベルベクトル処理部２Ａは、テキスト解析部１Ａから出力されるデータＤｘ１（フルコンテキストラベルのデータ）から、sequence-to-sequence方式のニューラルネットワークのモデルの学習処理に適したフルコンテキストラベルデータを取得するためのフルコンテキストラベルベクトル処理（第１実施形態と同様のフルコンテキストラベルベクトル処理）を実行する。そして、フルコンテキストラベルベクトル処理部２Ａは、フルコンテキストラベルベクトル処理により取得したデータをデータＤｘ２（最適化フルコンテキストラベルデータＤｘ２）として、エンコーダ部３のエンコーダ側プレネット処理部３１に出力する。 The full context label vector processing unit 2A uses the data Dx1 (full context label data) output from the text analysis unit 1A to obtain full context label data suitable for learning processing of a sequence-to-sequence neural network model. The full context label vector processing for acquisition (full context label vector processing similar to the first embodiment) is executed. Then, the full context label vector processing unit 2A outputs the data acquired by the full context label vector processing as data Dx2 (optimized full context label data Dx2) to the encoder side plenet processing unit 31 of the encoder unit 3.

エンコーダ側ＬＳＴＭ層３２は、エンコーダ側プレネット処理部３１から、現時刻ｔにおいて出力されるデータＤｘ３（ｔ）と、１つ前の時間ステップにおいて、エンコーダ側ＬＳＴＭ層３２から出力されたデータＤｘ４（ｔ−１）とを入力する。そして、エンコーダ側ＬＳＴＭ層３２は、入力されたデータＤｘ３（ｔ）、データＤｘ４（ｔ−１）に対して、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｘ４（データＤｘ４（ｔ）（＝入力側隠れ状態データｈｉ（ｔ）））としてアテンション部４Ａに出力する。 The encoder-side LSTM layer 32 includes data Dx3 (t) output from the encoder-side prenet processing unit 31 at the current time t and data Dx4 (data Dx4) output from the encoder-side LSTM layer 32 in the previous time step. Enter t-1). Then, the encoder-side LSTM layer 32 executes the processing by the LSTM layer on the input data Dx3 (t) and data Dx4 (t-1), and the processed data is data Dx4 (data Dx4 (t)). (= Input side hidden state data hi (t))) is output to the attention unit 4A.

アテンション部４Ａは、エンコーダ部３から出力されるデータＤｘ４と、デコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂから出力されるデータｈo（出力側隠れ状態データｈo）とを入力する。アテンション部４Ａは、エンコーダ部３から出力されるデータＤｘ４、すなわち、入力側隠れ状態データｈｉを所定の時間ステップ分記憶保持する。例えば、アテンション部４Ａは、時間ステップｔ＝１からｔ＝Ｓ（Ｓ：自然数）の期間において、エンコーダ部３により取得され、アテンション部４Ａに出力されたデータＤｘ４（＝ｈｉ）の集合を、ｈｉ_{１．．．Ｓ}（＝｛Ｄｘ４（１），Ｄｘ４（２），・・・，Ｄｘ４（Ｓ）｝）として記憶保持する。 The attention unit 4A inputs the data Dx4 output from the encoder unit 3 and the data ho (output side hidden state data ho) output from the decoder side LSTM layer 52B of the decoder unit 5B. The attention unit 4A stores and holds the data Dx4 output from the encoder unit 3, that is, the input side hidden state data hi for a predetermined time step. For example, the attention unit 4A obtains a set of data Dx4 (= hi) acquired by the encoder unit 3 and output to the attention unit 4A during the period from the time step t = 1 to t = S (S: natural number). _{1. 1. .. .. It} is stored as _S (= {Dx4 (1), Dx4 (2), ..., Dx4 (S)}).

また、アテンション部４Ａは、デコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂから出力されるデータＤｙ３、すなわち、出力側隠れ状態データｈｏを所定の時間ステップ分記憶保持する。例えば、アテンション部４Ａは、時間ステップｔ＝１からｔ＝Ｔ（Ｔ：自然数）の期間において、デコーダ側ＬＳＴＭ層５２Ｂにより取得され、アテンション部４Ａに出力されたデータＤｙ３（＝ｈｏ）の集合を、ｈｏ_{１．．．Ｔ}（＝｛Ｄｙ３（１），Ｄｙ３（２），・・・，Ｄｙ３（Ｔ）｝）として記憶保持する。 Further, the attention unit 4A stores and holds the data Dy3 output from the decoder side LSTM layer 52B of the decoder unit 5B, that is, the output side hidden state data ho for a predetermined time step. For example, the attention unit 4A collects a set of data Dy3 (= ho) acquired by the decoder-side LSTM layer 52B and output to the attention unit 4A during the period from the time step t = 1 to t = T (T: natural number). , Ho _{1. .. .. It} is stored as _T (= {Dy3 (1), Dy3 (2), ..., Dy3 (T)}).

そして、アテンション部４Ａは、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}と、出力側隠れ状態データの集合データｈｏ_{１．．．Ｔ}と、に基づいて、例えば、
ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＝ｆ２＿ａｔｔｎ（ｈｉ_{１．．．Ｓ}，ｈｏ_{１．．．Ｔ}）
ｆ２＿ａｔｔｎ（）：重み付け係数データを取得する関数
に相当する処理を実行して、現時刻ｔの重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}を取得する。 Then, the attention unit 4A is a set data hi _{1. Of} the input side hidden state data _{. .. ..} Set data ho of _S and output side hidden state data ho _{1. .. ..} Based on _T , for example,
w _att (t) _{1. .. .. S} = f2_attn (hi _{... S} , ho _{... T} )
f2_attn (): The weighting coefficient data w _att (t) at the current time t is executed by executing the process corresponding to the function for acquiring the weighting coefficient data _{. .. .. Get S.}

そして、アテンション部４Ａは、取得した重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}を内分処理部９に出力する。また、アテンション部４Ａは、データＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}をコンテキスト算出部１０に出力する。 Then, the attention unit 4A receives the acquired weighting coefficient data _watt (t) _{1. .. .. S} is output to the internal division processing unit 9. Further, the attention unit 4A is a set data hi of data Dx4 (= hi) _{. .. ..} Output _S to the context calculation unit 10.

強制アテンション部８は、音素継続長データＤｘ０２に対応する音素についてのエンコーダ部３により処理されたデータが出力されるとき、当該音素の推定された音素継続長（音素継続長データＤｘ０２）に相当する期間、重み付け係数を強制的に所定の値（例えば、「１」）にした重み付け係数データｗ_ｆ（ｔ）を生成する。そして、強制アテンション部８は、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}の各要素データに対する重み付け係数データと対応づけるために（内分処理ができるようにするために）、時刻ｔを中心として、Ｓ個にデータを拡張（同一データを複製して拡張）した重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}を生成する。 The forced attention unit 8 corresponds to the estimated phoneme continuation length (phoneme continuation length data Dx02) of the phoneme when the data processed by the encoder unit 3 for the phoneme corresponding to the phoneme continuation length data Dx02 is output. The weighting coefficient data w _f (t) in which the weighting coefficient is forcibly set to a predetermined value (for example, “1”) during the period is generated. Then, the forced attention unit 8 is a set data hi _{1. Of} the input side hidden state data _{. .. .. In} order to associate with the weighting coefficient data for each element data of _S (to enable internal division processing), the data is expanded to S pieces (the same data is duplicated and expanded) around time t. Coefficient data w _f (t) _{1. .. ..} Generate _S.

内分処理部９は、アテンション部４Ａから出力される重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}と、強制アテンション部８から出力される重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}とを入力する。そして、内分処理部９は、重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}と、重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}とに対して、内分処理を実行することで、合成重み付け係数データｗ（ｔ）を取得する。具体的には、内分処理部９は、
ｗ（ｔ）_{１．．．Ｓ}＝（１−α）×ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＋α×ｗ_ｆ（ｔ）_{１．．．Ｓ}
０≦α≦１
に相当する処理を実行することで、合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}を取得する。そして、内分処理部９は、取得した合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}をコンテキスト算出部１０に出力する。 The internal division processing unit 9 has weighted coefficient data _watt (t) output from the attention unit 4A _{. .. .. S} and the weighting coefficient data w _f (t) output from the forced attention unit 8 _{. .. ..} Enter _S. Then, the internal division processing unit 9 performs weighting coefficient data _watt (t) _{1. .. .. S} and weighting coefficient data w _f (t) _{1. .. .. The} composite weighting coefficient data w (t) is acquired by executing the internal division processing with respect to _S. Specifically, the internal division processing unit 9
w (t) _{1. .. .. S} = (1-α) × w _att (t) _{1. .. .. S} + α × w _f (t) _{1. .. .. S}
0 ≤ α ≤ 1
By executing the process corresponding to, the composite weighting coefficient data w (t) _{1. .. .. Get S.} Then, the internal division processing unit 9 uses the acquired composite weighting coefficient data w (t) _{1. .. ..} Output _S to the context calculation unit 10.

なお、学習処理時において、内分比αを「０」に固定してもよい。この場合（内分比αを「０」に固定した場合）、音声合成処理装置３００では、第１実施形態と同様の構成により学習処理が実行されることになる。また、学習処理時において、内分比αを所定の値（例えば、０．５）に固定して、音声合成処理装置３００において、学習処理を実行してもよい。 The internal division ratio α may be fixed to “0” during the learning process. In this case (when the internal division ratio α is fixed to “0”), the speech synthesis processing device 300 executes the learning process with the same configuration as that of the first embodiment. Further, at the time of learning processing, the internal division ratio α may be fixed to a predetermined value (for example, 0.5), and the learning processing may be executed in the speech synthesis processing apparatus 300.

ここで、学習処理時において、内分比αを所定の値に固定する場合について、図１０〜図１２を用いて説明する。なお、説明便宜のため、内分比αを「０．５」に固定する場合について、説明する。以下では、（１）音素に対応する音声が出力される期間内の処理（図１１の場合）と、（２）無音状態である期間内の処理（図１２の場合）とについて説明する。 Here, a case where the internal division ratio α is fixed to a predetermined value during the learning process will be described with reference to FIGS. 10 to 12. For convenience of explanation, a case where the internal division ratio α is fixed to “0.5” will be described. In the following, (1) processing within the period in which the voice corresponding to the phoneme is output (in the case of FIG. 11) and (2) processing in the period of silence (in the case of FIG. 12) will be described.

まず、「（１）音素に対応する音声が出力される期間内の処理（図１１の場合）」について、説明する。 First, "(1) Processing within the period in which the voice corresponding to the phoneme is output (in the case of FIG. 11)" will be described.

例えば、図１０に示すように、入力データＤｉｎが「今日の天気は．．．」である場合、データＤｘ０１に含まれる各音素のデータを、
（１）ｐｈ_０＝「ｋ」、（２）ｐｈ_１＝「ｙ」、（３）ｐｈ_２＝「ｏｕ」、（４）ｐｈ_３＝「ｎ」、（５）ｐｈ_０４＝「ｏ」、（６）ｐｈ_ｓｉｌ＝無音状態、（７）ｐｈ_５＝「ｔ」、（８）ｐｈ_６＝「ｅ」、（９）ｐｈ_０７＝「ｎ」、・・・
とし、音素ｐｈ_ｋ（ｋ：整数）の推定された音素継続長をｄｕｒ（ｐｈ_ｋ）とすると、音素継続長推定部７は、音素ｐｈ_ｋ（ｋ：整数）のコンテキストラベルを用いて、音素継続長推定処理を実行することで、音素ｐｈ_ｋの推定された音素継続長ｄｕｒ（ｐｈ_ｋ）を取得する。例えば、上記の各音素（音素ｐｈ_ｋ）について、音素継続長推定部７により取得（推定）された音素継続長ｄｕｒ（ｐｈ_ｋ）が、図１０に示す時間の長さ（継続長）を有するものとする。 For example, as shown in FIG. 10, when the input data Din is "Today's weather is ...", the data of each phoneme included in the data Dx01 is displayed.
(1) ph ₀ = "k", (2) ph ₁ = "y", (3) ph ₂ = "ou", (4) ph ₃ = "n", (5) ph ₀₄ = "o", (6) ph _sil = silence, (7) ph ₅ = "t", (8) ph ₆ = "e", (9) ph ₀₇ = "n", ...
Assuming that the estimated phoneme continuation length of the phoneme ph _k (k: integer) is dur (ph _k ), the phoneme continuation length estimation unit 7 uses the context label of the phoneme ph _k (k: integer) to make a phoneme. by executing the duration estimation process to obtain the estimated phoneme duration dur of the phoneme ph _{_k} (ph _k). For example, for each of the above phonemes (phoneme ph _k ), the phoneme continuation length dur (ph _k ) acquired (estimated) by the phoneme continuation length estimation unit 7 has the time length (continuation length) shown in FIG. It shall be.

強制アテンション部８は、音素継続長データＤｘ０２に対応する音素についてのエンコーダ部３により処理されたデータが出力されるとき、当該音素の推定された音素継続長（音素継続長データＤｘ０２）に相当する期間、重み付け係数を強制的に所定の値（例えば、「１」）にした重み付け係数データｗ_ｆ（ｔ）を生成する。図１０の場合、強制アテンション部８は、音素ｐｈ_ｋについてのエンコーダ部３により処理されたデータが出力されるとき、音素ｐｈ_ｋの音素継続長ｄｕｒ（ｐｈ_ｋ）に相当する期間、重み付け係数を強制的に所定の値（例えば、「１」）にした重み付け係数データｗ_ｆ（ｔ）を内分処理部９に出力し続ける（図１０において、ｗ_ｆ（ｔ）［ｐｈ_ｋ］と表記した部分に相当）。 The forced attention unit 8 corresponds to the estimated phoneme continuation length (phoneme continuation length data Dx02) of the phoneme when the data processed by the encoder unit 3 for the phoneme corresponding to the phoneme continuation length data Dx02 is output. The weighting coefficient data w _f (t) in which the weighting coefficient is forcibly set to a predetermined value (for example, “1”) during the period is generated. In the case of FIG. 10, when the data processed by the encoder unit 3 for the phoneme ph _k is output, the forced attention unit 8 sets a weighting coefficient for a period corresponding to the phoneme continuation length dur (ph _k ) of the phoneme ph _k. The weighting coefficient data w _f (t) forcibly set to a predetermined value (for example, “1”) is continuously output to the internal division processing unit 9 (in FIG. 10, it is expressed as w _f (t) [ph _k ]). Corresponds to the part).

また、図１０において、処理対象の音素に対応付けて、アテンション部４Ａにより取得された重み付け係数データｗ_ａｔｔ（ｔ）を示している。具体的には、図１０において、音素ｐｈ_ｋに対応する、アテンション部４Ａにより取得された重み付け係数データｗ_ａｔｔ（ｔ）が出力される期間を「ｗ_ａｔｔ（ｔ）［ｐｈ_ｋ］」として示している。なお、説明便宜のため、図１０では、アテンション部４Ａによる音素継続長の予測が正しくなされた場合を示している。 Further, in FIG. 10, the weighting coefficient data _watt (t) acquired by the attention unit 4A is shown in association with the phoneme to be processed. Specifically, in FIG. 10, the period during which the weighting coefficient data w _att (t) acquired by the attention unit 4A corresponding to the phoneme ph _k is output is shown as “w _att (t) [ph _k ]”. ing. For convenience of explanation, FIG. 10 shows a case where the phoneme continuation length is correctly predicted by the attention unit 4A.

また、図１０において、音素ｐｈ_ｋに対応する合成重み付け係数データｗ（ｔ）を「ｗ（ｔ）［ｐｈ_ｋ］」として示している。 Further, in FIG. 10, the composite weighting coefficient data w (t) corresponding to the phoneme ph _k is shown as “w (t) [ph _k ]”.

図１１は、時刻ｔ２（時間ステップｔ２）における処理を説明するための図であり、図１０において処理対象音素が「ｏｕ」であるときの期間の一部を時間軸方向に拡大して示した図である。なお、説明便宜のため、音声合成処理装置３００において、データＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}は、９個のデータ（すなわち、Ｓ＝９）（図１１において、期間Ｔ（ｔ２）において取得され、記憶保持されているデータ）であるものとする（以下、同様）。 FIG. 11 is a diagram for explaining the processing at the time t2 (time step t2), and in FIG. 10, a part of the period when the phoneme to be processed is “ou” is shown enlarged in the time axis direction. It is a figure. For convenience of explanation, in the speech synthesis processing device 300, the set data hi of data Dx4 (= hi) _{1. .. .. It} is assumed that _S is 9 data (that is, S = 9) (data acquired and stored in the period T (t2) in FIG. 11) (hereinafter, the same applies).

ここで、時刻ｔ２における処理について、説明する。 Here, the processing at time t2 will be described.

強制アテンション部８は、時刻ｔ２において、音素継続長Ｄｘ０２から、音素「ｏｕ」に相当する音声が出力継続される期間であることを認識し、時刻ｔ２の重み付け係数データｗ_ｆ（ｔ）を「１」に設定する。さらに、強制アテンション部８は、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}の各要素データに対する重み付け係数データと対応づけるために（内分処理ができるようにするために）、時刻ｔ２を中心として、Ｓ（＝９）個にデータを拡張（同一データを複製して拡張）した重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}を生成する。なお、ｗ_ｆ（ｔ）_{１．．．Ｓ}は、
ｗ_ｆ（ｔ）_{１．．．Ｓ}＝｛ｗ_０１，ｗ_０２，ｗ_０３，ｗ_０４，ｗ_０５，ｗ_０６，ｗ_０７，ｗ_０８，ｗ_０９｝
０≦ｗ_０ｊ≦１（１≦ｊ≦Ｓ）
ｔ＝ｔ２
であるものとし、ｗ_ｆ（ｔ２）_{１．．．Ｓ}において、ｗ_０１〜ｗ_０９は、すべて「１」に設定される（図１１参照）。 The forced attention unit 8 recognizes that it is a period in which the sound corresponding to the phoneme “ou” is continuously output from the phoneme continuation length Dx02 at the time t2, and sets the weighting coefficient data w _f (t) at the time t2 to “. Set to "1". Further, the forced attention unit 8 is a set data hi _{1. Of} the input side hidden state data _{. .. .. In} order to associate with the weighting coefficient data for each element data of _S (to enable internal division processing), the data is expanded to S (= 9) around time t2 (the same data is duplicated). Weighted coefficient data w _f (t) _{1. .. ..} Generate _S. In addition, w _f (t) _{1. .. .. S} is
w _f (t) _{1. .. .. S} = {w ₀₁ , w ₀₂ , w ₀₃ , w ₀₄ , w ₀₅ , w ₀₆ , w ₀₇ , w ₀₈ , w ₀₉ }
0 ≦ w _0j ≦ 1 (1 ≦ j ≦ S)
t = t2
It is assumed that w _f (t2) _{1. .. .. In S} , w _{01 to} w ₀₉ are all set to "1" (see FIG. 11).

強制アテンション部８は、上記により生成した重み付け係数データｗ_ｆ（ｔ２）_{１．．．Ｓ}を内分処理部９に出力する。 The forced attention unit 8 has the weighting coefficient data w _f (t2) generated as described above _{. .. .. S} is output to the internal division processing unit 9.

アテンション部４Ａは、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}と、出力側隠れ状態データの集合データｈｏ_{１．．．Ｔ}と、に基づいて、例えば、
ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＝ｆ２＿ａｔｔｎ（ｈｉ_{１．．．Ｓ}，ｈｏ_{１．．．Ｔ}）
ｆ２＿ａｔｔｎ（）：重み付け係数データを取得する関数
に相当する処理を実行して、時刻ｔ２の重み付け係数データｗ_ａｔｔ（ｔ２）_{１．．．Ｓ}を取得する。時刻ｔ２の重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}が図１１に示すデータ（一例）であるものとする。なお、ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}は、
ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＝｛ｗ_１１，ｗ_１２，ｗ_１３，ｗ_１４，ｗ_１５，ｗ_１６，ｗ_１７，ｗ_１８，ｗ_１９｝
０≦ｗ_１ｊ≦１（１≦ｊ≦Ｓ）
ｔ＝ｔ２
であるものとし、ｗ_１１〜ｗ_１９は、例えば、アテンション部４Ａにより、以下の値として、取得されたものとする（図１１参照）。
ｗ_１１＝０．０、ｗ_１２＝０．２、ｗ_１３＝０．４、ｗ_１４＝０．８、ｗ_１５＝１．０
ｗ_１６＝０．８、ｗ_１７＝０．４、ｗ_１８＝０．２、ｗ_１９＝０．０
アテンション部４Ａは、上記により取得された重み付け係数データｗ_ａｔｔ（ｔ２）_{１．．．Ｓ}を内分処理部９に出力する。 The attention unit 4A is a set data hi of input side hidden state data hi _{1. .. ..} Set data ho of _S and output side hidden state data ho _{1. .. ..} Based on _T , for example,
w _att (t) _{1. .. .. S} = f2_attn (hi _{... S} , ho _{... T} )
f2_attn (): The weighting coefficient data w _att (t2) at time t2 is executed by executing the process corresponding to the function for acquiring the weighting coefficient data _{. .. .. Get S.} Weighting coefficient data at time t2 w _att (t) _{1. .. ..} It is assumed that _S is the data (example) shown in FIG. In addition, w _att (t) _{1. .. .. S} is
w _att (t) _{1. .. .. S} = {w ₁₁ , w ₁₂ , w ₁₃ , w ₁₄ , w ₁₅ , w ₁₆ , w ₁₇ , w ₁₈ , w ₁₉ }
0 ≦ w _1j ≦ 1 (1 ≦ j ≦ S)
t = t2
It is assumed that w ₁₁ to w ₁₉ are, for example, acquired by the attention unit 4A as the following values (see FIG. 11).
w ₁₁ = 0.0, w ₁₂ = 0.2, w ₁₃ = 0.4, w ₁₄ = 0.8, w ₁₅ = 1.0
w ₁₆ = 0.8, w ₁₇ = 0.4, w ₁₈ = 0.2, w ₁₉ = 0.0
The attention unit 4A has the weighting coefficient data _watt (t2) acquired as described above _{. .. .. S} is output to the internal division processing unit 9.

内分処理部９は、アテンション部４Ａから出力される重み付け係数データｗ_ａｔｔ（ｔ２）_{１．．．Ｓ}と、強制アテンション部８から出力される重み付け係数データｗ_ｆ（ｔ２）_{１．．．Ｓ}とを入力する。そして、内分処理部９は、重み付け係数データｗ_ａｔｔ（ｔ２）_{１．．．Ｓ}と、重み付け係数データｗ_ｆ（ｔ２）_{１．．．Ｓ}とに対して、内分処理を実行することで、合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}を取得する。具体的には、内分処理部９は、
ｗ（ｔ２）_{１．．．Ｓ}＝（１−α）×ｗ_ａｔｔ（ｔ２）_{１．．．Ｓ}＋α×ｗ_ｆ（ｔ２）_{１．．．Ｓ}
０≦α≦１
に相当する処理を実行することで、合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}を取得する。 The internal division processing unit 9 is the weighting coefficient data _watt (t2) output from the attention unit 4A _{. .. .. S} and the weighting coefficient data w _f (t2) output from the forced attention unit 8 _{. .. ..} Enter _S. Then, the internal division processing unit 9 performs weighting coefficient data _watt (t2) _{1. .. .. S} and weighting coefficient data w _f (t2) _{1. .. ..} By executing the internal division processing for _S , the composite weighting coefficient data w (t2) _{1. .. .. Get S.} Specifically, the internal division processing unit 9
w (t2) _{1. .. .. S} = (1-α) × w _att (t2) _{1. .. .. S} + α × w _f (t2) _{1. .. .. S}
0 ≤ α ≤ 1
By executing the process corresponding to, the composite weighting coefficient data w (t2) _{1. .. .. Get S.}

ここでは、α＝０．５であるので、ｗ_ａｔｔ（ｔ２）_{１．．．Ｓ}と、ｗ_ｆ（ｔ２）_{１．．．Ｓ}との平均値が合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}となる。なお、ｗ（ｔ）_{１．．．Ｓ}は、
ｗ（ｔ）_{１．．．Ｓ}＝｛ｗ_１，ｗ_２，ｗ_３，ｗ_４，ｗ_５，ｗ_６，ｗ_７，ｗ_８，ｗ_９｝
０≦ｗ_１ｊ≦１（１≦ｊ≦Ｓ）
ｔ＝ｔ２
であるものとすると、ｗ_１〜ｗ_９は、内分処理部９により、以下の値として、取得される（図１１参照）。
ｗ_１＝０．５×ｗ_０１＋０．５×ｗ_１１＝０．５＋０＝０．５
ｗ_２＝０．５×ｗ_０２＋０．５×ｗ_１２＝０．５＋０．１＝０．６
ｗ_３＝０．５×ｗ_０３＋０．５×ｗ_１３＝０．５＋０．２＝０．７
ｗ_４＝０．５×ｗ_０４＋０．５×ｗ_１４＝０．５＋０．４＝０．９
ｗ_５＝０．５×ｗ_０５＋０．５×ｗ_１５＝０．５＋０．５＝１．０
ｗ_６＝０．５×ｗ_０６＋０．５×ｗ_１６＝０．５＋０．４＝０．９
ｗ_７＝０．５×ｗ_０７＋０．５×ｗ_１７＝０．５＋０．２＝０．７
ｗ_８＝０．５×ｗ_０８＋０．５×ｗ_１８＝０．５＋０．１＝０．６
ｗ_９＝０．５×ｗ_０９＋０．５×ｗ_１９＝０．５＋０＝０．５
そして、内分処理部９は、取得した合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}をコンテキスト算出部１０に出力する。 Here, since α = 0.5, w _att (t2) _{1. .. .. S} and w _f (t2) _{1. .. ..} The average value with _S is the composite weighting coefficient data w (t) _{1. .. ..} It becomes _S. In addition, w (t) _{1. .. .. S} is
w (t) _{1. .. .. S} = {w ₁ , w ₂ , w ₃ , w ₄ , w ₅ , w ₆ , w ₇ , w ₈ , w ₉ }
0 ≦ w _1j ≦ 1 (1 ≦ j ≦ S)
t = t2
Assuming that, w _{1 to} w ₉ are acquired by the internal division processing unit 9 as the following values (see FIG. 11).
w ₁ = 0.5 × w ₀₁ + 0.5 × w ₁₁ = 0.5 + 0 = 0.5
w ₂ = 0.5 x w ₀₂ + 0.5 x w ₁₂ = 0.5 + 0.1 = 0.6
w ₃ = 0.5 x w ₀₃ + 0.5 x w ₁₃ = 0.5 + 0.2 = 0.7
w ₄ = 0.5 x w ₀₄ + 0.5 x w ₁₄ = 0.5 + 0.4 = 0.9
w ₅ = 0.5 x w ₀₅ + 0.5 x w ₁₅ = 0.5 + 0.5 = 1.0
w ₆ = 0.5 × w ₀₆ + 0.5 × w ₁₆ = 0.5 + 0.4 = 0.9
w ₇ = 0.5 × w ₀₇ + 0.5 × w ₁₇ = 0.5 + 0.2 = 0.7
w ₈ = 0.5 × w ₀₈ + 0.5 × w ₁₈ = 0.5 + 0.1 = 0.6
w ₉ = 0.5 × w ₀₉ + 0.5 × w ₁₉ = 0.5 + 0 = 0.5
Then, the internal division processing unit 9 obtains the combined weighting coefficient data w (t2) _{1. .. ..} Output _S to the context calculation unit 10.

コンテキスト算出部１０は、アテンション部４Ａから出力されるデータＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}と、内分処理部９から出力される合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}とを入力する。そして、コンテキスト算出部１０は、合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}に基づいて、データＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}に対して、重み付け加算処理を実行することで、コンテキスト状態データｃ（ｔ）を取得する。つまり、コンテキスト算出部１０は、以下の数式に相当する処理を実行することで、コンテキスト状態データｃ（ｔ）を取得する。

ｔ＝ｔ２
ｗ_ｊ：合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}のｊ番目の要素データ（１≦ｊ≦Ｓ）
そして、コンテキスト算出部１０は、取得したコンテキスト状態データｃ（ｔ２）をデコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂに出力する。 The context calculation unit 10 is a set data hi of data Dx4 (= hi) output from the attention unit 4A _{. .. .. S} and the composite weighting coefficient data w (t2) output from the internal division processing unit 9 _{. .. ..} Enter _S. Then, the context calculation unit 10 uses the composite weighting coefficient data w (t2) _{1. .. ..} Based on _S , the set data hi of data Dx4 (= hi) _{1. .. ..} By executing the weighting addition process for _S , the context state data c (t) is acquired. That is, the context calculation unit 10 acquires the context state data c (t) by executing the process corresponding to the following mathematical expression.

t = t2
w _j : Composite weighting coefficient data w (t2) _{1. .. ..} J-th element data _S (1 ≦ j ≦ _S)
Then, the context calculation unit 10 outputs the acquired context state data c (t2) to the decoder side LSTM layer 52B of the decoder unit 5B.

次に、「（２）無音状態である期間内の処理（図１２の場合）」について、説明する。 Next, "(2) Processing within the period of silence (in the case of FIG. 12)" will be described.

図１２は、時刻ｔ３（時間ステップｔ３）における処理を説明するための図であり、図１０において無音状態の期間（図１０において、「ｓｉｌｅｎｔ（無音状態）」で示した期間）の一部を時間軸方向に拡大して示した図である。 FIG. 12 is a diagram for explaining the processing at the time t3 (time step t3), and shows a part of the period of silence in FIG. 10 (the period shown by “silent” in FIG. 10). It is the figure enlarged in the time axis direction.

ここで、時刻ｔ３における処理について、説明する。 Here, the processing at time t3 will be described.

強制アテンション部８は、時刻ｔ３において、音素継続長Ｄｘ０２から、無音状態（発声すべき音素がない状態）の期間であることを認識し、時刻ｔ３の重み付け係数データｗ_ｆ（ｔ）を「０」に設定する。さらに、強制アテンション部８は、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}の各要素データに対する重み付け係数データと対応づけるために（内分処理ができるようにするために）、時刻ｔ２を中心として、Ｓ（＝９）個にデータを拡張（同一データを複製して拡張）した重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}を生成する。なお、ｗ_ｆ（ｔ）_{１．．．Ｓ}は、
ｗ_ｆ（ｔ）_{１．．．Ｓ}＝｛ｗ_０１，ｗ_０２，ｗ_０３，ｗ_０４，ｗ_０５，ｗ_０６，ｗ_０７，ｗ_０８，ｗ_０９｝
０≦ｗ_０ｊ≦１（１≦ｊ≦Ｓ）
ｔ＝ｔ２
であるものとし、ｗ_ｆ（ｔ３）_{１．．．Ｓ}において、ｗ_０１〜ｗ_０９は、すべて「０」に設定される（図１２参照）。 The forced attention unit 8 recognizes that the phoneme continuation length Dx02 is a period of silence (a state in which there is no phoneme to be uttered) at time t3, and sets the weighting coefficient data w _f (t) at time t3 to “0”. Set to. Further, the forced attention unit 8 is a set data hi _{1. Of} the input side hidden state data _{. .. .. In} order to associate with the weighting coefficient data for each element data of _S (to enable internal division processing), the data is expanded to S (= 9) around time t2 (the same data is duplicated). Weighted coefficient data w _f (t) _{1. .. ..} Generate _S. In addition, w _f (t) _{1. .. .. S} is
w _f (t) _{1. .. .. S} = {w ₀₁ , w ₀₂ , w ₀₃ , w ₀₄ , w ₀₅ , w ₀₆ , w ₀₇ , w ₀₈ , w ₀₉ }
0 ≦ w _0j ≦ 1 (1 ≦ j ≦ S)
t = t2
It is assumed that w _f (t3) _{1. .. .. In S} , w _{01 to} w ₀₉ are all set to "0" (see FIG. 12).

強制アテンション部８は、上記により生成した重み付け係数データｗ_ｆ（ｔ３）_{１．．．Ｓ}を内分処理部９に出力する。 The forced attention unit 8 has the weighting coefficient data w _f (t3) generated as described above _{. .. .. S} is output to the internal division processing unit 9.

アテンション部４Ａは、入力側隠れ状態データの集合データｈｉ_{１．．．Ｓ}と、出力側隠れ状態データの集合データｈｏ_{１．．．Ｔ}と、に基づいて、例えば、
ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＝ｆ２＿ａｔｔｎ（ｈｉ_{１．．．Ｓ}，ｈｏ_{１．．．Ｔ}）
ｆ２＿ａｔｔｎ（）：重み付け係数データを取得する関数
に相当する処理を実行して、時刻ｔ３の重み付け係数データｗ_ａｔｔ（ｔ３）_{１．．．Ｓ}を取得する。時刻ｔ３の重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}が図１２に示すデータ（一例）であるものとする。なお、ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}は、
ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}＝｛ｗ_１１，ｗ_１２，ｗ_１３，ｗ_１４，ｗ_１５，ｗ_１６，ｗ_１７，ｗ_１８，ｗ_１９｝
０≦ｗ_１ｊ≦１（１≦ｊ≦Ｓ）
ｔ＝ｔ２
であるものとし、ｗ_１１〜ｗ_１９は、例えば、アテンション部４Ａにより、すべて値が「０」として、取得されたものとする（図１２参照）。 The attention unit 4A is a set data hi of input side hidden state data hi _{1. .. ..} Set data ho of _S and output side hidden state data ho _{1. .. ..} Based on _T , for example,
w _att (t) _{1. .. .. S} = f2_attn (hi _{... S} , ho _{... T} )
f2_attn (): The weighting coefficient data w _att (t3) at time t3 is executed by executing the process corresponding to the function for acquiring the weighting coefficient data _{. .. .. Get S.} Weighting coefficient data at time t3 w _att (t) _{1. .. ..} It is assumed that _S is the data (example) shown in FIG. In addition, w _att (t) _{1. .. .. S} is
w _att (t) _{1. .. .. S} = {w ₁₁ , w ₁₂ , w ₁₃ , w ₁₄ , w ₁₅ , w ₁₆ , w ₁₇ , w ₁₈ , w ₁₉ }
0 ≦ w _1j ≦ 1 (1 ≦ j ≦ S)
t = t2
It is assumed that w ₁₁ to w ₁₉ are all acquired by the attention unit 4A, for example, with the values set to "0" (see FIG. 12).

アテンション部４Ａは、上記により取得された重み付け係数データｗ_ａｔｔ（ｔ３）_{１．．．Ｓ}を内分処理部９に出力する。 The attention unit 4A has the weighting coefficient data _watt (t3) acquired as described above _{. .. .. S} is output to the internal division processing unit 9.

内分処理部９は、アテンション部４Ａから出力される重み付け係数データｗ_ａｔｔ（ｔ３）_{１．．．Ｓ}と、強制アテンション部８から出力される重み付け係数データｗ_ｆ（ｔ３）_{１．．．Ｓ}とを入力する。そして、内分処理部９は、重み付け係数データｗ_ａｔｔ（ｔ３）_{１．．．Ｓ}と、重み付け係数データｗ_ｆ（ｔ３）_{１．．．Ｓ}とに対して、内分処理を実行することで、合成重み付け係数データｗ（ｔ３）_{１．．．Ｓ}を取得する。具体的には、内分処理部９は、
ｗ（ｔ３）_{１．．．Ｓ}＝（１−α）×ｗ_ａｔｔ（ｔ３）_{１．．．Ｓ}＋α×ｗ_ｆ（ｔ３）_{１．．．Ｓ}
０≦α≦１
に相当する処理を実行することで、合成重み付け係数データｗ（ｔ３）_{１．．．Ｓ}を取得する。 The internal division processing unit 9 is the weighting coefficient data _watt (t3) output from the attention unit 4A _{. .. .. S} and the weighting coefficient data w _f (t3) output from the forced attention unit 8 _{. .. ..} Enter _S. Then, the internal division processing unit 9 performs weighting coefficient data _watt (t3) _{1. .. .. S} and weighting coefficient data w _f (t3) _{1. .. ..} By executing the internal division processing for _S , the composite weighting coefficient data w (t3) _{1. .. .. Get S.} Specifically, the internal division processing unit 9
w (t3) _{1. .. .. S} = (1-α) × w _att (t3) _{1. .. .. S} + α × w _f (t3) _{1. .. .. S}
0 ≤ α ≤ 1
By executing the process corresponding to, the composite weighting coefficient data w (t3) _{1. .. .. Get S.}

ここでは、α＝０．５であるので、ｗ_ａｔｔ（ｔ３）_{１．．．Ｓ}と、ｗ_ｆ（ｔ３）_{１．．．Ｓ}との平均値が合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}となる。なお、ｗ（ｔ）_{１．．．Ｓ}は、
ｗ（ｔ）_{１．．．Ｓ}＝｛ｗ_１，ｗ_２，ｗ_３，ｗ_４，ｗ_５，ｗ_６，ｗ_７，ｗ_８，ｗ_９｝
０≦ｗ_１ｊ≦１（１≦ｊ≦Ｓ）
ｔ＝ｔ２
であるものとすると、ｗ_１〜ｗ_９は、内分処理部９により、すべて値が「０」として、取得される（図１２参照）。 Here, α = 0.5, so w _att (t3) _{1. .. .. S} and w _f (t3) _{1. .. ..} The average value with _S is the composite weighting coefficient data w (t) _{1. .. ..} It becomes _S. In addition, w (t) _{1. .. .. S} is
w (t) _{1. .. .. S} = {w ₁ , w ₂ , w ₃ , w ₄ , w ₅ , w ₆ , w ₇ , w ₈ , w ₉ }
0 ≦ w _1j ≦ 1 (1 ≦ j ≦ S)
t = t2
Assuming that, w _{1 to} w ₉ are all acquired by the internal division processing unit 9 with the values set to "0" (see FIG. 12).

そして、内分処理部９は、取得した合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}をコンテキスト算出部１０に出力する。 Then, the internal division processing unit 9 obtains the combined weighting coefficient data w (t2) _{1. .. ..} Output _S to the context calculation unit 10.

コンテキスト算出部１０は、アテンション部４Ａから出力されるデータＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}と、内分処理部９から出力される合成重み付け係数データｗ（ｔ３）_{１．．．Ｓ}とを入力する。そして、コンテキスト算出部１０は、合成重み付け係数データｗ（ｔ２）_{１．．．Ｓ}に基づいて、データＤｘ４（＝ｈｉ）の集合データｈｉ_{１．．．Ｓ}に対して、重み付け加算処理を実行することで、コンテキスト状態データｃ（ｔ）を取得する。つまり、コンテキスト算出部１０は、以下の数式に相当する処理を実行することで、コンテキスト状態データｃ（ｔ）を取得する。

ｔ＝ｔ２
ｗ_ｊ：合成重み付け係数データｗ（ｔ３）_{１．．．Ｓ}のｊ番目の要素データ（１≦ｊ≦Ｓ）
そして、コンテキスト算出部１０は、取得したコンテキスト状態データｃ（ｔ３）をデコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂに出力する。 The context calculation unit 10 is a set data hi of data Dx4 (= hi) output from the attention unit 4A _{. .. .. S} and the composite weighting coefficient data w (t3) output from the internal division processing unit 9 _{. .. ..} Enter _S. Then, the context calculation unit 10 uses the composite weighting coefficient data w (t2) _{1. .. ..} Based on _S , the set data hi of data Dx4 (= hi) _{1. .. ..} By executing the weighting addition process for _S , the context state data c (t) is acquired. That is, the context calculation unit 10 acquires the context state data c (t) by executing the process corresponding to the following mathematical expression.

t = t2
w _j : Composite weighting coefficient data w (t3) _{1. .. ..} J-th element data _S (1 ≦ j ≦ _S)
Then, the context calculation unit 10 outputs the acquired context state data c (t3) to the decoder side LSTM layer 52B of the decoder unit 5B.

図１２の場合、無音状態であるので、アテンション部４Ａ、および、強制アテンション部８により取得される重み付け係数データがすべて０であるので、コンテキスト状態データｃ（ｔ３）も「０」となる。つまり、上記により、無音状態であることを適切に示すコンテキスト状態データｃ（ｔ３）が取得される。 In the case of FIG. 12, since there is no sound, the weighting coefficient data acquired by the attention unit 4A and the forced attention unit 8 are all 0, so the context state data c (t3) is also “0”. That is, as described above, the context state data c (t3) that appropriately indicates that the state is silent is acquired.

上記のように取得されたコンテキスト状態データｃ（ｔ）がデコーダ部５Ｂのデコーダ側ＬＳＴＭ層５２Ｂに出力される。 The context state data c (t) acquired as described above is output to the decoder-side LSTM layer 52B of the decoder unit 5B.

デコーダ側プレネット処理部５１での処理は、第１実施形態と同様である。 The processing by the decoder-side prenet processing unit 51 is the same as that of the first embodiment.

デコーダ側ＬＳＴＭ層５２Ｂは、デコーダ側プレネット処理部５１から、現時刻ｔにおいて出力されるデータＤｙ２（ｔ）と、１つ前の時間ステップにおいて、デコーダ側ＬＳＴＭ層５２から出力されたデータＤｙ３（ｔ−１）と、コンテキスト算出部１０から出力される時刻ｔのコンテキスト状態データｃ（ｔ）とを入力する。 The decoder-side LSTM layer 52B includes data Dy2 (t) output from the decoder-side prenet processing unit 51 at the current time t and data Dy3 (data Dy3) output from the decoder-side LSTM layer 52 in the previous time step. The t-1) and the context state data c (t) at time t output from the context calculation unit 10 are input.

デコーダ側ＬＳＴＭ層５２Ａは、入力されたデータＤｙ２（ｔ）、データＤｙ３（ｔ−１）、および、コンテキスト状態データｃ（ｔ）を用いて、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｙ３（ｔ）として線形予測部５３に出力する。
線形予測部５３、ポストネット処理部５４、および、加算器５５では、第１実施形態と同様の処理が実行される。 The decoder-side LSTM layer 52A executes processing by the LSTM layer using the input data Dy2 (t), data Dy3 (t-1), and context state data c (t), and obtains the processed data. It is output to the linear prediction unit 53 as data Dy3 (t).
In the linear prediction unit 53, the postnet processing unit 54, and the adder 55, the same processing as in the first embodiment is executed.

そして、音声合成処理装置２００では、上記のように取得されたデータＤｙ６（予測メルスペクトログラムのデータ）と、テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）の教師データ（正解のメルスペクトログラム）とを比較し、両者の差（比較結果）（例えば、差分ベクトルのノルムやユークリッド距離により表現する差）が小さくなるように、エンコーダ部３、デコーダ部５Ｂのニューラルネットワークのモデルのパラメータを更新する。音声合成処理装置１００では、このパラメータ更新処理を繰り返し実行し、データＤｙ６（予測メルスペクトログラムのデータ）と、テキストデータＤｉｎに対応するメルスペクトログラム（音響特徴量）の教師データ（正解のメルスペクトログラム）との差が十分小さくなる（所定の誤差範囲におさまる）、ニューラルネットワークのモデルのパラメータを最適化パラメータとして取得する。 Then, in the speech synthesis processing device 200, the data Dy6 (predicted mel spectrogram data) acquired as described above and the mel spectrogram (acoustic feature amount) teacher data (correct mel spectrogram) corresponding to the text data Din are used. The parameters of the model of the neural network of the encoder unit 3 and the decoder unit 5B are updated so that the difference between the two (comparison result) (for example, the difference expressed by the norm of the difference vector and the Euclidean distance) becomes small. In the speech synthesis processing apparatus 100, this parameter update process is repeatedly executed, and the data Dy6 (predicted mel spectrogram data) and the mel spectrogram (acoustic feature amount) teacher data (correct mel spectrogram) corresponding to the text data Din are obtained. When the difference between the two becomes sufficiently small (within a predetermined error range), the parameters of the neural network model are acquired as the optimization parameters.

音声合成処理装置３００では、上記のようにして取得した最適化パラメータに基づいて、エンコーダ部３、デコーダ部５Ｂのニューラルネットワークのモデルの各層に含まれるシナプス間の結合係数（重み係数）を設定することで、エンコーダ部３、デコーダ部５Ａのニューラルネットワークのモデルを最適化モデル（学習済みモデル）とすることができる。 In the speech synthesis processing apparatus 300, the coupling coefficient (weighting coefficient) between synapses included in each layer of the neural network model of the encoder unit 3 and the decoder unit 5B is set based on the optimization parameters acquired as described above. Therefore, the model of the neural network of the encoder unit 3 and the decoder unit 5A can be used as an optimized model (trained model).

以上により、音声合成処理装置３００において、入力をテキストデータとし、出力をメルスペクトログラムとするニューラルネットワークの学習済みモデル（最適化モデル）を構築できる。 As described above, in the speech synthesis processing device 300, a trained model (optimized model) of a neural network having an input as text data and an output as a mel spectrogram can be constructed.

なお、音声合成処理装置３００において、第１実施形態の音声合成処理装置１００における学習処理により取得したニューラルネットワークの学習済みモデル（最適化モデル）を用いてもよい。つまり、音声合成処理装置２００において、第１実施形態の音声合成処理装置１００における学習処理により取得したニューラルネットワークの学習済みモデルのエンコーダ部３およびデコーダ部５の最適パラメータを用いて、音声合成処理装置２００のエンコーダ部３およびデコーダ部５Ｂのパラメータを設定することで、音声合成処理装置３００において、学習済みモデルを構築するようにしてもよい。 In the speech synthesis processing device 300, the trained model (optimized model) of the neural network acquired by the learning process in the speech synthesis processing device 100 of the first embodiment may be used. That is, in the voice synthesis processing device 200, the voice synthesis processing device uses the optimum parameters of the encoder unit 3 and the decoder unit 5 of the trained model of the neural network acquired by the learning process in the voice synthesis processing device 100 of the first embodiment. By setting the parameters of the encoder unit 3 and the decoder unit 5B of the 200, the trained model may be constructed in the speech synthesis processing device 300.

なお、音声合成処理装置３００において、（１）エンコーダ部３、デコーダ部５Ｂの学習処理と、（２）ボコーダ６の学習処理とを連携させて学習処理を実行してもよいし、上記のように、個別に学習処理を実行してもよい。音声合成処理装置３００において、（１）エンコーダ部３、デコーダ部５Ｂの学習処理と、（２）ボコーダ６の学習処理とを連携させて学習処理を実行する場合、入力をテキストデータとし、当該テキストデータに対応する音声波形データ（正解の音声波形データ）とを用いて、（１）エンコーダ部３、デコーダ部５Ｂのニューラルネットワークのモデルと、（２）ボコーダ６のニューラルネットワークのモデルの最適化パラメータを取得することで学習処理を実行すればよい。 In the speech synthesis processing device 300, the learning process may be executed in cooperation with (1) the learning process of the encoder unit 3 and the decoder unit 5B and (2) the learning process of the vocoder 6, as described above. In addition, the learning process may be executed individually. When the speech synthesis processing device 300 executes the learning process in cooperation with (1) the learning process of the encoder unit 3 and the decoder unit 5B and (2) the learning process of the vocoder 6, the input is set as text data and the text is concerned. Optimizing parameters of (1) the neural network model of the encoder unit 3 and the decoder unit 5B and (2) the model of the neural network of the vocoder 6 using the voice waveform data (correct voice waveform data) corresponding to the data. The learning process may be executed by acquiring.

（３．２．２：予測処理）
次に、音声合成処理装置３００による予測処理について、説明する。なお、予測処理においても、説明便宜のため、処理対象言語を日本語として、以下、説明する。 (3.2.2: Prediction processing)
Next, the prediction process by the speech synthesis processing device 300 will be described. In the prediction processing as well, for convenience of explanation, the processing target language will be Japanese, which will be described below.

予測処理を実行する場合、音声合成処理装置３００では、上記の学習処理により取得された学習済みモデル、すなわち、エンコーダ部３、デコーダ部５Ｂのニューラルネットワークの最適化モデル（最適化パラメータが設定されているモデル）、および、ボコーダ６のニューラルネットワークの最適化モデル（最適化パラメータが設定されているモデル）が構築されている。そして、音声合成処理装置３００では、当該学習済みモデルを用いて予測処理が実行される。 When executing the prediction process, in the speech synthesis processing device 300, the trained model acquired by the above learning process, that is, the optimization model (optimization parameter) of the neural network of the encoder unit 3 and the decoder unit 5B is set. The model) and the optimization model (model in which the optimization parameters are set) of the neural network of the vocoder 6 are constructed. Then, in the speech synthesis processing device 300, the prediction processing is executed using the trained model.

そして、取得されたフルコンテキストラベルデータＤｘ１は、テキスト解析部１Ａからフルコンテキストラベルベクトル処理部２に出力される。 Then, the acquired full context label data Dx1 is output from the text analysis unit 1A to the full context label vector processing unit 2.

そして、音素継続長推定部７は、音素継続長推定処理により取得（推定）した音素継続長のデータ（図８の場合、ｄｕｒ（ｐｈ_ｋ））をデータＤｘ０２として、強制アテンション部８に出力する。 Then, the phoneme continuation length estimation unit 7 outputs the phoneme continuation length data (dur (ph _k ) in the case of FIG. 8) acquired (estimated) by the phoneme continuation length estimation process as data Dx02 to the forced attention unit 8. ..

エンコーダ部３では、第１実施形態と同様の処理が実行される。 In the encoder unit 3, the same processing as in the first embodiment is executed.

デコーダ側ＬＳＴＭ層５２Ｂは、入力されたデータＤｙ２（ｔ）、データＤｙ３（ｔ−１）、および、コンテキスト状態データｃ（ｔ）を用いて、ＬＳＴＭ層による処理を実行し、処理後のデータをデータＤｙ３（ｔ）として線形予測部５３に出力する。 The decoder-side LSTM layer 52B executes processing by the LSTM layer using the input data Dy2 (t), data Dy3 (t-1), and context state data c (t), and obtains the processed data. It is output to the linear prediction unit 53 as data Dy3 (t).

線形予測部５３、ポストネット処理部５４、および、加算器５５では、第１実施形態と同様の処理が実行される。 In the linear prediction unit 53, the postnet processing unit 54, and the adder 55, the same processing as in the first embodiment is executed.

ボコーダ６は、デコーダ部５Ｂの加算器５５から出力されるデータＤｙ６（予測メルスペクトログラムのデータ（音響特徴量のデータ））を入力とし、入力されたデータＤｙ６に対して、学習済みモデルを用いたニューラルネットワーク処理による音声合成処理を実行し、データＤｙ６（予測メルスペクトログラム）に対応する音声信号波形データを取得する。そして、ボコーダ６は、取得した音声信号波形データを、データＤｏｕｔとして出力する。 The vocoder 6 inputs data Dy6 (data of predicted mel spectrogram (data of acoustic feature amount)) output from the adder 55 of the decoder unit 5B, and uses a trained model for the input data Dy6. The speech synthesis process by the neural network process is executed, and the speech signal waveform data corresponding to the data Dy6 (predicted mel spectrogram) is acquired. Then, the vocoder 6 outputs the acquired voice signal waveform data as data Dout.

このように、音声合成処理装置３００では、入力されたテキストデータＤｉｎに対応する音声波形データＤｏｕｔを取得することができる。 In this way, the voice synthesis processing device 300 can acquire the voice waveform data Dout corresponding to the input text data Din.

音声合成処理装置３００では、図１０〜図１２を用いて説明したのと同様に、予測処理時においても、アテンション部４Ａにより取得された重み付け係数データｗ_ａｔｔ（ｔ）と、強制アテンション部８により取得された重み付け係数データｗ_ｆ（ｔ）とを内分処理により合成した重み付け係数データを用いて、コンテキスト状態データｃ（ｔ）を生成する。そして、音声合成処理装置３００では、上記のようにして生成されたコンテキスト状態データｃ（ｔ）を用いて、デコーダ部５Ｂ、ボコーダ６による処理が実行されるため、注意機構予測が失敗することに起因する、合成発話が途中で止まってしまう、同じフレーズを何回も繰り返してしまう、等の問題が発生することを適切に防止できる。 In the speech synthesis processing apparatus 300, the weighting coefficient data _watt (t) acquired by the attention unit 4A and the forced attention unit 8 are used even during the prediction processing, as described with reference to FIGS. 10 to 12. The context state data c (t) is generated by using the weighting coefficient data obtained by synthesizing the acquired weighting coefficient data w _f (t) by the internal division processing. Then, in the speech synthesis processing device 300, the processing by the decoder unit 5B and the vocoder 6 is executed using the context state data c (t) generated as described above, so that the attention mechanism prediction fails. It is possible to appropriately prevent problems such as the cause, the synthetic speech stopping in the middle, and the same phrase being repeated many times.

例えば、図１３に示すように、時刻ｔ２においての処理で、注意機構の予測が失敗している場合、すなわち、図１３に示すように、アテンション部４により取得された重み付け係数データが「０」（あるいは所定の値以下）である場合（ｗ_ａｔｔ（ｔ）_{１．．．Ｓ}のすべての要素データの値が「０」（あるいは所定の値以下）である場合）であっても、音声合成処理装置３００では、強制アテンション部８により取得された重み付け係数データｗ_ｆ（ｔ）の重みにより、注意機構の予測の失敗が音声合成処理に影響を及ぼさないようにできる合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}を取得することができる（図１３の場合。合成重み付け係数データｗ（ｔ）_{１．．．Ｓ}の各要素データの値は、すべて「０．５」）。 For example, as shown in FIG. 13, when the prediction of the attention mechanism fails in the process at time t2, that is, as shown in FIG. 13, the weighting coefficient data acquired by the attention unit 4 is “0”. Even if it is (or less than or equal to a predetermined value) (when all the element data values of _weight (t) _{1 ... S} are "0" (or less than or equal to a predetermined value)), voice synthesis In the processing device 300, the weight of the weighting coefficient data w _f (t) acquired by the forced attention unit 8 can be used to prevent the failure of the attention mechanism prediction from affecting the voice synthesis processing. ) _{1. .. .. S} can be obtained (in the case of FIG. 13. The value of each element data of the composite weighting coefficient data w (t) _{1 ... S} is "0.5").

このように、音声合成処理装置３００では、音素継続長については、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理（音素継続長推定部７による処理）により取得した音素継続長を用いて処理することで、音素継続長の予測精度を保証する。つまり、音声合成処理装置３００では、安定して音素継続長を適切に推定することができる、隠れマルコフモデル等のモデルを用いた推定処理（音素継続長推定部７による処理）により取得した音素継続長を用いて強制アテンション部８により取得した重み付け係数データと、アテンション部４Ａにより取得された重み付け係数データとを適度に合成した重み付け係数データにより生成したコンテキスト状態データｃ（ｔ）を用いて予測処理を実行する。したがって、音声合成処理装置３００では、注意機構の予測が失敗する場合（アテンション部４により適切な重み付け係数データが取得できない場合）であっても、強制アテンション部８により取得した重み付け係数データによる重み分の重み付け係数データが取得できるため、注意機構の予測の失敗が音声合成処理に影響を及ぼさないようにできる。 As described above, in the speech synthesis processing device 300, the phoneme continuation length is estimated by using a model such as a hidden Markov model, which can stably and appropriately estimate the phoneme continuation length (phoneme continuation length estimation unit 7). The prediction accuracy of the phoneme continuation length is guaranteed by processing using the phoneme continuation length obtained by (processing by). That is, in the speech synthesis processing device 300, the phoneme continuation acquired by the estimation process (processing by the phoneme continuation length estimation unit 7) using a model such as the hidden Markov model, which can stably and appropriately estimate the phoneme continuation length. Prediction processing using the context state data c (t) generated from the weighting coefficient data obtained by appropriately synthesizing the weighting coefficient data acquired by the forced attention unit 8 using the length and the weighting coefficient data acquired by the attention unit 4A. To execute. Therefore, in the speech synthesis processing device 300, even if the prediction of the attention mechanism fails (when the attention unit 4 cannot acquire the appropriate weighting coefficient data), the weighting amount based on the weighting coefficient data acquired by the forced attention unit 8 Since the weighting coefficient data of can be obtained, it is possible to prevent the failure of the attention mechanism prediction from affecting the speech synthesis process.

さらに、音声合成処理装置３００では、音響特徴量については、sequence-to-sequence方式を用いたニューラルネットワークのモデルで処理することにより取得できるので、高精度な音響特徴量の予測処理が実現できる。 Further, in the speech synthesis processing device 300, the acoustic features can be obtained by processing with a neural network model using the sequence-to-sequence method, so that highly accurate prediction processing of the acoustic features can be realized.

したがって、音声合成処理装置３００では、注意機構予測が失敗することに起因する、合成発話が途中で止まってしまう、同じフレーズを何回も繰り返してしまう、等の問題が発生することを適切に防止するとともに、高精度な音声合成処理を実行することができる。 Therefore, the speech synthesis processing device 300 appropriately prevents problems such as failure of attention mechanism prediction, synthetic utterance stopping in the middle, and repeating the same phrase many times. At the same time, it is possible to execute highly accurate speech synthesis processing.

なお、上記では、内分比αを固定値（例えば、０．５）に設定した場合について、説明したが、これに限定されることはなく、内分比αは動的に更新されるものであってもよい。例えば、内分処理部９において、アテンション部４Ａから入力される重み付け係数データｗ_ａｔｔ（ｔ）_{１．．．Ｓ}が所定の期間、継続して、所定の値よりも小さい、あるいは、略０であり、かつ、強制アテンション部８から入力される重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}が「１」である場合、アテンション部４による処理が失敗している（注意機構予測が失敗している）と判定し、αの値をより大きな値（重み付け係数データｗ_ｆ（ｔ）_{１．．．Ｓ}の重みが大きくなる値）に調整（更新）するようにしてもよい。 In the above, the case where the internal division ratio α is set to a fixed value (for example, 0.5) has been described, but the present invention is not limited to this, and the internal division ratio α is dynamically updated. It may be. For example, in the internal division processing unit 9, the weighting coefficient data _watt (t) input from the attention unit 4A _{1. .. ..} Weighting coefficient data w _f (t) _1. _S is continuously smaller than a predetermined value or substantially 0 for a predetermined period, and is input from the forced attention unit 8 _{. .. .. When S} is "1", it is determined that the processing by the attention unit 4 has failed (attention mechanism prediction has failed), and the value of α is set to a larger value (weighting coefficient data w _f (t) ₁ ). _It may be adjusted (updated) to (a value in which the weight of _S becomes large).

また、音声合成処理装置３００において、エンコーダ部３、デコーダ部５は、上記の構成に限定されるものではなく、他の構成のものであってよい。例えば、下記文献Ａに開示されているトランスフォーマーモデルのアーキテクチャによるエンコーダ、デコーダの構成を採用して、エンコーダ部３、デコーダ部５を構成するようにしてもよい。この場合、トランスフォーマーモデルのアーキテクチャによるエンコーダとデコーダの間に設置されるアテンション機構を、本実施形態で説明した機構、すなわち、アテンション部４、強制アテンション部８、内分処理部９、コンテキスト算出部１０により、アテンション機構が取得した重み付け係数データと、強制アテンション部８が取得した重み付け係数データとを内分処理により合成し、合成した重み付け係数データによりコンテキスト状態データを取得する機構に置換する構成を採用すればよい。
（文献Ａ）：A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, AN. Gomez, L. Kaiser, I. Polosukhin, “Attention is all you need”31^st Conference on Nural information Processing System (NIPS 2017), Long Beach, CA, USA.
［他の実施形態］
上記実施形態（変形例を含む）の音声合成処理装置において、エンコーダ側ＬＳＴＭ層３２、デコーダ側ＬＳＴＭ層５２は、それぞれ、複数のＬＳＴＭ層を備えるものであってもよい。また、エンコーダ側ＬＳＴＭ層３２、デコーダ側ＬＳＴＭ層５２は、それぞれ、双方向ＬＳＴＭ層（順伝搬、逆伝搬をＬＳＴＭ層）で構成されるものであってもよい。 Further, in the speech synthesis processing device 300, the encoder unit 3 and the decoder unit 5 are not limited to the above configuration, and may have other configurations. For example, the encoder unit 3 and the decoder unit 5 may be configured by adopting the encoder and decoder configurations based on the transformer model architecture disclosed in Document A below. In this case, the attention mechanism installed between the encoder and the decoder according to the architecture of the transformer model is the mechanism described in this embodiment, that is, the attention unit 4, the forced attention unit 8, the internal division processing unit 9, and the context calculation unit 10. Therefore, the weighting coefficient data acquired by the attention mechanism and the weighting coefficient data acquired by the forced attention unit 8 are combined by internal division processing, and the combined weighting coefficient data is used to replace the weighting coefficient data with a mechanism for acquiring context state data. do it.
(Reference A): A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, AN. Gomez, L. Kaiser, I. Polosukhin, “Attention is all you need” 31 ^st Conference on Nural information Processing System (NIPS 2017), Long Beach, CA, USA.
[Other Embodiments]
In the speech synthesis processing apparatus of the above embodiment (including a modification), the encoder-side LSTM layer 32 and the decoder-side LSTM layer 52 may each include a plurality of LSTM layers. Further, the encoder-side LSTM layer 32 and the decoder-side LSTM layer 52 may each be composed of bidirectional LSTM layers (forward propagation and reverse propagation LSTM layers).

また、上記実施形態（変形例を含む）では、音声合成処理装置が、テキスト解析部１と、フルコンテキストラベルベクトル処理部２とを備え、テキスト解析部１で取得したフルコンテキストラベルデータから、フルコンテキストラベルベクトル処理部２により、最適化フルコンテキストラベルデータを取得する場合について説明したが、これに限定されることはなく、例えば、音声合成処理装置において、最適化フルコンテキストラベルデータを取得する、テキスト解析部を設け、フルコンテキストラベルベクトル処理部を省略する構成としてもよい。 Further, in the above embodiment (including a modification), the voice synthesis processing apparatus includes a text analysis unit 1 and a full context label vector processing unit 2, and is full from the full context label data acquired by the text analysis unit 1. The case where the optimized full context label data is acquired by the context label vector processing unit 2 has been described, but the present invention is not limited to this, and for example, in the voice synthesis processing apparatus, the optimized full context label data is acquired. A text analysis unit may be provided and the full context label vector processing unit may be omitted.

また、上記実施形態（変形例を含む）を適宜組み合わせてもよい。 Moreover, you may combine the said embodiment (including a modification) as appropriate.

また上記実施形態（変形例を含む）で説明した音声合成処理装置において、各ブロックは、ＬＳＩなどの半導体装置により個別に１チップ化されても良いし、一部または全部を含むように１チップ化されても良い。 Further, in the speech synthesis processing device described in the above embodiment (including a modification), each block may be individually integrated into one chip by a semiconductor device such as an LSI, or one chip so as to include a part or all of the blocks. It may be converted.

なおここではＬＳＩとしたが、集積度の違いにより、ＩＣ、システムＬＳＩ、スーパーＬＳＩ、ウルトラＬＳＩと呼称されることもある。 Although it is referred to as LSI here, it may be referred to as IC, system LSI, super LSI, or ultra LSI depending on the degree of integration.

また集積回路化の手法はＬＳＩに限るものではなく、専用回路または汎用プロセサで実現してもよい。ＬＳＩ製造後にプログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）や、ＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサーを利用しても良い。 Further, the method of making an integrated circuit is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after the LSI is manufactured, or a reconfigurable processor that can reconfigure the connection and settings of circuit cells inside the LSI may be used.

また上記各実施形態の各機能ブロックの処理の一部または全部は、プログラムにより実現されるものであってもよい。そして上記各実施形態の各機能ブロックの処理の一部または全部は、コンピュータにおいて、中央演算装置（ＣＰＵ）により行われる。また、それぞれの処理を行うためのプログラムは、ハードディスク、ＲＯＭなどの記憶装置に格納されており、ＲＯＭにおいて、あるいはＲＡＭに読み出されて実行される。 Further, a part or all of the processing of each functional block of each of the above embodiments may be realized by a program. Then, a part or all of the processing of each functional block of each of the above embodiments is performed by the central processing unit (CPU) in the computer. Further, the program for performing each process is stored in a storage device such as a hard disk or a ROM, and is read and executed in the ROM or the RAM.

また上記実施形態の各処理をハードウェアにより実現してもよいし、ソフトウェア（ＯＳ（オペレーティングシステム）、ミドルウェア、あるいは所定のライブラリとともに実現される場合を含む。）により実現してもよい。さらにソフトウェアおよびハードウェアの混在処理により実現しても良い。 Further, each process of the above embodiment may be realized by hardware, or may be realized by software (including a case where it is realized together with an OS (operating system), middleware, or a predetermined library). Further, it may be realized by mixed processing of software and hardware.

例えば上記実施形態の各機能部をソフトウェアにより実現する場合、図１４に示したハードウェア構成（例えばＣＰＵ、ＧＰＵ、ＲＯＭ、ＲＡＭ、入力部、出力部、通信部、記憶部（例えば、ＨＤＤ、ＳＳＤ等により実現される記憶部）、外部メディア用ドライブ等をバスＢｕｓにより接続したハードウェア構成）を用いて各機能部をソフトウェア処理により実現するようにしてもよい。 For example, when each functional unit of the above embodiment is realized by software, the hardware configuration shown in FIG. 14 (for example, CPU, GPU, ROM, RAM, input unit, output unit, communication unit, storage unit (for example, HDD, SSD) Each functional unit may be realized by software processing by using a storage unit (a storage unit realized by the above), a hardware configuration in which an external media drive or the like is connected by a bus Bus).

また上記実施形態の各機能部をソフトウェアにより実現する場合、当該ソフトウェアは、図１４に示したハードウェア構成を有する単独のコンピュータを用いて実現されるものであってもよいし、複数のコンピュータを用いて分散処理により実現されるものであってもよい。 Further, when each functional unit of the above embodiment is realized by software, the software may be realized by using a single computer having the hardware configuration shown in FIG. 14, or a plurality of computers. It may be realized by using and distributed processing.

また上記実施形態における処理方法の実行順序は、必ずしも上記実施形態の記載に制限されるものではなく、発明の要旨を逸脱しない範囲で、実行順序を入れ替えることができるものである。 Further, the execution order of the processing method in the above embodiment is not necessarily limited to the description of the above embodiment, and the execution order can be changed without departing from the gist of the invention.

前述した方法をコンピュータに実行させるコンピュータプログラム、及びそのプログラムを記録したコンピュータ読み取り可能な記録媒体は、本発明の範囲に含まれる。ここでコンピュータ読み取り可能な記録媒体としては、例えば、フレキシブルディスク、ハードディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、大容量ＤＶＤ、次世代ＤＶＤ、半導体メモリを挙げることができる。 A computer program that causes a computer to perform the above-mentioned method, and a computer-readable recording medium that records the program are included in the scope of the present invention. Examples of computer-readable recording media include flexible disks, hard disks, CD-ROMs, MOs, DVDs, DVD-ROMs, DVD-RAMs, large-capacity DVDs, next-generation DVDs, and semiconductor memories.

上記コンピュータプログラムは、上記記録媒体に記録されたものに限らず、電気通信回線、無線または有線通信回線、インターネットを代表とするネットワーク等を経由して伝送されるものであってもよい。 The computer program is not limited to the one recorded on the recording medium, and may be transmitted via a telecommunication line, a wireless or wired communication line, a network typified by the Internet, or the like.

なお本発明の具体的な構成は、前述の実施形態に限られるものではなく、発明の要旨を逸脱しない範囲で種々の変更および修正が可能である。 The specific configuration of the present invention is not limited to the above-described embodiment, and various changes and modifications can be made without departing from the gist of the invention.

１００、２００、３００音声合成処理装置
１テキスト解析部
２、２Ａフルコンテキストラベルベクトル処理部
３エンコーダ部
４、４Ａアテンション部
５デコーダ部
６ボコーダ
７音素継続長推定部
８強制アテンション部
９内分処理部
１０コンテキスト算出部 100, 200, 300 Speech synthesis processing device 1 Text analysis unit 2, 2A Full context label vector processing unit 3 Encoder unit 4, 4A Attention unit 5 Decoder unit 6 Vocoder 7 Phoneme continuation length estimation unit 8 Forced attention unit 9 Internal division processing unit 10 Context calculation unit

Claims

任意の言語を処理対象言語とし、エンコーダ・デコーダ方式のニューラルネットワークを用いて音声合成処理を実行する音声合成処理装置であって、
前記処理対象言語のテキストデータに対してテキスト解析処理を実行し、コンテキストラベルデータを取得するテキスト解析部と、
前記テキスト解析部により取得された前記コンテキストラベルデータから、コンテキストラベルデータを取得する処理において処理対象とされた音素である単独音素についてのコンテキストラベルを取得することで、前記ニューラルネットワークの学習処理に適した最適化フルコンテキストラベルデータを取得するフルコンテキストラベルベクトル処理部と、
前記最適化フルコンテキストラベルデータに基づいて、ニューラルネットワークのエンコード処理を実行することで、隠れ状態データを取得するエンコーダ部と、
前記隠れ状態データに基づいて、ニューラルネットワークのデコード処理を実行することで、前記最適化フルコンテキストラベルデータに対応する音響特徴量データを取得するデコーダ部と、
前記デコーダ部により取得された音響特徴量から音声波形データを取得するボコーダと、
を備える音声合成処理装置。 It is a speech synthesis processing device that executes speech synthesis processing using an encoder / decoder neural network with an arbitrary language as the processing target language.
A text analysis unit that executes text analysis processing on the text data of the processing target language and acquires context label data,
From the context label data acquired by the text analysis unit, it is suitable for the learning process of the neural network by acquiring the context label for a single phone that is the sound element to be processed in the process of acquiring the context label data. Full context label vector processing unit that acquires optimized full context label data,
An encoder unit that acquires hidden state data by executing neural network encoding processing based on the optimized full context label data.
A decoder unit that acquires acoustic feature data corresponding to the optimized full context label data by executing a neural network decoding process based on the hidden state data.
A vocoder that acquires audio waveform data from the acoustic features acquired by the decoder unit, and
A speech synthesis processing device including.

前記音響特徴量は、メルスペクトログラムのデータである、
請求項１に記載の音声合成処理装置。 The acoustic features are mel spectrogram data.
The voice synthesis processing device according to claim 1.

前記ボコーダは、
ニューラルネットワークのモデルを用いた処理を実行することで、音響特徴量から音声波形データを取得する、
請求項１または２に記載の音声合成処理装置。 The vocoder
Acquire audio waveform data from acoustic features by executing processing using a neural network model.
The voice synthesis processing apparatus according to claim 1 or 2.

前記ボコーダは、
可逆変換ネットワークにより構成されたニューラルネットワークのモデルを用いた処理を実行することで、音響特徴量から音声波形データを取得する、
請求項３に記載の音声合成処理装置。 The vocoder
Acquires voice waveform data from acoustic features by executing processing using a model of a neural network composed of a lossless conversion network.
The voice synthesis processing apparatus according to claim 3.

音素単位のコンテキストラベルデータから音素継続長を推定する音素継続長推定部をさらに備え、
前記フルコンテキストラベルベクトル処理部は、前記音素継続長推定部により推定された音素継続長である推定音素継続長に対応する期間において、当該推定音素継続長に対応する音素の前記最適化フルコンテキストラベルデータを継続して前記エンコーダ部に出力する、
請求項１から４のいずれかに記載の音声合成処理装置。 It also has a phoneme continuation length estimation unit that estimates the phoneme continuation length from the context label data of each phoneme.
The full context label vector processing unit is the optimized full context label of the phoneme corresponding to the estimated phoneme continuation length in the period corresponding to the estimated phoneme continuation length which is the phoneme continuation length estimated by the phoneme continuation length estimation unit. The data is continuously output to the encoder unit.
The voice synthesis processing device according to any one of claims 1 to 4.

任意の言語を処理対象言語とし、エンコーダ・デコーダ方式のニューラルネットワークを用いて音声合成処理を実行する音声合成処理方法であって、
前記処理対象言語のテキストデータに対してテキスト解析処理を実行し、コンテキストラベルデータを取得するテキスト解析ステップと、
前記テキスト解析ステップにより取得された前記コンテキストラベルデータから、コンテキストラベルデータを取得する処理において処理対象とされた音素である単独音素についてのコンテキストラベルを取得することで、前記ニューラルネットワークの学習処理に適した最適化フルコンテキストラベルデータを取得するフルコンテキストラベルベクトル処理ステップと、
前記最適化フルコンテキストラベルデータに基づいて、ニューラルネットワークのエンコード処理を実行することで、隠れ状態データを取得するエンコード処理ステップと、
前記隠れ状態データに基づいて、ニューラルネットワークのデコード処理を実行することで、前記最適化フルコンテキストラベルデータに対応する音響特徴量データを取得するデコード処理ステップと、
前記デコード処理ステップにより取得された音響特徴量から音声波形データを取得するボコーダ処理ステップと、
を備える音声合成処理方法。 It is a speech synthesis processing method that executes speech synthesis processing using an encoder / decoder neural network with an arbitrary language as the processing target language.
A text analysis step of executing text analysis processing on the text data of the processing target language and acquiring context label data, and
From the context label data acquired in the text analysis step, it is suitable for the learning process of the neural network by acquiring the context label for a single phone that is the sound element to be processed in the process of acquiring the context label data. Full context label vector processing step to get the optimized full context label data,
An encoding process step for acquiring hidden state data by executing a neural network encoding process based on the optimized full context label data, and
A decoding process step of acquiring acoustic feature data corresponding to the optimized full context label data by executing a neural network decoding process based on the hidden state data.
A vocoder processing step that acquires voice waveform data from the acoustic features acquired by the decoding processing step, and
A speech synthesis processing method comprising.

請求項６に記載の音声合成処理方法をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute the voice synthesis processing method according to claim 6.

任意の言語を処理対象言語とし、エンコーダ・デコーダ方式のニューラルネットワークを用いて音声合成処理を実行する音声合成処理装置であって、
前記処理対象言語のテキストデータに対してテキスト解析処理を実行し、コンテキストラベルデータを取得するテキスト解析部と、
前記テキスト解析部により取得された前記コンテキストラベルデータから、コンテキストラベルデータを取得する処理において処理対象とされた音素である単独音素についてのコンテキストラベルを取得することで、前記ニューラルネットワークの学習処理に適した最適化フルコンテキストラベルデータを取得するフルコンテキストラベルベクトル処理部と、
前記最適化フルコンテキストラベルデータに基づいて、ニューラルネットワークのエンコード処理を実行することで、隠れ状態データを取得するエンコーダ部と、
音素単位のコンテキストラベルデータから音素継続長を推定する音素継続長推定部と、
前記音素継続長推定部により推定された音素継続長に基づいて、第１重み付け係数データを取得する強制アテンション部と、
前記エンコーダ部により取得された隠れ状態データに基づいて、第２重み付け係数データを取得するアテンション部と、
前記第１重み付け係数データと第２重み付け係数データとに対して内分処理を行うことで、合成重み付け係数データを取得する内分処理部と、
前記合成重み付け係数データにより、前記エンコーダ部により取得された前記隠れ状態データに対して重み付け合成処理を実行することで、コンテキスト状態データを取得するコンテキスト算出部と、
前記コンテキスト状態データに基づいて、ニューラルネットワークのデコード処理を実行することで、前記最適化フルコンテキストラベルデータに対応する音響特徴量データを取得するデコーダ部と、
前記デコーダ部により取得された音響特徴量から音声波形データを取得するボコーダと、
を備える音声合成処理装置。 It is a speech synthesis processing device that executes speech synthesis processing using an encoder / decoder neural network with an arbitrary language as the processing target language.
A text analysis unit that executes text analysis processing on the text data of the processing target language and acquires context label data,
From the context label data acquired by the text analysis unit, it is suitable for the learning process of the neural network by acquiring the context label for a single phone that is the sound element to be processed in the process of acquiring the context label data. Full context label vector processing unit that acquires optimized full context label data,
An encoder unit that acquires hidden state data by executing neural network encoding processing based on the optimized full context label data.
A phoneme continuation length estimation unit that estimates the phoneme continuation length from phoneme-based context label data,
A forced attention unit that acquires the first weighting coefficient data based on the phoneme continuation length estimated by the phoneme continuation length estimation unit, and
Attention section that acquires the second weighting coefficient data based on the hidden state data acquired by the encoder section, and
An internal division processing unit that acquires composite weighting coefficient data by performing internal division processing on the first weighting coefficient data and the second weighting coefficient data.
A context calculation unit that acquires context state data by executing a weighting composition process on the hidden state data acquired by the encoder unit based on the composite weighting coefficient data.
A decoder unit that acquires acoustic feature data corresponding to the optimized full context label data by executing a neural network decoding process based on the context state data.
A vocoder that acquires audio waveform data from the acoustic features acquired by the decoder unit, and
A speech synthesis processing device including.