JP7079455B1

JP7079455B1 - Acoustic model learning devices, methods and programs, as well as speech synthesizers, methods and programs

Info

Publication number: JP7079455B1
Application number: JP2021163711A
Authority: JP
Inventors: 悟行松永; 大和大谷
Original assignee: AI Inc
Current assignee: AI Inc
Priority date: 2021-10-04
Filing date: 2021-10-04
Publication date: 2022-06-02
Anticipated expiration: 2041-10-04
Also published as: JP2023054702A

Abstract

【課題】音声パラメータの時間構造と次元間の関係を考慮した音響モデルを学習する音響モデル学習装置、方法及びプログラムを提供する。【解決手段】モデル学習装置１００は、生成モデル学習装置２００と、識別モデル学習装置３００と、を備える。生成モデル学習装置は、合成音声パラメータを予測する音声パラメータ系列予測部２１０と、合成音声パラメータをグラム行列に変換するグラム行列変換部２４０と、第１の識別値を求める識別部２５０と、第１の誤差を計算する誤差計算部２６０と、生成モデル２２０を更新する更新部２７０とを備える。識別モデル学習装置は、合成音声パラメータを予測する音声パラメータ系列予測部３１０と、合成音声パラメータ又は自然言語特徴量をグラム行列に変換するグラム行列変換部３４０と、第２の識別値を求める識別部３５０と、第２の誤差を計算する誤差計算部３７０と、識別モデル３６０を更新する更新部３８０とを備える。【選択図】図１PROBLEM TO BE SOLVED: To provide an acoustic model learning device, a method and a program for learning an acoustic model in consideration of the time structure of speech parameters and the relationship between dimensions. A model learning device 100 includes a generation model learning device 200 and a discriminative model learning device 300. The generation model learning device includes a voice parameter sequence prediction unit 210 that predicts synthetic voice parameters, a gram matrix conversion unit 240 that converts synthetic voice parameters into a gram matrix, an identification unit 250 that obtains a first identification value, and a first. It is provided with an error calculation unit 260 for calculating the error of the above, and an update unit 270 for updating the generation model 220. The discrimination model learning device includes a speech parameter sequence prediction unit 310 that predicts synthetic speech parameters, a gram matrix conversion unit 340 that converts synthetic speech parameters or natural language features into a gram matrix, and a discrimination unit that obtains a second discrimination value. It includes 350, an error calculation unit 370 for calculating a second error, and an update unit 380 for updating the identification model 360. [Selection diagram] Fig. 1

Description

特許法第３０条第２項適用集会名：公立大学法人富山県立大学博士論文審査会、開催日：令和３年１月２７日ウェブサイト掲載日：令和３年３月２０日、掲載アドレス：ｈｔｔｐ：／／ｉｄ．ｎｉｉ．ａｃ．ｊｐ／１２５４／０００００３８９／Patent Law Article 30 Paragraph 2 Applicable Meeting Name: Toyama Prefectural University Doctoral Dissertation Examination Committee, Date: January 27, 3rd Reiwa Website Publication Date: March 20, 3rd Reiwa, Publication Address : Http: // id. nii. ac. jp / 1254/0000389 /

本発明の実施形態は、入力テキストに応じた音声を合成する音声合成技術に関する。 An embodiment of the present invention relates to a speech synthesis technique for synthesizing speech according to input text.

目標話者の発話音声データからその話者の合成音声を生成する方法として、ＤＮＮ(Deep Neural Network)に基づく音声合成技術がある。この技術は、ＤＮＮ音響モデル学習装置と音声合成装置で構成される。ＤＮＮ音響モデル学習装置は、発話音声データから抽出した自然言語特徴量系列及び自然音声パラメータ系列からＤＮＮ音響モデルを学習する。ＤＮＮ音響モデルは、時間フレーム毎に合成音声パラメータ系列をモデル化している。音声合成装置は、学習されたＤＮＮ音響モデルに基づいて、ユーザが音声合成したい文章の自然言語特徴量系列から合成音声パラメータ系列を生成し、合成音声波形を出力する。 There is a voice synthesis technique based on DNN (Deep Neural Network) as a method of generating a synthetic voice of the speaker from the speech voice data of the target speaker. This technique consists of a DNN acoustic model learning device and a speech synthesizer. The DNN acoustic model learning device learns a DNN acoustic model from a natural language feature quantity sequence and a natural speech parameter sequence extracted from spoken speech data. The DNN acoustic model models a synthetic speech parameter sequence for each time frame. Based on the learned DNN acoustic model, the speech synthesizer generates a synthetic speech parameter sequence from the natural language feature quantity sequence of the sentence that the user wants to synthesize speech, and outputs the synthesized speech waveform.

特許文献１では、音響モデル学習装置が、複数話者の自然音声データ、複数話者の自然音声データに対応する複数のテキストデータ、複数の話者データなどを学習データとして、生成的敵対ネットワーク(GAN: Generative Adversarial Network)を用いて、音響モデルと判別モデルとを交互に学習している。 In Patent Document 1, the acoustic model learning device uses natural voice data of a plurality of speakers, a plurality of text data corresponding to the natural voice data of a plurality of speakers, a plurality of speaker data, and the like as learning data, and is a generative hostile network ( GAN: Generative Adversarial Network) is used to learn the acoustic model and the discriminant model alternately.

なお、非特許文献１では、ＧＡＮを用いずにＤＮＮ音響モデルを学習する際に、ＤＮＮ音響モデルの出力系列である合成音声パラメータ系列をグラム行列に変換し、損失関数を求めている。すなわちグラム行列は、ＤＮＮ音響モデルが生成した合成音声パラメータ系列と自然音声パラメータ系列との生成誤差に用いられている。 In Non-Patent Document 1, when the DNN acoustic model is learned without using GAN, the synthetic speech parameter sequence, which is the output sequence of the DNN acoustic model, is converted into a Gram matrix to obtain a loss function. That is, the Gram matrix is used for the generation error between the synthetic speech parameter sequence generated by the DNN acoustic model and the natural speech parameter sequence.

特開２０２０－０６０６３３号公報Japanese Unexamined Patent Publication No. 2020-060633

Shinnosuke TAKAMICHI, et al., Sampling-based speech parameter generation using moment-matching networks, DOI:10.21437/INTERSPEECH.2017-362, https://www.semanticscholar.org/paper/Sampling-Based-Speech-Parameter-Generation-Using-Takamichi-Koriyama/a7d8dca8380f771d1617d619c8c877df8f90d849Shinnosuke TAKAMICHI, et al., Sampling-based speech parameter generation using moment-matching networks, DOI: 10.21437 / INTERSPEECH.2017-362, https://www.semanticscholar.org/paper/Sampling-Based-Speech-Parameter-Generation -Using-Takamichi-Koriyama / a7d8dca8380f771d1617d619c8c877df8f90d849

生成的敵対ネットワークにおいて、ＤＮＮ音響モデルは、ＤＮＮ音響モデルが生成した合成音声パラメータ系列と自然音声パラメータ系列との生成誤差と、自然音声パラメータ系列か合成音声パラメータ系列かを識別するＤＮＮ識別モデルの識別誤差が最小になるように学習される。しかし、生成誤差も識別誤差もある時間フレームまたはある時間フレーム前後の数フレームの情報しか考慮しないため、自然音声パラメータ系列全体の情報を考慮できないという問題がある。 In a generative hostile network, the DNN acoustic model identifies the generation error between the synthetic speech parameter sequence generated by the DNN acoustic model and the natural speech parameter sequence, and the DNN discrimination model that distinguishes between the natural speech parameter sequence and the synthetic speech parameter sequence. It is learned so that the error is minimized. However, there is a problem that the information of the entire natural speech parameter series cannot be considered because only the information of a certain time frame or a few frames before and after a certain time frame is considered for both the generation error and the discrimination error.

本発明は、このような課題に着目して鋭意研究され完成されたものであり、その目的は、音声パラメータ系列の時間構造と次元間の関係を考慮した音響モデルを学習し、かつ、適切にモデル化された音響モデルによる音声合成技術を提供することにある。 The present invention has been intensively researched and completed by paying attention to such a problem, and the purpose of the present invention is to learn an acoustic model considering the time structure of a speech parameter sequence and the relationship between dimensions, and appropriately. The purpose is to provide speech synthesis technology using a modeled acoustic model.

上記課題を解決するために、第１の発明は、複数の発話音声から抽出された自然言語特徴量系列及び自然音声パラメータ系列を発話単位で記憶するコーパス記憶部と；ある自然言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを記憶する生成モデル記憶部と、前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測する第１の音声パラメータ系列予測部と、前記合成音声パラメータ系列をグラム行列に変換する第１のグラム行列変換部と、前記グラム行列を入力し、識別モデルを用いて第１の識別値を求める第１の識別部と、前記自然言語特徴量系列、前記合成音声パラメータ系列及び前記第１の識別値に関する第１の誤差を計算する第１の計算部と、前記第１の誤差に基づいて、前記生成モデルを更新する第１の更新部と、を有する生成モデル学習装置と；前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測する第２の音声パラメータ系列予測部と、前記合成音声パラメータ系列又は前記自然言語特徴量系列をグラム行列に変換する第２のグラム行列変換部と、前記識別モデルを記憶する識別モデル記憶部と、前記グラム行列を入力し、前記識別モデルを用いて第２の識別値を求める第２の識別部と、前記第２の識別値に関する第２の誤差を計算する第２の計算部と、前記第２の誤差に基づいて、前記識別モデルを更新する第２の更新部と、を有する識別モデル学習装置と；を備える音響モデル学習装置である。 In order to solve the above problems, the first invention comprises a corpus storage unit that stores a natural language feature quantity sequence and a natural speech parameter sequence extracted from a plurality of spoken voices in speech units; from a certain natural language feature quantity series. A generation model storage unit that stores a generation model for predicting a certain synthetic speech parameter sequence, and a first speech parameter sequence that predicts a synthetic speech parameter sequence using the generation model by inputting the natural language feature quantity sequence. A prediction unit, a first gram matrix conversion unit that converts the synthetic speech parameter series into a gram matrix, a first identification unit that inputs the gram matrix and obtains a first identification value using an identification model, and a first identification unit. A first calculation unit that calculates a first error with respect to the natural language feature quantity series, the synthetic speech parameter series, and the first discriminative value, and a second generation model that updates the generation model based on the first error. A generation model learning device having 1 update unit; a second speech parameter sequence prediction unit that takes the natural language feature quantity sequence as an input and predicts a synthetic speech parameter sequence using the generation model, and the synthetic speech. A second gram matrix conversion unit that converts a parameter sequence or the natural language feature quantity series into a gram matrix, an identification model storage unit that stores the discriminative model, and a gram matrix are input, and the discriminative model is used. A second discriminative unit for obtaining the discriminative value of 2, a second calculation unit for calculating the second error regarding the second discriminative value, and a second discriminative model for updating the discriminative model based on the second error. It is an acoustic model learning device including the discriminative model learning device having the update unit of 2.

第２の発明は、前記生成モデルはフィードフォーワード・ニューラルネットワーク型のモデルである第１の発明に記載の音響モデル学習装置である。 The second invention is the acoustic model learning apparatus according to the first invention, wherein the generative model is a feedforward neural network type model.

第３の発明は、前記識別モデルは畳み込みニューラルネットワーク型のモデルである第１の発明に記載の音響モデル学習装置である。 A third invention is the acoustic model learning device according to the first invention, wherein the discriminative model is a convolutional neural network type model.

第４の発明は、複数の発話音声から抽出された自然言語特徴量系列及び自然音声パラメータ系列を発話単位で記憶するコーパスから、前記自然言語特徴量系列を入力とし、ある自然言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて合成音声パラメータ系列を予測し、前記合成音声パラメータ系列をグラム行列に変換し、前記グラム行列を入力し、識別モデルを用いて第１の識別値を求め、前記自然言語特徴量系列、前記合成音声パラメータ系列及び前記第１の識別値に関する第１の誤差を計算し、前記第１の誤差に基づいて、前記生成モデルを更新する生成モデル学習方法と；
前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測し、前記合成音声パラメータ系列又は前記自然言語特徴量系列をグラム行列に変換し、前記グラム行列を入力し、前記識別モデルを用いて第２の識別値を求め、前記第２の識別値に関する第２の誤差を計算し、前記第２の誤差に基づいて、前記識別モデルを更新する識別モデル学習方法と；を備える音響モデル学習方法である。 The fourth invention is from a corpus that stores a natural language feature quantity series and a natural speech parameter sequence extracted from a plurality of spoken voices in units of speech, and the natural language feature quantity series is input from a certain natural language feature quantity series. A synthetic speech parameter sequence is predicted using a generation model for predicting a synthetic speech parameter sequence, the synthetic speech parameter sequence is converted into a gram matrix, the gram matrix is input, and a first discriminant model is used. A generation model that obtains the discrimination value, calculates the first error regarding the natural language feature quantity series, the synthetic speech parameter series, and the first discrimination value, and updates the generation model based on the first error. With learning method;
Using the natural language feature quantity series as an input, the synthetic speech parameter sequence is predicted using the generation model, the synthetic speech parameter sequence or the natural language feature quantity series is converted into a gram matrix, and the gram matrix is input. A discriminative model learning method in which a second discriminative value is obtained using the discriminative model, a second error with respect to the second discriminative value is calculated, and the discriminative model is updated based on the second error. It is an acoustic model learning method including.

第５の発明は、複数の発話音声から抽出された自然言語特徴量系列及び自然音声パラメータ系列を発話単位で記憶するコーパスから、前記自然言語特徴量系列を入力とし、ある自然言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて合成音声パラメータ系列を予測するステップと、前記合成音声パラメータ系列をグラム行列に変換するステップと、前記グラム行列を入力し、識別モデルを用いて第１の識別値を求めるステップと、前記自然言語特徴量系列、前記合成音声パラメータ系列及び前記第１の識別値に関する第１の誤差を計算するステップと、前記第１の誤差に基づいて、前記生成モデルを更新するステップと、をコンピュータに実行させる生成モデル学習プログラムと；前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測するステップと、前記合成音声パラメータ系列又は前記自然言語特徴量系列をグラム行列に変換するステップと、前記グラム行列を入力し、前記識別モデルを用いて第２の識別値を求めるステップ、前記第２の識別値に関する第２の誤差を計算するステップと、前記第２の誤差に基づいて、前記識別モデルを更新するステップと、をコンピュータに実行させる識別モデル学習方法と；を備える音響モデル学習プログラムである。 The fifth invention is from a corpus that stores a natural language feature quantity series and a natural speech parameter sequence extracted from a plurality of spoken voices in units of speech, and the natural language feature quantity series is input from a certain natural language feature quantity series. A step of predicting a synthetic voice parameter series using a generation model for predicting a certain synthetic voice parameter series, a step of converting the synthetic voice parameter series into a gram matrix, and a step of inputting the gram matrix and using an identification model. Based on the step of obtaining the first discrimination value, the step of calculating the first error regarding the natural language feature quantity series, the synthetic voice parameter series, and the first discrimination value, and the first error. A generation model learning program that causes a computer to execute a step of updating the generation model; a step of predicting a synthetic voice parameter series using the generation model with the natural language feature quantity series as an input, and the synthesis voice parameter. A step of converting a series or the natural language feature quantity series into a gram matrix, a step of inputting the gram matrix and obtaining a second discrimination value using the discrimination model, and a second error regarding the second discrimination value. Is an acoustic model learning program comprising a step of calculating the above, a step of updating the discrimination model based on the second error, and a discrimination model learning method for causing a computer to execute.

第６の発明は、音声合成対象文章の言語特徴量系列を記憶するコーパス記憶部と、第１の発明に記載の音響モデル学習装置で学習した、ある言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを記憶する生成モデル記憶部と、音声波形を生成するためのボコーダを記憶するボコーダ記憶部と、前記言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測する音声パラメータ系列予測部と、前記合成音声パラメータ系列を入力とし、前記ボコーダを用いて合成音声波形を生成する波形合成処理部を備える音声合成装置である。 The sixth invention is a corpus storage unit that stores a speech feature quantity sequence of a sentence to be voice-synthesized, and a synthetic speech parameter sequence from a certain language feature quantity sequence learned by the acoustic model learning device according to the first invention. A generation model storage unit that stores a generation model for prediction, a vocabulary storage unit that stores a vocabulary for generating voice waveforms, and a synthetic speech parameter series using the generation model as inputs. It is a voice synthesis apparatus including a voice parameter sequence prediction unit that predicts the above, and a waveform synthesis processing unit that receives the synthetic voice parameter series as an input and generates a synthetic voice waveform using the vocabulary.

第７の発明は、音声合成対象文章の言語特徴量系列を入力とし、第４の発明に記載の音響モデル学習方法で学習した、ある言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて、合成音声パラメータ系列を予測し、前記合成音声パラメータ系列を入力とし、音声波形を生成するためのボコーダを用いて、合成音声波形を生成する音声合成方法である。 The seventh invention is for predicting a synthetic speech parameter sequence from a certain language feature quantity sequence learned by the acoustic model learning method described in the fourth invention, using the language feature quantity sequence of the speech synthesis target sentence as an input. It is a speech synthesis method that predicts a synthetic speech parameter sequence using a generation model, uses the synthetic speech parameter sequence as an input, and uses a vocabulary for generating a speech waveform to generate a synthetic speech waveform.

第８の発明は、音声合成対象文章の言語特徴量系列を入力とし、第５の発明に記載の音響モデル学習プログラムで学習した、ある言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて、合成音声パラメータ系列を予測するステップと、前記合成音声パラメータ系列を入力とし、音声波形を生成するためのボコーダを用いて、合成音声波形を生成するステップと、をコンピュータに実行させる音声合成プログラムである。 The eighth invention is for predicting a synthetic speech parameter sequence from a certain language feature quantity sequence learned by the acoustic model learning program according to the fifth invention, using the language feature quantity sequence of the speech synthesis target sentence as an input. The computer executes a step of predicting a synthetic speech parameter sequence using a generation model and a step of generating a synthetic speech waveform using a vocabulary for generating a speech waveform by using the synthetic speech parameter sequence as an input. It is a speech synthesis program to be made to.

本発明によれば、音声パラメータ系列の時間構造と次元間の関係を考慮した音響モデルを学習し、かつ、適切にモデル化された音響モデルによる音声合成技術を提供することができる。 According to the present invention, it is possible to learn an acoustic model in consideration of the time structure of a speech parameter series and the relationship between dimensions, and to provide a speech synthesis technique using an appropriately modeled acoustic model.

本発明の実施形態に係るモデル学習装置の機能ブロック図ある。It is a functional block diagram of the model learning apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るＧＤＣ－ＧＡＮの概略説明図である。It is a schematic explanatory drawing of GDC-GAN which concerns on embodiment of this invention. 本発明の実施形態に係る音声合成装置の機能ブロック図ある。It is a functional block diagram of the speech synthesizer which concerns on embodiment of this invention. 原音声（参照データ）のスペクトログラムである。It is a spectrogram of the original voice (reference data). ＧＡＮを利用しない場合（アンカーのデータ）のスペクトログラムである。It is a spectrogram when GAN is not used (anchor data). 比較例１（ＦＦＮＮ－ＧＡＮのデータ）のスペクトログラムである。It is a spectrogram of Comparative Example 1 (data of FFNN-GAN). 比較例２（ＣＮＮ－ＧＡＮのデータ）のスペクトログラムである。It is a spectrogram of Comparative Example 2 (data of CNN-GAN). 本発明の実施形態に係るスペクトログラム（ＧＤＣ－ＧＡＮのデータ）である。It is a spectrogram (data of GDC-GAN) which concerns on embodiment of this invention. ＧＡＮを利用しない場合（アンカーのデータ）のスペクトル包絡である。It is a spectral envelope when GAN is not used (anchor data). 比較例１（ＦＦＮＮ－ＧＡＮのデータ）のスペクトル包絡である。It is a spectral envelope of Comparative Example 1 (data of FFNN-GAN). 比較例２（ＣＮＮ－ＧＡＮのデータ）のスペクトル包絡である。It is a spectral envelope of Comparative Example 2 (data of CNN-GAN). 本発明の実施形態に係るスペクトル包絡（ＧＤＣ－ＧＡＮのデータ）である。It is a spectral envelope (data of GDC-GAN) according to the embodiment of the present invention.

図面を参照しながら本発明の実施の形態を説明する。ここで、各図において共通する部分には同一の符号を付し、重複した説明は省略する。また、図形は、長方形が処理部を表し、平行四辺形がデータを表し、円柱がデータベースを表す。また、実線の矢印は処理の流れを表し、点線の矢印はデータベースの入出力を表す。 Embodiments of the present invention will be described with reference to the drawings. Here, the same reference numerals are given to common parts in the respective figures, and duplicate description will be omitted. In the figure, the rectangle represents the processing unit, the parallelogram represents the data, and the cylinder represents the database. The solid arrow indicates the processing flow, and the dotted arrow indicates the input / output of the database.

処理部及びデータベースは機能ブロック群であり、ハードウェアでの実装に限られず、ソフトウェアとしてコンピュータに実装されていてもよく、その実装形態は限定されない。例えば、パーソナルコンピュータ等のクライアント端末と有線又は無線の通信回線（インターネット回線など）に接続された専用サーバにインストールされて実装されていてもよいし、いわゆるクラウドサービスを利用して実装されていてもよい。 The processing unit and the database are functional blocks, and are not limited to being implemented in hardware, and may be implemented in a computer as software, and the implementation form is not limited. For example, it may be installed and mounted on a dedicated server connected to a client terminal such as a personal computer and a wired or wireless communication line (Internet line, etc.), or it may be mounted using a so-called cloud service. good.

本実施形態では、「自然音声」は、話者が発する自然な音声を意味する。「自然言語特徴量系列」は、自然音声データに基づく言語特徴量を意味し、「自然音声パラメータ系列」は、自然音声データに基づく音響特徴量を意味する。一方、「合成音声」は、音声合成装置によって生成される人工的な音声を意味する。「合成音声パラメータ系列」は、合成音声データに基づく音響特徴量を意味する。 In this embodiment, "natural voice" means a natural voice emitted by a speaker. The "natural language feature quantity series" means a language feature quantity based on natural speech data, and the "natural speech parameter series" means an acoustic feature quantity based on natural speech data. On the other hand, "synthetic speech" means an artificial speech generated by a speech synthesizer. "Synthetic speech parameter series" means an acoustic feature amount based on synthetic speech data.

［Ａ．本実施形態の概要］
本実施形態では、生成的敵対ネットワーク(ＧＡＮ: Generative Adversarial Network)を用いて、合成音声パラメータ系列を予測するためのＤＮＮ予測モデルを学習する。自然言語特徴量系列から合成音声パラメータ系列を予測（又は生成）するＤＮＮを生成モデル（「音響モデル」ともいう）と呼び、また、自然音声パラメータ系列か予測された合成音声パラメータ系列かを識別するＤＮＮを識別モデルと呼ぶ。識別モデルは、自然音声パラメータ系列か合成音声パラメータ系列かの真偽を判定する。 [A. Outline of this embodiment]
In this embodiment, a DNN prediction model for predicting a synthetic speech parameter sequence is learned using a Generative Adversarial Network (GAN). A DNN that predicts (or generates) a synthetic speech parameter sequence from a natural language feature sequence is called a generative model (also referred to as an "acoustic model"), and also identifies whether it is a natural speech parameter sequence or a predicted synthetic speech parameter sequence. DNN is called a discriminative model. The discriminative model determines the authenticity of a natural speech parameter sequence or a synthetic speech parameter sequence.

すなわち、ＧＡＮによる学習法では、生成モデルは、識別モデルを欺く（すなわち、自然音声パラメータ系列に近い合成音声パラメータ系列を予測する）ように学習され、識別モデルは真（すなわち自然音声パラメータ系列）と偽（すなわち合成音声パラメータ系列）を正確に判定できるように学習される。 That is, in the GAN learning method, the generation model is trained to deceive the discriminative model (ie, predict a synthetic speech parameter sequence close to the natural speech parameter sequence), and the discriminative model is true (ie, the natural speech parameter sequence). It is learned so that false (that is, synthetic speech parameter series) can be accurately determined.

本実施形態では、自然及び合成の両方の音声パラメータ系列をグラム行列に変換し、これらのグラム行列を交互に識別モデルへ入力し、学習している。これによって、音声パラメータ系列の時間構造と次元間の関係を考慮した生成モデルを学習することが可能になる。そして、適切にモデル化された生成モデルによる音声合成による音声合成が可能になる。 In this embodiment, both natural and synthetic speech parameter sequences are converted into Gram matrices, and these Gram matrices are alternately input to the discriminative model for learning. This makes it possible to learn a generative model that considers the time structure of the speech parameter sequence and the relationship between dimensions. Then, speech synthesis by speech synthesis by an appropriately modeled generative model becomes possible.

（ａ１．モデル学習処理）
ＧＡＮによるモデル学習処理は、２つの学習処理を交互に行う。一方の学習処理では生成モデルを学習し、他方の学習処理では識別モデルを学習する。これらの学習処理それぞれには生成モデル及び識別モデルが必要である。 (A1. Model learning process)
In the model learning process by GAN, two learning processes are alternately performed. One learning process learns the generative model, and the other learning process learns the discriminative model. Each of these learning processes requires a generative model and a discriminative model.

ＧＡＮによる学習段階では、生成モデルに基づく音声パラメータ系列の予測処理は、自然言語特徴量系列を入力し、合成音声パラメータ系列を出力する。一方、識別モデルに基づく識別処理は、自然音声パラメータ系列が入力された場合、真と判別し、合成音声パラメータ系列が入力された場合、偽と判別するように学習される。 In the learning stage by GAN, the prediction processing of the speech parameter sequence based on the generative model inputs the natural language feature quantity sequence and outputs the synthetic speech parameter sequence. On the other hand, the discrimination process based on the discriminative model is learned to discriminate as true when a natural speech parameter sequence is input and as false when a synthetic speech parameter sequence is input.

本実施形態では、識別モデルへ入力される自然音声パラメータ系列及び合成音声パラメータ系列はグラム行列に変換される。変換されたグラム行列が識別モデルに入力される。 In this embodiment, the natural speech parameter sequence and the synthetic speech parameter sequence input to the discriminative model are converted into a Gram matrix. The transformed Gram matrix is input to the discriminative model.

（ａ２．音声合成処理）
音声合成処理では、本実施形態のＧＡＮによるモデル学習後の生成モデルを用いて、所定の言語特徴量系列から合成音声パラメータ系列を予測し、ボコーダを用いて合成音声波形を生成する。 (A2. Speech synthesis processing)
In the speech synthesis process, a synthetic speech parameter sequence is predicted from a predetermined language feature quantity sequence using the generated model after model learning by GAN of the present embodiment, and a synthetic speech waveform is generated using a vocoder.

［Ｂ．モデル学習装置の具体的な構成］
（ｂ１．モデル学習装置１００の全体構成）
図１は、本実施形態に係るモデル学習装置の機能ブロック図ある。ＧＡＮによるモデル学習装置１００は、生成モデル学習装置２００と、識別モデル学習装置３００と、コーパス記憶部１１０を備える。モデル学習装置１００は音響モデル学習装置ともいう。 [B. Specific configuration of model learning device]
(B1. Overall configuration of model learning device 100)
FIG. 1 is a functional block diagram of the model learning device according to the present embodiment. The model learning device 100 by GAN includes a generation model learning device 200, a discriminative model learning device 300, and a corpus storage unit 110. The model learning device 100 is also referred to as an acoustic model learning device.

モデル学習装置１００は、各データベースとして、コーパス記憶部１１０と、生成モデル記憶部２２０と、識別モデル記憶部３６０を備える。また、生成モデル学習装置２００は、各処理部として、音声パラメータ系列予測部２１０と、グラム行列変換部２４０と、識別部２５０と、誤差計算部２６０と、更新部２７０と、正規化部２８０を備える。識別モデル学習装置３００は、各処理部として、音声パラメータ系列予測部３１０と、正規化部３３０と、グラム行列変換部３４０と、識別部３５０と、誤差計算部３７０と、更新部３８０を備える。 The model learning device 100 includes a corpus storage unit 110, a generation model storage unit 220, and an identification model storage unit 360 as each database. Further, the generation model learning device 200 includes a voice parameter series prediction unit 210, a gram matrix conversion unit 240, an identification unit 250, an error calculation unit 260, an update unit 270, and a normalization unit 280 as each processing unit. Be prepared. The discriminative model learning device 300 includes a voice parameter sequence prediction unit 310, a normalization unit 330, a gram matrix conversion unit 340, an identification unit 350, an error calculation unit 370, and an update unit 380 as each processing unit.

まず、一人又は複数人の話者の音声を事前に収録する。ここでは２０００文程度の文章を読み上げ（発話し）、その発話音声を収録し、音声辞書（音声コーパスともいう）を話者毎に作成する。各音声コーパスには話者ＩＤ（話者識別情報）が付与されている。本実施形態では、１名の女性話者の音声コーパスを使用した。この女性話者は日本語を母語とするプロのアナウンサーである。本実施形態の音声コーパスでは朗読調の音声を使用した。 First, the voices of one or more speakers are recorded in advance. Here, about 2000 sentences are read aloud (speech), the utterance voice is recorded, and a voice dictionary (also called a voice corpus) is created for each speaker. A speaker ID (speaker identification information) is assigned to each voice corpus. In this embodiment, a voice corpus of one female speaker was used. This female speaker is a professional announcer whose mother tongue is Japanese. In the voice corpus of this embodiment, a reading-like voice was used.

そして、音声コーパスには、発話音声から抽出されたコンテキスト、音声波形、及び、自然音響特徴量が発話単位で格納されている。発話単位とは、文章毎の意味である。コンテキスト（「言語特徴量」ともいう）は各文章をテキスト解析した結果であり、音声波形に影響を与える要因（音素の並び、アクセント、イントネーションなど）である。音声波形は人が各文章を読み上げ、マイクロフォンに入力された波形である。 The voice corpus stores the context, the voice waveform, and the natural acoustic feature amount extracted from the spoken voice in the utterance unit. The utterance unit is the meaning of each sentence. The context (also called "language feature") is the result of text analysis of each sentence, and is a factor that affects the speech waveform (phoneme arrangement, accent, intonation, etc.). The voice waveform is a waveform that a person reads out each sentence and inputs it to a microphone.

ＤＮＮは時間フレーム毎に音響特徴量をモデル化する。音声を合成するために必要な音響特徴量は、継続長、基本周波数、スペクトル包絡、非周期性指標である。継続長は音素レベルの音響特徴量である。基本周波数、スペクトル包絡、非周期性指標は時間フレームレベルの音響特徴量である。 DNN models acoustic features for each time frame. The acoustic features required to synthesize speech are continuation length, fundamental frequency, spectral envelope, and aperiodicity index. The continuation length is a phoneme-level acoustic feature. The fundamental frequency, spectral envelope, and aperiodicity index are time frame level acoustic features.

時間フレームレベルの音響特徴量の予測には時間フレームの情報が必要であり、時間フレーム情報を得るためには継続長が必要である。そのため、まず、音素レベルの言語特徴量から継続長を予測し、次に、継続長から求めた時間フレーム情報が付加された時間フレームレベルの言語特徴量から基本周波数、スペクトル包絡、非周期性指標を予測する。このようにして、音声パラメータ系列は、時間フレームと、各音響特徴量の次元間の関係という構造で表すことができる。 Time frame information is required to predict the acoustic features at the time frame level, and a continuation length is required to obtain time frame information. Therefore, first, the continuation length is predicted from the phoneme-level language features, and then the fundamental frequency, spectral inclusion, and aperiodicity index are predicted from the time frame-level language features to which the time frame information obtained from the continuation length is added. Predict. In this way, the audio parameter sequence can be represented by the structure of the relationship between the time frame and the dimension of each acoustic feature quantity.

ここで、ＤＮＮは行列の積和で表現されるため、入力データの要素のうち大きな値をとる要素が支配的になる。また、ＤＮＮは出力データと教師データ間の誤差に基づいて学習されるため、教師データの要素のうち大きな値をとる要素の誤差が支配的になる。これらの問題を防ぐため、データの正規化が必要である。なお、本実施形態の非周期性指標の値は０から１までの範囲にあり、正規化の必要がないため、正規化処理と、その逆の処理（逆正規化処理）はしなくてもよい。ここで、データは時系列データであるため、シークエンスともいう。 Here, since DNN is represented by the product sum of matrices, the element having a large value among the elements of the input data becomes dominant. Further, since the DNN is learned based on the error between the output data and the teacher data, the error of the element having a large value among the elements of the teacher data becomes dominant. Data normalization is needed to prevent these problems. Since the value of the aperiodic index in this embodiment is in the range of 0 to 1 and does not require normalization, normalization processing and vice versa (reverse normalization processing) are not required. good. Here, since the data is time series data, it is also called a sequence.

教師データを正規化してＤＮＮを学習すると、ＤＮＮの出力データのスケールは、正規化後の教師データのスケールと同じになる。出力データのスケールを元に戻すには、教師データに適用した正規化法の処理と逆の処理（逆正規化処理）を出力データに適用する必要がある。 When the teacher data is normalized and the DNN is learned, the scale of the output data of the DNN becomes the same as the scale of the teacher data after the normalization. In order to restore the scale of the output data, it is necessary to apply the reverse processing (reverse normalization processing) of the normalization method applied to the teacher data to the output data.

言語特徴量の正規化にはＭｉｎ－Ｍａｘ正規化法を用いる。Ｍｉｎ－Ｍａｘ正規化法は、最小値が０、最大値が１となるようにデータのスケールを変化させる。音響特徴量の正規化にはＭｅａｎ－Ｖａｒ正規化法を用いる。Ｍｅａｎ－Ｖａｒ正規化法は、平均値が０、標準偏差が１である標準正規分布に従うようにデータのスケールを変化させる。 The Min-Max normalization method is used for normalization of language features. The Min-Max normalization method changes the scale of data so that the minimum value is 0 and the maximum value is 1. The Mean-Var normalization method is used to normalize the acoustic features. The Mean-Var normalization method scales the data to follow a standard normal distribution with a mean of 0 and a standard deviation of 1.

ここで、ＤＮＮは入出力の一対一の対応関係を表すモデルである。このため、ＤＮＮ音声合成では、予め時間フレーム単位の音響特徴量系列と音素単位の言語特徴量系列の対応（音素境界）を設定し、時間フレーム単位の音響特徴量と言語特徴量の対を用意する必要がある。この対が本実施形態の音声パラメータ系列及び言語特徴量系列に相当する。 Here, DNN is a model that represents a one-to-one correspondence between input and output. For this reason, in DNN speech synthesis, the correspondence (phoneme boundary) between the acoustic feature amount series in time frame units and the language feature amount series in phoneme units is set in advance, and a pair of acoustic feature amount and language feature amount in time frame units is prepared. There is a need to. This pair corresponds to the speech parameter sequence and the language feature quantity sequence of the present embodiment.

本実施形態では、言語特徴量系列及び音声パラメータ系列として、上述した音声辞書から、自然言語特徴量系列及び自然音声パラメータ系列を用いる。コーパス記憶部１１０は、上述した１名の女性話者の音声コーパスから抽出された自然言語特徴量系列１２０及び自然音声パラメータ系列１３０を発話単位で記憶している。モデル学習装置１００では、教師データ系列が自然音声パラメータ系列１３０であり、入力データ系列が自然言語特徴量系列１２０である。また、出力データ系列は、後述する合成音声パラメータ系列２３０及び３２０である。 In this embodiment, as the language feature quantity sequence and the speech parameter sequence, the natural language feature quantity sequence and the natural speech parameter sequence are used from the above-mentioned speech dictionary. The corpus storage unit 110 stores the natural language feature quantity sequence 120 and the natural voice parameter sequence 130 extracted from the voice corpus of one female speaker described above in utterance units. In the model learning device 100, the teacher data sequence is the natural speech parameter sequence 130, and the input data sequence is the natural language feature quantity sequence 120. The output data series are the synthetic speech parameter series 230 and 320, which will be described later.

（ｂ２．生成モデル学習装置２００の各機能ブロック）
音声パラメータ系列予測部２１０は、生成モデル記憶部２２０に記憶されている生成モデルを用いて、自然言語特徴量系列１２０から合成音声パラメータ系列２３０を予測する。ここでは自然言語特徴量系列１２０をＭｉｎ－Ｍａｘ正規化法によって正規化してから、合成音声パラメータ系列２３０を予測している。以降では、説明の簡略化のため、合成音声パラメータ系列２３０は正規化済みであるとして説明する。 (B2. Each functional block of the generative model learning device 200)
The speech parameter sequence prediction unit 210 predicts the synthetic speech parameter sequence 230 from the natural language feature quantity sequence 120 by using the generation model stored in the generation model storage unit 220. Here, the natural language feature quantity sequence 120 is normalized by the Min-Max normalization method, and then the synthetic speech parameter sequence 230 is predicted. Hereinafter, for the sake of brevity, the synthetic speech parameter series 230 will be described as normalized.

合成音声パラメータ系列２３０は、後述するＬ_Ｇ（生成誤差を求める損失関数）を求める場合と、後述するＬ_Ｄ（識別誤差を求める損失関数）を求める場合の両方に用いられる。Ｌ_Ｇ（生成誤差を求める損失関数）を求める場合は、後述する誤差計算部２６０に直接入力される。一方、Ｌ_Ｄ（識別誤差を求める損失関数）を求める場合は、グラム行列変換部２４０によってグラム行列に変換され、さらに、識別部２５０が識別モデル記憶部３６０に記憶されている識別モデルに基づいて、後述する識別値系列を求め、誤差計算部２６０に入力される。 The synthetic speech parameter series 230 is used both in the case of obtaining _LG (loss function for obtaining the generation error) described later and in the case of obtaining _LD (loss function for obtaining the discrimination error) described later. When finding _LG (loss function for finding the generation error), it is directly input to the error calculation unit 260 described later. On the other hand, when the _LD (loss function for obtaining the discrimination error) is obtained, it is converted into a Gram matrix by the Gram matrix conversion unit 240, and further, the Discriminative unit 250 is based on the discrimination model stored in the Discriminative model storage unit 360. , The discriminative value sequence described later is obtained and input to the error calculation unit 260.

正規化部２８０は、自然音声パラメータ系列１３０をＭｅａｎ－Ｖａｒ正規化法によって正規化し、誤差計算部２６０へ出力する。以降では、説明の簡略化のため、説明の簡略化のため、自然音声パラメータ系列１３０は正規化済みであるとして説明する。 The normalization unit 280 normalizes the natural voice parameter sequence 130 by the Mean-Var normalization method and outputs it to the error calculation unit 260. Hereinafter, for the sake of simplification of the explanation, the natural speech parameter series 130 will be described as normalized.

誤差計算部２６０は、損失関数を用いて誤差を計算する。ＧＡＮによる学習の場合、誤差は、Ｌ_ＧとＬ_Ｄの和として表すことができる。 The error calculation unit 260 calculates the error using the loss function. In the case of learning by GAN, the error can be expressed as the sum of _LG and _LD .

Ｌ_Ｇは生成モデルの生成誤差を求める損失関数である。入力は合成音声パラメータ系列２３０と、正規化部２８０の出力（すなわち自然音声パラメータ系列）である。すなわち、Ｌ_Ｇは合成音声パラメータ系列と自然音声パラメータ系列の誤差（損失）を表す。 _LG is a loss function for obtaining the generation error of the generative model. The inputs are the synthetic speech parameter sequence 230 and the output of the normalization unit 280 (that is, the natural speech parameter sequence). That is, _LG represents an error (loss) between the synthetic speech parameter sequence and the natural speech parameter sequence.

Ｌ_Ｄは識別誤差を求める損失関数である。入力は、後述する識別部２５０が求めた識別値系列と、真２９１を表す真値系列である。 _LD is a loss function for obtaining the discrimination error. The inputs are an identification value series obtained by the identification unit 250, which will be described later, and a true value series representing true 291.

更新部２７０は、誤差計算部２６０で計算された誤差に基づいて、生成モデル記憶部２２０に記憶されている生成モデルを勾配法によって更新する。ここでの勾配法はＡｄａｍ法である。 The update unit 270 updates the generative model stored in the generative model storage unit 220 by the gradient method based on the error calculated by the error calculation unit 260. The gradient method here is the Adam method.

（ｂ３．識別モデル学習装置３００の各機能ブロック）
識別モデル学習装置３００の生成モデル記憶部２２０は、生成モデル学習装置２００の生成モデル記憶部２２０に記憶されている生成モデルと同じものを記憶している。すなわち、生成モデル学習装置２００によって生成モデルが更新されると、更新された生成モデルを記憶している。 (B3. Each functional block of the discriminative model learning device 300)
The generation model storage unit 220 of the discriminative model learning device 300 stores the same generation model stored in the generation model storage unit 220 of the generation model learning device 200. That is, when the generative model is updated by the generative model learning device 200, the updated generative model is stored.

音声パラメータ系列予測部３１０は、生成モデル学習装置２００の音声パラメータ系列予測部２１０と同様に、生成モデル記憶部２２０に記憶されている生成モデルを用いて、自然言語特徴量系列１２０から合成音声パラメータ系列３２０を予測する。合成音声パラメータ系列３２０は、生成モデル学習装置２００の合成音声パラメータ系列と同様、正規化されている。正規化部３３０は、生成モデル学習装置２００の正規化部２８０と同様に、自然音声パラメータ系列１３０を正規化している。 The speech parameter sequence prediction unit 310 uses the generation model stored in the generation model storage unit 220, like the speech parameter sequence prediction unit 210 of the generation model learning device 200, to synthesize speech parameters from the natural language feature quantity sequence 120. Predict the sequence 320. The synthetic speech parameter sequence 320 is normalized in the same manner as the synthetic speech parameter sequence of the generation model learning device 200. The normalization unit 330 normalizes the natural speech parameter sequence 130, similarly to the normalization unit 280 of the generation model learning device 200.

グラム行列変換部３４０は、正規化部３３０の出力又は合成音声パラメータ系列３２０を入力とし、それぞれをグラム行列に変換する。それぞれの入力に分けて、識別モデルの更新処理を説明する。図１では、正規化部３３０の出力（すなわち、正規化された自然音声パラメータ系列）が入力された場合の以後の処理を一点鎖線で示す。また、合成音声パラメータ系列３２０が入力された場合の以後の処理を二点鎖線で示す。 The gram matrix conversion unit 340 takes the output of the normalization unit 330 or the synthetic speech parameter sequence 320 as an input, and converts each of them into a gram matrix. The update process of the discriminative model will be described separately for each input. In FIG. 1, the subsequent processing when the output of the normalization unit 330 (that is, the normalized natural voice parameter series) is input is shown by a alternate long and short dash line. Further, the subsequent processing when the synthetic speech parameter series 320 is input is shown by a two-dot chain line.

まず、一点鎖線の場合の識別モデルの更新処理について説明する。グラム行列変換部３４０は、正規化部３３０の出力（正規化された自然音声パラメータ系列）をグラム行列に変換する。識別部３５０は、識別モデル３６０に記憶されている識別モデルを用いて、後述する識別値系列を出力する。 First, the update process of the discriminative model in the case of the alternate long and short dash line will be described. The Gram matrix conversion unit 340 converts the output (normalized natural speech parameter sequence) of the normalization unit 330 into a Gram matrix. The discriminative unit 350 outputs a discriminative value series described later using the discriminative model stored in the discriminative model 360.

誤差計算部３７０は、識別値系列と、真３９１を表す真値系列を入力し、Ｌ_Ｄ（識別誤差を求める損失関数）を計算する。更新部３８０は、誤差計算部３７０で計算された誤差に基づいて、識別モデル記憶部３６０に記憶されている識別モデルを勾配法によって更新する。ここでの勾配法はＡｄａｍ法である。 The error calculation unit 370 inputs the discrimination value series and the true value series representing true 391, and calculates _LD (loss function for obtaining the discrimination error). The update unit 380 updates the discriminative model stored in the discriminative model storage unit 360 by the gradient method based on the error calculated by the error calculation unit 370. The gradient method here is the Adam method.

次に、二点鎖線の場合の識別モデルの更新処理について説明する。グラム行列変換部３４０は、合成音声パラメータ系列３２０をグラム行列に変換する。識別部３５０は、識別モデル３６０に記憶されている識別モデルを用いて、識別値系列を出力する。 Next, the update process of the discriminative model in the case of the alternate long and short dash line will be described. The gram matrix conversion unit 340 converts the synthetic speech parameter sequence 320 into a gram matrix. The discriminative unit 350 outputs a discriminative value series using the discriminative model stored in the discriminative model 360.

誤差計算部３７０は、識別値系列と、偽３９２を表す偽値系列を入力し、Ｌ_Ｄ（識別誤差を求める損失関数）を計算する。更新部３８０は、誤差計算部３７０で計算された誤差に基づいて、識別モデル記憶部３６０に記憶されている識別モデルを勾配法によって更新する。ここでの勾配法はＡｄａｍ法である。 The error calculation unit 370 inputs the discrimination value series and the false value series representing the false 392, and calculates _LD (loss function for obtaining the discrimination error). The update unit 380 updates the discriminative model stored in the discriminative model storage unit 360 by the gradient method based on the error calculated by the error calculation unit 370. The gradient method here is the Adam method.

（ｂ４．誤差計算で用いる系列及び損失関数の説明）
生成モデルの学習データセットである自然言語特徴量系列１２０及び自然音声パラメータ系列１３０について数式を用いて説明する。自然言語特徴量系列１２０を式（１）で定義する。

ｘは自然言語特徴量系列１２０であり、ｘ_ｔは時間フレームｔにおける自然言語特徴量ベクトルであり、ｘ_ｔ ^（ｋ）は時間フレームｔにおける次元ｋの自然言語特徴量である。転置行列「上付き文字のＴ」をベクトル内と外で２つ用いているのは、時間情報を考慮するためである。「下付き文字のｔとＴ」は、それぞれ時間フレームのインデックスと総数である。本実施形態のフレーム間隔は５ｍＳであり、時間フレーム数は１０００フレームである。「上付き文字の（ｋ）と（Ｋ）」は、それぞれ自然言語特徴量ベクトルの次元インデックスと次元数である。本実施形態の次元数は５２１である。 (B4. Explanation of series and loss function used in error calculation)
The natural language feature sequence 120 and the natural speech parameter sequence 130, which are the learning data sets of the generative model, will be described using mathematical formulas. The natural language feature sequence 120 is defined by Eq. (1).

x is a natural language feature series 120, x _t is a natural language feature vector in the time frame t, and x _t ^(k) is a natural language feature of dimension k in the time frame t. The reason why two transposed matrices "T of superscript" are used inside and outside the vector is to take time information into consideration. The "subscripts t and T" are the index and total number of time frames, respectively. The frame interval of this embodiment is 5 mS, and the number of time frames is 1000 frames. The "superscripts (k) and (K)" are the dimension index and the number of dimensions of the natural language feature vector, respectively. The number of dimensions of this embodiment is 521.

自然音声パラメータ系列１３０を式（２）で定義する。

ｙは自然音声パラメータ系列１３０であり、ｙ_ｔは時間フレームｔにおける自然音響特徴量ベクトルであり、ｙ_ｔ ^（ｄ）は時間フレームｔにおける次元ｄの自然音響特徴量である。「上付き文字の（ｄ）と（Ｄ）」は、それぞれ音響特徴量ベクトルの次元インデックスと次元数である。本実施形態では、生成モデルにＦＦＮＮ（Feed-Forward Neural Network、フィードフォーワード・ニューラルネットワーク）を用いており、次元数は５２１である。後処理の必要が無いＦＦＮＮを生成モデルに用いることは、計算資源の限られた計算環境に適しているからである。 The natural voice parameter sequence 130 is defined by the equation (2).

y is the natural voice parameter series 130, y _t is the natural acoustic feature quantity vector in the time frame t, and y _t ^(d) is the natural acoustic feature quantity of the dimension d in the time frame t. The "superscripts (d) and (D)" are the dimensional index and the number of dimensions of the acoustic feature vector, respectively. In this embodiment, FFNN (Feed-Forward Neural Network) is used as the generative model, and the number of dimensions is 521. This is because using FFNN, which does not require post-processing, for the generative model is suitable for a computational environment with limited computational resources.

合成音声パラメータ系列２３０は、ｙに対応する生成モデルでｘから予測した音声パラメータ系列として式（３）で定義する。

Ｇは生成モデルであり、ｙ^は合成音声パラメータ系列であり、ｙ^_ｔは時間フレームｔにおける予測した音響特徴ベクトルであり、ｙ^_ｔ ^（ｄ）は時間フレームｔにおける次元ｄの予測した音響特徴量である。なお、本来は、ハット記号「^」は「ｙ」の上に記載されるものであるが、明細書で使用可能な文字コードの都合上「ｙ」と「^」を並べて記載する。 The synthetic speech parameter sequence 230 is defined by the equation (3) as a speech parameter sequence predicted from x by the generation model corresponding to y.

G is a generative model, y ^ is a synthetic speech parameter sequence, y ^ _t is the predicted acoustic feature vector in the time frame t, and y ^ _t ^(d) is the predicted sound of the dimension d in the time frame t. It is a feature quantity. Originally, the hat symbol "^" is described above "y", but "y" and "^" are described side by side for the convenience of the character code that can be used in the specification.

ｙやｙ^に対応する識別モデル３６０の教師データとして用いる真値（２９１及び３９１）と偽値３９２を式（４）で定義する。

ｚは教師データとしての真偽値系列であり、ｚ_ｔは時間フレームｔにおける真偽値である。ｚ^（Ｒ）は真値系列であり、Ｔ個の真値が並んでいる。ｚ^（Ｆ）は偽値系列であり、Ｔ個の偽値が並んでいる。 The true value (291 and 391) and the false value 392 used as the teacher data of the discriminative model 360 corresponding to y and y ^ are defined by the equation (4).

z is a truth value series as teacher data, and z _t is a truth value in the time frame t. z ^(R) is a true value series, and T true values are arranged. z ^(F) is a false value series, and T false values are arranged.

識別部３５０はｙ又はｙ^を入力とし、識別モデル３６０を用いてｚ^を出力する。また、生成モデル学習装置２００の識別部２５０は、ｙ^を入力とし、識別モデル３６０を用いてｚ^を出力する。ｚ^はｚに対応する。なお、ｚ^は式（５）で定義する。

Ｄは識別モデル３６０、ｚ^はＤの識別値系列、ｚ^_ｔは時間フレームｔにおけるＤの識別値である。 The identification unit 350 takes y or y ^ as an input and outputs z ^ using the identification model 360. Further, the discriminative unit 250 of the generation model learning device 200 takes y ^ as an input and outputs z ^ using the discriminative model 360. z ^ corresponds to z. In addition, z ^ is defined by the equation (5).

D is the discriminative model 360, z ^ is the discriminative value series of D, and z ^ _t is the discriminative value of D in the time frame t.

音声パラメータ系列の生成誤差はｙ（正規化部２８０の出力）とｙ^（２３０）の平均絶対誤差で定義される。

ここで、Ｌ_Ｇは生成誤差を求める損失関数である。 The generation error of the voice parameter series is defined by the mean absolute error of y (output of the normalization unit 280) and y ^ (230).

Here, _LG is a loss function for obtaining the generation error.

ｙやｙ^の識別誤差はｚとｚ^の交差エントロピーで定義される。

ここで、Ｌ_Ｄは識別誤差を求める損失関数である。 The discrimination error of y and y ^ is defined by the cross entropy of z and z ^.

Here, _LD is a loss function for obtaining the discrimination error.

ＧＡＮに基づく学習は生成モデルと識別モデルの学習を交互に繰り返す。識別モデルの学習について説明する。誤差計算部３７０への入力はｙ（３３０からの出力）のｚ^とｙ^（３２０）のｚ^が交互に入力される。ｙのｚ^が入力された場合、ｚをｚ^（Ｒ）（３９１）に設定し、Ｌ_Ｄを計算する。ｙ^のｚ^が入力された場合、ｚをｚ^（Ｆ）（３９２）に設定し、Ｌ_Ｄを計算する。次に、それぞれの場合におけるＬ_Ｄに基づいて、更新部３８０が識別モデルのモデルパラメータを更新する。 Learning based on GAN alternately repeats learning of a generative model and a discriminative model. The learning of the discriminative model will be described. As the input to the error calculation unit 370, z ^ of y (output from 330) and z ^ of y ^ (320) are alternately input. When z ^ of y is input, z is set to z ^(R) (391) and _LD is calculated. When z ^ of y ^ is input, z is set to z ^(F) (392) and _LD is calculated. Next, the update unit 380 updates the model parameters of the discriminative model based on the _LD in each case.

生成モデルの学習では、まず、誤差計算部２６０は、Ｌ_Ｇと、ｙ^のｚ^に対するｚをｚ^（Ｒ）２９１と設定したときのＬ_Ｄとの和の誤差を計算する。次に、Ｌ_ＧとＬ_Ｄとの和の誤差に基づいて、更新部２７０が生成モデルのモデルパラメータを更新する。 In the learning of the generative model, first, the error calculation unit 260 calculates the error of the sum of _LG and _LD when z with respect to z ^ of y ^ is set to z ^(R) 291. Next, the update unit 270 updates the model parameters of the generated model based on the error of the sum of _LG and _LD .

Ｌ_ＧとＬ_Ｄとの和の誤差は式（８）で定義される。

Ｌ_Ａは生成モデルを学習するときに用いる損失関数であり、Ｅ_ＧはＬ_Ｇの期待値であり、Ｅ_ＤはＬ_Ｄの期待値である。Ｅ_ＧとＥ_Ｄを用いて、Ｌ_Ｇの生成誤差とＬ_Ｄの識別誤差のスケールの違いを調整する。Ｅ_ＧとＥ_Ｄは生成モデルと識別モデルのパラメータが更新されるたびに計算する。 The error of the sum of _LG and _LD is defined by Eq. (8).

_LA is a loss function used when training a generative model, _EG is an expected value of _LG , and _ED is an expected value of _LD . _EG and _ED are used to adjust the scale difference between the _LG generation error and the _LD discrimination error. _EG and _ED are calculated each time the parameters of the generative model and the discriminative model are updated.

本実施形態では、ＦＦＮＮを生成モデルとした場合の音声パラメータ系列予測部２１０の構成を対象としている。生成モデルがＦＦＮＮの場合、時間フレームｔの生成誤差は時間フレームｔの音響特徴量にしか影響を与えない。このように、音響特徴量の時間構造を考慮しないため、識別モデルが音響特徴量の時間構造を捉える必要がある。また、識別モデルがＣＮＮ（Convolutional Neural Network、畳み込みニューラルネットワーク）の場合、畳み込みの幅の制限がある。このため、音声パラメータ系列から抽出される特徴量に時系列全体の特徴は含まれていない。 In this embodiment, the configuration of the voice parameter series prediction unit 210 when FFNN is used as a generative model is targeted. When the generative model is FFNN, the generation error of the time frame t affects only the acoustic features of the time frame t. In this way, since the time structure of the acoustic features is not considered, it is necessary for the discriminative model to capture the time structure of the acoustic features. Further, when the discriminative model is CNN (Convolutional Neural Network), there is a limitation on the width of convolution. Therefore, the features extracted from the voice parameter series do not include the features of the entire time series.

そこで、本実施形態では、時系列全体の特徴を表す音響特徴量のグラム行列を用いた識別モデルを用いる。従来の識別モデルは時間フレーム毎の音響特徴量を識別する。これに対し、本実施形態の識別モデルは系列全体を表すグラム行列を識別する。これにより、生成モデルは音響特徴量の時間フレーム毎の特徴を捉える役割を担い、識別モデルは音響特徴量の系列全体の特徴を捉える役割を担う。このようにして、本実施形態のＧＡＮによる学習装置は、生成モデルと識別モデルを別々の基準で学習することによって、音響特徴量を多角的に捉えることが可能になる。 Therefore, in this embodiment, a discriminative model using a Gram matrix of acoustic features representing the characteristics of the entire time series is used. The conventional discriminative model identifies the acoustic features for each time frame. In contrast, the discriminative model of this embodiment identifies a Gram matrix that represents the entire sequence. As a result, the generative model plays a role of capturing the characteristics of the acoustic features for each time frame, and the discriminative model plays a role of capturing the characteristics of the entire series of acoustic features. In this way, the GAN-based learning device of the present embodiment can capture the acoustic features from various angles by learning the generative model and the discriminative model based on different criteria.

自然音声パラメータ系列のグラム行列は式（９）で定義する。

ｇは自然音声パラメータ系列のＤ×Ｄのグラム行列である。Ｄは次元数である。 The Gram matrix of the natural speech parameter series is defined by Eq. (9).

g is a D × D gram matrix of the natural speech parameter series. D is a number of dimensions.

生成モデルを用いて予測された合成音声パラメータ系列３２０は式（１０）で定義する。

ｇ^は合成音声パラメータ系列３２０のＤ×Ｄのグラム行列である。これらのグラム行列をＴで除算するのは、音響特徴量の時間フレーム数による影響をなくすためである。グラム行列から真偽値を出力するまでの畳み込み層の構成は、画像生成において高性能なＧＡＮである畳み込みニューラルネットワークによるＧＡＮ（DC-GAN：Deep Convolutional GAN、 A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” ICLR 2016, International Conference on Learning Representations, San Juan, Puerto Rico, 2016）の識別モデルと同じである。 The synthetic speech parameter sequence 320 predicted using the generative model is defined by Eq. (10).

g ^ is a D × D gram matrix of the synthetic speech parameter series 320. The reason for dividing these Gram matrices by T is to eliminate the influence of the number of time frames of the acoustic features. The composition of the convolutional layer from the Gram matrix to the output of the truth value is the GAN (DC-GAN: Deep Convolutional GAN, A. Radford, L. Metz, and S) by the convolutional neural network, which is a high-performance GAN in image generation. . Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” ICLR 2016, International Conference on Learning Representations, San Juan, Puerto Rico, 2016).

本実施形態に係るＧＡＮによる学習法は、グラム行列とＤＣ－ＧＡＮを利用するため、グラム行列による畳み込みニューラルネットワークによるＧＡＮ（GDC-GAN：Gram matrix DC-GAN）と呼ぶ。 Since the learning method using GAN according to the present embodiment uses a Gram matrix and DC-GAN, it is called GAN (GDC-GAN: Gram matrix DC-GAN) by a convolutional neural network using a Gram matrix.

図２は、本発明の実施形態に係るＧＤＣ－ＧＡＮの概略説明図である。識別モデルの入力にグラム行列を用いた処理の流れを説明する。本実施形態では、言語特徴量系列をＴ×Ｋの行列で表し、音声パラメータ系列をＴ×Ｄの行列で表し、グラム行列をＤ×Ｄの行列で表す。Ｔは１０００フレームであり、Ｋは５２１次元であり、Ｄは５２１次元である。また、生成モデルはＦＦＮＮであり、識別モデルはＣＮＮである。 FIG. 2 is a schematic explanatory view of a GDC-GAN according to an embodiment of the present invention. The flow of processing using the Gram matrix for inputting the discriminative model will be described. In the present embodiment, the language feature sequence is represented by a T × K matrix, the speech parameter sequence is represented by a T × D matrix, and the Gram matrix is represented by a D × D matrix. T is 1000 frames, K is 521 dimensions, and D is 521 dimensions. The generative model is FFNN and the discriminative model is CNN.

生成モデルがＦＦＮＮの場合、生成モデルは時間フレーム毎に学習される。このため、生成誤差を計算する際には、時間フレームｔの誤差は時間フレームｔの音声パラメータ系列にしか影響を与えない。 When the generative model is FFNN, the generative model is trained every time frame. Therefore, when calculating the generation error, the error in the time frame t affects only the audio parameter sequence in the time frame t.

一方、識別モデルがＣＮＮの場合、識別モデルは畳み込み幅（「畳み込み層の数」ともいう）と同じ数の時間フレーム数を考慮して学習される。このため、識別誤差を計算する際には、時間フレームｔの識別誤差は畳み込み幅に依存した時間フレームｔの音声パラメータ系列には影響を与えることができる。しかしながら、識別誤差は時系列全体Ｔの特徴を考慮することはできない。 On the other hand, when the discriminative model is CNN, the discriminative model is trained in consideration of the same number of time frames as the convolution width (also referred to as "the number of convolution layers"). Therefore, when calculating the discrimination error, the discrimination error of the time frame t can affect the audio parameter sequence of the time frame t depending on the convolution width. However, the discrimination error cannot take into account the characteristics of the entire time series T.

本実施形態によれば、音声パラメータ系列をグラム行列に変換してから、識別モデルへ入力している。これによって、グラム行列を静的な画像とみなすことが可能になる。そして、ＣＮＮでグラム行列の特徴を考慮しながら識別することが可能になる。 According to this embodiment, the speech parameter series is converted into a Gram matrix and then input to the discriminative model. This makes it possible to consider the Gram matrix as a static image. Then, it becomes possible to identify by CNN while considering the characteristics of the Gram matrix.

このように、生成誤差と識別誤差で別々の誤差計算基準を設けることによって、多角的に音声パラメータ系列を捉えることが可能になる。すなわち、生成誤差を計算する際には、図中の点線で示した時間フレームｔの生成誤差だけが時間フレームｔの音響特徴量に影響を与える。一方、識別誤差を計算する際には、図中の一点鎖線で示した時系列全体の識別誤差を識別値に与えることが可能になる。 In this way, by providing separate error calculation criteria for the generation error and the discrimination error, it is possible to capture the speech parameter series from various angles. That is, when calculating the generation error, only the generation error of the time frame t shown by the dotted line in the figure affects the acoustic feature amount of the time frame t. On the other hand, when calculating the discrimination error, it is possible to give the discrimination error of the entire time series shown by the alternate long and short dash line in the figure to the discrimination value.

なお、もしも音声パラメータ系列がグラム行列に変換されずにそのまま識別モデルへ入力されると、識別処理は時間フレーム毎に実行される。このため、識別モデルからの出力は識別値系列になることに留意していただきたい。 If the voice parameter series is input to the discriminative model as it is without being converted into a Gram matrix, the discriminative process is executed every time frame. Therefore, please note that the output from the discriminative model is a discriminative value series.

［Ｃ．音声合成装置の具体的な構成］
図３は、本実施形態に係る音声合成装置の機能ブロック図ある。音声合成装置４００は、各データベースとして、コーパス記憶部４１０と、生成モデル記憶部４４０と、ボコーダ記憶部４７０を備えている。また、音声合成装置３００は、各処理部として、音声パラメータ系列予測部４３０と、波形合成処理部４６０を備えている。 [C. Specific configuration of speech synthesizer]
FIG. 3 is a functional block diagram of the speech synthesizer according to the present embodiment. The speech synthesizer 400 includes a corpus storage unit 410, a generation model storage unit 440, and a vocoder storage unit 470 as each database. Further, the voice synthesizer 300 includes a voice parameter sequence prediction unit 430 and a waveform synthesis processing unit 460 as each processing unit.

コーパス記憶部４１０は、ユーザが音声合成したい文章（音声合成対象文章）から抽出した言語特徴量系列４２０を記憶している。 The corpus storage unit 410 stores the language feature sequence 420 extracted from the sentence (speech to be voice-synthesized) that the user wants to synthesize.

音声パラメータ系列予測部４３０は、言語特徴量系列４２０を入力し、モデル学習時と同様の正規化処理を行う。次に、正規化処理された言語特徴量系列を生成モデル記憶部４４０が記憶している学習後の生成モデルに基づいて音声パラメータ系列を予測し、さらに逆正規化処理を行う。最終的には、合成音声パラメータ系列４５０を出力する。 The voice parameter series prediction unit 430 inputs the language feature quantity series 420 and performs the same normalization processing as at the time of model learning. Next, the speech parameter sequence is predicted based on the generative model after learning in which the normalized language feature quantity sequence is stored in the generation model storage unit 440, and further denormalization processing is performed. Finally, the synthetic speech parameter series 450 is output.

波形合成処理部４６０は、合成音声パラメータ系列４５０を入力とし、ボコーダ記憶部４７０が記憶しているボコーダで処理し、合成音声波形４８０を出力する。 The waveform synthesis processing unit 460 receives the synthetic voice parameter sequence 450 as an input, processes it with the vocoder stored in the vocoder storage unit 470, and outputs the synthetic voice waveform 480.

［Ｄ．音声評価］
（ｄ１．実験条件）
音声評価の実験には、モデル学習装置１００の説明と同様、日本語を母語とするプロのアナウンサーの女性話者一名の音声コーパスを使用した。音声は朗読調の音声であり、学習用には２０００発話、評価用には学習用とは別に１００発話を用意した。言語特徴量は５２１次元のベクトル系列であり、外れ値が発生しないように発話内の正規化手法により正規化した。基本周波数は１６ｂｉｔ、４８ｋＨｚでサンプリングした収録音声から、５ｍｓフレーム周期で抽出した。また、学習の前処理として、基本周波数を対数化してから、無音と無声の区間を補間した。 [D. Voice evaluation]
(D1. Experimental conditions)
In the voice evaluation experiment, the voice corpus of one female speaker of a professional announcer whose mother tongue is Japanese was used as in the explanation of the model learning device 100. The voice is a reading-like voice, and 2000 utterances were prepared for learning and 100 utterances were prepared for evaluation separately from those for learning. The language features are 521-dimensional vector series, and are normalized by the normalization method in the utterance so that outliers do not occur. The fundamental frequency was 16 bits, and the recorded voice sampled at 48 kHz was extracted with a 5 ms frame cycle. In addition, as a pre-processing for learning, the fundamental frequency was logarithmized, and then the silent and silent sections were interpolated.

比較する５つの対象について説明する。１）原音声は「参照データ」と呼ぶ。２）ＧＡＮを利用しない場合（すなわち、ＧＡＮによる学習法を利用せずに、Ｌ_Ｇ（生成誤差を求める損失関数）のみで生成モデルを学習した場合）は、一番低いという意味で「アンカー（錨）のデータ」と呼ぶ。３）比較例１は、識別モデルがＦＦＮＮの場合であり、「ＦＦＮＮ－ＧＡＮのデータ」と呼ぶ。４）比較例２は、識別モデルがＣＮＮの場合であり、「ＣＮＮ－ＧＡＮのデータ」と呼ぶ。５）本実施形態は、「ＧＤＣ－ＧＡＮのデータ」と呼ぶ。ここで、識別モデルがＦＦＮＮ又はＣＮＮの場合、識別モデルへの入力はグラム行列ではなく、音声パラメータ系列である。なお、生成モデルは全てＦＦＮＮである。 Five objects to be compared will be described. 1) The original voice is called "reference data". 2) When GAN is not used (that is, when the generative model is trained only by _LG (loss function for finding the generation error) without using the learning method by GAN), "anchor (that is, when the generation model is trained) is the lowest. It is called "data of anchor)". 3) Comparative Example 1 is a case where the discriminative model is FFNN, and is referred to as "FFNN-GAN data". 4) Comparative Example 2 is a case where the discriminative model is CNN, and is called "CNN-GAN data". 5) This embodiment is referred to as "GDC-GAN data". Here, when the discriminative model is FFNN or CNN, the input to the discriminative model is not a Gram matrix but a speech parameter sequence. The generative models are all FFNN.

（ｄ２．実験結果）
図４から図８はスペクトログラムを示し、図９から図１２はスペクトル包絡を示す。音声は時々刻々と変わる周波数分布及び大きさ（パワー）の変化という情報を含んでいる。スペクトログラムは横軸が時間、縦軸が周波数、白黒の濃淡がパワーを表し、濃い部分はパワーが大きいことを示している。 (D2. Experimental results)
4 to 8 show the spectrogram, and FIGS. 9 to 12 show the spectral envelope. Speech contains information about frequency distribution and changes in magnitude (power) that change from moment to moment. In the spectrogram, the horizontal axis represents time, the vertical axis represents frequency, the shade of black and white represents power, and the dark part indicates that the power is large.

スペクトル包絡は、合成音声の音色を制御する特徴量である。ここでは、スペクトル包絡の表現方法の一つであるメルケプストラムを用いる。スペクトル包絡のフォルマントを強調し、起伏のあるスペクトル包絡を表現できる。 Spectral envelope is a feature that controls the timbre of synthetic speech. Here, mel cepstrum, which is one of the methods for expressing spectral envelopes, is used. The formant of the spectral envelope can be emphasized and the undulating spectral envelope can be expressed.

図４は、原音声（参照データ）のスペクトログラムである。図５は、ＧＡＮを利用しない場合（アンカーのデータ）のスペクトログラムである。図６は、比較例１（ＦＦＮＮ－ＧＡＮのデータ）のスペクトログラムである。図７は、比較例２（ＣＮＮ－ＧＡＮのデータ）のスペクトログラムである。図８は、本発明の実施形態に係るスペクトログラム（ＧＤＣ－ＧＡＮのデータ）である。 FIG. 4 is a spectrogram of the original voice (reference data). FIG. 5 is a spectrogram when GAN is not used (anchor data). FIG. 6 is a spectrogram of Comparative Example 1 (data of FFNN-GAN). FIG. 7 is a spectrogram of Comparative Example 2 (data of CNN-GAN). FIG. 8 is a spectrogram (data of GDC-GAN) according to an embodiment of the present invention.

図５の５１０を見ると、アンカーのデータは、原音声（図４参照）の高周波帯域の音声が再現できていないことがわかる。図６（ＦＦＮＮ－ＧＡＮのデータ）を見ると、横縞が強く出ており、時間変化がゆったりしていることがわかる。すなわち、全体的に時間変化がゆったりしており、平坦な音声を合成していることがわかる。 Looking at 510 in FIG. 5, it can be seen that the anchor data cannot reproduce the voice in the high frequency band of the original voice (see FIG. 4). Looking at FIG. 6 (FFNN-GAN data), it can be seen that the horizontal stripes are strong and the time change is slow. That is, it can be seen that the time change is slow as a whole, and a flat voice is synthesized.

図７（ＣＮＮ－ＧＡＮのデータ）を見ると、ＣＮＮモデルの畳み込みの幅の制限を受けているが、図６よりも全体的な時間変化がわかる。しかしながら、縦方向にまだらな模様が出ており、かすれた音声を合成していることがわかる。一方、ＧＤＣ－ＧＡＮのデータの場合、図８の５２０を見ると、フォルマントがくっきり出ていることがわかる。また、５３０を見ると、低周波帯域のパワーも強く表されている。すなわち、全体的な時間変化をはっきりと表しており、はっきりとした音声を合成していることがわかる。 Looking at FIG. 7 (data of CNN-GAN), it can be seen that the overall time change is larger than that of FIG. 6, although the width of the convolution of the CNN model is limited. However, a mottled pattern appears in the vertical direction, indicating that a faint voice is synthesized. On the other hand, in the case of GDC-GAN data, looking at 520 in FIG. 8, it can be seen that the formants are clearly visible. Looking at 530, the power in the low frequency band is also strongly expressed. In other words, it clearly shows the overall time change, and it can be seen that a clear voice is synthesized.

図９から図１２を用いて各場合のスペクトル包絡を説明する。ここでは、スペクトログラムの時刻０．６８におけるスペクトル包絡を描いたものであり、Ｒｅｆｅｒｅｎｃｅは、正解すなわち原音声のスペクトル包絡である。 The spectral envelope in each case will be described with reference to FIGS. 9 to 12. Here, the spectral envelope at time 0.68 of the spectrogram is drawn, and the Reference is the correct answer, that is, the spectral envelope of the original voice.

図９（アンカーのデータ）を見ると、波形がＲｅｆｅｒｅｎｃｅと合っておらず、時間変化が合っていないことがわかる。図１０（ＦＦＮＮ－ＧＡＮのデータ）を見ると、時間変化が合っていないことがわかる。ただし、６１０で示す通り、フォルマントは合っており、スペクトル強度が高く、音色がよく聞こえる。 Looking at FIG. 9 (anchor data), it can be seen that the waveform does not match the Reference and the time change does not match. Looking at FIG. 10 (FFNN-GAN data), it can be seen that the time changes do not match. However, as shown by 610, the formants are correct, the spectral intensity is high, and the timbre can be heard well.

。図１１（ＣＮＮ－ＧＡＮのデータ）を見ると、時間変化は良く表されているが、波形の山の形状は合っていない。一方、図１２（ＧＤＣ－ＧＡＮのデータ）を見ると、６２０で示す通り、フォルマントは合っている。また、時間変化もよくあらわされており、波形の山の形状も合っている。このように、本実施形態に係るＧＤＣ－ＧＡＮは、生成モデルが音声パラメータ系列の時間構造と次元間の関係を考慮したモデルパラメータを獲得できるように学習していることがわかる。 .. Looking at FIG. 11 (data of CNN-GAN), the time change is well represented, but the shape of the peak of the waveform does not match. On the other hand, looking at FIG. 12 (data of GDC-GAN), as shown by 620, the formants are correct. In addition, the time change is well represented, and the shape of the corrugated peak matches. As described above, it can be seen that the GDC-GAN according to the present embodiment is learned so that the generative model can acquire the model parameters in consideration of the time structure of the speech parameter series and the relationship between the dimensions.

［Ｅ．作用効果］
モデル学習装置１００は、自然及び合成の両方の音声パラメータ系列をグラム行列に変換し、これらのグラム行列を交互に識別モデルへ入力し、学習している。これによって、音声パラメータ系列の時間構造と次元間の関係を考慮した生成モデルを学習することが可能になる。そして、適切にモデル化された生成モデルによる音声合成による音声合成が可能になる。 [E. Action effect]
The model learning device 100 converts both natural and synthetic speech parameter sequences into Gram matrices, and alternately inputs these Gram matrices into the discriminative model for learning. This makes it possible to learn a generative model that considers the time structure of the speech parameter sequence and the relationship between dimensions. Then, speech synthesis by speech synthesis by an appropriately modeled generative model becomes possible.

以上、本発明の実施形態について説明してきたが、これらのうち、２つ以上の実施例を組み合わせて実施しても構わない。あるいは、これらのうち、１つの実施例を部分的に実施しても構わない。例えば、生成モデルはＦＦＮＮに限られずＲＮＮ（Recurrent Neural Network）であってもよい。また、音声合成装置のボコーダはニューラル・ボコーダであってもよい。 Although the embodiments of the present invention have been described above, two or more of these examples may be combined and carried out. Alternatively, one of these examples may be partially implemented. For example, the generative model is not limited to FFNN and may be RNN (Recurrent Neural Network). Further, the vocoder of the speech synthesizer may be a neural vocoder.

また、本発明は、上記発明の実施形態の説明に何ら限定されるものではない。特許請求の範囲の記載を逸脱せず、当業者が容易に想到できる範囲で種々の変形態様もこの発明に含まれる。 Further, the present invention is not limited to the description of the embodiment of the above invention. Various modifications are also included in the present invention to the extent that those skilled in the art can easily conceive without departing from the description of the scope of claims.

１００モデル学習装置
２００生成モデル学習装置
３００識別モデル学習装置
４００音声合成装置 100 Model learning device 200 Generation model learning device 300 Discriminative model learning device 400 Speech synthesizer

Claims

複数の発話音声から抽出された自然言語特徴量系列及び自然音声パラメータ系列を発話単位で記憶するコーパス記憶部と；
ある自然言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを記憶する生成モデル記憶部と、
前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測する第１の音声パラメータ系列予測部と、
前記合成音声パラメータ系列を第１のグラム行列に変換する第１のグラム行列変換部と、
前記第１のグラム行列を入力し、自然音声パラメータ系列か合成音声パラメータ系列かの真偽を判定するための識別モデルを用いて第１の識別値を求める第１の識別部と、
前記自然言語特徴量系列、前記合成音声パラメータ系列及び前記第１の識別値を用いて、生成誤差及び識別誤差に関する第１の誤差を計算する第１の計算部と、
前記第１の誤差に基づいて、前記生成モデルを更新する第１の更新部と、を有する生成モデル学習装置と；
前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測する第２の音声パラメータ系列予測部と、
前記合成音声パラメータ系列又は前記自然音声パラメータ系列を第２のグラム行列に変換する第２のグラム行列変換部と、
前記識別モデルを記憶する識別モデル記憶部と、
前記第２のグラム行列を入力し、前記識別モデルを用いて第２の識別値を求める第２の識別部と、
前記第２の識別値を用いて、識別誤差に関する第２の誤差を計算する第２の計算部と、
前記第２の誤差に基づいて、前記識別モデルを更新する第２の更新部と、を有する識別モデル学習装置と；
を備える音響モデル学習装置。 With a corpus storage unit that stores natural language feature quantities and natural speech parameter sequences extracted from multiple utterances in utterance units;
A generative model storage unit that stores a generative model for predicting a synthetic speech parameter sequence from a natural language feature sequence,
A first speech parameter sequence prediction unit that uses the natural language feature sequence as an input and predicts a synthetic speech parameter sequence using the generative model, and a first speech parameter sequence prediction unit.
A first gram matrix conversion unit that converts the synthetic speech parameter sequence into a first gram matrix,
A first identification unit for inputting the first Gram matrix and obtaining a first identification value using an identification model for determining the authenticity of a natural speech parameter sequence or a synthetic speech parameter sequence .
Using the natural language feature quantity series, the synthetic speech parameter series, and the first discrimination value, a first calculation unit for calculating a first error regarding a generation error and a discrimination error, and a first calculation unit.
A generative model learning device having a first update unit that updates the generative model based on the first error;
A second speech parameter sequence prediction unit that receives the natural language feature sequence as an input and predicts a synthetic speech parameter sequence using the generative model, and a second speech parameter sequence prediction unit.
A second Gram matrix conversion unit that converts the synthetic speech parameter sequence or the natural speech parameter sequence into a second Gram matrix, and the like.
A discriminative model storage unit that stores the discriminative model and
A second discriminative unit for inputting the second Gram matrix and obtaining a second discriminative value using the discriminative model, and a second discriminative unit.
A second calculation unit that calculates a second error related to the discrimination error using the second discrimination value, and a second calculation unit.
A discriminative model learning device having a second update unit that updates the discriminative model based on the second error;
An acoustic model learning device equipped with.

前記生成モデルはフィードフォーワード・ニューラルネットワーク型のモデルである請求項１に記載の音響モデル学習装置。 The acoustic model learning device according to claim 1, wherein the generative model is a feedforward neural network type model.

前記識別モデルは畳み込みニューラルネットワーク型のモデルである請求項１に記載の音響モデル学習装置。 The acoustic model learning device according to claim 1, wherein the discriminative model is a convolutional neural network type model.

複数の発話音声から抽出された自然言語特徴量系列及び自然音声パラメータ系列を発話単位で記憶するコーパスから、前記自然言語特徴量系列を入力とし、ある自然言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて合成音声パラメータ系列を予測し、
前記合成音声パラメータ系列を第１のグラム行列に変換し、
前記第１のグラム行列を入力し、自然音声パラメータ系列か合成音声パラメータ系列かの真偽を判定するための識別モデルを用いて第１の識別値を求め、
前記自然言語特徴量系列、前記合成音声パラメータ系列及び前記第１の識別値を用いて、生成誤差及び識別誤差に関する第１の誤差を計算し、
前記第１の誤差に基づいて、前記生成モデルを更新する生成モデル学習方法と；
前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測し、
前記合成音声パラメータ系列又は前記自然音声パラメータ系列を第２のグラム行列に変換し、
前記第２のグラム行列を入力し、前記識別モデルを用いて第２の識別値を求め、
前記第２の識別値を用いて、識別誤差に関する第２の誤差を計算し、
前記第２の誤差に基づいて、前記識別モデルを更新する識別モデル学習方法と；
を備える音響モデル学習方法。 From a corpus that stores natural language feature quantity sequences and natural speech parameter sequences extracted from multiple spoken voices in speech units, the natural language feature quantity series is input, and a synthetic speech parameter sequence from a certain natural language feature quantity sequence is obtained. Predict a synthetic speech parameter sequence using a generated model for prediction,
The synthetic speech parameter sequence is converted into the first Gram matrix,
The first Gram matrix is input, and the first discrimination value is obtained by using the discrimination model for determining the authenticity of the natural speech parameter series or the synthetic speech parameter series .
Using the natural language feature sequence, the synthetic speech parameter sequence, and the first discrimination value, the first error regarding the generation error and the discrimination error is calculated.
With a generative model learning method that updates the generative model based on the first error;
Using the natural language feature sequence as an input, the synthetic speech parameter sequence is predicted using the generative model.
The synthetic speech parameter sequence or the natural speech parameter sequence is converted into a second Gram matrix.
The second Gram matrix is input, and the second discrimination value is obtained using the discrimination model.
Using the second discrimination value, the second error regarding the discrimination error is calculated.
With a discriminative model learning method that updates the discriminative model based on the second error;
Acoustic model learning method.

複数の発話音声から抽出された自然言語特徴量系列及び自然音声パラメータ系列を発話単位で記憶するコーパスから、前記自然言語特徴量系列を入力とし、ある自然言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて合成音声パラメータ系列を予測するステップと、
前記合成音声パラメータ系列を第１のグラム行列に変換するステップと、
前記第１のグラム行列を入力し、自然音声パラメータ系列か合成音声パラメータ系列かの真偽を判定するための識別モデルを用いて第１の識別値を求めるステップと、
前記自然言語特徴量系列、前記合成音声パラメータ系列及び前記第１の識別値を用いて、生成誤差及び識別誤差に関する第１の誤差を計算するステップと、
前記第１の誤差に基づいて、前記生成モデルを更新するステップと、
をコンピュータに実行させる生成モデル学習プログラムと；
前記自然言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測するステップと、
前記合成音声パラメータ系列又は前記自然音声パラメータ系列を第２のグラム行列に変換するステップと、
前記第２のグラム行列を入力し、前記識別モデルを用いて第２の識別値を求めるステップ、
前記第２の識別値を用いて、識別誤差に関する第２の誤差を計算するステップと、
前記第２の誤差に基づいて、前記識別モデルを更新するステップと、
をコンピュータに実行させる識別モデル学習方法と；
を備える音響モデル学習プログラム。 From a corpus that stores natural language feature quantity sequences and natural speech parameter sequences extracted from multiple spoken voices in speech units, the natural language feature quantity series is input, and a synthetic speech parameter sequence from a certain natural language feature quantity sequence is obtained. Steps to predict a synthetic speech parameter sequence using a generated model for prediction,
The step of converting the synthetic speech parameter sequence into the first Gram matrix,
A step of inputting the first Gram matrix and obtaining a first discrimination value using a discrimination model for determining the authenticity of a natural speech parameter sequence or a synthetic speech parameter sequence .
Using the natural language feature sequence, the synthetic speech parameter sequence, and the first discrimination value, a step of calculating a first error regarding a generation error and a discrimination error , and a step of calculating the first error.
A step of updating the generative model based on the first error,
With a generative model learning program that causes a computer to execute;
A step of predicting a synthetic speech parameter sequence using the generative model with the natural language feature sequence as an input.
The step of converting the synthetic speech parameter sequence or the natural speech parameter sequence into a second Gram matrix,
The step of inputting the second Gram matrix and obtaining the second discrimination value using the discrimination model.
Using the second discrimination value, the step of calculating the second error regarding the discrimination error , and
A step of updating the discriminative model based on the second error,
With a discriminative model learning method that causes a computer to execute;
Acoustic model learning program with.

音声合成対象文章の言語特徴量系列を記憶するコーパス記憶部と、
請求項１に記載の音響モデル学習装置で学習した、ある言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを記憶する生成モデル記憶部と、
音声波形を生成するためのボコーダを記憶するボコーダ記憶部と、
前記言語特徴量系列を入力とし、前記生成モデルを用いて合成音声パラメータ系列を予測する音声パラメータ系列予測部と、
前記合成音声パラメータ系列を入力とし、前記ボコーダを用いて合成音声波形を生成する波形合成処理部を備える音声合成装置。 A corpus storage unit that stores the language feature series of sentences subject to speech synthesis,
A generation model storage unit that stores a generation model for predicting a certain synthetic speech parameter series from a certain language feature quantity series learned by the acoustic model learning device according to claim 1.
A vocoder storage unit that stores a vocoder for generating audio waveforms,
A speech parameter sequence prediction unit that receives the language feature sequence as an input and predicts a synthetic speech parameter sequence using the generative model, and a speech parameter sequence prediction unit.
A voice synthesizer including a waveform synthesis processing unit that receives the synthetic voice parameter sequence as an input and generates a synthetic voice waveform using the vocoder.

音声合成対象文章の言語特徴量系列を入力とし、請求項４に記載の音響モデル学習方法で学習した、ある言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて、合成音声パラメータ系列を予測し、
前記合成音声パラメータ系列を入力とし、音声波形を生成するためのボコーダを用いて、合成音声波形を生成する音声合成方法。 A speech synthesis target sentence is synthesized by using a generation model for predicting a certain synthetic speech parameter sequence from a certain language feature sequence, which is learned by the acoustic model learning method according to claim 4, with the language feature sequence as an input. Predict the speech parameter sequence,
A voice synthesis method for generating a synthetic voice waveform by using the vocoder for generating a voice waveform by using the synthetic voice parameter series as an input.

音声合成対象文章の言語特徴量系列を入力とし、請求項５に記載の音響モデル学習プログラムで学習した、ある言語特徴量系列からある合成音声パラメータ系列を予測するための生成モデルを用いて、合成音声パラメータ系列を予測するステップと、
前記合成音声パラメータ系列を入力とし、音声波形を生成するためのボコーダを用いて、合成音声波形を生成するステップと、
をコンピュータに実行させる音声合成プログラム。 Using the language feature quantity series of the text to be voice-synthesized as an input, and using the generation model for predicting a synthetic speech parameter series from a certain language feature quantity series learned by the acoustic model learning program according to claim 5, synthesis is performed. Steps to predict speech parameter sequences and
A step of generating a synthetic voice waveform by using the vocoder for generating a voice waveform by inputting the synthetic voice parameter series, and
A speech synthesis program that lets your computer run.