JP6802958B2

JP6802958B2 - Speech synthesis system, speech synthesis program and speech synthesis method

Info

Publication number: JP6802958B2
Application number: JP2017037151A
Authority: JP
Inventors: 橘　健太郎; 健太郎橘; 芳則志賀
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2020-12-23
Anticipated expiration: 2037-02-28
Also published as: WO2018159402A1; JP2018141915A

Description

本発明は、統計的パラメトリック音声合成に従う音声合成技術（statistical parametric speech synthesis；以下「ＳＰＳＳ」とも略称する。）に関する。 The present invention relates to a speech synthesis technique (statistical parametric speech synthesis; hereinafter, also abbreviated as "SPSS") according to statistical parametric speech synthesis.

従来から、音声合成技術は、テキスト読み上げアプリケーションや多言語翻訳サービスなどに広く応用されている。このような音声合成技術の一手法として、ＳＰＳＳが知られている。ＳＰＳＳは、統計モデルに基づいて音声を合成するフレームワークである。ＳＰＳＳにおける主要な研究対象は、過去十数年にわたって、隠れマルコフモデル（hidden Markov model；以下「ＨＭＭ」とも略称する。）に基づく音声合成であった。 Traditionally, speech synthesis technology has been widely applied to text-to-speech applications and multilingual translation services. SPSS is known as one method of such a speech synthesis technique. SPSS is a framework that synthesizes speech based on statistical models. For the past decade or so, the main research subject in SPSS has been speech synthesis based on the hidden Markov model (hereinafter abbreviated as "HMM").

近年、深層学習（deep learning）の一類型である、深層ニューラルネットワーク（deep neural network；以下「ＤＮＮ」とも略称。）に基づく音声合成が注目を集めている（例えば、非特許文献１など参照）。非特許文献１に示された研究成果によれば、ＤＮＮに基づく音声合成は、ＨＭＭに基づく音声合成に比較して、より高品質な音声を生成できることが示されている。 In recent years, speech synthesis based on deep neural networks (hereinafter also abbreviated as "DNN"), which is a type of deep learning, has attracted attention (see, for example, Non-Patent Document 1). .. According to the research results shown in Non-Patent Document 1, it is shown that speech synthesis based on DNN can generate higher quality speech than speech synthesis based on HMM.

多くのＳＰＳＳにおいては、音声生成時のソースフィルタモデルとしてボコーダー（vocoder）が利用される。より具体的には、ソースフィルタモデルは、声道フィルタと励振源とから構成される。声道フィルタは、声道をモデル化したものであり、スペクトル包絡パラメータにより表現される。励振源（声帯振動）をモデル化した源信号は、パルス系列とノイズ成分とを混合することにより表現される。 In many SPSSs, a vocoder is used as a source filter model during speech generation. More specifically, the source filter model is composed of a vocal tract filter and an excitation source. The vocal tract filter is a model of the vocal tract and is represented by spectral envelope parameters. The source signal that models the excitation source (vocal cord vibration) is expressed by mixing the pulse sequence and the noise component.

一般的に採用されるボコーダーにおいて、振動源の各フレームは有声区間であるか無声区間であるかの判定がなされ、有声区間であると判定された場合には、声の高さ（ピッチ）に相当する基本周波数（以下、「Ｆ₀」とも略称する。）のパルス系列を生成し、無声区間であると判定された場合には、ホワイトノイズとして振動源を生成する。ここで、有声と無声との判定はＦ₀が非ゼロ（有声）かゼロ（無声）かに基づいて行なわれる。一般的なＳＰＳＳでは、このＦ₀の系列は、１次元の系列とゼロ次元の離散シンボルとが切替わる不連続系列と表現され、各フレームにおいて、有声／無声（以下、「Ｖ／ＵＶ」とも略称する。）を切替えるためのフラグ（以下、「Ｖ／ＵＶフラグ」とも略称する。）が必要となる。 In a generally adopted vocoder, it is determined whether each frame of the vibration source is a voiced section or an unvoiced section, and if it is determined to be a voiced section, the pitch of the voice is determined. A pulse sequence of the corresponding fundamental frequency (hereinafter, also abbreviated as "F ₀ ") is generated, and when it is determined to be an unvoiced section, a vibration source is generated as white noise. Here, the determination of voiced and unvoiced is made based on whether F ₀ is non-zero (voiced) or zero (unvoiced). In general SPSS, this F ₀ series is expressed as a discontinuous series in which a one-dimensional series and a zero-dimensional discrete symbol are switched, and in each frame, voiced / unvoiced (hereinafter, also referred to as “V / UV”). A flag (hereinafter, also abbreviated as “V / UV flag”) for switching (abbreviated as) is required.

各フレームにおけるＶ／ＵＶの判定エラー、および、不連続系列を出力する振動源をモデル化することの困難性、に起因して、合成音声に品質劣化が生じる可能性がある。 Due to the V / UV determination error in each frame and the difficulty of modeling the vibration source that outputs the discontinuous sequence, quality deterioration may occur in the synthetic speech.

このような系列をモデル化する一手法として、ＭＳＤ（multi-space distribution）モデリングが提案されている（例えば、非特許文献２参照）。しかしながら、ＭＳＤモデリングは、連続系列と離散系列とを表現するという困難性を本質的に伴うものである。また、予測エラーが生じたＶ／ＵＶフレームについては、ボコーディング（vocoding）において、しばしば合成音声の品質劣化を招く結果となる。例えば、誤って有声判定されたフレームについてはｂｕｚｚｙ感を生じさせ、誤って無声判定されたフレームについてはしゃがれ感を生じさせる。 MSD (multi-space distribution) modeling has been proposed as a method for modeling such a series (see, for example, Non-Patent Document 2). However, MSD modeling inherently involves the difficulty of representing continuous and discrete series. In addition, V / UV frames in which a prediction error occurs often result in deterioration of the quality of synthetic speech in vocoding. For example, a frame that is erroneously determined to be voiced causes a buzzy feeling, and a frame that is erroneously determined to be unvoiced gives a feeling of crouching.

H. Zen, Andrew Senior, Mike Schuster, "Statistical parametric speech synthesis using deep neural networks", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 26-31 May 2013H. Zen, Andrew Senior, Mike Schuster, "Statistical parametric speech synthesis using deep neural networks", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 26-31 May 2013 K. Tokuda, T. Masuko, N. Miyazaki, T. Kobayashi, "Hidden Markov models based on multi-space probability distribution for pitch pattern modeling", Acoustics, Speech and Signal Processing (ICASSP), 1999 IEEE International Conference, 15-19 March 1999K. Tokuda, T. Masuko, N. Miyazaki, T. Kobayashi, "Hidden Markov models based on multi-space probability distribution for pitch pattern modeling", Acoustics, Speech and Signal Processing (ICASSP), 1999 IEEE International Conference, 15- 19 March 1999 K. Yu and S. Young, "Continuous F0 modeling for HMM based statistical parametric speech synthesis", IEEE Trans. Audio Speech Lang. Process., vol. 19, no. 5, pp. 1071-1079, 2011K. Yu and S. Young, "Continuous F0 modeling for HMM based statistical parametric speech synthesis", IEEE Trans. Audio Speech Lang. Process., Vol. 19, no. 5, pp. 1071-1079, 2011 Javier Latorre, Mark J. F. Gales, Sabine Buchholz, Kate Knill, Masatsune Tamura, Yamato Ohtani, Masami Akamine, "Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?", Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference, 22-27 May 2011Javier Latorre, Mark JF Gales, Sabine Buchholz, Kate Knill, Masatsune Tamura, Yamato Ohtani, Masami Akamine, "Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced / unvoiced classification?", Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference, 22-27 May 2011 T. G. Csapo', G. Ne'meth, and M. Cernak, "Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis", in Lecture Notes in Artificial Intelligence, vol. 9449, A.-H. Dediu, C. Mart in-Vide, and K. Vicsi, Eds. Budapest, Hungary: Springer International Publishing, pp. 27-38, 2015TG Csapo', G. Ne'meth, and M. Cernak, "Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis", in Lecture Notes in Artificial Intelligence, vol. 9449, A.-H. Dediu, C. Mart in-Vide, and K. Vicsi, Eds. Budapest, Hungary: Springer International Publishing, pp. 27-38, 2015 E. Banos, D. Erro, A. Bonafonte, and A. Moreno, "Flexible harmonic/stochastic modeling for HMM-based speech synthesis", in Proc. VJTH, pp. 145-148, 2008E. Banos, D. Erro, A. Bonafonte, and A. Moreno, "Flexible harmonic / stochastic modeling for HMM-based speech synthesis", in Proc. VJTH, pp. 145-148, 2008 Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, Stefan Scherer, "COVAREP - A collaborative voice analysis repository for speech technologies", Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference, 4-9 May 2014Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, Stefan Scherer, "COVAREP --A collaborative voice analysis repository for speech technologies", Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference, 4-9 May 2014 Ioannis Stylianou, "Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification", Ecole Nationale Superieure des Telecommunications, 1996Ioannis Stylianou, "Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification", Ecole Nationale Superieure des Telecommunications, 1996

上述したような課題に対して、いくつかの解決手法が提案されている。
一つ目の手法は、不連続系列であるＦ₀の系列を補間することで、連続的な系列として扱うものである（例えば、非特許文献３など参照）。この手法を用いることで、Ｆ₀は連続系列としてモデル化でき、品質を改善できることが示されている。この手法においては、波形生成時にＶ／ＵＶの判定を行なう必要があり、離散系列のモデル化が必要となる。 Several solutions have been proposed for the above-mentioned problems.
The first method is to treat the series as a continuous series by interpolating the series of F ₀ which is a discontinuous series (see, for example, Non-Patent Document 3). It has been shown that by using this method, F ₀ can be modeled as a continuous sequence and the quality can be improved. In this method, it is necessary to determine V / UV at the time of waveform generation, and it is necessary to model a discrete series.

別の手法として、何らかの連続系列からＶ／ＵＶを判定することが考えられる。例えば、Ｖ／ＵＶフラグに代えて、非周期性指標に基づいて、Ｖ／ＵＶを判定する手法が提案されている（例えば、非特許文献４など参照）。この手法は、完全に連続なモデル化を実現できる一方で、Ｖ／ＵＶの判定が波形生成時に必要となるので、Ｖ／ＵＶの判定エラーの影響を完全には避けることができない。 As another method, it is conceivable to determine V / UV from some continuous sequence. For example, a method for determining V / UV based on an aperiodic index instead of the V / UV flag has been proposed (see, for example, Non-Patent Document 4). While this method can realize completely continuous modeling, since V / UV determination is required at the time of waveform generation, the influence of V / UV determination error cannot be completely avoided.

さらに別の手法として、Ｖ／ＵＶフラグに代えて、音声信号の周期性の最大帯域を示すMaximum Voiced Frequency（以下、「ＭＶＦ」とも略称する。）を用いる手法が提案されている（例えば、非特許文献５など参照）。ＭＶＦを用いることにより、高周波帯域を非周期成分とするとともに、低周波帯域を周期成分として、音声を分割できる。また、ＭＶＦは、連続的にモデル化されるので、しきい値を設定することで、Ｖ／ＵＶフラグとして機能させることも可能である。しかしながら、高周波帯域および低周波帯域の２帯域にしか分割しないため、周期／非周期成分のモデル精度が十分ではない。 As yet another method, a method using Maximum Voiced Frequency (hereinafter, also abbreviated as "MVF") indicating the maximum band of the periodicity of the voice signal is proposed instead of the V / UV flag (for example, non-V / UV flag). (See Patent Document 5, etc.). By using MVF, it is possible to divide the voice by using the high frequency band as the aperiodic component and the low frequency band as the periodic component. Moreover, since the MVF is continuously modeled, it is possible to make it function as a V / UV flag by setting a threshold value. However, since it is divided into only two bands, a high frequency band and a low frequency band, the model accuracy of the periodic / aperiodic component is not sufficient.

本技術は、このような課題を解決するためのものであり、ＳＰＳＳにおいて、音響モデルにおけるＶ／ＵＶの判定エラーに起因する品質への影響を低減できる新たな手法を提供することを目的としている。 This technology is for solving such problems, and aims to provide a new method in SPSS that can reduce the influence on quality caused by V / UV judgment error in an acoustic model. ..

本発明のある局面に従えば、ＳＰＳＳに従う音声合成システムが提供される。音声合成システムは、既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出する第１の抽出部と、音声波形から周期成分および非周期成分を単位区間毎に抽出する第２の抽出部と、抽出された周期成分および非周期成分のスペクトル包絡を抽出する第３の抽出部と、既知のテキストの文脈情報に基づくコンテキストラベルを生成する生成部と、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルを構築する学習部とを含む。 According to certain aspects of the invention, a speech synthesis system according to SPSS is provided. The speech synthesis system has a first extraction unit that extracts the fundamental frequency of the speech waveform corresponding to a known text for each unit section, and a second extraction unit that extracts periodic and aperiodic components from the speech waveform for each unit section. Part, a third extraction part that extracts the spectral inclusion of the extracted periodic and aperiodic components, a generation part that generates a context label based on the context information of known text, and a spectral inclusion of the fundamental frequency and periodic components. Includes a learning unit that builds a statistical model by associating and learning an acoustic feature quantity including a spectral wrapping of an aperiodic component with a corresponding context label.

好ましくは、音声合成システムは、任意のテキストの入力に応答して、当該テキストの文脈情報に基づくコンテキストラベルを決定する決定部と、統計モデルから決定部により決定されたコンテキストラベルに対応する音響特徴量を推定する推定部とをさらに含む。当該推定される音響特徴量は、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む。音声合成システムは、さらに、推定された音響特徴量に含まれる基本周波数に従って生成されたパルス系列を、周期成分のスペクトル包絡に応じてフィルタリングすることで、周期成分を再構成する第１の再構成部と、ノイズ系列を非周期成分のスペクトル包絡に応じてフィルタリングすることで、非周期成分を再構成する第２の再構成部と、再構成された周期成分および非周期成分を加算して、入力された任意のテキストに対応する音声波形として出力する加算部とを含む。 Preferably, the speech synthesis system responds to the input of arbitrary text and determines a context label based on the context information of the text, and an acoustic feature corresponding to the context label determined by the decision unit from the statistical model. It further includes an estimation unit that estimates the amount. The estimated acoustic features include the fundamental frequency, the spectral inclusion of the periodic component, and the spectral inclusion of the aperiodic component. The speech synthesis system further reconstructs the periodic component by filtering the pulse sequence generated according to the fundamental frequency included in the estimated acoustic feature amount according to the spectral inclusion of the periodic component. By filtering the part and the noise sequence according to the spectral inclusion of the aperiodic component, the second reconstructed part that reconstructs the aperiodic component, and the reconstructed periodic and aperiodic components are added. Includes an adder that outputs as a voice waveform corresponding to any input text.

好ましくは、第２の抽出部は、第１の抽出部が基本周波数を抽出できない単位区間から非周期成分のみを抽出し、それ以外の単位区間から周期成分および非周期成分を抽出する。 Preferably, the second extraction unit extracts only the aperiodic component from the unit interval in which the first extraction unit cannot extract the fundamental frequency, and extracts the periodic component and the aperiodic component from the other unit intervals.

好ましくは、第１の抽出部は、基本周波数を抽出できない単位区間について、補間処理により基本周波数を決定する。 Preferably, the first extraction unit determines the fundamental frequency by interpolation processing for the unit interval from which the fundamental frequency cannot be extracted.

好ましくは、パルス系列は、補間処理がなされた基本周波数系列から生成された系列であり、ノイズ系列は、全区間にわたりノイズが生成された系列である。 Preferably, the pulse sequence is a sequence generated from the interpolated fundamental frequency sequence, and the noise sequence is a sequence in which noise is generated over the entire section.

本発明のさらに別の局面に従えば、ＳＰＳＳに従う音声合成方法を実現するための音声合成プログラムが提供される。音声合成プログラムはコンピュータに、既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを実行させる。 According to yet another aspect of the present invention, a speech synthesis program for realizing a speech synthesis method according to SPSS is provided. The speech synthesis program was extracted to the computer by a step of extracting the fundamental frequency of the speech waveform corresponding to the known text for each unit interval and a step of extracting the periodic component and the aperiodic component from the speech waveform for each unit interval. Acoustic features including fundamental frequency, periodic component spectral envelopment, aperiodic component spectral encapsulation, steps to extract periodic and aperiodic component spectral envelopes, and generate context labels based on known textual contextual information. By learning by associating the quantity with the corresponding context label, the step of constructing a statistical model is executed.

本発明のさらに別の局面に従えば、ＳＰＳＳに従う音声合成方法が提供される。音声合成方法は、既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを含む。 According to yet another aspect of the present invention, a speech synthesis method according to SPSS is provided. The speech synthesis method includes a step of extracting the fundamental frequency of the speech waveform corresponding to a known text for each unit interval, a step of extracting periodic and non-periodic components from the speech waveform for each unit interval, and an extracted periodic component. And the steps to extract the spectral inclusions of the aperiodic components, the steps to generate context labels based on the contextual information of known texts, and the acoustic features including the fundamental frequency, the spectral inclusions of the periodic components, and the spectral inclusions of the aperiodic components. , Includes steps to build a statistical model by learning in association with the corresponding context label.

本発明によれば、ＳＰＳＳにおいて、音響モデルにおけるＶ／ＵＶの判定エラーに起因する品質への影響を低減できる。 According to the present invention, in SPSS, it is possible to reduce the influence on quality caused by the determination error of V / UV in the acoustic model.

本実施の形態に従う音声合成システムを用いた多言語翻訳システムの概要を示す模式図である。It is a schematic diagram which shows the outline of the multilingual translation system using the speech synthesis system according to this embodiment. 本実施の形態に従うサービス提供装置のハードウェア構成例を示す模式図である。It is a schematic diagram which shows the hardware configuration example of the service providing apparatus according to this embodiment. 関連技術に係る音声合成処理の概要を説明するための模式図である。It is a schematic diagram for demonstrating the outline of the speech synthesis processing which concerns on a related technique. 本実施の形態に従う音声合成処理の概要を説明するための模式図である。It is a schematic diagram for demonstrating the outline of the speech synthesis processing according to this embodiment. 本実施の形態に従う音声合成システムにおける要部の処理を説明するためのブロック図である。It is a block diagram for demonstrating the processing of the main part in the speech synthesis system according to this embodiment. 本実施の形態に従う音声合成システムにおいて出力される周期成分および非周期成分の音声波形の一例を示す図である。It is a figure which shows an example of the voice waveform of the periodic component and the non-periodic component output in the speech synthesis system according to this embodiment. 本実施の形態に従う音声合成システムにおける処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the processing procedure in the speech synthesis system according to this embodiment. 本実施の形態に従う音声合成システムにおける処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the processing procedure in the speech synthesis system according to this embodiment. 本実施の形態に従う音声合成システムについての対比較実験の評価結果例を示す図である。It is a figure which shows the evaluation result example of the pair comparison experiment about the speech synthesis system according to this embodiment.

本発明の実施の形態について、図面を参照しながら詳細に説明する。なお、図中の同一または相当部分については、同一符号を付してその説明は繰返さない。 Embodiments of the present invention will be described in detail with reference to the drawings. The same or corresponding parts in the drawings are designated by the same reference numerals and the description thereof will not be repeated.

［Ａ．応用例］
まず、本実施の形態に従う音声合成システムの一つの応用例について説明する。より具体的には、本実施の形態に従う音声合成システムを用いた多言語翻訳システムについて説明する。 [A. Application example]
First, one application example of the speech synthesis system according to the present embodiment will be described. More specifically, a multilingual translation system using a speech synthesis system according to the present embodiment will be described.

図１は、本実施の形態に従う音声合成システムを用いた多言語翻訳システム１の概要を示す模式図である。図１を参照して、多言語翻訳システム１は、サービス提供装置１０を含む。サービス提供装置１０は、ネットワーク２を介して接続される携帯端末３０からの入力音声（第１言語で発せられたなんらかのことば）に対して音声認識、多言語翻訳などを行なって、第２言語での対応することばを合成して、その合成結果を出力音声として携帯端末３０へ出力する。 FIG. 1 is a schematic diagram showing an outline of a multilingual translation system 1 using a speech synthesis system according to the present embodiment. With reference to FIG. 1, the multilingual translation system 1 includes a service providing device 10. The service providing device 10 performs voice recognition, multilingual translation, and the like on the input voice (some words uttered in the first language) from the mobile terminal 30 connected via the network 2 in the second language. The corresponding words of are synthesized, and the synthesized result is output to the mobile terminal 30 as output voice.

例えば、ユーザ４は、携帯端末３０に対して、「Where is the station ?」という英語のことばを発すると、携帯端末３０は、その発せられたことばからマイクロフォンなどにより入力音声を生成し、生成した入力音声をサービス提供装置１０へ送信する。サービス提供装置１０は、「Where is the station ?」に対応する、日本語の「駅はどこですか？」ということばを示す出力音声を合成する。携帯端末３０は、サービス提供装置１０から出力音声を受信すると、その受信した出力音声を再生する。これによって、ユーザ４の対話相手には、日本語の「駅はどこですか？」とのことばが聞こえる。 For example, when the user 4 utters the English word "Where is the station?" To the mobile terminal 30, the mobile terminal 30 generates an input voice from the uttered word by a microphone or the like. The input voice is transmitted to the service providing device 10. The service providing device 10 synthesizes an output voice indicating the Japanese word "Where is the station?" Corresponding to "Where is the station?". When the mobile terminal 30 receives the output voice from the service providing device 10, the mobile terminal 30 reproduces the received output voice. As a result, the conversation partner of the user 4 can hear the Japanese words "Where is the station?".

図示していないが、ユーザ４の対話相手も同様の携帯端末３０を有していてもよく、例えば、ユーザ４からの質問に対して、「まっすぐ行って左です」との回答を自身の携帯端末に向かって発すると、上述したような処理が実行されて、ユーザ４の対話相手の携帯端末から、対応する英語の「Go straight and turn left」ということばが回答される。 Although not shown, the conversation partner of the user 4 may also have the same mobile terminal 30. For example, in response to a question from the user 4, the answer "go straight and left" is carried by the user. When the message is sent to the terminal, the above-mentioned process is executed, and the corresponding English word "Go straight and turn left" is answered from the mobile terminal of the user 4 with whom the user interacts.

このように、多言語翻訳システム１においては、第１言語のことば（音声）と第２言語のことば（音声）との間で自在に翻訳が可能である。なお、２つの言語に限らず、任意の数の言語間で相互に自動翻訳できるようにしてもよい。 As described above, in the multilingual translation system 1, it is possible to freely translate between the words of the first language (speech) and the words of the second language (speech). Not limited to the two languages, it may be possible to automatically translate each other between any number of languages.

このような自動音声翻訳の機能を利用することで、外国旅行や外国人とのコミュニケーションを容易化できる。 By using such an automatic speech translation function, it is possible to facilitate foreign travel and communication with foreigners.

サービス提供装置１０に含まれる本実施の形態に従う音声合成システムは、後述するように、ＳＰＳＳの一手法を採用する。サービス提供装置１０は、音声合成システムに関するコンポーネントとして、分析部１２と、学習部１４と、ＤＮＮ１６と、音声合成部１８とを含む。 The speech synthesis system according to the present embodiment included in the service providing device 10 adopts one method of SPSS as described later. The service providing device 10 includes an analysis unit 12, a learning unit 14, a DNN 16, and a speech synthesis unit 18 as components related to the speech synthesis system.

サービス提供装置１０は、自動翻訳に関するコンポーネントとして、音声認識部２０と、翻訳部２２とを含む。サービス提供装置１０は、さらに、携帯端末３０との間で通信処理を行なうための通信処理部２４を含む。 The service providing device 10 includes a voice recognition unit 20 and a translation unit 22 as components related to automatic translation. The service providing device 10 further includes a communication processing unit 24 for performing communication processing with the mobile terminal 30.

より具体的には、分析部１２および学習部１４は、ＤＮＮ１６を構築するための機械学習を担当する。分析部１２および学習部１４の機能および処理の詳細については、後述する。ＤＮＮ１６は、分析部１２および学習部１４による機械学習の結果としてのニューラルネットワークを格納する。 More specifically, the analysis unit 12 and the learning unit 14 are in charge of machine learning for constructing the DNN 16. Details of the functions and processes of the analysis unit 12 and the learning unit 14 will be described later. The DNN 16 stores the neural network as a result of machine learning by the analysis unit 12 and the learning unit 14.

本実施の形態においては一例として、ＤＮＮを用いているが、ＤＮＮに代えて、再帰型ニューラルネットワーク（recurrent neural network；以下「ＲＮＮ」とも略称する）、長・短記憶（long-short term memory；ＬＳＴＭ）ＲＮＮ、畳み込みニューラルネットワーク（convolutional neural network；ＣＮＮ）のいずれかを用いてもよい。 In the present embodiment, DNN is used as an example, but instead of DNN, a recurrent neural network (hereinafter, also abbreviated as "RNN"), a long-short term memory; Either LSTM) RNN or convolutional neural network (CNN) may be used.

音声認識部２０は、通信処理部２４を介して受信した携帯端末３０からの入力音声に対して、音声認識処理を実行することで音声認識テキストを出力する。翻訳部２２は、音声認識部２０からの音声認識テキストから、指定された言語のテキスト（説明の便宜上、「翻訳テキスト」とも記す。）を生成する。音声認識部２０および翻訳部２２については、公知の任意の方法を採用できる。 The voice recognition unit 20 outputs voice recognition text by executing voice recognition processing on the input voice from the mobile terminal 30 received via the communication processing unit 24. The translation unit 22 generates a text in a designated language (also referred to as "translation text" for convenience of explanation) from the voice recognition text from the voice recognition unit 20. Any known method can be adopted for the voice recognition unit 20 and the translation unit 22.

音声合成部１８は、翻訳部２２からの翻訳テキストに対して、ＤＮＮ１６を参照して音声合成を行ない、その結果得られる出力音声を、通信処理部２４を介して携帯端末３０へ送信する。 The voice synthesis unit 18 performs voice synthesis with reference to the DNN 16 with respect to the translated text from the translation unit 22, and transmits the output voice obtained as a result to the mobile terminal 30 via the communication processing unit 24.

図１には、説明の便宜上、ＤＮＮ１６を構築するための機械学習を担当するコンポーネント（主として、分析部１２および学習部１４）と、生成されたＤＮＮ１６を用いて多言語翻訳を担当するコンポーネント（主として、音声認識部２０、翻訳部２２、および音声合成部１８）が同一のサービス提供装置１０に実装されている例を示すが、これらの機能をそれぞれ別の装置に実装してもよい。この場合、第１の装置において、機械学習を実施することでＤＮＮ１６を構築し、第２の装置において、当該生成されたＤＮＮ１６を用いて音声合成および当該音声合成を利用したサービスを提供するようにしてもよい。 In FIG. 1, for convenience of explanation, a component in charge of machine learning for constructing the DNN 16 (mainly the analysis unit 12 and the learning unit 14) and a component in charge of multilingual translation using the generated DNN 16 (mainly). , The voice recognition unit 20, the translation unit 22, and the voice synthesis unit 18) are mounted on the same service providing device 10, but these functions may be mounted on different devices. In this case, in the first device, the DNN 16 is constructed by performing machine learning, and in the second device, the generated DNN 16 is used to provide voice synthesis and a service using the voice synthesis. You may.

上述したような多言語翻訳サービスにおいては、音声認識部２０および翻訳部２２の少なくとも一部の機能を携帯端末３０で実行されるアプリケーションが担当するようにしてもよい。また、音声合成を担当するコンポーネント（ＤＮＮ１６および音声合成部１８）の機能を携帯端末３０で実行されるアプリケーションが担当するようにしてもよい。 In the multilingual translation service as described above, the application executed by the mobile terminal 30 may be in charge of at least a part of the functions of the voice recognition unit 20 and the translation unit 22. Further, the application executed by the mobile terminal 30 may be in charge of the functions of the components (DNN 16 and the voice synthesis unit 18) in charge of voice synthesis.

このように、サービス提供装置１０および携帯端末３０が任意の形態で協働することで、多言語翻訳システム１およびその一部である音声合成システムを実現できる。このとき、それぞれの装置が分担する機能については、状況に応じて適宜決定すればよく、図１に示される多言語翻訳システム１に限定されるようなものではない。 In this way, the service providing device 10 and the mobile terminal 30 cooperate in any form to realize the multilingual translation system 1 and the speech synthesis system which is a part thereof. At this time, the functions shared by the respective devices may be appropriately determined according to the situation, and are not limited to the multilingual translation system 1 shown in FIG.

［Ｂ．サービス提供装置のハードウェア構成］
次に、サービス提供装置のハードウェア構成の一例について説明する。図２は、本実施の形態に従うサービス提供装置１０のハードウェア構成例を示す模式図である。サービス提供装置１０は、典型的には、汎用コンピュータを用いて実現される。 [B. Hardware configuration of service provider]
Next, an example of the hardware configuration of the service providing device will be described. FIG. 2 is a schematic diagram showing a hardware configuration example of the service providing device 10 according to the present embodiment. The service providing device 10 is typically realized by using a general-purpose computer.

図２を参照して、サービス提供装置１０は、主要なハードウェアコンポーネントとして、プロセッサ１００と、主メモリ１０２と、ディスプレイ１０４と、入力デバイス１０６と、ネットワークインターフェイス（Ｉ／Ｆ：interface）１０８と、光学ドライブ１３４と、二次記憶装置１１２とを含む。これらのコンポーネントは、内部バス１１０を介して互いに接続される。 With reference to FIG. 2, the service providing device 10 includes a processor 100, a main memory 102, a display 104, an input device 106, a network interface (I / F: interface) 108, and the like as main hardware components. It includes an optical drive 134 and a secondary storage device 112. These components are connected to each other via the internal bus 110.

プロセッサ１００は、後述するような各種プログラムを実行することで、本実施の形態に従うサービス提供装置１０の実現に必要な処理を実行する演算主体であり、例えば、１または複数のＣＰＵ（central processing unit）やＧＰＵ（graphics processing unit）などで構成される。複数のコアを有するようなＣＰＵまたはＧＰＵを用いてもよい。 The processor 100 is an arithmetic unit that executes processing necessary for realizing the service providing device 10 according to the present embodiment by executing various programs as described later, and is, for example, one or a plurality of CPUs (central processing units). ) And GPU (graphics processing unit). A CPU or GPU having a plurality of cores may be used.

主メモリ１０２は、プロセッサ１００がプログラムを実行するにあたって、プログラムコードやワークメモリなどを一時的に格納する記憶領域であり、例えば、ＤＲＡＭ（dynamic random access memory）やＳＲＡＭ（static random access memory）などの揮発性メモリデバイスなどで構成される。 The main memory 102 is a storage area for temporarily storing program code, work memory, and the like when the processor 100 executes a program. For example, DRAM (dynamic random access memory) or SRAM (static random access memory) or the like. It consists of volatile memory devices and the like.

ディスプレイ１０４は、処理に係るユーザインターフェイスや処理結果などを出力する表示部であり、例えば、ＬＣＤ（liquid crystal display）や有機ＥＬ（electroluminescence）ディスプレイなどで構成される。 The display 104 is a display unit that outputs a user interface related to processing, a processing result, and the like, and is composed of, for example, an LCD (liquid crystal display) or an organic EL (electroluminescence) display.

入力デバイス１０６は、ユーザからの指示や操作などを受付けるデバイスであり、例えば、キーボード、マウス、タッチパネル、ペンなどで構成される。また、入力デバイス１０６としては、機械学習に必要な音声を収集するためのマイクロフォンを含んでいてもよいし、機械学習に必要な音声を収集した集音デバイスと接続するためのインターフェイスを含んでいてもよい。 The input device 106 is a device that receives instructions and operations from the user, and is composed of, for example, a keyboard, a mouse, a touch panel, a pen, and the like. Further, the input device 106 may include a microphone for collecting voice necessary for machine learning, or may include an interface for connecting to a sound collecting device that collects voice necessary for machine learning. May be good.

ネットワークインターフェイス１０８は、インターネット上またはイントラネット上の携帯端末３０や任意の情報処理装置などとの間でデータを遣り取りする。ネットワークインターフェイス１０８としては、例えば、イーサネット（登録商標）、無線ＬＡＮ（Local Area Network）、Ｂｌｕｅｔｏｏｔｈ（登録商標）などの任意の通信方式を採用できる。 The network interface 108 exchanges data with a mobile terminal 30 on the Internet or an intranet, an arbitrary information processing device, or the like. As the network interface 108, for example, any communication method such as Ethernet (registered trademark), wireless LAN (Local Area Network), and Bluetooth (registered trademark) can be adopted.

光学ドライブ１３４は、ＣＤ−ＲＯＭ（compact disc read only memory）、ＤＶＤ（digital versatile disc）などの光学ディスク１３６に格納されている情報を読出して、内部バス１１０を介して他のコンポーネントへ出力する。光学ディスク１３６は、非一過的（non-transitory）な記録媒体の一例であり、任意のプログラムを不揮発的に格納した状態で流通する。光学ドライブ１３４が光学ディスク１３６からプログラムを読み出して、二次記憶装置１１２などにインストールすることで、汎用コンピュータがサービス提供装置１０（または、音声合成装置）として機能するようになる。したがって、本発明の主題は、二次記憶装置１１２などにインストールされたプログラム自体、または、本実施の形態に従う機能や処理を実現するためのプログラムを格納した光学ディスク１３６などの記録媒体でもあり得る。 The optical drive 134 reads information stored in an optical disc 136 such as a CD-ROM (compact disc read only memory) or a DVD (digital versatile disc) and outputs the information to other components via the internal bus 110. The optical disk 136 is an example of a non-transitory recording medium, and is distributed in a non-volatile state in which an arbitrary program is stored. When the optical drive 134 reads a program from the optical disk 136 and installs it in the secondary storage device 112 or the like, the general-purpose computer functions as the service providing device 10 (or the voice synthesizer). Therefore, the subject of the present invention may be the program itself installed in the secondary storage device 112 or the like, or a recording medium such as an optical disk 136 containing a program for realizing a function or process according to the present embodiment. ..

図２には、非一過的な記録媒体の一例として、光学ディスク１３６などの光学記録媒体を示すが、これに限らず、フラッシュメモリなどの半導体記録媒体、ハードディスクまたはストレージテープなどの磁気記録媒体、ＭＯ（magneto-optical disk）などの光磁気記録媒体を用いてもよい。 FIG. 2 shows an optical recording medium such as an optical disk 136 as an example of a non-transient recording medium, but the present invention is not limited to this, and a semiconductor recording medium such as a flash memory or a magnetic recording medium such as a hard disk or a storage tape is shown. , MO (magneto-optical disk) or the like may be used.

二次記憶装置１１２は、プロセッサ１００にて実行されるプログラム、プログラムが処理対象とする入力データ（学習用の入力音声およびテキスト、ならびに、携帯端末３０からの入力音声などを含む）、および、プログラムの実行により生成される出力データ（携帯端末３０へ送信される出力音声などを含む）などを格納するコンポーネントであり、例えば、ハードディスク、ＳＳＤ（solid state drive）などの不揮発性記憶装置で構成される。 The secondary storage device 112 includes a program executed by the processor 100, input data to be processed by the program (including input voice and text for learning, input voice from the mobile terminal 30 and the like), and a program. It is a component that stores output data (including output audio transmitted to the mobile terminal 30) and the like generated by the execution of, and is composed of, for example, a hard disk and a non-volatile storage device such as an SSD (solid state drive). ..

より具体的には、二次記憶装置１１２は、典型的には、図示しないＯＳ（operating system）の他、分析部１２を実現するための分析プログラム１２１と、学習部１４を実現するための学習プログラム１４１と、音声認識部２０を実現するための音声認識プログラム２０１と、翻訳部２２を実現するための翻訳プログラム２２１と、音声合成部１８を実現するための音声合成プログラム１８１とを格納している。 More specifically, the secondary storage device 112 typically includes an OS (operating system) (not shown), an analysis program 121 for realizing the analysis unit 12, and learning for realizing the learning unit 14. The program 141, the voice recognition program 201 for realizing the voice recognition unit 20, the translation program 221 for realizing the translation unit 22, and the voice synthesis program 181 for realizing the voice synthesis unit 18 are stored. There is.

これらのプログラムをプロセッサ１００で実行する際に必要となるライブラリや機能モジュールの一部を、ＯＳが標準で提供するライブラリまたは機能モジュールを用いて代替するようにしてもよい。この場合には、各プログラム単体では、対応する機能を実現するために必要なプログラムモジュールのすべてを含むものにはならないが、ＯＳの実行環境下にインストールされることで、必要な機能を実現できる。このような一部のライブラリまたは機能モジュールを含まないプログラムであっても、本発明の技術的範囲に含まれ得る。 Some of the libraries and functional modules required to execute these programs on the processor 100 may be replaced by the libraries or functional modules provided as standard by the OS. In this case, each program alone does not include all the program modules required to realize the corresponding functions, but the required functions can be realized by being installed under the OS execution environment. .. Even a program that does not include some such libraries or functional modules may be included in the technical scope of the present invention.

また、これらのプログラムは、上述したようないずれかの記録媒体に格納されて流通するだけでなく、インターネットまたはイントラネットを介してサーバ装置などからダウンロードすることで配布されてもよい。 Further, these programs are not only stored and distributed in any of the recording media as described above, but may also be distributed by downloading from a server device or the like via the Internet or an intranet.

なお、実際には、音声認識部２０および翻訳部２２を実現するためのデータベースが必要となるが、説明の便宜上、それらのデータベースについては描いていない。 Actually, a database for realizing the voice recognition unit 20 and the translation unit 22 is required, but for convenience of explanation, these databases are not drawn.

二次記憶装置１１２は、ＤＮＮ１６に加えて、ＤＮＮ１６を構築するための、機械学習用の入力音声１３０および対応するテキスト１３２を格納していてもよい。 In addition to the DNN 16, the secondary storage device 112 may store an input voice 130 for machine learning and a corresponding text 132 for constructing the DNN 16.

図２には、単一のコンピュータがサービス提供装置１０を構成する例を示すが、これに限らず、ネットワークを介して接続された複数のコンピュータが明示的または黙示的に連携して、多言語翻訳システム１およびその一部である音声合成システムを実現するようにしてもよい。 FIG. 2 shows an example in which a single computer constitutes the service providing device 10, but the present invention is not limited to this, and a plurality of computers connected via a network cooperate explicitly or implicitly in multiple languages. The translation system 1 and a speech synthesis system that is a part thereof may be realized.

コンピュータ（プロセッサ１００）がプログラムを実行することで実現される機能の全部または一部を、集積回路などのハードワイヤード回路（hard-wired circuit）を用いて実現してもよい。例えば、ＡＳＩＣ（application specific integrated circuit）やＦＰＧＡ（field-programmable gate array）などを用いて実現してもよい。 All or part of the functions realized by the computer (processor 100) executing the program may be realized by using a hard-wired circuit such as an integrated circuit. For example, it may be realized by using an ASIC (application specific integrated circuit) or an FPGA (field-programmable gate array).

当業者であれば、本発明が実施される時代に応じた技術を適宜用いて、本実施の形態に従う音声合成システムを実現できるであろう。 A person skilled in the art will be able to realize a speech synthesis system according to the present embodiment by appropriately using a technique suitable for the times when the present invention is implemented.

［Ｃ．概要］
本実施の形態においては、ＳＰＳＳに従う音声合成システムが提供される。本実施の形態に従う音声合成システムにおいては、励振源を示す源信号を周期成分と非周期成分とに分解することで、Ｖ／ＵＶの判定を不要化した方式を採用する。源信号を表現する周期成分および非周期成分を示す音声パラメータをＤＮＮに適用して学習を行なう。 [C. Overview]
In this embodiment, a speech synthesis system according to SPSS is provided. In the speech synthesis system according to the present embodiment, a method that eliminates the need for V / UV determination is adopted by decomposing the source signal indicating the excitation source into a periodic component and a non-periodic component. Learning is performed by applying a voice parameter indicating a periodic component and an aperiodic component expressing the source signal to the DNN.

まず、関連技術に係る音声合成処理および当該音声合成処理をＳＰＳＳに適用する場合の処理について説明する。図３は、関連技術に係る音声合成処理の概要を説明するための模式図である。図３を参照して、関連技術に係る音声合成処理においては、パルス生成部２５０と、ホワイトノイズ生成部２５２と、切替部２５４と、音声合成フィルタ２５６とを含む。図３に示す構成において、パルス生成部２５０、ホワイトノイズ生成部２５２、および、切替部２５４は、励振源をモデル化した部分に相当し、励振源からの源信号は、パルス生成部２５０から出力されるパルス系列と、ホワイトノイズ生成部２５２からの雑音系列とのうち、いずれか一方が切替部２５４にて選択されて、音声合成フィルタ２５６へ与えられる。パルス生成部２５０には、声の高さを示すＦ₀のパラメータが与えられ、Ｆ₀の逆数（基本周期／ピッチ周期）の間隔でパルス系列を出力する。なお、図示していないが、パルス生成部２５０には、声の大きさを示す振幅のパラメータが与えられてもよい。音声合成フィルタ２５６は、音声の音色を決定する部分であり、スペクトル包絡を示すパラメータが与えられる。 First, the speech synthesis process related to the related technology and the process when the speech synthesis process is applied to SPSS will be described. FIG. 3 is a schematic diagram for explaining the outline of the speech synthesis process according to the related technology. With reference to FIG. 3, the speech synthesis process according to the related technique includes a pulse generation unit 250, a white noise generation unit 252, a switching unit 254, and a speech synthesis filter 256. In the configuration shown in FIG. 3, the pulse generation unit 250, the white noise generation unit 252, and the switching unit 254 correspond to a portion that models the excitation source, and the source signal from the excitation source is output from the pulse generation unit 250. One of the pulse sequence to be generated and the noise sequence from the white noise generation unit 252 is selected by the switching unit 254 and given to the speech synthesis filter 256. The pulse generation unit 250 is given a parameter of F ₀ indicating the pitch of the voice, and outputs a pulse sequence at intervals of the reciprocal of F ₀ (basic period / pitch period). Although not shown, the pulse generation unit 250 may be given an amplitude parameter indicating the loudness of the voice. The speech synthesis filter 256 is a part that determines the tone color of speech, and is given a parameter indicating spectral envelope.

図３に示す音声生成時のソースフィルタモデルにおいては、入力された音声波形を単位区間（例えば、フレーム単位）で区切るとともに、各単位区間が有声区間であるか無声区間であるかが判定され、有声区間についてはパルス系列が源信号として出力され、無声区間についてはノイズ系列が源信号として出力される。この有声区間と無声区間とを識別するパラメータがＶ／ＵＶフラグである。 In the source filter model at the time of voice generation shown in FIG. 3, the input voice waveform is divided into unit intervals (for example, frame units), and it is determined whether each unit section is a voiced section or an unvoiced section. The pulse sequence is output as the source signal for the voiced section, and the noise sequence is output as the source signal for the unvoiced section. The parameter that distinguishes between the voiced section and the unvoiced section is the V / UV flag.

図３に示すソースフィルタモデルをＳＰＳＳに適用する場合には、Ｆ₀、Ｖ／ＵＶフラグ、スペクトル包絡が学習対象のパラメータとなる。したがって、各単位区間についてＶ／ＵＶを正しく判定しなければならない。しかしながら、Ｖ／ＵＶの判定、および、パルス系列およびノイズ系列が切替えられることによる不連続性を伴う源信号のモデル化は容易ではないので、合成音声に品質劣化が生じる可能性がある。 When the source filter model shown in FIG. 3 is applied to SPSS, F ₀ , the V / UV flag, and the spectral envelope are the parameters to be learned. Therefore, V / UV must be correctly determined for each unit interval. However, since it is not easy to determine V / UV and model the source signal with discontinuity due to switching between the pulse sequence and the noise sequence, quality deterioration may occur in the synthesized speech.

そこで、本実施の形態においては、音声波形の各単位区間についてのＶ／ＵＶを判定する必要のない手法を採用する。これにより、関連技術において生じ得る、Ｖ／ＵＶの判定エラーによる合成音声の品質への影響を低減する。 Therefore, in the present embodiment, a method that does not need to determine V / UV for each unit interval of the voice waveform is adopted. This reduces the effect of V / UV determination errors on the quality of synthetic speech that may occur in related technologies.

図４は、本実施の形態に従う音声合成処理の概要を説明するための模式図である。図４を参照して、本実施の形態に従う音声合成処理においては、パルス生成部２００と、音声合成フィルタ（周期成分）２０２と、ガウシアンノイズ生成部２０４と、音声合成フィルタ（非周期成分）２０６と、加算部２０８とを含む。 FIG. 4 is a schematic diagram for explaining an outline of the speech synthesis process according to the present embodiment. With reference to FIG. 4, in the speech synthesis process according to the present embodiment, the pulse generation unit 200, the speech synthesis filter (periodic component) 202, the Gaussian noise generation unit 204, and the speech synthesis filter (aperiodic component) 206 And the addition unit 208.

本実施の形態においては、図３に示すＶ／ＵＶフラグを用いた源信号の切替えではなく、周期成分および非周期成分のそれぞれに源信号を用意する。すなわち、音声信号を周期成分および非周期成分に分解する。 In the present embodiment, instead of switching the source signal using the V / UV flag shown in FIG. 3, a source signal is prepared for each of the periodic component and the aperiodic component. That is, the audio signal is decomposed into a periodic component and a non-periodic component.

より具体的には、パルス生成部２００および音声合成フィルタ（周期成分）２０２は、周期成分を生成する部分であり、パルス生成部２００は、指定されたＦ₀に従うパルス（後述するように、連続的なパルス系列）を生成するとともに、音声合成フィルタ（周期成分）２０２が周期成分に対応するスペクトル包絡に応じたフィルタを当該連続的なパルス系列に乗じることで、合成音声に含まれる周期成分を出力する。 More specifically, the pulse generation unit 200 and the speech synthesis filter (periodic component) 202 are portions that generate a periodic component, and the pulse generation unit 200 is a pulse according to a designated F ₀ (as described later, continuous). Pulse sequence), and the speech synthesis filter (periodic component) 202 multiplies the continuous pulse sequence by a filter corresponding to the spectral envelope corresponding to the periodic component to obtain the periodic component contained in the synthetic speech. Output.

このように、各単位区間がＶ／ＵＶのいずれであるかによらず、連続的なパルス系列を用いることができるのは、周期成分の無音区間は非可聴なパワーであると仮定し、全区間を有声であると扱うためである。すなわち、無音や無声といった周期性をもたない区間において、周期成分に対応するスペクトル包絡は、十分に振幅が小さいと仮定する。この仮定に従うと、このような無音または無声の区間において、Ｆ₀のパルス系列から周期成分を生成したとしても、非可聴な程に十分に小さくなると考えられる。そのため、関連技術に係る音声合成処理において、パルス系列の生成を停止していた無声区間においても、本実施の形態に従う音声合成処理においては、パルス系列を発生することで、パルス系列の不連続性に起因する合成音声への影響を低減することができる。 In this way, the continuous pulse sequence can be used regardless of whether each unit interval is V / UV, assuming that the silent section of the periodic component is inaudible power, and the whole. This is because the section is treated as voiced. That is, it is assumed that the spectral envelope corresponding to the periodic component has a sufficiently small amplitude in a section having no periodicity such as silence or silence. According to this assumption, even if a periodic component is generated from the pulse sequence of F ₀ in such a silent or silent section, it is considered that the periodic component is sufficiently small to be inaudible. Therefore, even in the silent section in which the generation of the pulse sequence has been stopped in the speech synthesis process related to the related technology, in the speech synthesis process according to the present embodiment, the pulse sequence is generated to discontinue the pulse sequence. It is possible to reduce the influence on the synthetic voice caused by.

また、ガウシアンノイズ生成部２０４および音声合成フィルタ（非周期成分）２０６は、非周期成分を生成する部分であり、ガウシアンノイズ生成部２０４は、連続的なノイズ系列の一例として、ガウシアンノイズを生成するとともに、音声合成フィルタ（非周期成分）２０６が非周期成分に対応するスペクトル包絡に応じたフィルタを当該ノイズ系列に乗じることで、合成音声に含まれる非周期成分を出力する。 Further, the Gaussian noise generation unit 204 and the speech synthesis filter (aperiodic component) 206 are portions that generate an aperiodic component, and the Gaussian noise generation unit 204 generates Gaussian noise as an example of a continuous noise series. At the same time, the speech synthesis filter (aperiodic component) 206 outputs the aperiodic component included in the synthetic speech by multiplying the noise sequence by a filter corresponding to the spectrum entrainment corresponding to the aperiodic component.

最終的に、音声合成フィルタ（周期成分）２０２から出力される周期成分、および、音声合成フィルタ（非周期成分）２０６から出力される非周期成分が加算部２０８で加算されることで、合成音声を示す音声波形が出力される。 Finally, the periodic component output from the speech synthesis filter (periodic component) 202 and the aperiodic component output from the speech synthesis filter (aperiodic component) 206 are added by the addition unit 208, whereby the synthetic speech is synthesized. The voice waveform indicating is output.

このように、各単位区間がＶ／ＵＶのいずれであるかによらず、ノイズ系列を用いることができるのは、非周期成分が無声信号および無音により構成されると仮定し、全区間を無声であると扱うためである。以上のように、有声区間および無声区間を区別する必要のない音響モデルを用いるとともに、その音響モデルに基づく学習を行なうことで、Ｖ／ＵＶの判定を必要としない音声合成方法を実現できる。 In this way, the noise sequence can be used regardless of whether each unit interval is V / UV, assuming that the aperiodic component is composed of unvoiced signals and silence, and the entire section is unvoiced. This is to treat it as. As described above, by using an acoustic model that does not need to distinguish between a voiced section and an unvoiced section and performing learning based on the acoustic model, a speech synthesis method that does not require V / UV determination can be realized.

［Ｄ．学習処理および音声合成処理］
次に、本実施の形態に従う音声合成システムにおける学習処理および音声合成処理の詳細について説明する。図５は、本実施の形態に従う音声合成システムにおける要部の処理を説明するためのブロック図である。 [D. Learning processing and speech synthesis processing]
Next, the details of the learning process and the speech synthesis process in the speech synthesis system according to the present embodiment will be described. FIG. 5 is a block diagram for explaining the processing of the main part in the speech synthesis system according to the present embodiment.

図５を参照して、音声合成システムは、ＤＮＮ１６を構築するための分析部１２および学習部１４と、ＤＮＮ１６を用いて音声波形を出力する音声合成部１８とを含む。以下、これらの各部の処理および機能について詳述する。 With reference to FIG. 5, the speech synthesis system includes an analysis unit 12 and a learning unit 14 for constructing the DNN 16, and a speech synthesis unit 18 for outputting a voice waveform using the DNN 16. Hereinafter, the processing and functions of each of these parts will be described in detail.

（ｄ１：分析部１２）
まず、分析部１２における処理および機能について説明する。分析部１２は、音声分析を担当する部分であり、学習用の入力音声が示す音声波形から音響特徴量系列を生成する。本実施の形態に従う音声合成システムにおいて、フレーム毎の音響特徴量は、Ｆ₀およびスペクトル包絡（周期成分および非周期成分）を含む。 (D1: Analysis unit 12)
First, the processing and the function in the analysis unit 12 will be described. The analysis unit 12 is a part in charge of voice analysis, and generates an acoustic feature sequence from the voice waveform indicated by the input voice for learning. In the speech synthesis system according to the present embodiment, the acoustic features for each frame include F ₀ and spectral inclusions (periodic and aperiodic components).

より具体的には、分析部１２は、Ｆ₀抽出部１２０と、周期／非周期成分抽出部１２２と、特徴量抽出部１２４とを含む。特徴量抽出部１２４は、Ｆ₀補間部１２６と、スペクトル包絡抽出部１２８とを含む。 More specifically, the analysis unit 12 includes an F ₀ extraction unit 120, a periodic / aperiodic component extraction unit 122, and a feature amount extraction unit 124. The feature amount extraction unit 124 includes an F ₀ interpolation unit 126 and a spectrum envelope extraction unit 128.

Ｆ₀抽出部１２０は、既知のテキストに対応する音声波形のＦ₀をフレーム（単位区間）毎に抽出する。すなわち、Ｆ₀抽出部１２０は、入力される音声波形からＦ₀をフレーム毎に抽出する。抽出されたＦ₀は、周期／非周期成分抽出部１２２および特徴量抽出部１２４へ与えられる。 F ₀ extraction unit 120 extracts the F ₀ of the voice waveform corresponding to a known text for each frame (unit interval). That, F ₀ extraction unit 120 extracts from the speech waveform input to F ₀ for each frame. The extracted F ₀ is given to the periodic / aperiodic component extraction unit 122 and the feature amount extraction unit 124.

周期／非周期成分抽出部１２２は、入力される音声波形から周期成分および非周期成分をフレーム（単位区間）毎に抽出する。より具体的には、周期／非周期成分抽出部１２２は、入力される音声波形のＦ₀に基づいて、Ｆ₀から周期成分および非周期成分を抽出する。本実施の形態においては、源信号ｓ（ｔ）を以下の（１）式に示すように抽出する。 The periodic / aperiodic component extraction unit 122 extracts the periodic component and the aperiodic component from the input voice waveform for each frame (unit interval). More specifically, the periodic / aperiodic component extraction unit 122 extracts the periodic component and the aperiodic component from F ₀ based on the input voice waveform F ₀ . In the present embodiment, the source signal s (t) is extracted as shown in the following equation (1).

但し、ｆ_０（ｔ）は、音声波形のフレームｔにおけるＦ₀を示し、周期性信号ｓ_ｐｄｃ（ｔ）は、音声波形のフレームｔにおける周期成分を示し、非周期性信号ｓ_ａｐｄ（ｔ）は、音声波形のフレームｔにおける非周期成分を示す。 However, f ₀ (t) indicates F ₀ in the frame t of the voice waveform, and the periodic signal _spdc (t) indicates the periodic component in the frame t of the voice waveform, and the aperiodic signal s _apd (t). Indicates an aperiodic component in the frame t of the voice waveform.

このように、入力される音声波形のフレームｔ毎に、Ｆ₀が存在する場合には、源信号は周期成分および非周期成分を含むものとして扱い、Ｆ₀が存在しない場合には、源信号は非周期成分のみを含むものとして扱う。すなわち、周期／非周期成分抽出部１２２は、Ｆ₀抽出部１２０がＦ₀を抽出できないフレーム（単位区間）から非周期成分のみを抽出し、それ以外のフレームから周期成分および非周期成分を抽出する。 Thus, for each frame t to a speech waveform input, if when the F ₀ is present, the source signal is treated as comprising periodic components and aperiodic components, there is no F ₀ is the source signal Is treated as containing only aperiodic components. That is, the periodic / aperiodic component extraction unit 122 extracts only the aperiodic component from the frame (unit interval) in which the F ₀ extraction unit 120 cannot extract F _0, and extracts the periodic component and the aperiodic component from the other frames. To do.

本実施の形態においては、源信号の周期（harmonic）成分を表現する一例として、以下の（２）式に示すようなsinusoidalモデルを採用する。 In the present embodiment, as an example of expressing the period (harmonic) component of the source signal, a sinusoidal model as shown in the following equation (2) is adopted.

（２）式において、Ｊはharmonicの数を示す。すなわち、（２）式に示すsinusoidalモデルにおいては、harmonicでの周波数および振幅は線形的に近似されている。このsinusoidalモデルを解くにあたって、α_ｋ，β_ｋ，γ，φ_ｋの値をそれぞれ決定する必要がある。より具体的には、以下の（３）式に従って定義されるδを最小化する値が解として決定される。 In equation (2), J indicates the number of harmonics. That is, in the sinusoidal model shown in Eq. (2), the frequency and amplitude in harmonic are linearly approximated. In solving this sinusoidal model, it is necessary to determine the values of α _k , β _k , γ, and φ _k , respectively. More specifically, the value that minimizes δ defined according to the following equation (3) is determined as the solution.

但し、ω（ｔ）は、長さ２Ｎ_ｗ＋１の窓関数である。（３）式に従って定義されるδを最小化する値は、非特許文献８に示される解法によって決定される。 However, ω (t) is a window function having a length of 2N _w + 1. The value that minimizes δ defined according to the equation (3) is determined by the solution method shown in Non-Patent Document 8.

周期／非周期成分抽出部１２２は、上述したような数学的な解法に従って、入力される音声波形に含まれる周期性信号ｓ_ｐｄｃ（ｔ）および非周期性信号ｓ_ａｐｄ（ｔ）を抽出する。 The periodic / aperiodic component extraction unit 122 extracts the periodic signal s _pdc (t) and the aperiodic signal s _apd (t) included in the input voice waveform according to the mathematical solution as described above.

特徴量抽出部１２４は、音響特徴量として、連続的なＦ₀、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を出力する。スペクトル包絡としては、例えば、ＬＳＰ（line spectral pair）、ＬＰＣ（linear prediction coefficients）、メルケプストラム係数のいずれを採用してもよい。なお、音響特徴量としては、連続的なＦ₀の対数（以下、「連続的なｌｏｇＦ₀」とも略称する。）が用いられる。 The feature amount extraction unit 124 outputs continuous F ₀ , spectral inclusion of periodic components, and spectral inclusion of aperiodic components as acoustic features. As the spectral envelope, for example, any of LSP (line spectral pair), LPC (linear prediction coefficients), and mer cepstrum coefficients may be adopted. As the acoustic features, logarithmic continuous F ₀ (hereinafter, abbreviated as "continuous logF _0".) Is used.

Ｆ₀補間部１２６は、Ｆ₀抽出部１２０が音声波形からフレーム毎に抽出されるＦ₀を補間して、連続的なＦ₀（Ｆ₀系列）を生成する。より具体的には、例えば、直近の１または複数のフレームにおいて抽出されたＦ₀から所定の補間関数に従って、対象のフレームにおけるＦ₀を決定できる。Ｆ₀補間部１２６におけるＦ₀の補間方法は、公知の任意の方法を採用できる。 F ₀ interpolation unit 126 interpolates the F ₀ to F ₀ extracting unit 120 is extracted for each frame from the speech waveform, it generates a continuous F ₀ (F ₀ sequence). More specifically, for example, in accordance with a predetermined interpolation function from F ₀ extracted in immediate vicinity of one or more frames can be determined F ₀ in the target frame. As the interpolation method of F ₀ in the F ₀ interpolation unit 126, any known method can be adopted.

スペクトル包絡抽出部１２８は、抽出される周期成分および非周期成分のスペクトル包絡を抽出する。より具体的には、スペクトル包絡抽出部１２８は、Ｆ₀抽出部１２０が抽出したＦ₀に基づいて、周期／非周期成分抽出部１２２から出力される周期性信号ｓ_ｐｄｃ（ｔ）および非周期性信号ｓ_ａｐｄ（ｔ）から、スペクトル包絡を抽出する。すなわち、スペクトル包絡抽出部１２８は、フレーム毎の周期性信号ｓ_ｐｄｃ（ｔ）に含まれる各周波数成分の分布特性を示す周期成分を示すスペクトル包絡（ｐｄｃ）を抽出するとともに、フレーム毎の非周期性信号ｓ_ａｐｄ（ｔ）に含まれる各周波数成分の分布特性を示す非周期成分を示すスペクトル包絡（ａｐｄ）を抽出する。 The spectral envelope extraction unit 128 extracts the spectral envelopes of the periodic and non-periodic components to be extracted. More specifically, the spectrum envelope extraction unit 128 has a periodic signal _spdc (t) and an aperiodic signal output from the periodic / aperiodic component extraction unit 122 based on the F ₀ extracted by the F ₀ extraction unit 120. The spectral envelope is extracted from the sex signal s _apd (t). That is, the spectrum inclusion extraction unit 128 extracts the spectrum inclusion (pdc) indicating the periodic component indicating the distribution characteristic of each frequency component included in the periodic signal s _pdc (t) for each frame, and also extracts the spectrum inclusion (pdc) for each frame. A spectral envelope (apd) showing an aperiodic component showing the distribution characteristics of each frequency component included in the sex signal s _apd (t) is extracted.

図６は、本実施の形態に従う音声合成システムにおいて出力される周期成分および非周期成分の音声波形の一例を示す図である。図６には、一例として、話者が「すべて」と発したときの音声信号を示す。後述するように、ＤＮＮ１６において、フレーム単位で音響特徴量が学習される。 FIG. 6 is a diagram showing an example of voice waveforms of periodic components and non-periodic components output in a speech synthesis system according to the present embodiment. FIG. 6 shows, as an example, an audio signal when the speaker says "all". As will be described later, in the DNN 16, the acoustic features are learned in frame units.

図６（ａ）には、入力された音声波形（源信号）を示し、図６（ｂ）には、源信号から抽出された周期成分の音声波形を示し、図６（ｃ）には、源信号から抽出された非周期成分の音声波形を示す。Ｆ₀が抽出される区間の周期成分が図６（ｂ）に示すように抽出される一方、Ｆ₀が抽出される区間の非周期成分とＦ₀が抽出されない区間とは、図６（ｃ）のようになる。図６（ｂ）中において「ｎｏｎ−Ｆ₀」とラベル付けされた区間では、振幅がほとんどゼロになっており、この区間がＦ₀が抽出されない区間に相当する。 FIG. 6A shows an input voice waveform (source signal), FIG. 6B shows a voice waveform of a periodic component extracted from the source signal, and FIG. 6C shows a voice waveform. The audio waveform of the aperiodic component extracted from the source signal is shown. While the cyclic component of the section F ₀ is extracted is extracted as shown in FIG. 6 (b), the section aperiodic component and F ₀ of the section F ₀ is extracted is not extracted, FIG. 6 (c )become that way. In the section labeled “non−F ₀ ” in FIG. 6 (b), the amplitude is almost zero, and this section corresponds to the section in which F ₀ is not extracted.

（ｄ２：学習部１４）
次に、学習部１４における処理および機能について説明する。ＳＰＳＳにおいては、入力されたテキストと当該テキストに対応する音声波形との関係を統計的に学習する。一般的に、この関係を直接モデル化することは容易ではない。そこで、本実施の形態に従う音声合成システムにおいては、入力されたテキストの文脈情報に基づくコンテキストラベル系列を生成するとともに、入力された音声波形からＦ₀およびスペクトル包絡を含む音響特徴量系列を生成する。そして、コンテキストラベル系列および音響特徴量系列を用いて学習することで、コンテキストラベル系列を入力とし、音響特徴量系列を出力する音響モデルを構築する。本実施の形態においては、ＤＮＮに従って統計モデルである音響モデルを構築する。その結果、ＤＮＮ１６には、構築される音響モデル（統計モデル）を示すパラメータが格納されることになる。 (D2: Learning unit 14)
Next, the processing and the function in the learning unit 14 will be described. In SPSS, the relationship between the input text and the voice waveform corresponding to the text is statistically learned. In general, it is not easy to model this relationship directly. Therefore, in the speech synthesis system according to the present embodiment, a context label sequence based on the context information of the input text is generated, and an acoustic feature sequence including F ₀ and spectral inclusion is generated from the input speech waveform. .. Then, by learning using the context label sequence and the acoustic feature sequence, an acoustic model that takes the context label sequence as an input and outputs the acoustic feature sequence is constructed. In the present embodiment, an acoustic model which is a statistical model is constructed according to DNN. As a result, the DNN 16 stores parameters indicating the acoustic model (statistical model) to be constructed.

図５に示す構成においては、コンテキストラベル系列を生成するコンポーネントとして、テキスト分析部１６２およびコンテキストラベル生成部１６４を含む。テキスト分析部１６２およびコンテキストラベル生成部１６４は、既知のテキストの文脈情報に基づくコンテキストラベルを生成する。 In the configuration shown in FIG. 5, the text analysis unit 162 and the context label generation unit 164 are included as the components that generate the context label series. The text analysis unit 162 and the context label generation unit 164 generate a context label based on the context information of the known text.

コンテキストラベルは、学習部１４および音声合成部１８の両方で用いるため、学習部１４および音声合成部１８が共通に利用する構成例を示している。しかしながら、学習部１４および音声合成部１８の各々に、コンテキストラベルを生成するためのコンポーネントをそれぞれ実装するようにしてもよい。 Since the context label is used by both the learning unit 14 and the speech synthesis unit 18, a configuration example commonly used by the learning unit 14 and the speech synthesis unit 18 is shown. However, each of the learning unit 14 and the speech synthesis unit 18 may be equipped with a component for generating a context label.

テキスト分析部１６２は、入力される学習用または合成対象のテキストを分析して、その文脈情報をコンテキストラベル生成部１６４へ出力する。コンテキストラベル生成部１６４は、テキスト分析部１６２からの分脈情報に基づいて、コンテキストラベルを決定してモデル学習部１４０へ出力する。 The text analysis unit 162 analyzes the input text for learning or synthesis, and outputs the context information to the context label generation unit 164. The context label generation unit 164 determines the context label based on the segmentation information from the text analysis unit 162 and outputs the context label to the model learning unit 140.

本実施の形態に従う音声合成システムにおいては、フレーム毎の音響特徴量を用いて学習を行なうので、コンテキストラベル生成部１６４についても、フレーム毎のコンテキストラベルを生成する。一般的に、コンテキストラベルは音素単位で生成されるため、コンテキストラベル生成部１６４は、音素内における各フレームの位置情報を付与することで、フレーム単位のコンテキストラベルを生成する。 In the speech synthesis system according to the present embodiment, since learning is performed using the acoustic features for each frame, the context label generation unit 164 also generates the context label for each frame. Generally, since the context label is generated in phoneme units, the context label generation unit 164 generates the context label in frame units by adding the position information of each frame in the phoneme.

モデル学習部１４０は、分析部１２からの音響特徴量系列１４２と、コンテキストラベル生成部１６４からのコンテキストラベル系列１６６とを入力として、ＤＮＮを用いて音響モデルを学習する。このように、モデル学習部１４０は、Ｆ₀、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルである音響モデルを構築する。 The model learning unit 140 learns an acoustic model using DNN by inputting the acoustic feature quantity series 142 from the analysis unit 12 and the context label series 166 from the context label generation unit 164. In this way, the model learning unit 140 is a statistical model by learning the acoustic features including F ₀ , the spectral inclusion of the periodic component, and the spectral inclusion of the aperiodic component in association with the corresponding context label. Build an acoustic model.

モデル学習部１４０でのＤＮＮに基づく音響モデルの学習においては、フレーム毎にコンテキストラベルを入力するとともに、フレーム毎の音響特徴量ベクトル（要素として、少なくとも、連続的なｌｏｇＦ₀、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む）を出力とするＤＮＮを用いることで、確率分布のモデル化を行なう。典型的には、モデル学習部１４０は、正規化された音響特徴量ベクトルについての平均二乗誤差を最小化するようにＤＮＮを学習する。このようなＤＮＮの学習は、以下の（４）式に示すように、フレーム毎に変化する平均ベクトルおよびコンテキスト非依存の共分散行列をもつ正規分布により、確率分布のモデル化を行なうことと等価である。 In the learning of the acoustic model based on the DNN in the model learning unit 140, the context label is input for each frame, and the acoustic feature amount vector for each frame (as an element, at least continuous logF ₀ , spectral envelope of the periodic component). , Including the spectral envelope of the aperiodic component) is used as the output to model the probability distribution. Typically, the model learning unit 140 learns the DNN so as to minimize the mean square error for the normalized acoustic feature vector. Such DNN learning is equivalent to modeling a probability distribution with a normal distribution having an average vector that changes from frame to frame and a context-independent covariance matrix, as shown in Eq. (4) below. Is.

但し、λはＤＮＮのパラメータセットを示し、Ｕはグローバルな共分散行列を示し、μｔはＤＮＮにより推定される音声パラメータの平均ベクトルを示す。したがって、生成された確率分布系列は、時変な平均ベクトルおよび時不変な共分散行列をもつことになる。 Where λ represents the parameter set of DNN, U represents the global covariance matrix, and μt represents the average vector of speech parameters estimated by DNN. Therefore, the generated probability distribution series will have a time-invariant mean vector and a time-invariant covariance matrix.

（ｄ３：音声合成部１８）
次に、音声合成部１８における処理および機能について説明する。音声合成部１８は、合成対象のテキストから生成されるフレーム毎のコンテキストラベルを生成し、生成したフレーム毎のコンテキストラベルをＤＮＮ１６に入力することで、確率分布系列を推定する。そして、推定した確率分布系列に基づいて、学習時とは逆の処理を経て、音声波形を合成する。 (D3: Speech synthesis unit 18)
Next, the processing and the function in the voice synthesis unit 18 will be described. The speech synthesis unit 18 generates a context label for each frame generated from the text to be synthesized, and inputs the generated context label for each frame into the DNN 16 to estimate the probability distribution series. Then, based on the estimated probability distribution series, the voice waveform is synthesized through the process opposite to that at the time of learning.

より具体的には、音声合成部１８は、音響特徴量推定部１８０と、パルス生成部１８４と、周期成分生成部１８６と、非周期成分生成部１８８と、加算部１８７とを含む。 More specifically, the speech synthesis unit 18 includes an acoustic feature amount estimation unit 180, a pulse generation unit 184, a periodic component generation unit 186, an aperiodic component generation unit 188, and an addition unit 187.

何らかの合成対象のテキストが入力されると、テキスト分析部１６２が入力されたテキストを分析して文脈情報を出力し、コンテキストラベル生成部１６４が分脈情報に基づいてコンテキストラベルを生成する。すなわち、テキスト分析部１６２およびコンテキストラベル生成部１６４は、任意のテキストの入力に応答して、当該テキストの文脈情報に基づくコンテキストラベルを決定する。 When some text to be synthesized is input, the text analysis unit 162 analyzes the input text and outputs context information, and the context label generation unit 164 generates a context label based on the segmentation information. That is, the text analysis unit 162 and the context label generation unit 164 determine the context label based on the context information of the text in response to the input of arbitrary text.

音響特徴量推定部１８０は、ＤＮＮ１６に構築された統計モデルである音響モデルから決定されたコンテキストラベルに対応する音響特徴量を推定する。より具体的には、音響特徴量推定部１８０は、生成されたフレーム毎のコンテキストラベルを、学習された音響モデルを示すＤＮＮ１６に入力する。音響特徴量推定部１８０は、入力されたコンテキストラベルに対応する音響特徴量をＤＮＮ１６から推定する。コンテキストラベル系列の入力に対応して、ＤＮＮ１６からはフレーム毎に平均ベクトルのみが変化する確率分布系列である音響特徴量系列１８２が出力される。 The acoustic feature amount estimation unit 180 estimates the acoustic feature amount corresponding to the context label determined from the acoustic model which is the statistical model constructed in the DNN 16. More specifically, the acoustic feature amount estimation unit 180 inputs the generated frame-by-frame context label into the DNN 16 indicating the learned acoustic model. The acoustic feature amount estimation unit 180 estimates the acoustic feature amount corresponding to the input context label from the DNN 16. Corresponding to the input of the context label series, the DNN 16 outputs an acoustic feature series 182 which is a probability distribution series in which only the average vector changes for each frame.

音響特徴量系列１８２に含まれる、補間された連続的なＦ₀（Ｆ₀系列）、周期成分のスペクトル包絡、非周期成分のスペクトル包絡は、ＤＮＮ１６を用いて、コンテキストラベル系列から推定される。 The interpolated continuous F ₀ (F ₀ sequence), the spectral envelope of the periodic component, and the spectral envelope of the aperiodic component included in the acoustic feature sequence 182 are estimated from the context label sequence using DNN16.

補間された連続的なＦ₀（Ｆ₀系列）は、連続分布として表現できるため、連続的なパルス系列から構成される。周期成分のスペクトル包絡および非周期成分のスペクトル包絡は、それぞれについてモデル化される。 Since the interpolated continuous F ₀ (F ₀ series) can be expressed as a continuous distribution, it is composed of a continuous pulse series. The spectral envelope of the periodic component and the spectral envelope of the aperiodic component are modeled for each.

パルス生成部１８４および周期成分生成部１８６は、推定された音響特徴量に含まれるＦ₀に従って生成されたパルス系列を、周期成分のスペクトル包絡に応じてフィルタリングすることで、周期成分を再構成する。より具体的には、パルス生成部１８４は、音響特徴量推定部１８０からのＦ₀（Ｆ₀系列）に従ってパルス系列を生成する。周期成分生成部１８６は、パルス生成部１８４からのパルス系列を周期成分のスペクトル包絡でフィルタリングすることで、周期成分を生成する。 The pulse generation unit 184 and the periodic component generation unit 186 reconstruct the periodic component by filtering the pulse sequence generated according to F ₀ included in the estimated acoustic features according to the spectral envelope of the periodic component. .. More specifically, the pulse generation unit 184 generates a pulse sequence according to F ₀ (F ₀ sequence) from the acoustic feature estimation unit 180. The periodic component generation unit 186 generates a periodic component by filtering the pulse sequence from the pulse generation unit 184 by the spectral envelope of the periodic component.

非周期成分生成部１８８は、ガウシアンノイズ系列などのノイズ系列を非周期成分のスペクトル包絡に応じてフィルタリングすることで、非周期成分を再構成する。より具体的には、非周期成分生成部１８８は、任意の励振源からのガウス性ノイズを非周期成分のスペクトル包絡でフィルタリングすることで、非周期成分を生成する。 The aperiodic component generation unit 188 reconstructs the aperiodic component by filtering a noise sequence such as a Gaussian noise sequence according to the spectral envelope of the aperiodic component. More specifically, the aperiodic component generation unit 188 generates an aperiodic component by filtering Gaussian noise from an arbitrary excitation source by the spectral envelope of the aperiodic component.

加算部１８７は、周期成分生成部１８６からの周期成分と非周期成分生成部１８８からの非周期成分とを加算することで、音声波形を再構成する。すなわち、加算部１８７は、再構成された周期成分および非周期成分を加算して、入力された任意のテキストに対応する音声波形として出力する。 The addition unit 187 reconstructs the audio waveform by adding the periodic component from the periodic component generation unit 186 and the aperiodic component from the aperiodic component generation unit 188. That is, the addition unit 187 adds the reconstructed periodic component and the non-periodic component, and outputs the voice waveform corresponding to the input arbitrary text.

上述したように、本実施の形態に従う音声合成システムにおいては、予め学習により構築されたＤＮＮ１６を用いて、フレーム毎のコンテキストラベルについて確率分布系列を推定するとともに、静的特徴量と動的特徴量との間の明示的な関係を利用することで，適切に遷移する音響特徴量系列を生成する。そして、生成された音響特徴量系列をボコーダーに適用することで、推定された音響特徴量から合成音声を生成する。 As described above, in the speech synthesis system according to the present embodiment, the probability distribution series is estimated for the context label for each frame by using the DNN 16 constructed in advance by learning, and the static features and the dynamic features are used. By using the explicit relationship between and, an acoustic feature series that transitions appropriately is generated. Then, by applying the generated acoustic feature series to the vocoder, a synthetic voice is generated from the estimated acoustic features.

このように、本実施の形態に従う音声合成システムにおいては、Ｖ／ＵＶの判定を行なうことなく、連続的な系列から音声波形を生成できる。 As described above, in the speech synthesis system according to the present embodiment, the speech waveform can be generated from the continuous sequence without determining V / UV.

なお、本実施の形態においては、典型例として、学習手段としてＤＮＮを用いるシステムを説明するが、学習手段としてはＤＮＮに限られず、任意の教師あり学習の方法を採用できる。例えば、ＨＭＭや再帰型ニューラルネットワーク（Recurrent Neural Network）などを採用してもよい。 In the present embodiment, a system using DNN as a learning means will be described as a typical example, but the learning means is not limited to DNN, and any supervised learning method can be adopted. For example, an HMM or a recurrent neural network may be adopted.

［Ｅ．処理手順］
図７および図８は、本実施の形態に従う音声合成システムにおける処理手順の一例を示すフローチャートである。図７および図８に示す各ステップは、１または複数のプロセッサ（例えば、図２に示すプロセッサ１００）が１または複数のプログラムを実行することで実現されてもよい。 [E. Processing procedure]
7 and 8 are flowcharts showing an example of a processing procedure in a speech synthesis system according to the present embodiment. Each step shown in FIGS. 7 and 8 may be realized by one or more processors (eg, processor 100 shown in FIG. 2) executing one or more programs.

図７には、ＤＮＮ１６を構築するための事前の機械学習の処理を示し、図８には、ＤＮＮ１６を用いた音声合成の処理を示す。 FIG. 7 shows a pre-machine learning process for constructing the DNN 16, and FIG. 8 shows a speech synthesis process using the DNN 16.

図７を参照して、プロセッサ１００は、既知のテキストおよび当該テキストに対応する音声波形が入力されると（ステップＳ１００）、入力された音声波形をフレームに区切り（ステップＳ１０２）、フレーム毎に、入力されたテキストからコンテキストラベルを生成する処理（ステップＳ１１０〜Ｓ１１２）、および、音響特徴量系列を生成する処理（ステップＳ１２０〜Ｓ１２８）を実行することで、コンテキストラベル系列および音響特徴量系列を生成する。 With reference to FIG. 7, when the known text and the voice waveform corresponding to the text are input (step S100), the processor 100 divides the input voice waveform into frames (step S102), and for each frame, By executing the process of generating the context label from the input text (steps S110 to S112) and the process of generating the acoustic feature amount series (steps S120 to S128), the context label series and the acoustic feature amount series are generated. To do.

すなわち、プロセッサ１００は、入力されたテキストを分析して文脈情報を生成し（ステップＳ１１０）、当該生成された文脈情報に基づいて、対応するフレームについてのコンテキストラベルを決定する（ステップＳ１１２）。 That is, the processor 100 analyzes the input text to generate contextual information (step S110), and determines the context label for the corresponding frame based on the generated contextual information (step S112).

また、プロセッサ１００は、入力された音声波形の対象フレームにおけるＦ₀を抽出し（ステップＳ１２０）、先に抽出されたＦ₀との間で補間処理を行なうことで、連続的なＦ₀を決定する（ステップＳ１２２）。そして、プロセッサ１００は、入力された音声波形の対象フレームにおける周期成分および非周期成分を抽出し（ステップＳ１２４）、それぞれの成分についてのスペクトル包絡を抽出する（ステップＳ１２６）。プロセッサ１００は、ステップＳ１２２において決定した連続的なＦ₀の対数、ならびに、ステップＳ１２６において抽出したスペクトル包絡（周期成分および非周期成分）を音響特徴量として決定する（ステップＳ１２８）。 The processor 100 extracts the F ₀ in the target frame of the input speech waveform (step S120), by performing an interpolation process between F ₀ extracted previously, determine continuous F ₀ (Step S122). Then, the processor 100 extracts the periodic component and the aperiodic component in the target frame of the input voice waveform (step S124), and extracts the spectral envelope for each component (step S126). The processor 100 determines the continuous logarithm of F ₀ determined in step S122 and the spectral inclusions (periodic component and aperiodic component) extracted in step S126 as acoustic features (step S128).

プロセッサ１００は、ステップＳ１１２において決定されたコンテキストラベルと、ステップＳ１２８において決定された音響特徴量とをＤＮＮ１６に追加する（ステップＳ１３０）。そして、プロセッサ１００は、未処理のフレームが存在するか否かを判断し（ステップＳ１３２）、未処理のフレームが存在する場合（ステップＳ１３２においてＹＥＳの場合）には、ステップＳ１１０〜Ｓ１１２、および、ステップＳ１２０〜Ｓ１２８の処理を繰返す。また、未処理のフレームが存在しない場合（ステップＳ１３２においてＮＯの場合）には、プロセッサ１００は、新たなテキストおよび当該テキストに対応する音声波形が入力されたか否かを判断し（ステップＳ１３４）、新たなテキストおよび当該テキストに対応する音声波形が入力された場合（ステップＳ１３４においてＹＥＳの場合）には、ステップＳ１０２以下の処理を繰返す。 The processor 100 adds the context label determined in step S112 and the acoustic features determined in step S128 to the DNN 16 (step S130). Then, the processor 100 determines whether or not there is an unprocessed frame (step S132), and if there is an unprocessed frame (YES in step S132), steps S110 to S112, and The processing of steps S120 to S128 is repeated. Further, when there is no unprocessed frame (NO in step S132), the processor 100 determines whether or not a new text and a voice waveform corresponding to the text have been input (step S134). When a new text and a voice waveform corresponding to the text are input (YES in step S134), the processing of step S102 and subsequent steps is repeated.

新たなテキストおよび当該テキストに対応する音声波形が入力されていない場合（ステップＳ１３４においてＮＯの場合）には、学習処理は終了する。 If a new text and a voice waveform corresponding to the text are not input (NO in step S134), the learning process ends.

なお、上述の説明においては、コンテキストラベルおよび音響特徴量が生成される毎に、ＤＮＮ１６へ入力する処理例を示すが、対象の音声波形からコンテキストラベル系列および音響特徴量系列の生成が完了した後に、まとめてＤＮＮ１６へ入力するようにしてもよい。 In the above description, a processing example of inputting to the DNN 16 each time a context label and an acoustic feature amount are generated will be shown, but after the generation of the context label series and the acoustic feature amount series from the target voice waveform is completed. , You may input to DNN16 collectively.

次に、図８を参照して、プロセッサ１００は、合成対象のテキストが入力されると（ステップＳ２００）、入力されたテキストを分析して文脈情報を生成し（ステップＳ２０２）、当該生成された文脈情報に基づいて、対応するフレームについてのコンテキストラベルを決定する（ステップＳ２０４）。そして、プロセッサ１００は、ステップＳ２０４において決定したコンテキストラベルに対応する音響特徴量をＤＮＮ１６から推定する（ステップＳ２０６）。 Next, referring to FIG. 8, when the text to be synthesized is input (step S200), the processor 100 analyzes the input text and generates context information (step S202), and the generated text is generated. The context label for the corresponding frame is determined based on the context information (step S204). Then, the processor 100 estimates the acoustic feature amount corresponding to the context label determined in step S204 from the DNN 16 (step S206).

プロセッサ１００は、推定した音響特徴量に含まれるＦ₀に従ってパルス系列を発生する（ステップＳ２０８）とともに、推定した音響特徴量に含まれるスペクトル包絡（周期成分）で当該発生したパルス系列をフィルタリングすることで、音声波形の周期成分を生成する（ステップＳ２１０）。 The processor 100 generates a pulse sequence according to F ₀ included in the estimated acoustic feature (step S208), and filters the generated pulse sequence by the spectral inclusion (periodic component) included in the estimated acoustic feature. Then, the periodic component of the voice waveform is generated (step S210).

また、プロセッサ１００は、ガウシアンノイズ系列を発生する（ステップＳ２１２）とともに、推定した音響特徴量に含まれるスペクトル包絡（非周期成分）で当該発生したガウシアンノイズ系列をフィルタリングすることで、音声波形の非周期成分を生成する（ステップＳ２１４）。 Further, the processor 100 generates a Gaussian noise sequence (step S212), and filters the generated Gaussian noise sequence by the spectral wrapping (aperiodic component) included in the estimated acoustic feature amount, so that the voice waveform is not generated. A periodic component is generated (step S214).

最終的に、プロセッサ１００は、ステップＳ２１０において生成した周期成分とステップＳ２１４において生成した非周期成分とを加算して、合成音声の音声波形として出力する（ステップＳ２１６）。そして、入力されたテキストに対する音声合成処理は終了する。なお、ステップＳ２０６〜Ｓ２１６の処理は、入力されたテキストを構成するフレームの数だけ繰返される。 Finally, the processor 100 adds the periodic component generated in step S210 and the aperiodic component generated in step S214, and outputs the voice waveform of the synthesized voice (step S216). Then, the voice synthesis process for the input text ends. The processing of steps S206 to S216 is repeated by the number of frames constituting the input text.

［Ｆ．実験的評価］
次に、本実施の形態に従う音声合成システムのおける有効性について実施した実験的評価について説明する。 [F. Experimental evaluation]
Next, an experimental evaluation carried out on the effectiveness of the speech synthesis system according to the present embodiment will be described.

（ｆ１：実験条件）
本実施の形態に係る実施例の比較対象となる比較例として、一般的なＤＮＮ音声合成を用いた。 (F1: Experimental conditions)
As a comparative example to be compared with the examples according to this embodiment, general DNN speech synthesis was used.

音声データとして、日本語女性話者１名により発声されたＡＴＲ音素バランス文５０３文を用いた。このうち、４９３文を学習データとして用いるとともに、残り１０文を評価文として用いた。 As the voice data, 503 ATR phoneme balance sentences uttered by one Japanese female speaker were used. Of these, 493 sentences were used as learning data, and the remaining 10 sentences were used as evaluation sentences.

音声データのサンプリング周波数は１６ｋＨｚとし、分析周期は５ｍｓとした。学習データの音声データに対するＷＯＲＬＤ分析によって得られた、スペクトルおよび非周期性指標（ＡＰ）を、それぞれ３９次のメルケプストラム係数（０次を含めて４０次）として表現した。 The sampling frequency of the audio data was 16 kHz, and the analysis cycle was 5 ms. The spectrum and aperiodic index (AP) obtained by WORLD analysis of the audio data of the training data were expressed as 39th-order mercepstrum coefficients (40th-order including 0th-order), respectively.

ｌｏｇＦ₀については、公知の複数の抽出法による結果を統合することで算出した上で、平滑化によってマイクロプロソディを除去した。 LogF ₀ was calculated by integrating the results of a plurality of known extraction methods, and then the microprosody was removed by smoothing.

実施例の音素継続長モデルは、比較例のＨＭＭ音声合成と同様に、音素単位のコンテキストラベルを用いて、５状態のスキップ無しleft-to-right型のコンテキスト依存音素ＨＳＭＭ（hidden semi-Markov model：隠れセミマルコフモデル)を学習した。また、ＤＮＮによる音響モデルの学習では、さらに無声区間を補間した連続ｌｏｇＦ₀パターンを用いた。これらのパラメータに対して、さらに１次動的特徴量および２次動的特徴量を付与したものを音響特徴量とした。 The phoneme continuation length model of the embodiment uses a phoneme-based context label in the same manner as the HMM speech synthesis of the comparative example, and is a left-to-right type context-dependent phoneme HSMM (hidden semi-Markov model) without skipping five states. : Hidden Semi-Markov model) was learned. Further, in the learning of the acoustic model by DNN, a continuous logF ₀ pattern in which the silent section was further interpolated was used. The acoustic features were obtained by further adding the primary dynamic features and the secondary dynamic features to these parameters.

比較例のＤＮＮ音声合成については、上記特徴量に加え、Ｖ／ＵＶ情報を用いた。入力ベクトルは、音素単位のコンテキストラベルに対して、ＨＳＭＭの継続長モデルから得られた継続長情報を付与することで、フレーム毎のコンテキストラベルを生成し、合計４８３次元のベクトルとして表現した。 For the DNN speech synthesis of the comparative example, V / UV information was used in addition to the above features. As the input vector, the context label for each frame was generated by adding the continuation length information obtained from the continuation length model of HSMM to the context label of the phoneme unit, and expressed as a vector having a total of 483 dimensions.

出力ベクトルは、比較例が２４４次元の音響特徴量のベクトルとし、実施例が２４３次元の音響特徴量のベクトルとした。 The output vector was a 244-dimensional acoustic feature vector in the comparative example and a 243-dimensional acoustic feature vector in the example.

実施例および比較例にそれぞれ用いた特徴量およびモデルの一覧を以下の表１に示す。但し、入力ベクトルおよび出力ベクトルは、いずれも平均が０、分散が１となるように正規化した。 Table 1 below shows a list of features and models used in the examples and comparative examples, respectively. However, both the input vector and the output vector were normalized so that the mean was 0 and the variance was 1.

ＤＮＮのネットワーク構成は、隠れ層を６層とし、ユニット数１０２４とした上で、重みは乱数を用いて初期化した。また、ミニバッチサイズは２５６として、ｅｐｏｃｈ数は３０として、学習係数は２．５×１０^４として、隠れ層の活性化関数はＲｅＬＵ（rectied linear unit）とし、optimizerはＡｄａｍとした。また、重み０．５のDropoutも用いた。 In the DNN network configuration, the hidden layers were set to 6 layers, the number of units was set to 1024, and the weights were initialized using random numbers. Further, as a mini batch size 256, epoch number as 30, as learning coefficient 2.5 × 10 ^4, the activation function of the hidden layer and ReLU (rectied linear unit), optimizer was Adam. A Dropout with a weight of 0.5 was also used.

（ｆ２：主観評価）
表１に示すように、実施例と比較例との間で音響特徴量が異なっているため、客観評価ではなく主観評価にて評価した。より具体的には、対比較実験により合成音声の自然性を比較した。 (F2: subjective evaluation)
As shown in Table 1, since the acoustic features differ between the examples and the comparative examples, the evaluation was performed by subjective evaluation instead of objective evaluation. More specifically, the naturalness of synthetic speech was compared by a pair comparison experiment.

上述したように、ＡＴＲ音素バランス文５０３文のうち学習データとしなかった１０文を評価音声とした。実施例および比較例のそれぞれによって生成された合成音声を被験者（内訳：男性４名、女性１名）に聞いてもらい、より自然性である（音声品質が高い）と感じたものを選択してもらった。但し、提示音声対に差が感じられない際には、「どちらでもない」という選択肢を認めた。 As described above, of the 503 ATR phoneme balance sentences, 10 sentences that were not used as learning data were used as evaluation voices. Have the subjects (breakdown: 4 males, 1 female) listen to the synthetic voices generated by each of the examples and comparative examples, and select the ones that feel more natural (high voice quality). received. However, when there was no difference in the presented audio pairs, the option of "neither" was accepted.

なお、実施例および比較例ともに、スペクトル包絡のメルケプストラム係数に対するポストフィルタを適用した。 In both Examples and Comparative Examples, a post filter was applied to the mer cepstrum coefficient of the spectral envelope.

図９は、本実施の形態に従う音声合成システムについての対比較実験の評価結果例を示す図である。図９において、比較例の非周期性指標（ＡＰ）は０．０から１．０の間で非周期性を表現している。 FIG. 9 is a diagram showing an example of evaluation results of a pair comparison experiment for a speech synthesis system according to the present embodiment. In FIG. 9, the aperiodic index (AP) of the comparative example expresses aperiodicity between 0.0 and 1.0.

図９中のαはＡＰのしきい値を示す。α＝０．０の場合に完全に有声となり、α＝１．０の場合に完全に無声となる。ＡＰがしきい値αより低い場合は有声とし、高い場合は無声とした。 Α in FIG. 9 indicates the threshold value of AP. When α = 0.0, it becomes completely voiced, and when α = 1.0, it becomes completely unvoiced. When AP was lower than the threshold value α, it was voiced, and when it was higher, it was unvoiced.

予備実験においてＶ／ＵＶの判定エラー率の低かったしきい値として、α＝０．５およびα＝０．６を用いた（図９（ａ）および（ｂ））。また、図９（ｃ）の「ｒｅｆｅｒｅｎｃｅ」は、Ｖ／ＵＶの判定結果の正解を与えた場合の結果を示す。 In the preliminary experiment, α = 0.5 and α = 0.6 were used as the threshold values at which the V / UV determination error rate was low (FIGS. 9 (a) and 9 (b)). Further, “reference” in FIG. 9C shows the result when the correct answer of the V / UV determination result is given.

図９（ａ）〜（ｃ）に示すいずれの場合についても、実施例が比較例に対して、検定統計量のｐ値がｐ＜０．０１となり、有意性を示したことが確認された。 In each of the cases shown in FIGS. 9 (a) to 9 (c), it was confirmed that the p value of the test statistic was p <0.01 with respect to the comparative example, showing significance. ..

（ｆ３：実験的評価の結論）
本実施の形態に従う音声合成システムにおいては、入力音声を周期成分／非周期成分に分離することにより、連続的にＦ₀およびスペクトル包絡のトラジェクトリを表現できた。このような手法を採用することにより、モデリング精度の改善およびＶ／ＵＶの判定エラーの回避といった利点を得ることができたと考えられる。 (F3: Conclusion of experimental evaluation)
In the speech synthesis system according to the present embodiment, by separating the input speech into a periodic component / aperiodic component, it was possible to continuously express the trajectory of F ₀ and the spectrum envelope. By adopting such a method, it is considered that advantages such as improvement of modeling accuracy and avoidance of V / UV determination error can be obtained.

上述の主観評価の結果によれば、本実施の形態に係る実施例は、比較例に対して正しいＶ／ＵＶ情報が与えられたときでさえ、より優れた性能を示した。このような結果によれば、周期成分と非周期成分とに分離したモデリングが品質改善に寄与していると評価できる。 According to the results of the subjective evaluation described above, the examples according to the present embodiment showed better performance even when correct V / UV information was given to the comparative examples. Based on these results, it can be evaluated that the modeling separated into the periodic component and the non-periodic component contributes to the quality improvement.

［Ｇ．まとめ］
本実施の形態に従う音声合成システムにおいては、ＳＰＳＳを実施するにあたって、源信号についてＶ／ＵＶを判定する必要のない手法を採用した。Ｖ／ＵＶを判定する代わりに、源信号を周期成分と非周期成分との組み合わせとして表現することで、Ｖ／ＵＶの判定エラーによる合成音声への品質劣化を抑制することができる。また、Ｆ₀系列を連続化することで、構築される音響モデルのモデリング精度を向上することもできる。 [G. Summary]
In the speech synthesis system according to the present embodiment, when SPSS is carried out, a method that does not require V / UV determination for the source signal is adopted. By expressing the source signal as a combination of a periodic component and a non-periodic component instead of determining V / UV, it is possible to suppress quality deterioration of the synthetic voice due to a V / UV determination error. In addition, the modeling accuracy of the constructed acoustic model can be improved by making the F ₀ series continuous.

本実施の形態に従う音声合成システムによる合成音声については、主観評価ながら、従来の手法に比較して、十分に品質を向上させることができることが示された。 It was shown that the quality of the synthesized speech by the speech synthesis system according to the present embodiment can be sufficiently improved as compared with the conventional method while subjectively evaluating.

今回開示された実施の形態は、すべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は、上記した実施の形態の説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time should be considered as exemplary in all respects and not restrictive. The scope of the present invention is shown by the claims rather than the description of the embodiments described above, and is intended to include all modifications within the meaning and scope equivalent to the claims.

１多言語翻訳システム、２ネットワーク、４ユーザ、１０サービス提供装置、１２分析部、１４学習部、１８音声合成部、２０音声認識部、２２翻訳部、２４通信処理部、３０携帯端末、１００プロセッサ、１０２主メモリ、１０４ディスプレイ、１０６入力デバイス、１０８ネットワークインターフェイス、１１０内部バス、１１２二次記憶装置、１２０Ｆ₀抽出部、１２１分析プログラム、１２２周期／非周期成分抽出部、１２４特徴量抽出部、１２６Ｆ₀補間部、１２８スペクトル包絡抽出部、１３０入力音声、１３２テキスト、１３４光学ドライブ、１３６光学ディスク、１４０モデル学習部、１４１学習プログラム、１４２，１８２音響特徴量系列、１６２テキスト分析部、１６４コンテキストラベル生成部、１６６コンテキストラベル系列、１８０音響特徴量推定部、１８１音声合成プログラム、１８４，２００，２５０パルス生成部、１８６周期成分生成部、１８７，２０８加算部、１８８非周期成分生成部、２０１音声認識プログラム、２０４ガウシアンノイズ生成部、２２１翻訳プログラム、２５２ホワイトノイズ生成部、２５４切替部、２５６音声合成フィルタ。 1 Multilingual translation system, 2 networks, 4 users, 10 service providers, 12 analysis unit, 14 learning unit, 18 speech synthesis unit, 20 speech recognition unit, 22 translation unit, 24 communication processing unit, 30 mobile terminals, 100 processors , 102 main memory, 104 display, 106 input device, 108 network interface, 110 internal bus, 112 secondary storage device, 120 F ₀ extraction unit, 121 analysis program, 122 periodic / aperiodic component extraction unit, 124 feature quantity extraction unit , 126 F ₀ Interpolator, 128 Spectral Entrainment Extractor, 130 Input Speech, 132 Text, 134 Optical Drive, 136 Optical Disc, 140 Model Learning Unit, 141 Learning Program, 142, 182 Acoustic Feature Series, 162 Text Analysis Unit, 164 Context label generation unit, 166 Context label series, 180 Acoustic feature amount estimation unit, 181 Speech synthesis program, 184,200,250 Pulse generation unit, 186 Periodic component generation unit, 187,208 Addition unit, 188 Non-periodic component generation unit , 201 speech recognition program, 204 Gaussian noise generator, 221 translation program, 252 white noise generator, 254 switching unit, 256 speech synthesis filter.

Claims

統計的パラメトリック音声合成に従う音声合成システムであって、
既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出する第１の抽出部と、
前記音声波形から周期成分および非周期成分を単位区間毎に抽出する第２の抽出部と、
前記抽出された周期成分および非周期成分のスペクトル包絡を抽出する第３の抽出部と、
前記既知のテキストの文脈情報に基づくコンテキストラベルを生成する生成部と、
前記基本周波数、前記周期成分のスペクトル包絡、前記非周期成分のスペクトル包絡を含む音響特徴量と、対応する前記コンテキストラベルとを対応付けて学習することで、統計モデルを構築する学習部とを備える、音声合成システム。 A speech synthesis system that follows statistical parametric speech synthesis.
A first extraction unit that extracts the fundamental frequency of the voice waveform corresponding to a known text for each unit interval, and
A second extraction unit that extracts periodic and non-periodic components from the voice waveform for each unit interval, and
A third extraction unit that extracts the spectral envelopes of the extracted periodic and aperiodic components, and
A generator that generates a context label based on the context information of the known text,
A learning unit for constructing a statistical model is provided by learning by associating an acoustic feature amount including the fundamental frequency, the spectral inclusion of the periodic component, and the spectral inclusion of the aperiodic component with the corresponding context label. , Speech synthesis system.

任意のテキストの入力に応答して、当該テキストの文脈情報に基づくコンテキストラベルを決定する決定部と、
前記統計モデルから前記決定部により決定されたコンテキストラベルに対応する音響特徴量を推定する推定部とを備え、当該推定される音響特徴量は、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含み、
前記推定された音響特徴量に含まれる基本周波数に従って生成されたパルス系列を、周期成分のスペクトル包絡に応じてフィルタリングすることで、周期成分を再構成する第１の再構成部と、
ノイズ系列を非周期成分のスペクトル包絡に応じてフィルタリングすることで、非周期成分を再構成する第２の再構成部と、
前記再構成された周期成分および非周期成分を加算して、前記入力された任意のテキストに対応する音声波形として出力する加算部とをさらに備える、請求項１に記載の音声合成システム。 A decision-maker that determines the context label based on the contextual information of the text in response to the input of any text.
It is provided with an estimation unit that estimates an acoustic feature amount corresponding to a context label determined by the determination unit from the statistical model, and the estimated acoustic feature amount is a fundamental frequency, a spectral envelope of a periodic component, and an aperiodic component. Including spectral envelopes
A first reconstructing unit that reconstructs the periodic component by filtering the pulse sequence generated according to the fundamental frequency included in the estimated acoustic feature amount according to the spectral inclusion of the periodic component.
A second reconstruction section that reconstructs the aperiodic component by filtering the noise sequence according to the spectral envelope of the aperiodic component,
The voice synthesis system according to claim 1, further comprising an addition unit that adds the reconstructed periodic component and the aperiodic component and outputs the voice waveform corresponding to the input arbitrary text.

前記第２の抽出部は、前記第１の抽出部が前記基本周波数を抽出できない単位区間から前記非周期成分のみを抽出し、それ以外の単位区間から前記周期成分および前記非周期成分を抽出する、請求項１または２に記載の音声合成システム。 The second extraction unit extracts only the aperiodic component from a unit interval in which the first extraction unit cannot extract the fundamental frequency, and extracts the periodic component and the aperiodic component from the other unit intervals. , The speech synthesis system according to claim 1 or 2.

前記第１の抽出部は、前記基本周波数を抽出できない単位区間について、補間処理により基本周波数を決定する、請求項１〜３のいずれか１項に記載の音声合成システム。 The speech synthesis system according to any one of claims 1 to 3, wherein the first extraction unit determines the fundamental frequency by interpolation processing for a unit interval in which the fundamental frequency cannot be extracted.

統計的パラメトリック音声合成に従う音声合成方法を実現するための音声合成プログラムであって、前記音声合成プログラムはコンピュータに
既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、
前記音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、
前記抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、
前記既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、
前記基本周波数、前記周期成分のスペクトル包絡、前記非周期成分のスペクトル包絡を含む音響特徴量と、対応する前記コンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを実行させる、音声合成プログラム。 It is a speech synthesis program for realizing a speech synthesis method according to statistical parametric speech synthesis, and the speech synthesis program includes a step of extracting the fundamental frequency of a speech waveform corresponding to a text known to a computer for each unit interval.
A step of extracting periodic components and non-periodic components from the voice waveform for each unit interval, and
The step of extracting the spectral envelopes of the extracted periodic and aperiodic components, and
The step of generating a context label based on the context information of the known text,
The step of constructing a statistical model is executed by learning the acoustic features including the fundamental frequency, the spectral inclusion of the periodic component, and the spectral inclusion of the aperiodic component in association with the corresponding context label. , Speech synthesis program.

統計的パラメトリック音声合成に従う音声合成方法であって、
既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、
前記音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、
前記抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、
前記既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、
前記基本周波数、前記周期成分のスペクトル包絡、前記非周期成分のスペクトル包絡を含む音響特徴量と、対応する前記コンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを備える、音声合成方法。 A speech synthesis method that follows statistical parametric speech synthesis.
A step to extract the fundamental frequency of the voice waveform corresponding to a known text for each unit interval, and
A step of extracting periodic components and non-periodic components from the voice waveform for each unit interval, and
The step of extracting the spectral envelopes of the extracted periodic and aperiodic components, and
The step of generating a context label based on the context information of the known text,
The step includes a step of constructing a statistical model by learning the acoustic features including the fundamental frequency, the spectral inclusion of the periodic component, and the spectral inclusion of the aperiodic component in association with the corresponding context label. Speech synthesis method.