JP6011758B2

JP6011758B2 - Speech synthesis system, speech synthesis method, and program

Info

Publication number: JP6011758B2
Application number: JP2011196779A
Authority: JP
Inventors: 芳則志賀
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2011-09-09
Filing date: 2011-09-09
Publication date: 2016-10-19
Anticipated expiration: 2031-09-09
Also published as: JP2013057843A

Description

本発明は、音声を処理する音声処置装置等に関するものである。 The present invention relates to an audio treatment device that processes audio.

近年、主流の音声合成は、声質や発話スタイルなどに関わる音声の特徴をモデル化し、音声コーパスと呼ばれるデータベースを利用して音声のモデル（以下、単に「モデル」または「音響モデル」ともいう）を統計的に学習する。そして、学習したモデルを使って音声を合成する。また、モデルとして隠れマルコフモデル（以下「ＨＭＭ」という。）が広く利用されている（非特許文献１、非特許文献２、非特許文献３参照）。 In recent years, mainstream speech synthesis has modeled speech features related to voice quality, speech style, etc., and used a database called speech corpus to create speech models (hereinafter also simply referred to as “models” or “acoustic models”). Learn statistically. Then, the synthesized speech is synthesized using the learned model. In addition, a hidden Markov model (hereinafter referred to as “HMM”) is widely used as a model (see Non-Patent Document 1, Non-Patent Document 2, and Non-Patent Document 3).

また、従来技術を図１４に示す。図１４に示すように、従来技術において、例えば、音声コーパスを構成する音声データは、通常、収録時に高い周波数（例えば４８ｋＨｚ）で標本化された高品位な音声を、目的に応じた標本化周波数に落とした（例えば１６ｋＨｚ）ものである。従来技術において、標本化周波数を落とす（「ダウンサンプリング」とも言う。）際には、まず、折り返し歪（「エイリアス歪」とも言う。）が生じないように「アンチエイリアスフィルター（以下「ＡＡＦ」という。）」と呼ばれるローパスフィルターに音声を通し、その出力に対して再標本化処理を行う。従来技術の音声合成は、音声のスペクトル表現としてパラメータ(例えば、ケプストラム)を用い、前述のダウンサンプリング適用後の音声からそのパラメータを計算する。そして、そのパラメータを音声の１つの特徴量としてモデル（ＨＭＭ）を学習する。そして、学習したモデルを用いて、音声合成が行われる。 The prior art is shown in FIG. As shown in FIG. 14, in the prior art, for example, the audio data constituting the audio corpus is usually a high-quality audio sampled at a high frequency (for example, 48 kHz) at the time of recording, and a sampling frequency corresponding to the purpose. (For example, 16 kHz). In the prior art, when the sampling frequency is lowered (also referred to as “down-sampling”), first, an “anti-alias filter (hereinafter referred to as“ AAF ”) is used to prevent aliasing distortion (also referred to as“ alias distortion ”). Sound is passed through a low-pass filter called “)”, and the output is resampled. Prior art speech synthesis uses parameters (eg, cepstrum) as a spectral representation of speech, and calculates the parameters from speech after downsampling has been applied. Then, the model (HMM) is learned using the parameter as one feature amount of speech. Then, speech synthesis is performed using the learned model.

徳田恵一,"特集号：音声情報処理技術の最先端(1)HMMによる音声認識と音声合成",情報処理学会誌「情報処理」,vol.45,no.10,pp.1005-1011,Oct.2004.9.Keiichi Tokuda, "Special Issue: State-of-the-Art of Speech Information Processing Technology (1) Speech Recognition and Speech Synthesis by HMM", Journal of Information Processing Society of Japan, Information Processing, vol.45, no.10, pp.1005-1011, Oct .2004.9. 徳田恵一,"HMMによる音声合成の基礎",電子情報通信学会技術研究報告,vol.100,no.392,SP2000-74,pp.43-50,Oct.2000.Tokuda Keiichi, “Basics of Speech Synthesis with HMM”, IEICE Technical Report, vol.100, no.392, SP2000-74, pp.43-50, Oct.2000. 徳田恵一,"隠れマルコフモデルの音声合成への応用",電子情報通信学会技術研究報告,vol.99,no.255,SP99-61,pp.47-54,Aug.1999.Keiichi Tokuda, "Application of Hidden Markov Models to Speech Synthesis", IEICE Technical Report, vol.99, no.255, SP99-61, pp.47-54, Aug. 1999.

しかしながら、上記のローパスフィルターによって、ダウンサンプリング後の音声は、もっとも高い周波数（ナイキスト周波数）近辺のエネルギーが大きく減衰され、音声スペクトル上で急峻な崖状の特性をもつ。上記のＨＭＭに基づく音声合成は、こうした崖状の特性を持つスペクトルから計算したパラメータ（例えば、ケプストラム）に対してモデル（ＨＭＭ）を学習している。そのために、音声合成の出力音声は高周波数領域のエネルギーが不足するほか、学習の際の統計処理がその特異なスペクトル特性の影響を受けて、出力音声の品質が著しく劣化していた。 However, with the low-pass filter, the sound after down-sampling has a sharp cliff-like characteristic on the sound spectrum because the energy near the highest frequency (Nyquist frequency) is greatly attenuated. In the speech synthesis based on the above HMM, a model (HMM) is learned for a parameter (for example, cepstrum) calculated from a spectrum having such a cliff-like characteristic. For this reason, the output speech of speech synthesis lacks the energy in the high frequency region, and the statistical processing at the time of learning is influenced by its unique spectral characteristics, so that the quality of the output speech is significantly degraded.

本発明では、上記課題に鑑み、以下のような解決手段を有する。 The present invention has the following means in view of the above problems.

本第一の発明の音声処置装置は、音声を格納し得る音声格納部と、１以上の特徴量を格納し得る特徴量格納部と、音声格納部に格納されている音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得部と、スペクトル取得部が取得したスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上の周波数のスペクトルを切り詰める処理を行う切詰処理部と、切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得部と、特徴量取得部が取得した１以上の特徴量を特徴量格納部に蓄積する特徴量蓄積部とを具備する音声処置装置である。 The voice treatment apparatus according to the first aspect of the present invention includes a voice storage unit that can store voice, a feature amount storage unit that can store one or more feature quantities, and a spectrum or spectrum envelope of the voice stored in the voice storage unit. A spectrum acquisition unit that acquires the spectrum, a spectrum processing unit that performs a process of truncating a spectrum having a frequency equal to or higher than a predetermined threshold, and a spectrum or spectrum that is subjected to the truncation process. A voice treatment device includes a feature amount acquisition unit that acquires one or more feature amounts from an envelope, and a feature amount storage unit that stores the one or more feature amounts acquired by the feature amount acquisition unit in a feature amount storage unit.

かかる構成により、音声合成において、高い品質の出力音声が得られる特徴量を取得できる。 With this configuration, it is possible to acquire a feature amount that can provide high-quality output speech in speech synthesis.

また、本第二の発明のモデル作成装置は、音声のモデルを格納し得るモデル格納部と、第一の発明の音声処置装置により蓄積された１以上の特徴量を格納している特徴量格納部と、１以上の特徴量から音声のモデルを構成し、モデル格納部に蓄積するモデル学習部とを具備するモデル作成装置である。 The model creation device according to the second aspect of the invention includes a model storage unit that can store a speech model, and a feature amount storage that stores one or more feature amounts accumulated by the speech treatment device according to the first aspect of the invention. And a model learning unit that constructs a speech model from one or more feature quantities and stores the model in a model storage unit.

かかる構成により、音声合成において、高い品質の出力音声が得られる音響モデルを学習できる。 With this configuration, it is possible to learn an acoustic model from which high-quality output speech can be obtained in speech synthesis.

また、本第三の発明の音声合成装置は、第二の発明のモデル作成装置により取得された音声のモデルを格納し得るモデル格納部と、音声合成する内容を示す情報である合成内容情報を受け付ける受付部と、合成内容情報に対して、モデル格納部に格納された音声のモデルを用いて、音声を生成する音声生成部と、音声生成部が生成した音声を出力する出力部とを具備する音声合成装置である。 The speech synthesizer of the third invention also includes a model storage unit that can store a speech model acquired by the model creation device of the second invention, and synthesis content information that is information indicating the content to be synthesized. A reception unit for receiving, a speech generation unit that generates speech using the speech model stored in the model storage unit for the synthesized content information, and an output unit that outputs the speech generated by the speech generation unit A speech synthesizer.

かかる構成により、音声合成において、高い品質の出力音声が得られる。 With this configuration, high quality output speech can be obtained in speech synthesis.

また、本第四の発明の音声合成装置は、第一の発明の音声処置装置により蓄積された１以上の特徴量を格納している特徴量格納部と、合成内容情報を受け付ける受付部と、合成内容情報に対して、特徴量格納部の１以上の特徴量を用いて、音声を生成する音声生成部と、音声生成部が生成した音声を出力する出力部とを具備する音声合成装置である。 The speech synthesizer of the fourth aspect of the invention includes a feature amount storage unit that stores one or more feature amounts accumulated by the speech treatment device of the first aspect of the invention, a reception unit that receives synthesis content information, A speech synthesizer comprising: a speech generation unit that generates speech using one or more feature amounts of a feature amount storage unit with respect to synthesis content information; and an output unit that outputs speech generated by the speech generation unit. is there.

本発明による音声処置装置によれば、ＡＡＦの減衰特性が合成音声に与える悪影響を回避することができるので、音声合成において、高い品質の出力音声が得られる特徴量を取得できる。 According to the speech treatment device of the present invention, the adverse effect of the AAF attenuation characteristics on the synthesized speech can be avoided, so that it is possible to acquire a feature quantity that can produce high-quality output speech in speech synthesis.

実施の形態１における音声処置装置１のブロック図Block diagram of voice treatment apparatus 1 according to Embodiment 1 同スペクトル包絡を示す図Diagram showing the same spectral envelope 同切り詰め処理後のスペクトル包絡を示す図Diagram showing the spectral envelope after the truncation process 同音声処置装置１の動作について説明するフローチャートThe flowchart explaining operation | movement of the voice treatment apparatus 1 同有声区間の典型的な対数パワースペクトルを示す図Diagram showing typical logarithmic power spectrum of the same voiced section 同合成した音声スペクトルを示す図Figure showing the synthesized speech spectrum 同スペクトル特徴量抽出、ＨＭＭ学習、および音声合成の詳細を示す図The figure which shows the detail of the spectrum feature-value extraction, HMM learning, and speech synthesis 同聴取試験の結果（ＭＯＳ）を示す図Figure showing the results of the hearing test (MOS) 同音響モデル作成装置２のブロック図Block diagram of the acoustic model creation device 2 同音声合成装置３のブロック図Block diagram of the speech synthesizer 3 実施の形態２における音声合成装置４のブロック図Block diagram of speech synthesizer 4 in the second embodiment 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 同コンピュータシステムのブロック図Block diagram of the computer system 従来技術を説明する図Diagram explaining the prior art

以下、音声処置装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a voice treatment device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１） (Embodiment 1)

本実施の形態において、高周波数領域の急峻な減衰を避けるため、アンチエイリアスフィルターを用いずに、ダウンサンプリング後の音声に相当するスペクトルを求め、このスペクトルをケプストラムのようなパラメータに変換してＨＭＭ等の学習に用いる音声処置装置について説明する。 In this embodiment, in order to avoid steep attenuation in the high frequency region, a spectrum corresponding to the downsampled speech is obtained without using an anti-aliasing filter, and this spectrum is converted into a parameter such as a cepstrum to obtain an HMM or the like. A voice treatment device used for learning will be described.

図１は、本実施の形態における音声処置装置１のブロック図である。音声処置装置１は、音声格納部１１、特徴量格納部１２、スペクトル取得部１３、切詰処理部１４、特徴量取得部１５、および特徴量蓄積部１６を備える。 FIG. 1 is a block diagram of a voice treatment device 1 in the present embodiment. The voice treatment device 1 includes a voice storage unit 11, a feature amount storage unit 12, a spectrum acquisition unit 13, a truncation processing unit 14, a feature amount acquisition unit 15, and a feature amount storage unit 16.

音声格納部１１は、音声を格納し得る。音声格納部１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。音声格納部１１に音声が記憶される過程は問わない。例えば、記録媒体を介して音声が音声格納部１１で記憶されるようになってもよく、通信回線等を介して送信された音声が音声格納部１１で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された音声が音声格納部１１で記憶されるようになってもよい。
特徴量格納部１２は、１以上の特徴量を格納し得る。１以上の特徴量とは、本実施形態ではメルケプストラムを用いているが、特に限定されることはなく、ケプストラムやＬＳＰ（Line Spectral Pairs）、ＰＡＲＣＯＲ係数（Partial Auto-Correlation Coefficient）等、何でも良い。また、特徴量格納部１２は、１以上の特徴量とともに音響モデル学習用データとして音声の基本周波数（Ｆ_０）などを一緒に格納していても良い。 The voice storage unit 11 can store voice. The audio storage unit 11 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. The process in which sound is stored in the sound storage unit 11 does not matter. For example, voice may be stored in the voice storage unit 11 via a recording medium, and voice transmitted via a communication line or the like may be stored in the voice storage unit 11. Or the audio | voice input via the input device may be memorize | stored in the audio | voice storage part 11. FIG.
The feature quantity storage unit 12 can store one or more feature quantities. In the present embodiment, the mel cepstrum is used as the one or more feature quantities, but is not particularly limited, and may be anything such as cepstrum, LSP (Line Spectral Pairs), PARCOR coefficient (Partial Auto-Correlation Coefficient). . In addition, the feature quantity storage unit 12 may store the fundamental frequency (F ₀ ) of speech as acoustic model learning data together with one or more feature quantities.

特徴量格納部１２は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The feature amount storage unit 12 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

本実施の形態では、例えば、標本化周波数１６ｋＨｚの音声を合成するものとする。そして、スペクトル取得部１３は、音声格納部１１に格納されている所望の周波数より高い標本化周波数（本実施の形態では、例えば、４８ｋＨｚとする）の音声から、スペクトルまたはスペクトル包絡を抽出する。そして、例えば、スペクトル取得部１３は、図２のスペクトル包絡を得る。図２においてｆ_１はナイキスト周波数を表わし、本実施の形態の場合ｆ_１＝２４ｋＨｚとなる。音声からスペクトルもしくはスペクトル包絡を抽出する技術は公知技術であるので、詳細な説明を省略する。なお、スペクトル取得部１３は、例えば、STRAIGHT分析（H. Kawahara, in Proc. ICASSP-97, vol.2, pp.1303--1306, 1997.参照）によって実現され得る。 In the present embodiment, for example, it is assumed that speech having a sampling frequency of 16 kHz is synthesized. And the spectrum acquisition part 13 extracts a spectrum or a spectrum envelope from the audio | voice of the sampling frequency (this embodiment shall be 48 kHz in this Embodiment) higher than the desired frequency stored in the audio | voice storage part 11. FIG. For example, the spectrum acquisition unit 13 obtains the spectrum envelope of FIG. In FIG. 2, f ₁ represents the Nyquist frequency, and in this embodiment, f ₁ = 24 kHz. Since a technique for extracting a spectrum or a spectrum envelope from speech is a known technique, detailed description thereof is omitted. The spectrum acquisition unit 13 can be realized by, for example, STRAIGHT analysis (see H. Kawahara, in Proc. ICASSP-97, vol. 2, pp. 1303--1306, 1997.).

切詰処理部１４は、スペクトル取得部１３が取得したスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上の周波数のスペクトルを切り詰める処理を行う。ここで、閾値とは、通常、所望の（ダウンサンプリング適用後相当の）音声のナイキスト周波数（本実施例では８ｋＨz）である。また、切り詰める処理とは、予め決められた閾値以上の周波数のスペクトル部分のデータを削除する処理、とも言える。なお、「閾値以上」は、「閾値より大きい」ことを含むとする。例えば、切詰処理部１４は、図２のスペクトル包絡から、所望のナイキスト周波数ｆ_２（本実施の形態の場合、ｆ_２＝８ｋＨzとなる）を超えるスペクトル区間のデータポイントを削除し、図３のサンプリング周波数が１６ｋＨｚ相当のスペクトル包絡を得る。 The truncation processing unit 14 performs a process of truncating a spectrum having a frequency equal to or higher than a predetermined threshold with respect to the spectrum or spectrum envelope acquired by the spectrum acquisition unit 13. Here, the threshold value is usually a desired Nyquist frequency (equivalent to the application of downsampling) (8 kHz in this embodiment). The truncation process can also be said to be a process of deleting data of a spectrum portion having a frequency equal to or higher than a predetermined threshold. Note that “more than the threshold” includes “greater than the threshold”. For example, the truncation processing unit 14 deletes the data points in the spectrum section exceeding the desired Nyquist frequency f ₂ (in the present embodiment, f ₂ = 8 kHz) from the spectrum envelope of FIG. A spectral envelope corresponding to a sampling frequency of 16 kHz is obtained.

特徴量取得部１５は、切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する。特徴量（本実施形態ではメルケプストラム）の取得は、例えば、音声信号処理ツールキット(SPTK)（http://sp-tk.sourceforge.net/参照）のmgcepコマンドによって実現され得る。 The feature quantity acquisition unit 15 acquires one or more feature quantities from the spectrum or spectrum envelope subjected to the truncation process. Acquisition of the feature value (mel cepstrum in the present embodiment) can be realized by, for example, the mgcep command of the audio signal processing tool kit (SPTK) (see http://sp-tk.sourceforge.net/).

特徴量蓄積部１６は、特徴量取得部１５が取得した１以上の特徴量を特徴量格納部１２に蓄積する。特徴量蓄積部１６は、特徴量取得部１５が取得した１以上の特徴量とともに音響モデル学習用データとして音声の基本周波数（Ｆ_０）などを一緒に特徴量格納部１２に蓄積しても良い。 The feature amount storage unit 16 stores one or more feature amounts acquired by the feature amount acquisition unit 15 in the feature amount storage unit 12. The feature amount accumulating unit 16 may accumulate the fundamental frequency (F ₀ ) of the sound as acoustic model learning data together with one or more feature amounts acquired by the feature amount acquiring unit 15 in the feature amount storage unit 12. .

スペクトル取得部１３、切詰処理部１４、特徴量取得部１５、および特徴量蓄積部１６は、通常、ＭＰＵやメモリ等から実現され得る。スペクトル取得部１３等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The spectrum acquisition unit 13, the cut processing unit 14, the feature amount acquisition unit 15, and the feature amount storage unit 16 can be usually realized by an MPU, a memory, or the like. The processing procedure of the spectrum acquisition unit 13 or the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音声処置装置１の動作について、図４のフローチャートを用いて説明する。 Next, operation | movement of the audio treatment apparatus 1 is demonstrated using the flowchart of FIG.

（ステップＳ４０１）スペクトル取得部１３は、音声格納部１１から音声を取得する。 (Step S <b> 401) The spectrum acquisition unit 13 acquires sound from the sound storage unit 11.

（ステップＳ４０２）スペクトル取得部１３は、ステップＳ４０１で取得した音声のスペクトルまたはスペクトル包絡を取得する。 (Step S402) The spectrum acquisition unit 13 acquires the spectrum or spectrum envelope of the sound acquired in step S401.

（ステップＳ４０３）切詰処理部１４は、ステップＳ４０２で取得されたスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上の高周波数のスペクトルを切り詰める処理を行う。 (Step S403) The truncation processing unit 14 performs a process of truncating a spectrum having a high frequency equal to or higher than a predetermined threshold with respect to the spectrum or spectrum envelope acquired in step S402.

（ステップＳ４０４）特徴量取得部１５は、ステップＳ４０３で切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する。 (Step S404) The feature quantity acquisition unit 15 acquires one or more feature quantities from the spectrum or the spectrum envelope that has been subjected to the truncation process in step S403.

（ステップＳ４０５）特徴量蓄積部１６は、ステップＳ４０４で取得された１以上の特徴量を特徴量格納部１２に蓄積し、処理を終了する。 (Step S405) The feature amount storage unit 16 stores one or more feature amounts acquired in step S404 in the feature amount storage unit 12, and ends the processing.

以下、本実施の形態における音声処置装置１がいかに効果的であるかを、実験例を紹介して説明する。なお、本実験例では、音声処置装置１を用いて生成した１以上の特徴量から音響モデルを生成し、当該音響モデルを用いて音声合成を行った。 Hereinafter, how effective the voice treatment apparatus 1 according to the present embodiment will be described by introducing experimental examples. In the present experimental example, an acoustic model is generated from one or more feature amounts generated using the speech treatment device 1, and speech synthesis is performed using the acoustic model.

本実験で使用する音声は、イギリス英語コーパスに含まれる標本化周波数１６ｋＨｚのデータ（以下１６ｋＨｚ音声という）と、標本化周波数４８ｋＨｚの音声（４８ｋＨｚ音声）である。なお、この１６ｋＨｚ音声は、４８ｋＨｚ音声に対してＡＡＦを含むダウンサンプリングを適用して得ている。ダウンサンプリングは、ごく一般的に利用されるソフトウェア(Edinburgh Speech Tools Library: http://www.cstr.ed.ac.uk/projects/speech tools/)によってなされている。 The voices used in this experiment are data with a sampling frequency of 16 kHz (hereinafter referred to as 16 kHz voice) included in the British English corpus and voices with a sampling frequency of 48 kHz (48 kHz voice). The 16 kHz sound is obtained by applying downsampling including AAF to the 48 kHz sound. Downsampling is performed by software (Edinburgh Speech Tools Library: http://www.cstr.ed.ac.uk/projects/speech tools /) that is very commonly used.

また、本実験において、スペクトル特徴量は３９次のメルケプストラムである。メルケプストラムは、音声信号処理ツールキット(SPTK)のmgcepコマンドを使って、STRAIGHT分析によって得られたスペクトル包絡(以下STRAIGHTスペクトル)から、特徴量取得部１５が計算して、取得できる。 In this experiment, the spectral feature is a 39th-order mel cepstrum. The mel cepstrum can be obtained by calculating the feature quantity acquisition unit 15 from the spectrum envelope obtained by the STRAIGHT analysis (hereinafter referred to as STRAIGHT spectrum) using the mgcep command of the speech signal processing tool kit (SPTK).

有声区間の典型的な対数パワースペクトルを図５に示す。図５において、横軸は周波数、縦軸は対数パワーである。また、図５の太破線（ＣＥＰＳ−ＴＤ）は１６ｋＨｚ音声のメルケプストラムから再構成したパワースペクトルであり、細実線（ＳＰＥＣ４８ｋ）は対応する４８ｋＨｚ音声のSTRAIGHTスペクトルである。同図から明らかなように、１６ｋＨｚ音声のスペクトル（ＣＥＰＳ−ＴＤ）はローパスフィルターの特性の影響でナイキスト周波数（８ｋＨｚ）付近のエネルギーが乏しい。また、４〜６ｋＨｚのスペクトル起伏が４８ｋＨｚ音声のスペクトルに比べて平坦化している。こうした高周波数領域においてエネルギーが不足し、起伏が平坦化したスペクトルのメルケプストラムを音声合成の特徴量とすれば、合成音声の品質が劣化することは明らかである。これが従来技術の音声合成法の音声品質の劣化原因の一つである。 A typical logarithmic power spectrum of the voiced interval is shown in FIG. In FIG. 5, the horizontal axis represents frequency, and the vertical axis represents logarithmic power. Also, the thick broken line (CEPS-TD) in FIG. 5 is a power spectrum reconstructed from the mel cepstrum of 16 kHz speech, and the thin solid line (SPEC48k) is the corresponding STRAIGHT spectrum of 48 kHz speech. As is apparent from the figure, the 16 kHz voice spectrum (CEPS-TD) has a low energy near the Nyquist frequency (8 kHz) due to the influence of the characteristics of the low-pass filter. Further, the spectral undulation of 4-6 kHz is flattened compared to the spectrum of 48 kHz speech. It is clear that the quality of synthesized speech deteriorates if a mel cepstrum having a spectrum with insufficient energy and flattened undulations is used as a feature value of speech synthesis in such a high frequency region. This is one of the causes of speech quality degradation of the conventional speech synthesis method.

一方、図５の太実線（ＣＥＰＳ−ＳＴ）は、本発明に基づいて４８ｋＨｚ音声から生成した（サンプリング周波数１６ｋＨｚの音声相当の）メルケプストラムから再構成したパワースペクトルである。このスペクトルは、ナイキスト周波数（８ｋＨｚ）付近および４〜６ｋＨｚのパワーが、４８ｋＨｚ音声のSTRAIGHTスペクトル（ＳＰＥＣ４８ｋ）と一致している。こうしたスペクトルを表わすメルケプストラムを音声合成の特徴量とすれば、高い品質の音声合成が可能となる。 On the other hand, a thick solid line (CEPS-ST) in FIG. 5 is a power spectrum reconstructed from a mel cepstrum (corresponding to a sound with a sampling frequency of 16 kHz) generated from a 48 kHz sound based on the present invention. In this spectrum, the Nyquist frequency (8 kHz) and the power of 4 to 6 kHz coincide with the STRAIGHT spectrum (SPEC48k) of 48 kHz audio. If a mel cepstrum representing such a spectrum is used as a feature amount for speech synthesis, high-quality speech synthesis can be achieved.

次に、本実験において、上述の実施形態に基づいて得られたスペクトル特徴量を用いてＨＭＭを学習し、学習したＨＭＭから音声を合成する。そして、合成した音声について調べ、本発明の効果を確認する。 Next, in this experiment, an HMM is learned using the spectral feature amount obtained based on the above-described embodiment, and speech is synthesized from the learned HMM. Then, the synthesized speech is examined to confirm the effect of the present invention.

本実験において、まず、以下の２つの異なるメルケプストラムを用いて別個にモデルを学習した。
（１）１６ｋＨｚ音声から計算したメルケプストラム（従来技術）
（２）４８ｋＨｚ音声から本発明の音声処置装置によって得たメルケプストラム In this experiment, the model was first trained separately using the following two different mel cepstrums.
(1) Mel cepstrum calculated from 16 kHz speech (prior art)
(2) Mel Cepstrum obtained from 48 kHz voice by the voice treatment apparatus of the present invention

なお、上記（１）および（２）を特徴量としてそれぞれＨＭＭを学習する際、特徴量作成以外の条件は同一である。これらモデルを用いて合成した音声スペクトルを図６に示す。図６において、横軸は周波数、縦軸は対数パワーである。また、上記（１）の特徴量から学習したモデルを用いて合成した音声スペクトル（従来技術による音声スペクトル）は図６のＣＥＰＳ−ＴＤ、上記（２）の特徴量から学習したモデルを用いて合成した音声スペクトル（本発明による音声スペクトル）は図６のＣＥＰＳ−ＳＴである。 When learning the HMM using the above (1) and (2) as feature quantities, the conditions other than the feature quantity creation are the same. A speech spectrum synthesized using these models is shown in FIG. In FIG. 6, the horizontal axis represents frequency and the vertical axis represents logarithmic power. The speech spectrum synthesized using the model learned from the feature quantity (1) (speech spectrum according to the prior art) is synthesized using CEPS-TD in FIG. 6 and the model learned from the feature quantity (2). The speech spectrum (speech spectrum according to the present invention) is the CEPS-ST in FIG.

図６から明らかなように、本発明を適用したＨＭＭ音声合成の合成音声は、従来技術に比べて、高周波数領域（７〜８ｋＨｚ）のスペクトル・エネルギーが大幅に改善しているとともに、全周波数帯域にわたって、フォルマントやアンチフォルマントの平坦化の度合いが少ない。従来技術の合成音声のようにスペクトルが平坦化すると、音声品質は劣化し、音声はこもったように知覚される。したがって本発明を用いれば、そうした劣化を緩和または回避することができる。 As is apparent from FIG. 6, the synthesized speech of the HMM speech synthesis to which the present invention is applied has significantly improved spectral energy in the high frequency region (7 to 8 kHz) compared to the prior art, and the entire frequency. There is little flattening of formants and anti-formants over the band. When the spectrum is flattened as in the case of the synthesized speech of the prior art, the speech quality deteriorates and the speech is perceived as muffled. Therefore, if this invention is used, such deterioration can be eased or avoided.

そこで、次に、上記のようなパワースペクトルをもつ音声が、人間の耳にどのように知覚されるかを調べるために、合成音声の自然性について聴取評価試験を行った。 Then, in order to investigate how the speech having the power spectrum as described above is perceived by the human ear, a listening evaluation test was performed on the naturalness of the synthesized speech.

聴取試験の評定者は音声研究者５名で、各評定者は２つのシステムが音声合成した１０文を評価する。評価スケールは、１('completely unnatural')から５('completely natural')の５段階で、試験は静かな部屋でヘッドフォンを用いて行われた。 There are five speech researchers who evaluate the listening test, and each rater evaluates 10 sentences synthesized by two systems. The evaluation scale was 5 steps from 1 ('completely unnatural') to 5 ('completely natural'), and the test was conducted using headphones in a quiet room.

また、本試験に関わるスペクトル特徴量抽出、ＨＭＭ学習、および音声合成の詳細を図７に示す。 FIG. 7 shows details of spectral feature extraction, HMM learning, and speech synthesis related to this test.

以下、２つのシステムの処理手順を明記する。システム１の処理手順は、従来技術の処理手順である。つまり、システム１では、（予めダウンサンプリング処理が施された）１６ｋＨｚで標本化された音声のスペクトル包絡をSTRAIGHT分析によって取得し、当該スペクトル包絡から計算したメルケプストラムをスペクトル特徴量とした学習を行い、ＨＭＭの音響モデルを構築した。そして、当該ＨＭＭの音響モデルを用いて、音声合成を行った。 The processing procedures of the two systems are specified below. The processing procedure of the system 1 is a conventional processing procedure. That is, in the system 1, the spectrum envelope of the speech sampled at 16 kHz (pre-sampled in advance) is acquired by STRAIGHT analysis, and learning is performed using the mel cepstrum calculated from the spectrum envelope as a spectrum feature amount. An acoustic model of HMM was constructed. Then, speech synthesis was performed using the acoustic model of the HMM.

また、システム２では、STRAIGHT分析を用いて４８ｋＨｚで標本化された音声のスペクトル包絡を取得し、当該スペクトル包絡に対して、音声処置装置１の本発明に基づく「スペクトル切り詰め処理」を行った。そして、スペクトル切り詰め処理を行った後のスペクトル包絡から計算したメルケプストラムを、スペクトル特徴量とした学習を行い、ＨＭＭの音響モデルを構築した。そして、当該ＨＭＭの音響モデルを用いて、音声合成を行った。 Further, in the system 2, the spectrum envelope of the voice sampled at 48 kHz is obtained using the STRAIGHT analysis, and the “spectrum truncation process” based on the present invention of the voice treatment device 1 is performed on the spectrum envelope. Then, learning was performed using the mel cepstrum calculated from the spectrum envelope after the spectrum truncation processing as a spectrum feature amount, and an HMM acoustic model was constructed. Then, speech synthesis was performed using the acoustic model of the HMM.

図８に、聴取試験の結果の平均オピニオンスコア（ＭＯＳ）を示す。１６ｋＨｚ音声を用いたシステム１（従来技術）はスコア２．５で、システム２（本発明）はスコア２．９となった。 FIG. 8 shows an average opinion score (MOS) as a result of the listening test. System 1 (prior art) using 16 kHz speech scored 2.5 and system 2 (present invention) scored 2.9.

これらの結果から次のことがわかる。ダウンサンプリング時に用いたＡＡＦのフィルター特性の悪影響は、本発明の音声処置装置１を用いることで回避可能であり、実際に聴感上、合成音声にＭＯＳ０．４相当の顕著な自然性の改善が見られた。 These results show the following. The adverse effect of the filter characteristics of the AAF used at the time of downsampling can be avoided by using the speech treatment apparatus 1 of the present invention, and in terms of audibility, a remarkable natural improvement equivalent to MOS 0.4 is actually seen in the synthesized speech. It was.

以上の実験結果から明白なように、本実施の形態によれば、音声合成において、高い品質の出力音声が得られる特徴量を取得できる。 As is clear from the above experimental results, according to the present embodiment, it is possible to acquire a feature quantity that can provide high-quality output speech in speech synthesis.

なお、本実施の形態における音声処置装置１が生成した１以上の特徴量は、音声合成だけではなく、同種の特徴量を取り扱う他の音声技術（例えば音声認識や話者認識）等にも利用可能であり、そうした音声技術の性能向上にも貢献できる。 Note that one or more feature quantities generated by the speech treatment device 1 according to the present embodiment are used not only for speech synthesis but also for other speech technologies that handle the same kind of feature quantities (for example, speech recognition and speaker recognition). It is possible and can contribute to the performance improvement of such audio technology.

また、音声合成処理のために与える情報は、テキストに限定されず、発音などを記した記号列やSpeech Synthesis Markup Language (ＳＳＭＬ)のようなマークアップ言語、また、それらのバイナリデータ等であってもよい。つまり、音声合成処理のために与える情報は、音声合成する内容を示す情報であれば何でも良く、かかる情報を合成内容情報ということとする。 The information given for speech synthesis processing is not limited to text, but is a symbol string describing pronunciation, a markup language such as Speech Synthesis Markup Language (SSML), and binary data thereof. Also good. That is, the information provided for the speech synthesis process may be any information indicating the content to be synthesized, and this information is referred to as synthesized content information.

また、音声処置装置１が生成した１以上の特徴量から音声のモデルを学習するモデル作成装置２が構成可能である。モデル作成装置２のブロック図の例は、以下の図９である。モデル作成装置２は、モデル格納部２１、特徴量格納部１２、およびモデル学習部２２を具備する。 In addition, a model creation device 2 that learns a speech model from one or more feature quantities generated by the speech treatment device 1 can be configured. An example of a block diagram of the model creation device 2 is shown in FIG. 9 below. The model creation device 2 includes a model storage unit 21, a feature amount storage unit 12, and a model learning unit 22.

モデル格納部２１は、音声のモデルを格納し得る。音声のモデルとは、従来技術の説明で示したように、声質や発話スタイルなどに関連する音声の特徴をモデル化したものをいい、例えば、各音素（または前後の音素環境を考慮した音素）毎に特徴量の時系列的なパターンをモデル化したものである。音声のモデルは、例えば、音韻毎の隠れマルコフモデル（ＨＭＭ）に基づくデータが好適であるが、他のモデルに基づくデータでも良い。 The model storage unit 21 can store a voice model. A speech model is a model of speech features related to voice quality, speech style, etc., as shown in the description of the prior art. For example, each phoneme (or a phoneme considering the phoneme environment before and after). Each model is a time-series pattern of features. The speech model is preferably data based on a hidden Markov model (HMM) for each phoneme, but may be data based on another model.

モデル学習部２２は、１以上の特徴量から音声のモデルを構成し、モデル格納部２１に蓄積する。なお、１以上の特徴量から音声のモデルを構成する技術は、例えば、図１４に示すＨＭＭ学習である。つまり、１以上の特徴量（例えば、メルケプストラム）に対してＨＭＭ学習を行いＨＭＭの音響モデルを取得する。なお、モデル学習部２２の処理は公知技術であるので、詳細な説明を省略する。 The model learning unit 22 configures a speech model from one or more feature amounts and accumulates the model in the model storage unit 21. A technique for constructing a speech model from one or more feature quantities is, for example, HMM learning shown in FIG. That is, HMM learning is performed on one or more feature quantities (for example, mel cepstrum) to acquire an HMM acoustic model. Since the process of the model learning unit 22 is a known technique, detailed description thereof is omitted.

また、モデル作成装置２が生成した音声のモデルを用いた音声合成装置３が構成可能である。音声合成装置３のブロック図の例は、以下の図１０である。音声合成装置３は、モデル格納部２１、受付部３１、音声生成部３２、出力部３３を備える。 Further, the speech synthesizer 3 using the speech model generated by the model creation device 2 can be configured. An example of a block diagram of the speech synthesizer 3 is shown in FIG. The voice synthesizer 3 includes a model storage unit 21, a reception unit 31, a voice generation unit 32, and an output unit 33.

受付部３１は、合成内容情報を受け付ける。合成内容情報とは、上述したように、音声合成する内容を示す情報であり、テキストに限定されず、発音などを記した記号列やＳＳＭＬのようなマークアップ言語、また、それらのバイナリデータ等であってもよい。ここで、受け付けとは、キーボードやマウスなどの入力デバイスから入力された情報の受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。合成内容情報の入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。受付部３１は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The accepting unit 31 accepts composite content information. As described above, the synthesis content information is information indicating the content to be synthesized, and is not limited to text, but a symbol string describing pronunciation, a markup language such as SSML, binary data thereof, and the like It may be. Here, reception means reception of information input from an input device such as a keyboard or a mouse, reception of information transmitted via a wired or wireless communication line, or from a recording medium such as an optical disk, a magnetic disk, or a semiconductor memory. It is a concept that includes acceptance of read information. The composite content information input means may be anything such as a numeric keypad, keyboard, mouse or menu screen. The accepting unit 31 can be realized by a device driver for input means such as a numeric keypad and a keyboard, control software for a menu screen, and the like.

音声生成部３２は、受付部３１が受け付けた合成内容情報に対して、モデル格納部２１の音声のモデルを用いて、音声（合成音声）を生成する。音声生成部３２は、例えば、図１４の音声特徴量生成と音声信号生成により、合成音声を取得する。つまり、音声生成部３２は、音声のモデルに対して音声特徴量の生成処理を行い、音声特徴量（ここでは、メルケプストラム）を生成する。また、音声生成部３２は、音声特徴量を用いて音声信号の生成処理を行い、合成音声を取得する。なお、音声生成部３２の処理は公知技術であるので、詳細な説明を省略する。音声生成部３２は、通常、ＭＰＵやメモリ等から実現され得る。音声生成部３２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The voice generation unit 32 generates a voice (synthetic voice) using the voice model of the model storage unit 21 for the synthesized content information received by the receiving unit 31. The voice generation unit 32 acquires a synthesized voice by, for example, voice feature value generation and voice signal generation in FIG. That is, the sound generation unit 32 performs a sound feature amount generation process on the sound model, and generates a sound feature amount (here, a mel cepstrum). In addition, the sound generation unit 32 performs a sound signal generation process using the sound feature amount, and acquires synthesized sound. Since the processing of the voice generation unit 32 is a known technique, detailed description thereof is omitted. The sound generation unit 32 can be usually realized by an MPU, a memory, or the like. The processing procedure of the sound generation unit 32 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部３３は、音声生成部３２が生成した音声を出力する。ここで出力とは、スピーカー等による音声出力、オーディオデバイスへの書き込みのほか、ＨＤＤや記録メディア上のファイルへの書き出し、他アプリケーションへの音声データの受け渡し等を含む概念である。出力部３３は、例えば、スピーカー等から実現され得る。 The output unit 33 outputs the sound generated by the sound generation unit 32. Here, “output” is a concept including voice output by a speaker or the like, writing to an audio device, writing to a file on an HDD or a recording medium, delivery of voice data to another application, and the like. The output unit 33 can be realized by, for example, a speaker.

（実施の形態２） (Embodiment 2)

本実施の形態において、実施の形態１で説明した音声処置装置１が生成した１以上の特徴量を用いた音声合成装置４について説明する。 In the present embodiment, a speech synthesizer 4 using one or more feature amounts generated by the speech treatment device 1 described in the first embodiment will be described.

音声合成装置４のブロック図の例は、以下の図１１である。なお、音声合成装置４は、特徴量格納部１２を除いて、公知技術でも良い。 An example of a block diagram of the speech synthesizer 4 is shown in FIG. Note that the speech synthesizer 4 may be a known technique except for the feature amount storage unit 12.

音声合成装置４は、特徴量格納部１２、受付部３１、音声生成部４２、出力部３３を備える。 The voice synthesizer 4 includes a feature amount storage unit 12, a reception unit 31, a voice generation unit 42, and an output unit 33.

音声生成部４２は、受付部３１が受け付けた合成内容情報に対して、特徴量格納部１２の１以上の特徴量を用いて、音声を生成する。音声生成部４２は、１以上の特徴量から直接に音声を生成する。音声生成部４２は、さまざまな方法で実現可能であるが、本実施の形態では、音声素片接続タイプの音声生成方法で実現されている。すなわち、前記特徴量は所定の合成単位（例えばダイフォーン）で、音声素片として特徴量格納部１２に保持されており、音声生成部４２は前記文字情報にしたがって、音声素片を特徴量格納部１２から取り出し順次接続して、所望の音声の特徴量時系列を生成する。その後、音声生成部４２は当該特徴量時系列を音声に変換する。音声生成部４２のこうした手法も公知技術であるので、詳細な説明を省略する。音声生成部４２は、通常、ＭＰＵやメモリ等から実現され得る。音声生成部４２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The voice generation unit 42 generates voice by using one or more feature amounts of the feature amount storage unit 12 for the composite content information received by the reception unit 31. The sound generation unit 42 generates sound directly from one or more feature amounts. The voice generation unit 42 can be realized by various methods. In the present embodiment, the voice generation unit 42 is realized by a voice unit connection type voice generation method. That is, the feature amount is held in the feature amount storage unit 12 as a speech unit in a predetermined synthesis unit (for example, a diphone), and the speech generation unit 42 stores the speech unit according to the character information. It extracts from the unit 12 and sequentially connects to generate a desired voice feature amount time series. Thereafter, the sound generation unit 42 converts the feature amount time series into sound. Since such a method of the voice generation unit 42 is also a known technique, detailed description thereof is omitted. The sound generation unit 42 can be usually realized by an MPU, a memory, or the like. The processing procedure of the sound generation unit 42 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

また、音声処置装置１とモデル作成装置２とを一の装置で実現しても良いことは言うまでもない。かかる場合の装置（音声処置装置）は、音声格納部１１、特徴量格納部１２、スペクトル取得部１３、切詰処理部１４、特徴量取得部１５、特徴量蓄積部１６、音響モデル格納部２１、およびモデル学習部２２を具備する。 Needless to say, the voice treatment device 1 and the model creation device 2 may be realized by a single device. In this case, the device (voice treatment device) includes a voice storage unit 11, a feature amount storage unit 12, a spectrum acquisition unit 13, a truncation processing unit 14, a feature amount acquisition unit 15, a feature amount storage unit 16, an acoustic model storage unit 21, And a model learning unit 22.

また、例えば、音声処置装置１とモデル作成装置２と音声合成装置３とを一の装置で実現しても良いことは言うまでもない。かかる場合の装置（音声処置装置）は、音声格納部１１、特徴量格納部１２、スペクトル取得部１３、切詰処理部１４、特徴量取得部１５、特徴量蓄積部１６、音響モデル格納部２１、モデル学習部２２、受付部３１、音声生成部３２、および出力部３３を具備する。 Needless to say, for example, the speech treatment device 1, the model creation device 2, and the speech synthesis device 3 may be realized by a single device. In this case, the device (voice treatment device) includes a voice storage unit 11, a feature amount storage unit 12, a spectrum acquisition unit 13, a truncation processing unit 14, a feature amount acquisition unit 15, a feature amount storage unit 16, an acoustic model storage unit 21, The model learning unit 22, the reception unit 31, the voice generation unit 32, and the output unit 33 are provided.

また、例えば、音声処置装置１と音声合成装置４とを一の装置で実現しても良いことは言うまでもない。かかる場合の装置（音声処置装置）は、音声格納部１１、特徴量格納部１２、スペクトル取得部１３、切詰処理部１４、特徴量取得部１５、特徴量蓄積部１６、受付部３１、音声生成部４２、および出力部３３を具備する。 Needless to say, for example, the speech treatment device 1 and the speech synthesis device 4 may be realized by a single device. In this case, the device (voice treatment device) includes a voice storage unit 11, a feature amount storage unit 12, a spectrum acquisition unit 13, a truncation processing unit 14, a feature amount acquisition unit 15, a feature amount storage unit 16, a reception unit 31, a voice generation unit. The unit 42 and the output unit 33 are provided.

さらに、上記実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得部と、前記スペクトル取得部が取得したスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上の周波数のスペクトルを切り詰める処理を行う切詰処理部と、前記切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得部と、前記特徴量取得部が取得した１以上の特徴量を記憶媒体に蓄積する特徴量蓄積部として機能させるためのプログラム、である。 Furthermore, the processing in the above embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the information processing apparatus according to the present embodiment is the following program. In other words, this program cuts the spectrum of a spectrum having a frequency equal to or higher than a predetermined threshold with respect to the spectrum or spectrum envelope acquired by the spectrum acquisition unit and the spectrum or spectrum envelope acquired by the spectrum acquisition unit. A storage processing unit that performs processing, a feature amount acquisition unit that acquires one or more feature amounts from the spectrum or spectrum envelope subjected to the reduction processing, and one or more feature amounts acquired by the feature amount acquisition unit. A program for functioning as a feature amount accumulating unit for accumulating.

また、図１２は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の音声処置装置を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図１２は、このコンピュータシステム３００の概観図であり、図１３は、システム３００のブロック図である。 FIG. 12 shows the external appearance of a computer that executes the programs described in this specification to realize the above-described voice treatment apparatuses according to various embodiments. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 12 is an overview diagram of the computer system 300, and FIG. 13 is a block diagram of the system 300.

図１２において、コンピュータシステム３００は、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブを含むコンピュータ３０１と、キーボード３０２と、マウス３０３と、モニタ３０４と、スピーカー３０６とを含む。 In FIG. 12, a computer system 300 includes a computer 301 including an FD (Flexible Disk) drive and a CD-ROM (Compact Disk Read Only Memory) drive, a keyboard 302, a mouse 303, a monitor 304, and a speaker 306. .

図１３において、コンピュータ３０１は、ＦＤドライブ３０１１、ＣＤ−ＲＯＭドライブ３０１２に加えて、ＭＰＵ３０１３と、当該ＭＰＵ３０１３、ＣＤ−ＲＯＭドライブ３０１２及びＦＤドライブ３０１１に接続されたバス３０１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）３０１５と、ＭＰＵ３０１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）３０１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３０１７とを含む。ここでは、図示しないが、コンピュータ３０１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 13, in addition to the FD drive 3011 and the CD-ROM drive 3012, the computer 301 includes an MPU 3013, a bus 3014 connected to the MPU 3013, the CD-ROM drive 3012 and the FD drive 3011, and a program such as a bootup program. A ROM (Read-Only Memory) 3015 for storing the memory, a RAM (Random Access Memory) 3016 for temporarily storing application program instructions and providing a temporary storage space, and an application program , A system program, and a hard disk 3017 for storing data. Although not shown here, the computer 301 may further include a network card that provides connection to a LAN.

コンピュータシステム３００に、上述した実施の形態の音声処置装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３１０１、またはＦＤ３１０２に記憶されて、ＣＤ−ＲＯＭドライブ３０１２またはＦＤドライブ３０１１に挿入され、さらにハードディスク３０１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３０１に送信され、ハードディスク３０１７に記憶されても良い。プログラムは実行の際にＲＡＭ３０１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３１０１、ＦＤ３１０２またはネットワークから直接、ロードされても良い。 A program that causes the computer system 300 to execute the functions of the voice treatment apparatus according to the above-described embodiment is stored in the CD-ROM 3101 or FD 3102, inserted into the CD-ROM drive 3012 or FD drive 3011, and further stored in the hard disk 3017. May be forwarded. Alternatively, the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017. The program is loaded into the RAM 3016 at the time of execution. The program may be loaded directly from the CD-ROM 3101, the FD 3102 or the network.

プログラムは、コンピュータ３０１に、上述した実施の形態の音声処置装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３００がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third party program, or the like that causes the computer 301 to execute the functions of the voice treatment apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 300 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる音声処置装置は、音声合成において、高い品質の出力音声が得られる特徴量を取得できる、という効果を有し、音声合成装置等として有用である。 As described above, the speech treatment device according to the present invention has an effect that it is possible to acquire a feature amount capable of obtaining high-quality output speech in speech synthesis, and is useful as a speech synthesis device or the like.

１音声処置装置
２モデル作成装置
３、４音声合成装置
１１音声格納部
１２特徴量格納部
１３スペクトル取得部
１４切詰処理部
１５特徴量取得部
１６特徴量蓄積部
２１モデル格納部
２２モデル学習部
３１受付部
３２、４２音声生成部
３３出力部 DESCRIPTION OF SYMBOLS 1 Voice treatment apparatus 2 Model production apparatus 3, 4 Voice synthesizer 11 Voice storage part 12 Feature-value storage part 13 Spectrum acquisition part 14 Truncation processing part 15 Feature-value acquisition part 16 Feature-value storage part 21 Model storage part 22 Model learning part 31 Reception unit 32, 42 Audio generation unit 33 Output unit

Claims

モデル作成装置と音声処置装置と音声合成装置とを具備する音声合成システムであって、
前記音声合成装置は、
前記モデル作成装置により取得された音声のモデルを格納し得るモデル格納部と、
音声合成する内容を示す情報である合成内容情報を受け付ける受付部と、
前記合成内容情報に対して、前記モデル格納部の音声のモデルを用いて、音声を生成する音声生成部と、
前記音声生成部が生成した音声を出力する出力部とを具備し、
前記モデル作成装置は、
音声のモデルを格納し得るモデル格納部と、
音声処置装置により蓄積された１以上の特徴量を格納している特徴量格納部と、
前記１以上の特徴量から音声のモデルを構成し、前記モデル格納部に蓄積するモデル学習部とを具備し、
前記音声処置装置は、
音声を格納し得る音声格納部と、
１以上の特徴量を格納し得る特徴量格納部と、
前記音声格納部に格納されている音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得部と、
前記スペクトル取得部が取得したスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上または閾値より大きい周波数のスペクトル部分のデータを削除する処理である切り詰める処理を行う切詰処理部と、
前記切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得部と、
前記特徴量取得部が取得した１以上の特徴量を前記特徴量格納部に蓄積する特徴量蓄積部とを具備する、音声合成システム。 A speech synthesis system comprising a model creation device, a speech treatment device, and a speech synthesizer,
The speech synthesizer
A model storage unit capable of storing a model of a voice acquired by the model creation device ;
A reception unit that receives synthesis content information that is information indicating content to be synthesized;
A voice generation unit that generates voice using the voice model of the model storage unit for the synthesized content information;
An output unit that outputs the sound generated by the sound generation unit ;
The model creation device includes:
A model storage unit capable of storing a model of speech;
A feature amount storage unit storing one or more feature amounts accumulated by the voice treatment device;
Comprising a model learning unit configured to compose a speech model from the one or more feature quantities and store the model in the model storage unit;
The voice treatment device includes:
An audio storage unit capable of storing audio;
A feature amount storage unit capable of storing one or more feature amounts;
A spectrum acquisition unit for acquiring a spectrum or spectrum envelope of the voice stored in the voice storage unit;
A truncation processing unit that performs a truncation process that is a process of deleting data of a spectrum part having a frequency that is greater than or equal to a predetermined threshold or greater than a predetermined threshold, with respect to the spectrum or spectrum envelope acquired by the spectrum acquisition unit;
A feature quantity acquisition unit that acquires one or more feature quantities from the spectrum or spectrum envelope subjected to the truncation process;
A speech synthesis system comprising: a feature amount storage unit that stores one or more feature amounts acquired by the feature amount acquisition unit in the feature amount storage unit .

音声処置装置と音声合成装置とを具備する音声合成システムであって、
前記音声合成装置は、
前記音声処置装置により蓄積された１以上の特徴量を格納している特徴量格納部と、
音声合成する内容を示す情報である合成内容情報を受け付ける受付部と、
前記合成内容情報に対して、前記特徴量格納部の１以上の特徴量を用いて、音声を生成する音声生成部と、
前記音声生成部が生成した音声を出力する出力部とを具備し、
前記音声処置装置は、
音声を格納し得る音声格納部と、
１以上の特徴量を格納し得る特徴量格納部と、
前記音声格納部に格納されている音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得部と、
前記スペクトル取得部が取得したスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上または閾値より大きい周波数のスペクトル部分のデータを削除する処理である切り詰める処理を行う切詰処理部と、
前記切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得部と、
前記特徴量取得部が取得した１以上の特徴量を前記特徴量格納部に蓄積する特徴量蓄積部とを具備する、音声合成システム。 A speech synthesis system comprising a speech treatment device and a speech synthesizer,
The speech synthesizer
A feature amount storage unit storing one or more feature amounts accumulated by the voice treatment device ;
A reception unit that receives synthesis content information that is information indicating content to be synthesized;
A voice generation unit that generates voice using the one or more feature quantities of the feature quantity storage unit for the synthesized content information;
An output unit that outputs the sound generated by the sound generation unit ;
The voice treatment device includes:
An audio storage unit capable of storing audio;
A feature amount storage unit capable of storing one or more feature amounts;
A spectrum acquisition unit for acquiring a spectrum or spectrum envelope of the voice stored in the voice storage unit;
A truncation processing unit that performs a truncation process that is a process of deleting data of a spectrum part having a frequency that is greater than or equal to a predetermined threshold or greater than a predetermined threshold, with respect to the spectrum or spectrum envelope acquired by the spectrum acquisition unit;
A feature quantity acquisition unit that acquires one or more feature quantities from the spectrum or spectrum envelope subjected to the truncation process;
A speech synthesis system comprising: a feature amount storage unit that stores one or more feature amounts acquired by the feature amount acquisition unit in the feature amount storage unit .

スペクトル取得部、切詰処理部、特徴量取得部、モデル学習部、受付部、音声生成部、および出力部により実現され得る音声合成方法であって、
前記スペクトル取得部が、音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得ステップと、
前記切詰処理部が、前記スペクトル取得ステップで取得されたスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上または閾値より大きい周波数のスペクトル部分のデータを削除する処理である切り詰める処理を行う切詰処理ステップと、
前記特徴量取得部が、前記切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得ステップと、
前記モデル学習部が、前記１以上の特徴量から音声のモデルを構成するモデル学習ステップと、
前記受付部が、音声合成する内容を示す情報である合成内容情報を受け付ける受付ステップと、
前記音声生成部が、前記合成内容情報に対して、前記モデル学習ステップが構成した音声のモデルを用いて、音声を生成する音声生成ステップと、
前記出力部が、前記音声生成ステップで生成された音声を出力する出力ステップとを具備する音声合成方法。 A speech synthesis method that can be realized by a spectrum acquisition unit, a truncation processing unit, a feature amount acquisition unit, a model learning unit, a reception unit, a voice generation unit, and an output unit,
The spectrum acquisition unit acquires a spectrum or spectrum envelope of speech,
The truncation process in which the truncation processing unit performs a truncation process that is a process of deleting data of a spectrum part having a frequency equal to or higher than a predetermined threshold or greater than a predetermined threshold with respect to the spectrum or spectrum envelope acquired in the spectrum acquisition step. Steps,
A feature amount acquisition step in which the feature amount acquisition unit acquires one or more feature amounts from the spectrum or spectrum envelope subjected to the truncation process;
A model learning step in which the model learning unit forms a speech model from the one or more feature quantities;
A receiving step in which the receiving unit receives synthesis content information which is information indicating content to be synthesized;
A voice generation step in which the voice generation unit generates a voice using the voice model formed by the model learning step with respect to the synthesized content information;
The speech synthesis method , wherein the output unit includes an output step of outputting the speech generated in the speech generation step .

スペクトル取得部、切詰処理部、特徴量取得部、受付部、音声生成部、および出力部により実現され得る音声合成方法であって、A speech synthesis method that can be realized by a spectrum acquisition unit, a truncation processing unit, a feature amount acquisition unit, a reception unit, a voice generation unit, and an output unit,
前記スペクトル取得部が、記録媒体に格納されている音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得ステップと、The spectrum acquisition unit acquires a spectrum or spectrum envelope of sound stored in a recording medium; and
前記切詰処理部が、前記スペクトル取得ステップで取得されたスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上または閾値より大きい周波数のスペクトル部分のデータを削除する処理である切り詰める処理を行う切詰処理ステップと、The truncation process in which the truncation processing unit performs a truncation process that is a process of deleting data of a spectrum part having a frequency equal to or higher than a predetermined threshold or greater than a predetermined threshold with respect to the spectrum or spectrum envelope acquired in the spectrum acquisition step. Steps,
前記特徴量取得部が、前記切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得ステップと、A feature amount acquisition step in which the feature amount acquisition unit acquires one or more feature amounts from the spectrum or spectrum envelope subjected to the truncation process;
前記受付部が、音声合成する内容を示す情報である合成内容情報を受け付ける受付ステップと、A receiving step in which the receiving unit receives synthesis content information which is information indicating content to be synthesized;
前記音声生成部が、前記合成内容情報に対して、前記特徴量取得部が取得した１以上の特徴量を用いて、音声を生成する音声生成ステップと、A voice generation step in which the voice generation unit generates voice using the one or more feature quantities acquired by the feature quantity acquisition unit for the synthesized content information;
前記出力部が、前記音声生成部が生成した音声を出力する出力ステップとを具備する音声合成方法。A speech synthesis method, wherein the output unit includes an output step of outputting the speech generated by the speech generation unit.

コンピュータを、
音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得部と、
前記スペクトル取得部が取得したスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上または閾値より大きい周波数のスペクトル部分のデータを削除する処理である切り詰める処理を行う切詰処理部と、
前記切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得部と、
前記１以上の特徴量から音声のモデルを構成するモデル学習部と、
音声合成する内容を示す情報である合成内容情報を受け付ける受付部と、
前記合成内容情報に対して、前記音声のモデルを用いて、音声を生成する音声生成部と、
前記音声生成部が生成した音声を出力する出力部として機能させるためのプログラム。 Computer
A spectrum acquisition unit for acquiring a spectrum or spectrum envelope of speech;
A truncation processing unit that performs a truncation process that is a process of deleting data of a spectrum part having a frequency that is greater than or equal to a predetermined threshold or greater than a predetermined threshold, with respect to the spectrum or spectrum envelope acquired by the spectrum acquisition unit;
A feature quantity acquisition unit that acquires one or more feature quantities from the spectrum or spectrum envelope subjected to the truncation process;
A model learning unit that constitutes a speech model from the one or more feature quantities;
A reception unit that receives synthesis content information that is information indicating content to be synthesized;
A voice generation unit that generates voice using the voice model for the synthesized content information;
The program for functioning as an output part which outputs the audio | voice produced | generated by the said audio | voice production | generation part .

コンピュータを、Computer
記録媒体に格納されている音声のスペクトルまたはスペクトル包絡を取得するスペクトル取得部と、A spectrum acquisition unit for acquiring a spectrum or spectrum envelope of speech stored in a recording medium;
前記スペクトル取得部が取得したスペクトルまたはスペクトル包絡に対して、予め決められた閾値以上または閾値より大きい周波数のスペクトル部分のデータを削除する処理である切り詰める処理を行う切詰処理部と、A truncation processing unit that performs a truncation process that is a process of deleting data of a spectrum part having a frequency that is greater than or equal to a predetermined threshold or greater than a predetermined threshold, with respect to the spectrum or spectrum envelope acquired by the spectrum acquisition unit;
前記切り詰める処理を行ったスペクトルまたはスペクトル包絡から１以上の特徴量を取得する特徴量取得部と、A feature quantity acquisition unit that acquires one or more feature quantities from the spectrum or spectrum envelope subjected to the truncation process;
音声合成する内容を示す情報である合成内容情報を受け付ける受付部と、A reception unit that receives synthesis content information that is information indicating content to be synthesized;
前記合成内容情報に対して、前記特徴量取得部が取得した１以上の特徴量を用いて、音声を生成する音声生成部と、A voice generation unit that generates voice using the one or more feature amounts acquired by the feature amount acquisition unit for the composite content information;
前記音声生成部が生成した音声を出力する出力部として機能させるためのプログラム。The program for functioning as an output part which outputs the audio | voice produced | generated by the said audio | voice production | generation part.