JP6011039B2

JP6011039B2 - Speech synthesis apparatus and speech synthesis method

Info

Publication number: JP6011039B2
Application number: JP2012129798A
Authority: JP
Inventors: ジョルディ　ボナダ; ボナダジョルディ; ブラアウメルレイン; 誠橘; 橘　　誠
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2011-06-07
Filing date: 2012-06-07
Publication date: 2016-10-19
Anticipated expiration: 2032-06-07
Also published as: JP2013015829A

Description

本発明は、複数の音声素片の連結で発話音や歌唱音等の音声を合成する技術に関する。 The present invention relates to a technique for synthesizing speech sounds, singing sounds, and the like by connecting a plurality of speech segments.

複数の音声素片を相互に連結することで所望の音声を合成する素片接続型の音声合成技術が従来から提案されている。例えば特許文献１の技術では、音声素片の各フレームの振幅スペクトルおよび位相スペクトルが記憶装置に格納され、振幅スペクトルおよび位相スペクトルの各々を個別に処理したうえで時間領域の信号に変換して相互に連結することで音声信号が生成される。 Conventionally, a unit connection type speech synthesis technique for synthesizing a desired speech by connecting a plurality of speech units to each other has been proposed. For example, in the technique of Patent Document 1, the amplitude spectrum and the phase spectrum of each frame of a speech unit are stored in a storage device, and each of the amplitude spectrum and the phase spectrum is individually processed and converted into a time domain signal. An audio signal is generated by connecting to.

特許第４３４９３１６号公報Japanese Patent No. 4349316

しかし、特許文献１の技術では、各音声素片のフレーム毎に振幅スペクトルおよび位相スペクトルの双方を記憶し得る大容量の記憶装置が必要になるという問題がある。また、振幅スペクトルおよび位相スペクトルを時間領域の信号に変換する段階で両者が時間的にずれている場合には合成音の受聴者が位相ズレ感を知覚する原因となるから、各フレームの振幅スペクトルと各フレームの位相スペクトルとを時間的に対応させるための特別な処理が必要である。以上の事情を考慮して、本発明は、振幅スペクトルと位相スペクトルとの時間的な対応を容易に維持するとともに音声素片の保持に必要な記憶容量を削減することを目的とする。 However, the technique of Patent Document 1 has a problem that a large-capacity storage device capable of storing both the amplitude spectrum and the phase spectrum for each frame of each speech unit is required. In addition, if the amplitude spectrum and the phase spectrum are shifted in time at the stage of converting them into a time domain signal, it will cause the listener of the synthesized sound to perceive a phase shift. And a special process for temporally matching the phase spectrum of each frame. In view of the above circumstances, an object of the present invention is to easily maintain a temporal correspondence between an amplitude spectrum and a phase spectrum and reduce a storage capacity necessary for holding speech segments.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の要素と後述の実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。 Means employed by the present invention to solve the above problems will be described. In order to facilitate the understanding of the present invention, in the following description, the correspondence between the elements of the present invention and the elements of the embodiments described later will be indicated in parentheses, but the scope of the present invention will be exemplified in the embodiments. It is not intended to be limited.

本発明の音声合成装置は、音声素片データが音声素片の各フレームについて示す振幅スペクトルに対応する位相スペクトルをフレーム毎に算定する位相算定手段（例えば位相算定部３２）と、音声素片データが示す各フレームの振幅スペクトルと位相算定手段による算定後の各フレームの位相スペクトルとを利用して音声信号を生成する音声合成手段（例えば音声合成部３４）とを具備する。以上の構成では、音声素片データが示す振幅スペクトルを利用して位相スペクトルが算定されるから、振幅スペクトルおよび位相スペクトルの双方を保持する構成と比較して素片記憶手段に必要な記憶容量が削減されるという利点がある。また、振幅スペクトルから位相スペクトルが算定されるから、各フレームの振幅スペクトルと位相スペクトルとの時間的な対応を容易に維持できる（ひいては振幅スペクトルと位相スペクトルとの時間差に起因した合成音の位相ズレ感を抑制できる）という利点もある。 The speech synthesizer according to the present invention includes a phase calculation unit (for example, a phase calculation unit 32) that calculates, for each frame, a phase spectrum corresponding to the amplitude spectrum that the speech unit data indicates for each frame of the speech unit; Includes a voice synthesis unit (for example, a voice synthesis unit 34) that generates a voice signal using the amplitude spectrum of each frame indicated by and the phase spectrum of each frame calculated by the phase calculation unit. In the above configuration, since the phase spectrum is calculated using the amplitude spectrum indicated by the speech segment data, the storage capacity required for the segment storage means is smaller than that in the configuration that holds both the amplitude spectrum and the phase spectrum. There is an advantage that it is reduced. In addition, since the phase spectrum is calculated from the amplitude spectrum, the temporal correspondence between the amplitude spectrum and the phase spectrum of each frame can be easily maintained (as a result, the phase shift of the synthesized sound due to the time difference between the amplitude spectrum and the phase spectrum). There is also an advantage that the feeling can be suppressed.

本発明の好適な態様において、位相算定手段は、音声素片データが示す振幅スペクトルに対応する最小位相または最大位相を位相スペクトルとして算定する。また、本発明の他の態様（例えば後述の第４実施形態）において、位相算定手段は、音声素片データが示す振幅スペクトルにおいて周波数軸上で相隣接する各周波数間の振幅値の差分を周波数軸の方向に平滑化することで位相スペクトルを算定する。 In a preferred aspect of the present invention, the phase calculating means calculates the minimum phase or the maximum phase corresponding to the amplitude spectrum indicated by the speech segment data as the phase spectrum. In another aspect of the present invention (for example, a fourth embodiment to be described later), the phase calculating means calculates the difference in amplitude value between frequencies adjacent to each other on the frequency axis in the amplitude spectrum indicated by the speech segment data. The phase spectrum is calculated by smoothing in the direction of the axis.

なお、音声素片データが示す各フレームの振幅スペクトルを調整する素片調整手段（例えば素片調整部２６）を具備する構成では、位相算定手段による位相スペクトルの算定後に素片調整手段が振幅スペクトルを調整する構成（態様Ａ）も採用され得る。ただし、振幅スペクトルと位相スペクトルとの時間差を低減するという観点からすると、素片調整手段による調整後の振幅スペクトルから位相算定手段が位相スペクトルを算定する構成が格別に好適である。 Note that, in a configuration including a unit adjustment unit (for example, unit adjustment unit 26) that adjusts the amplitude spectrum of each frame indicated by the speech unit data, the unit adjustment unit performs the amplitude spectrum after calculating the phase spectrum by the phase calculation unit. A configuration (aspect A) for adjusting the angle may be employed. However, from the viewpoint of reducing the time difference between the amplitude spectrum and the phase spectrum, a configuration in which the phase calculation means calculates the phase spectrum from the amplitude spectrum after adjustment by the segment adjustment means is particularly suitable.

本発明の好適な態様に係る音声合成装置は、位相算定手段が算定した各フレームの位相スペクトルのうち所定の帯域内の各位相値を乱数的に変化させる第１位相補正手段（例えば第１位相補正部４１）を具備する。以上の態様では、振幅スペクトルから算定された位相スペクトルのうち所定の帯域（例えば４ｋＨｚ以上の高域側の帯域）内の各位相値が乱数的に変化する（すなわち複数の位相値の系列に揺らぎが付与される）から、位相算定手段が算定した位相スペクトルをそのまま音声合成手段による音声信号の合成に適用する構成と比較して、聴感的に自然な印象の合成音を生成できるという利点がある。なお、以上の態様の具体例は例えば第２実施形態として後述される。 The speech synthesizer according to a preferred aspect of the present invention includes first phase correction means (for example, first phase) that randomly changes each phase value in a predetermined band of the phase spectrum of each frame calculated by the phase calculation means. A correction unit 41). In the above aspect, each phase value in a predetermined band (for example, a higher frequency band of 4 kHz or higher) in the phase spectrum calculated from the amplitude spectrum changes randomly (that is, fluctuates in a sequence of a plurality of phase values). Therefore, compared with the configuration in which the phase spectrum calculated by the phase calculation means is directly applied to the synthesis of the voice signal by the voice synthesis means, there is an advantage that a synthetic sound with an audibly natural impression can be generated. . In addition, the specific example of the above aspect is later mentioned as 2nd Embodiment, for example.

本発明の好適な態様に係る音声合成装置は、音声素片内での有声度（有声／無声の度合）の時間変化を特定し、位相算定手段が算定した各フレームの位相スペクトルの各位相値を、そのフレームの有声度に応じた変動範囲（例えば変動範囲α2）内で乱数的に変化させる第２位相補正手段（例えば第２位相補正部４２）とを具備する。例えば、有声度が低い（無声度が高い）ほど変動範囲を拡大する構成が好適である。以上の態様では、振幅スペクトルから算定された位相スペクトルの各位相値が乱数的に変化するから、位相算定手段が算定した位相スペクトルをそのまま音声合成手段による音声信号の合成に適用する構成と比較して、聴感的に自然な印象の合成音を生成できるという利点がある。しかも、第２位相補正手段の補正による位相値の変動範囲が各フレームの有声度に応じて可変に制御されるから、聴感的に自然な印象の合成音を生成できるという効果は格別に顕著となる。なお、以上の態様の具体例は例えば第３実施形態として後述される。 The speech synthesizer according to a preferred aspect of the present invention specifies a temporal change of voicedness (degree of voiced / unvoiced) in a speech unit, and each phase value of the phase spectrum of each frame calculated by the phase calculating means And second phase correction means (for example, the second phase correction unit 42) that randomly changes within a variation range (for example, the variation range α2) according to the voicing degree of the frame. For example, a configuration in which the range of fluctuation is expanded as the voicing degree is lower (the unvoiced degree is higher) is preferable. In the above aspect, since each phase value of the phase spectrum calculated from the amplitude spectrum changes randomly, the phase spectrum calculated by the phase calculation means is compared with a configuration in which it is directly applied to the synthesis of the voice signal by the voice synthesis means. Thus, there is an advantage that a synthetic sound having a natural impression can be generated. Moreover, since the variation range of the phase value due to the correction of the second phase correction means is variably controlled according to the voicing degree of each frame, the effect of being able to generate a synthetic sound with an audibly natural impression is particularly remarkable. Become. In addition, the specific example of the above aspect is later mentioned as 3rd Embodiment, for example.

本発明の好適な態様に係る音声合成装置は、位相算定手段が各フレームについて算定した位相スペクトルを補正する手段であって、位相算定手段が一のフレームについて算定した位相スペクトルの各位相値に、一のフレームの直前のフレームから予測される位相値の予測誤差を付加する第３位相補正手段を具備する。以上の構成によれば、聴感的に自然な印象の合成音を生成できるという利点がある。なお、以上の態様の具体例は例えば第５実施形態として後述される。 The speech synthesizer according to a preferred aspect of the present invention is a means for correcting the phase spectrum calculated for each frame by the phase calculation means, and for each phase value of the phase spectrum calculated by the phase calculation means for one frame, Third phase correcting means for adding a prediction error of a phase value predicted from a frame immediately before one frame is provided. According to the above structure, there exists an advantage that the synthetic sound of a natural impression can be produced | generated. In addition, the specific example of the above aspect is later mentioned as 5th Embodiment, for example.

以上の各態様に係る音声合成装置は、音声合成に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）で実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働でも実現される。本発明のプログラム（例えばプログラムＰGM）は、音声素片データが音声素片の各フレームについて示す振幅スペクトルに対応する位相スペクトルをフレーム毎に算定する位相算定処理と、音声素片データが示す各フレームの振幅スペクトルと位相算定処理後の各フレームの位相スペクトルとを利用して音声信号を生成する音声合成処理とを実行させる。以上のプログラムによれば、本発明の音声合成装置と同様の作用および効果が実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The speech synthesizer according to each aspect described above is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to speech synthesis, and a general-purpose arithmetic processing device such as a CPU (Central Processing Unit). And collaboration with the program. The program of the present invention (for example, the program PGM) includes a phase calculation process for calculating, for each frame, a phase spectrum corresponding to an amplitude spectrum indicated by the speech unit data for each frame of the speech unit, and each frame indicated by the speech unit data. And a speech synthesis process for generating a speech signal using the phase spectrum of each frame after the phase calculation process. According to the above program, the same operation and effect as the speech synthesizer of the present invention are realized. The program of the present invention is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or provided from a server device in a form of distribution via a communication network and installed in the computer. Is done.

本発明の第１実施形態の音声合成装置のブロック図である。It is a block diagram of the speech synthesizer of a 1st embodiment of the present invention. 記憶装置に格納された素片群の模式図である。It is a schematic diagram of the segment group stored in the storage device. 振幅スペクトルおよび位相スペクトルの模式図である。It is a schematic diagram of an amplitude spectrum and a phase spectrum. 第２実施形態の音声合成装置のブロック図である。It is a block diagram of the speech synthesizer of 2nd Embodiment. 第１位相補正部の動作の説明図である。It is explanatory drawing of operation | movement of a 1st phase correction part. 位相値の変動範囲の説明図である。It is explanatory drawing of the fluctuation range of a phase value. 第３実施形態の音声合成装置のブロック図である。It is a block diagram of the speech synthesizer of 3rd Embodiment. 有声度の時間変化を示すグラフである。It is a graph which shows the time change of voicedness. 第４実施形態における位相算定部の動作の説明図である。It is explanatory drawing of operation | movement of the phase calculation part in 4th Embodiment. 第５実施形態の音声合成装置のブロック図である。It is a block diagram of the speech synthesizer of 5th Embodiment.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、発話音や歌唱音等の音声を素片接続型の音声合成処理で生成する信号処理装置であり、図１に示すように、演算処理装置（ＣＰＵ）１２と記憶装置１４と放音装置１６とを具備するコンピュータシステムで実現される。 <A: First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 is a signal processing device that generates speech such as speech sounds and singing sounds by a unit connection type speech synthesis process, and as shown in FIG. 1, an arithmetic processing unit (CPU) 12 and a storage device 14 And a sound emitting device 16.

演算処理装置１２は、記憶装置１４に格納されたプログラムの実行で、合成音の波形を表す音声信号ＶOUTを生成するための複数の機能（素片選択部２２，振幅算定部２４，素片調整部２６，位相算定部３２，音声合成部３４）を実現する。なお、演算処理装置１２の各機能を複数の集積回路に分散した構成や、専用の電子回路（ＤＳＰ）が一部の機能を実行する構成も採用され得る。放音装置１６（例えばヘッドホンやスピーカ）は、演算処理装置１２が生成する音声信号ＶOUTに応じた音波を放射する。 The arithmetic processing unit 12 has a plurality of functions (unit selection unit 22, amplitude calculation unit 24, unit adjustment) for generating a voice signal VOUT representing a waveform of a synthesized sound by executing a program stored in the storage device 14. Unit 26, phase calculation unit 32, and speech synthesis unit 34). A configuration in which each function of the arithmetic processing unit 12 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (DSP) executes a part of the functions may be employed. The sound emitting device 16 (for example, a headphone or a speaker) emits a sound wave corresponding to the audio signal VOUT generated by the arithmetic processing device 12.

記憶装置１４は、演算処理装置１２が実行するプログラムＰGMや演算処理装置１２が使用する各種のデータ（素片群ＧA，合成情報ＧB）を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として任意に採用され得る。なお、音声合成装置１００とは別個の外部装置（例えばサーバ装置）に記憶装置１４を設置し、音声合成装置１００が通信網（例えばインターネット）を介して記憶装置１４から情報を取得する構成も採用され得る。すなわち、記憶装置１４は音声合成装置１００の必須の要件ではない。 The storage device 14 stores a program PGM executed by the arithmetic processing device 12 and various data (segment group GA, composite information GB) used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media can be arbitrarily employed as the storage device 14. A configuration in which the storage device 14 is installed in an external device (for example, a server device) separate from the speech synthesis device 100 and the speech synthesis device 100 acquires information from the storage device 14 via a communication network (for example, the Internet) is also employed. Can be done. That is, the storage device 14 is not an essential requirement for the speech synthesizer 100.

記憶装置１４に記憶される素片群ＧAは、図２に示すように、相異なる音声素片に対応する複数の音声素片データＤの集合（音声合成ライブラリ）である。音声素片は、音声の言語的な最小単位に相当する１個の音素、または、複数の音素を相互に連結した音素連鎖（例えば２個の音素で構成されるダイフォン）である。 The unit group GA stored in the storage device 14 is a set (speech synthesis library) of a plurality of speech unit data D corresponding to different speech units, as shown in FIG. The phoneme unit is a phoneme corresponding to the smallest linguistic unit of speech, or a phoneme chain in which a plurality of phonemes are connected to each other (for example, a diphone composed of two phonemes).

図２に示すように、音声素片データＤは、音声素片を時間軸上で区分した各フレームに対応する複数の単位データＵ（ＵA，ＵB）の時系列を含んで構成される。各単位データＵは、音声素片の各フレームにおける周波数領域でのスペクトルを規定する情報であり、音声素片のうち有声音の音素の各フレームに対応する単位データＵAと、無声音の音素の各フレームに対応する単位データＵBとに区別される。無声音の単位データＵBは、音声のスペクトル（複素スペクトル）を規定するデータであり、具体的には各フレームの振幅スペクトルＳMと位相スペクトルＳPとを指定する。 As shown in FIG. 2, the speech unit data D includes a time series of a plurality of unit data U (UA, UB) corresponding to each frame obtained by dividing the speech unit on the time axis. Each unit data U is information that defines a spectrum in the frequency domain in each frame of the speech unit, and each unit data UA corresponding to each frame of the voiced phoneme in the speech unit and each of the unvoiced phoneme. It is distinguished from unit data UB corresponding to a frame. The unvoiced sound unit data UB is data defining the spectrum of the speech (complex spectrum), and specifically specifies the amplitude spectrum SM and phase spectrum SP of each frame.

第１実施形態における有声音の単位データＵAは、振幅特性データＲを含んで構成される。振幅特性データＲは、有声音の振幅スペクトルＳM（包絡線）の形状的な特徴を示す複数の変数の集合である。具体的には、振幅特性データＲは、励起波形エンベロープｒ1と胸部レゾナンスｒ2と声道レゾナンスｒ3と差分スペクトルｒ4とを含むＥｐＲ（Excitation plus Resonance）パラメータであり、公知のＳＭＳ（Spectral Modeling Synthesis）分析で生成される。なお、ＥｐＲパラメータやＳＭＳ分析については、例えば特許第３７１１８８０号公報や特開２００７−２２６１７４号公報にも開示されている。 The voiced sound unit data UA in the first embodiment includes amplitude characteristic data R. The amplitude characteristic data R is a set of a plurality of variables indicating the shape characteristics of the amplitude spectrum SM (envelope) of the voiced sound. Specifically, the amplitude characteristic data R is an EpR (Excitation plus Resonance) parameter including an excitation waveform envelope r1, a chest resonance r2, a vocal tract resonance r3, and a difference spectrum r4, and is a known SMS (Spectral Modeling Synthesis) analysis. Is generated. EpR parameters and SMS analysis are also disclosed in, for example, Japanese Patent No. 3711880 and Japanese Patent Application Laid-Open No. 2007-226174.

励起波形エンベロープ（Excitation Curve）ｒ1は、声帯振動のスペクトルの包絡線を近似する変数である。胸部レゾナンス（Chest Resonance）ｒ2は、胸部共鳴特性を近似する所定個のレゾナンス（帯域通過フィルタ）の帯域幅と中心周波数と振幅値とを指定する。声道レゾナンス（Vocal Tract Resonance）ｒ3は、声道共鳴特性を近似する複数のレゾナンスの各々について帯域幅と中心周波数と振幅値とを指定する。差分スペクトルｒ4は、励起波形エンベロープｒ1と胸部レゾナンスｒ2と声道レゾナンスｒ3とで近似されるスペクトルと音声の振幅スペクトルとの差分（誤差）を意味する。 The excitation waveform envelope (excitation curve) r1 is a variable that approximates the envelope of the vocal fold vibration spectrum. Chest resonance r2 designates the bandwidth, center frequency, and amplitude value of a predetermined number of resonances (bandpass filters) that approximate the chest resonance characteristics. Vocal Tract Resonance r3 designates a bandwidth, a center frequency, and an amplitude value for each of a plurality of resonances that approximate the vocal tract resonance characteristics. The difference spectrum r4 means the difference (error) between the spectrum approximated by the excitation waveform envelope r1, the chest resonance r2 and the vocal tract resonance r3 and the amplitude spectrum of the speech.

図１に示すように、記憶装置１４には、合成音を時系列に指定する合成情報（スコアデータ）ＧBが記憶される。合成情報ＧBは、合成音の発音文字Ｘ1と発音期間Ｘ2と音高Ｘ3とを例えば音符毎に時系列に指定する。発音文字Ｘ1は、例えば歌唱音を合成する場合の歌詞の文字列であり、発音期間Ｘ2は、例えば発音の開始時刻と継続長とで指定される。合成情報ＧBは、例えば各種の入力機器（図示略）に対する利用者からの指示に応じて生成されて記憶装置１４に格納される。なお、他の通信端末から通信網を介して受信された合成情報ＧBや可搬型の記録媒体から転送された合成情報ＧBを音声信号ＶOUTの生成に使用することも可能である。 As shown in FIG. 1, the storage device 14 stores synthesis information (score data) GB for designating synthesized sounds in time series. The synthesis information GB designates the pronunciation character X1, the pronunciation period X2, and the pitch X3 of the synthesized sound, for example, in time series for each note. The pronunciation character X1 is a character string of lyrics when, for example, a singing sound is synthesized, and the pronunciation period X2 is specified by, for example, the start time and duration of the pronunciation. The composite information GB is generated in accordance with, for example, an instruction from a user for various input devices (not shown) and stored in the storage device 14. Note that the synthesized information GB received from another communication terminal via the communication network or the synthesized information GB transferred from the portable recording medium can be used for generating the audio signal VOUT.

図１の素片選択部２２は、合成情報ＧBが時系列に指定する発音文字Ｘ1に対応する音声素片の音声素片データＤを素片群ＧAから順次に選択する。例えば「sakura」という発音文字Ｘ1が指定された場合、素片選択部２２は、［Sil-ｓ］（Sil：無音），［ｓ-ａ］，［ａ-ｋ］，［ｋ-ｕ］，［ｕ-ｒ］，［ｒ-ａ］，［ａ-Sil］という７個の音声素片の音声素片データＤを順番に選択する。素片選択部２２が順次に選択する音声素片データＤのうち有声音の各単位データＵAは振幅算定部２４に供給され、無声音の各単位データＵBは素片調整部２６に供給される。 The unit selection unit 22 in FIG. 1 sequentially selects the speech unit data D of the speech unit corresponding to the phonetic character X1 designated in time series by the synthesis information GB from the unit group GA. For example, when the phonetic character X1 “sakura” is designated, the segment selection unit 22 selects [Sil-s] (Sil: silence), [sa], [ak], [ku], Speech unit data D of seven speech units, [ur, r-a], and [a-Sil] are selected in order. The unit data UA of voiced sound among the speech unit data D sequentially selected by the unit selection unit 22 is supplied to the amplitude calculation unit 24, and the unit data UB of unvoiced sound is supplied to the unit adjustment unit 26.

振幅算定部２４は、素片選択部２２から供給される各単位データＵAが指定する振幅特性データＲ（ｒ1〜ｒ4）を利用して有声音の各フレームの振幅スペクトル（包絡線）ＳMを生成する。なお、振幅特性データＲから振幅スペクトルＳMを生成する方法については前述の特許第３７１１８８０号公報や特開２００７−２２６１７４号公報に開示されている。 The amplitude calculation unit 24 generates an amplitude spectrum (envelope) SM of each frame of voiced sound using the amplitude characteristic data R (r1 to r4) specified by each unit data UA supplied from the unit selection unit 22. To do. A method for generating the amplitude spectrum SM from the amplitude characteristic data R is disclosed in the aforementioned Japanese Patent No. 3711880 and Japanese Patent Application Laid-Open No. 2007-226174.

素片調整部２６は、振幅算定部２４が生成した有声音の各フレームの振幅スペクトルＳMと素片選択部２２から供給される単位データＵB（振幅スペクトルＳMおよび位相スペクトルＳP）とを調整する。具体的には、素片選択部２２が選択した各音声素片データＤに対応する音声素片の各音素が、合成情報ＧBの発音期間Ｘ2に応じた時間長および音高Ｘ3に応じたピッチとなり、かつ、各音声素片の先頭部および末尾部が前後の音声素片と円滑に接続される（すなわち先頭部にて音量が経時的に増加するとともに末尾部にて音量が経時的に減少する）ように、有声音の振幅スペクトルＳMと無声音の単位データＵBとを調整する。また、例えば音声素片が利用者の所望の音響特性（例えば音色や明瞭度）となるように有声音の振幅スペクトルＳMと無声音の単位データＵBとを調整することも可能である。素片調整部２６による調整後の有声音の振幅スペクトルＳMは位相算定部３２に供給され、素片調整部２６による調整後の無声音の単位データＵBは音声合成部３４に供給される。 The segment adjustment unit 26 adjusts the amplitude spectrum SM of each frame of voiced sound generated by the amplitude calculation unit 24 and the unit data UB (amplitude spectrum SM and phase spectrum SP) supplied from the segment selection unit 22. Specifically, each phoneme of the speech unit corresponding to each speech unit data D selected by the unit selection unit 22 has a time length corresponding to the pronunciation period X2 of the synthesis information GB and a pitch corresponding to the pitch X3. And the beginning and end of each speech segment are smoothly connected to the front and back speech segments (ie, the volume at the beginning increases with time and the volume at the end decreases with time) ), The amplitude spectrum SM of the voiced sound and the unit data UB of the unvoiced sound are adjusted. Also, for example, the amplitude spectrum SM of voiced sound and the unit data UB of unvoiced sound can be adjusted so that the speech segment has the desired acoustic characteristics (for example, timbre and intelligibility) of the user. The amplitude spectrum SM of the voiced sound after adjustment by the segment adjustment unit 26 is supplied to the phase calculation unit 32, and the unit data UB of the unvoiced sound after adjustment by the segment adjustment unit 26 is supplied to the speech synthesis unit 34.

図１の位相算定部３２は、素片調整部２６による調整後の有声音の振幅スペクトルＳMから各フレームの位相スペクトルＳPを生成する。第１実施形態の位相算定部３２は、有声音の各フレームの振幅スペクトルＳMから一意に算定される最小位相をそのフレームの位相スペクトルＳPとして生成する。なお、例えば男性の低音の音声のスペクトルを逆フーリエ変換した時間領域の信号では、時間軸上の始点付近にエネルギーが集中するという傾向が観察される。振幅スペクトルＳMが共通する信号のうちエネルギーが始点付近に集中する最小位相（群遅延特性が最小）は、このような音声の傾向に整合するということもできる。 The phase calculation unit 32 in FIG. 1 generates a phase spectrum SP of each frame from the amplitude spectrum SM of the voiced sound after adjustment by the segment adjustment unit 26. The phase calculation unit 32 of the first embodiment generates a minimum phase that is uniquely calculated from the amplitude spectrum SM of each frame of voiced sound as the phase spectrum SP of that frame. For example, in a time domain signal obtained by performing inverse Fourier transform on the spectrum of a male bass sound, a tendency is observed in which energy is concentrated near the starting point on the time axis. It can also be said that the minimum phase (the group delay characteristic is minimum) where the energy is concentrated near the start point among the signals having the same amplitude spectrum SM matches the tendency of the sound.

振幅スペクトルの最小位相は一般的に、振幅スペクトルの対数のヒルベルト変換により算定される。そこで、第１実施形態の位相算定部３２は、振幅スペクトルＳMの対数ｌｏｇ(ＳM)をヒルベルト変換することで位相スペクトルＳPを生成する。具体的には、位相算定部３２は、第１に、振幅スペクトルＳMの対数ｌｏｇ(ＳM)に対して逆フーリエ変換（逆高速フーリエ変換）を実行することで時間領域のサンプル系列を算定し、このサンプル系列のうち時間軸上で負の時刻に相当する部分（後半分）を０に設定したうえでフーリエ変換（例えば高速フーリエ変換）を実行する。そして、位相算定部３２は、フーリエ変換の結果のうちの虚数部（最小位相）を位相スペクトルＳPとして算定する。位相算定部３２は、振幅スペクトルＳMとその振幅スペクトルＳMから生成した位相スペクトルＳPとを含む単位データＵCをフレーム毎に順次に音声合成部３４に供給する。 The minimum phase of the amplitude spectrum is generally calculated by the log Hilbert transform of the amplitude spectrum. Therefore, the phase calculation unit 32 of the first embodiment generates the phase spectrum SP by performing Hilbert transform on the logarithm log (SM) of the amplitude spectrum SM. Specifically, the phase calculation unit 32 first calculates a time-domain sample sequence by performing an inverse Fourier transform (inverse fast Fourier transform) on the logarithm log (SM) of the amplitude spectrum SM, A part (second half) corresponding to a negative time on the time axis is set to 0 in the sample series, and then Fourier transform (for example, fast Fourier transform) is executed. And the phase calculation part 32 calculates the imaginary part (minimum phase) of the results of Fourier transform as the phase spectrum SP. The phase calculation unit 32 supplies unit data UC including the amplitude spectrum SM and the phase spectrum SP generated from the amplitude spectrum SM to the speech synthesis unit 34 sequentially for each frame.

音声合成部３４は、位相算定部３２から順次に供給される有声音の単位データＵCと素片調整部２６から順次に供給される無声音の単位データＵBとを利用して音声信号ＶOUTを生成する。具体的には、音声合成部３４は、単位データＵCおよび単位データＵBの各々の振幅スペクトルＳMと位相スペクトルＳPとに対する逆フーリエ変換で各フレームの音声の時間波形を算定し、相前後するフレーム間で時間波形を相互に重複させて連結（加算）することで音声信号ＶOUTを生成する。 The voice synthesizer 34 uses the voiced sound unit data UC sequentially supplied from the phase calculation unit 32 and the unvoiced sound unit data UB sequentially supplied from the segment adjustment unit 26 to generate the voice signal VOUT. . Specifically, the speech synthesizer 34 calculates the time waveform of the speech of each frame by inverse Fourier transform with respect to the amplitude spectrum SM and the phase spectrum SP of each of the unit data UC and the unit data UB, and between successive frames. The audio signal VOUT is generated by connecting (adding) the time waveforms overlapping each other.

以上に説明した第１実施形態では、有声音の各フレームの振幅スペクトルＳMを利用して位相スペクトルＳPが算定されるから、有声音の各フレームについて位相スペクトルＳPを記憶装置１４に事前に格納する必要はない。したがって、有声音について振幅スペクトルＳMおよび位相スペクトルＳPの双方を事前に用意して保持する必要がある特許文献１と比較して、音声素片の記憶に必要な記憶容量を削減することが可能である。 In the first embodiment described above, since the phase spectrum SP is calculated using the amplitude spectrum SM of each frame of voiced sound, the phase spectrum SP is stored in advance in the storage device 14 for each frame of voiced sound. There is no need. Therefore, it is possible to reduce the storage capacity necessary for storing speech segments as compared with Patent Document 1 in which both the amplitude spectrum SM and the phase spectrum SP need to be prepared and held in advance for voiced sound. is there.

また、第１実施形態では、有声音の各フレームの振幅スペクトルＳMから位相スペクトルＳPが算定されるため、振幅スペクトルＳMと位相スペクトルＳPとの時間的な対応を容易に維持することが可能である。したがって、各フレームの振幅スペクトルＳMと各フレームの位相スペクトルＳPとを時間的に整合させる特別な仕組を必要とせずに、振幅スペクトルＳMと位相スペクトルＳPとの時間差に起因した合成音の位相ズレ感を抑制できるという利点がある。 In the first embodiment, since the phase spectrum SP is calculated from the amplitude spectrum SM of each frame of voiced sound, the temporal correspondence between the amplitude spectrum SM and the phase spectrum SP can be easily maintained. . Therefore, the phase shift of the synthesized sound caused by the time difference between the amplitude spectrum SM and the phase spectrum SP without requiring a special mechanism for temporally matching the amplitude spectrum SM of each frame and the phase spectrum SP of each frame. There is an advantage that can be suppressed.

なお、振幅スペクトルＳMを素片調整部２６が調整する構成としては、位相スペクトルＳPの算定後に振幅スペクトルＳMを調整する構成（以下「態様Ａ」という）も想定され得る。しかし、態様Ａでは、例えば調整後に音声合成部３４に供給される振幅スペクトルが位相スペクトルＳPに対して遅延し、合成音の受聴者が位相ズレ感を知覚する可能性がある。第１実施形態では、素片調整部２６による調整後の振幅スペクトルＳMを利用して位相スペクトルＳPが算定されるから、振幅スペクトルＳMと位相スペクトルＳPとの時間的な対応を容易かつ確実に維持することで位相ズレ感を抑制できるという効果は、態様Ａと比較して各格別に顕著となる。ただし、態様Ａも本発明の範囲には包含される。 As a configuration in which the segment adjustment unit 26 adjusts the amplitude spectrum SM, a configuration in which the amplitude spectrum SM is adjusted after calculation of the phase spectrum SP (hereinafter referred to as “mode A”) may be assumed. However, in aspect A, for example, the amplitude spectrum supplied to the speech synthesizer 34 after adjustment is delayed with respect to the phase spectrum SP, and the listener of the synthesized sound may perceive a phase shift feeling. In the first embodiment, since the phase spectrum SP is calculated using the amplitude spectrum SM after adjustment by the segment adjustment unit 26, the temporal correspondence between the amplitude spectrum SM and the phase spectrum SP is easily and reliably maintained. The effect that the phase shift feeling can be suppressed by doing so becomes significantly more remarkable as compared with the aspect A. However, the aspect A is also included in the scope of the present invention.

また、態様Ａでは、素片調整部２６による調整後の振幅スペクトルＳMと調整前の振幅スペクトルＳMから生成された位相スペクトルＳPとで特性が相互に乖離して合成音が不自然な音声となる可能性がある。素片調整部２６による調整で振幅スペクトルＳMの特性が大きく変化するほど以上の問題は顕著となる。第１実施形態では、素片調整部２６による調整後の振幅スペクトルＳMの特性に整合した位相スペクトルＳPが算定されるから、態様Ａと比較して自然な印象の合成音を生成できるという利点がある。 In the aspect A, the characteristics of the amplitude spectrum SM after the adjustment by the segment adjustment unit 26 and the phase spectrum SP generated from the amplitude spectrum SM before the adjustment are different from each other, and the synthesized sound becomes an unnatural sound. there is a possibility. The above problems become more prominent as the characteristics of the amplitude spectrum SM change greatly as a result of adjustment by the segment adjustment unit 26. In the first embodiment, since the phase spectrum SP matched with the characteristics of the amplitude spectrum SM adjusted by the segment adjustment unit 26 is calculated, there is an advantage that a synthetic sound with a natural impression can be generated as compared with the aspect A. is there.

＜Ｂ：第２実施形態＞
本発明の第２実施形態を以下に説明する。なお、以下に例示する各態様において作用や機能が第１実施形態と同等である要素については、以上の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
A second embodiment of the present invention will be described below. In addition, about the element which an effect | action and a function are equivalent to 1st Embodiment in each aspect illustrated below, each reference detailed in the above description is diverted and each detailed description is abbreviate | omitted suitably.

図３は、実際の音声の振幅スペクトルＷM（包絡線）と位相スペクトルＷPとの関係を示すグラフである。図３に例示した音声では、振幅スペクトルＷMにて第１フォルマントＦ1から第４フォルマントＦ4が存在する各周波数において位相スペクトルＷPの位相値が大きく変化するという関連が、振幅スペクトルＷMと位相スペクトルＷPとの間で確認される。他方、第４フォルマントＦ4を上回る帯域内では、振幅スペクトルＷMと位相スペクトルＷPとの明確な関連性は観測されない。他方、振幅スペクトルＷMと位相スペクトルＷPとの関連性が高域側の帯域内でも過度に維持される音声は聴覚的に不自然な印象になるという傾向がある。すなわち、周波数の高域側にて振幅スペクトルＷMと位相スペクトルＷPとの関連性が低下するという傾向は、音声の聴感的な自然さに寄与すると推察される。そこで、第２実施形態では、位相算定部３２が算定した位相スペクトルＳPのうち高域側に位置する所定の帯域（例えば４ｋＨｚ以上の帯域）内の各周波数の位相値を変動させる。 FIG. 3 is a graph showing the relationship between the amplitude spectrum WM (envelope) of actual speech and the phase spectrum WP. In the voice illustrated in FIG. 3, the relationship that the phase value of the phase spectrum WP changes greatly at each frequency where the first formant F1 to the fourth formant F4 exist in the amplitude spectrum WM is related to the amplitude spectrum WM and the phase spectrum WP. Be confirmed between. On the other hand, a clear relationship between the amplitude spectrum WM and the phase spectrum WP is not observed within the band exceeding the fourth formant F4. On the other hand, there is a tendency that a sound in which the relationship between the amplitude spectrum WM and the phase spectrum WP is excessively maintained even in the high frequency band has an unnatural impression. That is, it is presumed that the tendency for the relevance between the amplitude spectrum WM and the phase spectrum WP to decrease on the high frequency side contributes to the perceptual nature of speech. Therefore, in the second embodiment, the phase value of each frequency in a predetermined band (for example, a band of 4 kHz or higher) located on the high frequency side of the phase spectrum SP calculated by the phase calculation unit 32 is varied.

図４は、第２実施形態の音声合成装置１００のブロック図である。図４に示すように、第２実施形態の演算処理装置１２は、第１実施形態と同様の各要素（素片選択部２２，振幅算定部２４，素片調整部２６，位相算定部３２，音声合成部３４）に加えて第１位相補正部４１として機能する。第１位相補正部４１は、位相算定部３２が算定した各フレームの位相スペクトルＳPの周波数毎の位相値を変化させる。素片調整部２６による調整後の振幅スペクトルＳMと第１位相補正部４１による補正後の位相スペクトルＳPとを含む単位データＵCがフレーム毎に音声合成部３４に供給される。 FIG. 4 is a block diagram of the speech synthesizer 100 of the second embodiment. As shown in FIG. 4, the arithmetic processing unit 12 according to the second embodiment includes the same elements (element selection unit 22, amplitude calculation unit 24, unit adjustment unit 26, phase calculation unit 32, It functions as the first phase correction unit 41 in addition to the voice synthesis unit 34). The first phase correction unit 41 changes the phase value for each frequency of the phase spectrum SP of each frame calculated by the phase calculation unit 32. Unit data UC including the amplitude spectrum SM adjusted by the segment adjustment unit 26 and the phase spectrum SP corrected by the first phase correction unit 41 is supplied to the speech synthesis unit 34 for each frame.

図５の部分(A)は、有声音の単位データＵAが示す振幅スペクトルＳMである。また、図５の部分(B)は、位相算定部３２が算定した位相スペクトル（第１位相補正部４１による補正前の位相スペクトル）ＳPであり、図５の部分(C)は、第１位相補正部４１による補正後の位相スペクトルＳPである。図５の部分(C)には、図５の部分(B)に例示された補正前の位相スペクトルＳPが破線で併記されている。図５の部分(C)に示すように、第１位相補正部４１は、位相算定部３２が算定した位相スペクトルＳP（図５の部分(B)）のうち帯域Ｂ（例えば４ｋＨｚ以上の帯域）内の各周波数の位相値を乱数的に変化させる。すなわち、帯域Ｂ内の各位相値の系列に揺らぎが付与される。 Part (A) of FIG. 5 is an amplitude spectrum SM indicated by the unit data UA of voiced sound. 5B is a phase spectrum (phase spectrum before correction by the first phase correction unit 41) SP calculated by the phase calculation unit 32, and part (C) of FIG. 5 is the first phase. It is the phase spectrum SP after correction by the correction unit 41. In the part (C) of FIG. 5, the phase spectrum SP before correction exemplified in the part (B) of FIG. As shown in part (C) of FIG. 5, the first phase correction unit 41 includes a band B (for example, a band of 4 kHz or more) in the phase spectrum SP (part (B) of FIG. 5) calculated by the phase calculation unit 32. The phase value of each frequency is changed randomly. That is, fluctuations are given to the series of phase values in the band B.

具体的には、第１位相補正部４１は、帯域Ｂ内の周波数毎に乱数を発生し、位相スペクトルＳPのうち帯域Ｂ内の各周波数の位相値にその周波数の乱数を加算または減算することで補正後の位相値を算定する。したがって、図６の部分(A)に示すように、補正後の位相スペクトルＳPのうち帯域Ｂ内の各周波数の位相値は、補正前の位相値を中心値とする所定の変動範囲α1内の任意の数値に設定される。すなわち、第１位相補正部４１による補正の結果、図３に例示した音声と同様に、位相スペクトルＳPと振幅スペクトルＳMとの帯域Ｂ内での関連性は低下する。なお、各位相値に適用される乱数は例えばフレーム毎に更新される。 Specifically, the first phase correction unit 41 generates a random number for each frequency in the band B, and adds or subtracts the random number of that frequency to the phase value of each frequency in the band B in the phase spectrum SP. To calculate the corrected phase value. Therefore, as shown in part (A) of FIG. 6, the phase value of each frequency in the band B in the phase spectrum SP after correction is within a predetermined fluctuation range α1 centered on the phase value before correction. Set to any number. That is, as a result of the correction by the first phase correction unit 41, the relevance in the band B between the phase spectrum SP and the amplitude spectrum SM is reduced, as in the case of the voice illustrated in FIG. The random number applied to each phase value is updated for each frame, for example.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、位相スペクトルＳPのうち帯域Ｂ内の位相値を変動させることで帯域Ｂ内における振幅スペクトルＳMと位相スペクトルＳPとの関連性が低下するから、第１実施形態と比較して聴感的に自然な印象の合成音を生成できるという利点がある。 In the second embodiment, the same effect as in the first embodiment is realized. Further, in the second embodiment, since the relationship between the amplitude spectrum SM and the phase spectrum SP in the band B is reduced by changing the phase value in the band B of the phase spectrum SP, the comparison with the first embodiment is made. Thus, there is an advantage that a synthetic sound with a natural impression can be generated.

＜Ｃ：第３実施形態＞
図７は、第３実施形態の音声合成装置１００のブロック図である。図７に示すように、第３実施形態における素片群ＧAの各音声素片データＤは、複数の単位データＵ（ＵA，ＵB）の時系列に加えて種別情報Ｃを含んで構成される。種別情報Ｃは、音声素片内の各音素の種別を指定する。例えば母音（/ａ/，/ｉ/，/ｕ/），無声破裂音（/ｔ/，/ｋ/，/ｐ/），有声破裂音（/ｂ/，/ｄ/，/ｇ/），無声破擦音（/ｔｓ/），有声破擦音（/ｊ/），無声摩擦音（/ｓ/，/ｆ/），有声摩擦音（/ｚ/），半母音（/ｗ/，/ｙ/）等の種別が種別情報Ｃで指定される。 <C: Third Embodiment>
FIG. 7 is a block diagram of the speech synthesizer 100 of the third embodiment. As shown in FIG. 7, each speech unit data D of the unit group GA in the third embodiment includes type information C in addition to the time series of a plurality of unit data U (UA, UB). . The type information C specifies the type of each phoneme in the speech unit. For example, vowels (/ a /, / i /, / u /), unvoiced plosives (/ t /, / k /, / p /), voiced plosives (/ b /, / d /, / g /), Unvoiced crushing sound (/ ts /), Voiced crushing sound (/ j /), Unvoiced friction sound (/ s /, / f /), Voiced friction sound (/ z /), Semi-vowel (/ w /, / y /) Is specified by the type information C.

また、記憶装置１４には、音素内の有声／無声の度合（以下「有声度」という）Ｖの時間的な推移を指定する有声度情報ＤVが、有声音の音素の種別（母音，有声破裂音，有声破擦音，有声摩擦音）毎に事前に格納される。図８は、有声度情報ＤVが示す有声度Ｖの時間変化の模式図である。図８の部分(A)は、有声摩擦音/ｊ/と母音/ａ/とを連結した音声素片［ｊ-ａ］の有声度Ｖであり、図８の部分(B)は、有声破裂音/ｂ/と母音/ａ/とを連結した音声素片［ｂ-ａ］の有声度Ｖである。 In the storage device 14, voiced information DV specifying the temporal transition of the voiced / unvoiced degree (hereinafter referred to as “voiced degree”) V within the phoneme includes the type of the phoneme of the voiced sound (vowel, voiced burst). (Sound, voiced crushing sound, voiced crushing sound)). FIG. 8 is a schematic diagram of a temporal change in the voicing degree V indicated by the voicing degree information DV. The part (A) in FIG. 8 is the voicing degree V of the speech unit [ja] connecting the voiced friction sound / j / and the vowel / a /, and the part (B) in FIG. 8 is the voiced plosive sound. It is the voicing degree V of the speech unit [ba] connecting / b / and the vowel / a /.

有声度Ｖは、有声を意味する数値０と無声を意味する数値１との間で音素の始点から終点にかけて推移する。図８の部分(A)に示すように、有声摩擦音/ｊ/の有声度Ｖは、音素の始点ｔsから時点ｔ1までの所定長の区間（例えばフレームの３個分）内で０から１に直線的に変化し、時点ｔ1から時点ｔ2まで１を維持するとともに、時点ｔ2から終点ｔeまでの所定長の区間（例えばフレームの３個分）内で１から０に直線的に変化する。また、図８の部分(B)に示すように、有声破裂音/ｂ/の有声度Ｖは、音素の始点ｔsから時点ｔ1までの区間（例えばフレームの４個分）内で０から０.５に変化し、時点ｔ1から時点ｔ2まで０.５を維持するとともに、時点ｔ2から終点ｔeまでの区間（例えばフレームの４個分）内で０.５から０に変化する。他方、母音/ａ/の有声度Ｖは、全区間にわたって０（有声）に維持される。 The voicing degree V changes from the starting point of the phoneme to the ending point between a numerical value 0 meaning voiced and a numerical value 1 meaning unvoiced. As shown in part (A) of FIG. 8, the voicing degree V of the voiced friction sound / j / is changed from 0 to 1 within a predetermined length section (for example, three frames) from the phoneme start point ts to the time point t1. It changes linearly, maintains 1 from time t1 to time t2, and changes linearly from 1 to 0 within a predetermined length section (for example, three frames) from time t2 to end point te. Further, as shown in part (B) of FIG. 8, the voicing degree V of the voiced plosive sound / b / is 0 to 0 within a section (for example, four frames) from the start point ts of the phoneme to the time point t1. 5 and maintains 0.5 from time t1 to time t2, and also changes from 0.5 to 0 within a section (for example, four frames) from time t2 to end point te. On the other hand, the voicing degree V of the vowel / a / is maintained at 0 (voiced) over the entire section.

図７に示すように、第３実施形態の演算処理装置１２は、第１実施形態と同様の各要素（素片選択部２２，振幅算定部２４，素片調整部２６，位相算定部３２，音声合成部３４）に加えて第２位相補正部４２として機能する。第２位相補正部４２は、図６の部分(B)に示すように、位相算定部３２が算定した位相スペクトルＳPの周波数毎の位相値を、その位相値を中心値とする変動範囲α2内で乱数的に変化させる。具体的には、第２位相補正部４２は、周波数軸上の全帯域にわたる周波数毎に乱数を発生し、位相スペクトルＳPの各周波数の位相値にその周波数の乱数を加算または減算することで補正後の位相値を算定する。各周波数の位相値の補正に適用される乱数はフレーム毎に更新される。 As shown in FIG. 7, the arithmetic processing unit 12 according to the third embodiment includes the same elements (element selection unit 22, amplitude calculation unit 24, unit adjustment unit 26, phase calculation unit 32, It functions as a second phase correction unit 42 in addition to the voice synthesis unit 34). As shown in part (B) of FIG. 6, the second phase correction unit 42 sets the phase value for each frequency of the phase spectrum SP calculated by the phase calculation unit 32 within the fluctuation range α2 centered on the phase value. Change it randomly. Specifically, the second phase correction unit 42 generates a random number for each frequency over the entire band on the frequency axis, and corrects by adding or subtracting the random number of the frequency to the phase value of each frequency of the phase spectrum SP. The later phase value is calculated. The random number applied to the correction of the phase value of each frequency is updated for each frame.

ところで、実際の音声では、音声が無声に近いほど振幅スペクトルＷMと位相スペクトルＷPとの関連性が低下するという傾向がある。したがって、音声が無声に近いフレームでも振幅スペクトルＳMと位相スペクトルＳPとの関連性が高い場合には、合成音が人工的な音声と知覚される可能性がある。以上の傾向を考慮して、第３実施形態の第２位相補正部４２は、各フレームの位相スペクトルＳPの各位相値を変化させる変動範囲α2を、有声度情報ＤVがそのフレームについて指定する有声度Ｖに応じて可変に制御する。 By the way, in an actual voice, there is a tendency that the closer the voice is to unvoiced, the lower the relationship between the amplitude spectrum WM and the phase spectrum WP. Therefore, even if the voice is almost silent, the synthesized sound may be perceived as an artificial voice if the relationship between the amplitude spectrum SM and the phase spectrum SP is high. In consideration of the above tendency, the second phase correction unit 42 of the third embodiment uses the voicing information DV to specify the fluctuation range α2 for changing each phase value of the phase spectrum SP of each frame for the frame. It is controlled variably according to the degree V.

すなわち、第２位相補正部４２は、素片選択部２２が選択した音声素片データＤの種別情報Ｃに対応する有声度情報ＤV（すなわち、合成対象の音素に対応する有声度情報ＤV）を記憶装置１４から取得し、各フレームの位相スペクトルＳPの位相値を、記憶装置１４から取得した有声度情報ＤVがそのフレームについて指定する有声度Ｖに応じた変動範囲α2内で乱数的に変動させる。具体的には、有声度Ｖが無声の数値１に近いフレームほど変動範囲α2が広い範囲となる（すなわち振幅スペクトルＳMと補正後の位相スペクトルＳPとの関連性が低下する）ように周波数毎の乱数が設定される。 That is, the second phase correction unit 42 obtains the voicing information DV corresponding to the type information C of the speech unit data D selected by the unit selection unit 22 (that is, the voicing information DV corresponding to the synthesis target phoneme). The phase value of the phase spectrum SP of each frame acquired from the storage device 14 is changed randomly within a fluctuation range α2 corresponding to the voicing degree V specified for the frame by the voicing degree information DV acquired from the storage device 14. . Specifically, as the voicing degree V is closer to the unvoiced numerical value 1, the fluctuation range α2 becomes wider (that is, the relevance between the amplitude spectrum SM and the corrected phase spectrum SP decreases). A random number is set.

例えば図８の部分(A)の有声摩摩擦音/ｊ/や図８の部分(B)の有声破裂音/ｂ/の音素の各フレームにおける位相値の変動範囲α2は、音素の始点ｔsから時点ｔ1にかけて拡大し、時点ｔ1から時点ｔ2まで一定に維持されるとともに、時点ｔ2から終点ｔeにかけて縮小する。他方、母音/ａ/の音素の各フレームにおける変動範囲α2は、音素の全区間にわたって一定の狭い範囲に維持される。 For example, the phase value fluctuation range α2 in each frame of the voiced frictional sound / j / in the part (A) of FIG. 8 and the voiced plosive / b / in the part (B) of FIG. 8 is the time point from the start point ts of the phoneme. The image is enlarged from t1 to be kept constant from time t1 to time t2, and is reduced from time t2 to end point te. On the other hand, the variation range α2 in each frame of the vowel / a / phoneme is maintained in a constant narrow range over the entire phoneme section.

第３実施形態においても第１実施形態と同様の効果が実現される。また、第３実施形態では、有声音の各フレームの位相スペクトルＳPの位相値を第２位相補正部４２により変化させる変動範囲α2がそのフレームの有声度Ｖに応じて制御される。したがって、第１実施形態や第２実施形態と比較して、有声音の音素のうち特に無声に近い区間について人工的な印象を低減した自然な合成音を生成することが可能である。 In the third embodiment, the same effect as in the first embodiment is realized. In the third embodiment, the variation range α2 in which the phase value of the phase spectrum SP of each frame of voiced sound is changed by the second phase correction unit 42 is controlled according to the voicing degree V of the frame. Therefore, compared to the first and second embodiments, it is possible to generate a natural synthesized sound in which an artificial impression is reduced in a section of voiced phonemes that is particularly close to unvoiced.

＜Ｄ：第４実施形態＞
第１実施形態では、振幅スペクトルＳMに対応する最小位相を位相スペクトルＳPとして算定した。第４実施形態では、振幅スペクトルＳMに対応する位相スペクトルＳPを算定する方法が第１実施形態とは相違する。図９は、第４実施形態の位相算定部３２が振幅スペクトルＳMに対応する位相スペクトルＳPを算定する動作の説明図である。図９の部分(A)には、素片調整部２６による調整後の振幅スペクトルＳMが図示されている。振幅スペクトルＳMは、周波数軸上の相異なる周波数ｆ[k]に対応する複数の振幅値Ａ[k]の系列として表現される。記号ｋは、周波数軸上の任意の１個の周波数（周波数ビン）を意味する。振幅特性データＲから特定される振幅スペクトルＳMは包絡線であるから、記号ｋは調波成分（基音成分および複数の倍音成分）の次数に相当する。すなわち、振幅値Ａ[k]は、振幅スペクトルＳMにおける第ｋ次の調波成分の振幅を意味する。第４実施形態の位相算定部３２は、以下に例示する処理をフレーム毎に順次に実行する。 <D: Fourth Embodiment>
In the first embodiment, the minimum phase corresponding to the amplitude spectrum SM is calculated as the phase spectrum SP. In the fourth embodiment, the method for calculating the phase spectrum SP corresponding to the amplitude spectrum SM is different from that in the first embodiment. FIG. 9 is an explanatory diagram of an operation in which the phase calculation unit 32 according to the fourth embodiment calculates the phase spectrum SP corresponding to the amplitude spectrum SM. In part (A) of FIG. 9, the amplitude spectrum SM after adjustment by the segment adjustment unit 26 is shown. The amplitude spectrum SM is expressed as a series of a plurality of amplitude values A [k] corresponding to different frequencies f [k] on the frequency axis. The symbol k means any one frequency (frequency bin) on the frequency axis. Since the amplitude spectrum SM identified from the amplitude characteristic data R is an envelope, the symbol k corresponds to the order of the harmonic component (fundamental component and multiple harmonic components). That is, the amplitude value A [k] means the amplitude of the k-th harmonic component in the amplitude spectrum SM. The phase calculation unit 32 according to the fourth embodiment sequentially executes the processes exemplified below for each frame.

第１に、位相算定部３２は、周波数軸上の周波数ｆ[k]毎に、周波数ｆ[k]の振幅値Ａ[k]と周波数軸上で周波数ｆ[k]に隣接する周波数ｆ[k-1]の振幅値Ａ[k-1]との差分（以下「振幅差」という）δA[k]をフレーム毎に算定する（δA[k]＝Ａ[k]−Ａ[k-1]）。図９の部分(B)には、周波数軸上の各振幅差δA[k]が図示されている。 First, the phase calculation unit 32, for each frequency f [k] on the frequency axis, the amplitude value A [k] of the frequency f [k] and the frequency f [k] adjacent to the frequency f [k] on the frequency axis. The difference (hereinafter referred to as “amplitude difference”) δA [k] from the amplitude value A [k−1] of k−1] is calculated for each frame (δA [k] = A [k] −A [k−1). ]). In part (B) of FIG. 9, each amplitude difference δA [k] on the frequency axis is illustrated.

第２に、位相算定部３２は、周波数軸上の各振幅差δA[k]を周波数軸の方向に平滑化することで周波数ｆ[k]毎の振幅差δB[k]を算定する。振幅差δA[k]の平滑化には公知の技術が任意に採用されるが、例えば、各周波数ｆ[k]の振幅差δA[k]を含む複数個の数値の移動平均を平滑化後の振幅差δB[k]として算定する構成が好適である。第３に、位相算定部３２は、平滑化後の各振幅差δB[k]の数値を−π以上かつ＋π以下の範囲内の数値に変換し、変換後の各数値を周波数ｆ[k]毎の位相値とする位相スペクトルＳPを生成する。すなわち、位相算定部３２は、振幅スペクトルＳMにおいて周波数軸上で相隣接する各周波数（ｆ[k]，ｆ[k-1]）間の振幅差δA[k]を周波数軸の方向に平滑化することで位相スペクトルＳPを算定する要素として機能する。 Second, the phase calculation unit 32 calculates the amplitude difference δB [k] for each frequency f [k] by smoothing each amplitude difference δA [k] on the frequency axis in the direction of the frequency axis. A known technique is arbitrarily employed for smoothing the amplitude difference δA [k]. For example, after smoothing a moving average of a plurality of numerical values including the amplitude difference δA [k] of each frequency f [k] A configuration for calculating the amplitude difference ΔB [k] is preferable. Third, the phase calculation unit 32 converts the numerical value of each amplitude difference δB [k] after smoothing into a numerical value in the range of −π to + π, and converts the converted numerical values to the frequency f [k]. A phase spectrum SP for each phase value is generated. That is, the phase calculation unit 32 smoothes the amplitude difference δA [k] between the frequencies (f [k], f [k−1]) adjacent to each other on the frequency axis in the amplitude spectrum SM in the direction of the frequency axis. This functions as an element for calculating the phase spectrum SP.

図９の部分(C)には、第４実施形態の位相算定部３２が図９の部分(A)の振幅スペクトルＳMから算定した位相スペクトルＳPが図示されている。また、図９の部分(D)には、第１実施形態の位相算定部３２が図９の部分(A)の振幅スペクトルＳMから算定した位相スペクトルＳP（振幅スペクトルＳMに対応する最小位相）が図示されている。図９の部分(C)と部分(D)との対比から、第４実施形態でも第１実施形態と同様の形状（振幅スペクトルＳMとの関係が同等）の位相スペクトルＳPをフレーム毎に生成できることが理解される。すなわち、第４実施形態においても第１実施形態と同様の効果が実現される。 Part (C) of FIG. 9 shows a phase spectrum SP calculated by the phase calculation unit 32 of the fourth embodiment from the amplitude spectrum SM of part (A) of FIG. Further, in part (D) of FIG. 9, the phase spectrum SP (minimum phase corresponding to the amplitude spectrum SM) calculated from the amplitude spectrum SM of part (A) of FIG. It is shown in the figure. From the comparison between the part (C) and the part (D) in FIG. 9, the fourth embodiment can generate a phase spectrum SP having the same shape as that of the first embodiment (equal relationship with the amplitude spectrum SM) for each frame. Is understood. That is, the same effect as that of the first embodiment is realized in the fourth embodiment.

なお、以上の例示では第１実施形態を基礎として第４実施形態を説明したが、第４実施形態の位相算定部３２が算定した位相スペクトルＳPを第２実施形態の第１位相補正部４１および第３実施形態の第２位相補正部４２の一方または双方が補正する構成も採用され得る。 In the above example, the fourth embodiment has been described based on the first embodiment. However, the phase spectrum SP calculated by the phase calculation unit 32 according to the fourth embodiment is used as the first phase correction unit 41 according to the second embodiment and A configuration in which one or both of the second phase correction units 42 of the third embodiment corrects may be employed.

＜Ｅ：第５実施形態＞
図１０は、第５実施形態の音声合成装置１００のブロック図である。図１０に示すように、第５実施形態の演算処理装置１２は、第１実施形態と同様の各要素（素片選択部２２，振幅算定部２４，素片調整部２６，位相算定部３２，音声合成部３４）に加えて第３位相補正部４３として機能する。第３位相補正部４３は、位相算定部３２がフレーム毎に算定する位相スペクトルＳPの各位相値φA[m]を補正することで位相値φB[m]をフレーム毎に算定する。記号ｍは、時間軸上の任意の１個のフレーム（例えばフレームの番号）を意味する。位相値φB[m]は、振幅スペクトル（包絡線）ＳMの調波成分毎（周波数ビン毎）に算定される。第３位相補正部４３による補正後の各位相値φB[m]の系列が第ｍ番目のフレームの位相スペクトルＳPとして音声合成部３４の処理に利用される。 <E: Fifth Embodiment>
FIG. 10 is a block diagram of the speech synthesizer 100 of the fifth embodiment. As shown in FIG. 10, the arithmetic processing unit 12 of the fifth embodiment includes the same elements (element selection unit 22, amplitude calculation unit 24, unit adjustment unit 26, phase calculation unit 32, It functions as the third phase correction unit 43 in addition to the voice synthesis unit 34). The third phase correction unit 43 calculates the phase value φB [m] for each frame by correcting each phase value φA [m] of the phase spectrum SP calculated by the phase calculation unit 32 for each frame. The symbol m means any one frame (for example, a frame number) on the time axis. The phase value φB [m] is calculated for each harmonic component (for each frequency bin) of the amplitude spectrum (envelope) SM. The series of phase values φB [m] corrected by the third phase correction unit 43 is used for the processing of the speech synthesis unit 34 as the phase spectrum SP of the mth frame.

具体的には、第３位相補正部４３は、以下の数式(1)で表現されるように、第ｍ番目のフレームについて位相算定部３２が算定した位相スペクトルＳPの各位相値φA[m]に予測誤差Δφ[m]を付加することで、補正後の位相スペクトルＳPの位相値φB[m]を調波成分毎に算定する。

Specifically, the third phase correction unit 43 represents each phase value φA [m] of the phase spectrum SP calculated by the phase calculation unit 32 for the m-th frame, as expressed by the following formula (1). Is added with a prediction error Δφ [m] to calculate the phase value φB [m] of the corrected phase spectrum SP for each harmonic component.

数式(1)における第ｍ番目のフレームの予測誤差Δφ[m]は、以下の数式(2)で表現されるように、第ｍ番目のフレームについて予測される位相値（以下「予測位相」という）φE[m]と音声素片の第ｍ番目のフレームの実際の位相値φ[m]との差分（誤差）に相当する。第ｍ番目のフレームの予測位相φE[m]は、直前（第(m-1)番目）のフレームの実際の位相値φ[m-1]から推定される予測値であり、位相値φ[m]は、音声素片データＤが表現する音声素片における実際の位相値（実測値）である。

数式(2)の記号princarg( )は、括弧内の数値を−π以上かつ＋π以下の範囲（すなわち位相の数値範囲）内の数値に変換する演算子を意味する。 The prediction error Δφ [m] of the m-th frame in Expression (1) is a phase value predicted for the m-th frame (hereinafter referred to as “prediction phase”) as expressed by Expression (2) below. ) This corresponds to the difference (error) between φE [m] and the actual phase value φ [m] of the m-th frame of the speech unit. The predicted phase φE [m] of the mth frame is a predicted value estimated from the actual phase value φ [m−1] of the immediately previous ((m−1) th) frame, and the phase value φ [ m] is an actual phase value (measured value) in the speech unit represented by the speech unit data D.

The symbol princarg () in Equation (2) means an operator that converts a numerical value in parentheses into a numerical value in a range of −π to + π (ie, a numerical range of phases).

数式(2)の予測位相φE[m]は、以下の数式(3)の演算により調波成分毎に算定される。

数式(3)の記号Δtは、相前後する各フレーム間の時間差を意味し、数式(3)の記号ｆ[m]は、第ｍ番目のフレームにおける１個の調波成分（予測位相φE[m]に対応する調波成分）の周波数を意味する。数式(3)から理解されるように、数式(3)の第２項は、周波数ｆ[m-1]と周波数ｆ[m]との平均周波数（（ｆ[m-1]＋ｆ[m]）／２）の音声の位相が時間Δt内で変動する変動量（すなわち、相前後するフレーム間での位相の変動量の予測値）を意味する。したがって、数式(3)で算定される予測位相φE[m]は、位相が経時的に線形に変化するとの仮定のもとで、直前（第(m-1)番目）のフレームの実際の位相値φ[m-1]から予測される第ｍ番目のフレームの位相値に相当する。 The predicted phase φE [m] in Expression (2) is calculated for each harmonic component by the calculation of Expression (3) below.

The symbol Δt in Equation (3) means the time difference between successive frames, and the symbol f [m] in Equation (3) is one harmonic component (predicted phase φE [ m] represents the frequency of the harmonic component). As understood from the equation (3), the second term of the equation (3) is an average frequency ((f [m-1] + f [m]) of the frequency f [m-1] and the frequency f [m]. ) / 2) means a fluctuation amount in which the phase of the voice fluctuates within the time Δt (that is, a predicted value of the fluctuation amount of the phase between successive frames). Therefore, the predicted phase φE [m] calculated by Equation (3) is the actual phase of the immediately previous ((m-1) th) frame under the assumption that the phase changes linearly over time. This corresponds to the phase value of the mth frame predicted from the value φ [m−1].

第５実施形態では、収録済の音声素片から調波成分毎の位相値φ[m]が各フレームについて事前に算定され、各位相値φ[m]から数式(2)および数式(3)の演算で算定された予測誤差Δφ[m]が音声素片データＤの各単位データＵ内に設定される。すなわち、音声素片の各フレームの予測誤差Δφ[m]が音声素片毎に記憶装置１４に事前に記憶される。第３位相補正部４３は、位相算定部３２が第ｍ番目のフレームについて算定した位相スペクトルＳPの各位相値φA[m]に対し、記憶装置１４に記憶された第ｍ番目のフレームの予測誤差Δφ[m]を付加することで（数式(1)）、補正後の位相値φB[m]を算定する。 In the fifth embodiment, the phase value φ [m] for each harmonic component is calculated in advance for each frame from the recorded speech segment, and the equations (2) and (3) are calculated from each phase value φ [m]. The prediction error Δφ [m] calculated by the above calculation is set in each unit data U of the speech segment data D. That is, the prediction error Δφ [m] of each frame of the speech unit is stored in advance in the storage device 14 for each speech unit. The third phase correction unit 43 calculates the prediction error of the mth frame stored in the storage device 14 for each phase value φA [m] of the phase spectrum SP calculated by the phase calculation unit 32 for the mth frame. By adding Δφ [m] (Equation (1)), the corrected phase value φB [m] is calculated.

第５実施形態においても第１実施形態と同様の効果が実現される。また、第５実施形態では、位相算定部３２が算定した位相スペクトルＳPに予測誤差Δφ[m]が付加されるから、実際の音声における位相の変動に近似した傾向の位相スペクトルＳPを算定できる（したがって聴感的に自然な印象の合成音を生成できる）という利点がある。なお、以上の説明では第１実施形態の構成に第３位相補正部４３を追加した構成を例示したが、例えば第４実施形態の構成に第３位相補正部４３を追加することも可能である。 In the fifth embodiment, the same effect as in the first embodiment is realized. In the fifth embodiment, since the prediction error Δφ [m] is added to the phase spectrum SP calculated by the phase calculation unit 32, the phase spectrum SP having a tendency approximate to the phase fluctuation in the actual speech can be calculated ( Therefore, there is an advantage that a synthetic sound with a natural impression can be generated. In the above description, the configuration in which the third phase correction unit 43 is added to the configuration of the first embodiment is illustrated. However, for example, the third phase correction unit 43 can be added to the configuration of the fourth embodiment. .

なお、第５実施形態で１個のフレームについて用意される予測誤差Δφ[m]の総数は、音声素片データＤの生成に利用された音声素片（以下「原素片」という）における調波成分の総数と同数である。したがって、合成音の音高Ｘ3が原素片の音高を上回る場合（予測誤差Δφ[m]の総数が過剰となる場合）には、複数の予測誤差Δφ[m]を適宜に間引いたうえで各位相値φA[m]の補正に適用し、合成音の音高Ｘ3が原素片の音高を下回る場合（予測誤差Δφ[m]の総数が不足する場合）には、各予測誤差Δφ[m]を複数の周波数について適宜に重複させたうえで各位相値φA[m]の補正に適用する構成が好適である。なお、１個のフレーム内で複数の周波数にわたる予測誤差Δφ[m]が相等しい場合には合成音が聴感的に不自然な印象になり得るという傾向がある。したがって、１個の予測誤差Δφ[m]を複数の周波数にわたり重複して利用する場合には、予測誤差Δφ[m]を周波数毎に相違させる（例えば各予測誤差Δφ[m]に乱数を付加する）構成が好適である。 Note that the total number of prediction errors Δφ [m] prepared for one frame in the fifth embodiment is the adjustment in the speech unit (hereinafter referred to as “original unit”) used for generating speech unit data D. It is the same number as the total number of wave components. Therefore, when the pitch X3 of the synthesized sound exceeds the pitch of the original segment (when the total number of prediction errors Δφ [m] is excessive), a plurality of prediction errors Δφ [m] are appropriately thinned out. Applied to the correction of each phase value φA [m], and when the pitch X3 of the synthesized sound is lower than the pitch of the original piece (when the total number of prediction errors Δφ [m] is insufficient), each prediction error A configuration in which Δφ [m] is appropriately overlapped for a plurality of frequencies and then applied to the correction of each phase value φA [m] is preferable. Note that if the prediction errors Δφ [m] across a plurality of frequencies are equal in one frame, the synthesized sound tends to be audibly unnatural. Therefore, when one prediction error Δφ [m] is used overlappingly over a plurality of frequencies, the prediction error Δφ [m] is made different for each frequency (for example, a random number is added to each prediction error Δφ [m]). The configuration is suitable.

なお、以上の説明では、各音声素片のフレーム毎の予測誤差Δφ[m]を事前に算定して記憶装置１４に格納した構成（以下「構成１」という）を例示したが、第３位相補正部４３が位相スペクトルＳPの補正に適用する予測誤差Δφ[m]を取得する方法は適宜に変更される。 In the above description, the configuration in which the prediction error Δφ [m] for each frame of each speech unit is calculated in advance and stored in the storage device 14 (hereinafter referred to as “configuration 1”) is exemplified. The method of acquiring the prediction error Δφ [m] that the correction unit 43 applies to the correction of the phase spectrum SP is changed as appropriate.

例えば、各音声素片の時間波形を音声素片データＤに含ませ、音声信号ＶOUTの合成時に、第３位相補正部４３が、音声素片データＤ内の時間波形から各フレームの位相値（実測値）φ[m]を算定するとともに、数式(2)および数式(3)の演算で各位相値φ[m]から各フレームの予測誤差Δφ[m]を算定して位相スペクトルＳPの各位相値φA[m]の補正に適用する構成（以下「構成２」という）も採用され得る。なお、前述の構成１によれば、音声素片の時間波形を記憶する必要がないから、構成２と比較して記憶装置１４に必要な記憶容量が削減されるという利点がある。また、構成１によれば、音声信号ＶOUTの合成時に時間波形から各位相値φ[m]を算定する必要がないから、第３位相補正部４３の処理負荷が構成２と比較して軽減されるという利点もある。 For example, the time waveform of each speech unit is included in the speech unit data D, and at the time of synthesis of the speech signal VOUT, the third phase correction unit 43 uses the phase value of each frame (from the time waveform in the speech unit data D ( (Actually measured value) φ [m] is calculated, and prediction error Δφ [m] of each frame is calculated from each phase value φ [m] by the calculation of Equation (2) and Equation (3), and each phase of the phase spectrum SP is calculated. A configuration (hereinafter referred to as “configuration 2”) applied to the correction of the phase value φA [m] may also be employed. According to the above-described configuration 1, since it is not necessary to store the time waveform of the speech unit, there is an advantage that the storage capacity required for the storage device 14 is reduced compared to the configuration 2. Further, according to the configuration 1, since it is not necessary to calculate each phase value φ [m] from the time waveform when the audio signal VOUT is synthesized, the processing load of the third phase correction unit 43 is reduced as compared with the configuration 2. There is also an advantage that.

また、音声素片の各フレームについて算定された予測誤差Δφ[m]の代表値（例えば平均値）を各調波成分の予測誤差Δφとして音声素片毎に記憶装置１４に事前に記憶する構成（以下「構成３」という）も採用され得る。第３位相補正部４３は、位相算定部３２が各フレームについて算定した位相スペクトルＳPの各位相値φA[m]に記憶装置１４内の予測誤差Δφを共通に付加することで位相値φB[m]を算定する。なお、複数のフレームにわたり予測誤差Δφが共通する場合には合成音が聴感的に不自然な印象になり得るという傾向がある。したがって、構成３では、予測誤差Δφをフレーム毎に相違させる（例えば各フレームの予測誤差Δφに乱数を付加する）構成が好適である。なお、構成３では、音声素片のフレーム毎の予測誤差Δφ[m]や音声素片の音声波形を保持する必要がないから、構成１や構成２と比較して記憶装置１４に必要な記憶容量が削減されるという利点がある。 Further, a representative value (for example, an average value) of the prediction error Δφ [m] calculated for each frame of the speech unit is stored in advance in the storage device 14 for each speech unit as the prediction error Δφ of each harmonic component. (Hereinafter referred to as “Configuration 3”) may also be employed. The third phase correction unit 43 commonly adds the prediction error Δφ in the storage device 14 to each phase value φA [m] of the phase spectrum SP calculated by the phase calculation unit 32 for each frame, so that the phase value φB [m ] Is calculated. Note that when the prediction error Δφ is common across a plurality of frames, the synthesized sound tends to be audibly unnatural. Therefore, in the configuration 3, a configuration in which the prediction error Δφ is different for each frame (for example, a random number is added to the prediction error Δφ of each frame) is preferable. In the configuration 3, since it is not necessary to hold the prediction error Δφ [m] for each frame of the speech unit and the speech waveform of the speech unit, the memory necessary for the storage device 14 is compared with the configuration 1 and the configuration 2. There is an advantage that the capacity is reduced.

＜Ｆ：変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <F: Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）第１実施形態では振幅スペクトルＳMに対応する最小位相を位相スペクトルＳPとして算定し、第４実施形態では周波数ｆ[k]毎の振幅差δA[k]を平滑化した振幅差δB[k]の系列を位相スペクトルＳPとして算定したが、振幅スペクトルＳMに対応する位相スペクトルＳPをフレーム毎に算定する方法は以上の例示に限定されない。例えば、最小位相と同様に振幅スペクトルＳMから一意に決定される最大位相を、位相スペクトルＳPとして算定することも可能である。なお、最大位相を位相スペクトルＳPとして生成される合成音は、１波長内の後方の時点にエネルギーが集中するという傾向がある。したがって、振幅スペクトルＳMの最大位相を位相スペクトルＳPとして生成する構成は、例えば、音声信号ＶOUTを逆転再生する場合や、音声素片データＤが示す各音素の順序を逆転させた音声素片の音声素片データＤを生成する場合に好適である。後者の場合としては、例えば、音声素片［Sil-ａ］のフレーム毎に振幅スペクトルＳMと位相スペクトルＳP（振幅スペクトルＳMの最大位相）とを算定し、各フレームの順番を逆転させることで音声素片［ａ-Sil］の音声素片データＤを生成する場合が例示される。なお、振幅スペクトルＳMの最小位相や最大位相を算定する方法は以上の例示には限定されない。以上の例示から理解されるように、前述の各形態における位相算定部３２は、音声素片の各フレームの振幅スペクトルＳMに対応する位相スペクトルＳPをフレーム毎に算定する要素（位相算定手段）として包括される。 (1) In the first embodiment, the minimum phase corresponding to the amplitude spectrum SM is calculated as the phase spectrum SP, and in the fourth embodiment, the amplitude difference δB [obtained by smoothing the amplitude difference δA [k] for each frequency f [k]. The k] series is calculated as the phase spectrum SP, but the method of calculating the phase spectrum SP corresponding to the amplitude spectrum SM for each frame is not limited to the above example. For example, the maximum phase uniquely determined from the amplitude spectrum SM similarly to the minimum phase can be calculated as the phase spectrum SP. Note that the synthesized sound generated with the maximum phase as the phase spectrum SP tends to concentrate energy at a later time point within one wavelength. Therefore, the configuration in which the maximum phase of the amplitude spectrum SM is generated as the phase spectrum SP is, for example, when the audio signal VOUT is reproduced in reverse, or the audio of the speech unit in which the order of each phoneme indicated by the speech unit data D is reversed. This is suitable for generating the segment data D. In the latter case, for example, the amplitude spectrum SM and the phase spectrum SP (maximum phase of the amplitude spectrum SM) are calculated for each frame of the speech unit [Sil-a], and the order of each frame is reversed to thereby calculate the speech. The case where the speech unit data D of the unit [a-Sil] is generated is illustrated. The method for calculating the minimum phase and the maximum phase of the amplitude spectrum SM is not limited to the above examples. As understood from the above examples, the phase calculation unit 32 in each of the above-described embodiments is an element (phase calculation means) that calculates the phase spectrum SP corresponding to the amplitude spectrum SM of each frame of the speech unit for each frame. It is included.

（２）第２実施形態の第１位相補正部４１と第３実施形態の第２位相補正部４２との双方を具備する構成も採用され得る。第１位相補正部４１と第２位相補正部４２とを具備する構成では、図６の部分(A)および部分(B)に示す通り、第１位相補正部４１の補正による各位相値（帯域Ｂ内）の変動範囲α1が、第２位相補正部４２の補正による各位相値の変動範囲α2と比較して広い範囲に設定される。 (2) A configuration including both the first phase correction unit 41 of the second embodiment and the second phase correction unit 42 of the third embodiment may be employed. In the configuration including the first phase correction unit 41 and the second phase correction unit 42, each phase value (band) obtained by the correction of the first phase correction unit 41 is shown in part (A) and part (B) of FIG. The variation range α1 in (B) is set to a wider range than the variation range α2 of each phase value by the correction of the second phase correction unit 42.

（３）第３実施形態では、第２位相補正部４２が周波数軸上の全帯域の位相値を補正したが、特定の帯域内の位相値のみを第２位相補正部４２による補正の対象とすることも可能である。例えば、低域側の所定個（例えば５個）の周波数を除外した帯域内の各位相値を第２位相補正部４２が補正する構成が採用され得る。 (3) In the third embodiment, the second phase correction unit 42 has corrected the phase values of the entire band on the frequency axis, but only the phase values within a specific band are subject to correction by the second phase correction unit 42. It is also possible to do. For example, a configuration in which the second phase correction unit 42 corrects each phase value in a band excluding a predetermined number (for example, five) of frequencies on the low frequency side can be adopted.

（４）第３実施形態では、有声度Ｖの時間変化を示す有声度情報ＤVを記憶装置１４に事前に記憶させたが、音声素片内での有声度Ｖの時間変化を特定する方法は任意である。例えば、音声素片データＤから特定される音声の特徴（例えばスペクトルの傾き，フォルマントの位置や強度，ゼロクロス数）を利用して有声度Ｖの時間変化を算定する構成や、これらの特徴を記憶装置１４に事前に格納しておいて有声度Ｖの時間変化の算定に利用する構成も採用され得る。以上の説明から理解される通り、第２位相補正部４２は、音声素片内での有声度Ｖの時間変化を特定（例えば記憶装置１４から取得または所定の方法で算定）し、各フレームの位相スペクトルＳPの位相値を、そのフレームについて特定した有声度Ｖに応じた変動範囲α2内で乱数的に変動させる要素として包括され、有声度Ｖの時間変化を特定する方法の如何は不問である。 (4) In the third embodiment, the voicing degree information DV indicating the temporal change of the voicing degree V is stored in the storage device 14 in advance, but the method for specifying the temporal change of the voicing degree V in the speech unit is as follows. Is optional. For example, a configuration for calculating a temporal change in the voicing degree V using voice features (for example, spectrum inclination, formant position and intensity, number of zero crosses) specified from the voice segment data D, and storing these features. A configuration that is stored in advance in the device 14 and used for calculating the temporal change of the voicing degree V may also be employed. As understood from the above description, the second phase correction unit 42 specifies the temporal change of the voicing level V in the speech unit (for example, obtained from the storage device 14 or calculated by a predetermined method), and The method of specifying the temporal change of the voicing degree V is not limited, as it is included as an element that randomly changes the phase value of the phase spectrum SP within the fluctuation range α2 corresponding to the voicing degree V specified for the frame. .

（５）音声素片データＤの形式は任意である。例えば、前述の各形態では各フレームの振幅特性データＲを含む音声素片データＤを例示したが、音声素片データＤがフレーム毎の振幅スペクトルＳM（すなわち周波数毎の振幅値の系列）を直接的に指定する構成も採用される。音声素片データＤが振幅スペクトルＳMを含む構成では振幅算定部２４が算定される。以上の例示から理解される通り、音声素片データＤは、音声素片の各フレームの振幅スペクトルＳMを示すデータとして包括される。 (5) The format of the speech segment data D is arbitrary. For example, in the above-described embodiments, the speech unit data D including the amplitude characteristic data R of each frame is exemplified. A designating system is also adopted. In the configuration in which the speech element data D includes the amplitude spectrum SM, the amplitude calculator 24 is calculated. As understood from the above examples, the speech unit data D is included as data indicating the amplitude spectrum SM of each frame of the speech unit.

（６）前述の各形態では、位相算定部３２が算定した位相スペクトルＳPを利用して音声信号ＶOUTを生成する音声合成装置１００を例示したが、音声（音声素片）の各フレームの振幅スペクトルＳMに対応する位相スペクトルＳPをフレーム毎に算定する音声処理装置（位相算定装置）としても本発明は実施され得る。すなわち、音声合成部３４（音声合成手段）は省略され得る。 (6) In each of the above-described embodiments, the speech synthesizer 100 that generates the speech signal VOUT using the phase spectrum SP calculated by the phase calculation unit 32 has been exemplified, but the amplitude spectrum of each frame of speech (speech unit) The present invention can also be implemented as a speech processing apparatus (phase calculation apparatus) that calculates the phase spectrum SP corresponding to SM for each frame. That is, the speech synthesizer 34 (speech synthesizer) can be omitted.

１００……音声合成装置、１２……演算処理装置、１４……記憶装置、１６……放音装置、２２……素片選択部、２４……振幅算定部、２６……素片調整部、３２……位相算定部、３４……音声合成部、４１……第１位相補正部、４２……第２位相補正部、４３……第３位相補正部。

DESCRIPTION OF SYMBOLS 100 ... Speech synthesizer, 12 ... Arithmetic processor, 14 ... Memory | storage device, 16 ... Sound emission device, 22 ... Segment selection part, 24 ... Amplitude calculation part, 26 ... Segment adjustment part, 32... Phase calculation unit, 34... Speech synthesis unit, 41... First phase correction unit, 42... Second phase correction unit, and 43.

Claims

音声素片データが音声素片の各フレームについて示す振幅スペクトルにおいて周波数軸上で相隣接する各周波数間の振幅値の差分を周波数軸の方向に平滑化することで位相スペクトルをフレーム毎に算定する位相算定手段と、
前記音声素片データが示す各フレームの振幅スペクトルと前記位相算定手段による算定後の各フレームの位相スペクトルとを利用して音声信号を生成する音声合成手段と
を具備する音声合成装置。 The phase spectrum is calculated for each frame by smoothing the difference in the amplitude value between the frequencies adjacent to each other on the frequency axis in the amplitude spectrum indicated by the speech unit data for each frame of the speech unit. Phase calculation means;
A speech synthesizer comprising: a speech synthesizer that generates a speech signal using the amplitude spectrum of each frame indicated by the speech segment data and the phase spectrum of each frame calculated by the phase calculator.

コンピュータシステムが、  Computer system
音声素片データが音声素片の各フレームについて示す振幅スペクトルにおいて周波数軸上で相隣接する各周波数間の振幅値の差分を周波数軸の方向に平滑化することで位相スペクトルをフレーム毎に算定し、  The phase spectrum is calculated for each frame by smoothing the difference of the amplitude value between adjacent frequencies on the frequency axis in the amplitude spectrum that the speech unit data shows for each frame of the speech unit. ,
前記音声素片データが示す各フレームの振幅スペクトルと前記算定後の各フレームの位相スペクトルとを利用して音声信号を生成する  A speech signal is generated using the amplitude spectrum of each frame indicated by the speech unit data and the phase spectrum of each frame after the calculation.
音声合成方法。  Speech synthesis method.