JP2012252303A

JP2012252303A - Voice synthesizer

Info

Publication number: JP2012252303A
Application number: JP2011127123A
Authority: JP
Inventors: Keijiro Saino; 慶二郎才野
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2011-06-07
Filing date: 2011-06-07
Publication date: 2012-12-20
Anticipated expiration: 2031-06-07
Also published as: JP5914996B2

Abstract

PROBLEM TO BE SOLVED: To produce a synthesized natural voice having a high sound emitting speed.SOLUTION: A storage device 14 stores a plurality of voice fragment data DA indicating a voice fragment V which contains a plurality of phoneme sections S corresponding to different phonemes. A fragment selecting part 22 sequentially selects the voice fragment V. A phoneme length setting part 24 adjustably sets a synthesis time length T for each phoneme section S of the voice fragment V selected by the fragment selecting part 22. A voice synthesis part 26 mutually links voice waveforms indicated by the voice fragment data DA for a target section W of the synthesis time length T set by the phoneme length setting part 24 out of each phoneme section S of the phoneme fragment V selected by the fragment selecting part 22, to generate a voice signal VOUT. A position (front/rear) of the target section W in the phoneme section S is determined according to a phoneme kind.

Description

本発明は、複数の音声素片の連結で発話音や歌唱音等の音声を合成する技術に関する。 The present invention relates to a technique for synthesizing speech sounds, singing sounds, and the like by connecting a plurality of speech segments.

複数の音声素片を相互に連結することで所望の音声を合成する素片接続型の音声合成技術が従来から提案されている。例えば特許文献１には、利用者が指定した発音期間の時間長に応じて各音声素片（韻律小素片）を時間軸方向に伸縮して相互に連結することで所望の継続長の音声を合成する技術が開示されている。 Conventionally, a unit connection type speech synthesis technique for synthesizing a desired speech by connecting a plurality of speech units to each other has been proposed. For example, Patent Document 1 discloses a speech having a desired duration by expanding and concatenating each speech unit (prosodic segment) in the time axis direction according to the duration of a pronunciation period specified by the user. A technique for synthesizing is disclosed.

特開２００３−１０８１７６号公報JP 2003-108176 A

しかし、特許文献１の技術では、発音速度（単位時間あたりの音素数）が高い合成音を生成するために各音声素片を時間軸方向に過度に収縮した場合に、実際に人間が発音時に口を変形させ得る速度を上回る速度で発音されたような不自然な音声が合成される可能性がある。また、人間が実際に早口で発音する場合には１個の音素の明瞭な発音が完了する以前に直後の音素の発音が開始される（すなわち音素の一部が省略される）という傾向がある。しかし、特許文献１の技術では、音声素片を収縮した場合でも各音素は始点から終点までの全体にわたり発音されるから、合成音は聴覚的に不自然な音声となる。例えば、１個の音素の発音を短い周期で反復する場合（例えば「わわわわ……」と発音する場合）、実際には各回の発音で口が完全に開く以前に次の発音が開始するが、特許文献１の技術では発音毎に口を完全に開いたような不自然な音声が生成される。以上の事情を考慮して、本発明は、発音速度が高い自然な音声を合成することを目的とする。 However, in the technique of Patent Document 1, when each speech segment is excessively contracted in the time axis direction in order to generate a synthesized sound having a high sounding speed (number of phonemes per unit time), when a human actually speaks, There is a possibility that an unnatural voice that is pronounced at a speed exceeding the speed at which the mouth can be deformed is synthesized. Further, when a person actually pronounces quickly, the pronunciation of the next phoneme is started (that is, a part of the phoneme is omitted) before the clear pronunciation of one phoneme is completed. . However, in the technique of Patent Document 1, even when the speech segment is contracted, each phoneme is pronounced over the entire point from the start point to the end point, and thus the synthesized sound becomes aurally unnatural speech. For example, if you repeat the pronunciation of one phoneme in a short cycle (for example, when you pronounce “Wawawa ...”), the next pronunciation starts before the mouth is completely opened in each pronunciation. However, with the technique of Patent Document 1, an unnatural voice is generated such that the mouth is completely opened for each pronunciation. In view of the above circumstances, an object of the present invention is to synthesize natural speech with a high sounding speed.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の要素と後述の実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。 Means employed by the present invention to solve the above problems will be described. In order to facilitate the understanding of the present invention, in the following description, the correspondence between the elements of the present invention and the elements of the embodiments described later will be indicated in parentheses, but the scope of the present invention will be exemplified in the embodiments. It is not intended to be limited.

本発明の音声合成装置は、相異なる音素に対応する複数の音素区間（例えば音素区間Ｓ）を含む音声素片を示す複数の音声素片データ（例えば音声素片データＤA）を記憶する素片記憶手段（例えば記憶装置１４）と、音声素片を順次に選択する素片選択手段（例えば素片選択部２２）と、素片選択手段が選択した音声素片の各音素区間について合成時間長（例えば合成時間長Ｔ）を可変に設定する音素長設定手段（例えば音素長設定部２４）と、素片選択手段が選択した音声素片の各音素区間のうち音素長設定手段が設定した合成時間長の対象区間（例えば対象区間Ｗ）について音声素片データが示す音声波形を相互に連結して音声信号を生成する音声合成手段（例えば音声合成部２６）とを具備する。以上の構成では、各音声素片の音素区間のうち音素長設定手段が設定した合成時間長にわたる対象区間の音声波形を相互に連結して音声信号が生成される。音声素片を伸縮して音声信号の合成に適用する構成と比較して自然な音声を合成することが可能である。また、音声素片の音素区間毎に対象区間が選定されるから、音声素片の全体として音声信号の生成に適用される区間を選定する構成と比較すると、例えば各音素のなかで重要な区間（例えば受聴者が音素を識別するうえで重要な区間）を音素区間毎に個別に対象区間として選定して自然な音声を合成できるという利点がある。 The speech synthesizer of the present invention stores a plurality of speech unit data (for example, speech unit data DA) indicating a speech unit including a plurality of phoneme segments (for example, phoneme segment S) corresponding to different phonemes. The synthesis time length for each phoneme section of the speech unit selected by the storage unit (for example, the storage device 14), the unit selection unit (for example, the unit selection unit 22) for sequentially selecting speech units, and the speech unit selected by the unit selection unit A phoneme length setting unit (for example, phoneme length setting unit 24) that variably sets (for example, a synthesis time length T) and a synthesis set by the phoneme length setting unit in each phoneme segment of the speech unit selected by the unit selection unit. Voice synthesizing means (for example, a voice synthesizing unit 26) for generating a voice signal by interconnecting voice waveforms indicated by voice segment data for a target section of time length (for example, the target section W). In the above configuration, a speech signal is generated by mutually connecting speech waveforms of a target section over the synthesis time length set by the phoneme length setting means among the phoneme sections of each speech unit. It is possible to synthesize a natural voice as compared with a configuration in which a voice element is expanded and contracted and applied to synthesis of a voice signal. In addition, since the target section is selected for each phoneme section of the speech unit, compared with the configuration in which the section applied to the generation of the speech signal is selected as the entire speech unit, for example, an important section in each phoneme There is an advantage that a natural speech can be synthesized by selecting (for example, a section important for the listener to identify phonemes) as a target section individually for each phoneme section.

本発明の好適な態様において、音声素片の先頭に位置するとともに声道の閉鎖後の一時的な変形により発音される第１種別の音素に対応する音素区間（例えば第１種別Ｃ1の音素に対応する音素区間Ｓ1）は、第１種別の音素が発音される過程（例えば後方部ｐB）を含み、音声素片の末尾に位置するとともに第１種別の音素に対応する音素区間（例えば第１種別Ｃ1の音素に対応する音素区間Ｓ2）は、第１種別の音素が発音される直前の準備過程（例えば前方部ｐAの準備過程ｐA2）を含み、音声合成手段は、音声素片の先頭の音素区間が第１種別の音素に対応する場合に、その音素区間のうち始点から後方の合成時間長にわたる区間を対象区間として選定し（例えば図９の部分(A)）、音声素片の末尾の音素区間が第１種別の音素に対応する場合に、その音素区間のうち始点から後方の合成時間長にわたる区間を対象区間として選定する（例えば図９の部分(C)）。以上の態様では、第１種別の音素が発音される過程のうち前方側の区間が対象区間に優先的に包含され、第１種別の音素の準備過程のうち前方の区間（直前の音素の影響が顕著となる区間）が対象区間に優先的に包含される。したがって、第１種別の音素のうち例えば受聴者がその音素を認識するうえで重要な箇所を維持しながら音声信号を生成できるという利点がある。第１種別の音素は、典型的には発音が時間的に持続され難い音素である。例えば破裂音や破擦音等の音素が第１種別に区分される。 In a preferred embodiment of the present invention, a phoneme segment (for example, a phoneme of the first type C1) corresponding to the first type of phoneme located at the head of the speech unit and pronounced by temporary deformation after the vocal tract is closed. The corresponding phoneme segment S1) includes a process (for example, the rear part pB) in which the first type of phoneme is generated, and is located at the end of the speech segment and corresponds to the first type of phoneme (for example, the first segment). The phoneme section S2) corresponding to the type C1 phoneme includes a preparation process (for example, the preparation process pA2 of the front part pA) immediately before the first type phoneme is generated, and the speech synthesis means When the phoneme section corresponds to the first type phoneme, the section from the start point to the synthesis time length behind is selected as the target section (for example, part (A) in FIG. 9), and the end of the speech unit When the phoneme segment of the first corresponds to the first type of phoneme It selects a section across synthesis time length of the back from the start point of the phoneme segment as the target segment (e.g. the portion of FIG. 9 (C)). In the above aspect, the front section of the process of generating the first type of phonemes is preferentially included in the target section, and the front section of the first type of phoneme preparation process (the influence of the previous phoneme). Is markedly included in the target section. Therefore, there is an advantage that, for example, a voice signal can be generated while maintaining an important place for the listener to recognize the phoneme among the first type of phonemes. The first type of phoneme is typically a phoneme whose pronunciation is difficult to sustain in time. For example, phonemes such as plosives and rubbing sounds are classified into the first type.

本発明の好適な態様において、音声素片の先頭に位置するとともに第１種別とは相違する第２種別の音素に対応する音素区間（例えば第２種別Ｃ2の音素に対応する音素区間Ｓ1）は、当該第２種別の音素が後続の音素に変化する過程（例えば後方部ｑB）を含み、音声素片の末尾に位置するとともに第２種別の音素に対応する音素区間（例えば第２種別Ｃ2の音素に対応する音素区間Ｓ2）は、直前の音素が当該第２種別の音素に変化する過程（例えば前方部ｑA）を含み、音声合成手段は、音声素片の先頭の音素区間が第２種別の音素に対応する場合に、その音素区間のうち終点から前方の合成時間長にわたる区間を対象区間として選定し、音声素片の末尾の音素区間が第２種別の音素に対応する場合に、その音素区間のうち始点から後方の合成時間長にわたる区間を対象区間として選定する。以上の態様では、第２種別の音素が後続の音素に変化する過程のうち後方側の区間が対象区間に優先的に包含され、直前の音素が第２種別の音素に変化する過程のうち前方側の区間が対象区間に優先的に包含される。したがって、第２種別の音素のうち例えば受聴者がその音素の前後の遷移を認識するうえで重要な箇所を維持しながら音声信号を生成できるという利点がある。なお、第２種別の音素は、典型的には発音が持続され得る音素である。例えば、声道の形状が定常的に維持された状態で発音される母音，半母音および摩擦音等の音素や、口腔の一部や鼻腔を介した通気により発音を維持したまま声道を部分的に閉鎖した準備状態から声道を一時的かつ急速に変形させることで発音される流音や鼻音等の音素が第２種別に区分される。 In a preferred aspect of the present invention, a phoneme segment (for example, a phoneme segment S1 corresponding to a second type C2 phoneme) corresponding to a second type of phoneme that is located at the head of a speech unit and is different from the first type is , Including a process in which the second type of phoneme changes to a subsequent phoneme (for example, the rear part qB), and is located at the end of the speech unit and corresponds to the second type of phoneme (for example, of the second type C2 The phoneme section S2) corresponding to the phoneme includes a process in which the immediately preceding phoneme changes to the second type of phoneme (for example, the front part qA). Of the phoneme segment, the segment extending from the end point to the front synthesis time length is selected as the target segment, and when the phoneme segment at the end of the speech segment corresponds to the second type phoneme, Compositing from the start point in the phoneme section Selecting a segment extending between lengths as target section. In the above aspect, among the processes in which the second type phoneme changes to the subsequent phoneme, the rear section is preferentially included in the target section, and the immediately preceding phoneme changes to the second type phoneme. The side section is preferentially included in the target section. Therefore, for example, there is an advantage that a voice signal can be generated while maintaining an important part when the listener recognizes the transition before and after the phoneme of the second type. Note that the second type of phoneme is typically a phoneme whose sound can be sustained. For example, vowels, semi-vowels, and frictional sounds that are pronounced while the shape of the vocal tract is constantly maintained, and the vocal tract partially while maintaining pronunciation by aeration through part of the oral cavity or nasal cavity A phoneme such as a flowing sound or a nasal sound produced by temporarily and rapidly deforming the vocal tract from the closed preparation state is classified into a second type.

以上の各態様に係る音声合成装置は、音声合成に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）で実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働でも実現される。本発明のプログラム（例えばプログラムＰGM）は、相異なる音素に対応する複数の音素区間を含む音声素片を示す複数の音声素片データを記憶する素片記憶手段を具備するコンピュータに、音声素片を順次に選択する素片選択処理と、素片選択処理で選択した音声素片の各音素区間について合成時間長を可変に設定する音素長設定処理と、素片選択処理で選択した音声素片の各音素区間のうち音素長設定処理で設定した合成時間長の対象区間について音声素片データが示す音声波形を相互に連結して音声信号を生成する音声合成処理とを実行させる。以上のプログラムによれば、本発明の音声合成装置と同様の作用および効果が実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The speech synthesizer according to each aspect described above is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to speech synthesis, and a general-purpose arithmetic processing device such as a CPU (Central Processing Unit). And collaboration with the program. The program of the present invention (for example, the program PGM) is stored in a computer having speech storage units that store a plurality of speech segment data indicating speech segments including a plurality of speech segments corresponding to different phonemes. Unit selection processing for sequentially selecting, phoneme length setting processing for variably setting the synthesis time length for each phoneme section of the speech unit selected by the unit selection processing, and speech unit selected by the unit selection processing Among the phoneme sections, a speech synthesis process for generating a speech signal by interconnecting speech waveforms indicated by speech unit data for a target section having a synthesis time length set by the phoneme length setting process is executed. According to the above program, the same operation and effect as the speech synthesizer of the present invention are realized. The program of the present invention is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or provided from a server device in a form of distribution via a communication network and installed in the computer. Is done.

本発明の第１実施形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention. 記憶装置に格納された素片群の模式図である。It is a schematic diagram of the segment group stored in the storage device. 音声素片の模式図である。It is a schematic diagram of a speech element. 音素分類および音素種別の関係を示す図表である。It is a graph which shows the relationship between phoneme classification and phoneme classification. 第１種別の音素の説明図である。It is explanatory drawing of the 1st type phoneme. 第２種別の音素の説明図である。It is explanatory drawing of the 2nd type phoneme. 音声素片の選択および合成時間長の設定の説明図である。It is explanatory drawing of the selection of a speech unit, and the setting of the synthetic | combination time length. 対象区間の単位データを抽出する処理のフローチャートである。It is a flowchart of the process which extracts the unit data of an object area. 対象区間を選定する動作の説明図である。It is explanatory drawing of the operation | movement which selects an object area. 音声素片を連結する動作の具体例の説明図である。It is explanatory drawing of the specific example of the operation | movement which connects an audio | voice element.

図１は、本発明のひとつの実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、発話音や歌唱音等の音声を素片接続型の音声合成処理で生成する信号処理装置であり、図１に示すように、演算処理装置１２と記憶装置１４と放音装置１６とを具備するコンピュータシステムで実現される。 FIG. 1 is a block diagram of a speech synthesizer 100 according to one embodiment of the present invention. The speech synthesizer 100 is a signal processing device that generates speech such as speech and singing sound by segment-connected speech synthesis processing. As shown in FIG. 1, the arithmetic processing unit 12, the storage device 14, and the sound emission are produced. This is realized by a computer system including the device 16.

演算処理装置１２（ＣＰＵ）は、記憶装置１４に格納されたプログラムＰGMの実行で、合成音の波形を表す音声信号ＶOUTを生成するための複数の機能（素片選択部２２，音素長設定部２４，音声合成部２６）を実現する。なお、演算処理装置１２の各機能を複数の集積回路に分散した構成や、専用の電子回路（ＤＳＰ）が一部の機能を実現する構成も採用され得る。放音装置１６（例えばヘッドホンやスピーカ）は、演算処理装置１２が生成した音声信号ＶOUTに応じた音波を放射する。記憶装置１４は、演算処理装置１２が実行するプログラムＰGMや演算処理装置１２が使用する各種のデータ（素片群ＧA，合成情報ＧB）を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として採用される。 The arithmetic processing unit 12 (CPU) has a plurality of functions (unit selection unit 22, phoneme length setting unit) for generating a voice signal VOUT representing a waveform of a synthesized sound by executing the program PGM stored in the storage device 14. 24, the speech synthesis unit 26) is realized. A configuration in which each function of the arithmetic processing unit 12 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (DSP) realizes a part of the functions may be employed. The sound emitting device 16 (for example, a headphone or a speaker) emits a sound wave corresponding to the audio signal VOUT generated by the arithmetic processing device 12. The storage device 14 stores a program PGM executed by the arithmetic processing device 12 and various data (segment group GA, composite information GB) used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is employed as the storage device 14.

記憶装置１４に格納される素片群ＧAは、図２に示すように、複数の音声素片Ｖの各々に対応する音声素片データＤAおよび音素分類データＤBの集合（音声合成ライブラリ）である。図３に示すように、１個の音声素片Ｖは、相異なる音素に対応する２個の音素区間Ｓ（Ｓ1，Ｓ2）を連結したダイフォン（音素連鎖）である。音素区間Ｓ1は、音声素片Ｖの始点を含む時間長Ｌ1の区間であり、音素区間Ｓ2は、音声素片Ｖの終点を含む時間長Ｌ2の区間である。音素区間Ｓ2は音素区間Ｓ1に後続する。音素区間Ｓ1の時間長Ｌ1や音素区間Ｓ2の時間長Ｌ2は音声素片Ｖ毎に個別に決定される。なお、以下では便宜的に、無音を１個の音素として説明する。 The unit group GA stored in the storage device 14 is a set (speech synthesis library) of speech unit data DA and phoneme classification data DB corresponding to each of a plurality of speech units V, as shown in FIG. . As shown in FIG. 3, one speech element V is a diphone (phoneme chain) in which two phoneme sections S (S1, S2) corresponding to different phonemes are connected. The phoneme section S1 is a section of the time length L1 including the start point of the speech unit V, and the phoneme section S2 is a section of the time length L2 including the end point of the speech unit V. The phoneme segment S2 follows the phoneme segment S1. The time length L1 of the phoneme section S1 and the time length L2 of the phoneme section S2 are individually determined for each speech unit V. In the following description, silence is described as one phoneme for convenience.

図２の音声素片データＤAは、音声素片Ｖの時間波形を指定するデータである。図２に示すように、１個の音声素片Ｖに対応する音声素片データＤAは、その音声素片Ｖ（音素区間Ｓ1および音素区間Ｓ2）を時間軸上で区分した各フレームに対応する複数の単位データＵの時系列で構成される。各単位データＵは、１個のフレーム内の音声のスペクトルを規定する。例えば音声のスペクトルの形状の特徴を示す複数の変数（励起波形エンベロープ，胸部レゾナンス，声道レゾナンス，差分スペクトル）を含むＥｐＲ（Excitation plus Resonance）パラメータが単位データＵとして好適である。なお、ＥｐＲパラメータについては例えば特許第３７１１８８０号公報に開示されている。また、周波数毎の強度（すなわちスペクトル）を示すスペクトルデータを単位データＵとして採用することも可能である。図２に示すように、音声素片Ｖのうち音素区間Ｓ1に対応する複数の単位データＵの時系列を音素区間データＱ1と表記し、音声素片Ｖのうち音素区間Ｓ2に対応する複数の単位データＵの時系列を音素区間データＱ2と表記する。 The speech unit data DA in FIG. 2 is data for designating the time waveform of the speech unit V. As shown in FIG. 2, the speech unit data DA corresponding to one speech unit V corresponds to each frame obtained by dividing the speech unit V (phoneme segment S1 and phoneme segment S2) on the time axis. It consists of a time series of a plurality of unit data U. Each unit data U defines a spectrum of speech within one frame. For example, an EpR (Excitation plus Resonance) parameter including a plurality of variables (excitation waveform envelope, chest resonance, vocal tract resonance, difference spectrum) indicating features of the shape of a speech spectrum is suitable as the unit data U. The EpR parameter is disclosed in, for example, Japanese Patent No. 3711880. Moreover, it is also possible to employ | adopt as the unit data U the spectrum data which shows the intensity | strength (namely, spectrum) for every frequency. As shown in FIG. 2, a time series of a plurality of unit data U corresponding to the phoneme section S1 in the speech unit V is denoted as phoneme section data Q1, and a plurality of unit data U corresponding to the phoneme section S2 in the speech unit V is represented. The time series of the unit data U is expressed as phoneme segment data Q2.

図２の音素分類データＤBは、音声素片Ｖのうち音素区間Ｓ1および音素区間Ｓ2の各々に対応する音素の分類を指定する。音素分類データＤBが示す音素分類は、音素の調音方法に応じた分類である。例えば、日本語の音素を想定すると、図４に示すように、母音（/ａ/，/ｉ/，/ｕ/），半母音（/ｗ/），摩擦音（/ｓ/，/ｆ/），流音（/ｒ/），鼻音（/ｍ/，/ｎ/），破裂音（/ｔ/，/ｋ/，/ｐ/），破擦音（/ｔｓ/）等の音素分類を音素分類データＤBは指定する。ただし、音声素片Ｖが示す音声の言語は日本語に限定されない。 The phoneme classification data DB in FIG. 2 designates the phoneme classification corresponding to each of the phoneme segment S1 and the phoneme segment S2 in the speech segment V. The phoneme classification indicated by the phoneme classification data DB is a classification according to the phoneme articulation method. For example, assuming Japanese phonemes, as shown in FIG. 4, vowels (/ a /, / i /, / u /), semi-vowels (/ w /), friction sounds (/ s /, / f /), Phoneme classification such as streaming sound (/ r /), nasal sound (/ m /, / n /), plosive sound (/ t /, / k /, / p /), rubbing sound (/ ts /) Data DB is specified. However, the language of the speech indicated by the speech segment V is not limited to Japanese.

図４に示すように、各音素分類に属する音素は、調音方法に応じて第１種別Ｃ1と第２種別Ｃ2とに区分される。第１種別Ｃ1の音素は、発音が時間的に持続され難い音素であり、典型的には、声道が完全に閉鎖された初期的な準備状態から声道を一時的かつ急速に変形させることで発音される。具体的には、音素分類データＤBが示す複数の音素分類のうち、破裂音および破擦音等が第１種別Ｃ1に区分される。他方、第２種別Ｃ2の音素は、発音が持続され得る音素である。具体的には、音素分類データＤBが示す複数の音素分類のうち、声道の形状が定常的に維持された状態で発音される母音，半母音および摩擦音等の音素と、口腔の一部や鼻腔を介した通気により発音を維持したまま声道を部分的に閉鎖した準備状態から声道を一時的かつ急速に変形させることで発音される流音や鼻音等の音素とが、第２種別Ｃ2に区分される。 As shown in FIG. 4, phonemes belonging to each phoneme classification are classified into a first type C1 and a second type C2 according to the articulation method. The first type C1 phoneme is a phoneme whose pronunciation is difficult to be sustained in time, and typically deforms the vocal tract temporarily and rapidly from the initial preparation state where the vocal tract is completely closed. Is pronounced. Specifically, among the plurality of phoneme classifications indicated by the phoneme classification data DB, a plosive sound, a rubbing sound, and the like are classified into the first type C1. On the other hand, the phoneme of the second type C2 is a phoneme whose pronunciation can be sustained. Specifically, among a plurality of phoneme classifications indicated by the phoneme classification data DB, phonemes such as vowels, semi-vowels, and friction sounds that are generated in a state in which the shape of the vocal tract is constantly maintained, a part of the oral cavity, and the nasal cavity The second type C2 is a phoneme such as a flowing sound or a nasal sound that is generated by temporarily and rapidly deforming the vocal tract from a ready state in which the vocal tract is partially closed while maintaining the pronunciation by aeration through the air. It is divided into.

１個の音声素片Ｖの音素区間Ｓ1および音素区間Ｓ2の各々は、第１種別Ｃ1および第２種別Ｃ2の何れかの音素に対応する。例えば図５の部分(A)には、音素区間Ｓ1が第１種別Ｃ1の音素（破裂音/ｔ/）に対応する音声素片Ｖが例示され、図５の部分(B)には、音素区間Ｓ2が第１種別Ｃ1の音素に対応する音声素片Ｖが例示されている。なお、図５の部分(A)の音素区間Ｓ2および図５の部分(B)の音素区間Ｓ1の音素は任意である。また、図６の部分(A)には、音素区間Ｓ1が第２種別Ｃ2の音素（母音/ａ/）に対応する音声素片Ｖが例示され、図６の部分(B)には、音素区間Ｓ2が第２種別Ｃ2の音素に対応する音声素片Ｖが例示されている。図６の部分(A)の音素区間Ｓ2および図６の部分(B)の音素区間Ｓ1の音素は任意である。なお、図５および図６では、音声素片Ｖの各音素の波形の一例が便宜的に図示されているが、実際の各音素の波形は前後の音素に応じて多様に変化する。 Each of the phoneme section S1 and the phoneme section S2 of one speech element V corresponds to one of the first type C1 and the second type C2. For example, the part (A) of FIG. 5 illustrates a speech unit V in which the phoneme section S1 corresponds to the phoneme of the first type C1 (plosive sound / t /), and the part (B) of FIG. The speech segment V in which the section S2 corresponds to the phoneme of the first type C1 is illustrated. Note that the phonemes in the part (A) in FIG. 5 and the phonemes in the part (B) in FIG. 5 are arbitrary. Further, the part (A) in FIG. 6 illustrates a speech unit V in which the phoneme section S1 corresponds to the phoneme of the second type C2 (vowel / a /), and the part (B) in FIG. The speech segment V in which the section S2 corresponds to the phoneme of the second type C2 is illustrated. The phonemes in the part (A) in FIG. 6 and the phonemes in the part (B) in FIG. 6 are arbitrary. In FIGS. 5 and 6, an example of the waveform of each phoneme of the speech segment V is shown for convenience, but the actual waveform of each phoneme varies in various ways depending on the preceding and following phonemes.

図５の部分(C)に示すように、第１種別Ｃ1の音素（例えば破裂音/ｔ/）は、時間軸上で前方部ｐAと後方部ｐBとに区分される。前方部ｐAは、その音素の直前に発音された他の音素が減衰していく余韻過程ｐA1と、実際に音素（破裂音/ｔ/）が発音される直前の準備過程ｐA2とを含む。準備過程ｐA2は、音素の発音が可能な状態に声道を準備する過程（例えば声道を舌で閉鎖または収縮する過程）である。なお、余韻過程ｐA1が存在しない場合もある。他方、後方部ｐBは、準備過程ｐA2から声道を一時的かつ急速に変形させることで音素が実際に発音される過程（例えば準備過程ｐA2で声道の上流側に圧縮された空気を一気に解放する過程）である。 As shown in part (C) of FIG. 5, the first type C1 phoneme (for example, plosive sound / t /) is divided into a front part pA and a rear part pB on the time axis. The front part pA includes a reverberation process pA1 in which other phonemes sounded immediately before the phoneme are attenuated, and a preparation process pA2 immediately before a phoneme (plosive sound / t /) is actually sounded. The preparation process pA2 is a process of preparing the vocal tract in a state where phonemes can be pronounced (for example, a process of closing or contracting the vocal tract with a tongue). In some cases, the reverberation process pA1 does not exist. On the other hand, the rear part pB releases the compressed air at the upstream side of the vocal tract in the preparation process pA2 by temporarily and rapidly deforming the vocal tract from the preparation process pA2 so that the phoneme is actually pronounced. Process).

図５の部分(A)に示すように、音声素片Ｖのうち第１種別Ｃ1の音素に対応する音素区間Ｓ1は、その音素の後方部ｐBを含む。他方、図５の部分(B)に示すように、音声素片Ｖのうち第１種別Ｃ1の音素に対応する音素区間Ｓ2は、その音素の前方部ｐAを含む。すなわち、図５の部分(B)に例示された音声素片Ｖの末尾側の音素区間Ｓ2に図５の部分(A)の音素区間Ｓ1を後続させることで第１種別Ｃ1の音素（破裂音/ｔ/）が再現される。 As shown in part (A) of FIG. 5, the phoneme segment S1 corresponding to the phoneme of the first type C1 in the phoneme segment V includes the rear part pB of the phoneme. On the other hand, as shown in part (B) of FIG. 5, the phoneme segment S2 corresponding to the phoneme of the first type C1 in the phoneme segment V includes the front part pA of the phoneme. That is, the phoneme segment S1 of the first type C1 (plosive sound) is obtained by following the phoneme segment S1 of the portion (A) of FIG. 5 to the phoneme segment S2 on the tail side of the speech segment V illustrated in the portion (B) of FIG. / t /) is reproduced.

他方、図６の部分(C)に示すように、第２種別Ｃ2の音素（例えば母音/ａ/）は、前方部ｑAと後方部ｑBとを含む。前方部ｑAは、直前の他の音素からその音素に変化していく過程（例えば口を開けていく過程）であり、後方部ｑBは、その音素が後続の他の音素に変化していく過程（例えば口を閉じていく過程）である。なお、第２種別Ｃ2の音素のうち流音/ｒ/や鼻音/ｍ/等の音素については、声道が部分的に閉鎖された準備状態から声道を一時的かつ急速に変形させる過程（例えば舌先で上顎を弾く過程）が前方部ｑAの始点側に含まれる。 On the other hand, as shown in part (C) of FIG. 6, the second type C2 phoneme (for example, vowel / a /) includes a front part qA and a rear part qB. The front part qA is a process in which the previous phoneme changes to the phoneme (for example, a process of opening the mouth), and the rear part qB is a process in which the phoneme changes to another phoneme that follows. (For example, the process of closing the mouth). Of the second type C2 phonemes, for phonemes such as stream sounds / r / and nasal sounds / m /, the vocal tract is temporarily and rapidly deformed from a ready state in which the vocal tract is partially closed ( For example, the process of flipping the upper jaw with the tip of the tongue) is included on the start point side of the front part qA.

図６の部分(A)に示すように、音声素片Ｖのうち第２種別Ｃ2の音素に対応する音素区間Ｓ1は、その音素の後方部ｑBを含む。他方、図６の部分(B)に示すように、音声素片Ｖのうち第２種別Ｃ2の音素に対応する音素区間Ｓ2は、その音素の前方部ｑAを含む。特定の発声者による発声音から以上の条件を満たすように各音声素片Ｖが抽出されて各音素区間Ｓが画定されたうえで音声素片Ｖ毎の音声素片データＤA（音素区間データＱ1および音素区間データＱ2）が作成される。 As shown in part (A) of FIG. 6, the phoneme segment S1 corresponding to the phoneme of the second type C2 in the phoneme segment V includes the rear part qB of the phoneme. On the other hand, as shown in part (B) of FIG. 6, the phoneme segment S2 corresponding to the second type C2 phoneme in the phoneme segment V includes the front part qA of the phoneme. Each speech segment V is extracted so as to satisfy the above conditions from the sound produced by a specific speaker, and each speech segment S is defined. Then, speech segment data DA (phoneme segment data Q1) for each speech segment V is defined. And phoneme segment data Q2).

図１に示すように、記憶装置１４には、合成音を時系列に指定する合成情報（スコアデータ）ＧBが記憶される。合成情報ＧBは、合成音の発音文字Ｘ1と発音期間Ｘ2とピッチＸ3とを例えば音符毎に時系列に指定する。発音文字Ｘ1は、例えば歌唱音を合成する場合の歌詞の文字列であり、発音期間Ｘ2は、例えば発音の開始時刻と継続長とで指定される。合成情報ＧBは、例えば各種の入力機器に対する利用者からの指示に応じて生成されて記憶装置１４に格納される。なお、他の通信端末から通信網を介して受信された合成情報ＧBや可搬型の記録媒体から転送された合成情報ＧBを音声信号ＶOUTの生成に使用することも可能である。 As shown in FIG. 1, the storage device 14 stores synthesis information (score data) GB for designating synthesized sounds in time series. The synthesis information GB designates the pronunciation character X1, the pronunciation period X2, and the pitch X3 of the synthesized sound, for example, in time series for each note. The pronunciation character X1 is a character string of lyrics when, for example, a singing sound is synthesized, and the pronunciation period X2 is specified by, for example, the start time and duration of the pronunciation. The composite information GB is generated in accordance with, for example, instructions from the user for various input devices and stored in the storage device 14. Note that the synthesized information GB received from another communication terminal via the communication network or the synthesized information GB transferred from the portable recording medium can be used for generating the audio signal VOUT.

図１の素片選択部２２は、合成情報ＧBが時系列に指定する各発音文字Ｘ1に対応する音声素片Ｖを素片群ＧAから順次に選択する。例えば図７に示すように、「go straight」という発音文字Ｘ1が指定された場合、素片選択部２２は、［Sil-gh］，［gh-@U］，［@U-s］，［s-t］，［t-r］，［r-eI］，［eI-t］，［t-Sil］という音声素片Ｖを選択する。なお、各音素の記号は、SAMPA（Speech Assessment Methods Phonetic Alphabet）に準拠している。なお、記号「Sil」は無音（Silence）を意味する。 The segment selection unit 22 in FIG. 1 sequentially selects the speech segment V corresponding to each phonetic character X1 specified in time series by the synthesis information GB from the segment group GA. For example, as shown in FIG. 7, when the pronunciation character X1 “go straight” is designated, the segment selection unit 22 selects [Sil-gh], [gh- @ U], [@Us], [st]. , [Tr], [r-eI], [eI-t], and [t-Sil] are selected. Each phoneme symbol conforms to SAMPA (Speech Assessment Methods Phonetic Alphabet). The symbol “Sil” means silence.

図１の音素長設定部２４は、素片選択部２２が順次に選択する音声素片Ｖの各音素区間Ｓ（Ｓ1，Ｓ2）について、音声信号ＶOUTの合成に適用される場合の時間長（以下「合成時間長」という）Ｔを可変に設定する。各音素区間Ｓの合成時間長Ｔは、合成情報ＧBが時系列に指定する発音期間Ｘ2に応じて選定される。具体的には、音素長設定部２４は、図７に示すように、発音文字Ｘ1を構成する主要な母音の音素（図７の斜体字の音素）の始点がその発音文字Ｘ1の発音期間Ｘ2の始点に合致し、かつ、相前後する音素区間Ｓが時間軸上に隙間なく配列するように、各音素区間Ｓの合成時間長Ｔ（Ｔ(Sil)，Ｔ(gh)，Ｔ(@U)，……）を設定する。 The phoneme length setting unit 24 in FIG. 1 applies the time length (when applied to the synthesis of the speech signal VOUT for each phoneme section S (S1, S2) of the speech unit V sequentially selected by the segment selection unit 22 ( T is hereinafter set to be variable. The synthesis time length T of each phoneme section S is selected according to the sound generation period X2 specified by the synthesis information GB in time series. Specifically, as shown in FIG. 7, the phoneme length setting unit 24 sets the starting point of the main vowel phoneme (the italic phoneme in FIG. 7) constituting the pronunciation character X1 as the pronunciation period X2 of the pronunciation character X1. So that the adjacent phoneme segments S are arranged on the time axis without any gaps, the synthesis time length T (T (Sil), T (gh), T (@U ), ...) are set.

図１の音声合成部２６は、素片選択部２２が順次に選択する音声素片Ｖを相互に連結することで音声信号ＶOUTを生成する。具体的には、音声合成部２６は、素片選択部２２が選択した音声素片Ｖの各音素区間Ｓ（Ｓ1，Ｓ2）の音素区間データＱ（Ｑ1，Ｑ2）から、その音素区間Ｓについて音素長設定部２４が設定した合成時間長Ｔにわたる単位データＵの時系列を生成し、各単位データＵが示すスペクトルを時間波形に変換したうえで相互に連結するとともに合成情報ＧBのピッチＸ3に調整することで音声信号ＶOUTを生成する。 The speech synthesis unit 26 in FIG. 1 generates a speech signal VOUT by connecting speech units V sequentially selected by the unit selection unit 22 to each other. Specifically, the speech synthesizer 26 determines the phoneme segment S from the phoneme segment data Q (Q1, Q2) of each phoneme segment S (S1, S2) of the speech segment V selected by the segment selector 22. A time series of unit data U over the synthesis time length T set by the phoneme length setting unit 24 is generated, and the spectrum indicated by each unit data U is converted into a time waveform and connected to each other, and the pitch X3 of the synthesis information GB is connected. The audio signal VOUT is generated by adjusting.

例えば、各音素区間Ｓについて設定された合成時間長Ｔがその音素区間Ｓの初期的な時間長Ｌ（Ｌ1，Ｌ2）と比較して長い場合（すなわち音声素片Ｖの収録時と比較して発音速度を低下させる場合）、その音素区間Ｓに対応する音素区間データＱが合成時間長Ｔに伸長されたうえで音声信号ＶOUTの生成に適用される。音素区間データＱの伸長には公知の方法（例えば合成時間長Ｔ内の各時点の単位データＵを周囲の単位データＵから補間する方法）が任意に採用される。 For example, when the synthesis time length T set for each phoneme section S is longer than the initial time length L (L1, L2) of the phoneme section S (that is, compared with the recording of the speech unit V). When the pronunciation speed is reduced), the phoneme segment data Q corresponding to the phoneme segment S is expanded to the synthesis time length T and applied to the generation of the audio signal VOUT. A known method (for example, a method of interpolating the unit data U at each time point within the synthesis time length T from the surrounding unit data U) is arbitrarily adopted for expanding the phoneme section data Q.

他方、各音素区間Ｓについて設定された合成時間長Ｔがその音素区間Ｓの初期的な時間長Ｌ（Ｌ1，Ｌ2）と比較して短い場合（すなわち音声素片Ｖの収録時と比較して発音速度を上昇させる場合）、素片選択部２２が選択した音声素片Ｖの各音素区間Ｓ（Ｓ1，Ｓ2）のうち音素長設定部２４がその音素区間Ｓに設定した合成時間長Ｔの区間（以下「対象区間」という）Ｗについて音声素片データＤAが示す音声を相互に連結することで音声信号ＶOUTが生成される。具体的には、音声合成部２６は、各音素区間Ｓの音素区間データＱ（Ｑ1，Ｑ2）から合成時間長Ｔにわたる対象区間Ｗ内の単位データＵの時系列を抽出し、各単位データＵから特定される時間波形を相互に連結することで音声信号ＶOUTを生成する。すなわち、音素区間データＱのうち合成時間長Ｔにわたる対象区間Ｗ内の単位データＵの時系列が内容や順番が変更されることなく抽出されて音声信号ＶOUTの生成に利用される。 On the other hand, when the synthesis time length T set for each phoneme segment S is shorter than the initial time length L (L1, L2) of the phoneme segment S (that is, compared with the recording of the speech segment V). Of the synthesis time length T set by the phoneme length setting unit 24 in the phoneme segment S of the phoneme segments S (S1, S2) of the speech segment V selected by the segment selection unit 22 The voice signal VOUT is generated by connecting the voices indicated by the voice element data DA for the section (hereinafter referred to as “target section”) W to each other. Specifically, the speech synthesizer 26 extracts a time series of the unit data U in the target section W over the synthesis time length T from the phoneme section data Q (Q1, Q2) of each phoneme section S, and each unit data U The audio signal VOUT is generated by interconnecting the time waveforms specified from the above. That is, the time series of the unit data U in the target section W over the synthesis time length T in the phoneme section data Q is extracted without changing the content and order, and is used to generate the audio signal VOUT.

図８は、発音速度を上昇させる場合に音声合成部２６が音素区間データＱから対象区間Ｗ内の単位データＵを抽出する動作のフローチャートである。図８の処理は、音素長設定部２４により設定された合成時間長Ｔが初期的な時間長Ｌ（Ｌ1，Ｌ2）を下回る音素区間Ｓ毎に順次に実行される。 FIG. 8 is a flowchart of the operation in which the speech synthesizer 26 extracts the unit data U in the target section W from the phoneme section data Q when increasing the sound generation speed. The process of FIG. 8 is sequentially executed for each phoneme section S in which the synthesis time length T set by the phoneme length setting unit 24 is less than the initial time length L (L1, L2).

図８の処理を開始すると、音声合成部２６は、処理対象となる１個の音素区間（以下「注目音素区間」という）Ｓが音声素片Ｖの先頭側の音素区間Ｓ1に該当するか否かを判定する（ＳA1）。処理ＳA1の判定結果が肯定である場合、音声合成部２６は、注目音素区間Ｓの音素が第１種別Ｃ1に属するか否かを判定する（ＳA2）。具体的には、注目音素区間Ｓに対応する音素分類データＤBで指定される音素分類が、第１種別Ｃ1に属する所定の分類（破裂音，破擦音等）に該当するか否かに応じて、音声合成部２６は処理ＳA2の判定を実行する。 When the processing of FIG. 8 is started, the speech synthesizer 26 determines whether one phoneme segment (hereinafter referred to as “target phoneme segment”) S to be processed corresponds to the phoneme segment S1 on the head side of the speech segment V. (SA1). If the determination result of the process SA1 is affirmative, the speech synthesizer 26 determines whether or not the phoneme in the phoneme segment S of interest belongs to the first type C1 (SA2). Specifically, depending on whether or not the phoneme classification specified by the phoneme classification data DB corresponding to the target phoneme section S corresponds to a predetermined classification (plosive sound, rubbing sound, etc.) belonging to the first type C1. Then, the speech synthesizer 26 performs the determination of the process SA2.

図５の部分(A)を参照して説明した通り、音声素片Ｖのうち第１種別Ｃ1の音素に対応する音素区間Ｓ1にはその音素の後方部ｐBが含まれる。第１種別Ｃ1の音素の後方部ｐBのうち音素の発音が実際に発音される時点を含む前方の区間は、その音素の特徴が受聴者に顕著に認識される区間（すなわち受聴者が音素を識別するうえで重要な区間）である。そこで、注目音素区間Ｓが音声素片Ｖの先頭の音素区間Ｓ1であり（ＳA1：YES）、かつ、第１種別Ｃ1の音素に対応する場合（ＳA2：YES）、音声合成部２６は、図９の部分(A)に示すように、注目音素区間Ｓ（後方部ｐB）のうちの前方の区間を優先的に対象区間Ｗとして選定する（ＳA3）。具体的には、注目音素区間Ｓの始点を起点として後方の合成時間長Ｔにわたる区間が対象区間Ｗとして選定される。 As described with reference to part (A) of FIG. 5, the phoneme segment S1 corresponding to the phoneme of the first type C1 in the phoneme segment V includes the rear part pB of the phoneme. The forward section of the rear part pB of the first type C1 phoneme including the time when the pronunciation of the phoneme is actually pronounced is the section in which the characteristics of the phoneme are remarkably recognized by the listener (that is, the listener selects the phoneme). It is an important section for identification). Therefore, when the phoneme segment S is the first phoneme segment S1 of the speech segment V (SA1: YES) and corresponds to the first type C1 phoneme (SA2: YES), the speech synthesis unit 26 As shown in part (A) of FIG. 9, the front section of the target phoneme section S (rear part pB) is preferentially selected as the target section W (SA3). Specifically, a section spanning the synthesis time length T starting from the starting point of the phoneme section S of interest is selected as the target section W.

また、図６の部分(A)を参照して説明した通り、音声素片Ｖのうち第２種別Ｃ2の音素に対応する音素区間Ｓ1にはその音素の後方部ｑBが含まれる。第２種別Ｃ2の音素の後方部ｑBのうち直後の音素の影響が顕著となる後方の区間は、受聴者が音素の遷移を知覚するうえで特に重要な区間である。そこで、注目音素区間Ｓが音声素片Ｖの先頭の音素区間Ｓ1であり（ＳA1：YES）、かつ、第２種別Ｃ2の音素に対応する場合（ＳA2：NO）、音声合成部２６は、図９の部分(B)に示すように、注目音素区間Ｓ（後方部ｑB）のうちの後方の区間を優先的に対象区間Ｗとして選定する（ＳA4）。具体的には、注目音素区間Ｓの終点を起点として前方（手前側）の合計時間長Ｔにわたる区間が対象区間Ｗとして選定される。 Further, as described with reference to part (A) of FIG. 6, the phoneme segment S1 corresponding to the second type C2 phoneme in the phoneme segment V includes the rear part qB of the phoneme. Of the rear part qB of the second type C2 phoneme, the rear section where the effect of the immediately following phoneme becomes significant is a section that is particularly important for the listener to perceive the transition of the phoneme. Therefore, when the phoneme segment S is the first phoneme segment S1 of the speech segment V (SA1: YES) and corresponds to the second type C2 phoneme (SA2: NO), the speech synthesizer 26 As shown in part (B) of FIG. 9, the rear section of the target phoneme section S (rear part qB) is preferentially selected as the target section W (SA4). Specifically, a section extending over the total time length T ahead (front side) starting from the end point of the phoneme section S of interest is selected as the target section W.

他方、注目音素区間Ｓが音声素片Ｖの末尾の音素区間Ｓ2に該当する場合（ＳA1：NO）、音声合成部２６は、以下に詳述する通り、注目音素区間Ｓの音素種別（Ｃ1，Ｃ2）に関わらず、その注目音素区間Ｓのうちの前方の区間を優先的に対象区間Ｗとして選定する（ＳA3）。 On the other hand, when the target phoneme section S corresponds to the last phoneme section S2 of the speech unit V (SA1: NO), the speech synthesizer 26 determines the phoneme type (C1, Regardless of C2), the front section of the target phoneme section S is preferentially selected as the target section W (SA3).

図５の部分(B)を参照して説明した通り、音声素片Ｖのうち第１種別Ｃ1の音素に対応する音素区間Ｓ2にはその音素の前方部ｐAが含まれる。第１種別Ｃ1の音素の前方部ｐAのうち後方に位置する準備過程ｐA2は、大部分が無音であり、受聴者による音素の識別には殆ど影響しない。そこで、注目音素区間Ｓが音声素片Ｖの末尾の音素区間Ｓ2であり（ＳA1：NO）、かつ、第１種別Ｃ1の音素に対応する場合、音声合成部２６は、図９の部分(C)に示すように、注目音素区間Ｓ（前方部ｐA）のうち始点を起点として後方の合成時間長Ｔにわたる区間を対象区間Ｗとして選定する（ＳA3）。すなわち、第１種別Ｃ1の音素のうち直前の音素の影響が顕著となる余韻過程ｐA1は優先的に対象区間Ｗに包含される。 As described with reference to part (B) of FIG. 5, the phoneme segment S2 corresponding to the phoneme of the first type C1 in the speech unit V includes the front part pA of the phoneme. The preparation process pA2 located behind the front part pA of the first type C1 phoneme is mostly silent, and has little influence on the identification of the phoneme by the listener. Therefore, when the target phoneme section S is the last phoneme section S2 of the speech unit V (SA1: NO) and corresponds to the first type C1 phoneme, the speech synthesizer 26 performs the part (C ), A section spanning the synthesis time length T from the start point of the attention phoneme section S (front part pA) is selected as the target section W (SA3). That is, the reverberation process pA1 in which the influence of the immediately preceding phoneme becomes significant among the first type C1 phonemes is preferentially included in the target section W.

また、図６の部分(B)を参照して説明した通り、音声素片Ｖのうち第２種別Ｃ2の音素に対応する音素区間Ｓ2にはその音素の前方部ｑAが含まれる。第２種別Ｃ2の音素のうち母音や半母音や摩擦音等の音素の前方部ｑAのなかでは、直前の音素の影響が顕著となる前方の区間が、音素の遷移を聴覚的に識別するうえで特に重要である。また、第２種別Ｃ2の音素のうち流音や鼻音等の音素の前方部ｑAのなかでは、準備状態から声道が変形する過程を含む前方の区間が聴覚的な識別のうえで特に重要である。以上の傾向を考慮して、注目音素区間Ｓが音声素片Ｖの末尾の音素区間Ｓ2であり（ＳA1：NO）、かつ、第２種別Ｃ2の音素に対応する場合、音声合成部２６は、図９の部分(D)に示すように、注目音素区間Ｓ（前方部ｑA）のうち始点を起点として後方の合成時間長Ｔにわたる区間を対象区間Ｗとして選定する（ＳA3）。 Further, as described with reference to part (B) of FIG. 6, the phoneme segment S2 corresponding to the phoneme of the second type C2 in the phoneme segment V includes the front part qA of the phoneme. Among the second type C2 phonemes, in the front part qA of phonemes such as vowels, semi-vowels, friction sounds, etc., the front section where the effect of the immediately preceding phoneme becomes remarkable is particularly useful for identifying phoneme transitions. is important. Of the second type C2 phonemes, the front section including the process of deforming the vocal tract from the prepared state is particularly important for auditory identification among the front parts qA of phonemes such as stream sounds and nasal sounds. is there. Considering the above trend, when the phoneme segment S is the last phoneme segment S2 of the speech segment V (SA1: NO) and corresponds to the second type C2 phoneme, the speech synthesizer 26 As shown in part (D) of FIG. 9, a section spanning the synthesis time length T starting from the start point in the phoneme section S (front part qA) is selected as the target section W (SA3).

以上の手順で注目音素区間Ｓの対象区間Ｗを選定すると、音声合成部２６は、注目音素区間Ｓの音素区間データＱ（Ｑ1，Ｑ2）から対象区間Ｗ内の単位データＵの時系列を抽出する（ＳA5）。前述の通り、処理ＳA5で抽出された各単位データＵが音声信号ＶOUTの生成に適用される。他方、注目音素区間Ｓの音素区間データＱのうち対象区間Ｗの外側の各単位データＵは、音声信号ＶOUTの生成に使用されることなく破棄される。 When the target section W of the target phoneme section S is selected by the above procedure, the speech synthesizer 26 extracts the time series of the unit data U in the target section W from the phoneme section data Q (Q1, Q2) of the target phoneme section S. (SA5). As described above, each unit data U extracted in the process SA5 is applied to the generation of the audio signal VOUT. On the other hand, the unit data U outside the target section W in the phoneme section data Q of the phoneme section S of interest is discarded without being used to generate the audio signal VOUT.

図１０は、図９の処理で音素区間Ｓの対象区間Ｗ毎に抽出された単位データＵから音声信号ＶOUTを生成する動作の説明図である。「saka」という発音文字Ｘ1に対応する３個の音声素片Ｖ（［s-a］，［a-k］，［k-a］）を素片選択部２２が選択した場合が図１０では例示されている。 FIG. 10 is an explanatory diagram of an operation for generating the audio signal VOUT from the unit data U extracted for each target section W of the phoneme section S in the process of FIG. FIG. 10 illustrates the case where the segment selection unit 22 selects three speech segments V ([s-a], [a-k], [k-a]) corresponding to the pronunciation character X1 “saka”.

図１０に示すように、第１番目の音声素片Ｖ［s-a］のうち第２種別Ｃ2の音素/ａ/に対応する末尾の音素区間Ｓ2については（ＳA1：NO）、その音素区間Ｓ2の始点を含む合成時間長Ｔ(a1)の対象区間Ｗが選定される（ＳA3）。また、第２番目の音声素片Ｖ［a-k］のうち第２種別Ｃ2の音素/ａ/に対応する先頭の音素区間Ｓ1（ＳA1：YES，ＳA2：NO）についてはその音素区間Ｓ1の終点を含む合成時間長Ｔ(a2)の対象区間Ｗが選定され（ＳA4）、音声素片Ｖ[a-k]のうち第１種別Ｃ1の音素/ｋ/に対応する末尾の音素区間Ｓ2（ＳA1：NO）については、その音素区間Ｓ2の始点を含む合成時間長Ｔ(k1)の対象区間Ｗが選定される（ＳA3）。第３番目の音声素片Ｖ［k-a］のうち第１種別Ｃ1の音素/ｋ/に対応する先頭の音素区間Ｓ1（ＳA1：YES，ＳA2：YES）については、その音素区間Ｓ1の始点を含む合成時間長Ｔ(k2)の対象区間Ｗが選定される（ＳA3）。以上のように選定された対象区間Ｗ内の各単位データＵが時間軸上で相互に直接的に連結されることで音声信号ＶOUTが生成される。 As shown in FIG. 10, in the first phoneme segment V [sa], the last phoneme segment S2 corresponding to the second type C2 phoneme / a / (SA1: NO), the phoneme segment S2 The target section W having the combined time length T (a1) including the start point is selected (SA3). Also, for the first phoneme segment S1 (SA1: YES, SA2: NO) corresponding to the second type C2 phoneme / a / in the second speech segment V [ak], the end point of the phoneme segment S1 is set. The target interval W of the synthesis time length T (a2) including is selected (SA4), and the last phoneme segment S2 (SA1: NO) corresponding to the first type C1 phoneme / k / of the speech segment V [ak]. For, the target section W of the synthesis time length T (k1) including the start point of the phoneme section S2 is selected (SA3). The first phoneme segment S1 (SA1: YES, SA2: YES) corresponding to the first type C1 phoneme / k / in the third speech segment V [ka] includes the start point of the phoneme segment S1. The target section W with the combined time length T (k2) is selected (SA3). The unit signal U in the target section W selected as described above is directly connected to each other on the time axis, thereby generating the audio signal VOUT.

以上に説明したように、本実施形態では、音素区間Ｓの音素区間データＱのうち対象区間Ｗ内の単位データが抽出されて音声信号ＶOUTの生成に利用され、対象区間Ｗ以外の単位データは音声信号ＶOUTの生成に利用されずに削除される。音声信号ＶOUTの生成に利用される各単位データＵの内容や配列は抽出元の音素区間データＱの単位データＵと同様であるから、本実施形態によれば、合成情報ＧBで高い発音速度が指定された場合でも自然な音声を合成することが可能である。具体的には、実際の発音時に人間が口を変形させ得る速度を上回る速度で発音されたような音声や、発音速度が高いにも関わらず発音毎に口を完全に開いたような音声が生成される可能性を低減して、自然な音声を合成することができる。 As described above, in this embodiment, unit data in the target section W is extracted from the phoneme section data Q of the phoneme section S and used to generate the audio signal VOUT, and unit data other than the target section W is It is deleted without being used to generate the audio signal VOUT. Since the contents and arrangement of each unit data U used to generate the audio signal VOUT are the same as the unit data U of the phoneme segment data Q as the extraction source, according to the present embodiment, the synthesized information GB has a high sounding speed. Even when specified, it is possible to synthesize natural speech. Specifically, there is a voice that is pronounced at a speed that exceeds the speed at which a human can deform his / her mouth during actual pronunciation, or a voice that opens his / her mouth completely for each pronunciation even though the pronunciation speed is high. Natural speech can be synthesized by reducing the possibility of being generated.

また、本実施形態では、音声素片Ｖの音素区間Ｓ毎に対象区間Ｗが選定されるから、例えば音声素片Ｖの全体のうち例えば始点から後方の所定長にわたる区間を合成に適用する構成や、音声素片Ｖの全体のうち例えば終点から前方の所定長にわたる区間を合成に適用する構成と比較すると、各音素のなかで重要な区間（例えば受聴者が音素を識別するうえで重要な区間）を音素区間Ｓ毎に個別に選定して自然な音声を合成できるという利点がある。 Further, in the present embodiment, since the target section W is selected for each phoneme section S of the speech unit V, for example, a configuration in which, for example, a section extending from a starting point to a predetermined length behind the entire speech unit V is applied to synthesis. Compared with the configuration in which, for example, a section extending over a predetermined length ahead from the end point in the entire speech segment V is applied to synthesis, an important section (for example, important for the listener to identify a phoneme) in each phoneme. There is an advantage that natural speech can be synthesized by individually selecting a section) for each phoneme section S.

例えば本実施形態では、音声素片Ｖの先頭に位置するとともに第１種別Ｃ1の音素に対応する音素区間Ｓ1（後方部ｐB）については、図９の部分(A)のように、その音素が実際に発音される時点を含む先頭側の区間が対象区間Ｗとして選定される。他方、音声素片Ｖの末尾に位置するとともに第１種別Ｃ1の音素に対応する音素区間Ｓ2（前方部ｐA）については、図９の部分(C)のように、準備過程ｐA2の部分的な削除で対象区間Ｗが選定される。したがって、第１種別Ｃ1の音素のうち受聴者がその音素を認識するうえで重要な箇所を維持しながら各音声素片Ｖを短縮できるという利点がある。 For example, in the present embodiment, the phoneme segment S1 (rear part pB) located at the head of the speech unit V and corresponding to the phoneme of the first type C1 is the phoneme as shown in part (A) of FIG. The first section including the time when the sound is actually generated is selected as the target section W. On the other hand, as for the phoneme section S2 (front part pA) located at the end of the speech unit V and corresponding to the phoneme of the first type C1, as shown in part (C) of FIG. The target section W is selected by deletion. Therefore, there is an advantage that each speech segment V can be shortened while maintaining an important place for the listener to recognize the phoneme among the first type C1 phonemes.

第２種別Ｃ2の音素についても同様であり、音素区間Ｓ1（後方部ｑB）については図９の部分(B)のように末尾側の区間が対象区間Ｗとして選定され、音素区間Ｓ2（前方部ｑA）については図９の部分(D)のように先頭側の区間が対象区間Ｗとして選定される。したがって、第２種別Ｃ2の音素のうち受聴者がその音素を認識するうえで重要な箇所を維持しながら各音声素片Ｖを短縮できるという利点がある。 The same applies to the phoneme of the second type C2, and for the phoneme segment S1 (rear part qB), the last segment is selected as the target segment W as shown in part (B) of FIG. For qA), the section on the head side is selected as the target section W as shown in part (D) of FIG. Therefore, there is an advantage that each speech segment V can be shortened while maintaining an important place for the listener to recognize the phoneme of the second type C2.

＜変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）前述の実施形態では、注目音素区間Ｓが音素区間Ｓ1に該当するか否かの判定（ＳA1）と、注目音素区間Ｓが第１種別Ｃ1の音素に対応するか否かの判定（ＳA2）との結果に応じて、注目音素区間Ｓにおける対象区間Ｗの位置（先頭側／末尾側）を決定したが、音素区間Ｓの先頭側の区間および末尾側の区間の何れを対象区間Ｗとして選定すべきかを示す情報を音声素片データＤAに付加し、この情報に基づいて対象区間Ｗの位置（先頭側／末尾側）を決定することも可能である。 (1) In the above-described embodiment, whether or not the phoneme segment S corresponds to the phoneme segment S1 (SA1) and whether or not the phoneme segment S corresponds to the first type C1 phoneme ( The position (start / end) of the target section W in the phoneme section S is determined according to the result of (SA2), and either the head section or the end section of the phoneme section S is selected as the target section W. It is also possible to add information indicating whether or not to be selected to the speech segment data DA, and to determine the position (start side / end side) of the target section W based on this information.

（２）音声素片データＤAの形式は任意である。例えば、前述の実施形態では、音声素片Ｖの各フレームのスペクトルを示す単位データＵの時系列を音声素片データＤAとして使用したが、例えば音声素片Ｖの時間軸上のサンプル系列を音声素片データＤAとして使用することも可能である。音声素片Ｖのサンプル系列を音声素片データＤAとした場合、音声素片データＤAのうち対象区間Ｗ内の各サンプルが音声信号ＶOUTの生成に適用され、残余のサンプルは破棄される。 (2) The format of the speech unit data DA is arbitrary. For example, in the above-described embodiment, the time series of the unit data U indicating the spectrum of each frame of the speech unit V is used as the speech unit data DA. It can also be used as the segment data DA. If the sample sequence of the speech unit V is the speech unit data DA, each sample in the target section W of the speech unit data DA is applied to generate the speech signal VOUT, and the remaining samples are discarded.

（３）各音素区間Ｓの合成時間長Ｔを設定する方法は適宜に変更される。例えば、合成時間長Ｔの設定に適用する伸縮率（伸縮前の各音素区間Ｓの時間長Ｌに対する倍率）を母音の音素と子音の音素とで相違させることも可能である。例えば、母音の音素の伸縮率が子音の音素の伸縮率よりも高い数値に設定される。また、第１種別Ｃ1の音素のうち前方部ｐAに対応する音素区間Ｓ2と後方部ｐBに対応する音素区間Ｓ1とで伸縮率を相違させた構成も好適である。具体的には、前方部ｐAに対応する音素区間Ｓ2の伸縮率が後方部ｐBに対応する音素区間Ｓ1の伸縮率よりも高い数値（より大きく伸縮させる数値）に設定される。 (3) The method for setting the synthesis time length T of each phoneme section S is appropriately changed. For example, the expansion / contraction rate (magnification for the time length L of each phoneme section S before expansion / contraction) applied to the setting of the synthesis time length T can be made different between the vowel phoneme and the consonant phoneme. For example, the expansion rate of the vowel phoneme is set to a higher numerical value than the expansion rate of the consonant phoneme. A configuration in which the expansion / contraction rate is different between the phoneme section S2 corresponding to the front part pA and the phoneme section S1 corresponding to the rear part pB among the phonemes of the first type C1 is also preferable. Specifically, the expansion / contraction rate of the phoneme section S2 corresponding to the front portion pA is set to a numerical value (a numerical value for expanding / contracting larger) than the expansion / contraction rate of the phoneme section S1 corresponding to the rear portion pB.

（４）以上の説明ではダイフォンを例示したが、音声素片を構成する音素（音素区間Ｓ）の個数は任意である。例えば３個の音素区間Ｓを含むトライフォンを音声素片として利用する構成でも、前述の実施形態と同様に、素片選択部２２が選択した音声素片の３個の音素区間Ｓの各々について、音素の種別（Ｃ1／Ｃ2）に応じた位置に対象区間Ｗを選定することが可能である。なお、２個のダイフォンを連結して１個のトライフォンを構成する場合（例えば２個のダイフォン［ａ-ｓ］および［ｓ-ｅ］の連結で１個のトライフォン［ａ-ｓ-ｅ］を形成する場合）、前方のダイフォンの２個の音素区間Ｓと後方のダイフォンの２個の音素区間Ｓとの合計４個の音素区間Ｓが１個のトライフォンに含まれる。 (4) Although the diphone is exemplified in the above description, the number of phonemes (phoneme section S) constituting the speech segment is arbitrary. For example, even in a configuration in which a triphone including three phoneme sections S is used as a speech unit, for each of the three phoneme sections S of the speech unit selected by the unit selection unit 22 as in the above-described embodiment. The target section W can be selected at a position corresponding to the phoneme type (C1 / C2). When two diphones are connected to form one triphone (for example, two triphones [as] and [se] are connected to one triphone [as-e]. ], A total of four phoneme sections S including two phoneme sections S of the front diphone and two phoneme sections S of the rear diphone are included in one triphone.

１００……音声合成装置、１２……演算処理装置、１４……記憶装置、１６……放音装置、２２……素片選択部、２４……音素長設定部、２６……音声合成部。
DESCRIPTION OF SYMBOLS 100 ... Speech synthesis device, 12 ... Arithmetic processing device, 14 ... Memory | storage device, 16 ... Sound emission device, 22 ... Segment selection part, 24 ... Phoneme length setting part, 26 ... Speech synthesis part.

Claims

相異なる音素に対応する複数の音素区間を含む音声素片を示す複数の音声素片データを記憶する素片記憶手段と、
音声素片を順次に選択する素片選択手段と、
前記素片選択手段が選択した音声素片の各音素区間について合成時間長を可変に設定する音素長設定手段と、
前記素片選択手段が選択した音声素片の各音素区間のうち前記音素長設定手段が設定した合成時間長の対象区間について音声素片データが示す音声波形を相互に連結して音声信号を生成する音声合成手段と
を具備する音声合成装置。 Unit storage means for storing a plurality of speech unit data indicating speech units including a plurality of phoneme sections corresponding to different phonemes;
A segment selection means for sequentially selecting speech segments;
Phoneme length setting means for variably setting a synthesis time length for each phoneme section of the speech unit selected by the unit selection means;
A speech signal is generated by interconnecting speech waveforms indicated by speech unit data for a target segment of a synthesis time length set by the phoneme length setting unit among the phoneme segments of the speech unit selected by the unit selection unit. A speech synthesizer.

音声素片の先頭に位置するとともに声道の閉鎖後の一時的な変形により発音される第１種別の音素に対応する音素区間は、前記第１種別の音素が発音される過程を含み、音声素片の末尾に位置するとともに前記第１種別の音素に対応する音素区間は、前記第１種別の音素が発音される直前の準備過程を含み、
前記音声合成手段は、音声素片の先頭の音素区間が前記第１種別の音素に対応する場合に、その音素区間のうち始点から後方の前記合成時間長にわたる区間を前記対象区間として選定し、音声素片の末尾の音素区間が前記第１種別の音素に対応する場合に、その音素区間のうち始点から後方の前記合成時間長にわたる区間を前記対象区間として選定する
請求項１の音声合成装置。 The phoneme segment corresponding to the first type of phoneme that is located at the beginning of the speech unit and that is pronounced by temporary deformation after the vocal tract is closed includes a process in which the first type of phoneme is pronounced. The phoneme segment located at the end of the segment and corresponding to the first type of phoneme includes a preparation process immediately before the first type of phoneme is pronounced,
The speech synthesizing means, when the first phoneme section of the speech unit corresponds to the first type of phoneme, the section over the synthesis time length from the start point to the back of the phoneme section is selected as the target section, The speech synthesizer according to claim 1, wherein when a phoneme section at the end of a speech unit corresponds to the first type of phoneme, a section over the synthesis time length from the start point to the back is selected as the target section. .

音声素片の先頭に位置するとともに前記第１種別とは相違する第２種別の音素に対応する音素区間は、当該第２種別の音素が後続の音素に変化する過程を含み、音声素片の末尾に位置するとともに前記第２種別の音素に対応する音素区間は、直前の音素が当該第２種別の音素に変化する過程を含み、
前記音声合成手段は、音声素片の先頭の音素区間が前記第２種別の音素に対応する場合に、その音素区間のうち終点から前方の前記合成時間長にわたる区間を前記対象区間として選定し、音声素片の末尾の音素区間が前記第２種別の音素に対応する場合に、その音素区間のうち始点から後方の前記合成時間長にわたる区間を前記対象区間として選定する
請求項１または請求項２の音声合成装置。
A phoneme segment corresponding to a second type of phoneme that is located at the beginning of a speech unit and is different from the first type includes a process in which the second type of phoneme changes to a subsequent phoneme. The phoneme section located at the end and corresponding to the second type of phoneme includes a process in which the immediately preceding phoneme changes to the second type of phoneme,
The speech synthesis means, when the first phoneme section of the speech unit corresponds to the second type of phoneme, selects the section spanning the synthesis time length ahead from the end point among the phoneme sections as the target section, 3. When a phoneme section at the end of a speech unit corresponds to the second type of phoneme, a section extending from the start point to the back synthesis time length is selected as the target section among the phoneme sections. Voice synthesizer.