JPH11184497A

JPH11184497A - Voice analyzing method, voice synthesizing method, and medium

Info

Publication number: JPH11184497A
Application number: JP10093591A
Authority: JP
Inventors: Takahiro Kamai; 孝浩釜井; Kenji Matsui; 謙二松井
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1997-04-09
Filing date: 1998-04-06
Publication date: 1999-07-09
Anticipated expiration: 2018-04-06
Also published as: JP3576800B2

Abstract

PROBLEM TO BE SOLVED: To enable precise pitch conforming to the pitch marks of a sound- recorded message by giving proper pitch marks to a voice waveform and obtaining a synthesized voice having no roughness. SOLUTION: One of fixed type low-pass filters 3002a to 3002d is so set as to pass only the basic wave of a voice and a channel selection part 3004 sequentially selects peak detection results of peak detection parts 3003-a to 3003-d to extract peak information on the basic wave all the time. The channel selection part 3004 judges that a channel where intervals of peaks detected by the peak detection parts 3003-a to 3003-d is a correct channel. On the basis of this peak information, the pitch of the original voice is analyzed, an adaptive low- pass filter 3005 passes only the basic wave of the voice, and a peak detection part 3006 detects the peak of the basic wave to give pitch marks.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声のピッチやパ
ワーなどの詳細な分析、およびそれらを用いた高品質な
音声合成や、高能率での音声の圧縮符号化などの方法、
及び媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for detailed analysis of pitch and power of speech, and a method for high-quality speech synthesis and a high-efficiency compression-encoding of speech using them.
And media.

【０００２】[0002]

【従来の技術】音声合成システムは任意の内容を音声波
形として合成することが目的であり、そのために様々な
方式が考案されている。その中でも代表的な方式は、音
声波形を細かな単位で記憶しておき（音声素片と呼
ぶ）、目的の内容に合わせて適切なものを選び出して接
続する波形編集合成方式である。2. Description of the Related Art The purpose of a speech synthesis system is to synthesize arbitrary contents as a speech waveform, and for that purpose, various systems have been devised. Among them, a typical method is a waveform editing / synthesis method in which an audio waveform is stored in small units (referred to as a speech unit), and an appropriate one is selected and connected according to the target content.

【０００３】このような音声合成方法においては、音声
素片に対してそのピッチや時間長を変形することによっ
て互いの接続による不連続感や違和感を軽減し、なめら
かな音声を合成することが行われる。ピッチや時間長の
変形手法としては例えばＰＳＯＬＡ（Pitch Synchronou
s Overlap Add）法（F. Charpentier, M. Stella, "Dip
hone synthesis using an over-lapped technique for
speech waveforms concatenation", Proc. ICASSP, 201
5-2018, Tokyo, 1986）が知られている。これは、あら
かじめ音声素片の波形のローカルピーク位置や声門閉鎖
点にピッチマークを付与しておき、その位置を中心に窓
関数でピッチ波形を切り出し合成を行う。上述したよ
うに、音声合成方法で必要となるピッチマークの付与方
法としては、時間波形上のローカルピークにマークを付
与する方法や声門閉鎖点にマークを付与する方法があ
る。In such a speech synthesizing method, it is possible to reduce the discontinuity and discomfort due to the connection of the speech units by modifying the pitch and time length of the speech units, and synthesize a smooth speech. Will be For example, PSOLA (Pitch Synchronou)
s Overlap Add) method (F. Charpentier, M. Stella, "Dip
hone synthesis using an over-lapped technique for
speech waveforms concatenation ", Proc. ICASSP, 201
5-2018, Tokyo, 1986). In this method, a pitch mark is previously assigned to a local peak position or a glottal closure point of a waveform of a speech unit, and a pitch waveform is cut out and synthesized using a window function around the position. As described above, as a method of assigning a pitch mark required in the speech synthesis method, there is a method of assigning a mark to a local peak on a time waveform or a method of assigning a mark to a glottal closing point.

【０００４】時間波形上のローカルピークにマークを付
与する方法の例としては、河合他：“波形素片接続型音
声合成システムのための波形素片データベースの作
成”, 日本音響学会講演論文集, 3-5-5, 1994-11があ
る。この方法は簡易なことが利点であるが、高域成分の
多い複雑な音声波形などの場合は１ピッチ周期毎に一つ
のピッチマークを付与することが難しく、またピーク自
身も高域成分によって位相的に揺らぎを有する。その結
果、合成波形もピッチ周期毎に前後に揺らぎを伴い、こ
のことが聴感上濁った音声を生むという問題がある。As an example of a method for adding a mark to a local peak on a time waveform, Kawai et al .: "Creation of a waveform segment database for a waveform segment connection type speech synthesis system", Proceedings of the Acoustical Society of Japan, 3-5-5, 1994-11. This method has the advantage of simplicity, but it is difficult to add one pitch mark for each pitch period in the case of a complex audio waveform having many high-frequency components, and the peak itself has a phase due to the high-frequency component. It has fluctuations. As a result, there is a problem that the synthesized waveform also fluctuates back and forth in each pitch cycle, and this produces a sound that is muddy in hearing.

【０００５】一方、音声波形の声門閉鎖点をピッチマー
クとする方法としては、阪本他：“波形重畳法を用いた
日本語テキスト音声合成システム”, 電子情報通信学会
技術報告, SP95-6, 1995-05や、新居他：“音声信号モ
デルを用いたピッチ波形抽出位置の検討”, 日本音響学
会講演論文集, 1-4-22, 1995-3がある。これらの方法で
は、音声波形をウェーブレット変換や線形予測分析を用
いて分析することによって声門閉鎖タイミングを推定
し、その位置をピッチマークとする。声門閉鎖点の抽出
による方法は１ピッチ周期に一つのピッチマークが正し
く付与できるという利点があり、波形の切り出しの観点
からもその方法は声門閉鎖パルスに対する応答波形を切
り出すことに相当するため、スペクトル歪みの少ない良
好なピッチ波形を切り出すことができる。しかし、この
方法では声門閉鎖点の推定の分析手法が複雑であるとい
う問題がある。On the other hand, Sakamoto et al .: "Japanese Text-to-Speech Synthesis System Using Waveform Superposition", IEICE Technical Report, SP95-6, 1995 -05, Arai et al .: “Examination of pitch waveform extraction position using speech signal model”, Proceedings of the Acoustical Society of Japan, 1-4-22, 1995-3. In these methods, glottal closure timing is estimated by analyzing a speech waveform using wavelet transform or linear prediction analysis, and the position is used as a pitch mark. The method based on the extraction of the glottal closure point has the advantage that one pitch mark can be correctly added to one pitch period. From the viewpoint of waveform extraction, this method is equivalent to extracting a response waveform to the glottal closure pulse. A good pitch waveform with little distortion can be cut out. However, this method has a problem that the analysis method for estimating the glottal closure point is complicated.

【０００６】また、これらとは別に適応的に音声のピッ
チ周波数付近を通過帯域とするＦＩＲ直線位相型バンド
パスフィルタでフィルタリングする事により音声の基本
波を抽出し、そのゼロクロス位置を利用して音声波形を
１ピッチ周期ごとに区分化するという技術がある。大村
他：“基本波フィルタリング法による精細ピッチパター
ンの抽出”, 日本音響学会誌, ５１巻, ７号, pp.509-5
18, 1995がその例である。これは精細なピッチ分析を目
的としたものであるが、基本波に同期してピッチ周期が
求められる方法である。In addition to the above, a fundamental wave of a voice is extracted by adaptively filtering with a FIR linear phase band-pass filter having a pass band near the voice pitch frequency, and the zero-cross position is used to make use of the zero-cross position. There is a technique for segmenting a waveform for each pitch period. Omura et al .: “Extraction of Fine Pitch Pattern Using Fundamental Wave Filtering Method”, Journal of the Acoustical Society of Japan, 51, 7, pp. 509-5
18, 1995 is an example. This method aims at fine pitch analysis, but is a method in which a pitch period is obtained in synchronization with a fundamental wave.

【０００７】しかし、上記の方法によって抽出される区
分点は音声波形のローカルピークや声門閉鎖点とは直接
の関係が無く、そのままでピッチマークとして利用する
には不適切な場合もある。However, the segmentation points extracted by the above method have no direct relationship with the local peak or glottal closure point of the speech waveform, and may not be suitable for use as a pitch mark as it is.

【０００８】[0008]

【発明が解決しようとする課題】以上説明したように、
時間波形上のローカルピークをピッチマークとする方法
は、時間波形のピーク付近の揺らぎがピッチマークに含
まれるために合成音に濁りを発生するという問題があ
り、声門閉鎖点をピッチマークとする方法には声門閉鎖
点の推定の処理が複雑であるという課題がある。また、
これまで基本波をフィルタリングする方法では、ピッチ
マークに利用できる適切なタイミングを抽出することが
できないという課題を有していた。As described above,
The method of using the local peak on the time waveform as a pitch mark has a problem that fluctuations near the peak of the time waveform are included in the pitch mark, thereby causing turbidity in the synthesized sound. Has a problem that the process of estimating the glottal closure point is complicated. Also,
Until now, the method of filtering the fundamental wave has a problem that it is not possible to extract an appropriate timing usable for a pitch mark.

【０００９】本発明は、従来のこのような課題を考慮
し、比較的簡単な方法で従来に比べてより適切にピッチ
マークが付与できる音声分析方法、従来に比べてより優
れた音声を合成できる音声合成方法及び媒体などを提供
することを目的とする。The present invention has been made in consideration of such a conventional problem, and provides a voice analysis method capable of providing a pitch mark more appropriately than in the past by a relatively simple method. An object is to provide a speech synthesis method, a medium, and the like.

【００１０】[0010]

【課題を解決するための手段】請求項１の本発明は、音
声波形を記憶する音声波形記憶手段と、ピッチを分析す
るピッチ分析手段と、適応型フィルタと、ピークを検出
するピーク検出手段と、を用いて音声波形のピッチ周期
に対応する時間的基準位置であるピッチマーク情報を生
成する音声分析方法であって、前記音声波形記憶手段を
用いて前記音声波形の一部を一時的に記憶し、前記ピッ
チ分析手段を用いて前記一時的に記憶された音声波形の
大まかなピッチ情報を生成し、前記適応型フィルタへ前
記一時的に記憶された音声波形を入力させ、前記大まか
なピッチ情報に基づいて、前記適応型フィルタの遮断周
波数あるいは中心周波数を変化させることによって、そ
の入力された音声波形から基本波のみを通過させ、前記
ピーク検出手段を用いて前記基本波における片側の複数
の極大点を検出することにより、音声波形全体に対する
一連の正確なピッチマーク情報を生成することを特徴と
する音声分析方法である。According to a first aspect of the present invention, there is provided a voice waveform storing means for storing a voice waveform, a pitch analyzing means for analyzing a pitch, an adaptive filter, and a peak detecting means for detecting a peak. , A voice analysis method for generating pitch mark information that is a temporal reference position corresponding to a pitch cycle of a voice waveform, wherein a part of the voice waveform is temporarily stored using the voice waveform storage unit. And generating rough pitch information of the temporarily stored voice waveform using the pitch analysis means, inputting the temporarily stored voice waveform to the adaptive filter, and generating the rough pitch information. , By changing the cutoff frequency or center frequency of the adaptive filter, only the fundamental wave is passed from the input audio waveform, and the peak detection means There by detecting a plurality of maximum points on one side of said fundamental wave, a speech analysis method characterized by generating a series of accurate pitch mark information for the entire speech waveform.

【００１１】請求項２の本発明は、固定型低域フィルタ
及び、ピークを検出するピーク検出手段を有するピーク
検出チャンネルの複数組と、チャンネルを選択するため
のチャンネル選択手段とを用いて、音声波形のピッチ周
期に対応する時間的基準位置であるピッチマーク情報を
生成する音声分析方法であって、前記複数の固定型低域
フィルタのそれぞれの遮断周波数は、それら複数の固定
型低域フィルタのうちの少なくとも一つの固定型低域フ
ィルタが、入力されてくる音声波形の基本波のみを通過
させるように設定されており、前記それぞれの固定型低
域フィルタを用いて、入力された音声の所定の周波数以
下の成分である低域成分波形を出力し、前記ピーク検出
手段を用いて、前記固定型低域フィルタから出力された
前記低域成分波形から片側の複数の極大点を検出してピ
ーク情報として出力し、前記チャンネル選択手段によ
り、前記複数のピーク検出チャンネルから出力されたピ
ーク情報の全部又は一部を利用して、所定の時間間隔ご
とに所定の選択基準に基づいて、ピーク検出チャンネル
を選択し、前記選択されたピーク検出チャンネルから出
力されたピーク情報を利用して、音声波形全体に対する
一連のピッチマーク情報を生成することを特徴とする音
声分析方法である。According to a second aspect of the present invention, a plurality of sets of a peak detection channel having a fixed low-pass filter and a peak detection unit for detecting a peak, and a channel selection unit for selecting a channel are used. A voice analysis method for generating pitch mark information that is a temporal reference position corresponding to a pitch cycle of a waveform, wherein a cutoff frequency of each of the plurality of fixed low-pass filters is equal to that of the plurality of fixed low-pass filters. At least one of the fixed low-pass filters is set to pass only the fundamental wave of the input voice waveform, and the predetermined low-pass filters of the input voice are used by using the respective fixed low-pass filters. And outputs a low-frequency component waveform that is a component equal to or lower than the frequency of the low-frequency component waveform output from the fixed low-pass filter using the peak detection unit. The plurality of peak points on one side are detected and output as peak information, and the channel selection means uses all or a part of the peak information output from the plurality of peak detection channels, at predetermined time intervals. Selecting a peak detection channel based on a predetermined selection criterion, and using the peak information output from the selected peak detection channel, generating a series of pitch mark information for the entire audio waveform. This is a voice analysis method.

【００１２】請求項１８の本発明は、あらかじめ録音さ
れた音声波形である目的音声波形を分析して音韻系列情
報、音韻タイミング情報、ピッチ情報、振幅情報を作成
しておき、前記音韻系列情報、音韻タイミング情報、ピ
ッチ情報、振幅情報に基づいて音声を合成する音声合成
方法であって、前記音韻系列情報は前記目的音声波形に
含まれる音韻の種別とその出現順序を保持し、前記ピッ
チ情報は前記目的音声波形の所定のタイミングごとのピ
ッチに関する情報を保持し、前記振幅情報は前記目的音
声波形の所定のタイミングごとの振幅に関する情報を保
持することを特徴とする音声合成方法である。According to the present invention, a target speech waveform, which is a pre-recorded speech waveform, is analyzed to create phoneme sequence information, phoneme timing information, pitch information, and amplitude information. Phoneme timing information, pitch information, a speech synthesis method for synthesizing speech based on amplitude information, wherein the phoneme sequence information holds the type and appearance order of phonemes contained in the target speech waveform, the pitch information A voice synthesizing method, wherein information on a pitch of the target voice waveform at each predetermined timing is held, and the amplitude information holds information on an amplitude of the target voice waveform at each predetermined timing.

【００１３】請求項３９の本発明は、自然音声による定
型メッセージと、音声合成による合成メッセージとを組
み合わせることにより所定のメッセージを生成する音声
合成方法において、前記自然音声に対応するピッチマー
ク情報があらかじめ付与されており、前記定型メッセー
ジと前記合成メッセージとの少なくとも接続部において
は、前記合成メッセージの音声合成に用いる音声波形の
ピッチ波形を前記ピッチマーク情報に基づいて配置する
ことにより、前記定型メッセージと同じ内容の音声を合
成メッセージとして合成し、それら同じ内容の双方の音
声のそれぞれの混合比率を時間的に変化させ、前記接続
部において重ね合わせる、ことを特徴とする音声合成方
法である。[0013] According to a thirty-ninth aspect of the present invention, in the voice synthesizing method for generating a predetermined message by combining a fixed message based on natural voice and a synthesized message based on voice synthesis, the pitch mark information corresponding to the natural voice is determined in advance. Is provided, at least at the connection between the fixed message and the synthesized message, by arranging the pitch waveform of the voice waveform used for voice synthesis of the synthesized message based on the pitch mark information, the fixed message A voice synthesizing method comprising synthesizing voices having the same content as a synthesized message, changing a mixing ratio of both voices having the same content over time, and superimposing them at the connection unit.

【００１４】請求項４１の本発明は、第１のメッセージ
と第２のメッセージとを組み合わせることにより所定の
メッセージを生成する音声合成方法において、前記第１
のメッセージの種類毎にあらかじめ録音されている自然
音声に対応するピッチマーク情報に基づいて、第１のメ
ッセージの合成に用いる音声波形のピッチ波形を配置す
ることにより、前記第１のメッセージを生成し、前記第
１のメッセージと前記第２のメッセージとの少なくとも
接続部においては、前記第１のメッセージと同じ内容の
音声を前記第２のメッセージとして合成し、それら同じ
内容の双方の音声のそれぞれの混合比率を時間的に変化
させ、前記接続部において重ね合わせる、ことを特徴と
する音声合成方法である。The present invention according to claim 41, in the speech synthesis method for generating a predetermined message by combining a first message and a second message, wherein
The first message is generated by arranging a pitch waveform of a speech waveform used for synthesizing the first message based on pitch mark information corresponding to a natural voice recorded in advance for each message type. At least at a connection between the first message and the second message, a voice having the same content as the first message is synthesized as the second message, and each voice of both voices having the same content is synthesized. A voice synthesizing method characterized in that a mixing ratio is temporally changed and superimposed at the connection portion.

【００１５】請求項４４の本発明は、上記本発明のいず
れかに記載の各ステップの全部又は一部のステップをコ
ンピュータに実行させるためのプログラムを記録したこ
とを特徴とする媒体である。According to a forty-fourth aspect of the present invention, there is provided a medium recording a program for causing a computer to execute all or some of the steps described in any of the above aspects of the present invention.

【００１６】請求項４５の本発明は、上記本発明のいず
れかに記載の各ステップの全部又は一部のステップをコ
ンピュータに実行させるためのプログラムを記録したこ
とを特徴とする媒体である。According to a forty-fifth aspect of the present invention, there is provided a medium recording a program for causing a computer to execute all or some of the steps described in any of the above aspects of the present invention.

【００１７】上記構成により、例えば、ローカルピーク
検出の対象が正弦波状の波形であるためにピッチ周期に
対応した区分点の抽出を容易にし、さらにゼロクロスで
はなくピーク位置を区分点として抽出することにより、
ほぼ音声波形のローカルピーク及び声門閉鎖点に一致し
た位置にピッチマークを付与することができる。According to the above configuration, for example, since the target of local peak detection is a sinusoidal waveform, it is easy to extract a segment point corresponding to a pitch period. Further, by extracting a peak position instead of a zero cross as a segment point, ,
A pitch mark can be provided at a position substantially corresponding to the local peak and the glottal closure point of the voice waveform.

【００１８】[0018]

【発明の実施の形態】以下に本発明の音声分析方法にか
かるピッチマーク付与方法について詳しく説明する。（実施の形態１）図１は本発明の音声分析方法にかかる
ピッチマーク付与方法の第１の実施の形態を示す構成図
である。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, a method of assigning a pitch mark according to a voice analysis method of the present invention will be described in detail. (Embodiment 1) FIG. 1 is a configuration diagram showing a first embodiment of a pitch mark adding method according to a voice analysis method of the present invention.

【００１９】本実施の形態のピッチマーク付与方法を実
現するための構成は、波形記憶部1001、ピッチ分析部１
００２、適応型低域フィルタ１００３、ピーク検出部１
００４からなる。又、音声波形の入力は波形記憶部１０
０１に、波形記憶部１００１の出力はピッチ分析部１０
０２と適応型低域フィルタ１００３に並列に接続されて
いる。また、ピッチ分析部１００２の出力は適応型低域
フィルタ１００３に接続されている。適応型低域フィル
タ１００３の出力はピーク検出部１００４に接続されて
いる。さらに、極性判定部１００５は波形記憶部１００
１に接続されているとともに、極性判定部１００５とピ
ーク検出部１００４は互いに情報が交換できるように接
続されている。The configuration for realizing the pitch mark adding method according to the present embodiment includes a waveform storage unit 1001 and a pitch analysis unit 1.
002, adaptive low-pass filter 1003, peak detector 1
004. The input of the audio waveform is performed by the waveform storage unit 10.
01, the output of the waveform storage unit 1001 is
02 and the adaptive low-pass filter 1003 are connected in parallel. The output of the pitch analyzer 1002 is connected to an adaptive low-pass filter 1003. The output of the adaptive low-pass filter 1003 is connected to a peak detector 1004. Further, the polarity determination unit 1005 is provided in the waveform storage unit 100.
1 and the polarity determination unit 1005 and the peak detection unit 1004 are connected so that information can be exchanged with each other.

【００２０】上記のように構成されたピッチマーク付与
方法の動作について、以下に詳しく説明する。The operation of the pitch mark providing method configured as described above will be described in detail below.

【００２１】波形記憶部１００１は入力された音声波形
の一部または全部を一時的に記憶する。ピッチ分析部１
００２は波形記憶部１００１から音声波形の一部を受け
取り、ピッチ分析を行う。ここでのピッチ分析手法には
一般に知られているものを使用することができる。例え
ばM.J.Ross et al., "Average Magnitude DifferenceFu
nction Pitch Extractor", IEEE transactions, Vol. A
SSp-22, No. 5, 1974などがその例である。The waveform storage unit 1001 temporarily stores part or all of the input voice waveform. Pitch analysis unit 1
002 receives a part of the speech waveform from the waveform storage unit 1001 and performs pitch analysis. A generally known pitch analysis method can be used here. For example, MJRoss et al., "Average Magnitude DifferenceFu
nction Pitch Extractor ", IEEE transactions, Vol. A
SSp-22, No. 5, 1974 are examples.

【００２２】ピッチ分析結果はピッチ情報として適応型
低域フィルタ１００３に出力される。適応型低域フィル
タ１００３はピッチ情報を元に遮断周波数を設定し、音
声を処理することにより、音声波形の高調波成分を除去
した基本波を抽出する。遮断周波数にピッチ周波数の1.
2倍程度の周波数を用いることでこの動作が実現する。
適応型低域フィルタ１００３にはＦＩＲ直線位相型フ
ィルタが適している。このタイプのフィルタはあらゆる
周波数に対して遅延時間が一定であるため、その出力を
一定量シフトすることによって実質的に遅延が０である
と考えることができる。The result of pitch analysis is output to adaptive low-pass filter 1003 as pitch information. The adaptive low-pass filter 1003 sets a cutoff frequency based on the pitch information and processes the voice to extract a fundamental wave from which the harmonic component of the voice waveform has been removed. 1.Pitch frequency to cutoff frequency.
This operation is realized by using about twice the frequency.
An FIR linear phase filter is suitable for the adaptive low-pass filter 1003. Since this type of filter has a constant delay time for all frequencies, it can be considered that the delay is substantially zero by shifting its output by a fixed amount.

【００２３】図５に音声波形と、それを適応型低域フィ
ルタ１００３で処理した基本波の一例を示す。（a）が
音声波形、（b）が基本波である。この図(a)のように音
声波形は高調波のために複雑な波形であるが、基本波は
図(b)のように正弦波状の単純な波形である。FIG. 5 shows an example of a speech waveform and a fundamental wave processed by the adaptive low-pass filter 1003. (A) is a voice waveform, and (b) is a fundamental wave. The voice waveform is a complicated waveform due to harmonics as shown in FIG. (A), but the fundamental wave is a simple sinusoidal waveform as shown in FIG. (B).

【００２４】次に、ピーク検出部１００４が基本波の周
期に対応するピークを検出する。その動作を図６を用い
て説明する。ピーク検出部１００３は基本波の振幅に応
じて適当なしきい値を設定する。次に、しきい値を超え
た範囲をピーク検出範囲とする。最後に、その範囲内で
の最大点をピークとして検出する。上記のピーク検出範
囲は自動的にピッチ周期ごとに得られるため、検出され
るピークもピッチ周期ごとに得られる。Next, a peak detector 1004 detects a peak corresponding to the period of the fundamental wave. The operation will be described with reference to FIG. The peak detection unit 1003 sets an appropriate threshold according to the amplitude of the fundamental wave. Next, a range exceeding the threshold is set as a peak detection range. Finally, the maximum point within that range is detected as a peak. Since the above-described peak detection range is automatically obtained for each pitch cycle, the detected peak is also obtained for each pitch cycle.

【００２５】また、上記のやり方とは別の方法もある。
その動作を図７を用いて説明する。図７の上の波形は基
本波であり、下の波形は差分基本波である。差分基本波
とは基本波の差分（あるサンプルから直前のサンプルを
差し引くことによって波形の変化量を表したもの）をと
ったものであり、アナログ波形では微分に相当する操作
である。There is another method different from the above method.
The operation will be described with reference to FIG. The upper waveform in FIG. 7 is a fundamental wave, and the lower waveform is a differential fundamental wave. The difference fundamental wave is a difference between the fundamental waves (a change amount of a waveform obtained by subtracting the immediately preceding sample from a certain sample), and is an operation corresponding to differentiation in an analog waveform.

【００２６】基本波は正弦波状の波形であるため、差分
基本波は基本波の位相を９０度進めたものとなる。従っ
て、基本波のピークは差分基本波のゼロクロス位置とな
る。ピーク検出の対象が正方向のピークであれば、差分
基本波の値が正から負に変化する点が検出位置となる。
この方法はしきい値を設定する必要がないので非常に微
弱な基本波に対しても高感度にピーク検出ができる利点
がある。Since the fundamental wave has a sinusoidal waveform, the differential fundamental wave is obtained by advancing the phase of the fundamental wave by 90 degrees. Therefore, the peak of the fundamental wave is the zero cross position of the differential fundamental wave. If the target of peak detection is a positive peak, the point at which the value of the differential fundamental wave changes from positive to negative is the detection position.
This method has an advantage that peak detection can be performed with high sensitivity even for a very weak fundamental wave because it is not necessary to set a threshold value.

【００２７】さらに、ディジタルデータとしての差分基
本波のゼロクロス位置を、高精細に推定することによ
り、従来１サンプル単位の精度でしか得られなかったピ
ーク位置を、１サンプルより細かい任意の精度で検出す
ることも可能である。差分基本波は正弦波状の波形であ
るため、ゼロクロス付近の波形は直線で近似することが
できる。そこで図８に示すように、差分基本波のゼロク
ロス位置をはさむ符号の異なる二つのデータを直線補間
する事によって、精度の高いゼロクロス位置を推定でき
る。Further, by highly precisely estimating the zero-cross position of the differential fundamental wave as digital data, a peak position which can be obtained only with a precision of one sample in the past can be detected with an arbitrary precision finer than one sample. It is also possible. Since the differential fundamental wave is a sinusoidal waveform, the waveform near the zero cross can be approximated by a straight line. Therefore, as shown in FIG. 8, a highly accurate zero-cross position can be estimated by linearly interpolating two data with different signs sandwiching the zero-cross position of the differential fundamental wave.

【００２８】このようにして得られたゼロクロス位置を
ピッチマーク情報として利用することができる。The zero cross position thus obtained can be used as pitch mark information.

【００２９】さて、ピーク検出の対象とするピークの極
性には正と負の二つが考えられる。一般にどちらか一方
の極性のピークが音声波形のピークとの一致度が高い。
図９は音声波形と基本波の例である。この図で、実線は
基本波の正のピーク、波線は基本波の負のピークであ
る。負のピークは音声波形の急峻な変化点にほぼ一致し
ているが、正のピークはどのような変化点やピークとも
一致していない。There are two types of positive and negative polarities of a peak to be detected. Generally, the peak of either polarity has a high degree of coincidence with the peak of the audio waveform.
FIG. 9 is an example of a speech waveform and a fundamental wave. In this figure, the solid line is the positive peak of the fundamental wave, and the wavy line is the negative peak of the fundamental wave. The negative peak almost coincides with the steep change point of the audio waveform, while the positive peak does not coincide with any change point or peak.

【００３０】このような場合は基本波の負のピークが声
門閉鎖タイミングを近似していると考えられる。そこ
で、ピーク極性としては正負両方の極性のピークを抽出
し、それらを音声波形と照合することにより、抽出され
たピーク位置での音声波形の値が大きくなる方をピッチ
マークとして選び出せばよい。また、その照合は音声波
形全体に渡って行う必要はなく、一部の短い区間で判定
して差し支えない。そこで、極性判定部１００５は音声
の一部の区間に対してピーク検出部１００４の二つの極
性の出力を受け取り、波形記憶部１００１に記憶された
波形と照合することで、その音声全体の極性を判定す
る。以降、ピーク検出部１００４は判定された極性のピ
ークのみを対称として検出を行う。In such a case, it is considered that the negative peak of the fundamental wave approximates the glottal closure timing. Therefore, as the peak polarity, peaks of both positive and negative polarities may be extracted, and those peaks may be compared with the audio waveform to select the one having the larger audio waveform value at the extracted peak position as the pitch mark. In addition, the matching need not be performed over the entire voice waveform, and may be determined in a part of a short section. Therefore, the polarity determination unit 1005 receives the outputs of the two polarities of the peak detection unit 1004 for a part of the voice and compares the output with the waveform stored in the waveform storage unit 1001 to determine the polarity of the entire voice. judge. Thereafter, the peak detection unit 1004 detects only the peak of the determined polarity as symmetric.

【００３１】前述したように、基本波のいずれかのピー
クが声門閉鎖タイミングを近似すると考えられるが、そ
の考え方を以下に説明する。As described above, it is considered that any peak of the fundamental wave approximates the glottal closure timing, and the concept will be described below.

【００３２】音声波形がある時刻の近傍において（数
１）のように表される場合、その基本波成分は（数２）
のように表すことができる。When the voice waveform is expressed near the certain time as shown in (Equation 1), its fundamental wave component is expressed by (Equation 2)
Can be expressed as

【００３３】[0033]

【数１】 (Equation 1)

【００３４】[0034]

【数２】 (Equation 2)

【００３５】一方、音声波形は駆動音源g（n）と声道伝
達関数によってモデル化が可能である。駆動音源は声門
閉鎖によって発生するパルスであり、g（n）は（数３）
のようにインパルス列で近似することが出来る。インパ
ルス列は各高調波成分の位相が全て０であるという特徴
を持つ。すなわち、駆動音源波形g（n）は（数４）のよ
うに表すことが出来る。従って、基本波成分は（数５）
となる。従って、基本波成分のピーク位置と駆動音源g
（n）のインパルスの位置は一致する。つまりピーク位
置と声門閉鎖とが一致する。On the other hand, the speech waveform can be modeled by the driving sound source g (n) and the vocal tract transfer function. The driving sound source is a pulse generated by glottal closure, and g (n) is (Equation 3)
Can be approximated by an impulse train. The impulse train has a feature that the phases of all harmonic components are all zero. That is, the driving sound source waveform g (n) can be expressed as (Equation 4). Therefore, the fundamental wave component is (Equation 5)
Becomes Therefore, the peak position of the fundamental wave component and the driving sound source g
The position of the impulse in (n) coincides. That is, the peak position and the glottal closure coincide.

【００３６】[0036]

【数３】 (Equation 3)

【００３７】[0037]

【数４】 (Equation 4)

【００３８】[0038]

【数５】 (Equation 5)

【００３９】なお、実際には駆動音源がインパルスでは
ないことや、声道伝達関数の遅延や、あるいは***から
放射された後の伝搬路の伝達特性を考慮に入れなくては
ならないため、基本波成分のピークをそのままピッチマ
ークとして用いることが出来ない場合もある。そこで、
前後にシフトしながら音声素片波形との照合を行うこと
によってより適切なピッチマークを決定する。そのよう
な方法については本発明の第４の実施の形態におけるピ
ッチマーク付与方法の説明で述べる。Since the driving sound source is not actually an impulse, the delay of the vocal tract transfer function, or the transfer characteristic of the propagation path radiated from the lips, it is necessary to consider the fundamental wave. In some cases, the peak of the component cannot be used as a pitch mark as it is. Therefore,
A more appropriate pitch mark is determined by performing collation with the speech unit waveform while shifting back and forth. Such a method will be described in the description of the pitch mark providing method according to the fourth embodiment of the present invention.

【００４０】さらにまた、***からマイクロホンの距離
が遠い場合など、伝搬路の伝達特性がピッチ周波数近辺
で大きな位相歪みを持っている場合、通信路の位相等価
に用いられるいわゆるオールパス回路を用いることも有
効な方法と考えられる。***からマイクロホンの間の空
間の伝達特性は近似的に高域通過特性であると考えられ
るので、ピッチ周波数近辺の低い周波数帯では位相が進
む特性となる。そこで、その近辺の周波数で遅延を持つ
オールパス回路を用いることにより位相を補償し、正確
な声門閉鎖点の推定が可能になると考えられる。Further, when the transfer characteristic of the propagation path has a large phase distortion near the pitch frequency, such as when the microphone is far from the lip, a so-called all-pass circuit used for phase equalization of the communication path may be used. It is considered an effective method. Since the transfer characteristic of the space between the lip and the microphone is considered to be approximately a high-pass characteristic, the phase advances in a low frequency band around the pitch frequency. Therefore, it is considered that the phase can be compensated for by using an all-pass circuit having a delay at a frequency around the frequency, thereby enabling accurate estimation of the glottal closure point.

【００４１】以上述べたように、本実施の形態のピッチ
マーク付与方法を用いれば、簡単な処理でピッチ周期に
対応した時間的基準位置であるピッチマークを付与する
ことができる。また、基本波成分のピーク検出におい
て、差分基本波のゼロクロス位置を直線補間する事によ
って高精細なピッチマーク情報を生成することが可能で
ある。従って、本実施の形態のピッチマーク付与方法は
それ自身高精細なピッチ分析法と位置づけることもでき
る。As described above, by using the pitch mark assigning method of the present embodiment, it is possible to assign a pitch mark which is a temporal reference position corresponding to a pitch cycle by a simple process. Further, in the detection of the peak of the fundamental wave component, it is possible to generate high-definition pitch mark information by linearly interpolating the zero-cross position of the differential fundamental wave. Therefore, the pitch mark assigning method of the present embodiment can be regarded as a high-definition pitch analyzing method.

【００４２】ところで、本実施の形態は、ピッチ分析部
１００２を用いるが、そのピッチ分析部１００２におい
ては予備的なピッチ分析をある程度正確に行う必要があ
るといえる。もし、ピッチ分析部１００２の出力するピ
ッチ情報に誤りがあると、適応型低域フィルタ１００３
は基本波も遮断してしまうことや、高調波も通過させて
しまうことがある。このようなピッチ分析の誤りは出来
るだけ避けることが望ましい。In this embodiment, the pitch analysis unit 1002 is used, but it can be said that the pitch analysis unit 1002 needs to perform preliminary pitch analysis to some extent accurately. If the pitch information output from the pitch analysis unit 1002 contains an error, the adaptive low-pass filter 1003
In some cases, the fundamental wave may be blocked, and higher harmonics may be passed. It is desirable to avoid such pitch analysis errors as much as possible.

【００４３】このような問題を考慮して、低域フィルタ
とピーク検出という基本構成を複数組用いることによっ
て、上述した予備的なピッチ分析を不要にする方法を次
に示す。（実施の形態２）図２は本発明のピッチマーク付与方法
の第２の実施の形態の構成図である。In consideration of such a problem, a method for eliminating the need for the above-described preliminary pitch analysis by using a plurality of basic configurations of a low-pass filter and peak detection will be described below. (Embodiment 2) FIG. 2 is a block diagram of a second embodiment of the pitch mark providing method of the present invention.

【００４４】本実施の形態のピッチマーク付与方法に用
いる構造は、固定型低域フィルタ２００１−ａ〜d、ピ
ーク検出部２００２−ａ〜ｄ、チャンネル選択部２００
３からなり、入力は固定型低域フィルタ２００１−ａ〜
ｄに並列に接続されている。固定型低域フィルタ２００
１−ａの出力はピーク検出部２００２−ａに、固定型低
域フィルタ２００１−ｂの出力はピーク検出部２００２
−ｂに、というようにそれぞれ一対一に接続されてい
る。ピーク検出部２００２−ａ〜ｄの出力はチャンネル
選択部２００３の複数の入力に接続されている。The structure used in the pitch mark adding method of the present embodiment includes fixed low-pass filters 2001-ad, peak detectors 2002-ad, and channel selector 200.
3 and inputs are fixed low-pass filters 2001-a to
d is connected in parallel. Fixed low-pass filter 200
The output of 1-a is output to a peak detector 2002-a, and the output of the fixed low-pass filter 2001-b is output to a peak detector 2002.
-B, and so on. Outputs of the peak detectors 2002-a to 2002-d are connected to a plurality of inputs of the channel selector 2003.

【００４５】固定型低域フィルタ２００１とピーク検出
部２００２の対からなる部分をピーク検出チャンネルま
たは単にチャンネルと呼び、固定型低域フィルタ２００
２−ａとピーク検出部２００２−ａからなるチャンネル
をピーク検出チャンネルＡまたは単にチャンネルＡなど
と呼ぶことにする。他の対も同様にピーク検出チャンネ
ルＢ，Ｃ，Ｄと呼ぶ。A portion composed of a pair of the fixed low-pass filter 2001 and the peak detection unit 2002 is called a peak detection channel or simply a channel.
The channel composed of 2-a and the peak detection unit 2002-a will be referred to as peak detection channel A or simply channel A or the like. The other pairs are similarly called peak detection channels B, C, and D.

【００４６】上記のように構成されたピッチマーク付与
のための構成について以下に詳しく説明する。The configuration for providing a pitch mark configured as described above will be described in detail below.

【００４７】固定型低域フィルタ２００１−ａ〜ｄには
共通の音声波形が入力される。固定型低域フィルタ２０
０１−ａ〜ｄの遮断周波数はそれぞれ７１Ｈｚ、１４１
Ｈｚ、２８３Ｈｚ、５６６Ｈｚに固定されている。この
ようにフィルタを構成することによって、上記４つの固
定型低域フィルタ２００１−ａ〜ｄのうちの一つが必ず
基本波のみを通過させる。これは入力される音声のピッ
チが３６Ｈｚ〜５６６Ｈｚの範囲にある限り成立する。A common audio waveform is input to the fixed low-pass filters 2001-ad. Fixed low-pass filter 20
The cutoff frequencies of 01-ad are 71 Hz and 141, respectively.
Hz, 283 Hz, and 566 Hz. By configuring the filter in this manner, one of the four fixed low-pass filters 2001-a to 2001-d always passes only the fundamental wave. This is true as long as the pitch of the input voice is in the range of 36 Hz to 566 Hz.

【００４８】遮断周波数が実際のピッチよりも高いチャ
ンネルでは、固定型低域フィルタ２００１は高調波も同
時に通過させるため、ピーク検出部２００２ではピッチ
周期よりも短い間隔の多数のピークが検出される。逆
に、遮断周波数が実際のピッチよりも低いチャンネルで
は、固定型低域フィルタ２００１は基本波も含めてすべ
ての成分を遮断し、ピーク検出部２００２には何ら信号
が入力されず、ピークは全く検出されない。In a channel where the cutoff frequency is higher than the actual pitch, the fixed low-pass filter 2001 also passes harmonics at the same time, so that the peak detector 2002 detects many peaks at intervals shorter than the pitch period. On the other hand, in a channel whose cutoff frequency is lower than the actual pitch, the fixed low-pass filter 2001 cuts off all components including the fundamental wave, no signal is input to the peak detection unit 2002, and the peak is not detected at all. Not detected.

【００４９】上記のような各チャンネルからの、多数の
ピークの存在やピーク不存在などのピーク情報を利用し
てチャンネル選択部２００３がチャンネルを単位時間
ごとに適応的に選択する。このようにして、予備的なピ
ッチ分析が不要なピッチマーク付与方法が実現する。The channel selection unit 2003 adaptively selects a channel for each unit time using peak information such as the presence or absence of many peaks from each channel as described above. In this way, a pitch mark providing method that does not require a preliminary pitch analysis is realized.

【００５０】以下にチャンネル選択部２００３の動作原
理について説明する。The principle of operation of the channel selection unit 2003 will be described below.

【００５１】図１０はある音声のチャンネルＣ（遮断周
波数２８３Ｈｚ）およびチャンネルＤ（遮断周波数５６
６Ｈｚ）の出力を示している。横軸はピーク検出部２０
０２−ｂが出力したピークの位置（単位はミリ秒）、
縦軸は各ピークから次のピークまでの時間的間隔をＴｐ
（単位は秒）とした場合、１／Ｔｐ（単位はＨｚ）を表
したものである。このピーク情報を仮のピッチマーク情
報と見なすと、縦軸は仮のピッチ周波数と見なすことが
できる。この音声データは６０ミリ秒から３９０ミリ秒
の区間に有声音声が存在している。同図で６０ミリ秒か
ら２３０ミリ秒にかけてチャンネルＤの仮のピッチ周波
数は低下している。しかし、２３０ミリ秒を越えると急
激に仮のピッチ周波数は上昇し、それ以降は激しく上下
を繰り返している。一方、チャンネルＣはそのような領
域でもなめらかに仮のピッチ周波数が低下し続けてい
る。FIG. 10 shows a channel C (cutoff frequency 283 Hz) and a channel D (cutoff frequency 56) of a certain sound.
6 Hz). The horizontal axis is the peak detector 20
02-b output peak position (unit is millisecond),
The vertical axis represents the time interval from each peak to the next peak, Tp
When (unit is seconds), it represents 1 / Tp (unit is Hz). If this peak information is regarded as temporary pitch mark information, the vertical axis can be regarded as a temporary pitch frequency. The voice data includes voiced voice in a section from 60 milliseconds to 390 milliseconds. In the figure, the provisional pitch frequency of channel D decreases from 60 ms to 230 ms. However, when the time exceeds 230 milliseconds, the provisional pitch frequency sharply rises, and thereafter the pitch frequency fluctuates violently. On the other hand, in the channel C, the provisional pitch frequency continues to decrease smoothly even in such an area.

【００５２】この理由は、２３０ミリ秒以降は音声の真
のピッチ周波数が２３０Ｈｚを下回るために、チャンネ
ルＤの固定型低域フィルタ２００１−ｄの出力は基本波
ではなく高調波を含んだものとなり、１ピッチ周期内に
複数のピークを持つようになるためである。しかも、１
ピッチ周期内の複数のピークは間隔が一様ではなく、高
調波同士の位相や振幅の関係で極めて複雑な変化をす
る。The reason is that the output of the fixed low-pass filter 2001-d of the channel D contains harmonics, not fundamental waves, since the true pitch frequency of the sound is lower than 230 Hz after 230 milliseconds. This is because there are a plurality of peaks within one pitch period. And one
A plurality of peaks in the pitch period are not uniformly spaced, and change extremely complicated due to the phase and amplitude relationship between harmonics.

【００５３】このように、高調波を含んだチャンネルの
出力は、仮のピッチマークから求められた仮のピッチ周
波数の変化の激しさを検出することで判断できる。As described above, the output of the channel including the harmonic can be determined by detecting the degree of change in the temporary pitch frequency obtained from the temporary pitch mark.

【００５４】そこで、チャンネル選択部２００３は単位
時間ごとにその前後二つの仮のピッチ周波数を比較し、
（数６）で表される変化率Ａ（ｎ）が最小であるチャン
ネルを選択する。Therefore, the channel selection unit 2003 compares the two tentative pitch frequencies before and after each unit time,
The channel whose change rate A (n) represented by (Equation 6) is the smallest is selected.

【００５５】[0055]

【数６】 (Equation 6)

【００５６】（数６）において、ｐ（ｎ）はある時刻の
直前にあるピッチマーク位置を表し、ｐ（ｎ＋１）とｐ
（ｎ＋２）はそれぞれその直後および二つ後のピッチマ
ーク位置である。In equation (6), p (n) represents the pitch mark position immediately before a certain time, and p (n + 1) and p (n + 1)
(N + 2) is the pitch mark position immediately after and two pitch marks thereafter.

【００５７】また、より正確な判断を行うためにこの選
択アルゴリズムは様々な形式が考えられる。例えば、
（数７）のようにＡ（ｎ）とＡ（ｎ−１）とＡ（ｎ＋
１）の分散Ｖ（ｎ）を計算し、それを最小にするチャン
ネルを選択することも有効である。これは、高調波を含
むチャンネルの仮のピッチ周波数がなめらかな変化をせ
ず、上下を繰り返す特性を利用したものである。In order to make a more accurate determination, this selection algorithm may take various forms. For example,
A (n), A (n-1) and A (n +
It is also effective to calculate the variance V (n) in 1) and select a channel that minimizes it. This utilizes the characteristic that the provisional pitch frequency of the channel including the harmonic does not change smoothly but repeats up and down.

【００５８】[0058]

【数７】 (Equation 7)

【００５９】このようにしてチャンネル選択部２００３
が逐次チャンネルを選択することにより、図１１のよう
ななめらかな曲線を抽出することができる。同図で横軸
は時間（単位はミリ秒）、縦軸は逐次選択されたチャン
ネルのピッチマーク情報から計算されたピッチ周波数
（単位はＨｚ）である。In this way, the channel selection unit 2003
By selecting successive channels, a smooth curve as shown in FIG. 11 can be extracted. In the figure, the horizontal axis represents time (unit: millisecond), and the vertical axis represents pitch frequency (unit: Hz) calculated from pitch mark information of sequentially selected channels.

【００６０】なお、ここでは説明の都合上チャンネルを
４つとしたが、それ以外のチャンネル数を用いてももち
ろんかまわない。例えば、入力される音声が非常に低い
と分かっている場合は低い周波数のチャンネルを設ける
ことが望ましい。その代わり、高い周波数のチャンネル
が省略できる場合もあり得る。また、チャンネル間の遮
断周波数の関係を順に２倍になるようにしたが、これよ
り狭い間隔で配置することももちろんかまわない。そう
することによって、常に複数のチャンネルが基本波のみ
を通過させることになり、隣接するチャンネルなら信頼
性が高く、チャンネル選択の信頼性が一層高まる。Although the number of channels is four for convenience of explanation, other numbers of channels may be used. For example, if the input sound is known to be very low, it is desirable to provide a low frequency channel. Instead, the high frequency channel may be omitted. In addition, the cutoff frequency relationship between the channels is doubled in order, but it is of course possible to arrange them at smaller intervals. By doing so, a plurality of channels always pass only the fundamental wave, so that adjacent channels have higher reliability and channel selection reliability is further improved.

【００６１】以上説明したように、本実施の形態のピッ
チマーク付与方法を用いることで、予備的なピッチ分析
をいっさい行わずに適切なピッチマーク付与が可能とな
る。As described above, by using the pitch mark assigning method of the present embodiment, it is possible to assign an appropriate pitch mark without performing any preliminary pitch analysis.

【００６２】ところで、この実施の形態２のピッチマー
ク付与方法は異なるチャンネルからのピッチマーク情報
をつなぎ合わせて一つのピッチマーク情報とするため
に、そのつなぎ目で若干の不規則さが発生する可能性が
ある。In the meantime, in the pitch mark adding method according to the second embodiment, since pitch mark information from different channels is joined into one piece of pitch mark information, there is a possibility that some irregularity may occur at the joint. There is.

【００６３】そこで、本実施の形態２のピッチマーク付
与方法を一種のピッチ分析法と考え、ピッチマーク情報
を一旦ピッチ情報に変換した上で改めで適応型低域フィ
ルタを制御することで一連のピッチマーク情報を正確に
作り直すことができる。そのような内容の実施の形態に
ついて次に説明する。（実施の形態３）図３は本発明の第３の実施の形態のピ
ッチマーク付与方法の構成図である。Therefore, the pitch mark adding method according to the second embodiment is considered as a kind of pitch analysis method. The pitch mark information is once converted into pitch information, and then the adaptive low-pass filter is controlled again to perform a series of operations. Pitch mark information can be accurately recreated. An embodiment having such contents will be described below. (Embodiment 3) FIG. 3 is a configuration diagram of a pitch mark adding method according to a third embodiment of the present invention.

【００６４】本実施の形態のピッチマーク付与方法は波
形記憶部３００１、固定型低域フィルタ３００２−ａ〜
ｄ、ピーク検出部３００３−ａ〜ｄ、チャンネル選択部
３００４、適応型低域フィルタ３００５、ピーク検出部
３００６、極性判定部３００７からなる。この構成は本
発明の第１の実施の形態において、ピッチ分析部１００
２を固定型低域フィルタ３００２−ａ〜ｄ、ピーク検出
部３００３−ａ〜ｄ、チャンネル選択部３００４で置き
換えたもの、言い換えれば本発明の第２の実施の形態を
ピッチ分析部として用いたものである。The pitch mark adding method according to the present embodiment employs a waveform storage section 3001 and a fixed low-pass filter 3002-a.
d, a peak detector 3003-a to 300d, a channel selector 3004, an adaptive low-pass filter 3005, a peak detector 3006, and a polarity determiner 3007. This configuration is different from the first embodiment of the present invention in that the pitch analyzer 100
2 is replaced with a fixed low-pass filter 3002-ad, a peak detector 3003-ad, and a channel selector 3004. In other words, the second embodiment of the present invention is used as a pitch analyzer. It is.

【００６５】この構成によれば、予備的ピッチ分析が不
要なピッチマーク付与方法を一種のピッチ分析とし、そ
の結果得られるピッチ情報を用いてピッチマーク付与が
行える。（実施の形態４）図４は本発明の音声分析方法にかかる
第４の実施の形態のピッチマーク付与方法の構成図であ
る。According to this configuration, a pitch mark providing method that does not require a preliminary pitch analysis is a kind of pitch analysis, and pitch marks can be provided using pitch information obtained as a result. (Embodiment 4) FIG. 4 is a configuration diagram of a pitch mark adding method according to a fourth embodiment of the voice analysis method of the present invention.

【００６６】本実施の形態のピッチマーク付与方法は波
形記憶部４００１、固定型低域フィルタ４００２−ａ〜
ｄ、ピーク検出部４００３−ａ〜ｄ、チャンネル選択部
４００４、適応型低域フィルタ４００５、ピーク検出部
４００６、ピッチマーク照合部４００７、極性判定部４
００８からなる。この構成は本発明の第３の実施の形態
にピッチマーク照合部４００７が追加されたものであ
る。The pitch mark adding method according to the present embodiment employs a waveform storage section 4001 and a fixed low-pass filter 4002-a.
d, peak detectors 4003-ad, a channel selector 4004, an adaptive low-pass filter 4005, a peak detector 4006, a pitch mark collator 4007, and a polarity determiner 4.
008. This configuration is obtained by adding a pitch mark matching unit 4007 to the third embodiment of the present invention.

【００６７】ピッチマーク照合部４００７はピーク検出
部４００６の出力であるピーク位置情報を数種類の値に
よってシフトすることによって複数のピッチマーク候補
を作成する。例えば、ピーク検出部４００６によって抽
出されたピーク位置を（数８）のような系列で表すと
き、ピッチマーク候補を（数９）のように作成する。The pitch mark matching section 4007 creates a plurality of pitch mark candidates by shifting the peak position information output from the peak detecting section 4006 by several values. For example, when the peak position extracted by the peak detection unit 4006 is represented by a series as shown in (Equation 8), pitch mark candidates are created as shown in (Equation 9).

【００６８】[0068]

【数８】 (Equation 8)

【００６９】[0069]

【数９】 (Equation 9)

【００７０】次に、（数９）のように作成されたピッチ
マーク候補を音声素片波形と照合し、その結果を基にピ
ッチマーク候補の中からピッチマークを選び出し、出力
する。Next, the pitch mark candidate created as shown in (Equation 9) is collated with the speech unit waveform, and based on the result, a pitch mark is selected from the pitch mark candidates and output.

【００７１】照合の方法は以下の通りである。音声素片
波形は（数１０）のように表されるとすると、（数１
１）を用いて評価値を算出する。続いて、（数１１）を
最大にするkを求め、そのkに該当するピッチマーク候補
P'（m, k）をピッチマークとして選び出す。The collation method is as follows. Assuming that the speech unit waveform is expressed as (Equation 10), (Equation 1)
An evaluation value is calculated using 1). Subsequently, k that maximizes (Equation 11) is obtained, and pitch mark candidates corresponding to the k are obtained.
Select P '(m, k) as the pitch mark.

【００７２】[0072]

【数１０】 (Equation 10)

【００７３】[0073]

【数１１】 [Equation 11]

【００７４】このような、ピッチマーク照合部１００５
での処理の流れを言い換えると、検出されたピークを時
間的に前後にシフトしながら、音声素片波形のピークと
の一致度が最も高い所を検索することを意味する。検索
の範囲は適応型低域フィルタ４００５の遅延量に応じて
適切に選ぶべきで、遅延量を中心に前後１ピッチ周期以
内が適切である。Such a pitch mark collating unit 1005
In other words, this means that a search is made for a position having the highest degree of coincidence with the peak of the speech unit waveform while shifting the detected peak back and forth in time. The range of the search should be appropriately selected according to the delay amount of the adaptive low-pass filter 4005, and it is appropriate that the search range is within one pitch period before and after the delay amount.

【００７５】もし、適応型低域フィルタ４００５の遅延
量が小さければ、ピーク検出部４００６の出力をそのま
まピッチマークとして用いることも可能である。If the amount of delay of the adaptive low-pass filter 4005 is small, the output of the peak detector 4006 can be used as it is as a pitch mark.

【００７６】さて、上記第１〜第４の実施の形態に示し
たピッチマーク付与方法を用いることによる利点をまと
めると以下のようになる。Now, the advantages obtained by using the pitch mark providing methods shown in the first to fourth embodiments are summarized as follows.

【００７７】第一の利点は、既知のアルゴリズムの応用
による簡易な手法により実現可能な点である。すなわ
ち、ピッチ分析、ローパスフィルタなどの構成要素は既
に確立された手法であるため、安定した動作が期待でき
る。また、本発明の音声分析にかかるピッチマーク付与
方法の第２から第４の実施の形態を用いれば、最初の段
階での予備的なピッチ抽出自体が不要となるか、あるい
は本発明の音声分析にかかるピッチマーク付与方法を用
いることで予備的ピッチ抽出自体を実現することが可能
である。The first advantage is that it can be realized by a simple method by applying a known algorithm. That is, since the components such as the pitch analysis and the low-pass filter are already established methods, a stable operation can be expected. In addition, if the second to fourth embodiments of the pitch mark adding method according to the present invention are used, the preliminary pitch extraction itself at the initial stage becomes unnecessary, or the voice analysis according to the present invention is performed. , It is possible to realize the preliminary pitch extraction itself.

【００７８】第二の利点は、ピッチ周期に対応した確実
なピッチマークが付与できる点である。音声素片波形そ
のものからピークを抽出しようとすると、高調波の影響
を受けてうまくピッチ周期に対応したピークが抽出でき
ない場合がある。本発明によれば、ピーク抽出の対象は
基本波成分波形であるため、そのような心配がない。ま
た、有声無声の判定も基本波成分波形の振幅がある程度
の大きさを持つ部分のみを対象にすることで自動的に行
える。また、差分基本波のゼロクロス点を用いるピーク
検出法は極めて高い感度で基本波のピークを検出でき
る。従って、母音開始部や終了部などの微弱な波形の部
分からも精度良くピークを検出することが可能である。The second advantage is that a reliable pitch mark corresponding to the pitch period can be provided. When an attempt is made to extract a peak from the speech unit waveform itself, a peak corresponding to the pitch period may not be extracted properly due to the influence of harmonics. According to the present invention, there is no such concern because the peak extraction target is the fundamental wave component waveform. Further, the voiced / unvoiced determination can be automatically performed by targeting only a portion where the amplitude of the fundamental component waveform has a certain magnitude. Further, the peak detection method using the zero cross point of the differential fundamental wave can detect the peak of the fundamental wave with extremely high sensitivity. Therefore, it is possible to accurately detect a peak even from a weak waveform portion such as a vowel start portion and an end portion.

【００７９】第三の利点はざらつきのないなめらかな合
成音が得られる点である。例えばピッチマークを音声素
片波形上のピークに打つことができたとする。しかし、
音声素片波形のピークは高調波の影響で様々な揺らぎを
持っているため、ピッチマークの位置も複雑に揺らぎを
含む。そして音声合成時にはピッチ波形の位置をピッチ
マークの位置を基準に決めるため、そのようにピッチマ
ークの位置が前後に揺らいでいると合成音が大きなジッ
タを含むことになり、ざらついた音になる。このような
ことを防ぐには、ピッチマークの間隔を平滑化しなくて
はならない。また、たとえ声門閉鎖の位置に正確にピッ
チマークが付与できても、声門閉鎖位置自身が揺らぎを
持っていることも考えられる。通常、音声合成時にはピ
ッチ波形の配置をピッチマーク位置に基づいて行うた
め、音声合成時に、もとのピッチ間隔と異なる間隔で再
配置を行うことになる。このことにより、瞬時の揺らぎ
による影響を受けない多くの高調波成分などに揺らぎを
付加してしまうこととなり、このことが合成音に濁りを
生む場合も考えられる。本発明の音声分析方法にかか
るピッチマーク付与方法は、純音に近い基本波成分から
ピークを抽出するため、本来のなめらかなピッチ変化に
対応したピッチマークを適切に付与することができる。
その結果、揺らぎの成分を適切に合成音に反映させなが
らざらつきのないなめらかな音声を合成できる。A third advantage is that a smooth synthesized sound without roughness can be obtained. For example, it is assumed that a pitch mark can be hit on a peak on a speech unit waveform. But,
Since the peak of the speech unit waveform has various fluctuations due to the influence of harmonics, the position of the pitch mark also includes a complicated fluctuation. When the voice is synthesized, the position of the pitch waveform is determined based on the position of the pitch mark. Therefore, if the position of the pitch mark fluctuates back and forth, the synthesized sound will contain large jitter, and the sound will be rough. In order to prevent such a situation, the interval between pitch marks must be smoothed. Further, even if the pitch mark can be accurately given to the glottal closure position, the glottal closure position itself may have fluctuation. Normally, the pitch waveform is arranged based on the pitch mark position at the time of speech synthesis. Therefore, at the time of speech synthesis, rearrangement is performed at intervals different from the original pitch interval. As a result, fluctuations are added to many harmonic components and the like that are not affected by instantaneous fluctuations, and this may cause turbidity in the synthesized sound. In the pitch mark assigning method according to the voice analysis method of the present invention, since a peak is extracted from a fundamental component close to a pure tone, a pitch mark corresponding to an original smooth pitch change can be appropriately assigned.
As a result, a smooth voice without roughness can be synthesized while appropriately reflecting the fluctuation component in the synthesized sound.

【００８０】また、差分基本波のゼロクロス点を前後の
サンプルから直線補間で推定することにより、サンプル
点の粗さに影響を受けないなめらかなピーク間隔の変化
を反映させることができ、その結果極めてなめらかな音
質を実現できる。Further, by estimating the zero-cross point of the differential fundamental wave from the preceding and succeeding samples by linear interpolation, it is possible to reflect a smooth change in the peak interval which is not affected by the roughness of the sample point. Smooth sound quality can be achieved.

【００８１】以上述べたように、本発明では、例えば、
音声波形を基本波成分のみを通過させるように設定され
たFIR直線位相型ローパスフィルタによって正弦波状の
基本波成分波形を抽出し、その基本波成分波形のローカ
ルピークにマークを付与し、その位置をピッチマークと
する。As described above, in the present invention, for example,
A sinusoidal fundamental wave component waveform is extracted by an FIR linear-phase low-pass filter set so that only the fundamental wave component passes through the audio waveform, a mark is added to the local peak of the fundamental wave component waveform, and the position is set. A pitch mark.

【００８２】この方法によれば、ローカルピーク検出の
対象が正弦波状の波形であるためにピッチ周期に対応し
た区分点の抽出を容易にし、さらにゼロクロスではなく
ピーク位置を区分点として抽出することにより、ほぼ音
声波形のローカルピーク及び声門閉鎖点に一致した位置
にピッチマークを付与することができる。（実施の形態５）次に、本発明の音声合成方法の実施の
形態について説明する。According to this method, since the target of local peak detection is a sinusoidal waveform, it is easy to extract a division point corresponding to the pitch period. Further, by extracting a peak position instead of a zero cross as a division point, , A pitch mark can be provided at a position substantially coincident with a local peak and a glottal closure point of a speech waveform. (Embodiment 5) Next, an embodiment of the speech synthesis method of the present invention will be described.

【００８３】図１２は本発明の音声合成方法の第１の実
施の形態を表している。FIG. 12 shows a first embodiment of the speech synthesis method of the present invention.

【００８４】本実施の形態の音声合成方法は、ピッチマ
ーク記憶部１２００１と振幅情報記憶部１２００２と音
韻境界記憶部１２００３と音韻種別記憶部１２００４と
ピッチ波形記憶部１２００５とピッチ波形重畳部１２０
０６、およびそれらを全て制御する制御部１２００７を
用いる。The voice synthesizing method according to the present embodiment includes a pitch mark storage section 12001, an amplitude information storage section 12002, a phoneme boundary storage section 12003, a phoneme type storage section 12004, a pitch waveform storage section 12005, and a pitch waveform superposition section 120.
06 and a control unit 12007 for controlling all of them.

【００８５】ピッチマーク記憶部１２００１と振幅情報
記憶部１２００２と音韻境界記憶部１２００３と音韻種
別記憶部１２００４とピッチ波形記憶部１２００５の出
力は全てピッチ波形重畳部１２００６に接続されてい
る。Outputs of the pitch mark storage unit 12001, the amplitude information storage unit 12002, the phoneme boundary storage unit 12003, the phoneme type storage unit 12004, and the pitch waveform storage unit 12005 are all connected to the pitch waveform superimposition unit 12006.

【００８６】ピッチマーク記憶部１２００１にはあらか
じめ発声されて録音された自然音声に対して付与された
ピッチマーク情報が記憶されている。振幅情報記憶部１
２００２には自然音声のピッチマーク付近での振幅を表
す情報がピッチマーク情報と一対一で記録されている。
音韻境界記憶部１２００３には前述の自然音声における
音韻境界のタイミングが記憶されている。例えば自然音
声が「ありがとう」の場合、「あ」「り」「が」「と」
「う」の開始タイミングがそれぞれ記憶される。音韻種
別記憶部１２００４には前述の自然音声における音韻の
種別が記憶されている。たとえば、「あ」「り」「が」
「と」「う」の五つの音韻を識別する情報が記憶されて
いる。ピッチ波形記憶部１２００５には音声合成用の素
片として録音された音声素片波形からピッチマークを中
心として切り出されたピッチ波形が多数記憶されてい
る。[0086] The pitch mark storage unit 12001 stores pitch mark information given to natural speech that has been uttered and recorded in advance. Amplitude information storage unit 1
In 2002, information representing the amplitude of the natural sound near the pitch mark is recorded one-to-one with the pitch mark information.
The phoneme boundary storage unit 12003 stores the timing of the phoneme boundary in the natural speech described above. For example, if the natural voice is "thank you", "a""ri""ga""to"
The start timing of “U” is stored. The phoneme type storage unit 12004 stores the types of phonemes in the natural speech described above. For example, "a""ri""ga"
Information for identifying the five phonemes “to” and “u” is stored. The pitch waveform storage unit 12005 stores a large number of pitch waveforms that are cut out from the speech unit waveform recorded as a speech synthesis unit with a pitch mark as the center.

【００８７】なお、ピッチマークの付与は、前述した実
施の形態１〜４の本発明によるピッチマーク付与方法に
よって可能である。また、ピッチ波形記憶部１２００５
におけるピッチ波形の作成およびこの後の動作の説明に
あるピッチ波形の配置による音声合成は公知の任意の技
術で可能である。例えば特開平７−１５２３９６に開示
されている。The pitch marks can be provided by the pitch mark providing method according to the first to fourth embodiments of the present invention. Also, pitch waveform storage unit 12005
The voice synthesis based on the pitch waveform creation and the arrangement of the pitch waveforms described in the subsequent operation can be performed by any known technique. For example, it is disclosed in JP-A-7-152396.

【００８８】また、振幅情報記憶部１２００２には自然
音声におけるピッチマークの前後、例えば１０ミリ秒の
間の波形の振幅の絶対値の最大値が各ピッチマークに対
して記憶されている。The amplitude information storage unit 12002 stores, for each pitch mark, the maximum value of the absolute value of the amplitude of the waveform before and after the pitch mark in natural speech, for example, for 10 milliseconds.

【００８９】このような条件の下で、自然音声と同じ内
容の合成音を合成する場合の動作を図１３に示す。以
下、図１３を参照しながら説明する。FIG. 13 shows the operation of synthesizing a synthesized speech having the same contents as natural speech under such conditions. Hereinafter, description will be made with reference to FIG.

【００９０】まず制御部１２００７は音韻種別記憶部１
２００４から最初の音韻種別情報Ｓを取得し（Ｓ７００
２）、続いて音韻境界記憶部１２００３から最初の音韻
境界情報Ｂを取得する（Ｓ７００３）。こうして、最初
の音韻の種別Ｓと、その開始タイミングを知る。続い
て、制御部１２００７はピッチマーク記憶部１２００１
からＢ以降の最も近いピッチマーク情報Ｐを取得すると
ともに、振幅情報記憶部１２００２からそのピッチマー
クに対応する振幅情報Ａを取得する（Ｓ７００４）。続
いて、ピッチ波形記憶部１２００５からＳの開始部分に
必要なピッチ波形を取得し（Ｓ７００６）、ピッチ波形
重畳部１２００６においてＰと同じタイミングになるよ
うにピッチ波形を配置し、Ａに従って振幅を制御する
（Ｓ７００７）。First, the control unit 12007 controls the phoneme type storage unit 1
First phoneme type information S is acquired from 2004 (S700).
2) Subsequently, the first phoneme boundary information B is obtained from the phoneme boundary storage unit 12003 (S7003). Thus, the first phoneme type S and its start timing are known. Subsequently, the control unit 12007 controls the pitch mark storage unit 12001.
, The closest pitch mark information P subsequent to B is obtained, and the amplitude information A corresponding to the pitch mark is obtained from the amplitude information storage unit 12002 (S7004). Subsequently, a pitch waveform necessary for the start portion of S is acquired from the pitch waveform storage unit 12005 (S7006), and the pitch waveform superposition unit 12006 arranges the pitch waveform so as to have the same timing as P, and controls the amplitude according to A. (S7007).

【００９１】続いて、ピッチマーク記憶部１２００１か
ら次のピッチマーク情報Ｐを取得するとともに、振幅情
報記憶部１２００２からそのピッチマークに対応する振
幅情報Ａを取得し（Ｓ７００４）、ピッチ波形記憶部１
２００５からＳの時刻（Ｔ−Ｂ）に対応するピッチ波形
を取得し、ピッチ波形重畳部１２００６においてＰと同
じタイミングになるように配置し、Ａに従って振幅を制
御する（Ｓ７００７）。これ以降、Ｓ７００４からＳ７
００７を繰り返すが、Ｓ７００４の直後で、取得したピ
ッチマーク情報Ｐが次の音韻境界を越えている場合はＳ
７００２に処理を移す（Ｓ７００５）。また、Ｓ７００
２の直前で次の音韻がない場合はメッセージ終了を意味
するので処理を終了する（Ｓ７００１）。Subsequently, the next pitch mark information P is obtained from the pitch mark storage unit 12001, and the amplitude information A corresponding to the pitch mark is obtained from the amplitude information storage unit 12002 (S7004).
From 2005, a pitch waveform corresponding to the time S (TB) of S is obtained, arranged in the pitch waveform superimposing unit 12006 so as to be at the same timing as P, and the amplitude is controlled according to A (S7007). Thereafter, S7004 to S7
007 is repeated, but immediately after S7004, if the acquired pitch mark information P is beyond the next phoneme boundary, S
The process moves to 7002 (S7005). Also, S700
If there is no next phoneme immediately before 2, the process ends because it means the end of the message (S7001).

【００９２】Ｓ７００７における振幅の制御は以下のよ
うに行う。振幅情報Ａの値がａとする。これは、ピッチ
マーク情報Ｐに対応する自然音声波形の前後例えば１
０ミリ秒の間の振幅の絶対値の最大値である。一方、ピ
ッチ波形Ｗの振幅の絶対値の最大値をａｗとすると、
（数１２）によってこのピッチ波形に与えるゲインｇを
計算する。The control of the amplitude in S7007 is performed as follows. It is assumed that the value of the amplitude information A is a. This is, for example, 1 before and after the natural sound waveform corresponding to the pitch mark information P.
It is the maximum value of the absolute value of the amplitude during 0 ms. On the other hand, assuming that the maximum value of the absolute value of the amplitude of the pitch waveform W is aw,
The gain g given to this pitch waveform is calculated by (Equation 12).

【００９３】[0093]

【数１２】 (Equation 12)

【００９４】このゲインｇの値をピッチ波形Ｗの前サン
プルに乗算することで振幅の制御を行う。The amplitude is controlled by multiplying the previous sample of the pitch waveform W by the value of the gain g.

【００９５】ところでピッチ波形記憶部１２００６には
あらかじめ音声素片専用の波形から切り出されたピッチ
波形が記憶されているが、これらのピッチ波形を切り出
す際にはやはりピッチマークを用いる。本発明の音声分
析方法にかかるピッチマーク付与方法の第１の実施の形
態のところで説明したように、ピッチマークを差分基本
波のゼロクロス点から求める場合、直線補間により１サ
ンプルよりも細かい単位でのピッチマークが得られる。
このことを生かしてピッチ波形の切り出しを１サンプル
よりも細かい単位で行っておくことにより、ピッチ波形
重畳部１２００６で合成された波形はより一層ざらつき
のないなめらかな音声となる。By the way, the pitch waveform storage unit 12006 stores in advance the pitch waveforms cut out from the waveforms dedicated to the speech units, but when these pitch waveforms are cut out, pitch marks are also used. As described in the first embodiment of the pitch mark assigning method according to the voice analysis method of the present invention, when the pitch mark is obtained from the zero cross point of the differential fundamental wave, linear interpolation is performed in units smaller than one sample. A pitch mark is obtained.
By taking advantage of this fact and cutting out the pitch waveform in units smaller than one sample, the waveform synthesized by the pitch waveform superimposing unit 12006 becomes smoother and smoother.

【００９６】図１４はピッチ波形の切り出し方法を示し
たものである。上下の二つの図で横軸は時間、縦軸は波
形の振幅を表しており、横軸の目盛りはサンプルタイミ
ングである。ディジタルデータにはサンプルタイミング
でのみ値が定義される。また、上の図で○はディジタル
データとして記録されている音声波形のサンプルデータ
を表している。また、曲線はアナログ波形としての音声
波形である。縦線はピッチマークの位置を表している。FIG. 14 shows a method of cutting out a pitch waveform. In the upper and lower figures, the horizontal axis represents time, the vertical axis represents waveform amplitude, and the scale on the horizontal axis is sample timing. Values are defined for digital data only at the sample timing. In the above figure, ○ indicates sample data of a voice waveform recorded as digital data. The curve is a voice waveform as an analog waveform. The vertical line indicates the position of the pitch mark.

【００９７】ピッチマークが整数でない場合、この図の
ようにサンプルタイミングと一致しない。そこで、最寄
りのサンプルタイミングとその前後の合計３つのサンプ
ルデータを用いて二次補間でピッチマーク位置でのデー
タを推定する。また、ピッチマークから前後にサンプル
間隔の整数倍の位置（全てサンプルタイミングと一定量
ずれている）での全てのデータも同様に推定する。推定
された値は×で表されている。また、推定されたデータ
のみを抜き出したものが下の図に表されている。When the pitch mark is not an integer, it does not coincide with the sample timing as shown in FIG. Therefore, data at the pitch mark position is estimated by quadratic interpolation using the nearest sample timing and a total of three sample data before and after the nearest sample timing. In addition, all data at positions that are an integer multiple of the sample interval before and after the pitch mark (all are shifted by a fixed amount from the sample timing) are similarly estimated. The estimated value is represented by x. In addition, only the estimated data is shown in the figure below.

【００９８】このようにして推定されたすべての値を切
り出し波形として記憶する。補間方法には上記の二次補
間の他、直線補間やスプライン補間など、いかなる補間
方法を用いることも可能である。All the values thus estimated are stored as cut-out waveforms. As the interpolation method, in addition to the above-described quadratic interpolation, any interpolation method such as linear interpolation or spline interpolation can be used.

【００９９】また、ピッチマーク記憶部１２００１に記
憶されたピッチマーク情報が整数でない場合はピッチ波
形重畳部１２００６における波形の配置のタイミングも
整数でなくなるため、ピッチ波形の切り出しと同様の考
え方で補間を用いることにより、なめらかなピッチ変化
を持った合成音が生成できる。If the pitch mark information stored in the pitch mark storage unit 12001 is not an integer, the timing of the arrangement of the waveforms in the pitch waveform superimposing unit 12006 is not an integer. By using this, a synthesized sound having a smooth pitch change can be generated.

【０１００】このようにして合成された音声は、ピッチ
マークのもとになった自然音声と同一のタイミング、ピ
ッチパターン、振幅の変化を有するばかりか、波形の
タイミングや位相のレベルでほぼ完全に一致するものと
なる。このことにより、子音やその前後で細かくピッチ
が上下する、いわゆるマイクロプロソディ（micro-pros
ody）の情報を含んだ極めて自然性の高い合成音を得る
ことが可能となる。The speech synthesized in this manner not only has the same timing, pitch pattern, and amplitude change as the natural speech from which the pitch mark is based, but also has almost complete waveform timing and phase levels. Will match. This results in a so-called micro-prosody (pitch up and down finely before and after the consonant).
ody) information can be obtained.

【０１０１】なお、本実施の形態では、ピッチパターン
と振幅の情報をピッチマーク毎に保持するようにした
が、所定の区間ごとの平均値などを用いてもかまわな
い。こうすることで、ピッチパターン、振幅の情報を
圧縮することが可能で、合成音の音質もほとんど劣化を
伴わない。例えば、音韻開始点に挟まれた区間を一定の
区間数に区切れば、音声の発話速度に関わらず音韻の個
数に対応した効率の良い情報の保持が可能となる。ま
た、このような情報の持ち方は、音韻開始タイミング情
報を変形することによって、たとえ合成音のスピードを
任意に変化させても、極めて高い音質が保てるという利
点がある。また、ピッチ情報と振幅情報も変形すること
が可能となる。さらに、音韻系列情報を変更することに
より、発話の内容を変更することも可能である。変更が
可能な音韻は、変更前と変更後が互いに近い特性の音韻
である必要がある。例えば、有声音同士や無声音同士で
あれば比較的音質劣化が少なく入れ替えが可能である。In the present embodiment, the information on the pitch pattern and the amplitude is stored for each pitch mark, but an average value for each predetermined section may be used. By doing so, it is possible to compress the information of the pitch pattern and the amplitude, and the sound quality of the synthesized sound is hardly deteriorated. For example, if the section sandwiched by the phoneme start points is divided into a certain number of sections, efficient information corresponding to the number of phonemes can be held regardless of the speech utterance speed. Further, such a method of holding information has an advantage that extremely high sound quality can be maintained even if the speed of the synthesized sound is arbitrarily changed by modifying the phoneme start timing information. Also, the pitch information and the amplitude information can be changed. Further, by changing the phoneme sequence information, the content of the utterance can be changed. The phonemes that can be changed need to be phonemes with characteristics close to each other before and after the change. For example, voiced sounds and unvoiced sounds can be replaced with relatively little deterioration in sound quality.

【０１０２】なお、上記の説明では音韻種別情報Ｓとし
てどのような単位を用いるかは定義しなかったが、具体
的には音素を用いると良い。音素は子音や母音などの一
つ一つを表す単位で、例えば「カ」という音は/k/と/a/
の二つの音素からなる。In the above description, what unit is used as the phoneme type information S is not defined, but it is preferable to use a phoneme. A phoneme is a unit that represents each consonant, vowel, etc. For example, the sound "ka" is / k / and / a /
Consists of two phonemes.

【０１０３】また、振幅情報を用いた場合についてのみ
説明したが、振幅情報を用いずに音声素片が持つ振幅の
ままで合成することも可能である。この場合、音質は若
干の不自然さを伴うものの、タイミングやピッチパター
ンが自然音声のものであるため、自然性の高いものとな
る。Although the description has been given only of the case where the amplitude information is used, it is also possible to perform synthesis without using the amplitude information while keeping the amplitude of the speech unit. In this case, although the sound quality is accompanied by some unnaturalness, the timing and pitch pattern are those of a natural voice, so that the sound quality is high.

【０１０４】振幅情報を用いる場合、上記の説明ではピ
ッチマーク付近の振幅の絶対値の最大値を用いたが、他
の値を用いてももちろんかまわない。音声波形は振幅が
両方向に均一に分布するのではなく、一般にある極性に
偏った波形になる。これは声門閉鎖に伴って発生するパ
ルスが一方向であるためである。このパルスの方向に合
わせて片側の振幅の最大値を用いることは、音声波形に
含まれる揺らぎや雑音に影響を受けにくくするという効
果がある。また、ピッチマーク付近での短時間パワーを
用いることも考えられる。When the amplitude information is used, in the above description, the maximum value of the absolute value of the amplitude near the pitch mark is used, but other values may be used. The sound waveform does not have a uniform amplitude distribution in both directions, but generally has a waveform biased to a certain polarity. This is because the pulse generated with glottal closure is unidirectional. Using the maximum value of the amplitude on one side in accordance with the direction of the pulse has the effect of making it less susceptible to fluctuations and noise contained in the audio waveform. It is also conceivable to use short-time power near the pitch mark.

【０１０５】さらに、振幅情報を抽出する前に事前に自
然音声を低域フィルタを用いて高域の成分を除去してお
くことが考えられる。これは、自然音声の振幅が高域成
分によって細かく変動することによる振幅情報の揺らぎ
を除去する効果がある。Further, before extracting the amplitude information, it is conceivable to remove the high-frequency components of the natural speech by using a low-pass filter in advance. This has the effect of removing fluctuations in amplitude information due to fine fluctuations in the amplitude of natural speech due to high frequency components.

【０１０６】なお、音声合成の音質はピッチ波形記憶部
１２００５に記憶されたピッチ波形によって決定される
ので、ピッチマーク、振幅情報、音韻境界情報、音韻種
別情報は比較的低品質の音声から抽出したもので十分で
ある。例えば、ピッチ波形の帯域幅が10kHzであれば、
合成音の帯域幅も10kHzになる。従って、帯域幅5kHzの
音声からピッチマーク、振幅情報、音韻境界情報、音韻
種別情報を抽出しておけば、この音声よりも広帯域の高
品質音声として合成することが可能となる。これは、電
話回線などを通じて狭帯域になった音声を高品質音声に
変換することを可能とするため、極めて利用価値が高
い。（実施の形態６）次に、本発明の音声合成方法の別の実
施の形態について説明する。Since the sound quality of speech synthesis is determined by the pitch waveform stored in the pitch waveform storage unit 12005, the pitch mark, amplitude information, phoneme boundary information, and phoneme type information are extracted from relatively low quality speech. Things are enough. For example, if the bandwidth of the pitch waveform is 10 kHz,
The bandwidth of the synthesized sound is also 10 kHz. Therefore, if the pitch mark, the amplitude information, the phoneme boundary information, and the phoneme type information are extracted from the voice having the bandwidth of 5 kHz, it is possible to synthesize the voice as a high-quality voice having a wider band than this voice. This is extremely useful because it enables the conversion of narrow-band speech into high-quality speech through a telephone line or the like. (Embodiment 6) Next, another embodiment of the speech synthesis method of the present invention will be described.

【０１０７】音声メッセージの提供方法として用いられ
るものに、録音音声と合成音声の組み合わせがある。そ
のような方法が適するメッセージは定型の部分と不定形
の部分からなるものである。ここで言う定型の部分とは
様々なメッセージの中で多くのものに共通の部分であ
り、不定形な部分とは目的物や地名など数多くのパター
ンが考えられる部分である。As a method for providing a voice message, there is a combination of a recorded voice and a synthesized voice. A message to which such a method is suitable consists of a fixed part and an irregular part. Here, the fixed part is a part common to many messages in various messages, and the irregular part is a part in which a number of patterns such as an object and a place name can be considered.

【０１０８】このようなメッセージ提供方法では、定型
の部分を録音音声で、不定形の部分を合成音声で提供す
る。例えば、「次は、京都に止まります」というメッセ
ージがあり、他のメッセージには「次は、熱海に止まり
ます」などがあるとする。これら二つのメッセージは
「京都」と「熱海」の違いがあるのみで、「次は」と
「に止まります」の部分は共通のものが使用できる。こ
の場合、「次は」と「に止まります」が定型部分で、
「京都」と「熱海」の部分は他の地名や駅名が無数に考
えられるため、不定形部分となる。そこで、定型部分は
種類が少ないのであらかじめ自然に発声した音声を録音
しておき、不定形部分を音声合成によって生成するよう
にすることが行われる。しかし、音声合成の音質が録音
された音声と比べて劣るため、接続部分で音質の変化が
大きく、違和感を生む。In such a message providing method, a fixed portion is provided by a recorded voice, and an irregular portion is provided by a synthesized voice. For example, suppose there is a message "Next will stop in Kyoto", and other messages will say "Next will stop in Atami". These two messages differ only in "Kyoto" and "Atami", and the parts "Next" and "Stop" can be the same. In this case, "next" and "stop at" are standard parts,
The parts of "Kyoto" and "Atami" are indefinite parts because there are countless other place names and station names. Therefore, since there are few types of fixed portions, naturally uttered voices are recorded in advance, and the undefined portions are generated by speech synthesis. However, since the sound quality of the speech synthesis is inferior to the recorded speech, the change in the sound quality at the connection portion is large, giving a sense of incongruity.

【０１０９】そこで、定型メッセージから合成音声に徐
々に切り替わるように混合比を変えながら接続すること
でその違和感を防ぐことが考えられる。この方法は例え
ば特開平５−２７７８９などがある。しかし、従来の合
成方法では、定型部との重なりの部分でピッチや位相が
異なるため、二重の音声として聞こえてしまう問題があ
る。Therefore, it is conceivable to prevent the sense of incongruity by connecting while changing the mixing ratio so as to gradually switch from the fixed message to the synthesized voice. This method is disclosed in, for example, Japanese Patent Application Laid-Open No. 5-27789. However, in the conventional synthesizing method, there is a problem that the sound is heard as a double sound because the pitch and the phase are different in a portion overlapping with the fixed portion.

【０１１０】そこで、本発明の実施の形態では音声合成
部に第１の実施の形態の音声合成方法を用いる。その結
果、録音音声と合成音声の間で完全にピッチや位相が一
致し、両者を重ね合わせても単独の音声として聞こえる
ような優れた接続方法が実現する。Therefore, in the embodiment of the present invention, the speech synthesis method of the first embodiment is used for the speech synthesis section. As a result, an excellent connection method is realized in which the pitch and the phase are completely the same between the recorded voice and the synthesized voice, and the two voices can be heard as a single voice even when they are superimposed.

【０１１１】図１５にその音声合成方法の構成を示す。
その音声合成方法では、定型メッセージ生成部１５００
１と合成メッセージ生成部１５００２と混合部１５００
３を用いる。定型メッセージ生成部１５００３にはメッ
セージの中で定型の部分の波形が記憶されており、必要
に応じて読み出されることによりメッセージの一部を出
力する。合成メッセージ生成部１５００２は図１２の構
成を有し、それらのピッチマーク記憶部１２００１、振
幅情報記憶部１２００２，音韻境界記憶部１２００３、
音韻種別記憶部１２００４のそれぞれには、定型メッセ
ージ生成部１５００１に記憶された波形から取り出され
たそれらの情報が記憶されている。FIG. 15 shows the configuration of the speech synthesis method.
In the voice synthesizing method, the fixed message generator 1500
1, the composite message generation unit 15002 and the mixing unit 1500
3 is used. The fixed-form message generation unit 15003 stores a fixed-form part of the waveform in the message, and outputs a part of the message by being read as needed. The synthesized message generation unit 15002 has the configuration shown in FIG. 12, and includes a pitch mark storage unit 12001, an amplitude information storage unit 12002, a phoneme boundary storage unit 12003,
Each of the phoneme type storage units 12004 stores such information extracted from the waveform stored in the fixed message generation unit 15001.

【０１１２】以下に図１５に示す音声合成方法の動作に
ついて、先ほど例として示した「次は、京都に止まりま
す」のメッセージを用いて説明する。The operation of the voice synthesizing method shown in FIG. 15 will be described below using the message "Next, stop at Kyoto", which was shown as an example.

【０１１３】説明を簡単にするために、定型メッセージ
生成部１５００１と合成メッセージ生成部１５００２の
両者は、共に同じメッセージ「次は、京都に止まりま
す」を生成するものとする。For the sake of simplicity, it is assumed that both the fixed message generation unit 15001 and the composite message generation unit 15002 generate the same message “Next, stop at Kyoto”.

【０１１４】図１６は混合部１５００３の二つの入力端
子のゲインの変化を示したものである。まず、メッセー
ジ開始部において、定型メッセージ生成部１５００１は
「次は」という定型部分の波形の読み出しと混合部１５
００３への出力を開始する。ここで、メッセージ開始部
とは、音声メッセージの冒頭部、即ち、図１６に示す
「つ」の時点のことである。FIG. 16 shows a change in gain of the two input terminals of the mixing section 15003. First, in the message start unit, the fixed message generation unit 15001 reads the waveform of the fixed portion “next” and reads the waveform of the mixed
003 starts to be output. Here, the message start part is the beginning of the voice message, that is, the point of time "T" shown in FIG.

【０１１５】この時点で混合部１５００３は定型メッセ
ージ生成部１５００１側の入力ゲインを最大値にし、合
成メッセージ生成部１５００２側の入力ゲインを０にし
ている（Ｓ１６００１）。At this point, the mixing unit 15003 sets the input gain of the fixed message generation unit 15001 to the maximum value, and sets the input gain of the synthesized message generation unit 15002 to 0 (S16001).

【０１１６】一方、合成メッセージ生成部１５００２も
定型メッセージ生成部１５００１と同時に「次は」の合
成を開始する。このとき、ピッチマーク情報、音韻境界
情報、音韻種別情報は上述したように全て定型メッセー
ジ部分の波形から取り出したものを用いているので、合
成波形は定型メッセージ波形と同じピッチ、同じ位相を
有している。On the other hand, the synthesized message generator 15002 starts synthesizing "next" at the same time as the fixed message generator 15001. At this time, since the pitch mark information, phoneme boundary information, and phoneme type information all use those extracted from the waveform of the fixed message portion as described above, the synthesized waveform has the same pitch and the same phase as the fixed message waveform. ing.

【０１１７】メッセージ出力が「次は」の後半にさしか
かると、混合部１５００３は定型メッセージ生成部１５
００１側の入力ゲインを徐々に下げ、合成メッセージ生
成部１５００２側の入力ゲインを徐々に上げる（Ｓ１６
００２）。その結果、「次は」の後半部分は両者の波形
が重なり合ったものとなる。When the message output reaches the latter half of “next is”, the mixing unit 15003 sends the fixed message generation unit 15
The input gain on the 001 side is gradually decreased, and the input gain on the synthesized message generation unit 15002 is gradually increased (S16).
002). As a result, the latter part of the “next” is a waveform in which both waveforms overlap.

【０１１８】メッセージ出力が「京都」にさしかかるま
でに、混合部１５００３は定型メッセージ生成部１５０
０１側の入力ゲインを０まで下げ、合成メッセージ生成
部１５００２側の入力ゲインを最大値にする（Ｓ１６０
０３）。その結果、「京都」の部分は合成音声のみで出
力される。By the time the message output reaches “Kyoto”, the mixing unit 15003 sets the fixed message generation unit 150
The input gain on the 01 side is reduced to 0, and the input gain on the synthesized message generation unit 15002 is set to the maximum value (S160).
03). As a result, the portion of "Kyoto" is output only with synthesized speech.

【０１１９】メッセージ出力が「に止まります」にさし
かかると、混合部１５００３は先ほどと逆に定型メッセ
ージ生成部１５００１側の入力ゲインを徐々に上げ、合
成メッセージ生成部１５００２側の入力ゲインを徐々に
下げる（Ｓ１６００４）。そして、完全に定型メッセー
ジ生成部１５００１側の入力ゲインを最大に、合成メッ
セージ生成部１５００２側の入力ゲインを０にする（Ｓ
１６００５）。When the message output approaches “stopping”, the mixing section 15003 gradually increases the input gain of the fixed message generation section 15001 and gradually lowers the input gain of the synthesized message generation section 15002, contrary to the above. (S16004). Then, the input gain of the fixed message generation unit 15001 is set to the maximum, and the input gain of the synthesized message generation unit 15002 is set to 0 (S
16005).

【０１２０】上記のような動作の結果、定型メッセージ
部分は録音音声によって、不定形メッセージ部分は合成
音声によってメッセージ提供が行われ、接続部分付近で
は両者の混合比率を徐々に変更しながらなめらかに移り
変わるような動作が実現する。不定形メッセージである
「京都」の部分を別の単語（たとえば「熱海」）に変更
することで、メッセージの変更が可能である。As a result of the above operation, the fixed message portion is provided with the recorded voice and the irregular message portion is provided with the synthesized voice, and near the connection portion, a smooth transition is made while gradually changing the mixing ratio of the two. Such an operation is realized. The message can be changed by changing the part of the irregular message "Kyoto" to another word (for example, "Atami").

【０１２１】不定形メッセージ部分のピッチパターン
は、定型メッセージのピッチマークを用いて生成しても
よいが、他のピッチ生成方法を用いても構わない。特
に、「京都」以外の「熱海」などの地名の場合、「京
都」のピッチパターンがそのまま当てはまるとは限らな
いので「藤崎モデル」などのピッチ生成モデルを用いる
方が適切と考えられる。The pitch pattern of the irregular message portion may be generated by using the pitch mark of the fixed message, but another pitch generating method may be used. In particular, in the case of a place name such as "Atami" other than "Kyoto", the pitch pattern of "Kyoto" does not always apply as it is, so it is considered appropriate to use a pitch generation model such as "Fujisaki model".

【０１２２】なお、上記の説明では定型メッセージ生成
部１５００１と合成メッセージ生成部１５００２の両者
がメッセージ全体の生成を行うようにしていたが、必要
最小限の部分のみを受け持つようにしてももちろん構わ
ない。例えば、定型メッセージ生成部１５００１は「次
は」と「に止まります」の部分のみ、合成メッセージ生
成部１５００２は「は、京都に」の部分のみというよう
に部分的に生成したものを接続する事は可能であるし、
処理効率の上からもそれが望ましい。（実施の形態７）続いて、本発明の音声合成方法の更に
別の実施の形態について説明する。In the above description, both the fixed message generating unit 15001 and the synthesized message generating unit 15002 generate the entire message. However, it is of course possible to set only the minimum necessary part. . For example, the fixed message generation unit 15001 connects only the parts of "Next" and "Stop", and the composite message generation unit 15002 connects the partially generated ones such as only "Kyoto". Is possible and
It is desirable from the viewpoint of processing efficiency. (Embodiment 7) Next, still another embodiment of the speech synthesis method of the present invention will be described.

【０１２３】上記実施の形態６における音声合成方法の
説明でも述べたとおり、定型メッセージ部分と不定形メ
ッセージ部分の組み合わせによるメッセージ提供方法が
用いられている。このようなメッセージ提供方法の問題
点としては前述の「録音部分と合成部分の音質の差」が
あるが、そのほかにも「録音部分の記憶に必要な記憶装
置の容量の大きさ」がある。特に後者は録音メッセージ
部分の種類が多い場合に深刻となる。As described in the description of the speech synthesis method in the sixth embodiment, a message providing method using a combination of a fixed message portion and an irregular message portion is used. As a problem of such a method of providing a message, there is the above-mentioned "difference in sound quality between the recorded part and the synthesized part". In addition, there is "the capacity of the storage device necessary for storing the recorded part". The latter is particularly serious when there are many types of recorded message parts.

【０１２４】そこで本実施の形態では定型メッセージ部
分を録音によって蓄積するのではなく、ピッチマーク情
報、音韻境界情報、および音韻種別情報で蓄積してお
き、本発明の音声合成方法の第１の実施の形態によって
生成する。Therefore, in the present embodiment, instead of storing the fixed message portion by recording, it is stored in the form of pitch mark information, phoneme boundary information, and phoneme type information, and the first implementation of the speech synthesis method of the present invention. Is generated according to the form.

【０１２５】なお、本発明の第１のメッセージ、第２の
メッセージは、それぞれ、本実施の形態の定型メッセー
ジ、不定形のメッセージに対応する。Note that the first message and the second message of the present invention correspond to the fixed message and the irregular message of the present embodiment, respectively.

【０１２６】図１７は本実施の形態の音声合成方法の構
成を示すものである。その構成はピッチマーク記憶部１
２００１−１〜Ｎ、振幅情報記憶部１２００２−１〜
Ｎ、音韻境界記憶部１２００３−１〜Ｎ、音韻種別記憶
部１２００４−１〜Ｎ、ピッチ波形記憶部１２００５、
ピッチ波形重畳部１２００５、制御部１７００６からな
る。この構成は、図１２とほぼ同じであるが、ピッチマ
ーク記憶部１２００１、振幅情報記憶部１２００２、音
韻境界記憶部１２００３、音韻種別記憶部１２００４が
Ｎ個ずつ備わっている点が異なる。Ｎは定型メッセージ
の個数である。ｎを定型メッセージの番号とすると、そ
の定型メッセージの情報はピッチマーク記憶部１２００
１−ｎ、振幅情報記憶部１２００２−ｎ、音韻境界記憶
部１２００３−ｎ、音韻種別記憶部１２００４−ｎに記
憶されている。FIG. 17 shows the configuration of the speech synthesis method according to the present embodiment. Its configuration is a pitch mark storage unit 1
2001-1 to N, amplitude information storage section 12002-1 to 1200-1
N, phoneme boundary storage units 12003-1 to N, phoneme type storage units 12004-1 to N, pitch waveform storage unit 12005,
It comprises a pitch waveform superimposing unit 12005 and a control unit 17006. This configuration is almost the same as that of FIG. 12, except that a pitch mark storage unit 12001, an amplitude information storage unit 12002, a phoneme boundary storage unit 12003, and a phoneme type storage unit 12004 are provided N each. N is the number of fixed messages. When n is the number of a fixed message, information of the fixed message is stored in the pitch mark storage unit 1200.
1-n, an amplitude information storage unit 12002-n, a phoneme boundary storage unit 12003-n, and a phoneme type storage unit 12004-n.

【０１２７】ｋ番目の定型メッセージの合成を行うと
き、制御部１７００７はピッチマーク記憶部１２００１
−ｋ、振幅情報記憶部１２００２−ｋ、音韻境界記憶部
１２００３−ｋ、音韻種別記憶部１２００４−ｋを選択
する。以下、図１３に示したのと同様の手順で合成を行
う。すなわち、添え字ｋを省略すると、ピッチマーク記
憶部１２００１と振幅情報記憶部１２００２と音韻境界
記憶部１２００３と音韻種別記憶部１２００４に記憶さ
れた定型メッセージに関する情報を用いて合成を行う。When synthesizing the k-th fixed message, the control unit 17007 controls the pitch mark storage unit 12001.
-K, the amplitude information storage unit 12002-k, the phoneme boundary storage unit 12003-k, and the phoneme type storage unit 12004-k. Hereinafter, synthesis is performed in the same procedure as shown in FIG. That is, if the suffix k is omitted, the synthesis is performed using the information on the fixed message stored in the pitch mark storage unit 12001, the amplitude information storage unit 12002, the phoneme boundary storage unit 12003, and the phoneme type storage unit 12004.

【０１２８】不定形メッセージの合成には通常の音声合
成と同様に自分自身で生成したピッチパターンに従って
音声合成を行う。For synthesis of an irregular message, speech synthesis is performed in accordance with a pitch pattern generated by itself, similarly to ordinary speech synthesis.

【０１２９】なお、この不定型メッセージの合成は、第
６の実施の形態で説明したものと同じ方法により音声合
成を行えば更に良い。すなわち、この場合、定型メッセ
ージと不定型メッセージとの少なくとも接続部において
は、不定型メッセージの音声合成に用いる音声波形のピ
ッチ波形をピッチマーク情報に基づいて配置することに
より定型メッセージと同じ内容の音声を不定型メッセー
ジとして合成するものである。The synthesis of the irregular message may be further performed by performing voice synthesis by the same method as that described in the sixth embodiment. That is, in this case, at least at the connection between the fixed message and the fixed message, the pitch waveform of the sound waveform used for the speech synthesis of the fixed message is arranged based on the pitch mark information, so that the voice having the same content as the fixed message is obtained. Is synthesized as an irregular message.

【０１３０】ここでのピッチマーク情報は、すでに説明
した定型メッセージの種類毎にあらかじめ録音された自
然音声から抽出したピッチマーク情報のことである。こ
れにより、接続部分での音質の変化の違和感がより一層
軽減されるという効果がある。Here, the pitch mark information is pitch mark information extracted from natural voice recorded in advance for each type of the fixed message described above. As a result, there is an effect that the uncomfortable feeling of the change in the sound quality at the connection portion is further reduced.

【０１３１】このような動作により、定型メッセージ部
分と不定形メッセージ部分はともに合成音で提供される
ため、接続部分での音質の違和感は軽減される。さら
に、定型メッセージ部分には自然音声から抽出したピッ
チマーク情報を用いた合成音声を用いるため、従来の合
成音に比べて極めて自然性の高いものとなる。According to such an operation, the fixed message portion and the irregular message portion are both provided as synthesized sounds, so that the uncomfortable sound quality at the connection portion is reduced. Further, since the synthesized message using the pitch mark information extracted from the natural sound is used for the fixed message portion, the natural sound becomes extremely natural compared to the conventional synthesized sound.

【０１３２】また、定型メッセージ部分の記憶容量は録
音によるメッセージ蓄積に比べてはるかに少なくすむ。
具体的には、１秒間のメッセージを記憶する場合、録音
に必要な記憶容量はサンプリング周波数22.05kHzで4ビ
ットADPCMを用いた場合、11キロバイトになる。一方、
本実施の形態によるメッセージ蓄積方法によれば、平均
ピッチを300Hzとしてピッチマークの個数は1秒あたり30
0個である。ピッチマーク一つあたり4バイト、振幅情報
一つあたり４バイトを割り当てるとすると300×4＋300
×4＝2400バイト＝2.4キロバイトとなる。また、振幅情
報を省略する方法では300×4=1200バイト=1.2キロバイ
トである。ピッチマーク情報に比べると音韻境界情報と
音韻種別情報は極めて小さいので無視できる。Further, the storage capacity of the fixed message part is much smaller than the storage of the message by recording.
Specifically, when storing messages for one second, the storage capacity required for recording is 11 kilobytes when using a 4-bit ADPCM at a sampling frequency of 22.05 kHz. on the other hand,
According to the message storage method according to the present embodiment, the average pitch is 300 Hz, and the number of pitch marks is 30 per second.
There are zero. If 4 bytes per pitch mark and 4 bytes per amplitude information are allocated, 300 x 4 + 300
X4 = 2400 bytes = 2.4 kilobytes. In the method of omitting the amplitude information, 300 × 4 = 1200 bytes = 1.2 kilobytes. The phoneme boundary information and phoneme type information are extremely small compared to the pitch mark information and can be ignored.

【０１３３】上記の考察によれば、録音に比べて５分の
１程度、振幅情報を省略すれば１０分の１程度の少ない
記憶容量でメッセージ蓄積が可能となる。また、前述し
たようにピッチマーク情報及び振幅情報はデータの形式
を工夫することでさらに効率よく圧縮することが可能で
ある。例えば、有声音素区間を４分割した小区間ごとに
ピッチ及び振幅情報を割り当てれば、録音データと比較
して１００分の１程度の情報量に圧縮することができ
る。According to the above consideration, it is possible to store a message with a small storage capacity of about one-fifth compared with the recording, and about one-tenth if the amplitude information is omitted. As described above, the pitch mark information and the amplitude information can be more efficiently compressed by devising the data format. For example, if pitch and amplitude information is assigned to each subsection obtained by dividing a voiced phoneme section into four sections, the information amount can be reduced to about 1/100 of that of recorded data.

【０１３４】このように極めて小さい容量に圧縮された
情報から高品質な合成音を得ることができるため、これ
らの情報を記憶媒体から読み出す場合や、通信路を介し
て伝送する場合の効率も向上する。従って、情報をCD-R
OMなどのアクセス速度の遅い媒体に記憶したり、転送速
度の低い通信回線を通して高速の読み出しや伝送が可能
である。Since a high-quality synthesized sound can be obtained from information compressed to an extremely small capacity in this way, the efficiency of reading out such information from a storage medium or transmitting the information via a communication path is also improved. I do. Therefore, the information is transferred to CD-R
It can be stored in a medium with a low access speed such as an OM, and can be read and transmitted at high speed through a communication line with a low transfer speed.

【０１３５】このような利点を生かして、音声メッセー
ジの効率の良い記憶方法や提示方法が実現できる。（実施の形態８）続いて、本発明を利用した音声通報シ
ステムの実施の形態について説明する。By utilizing such advantages, an efficient storage method and presentation method of voice messages can be realized. Embodiment 8 Next, an embodiment of a voice notification system using the present invention will be described.

【０１３６】図１８は本実施の形態における音声通報シ
ステムの構成図である。FIG. 18 is a configuration diagram of the voice notification system according to the present embodiment.

【０１３７】本実施の形態の音声通報システムは、複数
のセンサ１８００１、複数のメッセージ情報記憶部１８
００２、複数の通信回線１８００３、集中監視部１８０
０４および音声合成部１８００５からなる。センサ１８
００１およびメッセージ情報記憶部１８００２は例えば
各家庭のガスメータに取り付けられており、集中監視部
１８００４および音声合成部１８００５はガス会社の制
御室などにある。通信回線１８００３は電話回線などを
利用して、各家庭のガスメータとガス会社をつなぐもの
である。The voice notification system according to the present embodiment has a plurality of sensors 18001 and a plurality of message information storage units 18.
002, multiple communication lines 18003, centralized monitoring unit 180
04 and a speech synthesis unit 18005. Sensor 18
001 and the message information storage unit 18002 are attached to, for example, a gas meter in each home, and the centralized monitoring unit 18004 and the voice synthesis unit 18005 are located in a control room of a gas company. A communication line 18003 connects a gas meter at each home to a gas company using a telephone line or the like.

【０１３８】メッセージ情報記憶部１８００２には所定
のメッセージの音韻系列情報、音韻タイミング情報、ピ
ッチ情報、振幅情報が記憶されている。以後、これらを
まとめてメッセージ情報と呼ぶ。センサ１８００１はガ
ス漏れなどの事象を関知するとメッセージ情報記憶部１
８００２にメッセージ情報を出力させる。メッセージ情
報は通信回線１８００３を介して集中監視部１８００４
に送られ、集中監視部１８００４はメッセージ情報を用
いて音声合成部１８００５を制御して音声を出力する。
音声合成部１８００５は本発明の実施の形態における音
声合成方法を利用した手段である。The message information storage section 18002 stores phoneme sequence information, phoneme timing information, pitch information, and amplitude information of a predetermined message. Hereinafter, these are collectively called message information. When the sensor 18001 detects an event such as a gas leak, the message information storage unit 1
8002 to output the message information. The message information is sent to the centralized monitoring unit 18004 via the communication line 18003.
The centralized monitoring unit 18004 controls the voice synthesizing unit 18005 using the message information to output voice.
The speech synthesis unit 18005 is a unit using the speech synthesis method according to the embodiment of the present invention.

【０１３９】この形式の利点は、メッセージ情報記憶部
１８００２に極めて小さい記憶容量で大量の音声メッセ
ージを記憶しておくことができる点である。また、通信
回線１８００３を通じて送られる情報が少なくて済むた
め、回線容量が小さい通信回線でも高速にメッセージ情
報を伝送できる。An advantage of this format is that a large amount of voice messages can be stored in the message information storage unit 18002 with an extremely small storage capacity. Further, since less information is sent through the communication line 18003, message information can be transmitted at high speed even on a communication line with a small line capacity.

【０１４０】従って、各家庭のガスメータに取り付けら
れたメッセージ情報記憶部１８００２には、ガス漏れな
どの事象を表す情報以外に、氏名、住所など、その家庭
固有の情報を個別に格納しておくことができる。このこ
とにより、ガス会社の制御室には異常の発生場所が適切
に通報され、迅速に対策を講じることを可能とする。ま
た、情報を制御室側に登録して管理するよりも、ガスの
新規契約や契約解除などに伴う変更が容易である。Therefore, the message information storage unit 18002 attached to the gas meter of each home should store information unique to the home, such as name and address, in addition to information indicating events such as gas leaks. Can be. As a result, the location where the abnormality has occurred is appropriately reported to the control room of the gas company, and it is possible to take prompt measures. Further, it is easier to change the information due to a new contract or cancellation of a gas than to register and manage information in the control room.

【０１４１】なお、本実施の形態ではガスメータとガス
会社を例に取って説明したが、他のあらゆる場面で本シ
ステムを利用することが可能である。（実施の形態９）次に、本発明を利用した音声合成シス
テムの実施の形態について説明する。In the present embodiment, the gas meter and the gas company have been described as examples. However, the present system can be used in any other situations. (Embodiment 9) Next, an embodiment of a speech synthesis system using the present invention will be described.

【０１４２】図１９は本実施の形態における音声合成シ
ステムの構成図である。FIG. 19 is a configuration diagram of a speech synthesis system according to the present embodiment.

【０１４３】本実施の形態における音声合成システムは
テキスト入力部１９００１、テキスト音韻系列変換部１
９００２、音韻系列記憶部１９００３、音声入力部１９
００４、音声記憶部１９００５、音韻タイミング検出部
１９００６、音韻タイミング記憶部１９００７、ピッチ
分析部１９００８、ピッチ情報記憶部１９００９、振幅
分析部１９０１０、振幅情報記憶部１９０１１、音声合
成部１９０１２からなる。The speech synthesizing system according to the present embodiment includes a text input unit 19001, a text phoneme sequence conversion unit 1
9002, phoneme sequence storage unit 19003, voice input unit 19
004, a speech storage section 19005, a phoneme timing detection section 19006, a phoneme timing storage section 19007, a pitch analysis section 19008, a pitch information storage section 19909, an amplitude analysis section 19010, an amplitude information storage section 19011, and a speech synthesis section 19012.

【０１４４】テキスト入力部１９００１はユーザに対し
てテキスト入力を促し、ユーザはそれに従いこれからし
ゃべろうとする内容を仮名のテキストで入力する。テキ
スト音韻系列変換部１９００２は仮名文字列を音素など
の音韻系列に変換する。音韻系列記憶部は変換された音
韻系列を記憶する。The text input section 19001 prompts the user to input a text, and the user inputs the content to be spoken in the form of a pseudonym text. The text phoneme sequence conversion unit 19002 converts the kana character string into a phoneme sequence such as a phoneme. The phoneme sequence storage unit stores the converted phoneme sequence.

【０１４５】続いて、音声入力部１９００４がユーザに
対して音声入力を促し、ユーザはそれに従い、先ほど入
力したテキストと同じ内容をしゃべることにより音声を
入力する。音声記憶部１９００５は入力された音声を一
時的に記憶する。音韻タイミング検出部１９００６は、
音声記憶部１９００５に一時的に記憶された音声と音韻
系列記憶部１９００３に記憶された音韻系列を用いて、
音声中の音韻のタイミングを全て検出する。このような
音韻タイミング検出処理はHMMなどの音声認識アルゴリ
ズムを用いて実現されている。検出された音韻タイミン
グ情報は音韻タイミング記憶部１９００７に記憶され
る。Subsequently, the voice input unit 19004 prompts the user to input a voice, and the user inputs voice by speaking the same content as the previously input text. The voice storage unit 19005 temporarily stores the input voice. The phoneme timing detection unit 19006,
Using the voice temporarily stored in the voice storage unit 19005 and the phoneme sequence stored in the phoneme sequence storage unit 19003,
Detect all phoneme timings in the voice. Such phoneme timing detection processing is realized using a speech recognition algorithm such as HMM. The detected phoneme timing information is stored in the phoneme timing storage unit 19007.

【０１４６】ピッチ分析部１９００８は本発明の音声分
析方法の実施の形態におけるピッチマーク付与方法を用
いて高性能なピッチ分析が実現できる。ピッチ分析部１
９００８は音声記憶部１９００５に一時的に記憶された
音声のピッチを分析する。ピッチ情報記憶部１９００９
は分析されたピッチ情報を記憶する。また、振幅分析部
１９０１０は音声記憶部１９００５に一時的に記憶され
た音声の振幅を分析する。振幅情報記憶部１９０１１は
分析された振幅情報を記憶する。The pitch analysis unit 19008 can realize a high-performance pitch analysis by using the pitch marking method in the embodiment of the voice analysis method of the present invention. Pitch analysis unit 1
Reference numeral 9008 analyzes the pitch of the voice temporarily stored in the voice storage unit 19005. Pitch information storage unit 19009
Stores the analyzed pitch information. Also, the amplitude analysis unit 19010 analyzes the amplitude of the voice temporarily stored in the voice storage unit 19005. The amplitude information storage unit 19011 stores the analyzed amplitude information.

【０１４７】音声合成部１９０１２は本発明の実施の形
態における音声合成方法によるものである。音声合成部
１９０１２は音韻系列記憶部１９００８、音韻タイミン
グ記憶部１９００７、ピッチ情報記憶部１９００９、振
幅情報記憶部１９０１１からそれぞれ音韻系列情報、音
韻タイミング、ピッチ情報、振幅情報を読み出し、それ
らを用いて音声を合成する。The voice synthesizing section 19012 is based on the voice synthesizing method according to the embodiment of the present invention. The speech synthesis unit 19012 reads phoneme sequence information, phoneme timing, pitch information, and amplitude information from the phoneme sequence storage unit 19008, the phoneme timing storage unit 19007, the pitch information storage unit 19009, and the amplitude information storage unit 19011, respectively, and uses them to perform speech. Are synthesized.

【０１４８】上記の構成により、音声メッセージの以下
のような利用が可能になる。本音声合成システムを例え
ば家庭電化製品に組み込む。組み込む先の例として全自
動洗濯機を取り上げる。なお、組み込みが必要なのは音
韻系列記憶部１９００８、音韻タイミング記憶部１９０
０７、ピッチ情報記憶部１９００９、振幅情報記憶部１
９０１１のみである（図中破線で囲まれた部分）。それ
以外の部分は分析が終了したら取り外して構わない。With the above configuration, the following use of the voice message becomes possible. The speech synthesis system is incorporated into, for example, home appliances. A fully automatic washing machine will be taken as an example of a target to be incorporated. It should be noted that the phonemic sequence storage unit 19008 and the phoneme timing storage unit 190 need to be incorporated.
07, pitch information storage section 19909, amplitude information storage section 1
9011 only (portion surrounded by a broken line in the figure). Other parts may be removed after the analysis is completed.

【０１４９】全自動洗濯機は衣類と洗剤を投入すると、
後はスイッチを押すだけで洗いとすすぎと脱水が自動的
に行われる。その間、ユーザは別の仕事にかかることが
できる。しかし、脱水が終わると洗濯物を干さなければ
ならないので、通常の全自動洗濯機にはブザーが内蔵さ
れており、脱水の終了をブザー音で告知する機能があ
る。When the fully automatic washing machine puts clothes and detergent,
After that, washing, rinsing and dehydration are performed automatically just by pressing the switch. Meanwhile, the user can perform another task. However, since the laundry must be dried when the spin-drying is completed, a normal fully automatic washing machine has a built-in buzzer and has a function of notifying the end of spin-drying with a buzzer sound.

【０１５０】しかし、最近は多くの家庭電化製品が同様
の機能を有するため、ブザー音が聞こえてもユーザにと
って何の告知かがわかりにくいという問題がある。However, recently, since many home appliances have the same function, there is a problem that it is difficult for the user to recognize what is notified even if the buzzer is heard.

【０１５１】この問題に対し、本音声合成システムを用
いることにより、あらかじめユーザが自分で全自動洗濯
機にしゃべらせたい内容を自分の声を使って登録するこ
とができる。すなわち、脱水の終了を「脱水が終わりま
した」や「洗濯が終了しました」などのように、ユーザ
の好みの内容でしゃべらせることができる。To solve this problem, by using the present voice synthesis system, the user can register in advance using his / her own voice the content that the user wants to talk to the fully automatic washing machine. That is, it is possible to make the user speak with the content of the user's preference, such as “the dehydration is over” or “the washing is over”.

【０１５２】本システムはユーザが登録時にしゃべった
内容を、登録時と同じ内容とイントネーションで再現す
るものである。従って、しゃべらせたい内容の抑揚をユ
ーザが好みに応じて自由に変えることができ、利用目的
に応じて多彩な応用が可能となる。In this system, the contents spoken by the user at the time of registration are reproduced with the same contents and intonation as at the time of registration. Therefore, the user can freely change the inflection of the content to be spoken according to his / her preference, and various applications can be made according to the purpose of use.

【０１５３】ところで、自分の声を録音して再生すると
普段聞いている自分の声と違って聞こえるために、これ
を嫌うユーザは多い。これに対し、本システムはイント
ネーションのみが自分のしゃべり方になるだけで、声の
質は音声素片によって決定される。従って、自分がしゃ
べった音声がプロのナレーターなどの声質に変換され
る。このことにより、ユーザが自分の声を自分で聞くこ
とに対する抵抗を軽減でき、さらにプロの音声に変換さ
れることによる喜びを味わうことができる。By the way, when a user records and reproduces his / her own voice, it sounds different from his / her own voice, and many users dislike it. On the other hand, in the present system, only the intonation is a way of speaking, and the voice quality is determined by the speech unit. Therefore, the voice spoken by the user is converted into the voice quality of a professional narrator or the like. As a result, it is possible to reduce the resistance of the user to listen to his / her own voice, and to enjoy the joy of being converted into a professional voice.

【０１５４】なお、本実施の形態では家庭内の全自動洗
濯機を例にとって説明したが、他のあらゆる場面、あら
ゆる機器に対して本システムを利用することができる。
ところで、以上述べてきた各実施の形態のいずれか一
つの実施の形態に記載の各手段の全部又は一部の手段の
機能や処理をコンピュータに実行させるためのプログラ
ムを磁気記録媒体や光記録媒体などに記録した媒体を作
成し、これを用いて上記と同様の動作を実行してももち
ろん良い。In the present embodiment, a fully automatic washing machine at home has been described as an example. However, the present system can be used for any other occasions and any appliances.
By the way, a program for causing a computer to execute functions or processing of all or a part of each means described in any one of the embodiments described above is stored in a magnetic recording medium or an optical recording medium. Alternatively, the same operation as described above may be executed by using a medium recorded in such as described above.

【０１５５】以上説明したように、本発明によるピッチ
マーク付与方法は、１）既知のアルゴリズムの応用によ
り実現可能、２）ピッチ周期に対応した確実なピッチマ
ークが付与可能、３）ざらつきのないなめらかな合成音
が得られる、という利点がある。As described above, the pitch mark adding method according to the present invention can be realized by 1) application of a known algorithm, 2) a reliable pitch mark corresponding to a pitch period can be added, and 3) smoothness without roughness. There is an advantage that a natural synthesized sound can be obtained.

【０１５６】また、本発明による音声合成方法は、１）
自然音声に含まれる自然なピッチパターンを詳細に再現
した自然性の極めて高い合成音が得られる、２）録音音
声と合成音声の接続部において極めてなめらかな変化を
持った違和感の少ない接続が可能、３）定型部と不定形
部の音質の差がないメッセージ提供が可能、４）定型部
音声の蓄積を従来の録音方式に比べ格段に少ない記憶容
量で実現可能、という利点がある。Further, the speech synthesizing method according to the present invention comprises:
A very natural sound synthesized with detailed reproduction of the natural pitch pattern included in the natural voice can be obtained. 2) The connection between the recorded voice and the synthesized voice can be connected with a very smooth change and with a very smooth change. 3) It is possible to provide a message without a difference in sound quality between the fixed-form part and the irregular-shaped part. 4) There is an advantage that storage of the fixed-form part voice can be realized with a much smaller storage capacity than the conventional recording method.

【０１５７】なお、上記の説明では定型部と不定形部の
組み合わせによるメッセージ提供方法を例にとって説明
したが、本実施の形態を定型部のみのメッセージ提供に
用いてももちろん構わない。In the above description, a method of providing a message by combining a fixed part and an irregular part has been described as an example. However, the present embodiment may be used for providing a message only to a fixed part.

【０１５８】[0158]

【発明の効果】以上述べたところから明らかなように本
発明は、比較的簡単な方法で従来に比べてより適切に音
声分析が可能であり、例えばピッチマークがより適切に
付与できるという長所を有する。As is apparent from the above description, the present invention has an advantage that voice analysis can be performed more appropriately than in the past by a relatively simple method, and for example, pitch marks can be more appropriately provided. Have.

【０１５９】また、本発明は、従来に比べて自然性が高
く、録音音声との接続部においても違和感の少ない音声
が合成できるという長所を有する。Further, the present invention has an advantage that a natural sound can be synthesized with less natural feeling even at a connection portion with a recorded voice, as compared with the conventional one.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の音声分析方法にかかるピッチマーク付
与方法の第１の実施の形態の構成図FIG. 1 is a configuration diagram of a first embodiment of a pitch mark adding method according to a voice analysis method of the present invention.

【図２】本発明の音声分析方法にかかるピッチマーク付
与方法の第２の実施の形態の構成図FIG. 2 is a configuration diagram of a second embodiment of a pitch mark adding method according to the voice analysis method of the present invention.

【図３】本発明の音声分析方法にかかるピッチマーク付
与方法の第３の実施の形態の構成図FIG. 3 is a configuration diagram of a third embodiment of a pitch mark adding method according to the voice analysis method of the present invention.

【図４】本発明の音声分析方法にかかるピッチマーク付
与方法の第４の実施の形態の構成図FIG. 4 is a configuration diagram of a fourth embodiment of a pitch mark adding method according to the voice analysis method of the present invention.

【図５】（ａ）：本実施の形態の音声波形の例を示す図（ｂ）：本実施の形態の基本波の例を示す図5A is a diagram illustrating an example of a speech waveform according to the present embodiment; FIG. 5B is a diagram illustrating an example of a fundamental wave according to the present embodiment;

【図６】図１のピーク検出部１００４の動作の一例の説
明図FIG. 6 is an explanatory diagram of an example of the operation of the peak detection unit 1004 in FIG.

【図７】図１のピーク検出部１００４の別の動作の一例
の説明図FIG. 7 is a diagram illustrating an example of another operation of the peak detection unit 1004 in FIG. 1;

【図８】差分基本波のゼロクロス付近での補間の説明図FIG. 8 is an explanatory diagram of interpolation of a differential fundamental wave near a zero cross.

【図９】音声波形と基本波の時間的対応の説明図FIG. 9 is an explanatory diagram of a temporal correspondence between a speech waveform and a fundamental wave.

【図１０】図２のチャンネルＣおよびチャンネルＤの出
力を示す図FIG. 10 is a diagram showing outputs of channels C and D in FIG. 2;

【図１１】図１のチャンネル選択部２００３が選択した
結果のピッチ周波数を示す図FIG. 11 is a view showing a pitch frequency as a result of selection by the channel selection unit 2003 in FIG. 1;

【図１２】本発明の音声合成方法の一実施の形態の構成
図FIG. 12 is a configuration diagram of an embodiment of a speech synthesis method according to the present invention.

【図１３】図１２の実施の形態の動作の流れ図FIG. 13 is a flowchart of the operation of the embodiment in FIG. 12;

【図１４】補間を行いながらピッチ波形を切り出す様子
を示した説明図FIG. 14 is an explanatory diagram showing a state in which a pitch waveform is cut out while performing interpolation.

【図１５】本発明の音声合成方法の別の実施の形態の構
成図FIG. 15 is a configuration diagram of another embodiment of the speech synthesis method of the present invention.

【図１６】図１５における混合部１５００３の二つの入
力端子のゲインの変化を示した説明図16 is an explanatory diagram showing a change in gain of two input terminals of the mixing unit 15003 in FIG. 15;

【図１７】本発明の音声合成方法の更に別の実施の形態
の構成図FIG. 17 is a configuration diagram of still another embodiment of the speech synthesis method of the present invention.

【図１８】本発明の音声通報システムの実施の形態の構
成図FIG. 18 is a configuration diagram of an embodiment of a voice notification system of the present invention.

【図１９】本発明の音声合成システムの実施の形態の構
成図FIG. 19 is a configuration diagram of an embodiment of a speech synthesis system of the present invention.

【符号の説明】[Explanation of symbols]

１００１波形記憶部１００２ピッチ分析部１００３適応型低域フィルタ１００４ピーク検出部１００５極性判定部２００１−ａ〜２００１−ｄ固定型低域フィルタ２００２−ａ〜２００２−ｄピーク検出部２００３チャンネル選択部３００１波形記憶部３００２−ａ〜３００２−ｄ固定型低域フィルタ３００３−ａ〜３００３−ｄピーク検出部３００４チャンネル選択部３００５適応型低域フィルタ３００６ピーク検出部３００７極性判定部４００１波形記憶部４００２−ａ〜４００２−ｄ固定型低域フィルタ４００３−ａ〜４００３−ｄピーク検出部４００４チャンネル選択部４００５適応型低域フィルタ４００６ピーク検出部４００７ピッチマーク照合部４００８極性判定部１２００１ピッチマーク記憶部１２００２振幅情報記憶部１２００３音韻境界記憶部１２００４音韻種別記憶部１２００５ピッチ波形記憶部１２００６ピッチ波形重畳部１２００７制御部１５００１定型メッセージ生成部１５００２合成メッセージ生成部１５００３混合部１２００１−１〜１２００１−Ｎピッチマーク記憶部１２００２−１〜１２００２−Ｎ振幅情報記憶部１２００３−１〜１２００３−Ｎ音韻境界記憶部１２００４−１〜１２００４−Ｎ音韻種別記憶部１７００７制御部１８００１−ａ〜ｄセンサ１８００２−ａ〜ｄメッセージ情報記憶部１８００３−ａ〜ｄ通信回線１８００４集中監視部１８００５音声合成部１９００１テキスト入力部１９００２テキスト音韻系列変換部１９００３音韻系列記憶部１９００４音声入力部１９００５音声記憶部１９００６音韻タイミング検出部１９００７音韻タイミング記憶部１９００８ピッチ分析部１９００９ピッチ情報記憶部１９０１０振幅分析部１９０１１振幅情報記憶部１９０１２音声合成部 1001 Waveform storage unit 1002 Pitch analysis unit 1003 Adaptive low-pass filter 1004 Peak detection unit 1005 Polarity determination unit 2001-a to 2001-d Fixed low-pass filter 2002-a to 2002-d Peak detection unit 2003 Channel selection unit 3001 Waveform Storage units 3002-a to 3002-d Fixed low-pass filters 3003-a to 3003-d Peak detection units 3004 Channel selection units 3005 Adaptive low-pass filters 3006 Peak detection units 3007 Polarity determination units 4001 Waveform storage units 4002-a- 4002-d Fixed low-pass filter 4003-a to 4003-d Peak detector 4004 Channel selector 4005 Adaptive low-pass filter 4006 Peak detector 4007 Pitch mark collator 4008 Polarity determiner 12001 Pitch mark recording Section 12002 amplitude information storage section 12003 phoneme boundary storage section 12004 phoneme type storage section 12005 pitch waveform storage section 12006 pitch waveform superposition section 12007 control section 15001 fixed message generation section 15002 synthesis message generation section 15003 mixing section 12001-1 to 12001-N pitch Mark storage unit 12002-1 to 12002-N Amplitude information storage unit 12003-1 to 12003-N Phoneme boundary storage unit 12004-1 to 12004-N Phoneme type storage unit 17007 Control unit 18001-a to d Sensor 18002-a to d Message information storage unit 18003-ad Communication line 18004 Centralized monitoring unit 18005 Speech synthesis unit 19001 Text input unit 19002 Text phoneme sequence conversion unit 19003 Phoneme sequence storage unit 19004 Sound The input unit 19005 voice storage unit 19006 phoneme timing detecting unit 19007 phoneme timing storage unit 19008 pitch analysis unit 19009 pitch information storage unit 19010 amplitude analyzing unit 19011 amplitude information storage unit 19012 speech synthesizer

Claims

【特許請求の範囲】[Claims]

【請求項１】音声波形を記憶する音声波形記憶手段
と、ピッチを分析するピッチ分析手段と、適応型フィル
タと、ピークを検出するピーク検出手段と、を用いて音
声波形のピッチ周期に対応する時間的基準位置であるピ
ッチマーク情報を生成する音声分析方法であって、前記音声波形記憶手段を用いて前記音声波形の一部を一
時的に記憶し、前記ピッチ分析手段を用いて前記一時的に記憶された音
声波形の大まかなピッチ情報を生成し、前記適応型フィルタへ前記一時的に記憶された音声波形
を入力させ、前記大まかなピッチ情報に基づいて、前記
適応型フィルタの遮断周波数あるいは中心周波数を変化
させることによって、その入力された音声波形から基本
波のみを通過させ、前記ピーク検出手段を用いて前記基本波における片側の
複数の極大点を検出することにより、音声波形全体に対
する一連の正確なピッチマーク情報を生成することを特
徴とする音声分析方法。An audio waveform storage means for storing an audio waveform, a pitch analysis means for analyzing a pitch, an adaptive filter, and a peak detection means for detecting a peak correspond to a pitch cycle of the audio waveform. A voice analysis method for generating pitch mark information that is a temporal reference position, wherein the voice waveform storage unit is used to temporarily store a part of the voice waveform, and the pitch analysis unit is used to temporarily store the voice waveform. Generating rough pitch information of the voice waveform stored in the adaptive filter, inputting the temporarily stored voice waveform to the adaptive filter, based on the rough pitch information, the cutoff frequency of the adaptive filter or By changing the center frequency, only the fundamental wave is passed from the input speech waveform, and the peak detection means is used to duplicate one side of the fundamental wave. By detecting the maximum point, speech analysis method characterized by generating a series of accurate pitch mark information for the entire speech waveform.

【請求項２】固定型低域フィルタ及び、ピークを検出
するピーク検出手段を有するピーク検出チャンネルの複
数組と、チャンネルを選択するためのチャンネル選択手
段とを用いて、音声波形のピッチ周期に対応する時間的
基準位置であるピッチマーク情報を生成する音声分析方
法であって、前記複数の固定型低域フィルタのそれぞれの遮断周波数
は、それら複数の固定型低域フィルタのうちの少なくと
も一つの固定型低域フィルタが、入力されてくる音声波
形の基本波のみを通過させるように設定されており、前記それぞれの固定型低域フィルタを用いて、入力され
た音声の所定の周波数以下の成分である低域成分波形を
出力し、前記ピーク検出手段を用いて、前記固定型低域フィルタ
から出力された前記低域成分波形から片側の複数の極大
点を検出してピーク情報として出力し、前記チャンネル選択手段により、前記複数のピーク検出
チャンネルから出力されたピーク情報の全部又は一部を
利用して、所定の時間間隔ごとに所定の選択基準に基づ
いて、ピーク検出チャンネルを選択し、前記選択されたピーク検出チャンネルから出力されたピ
ーク情報を利用して、音声波形全体に対する一連のピッ
チマーク情報を生成することを特徴とする音声分析方
法。2. A plurality of sets of a peak detection channel having a fixed low-pass filter and a peak detection means for detecting a peak, and a channel selection means for selecting a channel, corresponding to a pitch period of an audio waveform. A voice analysis method for generating pitch mark information that is a temporal reference position to perform, wherein the cutoff frequency of each of the plurality of fixed low-pass filters is at least one fixed among the plurality of fixed low-pass filters. Type low-pass filter is set to pass only the fundamental wave of the input voice waveform, and using the respective fixed low-pass filters, the components of the input voice are components having a frequency equal to or lower than a predetermined frequency. Outputting a certain low-pass component waveform, and using the peak detection means, a plurality of poles on one side from the low-pass component waveform output from the fixed low-pass filter. A point is detected and output as peak information, and the channel selection means uses all or a part of the peak information output from the plurality of peak detection channels, and uses a predetermined selection criterion at predetermined time intervals. A voice analysis method comprising: selecting a peak detection channel on the basis of the above, and using the peak information output from the selected peak detection channel to generate a series of pitch mark information for the entire voice waveform.

【請求項３】請求項１、又は２記載の音声分析方法に
よって得られたピッチマーク情報に基づいて、前記音声
波形にピッチマークを付与することを特徴とする音声分
析方法。3. A voice analysis method, wherein a pitch mark is added to the voice waveform based on the pitch mark information obtained by the voice analysis method according to claim 1.

【請求項４】請求項１、又は２記載の音声分析方法に
よって得られたピッチマーク情報を利用して、ピッチ周
波数を得ることを特徴とする音声分析方法。4. A voice analysis method, wherein a pitch frequency is obtained by using pitch mark information obtained by the voice analysis method according to claim 1.

【請求項５】請求項１、又は２記載の音声分析方法に
よって得られたピッチマーク情報を仮のピッチマークと
し、所定の単位時間毎にその直前及び直後に存在する前
記仮のピッチマークの間隔を用いて、前記ピッチ周波数
を計算することを特徴とする請求項４記載の音声分析方
法。5. A pitch mark information obtained by the voice analysis method according to claim 1 or 2 is used as a temporary pitch mark, and an interval between said temporary pitch marks existing immediately before and immediately after every predetermined unit time is provided. 5. The speech analysis method according to claim 4, wherein the pitch frequency is calculated using the following.

【請求項６】前記複数の固定型低域フィルタは遮断周
波数が互いに１：２の関係になるように設定されたこと
を特徴とする請求項２記載の音声分析方法。6. The speech analysis method according to claim 2, wherein the plurality of fixed low-pass filters are set such that cutoff frequencies have a relationship of 1: 2.

【請求項７】前記選択基準に基づいてピーク検出チャ
ンネルを選択しとは、それぞれの前記ピーク検出手段か
ら出力されるピーク情報から得られる、所定のピークと
その所定のピークに隣接するピークとの時間的間隔か
ら、前記所定のピーク位置における仮のピッチ周波数
を求め、前記仮のピッチ周波数の所定単位時間内での変化率が最
小であるピーク検出チャンネルを選択することを特徴と
する請求項２に記載の音声分析方法。7. Selecting a peak detection channel based on the selection criterion means that a peak detected from peak information output from each of the peak detection means is defined as a predetermined peak and a peak adjacent to the predetermined peak. 3. A temporary pitch frequency at the predetermined peak position is obtained from a time interval, and a peak detection channel having a minimum rate of change of the temporary pitch frequency within a predetermined unit time is selected. The voice analysis method described in 1.

【請求項８】前記選択基準に基づいてピーク検出チャ
ンネルを選択しとは、それぞれの前記ピーク検出手段か
ら出力されるピーク情報から得られる、所定のピークと
その所定のピークに隣接するピークとの時間的間隔か
ら、前記所定のピーク位置における仮のピッチ周波
数を求め、横軸にピーク位置、縦軸に仮のピッチ周波数を取る座標
系に、所定の時間範囲内に含まれる複数のピーク位置と
そのピーク位置に対応する前記仮のピッチ周波数を点と
して表したとき、それらの点をピーク位置の順に結んだ複数の直線の傾き
の分散が最小であるピーク検出チャンネルを選択するこ
とを特徴とする請求項２に記載の音声分析方法。8. Selecting a peak detection channel based on the selection criterion means that a peak detected from peak information output from each of the peak detecting means is defined as a predetermined peak and a peak adjacent to the predetermined peak. From the time interval, a tentative pitch frequency at the predetermined peak position is obtained, and the coordinate system that takes the tentative pitch frequency on the horizontal axis and the tentative pitch frequency on the vertical axis has a plurality of peak positions included in a predetermined time range. When the tentative pitch frequency corresponding to the peak position is expressed as a point, a peak detection channel is selected in which the variance of the slope of a plurality of straight lines connecting those points in the order of the peak position is the smallest. The voice analysis method according to claim 2.

【請求項９】前記ピーク検出手段は、前記低域成分波形又は前記基本波形の振幅が、一定又は
所定の単位時間毎に変化するしきい値を、越えた各部分
において、前記振幅の正または負方向の極大点を検出
することを特徴とする請求項１又は２に記載の音声分析
方法。9. The method according to claim 1, wherein the amplitude of the low-frequency component waveform or the basic waveform exceeds a threshold value that changes at a constant or predetermined unit time. 3. The speech analysis method according to claim 1, wherein a maximum point in a negative direction is detected.

【請求項１０】前記ピーク検出手段は、前記基本波の差分をとった差分基本波の値が正から負ま
たは負から正に変化する位置を極大点とすることを特徴
とする請求項１又は２に記載の音声分析方法。10. The method according to claim 1, wherein the peak detecting means sets a position where a value of a difference fundamental wave obtained by taking a difference between the fundamental waves changes from positive to negative or from negative to positive as a local maximum point. 3. The voice analysis method according to 2.

【請求項１１】前記ピーク検出手段は、前記基本波の
差分をとった差分基本波の値が正から負または負から正
に変化する点の前後の値から、直線補間により推定され
た０交差位置を極大点とする請求項１又は２のいずれか
一つに記載の音声分析方法。11. The peak detecting means according to claim 1, wherein the value of the difference fundamental wave obtained by taking the difference between the fundamental waves is zero crossing estimated by linear interpolation from values before and after a point at which the value changes from positive to negative or from negative to positive. The voice analysis method according to claim 1, wherein the position is a local maximum point.

【請求項１２】前記適応型フィルタはあらゆる周波数
に対して実質的に遅延量が０であることを特徴とする請
求項１に記載の音声分析方法。12. The speech analysis method according to claim 1, wherein the adaptive filter has a delay amount of substantially zero for all frequencies.

【請求項１３】前記固定型低域フィルタはあらゆる周
波数に対して実質的に遅延量が０であることを特徴とす
る請求項２に記載の音声分析方法。13. The speech analysis method according to claim 2, wherein the fixed low-pass filter has a delay amount of substantially zero for all frequencies.

【請求項１４】ピッチマーク照合手段により、一旦作
成された前記一連のピッチマーク情報に含まれる一つ一
つのピッチマークの互いの間隔を一定に保ったまま前後
にシフトすることによって複数のピッチマーク情報の候
補を作成し、前記ピッチマーク情報の候補に含まれる一つ一つのピッ
チマークが表す位置における音声波形の値を前記音声波
形記憶部から読み取り、前記読み取られた値を総合してピーク一致度を計算し、
前記ピーク一致度が最大となるようなピッチマーク候補
を選択することを特徴とする請求項１に記載の音声分析
方法。14. A plurality of pitch marks are shifted by a pitch mark collating means back and forth while maintaining a constant interval between each pitch mark included in said series of pitch mark information once created. A candidate for information is created, and a value of a voice waveform at a position represented by each pitch mark included in the pitch mark information candidate is read from the voice waveform storage unit, and the read values are integrated into a peak match. Calculate the degree,
2. The voice analysis method according to claim 1, wherein a pitch mark candidate that maximizes the peak coincidence is selected.

【請求項１５】前記ピーク一致度は前記読み取られた
値の合計値であることを特徴とする請求項１４記載の音
声分析方法。15. The speech analysis method according to claim 14, wherein the degree of peak coincidence is a total value of the read values.

【請求項１６】ピッチマーク照合手段により、一旦作
成された前記一連のピッチマーク情報に含まれる一つ一
つのピッチマークの互いの間隔を一定に保ったまま前後
にシフトすることによって複数のピッチマーク情報の候
補を作成し、前記ピッチマーク情報の候補に含まれる一つ一つのピッ
チマークが表す位置における音声波形の値を前記音声波
形記憶部から読み取り、前記読み取られた値を総合してピーク一致度を計算し、
前記ピーク一致度が最大となるようなピッチマーク候補
を選択することを特徴とする請求項２に記載の音声分析
方法。16. A plurality of pitch marks by shifting the pitch marks included in the series of pitch mark information once generated back and forth while maintaining a constant interval between them by a pitch mark collating means. A candidate for information is created, and a value of a voice waveform at a position represented by each pitch mark included in the pitch mark information candidate is read from the voice waveform storage unit, and the read values are integrated into a peak match. Calculate the degree,
3. The speech analysis method according to claim 2, wherein a pitch mark candidate that maximizes the peak coincidence is selected.

【請求項１７】前記ピーク一致度は前記読み取られた
値の合計値であることを特徴とする請求項１６記載の音
声分析方法。17. The voice analysis method according to claim 16, wherein said peak coincidence is a total value of said read values.

【請求項１８】あらかじめ録音された音声波形である
目的音声波形を分析して音韻系列情報、音韻タイミング
情報、ピッチ情報、振幅情報を作成しておき、前記音韻系列情報、音韻タイミング情報、ピッチ情報、
振幅情報に基づいて音声を合成する音声合成方法であっ
て、前記音韻系列情報は前記目的音声波形に含まれる音韻の
種別とその出現順序を保持し、前記ピッチ情報は前記目的音声波形の所定のタイミング
ごとのピッチに関する情報を保持し、前記振幅情報は前記目的音声波形の所定のタイミングご
との振幅に関する情報を保持することを特徴とする音声
合成方法。18. A phoneme sequence information, phoneme timing information, pitch information and amplitude information are prepared by analyzing a target speech waveform which is a speech waveform recorded in advance, and said phoneme sequence information, phoneme timing information and pitch information are prepared. ,
A speech synthesis method for synthesizing speech based on amplitude information, wherein the phoneme sequence information holds types of phonemes included in the target speech waveform and an appearance order thereof, and the pitch information is a predetermined value of the target speech waveform. A voice synthesizing method, wherein information on a pitch at each timing is held, and the amplitude information holds information on an amplitude at each predetermined timing of the target voice waveform.

【請求項１９】前記音韻系列情報は、前記目的音声波
形の内容を音素の並びで表したものであることを特徴と
する請求項１８記載の音声合成方法。19. The speech synthesis method according to claim 18, wherein the phoneme sequence information represents the contents of the target speech waveform in a sequence of phonemes.

【請求項２０】前記素片音声波形に、ピッチマークを
付与しておき、前記素片音声波形から、前記ピッチマー
クを時間的基準位置として、所定の関数を用いて切り出
したピッチ波形を、所定の時間間隔でずらして重ね合わ
せて任意のピッチの音声を合成する際、前記所定の時間間隔は、前記ピッチ情報に基づいて決定
され、前記振幅情報に基づいて前記ピッチ波形の振幅を制御す
ることを特徴とする請求項１８に記載の音声合成方法。20. A pitch mark is added to the unit voice waveform, and a pitch waveform cut out from the unit voice waveform using a predetermined function with the pitch mark as a temporal reference position is defined by a predetermined function. When synthesizing a voice of an arbitrary pitch by shifting and overlapping at a time interval of, the predetermined time interval is determined based on the pitch information, and controlling the amplitude of the pitch waveform based on the amplitude information. 19. The speech synthesis method according to claim 18, wherein:

【請求項２１】前記ピッチ情報は、前記目的音声波形
に付与されたピッチマークであり、前記所定の時間間隔が前記ピッチ情報に基づいて決定さ
れるとは、前記ピッチ波形を前記ピッチマークと同一の
タイミングで配置することを意味することを特徴とする
請求項２０に記載の音声合成方法。21. The pitch information is a pitch mark added to the target voice waveform, and the phrase that the predetermined time interval is determined based on the pitch information means that the pitch waveform is the same as the pitch mark. 21. The speech synthesis method according to claim 20, which means that the speech is arranged at the following timing.

【請求項２２】前記振幅情報は、前記目的音声波形に
付与されたピッチマークの一つ一つが表す位置の近傍に
おける前記目的音声波形の振幅の代表値であることを特
徴とする請求項２１に記載の音声合成方法。22. The apparatus according to claim 21, wherein the amplitude information is a representative value of the amplitude of the target voice waveform near a position represented by each of the pitch marks added to the target voice waveform. Described speech synthesis method.

【請求項２３】前記振幅情報は、前記目的音声波形に
付与されたピッチマーク近傍での振幅の絶対値の最大値
であり、前記各ピッチ波形の振幅の絶対値の最大値が前記振幅情
報と等しくなるように制御することを特徴とする請求項
２２に記載の音声合成方法。23. The amplitude information is a maximum value of an absolute value of an amplitude near a pitch mark given to the target audio waveform, and a maximum value of an absolute value of an amplitude of each of the pitch waveforms is equal to the amplitude information. 23. The speech synthesis method according to claim 22, wherein control is performed so as to be equal.

【請求項２４】前記振幅情報は、前記目的音声波形に
付与されたピッチマーク近傍での片側方向の振幅の最大
値であり、前記各ピッチ波形の振幅の片側方向の最大値が前記振幅
情報と等しくなるように制御することを特徴とする請求
項２２に記載の音声合成方法。24. The amplitude information is a maximum value of the amplitude in one direction in the vicinity of the pitch mark given to the target voice waveform, and the maximum value of the amplitude of each pitch waveform in one direction is the same as the amplitude information. 23. The speech synthesis method according to claim 22, wherein control is performed so as to be equal.

【請求項２５】前記振幅情報は、前記目的音声波形に
付与されたピッチマーク近傍での短時間パワーであり、前記各ピッチ波形の振幅の短時間パワーを前記振幅情報
と等しくなるように制御することを特徴とする請求項２
２に記載の音声合成方法。25. The amplitude information is short-time power near a pitch mark given to the target voice waveform, and controls the short-time power of the amplitude of each pitch waveform to be equal to the amplitude information. 3. The method according to claim 2, wherein
3. The speech synthesis method according to 2.

【請求項２６】前記ピッチ情報は、前記目的音声波形
に付与されたピッチマーク情報を所定のタイミングごと
のピッチ情報に変換したものであることを特徴とする請
求項１９に記載の音声合成方法。26. The voice synthesizing method according to claim 19, wherein the pitch information is obtained by converting pitch mark information given to the target voice waveform into pitch information for each predetermined timing.

【請求項２７】前記所定のタイミングとは、前記音韻
系列情報に含まれる有声音素に対応する区間を所定の個
数に区切ったタイミングであることを特徴とする請求項
２６に記載の音声合成方法。27. The speech synthesis method according to claim 26, wherein the predetermined timing is a timing obtained by dividing a section corresponding to a voiced phoneme included in the phoneme sequence information into a predetermined number. .

【請求項２８】前記振幅情報は、前記目的音声波形の
所定の周波数以下の成分のみを取り出した低域成分波形
から取り出したことを特徴とする請求項１８に記載の音
声合成方法。28. The speech synthesis method according to claim 18, wherein the amplitude information is extracted from a low-frequency component waveform obtained by extracting only a component having a frequency equal to or lower than a predetermined frequency of the target audio waveform.

【請求項２９】前記音韻系列情報、前記音韻タイミン
グ情報、前記ピッチ情報及び、前記振幅情報は、帯域制
限を受けた狭帯域音声から抽出されたことを特徴とする
請求項１８に記載の音声合成方法。29. The speech synthesis according to claim 18, wherein the phoneme sequence information, the phoneme timing information, the pitch information, and the amplitude information are extracted from a band-limited narrow-band speech. Method.

【請求項３０】前記音韻タイミング情報を変形するこ
とによって合成音の速度を変化させることを特徴とする
請求項１８に記載の音声合成方法。30. The speech synthesis method according to claim 18, wherein the speed of the synthesized sound is changed by modifying the phoneme timing information.

【請求項３１】前記ピッチ情報または前記振幅情報を
変形することによって合成音のピッチまたは音量を変化
させることを特徴とする請求項２３から３０のいずれか
に記載の音声合成方法。31. The speech synthesis method according to claim 23, wherein a pitch or a volume of a synthesized sound is changed by deforming the pitch information or the amplitude information.

【請求項３２】前記音韻系列情報を変更することによ
って、前記目的音声と異なる発話内容の音声を合成する
ことを特徴とする請求項１８に記載の音声合成方法。32. The speech synthesis method according to claim 18, wherein a speech having a speech content different from that of the target speech is synthesized by changing the phoneme sequence information.

【請求項３３】前記音韻系列情報、前記音韻タイミン
グ情報、前記ピッチ情報および、前記振幅情報を比較的
読み出し速度の遅い記憶媒体に記録し、必要に応じて前
記記憶媒体から情報を読み出すことによって音声を合成
することを特徴とする請求項１８に記載の音声合成方
法。33. Speech by recording the phoneme sequence information, the phoneme timing information, the pitch information, and the amplitude information on a storage medium having a relatively low reading speed, and reading information from the storage medium as needed. 19. The speech synthesis method according to claim 18, wherein

【請求項３４】複数のセンサと、それらの前記複数の
センサそれぞれに対応して接続された複数のメッセージ
情報記憶部と、それらのメッセージ情報記憶部それぞれ
に対応して接続された複数の通信回線と、それらの通
信回線に共通に接続された集中監視部と、その集中監視
部に接続された音声合成部と、を備え、前記メッセージ
情報記憶部には音声メッセージに対応する音韻系列情
報、音韻タイミング情報、ピッチ情報および、振幅情報
が記憶されており、前記複数のセンサのいずれかが所定の事象を感知したと
き、それに対応する前記メッセージ情報記憶部に記憶さ
れた音韻系列情報、音韻タイミング情報、ピッチ情報お
よび振幅情報が、対応する前記通信回線を介して前記集
中監視部に伝送され、前記音声合成部は、その集中監視部からの指示により、
前記音韻系列情報、音韻タイミング情報、ピッチ情報お
よび振幅情報に従って音声を合成することにより通報す
ることを特徴とする音声通報システムであって、前記音声合成部は請求項１８の音声合成方法を利用する
手段であることを特徴とする音声通報システム。34. A plurality of sensors, a plurality of message information storage units connected to each of the plurality of sensors, and a plurality of communication lines connected to each of the message information storage units. And a centralized monitoring unit commonly connected to the communication lines, and a voice synthesizing unit connected to the centralized monitoring unit. The message information storage unit stores phonemic sequence information and phonemic sequence information corresponding to the voice message. Timing information, pitch information, and amplitude information are stored. When one of the plurality of sensors detects a predetermined event, the corresponding phoneme sequence information and phoneme timing information stored in the message information storage unit. , Pitch information and amplitude information are transmitted to the centralized monitoring unit via the corresponding communication line, and the voice synthesizing unit includes the centralized monitoring unit According to the instructions from
A speech reporting system for reporting by synthesizing speech according to the phoneme sequence information, phoneme timing information, pitch information, and amplitude information, wherein the speech synthesis unit uses the speech synthesis method according to claim 18. A voice reporting system characterized by being a means.

【請求項３５】テキスト入力部とテキスト記憶部とテ
キスト音韻系列変換部と音韻系列記憶部と音声入力部と
音声記憶部と音韻タイミング検出部と音韻タイミング記
憶部とピッチ分析部とピッチ情報記憶部と振幅分析部と
振幅情報記憶部と音声合成部とを備え、前記テキスト入力部は任意のテキストを入力し、前記テキスト記憶部は前記入力されたテキストを一時的
に記憶し、前記テキスト音韻系列変換部は前記一時的に記憶された
テキストを音素などの音韻系列に変換し、前記音韻系列記憶部は前記変換された音韻系列を記憶
し、前記音声入力部は前記テキストに対応する音声を入力
し、前記音声記憶部は前記入力された音声を一時的に記憶
し、前記音韻タイミング検出部は前記一時的に記憶された音
声からそれぞれの音韻のタイミングを検出し、前記音韻タイミング記憶部は前記検出された音韻のタイ
ミングを記憶し、前記ピッチ分析部は前記一時的に記憶された音声のピッ
チを分析し、前記ピッチ情報記憶部は前記分析されたピッチを記憶
し、前記振幅分析部は前記一時的に記憶された音声の振幅を
分析し、前記振幅記憶部は前記分析された振幅を記憶し、前記音声合成部は前記音韻系列記憶部に記憶された音韻
系列情報と前記音韻タイミング記憶部に記憶された音韻
タイミング情報と前記ピッチ情報記憶部に記憶されたピ
ッチ情報と前記振幅情報記憶部に記憶された振幅情報に
従って、音声を合成する音声合成システムであって、前記音声合成部は請求項１８の音声合成方法を利用する
ことを特徴とする音声合成システム。35. A text input unit, a text storage unit, a text phoneme sequence conversion unit, a phoneme sequence storage unit, a speech input unit, a speech storage unit, a phoneme timing detection unit, a phoneme timing storage unit, a pitch analysis unit, and a pitch information storage unit. And an amplitude analysis unit, an amplitude information storage unit, and a speech synthesis unit, the text input unit inputs an arbitrary text, the text storage unit temporarily stores the input text, the text phoneme sequence The conversion unit converts the temporarily stored text into a phoneme sequence such as a phoneme, the phoneme sequence storage unit stores the converted phoneme sequence, and the voice input unit inputs a voice corresponding to the text. The voice storage unit temporarily stores the input voice, and the phoneme timing detection unit determines the timing of each phoneme from the temporarily stored voice. The phoneme timing storage unit stores the timing of the detected phoneme, the pitch analysis unit analyzes the pitch of the temporarily stored voice, and the pitch information storage unit analyzes the pitch. The amplitude analysis unit analyzes the amplitude of the temporarily stored voice, the amplitude storage unit stores the analyzed amplitude, and the voice synthesis unit stores the phoneme sequence storage unit. A speech for synthesizing speech according to the stored phoneme sequence information, the phoneme timing information stored in the phoneme timing storage unit, the pitch information stored in the pitch information storage unit, and the amplitude information stored in the amplitude information storage unit. 19. A speech synthesis system, wherein the speech synthesis unit uses the speech synthesis method according to claim 18.

【請求項３６】前記目的音声波形に付与されたピッチ
マークは、請求項１から１５のいずれか一つに記載の音
声分析方法により付与されたものであることを特徴とす
る請求項２１から３３のいずれか一つに記載の音声合成
方法。36. The pitch mark given to the target speech waveform is given by the speech analysis method according to any one of claims 1 to 15. The speech synthesis method according to any one of the above.

【請求項３７】前記素片音声波形に付与されたピッチ
マークは、請求項１から１５のいずれか一つに記載の音
声分析方法により付与されたものであることを特徴とす
る請求項２０に記載の音声合成方法。37. The pitch mark given to the segment speech waveform is given by the speech analysis method according to any one of claims 1 to 15. Described speech synthesis method.

【請求項３８】前記ピッチ波形は、切り出し対象とな
る区間における全ての振幅値を、所定の補間処理により
補間して求めたものであり、その切り出し対象となる区間は、直線補間により推定さ
れた０交差位置により決定されたピーク情報から求めら
れたピッチマークを、時間的基準位置として、特定した
区間であることを特徴とする請求項３７に記載の音声合
成方法。38. The pitch waveform is obtained by interpolating all amplitude values in a section to be cut out by a predetermined interpolation process, and the section to be cut out is estimated by linear interpolation. The speech synthesis method according to claim 37, wherein a pitch mark obtained from peak information determined by the zero crossing position is a specified section as a temporal reference position.

【請求項３９】自然音声による定型メッセージと、音
声合成による合成メッセージとを組み合わせることによ
り所定のメッセージを生成する音声合成方法において、前記自然音声に対応するピッチマーク情報があらかじめ
付与されており、前記定型メッセージと前記合成メッセージとの少なくと
も接続部においては、前記合成メッセージの音声合成に用いる音声波形のピッ
チ波形を前記ピッチマーク情報に基づいて配置すること
により、前記定型メッセージと同じ内容の音声を合成メ
ッセージとして合成し、それら同じ内容の双方の音声のそれぞれの混合比率を時
間的に変化させ、前記接続部において重ね合わせる、ことを特徴とする音声合成方法。39. A voice synthesizing method for generating a predetermined message by combining a fixed message based on natural voice and a synthesized message based on voice synthesis, wherein pitch mark information corresponding to the natural voice is added in advance. At least at the connection between the fixed message and the synthesized message, by arranging a pitch waveform of a voice waveform used for voice synthesis of the synthesized message based on the pitch mark information, a voice having the same content as the fixed message is synthesized. Synthesizing them as messages, temporally changing the mixing ratio of both voices having the same contents, and superimposing them at the connection unit.

【請求項４０】前記混合比率は、前記定型メッセージから合成メッセージへの接続部にお
いては、前記接続部の時間的に手前から前記混合比率を
前記合成メッセージの方が大きくなるように時間的に徐
々に変化させ、また、合成メッセージから定型メッセージへの接続部に
おいては、前記接続部の時間的に手前から前記混合比率
を前記定型メッセージの方が大きくなるように時間的に
徐々に変化させることを特徴とする請求項３９記載の音
声合成方法。40. The mixing ratio in the connection from the fixed message to the composite message is gradually increased in time from before the connection in time such that the composite message becomes larger in the composite message. In the connection part from the composite message to the fixed message, the mixing ratio is gradually changed from the time short of the connection part so as to be larger in the fixed message than in the fixed message. 40. The speech synthesis method according to claim 39, wherein:

【請求項４１】第１のメッセージと第２のメッセージ
とを組み合わせることにより所定のメッセージを生成す
る音声合成方法において、前記第１のメッセージの種類毎にあらかじめ録音されて
いる自然音声に対応するピッチマーク情報に基づいて、
第１のメッセージの合成に用いる音声波形のピッチ波形
を配置することにより、前記第１のメッセージを生成
し、前記第１のメッセージと前記第２のメッセージとの少な
くとも接続部においては、前記第１のメッセージと同じ内容の音声を前記第２のメ
ッセージとして合成し、それら同じ内容の双方の音声の
それぞれの混合比率を時間的に変化させ、前記接続部に
おいて重ね合わせる、ことを特徴とする音声合成方法。41. A voice synthesizing method for generating a predetermined message by combining a first message and a second message, wherein a pitch corresponding to a natural voice recorded in advance for each type of the first message is provided. Based on the mark information,
The first message is generated by arranging a pitch waveform of a voice waveform used for synthesizing a first message, and at least a connection between the first message and the second message includes the first message. Voices having the same content as the second message are synthesized as the second message, the respective mixing ratios of the two voices having the same content are temporally changed, and are superimposed at the connection unit. Method.

【請求項４２】前記第１のメッセージと前記第２のメ
ッセージとの少なくとも接続部における第２のメッセー
ジの前記合成は、前記第２のメッセージの音声合成に用
いる音声波形のピッチ波形を前記ピッチマーク情報に基
づいて配置することにより行うことを特徴とする請求項
４１記載の音声合成方法。42. The synthesizing of the second message at least at a connection portion between the first message and the second message, wherein the synthesizing of a pitch waveform of a speech waveform used for speech synthesis of the second message is performed using the pitch mark. 42. The speech synthesis method according to claim 41, wherein the method is performed by arranging based on information.

【請求項４３】前記ピッチマークは、請求項１から１
５のいずれかに記載の音声分析方法により付与されたも
のであることを特徴とする請求項３９から４２のいずれ
かに記載の音声合成方法。43. The pitch mark according to claim 1, wherein
43. The speech synthesis method according to claim 39, wherein the speech synthesis method is provided by the speech analysis method according to claim 5.

【請求項４４】請求項１から１５のいずれかに記載の
各ステップの全部又は一部のステップをコンピュータに
実行させるためのプログラムを記録したことを特徴とす
る媒体。44. A medium having recorded thereon a program for causing a computer to execute all or some of the steps according to any one of claims 1 to 15.

【請求項４５】請求項１８から４３のいずれかに記載
の各ステップの全部又は一部のステップをコンピュータ
に実行させるためのプログラムを記録したことを特徴と
する媒体。45. A medium having recorded thereon a program for causing a computer to execute all or some of the steps according to any one of claims 18 to 43.