JP2612868B2

JP2612868B2 - Voice utterance speed conversion method

Info

Publication number: JP2612868B2
Application number: JP62250707A
Authority: JP
Inventors: 徹都木; 尚夫桑原
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1987-10-06
Filing date: 1987-10-06
Publication date: 1997-05-21
Anticipated expiration: 2012-05-21
Also published as: JPH0193795A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、放送、映画、音楽等において、人間の音声
を処理する場合の発声速度を制御する音声の発声速度変
換方法に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an utterance speed conversion method for controlling utterance speed when processing human voice in broadcast, movie, music, and the like.

［発明の概要］本発明は人の音声を一時記録し、その発生速度を変化
させて、再び音声として出力する技術に関するもので、入力音声をA/D変換した後、有声音区間についてその
ピッチ周波数を抽出して各ピッチ間隔で分割し、その内
おもに定常母音区間についてピッチ単位で間引きまたは
繰り返しを行うと共に、無音区間、無声子音区間につい
ても間引きまたは繰返しを行って接続し、これをD/A変
換することにより、原音声の音韻性や自然性を良好に保ったまま、発声速
度を自由に変換できるようにする方法である。[Summary of the Invention] The present invention relates to a technique for temporarily recording a human voice, changing the generation speed thereof, and outputting the voice again as a voice. The frequency is extracted and divided at each pitch interval, and among them, thinning or repetition is mainly performed for the steady vowel section in units of pitch, and silence sections and unvoiced consonant sections are also thinned or repeated for connection. This is a method that allows A-conversion to freely convert the utterance speed while maintaining the phonological and natural characteristics of the original speech.

［従来の技術］この種の技術としては、古典的な例として音声をアナ
ログテープレコーダに録音し、再生スピードを変化させ
る方法がある。この場合、発声速度のみならず、ピッチ
周波数やホルマント周波数も一様に変化する。すなわ
ち、再生スピードを録音時のＲ倍にすると、発声速度が
Ｒ倍になると共に、ピッチおよびホルマント周波数も全
てＲ倍となる。ここで、ピッチ周波数はその全体的な変
化によって音声の高低を決定し、局所的な変化によっ
て、アクセント等、音声の抑揚を決定するものである。
また、ホルマント周波数は音声の個人性や音韻性を定め
るものである。[Prior Art] As this kind of technique, there is a classic example of a method of recording a voice on an analog tape recorder and changing a reproduction speed. In this case, not only the utterance speed but also the pitch frequency and the formant frequency change uniformly. That is, if the reproduction speed is R times that at the time of recording, the utterance speed is R times, and the pitch and the formant frequency are all R times. Here, the pitch frequency determines the level of the voice by its overall change, and determines the inflection of the voice such as accent by the local change.
The formant frequency determines the personality and phonological properties of the voice.

これに対し、Ｒ倍になったピッチおよびホルマント周
波数を元に戻すには、BBDなどを用いてクロック周波数
Ｆで取込んだ音声波形を、F/Rなるクロック周波数で読
出せばピッチおよびホルマント周波数が1/R倍となりも
とに戻る。ただし、BBDに取込む前に、適当な時間窓と
周期を用いて波形を間引いたり、繰り返したりして、過
不足のないようにする。On the other hand, in order to restore the pitch and the formant frequency that have been multiplied by R, the voice waveform captured at the clock frequency F using a BBD or the like can be read out at the clock frequency of F / R to obtain the pitch and the formant frequency. Becomes 1 / R times and returns to the original. However, before taking in the BBD, the waveform should be thinned or repeated using an appropriate time window and period so that there is no excess or deficiency.

また、デジタル信号処理である、分析・合成法を用い
る方式も提案されている。分析によって得られた調音パ
ラメータと残差波形を、時間的に適当な単位で間引いた
り、繰返しながら合成すれば、ピッチおよびホルマント
周波数には変化を与えずに発声速度を制御することがで
きる。Further, a method using an analysis / synthesis method, which is digital signal processing, has been proposed. If the articulatory parameters and the residual waveform obtained by the analysis are thinned out in appropriate units in terms of time or synthesized while repeating, the utterance speed can be controlled without changing the pitch and the formant frequency.

［発明が解決しようとする問題点］しかしながら、テープレコーダの再生スピードを変化
させるだけの方法は簡単ではあるが、ピッチやホルマン
ト周波数も変化してしまう。ピッチやホルマント周波数
が変化すると、個人性に影響があり、更に変化量が多い
場合には音韻性が劣化し、非人間的な声となる。[Problems to be Solved by the Invention] However, although the method of simply changing the reproduction speed of the tape recorder is simple, the pitch and the formant frequency also change. When the pitch or the formant frequency changes, the personality is affected. When the change amount is large, the phonological characteristics are degraded, resulting in a non-human voice.

またピッチやホルマント周波数を元に戻す方式におい
ても、その処理単位が、ブロック単位であるため、波形
の連続性を完全に保つことが難しく、音質劣化が著し
い。Also, in the method of restoring the pitch and the formant frequency, since the processing unit is a block unit, it is difficult to completely maintain the continuity of the waveform, and the sound quality is significantly deteriorated.

さらに、分析・合成方法においても、出力音声がパラ
メータ制御による合成音であるためある程度の音質劣化
は避けられない。Further, also in the analysis / synthesis method, a certain degree of sound quality deterioration is unavoidable because the output sound is a synthesized sound by parameter control.

また、従来の方式では、処理が全ての区間で一様であ
るが、実際の音声では子音の種類によってはその持続時
間が発声速度に殆ど依存せず、この部分を母音区間と同
じ比率で時間伸縮したのでは、会話音声としての自然性
が劣化する。Also, in the conventional method, the processing is uniform in all sections, but in actual speech, the duration of the consonant hardly depends on the utterance speed depending on the type of consonant, and this part is time-divided at the same ratio as the vowel section. If it expands or contracts, the naturalness of the conversation voice deteriorates.

さらにｔやｋのような破裂性の子音は持続時間が短い
ので、ブロック単位で間引いた場合に消失する場合があ
る。Further, bursting consonants such as t and k have a short duration, and may disappear when decimated in block units.

そこで、本発明の目的は上述した従来の問題点を解消
し、間引きや繰り返しの単位をピッチ単位とすることで
波形の連続性を保ち、かつ原音声の波形をそのまま用い
ることで音質の劣化を防ぐことを可能とする音声の発声
速度変換方法を提供することにある。Therefore, an object of the present invention is to solve the above-described conventional problems, to maintain the continuity of the waveform by using a unit of thinning or repetition as a pitch unit, and to reduce the deterioration of sound quality by using the original audio waveform as it is. It is an object of the present invention to provide a method for converting the utterance speed of speech which can prevent the utterance.

本発明の他の目的は母音区間、有声子音区間、無声子
音区間、無音区間を別々の比率で時間伸縮し、音声とし
ての自然性を維持することが可能な音声の発声速度変換
方法を提供することにある。Another object of the present invention is to provide a method for converting the utterance speed of a voice, which can expand and contract the vowel section, the voiced consonant section, the unvoiced consonant section, and the silent section at different ratios and maintain the naturalness of the voice. It is in.

［問題点を解決するための手段］そのために本発明では入力音声波形から、母音区間、
有声子音区間、無声子音区間、無音区間を抽出し、有声
子音区間と母音区間とで構成される有声音区間からピッ
チ周期を抽出することによって有声音区間をピッチの間
隔で分割し、母音区間および無音区間における発話時間
長の伸縮比率を大とし、かつ有声子音区間および無声子
音区間の伸縮比率を小とする各々の区間の伸縮比率を定
め、母音区間および有声子音区間では定められた伸縮比
率に基づきピッチ間隔で波形の間引または繰り返しをす
ることによって発声時間長を伸縮し、無声子音区間およ
び無音区間では定められた伸縮比率に基づき区間毎に発
声時間長の伸縮を行なった後各々の区間を接続して新た
な音声波形とすることを特徴とする。[Means for Solving the Problems] Therefore, in the present invention, a vowel section,
A voiced consonant section, an unvoiced consonant section, and a silent section are extracted, and a voiced section is divided by a pitch interval by extracting a pitch period from a voiced section composed of a voiced consonant section and a vowel section. The expansion and contraction ratio of the utterance time length in the silent section is set to be large, and the expansion and contraction ratio of each of the voiced consonant sections and the unvoiced consonant section is set to be small. The utterance time length is expanded or contracted by thinning or repeating the waveform at the pitch interval based on the vocalization time length. Are connected to form a new audio waveform.

［作用］以上の構成によれば、入力音声を母音区間、有声子音
区間、無声子音区間、無音区間に分離し、それぞれの区
間毎に人間の発声特徴に応じた変換方法を用いて発声速
度を変換する。[Operation] According to the above configuration, the input speech is separated into vowel sections, voiced consonant sections, unvoiced consonant sections, and non-speech sections, and the vocalization rate is determined for each section using a conversion method according to human vocal characteristics. To convert.

すなわち、有声音区間では音声の間引きや繰り返しの
単位をピッチ単位とし、かつ原音声の波形をそのまま用
いる。That is, in the voiced sound section, the unit of thinning or repetition of the voice is set as the pitch unit, and the waveform of the original voice is used as it is.

また、子音区間においても、それぞれの子音の性質に
より伸縮の方式を切替える。Also in the consonant section, the expansion / contraction method is switched according to the properties of each consonant.

［実施例］以下、図面に示す実施例に基づき本発明を詳細に説明
する。EXAMPLES Hereinafter, the present invention will be described in detail based on examples shown in the drawings.

第１図は、本発明の一実施例に係る発声速度変換シス
テムのブロック図を示す。図において、２は分析部、４
は制御部、６は波形接続部をそれぞれ示し、各部は電子
計算機内に構成され、ROM,RAMあるいはメモリディスク
等のメモリを併用しながら発声速度変換の処理が実行さ
れる。FIG. 1 is a block diagram showing an utterance speed conversion system according to one embodiment of the present invention. In the figure, 2 is an analysis unit, 4
Indicates a control unit, and 6 indicates a waveform connection unit. Each unit is configured in an electronic computer, and performs a speech rate conversion process while using a memory such as a ROM, a RAM, or a memory disk.

A/D変換されて標本化された音声波形は分析部２へ入
力し、有音と無音および有声音と無声音の判別、さらに
は有声音については線形予測分析がなされ、ピッチ周
期、予測係数、共振周波数、共振の帯域幅が求められ
る。The A / D-converted and sampled speech waveform is input to the analysis unit 2 and discriminated between voiced and unvoiced and voiced and unvoiced sounds. Further, for voiced sounds, linear prediction analysis is performed, and pitch period, prediction coefficient, A resonance frequency and a resonance bandwidth are required.

次に、制御部４においては、発声速度を変更し、波形
接続部６では発声時間長を伸縮して波形の接続を行な
う。Next, the control unit 4 changes the utterance speed, and the waveform connection unit 6 connects the waveforms by expanding and contracting the utterance time length.

上述した一連の発声速度変換の処理を終了すると、合
成された音声波形をD/A変換して出力音声とする。When the above-described series of utterance speed conversion processes is completed, the synthesized voice waveform is D / A converted to output voice.

上記各部における処理の詳細を第２図に示すフローチ
ャートを参照しながら説明する。The details of the processing in each section will be described with reference to the flowchart shown in FIG.

変換ビット数12bit,標本化周波数15kHzでA/D変換され
た音声は、まず、分析部２において、ステップS1で音声
パワーの有無に基づいて有音区間と無音区間の判別が行
われる。次にステップS2では有音区間の標本値に対して
PARCOR分析と零交さ分析とを行い、無声子音区間と有声
音区間との判別を行う。これは、１次のPARCOR係数を参
照して入力周波数の高域成分の割合を調べたり、零交さ
数を調べることによって行なう。すなわち、無声音のエ
ネルギーは高周波領域にまで分布しているので、高域成
分の割合および高周波になると多くなる零交さ数を調べ
ることによって無声子音と有声音とを判別する。なお、
PARCOR分析と零交さ分析の両方を用いて判別を行なうの
は、判別を確実なものとするためである。First, in the analysis unit 2, a speech section and a non-speech section are determined in the analysis unit 2 based on the presence / absence of speech power in the analysis unit 2 at a conversion bit number of 12 bits and a sampling frequency of 15 kHz. Next, in step S2, the sample value of the sound interval is
PARCOR analysis and zero-crossing analysis are performed to discriminate between unvoiced consonant sections and voiced sound sections. This is performed by checking the ratio of the high frequency component of the input frequency with reference to the first-order PARCOR coefficient or checking the number of zero crossings. That is, since the energy of the unvoiced sound is distributed up to the high frequency region, the unvoiced consonant and the voiced sound are discriminated by examining the ratio of the high frequency component and the number of zero crossings which increases as the frequency increases. In addition,
The reason why the discrimination is performed using both the PARCOR analysis and the zero-crossing analysis is to ensure the discrimination.

上記ステップS1およびS2で判別された無音区間の時間
および無声子音区間の波形は、それぞれステップS15お
よびS16においてそのままRAMあるいはメモリディスク等
に記憶される。The time of the silent section and the waveform of the unvoiced consonant section determined in steps S1 and S2 are stored in RAM or a memory disk or the like in steps S15 and S16, respectively.

次に、ステップS3では有声音区間における音声波形の
標本値を音声の生成モデルに基づくいわゆる声道逆フィ
ルタに通すことによって線形予測分析を行なう。この線
形予測分析によって線形予測係数と残差波形を得る。得
られた残差波形はステップS17においてRAMあるいはメモ
リディスク等に記憶される。Next, in step S3, a linear predictive analysis is performed by passing a sample value of the voice waveform in the voiced sound section through a so-called vocal tract inverse filter based on a voice generation model. By this linear prediction analysis, a linear prediction coefficient and a residual waveform are obtained. The obtained residual waveform is stored in a RAM or a memory disk in step S17.

ステップS4ではステップS3で得られた残差波形の相間
における周期と原音声波形のピークの間隔とから仮のピ
ッチ周期を求める。In step S4, a temporary pitch period is determined from the period between the phases of the residual waveform obtained in step S3 and the interval between the peaks of the original audio waveform.

次に、ステップS5においては、第３図に示すように波
形のレベルが急に大きくなる点の直前をピッチの開始点
とし、上記で求めたピッチ周期に基づき次のピッチの開
始点の１標本手前を終了点として１つのピッチ区間を定
める。Next, in step S5, as shown in FIG. 3, the point immediately before the point where the waveform level suddenly increases is set as the pitch start point, and one sample of the start point of the next pitch is obtained based on the pitch period obtained above. One pitch section is determined with the near side as the end point.

ステップS6では上記で求めた１ピッチ区間の中間点を
分析窓の中心として、20msec程度の窓掛けを行なう。こ
の窓掛けにより有限個の標本値による短時間スペクトル
分析が可能となり、この窓掛けデータを基に再び線形予
測分析を行なう。すなわち、標本値の窓掛けを行なった
データを基に相関関数を求めることによって、線形予測
係数α_１〜α_ｐを算出する。ここで、ｐは線形予測分析
の次数であり、一般に男性の声に対してはｐ＝14、女性
の声に対してはｐ＝10程度を用いる。In step S6, windowing is performed for about 20 msec with the midpoint of the one pitch section determined above as the center of the analysis window. This windowing enables short-time spectrum analysis using a finite number of sample values, and linear prediction analysis is performed again based on the windowing data. That is, the linear prediction coefficients α _{1 to} α _p are calculated by obtaining a correlation function based on the data obtained by windowing the sample values. Here, p is the order of the linear prediction analysis, and generally, p = 14 is used for a male voice and p = about 10 is used for a female voice.

さらに、ステップS18で、以下に示す（１）式を満足
するｚの根z₁〜z_pを求め、各々の根Z_iに対応して
（２），（３）式により共振周波数F_iとその帯域幅B_iを
求める。Further, in step S18, obtains a root z ₁ to z _p of z which satisfies the following equation (1), corresponding to each of the root Z _i (2), the resonance frequency F _i by (3) The bandwidth _Bi is obtained.

１＋α₁z^-1＋α₂z^-2＋…＋α_pz^-p＝０（１） F_i＝Fs/（２π）・arg（z_i）［Hz］（２） B_i＝Fs/π・|log（|z_i|）｜［Hz］（３）なおFsは音声の標本化周波数である。1 + α ₁ z ^-1 + α ₂ z ^-2 +... + Α _p z ^-p = 0 (1) F _i = Fs / (2π) · arg (z _i ) [Hz] (2) B _i = Fs / π · | log (| z _i |) | [Hz] (3) where Fs is the sampling frequency of the voice.

また、ステップS7はこの１ピッチ区間内のサンプル値
の自乗和をピッチ区間長で割った値を正規化パワーと定
義し、ピッチ区間の長さと共にRAMあるいはメモリディ
スク等に記録する。In step S7, a value obtained by dividing the sum of the squares of the sample values in one pitch section by the pitch section length is defined as normalized power, and is recorded together with the pitch section length in a RAM or a memory disk.

処理区間を１ピッチ分だけ後へずらし、上述した一連
の処理を行い、これらの操作を有声区間が終るまで繰返
す。The processing section is shifted backward by one pitch, the above-described series of processing is performed, and these operations are repeated until the voiced section ends.

（２）式で求めた共振周波数の時間軌跡は、定常母音
部では連続的にかつ緩やかに変化するが、有声子音部で
は不安定に変化しかつ帯域幅は母音部よりも広い。また
正規化パワーの時間軌跡においては有声子音部で一時的
かつ急激な減少が起こることが多い。そこで、ステップ
S8では、これらの特徴を用いて、母音部と有声子音部を
分離し、各ピッチ毎にその情報をRAMあるいはメモリデ
ィスク等に記録する。The time trajectory of the resonance frequency obtained by the equation (2) changes continuously and slowly in the steady vowel part, but changes unstable in the voiced consonant part and has a wider bandwidth than the vowel part. In the time trajectory of the normalized power, a temporary and rapid decrease often occurs in a voiced consonant part. So, step
In S8, using these characteristics, a vowel part and a voiced consonant part are separated, and the information is recorded on a RAM or a memory disk for each pitch.

制御部４では、分析部２において得られた、無音区間
長や一連のピッチ周期を基に、適当な配分により無音区
間長を伸縮したり、有声区間の各々のピッチを繰返すか
または間引くことにより、発話の時間長即ち発声速度が
変更された新しいピッチ周期列を作る。The control unit 4 expands or contracts the silent section length by appropriate distribution based on the silent section length and a series of pitch periods obtained in the analyzing unit 2, and repeats or thins out each pitch of the voiced section. , A new pitch period sequence in which the duration of the utterance, that is, the utterance speed is changed.

ここで分析部２において次のような結果が得られたと
する。Here, it is assumed that the following result is obtained in the analysis unit 2.

全発声時間長 T_all 母音部分の時間長の総和 T_v 有声子音部分の時間長の総和 T_cv 無声子音部分の時間長の総和 T_cn 無音部分の時間長の総和 T_s ただし T_all＝T_v＋T_cv＋T_cn＋T_s （４）ここで発声速度をＲ倍にしたければ、T_allをl/R倍に
すればよい。Total utterance time length T _all Sum of vowel part time length T _v Sum of voiced consonant part time length T _cv Sum of unvoiced consonant part time length T _cn Sum of silence part time length T _{s where} T _all = T _v + T _cv + T _cn + T _s (4) If it is desired to increase the utterance speed by R, T _all may be increased by 1 / R.

ところが、実際の音声では、発声速度が変化してもT
_cnやT_cvはあまり変化せず、主にT_sやT_vが変化する。そ
こで、T_sとT_vについては１の重みで、T_cnとT_cvについて
はｗ（ただしｗ＜１）の重みでその長さを変更し、その
和Ｔ′_allがT_allの1/R倍になるようにする。すなわちス
テップS9において、変更後の各部の時間長を次のように
する。However, in actual speech, even if the utterance speed changes, T
_cn and T _cv does not change much, mainly T _s and T _v changes. Therefore, the weight of 1 for T _s and T _v, T _cn and T change its length under the weight of w (provided that w <1) for _cv, 1 / R of the sum T _'all the T _all Make it double. That is, in step S9, the time length of each unit after the change is set as follows.

Ｔ′_all＝γ_０・T_all （５）Ｔ′_ｖ＝γ_１・T_v （６）Ｔ′_cv＝γ_２・T_cv （７）Ｔ′_cn＝γ_２・T_cn （８）Ｔ′_ｓ＝γ_１・T_s （９）ただし γ_０＝1/R （10）ここでｗの値は、0.3〜0.5程度とする。T ' _all = γ ₀ · T _all (5) T' _v = γ ₁ · T _v (6) T ' _cv = γ ₂ · T _cv (7) T' _cn = γ ₂ · T _cn (8) T ' _s = γ ₁ · T _s (9) where γ ₀ = 1 / R (10) Here, the value of w is about 0.3 to 0.5.

波形接続部６では制御部４で決定された比率により各
部分の発声時間長を伸縮して接続する。The waveform connection unit 6 expands and contracts the utterance time length of each portion according to the ratio determined by the control unit 4, and connects the portions.

母音区間、有声子音区間においてそれぞれの発声時間
長をγ_１倍、γ_２倍にするには、以下のように適当な割
合でピッチ単位の波形を適宜間引くかまたは繰り返して
接続する。Vowel section, gamma ₁ times the respective utterance time length in the voiced consonant segment, to gamma ₂ times, it connects the following manner appropriate ratio waveforms of pitch units appropriately thinning or repeatedly at.

すなわち、ステップS10およびS11で、ある母音区間ま
たは有声子音区間の発声時間長をγ倍するとして、γ＞
１ならば、1/（γ−１）ピッチにつき１ピッチの割合で
同じピッチ波形を繰返し、γ＜１ならば、1/（１−γ）
ピッチにつき１ピッチの割合で間引く。第４図にγ＝1.
5、およびγ＝0.667の場合の例を示す。同図から明らか
なように、γ＝1.5の場合は２ピッチに１回ピッチ区間
２および４を繰り返えす。また、γ＝0.667の場合、３
ピッチに１回ピッチ区間３および６を間引く。That is, in steps S10 and S11, if the utterance time length of a certain vowel section or voiced consonant section is multiplied by γ, γ>
If 1, the same pitch waveform is repeated at a rate of 1 pitch per 1 / (γ-1) pitch. If γ <1, 1 / (1−γ)
Decimate at a pitch of one pitch. FIG. 4 shows that γ = 1.
5 and an example where γ = 0.667 is shown. As is clear from the figure, when γ = 1.5, the pitch sections 2 and 4 are repeated once every two pitches. When γ = 0.667, 3
The pitch sections 3 and 6 are thinned out once on the pitch.

なお、有声子音区間のうち原音声の区間長が25msec以
下のものについては流音／γ／の可能性が高く、この区
間の長さは発声速度には殆ど依存しないので伸縮は行わ
ない。Note that among voiced consonant sections, those having a section length of the original voice of 25 msec or less have a high possibility of flowing sound / γ /, and since the length of this section hardly depends on the utterance speed, expansion / contraction is not performed.

このようにすれば、概ね原音声のγ倍の発声時間長と
することができ、かつ聴感的にも違和感がない。In this way, the utterance time length can be approximately γ times that of the original voice, and there is no uncomfortable audibility.

なお、一般的にピッチ区間を間引くかまたは繰返した
波形においては、あるピッチ区間の終了点と次のピッチ
区間の開始点の間は不連続であるので、接続点の前後数
サンプルのデータを用いて最小自乗法により３次曲線を
用いた近似を行い、連続的に接続する。In general, in a waveform in which pitch sections are thinned out or repeated, since the end point of a certain pitch section and the start point of the next pitch section are discontinuous, data of several samples before and after the connection point are used. Approximation using a cubic curve by the least squares method, and connect continuously.

無声子音区間においてはステップS12で原音声の区間
長Ｌが60msecより短いものについては破裂性または破擦
性の子音の可能性が高いので、それ自身の伸縮は行わな
い。In the unvoiced consonant section, if the section length L of the original speech is shorter than 60 msec in step S12, there is a high possibility of a bursting or rubbing consonant, and therefore, no expansion or contraction of itself is performed.

Ｌが60ミリ秒より大きいものについてはγ_２＜１なら
ば区間の開始点および終了点から中間点に向かって、そ
れぞれＬ・（１−γ_２）/2に相当する長さを省く。２≧
γ_２＞１ならば中間点の前後Ｌ・（γ_２−１）に相当す
る長さの波形を切り出し原波形の中間点の間に挿入す
る。この様子を第５図に示す。γ_２＞２の場合は、全区
間を繰返す操作を適宜加える。If L is greater than 60 ms, if γ ₂ <1, the length corresponding to L · (1−γ ₂ ) / 2 is omitted from the start and end points of the section toward the intermediate point. 2 ≧
If γ ₂ > 1, a waveform having a length corresponding to L · (γ ₂ -1) before and after the intermediate point is cut out and inserted between the intermediate points of the original waveform. This is shown in FIG. If γ ₂ > 2, an operation of repeating the entire section is appropriately added.

無音区間においては、ステップS13で、基本的には無
条件にその区間長をγ_１倍して新たな区間長とするが、
無声子音の直後の30ミリ秒以下の無音部は、無声破裂子
音の気音部の可能性が高いので例外としてその長さを不
変とすると共に、無声子音の直前の無音部を短くする場
合には30ミリ秒以下にならないように制限する。In the silence interval, in step S13, it is basically a new section length and ₁ times γ its segment length unconditionally in,
The silence portion of 30 ms or less immediately after the unvoiced consonant is likely to be a voiced part of a voiceless consonant, so the exception is to make the length unchanged and shorten the silence immediately before the unvoiced consonant. Is limited to no less than 30 milliseconds.

なお、以上の処理で各部分に生じた伸縮時間長の誤差
は、それぞれの区間の近傍の無音区間または母音区間の
長さを伸縮して修正する。The error of the expansion / contraction time length generated in each part in the above processing is corrected by expanding / contracting the length of a silent section or a vowel section near each section.

ひとつの区間の処理が終了したならば、ステップS14
において、その開始部および終了部に１ミリ秒程度の立
上がりおよび立下がりの窓をかけ、前の区間と接続し、
次の区間の処理に移る。When the processing of one section is completed, step S14
At the beginning and end, the rising and falling windows of about 1 millisecond are applied and connected to the previous section,
Move to the next section.

なお、長時間にわたる連続音声の全発声時間長を基に
処理を行うのは困難であるので、100〜200ミリ秒前後の
比較的長い無音区間を検出したならば、その中間点まで
をひとつのブロックと考え、まずこの１ブロックの中で
上記の一連の時間伸縮処理を行った後、つぎのブロック
の処理に移る。ただし、原音声が比較的早口の場合に
は、ブロック分割を判断するための無音区間長を50ミリ
秒程度に狭めた方がよい。Since it is difficult to perform processing based on the total utterance time length of continuous speech over a long period of time, if a relatively long silent section of about 100 to 200 milliseconds is detected, one Considering a block, first, the above-described series of time expansion / contraction processing is performed in this one block, and then the processing moves to the next block. However, if the original voice is relatively fast, it is better to narrow the silent section length for judging block division to about 50 milliseconds.

最終的に合成された音声をD/A変換して、出力音声と
する。Finally, the synthesized voice is D / A converted to output voice.

なお、分析部２における、ピッチ周波数抽出法や、有
声／無声判別法、有声子音抽出法などは、ここで述べた
ものに限らず、それらが精度良く抽出できる方法なら何
でも良い。Note that the pitch frequency extraction method, voiced / unvoiced discrimination method, voiced consonant extraction method, and the like in the analysis unit 2 are not limited to those described here, but may be any method that can extract them with high accuracy.

［発明の効果］以上説明したように、本発明によれば予め入力音声を
母音区間、有声子音区間、無声子音区間、無音区間に分
離し、それぞれの区間毎に人間の発声の特徴に応じた変
換方法を用いて発声速度を換えるので、音声としての自
然性が高い。[Effects of the Invention] As described above, according to the present invention, an input voice is previously divided into a vowel section, a voiced consonant section, an unvoiced consonant section, and a non-speech section, and each section corresponds to the characteristics of human utterance. Since the utterance speed is changed using the conversion method, the naturalness as speech is high.

また、有声音区間では音声の間引きや繰返しの単位を
ピッチ単位とすることで波形の連続性を保ち、かつ原音
声の波形をそのまま用いることで音質の劣化が殆どな
い。In the voiced sound section, the continuity of the waveform is maintained by using the unit of the thinning or repetition of the voice as the pitch unit, and the sound quality is hardly deteriorated by using the waveform of the original voice as it is.

さらに子音区間においても、それぞれの子音の性質に
より伸縮の方式を切替えることができるので、持続時間
の短いものが脱落することなどもなく、明瞭度の低下を
最小限に抑えることができる。Further, in the consonant section, the expansion and contraction method can be switched according to the nature of each consonant, so that the one with a short duration does not drop out and the decrease in clarity can be minimized.

【図面の簡単な説明】[Brief description of the drawings]

第１図は本発明の一実施例に係るシステムのブロック
図、第２図は本発明の一実施例を示すフローチャート、第３図は実施例におけるピッチ区間の定め方を説明する
ための波形図、第４図は実施例における波形の繰り返しおよび間引きを
説明するための波形図、第５図は実施例における無声子音部の波形の伸縮を説明
するための波形図である。２……分析部、４……制御部、６……波形制御部。FIG. 1 is a block diagram of a system according to one embodiment of the present invention, FIG. 2 is a flowchart showing one embodiment of the present invention, and FIG. 3 is a waveform diagram for explaining how to determine a pitch section in the embodiment. FIG. 4 is a waveform diagram for explaining repetition and thinning of a waveform in the embodiment, and FIG. 5 is a waveform diagram for explaining expansion and contraction of a waveform of an unvoiced consonant part in the embodiment. 2 ... Analyzer, 4 ... Controller, 6 ... Waveform controller.

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】入力音声波形から、母音区間、有声子音区
間、無声子音区間、無音区間を抽出し、前記有声子音区間と前記母音区間とで構成される有声音
区間からピッチ周期を抽出することによって該有声音区
間を当該ピッチの間隔で分割し、前記母音区間および前記無音区間における発話時間長の
伸縮比率を大とし、かつ前記有声子音区間および前記無
声子音区間の前記伸縮比率を小とする前記各々の区間の
前記伸縮比率を定め、前記母音区間および前記有声子音区間では前記定められ
た伸縮比率に基づき前記ピッチ間隔で波形の間引または
繰り返しをすることによって発声時間長を伸縮し、前記無声子音区間および前記無音区間では前記定められ
た伸縮比率に基づき当該区間毎に発声時間長の伸縮を行
なった後前記各々の区間を接続して新たな音声波形とす
ることを特徴とする音声の発声速度変換方法。1. A vowel section, a voiced consonant section, an unvoiced consonant section, and a silent section are extracted from an input speech waveform, and a pitch period is extracted from a voiced section composed of the voiced consonant section and the vowel section. The voiced sound section is divided at intervals of the pitch, thereby increasing the expansion ratio of the utterance time length in the vowel section and the silent section, and reducing the expansion ratio of the voiced consonant section and the unvoiced consonant section. The expansion / contraction ratio of each section is determined, and in the vowel section and the voiced consonant section, the utterance time length is expanded / contracted by thinning or repeating the waveform at the pitch interval based on the determined expansion / contraction ratio, In the unvoiced consonant section and the silent section, the respective sections are connected after performing the expansion and contraction of the utterance time length for each section based on the predetermined expansion and contraction ratio. A speech utterance speed conversion method characterized by using a new speech waveform.