JPH07295588A

JPH07295588A - Estimating method for speed of utterance

Info

Publication number: JPH07295588A
Application number: JP6083032A
Authority: JP
Inventors: Akio Ando; 彰男安藤; Eiichi Miyasaka; 栄一宮坂
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1994-04-21
Filing date: 1994-04-21
Publication date: 1995-11-10

Abstract

PURPOSE:To estimate the utterance speed of an arbitrarily uttered voice by replacing the number of vowels in an input voice detected in a desired time section by the number of syllables, and estimating the utterance speed as a value obtained by dividing the number of syllables by the desired time section. CONSTITUTION:The input speech is divided by a speech waveform division part 2 into blocks of the certain time section (several seconds), an acoustic analysis part 4 takes an analysis, block by block, and a vowel detection part 8 performs vowel detection by using a vowel standard pattern 10. After the vowel detection, an utterance speed calculation part 12 divides the detected number of vowels (i.e., number of syllables) by the block length (second) to find the mean utterance speeds (mora/second) in the blocks. The values of the mean utterance speeds found by the blocks are displayed at an utterance speed display part 14 in order. Further, a vowel part is compensated to take a measurement for the absence of a vowel due to a vowel made voiceless when a voiceless vowel detection part 6 judges that the vowel is made voiceless.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、話し手の音声の発話速
度を推定する方法に関する。最近、発話速度が速いこと
により生じる聞き取りにくさを解消するため、信号処理
技術を用いて発話速度がゆっくりな音声に変換する“話
速変換型補聴システム”の研究が行われているが（特開
平５−８０７９６号公報参照）、このようなシステムを
良好に動作させるためには、入力音声中の発話速度が速
い部分を何らかの方法で自動的に検出して、そのような
部分についてのみ発話速度を遅くするように話速変換を
行うことが必要となる。また、アナウンサーはもとよ
り、テレビジョン、ラジオ等の放送メディア、あるいは
講演等において発言するため、前もって最も聞きやすい
発話速度で発声する訓練をしたい場合にも、発話速度を
測定する装置があれば訓練の効率が上がることが想定さ
れる。本発明は、このように発話速度を測定したいとい
う需要に応えるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for estimating a speech rate of a speaker's voice. Recently, in order to eliminate the difficulty of hearing caused by the high speech rate, research has been conducted on a "speech rate conversion type hearing aid system" that uses signal processing technology to convert speech to a slow speech rate. In order to operate such a system satisfactorily, a portion of the input voice having a high speech rate is automatically detected by some method, and the speech rate of only such a portion is detected. It is necessary to convert the speech speed so as to slow down. Also, not only announcers but also broadcast media such as television and radio, or lectures, so even if you want to train in advance at the speech rate that is easiest to hear, if you have a device that measures the speech rate, It is expected that efficiency will increase. The present invention addresses the demand for measuring speech rate in this way.

【０００２】[0002]

【従来の技術】従来、発話速度を測定ないし推定する方
法は無いが、それに使用し得る材料を提供するものとし
て、例えば特開平５−２８９６９１号公報に開示されて
いるような音声信号の動的特徴量に基づく方法、および
大坂ほか“入力の発声速度を考慮した単語音声認識”電
子通信学会技術研究報告ＳＰ９３−５３１９９３年８
月に示されているように、音素の継続時間長を推定し
て、その結果を音声認識に適用することにより音声認識
の性能向上を図ったものがある。2. Description of the Related Art Conventionally, there is no method for measuring or estimating a speech rate, but as a material that can be used therefor, for example, dynamic speech signal as disclosed in Japanese Unexamined Patent Publication No. 5-289691 is disclosed. Feature-based method and Osaka et al. "Word speech recognition considering input speech rate" IEICE Technical Report SP93-53 1993 8
As shown in the month, there is a method that estimates the duration of a phoneme and applies the result to speech recognition to improve the performance of speech recognition.

【０００３】[0003]

【発明が解決しようとする課題】上記従来技術におい
て、前者（特開平５−２８９６９１号公報）に記載のも
のは、音声信号の特徴量のみによっては発話速度の正確
な推定は難しい。また、後者（大坂ほかの文献）は、発
話速度の推定を目的としていないためこれを直接発話速
度推定に利用することは困難で、いずれにしても現状で
は、発話速度を推定できる方法は皆無であり、前述した
ような需要に応えて、早急に精度よく発話速度を測定な
いし推定できる方法の実現が望まれていた。In the above-mentioned conventional technique, the former (Japanese Patent Laid-Open No. 5-289691) is difficult to accurately estimate the speech rate only by the characteristic amount of the voice signal. The latter (Osaka et al.) Does not aim to estimate the speech rate, so it is difficult to use it directly for speech rate estimation. In any case, at present, there is no method for estimating the speech rate. Therefore, it has been desired to realize a method capable of promptly and accurately measuring or estimating the speech rate in response to the demand as described above.

【０００４】[0004]

【課題を解決するための手段】本発明発話速度推定方法
は、このような期待に応えるために、従来皆無であった
発話速度を推定、しかも高精度（誤差５％以下）で推定
する方法を提供するもので、すなわち本発明は、入力音
声中の母音を検出し、所望の時間区間内における前記検
出した母音の数を前記所望の時間区間内における音節数
に置き替え、該置き替えた前記所望の時間区間内におけ
る音節数を前記所望の時間区間で除したものを発話速度
として該発話速度を推定するようにしたことを特徴とす
るものである。In order to meet such expectations, the speech rate estimation method of the present invention is a method for estimating a speech rate which has never existed in the past, and which estimates with high accuracy (error of 5% or less). That is, the present invention, that is, the present invention, detects vowels in the input speech, replaces the number of the detected vowels in the desired time interval with the number of syllables in the desired time interval, and the replaced said It is characterized in that the utterance speed is estimated by setting the utterance speed as a value obtained by dividing the number of syllables in the desired time interval by the desired time interval.

【０００５】また本発明は、前記入力音声中の母音を検
出するにあたっては、前記入力音声の各フレームごとの
ＬＰＣケプストラム係数とＬＰＣケプストラム係数で表
現された各母音の母音標準パターンとのユークリッド距
離を計算し、該計算結果に基づいて検出ようにしたこと
を特徴とするものである。また本発明は、前記入力音声
中の母音を検出するにあたっては、母音の隠れマルコフ
モデルを使用して検出するようにしたことを特徴とする
ものである。Further, according to the present invention, in detecting a vowel in the input voice, an Euclidean distance between an LPC cepstrum coefficient for each frame of the input voice and a vowel standard pattern of each vowel represented by the LPC cepstrum coefficient is calculated. It is characterized in that calculation is performed and detection is performed based on the calculation result. Further, the present invention is characterized in that when detecting a vowel in the input voice, a hidden Markov model of the vowel is used for detection.

【０００６】[0006]

【実施例】以下に添付図面を参照し実施例により本発明
を詳細に説明する。まず本発明では、日本語の特質とし
て、各音節ごとに１個ずつ母音が存在することを有効に
利用し、有限個の音節の数をその音節が含まれている時
間で除して発話速度を求める代わりに、音節の数を母音
の数に置き替えて入力音声を母音標準パターンと比較照
合することにより入力音声中の母音を検出し、その検出
した母音の数を母音が含まれている時間で除して発話速
度を求めるようにしている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below with reference to the accompanying drawings. First, in the present invention, as a characteristic of Japanese language, the existence of one vowel for each syllable is effectively used, and the utterance speed is divided by dividing the finite number of syllables by the time period in which the syllable is included. Instead of finding, the vowels in the input voice are detected by replacing the number of syllables with the number of vowels and comparing the input voice with the vowel standard pattern, and the number of detected vowels includes the vowel. The utterance speed is calculated by dividing by the time.

【０００７】本発明によって構成した発話速度推定装置
の概略のブロック線図を図１に示す。図１において、入
力音声は、音声波形分割部２により一定の時間区間（数
秒程度）ごとのブロックに分割され、音響分析部４で各
ブロックごとに音響分析が行われ、さらに母音検出部８
において、母音標準パターンを用いた母音検出が行われ
る。音響分析法としては、入力音声の周波数構造を抽出
できるものであればどのような方法を採用してもよい
が、本実施例では、時間軸上である一定の長さ（数十ミ
リ秒程度）の音声区間を切り出すという処理を、切り出
し区間をずらしながら行い（この場合の各切り出し区間
をフレームと呼ぶ）、各フレームごとに対数パワースペ
クトル包絡のフーリエ変換（これをＬＰＣケプストラム
係数と呼ぶ）を計算するＬＰＣケプストラム分析を採用
している。FIG. 1 shows a schematic block diagram of a speech rate estimating apparatus constructed according to the present invention. In FIG. 1, the input voice is divided into blocks for each fixed time section (about several seconds) by the voice waveform division unit 2, the acoustic analysis unit 4 performs acoustic analysis for each block, and the vowel detection unit 8
In, vowel detection is performed using a vowel standard pattern. As the acoustic analysis method, any method can be adopted as long as it can extract the frequency structure of the input voice, but in the present embodiment, a fixed length (several tens of milliseconds) on the time axis is used. ) Is performed while shifting the cutout section (each cutout section in this case is called a frame), and the Fourier transform of the logarithmic power spectrum envelope (this is called an LPC cepstrum coefficient) is performed for each frame. It employs LPC cepstrum analysis to calculate.

【０００８】母音標準パターン１０は、音声中の母音部
分から得られたＬＰＣケプストラム係数を用いて事前に
作成しておくものとする。母音検出は、入力音声の各フ
レームごとのＬＰＣケプストラム係数と、ＬＰＣケプス
トラム係数で表現された各母音の母音標準パターンとの
ユークリッド距離を計算し、この距離の最小値が予め設
定した閾値より小さい場合に母音が存在すると判断する
ことにより行う。また、母音の無声化に起因する母音脱
落に対処するため、無声化母音検出部６において入力音
声に対し、零交差波分析を行って摩擦性子音 (／ｓ／，
／sh／，／ts／など) の存在を調べ、摩擦性子音の後に
破裂性子音 (／ｐ／, ／t,／, ／ｋ／など）が存在し、
両子音の間に母音部分が検出されていない場合には、母
音無声化が起こっていると判断して母音部分を補うこと
とする。The vowel standard pattern 10 is created in advance using the LPC cepstrum coefficient obtained from the vowel part in the voice. The vowel detection calculates the Euclidean distance between the LPC cepstrum coefficient for each frame of the input speech and the vowel standard pattern of each vowel expressed by the LPC cepstrum coefficient, and when the minimum value of this distance is smaller than a preset threshold value. This is done by determining that there is a vowel in. In addition, in order to deal with vowel dropouts caused by unvoiced vowels, the unvoiced vowel detection unit 6 performs zero-crossing wave analysis on the input voice to perform frictional consonant (/ s /,
(/ Sh /, / ts /, etc.) is examined, and explosive consonants (/ p /, / t, /, / k /, etc.) are present after the frictional consonants,
When no vowel part is detected between both consonants, it is determined that vowel devoicing has occurred and the vowel part is supplemented.

【０００９】母音検出後、発話速度計算部１２において
検出された母音数（すなわち音節数）をブロック長（単
位：秒）で除することによりブロック内での平均発話速
度が求められる（単位：モーラ／秒）。この各ブロック
ごとに求めた平均発話速度の値を発話速度表示部１４に
順次表示するようにする。After the vowel detection, the average utterance speed in the block is obtained by dividing the number of vowels (that is, the number of syllables) detected in the utterance speed calculation unit 12 by the block length (unit: second) (unit: mora). / Sec). The average speech rate value obtained for each block is sequentially displayed on the speech rate display unit 14.

【００１０】次に、本発明発話速度推定方法の一実施例
における処理手順を図２および図３（図２の続き）のフ
ローチャートに示す。以下においては、フローチャート
の各ブロックが行う処理の実行内容を説明するに留める
が、必要に応じ、判断機能を含むブロックについては他
のブロックとの関係も説明する。また、図２および図３
において、符号およびはそれぞれこの点において流
れが接続していることを示している。Next, the processing procedure in one embodiment of the speech rate estimating method of the present invention is shown in the flowcharts of FIGS. 2 and 3 (continuation of FIG. 2). In the following, only the execution contents of the processing performed by each block of the flowchart will be described. However, as for the block including the determination function, the relationship with other blocks will be described as necessary. 2 and 3
In, the symbols and respectively indicate that the flows are connected at this point.

【００１１】各ブロックごとの実行内容は次の通りであ
る。Ｂ２：発話速度の表示をリセットする処理ブロック。Ｂ３：発話速度推定中に検出される全母音数を表す変数
m vowel に０を代入する処理ブロックＢ４：全ブロック長（音声区間の長さ、単位：秒）を表
す変数t lengthに０を代入する処理ブロック。Ｂ５：入力音声をＡ／Ｄ変換する処理ブロック。本実施
例ではサンプリング周波数１５ＫＨｚ、量子化ビット数
１６ビットでＡ／Ｄ変換する。Ｂ６：音声データの入力が終了したかどうかを判断する
処理ブロック。Ｂ７：発話速度推定時における平均発話速度を計算する
処理ブロック。計算は、m vowel をt lengthで除するこ
とにより行う。Ｂ８：Ｂ７で求めた平均発話速度を表示する処理ブロッ
ク。Ｂ９：音声データを１ブロック分読み込む処理ブロッ
ク。（ブロックサイズは数秒程度）Ｂ１０，Ｂ１２：Ａ／Ｄ変換された音声を１０ミリ秒の
時間窓を用いて１０ミリ秒ずつずらしながら切り出し、
パワーと零交差数を求める。パワーに対し音声区間の境
界を見いだすための閾値Ｔ₁と、その閾値よりも大きい
値に設定される音声区間検出用の閾値Ｔ₂を設ける。ま
た、零交差数に対しても閾値Ｔ₃を設ける。この時、以
下の様な判定を行う。（ｉ）求めたパワーがＴ₂より大きくなった場合には、
音声区間と判定する。（ii) 求めたパワーがＴ₁以上あるいは零交差数がＴ₃
以上である場合には、まず過去の時点にさかのぼってパ
ワーあるいは零交差数がＴ₁あるいはＴ₃を下回ること
なくパワーがＴ₂以上になる点を見いだせた場合に音声
区間と判定する。そうでない場合には音声データを先読
みしてそれ以降の時点で同様な点を見いだせた場合に音
声区間と判定する。 (iii)上記 (ｉ),(ii)以外の場合には音声区間でないと
判定する。この場合、音声データの入力が終了したかど
うかを判断する処理ブロックＢ６に戻る。Ｂ１４：母音数を数えるため変数n vowel に０を代入す
る処理ブロック。Ｂ１６：音声データをフレーム単位に分割する処理ブロ
ック。実施例においては、２０ミリ秒の長さのハミング
窓を５ミリ秒ずつずらしながら切り出すことにより、１
ブロック分の音声データをフレーム単位に分割する。Ｂ１８：フレーム番号を表す変数frame に１を代入する
処理ブロック。Ｂ２０：変数frame が、現在処理中の音声ブロック内に
設定できる最大フレーム数より大きいかどうかを判断す
る処理ブロック。大きい場合には、発話速度を計算する
処理ブロックＢ４４に進む。Ｂ２２：入力の各フレームについて、線形予測分析、お
よび零交差波分析からなる音響分析を行う処理ブロッ
ク。合わせて、フレーム内の平均パワーも計算する。本
実施例では、線形予測分析における次数は１８次とし、
プリエンファシス係数として１−0.95ｚ^-1を用いる。ま
た、得られた線形予測分析結果からＬＰＣケプストラム
係数を求める。Ｂ２４：入力音声のＬＰＣケプストラム係数と、ＬＰＣ
ケプストラム係数で表現されている母音標準パターンと
のユークリッド距離を計算する処理ブロック。また、ど
の母音の標準パターンとの距離が最小となるかを調べ、
その値が閾値Ｔ₄以下の場合には、現在のフレームに対
応する母音とする。どの母音に対する距離もＴ₄を超え
ている場合には対応する母音は無しとする。Ｂ２６，Ｂ２８：母音終了点を検出する処理ブロック
で、前フレームに対応する母音が存在し現在のフレーム
に対応する母音がない場合、もしくは、前フレームおよ
び現在のフレームに対応する母音が存在するがその母音
が異なる場合に、前フレームを母音の終了点とする。検
出できない場合、次の無声化母音を検出する処理ブロッ
クＢ３２に進む。Ｂ３０：変数n vowel に１を加える処理ブロック。Ｂ３２：変数m vowel に１を加える処理ブロック。Ｂ３４，Ｂ３６：無声化母音の存在を判定する処理ブロ
ック。まず、現在のフレームの平均パワーと２つ前のフ
レームの平均パワーとの比を計算することにより、現在
のフレームにおいて、パワーが急激に増加する特徴を有
する破裂性子音の存在を判定する。破裂性子音の存在が
判定された場合には、このフレームから過去にさかのぼ
って摩擦性子音が存在するかどうかを調べる。摩擦性子
音の存在は、零交差数の大きさが予め定めた閾値を超え
ているかどうかによって判定する。現在のフレームに破
裂性子音が存在し、かつその前に摩擦性子音が存在する
場合であって、両子音間に母音の存在が認められない場
合には、両子音の間に無声化した母音が存在すると判定
する。存在が認められない場合、次の変数frame の値に
１を加える処理ブロックＢ４２に進む。Ｂ３８：変数n vowel に１を加える処理ブロック。Ｂ４０：変数m vowel に１を加える処理ブロック。Ｂ４２：変数frame の値に１を加える処理ブロック。Ｂ４４：発話速度を計算する処理ブロック。発話速度
は、変数n vowel をブロック長（単位：秒）で除するこ
とにより、“モーラ／秒”単位で求める。Ｂ４６：Ｂ３８の発話速度を計算する処理ブロックで得
られた発話速度を表示する処理ブロック。Ｂ４８：変数t lengthの値に、ブロック長（単位：秒）
を加える処理ブロック。The contents of execution for each block are as follows. B2: A processing block for resetting the display of the speech rate. B3: variable representing the total number of vowels detected during speech rate estimation
Processing block for substituting 0 for m vowel B4: Processing block for substituting 0 for the variable t length representing the total block length (length of voice section, unit: seconds). B5: A processing block for A / D converting the input voice. In this embodiment, A / D conversion is performed with a sampling frequency of 15 KHz and a quantization bit number of 16 bits. B6: A processing block for determining whether or not the input of voice data is completed. B7: A processing block for calculating an average speech rate when estimating the speech rate. The calculation is performed by dividing m vowel by t length. B8: A processing block for displaying the average speech rate obtained in B7. B9: Processing block for reading one block of audio data. (Block size is about a few seconds) B10, B12: The A / D-converted voice is cut out while shifting by 10 milliseconds using a time window of 10 milliseconds,
Find the power and number of zero crossings. A threshold T ₁ for finding the boundary of the voice section with respect to the power and a threshold T ₂ for detecting the voice section set to a value larger than the threshold are provided. Also, a threshold value T ₃ is set for the number of zero crossings. At this time, the following judgment is made. (I) When the calculated power is larger than T ₂ ,
Determined as a voice section. (Ii) The calculated power is T ₁ or more or the number of zero crossings is T ₃
In the above case, first, when it is found that the power or the number of zero crossings does not fall below T ₁ or T ₃ and the power becomes T ₂ or more, it is judged to be a voice section. If not, the voice data is pre-read, and if a similar point is found after that, it is determined as the voice section. (iii) In cases other than the above (i) and (ii), it is determined that it is not a voice section. In this case, the process returns to the processing block B6 for determining whether or not the input of voice data is completed. B14: A processing block for substituting 0 for the variable n vowel in order to count the number of vowels. B16: Processing block for dividing audio data into frame units. In the embodiment, a Hamming window having a length of 20 milliseconds is cut out by shifting it by 5 milliseconds to obtain 1
The audio data for a block is divided into frame units. B18: A processing block for substituting 1 for the variable frame representing the frame number. B20: A processing block which determines whether or not the variable frame is larger than the maximum number of frames that can be set in the audio block currently being processed. If so, the process proceeds to processing block B44 for calculating the speech rate. B22: A processing block for performing acoustic analysis including linear prediction analysis and zero-crossing wave analysis for each frame of the input. In addition, the average power in the frame is also calculated. In this embodiment, the order in the linear prediction analysis is 18th,
As the pre-emphasis coefficient, 1-0.95z ^-1 is used. Further, the LPC cepstrum coefficient is obtained from the obtained linear prediction analysis result. B24: LPC cepstrum coefficient of input voice and LPC
A processing block that calculates the Euclidean distance from the vowel standard pattern represented by the cepstrum coefficient. Also, check which vowel has a minimum distance to the standard pattern,
When the value is less than or equal to the threshold value T ₄ , the vowel corresponding to the current frame is set. If the distance to any vowel exceeds T ₄ , there is no corresponding vowel. B26, B28: A processing block for detecting a vowel end point, when a vowel corresponding to the previous frame exists and no vowel corresponding to the current frame exists, or a vowel corresponding to the previous frame and the current frame exists. When the vowels are different, the previous frame is set as the vowel end point. If not, the process proceeds to processing block B32, which detects the next unvoiced vowel. B30: A processing block for adding 1 to the variable n vowel. B32: A processing block for adding 1 to the variable m vowel. B34, B36: Processing blocks for determining the presence of unvoiced vowels. First, the ratio of the average power of the current frame to the average power of the frame two frames before is calculated to determine the presence of a plosive consonant having a characteristic of sharply increasing power in the current frame. When it is determined that the explosive consonant is present, it is checked whether or not the consonant consonant is present in the past from this frame. The presence of the fricative consonant is determined by whether the magnitude of the number of zero crossings exceeds a predetermined threshold value. If there is a plosive consonant in the current frame, and there is a frictional consonant before it, and no vowel is present between the two consonants, the unvoiced vowels between the two consonants. Is determined to exist. If the existence is not recognized, the process proceeds to processing block B42 in which 1 is added to the value of the next variable frame. B38: A processing block for adding 1 to the variable n vowel. B40: A processing block for adding 1 to the variable m vowel. B42: A processing block for adding 1 to the value of the variable frame. B44: A processing block for calculating the speech rate. The speech rate is calculated in "mora / second" unit by dividing the variable n vowel by the block length (unit: second). B46: A processing block for displaying the speech rate obtained in the processing block for calculating the speech rate in B38. B48: Block length (unit: seconds) in the value of variable t length
Processing block to add.

【００１２】以上の説明においては、入力音声中の母音
を検出するために、母音の標準パターンを使用して、こ
れとのユークリッド距離を計算して求めるものとした
が、これは他の方法、例えば母音ＨＭＭ（Hidden Marko
v Model ：隠れマルコフモデル）を用いて母音部分を検
出することも可能である。In the above description, in order to detect the vowel in the input voice, the standard pattern of the vowel is used, and the Euclidean distance from the standard pattern is calculated to obtain it. For example, vowel HMM (Hidden Marko
v Model: Hidden Markov model) can be used to detect vowel parts.

【００１３】[0013]

【発明の効果】本発明によれば、任意に発声された音声
に対して、その発話速度を推定することが可能となる。
その一例として５０文（計1501音節）からなるテキスト
を５人の話者が発生した音声を評価データとして、発話
速度を推定する実験を行った結果、５％以下の推定誤差
で発話速度を推定できることが確認され、本発明の有効
性が示された。According to the present invention, it becomes possible to estimate the utterance speed of an arbitrarily uttered voice.
As an example, we conducted an experiment to estimate the speech rate by using a text consisting of 50 sentences (1501 syllables in total) as speech data generated by five speakers, and as a result, estimated the speech rate with an estimation error of 5% or less. It was confirmed that this was possible, and the effectiveness of the present invention was shown.

【００１４】なお、このとき使用した評価用音声として
は、ＡＴＲ（株式会社国際電気通信基礎研究所）が販売
している音声データベース中のものを用い、また母音標
準パターンとしては、同じくＡＴＲが販売している音声
データベースのうち、評価用音声を発声した話者とは異
なる男性２１名が発声したデータを用いて作製した。評
価に当たっては、発話速度を測定するためのブロックは
固定長のものとせず、文音声全体を１つのブロックとし
て扱った。すなわち、各文音声ごとに本発明により平均
発話速度を推定し、音声に付与されているラベルから計
算された平均発話速度との誤差を求めることによって、
本発明の評価を行った。結果は上記の通りである。As the evaluation voice used at this time, one in the voice database sold by ATR (International Telecommunication Institute) is used, and the vowel standard pattern is also sold by ATR. It was created by using data produced by 21 males different from the speaker who uttered the evaluation voice in the voice database. In the evaluation, the block for measuring the speech rate was not fixed length, but the whole sentence voice was treated as one block. That is, by estimating the average speech rate according to the present invention for each sentence voice, by obtaining the error from the average speech rate calculated from the label given to the voice,
The present invention was evaluated. The results are as described above.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明発話速度推定方法によって構成した発話
速度推定装置の概略を示すブロック線図である。FIG. 1 is a block diagram showing an outline of a speech rate estimation device configured by a speech rate estimation method of the present invention.

【図２】本発明発話速度推定方法の一実施例における処
理手順を示すフローチャートである。FIG. 2 is a flowchart showing a processing procedure in an embodiment of the speech rate estimation method of the present invention.

【図３】本発明発話速度推定方法の一実施例における処
理手順を示すフローチャート（図２の続き）である。FIG. 3 is a flowchart (continuation of FIG. 2) showing a processing procedure in an embodiment of the speech rate estimation method of the present invention.

【符号の説明】[Explanation of symbols]

２音声波形分割部４音響分析部６無声化母音検出部８母音検出部１０母音標準パターン１２発話速度計算部１４発話速度表示部 2 voice waveform division section 4 acoustic analysis section 6 unvoiced vowel detection section 8 vowel detection section 10 vowel standard pattern 12 speech rate calculation section 14 speech rate display section

Claims

【特許請求の範囲】[Claims]

【請求項１】入力音声中の母音を検出し、所望の時間
区間内における前記検出した母音の数を前記所望の時間
区間内における音節数に置き替え、該置き替えた前記所
望の時間区間内における音節数を前記所望の時間区間で
除したものを発話速度として該発話速度を推定するよう
にしたことを特徴とする発話速度推定方法。1. A vowel in an input voice is detected, the number of the detected vowel in a desired time section is replaced with the number of syllables in the desired time section, and the replaced desired vowel is included in the replaced desired time section. The speech rate estimation method is characterized in that the speech rate is estimated by dividing the number of syllables in (1) by the desired time interval as a speech rate.

【請求項２】請求項１記載の発話速度推定方法におい
て、前記入力音声中の母音を検出するにあたっては、前
記入力音声の各フレームごとのＬＰＣケプストラム係数
とＬＰＣケプストラム係数で表現された各母音の母音標
準パターンとのユークリッド距離を計算し、該計算結果
に基づいて検出するようにしたことを特徴とする発話速
度推定方法。2. The speech rate estimating method according to claim 1, wherein in detecting a vowel in the input voice, an LPC cepstrum coefficient for each frame of the input voice and each vowel represented by the LPC cepstrum coefficient are detected. A method for estimating speech rate, characterized in that a Euclidean distance from a vowel standard pattern is calculated, and the Euclidean distance is detected based on the calculation result.

【請求項３】請求項１記載の発話速度推定方法におい
て、前記入力音声中の母音を検出するにあたっては、母
音の隠れマルコフモデルを使用して検出するようにした
ことを特徴とする発話速度推定方法。3. The speech rate estimation method according to claim 1, wherein a vowel in the input speech is detected by using a hidden Markov model of the vowel. Method.