JP2001067092A

JP2001067092A - Voice detecting device

Info

Publication number: JP2001067092A
Application number: JP23934099A
Authority: JP
Inventors: Junko Yagi; 順子八木; Junichi Nakabashi; 順一中橋; Mitsuhiko Serikawa; 光彦芹川; Yoshihisa Nakato; 良久中藤; Dairo Katayama; 大朗片山
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1999-08-26
Filing date: 1999-08-26
Publication date: 2001-03-16

Abstract

PROBLEM TO BE SOLVED: To realize a voice detecting device less susceptible to noise level and capable of stably detecting voice zones in an input signal. SOLUTION: A noise level computing part 1 computes the noise level of an input signal, and a first threshold computing part 2 computes a power threshold and voice start-point correcting parameters corresponding to the noise level. A detection parameter computing part 3 computes voice detection parameters such as standard deviation of power, and a voice start-point detecting part 4 detects voice start points. A voice start-point correcting part 5 corrects the voice start points by using the voice start-point correcting parameters. A second threshold computing part 6 computes a voice detection parameter threshold corresponding to the noise level. A voice detecting part 7 detects voice zones by using the voice detection parameter threshold and voice detection parameters so as to cover input signals after the corrected voice start points.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置や音
声作動型音声記録装置等の前処理として入力信号中の音
声区間を検出する音声検出装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice detection device for detecting a voice section in an input signal as preprocessing of a voice recognition device, a voice activated voice recording device, or the like.

【０００２】[0002]

【従来の技術】近年、様々な機器を音声によって操作し
たり、入力手段としてキーボードなどの代わりに音声を
用いるといった場合に、音声認識の技術が用いられてい
る。2. Description of the Related Art In recent years, voice recognition technology has been used when various devices are operated by voice or when voice is used instead of a keyboard or the like as input means.

【０００３】音声認識を行うには、まず入力信号から実
際に人の音声が含まれる音声区間の検出を行う必要があ
る。その後、検出された音声区間の入力信号から音声の
特徴を表現するケプストラム等（以下、入力ベクトル）
を算出する。算出された入力ベクトルと、認識させたい
単語の入力ベクトルをあらかじめ蓄えている標準モデル
とのマッチングを行って、最も類似度が高いものを音声
認識結果として得ることができる。In order to perform voice recognition, it is necessary to first detect a voice section including human voice from an input signal. After that, a cepstrum or the like expressing an audio feature from the input signal of the detected audio section (hereinafter referred to as an input vector)
Is calculated. By performing matching between the calculated input vector and a standard model that stores in advance the input vector of the word to be recognized, the one having the highest similarity can be obtained as the speech recognition result.

【０００４】このような音声認識の技術を実用化するた
めには、様々な背景騒音が存在する実環境での認識性能
向上が求められる。背景騒音が大きくなるほど、音声信
号が背景騒音に埋もれてしまうことにより、認識性能は
劣化する。[0004] In order to put such a speech recognition technique into practical use, it is required to improve recognition performance in a real environment where various background noises exist. As the background noise increases, the speech signal is buried in the background noise, and the recognition performance deteriorates.

【０００５】実環境での認識性能向上のための一つの手
段として、背景騒音が大きい環境下において、いつ音声
が開始し、終了したかという音声区間検出を正確に行う
ことが挙げられる。As one means for improving recognition performance in a real environment, there is a method of accurately detecting a voice section when a voice starts and ends in an environment with a large background noise.

【０００６】図５は、音声を含む入力信号のパワーの時
間的変動をノイズがある場合とない場合について示して
いる。従来の音声検出装置では、入力信号のパワーがし
きい値２３を超えている区間が音声区間であると判断し
ていた。一般的に、しきい値e_thはノイズのパワーの平
均値（ノイズレベル）を数倍した値に設定される。ノイ
ズレベルは、入力信号のレベルが低く、非音声区間だと
考えられる区間の入力信号をノイズと見なすことにより
求めることができる。また、入力信号の一定時間内のパ
ワーの標準偏差（以下、パワーの標準偏差）を音声区間
の検出に用いる場合もある。パワーの標準偏差は、パワ
ーの変動が大きい場合には大きくなる。特に、音声が入
力され始めたときの立ち上がりは顕著である。図６は、
パワーの標準偏差の時間的変動をノイズがある場合とな
い場合について示している。パワーの標準偏差は、非音
声区間においてはノイズレベルの大小にかかわらずほぼ
０をとる。したがって、入力信号を音声あるいは非音声
と判定することが非常に容易である。一般的に、パワー
の標準偏差を音声区間の検出に用いる時のしきい値も、
ノイズしきい値と同様にノイズのパワーの標準偏差の平
均値を数倍した値に設定される。FIG. 5 shows the temporal variation of the power of an input signal including speech with and without noise. In the conventional voice detection device, a section in which the power of the input signal exceeds the threshold value 23 is determined to be a voice section. Generally, the threshold value e_th is set to a value obtained by multiplying the average value (noise level) of noise power by several times. The noise level can be obtained by regarding the input signal in a section where the level of the input signal is low and considered to be a non-voice section as noise. Further, the standard deviation of the power of the input signal within a certain period of time (hereinafter, the standard deviation of the power) may be used for detecting a voice section. The standard deviation of the power increases when the fluctuation of the power is large. In particular, the rise when sound starts to be input is remarkable. FIG.
The time variation of the standard deviation of the power is shown with and without noise. The standard deviation of the power is almost zero in the non-voice section regardless of the noise level. Therefore, it is very easy to determine the input signal as voice or non-voice. In general, the threshold value when the standard deviation of power is used for detecting a voice section is also
Similar to the noise threshold, the value is set to a value obtained by multiplying the average value of the standard deviation of the noise power by several times.

【０００７】[0007]

【発明が解決しようとする課題】ところが、従来の音声
検出装置では、背景騒音が大きくなるほど、音声の開始
時刻、終了時刻が正確に検出できなくなり、その結果音
声認識率が低下するといった問題があった。この問題に
関して、以下のような原因が考えられる。However, the conventional voice detection device has a problem that as the background noise increases, the start time and end time of the voice cannot be accurately detected, and as a result, the voice recognition rate decreases. Was. The following causes can be considered for this problem.

【０００８】まず、パワーを音声区間検出のためのパラ
メータとして用いた場合、ノイズレベルが高くなるに従
い、ノイズレベルの数倍として設定されるしきい値も高
くなり、検出された音声区間の始端が実際の音声区間の
始端よりも遅れたり、音声区間の検出ができないといっ
たことが生じる。First, when power is used as a parameter for voice section detection, as the noise level increases, the threshold value set as a multiple of the noise level also increases, and the beginning of the detected voice section becomes longer. This may be delayed from the beginning of the actual voice section or the voice section cannot be detected.

【０００９】一方、パワーの標準偏差においても、パワ
ーの値の大きさに比例して大きくなるという短所がある
ため、ノイズレベルが高くなるほどノイズ部分のパワー
の標準偏差が大きくなり、その結果ノイズのパワーの標
準偏差の平均値の数倍として設定されるしきい値も高く
なる。従って、パワーの標準偏差を音声区間検出のパラ
メータとして用いた場合にも、前述のパワーを音声区間
検出のパラメータとして用いた場合と同様の問題が程度
は小さいながら生じる。On the other hand, the standard deviation of the power also has a disadvantage that it increases in proportion to the magnitude of the power value. Therefore, the higher the noise level, the larger the standard deviation of the power of the noise portion. The threshold value set as several times the average value of the standard deviation of the power also increases. Therefore, even when the standard deviation of power is used as a parameter for voice section detection, the same problem as that when the power is used as a parameter for voice section detection occurs to a lesser extent.

【００１０】また、前述したように騒音下ではパワーの
標準偏差のしきい値は大きな値として設定されるため、
音声区間の始端が検出された後、パワーは一定以上ある
にもかかわらずパワーの変動が少ない場合は、パワーの
標準偏差が低下して音声区間終端検出のしきい値を下回
ることがある。このようなとき、音声区間が終了したも
のとして誤って判定されることが生じる。As described above, since the threshold value of the standard deviation of the power is set to a large value under noise,
After the start of the speech section is detected, if the power is small but the power is small even though the power is not less than a certain level, the standard deviation of the power may be lower than the threshold for detecting the end of the speech section. In such a case, it is erroneously determined that the voice section has ended.

【００１１】さらに、種々の実環境を考えると、ノイズ
レベルが同一であっても、ノイズの種類（以下、ノイズ
パタンという）によって、適切な音声区間検出のための
しきい値は異なってくる。したがって、ノイズレベルの
みに応じてしきい値を算出するのでは、正確な音声区間
検出ができないといったことが生じる。Further, considering various real environments, even if the noise level is the same, the threshold value for detecting an appropriate voice section differs depending on the type of noise (hereinafter referred to as a noise pattern). Therefore, if the threshold value is calculated based only on the noise level, accurate voice section detection may not be performed.

【００１２】また、特に騒音下で「札幌」「学校」のよ
うに促音を含む語彙を発声すると、音声区間内であって
も音声検出パラメータがしきい値より小さくなり、音声
終端を誤検出するという場合がある。その結果、終端の
検出後に逐次的なしきい値の更新処理を行う場合に、誤
検出した終端後の音声区間を用いて推定するために、高
いしきい値を設定してしまい、以後の音声入力に対して
正確な音声区間検出ができない。Further, when a vocabulary including a prompting sound such as "Sapporo" or "school" is uttered particularly in noise, the voice detection parameter becomes smaller than the threshold value even in the voice section, and the voice termination is erroneously detected. There is a case. As a result, in the case where the threshold value is updated successively after the end point is detected, a high threshold value is set in order to estimate using the voice section after the end point which is erroneously detected, and the subsequent voice input is performed. Cannot accurately detect voice section.

【００１３】前記問題に鑑み、本発明はノイズレベルの
影響を受けにくく、かつ様々なノイズパタンや発声語彙
に対しても安定した音声区間検出が行える音声検出装置
を提供することを課題とする。In view of the above problems, it is an object of the present invention to provide a speech detection device which is hardly affected by a noise level and can perform stable speech section detection even for various noise patterns and utterance vocabularies.

【００１４】[0014]

【課題を解決するための手段】前記課題を解決するた
め、本発明の音声検出装置は、ノイズレベル算出手段で
求められたノイズレベルに基づき、パワーによる音声開
始点の検出に用いるパワーしきい値と、前記パワーによ
る音声開始点を補正する音声開始点補正パラメータとを
算出する第一しきい値算出手段と、音声開始点検出手段
で検出された前記パワーによる音声開始点を、前記音声
開始点補正パラメータを用いて補正し、補正された音声
開始点を求める音声開始点補正手段と、前記ノイズレベ
ルと補正された前記音声開始点での音声検出パラメータ
の値に基づき音声区間の検出に用いる音声検出パラメー
タしきい値を算出する第二しきい値算出手段とを備える
ものとする。In order to solve the above-mentioned problems, a voice detecting apparatus according to the present invention comprises a power threshold used for detecting a voice starting point based on power based on a noise level obtained by a noise level calculating means. A first threshold value calculating means for calculating a voice start point correction parameter for correcting a voice start point based on the power; and a voice start point based on the power detected by the voice start point detection means. Voice start point correction means for correcting using a correction parameter to obtain a corrected voice start point, and voice used for voice section detection based on the noise level and the corrected voice detection parameter value at the voice start point And a second threshold value calculating means for calculating a detection parameter threshold value.

【００１５】前記構成によると、ノイズレベルに応じて
音声開始点の検出、補正をし、かつ補正された音声開始
点を基準点として、音声検出パラメータとノイズレベル
に応じた音声検出パラメータしきい値を用いて音声区間
の検出をすることにより、ノイズレベルの影響を受けに
くく安定した音声区間検出が可能となる。According to the above configuration, the voice start point is detected and corrected in accordance with the noise level, and the corrected voice start point is used as a reference point, and the voice detection parameter and the voice detection parameter threshold value in accordance with the noise level are determined. Is used to detect a voice section, it is possible to perform stable voice section detection that is hardly affected by the noise level.

【００１６】さらに、本発明の音声検出装置は、前記音
声検出パラメータが、前記入力信号のパワーの一定時間
内の標準偏差を前記入力信号のパワーの一定時間内の平
均値によって正規化したもの（以下、正規化した標準偏
差）である。Further, in the voice detection device according to the present invention, the voice detection parameter is obtained by normalizing a standard deviation of the power of the input signal within a predetermined time by an average value of the power of the input signal within a predetermined time ( Hereinafter, normalized standard deviation).

【００１７】前記構成によると、パワーの標準偏差はパ
ワーの平均値で正規化するとその大きさがパワーの大き
さの影響を受けなくなるため、音声検出パラメータとし
て正規化した標準偏差を用いることにより、ノイズレベ
ルの影響を受けにくく安定した音声区間検出が可能とな
る。According to the above configuration, when the standard deviation of the power is normalized by the average value of the power, the magnitude is not affected by the magnitude of the power. Therefore, by using the normalized standard deviation as the voice detection parameter, It is possible to perform stable voice section detection that is hardly affected by the noise level.

【００１８】また、本発明の音声検出装置は、前記音声
検出パラメータが、前記入力信号のパワーと前記入力信
号のパワーの一定時間内の標準偏差とであって、前記入
力信号のパワーと前記標準偏差とが前記しきい値算出手
段で求められたそれぞれのしきい値以下にともに低下す
る場合に音声区間の終端を検出することとしたものであ
る。Further, in the voice detection device according to the present invention, the voice detection parameter is a power of the input signal and a standard deviation of the power of the input signal within a predetermined time, and the power of the input signal and the standard When the deviation falls below each of the threshold values calculated by the threshold value calculating means, the end of the voice section is detected.

【００１９】前記構成によると、音声検出パラメータと
してパワーとパワーの標準偏差とを併せて用いることに
より、入力信号に十分なパワーがあるにもかかわらず音
声区間の終端が誤って検出されることを防ぐことができ
る。According to the above configuration, by using the power and the standard deviation of the power together as the voice detection parameter, it is possible to prevent the end of the voice section from being erroneously detected even though the input signal has sufficient power. Can be prevented.

【００２０】さらに、本発明の音声検出装置は、前記音
声検出パラメータが、前記入力信号のパワーと前記正規
化した標準偏差とであって、前記入力信号のパワーと前
記正規化した標準偏差とが前記しきい値算出手段で求め
られたそれぞれのしきい値以下にともに低下する場合に
音声区間の終端を検出することとしたものである。Further, in the voice detection device according to the present invention, the voice detection parameter is the power of the input signal and the normalized standard deviation, and the power of the input signal and the normalized standard deviation are different. The end of the voice section is detected when both fall below the respective threshold values obtained by the threshold value calculating means.

【００２１】前記構成によると、音声検出パラメータと
してパワーと正規化した標準偏差とを併せて用いること
により、入力信号に十分なパワーがあるにもかかわらず
音声区間の終端が誤って検出されることを防ぐことがで
きる。According to the above configuration, by using the power and the normalized standard deviation together as the voice detection parameter, the end of the voice section is erroneously detected even though the input signal has sufficient power. Can be prevented.

【００２２】さらに、本発明の音声検出装置は、前記音
声開始点補正パラメータは、前記ノイズレベルを変数と
するテーブル及び関数の少なくとも一方によって求めら
れることとしたものである。Further, in the voice detection device according to the present invention, the voice start point correction parameter is obtained by at least one of a table and a function using the noise level as a variable.

【００２３】前記構成によると、きめ細かくノイズレベ
ルに応じた音声開始点補正パラメータを求めることがで
きる。According to the above configuration, the voice start point correction parameter corresponding to the noise level can be obtained finely.

【００２４】さらに、本発明の音声検出装置は、ノイズ
の種類（以下、ノイズパタン）を判定するノイズパタン
判定手段を備え、また前記第一しきい値算出手段は、前
記ノイズレベルに加えさらに前記ノイズパタンに基づい
て第一のしきい値を算出し、また前記第二しきい値算出
手段は、前記ノイズレベルと補正された前記音声開始点
での音声検出パラメータに加えさらに前記ノイズパタン
に基づいて第二のしきい値を算出することとしたもので
ある。Further, the voice detecting device of the present invention includes a noise pattern determining means for determining a type of noise (hereinafter, referred to as a noise pattern), and the first threshold value calculating means further includes the noise level in addition to the noise level. A first threshold value is calculated based on the noise pattern, and the second threshold value calculating means further calculates the first threshold value based on the noise pattern in addition to the noise level and the corrected voice detection parameter at the voice start point. Thus, the second threshold value is calculated.

【００２５】前記構成によると、音声区間検出の前段で
前記入力信号のノイズパタンを判定し、しきい値算出時
に前記ノイズパタンも考慮することにより、様々なノイ
ズパタンに対しても安定した音声区間検出が可能とな
る。According to the above configuration, the noise pattern of the input signal is determined before the detection of the voice section, and the noise pattern is taken into account when calculating the threshold value. Detection becomes possible.

【００２６】また、本発明の音声検出装置は、前記検出
パラメータ算出部と前記音声開始点検出部との間に前記
ノイズレベル算出手段で算出された前記ノイズレベルが
妥当であるかを判定するノイズレベル再推定判定手段を
備えることとしたものである。Also, in the voice detection device according to the present invention, a noise between the detection parameter calculation unit and the voice start point detection unit for determining whether the noise level calculated by the noise level calculation unit is appropriate is determined. A level re-estimation determination unit is provided.

【００２７】前記構成によると、発声語彙によって音声
終端の誤検出が生じても、必要に応じてノイズレベルを
再推定することにより、ノイズレベルの推定誤りをなく
し、したがって音声検出パラメータのしきい値が誤って
設定されることはなく、正確な音声区間検出が可能とな
る。According to the above configuration, even if an erroneous detection of the end of voice occurs due to the utterance vocabulary, the noise level is re-estimated as necessary, thereby eliminating a noise level estimation error. Is not set erroneously, and accurate voice section detection becomes possible.

【００２８】[0028]

【発明の実施の形態】以下、本発明の各実施形態に係る
音声検出装置について、図面を参照しながら説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a speech detection device according to each embodiment of the present invention will be described with reference to the drawings.

【００２９】（第１の実施形態）図１は本発明の第１の
実施形態に係る音声検出装置の構成を示している。図１
の音声検出装置は、ノイズレベルを算出するノイズレベ
ル算出部１と、パワーしきい値と音声開始点補正パラメ
ータを算出する第一しきい値算出部２と、音声検出パラ
メータを算出する検出パラメータ算出部３と、音声開始
点を検出する音声開始点検出部４と、音声開始点を補正
する音声開始点補正部５と、音声検出パラメータしきい
値を算出する第二しきい値算出部６と、音声区間を検出
する音声検出部７と、第一テーブル８及び第一関数９の
少なくとも一方と、第二テーブル１０及び第二関数１１
の少なくとも一方とを備えている。(First Embodiment) FIG. 1 shows the configuration of a voice detection device according to a first embodiment of the present invention. FIG.
Is a noise level calculation unit 1 for calculating a noise level, a first threshold value calculation unit 2 for calculating a power threshold value and a voice start point correction parameter, and a detection parameter calculation for calculating a voice detection parameter. A voice start point detection unit 4 for detecting a voice start point; a voice start point correction unit 5 for correcting a voice start point; and a second threshold value calculation unit 6 for calculating a voice detection parameter threshold value. , A voice detection unit 7 for detecting a voice section, at least one of the first table 8 and the first function 9, and the second table 10 and the second function 11.
And at least one of

【００３０】ここでは、音声検出パラメータとして、パ
ワーの標準偏差を考えることとする。また説明として、
図１及び図５を用いる。Here, a power standard deviation is considered as a voice detection parameter. Also, as an explanation,
1 and 5 are used.

【００３１】ノイズレベル算出部１は、音声を含む入力
信号のパワーを入力とし、非音声区間のノイズレベルμ
を出力する。図５は入力信号のパワーの時間的変動を示
す。入力信号のパワーP_i（i=1,2,…,n）を入力として取
り込んでから、ある一定時間n２２の間のP_iの平均値を
算出することにより、数１にしたがって非音声区間の平
均パワーP_aveが算出される。The noise level calculator 1 receives the power of an input signal including voice as an input, and generates a noise level μ in a non-voice section.
Is output. FIG. 5 shows the time variation of the power of the input signal. After fetching the power P _{i of the} input signal (i = 1, 2,..., N) as an input and calculating the average value of P _i for a certain period of time n22, The average power P _ave is calculated.

【００３２】[0032]

【数１】 (Equation 1)

【００３３】このP_aveをもとに、関数sを用いてノイズ
レベルμを決定する（数２）。Based on this P _ave , the noise level μ is determined using the function s (Equation 2).

【００３４】[0034]

【数２】 (Equation 2)

【００３５】第一しきい値算出部２は、ノイズレベル算
出部１で算出されたノイズレベルμを入力とし、パワー
による音声開始点検出に用いるパワーしきい値e_th２３
と、パワーによる音声開始点を補正する音声開始点補正
パラメータh２５とを出力とする。パワーしきい値e_th
２３と音声開始点補正パラメータh２５は、あらかじめ
定めておいたテーブルを参照して、ノイズレベル算出部
で求めたノイズレベルμに応じて求めることができる。
表１はテーブルの一例を示している。The first threshold value calculation unit 2 receives the noise level μ calculated by the noise level calculation unit 1 as an input, and uses a power threshold value e_th23 used for detecting a voice start point based on power.
And an audio start point correction parameter h25 for correcting the audio start point by power. Power threshold e_th
23 and the voice start point correction parameter h25 can be obtained in accordance with the noise level μ obtained by the noise level calculation unit with reference to a predetermined table.
Table 1 shows an example of the table.

【００３６】[0036]

【表１】 [Table 1]

【００３７】表１では、テーブルにしきい値係数C_E_TH
が設定されている場合を示しており、パワーしきい値
は、これらのしきい値係数C_E_THにノイズレベルを掛け
ることにより求められる（数３）。もちろん、テーブル
にパワーしきい値そのものが設定されていてもよい。In Table 1, the threshold coefficient C_E_TH is stored in the table.
Is set, and the power threshold is obtained by multiplying the threshold coefficient C_E_TH by the noise level (Equation 3). Of course, the power threshold itself may be set in the table.

【００３８】[0038]

【数３】 (Equation 3)

【００３９】また、音声開始点補正パラメータh２５は
関数により求めることもできる。表２は、音声開始点補
正パラメータh２５を関数で求める例を示している。The voice start point correction parameter h25 can also be obtained by a function. Table 2 shows an example in which the voice start point correction parameter h25 is obtained by a function.

【００４０】[0040]

【表２】 [Table 2]

【００４１】ここでは関数の例として、音声開始点補正
パラメータh２５をノイズレベルの１次関数として求め
る簡単なものを示したが（数４）、これに限らず、ノイ
ズレベルの２次関数等を用いることもできる（数５）。
さらに、ある範囲のノイズレベルごとに離散的な値をと
る関数を用いることもできる（数６）。Here, as an example of the function, a simple one for calculating the voice start point correction parameter h25 as a linear function of the noise level has been shown (Equation 4). It can also be used (Equation 5).
Furthermore, a function that takes a discrete value for each noise level in a certain range can be used (Equation 6).

【００４２】[0042]

【数４】 (Equation 4)

【００４３】[0043]

【数５】 (Equation 5)

【００４４】[0044]

【数６】 (Equation 6)

【００４５】検出パラメータ算出部３は、入力信号のパ
ワーP_iを入力とし、パワーの標準偏差sdevを出力とす
る。パワーの標準偏差sdevは、ある一定時間のパワーの
平均値P_aveを算出して求められる（数７）。The detection parameter calculating unit 3 receives the power P _i of the input signal, and outputs the standard deviation sdev power. Standard deviation sdev power is obtained by calculating the average value P _ave of a certain time the power (Equation 7).

【００４６】[0046]

【数７】 (Equation 7)

【００４７】また検出パラメータ算出部において、ある
一定時間のパワーの中間値mp_iを算出し、その値をもと
にパワーの標準偏差sdevを算出しても良い。このパワー
の中間値mp_iを次の過程の音声開始点検出部においてパ
ワーの代わりに用いても良い。Further, the detection parameter calculating section may calculate an intermediate value mp _i of the power for a certain period of time, and calculate the standard deviation sdev of the power based on the calculated value. The intermediate value mp _i of the power may be used in place of the power in the voice start point detection unit in the next process.

【００４８】音声開始点検出部４は、入力信号のパワー
（P_iあるいはmp_i）とパワーしきい値e_th２３を入力と
し、音声開始点s_start２４を出力とする。その動作
は、図５に示すように、パワー（P_iあるいはmp_i）がパ
ワーしきい値e_th２３を超えた時点を音声開始点s_star
t２４とする。このパワーしきい値e_th２３を超えたこ
とにより、音声が入力された可能性があると考える。こ
こで、前記数２に示したようにパワーしきい値e_th２３
はノイズレベルμにノイズレベルに関わらず一定である
しきい値係数C_E_THを掛けることによって求められる。
一方、図５のパワーの時間的変動を見てみると、非音声
区間でのパワーの値がノイズがある場合２０とノイズが
ない場合２１で異なり、ノイズレベルが高くなるほどパ
ワーしきい値が高くなることが生じる。ノイズレベルが
高くなり、パワーしきい値が高くなるほど、検出される
音声開始点は実際の音声開始点よりも遅れた時刻になる
為、補正が必要である。The voice start point detecting section 4 power of the input signal (P _i or mp _i) and the input power threshold E_th23, and outputs a voice start point S_start24. As shown in FIG. 5, the operation starts at a time point when the power (P _i or mp _i ) exceeds the power threshold value e_th23 and starts at the voice start point s_star.
Let it be t24. It is considered that there is a possibility that voice has been input by exceeding the power threshold value e_th23. Here, as shown in Equation 2, the power threshold e_th23
Is obtained by multiplying the noise level μ by a threshold coefficient C_E_TH that is constant regardless of the noise level.
On the other hand, looking at the temporal fluctuation of the power in FIG. 5, the power value in the non-speech section is different between 20 when there is noise and 21 when there is no noise, and the power threshold increases as the noise level increases. It happens. As the noise level becomes higher and the power threshold becomes higher, the detected voice start point becomes a time later than the actual voice start point, so that correction is necessary.

【００４９】そこで、ノイズレベルが高くなるほど音声
開始点が実際より遅れた時刻に検出されることを考慮す
るために、音声開始点補正部５では、ノイズレベルμに
応じた時刻分だけ音声開始点を補正する。入力は音声開
始点検出部で検出された音声開始点s_start２４と音声
開始点補正パラメータh２５、出力は補正された音声開
始点s_start'２６である。s_start'２６は検出された音
声開始点s_start２４から音声開始点補正パラメータを
差し引くことによって求められる（数８）。Therefore, in order to consider that the higher the noise level is, the more the voice start point is detected at a time later than the actual one, the voice start point correction unit 5 sets the voice start point by the time corresponding to the noise level μ. Is corrected. The input is a voice start point s_start24 and a voice start point correction parameter h25 detected by the voice start point detection unit, and the output is a corrected voice start point s_start'26. s_start'26 is obtained by subtracting the voice start point correction parameter from the detected voice start point s_start24 (Equation 8).

【００５０】[0050]

【数８】 (Equation 8)

【００５１】第二しきい値算出部６は、前記ノイズレベ
ル算出部１で算出されたノイズレベル、前記音声開始点
補正部５で補正された音声開始点s_start'２６、パワー
標準偏差sdevを入力とし、音声検出パラメータしきい値
を出力とする。まず、補正された音声開始点s_start'２
６をもとに一定時間のパワーの標準偏差の平均値sdev
_aveを求める（数９）。The second threshold value calculator 6 calculates the noise level
Noise level calculated by the voice calculation unit 1, the voice start point
Audio start point s_start'26, power corrected by correction unit 5
Using the standard deviation sdev as input, the voice detection parameter threshold
Is the output. First, the corrected voice start point s_start'2
Average value sdev of standard deviation of power for a certain time based on 6
_ave(Equation 9).

【００５２】[0052]

【数９】 (Equation 9)

【００５３】この平均値sdev_aveをもとに、第一しきい
値算出部２と同様にノイズレベルによってあらかじめ定
めておいたテーブルを参照して、パワー標準偏差のしき
い値を算出する。Based on the average value sdev _ave , the threshold value of the power standard deviation is calculated with reference to a table predetermined by the noise level, similarly to the first threshold value calculation unit 2.

【００５４】[0054]

【表３】 [Table 3]

【００５５】パワー標準偏差のしきい値には、始端検出
用しきい値vh_th２９と終端検出用しきい値vl_th３０が
あり、始端検出用しきい値vh_th２９はパワーがパワー
しきい値を超えてから一定時間後、パワー標準偏差が始
端検出用しきい値vh_th２９より大きい場合、音声が検
出されたものとする。逆に、一定時間後パワー標準偏差
が始端検出用しきい値vh_thより小さいなら雑音を検出
したものとする。また音声検出後、パワーの標準偏差が
終端検出用しきい値vl_th３０を下回ったとき音声終端
を検出したものとする。表３はテーブルの一例を示して
いる。表３では、テーブルにしきい値係数vh_coef、vl_
coefが設定されている場合を示しており、各しきい値
は、音声開始点での音声検出パラメータの平均値sdev
_aveにこれらのしきい値係数を掛けることにより求めら
れる。もちろん、テーブルにしきい値そのものが設定さ
れていても良い（数１０・１１）。The threshold values of the power standard deviation include a threshold value vh_th29 for detecting a start edge and a threshold value vl_th30 for detecting an end edge. The threshold value vh_th29 for detecting a start edge is constant after the power exceeds the power threshold value. After a lapse of time, if the power standard deviation is larger than the starting end detection threshold value vh_th29, it is assumed that sound has been detected. Conversely, if the power standard deviation is smaller than the starting end detection threshold value vh_th after a predetermined time, it is assumed that noise has been detected. It is also assumed that the voice termination is detected when the standard deviation of the power falls below the termination detection threshold vl_th30 after the voice detection. Table 3 shows an example of the table. In Table 3, the threshold coefficients vh_coef, vl_
This shows the case where coef is set, and each threshold value is the average value sdev of the voice detection parameters at the voice start point.
It is obtained by multiplying _ave by these threshold coefficients. Of course, the threshold value itself may be set in the table (Equations 10 and 11).

【００５６】[0056]

【数１０】 (Equation 10)

【００５７】[0057]

【数１１】 [Equation 11]

【００５８】また、検出パラメータしきい値は関数によ
り求めることもできる。表４は、音声検出パラメータと
してのパワーの標準偏差の音声区間終端検出用しきい値
係数vl_coefを関数で求める例を示している。The threshold value of the detection parameter can be obtained by a function. Table 4 shows an example in which a voice section end detection threshold coefficient vl_coef of a power standard deviation as a voice detection parameter is obtained by a function.

【００５９】[0059]

【表４】 [Table 4]

【００６０】ここでは、音声区間終端検出パラメータし
きい値係数をノイズレベルμの１次関数として求める簡
単なものを示したが（数１２）、これに限らず、ノイズ
レベルの２次関数等を用いることもできる（数１３）。
さらに、ある範囲のノイズレベル毎に離散的な値をとる
関数を用いることもできる（数１４）。Here, a simple method for obtaining the threshold coefficient of the voice section end detection parameter as a linear function of the noise level μ has been shown (Equation 12). However, the present invention is not limited to this. It can also be used (Equation 13).
Further, a function that takes a discrete value for each noise level in a certain range can be used (Equation 14).

【００６１】[0061]

【数１２】 (Equation 12)

【００６２】[0062]

【数１３】 (Equation 13)

【００６３】[0063]

【数１４】 [Equation 14]

【００６４】音声検出部７は、音声検出パラメータ（パ
ワーの標準偏差sdev）、音声検出パラメータしきい値
（vh_th２９、vl_th３０）を入力とし、音声の始端sf、
終端efを出力する。図６にパワーの標準偏差の時間的変
動を示す。前述したように、補正された音声開始点s_st
art'２６以後の一定時間後、標準偏差が音声区間始端検
出用のパワーの標準偏差の始端検出用しきい値vh_th２
９より大きければ音声を検出したとする。逆に小さけれ
ば雑音を検出したとする。音声を検出した場合、パワー
の標準偏差が終端検出用しきい値vl_th３０を下回った
ときに、音声区間の終端s_end３１とする。求めたい音
声の始端は、音声開始点s_start２４、終端はパワーの
標準偏差が終端検出用しきい値vl_thを下回った時刻s_e
nd３１であるが、パワーの標準偏差はパワーの立ち上が
り、立ち下がりと比較すると遅延が生じていることに考
慮し、始終端は各しきい値により検出された時刻から遅
延分を差し引いた時刻とする（数１５・１６）。The voice detection unit 7 receives a voice detection parameter (power standard deviation sdev) and a voice detection parameter threshold (vh_th29, vl_th30) as inputs, and inputs a voice start end sf,
Output terminal ef. FIG. 6 shows the temporal variation of the standard deviation of the power. As described above, the corrected voice start point s_st
After a certain period of time after art'26, the standard deviation is the threshold vh_th2 for detecting the beginning of the standard deviation of the power for detecting the beginning of a voice section.
If it is larger than 9, it is assumed that a voice is detected. Conversely, if it is smaller, it is assumed that noise has been detected. When the voice is detected, when the standard deviation of the power falls below the threshold vl_th30 for detecting the end, the end of the voice section is set as s_end31. The start of the voice to be obtained is the voice start point s_start24, and the end is the time s_e at which the standard deviation of the power falls below the threshold vl_th for detecting the end.
The standard deviation of the power is nd31, and the standard deviation of the power is considered to be delayed compared to the rise and fall of the power, and the start and end are set to the time obtained by subtracting the delay from the time detected by each threshold. (Equations 15 and 16).

【００６５】[0065]

【数１５】 (Equation 15)

【００６６】[0066]

【数１６】 (Equation 16)

【００６７】表５にパワーの標準偏差の遅延を考慮する
delay_sf、delay_efの一例を示す。Table 5 considers the delay of the power standard deviation.
An example of delay_sf and delay_ef is shown.

【００６８】[0068]

【表５】 [Table 5]

【００６９】表５に示すように、delay_sf、delay_efは
ノイズレベルμによって可変な値をとる。したがって、
ノイズレベルμの関数として表しても良い。As shown in Table 5, delay_sf and delay_ef take variable values depending on the noise level μ. Therefore,
It may be expressed as a function of the noise level μ.

【００７０】以上説明したように、第１の実施形態によ
ると、パワーしきい値と、音声開始点補正パラメータ
と、音声検出パラメータとして用いるパワーの標準偏差
のしきい値とをノイズレベルに応じて求めるため、ノイ
ズの影響をあまり受けることなく音声区間の検出を行う
ことができる。As described above, according to the first embodiment, the power threshold value, the voice start point correction parameter, and the threshold value of the standard deviation of the power used as the voice detection parameter are set according to the noise level. Therefore, the voice section can be detected without being affected by the noise.

【００７１】（第２の実施形態）本発明の第２の実施形
態は、第１の実施形態における音声検出パラメータとし
て、正規化した標準偏差を用いるものである。(Second Embodiment) A second embodiment of the present invention uses a normalized standard deviation as a voice detection parameter in the first embodiment.

【００７２】図１において、第二しきい値算出部６は、
音声区間の始端と終端検出用に、平均パワーで正規化さ
れたパワーの標準偏差のしきい値２９，３０を各々求め
る。検出パラメータ算出部３は、正規化した標準偏差を
求める。ノイズレベル算出部１、第一しきい値算出部
２、音声開始点検出部４、音声開始点補正部５、音声検
出部７、第一及び第二テーブル８，１０、第一及び第二
関数９，１１はそれぞれ第１の実施形態における同一番
号の構成物と同じ作用をするものである。In FIG. 1, the second threshold value calculation unit 6
Threshold values 29 and 30 of the standard deviation of the power normalized by the average power are determined for detecting the start and end of the voice section. The detection parameter calculation unit 3 calculates a normalized standard deviation. Noise level calculation unit 1, first threshold value calculation unit 2, voice start point detection unit 4, voice start point correction unit 5, voice detection unit 7, first and second tables 8, 10, first and second functions Numerals 9 and 11 have the same functions as the components of the same number in the first embodiment.

【００７３】検出パラメータ算出部３は、入力が入力信
号のパワーP_i、出力は音声検出パラメータである正規化
した標準偏差sdev_rとなる。正規化した標準偏差は、あ
る一定時間でのパワーの標準偏差をパワーの平均値で割
ることにより求められる（数１７）。The input of the detection parameter calculation unit 3 is the power P _{i of the} input signal, and the output is the normalized standard deviation sdev_r, which is a voice detection parameter. The normalized standard deviation is obtained by dividing the standard deviation of the power over a certain period of time by the average value of the power (Equation 17).

【００７４】[0074]

【数１７】 [Equation 17]

【００７５】図７は、正規化した標準偏差の時間的変化
を示している。FIG. 7 shows the temporal change of the normalized standard deviation.

【００７６】第二しきい値算出部６では、正規化した標
準偏差sdev_rを入力とし、始端検出用しきい値vh_th３
２、終端検出用しきい値vl_th３３を出力する。The second threshold value calculation unit 6 receives the normalized standard deviation sdev_r as input, and calculates a threshold value vh_th3 for detecting a start end.
2. Output the termination detection threshold vl_th33.

【００７７】音声検出部７では、正規化した標準偏差sd
ev_r、始端検出用しきい値vh_th３２、終端検出用しき
い値vl_th３３を入力とし、音声区間の始端sf、終端ef
を出力する。In the voice detecting section 7, the normalized standard deviation sd
ev_r, start-edge detection threshold vh_th32, and end-detection threshold vl_th33 are input, and the start sf and end ef of the voice section are input.
Is output.

【００７８】パワーの標準偏差を平均パワーで正規化し
ているため、音声検出パラメータのパワー依存性がなく
なり、したがってパワーに依存するノイズレベルの影響
を受けにくくなる。Since the standard deviation of the power is normalized by the average power, the power dependency of the voice detection parameter is eliminated, and therefore, the influence of the noise level depending on the power is reduced.

【００７９】以上説明したように、第２の実施形態によ
ると、パワーの標準偏差を平均パワーで正規化すること
により、パワーの標準偏差がパワーに対する依存性を持
たなくなるため、ノイズレベルが非常に高い劣悪な環境
下での音声区間検出をより確実に行うことができる。As described above, according to the second embodiment, since the standard deviation of the power is normalized by the average power, the standard deviation of the power has no dependency on the power. It is possible to more reliably detect a voice section in a highly poor environment.

【００８０】（第３の実施形態）音声区間中であり、入
力信号のパワーが十分であっても、パワーの変動が小さ
いためパワーの標準偏差が小さくなることがあり、音声
区間が終了したものとして誤検出されることがある。こ
のような誤検出を防ぐため、第３の実施形態は、音声区
間終端検出の際に、第１の実施形態における音声検出パ
ラメータとして、パワーの標準偏差とともにパワーをも
併せて用いるものである。(Third Embodiment) In a voice section, even if the input signal has sufficient power, the power standard deviation may be small due to small power fluctuations. May be erroneously detected. In order to prevent such erroneous detection, the third embodiment uses the power together with the standard deviation of the power as the voice detection parameter in the first embodiment when detecting the end of the voice section.

【００８１】図１において、音声検出部７は、パワーの
標準偏差がしきい値vl_thを下回り、かつパワーがパワ
ーしきい値e_thを下回った場合にのみ、音声区間の終端
を検出する。ノイズレベル算出部１、第一しきい値算出
部２、検出パラメータ算出部３、音声開始点検出部４、
音声開始点補正部５、第二しきい値算出部６、音声検出
部７、第一及び第二テーブル８，１０、第一及び第二関
数９，１１はそれぞれ第１の実施形態における同一番号
の構成物と同じ作用をするものである。In FIG. 1, the voice detector 7 detects the end of the voice section only when the standard deviation of the power falls below the threshold value vl_th and when the power falls below the power threshold value e_th. Noise level calculation unit 1, first threshold value calculation unit 2, detection parameter calculation unit 3, voice start point detection unit 4,
The voice start point correction unit 5, the second threshold value calculation unit 6, the voice detection unit 7, the first and second tables 8, 10 and the first and second functions 9, 11 have the same numbers in the first embodiment. Has the same function as the constituent of the above.

【００８２】図８は、ある音声のパワーの標準偏差の時
間的変動を示している。図８において、３４はパワーの
標準偏差が増加して音声区間始端検出用しきい値vh_th
２９を超える時点、３５と３６はパワーの標準偏差が低
下して音声区間終端検出用しきい値vl_th３０を下回る
時点である。このパワーの標準偏差が音声検出部７に入
力されると、実際に検出されるべき音声区間の終端が３
６であっても、音声区間内である３５の時刻でパワーの
標準偏差が終端検出用しきい値vl_th３０を下回るため
終端を誤って検出してしまう。この音声区間内でのパワ
ーの標準偏差の低下は、パワーがある一定値を保ち、変
動が小さい場合に生じる。そこで、音声区間内でありな
がら音声の終端が誤検出されるのを防ぐため、音声検出
部７は、パワーの標準偏差が低下して音声区間終端検出
用のしきい値vl_th３０を下回り、かつ、パワーが図５
におけるパワーしきい値e_th２３を下回った場合にの
み、音声区間の終端を検出することとする。FIG. 8 shows the temporal variation of the standard deviation of the power of a certain voice. In FIG. 8, reference numeral 34 denotes a threshold vh_th for detecting the beginning of a voice section when the standard deviation of the power increases.
At points in time exceeding 29, points 35 and 36 are points in time at which the standard deviation of the power decreases and falls below the threshold value vl_th30 for detecting the end of a voice section. When the standard deviation of the power is input to the voice detection unit 7, the end of the voice section to be actually detected is 3
Even if it is 6, the standard deviation of the power is less than the terminal detection threshold vl_th30 at the time 35 in the voice section, so that the terminal is erroneously detected. The decrease in the standard deviation of the power within the voice section occurs when the power keeps a certain value and the fluctuation is small. Therefore, in order to prevent the end of the voice from being erroneously detected while in the voice section, the voice detection unit 7 reduces the standard deviation of the power to fall below the threshold vl_th30 for detecting the end of the voice section, and Power is Figure 5
Only when the power threshold value e_th23 falls below, the end of the voice section is detected.

【００８３】なお、音声区間の終端を検出するために、
パワーによる音声開始点検出に用いるパワーしきい値e_
thを用いることとしたが、この代わりに音声区間終端検
出用のパワーしきい値を別に求めておいて用いてもよ
い。また、音声検出パラメータとして、パワーの標準偏
差を用いず、パワーと、正規化した標準偏差とを併せて
用いる場合にも同様の効果が得られる。すなわち、正規
化した標準偏差が低下し、図７における音声区間終端検
出用のしきい値vl_th３３を下回り、かつ、パワーが図
５におけるパワーしきい値e_th２３を下回った場合にの
み、音声区間の終端を検出することとしてもよい。In order to detect the end of the voice section,
Power threshold e_ used for voice start point detection by power
Although th is used, a power threshold value for voice section end detection may be separately obtained and used instead. The same effect can be obtained when the power and the normalized standard deviation are used together without using the standard deviation of the power as the voice detection parameter. That is, only when the normalized standard deviation decreases and falls below the threshold vl_th33 for detecting the end of the voice section in FIG. 7 and when the power falls below the power threshold e_th23 in FIG. May be detected.

【００８４】以上説明したように、第３の実施形態によ
ると、音声区間終端の検出に入力信号のパワーを併せて
用いることにより、入力信号に十分なパワーがあるにも
かかわらず音声区間の終端が誤って検出されることを防
ぐことができる。As described above, according to the third embodiment, by using the power of the input signal together with the detection of the end of the voice section, the end of the voice section can be obtained even though the input signal has sufficient power. Can be prevented from being erroneously detected.

【００８５】（第４の実施形態）第１〜３の実施形態で
は、各しきい値を、ノイズレベルに基づいて設定してい
たが、同じノイズレベルであってもノイズの種類（以
下、ノイズパタン）によって適切なしきい値は異なる。
本発明の第４の実施形態は、第１〜３の実施形態に、入
力信号からノイズパタンを判定するノイズパタン判定部
１２を追加したものである。入力信号のノイズレベルと
ノイズパタンの両方に基づいて音声検出パラメータしき
い値を設定することにより、安定した音声区間検出を行
うものである。(Fourth Embodiment) In the first to third embodiments, each threshold value is set on the basis of the noise level. The appropriate threshold varies depending on the pattern).
The fourth embodiment of the present invention is obtained by adding a noise pattern determination unit 12 for determining a noise pattern from an input signal to the first to third embodiments. By setting a threshold value of a voice detection parameter based on both a noise level and a noise pattern of an input signal, stable voice section detection is performed.

【００８６】図２において、ノイズパタン判定部１２
は、音声信号を入力とし、ノイズパタンを出力とする。
図９にノイズパタン判定部を示す。ノイズパタン判定部
１２には、入力信号の音声の特徴を表現する入力ベクト
ルの時系列及びパワーの時系列を算出する音声分析部３
７と、種々のノイズモデルの前記入力ベクトルを記憶し
ているノイズモデル記憶部３８と、前記音声分析部３７
により算出された入力ベクトルと、前記ノイズモデル記
憶部３８に記憶されているノイズモデルの類似度を算出
し、現在のノイズパタンを判定するノイズパタン選択部
３９がある。またノイズパタン選択部３９では、ノイズ
モデル記憶部３８に記憶されているノイズパタンに対応
するノイズパタンを選択するという方法や、様々なノイ
ズパタンの組み合わせを選択するという方法を用いるこ
とができる。In FIG. 2, the noise pattern determination unit 12
Inputs a voice signal and outputs a noise pattern.
FIG. 9 shows a noise pattern determination unit. The noise pattern determination unit 12 includes a speech analysis unit 3 that calculates a time series of an input vector and a time series of power, which represent characteristics of speech of an input signal.
7, a noise model storage unit 38 that stores the input vectors of various noise models, and a speech analysis unit 37.
There is a noise pattern selection unit 39 that calculates the similarity between the input vector calculated by the above and the noise model stored in the noise model storage unit 38 and determines the current noise pattern. The noise pattern selection unit 39 can use a method of selecting a noise pattern corresponding to the noise pattern stored in the noise model storage unit 38 or a method of selecting various combinations of noise patterns.

【００８７】また、ノイズパタン判定部１２に下記に示
す別手法を用いることもできる。The noise pattern determination unit 12 may use another method described below.

【００８８】ノイズパタン判定部１２では、周波数弁別
器を用いてどの周波数におけるパワーが大きいかを調
べ、入力信号のノイズパタンが高域ノイズ・低域ノイズ
・ホワイトノイズ等の判定を行う。The noise pattern determination unit 12 uses a frequency discriminator to check at which frequency the power is large, and determines whether the noise pattern of the input signal is high-frequency noise, low-frequency noise, white noise, or the like.

【００８９】第一しきい値算出部１３では、前記ノイズ
レベルに加えさらに前記ノイズパタンを入力とし、パワ
ーしきい値、音声開始点補正パラメータを出力とする。
各しきい値は、ノイズレベル算出部１で求めたノイズレ
ベルと、ノイズパタン判定部１２で求めたノイズパタン
に応じて、あらかじめ定めておいたテーブルを参照して
求めることができる。表６にテーブルの一例を示す。The first threshold value calculation unit 13 receives the noise pattern in addition to the noise level, and outputs a power threshold value and a voice start point correction parameter.
Each threshold can be obtained by referring to a predetermined table according to the noise level obtained by the noise level calculation unit 1 and the noise pattern obtained by the noise pattern determination unit 12. Table 6 shows an example of the table.

【００９０】[0090]

【表６】 [Table 6]

【００９１】このようにノイズレベルとノイズパタンの
両方を考慮することにより、より正確なしきい値設定が
可能になると考えられる。テーブルにしきい値係数が設
定されている場合もあり、各しきい値はこれらのしきい
値係数にノイズレベルを掛けることにより求められる。
もちろん、テーブルにしきい値そのものが設定されてい
てもよい。また、ノイズレベルとノイズパタンを変数と
する一次関数等によって求めることも可能である。さら
に、ある範囲のノイズレベルおよびノイズパタン毎に離
散的な値をとる関数を用いることもできる。It is considered that the threshold value can be set more accurately by considering both the noise level and the noise pattern. In some cases, threshold coefficients are set in the table, and each threshold is obtained by multiplying these threshold coefficients by a noise level.
Of course, the threshold value itself may be set in the table. In addition, it is also possible to obtain by a linear function or the like using the noise level and the noise pattern as variables. Further, a function that takes a discrete value for each of a certain range of noise levels and noise patterns may be used.

【００９２】第二しきい値算出部１４では、前記ノイズ
レベル、前記音声開始点補正部５で補正された音声開始
点、パワーの標準偏差に加え、さらに前記ノイズパタン
を入力とし、音声検出パラメータしきい値を出力する。
第一しきい値算出部１３と同様に、テーブル及び関数を
用いて算出することも可能であり、しきい値係数が設定
されている場合は、音声開始点で補正された音声開始点
での検出パラメータの平均値を掛けることによって求め
る。The second threshold value calculation unit 14 receives the noise pattern in addition to the noise level, the voice start point corrected by the voice start point correction unit 5 and the standard deviation of the power, and receives the voice detection parameter. Output the threshold.
Similarly to the first threshold value calculation unit 13, it is also possible to calculate using a table and a function, and when a threshold coefficient is set, the value at the voice start point corrected at the voice start point is used. It is obtained by multiplying the average value of the detection parameters.

【００９３】第一及び第二テーブル１５，１７、第一及
び第二関数１６，１８は、前述のとおりノイズパタンを
考慮したものである。The first and second tables 15 and 17 and the first and second functions 16 and 18 take noise patterns into consideration as described above.

【００９４】ノイズレベル算出部１、検出パラメータ算
出部３、音声開始点検出部４、音声開始点補正部５、音
声検出部７はそれぞれ第１〜３の実施形態における同一
番号の構成物と同じ作用をするものである。The noise level calculation unit 1, detection parameter calculation unit 3, voice start point detection unit 4, voice start point correction unit 5, and voice detection unit 7 are the same as the components of the same numbers in the first to third embodiments, respectively. It works.

【００９５】以上説明したように、第４の実施形態によ
ると、音声区間検出の前段で前記入力信号のノイズパタ
ンを判定し、前記ノイズパタンも考慮してしきい値を設
定することにより、様々なノイズパタンに対しても安定
した音声区間検出を行うことができる。As described above, according to the fourth embodiment, the noise pattern of the input signal is determined before the detection of the voice section, and the threshold value is set in consideration of the noise pattern. It is possible to perform stable voice section detection even for a simple noise pattern.

【００９６】（第５の実施形態）特に騒音下において
「札幌」「学校」のように促音を含む語彙を発声する
と、音声区間内であっても音声検出パラメータがしきい
値より小さくなり、音声終端を誤検出するという場合が
ある。その結果、終端の検出後に逐次的なしきい値の更
新処理を行う場合に、誤検出した終端後の音声区間を用
いて推定するために、高いしきい値を設定してしまい、
以後の音声入力に対して正確な音声区間検出ができな
い。このような事態を防ぐため、本発明の第5の実施形
態は、第４の実施形態の音声検出パラメータ算出部と音
声開始点検出部との間にノイズレベル再推定判定部を追
加したものである。(Fifth Embodiment) In particular, when a vocabulary including a prompting sound such as "Sapporo" or "school" is uttered under noise, the voice detection parameter becomes smaller than the threshold value even in the voice section, and There is a case where the termination is erroneously detected. As a result, when successively updating the threshold value after the end point is detected, a high threshold value is set in order to estimate using the voice section after the end point that is erroneously detected,
Accurate voice section detection cannot be performed for subsequent voice inputs. In order to prevent such a situation, the fifth embodiment of the present invention adds a noise level re-estimation determination unit between the voice detection parameter calculation unit and the voice start point detection unit of the fourth embodiment. is there.

【００９７】図３において、ノイズレベル再推定判定部
１９は、前記ノイズレベル算出部１で算出された前記ノ
イズレベルが妥当であるかを判定するものである。ある
音声区間を検出後、一旦ノイズレベル算出部１において
ノイズレベルを算出する。しかし、一定時間前記音声検
出パラメータが前記音声検出パラメータしきい値より大
きい場合は、まだ音声の可能性があり、求めたノイズレ
ベルは音声のレベルの可能性があると判断し、再度ノイ
ズレベルの推定を行う必要があるという判定を下すもの
である。In FIG. 3, a noise level re-estimation determination section 19 determines whether the noise level calculated by the noise level calculation section 1 is appropriate. After detecting a certain voice section, the noise level is once calculated in the noise level calculation unit 1. However, if the voice detection parameter is greater than the voice detection parameter threshold for a certain period of time, there is still a possibility of voice, and the determined noise level is determined to be a voice level. It is determined that it is necessary to make an estimation.

【００９８】このノイズレベル再推定判定部１９は、検
出パラメータ算出部３の後、音声開始点検出部４の前に
行い、再推定の必要がなければ音声開始点検出部以降を
実行するが、再推定の必要がある場合は、ノイズレベル
算出部１に戻る。The noise level re-estimation determination section 19 performs the processing after the detection parameter calculation section 3 and before the voice start point detection section 4, and executes the processing after the voice start point detection section if re-estimation is not necessary. If re-estimation is necessary, the process returns to the noise level calculation unit 1.

【００９９】また、ノイズレベル再推定判定部１９に下
記に示す別手法を用いることもできる。The noise level re-estimation judging section 19 may use another method described below.

【０１００】ノイズレベル再推定判定部１９は、音声区
間検出後、ある一定時間前記音声検出パラメータが前記
音声検出パラメータしきい値より大きい場合は、前記音
声検出パラメータが前記音声検出パラメータしきい値よ
り小さくなるまで待ち、再度ノイズレベルの推定を行う
必要があるという判定を下すものである。When the voice detection parameter is larger than the voice detection parameter threshold for a certain period of time after voice section detection, the noise level re-estimation determination unit 19 determines that the voice detection parameter is lower than the voice detection parameter threshold. It waits until it becomes smaller, and determines that it is necessary to estimate the noise level again.

【０１０１】また、音声検出パラメータしきい値に上限
値、下限値を設定することにより、誤って過大あるいは
過小なしきい値が設定されることを防ぐことができる。By setting the upper limit value and the lower limit value for the threshold value of the voice detection parameter, it is possible to prevent an excessively large or small threshold value from being set by mistake.

【０１０２】ノイズレベル算出部１、第一しきい値算出
部２、検出パラメータ算出部３、音声開始点検出部４、
音声開始点補正部５、第二しきい値算出部６、音声検出
部７、第一及び第二テーブル８，１０、第一及び第二関
数９，１１はそれぞれ第１〜３の実施形態における同一
番号の構成物と同じ作用をするものである。The noise level calculator 1, the first threshold calculator 2, the detection parameter calculator 3, the voice start point detector 4,
The voice start point correction unit 5, the second threshold value calculation unit 6, the voice detection unit 7, the first and second tables 8, 10 and the first and second functions 9, 11 are the same as those in the first to third embodiments. It has the same function as a component having the same number.

【０１０３】以上説明したように、第５の実施形態によ
ると、音声区間検出後も音声検出パラメータの値に追従
し、必要に応じてノイズレベルを再推定することによ
り、特に促音を含む語彙の区間検出後に生じるノイズレ
ベル誤推定によって引き起こされるしきい値の誤設定を
防ぐことができ、様々な発声語彙に対しても安定した音
声区間検出を行うことができる。As described above, according to the fifth embodiment, the vocabulary including vocal sounds, particularly, including vocal sounds can be obtained by following the value of the voice detection parameter even after voice section detection and re-estimating the noise level as necessary. It is possible to prevent erroneous setting of a threshold value caused by erroneous noise level estimation that occurs after section detection, and to perform stable speech section detection even for various utterance vocabularies.

【０１０４】（第６の実施形態）本発明の第６の実施形
態は、第５と第６の実施形態を統合して、第１〜３の実
施形態にノイズパタン判定部とノイズレベル再推定判定
部を備えたものである。(Sixth Embodiment) In a sixth embodiment of the present invention, the fifth and sixth embodiments are integrated, and the noise pattern determination unit and the noise level re-estimation are added to the first to third embodiments. It is provided with a determination unit.

【０１０５】図４において、ノイズレベル算出部１、検
出パラメータ算出部３、音声開始点検出部４、音声開始
点補正部５、音声検出部７はそれぞれ第１〜３の実施形
態における同一番号の構成物と同じ作用をするものであ
り、ノイズパタン判定部１２、第一しきい値算出部１
３、第二しきい値算出部１４、ノイズレベル再推定判定
部１９、第一及び第二テーブル１５，１７、第一及び第
二関数１６，１８はそれぞれ第４〜５の実施形態におけ
る同一番号の構成物と同じ作用をするものである。In FIG. 4, a noise level calculator 1, a detection parameter calculator 3, a voice start point detector 4, a voice start point corrector 5, and a voice detector 7 have the same numbers in the first to third embodiments, respectively. It has the same function as the components, and includes a noise pattern determination unit 12, a first threshold value calculation unit 1,
3, the second threshold value calculation unit 14, the noise level re-estimation determination unit 19, the first and second tables 15, 17, and the first and second functions 16, 18 are the same numbers in the fourth to fifth embodiments, respectively. Has the same function as the constituent of the above.

【０１０６】ノイズパタン判定部１２とノイズレベル再
推定判定部１９を備えることにより、あらゆるノイズパ
タンに対しても適切なしきい値を設定することができ、
また音声区間誤検出によって生じるノイズレベル誤推定
を防ぐことができ、安定した音声区間検出を行うことが
できる。By providing the noise pattern determination section 12 and the noise level re-estimation determination section 19, an appropriate threshold value can be set for any noise pattern.
In addition, noise level erroneous estimation caused by erroneous detection of a voice section can be prevented, and stable voice section detection can be performed.

【０１０７】[0107]

【発明の効果】以上のように、本発明は、パワーしきい
値と、音声開始点補正パラメータと、音声検出パラメー
タしきい値とをノイズレベルに応じて求めるため、ノイ
ズレベルの影響をうけにくく、安定した音声区間検出を
することができる。As described above, according to the present invention, the power threshold value, the voice start point correction parameter, and the voice detection parameter threshold value are obtained according to the noise level. , Stable voice section detection.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の第１、第２、第３の実施形態に関わる
音声検出装置の構成を示すブロック図FIG. 1 is a block diagram showing a configuration of a voice detection device according to first, second, and third embodiments of the present invention.

【図２】本発明の第４の実施形態に関わる音声検出装置
の構成を示すブロック図FIG. 2 is a block diagram showing a configuration of a voice detection device according to a fourth embodiment of the present invention.

【図３】本発明の第５の実施形態に関わる音声検出装置
の構成を示すブロック図FIG. 3 is a block diagram showing a configuration of a voice detection device according to a fifth embodiment of the present invention.

【図４】本発明の第６の実施形態に関わる音声検出装置
の構成を示すブロック図FIG. 4 is a block diagram showing a configuration of a voice detection device according to a sixth embodiment of the present invention.

【図５】入力信号のパワーの時間的変動とパワーしきい
値とを示すグラフFIG. 5 is a graph showing a temporal change in power of an input signal and a power threshold.

【図６】入力信号の一定時間内のパワーの標準偏差の時
間的変動とそのしきい値とを示すグラフFIG. 6 is a graph showing the temporal variation of the standard deviation of the power of the input signal within a certain time and the threshold value thereof;

【図７】入力信号の正規化した標準偏差の時間的変動と
そのしきい値とを示すグラフFIG. 7 is a graph showing a temporal variation of a normalized standard deviation of an input signal and a threshold value thereof;

【図８】入力信号の一定時間内のパワーの標準偏差が低
下し、音声区間の終端が誤って検出される場合を示すグ
ラフFIG. 8 is a graph showing a case where the standard deviation of the power of the input signal within a certain period of time is reduced and the end of the voice section is erroneously detected.

【図９】本発明のノイズパタン判定部を示すブロック図FIG. 9 is a block diagram illustrating a noise pattern determination unit according to the present invention.

【符号の説明】[Explanation of symbols]

１ノイズレベル算出部２第一しきい値算出部３検出パラメータ算出部４音声開始点検出部５音声開始点補正部６第二しきい値算出部７音声検出部８第一テーブル９第一関数１０第二テーブル１１第二関数１２ノイズパタン判定部１３第一しきい値算出部１４第二しきい値算出部１５第一テーブル１６第一関数１７第二テーブル１８第二関数１９ノイズレベル再推定判定部２０入力信号のパワー（ノイズなし）２１入力信号のパワー（ノイズあり）２２ノイズレベルを算出するための区間２３パワーしきい値２４パワーによる音声開始点２５音声開始点補正パラメータ２６補正された音声開始点２７入力信号のパワーの標準偏差（ノイズなし）２８入力信号のパワーの標準偏差（ノイズあり）２９音声区間始端検出用のパワーの標準偏差のしきい
値３０音声区間終端検出用のパワーの標準偏差のしきい
値３１パワーの標準偏差が終端検出用しきい値を下回る
時刻３２音声区間始端検出用の平均パワーで正規化された
パワーの標準偏差のしきい値３３音声区間終端検出用の平均パワーで正規化された
パワーの標準偏差のしきい値３４音声区間の始端３５誤って検出された音声区間の終端３６音声区間の終端３７音声分析部３８ノイズモデル記憶部３９ノイズパタン選択部DESCRIPTION OF SYMBOLS 1 Noise level calculation part 2 First threshold value calculation part 3 Detection parameter calculation part 4 Voice start point detection part 5 Voice start point correction part 6 Second threshold value calculation part 7 Voice detection part 8 First table 9 First function Reference Signs List 10 second table 11 second function 12 noise pattern determination unit 13 first threshold value calculation unit 14 second threshold value calculation unit 15 first table 16 first function 17 second table 18 second function 19 noise level re-estimation Judgment unit 20 Power of input signal (no noise) 21 Power of input signal (noise) 22 Section for calculating noise level 23 Power threshold 24 Voice start point by power 25 Voice start point correction parameter 26 Corrected Voice start point 27 Standard deviation of power of input signal (no noise) 28 Standard deviation of power of input signal (noise) 29 Voice The threshold value of the standard deviation of the power for detecting the beginning of the interval 30 The threshold value of the standard deviation of the power for detecting the end of the voice section 31 The time when the standard deviation of the power falls below the threshold value for detecting the end 32 The threshold value of the standard deviation of the power normalized by the average power 33 The threshold value of the standard deviation of the power normalized by the average power for detecting the end of the voice section 34 The beginning of the voice section 35 The erroneously detected voice section End 36 end of speech section 37 speech analysis unit 38 noise model storage unit 39 noise pattern selection unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者芹川光彦大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者中藤良久大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者片山大朗大阪府門真市大字門真1006番地松下電器産業株式会社内Ｆターム(参考） 5D015 CC14 DD03 DD04 DD05 9A001 HH17 KK46 ──────────────────────────────────────────────────の Continued on the front page (72) Inventor Mitsuhiko Serikawa 1006 Kadoma Kadoma, Kadoma, Osaka Prefecture Inside Matsushita Electric Industrial Co., Ltd. (72) Inventor Dairo Katayama 1006 Kazuma Kadoma, Kadoma City, Osaka Prefecture F-term in Matsushita Electric Industrial Co., Ltd. 5D015 CC14 DD03 DD04 DD05 9A001 HH17 KK46

Claims

【特許請求の範囲】[Claims]

【請求項１】入力信号中の音声区間を検出する音声検
出装置であって、前記入力信号のパワーが一定値以下である区間におい
て、前記入力信号のパワーに基づいてノイズレベルを算
出するノイズレベル算出手段と、前記ノイズレベルに基づき、パワーによる音声開始点の
検出に用いるパワーしきい値と、前記パワーによる音声
開始点を補正する音声開始点補正パラメータとを算出す
る第一しきい値算出手段と、前記入力信号のパワーから音声検出パラメータを算出す
る検出パラメータ算出手段と、前記入力信号のパワーと前記パワーしきい値とを用い
て、前記パワーによる音声開始点を検出する音声開始点
検出手段と、前記音声開始点検出手段で検出された前記パワーによる
音声開始点を、前記音声開始点補正パラメータを用いて
補正し、補正された音声開始点を求める音声開始点補正
手段と、前記ノイズレベルと補正された前記音声開始点での音声
検出パラメータとに基づき、音声区間の検出に用いる音
声検出パラメータしきい値を算出する第二しきい値算出
手段と、前記補正された音声開始点以降の前記入力信号を対象と
して、前記音声検出パラメータと前記音声検出パラメー
タしきい値とを用いて音声区間の検出をする音声検出手
段とを備えた音声検出装置。1. A speech detection device for detecting a speech section in an input signal, wherein a noise level is calculated based on the power of the input signal in a section in which the power of the input signal is equal to or less than a predetermined value. Calculating means; a first threshold value calculating means for calculating, based on the noise level, a power threshold value used for detecting a voice start point based on power and a voice start point correction parameter for correcting the voice start point based on the power. Detection parameter calculation means for calculating a voice detection parameter from the power of the input signal; voice start point detection means for detecting a voice start point based on the power using the power of the input signal and the power threshold value And correcting a voice start point based on the power detected by the voice start point detection means using the voice start point correction parameter. Voice start point correction means for obtaining a corrected voice start point; and a voice detection parameter threshold value used for voice section detection based on the noise level and the corrected voice detection parameter at the voice start point. Second threshold value calculation means, and voice detection means for detecting a voice section using the voice detection parameter and the voice detection parameter threshold for the input signal after the corrected voice start point A voice detection device comprising:

【請求項２】請求項１記載の音声検出装置において、前記音声検出パラメータは、前記入力信号のパワーの一
定時間内の標準偏差であることを特徴とする音声検出装
置。2. The voice detection device according to claim 1, wherein the voice detection parameter is a standard deviation of the power of the input signal within a predetermined time.

【請求項３】請求項１記載の音声検出装置において、前記音声検出パラメータは、前記入力信号のパワーの一
定時間内の標準偏差が前記入力信号のパワーの一定時間
内の平均値で正規化されたものであることを特徴とする
音声検出装置。3. The voice detection device according to claim 1, wherein the voice detection parameter is such that a standard deviation of the power of the input signal within a predetermined time is normalized by an average value of the power of the input signal within a predetermined time. A voice detection device characterized by the following.

【請求項４】請求項１記載の音声検出装置において、前記音声検出パラメータは、前記入力信号のパワーと前
記入力信号のパワーの一定時間内の標準偏差とであっ
て、前記入力信号のパワーと前記標準偏差とが前記しき
い値算出手段で求められたそれぞれのしきい値以下にと
もに低下する場合に音声区間の終端を検出することを特
徴とする音声検出装置。4. The voice detection device according to claim 1, wherein the voice detection parameter is a power of the input signal and a standard deviation of the power of the input signal within a predetermined time. A speech detection device for detecting the end of a speech section when both the standard deviation falls below the respective threshold values obtained by the threshold value calculation means.

【請求項５】請求項１記載の音声検出装置において、前記音声検出パラメータは、前記入力信号のパワーと前
記入力信号のパワーの一定時間内の標準偏差が前記入力
信号のパワーの一定時間内の平均値で正規化されたもの
とであって、前記入力信号のパワーと前記正規化した標
準偏差とが前記しきい値算出手段で求められたそれぞれ
のしきい値以下にともに低下する場合に音声区間の終端
を検出することを特徴とする音声検出装置。5. The voice detection device according to claim 1, wherein the voice detection parameter is such that a standard deviation of the power of the input signal and a power of the input signal within a predetermined time is within a predetermined time of the power of the input signal. Voice when the power of the input signal and the normalized standard deviation both fall below the respective threshold values obtained by the threshold value calculating means. A speech detection device for detecting the end of a section.

【請求項６】請求項１記載の音声検出装置において、前記パワーしきい値は、前記ノイズレベルを変数とする
テーブル及び関数の少なくとも一方によって求められる
ことを特徴とする音声検出装置。6. The speech detection device according to claim 1, wherein the power threshold is obtained by at least one of a table and a function using the noise level as a variable.

【請求項７】請求項１記載の音声検出装置において、前記音声開始点補正パラメータは、前記ノイズレベルを
変数とするテーブル及び関数の少なくとも一方によって
求められることを特徴とする音声検出装置。7. The speech detection device according to claim 1, wherein the speech start point correction parameter is obtained by at least one of a table and a function using the noise level as a variable.

【請求項８】請求項１記載の音声検出装置において、前記音声検出パラメータしきい値は、前記ノイズレベル
を変数とするテーブル及び関数の少なくとも一方によっ
て求められることを特徴とする音声検出装置。8. The speech detection apparatus according to claim 1, wherein the speech detection parameter threshold value is obtained by at least one of a table and a function using the noise level as a variable.

【請求項９】入力信号中の音声区間を検出する音声検
出装置であって、ノイズの種類（以下、ノイズパタン）を判別するノイズ
パタン判定手段を備え、前記第一しきい値算出手段は、前記ノイズレベルに加え
さらに前記ノイズパタンに基づいて第一のしきい値を算
出し、前記第二しきい値算出手段は、前記ノイズレベルと補正
された前記音声開始点での音声検出パラメータに加えさ
らに前記ノイズパタンに基づいて第二のしきい値を算出
することを特徴とする請求項１〜８記載の音声検出装
置。9. A voice detection device for detecting a voice section in an input signal, comprising: a noise pattern determination unit configured to determine a type of noise (hereinafter referred to as a noise pattern); In addition to the noise level, a first threshold value is calculated based on the noise pattern. The second threshold value calculating means adds the noise level and the corrected voice detection parameter at the voice start point. The voice detection device according to claim 1, further comprising calculating a second threshold based on the noise pattern.

【請求項１０】入力信号中の音声区間を検出する音声
検出装置であって、前記検出パラメータ算出部と前記音声開始点検出部との
間に、前記ノイズレベル算出手段で算出された前記ノイ
ズレベルが妥当であるかを判定するノイズレベル再推定
判定部とを備える請求項１〜８記載の音声検出装置。10. A voice detection device for detecting a voice section in an input signal, wherein the noise level calculated by the noise level calculation means is provided between the detection parameter calculation unit and the voice start point detection unit. The noise detection apparatus according to claim 1, further comprising: a noise level re-estimation determination unit that determines whether or not is appropriate.

【請求項１１】入力信号中の音声区間を検出する音声
検出装置であって、前記ノイズパタンを判別するノイズパタン判定手段と、前記検出パラメータ算出部と前記音声開始点検出部との
間に、前記ノイズレベル算出手段で算出された前記ノイ
ズレベルが妥当であるかを判定するノイズレベル再推定
判定部とを備え、前記第一しきい値算出手段は、前記ノイズレベルに加え
さらに前記ノイズパタンに基づいて第一のしきい値を算
出し、前記第二しきい値算出手段は、前記ノイズレベルと補正
された前記音声開始点での音声検出パラメータに加えさ
らに前記ノイズパタンに基づいて第二のしきい値を算出
することを特徴とする請求項１〜８記載の音声検出装
置。11. A voice detection device for detecting a voice section in an input signal, comprising: a noise pattern determination unit that determines the noise pattern; and a noise parameter determination unit and a voice start point detection unit. A noise level re-estimation determination unit that determines whether the noise level calculated by the noise level calculation unit is appropriate, wherein the first threshold value calculation unit further adds the noise level to the noise pattern. Calculating a first threshold value based on the noise level and a second detection value based on the noise pattern in addition to the noise level and the corrected voice detection parameter at the voice start point. 9. The speech detection device according to claim 1, wherein a threshold value is calculated.