JPH1049190A

JPH1049190A - Device and method for voice recognition

Info

Publication number: JPH1049190A
Application number: JP8220452A
Authority: JP
Inventors: Shoji Kuriki; 章次栗木
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1996-08-02
Filing date: 1996-08-02
Publication date: 1998-02-20
Anticipated expiration: 2016-08-02
Also published as: JP3484559B2

Abstract

PROBLEM TO BE SOLVED: To realize a voice recognition device in which a correct recognition result is obtained with good use convenience in a highly noisy environment by following the changes in the environment. SOLUTION: When any one of the similarities of the recognition object exceeds a reject threshold value TH during a voice inputting, a similarity average value computing section 7 maintains the average value of the similarity just prior to exceed the value TH. Moreover, when a similarity peak detecting section 8 detects the peak value of the similarity of the recognition object exceeding the value TH, a maintaining time determining section 9 determines the maintaining time until a result is outputted based on the average value kept in the section 8 and the peak value of the similarity detected by the section 8. Then, a result outputting section 10 outputs the recognition object which provides the peak value as the recognition result, if there is no recognition object which gives the similarity exceeding the peak value, while the maintaining time runs out.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置およ
び音声認識方法に関する。The present invention relates to a voice recognition device and a voice recognition method.

【０００２】[0002]

【従来の技術】従来、例えば特開昭６２−１１１２９３
号(以下、従来技術１と称す)には、認識すべき音声とそ
の前後の騒音を含む十分長い区間を入力信号区間とし、
この区間内でワードスポッティングを行ない、類似度が
最大となる認識対象を認識結果として出力することによ
り、音声区間検出を行なうことなく騒音環境化で発声し
た音声を、認識対象音声とその前後に騒音を含んだ十分
長い入力の中から切り出し、認識する技術が示されてい
る。2. Description of the Related Art Conventionally, for example, Japanese Patent Application Laid-Open No. Sho 62-111293.
No. (hereinafter referred to as prior art 1), a sufficiently long section including the voice to be recognized and the noise before and after it is defined as an input signal section,
By performing word spotting in this section and outputting the recognition target with the highest similarity as a recognition result, the voice uttered in the noise environment without detecting the voice section can be recognized by the recognition target voice and the noise before and after it. There is shown a technique for extracting and recognizing from a sufficiently long input including "".

【０００３】従来技術１では、十分長い入力音声区間内
で認識を行なうため、騒音などにより類似度が上がる認
識対象があっても、対象音声による認識対象の類似度が
それを上回って大きくなり、入力音声区間内の最大類似
度が得られた認識対象を結果出力することで、騒音によ
る誤認識を結果出力することを避けることができる。In the prior art 1, since recognition is performed within a sufficiently long input voice section, even if there is a recognition target whose similarity rises due to noise or the like, the similarity of the recognition target by the target voice becomes larger than that. By outputting the recognition target having the maximum similarity in the input voice section as a result, it is possible to avoid outputting the result of erroneous recognition due to noise.

【０００４】図９はこの従来技術１において認識処理さ
れた認識対象ａ，ｂ，ｃの類似度の時間変化を示す図で
ある。図９の例では、非音声区間で、認識対象ｂ，ｃは
認識対象ａより大きな類似度を得ているが、音声が入る
と(音声区間では)、正解である認識対象ａの類似度が大
きくなる。これにより、認識結果としては、入力音声区
間内で最大の類似度を得た認識対象ａが出力され、非音
声区間でａよりも大きな類似度を得た認識対象ｂ，ｃは
無視される。FIG. 9 is a diagram showing a temporal change in the similarity of the recognition targets a, b, and c subjected to the recognition processing in the prior art 1. In the example of FIG. 9, in the non-voice section, the recognition targets b and c have higher similarities than the recognition target a. However, when voice is input (in the voice section), the similarity of the recognition target a that is the correct answer is higher. growing. As a result, as the recognition result, the recognition target a having the highest similarity in the input voice section is output, and the recognition objects b and c having the similarity higher than a in the non-voice section are ignored.

【０００５】このように、従来技術１によれば、認識す
べき音声とその前後の騒音を含む十分長い区間を入力信
号区間とし、この区間内でワードスポッティングを行な
い、類似度が最大となる認識対象を認識結果として出力
することにより、騒音による誤認識結果が出力されるこ
とを避けることができる。As described above, according to the prior art 1, a sufficiently long section including the voice to be recognized and the noises before and after the input signal section is set as the input signal section, and word spotting is performed in this section so that the similarity is maximized. By outputting the target as a recognition result, it is possible to avoid outputting an erroneous recognition result due to noise.

【０００６】しかしながら、上述の従来技術１では、十
分長い入力音声区間が終了しなければ認識結果が出力さ
れないため、実際の機器に使用する場合、長い反応時間
が必要となり、認識対象音声を発声してから認識結果が
出力されるまで時間がかかり実用的ではない。However, in the above-mentioned prior art 1, since a recognition result is not output unless a sufficiently long input voice section is completed, a long reaction time is required when the apparatus is used in an actual device, and a voice to be recognized is uttered. It takes a long time to output the recognition result afterwards, which is not practical.

【０００７】そのため、入力音声区間の終了を待たずに
類似度の変化のピークを検出し、ピークを検出後、一定
時間(図９ではｔ₁で示す)内に、他の認識対象の類似度
がピークを越えなければ、いまピークが検出された認識
対象を結果出力することで、反応時間が短かい認識結果
の出力が可能である。但し、この場合、非音声区間で誤
認識結果が出力されるため(認識対象ｂとｃが非音声区
間で認識結果として出力されるため)、リジェクト閾値
を定め、類似度が閾値以下の場合には、認識結果をリジ
ェクトすることで、誤認識結果が出力されるのを避ける
必要がある。For this reason, the peak of the change in similarity is detected without waiting for the end of the input voice section, and after detecting the peak, the similarity of another recognition target is detected within a certain time (indicated by t ₁ in FIG. 9). If does not exceed the peak, it is possible to output a recognition result having a short reaction time by outputting the recognition target for which the peak has been detected. However, in this case, since the erroneous recognition result is output in the non-voice section (because the recognition targets b and c are output as the recognition result in the non-voice section), the reject threshold is determined, and when the similarity is equal to or smaller than the threshold. It is necessary to avoid output of an erroneous recognition result by rejecting the recognition result.

【０００８】しかしながら、環境が変化すると非音声区
間の誤認識となる認識対象の類似度が変化するため、１
つのリジェクト閾値で全ての環境変化に対応することは
できない。However, when the environment changes, the degree of similarity of the recognition target that causes erroneous recognition of the non-voice section changes.
One reject threshold cannot handle all environmental changes.

【０００９】このような問題に対処するため、例えば特
公昭６０−６００８０号(以下、従来技術２と称す)，特
開平１−３２１４９９号(以下、従来技術３と称す)に
は、騒音下でのリジェクトを効果的に行なうためリジェ
クト処理を行なう閾値を環境により変化させる技術が示
されている。すなわち、従来技術２では、周囲騒音によ
りリジェクト閾値を変化させ、また、従来技術３では、
入力音声のＳ／Ｎにより閾値を変化させるようになって
いる。To cope with such a problem, for example, Japanese Patent Publication No. Sho 60-60080 (hereinafter referred to as Prior Art 2) and Japanese Patent Application Laid-Open No. 1-321499 (hereinafter referred to as Prior Art 3) are disclosed under noise. In order to effectively perform the rejection, a technique of changing a threshold value for performing a rejection process depending on an environment is disclosed. That is, in the related art 2, the reject threshold is changed by the ambient noise, and in the related art 3,
The threshold is changed according to the S / N of the input voice.

【００１０】[0010]

【発明が解決しようとする課題】このように、従来技術
２，従来技術３では、環境の変化により、リジェクト閾
値を変化させることができるため、最適なリジェクト閾
値を設定することができる。しかしながら、従来技術２
では、雑音レベルを検出する必要があり、また、従来技
術３においても、Ｓ／Ｎを検出するためには雑音レベル
と音声レベルを検出しなければならないという問題があ
った。すなわち、雑音レベルを検出するためには、非音
声区間の検出が必要であり、非音声区間はパワー情報を
使用して検出されるが、雑音が大きくなり音声とのパワ
ー差が少なくなると非音声区間の検出ができなくなり、
雑音レベルの検出も不可能になる。このため、従来技術
２，従来技術３では、高騒音下においてはリジェクト閾
値を環境に対応して設定することができないという問題
があった。As described above, in the prior art 2 and the prior art 3, the reject threshold can be changed according to a change in the environment, so that an optimum reject threshold can be set. However, prior art 2
Then, it is necessary to detect the noise level, and also in the prior art 3, there is a problem that the noise level and the voice level must be detected in order to detect the S / N. That is, in order to detect the noise level, it is necessary to detect a non-speech section. The non-speech section is detected using the power information. However, when the noise increases and the power difference from the speech decreases, the non-speech section decreases. The section cannot be detected,
Detection of the noise level becomes impossible. For this reason, the prior arts 2 and 3 have a problem that the reject threshold cannot be set according to the environment under high noise.

【００１１】本発明は、高騒音下においても、雑音レベ
ル等の環境の変化を容易に検出することができ、これに
より、高騒音下においても、環境の変化に追従させて使
用勝手の良く、正しい認識結果を得ることの可能な音声
認識を実現できて、かつ、従来に比べて、短かい反応時
間で認識結果を出力することが可能な音声認識装置およ
び音声認識方法を提供することを目的としている。The present invention makes it possible to easily detect changes in the environment such as noise levels even under a high noise level, so that the apparatus can follow the changes in the environment even under a high noise level and is easy to use. An object of the present invention is to provide a voice recognition device and a voice recognition method that can realize voice recognition that can obtain a correct recognition result and that can output a recognition result with a shorter reaction time compared to the related art. And

【００１２】[0012]

【課題を解決するための手段】上記目的を達成するため
に、請求項１記載の発明は、音声入力信号から音声特徴
データを抽出する特徴抽出手段と、抽出された音声特徴
データを全ての認識対象の標準パターンと比較して類似
度を計算する類似度計算手段と、所定のリジェクト閾値
を設定するリジェクト閾値設定手段と、全ての認識対象
の類似度が所定のリジェクト閾値を越えない場合は全て
の認識対象の類似度の平均値を計算し、全ての認識対象
の類似度のうち１つでも所定のリジェクト閾値を越える
場合は、その直前の平均値を保持する類似度平均値計算
手段と、所定のリジェクト閾値を越えた認識対象の類似
度のピーク値を検出する類似度ピーク検出手段と、認識
結果を出力する結果出力手段と、ある認識対象の類似度
について所定のリジェクト閾値を越えたピーク値が前記
類似度ピーク検出手段により検出されたときに、少なく
とも、類似度平均値計算手段で保持されている類似度の
平均値に基づいて認識結果を出力するまでの保留時間を
決定する保留時間決定手段とを備え、結果出力手段は、
保留時間中にピーク値を越える類似度を与える認識対象
が無い場合に、ピーク値を与えた認識対象を認識結果と
して出力することを特徴としている。In order to achieve the above object, the invention according to claim 1 comprises a feature extracting means for extracting voice feature data from a voice input signal, and all feature recognition of the extracted voice feature data. A similarity calculating means for calculating a similarity by comparing with a target standard pattern, a reject threshold setting means for setting a predetermined reject threshold, and all if the similarity of all recognition targets does not exceed the predetermined reject threshold, Calculating an average value of the similarities of the recognition targets, and if at least one of the similarities of all the recognition targets exceeds a predetermined rejection threshold, a similarity average value calculating means for holding the immediately preceding average value; A similarity peak detecting means for detecting a peak value of the similarity of the recognition target exceeding a predetermined reject threshold, a result output means for outputting a recognition result, and a predetermined rejection for the similarity of a certain recognition target. When the peak value exceeding the target threshold is detected by the similarity peak detecting means, at least until the recognition result is output based on the similarity average value held by the similarity average value calculating means, Holding time determination means for determining the time, the result output means,
When there is no recognition target giving a similarity exceeding the peak value during the suspension time, the recognition target giving the peak value is output as a recognition result.

【００１３】また、請求項２記載の発明は、音声入力信
号から音声特徴データを抽出する特徴抽出手段と、抽出
された音声特徴データを全ての認識対象の標準パターン
と比較して類似度を計算する類似度計算手段と、平均値
保持用の閾値を設定する第１の閾値設定手段と、結果出
力用のリジェクト閾値を設定する第２の閾値設定手段
と、全ての認識対象の類似度が平均値保持用の閾値を越
えない場合は全ての認識対象の類似度の平均値を計算
し、全ての認識対象の類似度のうち１つでも平均値保持
用の閾値を越える場合は、その直前の平均値を保持する
類似度平均値計算手段と、平均値保持用の閾値を越えた
認識対象の類似度のピーク値を検出する類似度ピーク検
出手段と、認識結果を出力する結果出力手段と、ある認
識対象の類似度について平均値保持用の閾値を越えたピ
ーク値が類似度ピーク検出手段により検出されたとき
に、少なくとも、類似度平均値計算手段で保持されてい
る類似度の平均値に基づいて認識結果を出力するまでの
保留時間を決定する保留時間決定手段とを備え、結果出
力手段は、保留時間中にピーク値を越える類似度を与え
る認識対象が無く、かつ、ピーク値を与えた認識対象の
類似度が結果出力用のリジェクト閾値を越えた場合に、
ピーク値を与えた認識対象を認識結果として出力するこ
とを特徴としている。According to a second aspect of the present invention, there is provided a feature extracting means for extracting voice feature data from a voice input signal, and calculating the similarity by comparing the extracted voice feature data with all standard patterns to be recognized. Similarity calculating means, a first threshold setting means for setting a threshold for holding an average value, a second threshold setting means for setting a reject threshold for result output, If the threshold for value retention is not exceeded, the average value of the similarities of all the recognition targets is calculated. If even one of the similarities of all the recognition targets exceeds the threshold for average value retention, the immediately preceding A similarity average calculating means for holding an average value, a similarity peak detecting means for detecting a peak value of similarity of a recognition target exceeding a threshold for holding the average value, a result output means for outputting a recognition result, About the similarity of a recognition target When a peak value exceeding a threshold for holding the average value is detected by the similarity peak detection means, a recognition result is output based on at least the average value of the similarity held by the similarity average value calculation means. Holding time determining means for determining a holding time until the recognition time, wherein the result output means has no recognition target giving a similarity exceeding the peak value during the holding time, and the similarity of the recognition target giving the peak value is When the reject threshold for the result output is exceeded,
It is characterized in that a recognition target given a peak value is output as a recognition result.

【００１４】また、請求項３記載の発明は、請求項２記
載の音声認識装置において、結果出力用のリジェクト閾
値は、平均値保持用の閾値よりも高く設定されることを
特徴としている。According to a third aspect of the present invention, in the speech recognition apparatus according to the second aspect, the reject threshold for outputting the result is set higher than the threshold for holding the average value.

【００１５】また、請求項４記載の発明は、請求項１ま
たは請求項２記載の音声認識装置において、結果出力手
段は、保留時間中にピーク値を越える類似度を与える認
識対象がある場合は、ピーク値を与えた認識対象を認識
結果として出力せず、ピーク値を越えた認識対象の類似
度について新たにピーク値が類似度ピーク検出手段によ
り検出されるとき、保留時間決定部に新たな保留時間を
決定させ設定させることを特徴としている。According to a fourth aspect of the present invention, in the speech recognition apparatus according to the first or second aspect, the result output means includes a recognition target which gives a similarity exceeding a peak value during the hold time. When the similarity of the recognition target exceeding the peak value is newly detected by the similarity peak detecting means without outputting the recognition target to which the peak value is given as the recognition result, a new time is output to the hold time determination unit. It is characterized in that the hold time is determined and set.

【００１６】また、請求項５記載の発明は、請求項１，
請求項２，請求項４のいずれか一項に記載の音声認識装
置において、保留時間決定手段は、直前に保持された平
均値から保留時間を決定することを特徴としている。Further, the invention described in claim 5 is the first invention.
The speech recognition apparatus according to any one of claims 2 and 4, wherein the holding time determination means determines the holding time from the average value held immediately before.

【００１７】また、請求項６記載の発明は、請求項１，
請求項２，請求項４のいずれか一項に記載の音声認識装
置において、保留時間決定手段は、類似度ピーク検出手
段により検出された類似度のピーク値と直前に保持され
た平均値との差から保留時間を決定することを特徴とし
ている。The invention according to claim 6 is based on claim 1,
5. The speech recognition apparatus according to claim 2, wherein the holding time determination unit determines a similarity between the peak value of the similarity detected by the similarity peak detection unit and the average value held immediately before. The feature is that the hold time is determined from the difference.

【００１８】また、請求項７記載の発明は、請求項１，
請求項２，請求項４のいずれか一項に記載の音声認識装
置において、保留時間決定手段は、類似度ピーク検出手
段により検出された類似度のピーク値と直前に保持され
た平均値との比から保留時間を決定することを特徴とし
ている。The invention according to claim 7 is based on claim 1,
5. The speech recognition apparatus according to claim 2, wherein the holding time determination unit determines a similarity between the peak value of the similarity detected by the similarity peak detection unit and the average value held immediately before. It is characterized in that the holding time is determined from the ratio.

【００１９】また、請求項８記載の発明は、請求項１，
請求項２，請求項４のいずれか一項に記載の音声認識装
置において、類似度平均値計算手段は、平均値を、該音
声認識装置の動作開始時からの平均値、または、一定時
間内のフレーム当たりの平均値を時間方向に移動させな
がら平均をとった移動平均、または、時間方向のローパ
スフィルタリングとして算出することを特徴としてい
る。Further, the invention according to claim 8 is the first invention.
The speech recognition device according to any one of claims 2 and 4, wherein the similarity average value calculation means calculates the average value from the start of operation of the speech recognition device or within a predetermined time. Is calculated as a moving average in which the average value per frame is averaged while moving in the time direction, or as low-pass filtering in the time direction.

【００２０】また、請求項９記載の発明は、音声入力信
号から音声特徴データを抽出し標準パターンと比較して
類似度を計算し、全ての認識対象の類似度が所定のリジ
ェクト閾値を越えない場合は全ての認識対象の類似度の
平均値を計算し、全ての認識対象の類似度のうち１つで
も所定のリジェクト閾値を越える場合は、その直前の平
均値を保持し、ある認識対象の類似度について所定のリ
ジェクト閾値を越えたピーク値が検出された場合、少な
くとも、直前に保持された平均値に基づいて認識結果を
出力するまでの保留時間を決定し、保留時間中にピーク
値を越える類似度を与える認識対象が無い場合は、ピー
ク値を与えた認識対象を認識結果として出力することを
特徴としている。According to the ninth aspect of the present invention, the speech feature data is extracted from the speech input signal and compared with a standard pattern to calculate the similarity, and the similarity of all recognition targets does not exceed a predetermined rejection threshold. In this case, the average value of the similarities of all the recognition targets is calculated. If at least one of the similarities of all the recognition targets exceeds a predetermined rejection threshold, the average value immediately before the rejection threshold is held, and When a peak value exceeding a predetermined reject threshold is detected for the similarity, at least a holding time until a recognition result is output is determined based on the average value held immediately before, and the peak value is determined during the holding time. If there is no recognition target that gives a similarity exceeding the recognition value, the recognition target that gives the peak value is output as a recognition result.

【００２１】また、請求項１０記載の発明は、音声入力
信号から音声特徴データを抽出し標準パターンと比較し
て類似度を計算し、全ての認識対象の類似度が平均値保
持用の閾値を越えない場合は全ての認識対象の類似度の
平均値を計算し、全ての認識対象の類似度のうち１つで
も平均値保持用の閾値を越える場合は、その直前の平均
値を保持し、ある認識対象の類似度について平均値保持
用の閾値を越えたピーク値が検出された場合、少なくと
も、直前に保持された平均値に基づいて認識結果を出力
するまでの保留時間を決定し、保留時間中にピーク値を
越える類似度を与える認識対象が無く、かつ、ピーク値
を与えた認識対象の類似度が平均値保持用の閾値よりも
高く設定されている結果出力用のリジェクト閾値を越え
た場合には、ピーク値を与えた認識対象を認識結果とし
て出力することを特徴としている。According to a tenth aspect of the present invention, voice feature data is extracted from a voice input signal and compared with a standard pattern to calculate a similarity, and the similarity of all recognition targets is set to a threshold for holding an average value. If it does not exceed, calculate the average value of the similarities of all the recognition targets, and if even one of the similarities of all the recognition targets exceeds the threshold for holding the average value, hold the average value immediately before that, When a peak value exceeding the average value holding threshold is detected for the similarity of a certain recognition target, at least a holding time until a recognition result is output based on the average value held immediately before is determined, and the holding time is determined. There is no recognition target giving a similarity exceeding the peak value during the time, and the similarity of the recognition target giving the peak value exceeds the rejection threshold for result output which is set higher than the threshold for holding the average value. If It is characterized by outputting a recognition target given the value as a recognition result.

【００２２】また、請求項１，請求項４乃至請求項９記
載の発明は、音声入力信号から音声特徴データを抽出し
標準パターンと比較して類似度を計算し、全ての認識対
象の類似度が所定のリジェクト閾値を越えない場合は全
ての認識対象の類似度の平均値を計算し、全ての認識対
象の類似度のうち１つでも所定のリジェクト閾値を越え
る場合は、その直前の平均値を保持し、ある認識対象の
類似度について所定のリジェクト閾値を越えたピーク値
が検出された場合、少なくとも、直前に保持された平均
値に基づいて認識結果を出力するまでの保留時間を決定
し、保留時間中にピーク値を越える類似度を与える認識
対象が無い場合は、ピーク値を与えた認識対象を認識結
果として出力するので、少なくとも静かな環境下では短
かい反応時間で認識結果を出力することができ、また、
高騒音下においても、正しい認識結果を得ることが可能
な、環境の変化に追従する使い勝手の良い音声認識を実
現できる。According to the present invention, voice feature data is extracted from a voice input signal and compared with a standard pattern to calculate a similarity. If does not exceed the predetermined reject threshold, the average value of the similarities of all the recognition targets is calculated. If even one of the similarities of all the recognition targets exceeds the predetermined reject threshold, the average value immediately before the reject threshold is calculated. When a peak value exceeding a predetermined rejection threshold is detected for the similarity of a certain recognition target, at least a hold time until a recognition result is output is determined based on the average value held immediately before. If there is no recognition target that gives a similarity exceeding the peak value during the hold time, the recognition target with the peak value is output as a recognition result. Can output the result, also,
Even under high noise, it is possible to realize easy-to-use voice recognition that can obtain a correct recognition result and follows environmental changes.

【００２３】また、請求項２乃至請求項８，請求項１０
記載の発明は、音声入力信号から音声特徴データを抽出
し標準パターンと比較して類似度を計算し、全ての認識
対象の類似度が平均値保持用の閾値を越えない場合は全
ての認識対象の類似度の平均値を計算し、全ての認識対
象の類似度のうち１つでも平均値保持用の閾値を越える
場合は、その直前の平均値を保持し、ある認識対象の類
似度について平均値保持用の閾値を越えたピーク値が検
出された場合、少なくとも、直前に保持された平均値に
基づいて認識結果を出力するまでの保留時間を決定し、
保留時間中にピーク値を越える類似度を与える認識対象
が無く、かつ、ピーク値を与えた認識対象の類似度が平
均値保持用の閾値よりも高く設定されている結果出力用
のリジェクト閾値を越えた場合には、ピーク値を与えた
認識対象を認識結果として出力するので、上記効果に加
えて、より一層、アプリケーション等に応じた適切な認
識処理を行なうことが可能になる。Further, claims 2 to 8, and claim 10
The described invention extracts voice feature data from a voice input signal, compares it with a standard pattern, calculates similarity, and, if the similarity of all recognition targets does not exceed a threshold for holding an average value, all recognition targets Calculates the average value of the similarities of all the recognition targets. If even one of the similarities of all the recognition targets exceeds the threshold for holding the average value, the average value immediately before that is retained, and the average value of the similarity of a certain recognition target is averaged. If a peak value that exceeds the threshold for value retention is detected, at least, determine the hold time until outputting the recognition result based on the average value retained immediately before,
There is no recognition target that gives a similarity exceeding the peak value during the hold time, and the similarity of the recognition target giving the peak value is set higher than the threshold for holding the average value. If it exceeds, the recognition target given the peak value is output as a recognition result, so that in addition to the above-described effects, it is possible to further perform appropriate recognition processing according to the application or the like.

【００２４】[0024]

【発明の実施の形態】以下、本発明の実施形態を図面に
基づいて説明する。図１は本発明に係る音声認識装置の
構成例を示す。図１を参照すると、この音声認識装置
は、入力された音声を電気信号(アナログ音声信号)に変
換する入力部(例えばマイクロホン)１と、入力部１から
のアナログ音声信号をデジタル音声信号に変換するＡ／
Ｄ変換部２と、デジタル音声信号をフレーム毎に音声特
徴データに変換する特徴抽出部３と、全ての認識対象の
標準パターンが予め格納されている標準パターン格納部
４と、特徴抽出部３からの音声特徴データを標準パター
ン格納部４に格納されている各認識対象の標準パターン
と比較し、各認識対象との類似度を計算する類似度計算
部５と、リジェクト閾値ＴＨが設定されるリジェクト閾
値設定部６と、全ての認識対象の類似度がリジェクト閾
値ＴＨを越えない場合は、全ての認識対象の類似度の平
均値を計算し、全ての認識対象のうちの１つでも、その
類似度がリジェクト閾値ＴＨを越えた場合には、その直
前の類似度の平均値を保持する類似度平均値計算部７
と、リジェクト閾値ＴＨを越えた認識対象の類似度のピ
ークを検出する類似度ピーク検出部８と、リジェクト閾
値ＴＨを越えた認識対象のピークが類似度ピーク検出部
８によって検出されたとき、少なくとも、類似度平均値
計算部７で保持されている類似度の平均値(全ての認識
対象のうちの１つでも、その類似度がリジェクト閾値Ｔ
Ｈを越えた場合には、その直前の類似度の平均値)に基
づいて、ピーク値が検出された認識対象を認識結果とし
て出力するまでの保留時間を決定する保留時間決定部９
と、保留時間中にピーク値を越える類似度を与える認識
対象が無い場合に、リジェクト閾値ＴＨを越えてピーク
値が検出された認識対象を認識結果として出力する結果
出力部１０とを備えている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a configuration example of a speech recognition device according to the present invention. Referring to FIG. 1, the voice recognition device includes an input unit (for example, a microphone) 1 for converting an input voice into an electric signal (analog voice signal), and a digital voice signal from the analog voice signal from the input unit 1. A /
A D conversion unit 2, a feature extraction unit 3 for converting a digital audio signal into audio feature data for each frame, a standard pattern storage unit 4 in which all standard patterns to be recognized are stored in advance, and a feature extraction unit 3. Is compared with the standard pattern of each recognition target stored in the standard pattern storage unit 4, and the similarity calculation unit 5 that calculates the similarity with each recognition target, and the rejection in which the rejection threshold TH is set If the similarity of all the recognition targets does not exceed the reject threshold TH, the threshold setting unit 6 calculates the average value of the similarities of all the recognition targets, and even if one of the recognition targets has the similarity, If the degree exceeds the reject threshold TH, the similarity average value calculation unit 7 that holds the average value of the similarity immediately before the reject threshold TH
A similarity peak detecting unit 8 that detects a peak of the similarity of the recognition target exceeding the reject threshold TH, and a similarity peak detecting unit 8 that detects the peak of the recognition target that exceeds the reject threshold TH. , The average value of the similarities held in the similarity average value calculation unit 7 (even for one of all the recognition targets, the similarity is equal to the reject threshold T
H, when the value exceeds H, the holding time determination unit 9 that determines the holding time until the recognition target whose peak value is detected is output as a recognition result based on the average value of the similarity immediately before that.
And a result output unit 10 that outputs, as a recognition result, a recognition target whose peak value has been detected exceeding the reject threshold TH when there is no recognition target giving a similarity exceeding the peak value during the hold time. .

【００２５】ここで、特徴抽出部３，標準パターン格納
部４，類似度計算部５には、一般的な音声認識技術を用
いることができる。Here, a general speech recognition technique can be used for the feature extraction unit 3, standard pattern storage unit 4, and similarity calculation unit 5.

【００２６】すなわち、図１の音声認識装置において
も、基本的には、一般的な音声認識技術が用いられる。
例えば、音声の開始点と終了点を検出して音声区間と
し、この音声区間に対して音声パターン認識等を行な
い、最大の類似度が得られた標準パターン(認識対象で
ある)の単語番号や属性データを認識結果として出力す
る形式のものを用いることができる。That is, also in the speech recognition apparatus of FIG. 1, a general speech recognition technique is basically used.
For example, a voice section is detected by detecting a start point and an end point of a voice, and voice pattern recognition or the like is performed on the voice section, and a word number or a word number of a standard pattern (recognition target) having the highest similarity is obtained. A format that outputs attribute data as a recognition result can be used.

【００２７】あるいは、音声区間を必要としないワード
スポッティング法による認識のものを用いることもでき
る。例えば継続時間制御型状態遷移モデルを用いた単語
音声認識法を用いることもでき、単語音声認識法による
認識を行なえば、音声の終了付近で類似度が最大とな
り、類似度のピーク点を検出することで認識結果を出力
することができる。Alternatively, a recognition method using a word spotting method that does not require a voice section can be used. For example, a word-speech recognition method using a duration control type state transition model can be used. If the recognition by the word-speech recognition method is performed, the similarity becomes maximum near the end of the voice, and the peak point of the similarity is detected. Thus, a recognition result can be output.

【００２８】また、類似度平均値計算部７は、全ての認
識対象の類似度の平均として、例えば次式のように、フ
レーム毎に計算された認識対象の類似度の平均値を算出
することができる。The similarity average calculating unit 7 calculates the average of the similarities of the recognition targets calculated for each frame as an average of the similarities of all the recognition targets, for example, as in the following equation. Can be.

【００２９】[0029]

【数１】 (Equation 1)

【００３０】ここで、Ｓｉｍ(ｉ)はあるフレームでの認
識対象ｉの類似度であり、ｎは認識対象の数である。Here, Sim (i) is the similarity of the recognition target i in a certain frame, and n is the number of recognition targets.

【００３１】また、保留時間決定部９は、保留時間を、
例えば次のように決定するようになっている。Further, the hold time determination unit 9 sets the hold time as:
For example, it is determined as follows.

【００３２】すなわち、類似度平均値計算部７で保持さ
れている平均値が大きい程、保留時間を長くする。ある
いは、類似度ピーク検出部８で検出された類似度のピー
ク値と類似度平均値計算部７で保持されている類似度の
平均値との差が小さい程、すなわち、(類似度ピーク値
−保持されている平均値)が小さい程、保留時間を長く
する。あるいは、類似度ピーク検出部８で検出された類
似度のピーク値と類似度平均値計算部７で保持されてい
る類似度の平均値との比が小さい程、すなわち、(類似
度ピーク値／保持されている平均値)が小さい程、保留
時間を長くする。That is, the longer the average value held in the similarity average value calculation unit 7 is, the longer the hold time is. Alternatively, the smaller the difference between the peak value of the similarity detected by the similarity peak detecting unit 8 and the average value of the similarity held by the average similarity calculating unit 7, that is, (the similarity peak value− The smaller the held average value), the longer the hold time. Alternatively, the smaller the ratio between the peak value of the similarity detected by the similarity peak detection unit 8 and the average value of the similarity held in the average similarity calculation unit 7, that is, (Similarity peak value / The smaller the held average value), the longer the hold time.

【００３３】すなわち、保留時間ｔ_nは、少なくとも、
ピーク値が得られた時点で類似度平均値計算部７に保持
されている類似度の平均値に基づいて求められる。すな
わち、平均値とピーク値の差が小さい場合、および／ま
たは、ピーク値と平均値の比が小さい場合、および／ま
たは、平均値の類似度が大きい場合には、保留時間ｔ_n
を長く設定する。例えば、保留時間ｔ_nは、次式のよう
に設定される。That is, the hold time t _n is at least
At the time when the peak value is obtained, it is obtained based on the average value of the similarity held in the similarity average value calculation unit 7. That is, when the difference between the average value and the peak value is small, and / or when the ratio between the peak value and the average value is small, and / or when the similarity between the average values is large, the holding time t _n
Set longer. For example, the hold time t _n is set as in the following equation.

【００３４】[0034]

【数２】ｔ_n＝α₁・｛１／(Ｐ−Ａ)｝＋α₂・(Ａ／Ｐ)＋α₃・ＡT _n = α ₁ · {1 / (PA)} + α ₂ · (A / P) + α ₃ · A

【００３５】ここで、Ｐは類似度のピーク値、Ａは保持
されている類似度の平均値、α₁，α₂，α₃は実験的に
得られた係数である。Here, P is the peak value of the similarity, A is the average value of the held similarities, and α ₁ , α ₂ and α ₃ are the coefficients obtained experimentally.

【００３６】なお、この場合、保留時間決定部９で決定
される保留時間は、実験により求めたテーブルを使用す
ることで実現可能である。一例として静かな環境で１０
０〜２００ｍｓ、高騒音下で１秒程度であれば一般的な
アプリケーションに適応できる。In this case, the hold time determined by the hold time determining section 9 can be realized by using a table obtained by experiments. 10 in a quiet environment as an example
If it is 0 to 200 ms and about 1 second under high noise, it can be applied to general applications.

【００３７】また、結果出力部１０は、ピーク値が検出
された後、保留時間決定部９で決定された保留時間ｔ_n
が経過するまでの間に、このピーク値を越える類似度が
得られた認識対象がなければ、そのピーク値を与えた認
識対象を認識結果として出力する一方、ピーク値が検出
された後、保留時間ｔ_nが経過するまでの間に、このピ
ーク値を越える類似度を与える認識対象があれば、この
ピーク値を与えた認識対象を認識結果として出力せず
(結果を無効とし)、このピーク値を越えた認識対象の類
似度について新たな保留時間が設定され、新たな保留動
作を行なうようになっている。After the peak value is detected, the result output unit 10 outputs the hold time t _n determined by the hold time determination unit 9.
If there is no recognition target that has obtained a similarity exceeding this peak value before elapses, the recognition target that has given the peak value is output as a recognition result, while the peak value is detected and then held. If there is a recognition target that gives a similarity exceeding this peak value until the time t _n elapses, the recognition target giving this peak value is not output as a recognition result.
(The result is invalidated), a new hold time is set for the similarity of the recognition target exceeding this peak value, and a new hold operation is performed.

【００３８】次に、図１の音声認識装置の動作について
説明する。入力部１から入力された音声は、入力部１で
電気信号(アナログ音声信号)に変換され、Ａ／Ｄ変換部
２でデジタル値に変換される。デジタル化された音声デ
ータは、特徴抽出部３でフレーム毎に音声特徴データに
変換される。ここで、音声特徴データは周波数帯域での
パワー値を示すＴＳＰやケプストラム値などが一般的で
ある。Next, the operation of the speech recognition apparatus shown in FIG. 1 will be described. The audio input from the input unit 1 is converted into an electric signal (analog audio signal) by the input unit 1 and converted into a digital value by the A / D conversion unit 2. The digitized audio data is converted into audio feature data by the feature extraction unit 3 for each frame. Here, the voice feature data is generally a TSP or a cepstrum value indicating a power value in a frequency band.

【００３９】特徴抽出部３で得られた音声特徴データ
は、類似度計算部５において、標準パターン格納部４に
予め格納されている各認識対象の標準パターンと比較さ
れ、各認識対象の類似度が計算される。各認識対象の類
似度はフレーム周期毎に更新され、時間により変化す
る。類似度平均値計算部７は、全認識対象の類似度がリ
ジェクト閾値ＴＨより小さい場合には、全認識対象の類
似度の平均値を計算する。この平均値はフレーム周期毎
に更新される。The speech feature data obtained by the feature extraction unit 3 is compared with a standard pattern of each recognition target stored in advance in a standard pattern storage unit 4 by a similarity calculation unit 5, and the similarity of each recognition target is compared. Is calculated. The similarity of each recognition target is updated for each frame period, and changes with time. When the similarity of all the recognition targets is smaller than the reject threshold TH, the similarity average calculating unit 7 calculates the average value of the similarities of all the recognition targets. This average value is updated every frame period.

【００４０】音声が入力された場合や騒音の影響で、１
つでも認識対象の類似度がリジェクト閾値ＴＨを越えた
場合、類似度平均値計算部７では、リジェクト閾値ＴＨ
を越える直前の類似度の平均値を保持し、フレーム周期
後の更新を行なわない。また、類似度ピーク検出部８で
は、リジェクト閾値ＴＨを越えた認識対象の類似度のピ
ーク値を検出する。類似度ピーク検出部８によりピーク
値が検出されると、保留時間決定部９は、例えば、類似
度ピーク検出部８に保持されている平均値と類似度ピー
ク検出部８で検出された類似度のピーク値とに基づき、
例えば数２を用いて、結果出力するまでの保留時間ｔ_n
を決定する。When voice is input or due to noise, 1
In any case, if the similarity of the recognition target exceeds the reject threshold TH, the similarity average value calculation unit 7 sets the reject threshold TH
, The average value of the similarity immediately before the frame period is exceeded, and the update after the frame period is not performed. The similarity peak detector 8 detects the peak value of the similarity of the recognition target exceeding the reject threshold TH. When the peak value is detected by the similarity peak detection unit 8, the hold time determination unit 9, for example, compares the average value held by the similarity peak detection unit 8 with the similarity detected by the similarity peak detection unit 8. Based on the peak value of
For example, using Equation 2, the hold time t _n until the result is output
To determine.

【００４１】結果出力部１０は、ピーク値が検出された
後、保留時間決定部９で決定された保留時間ｔ_nが経過
するまでの間に、このピーク値を越える類似度を与える
認識対象がなければ、そのピーク値を与えた認識対象を
認識結果として出力する。これに対し、ピーク値が検出
された後、保留時間ｔ_nが経過するまでの間に、このピ
ーク値を越える類似度を与える認識対象があれば、この
ピーク値を与えた認識対象を認識結果として出力せず
(結果を無効とし)、このピーク値を越えた認識対象の類
似度について新たな保留時間が設定され、新たな保留動
作を行なう。After the peak value is detected, the result output unit 10 recognizes the recognition target that gives a similarity exceeding the peak value until the hold time t _n determined by the hold time determination unit 9 elapses. If not, the recognition target given the peak value is output as a recognition result. On the other hand, if there is a recognition target that gives a similarity exceeding this peak value after the peak value is detected and before the hold time t _n elapses, the recognition target giving this peak value is recognized. Not output as
(The result is invalidated), a new hold time is set for the similarity of the recognition target exceeding this peak value, and a new hold operation is performed.

【００４２】次に、具体的な動作例について説明する。
図２は静かな環境での各認識対象ａ，ｂ，ｃの類似度の
変化を示す図であり、図２の例では、認識対象ａ，ｂ，
ｃのいずれも、音声が無い区間では低い類似度を保って
いる。これは静かな環境の音声特徴抽出値が音声が入力
された音声特徴抽出値と大きく異なるため、パターン距
離が大きくなり類似度が低くなることによるものであ
る。図２には、さらに、全ての認識対象の類似度の平均
値を計算した結果が、×−×線で示されている。Next, a specific operation example will be described.
FIG. 2 is a diagram illustrating a change in the similarity of each of the recognition targets a, b, and c in a quiet environment. In the example of FIG.
In each of c, a low similarity is maintained in a section where there is no voice. This is because the speech feature extraction value in a quiet environment is significantly different from the speech feature extraction value to which speech is input, so that the pattern distance increases and the similarity decreases. In FIG. 2, the result of calculating the average value of the similarities of all the recognition targets is indicated by the XX line.

【００４３】各認識対象の類似度は時間的に変化するた
め、一般にフレームという単位時間(数ｍｓ〜数十ｍｓ
程度に設定される)内の音声波形から特徴抽出した音声
特徴データに対して類似度を計算する。従って、類似度
はフレーム周期毎に更新され、また、全ての認識対象の
類似度の平均値もフレーム毎に更新されるが、平均値の
計算は認識対象の類似度がリジェクト閾値ＴＨ以下であ
る場合に行ない、１つでもリジェクト閾値ＴＨを越えた
場合(図２の例では、認識対象ａに対応する音声が入力
されて音声区間内で認識対象ａの類似度が高くなり、リ
ジェクト閾値ＴＨを越えた場合)は直前の平均値Ａ１が
保持される。ここで、保持された平均値Ａ１は、周囲の
環境の状態を反映したものとなっている。図２の例で
は、周囲の環境は静かな環境であり、全ての認識対象の
類似度が低くなるため、平均値Ａ１は小さくなる。一
方、入力音声の認識対象ａとの類似度は、音声区間内
で、リジェクト閾値ＴＨを越えた後、音声区間終了付近
でピーク値Ｐ０を得て、その後小さくなる。Since the degree of similarity of each recognition target changes with time, a unit time (several ms to several tens ms) is generally called a frame.
The degree of similarity is calculated for the audio feature data extracted from the audio waveform in (set to a degree). Therefore, the similarity is updated for each frame period, and the average value of the similarities of all the recognition targets is also updated for each frame, but the average value is calculated such that the similarity of the recognition targets is equal to or less than the reject threshold TH. In the case where at least one of them exceeds the reject threshold TH (in the example of FIG. 2, the voice corresponding to the recognition target a is input and the similarity of the recognition target a increases in the voice section, and the reject threshold TH is (If exceeded), the immediately preceding average value A1 is held. Here, the held average value A1 reflects the state of the surrounding environment. In the example of FIG. 2, the surrounding environment is a quiet environment, and the similarity of all recognition targets is low, so that the average value A1 is small. On the other hand, the similarity of the input voice to the recognition target a exceeds the reject threshold TH in the voice section, obtains a peak value P0 near the end of the voice section, and thereafter decreases.

【００４４】この場合、例えば、ピーク値Ｐ０が得られ
た時点で保持されている平均値Ａ１とピーク値Ｐ０との
類似度の差(Ｐ０−Ａ１)が大きい場合には、ピーク値Ｐ
０を与えた認識対象ａは非音声区間の類似度の平均値Ａ
１に比べて著しく極だっているため、ピーク値Ｐ０が得
られた時点から短かい時間(短かい保留時間)ｔ₂の後
に，認識結果(ピーク値Ｐ０を与えた認識対象ａ)を出力
する。ここで、ｔ₂としては、１００ｍｓ〜２００ｍｓ
程度が適している。In this case, for example, if the difference (P0−A1) between the similarity between the average value A1 held at the time when the peak value P0 is obtained and the peak value P0 is large, the peak value P0
The recognition target a given 0 is the average value A of the similarity in the non-voice section.
Since there Datte significantly poles compared to 1, shorter time from the time when the peak value P0 is obtained (short hold time) after t _2, and outputs the recognition result (recognition target a given peak value P0). Here, as t ₂ , 100 ms to 200 ms
The degree is suitable.

【００４５】このように、本発明によれば、静かな環境
下では、短かい反応時間で認識結果を出力することがで
きる。As described above, according to the present invention, a recognition result can be output in a short reaction time in a quiet environment.

【００４６】なお、非音声区間の平均値が小さい場合に
は、リジェクト閾値ＴＨを越える認識対象が非音声区間
では無いため、音声区間以外で認識結果が出力されるこ
とはない。また、例えば、正解認識対象の類似度のピー
ク値と非音声区間の平均値との差が大きい場合には、リ
ジェクト閾値ＴＨは実験的に容易に設定できる。When the average value of the non-speech section is small, the recognition target exceeding the reject threshold TH is not the non-speech section, so that the recognition result is not output outside the speech section. Further, for example, when the difference between the peak value of the similarity of the correct recognition target and the average value of the non-voice section is large, the reject threshold TH can be easily set experimentally.

【００４７】一方、騒音が大きくなってくると、騒音区
間(非音声区間)でも類似度が高くなる認識対象が出てく
る。また、音声区間においても、音声に騒音が付加され
ることで音声特徴抽出値が歪んでくるため、正解認識対
象の類似度のピーク値が下がる。On the other hand, when the noise level increases, recognition targets having a high similarity also appear in a noise section (non-voice section). Also, in the voice section, the noise feature is added to the voice, and the voice feature extraction value is distorted, so that the peak value of the similarity of the correct recognition target decreases.

【００４８】従来の方式では、図９に示したように、騒
音区間のピーク値Ｐ１，Ｐ３と最大類似度を与えるピー
ク値Ｐ０との間にリジェクト閾値ＴＨを設定すること
で、誤認識結果ｂ，ｃをリジェクトしているが、さらに
騒音が大きくなった場合、騒音区間のピーク値Ｐ１，Ｐ
３と最大類似度を与えるピーク値Ｐ０との差が小さくな
り、リジェクト閾値ＴＨの設定ができなくなる。In the conventional method, as shown in FIG. 9, by setting the reject threshold TH between the peak values P1 and P3 of the noise section and the peak value P0 giving the maximum similarity, the false recognition result b , C are rejected, but if the noise further increases, the peak values P1, P
The difference between 3 and the peak value P0 that gives the maximum similarity becomes small, and the reject threshold TH cannot be set.

【００４９】図３には、この様子が示されている。すな
わち、図３の例では、騒音が大きくなり、騒音区間での
ピーク値Ｐ１，Ｐ３がリジェクト閾値ＴＨを越えている
ため、従来の方式によっては、リジェクト閾値ＴＨによ
る誤認識結果Ｐ１，Ｐ３のリジェクトを行なうことがで
きない。FIG. 3 shows this state. That is, in the example of FIG. 3, since the noise increases and the peak values P1 and P3 in the noise section exceed the reject threshold TH, depending on the conventional method, the rejection of the erroneous recognition results P1 and P3 by the reject threshold TH is performed. Can not do.

【００５０】これに対し、本発明では、図３のような場
合、図４に示すように、例えば、認識対象ｂの類似度が
ピーク値Ｐ１を与えた時点で保持されている平均値Ａ２
とピーク値Ｐ１との差(Ｐ１−Ａ２)は小さいため、結果
出力するまでの時間(保留時間)ｔ₃を長く設定する。こ
れにより、図３のような場合、誤認識結果が出力される
前に、正解認識対象ａの類似度がピーク値Ｐ１を越える
ため、ピーク値Ｐ１の結果は無効となり、誤認識結果ｂ
が出力されるのを防ぐことができる。On the other hand, in the present invention, in the case of FIG. 3, for example, as shown in FIG. 4, the similarity of the recognition target b is, for example, the average value A2 held when the peak value P1 is given.
And since the difference (P1-A2) is smaller between the peak value P1, the time (holding time) until the result output setting a longer t _3. As a result, in the case shown in FIG. 3, the similarity of the correct recognition target a exceeds the peak value P1 before the erroneous recognition result is output, so that the result of the peak value P1 becomes invalid and the erroneous recognition result b
Can be prevented from being output.

【００５１】そして、正解認識対象ａの類似度のピーク
値Ｐ０が得られると、その場合、ピーク値Ｐ０が得られ
た時点で保持されている平均値Ａ３は、例えば図２の平
均値Ａ１よりも大きく(騒音がある場合は騒音の音声特
徴データが音声の特徴データと似ているため、パターン
距離が小さく類似度が高くなることによる)、従って、
ピーク値Ｐ０と平均値Ａ３との差(Ｐ０−Ａ３)は、図２
におけるピーク値Ｐ０と平均値Ａ１との差(Ｐ０−Ａ１)
に比べて小さく、この場合、結果出力するまでの時間ｔ
₄は図２のｔ₂に比べて長く設定される。When the peak value P0 of the similarity of the correct recognition target a is obtained, the average value A3 held at the time when the peak value P0 is obtained is, for example, lower than the average value A1 in FIG. Is also large (when there is noise, the voice feature data of the noise is similar to the voice feature data, so the pattern distance is small and the similarity is high),
The difference (P0−A3) between the peak value P0 and the average value A3 is shown in FIG.
(P0-A1) between peak value P0 and average value A1 at
In this case, the time t until the result is output
₄ is set longer than the t ₂ in FIG.

【００５２】これにより、図４の例では、ピーク値Ｐ０
が得られた認識対象ａを認識結果として出力するための
保留時間ｔ₄を経過するまでに、認識対象ｃの類似度が
ピーク値Ｐ３をとるが、ピーク値Ｐ３はピーク値Ｐ０よ
りも類似度が低いためピーク値Ｐ３による認識結果出
力，すなわち認識対象ｃの結果は棄却される。Accordingly, in the example of FIG. 4, the peak value P0
Before elapse of holding time t ₄ for outputting as a recognition result recognized a that is obtained, but the similarity of the recognition target c takes a peak value P3, the peak value P3 similarity than the peak value P0 Is low, the recognition result output based on the peak value P3, that is, the result of the recognition target c is rejected.

【００５３】このように、本発明によれば、図４の例か
らわかるように、ピーク値Ｐ０が得られた後、保留時間
ｔ₄を経過した時点で、認識対象ａの正解認識結果が出
力され、この場合、非音声区間(騒音区間)の認識対象
ｂ，ｃの誤認識結果は棄却される。すなわち、本発明に
よれば、高騒音下においても、正しい認識結果を得るこ
とができる。[0053] Thus, according to the present invention, as it can be seen from the example of FIG. 4, after the peak value P0 is obtained, at the time of the lapse of the hold time t _4, correct recognition result of the recognition target a output In this case, the erroneous recognition results of the recognition targets b and c in the non-voice section (noise section) are rejected. That is, according to the present invention, a correct recognition result can be obtained even under high noise.

【００５４】上述の例では、認識対象の類似度の平均と
して、フレーム毎に計算された認識対象の類似度の平均
値を用いたが、これ以外にも、種々のものを用いること
ができる。In the above-described example, the average value of the similarities of the recognition targets calculated for each frame is used as the average of the similarities of the recognition targets. However, various other values may be used.

【００５５】例えば、保留時間決定に使用される平均値
として、リジェクト閾値ＴＨを越える類似度を持つ認識
対象が現われる直前のフレームの平均値を用いる場合、
直前のフレームの平均値のみでは周囲環境を代表してい
ない場合がある。例えば直前に突発性ノイズがあった場
合、保持した平均値のみが大きな値を示すこともありう
る。そのため、フレーム毎の平均値を時間軸方向にフィ
ルタリングすることによって、突発性ノイズに対応する
ことも可能である。For example, when an average value of a frame immediately before a recognition target having a similarity exceeding the reject threshold value TH is used as the average value used for determining the hold time,
In some cases, only the average value of the immediately preceding frame does not represent the surrounding environment. For example, when there is sudden noise immediately before, only the held average value may show a large value. Therefore, it is possible to deal with sudden noise by filtering the average value for each frame in the time axis direction.

【００５６】図５はこの音声認識装置の動作開始時から
のフレーム毎の平均値を時間方向へ平均する仕方を示す
図である。この場合、認識対象の類似度がリジェクト閾
値ＴＨを越えた期間は除外する。すなわち、図５の例で
は、図中矢印実線で示すように、動作開始時からのフレ
ーム毎の平均値を認識対象の類似度がリジェクト閾値Ｔ
Ｈを越えた期間を除外して、時間軸方向へ平均して、最
新の平均値としている。これにより、周囲環境を代表し
た平均値を使用することができる。FIG. 5 is a diagram showing how the average value of each frame from the start of the operation of the speech recognition apparatus is averaged in the time direction. In this case, a period in which the similarity of the recognition target exceeds the reject threshold TH is excluded. That is, in the example of FIG. 5, as shown by the solid line in the figure, the average value of each frame from the start of the operation is determined by the similarity of the recognition target to the reject threshold T.
The average over the time axis direction, excluding the period exceeding H, is used as the latest average value. As a result, an average value representing the surrounding environment can be used.

【００５７】また、図６は一定時間内のフレームの平均
値を時間方向へ平均する仕方、すなわち移動平均をとる
場合を示す図である。図６の例では、図中矢印実線区間
で平均をとり、時間軸に対して移動平均を求めている。
これにより、周囲環境の変化に追随した平均値を使用す
ることができる。FIG. 6 is a diagram showing a method of averaging the average value of frames within a certain time in the time direction, that is, a case of taking a moving average. In the example of FIG. 6, an average is obtained in the section indicated by the solid line in the arrow, and a moving average is obtained with respect to the time axis.
As a result, an average value that follows changes in the surrounding environment can be used.

【００５８】また、時間軸方向のフィルタリングとし
て、例えば次式によって時間平均をとって良い。As the filtering in the time axis direction, for example, a time average may be obtained by the following equation.

【００５９】[0059]

【数３】時間平均値＝(現フレームの平均値＋前フレー
ムの平均値)／２## EQU3 ## Time average value = (average value of current frame + average value of previous frame) / 2

【００６０】数３によって時間平均をとる場合には、周
囲環境に追随した平均値を簡易に得られる。When the time average is calculated according to Equation 3, an average value following the surrounding environment can be easily obtained.

【００６１】また、上述の構成例では、類似度のピーク
検出処理を開始するためのリジェクト閾値(すなわち、
誤認識を避けるための結果出力用のリジェクト閾値)と
平均値の保持処理を開始するための平均値保持用のリジ
ェクト閾値とに、同じ閾値ＴＨを使用している。しかし
ながら、アプリケーションによっては誤認識を避けるた
めに、結果出力用のリジェクト閾値を高く設定する場合
があり、この場合、平均値保持用のリジェクト閾値が結
果出力用のリジェクト閾値と同じであると、音声が入力
されて類似度が上がる区間まで平均値として計算される
ため、保持される平均値が高くなってしまう。Further, in the above configuration example, a reject threshold for starting the peak detection processing of the similarity (that is, the reject threshold)
The same threshold value TH is used for the result output rejection threshold for avoiding erroneous recognition) and the average value retention rejection threshold for starting the average value retention processing. However, depending on the application, in order to avoid erroneous recognition, the rejection threshold for the result output may be set high.In this case, if the rejection threshold for holding the average value is the same as the rejection threshold for the result output, the Is input and the average value is calculated up to the section where the similarity rises, so that the retained average value becomes high.

【００６２】図７には、この様子が示されている。すな
わち、図７において、結果出力用のリジェクト閾値，平
均値保持用のリジェクト閾値をともに低い値ＴＨ１に設
定する場合には、保持される平均値を符号Ａ５で示すよ
うに、低く維持することができるが、結果出力用のリジ
ェクト閾値が低い値ＴＨ１に設定されていることから、
アプリケーションによっては、誤認識の割合いが増加す
る恐れがある。FIG. 7 shows this state. That is, in FIG. 7, when both the reject threshold for result output and the reject threshold for holding the average value are set to a low value TH1, the held average value may be kept low as indicated by reference numeral A5. Yes, but since the reject threshold for result output is set to a low value TH1,
Depending on the application, the rate of false recognition may increase.

【００６３】これに対し、図７において、結果出力用の
リジェクト閾値，平均値保持用のリジェクト閾値をとも
に高い値ＴＨ２に設定する場合には、誤認識の割合いを
低減することができるが、保持される平均値は、符号Ａ
４で示すように、符号Ａ５に比べて高くなってしまう。
すなわち、値の大きなリジェクト閾値ＴＨ２を使用する
場合には、音声に対応した正解認識対象の類似度が大き
な時点で平均値が保持されるため(図中Ａ４)、全認識対
象の平均値が音声に対応した認識対象のために高くなっ
てしまい、適切な認識処理を行なう上で、支障の生ずる
ことがある。On the other hand, in FIG. 7, when both the reject threshold for outputting the result and the reject threshold for holding the average value are set to a high value TH2, the rate of erroneous recognition can be reduced. The average value retained is the code A
As shown by 4, it is higher than the code A5.
That is, when the rejection threshold TH2 having a large value is used, the average value is held when the similarity of the correct recognition target corresponding to the voice is large (A4 in the figure). May become high because of the recognition target corresponding to the above, which may cause trouble in performing an appropriate recognition process.

【００６４】このように、アプリケーションによって結
果出力用のリジェクト閾値を高く設定する必要がある場
合における上述の問題を回避するため、図１の構成を図
８のように変形することができる。As described above, the configuration shown in FIG. 1 can be modified as shown in FIG. 8 in order to avoid the above-mentioned problem when the reject threshold for result output needs to be set high by the application.

【００６５】すなわち、図８の構成例では、図１のリジ
ェクト閾値設定部６のかわりに、平均値保持用の閾値Ｔ
Ｈ１が設定される第１の閾値設定部２１と、結果出力用
のリジェクト閾値ＴＨ２が設定される第２の閾値設定部
２２とが設けられている。That is, in the configuration example of FIG. 8, instead of the reject threshold setting unit 6 of FIG.
A first threshold setting unit 21 for setting H1 and a second threshold setting unit 22 for setting a reject threshold TH2 for outputting a result are provided.

【００６６】このような構成では、第１の閾値設定部２
１において、平均値保持用の閾値ＴＨ１を設定でき、ま
た、第２の閾値設定部２２においては、平均値保持用の
閾値ＴＨ１とは独立して、別個に、結果出力用のリジェ
クト閾値ＴＨ２を設定できる。そして、類似度平均計算
部７は、第１の閾値設定部２１で設定された平均値保持
用の閾値ＴＨ１を用いて、類似度平均処理(すなわち、
１つでも平均値保持用の閾値ＴＨ１を越えたときに、こ
のときの平均値を保持する処理)を行なうことができ、
また、類似度ピーク検出部８は、平均値保持用の閾値Ｔ
Ｈ１を越えた認識対象の類似度のピークの検出を行な
い、保留時間決定部９は、平均値保持用の閾値ＴＨ１を
越えた認識対象のピークが類似度ピーク検出部８によっ
て検出されたとき、少なくとも、類似度平均値計算部７
で保持されている類似度の平均値(全ての認識対象のう
ちの１つでも、その類似度がリジェクト閾値ＴＨを越え
た場合には、その直前の類似度の平均値)に基づいて、
ピーク値が検出された認識対象を認識結果として出力す
るまでの保留時間を決定することができる。In such a configuration, the first threshold setting unit 2
1, a threshold value TH1 for holding the average value can be set, and the second threshold value setting unit 22 sets a reject threshold value TH2 for outputting the result independently of the threshold value TH1 for holding the average value. Can be set. Then, the similarity average calculating unit 7 uses the average value holding threshold TH1 set by the first threshold setting unit 21 to perform the similarity averaging process (ie,
When even one exceeds the threshold value TH1 for holding the average value, the process of holding the average value at this time can be performed.
Further, the similarity peak detection unit 8 includes a threshold T for holding the average value.
When the similarity peak of the recognition target exceeding H1 is detected, the hold time determination unit 9 determines that the similarity peak detection unit 8 detects the peak of the recognition target exceeding the average value holding threshold TH1. At least the similarity average value calculation unit 7
Based on the average value of the similarity held in (in the case where even one of all the recognition targets has the similarity exceeding the reject threshold TH, the average value of the similarity immediately before that).
It is possible to determine a hold time until a recognition target whose peak value is detected is output as a recognition result.

【００６７】また、結果出力部１０は、第２の閾値設定
部２２で設定された結果出力用のリジェクト閾値ＴＨ２
を用いて、結果出力処理(類似度がリジェクト閾値ＴＨ
２を越えた認識結果を採用し、類似度がリジェクト閾値
以下の認識結果を棄却する処理)を行なうことができ
る。The result output unit 10 is configured to output the result reject threshold value TH2 set by the second threshold value setting unit 22.
And the result output processing (similarity is determined as
(A process of rejecting a recognition result whose similarity is equal to or less than a reject threshold) by employing a recognition result exceeding 2).

【００６８】従って、アプリケーションによって結果出
力用のリジェクト閾値を高く設定する必要がある場合に
は、結果出力用のリジェクト閾値ＴＨ２を図７のような
高い値に設定する一方、平均値保持用の閾値ＴＨ１につ
いては、図７のように、結果出力用のリジェクト閾値Ｔ
Ｈ２に比べて低い値に設定することができる。このよう
に、結果出力用のリジェクト閾値ＴＨ２を高い値に設定
することで、誤認識の割合いを低減することができ、ま
た、平均値保持用の閾値ＴＨ１を低い値に設定すること
で、結果出力用のリジェクト閾値ＴＨ２が高い値に設定
される場合にも、保持される平均値を図７に符号Ａ５で
示すような低い値に維持することができ、適切な認識処
理を行なうことが可能となる。Therefore, when it is necessary for the application to set a high rejection threshold for result output, the rejection threshold TH2 for result output is set to a high value as shown in FIG. As for TH1, as shown in FIG. 7, a reject threshold T for outputting a result is set.
It can be set to a lower value than H2. As described above, by setting the reject threshold TH2 for result output to a high value, the rate of misrecognition can be reduced, and by setting the threshold TH1 for holding the average value to a low value, Even when the reject threshold value TH2 for result output is set to a high value, the held average value can be maintained at a low value as indicated by reference numeral A5 in FIG. 7, and appropriate recognition processing can be performed. It becomes possible.

【００６９】[0069]

【発明の効果】以上に説明したように、請求項１，請求
項４乃至請求項９記載の発明によれば、音声入力信号か
ら音声特徴データを抽出し標準パターンと比較して類似
度を計算し、全ての認識対象の類似度が所定のリジェク
ト閾値を越えない場合は全ての認識対象の類似度の平均
値を計算し、全ての認識対象の類似度のうち１つでも所
定のリジェクト閾値を越える場合は、その直前の平均値
を保持し、ある認識対象の類似度について所定のリジェ
クト閾値を越えたピーク値が検出された場合、少なくと
も、直前に保持された平均値に基づいて認識結果を出力
するまでの保留時間を決定し、保留時間中にピーク値を
越える類似度を与える認識対象が無い場合は、ピーク値
を与えた認識対象を認識結果として出力するので、少な
くとも静かな環境下では短かい反応時間で認識結果を出
力することができ、また、高騒音下においても、正しい
認識結果を得ることが可能な、環境の変化に追従する使
い勝手の良い音声認識を実現できる。As described above, according to the first to fourth aspects of the present invention, voice feature data is extracted from a voice input signal and compared with a standard pattern to calculate a similarity. If the similarities of all the recognition targets do not exceed the predetermined rejection threshold, an average value of the similarities of all the recognition targets is calculated, and even one of the similarities of all the recognition targets sets the predetermined rejection threshold. If the peak value is exceeded, the immediately preceding average value is retained.If a peak value exceeding a predetermined reject threshold is detected for the similarity of a certain recognition target, the recognition result is determined based on at least the previously retained average value. Determine the hold time until output, and if there is no recognition target that gives a similarity exceeding the peak value during the hold time, the recognition target with the peak value is output as the recognition result, so at least a quiet environment In short it is possible to output a recognition result in reaction times, also, even under high noise, which can obtain a correct recognition result, it is possible to realize a good speech recognition convenient to follow the changes in the environment.

【００７０】また、請求項２乃至請求項８，請求項１０
記載の発明によれば、音声入力信号から音声特徴データ
を抽出し標準パターンと比較して類似度を計算し、全て
の認識対象の類似度が平均値保持用の閾値を越えない場
合は全ての認識対象の類似度の平均値を計算し、全ての
認識対象の類似度のうち１つでも平均値保持用の閾値を
越える場合は、その直前の平均値を保持し、ある認識対
象の類似度について平均値保持用の閾値を越えたピーク
値が検出された場合、少なくとも、直前に保持された平
均値に基づいて認識結果を出力するまでの保留時間を決
定し、保留時間中にピーク値を越える類似度を与える認
識対象が無く、かつ、ピーク値を与えた認識対象の類似
度が平均値保持用の閾値よりも高く設定されている結果
出力用のリジェクト閾値を越えた場合には、ピーク値を
与えた認識対象を認識結果として出力するので、上記効
果に加えて、より一層、アプリケーション等に応じた適
切な認識処理を行なうことが可能になる。Further, claims 2 to 8 and claim 10
According to the described invention, voice feature data is extracted from a voice input signal and compared with a standard pattern to calculate a similarity. If the similarity of all recognition targets does not exceed the average value holding threshold, all The average value of the similarities of the recognition targets is calculated. If at least one of the similarities of all the recognition targets exceeds the threshold for holding the average value, the average value immediately before that is held, and the similarity of a certain recognition target is held. If a peak value that exceeds the threshold for holding the average value is detected, at least, a hold time until the recognition result is output based on the average value held immediately before is determined, and the peak value is determined during the hold time. If there is no recognition target giving a similarity exceeding the threshold and the similarity of the recognition target giving the peak value exceeds the rejection threshold for result output set higher than the threshold for holding the average value, the peak The recognition target given the value Since output as identification results, in addition to the above effects, more, it becomes possible to perform appropriate recognition process in accordance with the application or the like.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明に係る音声認識装置の構成例を示す図で
ある。FIG. 1 is a diagram showing a configuration example of a speech recognition device according to the present invention.

【図２】静かな環境での各認識対象ａ，ｂ，ｃの類似度
の変化を示す図である。FIG. 2 is a diagram showing a change in similarity between recognition targets a, b, and c in a quiet environment.

【図３】騒音環境での各認識対象ａ，ｂ，ｃの類似度の
変化を示す図である。FIG. 3 is a diagram illustrating a change in the similarity of each of the recognition targets a, b, and c in a noise environment.

【図４】図３のように騒音環境下での各認識対象ａ，
ｂ，ｃの類似度の変化に対して本発明を適用する場合を
示す図である。FIG. 4 is a diagram illustrating each recognition target a,
It is a figure showing the case where the present invention is applied to change of the similarity of b and c.

【図５】音声認識装置の動作開始時からのフレーム毎の
平均値を時間方向へ平均する仕方を示す図である。FIG. 5 is a diagram illustrating a method of averaging an average value for each frame from the start of operation of the speech recognition device in a time direction.

【図６】一定時間内のフレームの平均値を時間方向へ平
均する仕方、すなわち移動平均をとる場合を示す図であ
る。FIG. 6 is a diagram showing a method of averaging the average value of frames within a certain time in the time direction, that is, a case of taking a moving average.

【図７】結果出力用のリジェクト閾値，平均値保持用の
リジェクト閾値をともに同じ値に設定する場合の問題を
説明するための図である。FIG. 7 is a diagram for explaining a problem when both a reject threshold for result output and a reject threshold for holding an average value are set to the same value;

【図８】図１の音声認識装置の変形例を示す図である。FIG. 8 is a diagram showing a modification of the voice recognition device of FIG. 1;

【図９】従来の音声認識方式を説明するための図であ
る。FIG. 9 is a diagram for explaining a conventional voice recognition method.

【符号の説明】[Explanation of symbols]

１入力部２Ａ／Ｄ変換部３特徴抽出部４標準パターン格納部５類似度計算部６リジェクト閾値設定部７類似度平均値計算部８類似度ピーク検出部９保留時間決定部１０結果出力部２１第１のリジェクト閾値設定部２２第２のリジェクト閾値設定部 DESCRIPTION OF SYMBOLS 1 Input part 2 A / D conversion part 3 Feature extraction part 4 Standard pattern storage part 5 Similarity calculation part 6 Reject threshold value setting part 7 Similarity average value calculation part 8 Similarity peak detection part 9 Retention time determination part 10 Result output part 21 First Reject Threshold Setting Unit 22 Second Reject Threshold Setting Unit

Claims

【特許請求の範囲】[Claims]

【請求項１】音声入力信号から音声特徴データを抽出
する特徴抽出手段と、抽出された音声特徴データを全て
の認識対象の標準パターンと比較して類似度を計算する
類似度計算手段と、所定のリジェクト閾値を設定するリ
ジェクト閾値設定手段と、全ての認識対象の類似度が所
定のリジェクト閾値を越えない場合は全ての認識対象の
類似度の平均値を計算し、全ての認識対象の類似度のう
ち１つでも所定のリジェクト閾値を越える場合は、その
直前の平均値を保持する類似度平均値計算手段と、所定
のリジェクト閾値を越えた認識対象の類似度のピーク値
を検出する類似度ピーク検出手段と、認識結果を出力す
る結果出力手段と、ある認識対象の類似度について所定
のリジェクト閾値を越えたピーク値が前記類似度ピーク
検出手段により検出されたときに、少なくとも、類似度
平均値計算手段で保持されている類似度の平均値に基づ
いて認識結果を出力するまでの保留時間を決定する保留
時間決定手段とを備え、前記結果出力手段は、保留時間
中に前記ピーク値を越える類似度を与える認識対象が無
い場合に、前記ピーク値を与えた認識対象を認識結果と
して出力することを特徴とする音声認識装置。1. A feature extracting means for extracting voice feature data from a voice input signal, a similarity calculating means for calculating the similarity by comparing the extracted voice feature data with all standard patterns to be recognized, Reject threshold setting means for setting a reject threshold of the target object; and, if the similarities of all the recognition targets do not exceed a predetermined reject threshold, calculating the average value of the similarities of all the recognition targets, and calculating the similarity of all the recognition targets. If at least one of them exceeds a predetermined reject threshold, a similarity average calculating means for holding an average immediately before the reject threshold, and a similarity detecting a peak value of the similarity of the recognition target exceeding the predetermined reject threshold A peak detection unit, a result output unit for outputting a recognition result, and a peak value exceeding a predetermined reject threshold for the similarity of a certain recognition target is detected by the similarity peak detection unit. At least, a holding time determining means for determining a holding time until a recognition result is output based on the average value of the similarities held by the similarity average value calculating means, the result output means Is a speech recognition apparatus for outputting, as a recognition result, a recognition target to which the peak value is given when there is no recognition target that gives a similarity exceeding the peak value during the hold time.

【請求項２】音声入力信号から音声特徴データを抽出
する特徴抽出手段と、抽出された音声特徴データを全て
の認識対象の標準パターンと比較して類似度を計算する
類似度計算手段と、平均値保持用の閾値を設定する第１
の閾値設定手段と、結果出力用のリジェクト閾値を設定
する第２の閾値設定手段と、全ての認識対象の類似度が
平均値保持用の閾値を越えない場合は全ての認識対象の
類似度の平均値を計算し、全ての認識対象の類似度のう
ち１つでも平均値保持用の閾値を越える場合は、その直
前の平均値を保持する類似度平均値計算手段と、平均値
保持用の閾値を越えた認識対象の類似度のピーク値を検
出する類似度ピーク検出手段と、認識結果を出力する結
果出力手段と、ある認識対象の類似度について平均値保
持用の閾値を越えたピーク値が前記類似度ピーク検出手
段により検出されたときに、少なくとも、類似度平均値
計算手段で保持されている類似度の平均値に基づいて認
識結果を出力するまでの保留時間を決定する保留時間決
定手段とを備え、前記結果出力手段は、保留時間中に前
記ピーク値を越える類似度を与える認識対象が無く、か
つ、前記ピーク値を与えた認識対象の類似度が結果出力
用のリジェクト閾値を越えた場合に、前記ピーク値を与
えた認識対象を認識結果として出力することを特徴とす
る音声認識装置。2. A feature extracting means for extracting voice feature data from a voice input signal, a similarity calculating means for calculating the similarity by comparing the extracted voice feature data with all standard patterns to be recognized, First to set the threshold for holding the value
Threshold setting means, a second threshold setting means for setting a reject threshold for result output, and a similarity calculation for all the recognition targets when the similarity of all the recognition targets does not exceed the threshold for holding the average value. The average value is calculated, and if even one of the similarities of all the recognition targets exceeds the threshold for holding the average value, a similarity average value calculating means for holding the immediately preceding average value, A similarity peak detecting means for detecting the peak value of the similarity of the recognition target exceeding the threshold value, a result output means for outputting the recognition result, and a peak value exceeding the threshold for holding the average value for the similarity of a certain recognition target Is detected by the similarity peak detecting means, at least a holding time determination for determining a holding time until outputting a recognition result based on an average value of the similarities held by the similarity average value calculating means. And means, The result output means, when there is no recognition target giving the similarity exceeding the peak value during the hold time, and when the similarity of the recognition target giving the peak value exceeds the reject threshold for result output, A speech recognition apparatus, which outputs a recognition target given the peak value as a recognition result.

【請求項３】請求項２記載の音声認識装置において、
前記結果出力用のリジェクト閾値は、平均値保持用の閾
値よりも高く設定されることを特徴とする音声認識装
置。3. The speech recognition device according to claim 2, wherein
The speech recognition device according to claim 1, wherein the reject threshold for outputting the result is set higher than a threshold for holding an average value.

【請求項４】請求項１または請求項２記載の音声認識
装置において、前記結果出力手段は、前記保留時間中に
前記ピーク値を越える類似度を与える認識対象がある場
合は、前記ピーク値を与えた認識対象を認識結果として
出力せず、前記ピーク値を越えた認識対象の類似度につ
いて新たにピーク値が前記類似度ピーク検出手段により
検出されるとき、前記保留時間決定部に新たな保留時間
を決定させ設定させることを特徴とする音声認識装置。4. The speech recognition apparatus according to claim 1, wherein the result output unit determines the peak value when there is a recognition target that gives a similarity exceeding the peak value during the hold time. When the given recognition target is not output as a recognition result and a new peak value is detected by the similarity peak detection means for the similarity of the recognition target exceeding the peak value, a new holding time is stored in the holding time determination unit. A speech recognition apparatus characterized in that the time is determined and set.

【請求項５】請求項１，請求項２，請求項４のいずれ
か一項に記載の音声認識装置において、前記保留時間決
定手段は、前記直前に保持された平均値から保留時間を
決定することを特徴とする音声認識装置。5. The speech recognition device according to claim 1, wherein said holding time determining means determines a holding time from an average value held immediately before. A speech recognition device characterized by the above-mentioned.

【請求項６】請求項１，請求項２，請求項４のいずれ
か一項に記載の音声認識装置において、前記保留時間決
定手段は、前記類似度ピーク検出手段により検出された
類似度のピーク値と前記直前に保持された平均値との差
から保留時間を決定することを特徴とする音声認識装
置。6. The speech recognition device according to claim 1, wherein said holding time determining means includes a similarity peak detected by said similarity peak detecting means. A speech recognition apparatus, wherein a hold time is determined from a difference between a value and an average value held immediately before.

【請求項７】請求項１，請求項２，請求項４のいずれ
か一項に記載の音声認識装置において、前記保留時間決
定手段は、前記類似度ピーク検出手段により検出された
類似度のピーク値と前記直前に保持された平均値との比
から保留時間を決定することを特徴とする音声認識装
置。7. The speech recognition device according to claim 1, wherein said holding time determination unit includes a similarity peak detected by said similarity peak detection unit. A speech recognition apparatus, wherein a hold time is determined from a ratio of a value to an average value held immediately before.

【請求項８】請求項１，請求項２，請求項４のいずれ
か一項に記載の音声認識装置において、前記類似度平均
値計算手段は、前記平均値を、該音声認識装置の動作開
始時からの平均値、または、一定時間内のフレーム当た
りの平均値を時間方向に移動させながら平均をとった移
動平均、または、時間方向のローパスフィルタリングと
して算出することを特徴とする音声認識装置。8. The speech recognition device according to claim 1, wherein the similarity average value calculation means calculates the average value by using the average value to start the operation of the speech recognition device. A speech recognition apparatus that calculates an average value from time or an average value per frame within a certain time as a moving average obtained by moving the average in a time direction or low-pass filtering in a time direction.

【請求項９】音声入力信号から音声特徴データを抽出
し標準パターンと比較して類似度を計算し、全ての認識
対象の類似度が所定のリジェクト閾値を越えない場合は
全ての認識対象の類似度の平均値を計算し、全ての認識
対象の類似度のうち１つでも所定のリジェクト閾値を越
える場合は、その直前の平均値を保持し、ある認識対象
の類似度について所定のリジェクト閾値を越えたピーク
値が検出された場合、少なくとも、前記直前に保持され
た平均値に基づいて認識結果を出力するまでの保留時間
を決定し、保留時間中に前記ピーク値を越える類似度を
与える認識対象が無い場合は、前記ピーク値を与えた認
識対象を認識結果として出力することを特徴とする音声
認識方法。9. Speech feature data is extracted from a speech input signal and compared with a standard pattern to calculate similarity. If the similarity of all recognition targets does not exceed a predetermined reject threshold, the similarity of all recognition targets is calculated. The average value of the degrees is calculated, and if at least one of the similarities of all the recognition targets exceeds a predetermined rejection threshold, the average value immediately before that is held, and the predetermined rejection threshold is set for the similarity of a certain recognition target. When a peak value exceeding the peak value is detected, a holding time until a recognition result is output is determined based on at least the average value held immediately before, and a recognition that gives a similarity exceeding the peak value during the holding time is determined. If there is no target, the speech recognition method outputs a recognition target given the peak value as a recognition result.

【請求項１０】音声入力信号から音声特徴データを抽
出し標準パターンと比較して類似度を計算し、全ての認
識対象の類似度が平均値保持用の閾値を越えない場合は
全ての認識対象の類似度の平均値を計算し、全ての認識
対象の類似度のうち１つでも平均値保持用の閾値を越え
る場合は、その直前の平均値を保持し、ある認識対象の
類似度について平均値保持用の閾値を越えたピーク値が
検出された場合、少なくとも、前記直前に保持された平
均値に基づいて認識結果を出力するまでの保留時間を決
定し、保留時間中に前記ピーク値を越える類似度を与え
る認識対象が無く、かつ、前記ピーク値を与えた認識対
象の類似度が平均値保持用の閾値よりも高く設定されて
いる結果出力用のリジェクト閾値を越えた場合には、前
記ピーク値を与えた認識対象を認識結果として出力する
ことを特徴とする音声認識方法。10. Speech feature data is extracted from a speech input signal and compared with a standard pattern to calculate similarity. If the similarity of all recognition targets does not exceed a threshold for holding an average value, all recognition targets are calculated. Calculates the average value of the similarities of all the recognition targets. If even one of the similarities of all the recognition targets exceeds the threshold for holding the average value, the average value immediately before that is retained, and the average value of the similarity of a certain recognition target is averaged. When a peak value exceeding the threshold for holding the value is detected, at least, a hold time until a recognition result is output based on the average value held immediately before is determined, and the peak value is determined during the hold time. If there is no recognition target that gives a similarity exceeding, and the similarity of the recognition target that has given the peak value exceeds the reject threshold for result output that is set higher than the threshold for holding the average value, Given the peak value A speech recognition method characterized by outputting a recognition target as a recognition result.