JP3474949B2

JP3474949B2 - Voice recognition device

Info

Publication number: JP3474949B2
Application number: JP29172694A
Authority: JP
Inventors: 浩也村尾
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1994-11-25
Filing date: 1994-11-25
Publication date: 2003-12-08
Anticipated expiration: 2018-12-08
Also published as: JPH08146986A

Description

【発明の詳細な説明】【０００１】【産業上の利用分野】この発明は、音声によりデータを
入力するための音声認識装置に関し、たとえば、録画番
組の予約が音声入力によって行われる録画装置等に利用
される音声認識装置に関する。【０００２】【従来の技術】図４は、従来の音声認識装置の構成を示
している。【０００３】音声分析部１０１は、入力音声の音声パワ
ー信号と、入力音声に対する音声スペクトルとを生成す
る。入力音声の音声パワー信号は、音声区間検出部１０
２に送られる。入力音声に対する音声スペクトルは、音
声パターン作成部１０３に送られる。【０００４】音声区間検出部１０２は、音声検出部１１
１および音声区間切出し部１１２とを備えている。音声
検出部１１１は、図５に示すように、音声検出用しきい
値αを用いて、音声パワー信号中の音声部分を検出す
る。【０００５】音声区間切出し部１１２は、図５に示すよ
うに、切出し用しきい値βを用いて、音声認識に有効な
音声区間Ｌを求める。切出し用しきい値βは、音声検出
部１１１によって検出された音声部分より所定時間前の
雑音パワーに基づいて決定される。【０００６】音声パターン作成部１０３は、音声区間切
出し部１１２によって求められた音声区間Ｌに対する音
声スペクトルに基づいて、音声パターンを作成する。作
成された音声パターンは、学習済のニューラルネットワ
ーク１０４に入力される。【０００７】このニューラルネットワーク１０４の学習
は、次のように行なわれる。まず、各認識対象音声に対
する標準音声パターンを、予め収集した音声を用いてそ
れぞれ求める。そして、各標準音声パターンを入力パタ
ーンとし、各入力パターンに対応する音声を表す音声識
別データを教師データとして、ニューラルネットワーク
１０４を学習させる。【０００８】学習済のニューラルネットワーク１０４
に、音声パターンが入力されることにより、入力された
音声パターンに対応する出力パターンが得られる。この
出力パターンは、認識結果判定部１０５に送られる。認
識結果判定部１０５は、送られてきた出力パターンに基
づいて当該音声検出部分の音声を認識し、その認識結果
を出力する。【０００９】【発明が解決しようとする課題】このような音声認識装
置では、音声認識に有効な音声区間を設定するための切
出し用しきい値βは１つであるため、雑音が音声区間に
含まれてしまうことによって誤認識が発生したり、音声
パワーの小さい語尾等が音声区間から脱落してしまうこ
とによって誤認識が発生したりする可能性が高い。図５
の例では、本来「しち」と認識すべきところが、「し」
と誤認識されてしまう。【００１０】この発明は、認識精度の向上が図れる音声
認識装置を提供することを目的とする。【００１１】【課題を解決するための手段】この発明の音声認識装置
は、入力音声の音声検出部分より所定時間前の雑音パワ
ーに基づいて決定される音声区間判定用の複数のパワー
しきい値と入力音声の音声パワーとに基づいて複数の音
声区間を設定する音声区間設定手段、各音声区間の音声
スペクトルに基づいて各音声区間ごとの音声パターンを
それぞれ作成する音声パターン作成手段、および各音声
区間ごとの音声パターンに基づいて入力音声を認識する
ものにおいて、上記各パワーしきい値として、該音声認
識手段が、各認識対象音声に対する標準音声パターンを
入力パターンとし、各入力パターンに対応する音声を表
す音声識別データを教師データとして、学習が行なわれ
たニューラルネットワーク、上記各音声区間ごとの音声
パターンを上記ニューラルネットワークにそれぞれ入力
して、上記各音声区間ごとの音声パターンに対する出力
パターンを求める手段、および求められた全ての出力パ
ターンのうち、教師データとの類似度が最も高い出力パ
ターンに基づいて、入力音声を認識する手段を備えたこ
とを特徴とするものである。【００１２】【００１３】【００１４】【００１５】【００１６】【００１７】【作用】この発明の音声認識装置によれば、入力音声の
音声パワーと、音声区間判定用の複数のパワーしきい値
とに基づいて、複数の音声区間が設定される。各音声区
間の音声スペクトルに基づいて、各音声区間ごとの音声
パターンがそれぞれ作成される。そして、各音声区間ご
との音声パターンに基づいて、入力音声が認識される。
上記各パワーしきい値は、入力音声の音声検出部分より
所定時間前の雑音パワーに基づいて決定される。この場
合の音声区間の特徴としては、たとえば、音声スペクト
ルが挙げられる。この発明の音声認識装置に用いられて
いる音声認識手段としては、各認識対象音声に対する標
準音声パターンを入力パターンとし、各入力パターンに
対応する音声を表す音声識別データを教師データとし
て、学習が行なわれたニューラルネットワーク、上記各
音声区間ごとの音声パターンを上記ニューラルネットワ
ークにそれぞれ入力して、上記各音声区間ごとの音声パ
ターンに対する出力パターンを求める手段、および求め
られた全ての出力パターンのうち、教師データとの類似
度が最も高い出力パターンに基づいて、入力音声を認識
する手段を備えているものが用いられる。【００１８】【実施例】以下、図１〜図４を参照して、この発明の実
施例について説明する。【００１９】図１は、音声認識装置の構成を示してい
る。【００２０】音声認識装置は、音声分析部１、音声区間
検出部２、音声パターン作成部３、ニューラルネットワ
ーク演算部４、認識結果記憶部５および認識結果判定部
６を備えている。音声区間検出部２は、音声検出部２
１、音声区間切出し部２２および切出し位置記憶部２３
を備えている。【００２１】図２は、ニューラルネットワーク演算部４
に設けられているニューラルネットワークの構造の一例
を示している。【００２２】このニューラルネットワークは、入力層４
１、中間層４２および出力層４３からなる。入力層４１
は、たとえば、１２８個（１６channel ×８frame ) の
入力ユニットから構成されている。中間層４２は、入力
層４１の各入力ユニットと相互に結合された、たとえ
ば、５０個の中間ユニットから構成されている。出力層
４３は、中間層４２の各中間ユニットと相互に結合され
た、たとえば、２０個の出力ユニットから構成されてい
る。【００２３】ここでは、認識対象音声は２０個あるもの
とする。各認識対象音声を表す音声識別データは、各出
力ユニットに対応した２０個のデータからなり、その１
つのみが”１”で他が全て”０”のデータで構成されて
いるものとする。そして、データ”１”の位置が、各音
声識別データごとに異なっている。【００２４】このニューラルネットワークの学習は、次
のように行なわれる。まず、各認識対象音声に対する標
準音声パターンを、予め収集した音声を用いてそれぞれ
求める。各標準音声パターンとしては、対応する標準音
声信号の音声区間を８等分した各区間それぞれの平均ス
ペクトルが用いられている。また、各区間の音声スペク
トルは、予め定められた１６の周波数帯域に対する音声
スペクトルから構成されている。そして、求められた各
標準音声パターンを入力パターンとし、各入力パターン
に対応する音声を表す音声識別データを教師データとし
て、バックプロパゲーション法により、ニューラルネッ
トワークを学習させる。【００２５】図１の音声認識装置の動作について説明す
る。【００２６】音声分析部１は、入力音声の音声パワー信
号と、入力音声に対する音声スペクトルとを生成する。
入力音声の音声パワー信号は、音声区間検出部２に送ら
れる。入力音声に対する音声スペクトルは、音声パター
ン作成部３に送られる。【００２７】音声検出部２１は、図３に示すように、音
声検出用しきい値αを用いて、入力された音声パワー信
号中の音声部分を検出する。【００２８】音声区間切出し部２２は、図３に示すよう
に、複数の切出し用しきい値β１、β２、β３、β４を
用いて、複数の音声区間を設定する。この例では、第１
から第４の音声区間Ｌ１、Ｌ２、Ｌ３、Ｌ４を設定す
る。そして、設定した各音声区間Ｌ１〜Ｌ４の開始点と
終了点とを、各音声区間Ｌ１〜Ｌ４に対応させて、切出
し位置記憶部２３に格納する。【００２９】各切出し用しきい値β１、β２、β３、β
４は、たとえば、次のようにして設定される。まず、最
小の切出し用しきい値β１が、音声検出部２１によって
検出された音声部分（音声検出部分）の開始位置より所
定時間前の雑音パワーに基づいて決定される。そして、
決定された最小の切出し用しきい値β１に、定数γが加
算されることによりしきい値β２が求められ、しきい値
β２に定数γが加算されることによりしきい値β３が求
められ、しきい値β３に定数γが加算されることにより
しきい値β４が求められる。【００３０】音声パターン作成部３は、音声区間切出し
部２２によって求められた各音声区間Ｌ１〜Ｌ４に対す
る音声スペクトルに基づいて、各音声区間Ｌ１〜Ｌ４ご
とに音声パターンを作成して、ニューラルネットワーク
演算部４に入力させる。【００３１】つまり、切出し位置記憶部２３に格納され
ている第１の音声区間Ｌ１の開始点と終了点とに基づい
て、当該音声区間Ｌ１に対する音声パターン（Ｐ１）を
作成する。この音声パターンは、当該音声区間を８等分
した各区間それぞれの平均スペクトルが用いられてい
る。そして、各区間の音声スペクトルパターンは、予め
定められた１６の周波数帯域に対する音声スペクトルか
ら構成されている。作成された第１の音声パターン（Ｐ
１）は、学習済のニューラルネットワークに入力され
る。【００３２】学習済のニューラルネットワークに、第１
の音声パターン（Ｐ１）が入力されることにより、第１
の音声パターン（Ｐ１）に対応する出力パターンが得ら
れる。そして、得られた出力パターンに基づいて、認識
結果と出力最大値（２０個の出力のうちの最大値）と
が、第１認識結果として認識結果記憶部５に記憶され
る。【００３３】次に、切出し位置記憶部１３に格納されて
いる第２の音声区間Ｌ２の開始点と終了点とに基づい
て、当該音声区間Ｌ２に対する音声パターン（Ｐ２）が
作成され、作成された第２の音声パターン（Ｐ２）が学
習済のニューラルネットワークに入力される。これによ
り、第２の音声パターン（Ｐ２）に対応する出力パター
ンが得られる。そして、得られた出力パターンに基づい
て、認識結果と出力最大値とが、第２認識結果として認
識結果記憶部５に記憶される。【００３４】次に、第３の音声区間Ｌ３の開始点と終了
点とに基づいて、当該音声区間Ｌ３に対する音声パター
ン（Ｐ３）が作成されて、学習済のニューラルネットワ
ークに入力される。これにより、第３の音声パターン
（Ｐ３）に対応する出力パターンが得られる。そして、
得られた出力パターンに基づいて、認識結果と出力最大
値とが、第３認識結果として認識結果記憶部５に記憶さ
れる。【００３５】次に、第４の音声区間Ｌ４の開始点と終了
点とに基づいて、当該音声区間Ｌ４に対する音声パター
ン（Ｐ４）が作成されて、学習済のニューラルネットワ
ークに入力される。これにより、第４の音声パターン
（Ｐ４）に対応する出力パターンが得られる。そして、
得られた出力パターンに基づいて、認識結果と出力最大
値とが、第４認識結果として認識結果記憶部５に記憶さ
れる。【００３６】このようにして、第１〜第４の音声パター
ン（Ｐ１〜Ｐ４）に対する第１〜第４の認識結果が得ら
れると、認識結果判定部６は、認識結果記憶部５に記憶
されている第１〜第４の認識結果のうち、出力最大値
が”１”に最も近い音声認識結果を、当該検出音声部分
の音声認識結果として選択して出力する。つまり、音声
識別データ（教師データ）に類似度が最も高い出力パタ
ーンに基づいて、入力音声が認識される。【００３７】上記実施例では、１つの音声検出部分に対
して、複数の切出し用しきい値β１〜β４によって得ら
れた複数の音声区間Ｌ１〜Ｌ４が設定されている。そし
て、各音声区間ごとの音声パターンに基づいて、当該音
声検出部分の音声が認識されているので、雑音が音声区
間に含まれてしまうことによって誤認識が発生したり、
音声パワーの小さい語尾等が音声区間から脱落してしま
うことによって誤認識が発生したりするといったことが
防止される。この結果、音声認識精度が向上する。【００３８】図３の例では、切出し用しきい値β１によ
って設定された第１の音声区間Ｌ１の音声パターンに対
する出力パターンが、音声「しち」を表す音声識別デー
タ（教師データ）に最も近くなるので、当該音声検出部
に対しては「しち」と認識される。【００３９】上記実施例では、複数の音声区間は、入力
音声の音声パワーと、複数の切出し用しきい値とに基づ
いて設定されているが、音声パワー以外の音声区間判定
用のパラメータと、そのパラメータに応じた複数のしき
い値とに基づいて複数の音声区間を設定してもよい。音
声区間判定用のパラメータとしては、音声パワー以外
に、パワーの傾き、広域パワー、低域パワー等がある。【００４０】また、各音声区間ごとの音声パターンをそ
れぞれ作成するための、音声区間の特徴としては、音声
スペクトルの他、音声スペクトルの傾き、音声パワー等
を用いてもよい。【００４１】また、この発明は、入力音声から作成され
た音声パターンと、標準音声パターンとの類似度を、Ｄ
Ｐマッチング法( DTW : dynamic time warping )等によ
って判定する音声認識装置にも適用することができる。【００４２】【発明の効果】この発明によれば、認識精度の向上が図
れる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for inputting data by voice, and for example, to a video recording device or the like in which a recording program is reserved by voice input. The present invention relates to a speech recognition device used. FIG. 4 shows the configuration of a conventional speech recognition apparatus. [0003] A voice analysis unit 101 generates a voice power signal of an input voice and a voice spectrum for the input voice. The audio power signal of the input audio is output to the audio section detection unit 10.
Sent to 2. The speech spectrum for the input speech is sent to the speech pattern creation unit 103. [0004] The voice section detection unit 102 includes a voice detection unit 11.
1 and a speech section cutout unit 112. The voice detection unit 111 detects a voice portion in the voice power signal using the voice detection threshold α as shown in FIG. [0005] As shown in FIG. 5, a speech section cutout unit 112 uses a cutout threshold value β to find a speech section L effective for speech recognition. The cutout threshold value β is determined based on the noise power a predetermined time before the audio part detected by the audio detection unit 111. [0006] The voice pattern creating section 103 creates a voice pattern based on the voice spectrum for the voice section L obtained by the voice section extracting section 112. The created voice pattern is input to the learned neural network 104. The learning of the neural network 104 is performed as follows. First, a standard voice pattern for each recognition target voice is obtained using voices collected in advance. Then, the neural network 104 is trained by using each standard voice pattern as an input pattern and voice identification data representing voice corresponding to each input pattern as teacher data. The trained neural network 104
Then, when an audio pattern is input, an output pattern corresponding to the input audio pattern is obtained. This output pattern is sent to the recognition result determination unit 105. The recognition result determination unit 105 recognizes the voice of the voice detection portion based on the transmitted output pattern, and outputs the recognition result. [0009] In such a speech recognition apparatus, since there is one cutout threshold β for setting a speech section effective for speech recognition, noise is included in the speech section. There is a high possibility that erroneous recognition will occur due to being included, or erroneous recognition will occur due to endings with low voice power dropping out of the voice section. FIG.
In the example above, what should originally be recognized as "shichi" is "shi"
Is mistakenly recognized. [0010] It is an object of the present invention to provide a speech recognition apparatus capable of improving recognition accuracy. [0011] A speech recognition apparatus according to the present invention.
Is the noise power a predetermined time before the voice detection part of the input voice.
Multiple powers for voice section determination based on
Multiple sounds based on the threshold and the audio power of the input audio
Voice section setting means for setting voice sections, voice of each voice section
Based on the spectrum, the voice pattern for each voice section
Voice pattern creation means to be created, and each voice
Recognize input voice based on voice pattern for each section
The power threshold value,
The recognition means creates a standard voice pattern for each voice to be recognized.
Input patterns are displayed and audio corresponding to each input pattern is displayed.
Learning is performed using the voice identification data as teacher data.
Neural network, voice for each voice section above
Input each pattern to the above neural network
And output the voice pattern for each voice section
The means for determining the pattern and all the output
Of the turns, the output pattern with the highest similarity to the teacher data
A means for recognizing input voice based on turns
It is characterized by the following. According to the speech recognition apparatus of the present invention, the input speech can be
Voice power and multiple power thresholds for voice segment determination
, A plurality of speech sections are set. Each voice zone
Voice for each voice section based on the voice spectrum between
Each pattern is created. And for each voice section
The input voice is recognized on the basis of the voice pattern.
The above power thresholds are calculated from the voice detection part of the input voice.
It is determined based on the noise power before a predetermined time. This place
The characteristics of the voice section are, for example,
Le. Used in the speech recognition device of the present invention
The voice recognition means that is used
The quasi-voice pattern is used as the input pattern, and each input pattern
The voice identification data representing the corresponding voice is used as teacher data.
The neural network where the learning was performed
The voice pattern for each voice section is stored in the above neural network.
Input to each of the audio
Means for finding the output pattern for a turn, and finding
Similar to the teacher data among all output patterns
Recognizes input speech based on the most frequent output pattern
What has the means to perform is used. An embodiment of the present invention will be described below with reference to FIGS. FIG. 1 shows the configuration of the speech recognition apparatus. The speech recognition apparatus includes a speech analysis section 1, a speech section detection section 2, a speech pattern creation section 3, a neural network operation section 4, a recognition result storage section 5, and a recognition result determination section 6. The voice section detection unit 2 includes the voice detection unit 2
1. Voice section extraction unit 22 and extraction position storage unit 23
It has. FIG. 2 shows a neural network operation unit 4.
1 shows an example of the structure of a neural network provided in the first embodiment. This neural network has an input layer 4
1, an intermediate layer 42 and an output layer 43. Input layer 41
Is composed of, for example, 128 (16 channel × 8 frame) input units. The intermediate layer 42 is composed of, for example, 50 intermediate units mutually connected to each input unit of the input layer 41. The output layer 43 is composed of, for example, 20 output units mutually connected to each intermediate unit of the intermediate layer 42. Here, it is assumed that there are 20 voices to be recognized. The voice identification data representing each recognition target voice is composed of 20 data corresponding to each output unit.
It is assumed that only one is composed of data of “1” and the others are composed of data of “0”. Then, the position of the data “1” is different for each voice identification data. The learning of the neural network is performed as follows. First, a standard voice pattern for each recognition target voice is obtained using voices collected in advance. As each standard voice pattern, the average spectrum of each section obtained by equally dividing the voice section of the corresponding standard voice signal into eight is used. The audio spectrum of each section is composed of audio spectra for 16 predetermined frequency bands. Then, the neural network is trained by the back propagation method using the obtained standard voice patterns as input patterns and voice identification data representing voices corresponding to the input patterns as teacher data. The operation of the speech recognition apparatus shown in FIG. 1 will be described. The voice analysis unit 1 generates a voice power signal of the input voice and a voice spectrum for the input voice.
The audio power signal of the input audio is sent to the audio section detection unit 2. The speech spectrum for the input speech is sent to the speech pattern creation unit 3. As shown in FIG. 3, the voice detector 21 detects a voice portion in the input voice power signal using the voice detection threshold value α. As shown in FIG. 3, the voice section cutout section 22 sets a plurality of voice sections using a plurality of cutout threshold values β1, β2, β3, and β4. In this example, the first
, The fourth speech sections L1, L2, L3, L4 are set. Then, the set start point and end point of each of the voice sections L1 to L4 are stored in the cutout position storage unit 23 in association with each of the voice sections L1 to L4. Each of the thresholds β1, β2, β3, β
4 is set as follows, for example. First, the minimum cut-out threshold value β1 is determined based on the noise power at a predetermined time before the start position of the voice part (voice detection part) detected by the voice detection unit 21. And
A threshold value β2 is obtained by adding a constant γ to the determined minimum cutout threshold value β1, and a threshold value β3 is obtained by adding a constant γ to the threshold value β2. A threshold value β4 is obtained by adding a constant γ to the threshold value β3. The voice pattern creating section 3 creates a voice pattern for each voice section L1 to L4 based on the voice spectrum for each voice section L1 to L4 obtained by the voice section cutout section 22, and performs neural network operation. Input to the unit 4. That is, based on the start point and end point of the first voice section L1 stored in the cut-out position storage section 23, a voice pattern (P1) for the voice section L1 is created. As the voice pattern, an average spectrum of each section obtained by equally dividing the voice section into eight is used. Then, the audio spectrum pattern of each section is composed of audio spectra for 16 predetermined frequency bands. The first voice pattern (P
1) is input to the learned neural network. In the trained neural network, the first
When the voice pattern (P1) is input, the first
An output pattern corresponding to the voice pattern (P1) is obtained. Then, based on the obtained output pattern, the recognition result and the maximum output value (the maximum value of the 20 outputs) are stored in the recognition result storage unit 5 as the first recognition result. Next, based on the start point and end point of the second voice section L2 stored in the cut-out position storage unit 13, a voice pattern (P2) for the voice section L2 is created and created. The second voice pattern (P2) is input to the learned neural network. Thus, an output pattern corresponding to the second voice pattern (P2) is obtained. Then, based on the obtained output pattern, the recognition result and the maximum output value are stored in the recognition result storage unit 5 as the second recognition result. Next, based on the start point and end point of the third voice section L3, a voice pattern (P3) for the voice section L3 is created and input to the learned neural network. As a result, an output pattern corresponding to the third voice pattern (P3) is obtained. And
Based on the obtained output pattern, the recognition result and the maximum output value are stored in the recognition result storage unit 5 as the third recognition result. Next, based on the start point and end point of the fourth voice section L4, a voice pattern (P4) for the voice section L4 is created and input to the learned neural network. Thus, an output pattern corresponding to the fourth voice pattern (P4) is obtained. And
Based on the obtained output pattern, the recognition result and the maximum output value are stored in the recognition result storage unit 5 as the fourth recognition result. When the first to fourth recognition results for the first to fourth voice patterns (P1 to P4) are obtained in this manner, the recognition result determination unit 6 stores the result in the recognition result storage unit 5. Among the first to fourth recognition results, the speech recognition result whose output maximum value is closest to "1" is selected and output as the speech recognition result of the detected speech portion. That is, the input voice is recognized based on the output pattern having the highest similarity to the voice identification data (teacher data). In the above embodiment, a plurality of speech sections L1 to L4 obtained by a plurality of cutout thresholds β1 to β4 are set for one speech detection portion. And, since the voice of the voice detection part is recognized based on the voice pattern for each voice section, erroneous recognition occurs due to noise included in the voice section,
It is possible to prevent erroneous recognition from occurring due to the ending of the voice with a low voice power falling out of the voice section. As a result, the speech recognition accuracy is improved. In the example of FIG. 3, the output pattern corresponding to the voice pattern of the first voice section L1 set by the cut-out threshold value β1 is closest to the voice identification data (teacher data) representing the voice "Shi" Therefore, the voice detector is recognized as “Shi”. In the above embodiment, the plurality of voice sections are set based on the voice power of the input voice and the plurality of cut-out thresholds. A plurality of speech sections may be set based on a plurality of thresholds according to the parameters. The parameters for voice section determination include, besides voice power, power gradient, wide-range power, low-band power, and the like. As features of the voice section for creating a voice pattern for each voice section, the slope of the voice spectrum, the voice power, and the like may be used in addition to the voice spectrum. Further, according to the present invention, the similarity between a voice pattern created from an input voice and a standard voice pattern is determined by D
The present invention can also be applied to a speech recognition device that determines by a P matching method (DTW: dynamic time warping) or the like. According to the present invention, the recognition accuracy can be improved.

【図面の簡単な説明】【図１】音声認識装置の構成を示すブロック図である。【図２】図１のニューラルネットワーク演算部に用いら
れているニューラルネットワークの構造を示す模式図で
ある。【図３】図１の音声認識装置において、複数の切出し用
しきい値に基づいて複数の音声区間が設定されることを
示すタイムチャートである。【図４】従来の音声認識装置の構成を示すブロック図で
ある。【図５】図４の音声認識装置において、１つの切出し用
しきい値に基づいて１つの音声区間が設定されることを
示すタイムチャートである。【符号の説明】１音声分析部２音声区間検出部３音声パターン作成部４ニューラルネットワーク演算部５認識結果記憶部６認識結果判定部２１音声検出部２２音声区間切出し部２３切出し位置記憶部BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a configuration of a speech recognition device. FIG. 2 is a schematic diagram showing a structure of a neural network used in the neural network operation unit of FIG. FIG. 3 is a time chart showing that a plurality of speech sections are set based on a plurality of cutout thresholds in the speech recognition device of FIG. 1; FIG. 4 is a block diagram showing a configuration of a conventional speech recognition device. FIG. 5 is a time chart showing that one voice section is set based on one cutout threshold value in the voice recognition device of FIG. 4; [Description of Signs] 1 voice analysis section 2 voice section detection section 3 voice pattern creation section 4 neural network operation section 5 recognition result storage section 6 recognition result determination section 21 voice detection section 22 voice section cutout section 23 cutout position storage section

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平４−31896（ＪＰ，Ａ) 特開平３−116099（ＪＰ，Ａ) 特開昭61−99196（ＪＰ，Ａ) 特開昭59−36300（ＪＰ，Ａ) 特開昭61−99149（ＪＰ，Ａ) 特開昭59−211098（ＪＰ，Ａ) 特公平６−7343（ＪＰ，Ｂ２) 特公昭63−29754（ＪＰ，Ｂ２) 特許3091537（ＪＰ，Ｂ２) 特許2754960（ＪＰ，Ｂ２) 特許3322491（ＪＰ，Ｂ２) 特許3322536（ＪＰ，Ｂ２) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/02 G10L 15/16 G06N 3/00 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-4-31896 (JP, A) JP-A-3-116099 (JP, A) JP-A-61-99196 (JP, A) JP-A-59-99196 JP-A-61-99149 (JP, A) JP-A-59-211098 (JP, A) JP-B-6-7343 (JP, B2) JP-B-63-29754 (JP, B2) Patent 3091537 (JP, B2) Patent 2754960 (JP, B2) Patent 3324951 (JP, B2) Patent 3322536 (JP, B2) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 11/02 G10L 15 / 16 G06N 3/00

Claims

(57)【特許請求の範囲】【請求項１】入力音声の音声検出部分より所定時間前の
雑音パワーに基づいて決定される音声区間判定用の複数
のパワーしきい値と入力音声の音声パワーとに基づいて
複数の音声区間を設定する音声区間設定手段、各音声区
間の音声スペクトルに基づいて各音声区間ごとの音声パ
ターンをそれぞれ作成する音声パターン作成手段、およ
び各音声区間ごとの音声パターンに基づいて入力音声を
認識する音声認識手段を備えた音声認識装置において、上記各パワーしきい値として、該音声認識手段が、各
認識対象音声に対する標準音声パターンを入力パターン
とし、各入力パターンに対応する音声を表す音声識別デ
ータを教師データとして、学習が行なわれたニューラル
ネットワーク、上記各音声区間ごとの音声パターンを上
記ニューラルネットワークにそれぞれ入力して、上記各
音声区間ごとの音声パターンに対する出力パターンを求
める手段、および求められた全ての出力パターンのう
ち、教師データとの類似度が最も高い出力パターンに基
づいて、入力音声を認識する手段を備えたことを特徴と
する音声認識装置。 (57) [Claims] [Claim 1] A predetermined time before the voice detection part of the input voice
Plural for speech section determination determined based on noise power
Based on the power threshold and the audio power of the input audio
Voice section setting means for setting a plurality of voice sections, each voice section
Voice segment for each voice section based on the voice spectrum between
Voice pattern creation means for creating each turn, and
Input voice based on the voice pattern of each voice section.
In a speech recognition apparatus provided with speech recognition means for recognizing, as the power thresholds,
Input a standard voice pattern for the target voice
Voice identification data representing the voice corresponding to each input pattern.
Learning using the data as teacher data
Network, voice pattern for each voice section above
Input to the neural network
Find output patterns for voice patterns for each voice section
Means and all output patterns required
In other words, based on the output pattern with the highest similarity to the teacher data,
And a means for recognizing the input voice.
Voice recognition device.