JPS6332400B2

JPS6332400B2 -

Info

Publication number: JPS6332400B2
Application number: JP56109265A
Authority: JP
Inventors: Isamu Nose; Kaneyoshi Mizuno
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1981-07-15
Filing date: 1981-07-15
Publication date: 1988-06-29
Also published as: JPS5811998A

Description

【発明の詳細な説明】[Detailed description of the invention]

本発明は、音声認識装置において、認識率の向
上を計ることができる重み付非類似度演算に関す
るものである。従来の音声認識装置のブロツク図を第１図に示
す。第１図において、１１は入力端子、１２は周
波数分析部、１３は音声区間検出部、１４は音声
区間の始端検出信号、１５は音声区間の終端検出
信号、１６はスペクトル変換部、１７は非類似度
演算部、１８は判定部の如く構成されており以下
各部の説明をする。周波数分析部１２は第２図に示す如く構成さ
れ、入力音声信号２１は前置増幅器２２によつて
増幅され約200Hz〜6000Hzの間で中心周波数が対
数で等間隔となるように設定された帯域波器群
２３―１，２３―２，…，２３―ｎ、全波整流器
群２４―１，２４―２，…，２４―ｎ及び低域
波器群２５―１，２５―２，…，２５―ｎによつ
て分析され多重化器２６を通してアナログ・デイ
ジタル変換器２７によつてあらかじめ設定された
時間間隔（以下サンプル周期と記す）毎に量子化
され、対数変換器２８を通して出力端子２９に出
力される。周波数分析部１２で分析された結果は音声区間
検出部１３、及びスペクトル変換部１６に送られ
る。音声区間検出部１３は音声区間の始端及び終端
を検出し非類似度演算部に始端検出信号１４及び
終端検出信号１５を送るものであり、簡易的な検
出法としてはサンプル周期毎の周波数分析部１２
からのｎ個の分析データの平均値を求めその値が
あらかじめ設定された閾値を最初に越えた時点を
始点とし、最後に閾値以下になつた時点を終端と
する検出法がある。スペクトル変換部１６は、話者による音源特性
及びパワーの正規化の方法として、論文“非線形
スペクトルマツチングによる単語音声認識の一方
式”小原他（電子通信学会技術研究報告PRL79
―46）に発表されたものでまず計算方法を説明す
る。周波数分析部１２で、ある時刻に分析されたＮ
個のデータをx_i（ｉ＝１〜ｎ）とすると、スペク
トル変換データはx^_i（ｉ＝１〜ｎ）は(1)式で与え
られる。 x^_i＝x_i−（Ai＋Ｂ） ……(1) (1)式においてＡ、Ｂはそれぞれx_i（ｉ＝１〜ｎ）
の最小２乗近似直線の傾き及び切片を意味するも
のでそれぞれ次式によつて求められる。 (2)、(3)式においてデータ数Ｎを固定すれば_o 〓ⁱ⁼¹
ｉ、_o 〓ⁱ⁼¹ i²は定数となり従つて(2)、(3)式の分母も定
数となるC₁＝_o 〓ⁱ⁼¹ ｉ、C₂＝_o 〓ⁱ⁼¹ i²とおけば、(2)、(3)式
はとなる。ここにC₃＝Ｎ・_o 〓ⁱ⁼¹ i²−（_o 〓ⁱ⁼¹ ｉ）²である。
(4)、(5)式から明らかなように入力データから _o 〓ⁱ⁼¹ ｉ・x_i及び_o 〓ⁱ⁼¹ x_iを求めれば(4)、(5)式により
Ａ、Ｂの値を求めることができ、さらに(1)式によ
りスペクトル変換データx^_i（ｉ＝１〜ｎ）を求め
ることができる。第３図にスペクトル変換部１６のブロツク図を
示し以下図にそつて説明する。入力端子３１から入力された入力データx_i（ｉ
＝１〜ｎ）と、入力データと同期して計算するカ
ウンタ３２によつて発生したｉとの積を乗算器３
３によつて求めさらに加算器３４とレジスタ３５
によりｉ・x_iの値を累積させることによりレジス
タ３５に_o 〓ⁱ⁼¹ ｉ・x_iの値をセツトすることができ
る。また、加算器３６とレジスタ３７により同様
に、レジスタ３７に_o 〓ⁱ⁼¹ x_iの値をセツトすることが
できる。次にマルチプレクサ３８，３９において、それ
ぞれＮ，C₁の値を選択することにより乗算器４０ではＮ・_o 〓ⁱ⁼¹ ｉ・x_iが、乗算器４１ではC₁・_o 〓ⁱ⁼¹ x_iが得られ、さらに減算除算器４２により
（Ｎ・_o 〓ⁱ⁼¹ ｉ・x_i−C₁・_o 〓ⁱ⁼¹ x_i）／C₃の演算により結
果すなわちＡの値をレジスタ４３にセツトする。
同様にマルチプレクサ３８，３９においてそれぞ
れC₁，C₂を選択させ乗算器４０，４１及び減算
除算器４４を使用して（C₂・_o 〓ⁱ⁼¹ x_i−C_1o 〓ⁱ⁼¹ ｉ・
x_i）／C₃の演算を行いその結果すなわちＢの値を
レジスタ４５にセツトする。続いてカウンタ４６によりｉを発生させ乗算器
４７によりＡ・ｉを求めさらに加算器４８により
Ai＋Ｂを求めることができる。次に遅延回路４
９により遅延した入力データx_iと加算器４８で求
めたAi＋Ｂの減算を減算器５０によつて行えば
スペクトル変換データx^_iが出力端子に出力され
る。次に、非類似度演算部１７の構成を第４図に示
し以下図にそつて説明する。第４図において、１
０１は音声区間の始端検出信号、１０２は音声区
間の終端検出信号、１０３はスペクトル変換部１
７からの入力データ、１０４は入力メモリ制御回
路、１０５は入力メモリ、１０６は標準パターン
メモリ制御回路、１０７は標準パターンメモリ、
１０８は差分絶対値演算回路、１０９は加算器、
１１０はレジスタである。音声区間の始端検出信号１０１が発生してから
音声区間の終端検出信号１０２が発生するまでの
間入力データ１０３は入力メモリ制御回路１０４
により入力メモリ１０５に格納される。音声区間
の入力データ１０３の格納が終了すると、入力メ
モリ１０５とあらかじめ分析され標準パターンメ
モリ１０７に格納されている所望の標準パターン
との非類似度の演算を順次行なう。非類似度の演算方法では動的計画法を用いて入
力データと標準パターンとを非線形に対応させる
方法が一般的に用いられているが、説明の簡略化
の為、以下線形対応を用いた方法で説明する。し
かしながら本発明は非線形対応に対しても適用で
きる事は明らかである。入力メモリ制御回路１０４及び標準パターンメ
モリ制御部１０６を介して入力データ及び標準パ
ターンそれぞれの対応する要素を読出し、差分絶
対値回路１０８によつて両者の差分の絶対値の演
算を行い、さらにその結果とレジスタ１１０との
加算を加算器１１０で行い加算結果を再びレジス
タ１１０に入れる。この演算を対応する要素すべてについて繰り返
すことにより入力データとある標準パターンとの
非類似度の演算ができる。このようにして、標準
パターンメモリ１０７に格納されている全て又は
一部の標準パターンとの非類似度の演算を行う。但し、レジスタ１１０は、ある標準パターンと
の非類似度演算を始める時の初期値は０としてお
く必要がある。即ち、ある認識語の標準パターンＰと入力デー
タＱとの非類似度演算において両者の対応する要
素があらかじめ正規化されているものとして(6)式
で示す。非類似度＝_L 〓^l=1 _o 〓ⁱ⁼¹ ｜x^^P（ｉ、ｌ）−x^^Q（ｉ、ｌ）｜
……(6) (6)式にてｉは対応する要素に付された番号であ
りｌは標準パターンＰと入力データＱとの音声区
間長の正規化後の時系列に付された番号である。判定部１８では非類似度演算部１７の結果によ
り最も非類似度の低かつた、すなわち類似度の最
も高かつた標準パターンと同じ音声が入力された
ものと判断して、結果を出力する。しかしながら、上記従来の技術では、音声は話
者による変化はもちろんのこと同一話者において
も発声毎に変化するため、分析結果の似ている語
間の誤認識が生ずるという欠点があつた。従つて本発明は従来の技術の上記欠点を改善す
るもので、その目的は音声認識装置の認識率を向
上させることにあり、標準パターンメモリに重み
領域データを付加し、さらに、非類似度演算部に
おける重みの大きさを、入力パターンと標準パタ
ーンの符号を含めたレベルの相互関係によつて判
断する機能を付加したものである。すなわち、短時間スペクトルを目視した場合は
明らかに異なるパターンであると認識できるもの
であつても、全体の非類似度としては小さな値に
なり、誤認識されることがある。このように、一定の非類似度の演算のみでは類
似してしまう小数の音声を識別するための一つ有
力な手法は、スペクトル変換データを要素とする
標準パターンの特定領域に非類似度を増す方向の
重みをつけることである。本発明は、このような重みづけによる非類似度
の演算を、短時間スペクトルにおける山や谷の位
置を考慮して行わせるものであり、特に短時間ス
ペクトルにおける山や谷が、スペクトル変換デー
タにおける正負の符号及びデータの絶対値の大き
さとして現われるのを利用するものである。第５図は本発明の実施例のブロツク図であり、
１１は入力端子、１２は周波数分析部、１３は音
声区間検出部、１４は音声区間の始端検出信号、
１５は音声区間の終端検出信号、１６はスペクト
ル変換部、５５は重み付非類似度演算部、１８は
判定部の如く構成されている。重み付非類似度演
算部５５以外は第１図の構成と同じであるので、
以下重み付非類似度演算部５５について第６図に
よつて詳細に説明する。第６図において１０１は音声区間の始端検出信
号、１０２は音声区間の終端検出信号、１０３は
スペクトル変換部１６からの入力データ、１０４
は入力メモリ制御回路、１０５は入力メモリ、１
０８は差分絶対値演算回路、２０３は標準パター
ンメモリ制御回路、２０４は標準パターンメモ
リ、２０１は入力メモリの出力信号線、２０５は
標準パターンメモリのパターンデータに関する出
力信号線、２０７は標準パターンメモリの重み計
算指定に関する出力信号線、２０８，２０９はレ
ベル変換回路、２１０，２１１はレベル変換回路
２０８，２０９の出力信号線、２１２はテーブル
メモリ、２１３は乗算器、１０９は加算器、１１
０はレジスタの如く構成されている。音声区間の始端検出信号１０１が発生してから
音声区間の終端検出信号１０２が発生するまでの
間入力データ１０３は入力メモリ制御回路１０４
により入力メモリ１０５に格納される。入力データ１０３の格納が終了すると、入力メ
モリ１０５とあらかじめ分析され標準パターンメ
モリ２０４に格納されている標準パターンとの重
み付非類似度の演算を順次行う。重み付非類似度演算においては、標準パターン
は、(1)式と同様にして算出されたスペクトル変換
データz^_i（以下、入力データx^_iとの区別の為、z^_iで
記述する）と重み指定データP_iとの対の時系列で
記述されていて、一方入力データは(1)式で示され
るスペクトル変換データx^_iのみであり、各々のス
ペクトル変換データは差分絶対値演算回路１０８
の入力部とレベル変換回路２０９，２０８の入力
部へ出力信号線２０５，２０１を介して出力され
ると同時に重み指定データがテーブルメモリのア
ドレス入力線の一部２０７を介して出力される。
なお重み指定データP_iは、重み指定有りの場合に
はP_i＝１とし、重み指定なしの場合はP_i＝０とす
る。レベル変換回路２０８，２０９は対数変換器２
７の入力データのビツト数が大きい為（８ビツト
以上）、ビツト低減を行いテーブルメモリ２１２
の容量が大きくならない様にしている。通常レベ
ル変換回路２０８，２０９の出力ビツト数は２〜
４ビツト程度で変換される。例えば２ビツトの場
合入力データをy^_i（最小値は負数でMIN又最大値
は正数でMAXとすると、MINy^_iMAX）と
すると変換出力は次表のとおりとなる。 The present invention relates to a weighted dissimilarity calculation that can improve the recognition rate in a speech recognition device. A block diagram of a conventional speech recognition device is shown in FIG. In FIG. 1, 11 is an input terminal, 12 is a frequency analysis section, 13 is a voice section detection section, 14 is a voice section start detection signal, 15 is a voice section end detection signal, 16 is a spectrum conversion section, and 17 is a non-voice section detection signal. The similarity calculating section 18 is configured like a determining section, and each section will be explained below. The frequency analyzer 12 is configured as shown in FIG. 2, and the input audio signal 21 is amplified by a preamplifier 22 to generate a frequency band whose center frequency is set to be logarithmically evenly spaced between approximately 200Hz and 6000Hz. wave rectifier groups 23-1, 23-2,..., 23-n, full wave rectifier groups 24-1, 24-2,..., 24-n, and low frequency wave rectifier groups 25-1, 25-2,..., 25-n, passed through the multiplexer 26, quantized by the analog-to-digital converter 27 at preset time intervals (hereinafter referred to as sample period), and sent to the output terminal 29 through the logarithmic converter 28. Output. The results analyzed by the frequency analysis section 12 are sent to the voice section detection section 13 and the spectrum conversion section 16. The voice section detection section 13 detects the start and end of a voice section and sends a start edge detection signal 14 and an end detection signal 15 to the dissimilarity calculation section.A simple detection method is to detect the start and end of the voice section and send the start edge detection signal 14 and the end detection signal 15 to the dissimilarity calculation section. 12
There is a detection method that calculates the average value of n pieces of analysis data from , and uses the point when the value first exceeds a preset threshold as the starting point, and the ending point when it finally falls below the threshold. The spectrum conversion unit 16 uses the method of normalizing the sound source characteristics and power by the speaker as described in the paper “A method of word speech recognition using nonlinear spectral matching” by Ohara et al. (IEICE technical research report PRL79).
―46), and I will first explain the calculation method. N analyzed at a certain time by the frequency analysis section 12
When the pieces of data are x _i (i=1 to n), the spectral conversion data x^ _i (i=1 to n) is given by equation (1). x^ _i = x _i − (Ai + B) ...(1) In equation (1), A and B are each x _i (i = 1 to n)
This means the slope and intercept of the least squares approximation straight line, and are determined by the following equations. If the number of data N is fixed in equations (2) and (3), _o 〓 ⁱ⁼¹
i, _o 〓 ⁱ⁼¹ i ² is a constant, so the denominators of equations (2) and (3) are also constants.C ₁ = _o 〓 ⁱ⁼¹ i, C ₂ = _o 〓 ⁱ⁼¹ i ² For example, equations (2) and (3) are becomes. Here, C ₃ =N・_o 〓 ⁱ⁼¹ i ² −( _o 〓 ⁱ⁼¹ i) ² .
As is clear from equations (4) and (5), if _o 〓 ⁱ⁼¹ i・x _i and _o 〓 ⁱ⁼¹ x _i are calculated from the input data, then the values of A and B can be obtained from equations (4) and (5). can be obtained, and further, spectrum conversion data x^ _i (i=1 to n) can be obtained using equation (1). FIG. 3 shows a block diagram of the spectrum conversion section 16, and will be explained below with reference to the figure. Input data x _i (i
= 1 to n) and i generated by the counter 32 that is calculated in synchronization with the input data.
Further, adder 34 and register 35
By accumulating the value of i.x _i , the value of _o 〓 ⁱ⁼¹ i.x _i can be set in the register 35. Further, the value of _o 〓 ⁱ⁼¹ x _i can be set in the register 37 in the same way using the adder 36 and the register 37. Next, by selecting the values of N and C ₁ in the multiplexers 38 and 39, respectively, the multiplier 40 obtains N・_o 〓 ⁱ⁼¹ i・x _i , and the multiplier 41 obtains C ₁・_o 〓 ⁱ⁼¹ x _i is obtained, and the subtraction/divider 42 then performs the operation of (N・_o 〓 ⁱ⁼¹ i・x _i −C ₁・_o 〓 ⁱ⁼¹ x _i )/C ₃ to store the result, that is, the value of A, in the register 43. Set.
Similarly, multiplexers 38 and 39 select C ₁ and C ₂ respectively, and multipliers 40 and 41 and subtraction divider 44 are used to select (C ₂・_o 〓 ⁱ⁼¹ x _i −C _1o 〓 ⁱ⁼¹ i・
x _i )/C ₃ and the result, that is, the value of B, is set in the register 45. Next, the counter 46 generates i, the multiplier 47 calculates A·i, and the adder 48 calculates A·i.
Ai+B can be found. Next, delay circuit 4
When the subtracter 50 subtracts the input data x _i delayed by 9 and Ai+B obtained by the adder 48, the spectral conversion data x^ _i is outputted to the output terminal. Next, the configuration of the dissimilarity calculation unit 17 is shown in FIG. 4 and will be explained below with reference to the figure. In Figure 4, 1
01 is a voice section start detection signal, 102 is a voice section end detection signal, and 103 is a spectrum conversion unit 1.
7, 104 is an input memory control circuit, 105 is an input memory, 106 is a standard pattern memory control circuit, 107 is a standard pattern memory,
108 is a difference absolute value calculation circuit, 109 is an adder,
110 is a register. The input data 103 is stored in the input memory control circuit 104 from when the voice section start detection signal 101 is generated until when the voice section end detection signal 102 is generated.
is stored in the input memory 105. When the storage of the input data 103 of the voice section is completed, the degree of dissimilarity between the input memory 105 and a desired standard pattern that has been analyzed in advance and stored in the standard pattern memory 107 is sequentially calculated. A commonly used method for calculating dissimilarity is to use dynamic programming to make nonlinear correspondences between input data and standard patterns, but for the sake of simplicity, we will use a method using linear correspondence below. I will explain. However, it is clear that the present invention can also be applied to nonlinear correspondences. The corresponding elements of the input data and the standard pattern are read out via the input memory control circuit 104 and the standard pattern memory control unit 106, and the absolute value of the difference between the two is calculated by the absolute difference value circuit 108. The adder 110 performs the addition of and the register 110, and the addition result is input into the register 110 again. By repeating this calculation for all corresponding elements, the degree of dissimilarity between the input data and a certain standard pattern can be calculated. In this way, the degree of dissimilarity with all or some of the standard patterns stored in the standard pattern memory 107 is calculated. However, the initial value of the register 110 must be set to 0 when starting the dissimilarity calculation with a certain standard pattern. That is, in calculating the degree of dissimilarity between the standard pattern P of a certain recognition word and the input data Q, the equation (6) is shown assuming that the corresponding elements of the two have been normalized in advance. Dissimilarity = _L 〓 ^l=1 _o 〓 ⁱ⁼¹ |x^ ^P (i, l)−x^ ^Q (i, l) |
...(6) In equation (6), i is the number assigned to the corresponding element, and l is the number assigned to the time series after normalizing the speech interval length of the standard pattern P and input data Q. be. The determination unit 18 determines that the same voice as the standard pattern with the lowest degree of dissimilarity, that is, the highest degree of similarity, has been input based on the result of the dissimilarity calculation unit 17, and outputs the result. However, in the above-mentioned conventional technology, since the voice changes not only depending on the speaker, but also from one utterance to another even by the same speaker, there is a drawback that erroneous recognition occurs between words with similar analysis results. Therefore, the present invention aims to improve the above-mentioned drawbacks of the conventional technology.The purpose of the present invention is to improve the recognition rate of a speech recognition device. A function is added to determine the size of the weight in the input pattern based on the correlation between the levels including the signs of the input pattern and the standard pattern. That is, even if the patterns can be recognized as clearly different when the short-time spectra are visually observed, the overall dissimilarity will be a small value and may be misrecognized. In this way, one promising method for identifying a small number of voices that are similar only by calculating a certain degree of dissimilarity is to increase the degree of dissimilarity in a specific region of a standard pattern using spectral conversion data as an element. This is to give weight to the direction. The present invention calculates the degree of dissimilarity by weighting in consideration of the positions of peaks and valleys in the short-time spectrum. In particular, the peaks and valleys in the short-time spectrum are It utilizes the positive and negative signs and the magnitude of the absolute value of data. FIG. 5 is a block diagram of an embodiment of the present invention;
11 is an input terminal, 12 is a frequency analysis section, 13 is a voice section detection section, 14 is a voice section start end detection signal,
15 is an end detection signal of a voice section, 16 is a spectrum conversion section, 55 is a weighted dissimilarity calculation section, and 18 is a determination section. Since the configuration is the same as that shown in FIG. 1 except for the weighted dissimilarity calculation unit 55,
The weighted dissimilarity computing unit 55 will be explained in detail below with reference to FIG. 6. In FIG. 6, 101 is a voice section start detection signal, 102 is a voice section end detection signal, 103 is input data from the spectrum conversion section 16, and 104
is an input memory control circuit, 105 is an input memory, 1
08 is a difference absolute value calculation circuit, 203 is a standard pattern memory control circuit, 204 is a standard pattern memory, 201 is an output signal line of the input memory, 205 is an output signal line related to pattern data of the standard pattern memory, and 207 is a standard pattern memory control circuit. Output signal lines related to weight calculation specification, 208 and 209 are level conversion circuits, 210 and 211 are output signal lines of the level conversion circuits 208 and 209, 212 is a table memory, 213 is a multiplier, 109 is an adder, 11
0 is configured like a register. The input data 103 is stored in the input memory control circuit 104 from when the voice section start detection signal 101 is generated until when the voice section end detection signal 102 is generated.
is stored in the input memory 105. When the storage of the input data 103 is completed, weighted dissimilarities between the input memory 105 and the standard pattern analyzed in advance and stored in the standard pattern memory 204 are sequentially calculated. In the weighted dissimilarity calculation, the standard pattern is the spectral transformation data z^ _i (hereinafter referred to as z^ i to distinguish it from the input data x^ _i) calculated in the same way as equation (1 ₎ . ) and weight specification data P _i , while the input data is only the spectral conversion data x^ _i shown by equation (1), and each spectral conversion data is subjected to absolute difference calculation. circuit 108
At the same time, the weight designation data is outputted to the input portions of the table memory and the input portions of the level conversion circuits 209 and 208 via output signal lines 205 and 201, and is outputted via a portion 207 of the address input line of the table memory.
Note that the weight designation data P _i is set to P _i =1 when the weight is designated, and set to P _i =0 when the weight is not designated. Level conversion circuits 208 and 209 are logarithmic converters 2
Since the number of bits of the input data 7 is large (more than 8 bits), the bits are reduced and stored in the table memory 212.
This prevents the capacity from increasing. Normally, the number of output bits of the level conversion circuits 208 and 209 is 2 or more.
It is converted using about 4 bits. For example, in the case of 2 bits, if the input data is y^ _i (minimum value is a negative number MIN and maximum value is positive number MAX, then MINy^ _i MAX), the conversion output will be as shown in the following table.

【表】【table】

Claims

【特許請求の範囲】１入力音声を周波数分析する周波数分析手段
と、その出力に接続される音声区間検出手段と、
前記周波数分析手段の出力データに対して話者に
よる音源特性及びパワーを正規化したスペクトル
変換データを出力するスペクトル変換手段と、ス
ペクトル変換手段の出力に接続され音声区間検出
手段により与えられる音声の始端と終端の間の音
声区間で入力スペクトル変換データを標準パター
ンと比較して非類似度を演算する非類似度演算手
段と、その出力に接続されて音声を認識する判定
手段とを有する音声認識装置において、前記非類似度演算手段が、前記入力スペクトル変換データx^_iの時系列を格
納する第１の記憶手段と、予め算出されたスペクトル変換データz^_iの時系
列、及び該スペクトル変換データz^_i毎の重み指定
の有無を示す重み指定データP_iから成る標準パタ
ーンを格納する第２の記憶手段と、取り得る入力スペクトル変換X^_i及び標準パター
ンの取り得るスペクトル変換データZ^_iの極性及び
絶対値の大きさに応じてこれらの差を強調するよ
うに予め設定される重み係数（w_i≧１）が格納さ
れ、第１の記憶手段からの入力スペクトル変換デ
ータx^_i及び第２の記憶手段からのスペクトル変換
データz^_i、重み指定データP_iに基づいて前記重み
係数w_iを出力する第３の記憶手段と、第１の記憶手段からの入力スペクトル変換デー
タx^_iと第２の記憶手段からのスペクトル変換デー
タz^_iとの距離を算出して出力する第１の演算手段
と、第３の記憶手段及び第１の演算手段の出力の積
和演算を行うことにより、重み付けによる非類似
度を演算して出力する第２の演算手段とを備え、前記重み指定データのうち重み指定の有りを示
す重み指定データが他の標準パターンとの区別に
有効な特定領域に予め設定されることを特徴とす
る音声認識装置。[Claims] 1. A frequency analysis means for frequency analyzing input speech, a speech section detection means connected to the output thereof,
spectral conversion means for outputting spectral conversion data obtained by normalizing the sound source characteristics and power of the speaker with respect to the output data of the frequency analysis means; and a starting point of speech connected to the output of the spectral conversion means and provided by the speech interval detection means. A speech recognition device comprising: a dissimilarity calculating means for calculating a dissimilarity by comparing input spectral conversion data with a standard pattern in a speech interval between and an end; and a determining means connected to the output thereof to recognize speech. In the above, the dissimilarity calculation means includes: a first storage means for storing a time series of the input spectral conversion data x^ _i ; a time series of the pre-calculated spectral transformation data z^ _i ; and the spectral transformation data. a second storage means for storing a standard pattern consisting of weight designation data P _i indicating the presence or absence of weight designation for each z^ _i ; possible input spectrum transformations X^ _i and possible spectrum transformation data Z^ _i of the standard pattern; A weighting coefficient (w _i ≧1) that is set in advance to emphasize these differences according to the polarity and magnitude of the absolute value of is stored, and the input spectral conversion data x^ _i and a third storage means that outputs the weighting coefficient w _i based on the spectral conversion data z^ _i from the second storage means and the weight designation data P _i ; and input spectral conversion data x^ from the first storage means. A first calculation means that calculates and outputs the distance between _i and the spectrum conversion data z^ _i from the second storage means, and a product-sum calculation of the outputs of the third storage means and the first calculation means. and a second calculating means that calculates and outputs a degree of dissimilarity by weighting, and the weight designation data indicating the presence of weight designation among the weight designation data is a specification that is effective for distinguishing from other standard patterns. A voice recognition device characterized in that the voice recognition device is set in advance in a region.