JPH024919B2

JPH024919B2 -

Info

Publication number: JPH024919B2
Application number: JP56089880A
Authority: JP
Inventors: Hidekazu Tsuboka; Yoshiteru Mifune; Satoru Kabasawa
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1981-06-11
Filing date: 1981-06-11
Publication date: 1990-01-30
Also published as: JPS57204598A

Description

【発明の詳細な説明】音声認識の方法において、先ず音韻を認識し、
然る後に得られた音韻列と、音韻列で表現された
単語辞書の各単語とを音韻間類似度に基づいて音
韻レベルでのマツチングをとり、最大の類似度を
与える単語を認識結果とするものがある。この場
合、音韻を認識するために、音韻の標準パターン
を予め登録しておき、入力音声信号をこの標準パ
ターンと比較し、音韻認識を行う。本発明はこの
様な音韻の標準パターンを求める方法に関するも
のである。以下図面を用い、従来の音声認識装置
の１例を説明する。[Detailed description of the invention] In the speech recognition method, first, the phoneme is recognized,
After that, the obtained phoneme string and each word in the word dictionary expressed by the phoneme string are matched at the phoneme level based on the degree of similarity between phonemes, and the word that gives the greatest degree of similarity is taken as the recognition result. There is something. In this case, in order to recognize phonemes, a standard pattern of phonemes is registered in advance, and the input audio signal is compared with this standard pattern to perform phoneme recognition. The present invention relates to a method for determining such a standard pattern of phonemes. An example of a conventional speech recognition device will be described below with reference to the drawings.

第１図は前記音韻認識に基づく音声認識装置の
構成を示すブロツク図である。１は音声信号入力
端子であつて、マイクロホン（図示せず）等から
電気信号に変換された音声信号が入力される。２
は特徴抽出部で、入力された音声信号を特徴パラ
メータの時系列に変換する。たとえばいま20チヤ
ネルのフイルタバンクで構成されているとすれ
ば、入力音声信号は周波数軸上に並べられた20個
の帯域フイルタのそれぞれの出力の大きさを表す
20組の数値の組（20次元ベクトル）の時系列に変
換される。３は音韻標準パターン保持部で、認識
されるべき各音韻が、20次元のベクトルとして保
持されている。この20次元のベクトルは各音韻に
対して前記特徴抽出部２で前記と同様な方法で抽
出されたもので、予め準備されている。４は音韻
認識部で、音韻標準パターン保持部３の各音韻と
スイツチ１１を介して入力される特徴ベクトルを
一定時間毎に比較し、最も類似度の高い、つまり
距離の近い音韻標準パターンを認識結果として出
力する。５は単語辞書で、認識すべき単語を音韻
系列の表現として保持する。６は単語認識部で、
音韻認識部４の出力音韻列と単語辞書５の各単語
とを比較し、最も類似度の高いものを認識単語と
して端子７に出力する。この場合、入力音韻列と
単語辞書の各単語との類似度は音韻間類似度に基
づいて計算される。この音韻間類似度は予め実験
的に求められているもので、例えば20次元ベクト
ル空間においては、多数のデータから音韻Ｘを表
すベクトルの平均ベクトルで表わされる点と、音
韻Ｙを表すベクトルの平均ベクトルで表わされる
点との間の距離を線形変換したものを音韻Ｘと音
韻Ｙの類似度とする。なお８は音韻標準パターン
作成部であり、スイツチ１１は、音韻パターン作
成時に特徴抽出部２の出力が音韻パターン作成部
８へ入力されるように、また認識時には音韻認識
部４に入力されるように動作する。 FIG. 1 is a block diagram showing the configuration of a speech recognition device based on the above-mentioned phoneme recognition. Reference numeral 1 denotes an audio signal input terminal, into which an audio signal converted into an electrical signal from a microphone (not shown) or the like is input. 2
is a feature extraction unit that converts the input audio signal into a time series of feature parameters. For example, if a filter bank has 20 channels, the input audio signal represents the output magnitude of each of the 20 band filters arranged on the frequency axis.
Converted to a time series of 20 sets of numbers (20-dimensional vectors). 3 is a phoneme standard pattern holding unit in which each phoneme to be recognized is held as a 20-dimensional vector. This 20-dimensional vector is extracted for each phoneme by the feature extraction section 2 in the same manner as described above, and is prepared in advance. 4 is a phoneme recognition unit that compares each phoneme in the phoneme standard pattern holding unit 3 with the feature vector input via the switch 11 at regular intervals, and recognizes the phoneme standard pattern with the highest degree of similarity, that is, the closest distance. Output as result. A word dictionary 5 stores words to be recognized as representations of phoneme sequences. 6 is the word recognition section,
The output phoneme string of the phoneme recognition unit 4 and each word of the word dictionary 5 are compared, and the one with the highest degree of similarity is outputted to the terminal 7 as a recognized word. In this case, the degree of similarity between the input phoneme string and each word in the word dictionary is calculated based on the degree of similarity between phonemes. This degree of similarity between phonemes is determined experimentally in advance. For example, in a 20-dimensional vector space, the point represented by the average vector of vectors representing phoneme X and the average vector of vectors representing phoneme Y from a large number of data. The similarity between phoneme X and phoneme Y is obtained by linearly transforming the distance between the points represented by the vectors. Note that 8 is a phoneme standard pattern creation unit, and a switch 11 is configured so that the output of the feature extraction unit 2 is input to the phoneme pattern creation unit 8 when creating a phoneme pattern, and input to the phoneme recognition unit 4 during recognition. works.

以上のような構成において、従来より音韻標準
パターンの作成は同一音韻を表す多数のベクトル
を単に平均しているのが普通である。すなわち、
同じ音韻Ｘであつても、話者によつてそのパター
ンは異り、また、同一の話者でも前後に続く音韻
の影響を受けてそのパターンは異なるので、なる
べく標準的なパターンを得るために話者を変え、
前後に続く音韻を変えて得られる音韻Ｘに対する
多数のベクトルの平均をとるわけである。数式的
に表せば、音韻Ｘに対するｋ次元のそれぞれのベ
クトルを〓₁、〓₂、…、〓_o；〓_i＝（x_i1、x_i2、…
x_ik）とすれば、音韻Ｘの標準パターンベクトル
〓は、〓＝（_o 〓ⁱ⁼¹ 〓_i）／ｎ（（_o 〓ⁱ⁼¹ Xi₁）／ｎ（_o 〓ⁱ⁼¹ x_i2）／ｎ、…、（_o 〓ⁱ⁼¹ x_ik）／ｎ）で与えられる。 In the above-described configuration, conventional phoneme standard patterns have been created by simply averaging a large number of vectors representing the same phoneme. That is,
Even if the phoneme change the speaker,
The average of many vectors for the phoneme X obtained by changing the phonemes that follow before and after is taken. Expressed mathematically, the k-dimensional vectors for phoneme X are 〓 ₁ , 〓 ₂ , ..., 〓 _o ; 〓 _i = (x _i1 , x _i2 , ...
_x _ik ⁾ _, ^then ^the _standard _pattern _vector _of phoneme /n,..., ( _o 〓 ⁱ⁼¹ x _ik )/n) is given.

ところ第２図は横軸が前記フイルタバンクを構
成する帯域フイルタのチヤネル番号、縦軸がその
音韻の出力強度を示しているものである。Ａ、Ｂ
は、それぞれ異る話者により発声された音韻Ｘに
対するスペクトル、Ｃはその平均スペクトルであ
る。同図から明らかなように、Ａ，Ｂそれぞれ
は、単独では２つの山が顕著であるのに、Ｃはそ
れらの特徴がかなり減殺されている。これは、Ａ
とＢのそれぞれの山の周波数軸に対する位置がず
れているためであつて、主として話者の声道長の
差違に起因する共振周波数が異るためである。音
声学の知見によれば、この山（ホルマント）の現
れ方は音韻の種類と密接な関係があり、音韻を弁
別する上で極めて重要な情報を担つている。そこ
で、前記平均化操作によりＣのようなその特徴が
減殺されるのは好ましいことではない。前記説明
は２者の平均についてのみであつたが、さらに多
くのスペクトルについて前記の如き平均化を行う
と、その特徴はさらに減少してゆく危険がある。 In FIG. 2, the horizontal axis represents the channel number of the band filter constituting the filter bank, and the vertical axis represents the output intensity of the phoneme. A, B
are spectra for the phoneme X uttered by different speakers, and C is its average spectrum. As is clear from the figure, in each of A and B, the two peaks are noticeable when taken alone, but in C, these features are considerably diminished. This is A
This is because the positions of the peaks of and B with respect to the frequency axis are different, and the resonance frequencies are different mainly due to the difference in the vocal tract lengths of the speakers. According to the knowledge of phonetics, the appearance of this mountain (formant) is closely related to the type of phoneme, and it carries extremely important information in discriminating phonemes. Therefore, it is not desirable that characteristics such as C are diminished by the averaging operation. The above explanation was only about the average of the two, but if more spectra are averaged as described above, there is a risk that the characteristics will further decrease.

本発明は、以上のような欠点を除去し、適切な
標準パターンを得る音声認識装置を提供するもの
である。 The present invention provides a speech recognition device that eliminates the above-mentioned drawbacks and obtains appropriate standard patterns.

その基本的な考え方は、周波数軸の非線形な伸
縮を行つてパターンＡをパターンＢに近づけてか
ら平均するものである。次に本発明における実施
例の周波数軸の伸縮の方法を述べる。 The basic idea is to bring pattern A closer to pattern B by nonlinear expansion and contraction of the frequency axis, and then average it. Next, a method of expanding and contracting the frequency axis in an embodiment of the present invention will be described.

第３図は、いわゆる格子グラフであつて、横軸
はパターンＢの周波数軸、縦軸はパターンＡの周
波数軸である。１３と１４、あるいは１３と１５
によつて囲まれる部分は整合の窓であつて、この
窓の中でのみルートを選択できるものとする。な
おイはパタンＡの周波数軸をパタンＢに対して高
域側にずらせる場合、ロは低域側にずらせる場合
であり、何れも１チヤネル分のみ移動可能である
とする。 FIG. 3 is a so-called lattice graph, in which the horizontal axis is the frequency axis of pattern B, and the vertical axis is the frequency axis of pattern A. 13 and 14 or 13 and 15
The area surrounded by is a matching window, and a route can only be selected within this window. Note that (A) is a case in which the frequency axis of pattern A is shifted to the higher frequency side with respect to pattern B, and (B) is a case in which the frequency axis is shifted to the lower frequency side, and in both cases it is assumed that the frequency axis can be shifted by one channel.

第４図はルート選択の条件を示すものであつ
て、ｉはパターンＢのチヤネルに対応する座標、
ｊはパターンＡのチヤネルに対応する座標を示
し、点（ｉ、ｊ）に至る可能なルートを示す。但
し各線分の上に付した数字１及び２は、そのルー
トが選ばれるときに乗ぜられる荷重係数ｋの一例
である。つまり第４図のに示される３通りのみ
のルート選択が与えられている場合、（ｉ−１、
ｊ）から（ｉ、ｊ）へのルート選択の際の荷重係
数ｋはｋ＝１が、（ｉ−１、ｊ−１）から（ｉ、
ｊ）へのルート選択の際の荷重係数ｋはｋ＝２
が、（ｉ、ｊ−１）から（ｉ、ｊ）へのルート選
択の際の荷重係数ｋはｋ＝１がそれぞれ科せられ
るとする。同様に第４図のに示される３通りの
みのルート選択が与えられている場合、（ｉ−１、
ｊ）、（ｉ−１、ｊ−１）及び（ｉ−１、ｊ−２）
から（ｉ、ｊ）へのルート選択の際の荷重係数ｋ
はすべてｋ＝１が科せられるものとする。 FIG. 4 shows the conditions for route selection, where i is the coordinate corresponding to the channel of pattern B,
j indicates the coordinates corresponding to the channel of pattern A, indicating a possible route to point (i, j). However, the numbers 1 and 2 above each line segment are examples of the weighting coefficient k that is multiplied when that route is selected. In other words, when only the three route choices shown in Figure 4 are given, (i-1,
The weighting factor k when selecting a route from j) to (i, j) is k=1, but from (i-1, j-1) to (i,
The weighting coefficient k when selecting the route to j) is k=2
However, when selecting a route from (i, j-1) to (i, j), the weighting coefficient k is assumed to be k=1. Similarly, if only the three route choices shown in Figure 4 are given, (i-1,
j), (i-1, j-1) and (i-1, j-2)
Loading coefficient k when selecting route from to (i, j)
Assume that k=1 is imposed on all cases.

いま、第４図のに示す３通りのルート選択及
び荷重係数ｋを用いたパターンＡのベクトルをａ
＝（a₁、a₂、…、a_I）、パターンＢのベクトルをｂ
＝（b₁、b₂、…、b_I）とし（但し、Ｉはチヤンネ
ル数）、ｄ（ｉ、ｊ）＝｜b_i−a_j｜とするとき、周
知の動的計画法を適用することにより、すなわち
下記の漸化式を解くことによつてパターンＡとパ
ターンＢとの距離を定義することができる。 Now, the vector of pattern A using the three route selections and the weighting coefficient k shown in Figure 4 is a.
= (a ₁ , a ₂ , ..., a _I ), the vector of pattern B is
When = (b ₁ , b ₂ , ..., b _I ) (where I is the number of channels) and d (i, j) = |b _i −a _j |, well-known dynamic programming is applied. In other words, by solving the following recurrence formula, the distance between pattern A and pattern B can be defined.

以下第３図で示したイパターンＡの周波数軸を
パターンＢに対して高域側にずらせる場合と、ロ
パターンＡの周波数軸をパターンＢに対して低域
側にずらせる場合とに分けて具体的な説明を行な
う。 Below, we will divide the frequency axis of Pattern A shown in Figure 3 into two cases: one in which the frequency axis of Pattern A is shifted to the higher frequency side relative to Pattern B, and the other in which the frequency axis of Pattern A is shifted to the lower frequency side relative to Pattern B. A detailed explanation will be given below.

(イ) パターンＡの周波数軸をパターンＢに対して
高域側にずらせる場合で、かつルート選択の条
件を第４図の条件で適用した時。(b) When the frequency axis of pattern A is shifted to the higher frequency side relative to pattern B, and when the route selection conditions are applied as shown in Figure 4.

初期値ｇ（１、１）＝ｄ（１、１）として、ｇ（ｉ、ｉ）＝minｇ（ｉ、ｉ−１）＋ｄ（ｉ
、ｉ）ｇ（ｉ−１、ｉ−１）＋kd（ｉ、ｉ）ｇ（ｉ、ｉ−１）＝minｇ（ｉ−１、ｉ−２）
＋kd（ｉ、ｉ−１）ｇ（ｉ−１、ｉ−１）＋ｄ（ｉ、ｉ−１）（但し、ｋ＝２）よりｇ（ｉ、ｊ）を順次求めてゆき、 D_H（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）／2I をパターンＡとパターンＢとの距離とする。な
お、ｉ、ｊ≦０ならｇ（ｉ、ｊ）＝∝であり、ｇ（ｉ、ｉ）＝minｇ（ｉ、ｉ−１）＋ｄ（ｉ
、ｉ）ｇ（ｉ−１、ｉ−１）＋kd（ｉ、ｉ）及びｇ（ｉ、ｉ）＝minｇ（ｉ−１、ｉ−２）＋kd
（ｉ、ｉ−１）ｇ（ｉ−１、ｉ−１）＋ｄ（ｉ、ｉ−１）の式におけるminは〔〕内の小さい方の値を
採用するものとする。またＩは一定なので D_H（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）と定義することができる。 Assuming the initial value g(1,1)=d(1,1), g(i,i)=ming(i,i-1)+d(i
, i) g (i-1, i-1) + kd (i, i) g (i, i-1) = ming (i-1, i-2)
+ kd (i, i-1) g (i-1, i-1) + d (i, i-1) (k = 2) _. ,B)=g(I,I)/2I is the distance between pattern A and pattern B. Furthermore, if i, j≦0, g(i, j)=∝, and g(i, i)=ming(i, i-1)+d(i
, i) g(i-1,i-1)+kd(i,i) and g(i,i)=ming(i-1,i-2)+kd
(i, i-1) g(i-1, i-1)+d(i, i-1) In the formula, min shall be the smaller value in [ ]. Also, since I is constant, it can be defined as D _H (A, B) = g (I, I).

(ロ) パターンＡの周波数軸をパターンＢに対して
低域側にずらせる場合で、かつルート選択の条
件を第４図で適用した時。(b) When the frequency axis of pattern A is shifted to the lower frequency side relative to pattern B, and when the route selection conditions shown in Figure 4 are applied.

初期値ｇ（１、１）＝ｄ（１、１）として、ｇ（ｉ、ｉ）＝minｇ（ｉ−１、ｉ）＋ｄ（ｉ
、ｉ）ｇ（ｉ−１、ｉ−１）＋kd（ｉ、ｉ）ｇ（ｉ−１、ｉ）＝minｇ（ｉ−２、ｉ−１）k
d（ｉ−１、ｉ）ｇ（ｉ−１、ｉ−１）＋ｄ（ｉ−１、ｉ）（但し；ｋ＝２）よりｇ（ｉ、ｊ）を順次求めてゆき、 D_L（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）／2I をパターンＡとパターンＢとの距離とする、な
お、ｉ、ｊ≦０ならｇ（ｉ、ｊ）＝∝であり、ｇ（ｉ、ｉ）＝minｇ（ｉ、ｉ−１）＋ｄ（ｉ
、ｉ）ｇ（ｉ−１、ｉ−１）＋kd（ｉ、ｉ）及びｇ（ｉ、ｉ）＝minｇ（ｉ−１、ｉ−２）＋kd
（ｉ、ｉ−１）ｇ（ｉ−１、ｉ−１）＋ｄ（ｉ、ｉ−１）の式におけるminは〔〕内の小さい方の値を
採用するものとする。またＩは一定なので D_L（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）と定義することができる以上のようにして第３図のイの場合の距離D_H
（Ａ、Ｂ）及び第３図ロの場合の距離D_L（Ａ、Ｂ）
を求める過程において、前記最適のルートがそれ
ぞれ第５図イ及び第５図ロのルートのように求ま
る。すなわち始点を（Ｉ、Ｉ）としてそれぞれの
場合に応じて前記漸化式を逆に辿つてゆけば、ル
ートは明確となる。 Assuming the initial value g(1,1)=d(1,1), g(i,i)=ming(i-1,i)+d(i
, i) g(i-1,i-1)+kd(i,i) g(i-1,i)=ming(i-2,i-1)k
d (i-1, i) g (i _- 1, i-1) + d (i-1, i) (k = 2). ,B)=g(I,I)/2I is the distance between pattern A and pattern B.If i, j≦0, then g(i,j)=∝, and g(i,i)= ming(i, i-1)+d(i
, i) g(i-1,i-1)+kd(i,i) and g(i,i)=ming(i-1,i-2)+kd
(i, i-1) g(i-1, i-1)+d(i, i-1) In the formula, min shall be the smaller value in [ ]. Also, since I is constant, it can be defined as D _L (A, B) = g (I, I). As described above, the distance D _H in case A in Figure 3
(A, B) and distance D _L (A, B) in case of Figure 3 B
In the process of finding the optimal routes, the routes shown in FIG. 5A and FIG. 5B are found, respectively. That is, if the starting point is (I, I) and the above recurrence formula is traced in reverse according to each case, the route becomes clear.

さて前記漸化式を用い順次計算してＩ＝20の場
合のパターンＡ及びＢの距離D_H（Ａ、Ｂ）及びD_L
（Ａ、Ｂ）をそれぞれ求めると、 D_H（Ａ、Ｂ）＝146 D_L（Ａ、Ｂ）＝187 となり、D_H＜D_Lとなり、第５図イ及びロで明ら
かなように第５図ロで示されるパターンＡの周波
数軸をパターンＢに対し高域側にずらせる方法よ
りも、パターンＡの周波数軸をパターンＢに対し
低域側にずらせる第３図ロの選択の方が望まし
い。 Now, using the above recurrence formula, calculate the distances D _H (A, B) and D _L of patterns A and B in the case of I=20.
(A, B) respectively, D _H (A, B) = 146 D _L (A, B) = 187, D _H < D _L , and as is clear from Figure 5 A and B, the 5th The method shown in Figure 3B, in which the frequency axis of Pattern A is shifted to the lower frequency side relative to Pattern B, is better than the method shown in Figure 3, in which the frequency axis of Pattern A is shifted to the higher frequency side relative to Pattern B. desirable.

第６図においてA′は、上記のようにして求め
られたルートに従つてＡの周波数軸をパターンＢ
に対し低域側にずらして得られたスペクトルを、
C′はスペクトルA′とスペクトルＢとの平均をと
つたものを示す。但し、例えばパターンＢのチヤ
ネル３に対して、チヤネルＡのチヤネル２とチヤ
ネル３が同時に対応しているが、このときは、パ
ターンＢのチヤネル３とパターンＡのチヤネル
２、チヤネル３の３者のスペクトル強度の平均を
とつている。さてこのようにして得られたスペク
トルC′は、スペクトルの特徴を良く保存してい
る。 In Figure 6, A' moves the frequency axis of A to pattern B according to the route determined above.
The spectrum obtained by shifting to the lower frequency side is
C' indicates the average of spectrum A' and spectrum B. However, for example, when channel 3 of pattern B corresponds to channel 2 and channel 3 of channel A at the same time, in this case, channel 3 of pattern B and channel 2 and channel 3 of pattern A The spectral intensity is averaged. Now, the spectrum C' obtained in this way preserves the spectral characteristics well.

一方、ルート選択の条件として第４図のに示
した３通りのルート選択及び荷重係数ｋを用い、
下記の漸化式を解くことにより、パターンＡとパ
ターンＢとの距離を定義することもできる。 On the other hand, using the three route selections and load coefficient k shown in Figure 4 as route selection conditions,
The distance between pattern A and pattern B can also be defined by solving the following recurrence formula.

以下第３図で示した、イパターンＡの周波数軸
をパターンＢに対して高域側にずらせる場合と、
ロパターンＡの周波数軸をパターンＢに対して低
域側にずらせる場合とに分けて具体的な説明を行
なう。 The case where the frequency axis of pattern A is shifted to the higher frequency side with respect to pattern B, as shown in Fig. 3 below,
A specific explanation will be given separately for the case where the frequency axis of pattern A is shifted to the lower frequency side with respect to pattern B.

(イ) パターンＡの周波数軸をパターンＢに対して
高域側にずらせる場合。(b) When the frequency axis of pattern A is shifted to the higher frequency side relative to pattern B.

初期値ｇ（１、１）＝ｄ（１、１）として、ｇ（ｉ、ｉ）＝minｇ（ｉ−１、ｉ−１）＋ｄ
（ｉ、ｉ）ｇ（ｉ−１、ｉ−２）＋ｄ（ｉ、ｉ）ｇ（ｉ、ｉ−１）＝minｇ（ｉ−１、ｉ−１）
ｄ（ｉ、ｉ−１）ｇ（ｉ−１、ｉ−２）＋ｄ（ｉ、ｉ−１）よりｇ（ｉ、ｊ）を順次求めてゆき、 D_H（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）／2I をパターンＡとパターンＢとの距離とする。な
お、ｉ、ｊ≦０ならｇ（ｉ、ｊ）＝∝であり、ｇ（ｉ、ｉ）＝minｇ（ｉ−１、ｉ−１）＋ｄ
（ｉ、ｉ）ｇ（ｉ−１、ｉ−２）＋ｄ（ｉ、ｉ）及びｇ（ｉ、ｉ−１）＝minｇ（ｉ−１、ｉ−１）
＋ｄ（ｉ、ｉ−１）ｇ（ｉ−１、ｉ−２）＋ｄ（ｉ、ｉ−１）の式におけるminは〔〕内の小さい方の値を
採用するものとする。またＩは一定なので D_H（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）と定義することができる。 As the initial value g (1, 1) = d (1, 1), g (i, i) = ming (i-1, i-1) + d
(i, i) g(i-1, i-2) + d(i, i) g(i, i-1) = ming(i-1, i-1)
Find g(i, j) sequentially from d(i, i-1) g(i-1, i-2) + d(i, i-1), and D _H (A, B)=g(I , I)/2I is the distance between pattern A and pattern B. Note that if i, j≦0, g(i, j)=∝, and g(i, i)=ming(i-1, i-1)+d
(i, i) g(i-1, i-2) + d(i, i) and g(i, i-1) = ming(i-1, i-1)
+d(i, i-1) g(i-1, i-2)+d(i, i-1) In the formula, min shall be the smaller value in [ ]. Also, since I is constant, it can be defined as D _H (A, B) = g (I, I).

(ロ) パターンＡの周波数軸をパターンＢに対して
低域側にずらせる場合。(b) When the frequency axis of pattern A is shifted to the lower frequency side relative to pattern B.

初期値ｇ（１、１）＝ｄ（１、１）として、ｇ（ｉ、ｉ）＝minｇ（ｉ−１、ｉ−１）＋ｄ
（ｉ、ｉ）ｇ（ｉ−１、ｉ）＋ｄ（ｉ、ｉ）ｇ（ｉ−１、ｉ）＝minｇ（ｉ−２、ｉ−１）
＋ｄ（ｉ−１、ｉ）ｇ（ｉ−２、ｉ−２）＋ｄ（ｉ−１、ｉ）よりｇ（ｉ、ｊ）を順次求めてゆき、 D_L（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）／2I をパターンＡとパターンＢとの距離とする。な
お、ｉ、ｊ≦０ならｇ（ｉ、ｊ）＝∝であり、ｇ（ｉ、ｉ）＝minｇ（ｉ−１、ｉ−１）＋ｄ
（ｉ、ｉ）ｇ（ｉ−１、ｉ）＋ｄ（ｉ、ｉ）ｇ（ｉ−１、ｉ）＝minｇ（ｉ−２、ｉ−１）
＋ｄ（ｉ−１、ｉ）ｇ（ｉ−２、ｉ−２）＋ｄ（ｉ−１、ｉ）の式におけるminは〔〕内の小さい方の値を
採用するものとする。またＩは一定なので D_L（Ａ、Ｂ）＝ｇ（Ｉ、Ｉ）と定義することができる。 As the initial value g (1, 1) = d (1, 1), g (i, i) = ming (i-1, i-1) + d
(i, i) g(i-1, i) + d(i, i) g(i-1, i) = ming(i-2, i-1)
+d (i-1, i) g (i-2, i-2) + d (i-1, i) Find g (i, j) sequentially, and D _L (A, B) = g (I , I)/2I is the distance between pattern A and pattern B. Note that if i, j≦0, g(i, j)=∝, and g(i, i)=ming(i-1, i-1)+d
(i, i) g(i-1, i) + d(i, i) g(i-1, i) = ming(i-2, i-1)
+d(i-1,i) g(i-2,i-2)+d(i-1,i) In the formula, min shall be the smaller value in [ ]. Also, since I is constant, it can be defined as D _L (A, B) = g (I, I).

以上のようにして第３図のイの場合の距離D_H
（Ａ、Ｂ）及び第３図のロの場合の距離D_L（Ａ、
Ｂ）を求めるとD_H＞D_Lとなる。そこで第７図に
示す如く、パターンＡの周波数軸をパターンＢに
対し高域値にずらせる第３図のイの選択が望まし
い。 As described above, the distance D _H in case A in Figure 3 is
(A, B) and the distance D _L (A,
When calculating B), D _H > D _L. Therefore, as shown in FIG. 7, it is desirable to select item A in FIG. 3, which shifts the frequency axis of pattern A to a higher frequency range than pattern B.

第８図においてはA″は上記のように求められ
たルートに従つて、パターンＡの周波数軸をパタ
ーンＢに対し高域側にずらして得られたスペクト
ルを示す。このルート選択方式の特徴は、求めら
れたルートの中で横軸（パターンＢの周波数軸）
に垂直な部分がないので、パターンＡの２つの周
波数軸が同時にパターンＢの１つの周波数に対応
することはない。しかし、例えばパターンＢの
２、３、４チヤネルのところに見られるように、
２つのルートが存在する場合があり、何れのルー
トを選ぶかによつて、パターンＢのあるチヤネル
に対応するパターンＡのチヤネルが異る。C″は、
このような場合はパターンＢのそのチヤネルの強
度と、それぞれのルートに関して対応するパター
ンＡのそれぞれのチヤネルの値の３者の平均で求
めることによつて得られた平均スペクトルであ
る。この場合もスペクトルの特徴は良く保存され
ている。 In Fig. 8, A″ indicates the spectrum obtained by shifting the frequency axis of pattern A to the higher frequency side relative to pattern B according to the route determined as above.The characteristics of this route selection method are , in the determined route, the horizontal axis (frequency axis of pattern B)
Since there is no perpendicular part to , two frequency axes of pattern A will never correspond to one frequency of pattern B at the same time. However, as seen for example in the 2nd, 3rd, and 4th channels of pattern B,
There may be two routes, and depending on which route is selected, the channel of pattern A that corresponds to a certain channel of pattern B is different. C″ is
In such a case, the average spectrum is obtained by calculating the average of the intensity of that channel of pattern B and the value of each channel of pattern A corresponding to each route. In this case as well, the spectral features are well preserved.

以上のことを一般的に述べれば、次のようにな
る。 Generally speaking, the above is as follows.

多数の話者や文脈から前記の如くして得られた
音韻Ｘに対するｐ個の特徴スペクトルを〓₁、〓₂
…、〓_pただし〓_e＝（x_e1、x_e2、…、x_eo）とする
とき、基準ベクトル〓_r＝（x_r1、x_r2、…、x_ro）を
定める。次に、〓₁、…、〓_pの任意のベクトル〓_n＝（x_n1、…、
x_nj、…、x_no）と前記基準ベクトル〓_rに対し、
前記格子グラフを構成し、ｉ（ｋ）とｊ（ｋ）の交
点をＣ（ｋ）＝（ｉ（ｋ）、ｊ（ｋ））とするとき、
x_r,i(k)とx_n,j(k)の距離ｄ（Ｃ（ｋ））、荷重係数ｗ（
ｋ）
に対し、荷重平均が最小になるように点列Ｃ(1)Ｃ(2)…Ｃ（ｋ）…Ｃ
（ｋ）を定め、前記ベクトル〓_nの成分x_n,i(k)を
x_n,i(k)に変換したベクトル〓′_nを求める。このよ
うにして前記ベクトル〓₁、…、〓_pを〓′₁、…、
〓′_pに変換し、〓′₁、…、〓′_pの平均ベクトルを
前記音韻Ｘの標準パターンとすることになる。 The p feature spectra for phoneme X obtained as above from a large number of speakers and contexts are 〓 ₁ , 〓 ₂
..., 〓 _p However, when 〓 _e = (x _e1 , x _e2 , ..., x _eo ), the reference vector 〓 _r = (x _r1 , x _r2 , ..., x _ro ) is determined. Then, any vector 〓 ₁ ,…, 〓 _p 〓 _n = (x _n1 ,…,
x _nj , ..., x _no ) and the reference vector 〓 _r ,
When constructing the lattice graph and setting the intersection of i(k) and j(k) as C(k) = (i(k), j(k)),
The distance d (C(k)) between x _r,i(k) and x _n,j(k) , the load factor w(
k)
For weighted average The point sequence C(1)C(2)...C(k)...C is minimized.
(k), and the component x _n,i(k) of the vector 〓 _n is
Find the vector 〓′ _n converted to x _n,i(k) . In this way, the said vector 〓 ₁ ,..., 〓 _p becomes 〓' ₁ ,...,
〓′ _p , and the average vector of 〓′ ₁ , . . . , 〓′ _p is used as the standard pattern of the phoneme X.

なお、前記荷重平均を求めるとき、実施例でも
示したように、ベクトル〓_nの不自然な変形が起
らないように、整合窓を設けたり、ルート選択の
方法を制限するのが普通であり、それら制限方法
は本実施例に示したものにとどまるものではな
く、種々の方法が用いられるのは当然である。 Note that when calculating the weighted average, as shown in the example, it is common to provide a matching window or limit the route selection method to prevent unnatural deformation of the vector 〓 _n . Of course, these limiting methods are not limited to those shown in this embodiment, and various methods may be used.

第９図は上記のような方法により周波数軸を伸
縮させた後、その平均をとつて各音韻の標準パタ
ーンを作る音声認識装置における音韻標準パター
ン作成部８の構成を示したものである。 FIG. 9 shows the configuration of the phoneme standard pattern creation section 8 in the speech recognition device which creates standard patterns for each phoneme by expanding and contracting the frequency axis using the method described above and then taking the average.

８０は多数の話者の、前後が種々の音韻である
場合の、音韻Ｘの特徴抽出部２で得られたベクト
ル〓₁、〓₂、…、〓_oを蓄えるメモリ、８５は８
０に蓄えられているベクトル〓₁、…、〓_oのうち
任意の１つであるベクトル〓_i（但し、｜≦ｉ≦ｎ
で、できれば最も標準的なスペクトルパターンを
与えるベクトル）を蓄える基準パターンメモリ、
８１はメモリ８０に蓄えられているベクトル〓₁、
…、〓_oを一時的に順次蓄える被伸縮パターンメ
モリ、８２は前記基準パターンメモリに蓄えられ
ている基準ベクトル〓_iに対し、前記被伸縮パタ
ーンメモリ８１に蓄えられているベクトル〓_jの
周波数軸の伸縮を前記方法によつて行い、周波数
軸の伸縮されたベクトル〓_jを求める周波数軸伸
縮部、８３はスペクトル周波数軸伸縮部８２の出
力〓₁、〓₂、…、〓_oを蓄えるメモリ、８４はメ
モリ８３に蓄えられたベクトル〓₁、…、〓_oの平
均をとり〓＝1/n_o 〓^j=1 〓〜_jを求める平均化部である。 80 is a memory for storing vectors 〓 ₁ , 〓 ₂ , ..., _〓 o obtained by the feature extraction unit 2 of phoneme X when there are various phonemes before and after a large number of speakers; 85 is a memory for storing 85
The vector stored in 0 is any one of the vectors 〓 ₁ , ..., 〓 _o 〓 _i (where |≦i≦n
a reference pattern memory that stores vectors that give the most standard spectral pattern if possible;
81 is the vector 〓 ₁ stored in the memory 80,
..., 〓 Stretchable pattern memory that temporarily stores _o in sequence, 82 is the reference vector stored in the reference pattern memory 〓 For _i , the frequency axis of the vector stored in the Stretched pattern memory 81 〓 _j 83 is a memory for storing the outputs 〓 ₁ , 〓 ₂ , . _. . , 〓 _o of the spectral frequency axis expansion and contraction section 82; Reference numeral 84 denotes an averaging unit that takes ^the _average of _the vectors ₁ , _.

このような装置により求められた平均ベクトル〓
は、音韻Ｘの標準パターンとして線路１０を通じ
て音韻標準パターン保持部３に蓄えられる。The average vector obtained by such a device〓
is stored as a standard pattern of the phoneme X in the phoneme standard pattern holding unit 3 via the line 10.

以上のような方法により標準音韻パターンを作
成することにより、理想的な標準音韻パターンが
得られる。 By creating a standard phoneme pattern using the method described above, an ideal standard phoneme pattern can be obtained.

また、認識の際はこのようにして得られた標準
音韻パターンを用い、特願昭55−109145の入力音
韻の認識を行う音韻認識部４において、入力音韻
ベクトルの周波数軸の伸縮を各標準音韻パターン
に対し前記と同様に行つた後、両者の距離を求め
る方法は、話者に対する理想的な正規化方法とな
り、音韻認識の精度が著しく向上するものであ
る。 In addition, during recognition, the standard phoneme pattern obtained in this way is used, and the phoneme recognition unit 4, which recognizes the input phoneme of Patent Application 1983-109145, expands and contracts the frequency axis of the input phoneme vector for each standard phoneme. The method of calculating the distance between the patterns after performing the same process as described above is an ideal normalization method for the speaker, and significantly improves the accuracy of phoneme recognition.

【図面の簡単な説明】[Brief explanation of drawings]

第１図は音声認識装置の構成を示すブロツク
図、第２図及び第６図は音韻の出力強度を示す説
明図、第３図は格子グラフを説明するための図、
第４図はルート選択の方法例を示した説明図、第
５図及び第７図は具体的な計算によりルート選択
を示した図、第８図は同スペクトル図、第９図は
本発明の一実施例に基づく要部を示したブロツク
図である。２……特徴抽出部、３……音韻標準パターン保
持部、８０……メモリ、８１……被伸縮パターン
メモリ、８２……周波数軸伸縮部、８３……スペ
クトルメモリ、８４……平均化部、８５……基準
パターンメモリ。 FIG. 1 is a block diagram showing the configuration of the speech recognition device, FIGS. 2 and 6 are explanatory diagrams showing the output strength of phonemes, and FIG. 3 is a diagram for explaining the lattice graph.
Fig. 4 is an explanatory diagram showing an example of the route selection method, Figs. 5 and 7 are diagrams showing route selection based on concrete calculations, Fig. 8 is a spectrum diagram of the same, and Fig. 9 is an explanatory diagram showing an example of the route selection method. FIG. 2 is a block diagram showing main parts based on one embodiment. 2...Feature extraction unit, 3...Phonological standard pattern holding unit, 80...Memory, 81...Stretched pattern memory, 82...Frequency axis expansion/contraction unit, 83...Spectral memory, 84...Averaging unit, 85...Reference pattern memory.

Claims

【特許請求の範囲】１入力音声信号をｎ次元の特徴ベクトルの時系
列に変換する手段と、識別すべき各音韻に対応し
て予め準備されているｎ次元の特徴ベクトルを音
韻標準パターンとし、前記特徴ベクトルの時系列
の各ベクトルと前記音韻標準パターンのそれぞれ
とを比較することにより前記特徴ベクトルの時系
列を音韻系列に変換する手段を含む音声認識装置
において、音韻Ｘに対する標準パターンを求める
に際し、音韻Ｘに対して集められた多数の話者や
文脈から得られた特徴ベクトルを〓₁、〓₂、…、
〓_p；〓_e＝x_e1、x_e2、…、x_eo）とし、かつ基準の
ベクトル〓_r＝（x_r1、…、x_ri、…、x_ro）を定め、
前記特徴ベクトル〓₁、…、〓_pの任意のベクトル
を〓_n＝（x_n1、…、x_nj、…、x_no）としてｉ−ｊ
平面上の格子点Ｃ（ｋ）＝（ｉ（ｋ）、Ｊ（ｋ））、x_r
i(k)
とx_nj(k)との距離ｄ（ｃ（ｋ））、及び荷重係数ｗ
（ｋ）に対し、｛_k 〓^k=1 ｄ（ｃ（ｋ））ｗ（ｋ）｝／｛_k 〓^k=1 ｗ（ｋ）｝が最小になるように点列ｃ(1)ｃ(2)ｃ(3)…ｃ（ｋ）…Ｃ（ｋ）を定める手
段と、その点列に従つて、前記ベクトル〓_nの成
分x_ni(k)をx_nj(k)に変換することにより、変換され
たベクトル〓′_nを得る手段と、〓′₁、〓′₂、…、
〓′_r、…、〓′_pの平均をとる手段を有し、該平均
ベクトルを前記音韻Ｘに対する標準パターンとす
ることを特徴とする音声認識装置。[Scope of Claims] 1. means for converting an input speech signal into a time series of n-dimensional feature vectors, and a phoneme standard pattern using n-dimensional feature vectors prepared in advance corresponding to each phoneme to be identified; In a speech recognition device including means for converting the time series of the feature vectors into a phoneme series by comparing each vector in the time series of the feature vectors with each of the phoneme standard patterns, when determining a standard pattern for a phoneme X, , the feature vectors obtained from a large number of speakers and contexts collected for phoneme X are 〓 ₁ , 〓 ₂ , ...,
〓 _p ; 〓 _e = x _e1 , x _e2 , ..., x _eo ), and the reference vector 〓 _r = (x _r1 , ..., x _ri , ..., x _ro ),
Let any vector of the feature vectors 〓 ₁ , ..., 〓 _p be 〓 _n = (x _n1 , ..., x _nj , ..., x _no ) i−j
Lattice point C(k) on the plane = (i(k), J(k)), x _{r
i(k)}
The distance d(c(k)) between and x _nj(k ), and the load factor w
( _k ), the _point sequence c( ¹ ⁾ c( 2) By means of determining c(3)...c(k)...C(k) and by converting the component x _ni(k) of the vector 〓 _n into x _nj(k) according to the point sequence. , means to obtain the transformed vector 〓′ _n , and 〓′ ₁ , 〓′ ₂ ,...,
A speech recognition device characterized in that it has means for taking an average of 〓′ _r _, .