JPS6290000A

JPS6290000A - Detection of formant frequency and voice signal recognition method and system utilizing it

Info

Publication number: JPS6290000A
Application number: JP26920485A
Authority: JP
Inventors: ジヨージ　アール．ドデイントン; イエウナング　チエン; アール．ゲイリイ　レオナード
Original assignee: Texas Instruments Inc
Current assignee: Texas Instruments Inc
Priority date: 1984-11-30
Filing date: 1985-11-29
Publication date: 1987-04-24
Anticipated expiration: 2010-10-09
Also published as: JPH0792680B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〈産業上の利用分野〉本発明は音声認識技術に関し、さらに詳しくは、音声認
識技術におけるホルマント周波数の利用に関する。DETAILED DESCRIPTION OF THE INVENTION <Field of Industrial Application> The present invention relates to speech recognition technology, and more particularly to the use of formant frequencies in speech recognition technology.

〈従来技術及び問題点〉音声認識技術は、確実な前進を続けてきたが、機械によ
って適正な音声認識を行うことは、いまだに現実とかけ
離れたゴールのままである。音声のある種の特性を有効
に認識するために、今まで多数の技術が開発されてきた
が、非常に有用でかつ正確な認識、特に不特定話者の音
声認識を行うことは、はとんど実現されていない。これ
は主として、音声信号の統語をとっても、広範囲にわた
る話者によるばらつきがある為である。このようなばら
つきにもかかわらず、我々人間は、非常に良好な音声認
識が可能である。これは、ゆがみや雑音、話し手の違い
による変化によっても比較的一定している音声のモデル
周波数を、人間の知覚は、感じとることができるからで
ある。一方、現在の機械技術では、スペクトル振幅を主
たる距離尺度として不変的に使用し、これに基づいて音
声認識を行っている。しかしながら、ゆがみや雑音及び
話し手の違いによる変化がファクターとなる為、このよ
うなスペクトル振幅の測定は、音声認識を行う上できび
しい制約がある。<Prior Art and Problems> Although speech recognition technology has continued to make steady progress, proper speech recognition by machines remains a goal far from reality. Many techniques have been developed to effectively recognize certain characteristics of speech, but very useful and accurate recognition, especially speaker-independent speech recognition, remains a challenge. However, it has not been realized. This is mainly due to the wide range of variations among speakers in the syntax of speech signals. Despite this variation, we humans are capable of very good speech recognition. This is because human perception is able to sense the model frequency of the voice, which remains relatively constant despite changes due to distortion, noise, and differences in speakers. On the other hand, current mechanical technology permanently uses spectral amplitude as the main distance measure and performs speech recognition on this basis. However, since distortion, noise, and changes due to differences in speakers become factors, such measurement of spectral amplitude has severe limitations when performing speech recognition.

人間の声道は、可変の共振器と同様に動作するので音声
信号は、音声科学者によって専門用語で一般にホルマン
ト又はホルマント周波数とよばれるモデル周波数を含ん
でいる。音声認識のために適当な測定を行うのにホルマ
ント周波数の分析を行う必要があることは、音声科学の
分野では古くから知られている。しかしながら、多数の
科学者における主流の意見によれば、このようなホルマ
ント周波数を適当な高い信頼性で決定することは非常に
困難であるため、ホルマント周波数を使って機械で音声
認識を行うのは、実際的ではないと考えられている。従
って、ホルマント周波数を用いる先の試みは、しばしば
、音声認識を行う上での不正確さが問題となった。Because the human vocal tract operates like a variable resonator, the audio signal contains model frequencies that are commonly referred to in technical terms by audio scientists as formants or formant frequencies. It has long been known in the field of speech science that it is necessary to analyze formant frequencies in order to make appropriate measurements for speech recognition. However, according to the prevailing opinion among many scientists, it is very difficult to determine such formant frequencies with a reasonably high degree of reliability, and therefore it is difficult to use formant frequencies for machine speech recognition. , is considered impractical. Therefore, previous attempts to use formant frequencies often suffered from inaccuracies in performing speech recognition.

く問題点を解決する為の手段〉本発明に従うと正確な音声認識にホルマント周波数が使
用されるように、非常に高い信頼性でポルマント周波数
を選択し、又は決定する方法及び装置が提供される。Means for Solving the Problems According to the present invention, a method and apparatus for selecting or determining a formant frequency with very high reliability so that the formant frequency is used for accurate speech recognition is provided. .

本発明に従うと、仮説検定の統計原理を用いて音声認識
に使うホルマント周波数の決定を可能にする。本発明は
、絶対的な意味でホルマント周波数を決定しようとせず
に、認識の仮説をテストしてゆくことに関連してホルマ
ント周波数の決定を集結させてゆく。According to the present invention, statistical principles of hypothesis testing are used to enable the determination of formant frequencies for use in speech recognition. The present invention does not attempt to determine formant frequencies in an absolute sense, but instead centers the determination of formant frequencies in conjunction with testing hypotheses of recognition.

本発明に従うと、ホルマント周波数の候補の組が認識の
仮説と関係なく決定される。故に、ホルマント周波数の
候補は存在しろるホルマントの選択候補を全て含む。参
照データの各フレーム毎に参照フレームのホルマントと
最も良く一致する最適なホルマント周波数候補が決定さ
れる。適当なホルマント候補が決定された後で、標準の
判断実行（デシジョンメーキング）技術を使用して、音
声認識決定を行う。According to the invention, a set of candidate formant frequencies is determined independent of recognition hypotheses. Therefore, the formant frequency candidates include all possible formant selection candidates. An optimal formant frequency candidate that best matches the formant of the reference frame is determined for each frame of reference data. After suitable formant candidates are determined, standard decision-making techniques are used to make speech recognition decisions.

本発明の他の特記すべき特徴としては、音声データの各
フレームに関する全てのホルマント周波数候補が、線形
予測符号化の多項式を使用して決定される。参照データ
の各々のフレームについて、ホルマント周波数候補のう
ち、参照ホルマントと最も良く一致する最適な選択候補
が決定される。Another notable feature of the invention is that all candidate formant frequencies for each frame of audio data are determined using linear predictive coding polynomials. For each frame of reference data, the optimal selection candidate among the formant frequency candidates that best matches the reference formant is determined.

好ましい実施例において、ホルマント周波数は、ｆｆ１
０１（メル）周波数とｊ２０ｇ（対数）帯域を用いて表
現され、他の特徴の１つとしてｌｏｇピッチ周波数を含
み、これらのデータは、多塵がウス確率変数としてモデ
ル化される。試験を行うトークン（トレーニングトーク
ン）の集合から概算された共分散行列が関数の尤度計算
に使用される。存在しうるホルマント周波数候補の集合
が全てについて網羅的に評価され、観測したデータの尤
度を最大にする集合が選択される。適当なホルマン１〜
周波数１侯補が選択された侵に、標準の時間軸整合（タ
イムアライメント）及びデシジョンメーキング技術を使
って、特定の音声認識の決定を行う。In a preferred embodiment, the formant frequency is ff1
Expressed using 01 (mel) frequencies and j20g (logarithmic) bands, and including log pitch frequency as one of the other features, these data are modeled as multidimensional random variables. The covariance matrix estimated from the set of testing tokens (training tokens) is used to calculate the likelihood of the function. All possible sets of formant frequency candidates are comprehensively evaluated, and the set that maximizes the likelihood of the observed data is selected. Appropriate Holman 1~
Once the frequency 1 candidates are selected, standard time alignment and decision making techniques are used to make specific speech recognition decisions.

例えば、ホルマン［・周波数や帯域幅等の特徴に基づい
て、フレーム誤差が計算され、最適の時間軸整合の決定
にダイナミックプログラミングを用いて、種々の認識仮
説に対する決定関数がだされる。For example, frame errors are computed based on features such as Holman frequency and bandwidth, and dynamic programming is used to determine optimal time alignment to generate decision functions for various recognition hypotheses.

〈実施例〉第１ａ図乃至第１ｅ図は、単Ｒｔ　ｒｃｈａｒｒ　　（
ひやかず［ｔｆｅｉｒ　］　）　Ｊを発音した５つの異
なる音声のスペクトログラムを示す。既知の通り、スペ
クトログラムは時間対音声周波数で・構成を示し、グラ
フの強度は音声信号の振幅として表現される。<Example> Figures 1a to 1e show the simple Rt rcharr (
The spectrograms of five different voices pronouncing J are shown. As is known, a spectrogram shows the composition of audio frequency versus time, and the intensity of the graph is expressed as the amplitude of the audio signal.

第１ａ図乃至第１ｅ図で示す最大周波数は、約４ｋｌｌ
ｚで、図で示すタイムスケールは、１〜１７２秒である
。第１ａ図乃至第１ｅ図の５つのスペクトログラムは５
つの異なる音声から生成されたものであるから、図から
明らかなように、その結果も極めて相違する。これは、
人間の声道の形状、大きさには、種々の相違があり、従
って、違う人間は、各々、異なる周波数の出力を発する
ためである。The maximum frequency shown in Figures 1a to 1e is approximately 4kll.
In z, the time scale shown is from 1 to 172 seconds. The five spectrograms in Figures 1a to 1e are 5
As it is clear from the figure, the results are very different since they are generated from two different voices. this is,
This is because there are various differences in the shape and size of human vocal tracts, and therefore different people emit outputs at different frequencies.

各々のスペクトログラムの最初の部分は、区域１０によ
って示され、これはローマ字で［ｃｈＪの発音を表現す
る。この文字ｒｃｈＪは、乱流によって発生されるもの
で有声コードの動作で発生するものではない無音声であ
るので、ここから抽出するホルマント周波数の決定は異
なる。スペクトログラムの部分１２は、有声音ｒａ　Ｊ
を含んでいる。The first part of each spectrogram is indicated by area 10, which represents the pronunciation of [chJ in Roman letters. Since this character rchJ is unvoiced, which is generated by turbulence and not by the operation of a voiced chord, the determination of the formant frequency to be extracted from it is different. Part 12 of the spectrogram is the voiced sound ra J
Contains.

単語のこの部分は有声音で、話し手の口の形によって決
まる特定の共振器によって決定されるがらモデル又はホ
ルマント周波数は、各々のスペクトログラムについて生
成される。スベクトログラムの最後の部分は参照番号１
４で図示され、この部分は、ローマ字「「ｆ」の発音を
示すもので、話し手は、一般に有声コード動作で発生し
ないので、はっきりとしたホルマント周波数が生成され
ないのがわかる。This part of the word is voiced and a model or formant frequency is generated for each spectrogram, determined by a specific resonator determined by the shape of the speaker's mouth. The last part of the vectorogram is reference number 1
4, this part shows the pronunciation of the Roman letter "f", and it can be seen that no distinct formant frequency is produced, as the speaker does not generally produce voiced chord motions.

各々の第１ａ図乃至第１ｅ図の部分１２に示すホルマン
ト周波数は、音声認識に非常に有用な貴重な情報を含ん
でいる。第１図を参照することによって、特定の音声の
発生を示す為に記憶されている音声信号と比較する上で
非常に有効な、それ自体特有の「印（シグネチャ、特徴
、署名）」を各スペクトログラムが作り出していること
がわかる。The formant frequencies shown in section 12 of each of FIGS. 1a-1e contain valuable information that is very useful for speech recognition. By referring to FIG. You can see what the spectrogram is creating.

しかしながら、ホルマント周波数に含まれる情報を利用
するためには、とのホルマント周波数が含まれているか
について非常に正確に予測する必要がある。これは非常
に困難なプロセスであるとともに、比較を行う為に誤っ
たホルマント周波数がとりあげられた場合、誤った結果
がでてしまう。However, in order to utilize the information contained in the formant frequencies, it is necessary to predict very accurately whether the formant frequencies of and are included. This is a very difficult process and can lead to erroneous results if the wrong formant frequencies are picked up for comparison.

第１ａ図を参照すると、例えば、部分１２には、５つの
存在しうるホルマン１−周波数が決定されることがわか
る。例えば、グラフ中、矢印１６−２４で示す部分が、
一般に、モデル又はホルマント周波数として決定される
周波数の水平方向延長部分を示している。５つのモデル
周波数１６−２４のうち、３つの顕著なホルマント周波
数を確定的に決定することは非常に困難である。このよ
うな決定は、第１ａ乃至ｅ図に示す各々のスペクトログ
ラムについて行う必要があり、決定及びその結果は、各
々のスペクトログラム毎に異なる。Referring to FIG. 1a, it can be seen that, for example, in section 12, five possible Holman 1-frequencies are determined. For example, the part indicated by arrows 16-24 in the graph is
Generally, it shows the horizontal extension of the frequency determined as the model or formant frequency. It is very difficult to definitively determine the three significant formant frequencies among the five model frequencies 16-24. Such a determination must be made for each spectrogram shown in FIGS. 1a-e, and the determination and its results will be different for each spectrogram.

本発明の目的は、上述の決定を非常に高い信頼性で行う
ことを可能とすることである。The aim of the invention is to make it possible to make the above-mentioned decisions with very high reliability.

第２図は、第１ａ図に示すスペクトログラムの部分を周
波数に対する振幅として表わしたグラフである。第２図
から、５つのはっきりとしたビークが存在し、各々のビ
ークが第１ａ図に示す存在しうるホルマント周波数の１
つに相当することが示される。例えば、ビーク２６は、
グラフの区域１６に相当し、ビーク２８は、グラフの区
域１８に相当し、ビーク３０は、グラフの区域２ｏに相
当し、それ以外も同様である。第２図は、音声認識に使
う３つの低次ホルマント周波数を決定する困難さを示す
ものである。FIG. 2 is a graph showing the portion of the spectrogram shown in FIG. 1a as amplitude versus frequency. From Figure 2, there are five distinct peaks, each of which corresponds to one of the possible formant frequencies shown in Figure 1a.
It is shown that it corresponds to For example, beak 26 is
Corresponds to area 16 of the graph, beak 28 corresponds to area 18 of the graph, beak 30 corresponds to area 2o of the graph, and so on. FIG. 2 illustrates the difficulty in determining the three lower formant frequencies used in speech recognition.

本発明は、低次のホルマントを絶対的意味で決定しよう
とするものではなく、認識仮説を検定してゆくことと関
連してホルマント周波数が決定されるものである。まず
、認識仮説とは独立に、ホルマント周波数候補の相が決
定される。故に、候補は、存在しうるホルマント周波数
の選択候補を全て含み、第２図に示す場合では、５つの
ポルマント周波数候補２６−３４の全てを含む。参照デ
ータの各々の規定されたフレームに関し、参照フレーム
のホルマントと最も良く一致するホルマント周波数候補
の最適選択候補が決定される。−担、各フレーム毎の適
当なホルマント周波数候補が決定されると、標準の時間
軸整合技術やデシジョンメーキング技術が行われて、音
声認識が実現される。The present invention does not attempt to determine low-order formants in an absolute sense, but rather determines formant frequencies in conjunction with testing recognition hypotheses. First, the phase of the formant frequency candidate is determined independently of the recognition hypothesis. Therefore, the candidates include all possible formant frequency selection candidates, and in the case shown in FIG. 2, include all five formant frequency candidates 26-34. For each defined frame of reference data, an optimal selection of formant frequency candidates that best matches the formant of the reference frame is determined. - Once an appropriate formant frequency candidate for each frame is determined, standard time axis alignment techniques and decision making techniques are performed to realize speech recognition.

第３図は、本発明に従って音声認識にホルマント周波数
を用いるシステムを図示する。ＳＴとして示す音声信号
は、ブリプロセッサ３６に与えられる。音声信号は各フ
レームの数列の形式で表わされ、各フレームは、２０又
は３０ミリ秒の長さ等の所定期間を持っている。各フレ
ームに含まれるデータは、本発明に従って分析され、従
来通りの方法でシステムに登録され、記憶されることに
よって予め入力されている参照データと比較する為に使
用される。FIG. 3 illustrates a system that uses formant frequencies for speech recognition in accordance with the present invention. The audio signal, denoted ST, is provided to a preprocessor 36. The audio signal is represented in the form of a sequence of frames, each frame having a predetermined period of time, such as 20 or 30 milliseconds in length. The data contained in each frame is analyzed according to the invention and used for comparison with reference data previously entered by being registered and stored in the system in a conventional manner.

従来通り登録され、記憶されている参照データに関する
さらに詳しい説明は、１９８３年１月２８日に出願され
、本願出願人に譲渡された係属中の米国特許出願連続番
号箱４６１，８８４＠に記述されており、ここに参考と
して示す。例えば、本発明のシステムは、１０単語のボ
キャブラリ（語い）を認識する為に使用される。１ｏ単
語は、音によって本発明のシステムで記録され、処理さ
れ、記憶されて、本発明に従って、処理された音声信号
との比較が行われる。好ましい実施例にお【Ｊるブリブ
ロレッυ３６は線形予測旬月化多項式の複素根をホルマ
ント周波数候補として使用する。A further detailed description of conventionally registered and stored reference data is set forth in pending U.S. Patent Application Ser. It is shown here for reference. For example, the system of the present invention may be used to recognize a vocabulary of 10 words. 1o words are recorded in the system of the invention by sound, processed, stored and a comparison is made with the processed audio signal according to the invention. In a preferred embodiment, the complex root of the linear predictive polynomial is used as the formant frequency candidate.

各フレームについてのホルマント周波数候補の組を形成
する全ての複素根は、ブリプロセッサ３６から最適ホル
マント選択器３８へと与えられる。All complex roots forming the set of formant frequency candidates for each frame are provided from the briprocessor 36 to the optimal formant selector 38.

好ましい実施例では、ホルマントは、ｍｅｔ周波数とｌ
Ｏｏ帯域幅からなる項として示される。１０ｇピッチ周
波数もまた、ホルマント周波数候補の最適な選択候補を
選ぶ選択器３８によって特徴の１つとして使用される。In a preferred embodiment, the formants are separated by the met frequency and l
It is shown as a term consisting of Oo bandwidth. The 10g pitch frequency is also used as one of the features by the selector 38 to choose the best selection of formant frequency candidates.

選択器３８から結果として出力されるデータは比較器４
ｏに与えられ、比較器はこのデータと予め記憶されてい
る音声参照データと比較する。The data output as a result from the selector 38 is sent to the comparator 4.
o, and the comparator compares this data with pre-stored audio reference data.

比較器４０は、選択器３８と共に反復的プロセスを行い
、記憶された参照ホルマントと最も類似性を示すポルマ
ントを候補の中から選択する。比較器４０からの出力は
ダイナミックワーブ操作システム４２に与えられ、ダイ
ナミックワーブ操作システム４２が動作し、参照ベクト
ルと入力信号ベクトルの間の最小誤差を決定し、これに
よって最低な累積誤差を持つ経路を決定する。最適に整
合されたベクトル仮説は、ダイナミックタイムワーブ操
作システム４２から高品質レベル認識論理４４に与えら
れ、この高品質レベル認識論理４４は、従来のタイムア
ライメント及びデシジョンメーキング技術を用いて動作
し、認識したものとして単重を断定できるか否かを決定
する。システム４２及び認Ｒ論理４４に関するこれ以上
の説明は、前記係属中の米国特許出願連続番号第４６１，８８４号に記述されている。Comparator 40 performs an iterative process in conjunction with selector 38 to select from among the candidates the formant that exhibits the most similarity to the stored reference formant. The output from comparator 40 is provided to a dynamic warb manipulation system 42 which operates to determine the minimum error between the reference vector and the input signal vector, thereby determining the path with the lowest cumulative error. decide. The optimally aligned vector hypotheses are provided from the dynamic timewarb manipulation system 42 to high quality level recognition logic 44, which operates using conventional time alignment and decision making techniques to perform recognition. Determine whether the unit weight can be determined assuming that the Further description of system 42 and recognition logic 44 is provided in the aforementioned copending US Patent Application Serial No. 461,884.

ブリプロセッサ３６についてざらに詳しく説明すると、
線形予測符号化（ＬＰＧ）多項式の処理はブリプロごツ
サ３６で実行される。このＬＰＧ多項式アルゴリズムは
、人間の音声を周期的又は確率入力を持つ時変再帰型、
Ｎ極フィルタとしてモデル化し、Ｎ極を持つフィルタと
して動作する。To explain in detail about Buri Processor 36,
Processing of linear predictive coding (LPG) polynomials is performed by the Bripro controller 36. This LPG polynomial algorithm is a time-varying recursive algorithm with periodic or stochastic inputs for human speech.
It is modeled as an N-pole filter and operates as a filter with N-poles.

好ましい実施例では、多項式がさらに、変形ベアストウ
根の解法を使って因数分解される。この解法では多項式
が音声の線形予測モデルを表わし、大部分の根が単位円
付近に存在する複素共役根であると仮定している。次に
、結果として得た複素共役極は周波数及び帯域幅に変換
され、特定の音声信号フレームに関する全ての存在しう
るホルマント周波数の形式で音声信号を表わす。第２図
に示す信号の場合、５つ全てのホルマント周波数２６−
３４が、最適ホルマント選択器３８に送られる。In the preferred embodiment, the polynomial is further factorized using a modified Bairstow root solution. This solution assumes that the polynomial represents a linear prediction model of speech, and that most of the roots are complex conjugate roots that exist near the unit circle. The resulting complex conjugate poles are then converted into frequencies and bandwidths to represent the audio signal in the form of all possible formant frequencies for a particular audio signal frame. For the signal shown in Figure 2, all five formant frequencies 26-
34 is sent to an optimal formant selector 38.

ブリプロセッサ３６によって行われるＬＰＧ操作は、適
当なプログラムが行われたデジタルプロセッサによって
実行することができ、例えば、前述の係続中の米国特許
出願連続番号第４６１．８８４号に記載されるシステムと同様のもの
が使用可能である。ブリプロセッサ３６で実際に有効に
動作することがわかっているプログラムは、以下に示す
フォートランプログラム原語で規定される。The LPG operations performed by the preprocessor 36 may be performed by a suitably programmed digital processor, such as the system described in the aforementioned copending U.S. Patent Application Serial No. 461.884. Similar ones are available. A program that is known to actually operate effectively on the BRIDGE processor 36 is defined in the Fortran program original language shown below.

づ　　　！Ｊｉｆ　（ｎｏｒｄｅｒ、　ｌｅ、２）　ｔｈｅｎ　５ｋ
ｉｐ　Ｂａｉｒｓｔｏｗ　ｐｒｏｃｅｄｕｒｅ　ａｎｄ
　ｗｒａｐ　ｕｐＦ（１，ｎｆａｃｔｏｒ）　＝　ａｗ
（２）／ａｗ（１）ｉｆ　（ｎｏｒｄｅｒ、ｅｑ、２）
　ｔｈｅｎＦ（２、ｎｆａｃｔｏｒ）　　　　　−ａｗ
（３）／ａｗ（１）１ｓｅＦ（２，ｎｆａｃｔｏｒ）　＝　０．０ｎｄｉｆＣＯＮＶＥＲＧＥＤ　＝　、　ｔｒｕｅ。Zu! J if (norder, le, 2) then 5k
ip Bairstow procedure and
wrap upF(1, nfactor) = aw
(2)/aw(1)if (norder, eq, 2)
thenF(2, nfactor) -aw
(3)/aw(1)1se F(2, nfactor) = 0.0ndif CONVERGED = , true.

ｅｔｕｒｎｅｌｓｅ　　　　　　ｄｏ　ａａｒｒｓｔｏｗ’ｓ　ｐ
ｒｏｃｅｄｕｒｅｄｏ　４　ｎｔｒＶ＝１．Ｎ　　ｔｒ
Ｖ　ｔｏ　Ｏｅｔ　Ｃ０ｎＶｅｒＱＥｉｎＣｅ　ｗｉｔ
ｈ　ＶａｒｉｏｕｓＳｔａｒｔｉｎｇ　ｐｏｉｎｔｓｃａｌｌ　ｄｏｕｂｌｅ　（ｆｓｔａｒｔ（１，ｎｔｒ
ｙ）、　Ｆ（１，ｎｆａｃｔｏｒ）、　４）ｃａｌｌ　
ＢＡＩＲ３ｍＷ　（ａｗ、　ｎｏｒｄｅｒ、　Ｆ（１，
ｎｆａｃｔｏｒ）、　Ｃ０ＮＶＥＲＧＥＤ）ｉｆ　（Ｃ
ＯＮＶＥＲＧＥＤ）　ｏｏ　ｔｏ　３４　　　ｃｏｎｔ
ｉｎｕｅｅｔｕｒｎａｎｄ　１ｆ３ｃｏｎｔｉｎｕｅｅｔｕｒｎ上述のＬＰＧ多項式は、変形ベアストウ根の解法を用い
て因数分解される。この解決は、フォトランプログラム
原語で定義される以下のプログラムによって行われる。etern else do aarrstow's p
4 ntrV=1. Ntr
V to Oet C0nVerQEinCe wit
h VariousStarting points call double (fstart(1, ntr
y), F(1, nfactor), 4) call
BAIR3mW (aw, norder, F(1,
nfactor), C0NVERGED) if (C
ONVERGED) oo to 34 cont.
inue etturn and 1f 3continue etern The above LPG polynomial is factorized using a modified Bairstow root solution. This solution is performed by the following program defined in the original language of the Photorun program.

比較器４０に関連する最適ホルマント選択器３８はホル
マント周波数候補から参照ホルマントと一致する又は最
も類似性を示すようなホルマントを選択しなければなら
ない。これは、反復プロセスである。例えば、第２図を
参照すると、低次の３つのホルマント周波数候補２６．
２８．３０が考慮され、次にポルマント周波数候補２６
゜２８．３２が考慮され、次にホルマント周波数候補２
６，２８．３４が考慮され、以下も同様である。この反
復プロセスは、この工程によって３つのホルマント周波
数の適当な組合せを全て考慮されるまでくり返される。The optimal formant selector 38 associated with the comparator 40 must select from the candidate formant frequencies the formant that matches or exhibits the most similarity to the reference formant. This is an iterative process. For example, referring to FIG. 2, three lower order formant frequency candidates 26.
28.30 is considered, then the pollant frequency candidate 26
゜28.32 is considered, and then formant frequency candidate 2
6, 28, 34, and so on. This iterative process is repeated until all suitable combinations of three formant frequencies have been considered by the process.

３つのホルマント周波数を含む組合せのそれぞれの選択
候補について、入力ベクトルと参照ベクトルとの間の距
離が決定され、最短距離が決定される。選択的に、Δベ
クトルの線形変換を行って最も参照ホルマントと類似づ
るホルマント周波数を提供するようにしてもよい。上記
候補を考慮してゆくプロセスは、第３図の比較器４０に
おいて距離計算の複数の組合せ１ｘ−ｒｌ　１から１ｘ
−ｒ、ｌとして図示される。For each candidate selection of combinations including three formant frequencies, the distance between the input vector and the reference vector is determined, and the shortest distance is determined. Optionally, a linear transformation of the Δ vector may be performed to provide a formant frequency that is most similar to the reference formant. The process of considering the above candidates consists of multiple combinations of distance calculations 1x-rl 1 to 1x in the comparator 40 of FIG.
−r, l.

比較器４０は、フレーム内の入力データを参照データと
比較する。各々のフレームについて、比較器４０によっ
てＮ個の距離測度が発生され、本実施例では、毎秒５Ｏ
Ｎの距離測度が発生され、最初の音声入力信号において
毎秒８０００個のυンブルを使用する。ダイナミックタ
イムワープ操作装＠４２の出力は、ボキャブラリーに含
まれるＭ個の単語に相当するＮ個の誤差を発生する。Comparator 40 compares the input data within the frame with reference data. For each frame, N distance measures are generated by comparator 40, in this example 5O
N distance measures are generated, using 8000 υ combinations per second in the initial audio input signal. The output of the dynamic time warp operator @42 produces N errors corresponding to M words included in the vocabulary.

本実施例では、参照データの各々のフレームに関し、ホ
ルマント周波数はｍｅ１周波数と１０ｇ帯域幅の形式で
表示される。さらに、１０ｇピッチ周波数も決定を行う
為の特徴の１つとして加えられる。In this example, for each frame of reference data, the formant frequency is expressed in the form of me1 frequency and 10g bandwidth. Additionally, the 10g pitch frequency is also added as one of the features for making the determination.

このデータは、多塵ガウス確率変数としてモデル化され
る。試験を行うトークンから各フレームに関し推定され
た共分散行列が、ざらに関数の尤度を計算する際に使用
される。This data is modeled as a multi-Gaussian random variable. The covariance matrix estimated for each frame from the tested tokens is used in calculating the likelihood of the rough function.

本発明に従った選択及び比較プロセスは、適当なデジタ
ルプロセッサで以下に示すフォートランプログラム原語
で表わすプログラムによって実行される。The selection and comparison process according to the invention is carried out on a suitable digital processor by a program expressed in the Fortran program language as shown below.

サブルーチン　Ｆ１４ＴＨＡＰ４　　（ＰＣｌｊｌＮ、
　Ｈ［ＬｊＮ、　ＬＢＷ−ＩＮ。Subroutine F14THAP4 (PCljlN,
H[LjN, LBW-IN.

このサブルーチンは、ピッチ周期の対数を最適ホルマン
トのマツピングを決定する為に使用する特徴ベクトルの
一構成要素として含ｌυでいる。このサブルーチンは、
２次因子としてでなくｍａ１周波数と帯域幅を入力とし
て受けとり、マツピング出力を順序づけられた２次因子
の陽集合というよりむしろホルマントの候補のインデク
ス（指数）の形式で表わして出力する。各参照フレーム
に特有の距離が正規化される。各参照フレームに関する
共分散行列がプリロードさ机るように初期化エントリ一
点が提供される。This subroutine includes the logarithm of the pitch period as a component of the feature vector used to determine the optimal formant mapping. This subroutine is
It receives the ma1 frequency and bandwidth as input rather than as second-order factors, and outputs the mapping output in the form of an index of formant candidates rather than an explicit set of ordered second-order factors. Distances specific to each reference frame are normalized. One initialization entry is provided so that the covariance matrix for each reference frame is preloaded.

このサブルーチンシよ、検索する周波数範囲を限定する
ことによって処理速度を上げることができる。入力と参
照ｍｅ１周波数の間の最大誤差が限定され（故に入力と
参照周波数の比率を大まかに限定することによって）、
この方法は、適当な入力ホルマント候補に限界を確定す
る。このサブルーチンは、参照ホルマントを入力ホルマ
ントにマツピングする。適当なマツピングが行われた入
力データが出力され、ホルマントのマツピングに使われ
る最適マツピング誤差はリターンされる。このマツピン
グは、全ての存在しうる対応を検査してデータの尤度を
最大にする対応を一つ選択することによって実行される
。尤度関数は、ホルマン１−を基礎とする母音認識から
決定されるホルマント共分散行列を使って計算される。The processing speed of this subroutine can be increased by limiting the frequency range to be searched. The maximum error between the input and reference me1 frequencies is limited (hence by roughly limiting the ratio of input and reference frequencies),
This method establishes limits on suitable input formant candidates. This subroutine maps the reference formant to the input formant. Input data that has been appropriately mapped is output, and the optimal mapping error used for formant mapping is returned. This mapping is performed by examining all possible correspondences and selecting the one that maximizes the likelihood of the data. The likelihood function is calculated using the formant covariance matrix determined from Holman 1-based vowel recognition.

このザブルーチンは、以下に示す。This subroutine is shown below.

’Ｒ− 々　　　　　　　　　　　　　　　　　　　　　々＜　
　　　　　　　　　　　　　　　　　　　　ヨ　　　　
。'R-
Yo
.

＝　　　　　　　　　　　ロ〉　　　　　　　　　　　　　　　　　　　　　　マ≧ ｍ−。=　　　　　　　　　　　　　　　　　　　〉　　　　　　　　　　　　　　　　　　　Ma≧ m-.

５ｕｂｒｏｕｔｉｎｅ　ＦＨＴ−ＣＯＶＩＮＶ　（ＣＯ
Ｖ、　Ｎ０ＩＨ，ＩＮＶＣＯＶ）ｐａｒａｍｅｔｅｒ　
　（ｍａｘｆｍｔｓ＝４．　　ｍａｘｄｉｍ　　＝１＋
２＊ｍａｘｆｍｔｓ）ｉｎｔｅ（ｌｅｒ＊２　ＮＤＩＨｒｅａｌ＊４　Ｃ０Ｖ（ＮＯＩＨ＊（ＮＤＩＨ＋１）／
２）、　　ＩＮＶＣＯＶ（ｍａｘｄｉｍ、ｍａｘｄｉｍ
）ｒｅａｌ＊４　ｅｉｇｖａｌ（ｍａｘｄｉｍ）、　ｅ
ｉｇｖｅｃ（ｍａｘｄｉｍ、ｍａｘｄｉｍ）。5ubroutine FHT-COVINV (CO
V, N0IH, INVCOV)parameter
(maxfmts=4. maxdim=1+
2*maxfmts) inte(ler*2 NDIH real*4 C0V(NOIH*(NDIH+1)/
2), INVCOV(maxdim, maxdim
) real*4 eigval(maxdim), e
igvec(maxdim, maxdim).

ｗｏｒｋ（ｍａｘｄｉｍ本（ｍａｘｄｉｍ＋３）／２）
ｃａｌｌ　ＥＩＧＲ３（ＣＯＶ、　ＭＯＴＭ、　２．　
ｅｉｏｖａｌ、　ｅｉｑｖｅｃ、　ｍａｘｄｉｍ、　ｗ
ｏｒｋ、　１ｅｒｒｏｒ）：ｆ　　（ｉｅｒｒｏｒ、ｎ
ｅ、０　　　、ｏｒ、ｗｏｒｋ（１）、ｇｔ、１．０）
ｔｙｐｅ　本。work (maxdim book (maxdim+3)/2)
call EIGR3 (COV, MOTM, 2.
eioval, eiqvec, maxdim, w
ork, 1 error): f (ierror, n
e,0,or,work(1),gt,1.0)
type book.

’ｅｒｒｏｒ：’、　１ｅｒｒｏｒ、　ｗｏｒｋ（１）
ｄｏ　ｉ＝１．ＮＤＩＭｉｆ　（ｅｉｇｖａｌ（ｉ）、　Ｉｅ、０．０）　５Ｔ
ＯＰ１　　　　　　　　　　　　　　’ｃｏｖａｒ　　
　ｎｏｔ　　　ｐｏｓｉｔｉｖｅ　　　ｄｅｆｉｎｉｔ
ｅ−ＦＨＴ−ＣＯν１１４Ｖ　　iＦ）ｌＴＨＡＰ４）
’ ｄｏ　ｊ＝１．Ｎｆ）ＩＮｃｉｊ　＝　０．０ｄｏ　ｋ＝１．ＮＤＩＭｃｉｊ　＝ｃｉｊ＋ｅｉｇｖｅｃ（ｉ、ｋ）＊ｅｔｇｖｅｃｌ、ｋ
）／ｅｉｇｖａｌ（ｋ）ｅｎｄ　ｄ。'error:', 1error, work(1)
do i=1. NDIM if (eigval(i), Ie, 0.0) 5T
OP1 'covar
not positive definit
e-FHT-COν114V iF)lTHAP4)
'do j=1. Nf) IN cij = 0.0 do k=1. NDIM cij = cij+eigvec(i,k)*etgvecl,k
)/eigval(k)end d.

ＩＮＶＣＯＶ（ｉ、ｊ）　＝　ｃｉｊｅｎｄ　ｄ。INVCOV (i, j) = cij end d.

ｅｎｄ　ｄ。end d.

ｔｙｐｅ　９９．　ＩＮＶＣＯＶｆｏｒｍａｔ　（／（１ｘ、＜ＮＤＩＨ＞ｆ９．４））
ｒｅｔｕｒｎ　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
１ｎｄ第２図を参照して本発明を用いる場合、例えば正しいホ
ルマント周波数は、３つの低次のホルマント周波数とし
て周波数２６．２８．３２が決定されつる。これらの３
つの低次のホルマント周波数はダイナミックワーブ操作
システム４２及び高品質レベル認識論理システム４４で
使用され、参照データと比較されて非常に正確な音声認
識を行うことができる。type 99. INVCOV format (/(1x, <NDIH>f9.4))
return

1nd When using the present invention with reference to FIG. 2, for example, the correct formant frequencies are determined as frequencies 26, 28, and 32 as the three lower order formant frequencies. These 3
The two lower order formant frequencies can be used in a dynamic warb manipulation system 42 and a high quality level recognition logic system 44 and compared to reference data to provide highly accurate speech recognition.

ダイナミックタイムワーブ操作システム４２の出力は、
最適に整合するベクトル仮設を含み、こｈは、高品質レ
ベル認識システムで使用され、特？単語の認識をしたと
断定できる程度、仮設が良「であるか否かを決定する。The output of the dynamic time warb manipulation system 42 is:
Contains an optimally matching vector hypothesis, which is used in high-quality level recognition systems, and is characterized by special? Determine whether the hypothesis is good or not to the extent that it can be concluded that the word has been recognized.

本発明を使用し、第１及び２図を参照すると、高品質レ
ベル認識論理４４は、第１ａ図において信号ＳＴで図示
される善意の示す単語が「ｃｈａＨ（ひやか（゛）」で
あることを決定する。Using the present invention and referring to FIGS. 1 and 2, the high quality level recognition logic 44 determines that the bona fide indicated word illustrated by signal ST in FIG. 1a is "chaH". decide.

一般的なＬＰＧ分析及びＬＰＧパラメータ抽出り為の種
々の方法に関する従来技術は、一般に知られるマーケル
とグレイ著「音声の線形予測」（１９７６）及びラビネ
ーとシエーファー著［音声信号のデジタル処理Ｊ　　（
１９７８）及びこの中で参考文献として記載される文献
を含む数々の参考文献に記載され、これらはこの中でも
参考として用いである。本発明は、多数の市販されるデ
ジタルプロセッサを使って実施され、操作されうるが、
技術は、ハードウェアで実現できることは明らかである
。The prior art regarding various methods for general LPG analysis and LPG parameter extraction are generally known as "Linear Prediction of Speech" by Markel and Gray (1976) and "Digital Processing of Speech Signals J.
1978) and a number of references, including those referenced therein, which are incorporated herein by reference. Although the present invention can be implemented and operated using a number of commercially available digital processors,
It is clear that the technology can be implemented in hardware.

故に、非常に高い信頼性を持つ音声認識を行う上で使用
されるホルマント周波数を非常に正確に決定する技術が
本発明によって提供されることは明らかである。音声信
号の各フレームに関し存在しうる全でのホルマント周波
数候補がこの技術によって生成され、所定の参照基準に
従って各フレームの最適ホルマント周波数が選択される
。本発明は、多数の仮設のもとに多数の測定を行うこと
が必要であり、従って追加の操作及び予め開発すべき技
術が必要となる。しかしながら、多数の必要な針筒を、
クラスター分析を用いることで減らすことができる。本
発明の操作で使用されるフレームの数が増加すると、処
理速度を短縮する可能性は増加する。It is therefore clear that the present invention provides a technique for very accurately determining the formant frequencies used in speech recognition with very high reliability. All possible formant frequency candidates for each frame of the audio signal are generated by this technique, and the optimal formant frequency for each frame is selected according to a predetermined reference criterion. The present invention requires a large number of measurements to be made under a large number of assumptions, thus requiring additional operations and techniques to be developed in advance. However, the large number of necessary needle tubes
This can be reduced by using cluster analysis. As the number of frames used in the operation of the present invention increases, the potential for reducing processing speed increases.

本発明の実施例につき詳細に説明してぎたが、特許請求
の範囲に記載される本発明の技術思想を離れない限り、
梯々の変更、改変、変形が可能であることは当業者に明
らかである。Although the embodiments of the present invention have been described in detail, unless departing from the technical idea of the present invention as described in the claims,
It will be apparent to those skilled in the art that variations, modifications and variations in the ladders are possible.

尚、本発明の主な実施態様は、以下の通りである。The main embodiments of the present invention are as follows.

（１）　　音声信号の各フレームに対し存在しうる全で
のホルマント周波数候補を発生し、所定の参照籾準に従って各フレームに関する最適ホルマ
ント周波数を選択する工程を含む音声信号の一連のフレームにおいてホルマント周
波数を検出する方法。(1) determining formant frequencies in a series of frames of an audio signal, including the steps of generating all possible formant frequency candidates for each frame of the audio signal and selecting an optimal formant frequency for each frame according to a predetermined reference standard; How to detect.

（２）　　？８２数の参照フレームのホルマント周波数
を記憶し、認識したい音声信号の各々のフレームに関する複数のホ
ルマン１−周波数候補を発生し、所定の基準に従って記
憶されるホルマント周波数と最も良く一致する最適ホル
マント周波数を上記記憶されたホルマント周波数候補か
ら選択し、選択された最適ポルマント周波数に応答して
音声信号を認識する工程を含むホルマント周波数を用いて一連のフレームに組織された
音声信号を認識する方法。(2)? 82 formant frequencies of reference frames are stored, a plurality of Holman 1-frequency candidates are generated for each frame of the speech signal to be recognized, and an optimal formant frequency that best matches the stored formant frequency is determined according to a predetermined criterion. A method of recognizing an audio signal organized into a series of frames using formant frequencies, the method comprising selecting from the stored formant frequency candidates and recognizing an audio signal in response to the selected optimal formant frequency.

（３）　　上記ホルマント周波数候補の全ての存在し。(3) All of the above formant frequency candidates are present.

つる選択的候補が生成される第２項の方法。The method of clause 2 in which vine-selective candidates are generated.

（４）　　上記選択を行う工程では、上記ホルマント周
波数候補の生成に線形予測符号化法を用いる第２項の方
法。(4) The method of item 2, in which the step of making the selection uses a linear predictive coding method to generate the formant frequency candidates.

（５）　　ベアストウアルゴリズムで上記線形予測符号
化法の出力を因数分解する工程を含む第４項の方法。(5) The method of item 4, including the step of factorizing the output of the linear predictive coding method using the Bairstow algorithm.

（６）　　上記選択を行う工程が上記所定の基準に応じ
て音声信号のピッチ周波数を利用する第２項の方法。(6) The method of claim 2, wherein the step of making the selection utilizes the pitch frequency of the audio signal in accordance with the predetermined criteria.

（７）　　上記選択を行う工程が上記ホルマント周波数候補をｎｅｔ周波数とλ０９帯域
幅で表示した信号を発生し、上記表示信号を多塵ガウス確率変数としてモデル化する
工程を含む第７項の方法。(7) The method of claim 7, wherein the step of making the selection includes the step of generating a signal representing the formant frequency candidate at a net frequency and a λ09 bandwidth, and modeling the displayed signal as a multi-Gaussian random variable.

（８）　　共分散行列を用いて最も良く一致することを
示す尤度関数を計算する工程を含む第７項の方法。(8) The method of item 7, including the step of calculating a likelihood function that indicates the best match using a covariance matrix.

（９）　　音声信号の各フレームに関し、全ての存在し
うるホルマント周波数候補を生成する手段と所定の参照
Ｑに従って各フレームに関し最適ホルマン［−周波数を
選択する手段を含む一連のフレームに組織した音声信号のホルマント
周波数を検出するシステム。(9) an audio signal organized into a series of frames comprising means for generating all possible formant frequency candidates for each frame of the audio signal and means for selecting an optimal formant frequency for each frame according to a predetermined reference Q; A system that detects formant frequencies.

（１０）複数のホルマント周波数の参照フレームを記憶
する手段と上記ホルマント周波数候補から、各フレームに関し上記
記憶されたホルマント周波数と最も良く一致する最適ホ
ルマント周波数を選択する選択手段とト記選択された最適ホルマント周波数に応答して上記音
声信号を認識する手段とをさらに有する第９項のシステム。(10) means for storing reference frames of a plurality of formant frequencies; and selection means for selecting an optimal formant frequency that best matches the stored formant frequency for each frame from the formant frequency candidates; 10. The system of claim 9 further comprising means for recognizing said audio signal in response to formant frequencies.

（１１）上記選択手段が線形予測符号化法を用いて上記
ホルマント周波数候補を生成する第１０項のシステム。(11) The system according to item 10, wherein the selection means generates the formant frequency candidates using a linear predictive coding method.

（１２）ベアストウアルゴリズムで上記線形予測符号化
出力を因数分解する手段をさらに含む第１１項のシステ
ム。(12) The system of clause 11, further comprising means for factorizing the linear predictive coding output using the Bairstow algorithm.

（１３）上記選択手段が上記所定の基準に応じて音声信
号のピッチ周波数を利用する第１０項のシステム。(13) The system of clause 10, wherein the selection means utilizes the pitch frequency of the audio signal according to the predetermined criterion.

（１４）上記ホルマント周波数候補をｍｅ１周波数とｐ
、　ｏｏｉ域幅で表示したものを発生する手段と上記表
示を多塵ガウス確率変数としてモデル化する手段とを有
する第１１項のシステム。(14) The above formant frequency candidates are me1 frequency and p
, ooi-bandwidth representation; and means for modeling said representation as a multi-Gaussian random variable.

（１５）共分散行列を用いて最もよく一致することを示
す尤度関数を計算する手段とを有する第１４項のシステ
ム。and (15) means for calculating a likelihood function indicating the best match using the covariance matrix.

【図面の簡単な説明】[Brief explanation of drawings]

第１ａ図乃至第１ｅ図は、単語［’ｃｈａｆｆ　Ｊを異
なる５人の話者で発生された音声のスペクトログラムを
示す図、第２１は、第１ａ図のスペクトログラムの一部
分を周波数に対し振幅を示すグラフ、第３図は本発明を
使って音声認識を行う装置の好ましい実施例のブロック
図である。３６　ブリプロセッサ３８　最適ホルマン１〜周波数選択器４０　比較器４２　ダイナミックタイムワーブ操作装置４４　高品質
レベル認識論理Figures 1a to 1e are diagrams showing the spectrograms of speech produced by five different speakers of the word ['chaff J, and Figure 21 is a diagram showing the amplitude versus frequency of a portion of the spectrogram in Figure 1a. FIG. 3 is a block diagram of a preferred embodiment of an apparatus for performing speech recognition using the present invention. 36 Briprocessor 38 Optimal Holman 1~Frequency Selector 40 Comparator 42 Dynamic Time Warb Operating Device 44 High Quality Level Recognition Logic

Claims

【特許請求の範囲】[Claims]

（１）音声信号の各フレームに対し存在しうる全てのホ
ルマント周波数候補を発生し、所定の参照基準に従つて各フレームに関する最適ホルマ
ント周波数を選択する工程を含む音声信号の一連のフレームにおいてホルマント周
波数を検出する方法。(1) generating all possible formant frequency candidates for each frame of the audio signal and selecting the optimal formant frequency for each frame according to a predetermined reference standard; How to detect.