JPS6114520B2

JPS6114520B2 -

Info

Publication number: JPS6114520B2
Application number: JP52043972A
Authority: JP
Inventors: Hiroya Fujisaki; Fujitoshi Takamura; Hidekazu Shiratori; Osamu Terao; Yasuo Sato
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1977-04-15
Filing date: 1977-04-15
Publication date: 1986-04-18
Also published as: JPS53128905A

Description

【発明の詳細な説明】本発明は音声認識方法に係り、特に登録・照合
を前提とする単語音声認識に適用して好適な音声
認識方法に係る。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition method, and particularly to a speech recognition method suitable for application to word speech recognition based on registration and verification.

従来の登録・照合を前提とする単語音声認識に
おいては、音声認識に必要なパラメータたとえば
第１ホルマント周波数及び第２ホルマント周波数
を定時間間隔でサンプリングし、このパラメータ
を使用することにより音声認識を行つていた。 In conventional word speech recognition based on registration and verification, parameters necessary for speech recognition, such as the first formant frequency and second formant frequency, are sampled at regular time intervals, and speech recognition is performed by using these parameters. It was on.

しかしながら、音声における上記パラメータは
時間的なだらかに変化するとは限らず、急変する
場合がある。 However, the above-mentioned parameters of voice do not necessarily change gradually over time, but may change suddenly.

このため、パラメータ標本化周波数を低減化す
ると音声の認識率が低下し、一方この標本化周波
数を高め認識率を高めようとするとパラメータ等
を記憶するメモリの容量が増大すると共に、処理
時間が増加する欠点を生じる。 For this reason, reducing the parameter sampling frequency will lower the speech recognition rate, while increasing the sampling frequency to increase the recognition rate will increase the memory capacity for storing parameters and increase the processing time. This gives rise to the disadvantage of

そこで、より少ない標本数で音声の情報を効率
よく表現し、すなわちより少ない標本数で音声の
パラメータを抽出し、更には認識率を高める方式
が要望される。 Therefore, there is a need for a method that efficiently represents speech information with a smaller number of samples, that is, extracts speech parameters with a smaller number of samples, and further increases the recognition rate.

本発明にかかる要求を満たす新規な音声認識方
式を提供することを目的としており、この目的は
音声信号の周波数分析結果を利用して各音素に対
するパラメータを抽出し、このパラメータと予め
登録せられる音素のパラメータとを使用して未知
入力音声の認識を行なう音声認識方法において、
入力音声の累積変動量AV（ｔ_o）を逐次演算して
累積せしめ、この累積変動量が所定のスレツシユ
ホールド値TH以上になつた時点にパラメータを
抽出することにより達成される。すなわち、本発
明においては従来例の如く定時間間隔でパラメー
タを標本化（均一標本化）するものではなく、パ
ラメータの急変する部分で標本数を多くとり、一
方パラメータの変動が少ない部分では標本数を少
としてパラメータを標本化（不均一標本化）し、
全体として標本数を小に即ち平均標本周波数を小
となしメモリ容量の低減、処理の高速化及び認識
率の向上を図つたものである。 The purpose of the present invention is to provide a new speech recognition method that satisfies the requirements of the present invention, and the purpose is to extract parameters for each phoneme using frequency analysis results of a speech signal, and to use these parameters and pre-registered phonemes. In a speech recognition method that recognizes unknown input speech using parameters,
This is achieved by sequentially calculating and accumulating the cumulative variation amount AV(t _o ) of the input audio, and extracting the parameter when the cumulative variation amount exceeds a predetermined threshold value TH. In other words, in the present invention, the parameters are not sampled at regular time intervals (uniform sampling) as in the conventional example, but a large number of samples are taken in parts where the parameters change rapidly, while the number of samples is reduced in parts where the parameters change little. Sampling the parameters with a small number (heterogeneous sampling),
Overall, the number of samples is kept small, that is, the average sampling frequency is kept small to reduce memory capacity, speed up processing, and improve recognition rate.

以下、本発明を図面に従つて詳細に説明する。 Hereinafter, the present invention will be explained in detail with reference to the drawings.

第１図は本発明に係る音声認識を実現するため
の回路ブロツク図であり、１は入力音声をＮチヤ
ンネルたとえば15チヤンネルの周波数信号に分解
する前処理手段即ち帯域フイルタ群、２は音声の
特徴パラメータたとえば第１、第２ホルマント周
波数に相当するM₁、M₂を計算するパラメータ抽
出部、３は入力音声のパラメータをサンプリング
すべき時刻を決定するサンプリング時刻決定回
路、４はサンプリング時刻決定回路からのタイミ
ングにより第１、第２ホルマント周波数を不均一
サンプリングし、これを登録モード時にはパラメ
ータ時系列登録部に、認識モード時には照合部に
それぞれ入力する不均一サンプリング回路、５は
不均一サンプリングされた第１、第２ホルマント
周波数を登録モード時単語名と対応づけて記憶す
るパラメータ時系列登録部、６は未知入力音声の
パラメータ時系列と登録部５に既に記憶せられて
いる既知音声のパラメータ時系列とを比較し、未
知入力音声を認識するパラメータ時系列照合部、
７は出力回路、８はサンプリング回路の出力信号
を登録モード時にパラメータ時系列登録部に入力
し、認識モード時には照合部６に入力する切替回
路、９はコントローラである。尚、図中実線は信
号線、点線は制御線である。 FIG. 1 is a circuit block diagram for realizing speech recognition according to the present invention, in which 1 is a preprocessing means or band filter group for decomposing input speech into N channels, for example, 15 channels of frequency signals, and 2 is a group of voice characteristics. A parameter extracting unit that calculates parameters such as M ₁ and M ₂ corresponding to the first and second formant frequencies, 3 a sampling time determination circuit that determines the time at which parameters of the input voice should be sampled, and 4 a sampling time determination circuit. 5 is a non-uniform sampling circuit which non-uniformly samples the first and second formant frequencies according to the timing of , and inputs them to the parameter time series registration section in the registration mode and to the matching section in the recognition mode. 1. A parameter time series registration unit that stores the second formant frequency in association with a word name in registration mode; 6, a parameter time series of unknown input speech and a parameter time series of known speech already stored in the registration unit 5; A parameter time series matching unit that compares and recognizes unknown input speech;
7 is an output circuit, 8 is a switching circuit that inputs the output signal of the sampling circuit to the parameter time series registration section in the registration mode, and inputs it to the collation section 6 in the recognition mode, and 9 is a controller. Note that the solid lines in the figure are signal lines, and the dotted lines are control lines.

音声が入力されると、該音声は帯域フイルタ群
１においてＮチヤンネルの周波数信号P₁（ｔ）、
P₂（ｔ）………Ｐ_N（ｔ）に分解され、これらの
信号はそれぞれサンプリング時刻決定回路３及び
パラメータ抽出部２に入力される。 When audio is input, the audio is processed by bandpass filter group 1 as N-channel frequency signals P ₁ (t),
The signal is decomposed into P ₂ (t)...P _N (t), and these signals are input to the sampling time determination circuit 3 and the parameter extraction section 2, respectively.

パラメータ抽出部２は一定周期のクロツクパル
スが発生する毎に常時次式により第１ホルマント
周波数及び第２ホルマント周波数を計算し、その
結果を図示しないレジスタに記憶する。 The parameter extraction unit 2 always calculates the first formant frequency and the second formant frequency using the following equations every time a clock pulse of a fixed period is generated, and stores the results in a register (not shown).

ここでＰ_i（ｔ_o）はたとえば10ｍsec毎の時点
ｔ_oでサンプリングされたｉ番目のフイルタ出力
であり、Ｗ_ijはその荷重、Ｆ_iはその中心周波数を
表わしている。なお、荷重Ｗ_i、_jはホルマント周
波数既知の合成音のフイルタ出力から求めた量
M₁、M₂が該ホルマント周波数に一致するように
予め実験的に決定される。 Here, P _i (t _o ) is the i-th filter output sampled at time t _o every 10 msec, W _ij is its load, and F _i is its center frequency. Note that the loads W _i and _j are quantities obtained from the filter output of a synthesized sound whose formant frequency is known.
M ₁ and M ₂ are experimentally determined in advance to match the formant frequency.

一方、サンプリング時刻決定回路３はM₁、M₂
の演算周期で次式により定義される累積変動量
AV（ｔ_o）の演算を行ない不均一サンプリング時
刻ｔ_okを決定する。 On the other hand, the sampling time determination circuit 3 has M ₁ , M ₂
Cumulative fluctuation amount defined by the following formula in the calculation cycle of
AV(t _o ) is calculated to determine the non-uniform sampling time to _ok .

即ち、上記累積変動量AV（ｔ_o）が予め定めら
れた閾値THを超えたか否かを決定回路３により
監視し、該閾値を超えた時点ｔ_okをｋ番目の不均
一ササンプリング時刻とし出力を発生する。 That is, the decision circuit 3 monitors whether the cumulative fluctuation amount AV(t _o ) exceeds a predetermined threshold TH, and outputs the time point to _ok at which the threshold value is exceeded as the k-th non-uniform sasampling time. occurs.

ここでＶ（ｔ_o）はフイルタ出力の変動量であ
り次式で定義される。 Here, V(t _o ) is the amount of variation in the filter output and is defined by the following equation.

かくして累積変動量AV（ｔ_o）が時刻ｔ_okで閾
値を超えると、決定回路３より不均一サンプリン
グ回路４にサンプリング指令が出され、これによ
りサンプリング回路４はパラメータ抽出部２の図
示しないレジスタに記憶せられる時刻ｔ_okの第
１、第２ホルマント周波数M₁、M₂をサンプリン
グし、これを登録時にはパラメータ時系列登録部
５に記憶させ、又、認識時であればパラメータ照
合部６に入力する。 In this way, when the cumulative fluctuation amount AV(t _o ) exceeds the threshold value at time t _ok , a sampling command is issued from the decision circuit 3 to the non-uniform sampling circuit 4 , and the sampling circuit 4 inputs the data into a register (not shown) of the parameter extraction unit 2 . The first and second formant frequencies M ₁ and M ₂ at the time _{to ok} to be stored are sampled and stored in the parameter time series registration unit 5 at the time of registration, and input to the parameter matching unit 6 at the time of recognition. do.

決定回路３は上記サンプリング指令を出力すれ
ば直ちに累積変動量AV（ｔ_o）を零にリセツト
し、再びフイルタ出力Ｖ（ｔ_o）の変動量を２，
３に基いて累積してゆく。 Immediately after outputting the sampling command, the decision circuit 3 resets the cumulative variation amount AV(t _o ) to zero, and again sets the variation amount of the filter output V(t _o ) to 2,
It is accumulated based on 3.

以後、同様に入力音声がなくなるまで累積変動
量を監視し、閾値を超えるごとにその時点の第
１、第２ホルマント周波数をサンプリングしこれ
を登録部５又は照合部６に入力することになる。 Thereafter, the cumulative variation amount is similarly monitored until there is no more input voice, and each time the threshold value is exceeded, the first and second formant frequencies at that time are sampled and input to the registration section 5 or the collation section 6.

尚、以上の説明では累積変動量と第１、第２ホ
ルマント周波数に相当するM₁、M₂の演算を並列
して行なう場合について述べたが、累積変動量
AV（ｔ_o）が閾値THを超えた際にのみM₁、M₂の
演算をするようにしてもよい。 In addition, in the above explanation, the case where the calculations of M ₁ and M ₂ corresponding to the cumulative fluctuation amount and the first and second formant frequencies are performed in parallel, but the cumulative fluctuation amount
M ₁ and M ₂ may be calculated only when AV(t _o ) exceeds the threshold TH.

このようにすれば後述の如く、AV（ｔ_o）、
M₁、M₂の演算、照合等をコンピユータにより行
なう場合にその処理能力を大幅にアツプできる。 In this way, as described later, AV(t _o ),
When calculations, collation, etc. of M ₁ and M ₂ are performed by a computer, the processing capacity can be greatly increased.

一方、パラメータ照合部６は単語認識モード時
にサンプリング回路４によりサンプリングされた
パラメータ時系列を一旦図示しない内部のレジス
タに記憶せしめ、しかる後、パラメータ時系列登
録部５に記憶せられる各既知単語のパラメータ時
系列と周知方法で比較し最も類似性のある単語を
未知入力音声として出力回路７に出力する。しか
る後、出力回路８は照合部の認識結果に基づき認
識単語をデイスプレイ又はスピーカより出力する
ことになる。 On the other hand, the parameter matching section 6 temporarily stores the parameter time series sampled by the sampling circuit 4 in the word recognition mode in an internal register (not shown), and then the parameters of each known word are stored in the parameter time series registration section 5. Comparison is made in time series using a well-known method, and the word with the most similarity is outputted to the output circuit 7 as unknown input speech. Thereafter, the output circuit 8 outputs the recognized word from the display or speaker based on the recognition result of the matching section.

第２図は第１図による音声認識をコンピユータ
によりソフト的に実行する本発明の別の実施例で
あり、２０１は処理装置CPU、２０２はプログ
ラムメモリ、２０３は演算結果を格納するメモ
リ、２０５はアダプタ、２０６は第１図のＮチヤ
ンネルの帯域フイルタ群である。 FIG. 2 shows another embodiment of the present invention in which the speech recognition shown in FIG. 1 is executed by software on a computer, in which 201 is a processing unit CPU, 202 is a program memory, 203 is a memory for storing calculation results, and 205 is a memory for storing calculation results. Adapter 206 is the N-channel bandpass filter group of FIG.

第３図は本発明による音声認識の手順を示す流
れ図であり、第１、第２ホルマント周波数M₁、
M₂をAV（ｔ_o）＞THになつた後に計算する例で
ある。かくして、上記本発明によりサンプリング
時刻ｔ_kを求めこの時刻に第１ホルマント周波数
を計算してみると、たとえば｜∫it∫ｉ｜（シ
チ）の第１ホルマント周波数の時間的遷移は第４
図の如くなる。尚、第４図ａは均一標本化の場合
であり、ｂは均一本化率を１とした場合、標本化
率0.33の本発明不均一標本化におけるM₁の時間
的遷移である。 FIG. 3 is a flowchart showing the procedure of speech recognition according to the present invention, in which the first and second formant frequencies M ₁ ,
This is an example in which M ₂ is calculated after AV(t _o )>TH. Thus, when the sampling time t _k is determined according to the present invention and the first formant frequency is calculated at this time, for example, the temporal transition of the first formant frequency of |∫it∫i| (siti) is the fourth
It will look like the figure. Note that FIG. 4 a shows the case of uniform sampling, and b shows the temporal transition of M ₁ in the non-uniform sampling of the present invention with a sampling rate of 0.33, assuming that the uniform sampling rate is 1.

これより明らかな如く、第１ホルマント周波数
の急変部でより多くサンプリングされ、変化の少
ない部分ではサンプリングのあらさが小となつて
いることが理解される。 As is clear from this, it can be seen that more samples are taken in parts where the first formant frequency changes rapidly, and the roughness of sampling is less in parts where there is less change.

又、第５図は均一標本化の場合と本発明による
認識率を示すもので一定速度の均一標本化率を１
とした場合、相対標本化率が0.33以下では本発明
による認識率が均一標本による場合に比べ著しく
向上していることが理解される。 Furthermore, Fig. 5 shows the recognition rate in the case of uniform sampling and the recognition rate according to the present invention.
It is understood that when the relative sampling rate is 0.33 or less, the recognition rate according to the present invention is significantly improved compared to the case using a uniform sample.

尚、上記データは成人男性１名が数字、演算記
号30語を７回宛発声した総計210語を用い、これ
らを15チヤンネルの1/3オクターブフイルタ（中
心周波数200Hz〜5000Hz）で周波数分析して整流
平滑を行ない、しかる後標本化周期10ｍsec、精
度11ビツトでＡ／Ｄ変換して計算機入力し特徴パ
ラメータとして前記第１、第２ホルマント周波数
に相当するM₁、M₂を求め、このパラメータを使
用することにより比較的特徴量の類似している｜
san｜、｜sain｜、｜yon｜、｜it∫ｉ｜、｜∫ｉ
｜、｜∫it∫ｉ｜の６種計42語を用いて、標本化
率と認識率の関係を認識実験によつて求めたもの
である。 The above data uses a total of 210 words uttered by an adult male to 30 numbers and arithmetic symbols 7 times, and is frequency-analyzed using a 1/3 octave filter (center frequency 200Hz to 5000Hz) with 15 channels. After rectifying and smoothing, A/D conversion is performed with a sampling period of 10 msec and an accuracy of 11 bits, input into a computer, M ₁ and M ₂ corresponding to the first and second formant frequencies are determined as characteristic parameters, and these parameters are By using it, the features are relatively similar |
san｜, ｜sain｜, ｜yon｜, ｜it∫i｜, ｜∫i
The relationship between sampling rate and recognition rate was determined through a recognition experiment using a total of 42 words of six types: |, |∫it∫i|.

破線は均一標本化、実線は不均一標本化の例を
示すが、その差は相対標本化率がほぼ0.3より小
さくなると急激に増大する傾向がある。 The dashed line shows an example of uniform sampling, and the solid line shows an example of non-uniform sampling, but the difference tends to increase rapidly when the relative sampling rate becomes less than approximately 0.3.

次にその確認のため上記音声資料の全て210語
を用いて、相対標本化率が0.33と0.17の場合につ
いて均一標本化と不均一標本化の場合のそれぞれ
の認識率を求めたが第６図に示す如く同様な傾向
を示す。 Next, to confirm this, we used all 210 words of the audio material mentioned above to calculate the recognition rates for uniform sampling and non-uniform sampling for relative sampling rates of 0.33 and 0.17, respectively, as shown in Figure 6. A similar tendency is shown as shown in .

一方、単語音声認識に要する処理時間は第７図
に示す如く相対標本化率の２乗に比例し、また記
憶容量は正比例する結果がえられた。 On the other hand, as shown in FIG. 7, the processing time required for word speech recognition is proportional to the square of the relative sampling rate, and the storage capacity is directly proportional.

以上より、本発明によれば全体として標本化数
を小にできるからメモリ容量を小にでき同時に照
合時間を減少できる。 As described above, according to the present invention, since the number of samples can be reduced as a whole, the memory capacity can be reduced, and at the same time, the matching time can be reduced.

又、パラメータの変動が急激の部分ではより多
くのサンプリングを行うから音声の特徴を確実に
つかみこれにより音声認識でき、その認識率を高
めることができる。 In addition, since more sampling is performed in areas where the parameters change rapidly, the characteristics of the voice can be grasped reliably, thereby enabling voice recognition and increasing the recognition rate.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は本発明の一実施例、第２図は本発明の
別の実施例、第３図は本発明による不均一標本化
の手順を示す流れ図、第４図は本発明による第１
ホルマント周波数の時間的推移を示す図、第５，
６図は本発明による不均一標本化と従来の均一標
本化の認識率を比較するもの、第７図はメモリ容
量と相対標本化率との関係を示す図である。図中、１は帯域フイルタ群、２はパラメータ抽
出部、３はサンプリング時刻決定回路、４は不均
一サンプリング回路、５はパラメータ時系列登録
部、６はパラメータ時系列照合部である。 FIG. 1 is an embodiment of the present invention, FIG. 2 is another embodiment of the present invention, FIG. 3 is a flowchart showing a procedure for non-uniform sampling according to the present invention, and FIG. 4 is a first embodiment of the present invention.
Diagram showing temporal transition of formant frequency, fifth
FIG. 6 compares the recognition rate between non-uniform sampling according to the present invention and conventional uniform sampling, and FIG. 7 is a diagram showing the relationship between memory capacity and relative sampling rate. In the figure, 1 is a group of band filters, 2 is a parameter extractor, 3 is a sampling time determining circuit, 4 is an uneven sampling circuit, 5 is a parameter time series registration unit, and 6 is a parameter time series collation unit.

Claims

【特許請求の範囲】１音声信号の周波数分析結果を利用して各音素
に対するパラメータを抽出し、このパラメータと
予め登録せられる音素のパラメータとを使用して
未知入力音声の認識を行なう音声認識方法におい
て、入力音声の累積変動量AV（tn）を逐次演算
して累積せしめ、この累積変動量が所定のスレツ
シユホールド値以上になつた時点でパラメータを
抽出することを特徴とする音声認識方法。２前記累積変動量が所定のスレツシユホールド
値以上になつた際におけるパラメータを既知入力
音声のパラメータとして記憶せしめることを特徴
とする特許請求の範囲第１項記載の音声認識方
法。３前記累積変動量が所定のスレツシユホールド
値以上になつた際におけるパラメータを未知入力
音声のパラメータとすることを特徴とする特許請
求の範囲第１項記載の音声認識方法。[Scope of Claims] 1. A speech recognition method in which parameters for each phoneme are extracted using frequency analysis results of a speech signal, and unknown input speech is recognized using these parameters and phoneme parameters registered in advance. A speech recognition method characterized in that the cumulative variation amount AV(tn) of input speech is sequentially calculated and accumulated, and a parameter is extracted when the cumulative variation amount exceeds a predetermined threshold value. 2. The speech recognition method according to claim 1, wherein the parameter when the cumulative amount of variation exceeds a predetermined threshold value is stored as a parameter of a known input speech. 3. The speech recognition method according to claim 1, wherein the parameter when the cumulative variation amount exceeds a predetermined threshold value is used as the parameter of unknown input speech.