JPS58195893A

JPS58195893A - Pretreatment for voice recognition equipment

Info

Publication number: JPS58195893A
Application number: JP57078309A
Authority: JP
Inventors: 村田　隆憲
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1982-05-12
Filing date: 1982-05-12
Publication date: 1983-11-15
Also published as: JPS6258515B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は、音声認識装置における前処理方法、特に音声
波形の分析、圧縮を行う為の前処理方法に関するもので
ある。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a preprocessing method in a speech recognition device, and particularly to a preprocessing method for analyzing and compressing a speech waveform.

音声認識装置は、大きく分けて、下記の２つの部分から
成る。The speech recognition device is roughly divided into the following two parts.

（イ）　音声の特徴ノｆラメータを抽出する分析部（ロ
）抽出された特徴Ａ？ラメータ時系列と、予じめ登録さ
れた特徴・母うメータ時系列との類似度（あるいは非類
似度）を求め、その判定を行ない、認識結果を得るマツ
チング部（イ）の分析部においては、従来、音声波形の分析の周
期（フレーム周期）を数ｍ５ｅｃ〜数十ｍ８ｅｅ程度で
行なっている。(b) Analysis section that extracts voice feature f parameters (b) Extracted feature A? In the analysis section of the matching section (a), the similarity (or dissimilarity) between the parameter time series and the pre-registered feature/mother meter time series is determined, and the recognition result is obtained. Conventionally, the period of voice waveform analysis (frame period) is approximately several m5ec to several tens of m8ee.

これは、音声の定常的な部分（母音部等）においては、
フレーム周期は数十ｍ５ｅｃ程度で十分であることと、
フレーム周期を短かくするにつれて、特徴・母うメータ
時系列のデータ量が増大し、装置の巨大化、そして上記
（ロ）のマツチング部における処理時間の増大を招く為
、フレーム周期を数ｍ５ｅｃ以下にするのは、経済性の
点で問題があること、等の理由による。This means that in regular parts of speech (such as vowels),
A frame period of several tens of m5ec is sufficient;
As the frame period is shortened, the amount of data of the feature/matrix meter time series increases, which leads to an increase in the size of the device and an increase in the processing time in the matching section (b) above. The reason for this is that there are problems in terms of economic efficiency.

しかしながら、従来のように、フレーム周期を数ｍ５ｅ
ｃ〜数十ｍ５ｅｃ程度で分析を行なった場合、子音部の
ような変化の激しい部分においては、分析フレーム周期
が、音声の変化に十分追従できる程短かくない為、音声
のどの時点から分析を行なうかによって、得られる特徴
ノＪ？ラメータに違いが出て、特徴パラメータが不安定
となる欠点を持っていた。However, as in the past, the frame period was reduced to several m5e.
When analysis is performed at around c~several tens of m5ec, the analysis frame period is not short enough to follow the changes in the voice in parts with rapid changes such as consonant parts, so it is difficult to analyze at what point in the voice. What characteristics can you get depending on what you do? This method had the disadvantage that the characteristic parameters were unstable due to differences in the parameters.

例えば、単音節の「ハ」と「力」、又は、単語の「橋（
・・シ）」と「菓（カシ）」のように、お互いに良く似
た特徴・母うメータ時系列を持つ音声（語頭の子音ｒｈ
ＪとｒｋＪは調音点の同じような子音で、又、語頭子音
部以後は、同じ音声である）の認識を行う場合、上述の
ごとく、子音部の特徴パラメータ時系列が不安定な為、
特に調音点の同じような子音の判別が困難となシ、音声
認識装置の信頼性低下の原因となっていた。For example, the monosyllables "ha" and "power" or the word "hashi (
・・Sounds with similar characteristics and mother meter time series (initial consonant rh
J and rkJ are consonants with similar points of articulation, and after the initial consonant part, they are the same sound).As mentioned above, the characteristic parameter time series of the consonant part is unstable, so
In particular, it is difficult to distinguish between consonants with similar points of articulation, which causes a decrease in the reliability of speech recognition devices.

子音部のような変化の激しい部分においても認識に十分
な、安定した特徴・ぐラメータを得るには、音声の変化
に十分追従し得るフレーム周期で分析を行なえば良いわ
けであるが、この場合には、前述のように、装置の巨大
化、処理時間の増大を招く欠点があった。In order to obtain stable features and parameters that are sufficient for recognition even in parts with rapid changes such as consonant parts, it is sufficient to perform analysis at a frame period that can sufficiently follow changes in the voice. As mentioned above, this method has the disadvantage of increasing the size of the device and increasing the processing time.

本発明の目的は、上述の点を鑑みて、装置の巨大化、処
理時間の増大を招くことなく音声認識装置の信頼性向上
に寄与する前処理方法を提供することにある。In view of the above points, an object of the present invention is to provide a preprocessing method that contributes to improving the reliability of a speech recognition device without increasing the size of the device or increasing the processing time.

このような目的を達成する為に、本発明においては、子
音部のような変化の激しい部分においても、安定した特
徴パラメータが得られるように、入力音声を分析する際
、分析基準点を検出し、その分析基準点から分析を行な
うことを特徴とする。In order to achieve this purpose, the present invention detects analysis reference points when analyzing input speech so that stable feature parameters can be obtained even in parts that change rapidly such as consonant parts. , is characterized in that analysis is performed from that analysis reference point.

以下、第１図を参照して、本発明の詳細な説明する。The present invention will be described in detail below with reference to FIG.

第１図（１）は、音声信号の１例を示したもので、子音
部及び子音から母音定常部への渡シの部分においては、
音声波形に対する窓の位置によって、得られる特徴ノ９
ラメータが異なるのは明らかであり、第１図（２）の窓
位置で分析された特徴Ａ’ラメータ時系列を持つ登録・
母ターンと、第１図（３）の窓位置で分析された特徴パ
ラメータ時系列を持つ入力・ぞターンとのマツチングを
行なった場合、類似度が小さくなシ、他の音声と誤認識
し易くなる。Figure 1 (1) shows an example of a speech signal. In the consonant part and the transition part from the consonant to the vowel stationary part,
Characteristic No. 9 obtained depending on the position of the window with respect to the audio waveform
It is clear that the parameters are different, and the registration and
When matching the mother turn with the input turn that has the feature parameter time series analyzed at the window position in Figure 1 (3), it is easy to misrecognize it as another voice because the degree of similarity is small. Become.

ここで、説明の便宜上、１　ｍ５ｅｃ毎にサンプリング
された、音声のフィルタパンク出力値に、各チャンネル
毎に窓長ＷＬ　＝ｌ　６　ｒｎｓ　ｅｃ　％　フレーム
周期（窓周期）　Ｗ、　＝　８　ｍ５ｅｃの窓を掛けて
得られる平滑化された周波数成分を特徴ノぐラメータと
した場合を一例にとって、以下の説明を行なう。Here, for convenience of explanation, a window of window length WL = l 6 rnsec % frame period (window period) W, = 8 m5ec is added to the audio filter puncture output value sampled every 1 m5ec for each channel. The following explanation will be given by taking as an example a case where the smoothed frequency component obtained by multiplying is used as a feature parameter.

時刻ｔにおける各チャンネルのフィルタパンク出力値を
ベクトルＪｔ＋各チャンネルのフィルタパンク出力値の
総和、すなわち・母ワーをｐｔで表わす。The filter puncture output value of each channel at time t is expressed as a vector Jt+the sum of the filter puncture output values of each channel, that is, the mother power is expressed as pt.

二ｔＪｔ　＝　（ｊｔｌ　、ｊｔ２ｍ・・・・・・・・・、
ｊｔ）、）　　（１）ｔ＝０．１，２．・・・・・・・
・・、　Ｔ　　　　　（３）ｎ　＝１　＋　２　＋・・
・・・・・・・・・・・・・、　Ｎ　　　　　（４）こ
こで、ｊｔｎは、時刻ｔにおける第ｎ番目のチャンネル
のフィルタパンク出力値を表わす。2t Jt = (jtl, jt2m・・・・・・・・・,
jt), ) (1) t=0.1, 2.・・・・・・・・・
..., T (3)n = 1 + 2 +...
. . . N (4) Here, jtn represents the filter puncture output value of the n-th channel at time t.

父、フィルタパンク出力値Ｊｔの時系列をＪでパワーｐ
ｔの時系列をＰで表わす。Father, the time series of filter puncture output value Jt is expressed as power p by J.
Let P represent the time series of t.

Ｊ　＝　Ｊｏ　＊　　Ｊｌ　　＋　・・・・・・・・・
、ＪＴ（５）Ｐ２ｐｏｌｐ１１゛°°゛°°゛°＃ｐＴ
　　　（６）又、・母ワ一時系列Ｐに、窓長１６ｍ５ｅ
ｃの窓を掛けて得られる平滑・母ワーをｐｔ’で表わす
。J = Jo * Jl + ・・・・・・・・・
, JT (5) P2polp11゛°°゛°°゛°#pT
(6) Also, for the mother temporal series P, the window length is 16 m5e.
The smooth power obtained by multiplying by the window of c is expressed as pt'.

ｐｔ””　Σ　ｐｔ＋ピｗ５　　　　　　　　　　（７
）ｉ＝Ｑここで、ｗｏ１ｗ４．・・・・・・・・・１ｗ１５は窓
係数である。pt”” Σ pt+pi w5 (7
) i=Q Here, wo1w4. ......1w15 is a window coefficient.

ノ９ワーｐｔが、しきい値ＴＨｃを越゛えた時刻Ｃを音
声検出点とする。The time C at which the 9th warp pt exceeds the threshold THc is defined as the voice detection point.

Ｐｃ≧ＴＨｃ’、（８）ｔ−ｃよシ、第１図（４）のように窓位置を移動し、（
５）式に従って、順次、平滑パワーｐｃ’　＋　ｐｃ’
１＋・・曲を求めてゆき、平滑・ｔワーがしきい値ＴＨ
ｂ以下となる時刻すを決定する。Pc≧THc', (8) From t-c, move the window position as shown in Figure 1 (4), and (
5) Sequentially, smooth power pc' + pc' according to the formula
1+... Find the song, smoothing and twar is the threshold TH
Determine the time that is less than or equal to b.

ｐｂ’≦ＴＨｂ　　　　　　　　　　（９）第１図（５
）に、その様子を示す。pb'≦THb (9) Figure 1 (5
) shows the situation.

時刻すを分析基準点とし、０１式に従って％徴・母うメ
ータ時系列Ｋを得る。Using the time point as the analysis reference point, the percentage characteristic/matrix meter time series K is obtained according to formula 01.

ｋｕｎ　”　　Σ　ｊ　（ｔ＋ｉ　）ｎ　’　Ｗｉ　　
　　　　　α１ｉ＝０ｕ　＝　ｏ　、　１．２、−−−−−・・・・、　Ｕ　
　　　　　　　　（１１）ｔ＝ａ＋８ｕ　　　　　　　
　　　　　　　　　（６）ａ　−ｂ　ＭＯＤ　８　　　
　　　　　　　　　　　　　　α１ｋｕ　”、　（ｋｕ
ｌｔ　ｋｕ２ｍ　”””・・・＋　ｋｕＮ）　　　　α
◆に＝ｋｏ、に７．・・・・・・・・・、ｋＵ）　　　
　　　　α→ここで％　ｋＬｌｆｌは、時刻Ｕにおける
第ｎ番目のチャ／ネルの特徴パラメータである。kun ” Σ j (t+i)n' Wi
α1i=0 u=o, 1.2,------..., U
(11) t=a+8u
(6)a-b MOD 8
α1ku”, (ku
lt ku2m ”””...+ kuN) α
◆ ni = ko, ni 7.・・・・・・・・・, kU)
α→where % kLlfl is the characteristic parameter of the nth channel/channel at time U.

このようにして得られた特徴・（ラメータ時系列には、
（９）式のｐｂ’≦ＴＨｂなる条件により、窓位置が一
定となり、変化の激しい子音部においても安定したもの
となる。The features obtained in this way (the ramometer time series has
Due to the condition pb'≦THb in equation (9), the window position becomes constant and stable even in consonant parts that change rapidly.

第１図（６）に、時刻すを基準とした時の窓位置を示す
。FIG. 1 (6) shows the window position based on the time.

以下、本発明を実施例を参照して詳細に説明する。Hereinafter, the present invention will be explained in detail with reference to Examples.

第３図は、本発明による前処理方法を実現する回路の一
実施例を示すブロック構成図であシ、又第２図は、第３
図における本発明の回路を含む、単音節音声認識システ
ムの一実施例を示すブロック構成図で、第３図に記載さ
れた部分には、同一番号を付しである。FIG. 3 is a block diagram showing one embodiment of a circuit for realizing the preprocessing method according to the present invention, and FIG.
3 is a block diagram showing an embodiment of a monosyllabic speech recognition system including the circuit of the present invention shown in FIG. 3, in which the same numbers are given to the parts shown in FIG.

音声は、マイクロホン１全通して電気信号に変換され、
前置増幅器２で増幅され、ノリエンファシス回路３にて
高域強調される。The voice is converted into an electrical signal through microphone 1,
The signal is amplified by a preamplifier 2, and high frequencies are emphasized by a noise emphasis circuit 3.

さらに、フィルターバンク４にてＮチャンネルに分解さ
れた各周波数成分は、アナログマルチプレクサ５におい
て順次選択され、Ａ／Ｄ変換器６にてｒノタル信号に変
換され、フィルタバンク出方値Ｊｔとなる。Furthermore, each frequency component decomposed into N channels by the filter bank 4 is sequentially selected by the analog multiplexer 5, and converted into an r total signal by the A/D converter 6, which becomes the filter bank output value Jt.

１ｍ５ｅｃ毎に得られるフィルタパンク出方値了ｔは、
人カパッファ７に送られると同時に、加算累積器８によ
シ（２）式の演算が実行され、・クワ−ｐｔが、・やワ
ーバッファ９に送られる。The filter puncture output value t obtained every 1m5ec is:
At the same time as being sent to the human buffer 7, the adder/accumulator 8 executes the calculation of formula (2), and the word pt is sent to the word buffer 9.

又、同時Ｋ　Ｎ−’４ワ〜ｐｔは、音声検出部１ノに送
られ、しきい値ＴＨｃと比較され、ｐｃ≧ＴＨｃとなる
時刻ｔ＝ｅが検出される。Further, the simultaneous KN-'4 wa~pt is sent to the voice detection section 1, where it is compared with the threshold value THc, and the time t=e at which pc≧THc is detected.

音声検出部１１は、しきい値ＴＨ，が格納されているレ
ジスタと、比較器で構成され、加算累積器８から送られ
てくる・ぐワーｐｔとレジスタに格納されているしきい
値ＴＨｃが比較器で順次比較され、時刻ｔ　＝　ｃが検
出される。The voice detection unit 11 is composed of a register storing a threshold value TH, and a comparator, and the voice detection unit 11 is composed of a register storing a threshold value TH, and a comparator. A comparator sequentially compares them, and time t=c is detected.

時刻ｔ　＝　ｃの前後、各々ある一定時間（ｔ−０〜Ｔ
）のフィルタパンク出力値時系列Ｊの、入力バッファ７
への格納が終了すると、分析基準点検出部１０では（７
）式の演算が実行されさらに、得られた平滑ノＪ？ワー
ｐｔ’がしきい値ＴＨｂと比較され、ｐｂ’≦ＴＨｂと
なる時刻ｔ＝ｂが検出される。Before and after time t = c, a certain period of time (t-0 to T
), the input buffer 7 of the filter puncture output value time series J
When the storage is completed, the analysis reference point detection unit 10 stores (7
) is executed, and the resulting smoothed J? The warp pt' is compared with a threshold value THb, and a time t=b at which pb'≦THb is detected.

分析基準点検出部１０は、第３図のように構成される。The analysis reference point detection section 10 is configured as shown in FIG.

窓係数メモリ１０ノには、窓係数Ｗ。−’１５が格納さ
れており、乗算加算器１０２においてパワーバッファ９
より送られてくるパワーｐｔと窓係数Ｗ　ｏ”’−Ｗ　
＋　５の演算が（７）式に従って実行され、得られた平
滑ノ＋ワーｐｔ’が比較器１０４において、しきい値レ
ジスタｌθ３に格納されているしきい値ＴＨｂと比較さ
れｐｔ’≦ＴＨｂとなる時刻ｔ＝ｂが検出される。A window coefficient W is stored in the window coefficient memory 10. −'15 is stored, and the power buffer 9 is stored in the multiplier adder 102.
The power pt sent from and the window coefficient W o”'-W
+5 is executed according to equation (7), and the obtained smoothed value pt' is compared with the threshold value THb stored in the threshold register lθ3 in the comparator 104, and it is determined that pt'≦THb. A time t=b is detected.

分析基準点ｔ＝ｂを基準として、（１０式の演算が、平
滑部）２で実行され、特徴・母うメータ時系列Ｋが特徴
・母うメータパッフ７１３に格納される。Using the analysis reference point t=b as a reference, the calculation of equation (10) is executed in the smoothing unit 2, and the feature/matrix meter time series K is stored in the feature/matrix meter puff 713.

平滑部１２は、第３図のように構成され、窓係数メモリ
１２ノには、窓係数Ｗ。−ｗｌ、が格納されている。乗
算加算器１２２において人カパッファ７よシ送られてく
るフィルタパンク出力値ｊｔｎと窓係数Ｗ。−ｗ１５の
演算が（７）式に従って実行され、結果は特徴ノｆラメ
ータパッファ１３に送うれる。The smoothing unit 12 is configured as shown in FIG. 3, and the window coefficient memory 12 stores a window coefficient W. -wl is stored. A filter puncture output value jtn and a window coefficient W are sent from the human buffer 7 in the multiplier/adder 122. -w15 is executed according to equation (7), and the result is sent to the feature f parameter puffer 13.

特徴・ヤラメータパッファ１３に格納された特徴パラメ
ータ時系列には、正規化部１４で正規化され、出力パノ
ファ１５へ送られる。The feature parameter time series stored in the feature/yarameter puffer 13 is normalized by the normalization unit 14 and sent to the output panofur 15.

出力バノファ１５に格納された正規化データは、認識部
１８へ入力される。一方、音声の正規化された登録パタ
ーンが、登録パターンメモリ１６より順次、認識部１８
へ入力され、認識部１８において、類似度が演算されて
、認識が行なわれ、認識結果か端子１９に出力される。The normalized data stored in the output vanofer 15 is input to the recognition unit 18. On the other hand, the normalized registered patterns of speech are sequentially stored in the recognition unit 18 from the registered pattern memory 16.
The recognition unit 18 calculates the degree of similarity, performs recognition, and outputs the recognition result to the terminal 19.

制御部１８は、７−７７の各部の制御を行う。The control section 18 controls each section 7-77.

尚、第３図において、窓係数メモリノｏノと１２１、及
び乗算加算器１０２と１２２を、別個に設けているが、
時分割的に使用することにより、共有化してもよい。In addition, in FIG. 3, the window coefficient memories 121 and 102 and the multiplier adders 102 and 122 are provided separately.
It may be shared by using it in a time-sharing manner.

父、上述の実施例では、単音節音声認識装置を例にあげ
て説明を行なったが、それに限定されるものではなく、
単語音声認識装置等、音声認識装置一般に対しても、本
発明を適用できることは、明らかである。Father, although the above embodiment has been explained using a monosyllabic speech recognition device as an example, the present invention is not limited thereto.
It is clear that the present invention can be applied to speech recognition devices in general, such as word speech recognition devices.

以上述べたように、本発明の前処理方法によって得られ
る音声の特徴ノ４ラメータ時系列は、変化の激しい子音
部においても安定したものとなシ、分析フレーム周期を
短かくした場合に比較し、装置の巨大化、処理時間の増
大を招くこと無く、認識の信頼性を向上させる効果があ
シ有効である。As mentioned above, the four-parameter time series of speech characteristics obtained by the preprocessing method of the present invention is stable even in consonant parts that change rapidly, compared to when the analysis frame period is shortened. This method is effective in improving the reliability of recognition without increasing the size of the device or increasing the processing time.

上記の効果は、上記前処理方法を組み入れた単音節音声
認識装置において、認識率が向上したという結果からも
実証されている。The above effect is also demonstrated by the result that the recognition rate was improved in a monosyllabic speech recognition device incorporating the above preprocessing method.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は、本発明の一実施例の考え方を説明する図、第
２図は本発明の前処理法を用いた単音節認識装置の一構
成例を示すブロック図、第３図は本発明の前処理法を実
現する回路の一実施例を示すブロック図である。１：マイクロフォン、２：前置増幅器、３：ゾリエンフ
ァシス回路、４：フィルター／Ｊンク、５：アナログマ
ルチプレクサ、６：Ａ／Ｄ変換器、７：入カパッファ、
８：加算累積器、９：パワーバッファ、１０：分析基準
点検出部、１ノ＝音声検出部、１２：平滑部、１３：特
徴・母うメータパッファ、１４：正規化部、１５：出カ
バ、ノア、１６：登録パターンメモリ、１７：認識部、
１８：制御部、ノリ：出力端子。Fig. 1 is a diagram explaining the concept of an embodiment of the present invention, Fig. 2 is a block diagram showing an example of the configuration of a monosyllable recognition device using the preprocessing method of the present invention, and Fig. 3 is a diagram illustrating the concept of an embodiment of the present invention. FIG. 2 is a block diagram showing an example of a circuit that implements the preprocessing method of FIG. 1: Microphone, 2: Preamplifier, 3: Emphasis circuit, 4: Filter/Jink, 5: Analog multiplexer, 6: A/D converter, 7: Input buffer,
8: Addition accumulator, 9: Power buffer, 10: Analysis reference point detection unit, 1 = voice detection unit, 12: Smoothing unit, 13: Feature/mother meter puffer, 14: Normalization unit, 15: Output cover , Noah, 16: Registered pattern memory, 17: Recognition unit,
18: Control unit, Nori: Output terminal.

Claims

【特許請求の範囲】[Claims]

入力音声を分析して得られるｎ次元の特徴パラメータ時
系列と、予じめ登録されたｎ次元の特徴ノｊラメータ時
系列とを比較演算して、入力音声の認識を行う音声認識
装置において、入力音声をフィルターパンクにて複数の
周波数成分に分解し、各々の周波数成分の総和すなわち
・母ワーがあるしきい値を越える時刻を音声検出点とし
、さ、らにノ４ワーに対して時間方向を逆にさかの埋っ
て窓をかけてゆき、得られる平滑・ぐワーがあるしきい
値以下となる時刻を分析基準点とし、その分析基準点か
ら各周波数成分毎に窓をかけ、ｎ次元の特徴・ぐラメー
タ時系列を得ることを特徴とした、前処理方法。In a speech recognition device that recognizes input speech by comparing and calculating an n-dimensional feature parameter time series obtained by analyzing input speech with a pre-registered n-dimensional feature parameter time series, The input audio is decomposed into multiple frequency components by filter puncture, and the time when the sum of each frequency component, i.e., the mother wave exceeds a certain threshold, is set as the audio detection point, and The direction is reversed and the window is applied, and the time when the obtained smoothing value becomes less than a certain threshold value is set as the analysis reference point, and from that analysis reference point, each frequency component is windowed, and n A preprocessing method characterized by obtaining dimensional features/grammeter time series.