JP2000047686A

JP2000047686A - Voice recognizing device and method

Info

Publication number: JP2000047686A
Application number: JP10216623A
Authority: JP
Inventors: Tetsuo Kosaka; 哲夫小坂
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1998-07-31
Filing date: 1998-07-31
Publication date: 2000-02-18

Abstract

PROBLEM TO BE SOLVED: To perform recognition and adaptation almost simultaneously and in parallel and to prevent degradation of recognizing performance at the time directly after start of recognition. SOLUTION: In voice sections of an inputted voice (S101, 102), adaptation of a voice model (e.g. voice model by hidden Markov model(HMM)) is performed for each one frame of this input voice in synchronism with recognition of an input voice (S103-S105). In adaptation of a voice model, first, input voice feature for each voice frame making the prescribed time as a unit based on an input voice is generated. And difference between input voice feature for each voice frame and model feature selected out of voice models for each voice frame in recognition of this input voice is averaged over a period to the present voice frame, adaptation of a voice model is performed synchronizing with a voice input based on obtained averaged result.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、隠れマルコフモデ
ル（ＨＭＭ）を用いて音声認識を行なう、音声認識装置
および方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus and method for performing speech recognition using a Hidden Markov Model (HMM).

【０００２】[0002]

【従来の技術】実環境において音声認識を行なう場合、
特に問題となるのはマイクや電話回線特性などの影響に
よる回線特性の歪みと、内部雑音などの加算性雑音であ
る。このうち回線特性の歪みに対処する方法として、Si
gnal Bias Removal（ＳＢＲ）法が提案されている。Ｓ
ＢＲ法は「Rahim,et al.:Signal Bias Removal by Maxi
mum Likelihood Estimation for Robust Telephone Spe
ech Recognition , IEEETrans. on Speech and Audio P
rocessing, Vol.4,No.1,(1996.1).」などに詳しい。2. Description of the Related Art When performing speech recognition in a real environment,
Particularly problematic are distortion of line characteristics due to the effects of microphones and telephone line characteristics, and additive noise such as internal noise. Among them, as a method to deal with distortion of line characteristics, Si
The gnal Bias Removal (SBR) method has been proposed. S
The BR method is described in "Rahim, et al .: Signal Bias Removal by Maxi
mum Likelihood Estimation for Robust Telephone Spe
ech Recognition, IEEETrans.on Speech and Audio P
rocessing, Vol.4, No.1, (1996.1). "

【０００３】ＳＢＲ法は回線特性の歪みを補償するため
の一手法である。この方法では入力音声の情報をもと
に、回線歪みを修正し、入力環境に適応させる。これに
よって回線特性が変動した場合でも柔軟に対処できる。
以上は適応用音声が発声された後に処理が行われるが、
入力に同期して処理を行なうsequential SBR（ＳＳＢ
Ｒ）法も上記論文で提案されている。これにより認識と
適応をほぼ同時に並行しておこなうことができる。The SBR method is one method for compensating for distortion in line characteristics. In this method, the line distortion is corrected based on the information of the input voice and adapted to the input environment. As a result, it is possible to flexibly cope with a change in line characteristics.
In the above, processing is performed after the adaptation voice is uttered,
Sequential SBR (SSB) that performs processing in synchronization with input
The R) method has also been proposed in the above article. Thereby, recognition and adaptation can be performed almost simultaneously in parallel.

【０００４】[0004]

【発明が解決しようとする課題】以上述べたようにＳＢ
Ｒ法を用いることにより、マイクや電話回線特性などの
影響による回線特性の歪みに対処することができる。ま
たＳＳＢＲ法により音声入力に同期して適応と認識を行
なうことが可能である。しかし、ＳＳＢＲ法では適応の
ためのデータ量を考慮していないため、認識開始直後の
適応は安定して行なわれず、認識性能が低下する恐れが
あった。SUMMARY OF THE INVENTION As described above, SB
By using the R method, it is possible to cope with distortion of line characteristics due to the influence of microphones and telephone line characteristics. Further, adaptation and recognition can be performed in synchronization with voice input by the SSBR method. However, since the SSBR method does not consider the amount of data for adaptation, adaptation immediately after the start of recognition is not performed stably, and there is a possibility that recognition performance may be reduced.

【０００５】本発明は上記の問題に鑑みてなされたもの
であり、認識と適応をほぼ同時に並行して行うととも
に、認識開始直後における認識性能の低下を防ぐことが
可能な音声認識方法及び装置を提供することを目的とす
る。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems, and provides a voice recognition method and apparatus capable of performing recognition and adaptation almost simultaneously in parallel and preventing a reduction in recognition performance immediately after the start of recognition. The purpose is to provide.

【０００６】[0006]

【課題を解決するための手段】上記の目的を達成するた
めの本発明の一態様による音声認識装置はたとえば以下
の構成を備える。すなわち、入力音声に基づいて所定時
間を単位とした音声フレーム毎の入力音声特徴を生成す
る生成手段と、前記音声フレーム毎の入力音声特徴と、
当該入力音声の認識において各音声フレームに関して音
声モデルの中から選択されたモデル特徴との差を、現在
の音声フレームに至るまでの期間にわたって平均化する
演算手段と、前記演算手段によって得られた平均化結果
に基づいて、音声入力に同期して前記音声モデルの適応
化を行う適応化手段とを備える。According to one aspect of the present invention, there is provided a voice recognition apparatus having the following configuration. That is, generating means for generating an input voice feature for each voice frame in units of a predetermined time based on the input voice, an input voice feature for each voice frame,
Calculating means for averaging a difference between a model feature selected from a voice model for each voice frame in the recognition of the input voice over a period up to a current voice frame, and an average obtained by the calculating means. Adaptation means for adapting the speech model in synchronization with the speech input based on the adaptation result.

【０００７】また、上記の目的を達成するための本発明
の他の態様である音声認識方法は、たとえば以下の工程
を備える。すなわち、入力音声に基づいて所定時間を単
位とした音声フレーム毎の入力音声特徴を生成する生成
工程と、前記音声フレーム毎の入力音声特徴と、当該入
力音声の認識において各音声フレームに関して音声モデ
ルの中から選択されたモデル特徴との差を、現在の音声
フレームに至るまでの期間にわたって平均化する演算工
程と、前記演算工程によって得られた平均化結果に基づ
いて、音声入力に同期して前記音声モデルの適応化を行
う適応化工程とを備える。A speech recognition method according to another embodiment of the present invention for achieving the above object includes, for example, the following steps. That is, a generating step of generating an input voice feature for each voice frame in units of a predetermined time based on the input voice, an input voice feature for each voice frame, and a voice model of each voice frame in recognition of the input voice. A calculating step of averaging a difference from a model feature selected from among the time period up to the current voice frame, and based on an averaging result obtained by the calculating step, in synchronization with a voice input, An adaptation step of adapting the speech model.

【０００８】更に、本発明によれば、上記音声認識方法
をコンピュータに実現させるための制御プログラムを格
納した記憶媒体が提供される。Further, according to the present invention, there is provided a storage medium storing a control program for causing a computer to implement the above speech recognition method.

【０００９】[0009]

【発明の実施の形態】以下、添付の図面に従って本発明
による実施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.

【００１０】図１は本実施形態による音声認識装置のブ
ロック構成図である。図１において、１００は音声を入
力するためのマイクロフォンである。１０１は取り込ん
だ音声をデジタル信号に変換するためのＡ／Ｄ変換部で
ある。１０２は認識結果を他の装置へ渡すためのインタ
フェースであり、ＲＳ２３２Ｃなどによる通信が可能で
ある。１０３は結果を表示するためのディスプレイであ
る。１０４はＣＰＵであり、外部記憶装置１０７或いは
ＲＯＭ１０５に格納されている制御プログラムをＲＡＭ
１０６に読み出し、その読み出したプログラムに基づい
て後述する認識処理を行なう。FIG. 1 is a block diagram of the speech recognition apparatus according to the present embodiment. In FIG. 1, reference numeral 100 denotes a microphone for inputting voice. Reference numeral 101 denotes an A / D converter for converting a captured voice into a digital signal. Reference numeral 102 denotes an interface for passing the recognition result to another device, and is capable of performing communication by RS232C or the like. 103 is a display for displaying the result. Reference numeral 104 denotes a CPU, which stores a control program stored in the external storage device 107 or the ROM 105 in a RAM.
106, and performs a recognition process described later based on the read program.

【００１１】ＲＯＭ１０５には、ＣＰＵ１０４の処理を
実行するための各種制御プログラムが格納されている。
また音素モデルなどのデータもＲＯＭ１０５に格納され
ている。ＲＡＭ１０６は、ＣＰＵ１０４が各種制御プロ
グラムに基づく処理を実行するに際しての作業領域を提
供する。また、外部記憶装置１０７は、ハードディス
ク、フロッピーディスクなどであり、これらによっても
ＣＰＵ１０４の処理を行なうための制御プログラムを格
納することが可能である。Various control programs for executing the processing of the CPU 104 are stored in the ROM 105.
Data such as phoneme models is also stored in the ROM 105. The RAM 106 provides a work area when the CPU 104 executes a process based on various control programs. Further, the external storage device 107 is a hard disk, a floppy disk, or the like, and can also store a control program for performing the processing of the CPU 104 by using them.

【００１２】次に本実施形態における音声認識処理の手
順を図２により説明する。図２は本実施形態による音声
認識処理を説明するフローチャートである。Next, the procedure of the voice recognition processing in the present embodiment will be described with reference to FIG. FIG. 2 is a flowchart illustrating the speech recognition processing according to the present embodiment.

【００１３】まずステップＳ１０１においては、マイク
１００から取り込んだ音声をＡ／Ｄ変換部１０１でデジ
タル信号に変換する。以下ステップＳ１０２からステッ
プＳ１０７の処理はＲＯＭ１０５または外部記憶装置１
０７から読み込んだ制御プログラムにより、ＣＰＵ１０
４が作業領域としてＲＡＭ１０６を使いながら実現して
いく処理である。First, in step S101, the audio fetched from the microphone 100 is converted into a digital signal by the A / D converter 101. Hereinafter, the processing from step S102 to step S107 is performed in the ROM 105 or the external storage device 1.
07 by the control program read from the
Reference numeral 4 denotes a process realized by using the RAM 106 as a work area.

【００１４】ステップＳ１０２において、音声波形から
パワー情報などを用いて音声区間の検出を行なう。以下
の処理は、ステップＳ１０２において検出された音声区
間のみについて行なう。ステップＳ１０３においては、
デジタル化された音声信号を短時間スペクトルに変換す
る。ステップＳ１０４では、ケプストラム分析を行い、
入力した音声信号のケプストラム時系列を求める。In step S102, a speech section is detected from the speech waveform using power information and the like. The following processing is performed only for the voice section detected in step S102. In step S103,
The digitized audio signal is converted into a short-time spectrum. In step S104, cepstrum analysis is performed,
The cepstrum time series of the input audio signal is obtained.

【００１５】ステップＳ１０５では、以上により求めら
れた入力パラメータを用いて、ＨＭＭ(Hidden Markov M
odel)で表現される音素モデルのパラメータの適応を行
なう。ここでは連続確率分布型のＨＭＭを用いるものと
する。また認識のための探索としてはマルチパスサーチ
を用いた場合を例として説明する。なお、音素モデルは
ＲＯＭ１０５や外部記憶装置１０７に格納されており、
音声認識装置の起動時にＲＡＭ１０６に読み込まれる。In step S105, the HMM (Hidden Markov M
odel) to adapt the parameters of the phoneme model. Here, it is assumed that a continuous probability distribution type HMM is used. Also, a case where a multipath search is used as a search for recognition will be described as an example. Note that the phoneme model is stored in the ROM 105 or the external storage device 107.
It is read into the RAM 106 when the speech recognition device is activated.

【００１６】入力音声のパラメータは時系列であり、一
般には５〜１０ｍｓｅｃ間隔（これを１時刻或いは１フ
レームという）で求められる。そして、この１フレーム
ごとに以下の（１）式を用いた適応処理を行なう。The parameters of the input voice are time-series, and are generally obtained at intervals of 5 to 10 msec (this is called one time or one frame). Then, an adaptive process using the following equation (1) is performed for each frame.

【００１７】[0017]

【数１】 (Equation 1)

【００１８】ここで、右辺中のｘmは認識に用いられる
ＨＭＭのｍ番目の状態の平均値である。また、左辺のｘ
＾mは適応後の平均値である。ｎは適応に用いたサンプ
ル数（その時点までのフレーム数）である。ｙkはｋフ
レーム目の入力ベクトル、τは定数である。ｚkはｋフ
レーム目において、認識時に最大尤度を示すＨＭＭの状
態又は、分布の平均値である。Here, xm in the right side is the average value of the m-th state of the HMM used for recognition. Also, x on the left side
＾ m is the average value after adaptation. n is the number of samples used for adaptation (the number of frames up to that point). yk is an input vector of the k-th frame, and τ is a constant. zk is the state of the HMM showing the maximum likelihood at the time of recognition in the k-th frame or the average value of the distribution.

【００１９】たとえば、１フレーム（１時刻）を１０ｍ
Ｓとして、音声入力から０．５秒後の場合、上記式
（１）において、ｎは５０（0.5Sec＝500mSec＝50×10m
Sec）となる。そして、ｙ1〜ｙ50の各入力ベクトルと、
当該各入力ベクトルのそれぞれに最も近いＨＭＭの状態
ｚ1〜ｚ50との差の平均を計算し、これに50／（50＋
τ）を掛け合わせた値を、ｍ番目の状態の平均値ｘmか
ら減算することで適応後の平均値ｘ＾mを得る。For example, one frame (one time) is 10 m
As S, in the case of 0.5 second after the voice input, in the above equation (1), n is 50 (0.5Sec = 500mSec = 50 × 10m)
Sec). And input vectors y1 to y50,
The average of the difference between the HMM states z1 to z50 closest to each of the input vectors is calculated, and 50 / (50+
τ) is subtracted from the average value xm of the m-th state to obtain an average value x ＾ m after adaptation.

【００２０】上述の適応処理は、認識システムが持つ全
音素モデルに対して行う。しかしながら、音素モデルの
種類が多い場合、全ての音素モデルに対して適応を行う
のは時間がかかる。そこで、計算時間を短縮するため
に、全ての音素モデルに対して適応を行うのではなく、
探索時に必要な音素モデルのみに対して適応を行う方法
も考えられる。すなわち、探索に必要な音声モデルのみ
に対して適応を行うようにしてもよい。この場合は、例
えば以下の方法を用いればよい。The above-described adaptive processing is performed on all phoneme models of the recognition system. However, when there are many types of phoneme models, it takes time to adapt to all phoneme models. Therefore, instead of performing adaptation for all phoneme models, to shorten the calculation time,
A method is also conceivable in which adaptation is performed only for a phoneme model required at the time of search. That is, adaptation may be performed only for the speech model necessary for the search. In this case, for example, the following method may be used.

【００２１】例えば、「東京」と「大阪」の２単語しか
認識しない文法では、この２単語に必要な音素以外は適
応する必要がない。特に音素環境依存モデルを使った場
合は、適応する必要がない音素モデルが増加する。認識
時に文法ネットワークにそって必要な音素が選択されて
いくが、その選択されたときに、選択されたモデルのみ
上述の適応を行えばよい。For example, in a grammar that recognizes only two words "Tokyo" and "Osaka", it is not necessary to adapt other than phonemes necessary for these two words. In particular, when a phoneme environment-dependent model is used, the number of phoneme models that need not be adapted increases. The necessary phonemes are selected along the grammar network at the time of recognition. When the phonemes are selected, the above-mentioned adaptation may be performed only for the selected model.

【００２２】図３は入力音声フレームを処理していく過
程で生成されるリストを示す図である。各ｎ番目の入力
音声フレームについて、入力ベクトルｙk（k＝ｎ）とこ
れに最も近い（最大尤度を示す）状態の値（或いは分布
の平均値）ｚk（k＝ｎ）が登録されている。新たな入力
音声フレームが得られると、対応する入力ベクトルが得
られるとともに、最大尤度を示す状態の値が選択され、
これらが当該リストに登録される。上述の式（１）の計
算は、この表を参照して行われる。なお、このリストに
は少なくとも検出された音声区間分のフレームについて
のデータが格納される。FIG. 3 is a diagram showing a list generated in the process of processing an input speech frame. For each n-th input speech frame, an input vector yk (k = n) and a value (or average value of distribution) zk (k = n) of a state closest to this (indicating maximum likelihood) are registered. . When a new input speech frame is obtained, a corresponding input vector is obtained, and a state value indicating the maximum likelihood is selected.
These are registered in the list. The calculation of the above equation (1) is performed with reference to this table. It should be noted that this list stores at least data on frames of the detected voice section.

【００２３】なお、ＨＭＭの混合数が１の場合は状態の
平均値と分布の平均値は一致するが、混合分布の場合は
状態の平均値と分布の平均値は一致しない。この場合で
かつ状態の平均値を用いる場合は、以下のように、同一
状態内の分布を１つの分布にマージして用いる。すなわ
ち、以下の式（２）によって得られたＸmを式（１）の
ｘmに代入して用いる。When the number of mixtures of the HMM is 1, the average value of the state and the average value of the distribution match, but in the case of the mixture distribution, the average value of the state does not match the average value of the distribution. In this case, when using the average value of the states, the distributions in the same state are merged into one distribution and used as described below. That is, Xm obtained by the following equation (2) is used by substituting it into xm of the equation (1).

【００２４】[0024]

【数２】 (Equation 2)

【００２５】ここでｗkmは状態ｍのｋ番目の分布の重
み、ｘkmはその平均値である。以上の適応を認識時の前
向き探索で、入力音声の１フレームごとに行なう。Here, wkm is the weight of the k-th distribution in state m, and xkm is its average value. The above adaptation is performed for each frame of the input voice in the forward search at the time of recognition.

【００２６】ステップＳ１０６では、前向き探索終了後
に後向き探索を行ないステップＳ１０７で認識結果をデ
ィスプレイ１０３やＩ／Ｆ１０２に出力する。In step S106, a backward search is performed after the forward search is completed, and in step S107, the recognition result is output to the display 103 or the I / F 102.

【００２７】以上説明したように、本実施形態によれ
ば、音声認識を行いながらＨＭＭの状態の平均値を１フ
レーム毎に適応化し、音声認識と適応化をほぼ同時に並
行して行うことが可能である。また、それまでの入力フ
レームに関してフレーム数による重み（ｎ／（ｎ＋
τ））に基づいて状態値の適応を行うので、適応のため
のデータ量も考慮され、認識開始直後における認識性能
の低下を防ぐことが可能となる。As described above, according to the present embodiment, the average value of the state of the HMM can be adapted for each frame while performing speech recognition, and speech recognition and adaptation can be performed almost simultaneously in parallel. It is. Also, the weight (n / (n +
Since the adaptation of the state value is performed based on (τ)), the amount of data for adaptation is also taken into consideration, and it is possible to prevent a decrease in recognition performance immediately after the start of recognition.

【００２８】［他の実施形態］［１］式（１）では、すべての入力データを同一に処
理しているがこれに限らない。例えば、音素モデルをい
くつかのクラスにクラス分けし、式（１）の第２項をｚ
kの音素の種別に求めるようにすることもできる。この
ようにすることで、より詳細な適応が可能となる。[Other Embodiments] [1] In equation (1), all input data are processed in the same manner, but the present invention is not limited to this. For example, the phoneme model is classified into several classes, and the second term of the equation (1) is z
The k phoneme type can also be obtained. In this way, more detailed adaptation is possible.

【００２９】なお、クラス分けには次のような方法が適
用できる。まず、音素の種類による分類、例えば母音と
子音等で分けることが挙げられる。また、音素間の距離
尺度を定義し、その距離に基づいてクラスタリング法に
より分類を行う方法も考えられる。クラスタリングはk-
means法などを用いることが可能である。更には、音素
モデルが持つパワー情報を用いることも可能である。音
素モデルのパラメータとしては、例えば、ケプストラ
ム、デルタケプストラム、パワー、デルタパワーなどが
含まれる。このうち、パワー・パラメータを用い、パワ
ーの閾値を設け、閾値以上はクラス１、閾値以下はクラ
ス０というように分けること等が考えられる。The following method can be applied to the classification. First, classification by phoneme type, for example, classification by vowels and consonants, etc. may be mentioned. It is also conceivable to define a distance measure between phonemes and perform classification by a clustering method based on the distance. Clustering is k-
It is possible to use a means method or the like. Furthermore, it is also possible to use the power information of the phoneme model. The parameters of the phoneme model include, for example, cepstrum, delta cepstrum, power, delta power, and the like. Among them, it is conceivable to set a power threshold using a power parameter, and classify the power into a class 1 above the threshold and a class 0 below the threshold.

【００３０】音素ｘmがクラスＣiに含まれる場合、すな
わちｘm∈Ｃiの場合、以下の式（３）の如くクラス別の
適応化を行う。When the phoneme xm is included in the class Ci, that is, when xm∈Ci, the adaptation for each class is performed as in the following equation (3).

【００３１】[0031]

【数３】 (Equation 3)

【００３２】ここでｎiは該当フレーム（現在の入力音
声フレーム）までに、クラスＣiに含まれる音素ｚkが選
択された数を表す。Here, ni represents the number of phonemes zk included in the class Ci up to the corresponding frame (current input speech frame).

【００３３】図４は他の実施形態において、入力音声フ
レームを処理していく過程で生成されるリストを示す図
である。図４に示されるように、各入力音声フレームに
は、選択されたＨＭＭの状態に対応する音素クラスが登
録される。上記の式（３）における平均値の計算は、現
在のフレームにおいて選択されたＨＭＭの状態が属する
音素クラスに属しているｚkが選択されたフレームを抽
出して行われる。FIG. 4 is a diagram showing a list generated in a process of processing an input speech frame in another embodiment. As shown in FIG. 4, a phoneme class corresponding to the state of the selected HMM is registered in each input speech frame. The calculation of the average value in the above equation (3) is performed by extracting a frame in which zk belonging to the phoneme class to which the state of the HMM selected in the current frame belongs is selected.

【００３４】［２］式（１）では、すべての入力デー
タを同一に処理していたが、ここに示す方法では、まず
音素モデルをいくつかのクラスにクラス分けし、以下の
方法により適応を実施してもよい。なお、ここではｘm
∈Ｃiとする。[2] In equation (1), all input data are processed in the same way. However, in the method shown here, phoneme models are first classified into several classes, and adaptation is performed by the following method. May be implemented. Here, xm
とする Ci.

【００３５】[0035]

【数４】 (Equation 4)

【００３６】ここで、ｎは入力音声の該当フレームまで
の総フレーム数、ｎiはクラスＣiに含まれる音素ｚkが
選択された数を表す。上式の第２項は全クラスで計算
し、第３項はクラス別に計算することを意味する。な
お、音声フレームの処理においては、上述した図４のリ
ストが利用できることは明らかであろう。Here, n represents the total number of frames of the input speech up to the corresponding frame, and ni represents the number of phonemes zk included in the class Ci. The second term in the above equation means that calculation is performed for all classes, and the third term means that calculation is performed for each class. It should be clear that the list of FIG. 4 described above can be used in the processing of the audio frame.

【００３７】本来、音素により適応のための補正量（式
（１）の右辺第２項）は異なる。よって、音素を細かく
クラス分けすればするほど、適応の精度は向上する。し
かし、逆に、細かく分ければ分けるほど、１つのクラス
に用いることのできる適応用のデータ量が減少し、適応
の効果を出すにはより多くの音声データを必要とすると
いうトレードオフが存在する。このように、１クラスよ
りは２クラスのほうがデータ量は必要となるが精度は向
上する。従って、上述の式（３）の方法を用いると、デ
ータ量（音声の量）が多ければ、式（１）の方法より認
識率が向上する。これが式（３）のメリットである。Originally, the correction amount for adaptation (the second term on the right side of the equation (1)) differs depending on the phoneme. Therefore, the more finely divided the phonemes, the higher the accuracy of adaptation. However, conversely, the more finely divided, the smaller the amount of adaptation data that can be used for one class, and there is a trade-off that more audio data is required to achieve the effect of adaptation. . As described above, the data amount is required for the two classes than for the one class, but the accuracy is improved. Therefore, when the method of the above equation (3) is used, if the data amount (the amount of voice) is large, the recognition rate is improved as compared with the method of the equation (1). This is the advantage of equation (3).

【００３８】また、式（３）によるデメリットを少しで
も解消しようとしたのが式（４）の方法である。すなわ
ち、式（４）は、全クラスから求めた補正量と各クラス
から求めた補正量を重みを付けて足し合わせることによ
り、認識率を向上させつつ、多くのデータ量を必要とす
るというデメリットを解消しようとするものである。The method of equation (4) attempts to eliminate the disadvantage of equation (3) as much as possible. That is, equation (4) is disadvantageous in that a large amount of data is required while improving the recognition rate by adding the weights of the correction amounts obtained from all classes and the correction amounts obtained from each class. Is to be solved.

【００３９】［３］なお、音声認識システムでは、１
発声で目的を達成する応用もあるが、複数発声を必要と
する場合もある。複数発声をする場合、上述してきた適
応は複数発声にわたって継続することにより、徐々に認
識性能を向上させることが可能である。しかし、必ずし
も同一環境、同一話者が継続するとも限らない。そこ
で、以下のような方法を用いることも考えられる。[3] In the speech recognition system, 1
Some applications achieve the goal by vocalization, while others require multiple vocalizations. In the case of making a plurality of utterances, the adaptation described above is continued over a plurality of utterances, so that the recognition performance can be gradually improved. However, the same environment and the same speaker do not always continue. Therefore, it is conceivable to use the following method.

【００４０】まず、認識開始前に、音素モデルをディス
クやメモリにコピーし保存する。そのうえで、上述の方
法により適応を行う。１発声が終了した後、適応された
モデルを破棄し、保存した音素モデルパラメータを読み
込み、それをもとに２発声目の認識、適応を行う。コの
方法により、適応の影響は、発声ごとにクリアされるの
で、現在の発声による適応は、前の発声による適応の影
響を受けなくなる。First, before the start of recognition, the phoneme model is copied and stored on a disk or a memory. Then, adaptation is performed by the method described above. After completion of one utterance, the adapted model is discarded, the stored phoneme model parameters are read, and recognition and adaptation of the second utterance are performed based on the parameters. According to the method, since the influence of the adaptation is cleared for each utterance, the adaptation by the current utterance is not affected by the adaptation by the previous utterance.

【００４１】［４］また、上記実施形態では、ステッ
プＳ１０７で認識結果をディスプレイ１０３やＩ／Ｆ１
０２に出力するとしているが、文字や記号で出力される
認識結果を、何らかのアプリケーションに接続し、音声
認識によるアプリケーションの制御をするように実施す
ることも可能であることは明らかである。[4] In the above embodiment, the recognition result is displayed on the display 103 or the I / F 1 in step S107.
02 is output, but it is apparent that the recognition result output in characters or symbols can be connected to some application and implemented so as to control the application by voice recognition.

【００４２】［５］また、上記実施形態では、音声特
徴としてケプストラムを例にあげたが、対数スペクトル
またはそれを表現するパラメータにより実施してもよ
い。[5] In the above embodiment, a cepstrum is taken as an example of a speech feature, but the present invention may be implemented using a logarithmic spectrum or a parameter expressing the logarithmic spectrum.

【００４３】［６］また、上記実施形態では日本語を
対象としているが、音声認識に用いられる音声モデルや
文法を外国語用に変更することにより、外国語で実施し
ても問題ない。[6] In the above embodiment, Japanese is targeted, but there is no problem if the speech model or grammar used for speech recognition is changed to a foreign language so that it can be implemented in a foreign language.

【００４４】従来音声入力に同期して適応と認識を同時
に行なう音声認識装置では、認識開始直後に適応が不十
分で、その認識結果性能の低下をもたらしていたが、本
実施形態によれば、認識性能の低下を防ぐことができ
る。このため、音声認識を使用する際問題となるマイク
ロフォンの特性の差や、電話回線特性の差の影響をより
低減でき、ユーザーに対し安定した性能の認識装置を提
供することができる。In a conventional speech recognition apparatus that performs adaptation and recognition simultaneously in synchronization with a speech input, the adaptation is insufficient immediately after the start of recognition, and the performance of the recognition result is reduced. However, according to the present embodiment, Recognition performance can be prevented from lowering. For this reason, the effects of differences in microphone characteristics and differences in telephone line characteristics, which are problems when using voice recognition, can be further reduced, and a user can be provided with a stable performance recognition device.

【００４５】なお、本発明は、複数の機器（例えばホス
トコンピュータ，インタフェイス機器，リーダ，プリン
タなど）から構成されるシステムに適用しても、一つの
機器からなる装置（例えば、複写機，ファクシミリ装置
など）に適用してもよい。The present invention can be applied to a system including a plurality of devices (for example, a host computer, an interface device, a reader, a printer, etc.), but can be applied to a single device (for example, a copier, a facsimile). Device).

【００４６】また、本発明の目的は、前述した実施形態
の機能を実現するソフトウェアのプログラムコードを記
録した記憶媒体を、システムあるいは装置に供給し、そ
のシステムあるいは装置のコンピュータ（またはＣＰＵ
やＭＰＵ）が記憶媒体に格納されたプログラムコードを
読出し実行することによっても、達成されることは言う
までもない。Another object of the present invention is to provide a storage medium storing a program code of software for realizing the functions of the above-described embodiments to a system or apparatus, and to provide a computer (or CPU) of the system or apparatus.
And MPU) read and execute the program code stored in the storage medium.

【００４７】この場合、記憶媒体から読出されたプログ
ラムコード自体が前述した実施形態の機能を実現するこ
とになり、そのプログラムコードを記憶した記憶媒体は
本発明を構成することになる。In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.

【００４８】プログラムコードを供給するための記憶媒
体としては、例えば、フロッピディスク，ハードディス
ク，光ディスク，光磁気ディスク，ＣＤ−ＲＯＭ，ＣＤ
−Ｒ，磁気テープ，不揮発性のメモリカード，ＲＯＭな
どを用いることができる。As a storage medium for supplying the program code, for example, a floppy disk, hard disk, optical disk, magneto-optical disk, CD-ROM, CD
-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

【００４９】また、コンピュータが読出したプログラム
コードを実行することにより、前述した実施形態の機能
が実現されるだけでなく、そのプログラムコードの指示
に基づき、コンピュータ上で稼働しているＯＳ（オペレ
ーティングシステム）などが実際の処理の一部または全
部を行い、その処理によって前述した実施形態の機能が
実現される場合も含まれることは言うまでもない。When the computer executes the readout program code, not only the functions of the above-described embodiment are realized, but also the OS (Operating System) running on the computer based on the instruction of the program code. ) May perform some or all of the actual processing, and the processing may realize the functions of the above-described embodiments.

【００５０】さらに、記憶媒体から読出されたプログラ
ムコードが、コンピュータに挿入された機能拡張ボード
やコンピュータに接続された機能拡張ユニットに備わる
メモリに書込まれた後、そのプログラムコードの指示に
基づき、その機能拡張ボードや機能拡張ユニットに備わ
るＣＰＵなどが実際の処理の一部または全部を行い、そ
の処理によって前述した実施形態の機能が実現される場
合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, based on the instruction of the program code, It goes without saying that the CPU included in the function expansion board or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.

【００５１】[0051]

【発明の効果】以上説明したように本発明によれば、認
識と適応をほぼ同時に並行して行うとともに、認識開始
直後における認識性能の低下を防ぐことが可能となる。As described above, according to the present invention, recognition and adaptation can be performed almost simultaneously in parallel, and a decrease in recognition performance immediately after the start of recognition can be prevented.

【００５２】[0052]

【図面の簡単な説明】[Brief description of the drawings]

【図１】本実施形態による音声認識装置のブロック構成
図である。FIG. 1 is a block diagram of a speech recognition apparatus according to an embodiment.

【図２】本実施形態による音声認識処理を説明するフロ
ーチャートである。FIG. 2 is a flowchart illustrating a speech recognition process according to the embodiment.

【図３】入力音声フレームを処理していく過程で生成さ
れるリストを示す図である。FIG. 3 is a diagram showing a list generated in a process of processing an input audio frame.

【図４】他の実施形態において、入力音声フレームを処
理していく過程で生成されるリストを示す図である。FIG. 4 is a diagram showing a list generated in a process of processing an input voice frame in another embodiment.

Claims

【特許請求の範囲】[Claims]

【請求項１】入力音声に基づいて所定時間を単位とし
た音声フレーム毎の入力音声特徴を生成する生成手段
と、前記音声フレーム毎の入力音声特徴と、当該入力音声の
認識において各音声フレームに関して音声モデルの中か
ら選択されたモデル特徴との差を、現在の音声フレーム
に至るまでの期間にわたって平均化する演算手段と、前記演算手段によって得られた平均化結果に基づいて、
音声入力に同期して前記音声モデルの適応化を行う適応
化手段とを備えることを特徴とする音声認識装置。A generating unit configured to generate an input voice feature for each voice frame in units of a predetermined time based on the input voice; an input voice feature for each voice frame; A calculating means for averaging the difference between the model features selected from the voice model over the period up to the current voice frame, based on the averaging result obtained by the calculating means,
An adaptive means for adapting the voice model in synchronization with a voice input.

【請求項２】前記生成手段は、入力音声から得られた
短時間スペクトルに基づいてケプストラムまたは対数ス
ペクトル時系列のいずれかを生成し、これに基づいて前
記入力フレーム毎の入力音声特徴を生成することを特徴
とする請求項１に記載の音声認識装置。2. The generation means generates one of a cepstrum and a logarithmic spectrum time series based on a short-time spectrum obtained from an input speech, and generates an input speech feature for each of the input frames based on this. The speech recognition device according to claim 1, wherein:

【請求項３】前記音声モデルは、クラス分けされたモ
デル特徴で構成されており、前記演算手段は、認識時に選択されたモデル特徴が属す
るクラスに関して、音声フレーム毎の入力特徴と認識時
に選択されたモデル特徴との差を現在の音声フレームに
至るまでの期間にわたって平均化することを特徴とする
請求項１または２に記載の音声認識装置。3. The speech model is composed of model features classified into classes, and the calculating means selects, for a class to which the model feature selected at the time of recognition belongs, an input feature for each speech frame and a feature selected at the time of recognition. 3. The speech recognition apparatus according to claim 1, wherein a difference from the model feature is averaged over a period up to a current speech frame.

【請求項４】前記音声モデルは、クラス分けされたモ
デル特徴で構成されており、前記演算手段は、前記音声フレーム毎の入力音声特徴と、当該入力音声の
認識において各音声フレームに関して音声モデルの中か
ら選択されたモデル特徴との差を、前記クラスに関らず
現在の音声フレームに至るまでの期間にわたって平均化
する第１平均化手段と、認識時に選択された音声モデルが属するクラスに関し
て、音声フレーム毎の入力特徴と認識時に選択されたモ
デル特徴との差を現在の音声フレームに至るまでの期間
にわたって平均化する第２平均化手段とを備え、前記適応化手段は、前記第１及び第２平均化手段で得ら
れた平均化結果に基づいて、音声入力に同期して前記音
声モデルの適応化を行うことを特徴とする請求項１また
は２に記載の音声認識装置。4. The speech model is composed of model features classified into classes, and the calculating means includes an input speech feature for each speech frame and a speech model of each speech frame in recognition of the input speech. First averaging means for averaging the difference from the model feature selected from among the classes up to the current speech frame regardless of the class; and for the class to which the speech model selected at the time of recognition belongs, A second averaging means for averaging a difference between an input feature for each speech frame and a model feature selected at the time of recognition over a period up to a current speech frame, wherein the adaptation means comprises 3. The method according to claim 1, wherein the adaptation of the speech model is performed in synchronization with a speech input based on an averaging result obtained by the second averaging means. Of the speech recognition apparatus.

【請求項５】前記クラス分けは各音声モデルが持つパ
ワー情報をもとに行なわれることを特徴とする請求項１
乃至４のいずれかに記載の音声認識装置。5. The method according to claim 1, wherein the classification is performed based on power information of each voice model.
The voice recognition device according to any one of claims 1 to 4.

【請求項６】前記適応化手段は、前記平均化結果に対
して、入力音声のフレーム数に応じて変化する重みをつ
け、その重みづけされた平均化結果に基づいて音声入力
に同期して音声モデルの適応を行なうことを特徴とする
請求項１乃至５のいずれかに記載の音声認識装置。6. The averaging means assigns a weight to the averaging result that varies according to the number of frames of the input audio, and synchronizes with the audio input based on the weighted averaging result. The speech recognition device according to any one of claims 1 to 5, wherein the speech model is adapted.

【請求項７】音声モデルの適応を行なう場合、探索時
に探索に必要な音声モデルに対してのみ適応を行なうこ
とを特徴とする請求項１乃至６のいずれかに記載の音声
認識装置。7. The speech recognition apparatus according to claim 1, wherein when adapting the speech model, the adaptation is performed only for a speech model necessary for the search at the time of the search.

【請求項８】音声モデルの適応時に適応前のモデルを
保持する保持手段を更に備え、１音声入力ごとに、認識
開始時に音声モデルを適応前のモデルに置き換えること
を特徴とする請求項１乃至６のいずれかに記載の音声認
識装置。8. The apparatus according to claim 1, further comprising holding means for holding a model before adaptation at the time of adaptation of the speech model, wherein the speech model is replaced with a model before adaptation at the start of recognition for each speech input. 7. The speech recognition device according to any one of 6.

【請求項９】前記音声モデルは隠れマルコフモデルで
あることを特徴とする請求項１乃至８のいずれかに記載
の音声認識装置。9. The speech recognition apparatus according to claim 1, wherein the speech model is a hidden Markov model.

【請求項１０】入力音声に基づいて所定時間を単位と
した音声フレーム毎の入力音声特徴を生成する生成工程
と、前記音声フレーム毎の入力音声特徴と、当該入力音声の
認識において各音声フレームに関して音声モデルの中か
ら選択されたモデル特徴との差を、現在の音声フレーム
に至るまでの期間にわたって平均化する演算工程と、前記演算工程によって得られた平均化結果に基づいて、
音声入力に同期して前記音声モデルの適応化を行う適応
化工程とを備えることを特徴とする音声認識方法。10. A generating step of generating an input voice feature for each voice frame in units of a predetermined time based on the input voice, an input voice feature for each voice frame, and for each voice frame in recognition of the input voice. An arithmetic step of averaging the difference between the model features selected from the audio model over the period up to the current audio frame, based on the averaging result obtained by the arithmetic step,
Adapting the speech model in synchronization with a speech input.

【請求項１１】前記生成工程は、入力音声から得られ
た短時間スペクトルに基づいてケプストラムまたは対数
スペクトル時系列のいずれかを生成し、これに基づいて
前記入力フレーム毎の入力音声特徴を生成することを特
徴とする請求項１０に記載の音声認識方法。11. The generating step generates either a cepstrum or a logarithmic spectrum time series based on a short-time spectrum obtained from the input speech, and generates an input speech feature for each of the input frames based on this. The speech recognition method according to claim 10, wherein:

【請求項１２】前記音声モデルは、クラス分けされた
モデル特徴で構成されており、前記演算工程は、認識時に選択されたモデル特徴が属す
るクラスに関して、音声フレーム毎の入力特徴と認識時
に選択されたモデル特徴との差を現在の音声フレームに
至るまでの期間にわたって平均化することを特徴とする
請求項１０または１１に記載の音声認識方法。12. The speech model is composed of model features classified into classes, and in the calculation step, for the class to which the model feature selected at the time of recognition belongs, an input feature for each voice frame and a selection at the time of recognition are performed. The speech recognition method according to claim 10 or 11, wherein a difference from the model feature is averaged over a period up to a current speech frame.

【請求項１３】前記音声モデルは、クラス分けされた
モデル特徴で構成されており、前記演算工程は、前記音声フレーム毎の入力音声特徴と、当該入力音声の
認識において各音声フレームに関して音声モデルの中か
ら選択されたモデル特徴との差を、前記クラスに関らず
現在の音声フレームに至るまでの期間にわたって平均化
する第１平均化工程と、認識時に選択された音声モデルが属するクラスに関し
て、音声フレーム毎の入力特徴と認識時に選択されたモ
デル特徴との差を現在の音声フレームに至るまでの期間
にわたって平均化する第２平均化工程とを備え、前記適応化工程は、前記第１及び第２平均化工程で得ら
れた平均化結果に基づいて、音声入力に同期して前記音
声モデルの適応化を行うことを特徴とする請求項１０ま
たは１１に記載の音声認識方法。13. The speech model is composed of model features classified into classes. The calculating step includes: input speech features for each speech frame; and a speech model of each speech frame in recognition of the input speech. A first averaging step of averaging a difference from a model feature selected from among them over a period up to a current speech frame regardless of the class; and for a class to which the speech model selected at the time of recognition belongs, A second averaging step of averaging a difference between an input feature for each speech frame and a model feature selected at the time of recognition over a period up to a current speech frame, wherein the adaptation step includes the first and the second steps. 11. The adaptation of the speech model in synchronization with a speech input, based on the averaging result obtained in the second averaging step. Speech recognition method as claimed in.

【請求項１４】前記クラス分けは各音声モデルが持つ
パワー情報をもとに行なわれることを特徴とする請求項
１０乃至１３のいずれかに記載の音声認識方法。14. The speech recognition method according to claim 10, wherein the classification is performed based on power information of each speech model.

【請求項１５】前記適応化工程は、前記平均化結果に
対して、入力音声のフレーム数に応じて変化する重みを
つけ、その重みづけされた平均化結果に基づいて音声入
力に同期して音声モデルの適応を行なうことを特徴とす
る請求項１０乃至１４のいずれかに記載の音声認識方
法。15. The adapting step assigns a weight to the averaged result according to the number of frames of the input audio, and synchronizes with the audio input based on the weighted averaged result. 15. The speech recognition method according to claim 10, wherein a speech model is adapted.

【請求項１６】音声モデルの適応を行なう場合、探索
時に探索に必要な音声モデルに対してのみ適応を行なう
ことを特徴とする請求項１０乃至１５のいずれかに記載
の音声認識方法。16. The speech recognition method according to claim 10, wherein when adapting the speech model, the adaptation is performed only for a speech model necessary for the search at the time of the search.

【請求項１７】音声モデルの適応時に適応前のモデル
を保持する保持工程を更に備え、１フレーム入力ごと
に、認識開始時に音声モデルを適応前のモデルに置き換
えることを特徴とする請求項１０乃至１５のいずれかに
記載の音声認識方法。17. The method according to claim 10, further comprising a holding step of holding a model before adaptation at the time of adaptation of the speech model, wherein the speech model is replaced with a model before adaptation at the start of recognition for each frame input. 16. The speech recognition method according to any one of 15).

【請求項１８】前記音声モデルは隠れマルコフモデル
であることを特徴とする請求項１０乃至１７のいずれか
に記載の音声認識方法。18. The speech recognition method according to claim 10, wherein said speech model is a hidden Markov model.

【請求項１９】コンピュータに音声認識処理を実行さ
せるための制御プログラムを格納する記憶媒体であっ
て、該制御プログラムが、入力音声に基づいて所定時間を単位とした音声フレーム
毎の入力音声特徴を生成する生成工程のコードと、前記音声フレーム毎の入力音声特徴と、当該入力音声の
認識において各音声フレームに関して音声モデルの中か
ら選択されたモデル特徴との差を、現在の音声フレーム
に至るまでの期間にわたって平均化する演算工程のコー
ドと、前記演算工程によって得られた平均化結果に基づいて、
音声入力に同期して前記音声モデルの適応化を行う適応
化工程のコードとを備えることを特徴とする記憶媒体。19. A storage medium for storing a control program for causing a computer to execute a voice recognition process, the control program storing an input voice feature for each voice frame in units of a predetermined time based on the input voice. The code of the generating step to generate, the input voice feature for each voice frame, and the difference between the model feature selected from the voice model for each voice frame in recognition of the input voice, up to the current voice frame Based on the code of the operation step to average over the period of, based on the averaging result obtained in the operation step,
A code for an adaptation step of adapting the speech model in synchronization with a speech input.