JPH0990975A

JPH0990975A - Model learning method for pattern recognition

Info

Publication number: JPH0990975A
Application number: JP7244275A
Authority: JP
Inventors: Junichi Takahashi; 淳一高橋; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-09-22
Filing date: 1995-09-22
Publication date: 1997-04-04

Abstract

PROBLEM TO BE SOLVED: To improve the recognition performance even though the amount of learning data is small by combining the learning methods of different optimization references. SOLUTION: The feature parameters of small amount learning data 22 are obtained by an analysis process 25. Employing the learning data, average vectors of a corresponding unspecific speaker initial model 21 are learned by a maximum consent probability estimating method 26. The average vectors of an un-learned model are learned by the previously learned model by a moving vector field smoothing method 27 and the interpolating process using the average vectors of an initial model. The model obtained by the both learning is made as an initial model 23 and employing the data 22, a learning is conducted by an identification error minimizing learning method 28.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は音声、文字、図形
などのパターン認識に適用され、予め用意された標準的
なモデルを初期モデルとし、少量の学習データを使って
学習により標準的なモデルを修正して可能な限り高性能
なモデルを学習する方法に関し、例えば音声認識では話
者の音声が認識し易いように音響モデルを調整して色々
な話者の個人的特徴に対処する適応化問題などへの応用
を可能とするパターン認識のためのモデル学習方法に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is applied to pattern recognition of speech, characters, figures, etc., using a standard model prepared in advance as an initial model, and learning a standard model by learning using a small amount of learning data. A method of modifying and learning a model with the highest performance possible, for example, in speech recognition, an adaptation problem that adjusts an acoustic model so that a speaker's voice can be easily recognized and copes with various personal characteristics of a speaker. The present invention relates to a model learning method for pattern recognition that can be applied to, for example.

【０００２】[0002]

【従来の技術】認識対象の特徴データ系列を確率・統計
理論に基づいてモデル化する、隠れマルコフモデル法
（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ，以後ＨＭ
Ｍ法と呼ぶ）は、音声、文字、図形等のパターン認識に
おいて有用な技術である。特に音声認識の分野では、こ
の方法が今や主流である。このＨＭＭ法の詳細は、例え
ば、社団法人電子情報通信学会編、中川聖一著『確率モ
デルによる音声認識』に開示されている。ＨＭＭ法に関
する技術は、音声認識技術により発展してきたといって
も過言ではない程、色々なＨＭＭを用いた技術が研究・
開発されており、音声認識技術の分野における隠れマル
コフモデルに関する技術は、ほぼ、従来の隠れマルコフ
モデルを用いたパターン認識技術を包含していると言え
る。そこで、以下、隠れマルコフモデルを用いた音声認
識を例に、従来技術について説明する。2. Description of the Related Art A hidden Markov model method (Hidden Markov Model, hereinafter referred to as HM) for modeling a feature data series to be recognized based on probability / statistical theory.
The M method) is a useful technique in pattern recognition of voice, characters, figures, and the like. Especially in the field of speech recognition, this method is now mainstream. The details of the HMM method are disclosed, for example, in Seiichi Nakagawa, "Speech Recognition by Stochastic Model", edited by the Institute of Electronics, Information and Communication Engineers. It is no exaggeration to say that the technology related to the HMM method has been developed by speech recognition technology, and research and development of technologies using various HMMs has been made.
It can be said that the technique related to the hidden Markov model that has been developed and is in the field of the speech recognition technique almost includes the conventional pattern recognition technique using the hidden Markov model. Therefore, a conventional technique will be described below by taking speech recognition using a hidden Markov model as an example.

【０００３】ＨＭＭ法による音声認識処理手順を図４Ａ
を参照して説明する。ＨＭＭ法の処理には、大別して、
２つのフェーズがある、１つは“学習”であり、もう１
つは“探索”である。“学習”のフェーズでは、図４Ａ
中のスイッチ１０，１１は、それぞれ、Ａ側を選択し、
音声データベース１２と学習処理部１３とを分析処理部
１４に接続し、色々な音声の構成単位（音韻／音素／音
節）や単語／文章などに対する音声信号が蓄積された音
声データベース１２のデータを用いて、各音韻／音素／
音節／単語などの音響的な性質を表現するモデルをＨＭ
Ｍ法の学習アルゴリズムに基づいて求める。このモデル
を求める過程において用いられるデータベース１２中の
信号は、分析処理部１４によって音声信号から音声信号
の特徴を表現する特徴パラメータのベクトルデータ時系
列に変換され、音響モデルの学習はこのベクトルデータ
時系列が用いられる。この一連の処理は、音声データベ
ース１２から分析処理部１４への音声信号データを入力
し、分析処理部１４の分析処理出力結果、すなわち音声
信号データの特徴パラメータのベクトルデータ時系列を
学習処理部１３へ入力する処理過程で表される。図４Ａ
において、学習によって最終的に得られるすべてのモデ
ルを蓄積するＨＭＭセット１５から学習処理への矢印
は、学習すべきＨＭＭのモデル構造（状態数、状態間の
遷移形式など）とモデルパラメータ（状態遷移確率、シ
ンボル出力確率、初期状態確率）の初期値を学習処理の
実行時に設定することを示している。分析処理部１４に
おける信号処理として、よく用いられるのは、線形予測
分析（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉ
ｎｇ，ＬＰＣと呼ばれる）であり、特徴パラメータとし
ては、ＬＰＣケプストラム、ＬＰＣデルタケプストラ
ム、メルケプストラム、対数パワーなどがある。このよ
うな学習処理によって求められた各音韻／音素／音節な
どのモデルはＨＭＭセット１５の要素として蓄積され、
このＨＭＭセット１５が音声データベースで現れるすべ
ての音響現象を表現する。なお、学習のアルゴリズムと
しては、最尤推定法に基づくＢａｕｍ−Ｗｅｌｃｈ再推
定法がよく用いられる。FIG. 4A shows a voice recognition processing procedure by the HMM method.
This will be described with reference to FIG. The processing of the HMM method is roughly divided into
There are two phases, one is "learning" and the other is
One is “search”. In the “learning” phase, FIG. 4A
The switches 10 and 11 inside select the A side,
The speech database 12 and the learning processing unit 13 are connected to the analysis processing unit 14, and data of the speech database 12 in which speech signals for various constituent units (phonemes / phonemes / syllables) and words / sentences are accumulated is used. And each phoneme / phoneme /
HM is a model that expresses acoustic properties such as syllables / words.
It is calculated based on the learning algorithm of the M method. The signal in the database 12 used in the process of obtaining this model is converted from the voice signal into a vector data time series of feature parameters expressing the feature of the voice signal by the analysis processing unit 14, and learning of the acoustic model is performed at this vector data time. Sequences are used. In this series of processing, the voice signal data is input from the voice database 12 to the analysis processing unit 14, and the analysis processing output result of the analysis processing unit 14, that is, the vector data time series of the characteristic parameters of the voice signal data is learned by the learning processing unit 13. It is represented by the process of inputting into. Figure 4A
In the above, in the arrow from the HMM set 15 that accumulates all the models finally obtained by learning to the learning process, the model structure of the HMM to be learned (the number of states, transition form between states, etc.) and model parameters (state transitions). It indicates that initial values of probability, symbol output probability, initial state probability) are set when the learning process is executed. Linear prediction analysis (Linear Predictive Codi) is often used as signal processing in the analysis processing unit 14.
ng, LPC), and the characteristic parameters include LPC cepstrum, LPC delta cepstrum, mel cepstrum, and logarithmic power. Models such as phonemes / phonemes / syllables obtained by such learning processing are accumulated as elements of the HMM set 15,
This HMM set 15 represents all acoustic phenomena that appear in the voice database. A Baum-Welch re-estimation method based on the maximum likelihood estimation method is often used as a learning algorithm.

【０００４】“探索”のフェーズでは、図４Ａ中のスイ
ッチ１０，１１は、それぞれ、Ｂ側を選択して分析処理
部１４に未知音声入力部１６と探索処理部１７とを接続
する。入力される未知の音声信号は、分析処理部１４に
よって特徴パラメータのベクトルデータ時系列に変換さ
れ、探索処理部１７でその特徴パラメータデータの時系
列がＨＭＭセット１５のどのモデルに最も類似している
かを、尤度と呼ばれる一種のスコアとして求め、各モデ
ルに対して求められる尤度の大小比較から、最も大きい
尤度を与えるモデルを選び出してそのモデルが表す音素
／音韻／音節／単位などの名称を認識結果とする。この
尤度を求めるＨＭＭ法の探索アルゴリズムとしては、前
向き−後向きアルゴリズムに基づくトレリス（ｔｒｅｌ
ｌｉｓ）計算やビタビ（Ｖｉｔｅｒｂｉ）アルゴリズム
がよく用いられる。単語認識を行なう場合、モデルが音
素／音韻／音節で表されている場合は、認識対象となる
単語に対するモデルを、その表記（例えば音素列）に従
って、上記のモデルの連結によって作成し、尤度はこの
ようにして得られた各単語のモデルに対して求められ
る。そして、各単語モデルの尤度の大小比較を行ない、
最も大きい尤度を与える単語をその認識結果とする。In the "search" phase, the switches 10 and 11 in FIG. 4A respectively select the B side to connect the unknown voice input unit 16 and the search processing unit 17 to the analysis processing unit 14. The unknown voice signal input is converted into a vector data time series of feature parameters by the analysis processing unit 14, and the time series of the feature parameter data is most similar to which model of the HMM set 15 in the search processing unit 17. Is calculated as a kind of score called likelihood, the model giving the largest likelihood is selected from the magnitude comparison of the likelihoods calculated for each model, and the name of the phoneme / phoneme / syllable / unit represented by the model is selected. Is the recognition result. As a search algorithm of the HMM method for obtaining this likelihood, a trellis (trel) based on a forward-backward algorithm is used.
Lis calculation and Viterbi algorithm are often used. When performing word recognition, if the model is represented by phonemes / phonemes / syllables, a model for the word to be recognized is created by concatenation of the above models according to the notation (for example, phoneme sequence), and the likelihood is calculated. Is calculated for each word model thus obtained. Then, the likelihood of each word model is compared,
The word giving the largest likelihood is taken as the recognition result.

【０００５】従って、ＨＭＭ法を用いた音声認識では、
認識対象とする音声信号に対する情報として、モデルの
学習用の音声データを収集する必要がある。これまでに
も、色々な音声データベースが存在するが、そのほとん
どが高品質音声のデータベースである。上記のＨＭＭ法
を基本とし、これに様々な技術的工夫が加えられ、これ
までに高性能な音響モデルが得られるようになってき
た。主たる技術的な工夫としては、ＨＭＭのモデル構造
に関するもの、音響モデルの学習法に関するものがあ
る。前者については、これまでの様々な研究・開発の結
果から、音素環境依存型のモデル化がよいとされてい
る。この技術は、同じ音素であっても、その前後に位置
する音素によってその音響現象が異なることに着目して
モデル化することを特徴とする。ここで、音素環境と
は、前後に位置する音素からの音響的な影響を意味す
る。例えば、『秋（あき）』、『駅（えき）』という音
声に対する音素表記を「ａ−ｋ−ｉ」、「ｅ−ｋ−ｉ」
とする。ここで“−”は音素の区切りを表す記号とす
る。この例では、音素“ｋ”は、『秋』の場合は音素
“ａ”と“ｉ”に、『駅』の場合には音素“ｅ”と
“ｉ”と異なった前後の音素に挟まれているので、音素
の表記は“ｋ”として同じであってもそれぞれに対応す
る音響モデルを作成する。他方、音響モデルの学習法に
関しては、前述した最尤推定法（MaximumLikelihood Es
timation ）に基づいており、前者のＨＭＭのモデル構
造の工夫に最尤推定法のBaum−Welch の学習アルゴリズ
ムを適用したものがほとんどである。Therefore, in speech recognition using the HMM method,
It is necessary to collect voice data for model learning as information on the voice signal to be recognized. Up to now, various voice databases exist, but most of them are high-quality voice databases. Based on the above HMM method, various technical devices have been added to this method, and a high-performance acoustic model has been obtained so far. The main technical measures are related to the model structure of the HMM and the learning method of the acoustic model. As for the former, phoneme environment-dependent modeling is considered to be good, based on the results of various research and development so far. This technique is characterized in that even the same phoneme is modeled by focusing on the fact that the acoustic phenomenon is different depending on the phonemes located before and after the same phoneme. Here, the phoneme environment means an acoustic influence from phonemes located in front of and behind. For example, phoneme notation for the voices "Aki" and "eki" is "ak-i" and "ek-i".
And Here, "-" is a symbol representing a phoneme delimiter. In this example, the phoneme "k" is sandwiched between the phonemes "a" and "i" in the case of "autumn", and the phonemes "e" and "i" in the case of "station" that are different from each other. Therefore, even if the phoneme notation is the same as “k”, the corresponding acoustic model is created. On the other hand, regarding the learning method of the acoustic model, the maximum likelihood estimation method (MaximumLikelihood Es
In most cases, the Baum-Welch learning algorithm of the maximum likelihood estimation method is applied to the devising of the former HMM model structure.

【０００６】しかし、近年、最尤推定法に基づいて学習
した音響モデルよりもさらに高性能な音響モデルを作成
することを狙いとして、識別誤り最小化（Minimum Clas
sification Error）に基づく音響モデルの学習法が研究
されている。この学習法の原理は、例えば、B.-H．Juan
g and S.Katagiri, “Discriminative Learning forMin
imum Error Classification, ”IEEE Transaction on S
ignal Processing, Vol. 40, No.2, pp. 3043-3054, 19
92 やW. Chou, B.-H. Juang and C.-H. Lee,“Segmenta
l GPD Training of HMM Based Speech Recognizer,”Pr
oceeding ofInternatinal Conference on Acoustics, S
peech＆Signal Processing, pp. 473-476, 1992に開示
されている。学習における最適化規準はこれまでの最尤
推定法とは異なる。最尤推定法に基づく学習ではモデル
化の対象カテゴリ毎にそのカテゴリに属するサンプルデ
ータを用いて尤度最大化の規準でそのモデルを学習する
のに対して、識別誤り最小化学習法では、サンプルデー
タに対する認識誤りの個数を最小化する規準でモデルを
学習する。However, in recent years, with the aim of creating an acoustic model having a higher performance than the acoustic model learned based on the maximum likelihood estimation method, the identification error minimization (Minimum Clas
The learning method of the acoustic model based on sification error) is studied. The principle of this learning method is, for example, B.-H. Juan
g and S. Katagiri, “Discriminative Learning for Min
imum Error Classification, “IEEE Transaction on S
ignal Processing, Vol. 40, No.2, pp. 3043-3054, 19
92 and W. Chou, B.-H. Juang and C.-H. Lee, “Segmenta
l GPD Training of HMM Based Speech Recognizer, ”Pr
oceeding of Internal Conference on Acoustics, S
peech & Signal Processing, pp. 473-476, 1992. The optimization criterion in learning is different from the conventional maximum likelihood estimation method. In learning based on the maximum likelihood estimation method, the model is trained according to the criterion of likelihood maximization using sample data belonging to each category of modeling, whereas in the identification error minimization learning method, Train the model according to the criterion that minimizes the number of recognition errors for the data.

【０００７】音素「ａ」の音響モデルを学習する場合を
例に、上記２種類の学習法の違いを説明する。最尤推定
法では、音素「ａ」に相当する複数の音声データだけを
用いて、音素「ａ」の音響モデルに対するこれらの音声
データの尤度（類似の度合いを表す一つの尺度）が最大
になるようにこのモデルのモデルパラメータを求める。
音素「ａ」以外の音素に相当する音声データは全く使用
しない。すなわち、そのカテゴリ内でモデル化する、一
種のクラス内学習である。一方、識別誤り最小化学習法
では、音素「ａ」に相当する複数の音声データが、音素
「ａ」の音響モデルによって可能な限り音素「ａ」であ
ると認識されるように、音素「ａ」の音響モデルばかり
でなく他の音響モデルのモデルパラメータをも求める。
この場合、他の音響モデルは、それらの音響モデルに対
する音素「ａ」に相当する複数の音声データの尤度が、
音素「ａ」の音響モデルに対する尤度よりも小さくなる
ように調整され、その結果として認識誤りが減少する。
対象とするカテゴリをモデル化するばかりでなく、他の
カテゴリのモデル化にも寄与するような学習であること
から、一種のクラス間学習であると言える。この識別誤
り最小化学習法は、大量の学習データを用いた音響モデ
ルの学習に効果があり、最尤推定法の場合に比べてより
認識性能の高い音響モデルが学習できることが、先に列
挙した論文にも報告されている。The difference between the above two types of learning methods will be described by taking the case of learning the acoustic model of the phoneme "a" as an example. In the maximum likelihood estimation method, only a plurality of voice data corresponding to the phoneme “a” is used, and the likelihood (one measure representing the degree of similarity) of these voice data with respect to the acoustic model of the phoneme “a” is maximized. The model parameters of this model are calculated so that
Speech data corresponding to phonemes other than the phoneme "a" are not used at all. That is, it is a kind of in-class learning modeled in the category. On the other hand, in the identification error minimization learning method, the phoneme "a" is recognized so that the plurality of speech data corresponding to the phoneme "a" are recognized as the phoneme "a" by the acoustic model of the phoneme "a" as much as possible. The model parameters of other acoustic models are also obtained.
In this case, for other acoustic models, the likelihood of a plurality of speech data corresponding to the phoneme "a" for those acoustic models is
The phoneme "a" is adjusted to be less than the likelihood for the acoustic model, resulting in reduced recognition errors.
It can be said that this is a kind of interclass learning because it not only models the target category but also contributes to modeling other categories. This identification error minimization learning method is effective for learning acoustic models using a large amount of learning data, and it is possible to learn acoustic models with higher recognition performance than the maximum likelihood estimation method. It has also been reported in a paper.

【０００８】しかし、数十単語程度の少量の学習データ
を用いた音響モデルの学習に対する効果の有無は、ほと
んど報告されていない。唯一、話者適応化への適用に関
する最近の研究報告として、松井、古井、“識別誤り最
小化による話者適応化法の検討”、日本音響学会平成７
年度春季研究発表会講演論文集、３−５−１０、pp.９
５−９６があるが、この中で、少量の学習データの場合
は、音響モデルの認識性能の改善に対する効果は小さい
と報告されている。従って、少量の学習データの場合に
は、識別誤り最小化学習法だけでは、より高性能な音響
モデルを学習することができないという問題がある。識
別誤り最小化学習法の機能を生かした学習法により、こ
れまでの少量学習データ向きの学習法により得られる音
響モデルの認識性能を越えるような音響モデルが学習で
きるようになれば、音声認識システムの認識性能が向
上、すなわち、認識誤りが減り、より快適な音声認識の
応用サービスが可能となる。However, there is almost no report on whether or not there is an effect on learning of an acoustic model using a small amount of learning data of about several tens of words. As the only recent research report on the application to speaker adaptation, Matsui, Furui, "Speaker adaptation method by minimizing discrimination error", Acoustical Society of Japan 1995
Proceedings of Spring Research Conference, 3-5-10, pp.9
Among them, it is reported that in the case of a small amount of training data, the effect on improving the recognition performance of the acoustic model is small. Therefore, in the case of a small amount of learning data, there is a problem that a higher performance acoustic model cannot be learned only by the identification error minimization learning method. If a learning method that makes use of the function of the identification error minimization learning method can learn an acoustic model that exceeds the recognition performance of the acoustic model obtained by the conventional learning method for small amount of learning data, a speech recognition system Recognition performance is improved, that is, recognition errors are reduced, and a more comfortable voice recognition application service becomes possible.

【０００９】[0009]

【発明が解決しようとする課題】この発明は、パターン
認識を用いた実際的なシステムやサービスにおいて、シ
ステムの機能またはサービスの利便性を高めるためのパ
ターン認識の高性能化を実現するために、少量の学習デ
ータを用いて、より高性能なモデルを作成する学習法を
提供することを目的とする。SUMMARY OF THE INVENTION The present invention provides a practical system or service using pattern recognition in order to realize high performance of pattern recognition for enhancing convenience of system function or service. The purpose of the present invention is to provide a learning method for creating a higher performance model using a small amount of learning data.

【００１０】[0010]

【課題を解決するための手段】この発明によればあらか
じめ用意された初期モデルを、少量の学習データを使っ
て最大事後確率推定法と移動ベクトル場平滑化法とを組
み合わせた学習法により学習し、次にこの学習したモデ
ルを前記少量の学習データを用いて識別誤り最小化学習
法により学習する。According to the present invention, an initial model prepared in advance is learned by a learning method that combines a maximum posterior probability estimation method and a moving vector field smoothing method using a small amount of learning data. Then, the learned model is learned by the identification error minimization learning method using the small amount of learning data.

【００１１】[0011]

【発明の実施の形態】この発明の方法を図１に示す。こ
の発明の方法は大別して３つの処理からなる。１つは、
モデルの学習に用いるデータからその特徴パラメータを
抽出する処理である。図１においては、少量の学習デー
タ２２から分析処理２５にデータが送られ、その結果得
られる特徴パラメータデータが各学習処理に入力される
過程がこれに相当する。残りの２つの処理は、いずれ
も、モデルの学習処理である。破線で囲まれた処理２９
は、最大事後確率推定法（ＭａｘｉｍｕｍＡＰｏｓ
ｔｅｒｉｏｒｉｅｓｔｉｍａｔｉｏｎ：ＭＡＰ）２６
と移動ベクトル場平滑化法（ＶｅｃｔｏｒＦｉｅｌｄ
Ｓｍｏｏｔｈｉｎｇ：ＶＦＳ）２７の組み合わせから
なるＭＡＰ／ＶＦＳと呼ばれる学習法である。このＭＡ
Ｐ／ＶＦＳ法は、特願平６−１５６２３８や特願平６−
２２６５０５に開示されている。破線内の処理２９は特
願平６−１５６２３８に開示された方法である。ここで
は、破線内の処理２９を特願平６−１５６２３８に開示
されたＭＡＰ／ＶＦＳ法の組み合わせ例としているが、
特願平６−２２６５０５に開示された方法を用いてもよ
い。もう一つの学習処理は、識別誤り最小化学習法２８
である。DETAILED DESCRIPTION OF THE INVENTION The method of the present invention is shown in FIG. The method of the present invention is roughly divided into three processes. One is
This is a process of extracting the characteristic parameter from the data used for learning the model. In FIG. 1, a process in which a small amount of learning data 22 is sent to the analysis process 25 and the characteristic parameter data obtained as a result is input to each learning process corresponds to this. The remaining two processes are model learning processes. Process 29 enclosed by a broken line
Is the maximum posterior probability estimation method (Maximum A Pos
terrior estimation (MAP) 26
And moving vector field smoothing method (Vector Field)
This is a learning method called MAP / VFS, which consists of a combination of Smoothing (VFS) 27. This MA
The P / VFS method is described in Japanese Patent Application No. 6-156238 and Japanese Patent Application No. 6-156238.
No. 226505. The process 29 in the broken line is the method disclosed in Japanese Patent Application No. 6-156238. Here, although the process 29 within the broken line is a combination example of the MAP / VFS method disclosed in Japanese Patent Application No. 6-156238,
The method disclosed in Japanese Patent Application No. 6-226505 may be used. Another learning process is the identification error minimization learning method 28.
It is.

【００１２】この発明の方法では、学習データ２２の特
徴パラメータデータを用いて、まず、初期モデル２１を
ＭＡＰ／ＶＦＳ法２９により学習して第１学習モデル２
３を得る。その後、得られた第１学習モデル２３を初期
モデルと見なして、これを識別誤り最小化学習法２８に
より学習し、第２学習モデル２４を得る。この過程で
は、ＭＡＰ／ＶＦＳ法２９で用いたのと同一の学習デー
タ２２を用いる。得られた第２学習モデル２４が、求め
る高性能なモデルである。In the method of the present invention, the initial model 21 is first learned by the MAP / VFS method 29 using the characteristic parameter data of the learning data 22, and the first learning model 2
Get 3. After that, the obtained first learning model 23 is regarded as an initial model, and this is learned by the identification error minimization learning method 28 to obtain the second learning model 24. In this process, the same learning data 22 used in the MAP / VFS method 29 is used. The obtained second learning model 24 is a high-performance model to be obtained.

【００１３】上記より、この発明の方法は、ある少量の
学習データを用いてＭＡＰ／ＶＦＳ法により学習したモ
デルを、全く同一の学習データを用いて、ＭＡＰ／ＶＦ
Ｓ法とは最適化規準が異なる学習法である識別誤り最小
化学習法により、さらに学習して、より高性能なモデル
が作成できる。以下では、ＭＡＰ／ＶＦＳ法、識別誤り
最小化学習法の原理を数学的な表現を交えて説明し、こ
の発明の方法により、ＨＭＭのモデルパラメータの具体
的な学習の手続きを明らかにする。From the above, according to the method of the present invention, a model learned by the MAP / VFS method using a certain small amount of learning data is converted into MAP / VF using exactly the same learning data.
A higher performance model can be created by further learning by the identification error minimization learning method which is a learning method having an optimization criterion different from the S method. In the following, the principles of the MAP / VFS method and the identification error minimization learning method will be described together with mathematical expressions, and a concrete learning procedure of the HMM model parameters will be clarified by the method of the present invention.

【００１４】この発明の方法を、話者適応化の問題に適
用する例について説明する。一般に、音声認識システム
では、不特定多数のユーザを想定して、その音響モデル
として不特定話者モデルが用いられる。このモデルは、
性別、年齢など色々な話者の音声からなる大量の音声デ
ータを用いて学習され、その認識性能はおよそ不特定多
数の話者に対して許容できる範囲にあることが多い。し
かし、大量の学習データといっても有限の量である以
上、これに含まれないような話者の個人性を有する音声
が存在する可能性がある。このような音声に対しては、
不特定話者モデルといえども、その音声認識性能は低下
する。このような問題に対処するために必要となるのが
話者適応化技術であり、その話者の音声が認識し易いよ
うに適応学習によって音響モデルを調整する。一般に、
適応学習において使用することができる学習データは少
量に限られるため、適応化においては、限られた少量の
データからどのようにして高性能なモデルを学習するか
が重要な課題である。従って、モデルの性能は、高けれ
ば高いほどよい。An example in which the method of the present invention is applied to the problem of speaker adaptation will be described. In general, in a voice recognition system, an unspecified speaker model is used as an acoustic model of an unspecified number of users. This model is
It is learned by using a large amount of voice data composed of voices of various speakers such as gender and age, and its recognition performance is often within an allowable range for an almost unspecified number of speakers. However, even if it is a large amount of learning data, since it is a finite amount, there is a possibility that there is a speaker's individuality that is not included in this data. For such sounds,
Even with the unspecified speaker model, its speech recognition performance deteriorates. What is needed to deal with such a problem is a speaker adaptation technique, and an acoustic model is adjusted by adaptive learning so that the speaker's voice can be easily recognized. In general,
Since only a small amount of learning data can be used in adaptive learning, how to learn a high-performance model from a limited small amount of data is an important issue in adaptation. Therefore, the higher the performance of the model, the better.

【００１５】以下の説明では、各音素のＨＭＭのモデル
を、図４Ｂに示すような、状態数４、混合数３のｌｅｆ
ｔ−ｔｏ−ｒｉｇｈｔ型の混合連続ＨＭＭとする。図４
Ｂにおいて、○は状態３０を表しており、○の下に書か
れた番号は各状態に付けられた状態番号である。また、
状態間に付けられた矢印は、状態遷移枝を表しており、
同一の状態で遷移する自己ループ３１と右隣りの状態へ
遷移する遷移枝３２とがある。各状態遷移枝の側に示さ
れたパラメータａ_ijは、各遷移枝の状態遷移確率を表
す。状態番号４の状態は、音素モデルの最終状態であ
る。各音素モデルを連結して音節／単語／文などのモデ
ルを作る場合は、この最終状態４を次に続く音素モデル
の状態番号１に重ねて連結する。このｌｅｆｔ−ｔｏ−
ｒｉｇｈｔ型構造のモデルは、自己ループと右隣りの状
態への状態遷移のみを許すことを特徴とし、音声の現象
をよく表現するものとして一般によく用いられている。
また、混合連続とは、各状態のシンボル出力確率密度関
数を複数のガウス分布（または正規分布）の線形加算に
よって表現することを意味し、現状の音声認識アルゴリ
ズムにおいては主流のモデル表現法である。In the following explanation, the HMM model of each phoneme is represented by a left state number 4 and a mixture number 3 as shown in FIG. 4B.
It is a t-to-right type mixed continuous HMM. FIG.
In B, ◯ represents the state 30, and the numbers written under the ◯ are the state numbers given to the respective states. Also,
The arrows attached between the states represent the state transition branches,
There is a self-loop 31 that makes a transition in the same state and a transition branch 32 that makes a transition to the right adjacent state. The parameter a _ij shown on the side of each state transition branch represents the state transition probability of each state transition branch. The state of state number 4 is the final state of the phoneme model. When the phoneme models are connected to form a model such as a syllable / word / sentence, the final state 4 is connected to the state number 1 of the succeeding phoneme model in an overlapping manner. This left-to-
The model of the right type structure is characterized by allowing only a self-loop and a state transition to the state on the right side, and is commonly used as a well-expressed speech phenomenon.
Further, mixed continuous means expressing the symbol output probability density function of each state by linear addition of a plurality of Gaussian distributions (or normal distributions), which is a mainstream model expression method in the current speech recognition algorithm. .

【００１６】各音素のＨＭＭのパラメータを図４Ｂのモ
デル構造に合わせて次のように定義する。・状態遷移確率：ａ_ij((i,j)=(1,1),(1,2),(2,2),(2,
3),(3,3),(3,4)) ・シンボル出力確率：ｂ_j（ｘ）＝Σ³ _k=1ω_jkＮ（ｘ
｜μ_jk, Σ_jk）（ｊ＝１，２，３）ここで、関数Ｎ（ｘ｜μ_jk, Σ_jk）は、ガウス分布関
数、係数ω_jkは分岐確率を表す。The HMM parameters of each phoneme are defined as follows in accordance with the model structure of FIG. 4B. State transition probability: a _ij ((i, j) = (1,1), (1,2), (2,2), (2,
3), (3,3), (3,4)) ・ Symbol output probability: b _j (x) = Σ ³ _{k = 1} ω _jk N (x
| Μ _jk , Σ _jk ) (j = 1, 2, 3) Here, the function N (x | μ _jk , Σ _jk ) represents the Gaussian distribution function, and the coefficient ω _jk represents the branch probability.

【００１７】また、ガウス分布関数は、次式で表わせ
る。Ｎ（ｘ｜μ_jk, Σ_jk）＝［１／（（２π）^n/2｜Σ_jk｜^1/2）］ｅｘｐ（−(1/2) （ｘ−μ_jk）^tΣ^-1 _jk（ｘ−μ_jk）ここで、ｘは、音声の特徴パラメータのベクトルデータ
時系列における、ある時刻のベクトルデータである。ま
た、μ_jk，Σ_jkは、ガウス分布関数を特徴付けるパラメ
ータであり、それぞれ、平均ベクトル、共分散行列であ
る。The Gaussian distribution function can be expressed by the following equation. N (x | μ _jk , Σ _jk ) = [1 / ((2π) ^{n / 2} │Σ _jk │ ^1/2 )] exp (-(1/2) (x-μ _jk ) ^t Σ ^-1 _jk ( x−μ _jk ) Here, x is vector data at a certain time in the vector data time series of the feature parameter of the speech, and μ _jk and Σ _jk are parameters characterizing the Gaussian distribution function, respectively. Mean vector and covariance matrix.

【００１８】以上の定義のもとに、図１に示したこの発
明の方法の処理手順に沿って、ＨＭＭのモデルパラメー
タを学習する過程について詳述する。また、学習対象の
モデルパラメータは、シンボル出力確率のガウス分布の
平均ベクトルとする。また、以下の説明では、パラメー
タの添え字を状態番号ｊを省いて、シンボル出力確率分
布の要素分布番号ｋのみで表すこととする。Based on the above definition, the process of learning the model parameter of the HMM will be described in detail according to the processing procedure of the method of the present invention shown in FIG. The model parameter to be learned is a mean vector of Gaussian distribution of symbol output probabilities. In the following description, the subscript of the parameter will be represented by only the element distribution number k of the symbol output probability distribution, omitting the state number j.

【００１９】この発明の方法における第１番目の学習で
ある、最大事後確率推定法（ＭＡＰ法）と移動ベクトル
場平滑化法（ＶＦＳ法）との組み合わせであるＭＡＰ／
ＶＦＳ法の原理を以下に示す。詳細は特願平６−１５６
２３８に開示されている。ＭＡＰ／ＶＦＳ法 MAP / which is a combination of the maximum posterior probability estimation method (MAP method) and the moving vector field smoothing method (VFS method), which is the first learning in the method of the present invention.
The principle of the VFS method is shown below. Details are Japanese Patent Application No. 6-156
238. MAP / VFS method

【００２０】[0020]

【数１】 [Equation 1]

【００２１】上記の一連の式において、式（１）はＭＡ
Ｐ法における平均ベクトルの推定式、式（２）はＶＦＳ
法における推定式である。ＭＡＰ／ＶＦＳ法では、先
ず、与えられた学習データを用いてＭＡＰ法により平均
ベクトルμ＾_kを式（１）により求める。式（１）から
わかるように、推定値μ＾_kは、初期モデルの事前知識
であるμ_kと新たな学習データｘ_tのサンプル平均との
重み付き平均として求められる。パラメータτ_kは、サ
ンプルデータに対する事前知識の信頼度を制御するパラ
メータである。つまり学習データ中の学習対象モデル
を、これに対する初期モデルとして、その学習データを
用いてＭＡＰ法により求める。In the above series of equations, equation (1) is MA
Equation (2) is the VFS estimation formula for the average vector in the P method.
It is an estimation formula in the method. In the MAP / VFS method, first, the average vector μ ^ _k is calculated by the MAP method using the given learning data by the equation (1). As can be seen from the equation (1), the estimated value μ ^ _k is obtained as a weighted average of μ _k that is the prior knowledge of the initial model and the sample average of the new learning data x _t . The parameter τ _k is a parameter that controls the reliability of the prior knowledge of the sample data. That is, the learning target model in the learning data is obtained by the MAP method using the learning data as an initial model for the learning target model.

【００２２】学習データが少量であるため、クラス内学
習であるＭＡＰ法の学習ではすべてのモデルの要素分布
に対する平均ベクトルを学習することはできず、必ず、
未学習のモデルの要素分布が残る。ＶＦＳ法では、この
未学習の要素分布の平均ベクトルを式（２）に示す内挿
・外挿補間（ｉｎｔｅｒｐｏｌａｔｉｏｎ）処理によっ
て求める。また、学習データの量が少量であることか
ら、ＭＡＰ推定値に統計的な推定誤差があると考えら
れ、このＭＡＰ推定値は、式（２）に示すようなＶＦＳ
法の平滑化（ｓｍｏｏｔｈｉｎｇ）処理により補正され
る。図２にＶＦＳ法の幾何学的な説明を示す。ＶＦＳ法
では、学習による平均ベクトルの変化を音響パラメータ
空間での移動と仮定する。図２の上側の図は、ＭＡＰ法
によるクラス内学習を実行した場合を示している。ＭＡ
Ｐ推定値が求められる平均ベクトルと求められないもの
が存在することがわかる。Since the learning data is small, it is not possible to learn the mean vector for the element distributions of all models by the learning of the MAP method, which is the in-class learning.
The element distribution of the unlearned model remains. In the VFS method, the mean vector of this unlearned element distribution is obtained by the interpolation / extrapolation interpolation (interpolation) processing shown in equation (2). Further, since the amount of learning data is small, it is considered that there is a statistical estimation error in the MAP estimated value, and this MAP estimated value is VFS as shown in equation (2).
It is corrected by the smoothing process of the method. FIG. 2 shows a geometrical description of the VFS method. In the VFS method, the change in the average vector due to learning is assumed to be movement in the acoustic parameter space. The upper diagram of FIG. 2 shows a case where in-class learning by the MAP method is executed. MA
It can be seen that there is an average vector for which the P estimation value is obtained and one that is not obtained.

【００２３】左下側の図は、補間処理の様子を示してい
る。ＭＡＰ法により学習された平均ベクトルの学習前後
の平均ベクトルの差分ｍ_k＝（μ＾_k−μ_k）を移動ベ
クトルと見なし、未学習の平均ベクトルμ_pに対する移
動ベクトルｍ_pを、その近傍の移動ベクトルｍ₁〜ｍ₄
線形補間によって求めている。そして、推定された移動
ベクトルｍ_pに初期の平均ベクトルμ_pを加算すること
によって、学習後の平均ベクトルの推定値μ＾_q求め
る。一方、右下側の図は平滑化処理を示している。ＭＡ
Ｐ法により学習された平均ベクトルμ＾_qに対する移動
ベクトルｍ_qを、その近傍の移動ベクトルｍ₁〜ｍ₄か
ら線形補間することにより平滑化して平滑化後移動ベク
トルｍ_q′を得、平均ベクトルμ_kを移動ベクトル
ｍ_q′で移動させる。この場合、補正対象の平均ベクト
ルμ＾_qに対する移動ベクトルｍ_qも線形補間の対象と
する。補間や平滑化における線形補間処理の各移動ベク
トルの重み係数は式（４）に示す平均ベクトル間の距離
（通常、ユークリッド距離）に関するガウス窓関数によ
って与えられる。パラメータｓは平滑化パラメータで、
線形補間における近傍の移動ベクトルの依存度の強弱を
制御する。このようにして、ＭＡＰ／ＶＦＳ法では、限
られた学習データであるにもかかわらず、すべてのモデ
ルの要素分布に対する平均ベクトルが学習される。この
ように平滑化処理も行った方がよいが、補間処理でＭＡ
Ｐ／ＶＦＳ法の学習を終了としてもよい。The diagram on the lower left side shows the state of the interpolation processing. The difference m _k = (μ ^ _k −μ _k ) between the average vector before and after learning of the average vector learned by the MAP method is regarded as a moving vector, and the moving vector m _p with respect to the unlearned average vector μ _p is Movement vector m _{1 to} m ₄
It is calculated by linear interpolation. Then, by adding the initial mean vector mu _p in moving vector m _p estimated, obtaining estimates mu ^ _q of the mean vectors after training. On the other hand, the figure on the lower right side shows the smoothing process. MA
The moving vector m _q with respect to the average vector μ ^ _q learned by the P method is smoothed by linearly interpolating from moving vectors m _{1 to} m _{4 in the} vicinity thereof to obtain a smoothed moving vector m _q ′, and the average vector Move μ _k by the movement vector m _q ′. In this case, the movement vector m _{q with} respect to the average vector μ ^ _q to be corrected is also the target of linear interpolation. The weighting coefficient of each moving vector in the linear interpolation processing in the interpolation and smoothing is given by the Gaussian window function regarding the distance between average vectors (usually Euclidean distance) shown in Expression (4). The parameter s is a smoothing parameter,
Controls the degree of dependence of neighboring movement vectors in linear interpolation. In this way, in the MAP / VFS method, the average vector for the element distributions of all models is learned despite the limited learning data. It is better to perform smoothing processing in this way, but MA
The learning of the P / VFS method may be terminated.

【００２４】次に、第２番目の識別誤り最小化学習法の
原理を以下に示す。識別誤り最小化学習法 Next, the principle of the second identification error minimization learning method will be described below. Identification error minimization learning method

【００２５】[0025]

【数２】 [Equation 2]

【００２６】識別誤り最小化学習法では、式（５）に示
すように、識別関数ｇ_c（Ｘ，Λ）として、ＨＭＭを用
いた音声認識処理において類似度の判定に用いる対数尤
度ｌｏｇ［Ｌ（Ｘ）］を用いる。データＸに対するパラ
メータセットΛのモデルに対する尤度は、ＨＭＭの尤度
計算によって求める。この学習法における最適化の対象
である識別誤り数は、損失関数ｌ（ｄ_c）により定義さ
れる。正解クラスのモデルに対する対数尤度ｇ_c（Ｘ，
Λ）とｎｅａｒ−ｍｉｓｓの不正解クラスのモデルに対
する対数尤度の幾何平均Ｇ_c（Ｘ，Λ）との差ｄ
_c（Ｘ，Λ）を定義し、このｄ_cに関するｓｉｇｍｏｉ
ｄ関数（式（８））によって実効的な識別誤り数を求め
る。例えば、ｇ_c（Ｘ，Λ）がＧ_c（Ｘ，Λ）にくらべ
て非常に大きい場合は識別誤りがないので、損失関数の
値はｌ（ｄ_c）＝０となる。また、逆の条件では、識別
誤りが生じたことになるのでｌ（ｄ_c）＝１である。損
失関数を最小化するモデルパラメータΛ^-を求めること
がこの学習の問題であるが、これは、式（９）に示すよ
うな最急降下法によって求める。学習ステップサイズ∈
_tを小さい正数に設定して、漸化的に最適なパラメータ
を求める。式（９）は、ＨＭＭのモデルパラメータセッ
トΛに対する漸化式であるが、平均ベクトルμ_kに関し
ては、式（１０），（１１），（１２），（１３）から
式（９）の▽ｌ_c（Ｘ；Λ^-）を求め、式（９）のΛを
μ_kに置き換えて考えればよい。In the identification error minimization learning method, as shown in the equation (5), as the discrimination function g _c (X, Λ), the logarithmic likelihood log [[ L (X)] is used. The likelihood for the model of the parameter set Λ for the data X is obtained by the likelihood calculation of HMM. The number of identification errors to be optimized in this learning method is defined by the loss function l (d _c ). Log-likelihood g _c (X,
Λ) and the difference d between the geometric mean G _c (X, Λ) of the log-likelihood for the model of the wrong-miss class of near-miss
define _c (X, Λ), and sigmoi for this d _c
The effective number of identification errors is calculated by the d function (equation (8)). For example, when g _c (X, Λ) is much larger than G _c (X, Λ), there is no identification error, and the value of the loss function is l (d _c ) = 0. On the other hand, under the opposite condition, an identification error has occurred, and thus l (d _c ) = 1. The problem of this learning is to find the model parameter Λ ⁻ that minimizes the loss function, but this is found by the steepest descent method as shown in equation (9). Learning step size ∈
Set _t to a small positive number and recursively find optimal parameters. Expression (9) is a recurrence expression for the model parameter set Λ of the HMM, but regarding the average vector μ _k , from Expressions (10), (11), (12), and (13) to Expression (9). It suffices to find l _c (X; Λ ⁻ ) and replace Λ in equation (9) with μ _k .

【００２７】上述において最大事後確率推定法／移動ベ
クトル場平滑化法による適応化は、複数の入力適応化用
学習データを各１つづつ入力し、その各１つのデータを
用いて最大事後確率推定法／移動ベクトル場平滑化法に
より学習を行うが、第１番目に入力した学習データを用
いて不特定話者モデルを適応化し、この適応化モデルに
対し第２番目以行のデータを用いて適応化を行うように
してもよい。In the above-mentioned adaptation by the maximum posterior probability estimation method / movement vector field smoothing method, a plurality of input adaptation learning data are input one by one, and the maximum posterior probability estimation is performed using each one of them. Method / moving vector field smoothing method is used for learning, but an unspecified speaker model is adapted using the first input learning data, and data of the second and subsequent rows is used for this adaptation model. You may make it adapt.

【００２８】この発明の方法では、第１の学習法である
ＭＡＰ／ＶＦＳ法により得られたモデルをこの識別誤り
最小化学習の初期モデルとして用いる。従って、式
（５）〜（１３）のμ_kを、ＭＡＰ／ＶＦＳ法で求めら
れた平均ベクトルμ〜_kに置き換えて学習し、求められ
た平均ベクトルの推定値μ−_kが最終的に得られる学習
モデルに対する平均ベクトルである。なお、学習データ
は、第１の学習法で用いるデータと全く同一である。In the method of the present invention, the model obtained by the MAP / VFS method, which is the first learning method, is used as the initial model for this identification error minimization learning. Thus, the mu _k of formula (5) to (13), substituting the average vector Myu～ _k obtained by MAP / VFS method learned estimate .mu. _k the mean vector obtained is finally obtained Is a mean vector for the learned model. The learning data is exactly the same as the data used in the first learning method.

【００２９】次に計算機シミュレーションによる実験例
を述べる。不特定話者モデルを初期モデルとし、１０，
２０，５０単語の少量学習データを用いて、この発明の
方法（以後ＭＡＰ／ＶＦＳ＋ＭＣＥと呼ぶ）と他の方法
を用いた場合の音響モデルの音素認識性能を比較した。
他の方法としては、識別誤り最小化学習法（以後、ＭＣ
Ｅと呼ぶ）、最大事後確率推定法（以後、ＭＡＰと呼
ぶ）、ＭＡＰ＋ＭＣＥ，及びＭＡＰ／ＶＦＳ、の４種類
の学習方法を取り上げた。認識性能の比較評価実験に用
いた初期モデルは、市販されているＡＴＲの音声データ
ベースのうち、音素バランス２１６単語、重要語５２４
０単語の偶数番目の単語を１６名分用いて、最尤推定法
により学習した音素環境依存モデルであり、その構造は
隠れマルコフ網である。そのＨＭＭの状態数は４５０、
シンボル出力確率の要素分布数は９２４、混合数は２で
ある。学習に用いた少量の学習データは、ＡＴＲ５２４
０単語の奇数番目の単語から５０単語を任意に選択し、
１０，２０，５０単語の学習用データを作成し、その残
りの単語データを評価用データとした。話者は、男性Ｍ
ＭＹ、女性ＦＹＮである。図３に、音素認識における誤
り率、誤認識改善率の比較を示す。この図３から、この
発明の方法であるＭＡＰ／ＶＦＳ＋ＭＣＥを用いた場合
が、他のどの方法よりも認識性能が高いことがわかる、
例えば、２０単語の学習の場合の誤認識改善率の比較で
は、ＭＣＥでは４．３％、ＭＡＰ／ＶＦＳ＋ＭＣＥでは
２１．８％であり、この発明の方法により学習した音響
モデルがＭＣＥのそれよりも５倍も性能が高い。また、
５０単語の場合は、およそ３倍も性能が高い。また、，
ＭＡＰ／ＶＦＳとの比較においても、２０，５０単語の
場合は、それぞれ、誤認識改善率はおよそ３％、５％高
い。Next, an experimental example by computer simulation will be described. An unspecified speaker model is used as an initial model.
Using a small amount of learning data of 20 and 50 words, the phoneme recognition performance of the acoustic model using the method of the present invention (hereinafter referred to as MAP / VFS + MCE) and another method was compared.
As another method, a learning method for minimizing the identification error (hereinafter, MC
Four learning methods, i.e., E), maximum posterior probability estimation method (hereinafter, referred to as MAP), MAP + MCE, and MAP / VFS, were taken up. The initial model used for the comparative evaluation experiment of the recognition performance was 216 phoneme balance words and 524 important words in the commercially available ATR speech database.
It is a phoneme environment-dependent model learned by the maximum likelihood estimation method using 16 even-numbered words of 0 words, and its structure is a hidden Markov network. The HMM has 450 states,
The number of element distributions of the symbol output probability is 924, and the number of mixture is 2. The small amount of learning data used for learning is ATR524
Select any 50 words from the odd number of 0 words,
Learning data of 10, 20, and 50 words was created, and the remaining word data was used as evaluation data. Speaker is male M
MY and female FYN. FIG. 3 shows a comparison between the error rate and the erroneous recognition improvement rate in phoneme recognition. From FIG. 3, it can be seen that the recognition performance is higher in the case of using the method of the present invention, MAP / VFS + MCE, than any other method.
For example, the comparison of the misrecognition improvement rate in the case of learning 20 words is 4.3% for MCE and 21.8% for MAP / VFS + MCE, and the acoustic model learned by the method of the present invention is better than that of MCE. The performance is 5 times higher. Also,
For 50 words, the performance is about three times higher. Also,,
Also in comparison with MAP / VFS, in the case of 20,50 words, the misrecognition improvement rate is about 3% and 5% higher, respectively.

【００３０】[0030]

【発明の効果】以上の説明から、この発明の方法は、従
来にない高い認識性能をもつ音響モデルを学習できると
いう効果がある。これは、学習データは少量であって
も、異なる最適化規準の学習法を組み合わせることによ
って、各方法によるモデルの学習において、そのデータ
から、音響モデルを作成するのに必要な情報をそれぞれ
異なった観点から抽出でき、それらの情報を組み合わせ
てモデルを学習できるからであると考えられる。From the above description, the method of the present invention has an effect of learning an acoustic model having a high recognition performance which has never been obtained. This is because, even if the amount of training data is small, by combining learning methods of different optimization criteria, the information necessary to create an acoustic model from the data is different when the model is trained by each method. It is considered that the model can be extracted from the viewpoint and the model can be learned by combining the information.

【図面の簡単な説明】[Brief description of drawings]

【図１】この発明の方法における処理を示す流れ図。FIG. 1 is a flow chart showing the processing in the method of the present invention.

【図２】移動ベクトル場平滑化法の原理を示す概念図。FIG. 2 is a conceptual diagram showing the principle of a moving vector field smoothing method.

【図３】この発明の方法と従来の方法との音声認識性能
の比較結果を示す図。FIG. 3 is a diagram showing a comparison result of voice recognition performance between the method of the present invention and the conventional method.

【図４】Ａは隠れマルコフモデルを用いた音声認識処理
方法を説明するための図、ＢはＨＭＭのモデル構造の例
を示す図である。4A is a diagram for explaining a speech recognition processing method using a hidden Markov model, and FIG. 4B is a diagram showing an example of an HMM model structure.

Claims

【特許請求の範囲】[Claims]

【請求項１】あらかじめ用意された初期モデルを、少
量の学習データを使って学習し、得られたモデルを用い
て入力パターンに対する類似度を計算して、最も高い類
似度を与えるモデルが表現するカテゴリを認識結果とす
るパターン認識のためのモデル学習方法において、少量の学習データを用いて、最大事後確率推定法と移動
ベクトル場平滑化法とを組み合わせた学習法により初期
モデルを学習し、その後、この学習により得られたモデルを上記少量の学
習データを用いて識別誤り最小化学習法により学習する
ことを特徴とするパターン認識のためのモデル学習方
法。1. An initial model prepared in advance is learned by using a small amount of learning data, the similarity to an input pattern is calculated using the obtained model, and the model giving the highest similarity is expressed. In the model learning method for pattern recognition that uses categories as recognition results, a small amount of training data is used to learn the initial model by a learning method that combines the maximum posterior probability estimation method and the moving vector field smoothing method. , A model learning method for pattern recognition, characterized in that the model obtained by this learning is learned by the identification error minimization learning method using the small amount of learning data.

【請求項２】上記パターン認識におけるモデルが隠れ
マルコフモデルであることを特徴とする請求項１に記載
のモデル学習方法。2. The model learning method according to claim 1, wherein the model in the pattern recognition is a hidden Markov model.

【請求項３】上記隠れマルコフモデルが混合連続隠れ
マルコフモデルであることを特徴とする請求項２に記載
のモデル学習方法。3. The model learning method according to claim 2, wherein the hidden Markov model is a mixed continuous hidden Markov model.

【請求項４】上記隠れマルコフモデルにおいて、学習
対象のモデルパラメータを平均ベクトルとすることを特
徴とする請求項３に記載のモデル学習方法。4. The model learning method according to claim 3, wherein in the hidden Markov model, a model parameter to be learned is an average vector.

【請求項５】上記最大事後確率推定法と移動ベクトル
場平滑化法の組み合せた学習法は、上記学習データ中の
学習対象モデルを、これと対応する上記初期モデルを初
期値として、その学習用データを用いて最大事後確率推
定法により求め、上記学習データ中の学習対象でないモ
デルを、移動ベクトル場平滑化法による上記最大事後確
率推定法により求めたモデルと上記初期モデルとを用い
た内挿・外挿の補間処理により求めることを特徴とする
請求項１乃至４の何れかに記載のモデル学習方法。5. A learning method in which the maximum posterior probability estimation method and a moving vector field smoothing method are combined is used for learning by using a learning target model in the learning data as an initial value of the corresponding initial model. Obtained by the maximum posterior probability estimation method using the data, a model that is not a learning target in the learning data is interpolated using the model obtained by the maximum posterior probability estimation method by the moving vector field smoothing method and the initial model. The model learning method according to any one of claims 1 to 4, wherein the model learning method is obtained by extrapolation interpolation processing.

【請求項６】上記最大事後確率推定法によって求めた
上記学習対象モデルを、上記移動ベクトル場平滑化法に
よる平滑化処理により修正することを特徴とする請求項
５記載のモデル学習方法。6. The model learning method according to claim 5, wherein the learning target model obtained by the maximum posterior probability estimation method is corrected by smoothing processing by the moving vector field smoothing method.