JP3439700B2

JP3439700B2 - Acoustic model learning device, acoustic model conversion device, and speech recognition device

Info

Publication number: JP3439700B2
Application number: JP30675699A
Authority: JP
Inventors: 優高野
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1999-10-28
Filing date: 1999-10-28
Publication date: 2003-08-25
Anticipated expiration: 2019-10-28
Also published as: JP2001125589A

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の話者の発声
音声データに基づいて音響モデルを学習して生成する音
響モデル学習装置、所定の音響モデルに基づいて、各状
態における分岐数を減少させるように変換する音響モデ
ル変換装置、並びに上記音響モデル学習装置又は上記音
響モデル変換装置によって生成又は変換された音響モデ
ルを用いて音声認識する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic model learning device for learning and generating an acoustic model based on vocal data of a plurality of speakers, and to reduce the number of branches in each state based on a predetermined acoustic model. The present invention relates to an acoustic model conversion device that performs conversion so as to perform speech recognition, and a speech recognition device that performs speech recognition using an acoustic model generated or converted by the acoustic model learning device or the acoustic model conversion device.

【０００２】[0002]

【従来の技術】現在広く使われている隠れマルコフモデ
ル（以下、ＨＭＭという。）などの音素モデルでは、音
響的特徴を表す「状態」が時系列順に接続したものを使
って、各音素の音響的特徴の変遷を表現している。ここ
で、音素モデルの各状態は該当する音響的特徴を統計的
に表現するために、固有の確率分布を持っている。入力
音声中の各瞬間における音響特徴量のこの確率分布に対
する確率がその瞬間の該当状態の尤度となる。2. Description of the Related Art In a phoneme model such as a Hidden Markov Model (hereinafter referred to as HMM) which is widely used at present, acoustics of each phoneme are obtained by using "states" representing acoustic characteristics connected in chronological order. Expresses the transition of intellectual characteristics. Here, each state of the phoneme model has a unique probability distribution in order to statistically represent the corresponding acoustic feature. The probability of this probability distribution of the acoustic feature quantity at each moment in the input speech is the likelihood of the corresponding state at that moment.

【０００３】確率分布としては、正規分布（ガウス分
布）がよく用いられる。同一の音素であっても性別、発
話速度その他の要因によって音響的特徴が大きく異なる
ことがあるため、ほとんどの場合、特に不特定話者用の
音響モデルの場合は必ずと言っていいほど、１つの状態
に関する確率分布としていくつかのガウス分布を一定比
率で混合した混合分布（混合ガウス分布）を使用する。A normal distribution (Gaussian distribution) is often used as the probability distribution. Even with the same phoneme, the acoustic characteristics may differ greatly depending on the gender, speech rate, and other factors. Therefore, in most cases, especially in the case of an acoustic model for unspecified speakers, it is almost always 1 A mixture distribution (mixed Gaussian distribution) in which several Gaussian distributions are mixed at a constant ratio is used as a probability distribution regarding one state.

【０００４】少ない学習音声で信頼性の高い音響モデル
を構築するために、状態は、異なる音素又は音素環境間
で共有されることがある。また、混合分布の要素となる
正規分布も異なる混合分布間又は異なる状態間で共有さ
れることがある。States may be shared between different phonemes or phoneme environments in order to build reliable acoustic models with less training speech. Further, the normal distribution that is an element of the mixture distribution may be shared between different mixture distributions or between different states.

【０００５】現在、この形態の音響モデルを生成する方
法は大まかに、以下の２つの手順からなる。（１）各音素（ここで、音素環境で細分することが多
い。）に、状態共有構造を含め状態系列を割り当てる
（以下、状態分割処理という。）。（２）各状態に、混合分布を割り当てる（以下、混合分
布分割処理という。）。At present, a method of generating an acoustic model of this form roughly includes the following two procedures. (1) A state series including a state sharing structure is assigned to each phoneme (here, it is often subdivided in the phoneme environment) (hereinafter referred to as state division processing). (2) A mixture distribution is assigned to each state (hereinafter referred to as mixture distribution division processing).

【０００６】状態分割処理で生成される状態連結構造は
完全に人為的に、例えば各音素３状態で均一で共有なし
等で決定することが多いが、適切な構造は現在のところ
知られていない。そこで、学習用の音声データ主導で状
態接続構造を作成する方法として、鷹見らによる逐次状
態分割法（以下、ＳＳＳ法という。）や、マリ・オステ
ンドルフらによる最尤逐次状態分割法（以下、ＭＬ−Ｓ
ＳＳ法という。）などが知られている。The state connection structure generated by the state division process is often determined completely artificially, for example, in each phoneme 3 state being uniform and not shared, but an appropriate structure is not known at present. . Therefore, as a method of creating a state connection structure led by voice data for learning, the sequential state division method by Takami et al. (Hereinafter referred to as SSS method) and the maximum likelihood sequential state division method by Mari Ostendorff et al. ML-S
It is called SS method. ) Is known.

【０００７】また、混合分布分割処理は、各状態に属す
る音声サンプルをクラスタリング（すなわち、類似性に
従って分類処理を行う。）する方法で行なわれる。これ
についても適切に定める方法が知られていないため、通
常の場合、適当な自然数Ｎ（２〜１００程度）を決め
て、例えば公知のｋ−ｍｅａｎｓ法、公知のＬＢＧ（Li
nde-Buzo-Gray）アルゴリズム等で状態毎に音声サンプ
ル群をＮ分割する方法が取られる。The mixed distribution division process is performed by a method of clustering the voice samples belonging to each state (that is, performing a classification process according to similarity). Since a method for appropriately determining this is not known, in general, an appropriate natural number N (about 2 to 100) is usually determined, and for example, a known k-means method or a known LBG (Li
nde-Buzo-Gray) algorithm or the like is used to divide the audio sample group into N for each state.

【０００８】[0008]

【発明が解決しようとする課題】音響モデルを生成する
方法の従来例においては、大きく分けて以下の２つの問
題点がある。The conventional example of the method for generating the acoustic model has the following two problems.

【０００９】（１）混合分布分割処理において、全状態
に均一の混合数Ｎを割り当てる根拠はない。すなわち、
混合分布は、該当音素の音響的な変化形を大まかに分類
したものであることが望まれるが、音響モデルの性能を
最適にするような混合数が全音素に対して均一であると
は思えない。むしろ、個人差が大きく出る母音と、それ
ほどでもない子音では最適な混合数が異なると考えるの
が自然である。そして、特定音素の必要以上の細分類
は、該当音素の存在領域を広げることから、該当音素以
外の音響に高い尤度を与える結果を引き起こす可能性が
あり、他音素の混合数とのつりあいを考慮した数でなけ
ればならない。実験的にも、最近の研究で知られるよう
に、識別性能等を基準に複数混合数を用いる試みが行な
われており、認識性能向上が見られる。(1) In the mixture distribution division processing, there is no basis for assigning a uniform mixture number N to all states. That is,
It is desirable that the mixture distribution be a rough classification of acoustic variations of the corresponding phonemes, but it seems that the number of mixtures that optimizes the performance of the acoustic model is uniform for all phonemes. Absent. Rather, it is natural to think that the optimal mixing number is different for vowels that have large individual differences and consonants that are not so large. Further, the unnecessary subdivision of a specific phoneme may cause a result that gives a high likelihood to sounds other than the corresponding phoneme because it widens the existence region of the corresponding phoneme, and the balance with the number of mixture of other phonemes may be caused. The number should be considered. Experimentally, as is known in recent research, attempts are being made to use a plurality of mixed numbers based on discrimination performance and the like, and recognition performance is improved.

【００１０】（２）音響モデルを生成する方法の従来例
の手順は、状態分割処理を行った後に混合分布分割処理
を行うという順序を堅持しているが、これには何の正当
性もない。上記問題点（１）においては、音響的特徴を
音素毎又は音素環境毎に分類するという処理が行なわれ
る。また、混合分布分割処理では、状態分割処理で生成
された各状態を、音響的特徴により細分割する。これ
は、異なる音素Ａ，Ｂ間の音響的差異が同音素の変形
Ａ，Ａ’間の差異より大きいという思想を結果的に表現
しているが、そのような証拠はない。日本語に見られる
無声化（典型的には無声子音間にはさまれたｉ，ｕ）の
有無等による音響的差異は異なる音素間の音響的差異に
匹敵するほど大きく、むしろこの分割を先行させる方
が、学習用音響サンプルの効率的利用の点で有利であ
る。上記の従来例の手順では、このような音素又は音素
環境によらない分割を行なうことができないという問題
点があった。(2) The procedure of the conventional example of the method of generating the acoustic model adheres to the order of performing the state distribution process and then the mixed distribution division process, but this has no justification. . In the above problem (1), a process of classifying acoustic features for each phoneme or each phoneme environment is performed. Further, in the mixed distribution division processing, each state generated in the state division processing is subdivided according to acoustic features. This results in the idea that the acoustic difference between different phonemes A and B is greater than the difference between the same phoneme variants A and A ', but there is no such evidence. The acoustic difference due to the presence or absence of unvoiced sounds (typically i and u sandwiched between unvoiced consonants) found in Japanese is large enough to be comparable to the acoustic difference between different phonemes, and rather this division is preceded. This is advantageous in terms of efficient use of the learning acoustic sample. The procedure of the above-mentioned conventional example has a problem that it is not possible to perform such division not depending on the phoneme or the phoneme environment.

【００１１】さらに、シングルパスのＨＭＭのみを用い
て音声認識する音声認識装置においては、マルチパスの
ＨＭＭを用いて音声認識することはできず、少なくとも
ＨＭＭの各状態からの分岐数を減少させて、好ましくは
シングルパスのＨＭＭに変換する必要がある。Furthermore, in a speech recognition apparatus that recognizes speech using only a single-pass HMM, speech recognition cannot be performed using a multi-pass HMM, and at least the number of branches from each state of the HMM is reduced. , Preferably single-pass HMM.

【００１２】本発明の目的は以上の問題点を解決し、生
成された音響モデルを用いて音声認識するときの音声認
識率を従来例に比較して高めることができる音響モデル
を生成することができる音響モデル学習装置、音響モデ
ルの各状態からの分岐数を減少させるように変換するこ
とができる音響モデル変換装置、及び上記音響モデル学
習装置又は音響モデル変換装置を用いて音声認識するこ
とにより改善された音声認識率で音声認識することがで
きる音声認識装置を提供することにある。An object of the present invention is to solve the above problems and to generate an acoustic model which can increase the speech recognition rate when recognizing a speech using the generated acoustic model as compared with the conventional example. Improved acoustic model learning device, acoustic model conversion device capable of converting so as to reduce the number of branches from each state of the acoustic model, and improvement by speech recognition using the acoustic model learning device or acoustic model conversion device It is to provide a voice recognition device capable of performing voice recognition with a specified voice recognition rate.

【００１３】[0013]

【課題を解決するための手段】本発明に係る請求項１記
載の音響モデル学習装置は、複数の話者の発声音声デー
タに基づいて音響モデルを学習して生成するモデル学習
手段を備えた音響モデル学習装置において、上記モデル
学習手段は、複数の特定話者の発声音声データに基づい
て、所定の学習アルゴリズムを用いて単一ガウス分布の
音響モデルを生成した後、上記生成した単一ガウス分布
の音響モデルのすべての状態に対してコンテキスト方
向、時間方向及びパス方向に分割したときの分割前後の
尤度期待値の増加量を計算し、上記すべての状態に対し
て計算された尤度期待値の増加量のうち最大の尤度期待
値の増加量を有する状態を検索して分割することを繰り
返すことにより、音響モデルを学習して生成することを
特徴とする。An acoustic model learning apparatus according to a first aspect of the present invention is provided with a model learning means for learning and generating an acoustic model based on uttered voice data of a plurality of speakers. In the model learning device, the model learning means generates an acoustic model of a single Gaussian distribution by using a predetermined learning algorithm based on vocal data of a plurality of specific speakers, and then the generated single Gaussian distribution. The likelihood increase amount before and after the division is calculated for each state of the acoustic model in the context direction, the time direction, and the path direction, and the likelihood expectation calculated for all the above states is calculated. The present invention is characterized in that an acoustic model is learned and generated by repeating the search and division of the state having the maximum likelihood expected value increase amount among the value increase amounts.

【００１４】また、請求項２記載の音響モデル学習装置
は、請求項１記載の音響モデル学習装置において、上記
モデル学習手段は、複数の特定話者の発声音声データに
基づいて、所定の学習アルゴリズムを用いて単一ガウス
分布の音響モデルを生成する初期モデル生成手段と、上
記初期モデル生成手段によって生成された単一ガウス分
布の音響モデルにおいて、１つの状態をコンテキスト方
向、時間方向及びパス方向に分割したときに、最大の尤
度期待値の増加量を有する状態を検索する検索手段と、
上記検索手段によって検索された最大の尤度期待値の増
加量を有する状態を、最大の尤度期待値の増加量に対応
するコンテキスト方向、時間方向、又はパス方向に分割
した後、所定の学習アルゴリズムを用いて単一ガウス分
布の音響モデルを生成する生成手段と、上記生成手段の
処理と上記検索手段の処理を、単一ガウス分布の音響モ
デル内の状態を分割することができなくなるまで又は単
一ガウス分布の音響モデル内の状態数が予め決められた
分割数となるまで繰り返すことにより、音響モデルを学
習して生成する制御手段とを備えたことを特徴とする。According to a second aspect of the present invention, there is provided an acoustic model learning device according to the first aspect, wherein the model learning means uses a predetermined learning algorithm based on voice data of a plurality of specific speakers. In the initial model generating means for generating an acoustic model having a single Gaussian distribution by using, and in the acoustic model having a single Gaussian distribution generated by the initial model generating means, one state is defined in the context direction, the time direction and the path direction. A search means for searching for a state having the largest increase in the likelihood expected value when divided,
The state having the maximum increase in likelihood expected value searched by the search means is divided into the context direction, the time direction, or the path direction corresponding to the maximum increase in likelihood expected value, and then the predetermined learning is performed. Generating means for generating an acoustic model of a single Gaussian distribution using an algorithm, and processing of the generating means and processing of the searching means until the state in the acoustic model of a single Gaussian distribution cannot be divided or The present invention is characterized by further comprising control means for learning and generating an acoustic model by repeating until the number of states in the acoustic model having a single Gaussian distribution reaches a predetermined number of divisions.

【００１５】さらに、本発明に係る請求項３記載の音響
モデル変換装置は、所定の音響モデルに基づいて、各状
態における分岐数を減少させるように上記音響モデルを
変換する変換処理を行う音響モデル変換装置であって、
所定の音響モデルに基づいて、各音素並びに対して、す
べての処理対象状態を、処理対象状態よりも先行する先
行状態と、処理対象状態に後続する後続状態との毎に分
類し、分類後の各集合に対して、対応する集合に属する
すべての古い状態のすべてのガウス分布を構成要素とし
て有する新しい状態を生成する分類手段と、上記分類手
段によって分類された後の各集合において、処理対象状
態が一致する新しい状態をすべて合併することにより、
合併後の新しい状態の集合を生成する合併手段と、上記
合併手段による合併前の古い状態間のすべての遷移を、
合併前の古い始点を含む新しい状態から、合併前の古い
終点を含む新しい状態への遷移として設定することによ
り、処理後のガウス分布を含む音響モデルを生成する遷
移設定手段とを備えたことを特徴とする。Further, the acoustic model conversion device according to a third aspect of the present invention is an acoustic model which performs conversion processing for converting the acoustic model so as to reduce the number of branches in each state based on a predetermined acoustic model. A conversion device,
Based on a predetermined acoustic model, for each phoneme sequence, all the processing target states are classified for each preceding state preceding the processing target state and each subsequent state subsequent to the processing target state, and after classification For each set, a classifier that generates a new state that has all the Gaussian distributions of all the old states that belong to the corresponding set, and the processing target state in each set after being classified by the classifier By merging all new states that match
All transitions between the merger, which produces a new set of states after the merger, and the old state before the merger by the merger,
A transition setting means for generating an acoustic model including a processed Gaussian distribution by setting as a transition from a new state including an old start point before merging to a new state including an old end point before merging. Characterize.

【００１６】また、請求項４記載の音響モデル学習装置
は、請求項１又は２記載の音響モデル学習装置と、上記
音響モデル学習装置によって生成された音響モデルに対
して変換処理を行う請求項３記載の音響モデル変換装置
とを備えたことを特徴とする。Further, an acoustic model learning device according to a fourth aspect of the present invention is the acoustic model learning device according to the first or second aspect, and a conversion process is performed on the acoustic model generated by the acoustic model learning device. The acoustic model conversion device described above is provided.

【００１７】さらに、本発明に係る請求項５記載の音声
認識装置は、入力される発声音声文の音声信号に基づい
て所定の音響モデルを参照して音声認識する音声認識手
段を備えた音声認識装置において、上記音声認識手段
は、請求項１、２又は４に記載の音響モデル学習装置、
もしくは請求項３記載の音響モデル変換装置によって生
成又は変換された音響モデルを参照して音声認識するこ
とを特徴とする。Further, the speech recognition apparatus according to claim 5 of the present invention comprises a speech recognition means provided with speech recognition means for recognizing a speech by referring to a predetermined acoustic model based on the speech signal of the input uttered speech sentence. In the device, the voice recognition means is the acoustic model learning device according to claim 1, 2 or 4.
Alternatively, the voice recognition is performed by referring to the acoustic model generated or converted by the acoustic model conversion device according to the third aspect.

【００１８】[0018]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings.

【００１９】図１は、本発明に係る一実施形態である音
声認識システムの構成を示すブロック図である。この実
施形態は、通常よく知られている音声認識システムに、
音響モデル学習部３１と、音響モデル変換部３２とを追
加したことを特徴としている。FIG. 1 is a block diagram showing the configuration of a voice recognition system according to an embodiment of the present invention. This embodiment is based on the well-known speech recognition system.
The feature is that an acoustic model learning unit 31 and an acoustic model conversion unit 32 are added.

【００２０】先の「発明が解決しようとする課題」の項
で述べた問題点を解決するために、状態分割と混合分布
分割をこの順でなく同時に行うことを考える。考えなけ
ればならないことは、状態分割の効率と混合分布分割の
効率を同一尺度で計ることである。これにより、両者が
従来の決定木上での従来例の状態分割法で統一的に扱え
るようになる。尺度としては、従来のＳＳＳ法における
分散値によるもの、ＭＬ−ＳＳＳ法における尤度期待値
の増分によるもの等がある。認識対象に真の分布が存在
すると仮定した場合、作成したモデルと真の分布との近
さが尤度の期待値で表せることから、ＭＬ−ＳＳＳ法で
採用している尤度期待値の増分を尺度に逐次分割を行う
のが真の分布に近いモデルを得るという点で妥当と思わ
れる。そこで、本実施形態においては、ＭＬ−ＳＳＳ法
をベースに、状態分割処理及び混合分布分割処理を含め
た逐次状態分割処理を、尤度期待値の増分の尺度で行う
装置及び方法を提案する。本装置及び方法の目的は、上
記従来法の問題点を解決し、状態分割及び混合分布分割
を適切な順序で行うことである。In order to solve the problem described in the above-mentioned "Problems to be solved by the invention", it is considered that the state division and the mixture distribution division are performed simultaneously, not in this order. What must be considered is to measure the efficiency of state division and the efficiency of mixture distribution division on the same scale. As a result, both can be handled uniformly by the conventional state division method on the conventional decision tree. As the scale, there are those based on the variance value in the conventional SSS method, those based on the increase of the likelihood expectation value in the ML-SSS method, and the like. When it is assumed that a true distribution exists in the recognition target, the closeness between the created model and the true distribution can be represented by the expected value of the likelihood. Therefore, the expected value increment of the ML-SSS method is increased. It seems reasonable to perform sequential partitioning on the scale to obtain a model close to the true distribution. In view of this, the present embodiment proposes an apparatus and method for performing sequential state division processing including state division processing and mixture distribution division processing on the basis of the ML-SSS method on the basis of the increment of the likelihood expected value. The purpose of the present apparatus and method is to solve the above-mentioned problems of the conventional method and perform the state division and the mixture distribution division in an appropriate order.

【００２１】本実施形態においては、ＭＬ−ＳＳＳ法中
の状態分割の２領域であるコンテキスト方向及び時間方
向（図４及び図５参照。）に加え、ある状態を表す音響
から離れた距離の方向である（言い換えれば、ある音響
状態からの遷移経路の方向）パス方向の分割（図６参
照。）という領域を用意する。なお、図４以降のＨＭＭ
における音素表記において、例えば、ａ／ｋ／ａｉと
いう表記は、ａ／ｋ／ａと、ａ／ｋ／ｉとを含むことを
意味する。前記２領域が、音素及び音素環境毎の状態分
割であったのに対して、パス方向の分割は音素に依存し
ない状態分割である。パス方向の分割は、元の状態と同
じ音素を表す状態を並列に生成する。このパス方向の分
割を用いることで、単一の音素が複数の状態系列を持つ
ことが可能になる。生成された音響モデルであるマルチ
パスモデルはそのままでも音声認識に使用可能である
が、音響モデル変換部３２を用いることにより、各状態
の分岐数を減少させ、好ましくは従来互換のシングルパ
スの音響モデルに変換することができる。In the present embodiment, in addition to the context direction and time direction (see FIG. 4 and FIG. 5), which are two regions of state division in the ML-SSS method, the direction of a distance away from the sound representing a certain state. (In other words, the direction of the transition path from a certain acoustic state) A region called division (see FIG. 6) in the path direction is prepared. Note that the HMMs shown in FIG.
In the phoneme notation in, for example, the notation a / k / a i means to include a / k / a and a / k / i. Whereas the two regions are the state divisions for each phoneme and each phoneme environment, the division in the path direction is the state division that does not depend on the phoneme. The division in the pass direction generates states representing the same phonemes as the original states in parallel. By using this division in the path direction, it becomes possible for a single phoneme to have multiple state sequences. The generated multi-path model, which is an acoustic model, can be used for speech recognition as it is. However, by using the acoustic model conversion unit 32, the number of branches in each state is reduced, and preferably a single-pass acoustic that is compatible with the conventional art is used. Can be converted to a model.

【００２２】まず、マルチパス状態分割アルゴリズムに
ついて説明する。従来のＭＬ−ＳＳＳ法は、状態分割の
繰り返しによってＨＭＭを生成するアルゴリズムであ
る。ＭＬ−ＳＳＳ法は、ＨＭＭにおける状態の１つを分
割する毎に状態数を１つずつ増大させる。分割される状
態は、コンテキスト方向の分割（図４参照。）及び時間
方向（図５参照。）の分割の各分割領域に関するその対
数尤度の期待値の利得によって選択される。ここで、長
さＴを有する観察された系列ｙ₁ ^Tの非観測系列ｓ ₁ ^Tによ
る尤度は、次式のように定義される。First, the multipath state division algorithm
explain about. The conventional ML-SSS method is based on state division.
It is an algorithm to generate HMM by iteration.
It The ML-SSS method divides one of the states in the HMM.
The number of states is increased by one each time it is divided. State of being divided
The states are divided in the context direction (see FIG. 4) and time.
The pair for each sub-region of the direction (see FIG. 5)
It is selected by the gain of the expected value of the number of likelihoods. Where long
Observed sequence y with length T₁ ^TUnobserved sequence s ₁ ^TBy
Likelihood is defined by the following equation.

【００２３】[0023]

【数１】 [Equation 1]

【００２４】ここで、ｓ₁ ^T＝｛ｓ₁，ｓ₂，…，ｓ_T｝で
ある。対数尤度の期待値Ｑ（θ│θ⁽ ^p)）は、直前モデ
ルのパラメータセットθ^(p)を使用して次式のように計
算される。Here, s ₁ ^T = {s ₁ , s ₂ , ..., S _T }. The expected value Q (θ | θ ⁽ ^p) ) of the log-likelihood is calculated as follows using the parameter set θ ^(p) of the immediately preceding model.

【００２５】[0025]

【数２】ここで、[Equation 2] here,

【数３】ａ_ss _’＝ｐ（ｓ_t＝ｓ’｜ｓ_t-1＝ｓ，θ_A(s)）## _EQU00003 ## a _ss _' = p (s _t = s' | s _t-1 = s, θ _{A (s)} )

【数４】ｐ（ｙ_t｜ｓ，θ_B(s)）≒Ｎ（μ_s，Σ_s）## EQU4 ## p (y _t | s, θ _{B (s)} ) ≈N (μ _s , Σ _s )

【数５】γ_t（ｓ）＝ｐ（ｓ_t＝ｓ｜ｙ₁ ^T，θ^(p)）Γ _t (s) = p (s _t = s | y ₁ ^T , θ ^(p) )

【数６】ξ_t（ｓ，ｓ’）＝ｐ（ｓ_t＝ｓ’，ｓ_t-1＝ｓ
｜ｙ₁ ^T，θ^(p)）## EQU4 ## ξ _t (s, s ') = p (s _t = s', s _t-1 = s
｜ y ₁ ^T , θ ^(p) )

【００２６】ここで、θ＝｛θ_A，θ_B｝であり、θ_A(s)
は状態ｓにおけるモデル全体の遷移確率すべての集合で
あり、また、θ_B(s)は状態ｓによる出力確率の計算のた
めのパラメータセットである。μ_s及びΣ_sはそれぞれ、
状態ｓのガウス分布の平均値ベクトル及び分散ベクトル
である。Where θ = {θ _A , θ _B } and θ _{A (s)}
Is a set of all transition probabilities of the entire model in the state s, and θ _{B (s)} is a parameter set for calculating the output probability by the state s. μ _s and Σ _s are
It is the mean value vector and variance vector of the Gaussian distribution of state s.

【００２７】従来のＭＬ−ＳＳＳ法では、分割利得は、
θ^(p)による分割前後の対数尤度の期待値の差として計
算される。状態ｓ^*が２つの状態ｑ₀、ｑ₁に分割される
ときの分割利得Ｇ（ｓ^*，ｑ₀，ｑ₁）は、次式のように
計算される。In the conventional ML-SSS method, the division gain is
It is calculated as the difference between the expected values of log-likelihood before and after the division by θ ^(p) . Dividing the gain G when the state s ^* is divided into two states _{_{^{q 0, q 1 (s *}}} , q 0, q 1) is calculated as follows.

【００２８】[0028]

【数７】ここで、[Equation 7] here,

【数８】 [Equation 8]

【数９】 [Equation 9]

【数１０】 [Equation 10]

【００２９】コンテキスト方向の分割においては、各状
態に関して、分割ファクタ（先行状態、中心状態（処理
対象状態をいう。）又は後続状態）及び分割後の各２状
態の３つの音素並びであるトライフォン（triphone）の
セットがチャウ（Chou）の公知のクラスタリングアルゴ
リズムを使用して決定される。時間方向の分割では、分
割利得が制約付きのバウム・ウェルチの学習アルゴリズ
ム（又は、フォワード・バックワードの学習アルゴリズ
ムともいう。）を使用して計算される。こうして、これ
らの２つの領域に関する分割利得の期待値を計算するこ
とができる。分割する状態及びその分割領域は、どの利
得が最大であるかに基づいて決定される。In the division in the context direction, a triphone, which is a sequence of three phonemes of a division factor (preceding state, central state (referring to a state to be processed) or subsequent state) and each of the two states after division, for each state. The set of (triphone) is determined using the known clustering algorithm of Chou. In the division in the time direction, the division gain is calculated by using the Baum-Welch learning algorithm with a constraint (also referred to as a forward backward learning algorithm). Thus, the expected split gain for these two regions can be calculated. The splitting state and the splitting area are determined based on which gain is maximum.

【００３０】次いで、本実施形態に係るパス方向の分割
によるマルチパスの状態分割処理について説明する。音
声認識のための音響モデルにおける状態は、性別、話の
速度他に起因する音素の音響的変化を捕捉するために、
幾つかの分布をその混合成分として保有している場合が
多い。こうした音響的変化は、時として音声的環境に起
因するものよりも大きく、また場合によっては音素自体
の差異に起因するものよりも大きい。従来例のＭＬ−Ｓ
ＳＳ法及び他の大部分の音響モデルの生成アルゴリズム
は、１つのトライフォンを１つの音響的現象と見なす。
従って、この場合に生成される音響モデルでは１つのト
ライフォンがわずか１つのパス（状態シーケンス）に対
応し、トライフォンの音響的変化は、状態ネットワーク
の生成後に他の混合分割アルゴリズムによって生成され
る混合分布において捕捉されることになる。音響モデル
を生成する先行方法が２つの段階を必要とするのは、こ
の理由による。Next, the multipath state division processing by the division in the path direction according to the present embodiment will be described. The state in the acoustic model for speech recognition is to capture acoustic changes in phonemes due to gender, speed of speech, etc.
It often holds several distributions as its mixed components. These acoustic changes are sometimes greater than those due to the phonetic environment, and in some cases greater than due to differences in the phonemes themselves. Conventional ML-S
The SS method and most other acoustic model generation algorithms consider one triphone as one acoustic phenomenon.
Therefore, in the acoustic model generated in this case, one triphone corresponds to only one path (state sequence), and the acoustic change of the triphone is generated by another mixing division algorithm after the generation of the state network. It will be captured in the mixed distribution. It is for this reason that previous methods of generating acoustic models require two stages.

【００３１】そこで、本発明者らは、上述の音響的変化
に対処するために第３の分割領域であるパス方向の分割
をさらに導入する。コンテキスト方向の分割において
は、元の状態のトライフォンを分割することによって１
つの状態が２つの状態に分割されるが、パス方向の分割
では、図６に示すように、１つの状態が、ともに元の状
態が表現するものと同じトライフォンを表現する２つの
並行する状態に分割される。パス方向の分割において
は、対応するデータセットがトライフォンについて一切
考慮することなく２つのクラスタに分割され、次いで、
分割後の各状態のパラメータが各クラスタ内のデータを
使用して評価される。その結果、パス方向の分割によっ
て生成されるＨＭＭのトポロジーは、１つのトライフォ
ンが多数の経路を保有可能であるマルチパス構造となる
ことができる。マルチパスＨＭＭとの確率的整合性のた
めに、尤度の計算においては、以下のような任意の確率
ｃによる分岐を認める修正が必要である。Therefore, the present inventors further introduce a third division area, that is, division in the pass direction in order to cope with the above-mentioned acoustic change. For context-wise splitting, 1 by splitting the triphone in its original state
One state is divided into two states, but in the division in the path direction, as shown in FIG. 6, two parallel states in which one state represents the same triphone as the original state represents. Is divided into In the split in the path direction, the corresponding dataset is split into two clusters without any consideration for triphones, then
The parameters of each state after division are evaluated using the data in each cluster. As a result, the topology of the HMM generated by the division in the path direction can be a multipath structure in which one triphone can hold many paths. Due to the probabilistic consistency with the multipath HMM, the likelihood calculation needs to be modified to allow branching with an arbitrary probability c as follows.

【００３２】[0032]

【数１１】ここで、[Equation 11] here,

【数１２】ａ’_ss _’＝ｐ（ｓ_t∈Ｓ（ｓ’）｜ｓ_t-1＝ｓ，θ_A(s)）A ′ _ss _′ = p (s _t εS (s ′) | s _t−1 = s, θ _{A (s)} )

【数１３】ｃ_s＝ｐ（ｓ_t＝ｓ｜ｓ_t∈Ｓ（ｓ），
θ_C(s)）C _s = p (s _t = s | s _t εS (s),
θ _{C (s)} )

【００３３】ここで、下記の拘束条件を用いる。Here, the following constraint conditions are used.

【００３４】任意のトライフォンｍに対して、For any triphone m,

【数１４】ここで、Ｓ_m（ｓ）は状態ｓとトライフォンｍ及びパス
内の時間方向の順序（図８参照。）を共有する状態セッ
トを意味している。また、Ｓ（ｓ）は状態ｓにおける、
学習サンプル内での文脈（既知）のトライフォンｍに関
する状態セットＳ_m（ｓ）を意味する。θ_C(s)は、ｃ_sを
計算するためのパラメータセットである。この定義のも
とでは、対数尤度の期待値及び利得は次式のように書き
替えられる。[Equation 14] Here, S _m (s) means a state set that shares the state s with the triphone m and the temporal order in the path (see FIG. 8). Also, S (s) is in the state s,
The state set S _m (s) for the context (known) triphone m in the training sample. θ _{C (s)} is a parameter set for calculating c _s . Under this definition, the expected value and the gain of the log-likelihood can be rewritten as the following equation.

【００３５】[0035]

【数１５】ここで、[Equation 15] here,

【数１６】ξ’_t（ｓ，ｓ’）＝ｐ（ｓ_t∈Ｓ（ｓ’），
ｓ_t-1＝ｓ｜ｙ₁ ^T，θ^(p)）Ξ ′ _t (s, s ′) = p (s _t εS (s ′),
s _t-1 = s | y ₁ ^T , θ ^(p) )

【数１７】λ_t（ｓ）＝ｐ（ｓ_t＝ｓ｜ｓ_t∈Ｓ（ｓ），
ｙ₁ ^T，θ^(p)）Λ _t (s) = p (s _t = s | s _t εS (s),
y ₁ ^T , θ ^(p) )

【００３６】γ_t（ｓ）の定義は、従来例のＭＬ−ＳＳ
Ｓ法の場合と同一である。従って、新たに定義された対
数尤度増分は次式のように計算される。The definition of γ _t (s) is defined by the conventional ML-SS.
It is the same as the case of the S method. Therefore, the newly defined log-likelihood increment is calculated as:

【００３７】[0037]

【数１８】ここで、[Equation 18] here,

【数１９】 [Formula 19]

【数２０】 [Equation 20]

【００３８】ここで、Ｎ₃（ｓ^*）ｌｏｇｃ_s*の項、及
びWhere N ₃ (s ^* ) log c _{s *} terms, and

【数２１】の項は、状態ｓの分岐重み係数であるθ_c(s)から導出さ
れる。コンテキスト方向の分割ではこうした重み係数は
変化しないため、これらの項は利得に影響しない。時間
方向の分割においては、先行状態の重み係数は変化せ
ず、後続状態の重み係数は１となる。従って、こうした
２つの領域の分割利得の計算は、従来例のＭＬ−ＳＳＳ
法の場合と同一である。こららの項を含む利得の期待
は、パス方向の分割においてのみ行われる。当該状態に
対応する全てのサンプルを分割するため、本発明者らは
チャウ（Chou）のアルゴリズムを利用してコンテキスト
方向の分割と同じ方法でパス方向の分割を行なう。コン
テキスト方向の分割とパス方向の分割との相違は、次の
２点のみである。第１に、コンテキスト方向の分割は、
学習サンプルをトライフォン毎にクラスタリングし、パ
ス方向の分割は学習サンプルをサンプル毎にクラスタリ
ングする。第２に、パス方向の分割では元の状態ｃ _s*の
分岐重み係数は最適な比率で分割されるが、コンテキス
ト方向の分割では分割されない。[Equation 21] Is a branch weighting coefficient of the state s θ_{c (s)}Derived from
Be done. In context-wise partitioning, these weighting factors
These terms do not affect the gain because they do not change. time
In the direction division, the weighting factor of the preceding state does not change.
However, the weighting factor of the subsequent state is 1. Therefore, such
The division gain of the two regions is calculated by the conventional ML-SSS.
It is the same as in the case of law. Expectation of gains including these terms
Is performed only in the division in the path direction. In that state
In order to split all corresponding samples, we have
Context using Chou's algorithm
The path direction division is performed in the same manner as the direction division. Con
The difference between text direction division and path direction division is as follows.
Only 2 points. First, the context-wise split is
Cluster learning samples for each triphone and
The training sample is divided into clusters for each sample.
To run. Second, in the division in the path direction, the original state c _{s *}of
The branching weight factors are split at an optimal ratio, but
It is not divided in the direction division.

【００３９】図２は、図１の音響モデル学習部３１によ
って実行される音響モデル学習処理を示すフローチャー
トである。図２において、まず、ステップＳ１１では、
複数の特定話者の発声音声データを格納する音声発声デ
ータメモリ４１内の発声音声データ（具体的には、発声
音声の特徴パラメータのデータである。）に基づいてそ
れぞれ後述する所定の音声の特徴パラメータを抽出した
後音素を切り出して、従来の方法で複数の特定話者用単
一ガウス分布のＨＭ網を生成する。そして、生成したＨ
Ｍ網に基づいて、例えば公知のバウム・ウェルチの学習
アルゴリズムを用いて学習を行って単一ガウス分布のＨ
Ｍ網を生成する。次いで、ステップＳ１２では、ＨＭ網
内のすべての状態に対して分割可能な状態の分割情報を
得る。この処理は、ステップＳ１５と同様に実行され
る。すなわち、詳細後述する最尤分割設定処理を用いて
すべての状態に対して将来の分割の中で最良の分割方向
（本実施形態においては、コンテキスト方向、時間方
向、及びパス方向を用いる。）及び音素（又は音素ラベ
ル）を検索して決定し、これらを分割情報としてメモリ
に記憶する。すなわち、分割情報とは、以下の通りであ
る。（１）分割したときの尤度期待値の増加量、（２）分割
は、コンテキスト方向であるか、時間方向であるか、パ
ス方向であるか、並びに、（３）コンテキスト方向の前
の音素、当該音素、後の音素。FIG. 2 is a flow chart showing the acoustic model learning process executed by the acoustic model learning unit 31 of FIG. In FIG. 2, first, in step S11,
Based on the voiced voice data (specifically, the voiced voice feature parameter data) in the voiced voice data memory 41 that stores the voiced voice data of a plurality of specific speakers, predetermined voice features to be described later, respectively. After extracting the parameters, the phonemes are cut out to generate a plurality of specific speaker single Gaussian HM networks by a conventional method. And the generated H
Based on the M network, learning is performed by using, for example, a known Baum-Welch learning algorithm, and H of a single Gaussian distribution is obtained.
Generate an M-net. Next, in step S12, division information of divisible states is obtained for all states in the HM network. This process is executed similarly to step S15. That is, the best division direction among future divisions (in the present embodiment, the context direction, the time direction, and the path direction are used) for all states using the maximum likelihood division setting process described in detail later, and Phonemes (or phoneme labels) are searched and determined, and these are stored in the memory as division information. That is, the division information is as follows. (1) increase in likelihood expected value when divided, (2) whether the division is in the context direction, the time direction, the path direction, and (3) the phoneme before the context direction , The phoneme, the phoneme after.

【００４０】次いで、ステップＳ１３において、分割情
報に基づいて最大の尤度期待値の増加量を有する分割す
べき状態を検索し、検索した状態を分割情報に従って分
割する。すなわち、最大の尤度を有する分割すべき状態
を最良の方向（すなわち、コンテキスト方向か、時間方
向か、もしくはパス方向）で分割する。さらに、ステッ
プＳ１４では、分割したときの被影響状態を検索して決
定し、これらの被影響状態に対して公知のバウム・ウェ
ルチの学習アルゴリズムを用いて学習を行って単一ガウ
ス分布のＨＭ網を生成する。そして、ステップＳ１５
で、詳細後述する最尤分割設定処理を用いて、ステップ
Ｓ１３で分割された２つの状態及び被影響状態に対して
将来の分割の中で最良の分割方向及び音素（又は音素ラ
ベル）を検索して決定し、これらを分割情報としてメモ
リに記憶する。ここで、Ｋ個の被影響状態に対して（Ｋ
−１）個のコンテキスト方向の分割テストと１個の時間
方向の分割テストが実行される。ステップＳ１６では、
単一ガウス分布のＨＭ網内の状態が分割不可能である
か、又は単一ガウス分布のＨＭ網内の状態数が予め決め
られた分割数（以下、所定の分割数という。）となった
か否かが判断され、分割可能でありかつ所定の分割数に
達していないときはステップＳ１３に戻って上記の処理
を繰り返す。一方、ステップＳ１６で分割が不可能であ
るとき、又は所定の分割数に達しているときは、ステッ
プＳ１７で得られたＨＭ網である音響モデルを隠れマル
コフモデル網メモリ（以下、ＨＭ網メモリという。）４
２に格納する。Next, in step S13, the state to be divided which has the largest increase in the likelihood expected value is searched for based on the divided information, and the searched state is divided according to the divided information. That is, the state to be divided having the maximum likelihood is divided in the best direction (that is, the context direction, the time direction, or the path direction). Further, in step S14, the affected states at the time of division are searched and determined, and learning is performed on these affected states using a known Baum-Welch learning algorithm to obtain a single Gaussian HM network. To generate. Then, step S15
Then, the maximum likelihood division setting process described in detail later is used to search for the best division direction and phoneme (or phoneme label) in the future division for the two states and the affected states divided in step S13. Then, these are stored in the memory as division information. Here, for K affected states, (K
-1) Split tests in the context direction and one split test in the time direction are executed. In step S16,
Whether the states in the HM network having a single Gaussian distribution cannot be divided, or the number of states in the HM network having a single Gaussian distribution has reached a predetermined number of divisions (hereinafter referred to as a predetermined number of divisions). If it is determined whether or not it is possible to divide and the predetermined number of divisions has not been reached, the process returns to step S13 and the above processing is repeated. On the other hand, when the division is impossible in step S16, or when the number of divisions reaches the predetermined number, the acoustic model which is the HM network obtained in step S17 is the hidden Markov model network memory (hereinafter referred to as HM network memory). .) 4
Store in 2.

【００４１】本実施形態におけるさらなる優位点は、バ
ウム・ウェルチの学習アルゴリズムを用いて単一ガウス
分布に対して学習を実行する点である。これは混合ガウ
ス分布の場合より遥かに早い速度で実行される。A further advantage of this embodiment is that learning is performed on a single Gaussian distribution using the Baum-Welch learning algorithm. This is done at a much faster rate than in the Gaussian mixture distribution.

【００４２】図３は、図１の音響モデル変換部３２によ
って実行される音響モデル変換処理を示すフローチャー
トである。図３において、まず、ステップＳ２１におい
てＨＭ網メモリ４２に格納されたＨＭ網である音響モデ
ルに基づいて、各トライフォンに対して、すべての状態
を先行状態と後続状態毎に分類し、分類後の各集合に対
して新しい状態を生成する。ここで、新しい状態は対応
する集合に属するすべての古い状態のすべてのガウス分
布を構成要素として有する。次いで、ステップＳ２２に
おいて、分類後の各集合において、処理対象状態が一致
する新しい状態をすべて合併することにより、合併後の
新しい状態の集合を生成する。さらに、ステップＳ２３
において、合併前の古い状態間のすべての遷移を、古い
始点を含む新しい状態から古い終点を含む新しい状態へ
の遷移として設定し、古い状態をすべて廃棄し、処理後
のＨＭ網である音響モデルをＨＭ網メモリ１１に格納す
る。ここで、重複する遷移を削除する。FIG. 3 is a flowchart showing the acoustic model conversion processing executed by the acoustic model conversion unit 32 of FIG. In FIG. 3, first, based on the acoustic model which is the HM network stored in the HM network memory 42 in step S21, all states are classified into the preceding state and the succeeding state for each triphone, and after classification, Generate a new state for each set of. Here, the new state has as components all Gaussian distributions of all old states belonging to the corresponding set. Then, in step S22, a new set of states after the merger is generated by merging all the new states in which the processing target states match in each of the sets after the classification. Furthermore, step S23
, Set all transitions between old states before merging as transitions from a new state including an old start point to a new state including an old end point, discard all old states, and an HM network acoustic model after processing Are stored in the HM network memory 11. Here, the overlapping transitions are deleted.

【００４３】なお、本実施形態においては、音声認識の
ための統計的音素モデルセットとしてＨＭ網を使用して
いる。当該ＨＭ網は効率的に表現された音素環境依存モ
デルである。１つのＨＭ網は多数の音素環境依存モデル
を包含する。ＨＭ網はガウス分布を含む状態の結合で構
成され、個々の音素環境依存モデル間で状態が共有され
る。このためパラメータ推定のためのデータ数が不足す
る場合も、頑健なモデルを生成することができる。この
ＨＭ網は、音響モデル学習部３１と音響モデル変換部３
２の少なくとも一方を用いて自動生成される。本実施形
態においては、ＨＭ網のパラメータとして、ガウス分布
で表現される出力確率及び遷移確率を有する。このた
め、音声認識時には一般のＨＭＭと同様に扱うことがで
きる。さらに、上記ＨＭ網を用いた、ＳＳＳ−ＬＲ（le
ft-to-right rightmost型）不特定話者連続音声認識装
置について説明する。この音声認識装置は、メモリに格
納されたＨＭ網と呼ばれる音素環境依存型の効率のよい
ＨＭＭの表現形式を用いている。In this embodiment, the HM network is used as a statistical phoneme model set for speech recognition. The HM network is a phoneme environment dependent model that is efficiently expressed. One HM network contains many phoneme environment dependent models. The HM network is composed of a combination of states including a Gaussian distribution, and states are shared among individual phoneme environment dependent models. Therefore, a robust model can be generated even when the number of data for parameter estimation is insufficient. This HM network includes an acoustic model learning unit 31 and an acoustic model conversion unit 3.
It is automatically generated using at least one of the two. In the present embodiment, the output probability and the transition probability represented by the Gaussian distribution are included as the parameters of the HM network. Therefore, at the time of voice recognition, it can be handled like a general HMM. Furthermore, SSS-LR (le) using the HM network is used.
ft-to-right right most type) An unspecified speaker continuous speech recognition device will be described. This speech recognition apparatus uses a phoneme environment-dependent efficient HMM representation format called HM network stored in a memory.

【００４４】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。In FIG. 1, the vocalized voice of the speaker is input to the microphone 1 and converted into a voice signal, and then input to the feature extraction unit 2. The feature extraction unit 2 performs, for example, LPC analysis after A / D conversion of the input voice signal, and a 34-dimensional feature parameter including logarithmic power, 16th-order cepstrum coefficient, Δ logarithmic power, and 16th-order Δ cepstrum coefficient. To extract. The time series of the extracted characteristic parameters is input to the phoneme matching unit 4 via the buffer memory 3.

【００４５】音素照合部４に接続されるＨＭ網メモリ１
１に格納されたＨＭ網は、各状態をノードとする複数の
ネットワークとして表され、各状態はそれぞれ以下の情
報を有する。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率HM network memory 1 connected to the phoneme collation unit 4
The HM network stored in 1 is represented as a plurality of networks having each state as a node, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding states and succeeding states (d) Parameter of output probability density distribution (e) Self transition probability and transition probability to subsequent state

【００４６】音素照合部４は、音素コンテキスト依存型
ＬＲパーザ５からの音素照合要求に応じて音素照合処理
を実行する。そして、不特定話者モデルを用いて音素照
合区間内のデータに対する尤度が計算され、この尤度の
値が音素照合スコアとしてＬＲパーザ５に返される。こ
のときに用いられるモデルは、ＨＭＭと等価であるため
に、尤度の計算には通常のＨＭＭで用いられている前向
きパスアルゴリズムをそのまま使用する。The phoneme matching unit 4 executes a phoneme matching process in response to a phoneme matching request from the phoneme context dependent LR parser 5. Then, the likelihood for the data in the phoneme matching section is calculated using the unspecified speaker model, and the value of this likelihood is returned to the LR parser 5 as a phoneme matching score. Since the model used at this time is equivalent to the HMM, the forward path algorithm used in the normal HMM is used as it is for the calculation of the likelihood.

【００４７】一方、文脈自由文法データベースメモリ１
４内の所定の文脈自由文法（ＣＦＧ）を公知の通り自動
的に変換してＬＲテーブルを生成してＬＲテーブルメモ
リ１３に格納される。ＬＲパーザ５は、上記ＬＲテーブ
ルメモリ１３内のＬＲテーブルを参照して、入力された
音素期待データについて左から右方向に、後戻りなしに
処理する。構文的にあいまいさがある場合は、スタック
を分割してすべての候補の解析が平行して処理される。
ＬＲパーザ５は、ＬＲテーブルから次にくる音素を期待
して音素期待データを音素照合部４に出力する。これに
応答して、音素照合部４は、その音素に対応するＨＭ網
内の情報を参照して照合し、その尤度を音声認識スコア
としてＬＲパーザ５に戻し、順次音素を連接していくこ
とにより、連続音声の認識を行い、その音声認識結果デ
ータを出力する。上記連続音声の認識において、複数の
音素が期待された場合は、これらすべての存在をチェッ
クし、ビームサーチの方法により、部分的な音声認識の
尤度の高い部分木を残すという枝刈りを行って高速処理
を実現する。On the other hand, context-free grammar database memory 1
A predetermined context-free grammar (CFG) in 4 is automatically converted as known to generate an LR table and stored in the LR table memory 13. The LR parser 5 refers to the LR table in the LR table memory 13 and processes the input phoneme expected data from left to right without backtracking. In the case of syntactic ambiguity, the stack is split and parsing of all candidates is processed in parallel.
The LR parser 5 expects the next phoneme from the LR table and outputs the phoneme expected data to the phoneme matching unit 4. In response to this, the phoneme collation unit 4 collates by referring to the information in the HM network corresponding to the phoneme, returns the likelihood to the LR parser 5 as a speech recognition score, and successively concatenates the phonemes. As a result, continuous voice recognition is performed and the voice recognition result data is output. When multiple phonemes are expected in the above continuous speech recognition, the existence of all of them is checked, and the pruning is performed by the beam search method to leave a partial tree with high likelihood of partial speech recognition. To achieve high-speed processing.

【００４８】以上の実施形態において、特徴抽出部２
と、音素照合部４と、ＬＲパーザ５と、音響モデル学習
部３１と、音響モデル変換部３２とは、例えばディジタ
ル計算機によって構成される。また、特定話者の発声音
声データメモリ４１とＨＭ網メモリ４２，１１と文脈自
由文法データベースメモリ１４とＬＲテーブルメモリ１
３と、バッファメモリ３とは、例えばハードディスクメ
モリで構成される。In the above embodiment, the feature extraction unit 2
The phoneme matching unit 4, the LR parser 5, the acoustic model learning unit 31, and the acoustic model conversion unit 32 are configured by, for example, a digital computer. In addition, the voice data memory 41 of the specific speaker, the HM network memories 42 and 11, the context-free grammar database memory 14, and the LR table memory 1
The buffer memory 3 and the buffer memory 3 are, for example, hard disk memories.

【００４９】以上の実施形態においては、音響モデル変
換部３２で変換されたＨＭ網を用いて音声認識している
が、本発明はこれに限らず、音響モデル学習部３１によ
って生成された音響モデルを用いて音声認識してもよ
い。In the above embodiment, speech recognition is performed using the HM network converted by the acoustic model conversion unit 32, but the present invention is not limited to this, and the acoustic model generated by the acoustic model learning unit 31. May be used for voice recognition.

【００５０】[0050]

【実施例】まず、音響モデル変換部３２における変換例
１及び２について説明する。ここで、変換例１は、図１
０に示すように、マルチパスモデルからシングルパスモ
デルへの変換処理例であり、変換例２は、図１１に示す
ように、マルチパスモデルから各状態の分岐数を減少さ
せたマルチパスモデルである。First, conversion examples 1 and 2 in the acoustic model conversion unit 32 will be described. Here, the conversion example 1 is shown in FIG.
0 is an example of conversion processing from a multipath model to a single path model, and conversion example 2 is a multipath model in which the number of branches in each state is reduced from the multipath model as shown in FIG. is there.

【００５１】＜変換例１＞図１０（ａ）のＨＭＭに対し
て、上記ステップＳ２１の処理を実行すると、次の処理
結果が得られる。<Conversion Example 1> When the process of step S21 is executed on the HMM of FIG. 10A, the following process result is obtained.

【００５２】[0052]

【表１】トライフォンａ／ｋ／ａ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ２｝｛Ｓ０｝ａ／ｋ／ａ−０｛Ｓ０｝｛Ｓ４｝｛Ｓ１，Ｓ２｝ａ／ｋ／ａ−１｛Ｓ１，Ｓ２｝ − ｛Ｓ４｝ａ／ｋ／ａ−２ ――――――――――――――――――――――――――――――――――― （注）「新しい状態」の項目では、新しい状態の音素を区別するための音素及びその識別符号を付与しており、以下同様である。[Table 1] Triphone a / k / a ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S2} {S0} a / k / a-0 {S0} {S4} {S1, S2} a / k / a-1 {S1, S2}-{S4} a / k / a-2 ――――――――――――――――――――――――――――――――――― (Note) In the “new state” item, phonemes and The identification code is given, and so on.

【００５３】[0053]

【表２】トライフォンａ／ｋ／ｉ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ２｝｛Ｓ０｝ａ／ｋ／ｉ−０｛Ｓ０｝｛Ｓ４｝｛Ｓ１，Ｓ２｝ａ／ｋ／ｉ−１｛Ｓ１，Ｓ２｝ − ｛Ｓ４｝ａ／ｋ／ｉ−２ ―――――――――――――――――――――――――――――――――――[Table 2] Triphone a / k / i ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S2} {S0} a / k / i-0 {S0} {S4} {S1, S2} a / k / i-1 {S1, S2}-{S4} a / k / i-2 ―――――――――――――――――――――――――――――――――――

【００５４】[0054]

【表３】トライフォンａ／ｋ／ｅ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ３｝｛Ｓ０｝ａ／ｋ／ｅ−０｛Ｓ０｝｛Ｓ４｝｛Ｓ１，Ｓ３｝ａ／ｋ／ｅ−１｛Ｓ１，Ｓ３｝ − ｛Ｓ４｝ａ／ｋ／ｅ−２ ―――――――――――――――――――――――――――――――――――[Table 3] Triphone a / k / e ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S3} {S0} a / k / e-0 {S0} {S4} {S1, S3} a / k / e-1 {S1, S3}-{S4} a / k / e-2 ―――――――――――――――――――――――――――――――――――

【００５５】[0055]

【表４】トライフォンａ／ｋ／ｏ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ３｝｛Ｓ０｝ａ／ｋ／ｏ−０｛Ｓ０｝｛Ｓ５｝｛Ｓ１，Ｓ３｝ａ／ｋ／ｏ−１｛Ｓ１，Ｓ３｝ − ｛Ｓ５｝ａ／ｋ／ｏ−２ ―――――――――――――――――――――――――――――――――――[Table 4] Triphone a / k / o ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S3} {S0} a / k / o-0 {S0} {S5} {S1, S3} a / k / o-1 {S1, S3}-{S5} a / k / o-2 ―――――――――――――――――――――――――――――――――――

【００５６】次いで、ステップＳ２２の処理は以下のよ
うにして実行される。（ａ）処理対象要素｛Ｓ０｝を持つ新しい状態ａ／ｋ／
＊−０を合併、新しい状態Ｔ０とする。ここで、＊はす
べての音素を表す。（ｂ）処理対象要素｛Ｓ１，Ｓ２｝を持つ新しい状態ａ
／ｋ／ａ−１及びａ／ｋ／ｉ−１を合併、新しい状態Ｔ
１とする。（ｃ）処理対象要素｛Ｓ１，Ｓ３｝を持つ新しい状態ａ
／ｋ／ｅ−１及びａ／ｋ／ｏ−１を合併、新しい状態Ｔ
２とする。（ｄ）処理対象要素｛Ｓ４｝を持つ新しい状態ａ／ｋ／
ａ−２、ａ／ｋ／ｉ−２及びａ／ｋ／ｅ−２を合併、新
しい状態Ｔ３とする。（ｅ）処理対象要素｛Ｓ５｝を持つ新しい状態ａ／ｋ／
ｏ−２をそのまま新しい状態Ｔ４とする。Next, the process of step S22 is executed as follows. (A) New state a / k / having the processing target element {S0}
* -0 is merged into a new state T0. Here, * represents all phonemes. (B) A new state a having a processing target element {S1, S2}
/ K / a-1 and a / k / i-1 merged, new state T
Set to 1. (C) New state a with processing target element {S1, S3}
/ K / e-1 and a / k / o-1 merged, new state T
Set to 2. (D) New state a / k / having the processing target element {S4}
a-2, a / k / i-2 and a / k / e-2 are merged into a new state T3. (E) New state a / k / having the processing target element {S5}
Let o-2 be the new state T4 as it is.

【００５７】さらに、ステップＳ２３の処理は以下のよ
うにして実行される。（ａ）古い状態Ｓ０→Ｓ１の遷移を、新しい状態Ｔ０→
Ｔ１間の遷移として採用して設定する。（ｂ）古い状態Ｓ０→Ｓ１の遷移を、新しい状態Ｔ０→
Ｔ２間の遷移として採用して設定する。（ｃ）古い状態Ｓ０→Ｓ２の遷移を、新しい状態Ｔ０→
Ｔ１間の遷移として採用するが、上記と重複するので削
除する。（ｄ）古い状態Ｓ０→Ｓ３の遷移を、新しい状態Ｔ０→
Ｔ２間の遷移として採用するが、上記と重複するので削
除する。（ｅ）古い状態Ｓ１→Ｓ４の遷移を、新しい状態Ｔ１→
Ｔ３間の遷移として採用して設定する。（ｆ）古い状態Ｓ２→Ｓ４の遷移を、新しい状態Ｔ１→
Ｔ３間の遷移として採用して設定するが、上記と重複す
るので削除する。（ｇ）古い状態Ｓ３→Ｓ４の遷移を、新しい状態Ｔ２→
Ｔ３間の遷移として採用して設定する。（ｈ）古い状態Ｓ３→Ｓ４の遷移を、新しい状態Ｔ２→
Ｔ４間の遷移として採用して設定する。Further, the process of step S23 is executed as follows. (A) The transition of the old state S0 → S1 is changed to the new state T0 →
It is adopted and set as the transition between T1s. (B) The transition of the old state S0 → S1 is changed to the new state T0 →
It is adopted and set as the transition between T2. (C) The transition of the old state S0 → S2 is changed to the new state T0 →
It is adopted as the transition between T1, but it is deleted because it overlaps with the above. (D) The transition of the old state S0 → S3 is changed to the new state T0 →
It is adopted as the transition between T2, but it is deleted because it overlaps with the above. (E) The transition from the old state S1 → S4 to the new state T1 →
It is adopted and set as the transition between T3. (F) Transition of the old state S2 → S4 to the new state T1 →
It is adopted and set as the transition between T3, but it is deleted because it overlaps with the above. (G) The transition from the old state S3 → S4 to the new state T2 →
It is adopted and set as the transition between T3. (H) The transition from the old state S3 → S4 to the new state T2 →
It is adopted and set as the transition between T4.

【００５８】以上の変換例１の処理により、図１０
（ａ）のＨＭＭを、図１０（ｂ）のＨＭＭに変換するこ
とができる。By the processing of the above conversion example 1, FIG.
The HMM in (a) can be converted into the HMM in FIG. 10 (b).

【００５９】＜変換例２＞図１１（ａ）のＨＭＭに対し
て、上記ステップＳ２１の処理を実行すると、次の処理
結果が得られる。<Conversion example 2> When the process of step S21 is executed for the HMM of FIG. 11A, the following process result is obtained.

【００６０】[0060]

【表５】トライフォンａ／ｋ／ａ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ２｝｛Ｓ０｝ａ／ｋ／ａ−０｛Ｓ０｝｛Ｓ４｝｛Ｓ１，Ｓ２｝ａ／ｋ／ａ−１｛Ｓ１，Ｓ２｝ − ｛Ｓ５｝ａ／ｋ／ａ−２ ―――――――――――――――――――――――――――――――――――[Table 5] Triphone a / k / a ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S2} {S0} a / k / a-0 {S0} {S4} {S1, S2} a / k / a-1 {S1, S2}-{S5} a / k / a-2 ―――――――――――――――――――――――――――――――――――

【００６１】[0061]

【表６】トライフォンａ／ｋ／ｉ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ２｝｛Ｓ０｝ａ／ｋ／ｉ−０｛Ｓ０｝｛Ｓ４｝｛Ｓ１，Ｓ２｝ａ／ｋ／ｉ−１｛Ｓ１，Ｓ２｝ − ｛Ｓ５｝ａ／ｋ／ｉ−２ ―――――――――――――――――――――――――――――――――――[Table 6] Triphone a / k / i ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S2} {S0} a / k / i-0 {S0} {S4} {S1, S2} a / k / i-1 {S1, S2}-{S5} a / k / i-2 ―――――――――――――――――――――――――――――――――――

【００６２】[0062]

【表７】トライフォンａ／ｋ／ｅ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ３｝｛Ｓ０｝ａ／ｋ／ｅ−０｛Ｓ０｝｛Ｓ５｝｛Ｓ１｝ａ／ｋ／ｅ−１｛Ｓ０｝｛Ｓ４｝｛Ｓ３｝ａ／ｋ／ｅ−２｛Ｓ３｝｛Ｓ５｝｛Ｓ４｝ａ／ｋ／ｅ−３｛Ｓ１，Ｓ４｝ − ｛Ｓ５｝ａ／ｋ／ｅ−４ ―――――――――――――――――――――――――――――――――――[Table 7] Triphone a / k / e ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S3} {S0} a / k / e-0 {S0} {S5} {S1} a / k / e-1 {S0} {S4} {S3} a / k / e-2 {S3} {S5} {S4} a / k / e-3 {S1, S4}-{S5} a / k / e-4 ―――――――――――――――――――――――――――――――――――

【００６３】[0063]

【表８】トライフォンａ／ｋ／ｏ ――――――――――――――――――――――――――――――――――― 先行状態後続状態 → 処理対象状態新しい状態 ――――――――――――――――――――――――――――――――――― − ｛Ｓ１，Ｓ３｝｛Ｓ０｝ａ／ｋ／ｏ−０｛Ｓ０｝｛Ｓ５｝｛Ｓ１｝ａ／ｋ／ｅ−１｛Ｓ０｝｛Ｓ４｝｛Ｓ３｝ａ／ｋ／ｏ−２｛Ｓ３｝｛Ｓ６｝｛Ｓ４｝ａ／ｋ／ｏ−３｛Ｓ４｝ − ｛Ｓ６｝ａ／ｋ／ｏ−４ ―――――――――――――――――――――――――――――――――――[Table 8] Triphone a / k / o ――――――――――――――――――――――――――――――――――― Predecessor state Subsequent state → Process target state New state ――――――――――――――――――――――――――――――――――― -{S1, S3} {S0} a / k / o-0 {S0} {S5} {S1} a / k / e-1 {S0} {S4} {S3} a / k / o-2 {S3} {S6} {S4} a / k / o-3 {S4}-{S6} a / k / o-4 ―――――――――――――――――――――――――――――――――――

【００６４】次いで、ステップＳ２２の処理は以下のよ
うにして実行される。（ａ）処理対象要素｛Ｓ０｝を持つ新しい状態ａ／ｋ／
＊−０を合併して、新しい状態Ｔ０とする。（ｂ）処理対象要素｛Ｓ１，Ｓ２｝を持つ新しい状態ａ
／ｋ／ａ−１及びａ／ｋ／ｉ−１を合併して、新しい状
態Ｔ１とする。（ｃ）処理対象要素｛Ｓ１｝を持つ新しい状態ａ／ｋ／
ｅ−１及びａ／ｋ／ｏ−１を合併して、新しい状態Ｔ２
とする。（ｄ）処理対象要素｛Ｓ３｝を持つ新しい状態ａ／ｋ／
ｅ−２及びａ／ｋ／ｏ−２を合併して、新しい状態Ｔ３
とする。（ｅ）処理対象要素｛Ｓ４｝を持つ新しい状態ａ／ｋ／
ｅ−３及びａ／ｋ／ｏ−３を合併して、新しい状態Ｔ４
とする。（ｆ）処理対象要素｛Ｓ５｝を持つ新しい状態ａ／ｋ／
ａ−２及びａ／ｋ／ｉ−２，並びにａ／ｋ／ｅ−４を合
併して、新しい状態Ｔ５とする。（ｇ）処理対象要素｛Ｓ６｝を持つ新しい状態ａ／ｋ／
ｏ−４をそのまま、新しい状態Ｔ６とする。Next, the process of step S22 is executed as follows. (A) New state a / k / having the processing target element {S0}
* -0 is merged into a new state T0. (B) A new state a having a processing target element {S1, S2}
/ K / a-1 and a / k / i-1 are merged into a new state T1. (C) New state a / k / having the processing target element {S1}
e-1 and a / k / o-1 are merged into a new state T2
And (D) New state a / k / having the processing target element {S3}
e-2 and a / k / o-2 are merged into a new state T3
And (E) New state a / k / having the processing target element {S4}
e-3 and a / k / o-3 are merged into a new state T4
And (F) New state a / k / having the processing target element {S5}
a-2 and a / k / i-2, and a / k / e-4 are merged into a new state T5. (G) New state a / k / having the processing target element {S6}
The new state T6 is maintained without changing o-4.

【００６５】さらに、ステップＳ２３の処理は以下のよ
うにして実行される。（ａ）古い状態Ｓ０→Ｓ１の遷移を、新しい状態Ｔ０→
Ｔ１間の遷移として採用して設定する。（ｂ）古い状態Ｓ０→Ｓ１の遷移を、新しい状態Ｔ０→
Ｔ２間の遷移として採用して設定する。（ｃ）古い状態Ｓ０→Ｓ２の遷移を、新しい状態Ｔ０→
Ｔ１間の遷移として採用するが、上記と重複するので削
除する。（ｄ）古い状態Ｓ０→Ｓ３の遷移を、新しい状態Ｔ０→
Ｔ３間の遷移として採用して設定する。（ｅ）古い状態Ｓ１→Ｓ５の遷移を、新しい状態Ｔ１→
Ｔ５間の遷移として採用して設定する。（ｆ）古い状態Ｓ１→Ｓ５の遷移を、新しい状態Ｔ２→
Ｔ５間の遷移として採用して設定する。（ｇ）古い状態Ｓ２→Ｓ５の遷移を、新しい状態Ｔ２→
Ｔ５間の遷移として採用するが、上記と重複するので削
除する。（ｈ）古い状態Ｓ３→Ｓ４の遷移を、新しい状態Ｔ３→
Ｔ４間の遷移として採用して設定する。（ｉ）古い状態Ｓ４→Ｓ５の遷移を、新しい状態Ｔ４→
Ｔ５間の遷移として採用して設定する。（ｊ）古い状態Ｓ４→Ｓ６の遷移を、新しい状態Ｔ４→
Ｔ６間の遷移として採用して設定する。Further, the process of step S23 is executed as follows. (A) The transition of the old state S0 → S1 is changed to the new state T0 →
It is adopted and set as the transition between T1s. (B) The transition of the old state S0 → S1 is changed to the new state T0 →
It is adopted and set as the transition between T2. (C) The transition of the old state S0 → S2 is changed to the new state T0 →
It is adopted as the transition between T1, but it is deleted because it overlaps with the above. (D) The transition of the old state S0 → S3 is changed to the new state T0 →
It is adopted and set as the transition between T3. (E) The transition from the old state S1 → S5 to the new state T1 →
It is adopted and set as the transition between T5. (F) Transition of the old state S1 → S5 to the new state T2 →
It is adopted and set as the transition between T5. (G) The transition from the old state S2 → S5 to the new state T2 →
It is adopted as the transition between T5, but it is deleted because it overlaps with the above. (H) The transition of the old state S3 → S4 is changed to the new state T3 →
It is adopted and set as the transition between T4. (I) The transition from the old state S4 → S5 to the new state T4 →
It is adopted and set as the transition between T5. (J) The transition from the old state S4 → S6 to the new state T4 →
It is adopted and set as the transition between T6.

【００６６】＜実験及び実験結果＞本発明者らは、本実
施形態の音声認識システムの検証を行うために以下の実
験を行った。<Experiments and Experimental Results> The present inventors conducted the following experiments in order to verify the speech recognition system of this embodiment.

【００６７】上述の音響モデル変換部３２で用いる音響
モデル変換処理を使用すれば、シングルパス及び混合分
布を有する通常の音響モデルを生成することができる。
これにより、マルチパス音素モデルのための特別な復号
器を何ら必要とせずに認識実験を実行することができ
る。本実験は、表９及び表１０に示された条件に基づい
て、時間方向の分割を行なうことなく、本実施形態で生
成された音響モデルを使用して行う。By using the acoustic model conversion processing used in the acoustic model conversion unit 32 described above, it is possible to generate a normal acoustic model having a single path and a mixture distribution.
This allows the recognition experiment to be performed without any special decoder for the multi-pass phoneme model. This experiment is performed based on the conditions shown in Table 9 and Table 10 using the acoustic model generated in this embodiment without performing division in the time direction.

【００６８】[0068]

【表９】実験条件１ ――――――――――――――――――――――――――――――――――― 学習サンプルホテル予約業務用の日本語の自由発話音声（話者２３０名） ――――――――――――――――――――――――――――――――――― 認識用サンプルホテル予約業務用の日本語の自由発話音声（話者４１名） ――――――――――――――――――――――――――――――――――― 言語モデルマルチクラス複合ｎ−グラム ―――――――――――――――――――――――――――――――――――[Table 9] Experimental condition 1 ――――――――――――――――――――――――――――――――――― Learning sample Japanese free speech for hotel reservation (230 speakers) ――――――――――――――――――――――――――――――――――― Recognition sample Japanese free speech for hotel reservation business (41 speakers) ――――――――――――――――――――――――――――――――――― Language model Multi-class compound n-gram ―――――――――――――――――――――――――――――――――――

【００６９】[0069]

【表１０】実験条件２ ――――――――――――――――――――――――――――――――――― 従来例のモデル：ＭＬ−ＳＳＳ法及び公知のＬＢＧアルゴリズムによって３状態の話者独立型モデルから作成されたＮ状態Ｍ混合モデル ――――――――――――――――――――――――――――――――――― 本実施形態のモデル：本実施形態によって３状態の話者独立モデルから生成された変換済みのＮ状態モデル ―――――――――――――――――――――――――――――――――――[Table 10] Experimental condition 2 ――――――――――――――――――――――――――――――――――― Conventional model: 3 states by ML-SSS method and known LBG algorithm N-state M-mixed model created from a speaker-independent model of ――――――――――――――――――――――――――――――――――― Model of the present embodiment: Generated from a three-state speaker-independent model according to the present embodiment Transformed N-state model ―――――――――――――――――――――――――――――――――――

【００７０】表１１及び表１２に、実験結果の単語認識
率を示す。Tables 11 and 12 show the word recognition rates of the experimental results.

【００７１】[0071]

【表１１】 ――――――――――――――――――――――――――――――――――― 従来例のモデル単語認識率 ――――――――――――――――――――――――――――――――――― ８００状態３混合（２４００個のガウス分布）８０．１％８００状態５混合（４０００個のガウス分布）８０．７％８００状態１０混合（８０００個のガウス分布）８１．４％ ―――――――――――――――――――――――――――――――――――[Table 11] ――――――――――――――――――――――――――――――――――― Conventional model word recognition rate ――――――――――――――――――――――――――――――――――― 800 states 3 mixture (2400 Gaussian distribution) 80.1% 800 states 5 mixture (4000 Gaussian distribution) 80.7% 800 states 10 mixture (8000 Gaussian distribution) 81.4% ―――――――――――――――――――――――――――――――――――

【００７２】[0072]

【表１２】 ――――――――――――――――――――――――――――――――――― 本実施形態のモデル単語認識率 ――――――――――――――――――――――――――――――――――― ２０００個のガウス分布８１．１％３０００個のガウス分布８１．５％ ―――――――――――――――――――――――――――――――――――[Table 12] ――――――――――――――――――――――――――――――――――― Model of this embodiment Word recognition rate ――――――――――――――――――――――――――――――――――― 2000 Gaussian distribution 81.1% 3000 Gaussian distribution 81.5% ―――――――――――――――――――――――――――――――――――

【００７３】表１１及び表１２から明らかなように、本
実施形態の音響モデルは、２倍の数のガウス分布を使用
するＭＬ−ＳＳＳ法及びＬＢＧアルゴリズムで作成され
た従来例の音響モデルよりも良好な性能を達成してい
る。本実験では、膨大な計算時間のために、ＨＭＭのト
ポロジーの学習処理は既に各音素モデルに分割されてい
るコンテキスト非依存型音素モデルから開始している。
従って、音響モデルは多数の音素間の音響変化を十分に
は捕捉し得ていないと考えられる。As is clear from Tables 11 and 12, the acoustic model of this embodiment is better than the acoustic model of the conventional example created by the ML-SSS method and the LBG algorithm, which uses twice the number of Gaussian distributions. It has achieved good performance. In this experiment, the learning process of the HMM topology starts from the context-independent phoneme model that is already divided into each phoneme model due to the huge calculation time.
Therefore, it is considered that the acoustic model is not able to sufficiently capture the acoustic changes among many phonemes.

【００７４】本実施形態によれば、状態ネットワークだ
けでなく混合構造をも包含する音響モデリングのための
新規かつ効果的なトポロジー生成方法を提案した。さら
に、最終的なマルチパスモデルを通常のシングルパスモ
デルに変換する、良好に機能する方法を導入した。その
結果、提案方法によって作成されたモデルは、これらを
使用した場合の認識率が２倍以上のガウスを使用する通
常モデルの場合よりも優れているという、良好な性能を
達成した。According to this embodiment, a new and effective topology generation method for acoustic modeling including not only state networks but also mixed structures has been proposed. In addition, we introduced a well-functioning method to transform the final multi-pass model into a regular single-pass model. As a result, the models created by the proposed method achieved good performance, which is superior to the case of the normal model using Gauss whose recognition rate is more than twice when using them.

【００７５】以上説明したように、本実施形態によれ
ば、以下の特有の効果を有する。（１）音響モデル学習部３１によりコンテキスト方向と
時間方向の状態分割に加えて、パス方向の分割を考慮し
て音響モデルを学習して生成したので、複数の混合数を
有する音響モデルに適用することができるとともに、モ
デルサイズが従来例に比較して小さくなる。これによ
り、当該音響モデルを用いて音声認識したときの音声認
識率を従来例に比較して高めることができる。（２）また、音響モデル変換部３２により各状態の分岐
数を減少させることができるので、モデルサイズが従来
例に比較して小さくなる。これにより、当該音響モデル
を用いて音声認識したときの認識時間を短縮させること
ができる。As described above, according to this embodiment, the following unique effects are obtained. (1) The acoustic model learning unit 31 learns and generates an acoustic model in consideration of the division in the path direction in addition to the state division in the context direction and the time direction, and thus is applied to the acoustic model having a plurality of mixture numbers. In addition, the model size becomes smaller than that of the conventional example. Accordingly, the voice recognition rate when voice recognition is performed using the acoustic model can be increased as compared with the conventional example. (2) Since the number of branches in each state can be reduced by the acoustic model conversion unit 32, the model size becomes smaller than that of the conventional example. As a result, the recognition time when voice recognition is performed using the acoustic model can be shortened.

【００７６】[0076]

【発明の効果】以上詳述したように本発明に係る音響モ
デル学習装置によれば、複数の話者の発声音声データに
基づいて音響モデルを学習して生成するモデル学習手段
を備えた音響モデル学習装置において、上記モデル学習
手段は、複数の特定話者の発声音声データに基づいて、
所定の学習アルゴリズムを用いて単一ガウス分布の音響
モデルを生成した後、上記生成した単一ガウス分布の音
響モデルのすべての状態に対してコンテキスト方向、時
間方向及びパス方向に分割したときの分割前後の尤度期
待値の増加量を計算し、上記すべての状態に対して計算
された尤度期待値の増加量のうち最大の尤度期待値の増
加量を有する状態を検索して分割することを繰り返すこ
とにより、音響モデルを学習して生成する。従って、コ
ンテキスト方向と時間方向の状態分割に加えて、パス方
向の分割を考慮して音響モデルを学習して生成したの
で、複数の混合数を有する音響モデルに適用することが
できるとともに、モデルサイズが従来例に比較して小さ
くなる。これにより、当該音響モデルを用いて音声認識
したときの音声認識率を従来例に比較して高めることが
できる。As described above in detail, according to the acoustic model learning device of the present invention, the acoustic model is equipped with the model learning means for learning and generating the acoustic model based on the voiced voice data of a plurality of speakers. In the learning device, the model learning means, based on the uttered voice data of a plurality of specific speakers,
After generating an acoustic model with a single Gaussian distribution using a given learning algorithm, then dividing when all the states of the acoustic model with a single Gaussian distribution generated above are divided in the context direction, the time direction, and the path direction. Calculate the amount of increase in likelihood expectation before and after, and search and split the state with the largest amount of increase in likelihood expectation among the amounts of increase in likelihood expectation calculated for all the above states. By repeating this, the acoustic model is learned and generated. Therefore, in addition to the state-direction and time-direction state divisions, the acoustic model was learned and generated in consideration of the path-direction division, so that it can be applied to an acoustic model having a plurality of mixed numbers and the model size Is smaller than the conventional example. Accordingly, the voice recognition rate when voice recognition is performed using the acoustic model can be increased as compared with the conventional example.

【００７７】また、本発明に係る音響モデル変換装置に
よれば、所定の音響モデルに基づいて、各状態における
分岐数を減少させるように上記音響モデルを変換する変
換処理を行う音響モデル変換装置であって、所定の音響
モデルに基づいて、各音素並びに対して、すべての処理
対象状態を、処理対象状態よりも先行する先行状態と、
処理対象状態に後続する後続状態との毎に分類し、分類
後の各集合に対して、対応する集合に属するすべての古
い状態のすべてのガウス分布を構成要素として有する新
しい状態を生成し、分類された後の各集合において、処
理対象状態が一致する新しい状態をすべて合併すること
により、合併後の新しい状態の集合を生成し、合併前の
古い状態間のすべての遷移を、合併前の古い始点を含む
新しい状態から、合併前の古い終点を含む新しい状態へ
の遷移として設定することにより、処理後のガウス分布
を含む音響モデルを生成する。従って、各状態の分岐数
を減少させることができるので、モデルサイズが従来例
に比較して小さくなる。これにより、当該音響モデルを
用いて音声認識したときの認識時間を短縮させることが
できる。Further, according to the acoustic model conversion device of the present invention, the acoustic model conversion device performs a conversion process for converting the acoustic model based on a predetermined acoustic model so as to reduce the number of branches in each state. Then, based on a predetermined acoustic model, for each phoneme array, all the processing target state, the preceding state preceding the processing target state,
A new state having all the Gaussian distributions of all the old states belonging to the corresponding set as a constituent is generated for each set after the classification, and is classified by each subsequent state following the processing target state. In each set after being merged, a new set of states after merging is generated by merging all the new states that have the same processing target state, and all transitions between the old states before merging are converted to the old ones before merging. An acoustic model including a processed Gaussian distribution is generated by setting a transition from a new state including a start point to a new state including an old end point before merging. Therefore, since the number of branches in each state can be reduced, the model size becomes smaller than that of the conventional example. As a result, the recognition time when voice recognition is performed using the acoustic model can be shortened.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明に係る一実施形態である音声認識シス
テムの構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a voice recognition system according to an embodiment of the present invention.

【図２】図１の音響モデル学習部３１によって実行さ
れる音響モデル学習処理を示すフローチャートである。FIG. 2 is a flowchart showing an acoustic model learning process executed by an acoustic model learning unit 31 of FIG.

【図３】図１の音響モデル変換部３２によって実行さ
れる音響モデル変換処理を示すフローチャートである。3 is a flowchart showing an acoustic model conversion process executed by an acoustic model conversion unit 32 in FIG.

【図４】本実施形態における音響モデルの１つの状態
に対するコンテキスト方向の分割処理を示す図である。FIG. 4 is a diagram showing context-direction division processing for one state of the acoustic model in the present embodiment.

【図５】本実施形態における音響モデルの１つの状態
に対する時間方向の分割処理を示す図である。FIG. 5 is a diagram showing division processing in the time direction for one state of the acoustic model according to the present embodiment.

【図６】本実施形態における音響モデルの１つの状態
に対するパス方向の分割処理を示す図である。FIG. 6 is a diagram showing a dividing process in the path direction for one state of the acoustic model according to the present embodiment.

【図７】パス分割方向を含むＭＬ−ＳＳＳ法の処理例
を示す図であり、（ａ）は当該処理例における第１段階
におけるＨＭＭを示す図であり、（ｂ）は上記第１段階
においてパス方向の分割処理を実行したときの第２段階
におけるＨＭＭを示す図であり、（ｃ）は上記第２段階
においてコンテキスト方向の分割処理を実行したときの
ＨＭＭを示す図である。7A and 7B are diagrams showing a processing example of the ML-SSS method including a path division direction, FIG. 7A is a diagram showing an HMM in a first stage in the processing example, and FIG. 7B is a diagram showing a HMM in the first stage. It is a figure which shows HMM in the 2nd step when the division process of a path direction is performed, and (c) is a figure which shows an HMM when a division process of the context direction is executed in the said 2nd stage.

【図８】本実施形態において処理対象状態における分
類処理の一例を示す図である。FIG. 8 is a diagram showing an example of classification processing in a processing target state in the present embodiment.

【図９】本実施形態において再定義された状態とそれ
らの混合ガウス分布を示す図である。FIG. 9 is a diagram showing redefined states and their mixed Gaussian distributions in the present embodiment.

【図１０】図１の音響モデル変換部３２によって実行
される（ａ）マルチパスモデルから（ｂ）シングルパス
モデルへの変換処理の変換例１を示す図である。10 is a diagram showing a conversion example 1 of (a) multipath model to (b) single-path model conversion processing executed by the acoustic model conversion unit 32 of FIG. 1;

【図１１】図１の音響モデル変換部３２によって実行
される（ａ）マルチパスモデルから（ｂ）マルチパスモ
デルへの変換処理の変換例２を示す図である。11 is a diagram showing a second conversion example of the conversion processing from the (a) multipath model to the (b) multipath model executed by the acoustic model conversion unit 32 of FIG.

【符号の説明】[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１，４２…隠れマルコフ網メモリ（ＨＭ網メモリ）、１３…ＬＲテーブルメモリ、１４…文脈自由文法データベースメモリ、３１…音響モデル学習部、３２…音響モデル変換部、４１…特定話者の発声音声データメモリ。 1 ... Microphone, 2 ... Feature extraction unit, 3 ... buffer memory, 4 ... phoneme matching unit, 5 ... LR parser, 11, 42 ... Hidden Markov network memory (HM network memory), 13 ... LR table memory, 14 ... Context-free grammar database memory, 31 ... Acoustic model learning unit, 32 ... Acoustic model conversion unit, 41 ... Voice data memory of a specific speaker.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/14 G10L 15/06 G10L 15/10 ─────────────────────────────────────────────────── ─── Continuation of the front page (58) Fields surveyed (Int.Cl. ⁷ , DB name) G10L 15/14 G10L 15/06 G10L 15/10

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】複数の話者の発声音声データに基づいて
音響モデルを学習して生成するモデル学習手段を備えた
音響モデル学習装置において、上記モデル学習手段は、複数の特定話者の発声音声デー
タに基づいて、所定の学習アルゴリズムを用いて単一ガ
ウス分布の音響モデルを生成した後、上記生成した単一
ガウス分布の音響モデルのすべての状態に対してコンテ
キスト方向、時間方向及びパス方向に分割したときの分
割前後の尤度期待値の増加量を計算し、上記すべての状
態に対して計算された尤度期待値の増加量のうち最大の
尤度期待値の増加量を有する状態を検索して分割するこ
とを繰り返すことにより、音響モデルを学習して生成す
ることを特徴とする音響モデル学習装置。1. An acoustic model learning device having model learning means for learning and generating an acoustic model based on vocal data of a plurality of speakers, wherein the model learning means comprises vocal sounds of a plurality of specific speakers. After generating an acoustic model of a single Gaussian distribution based on the data using a predetermined learning algorithm, in the context direction, the time direction, and the path direction for all states of the generated single Gaussian distribution acoustic model. The increase amount of the likelihood expected value before and after the division is calculated, and the state having the largest increase amount of the likelihood expected value among the increase amounts of the likelihood expected value calculated for all the above states is calculated. An acoustic model learning device characterized by learning and generating an acoustic model by repeating search and division.

【請求項２】上記モデル学習手段は、複数の特定話者の発声音声データに基づいて、所定の学
習アルゴリズムを用いて単一ガウス分布の音響モデルを
生成する初期モデル生成手段と、上記初期モデル生成手段によって生成された単一ガウス
分布の音響モデルにおいて、１つの状態をコンテキスト
方向、時間方向及びパス方向に分割したときに、最大の
尤度期待値の増加量を有する状態を検索する検索手段
と、上記検索手段によって検索された最大の尤度期待値の増
加量を有する状態を、最大の尤度期待値の増加量に対応
するコンテキスト方向、時間方向、又はパス方向に分割
した後、所定の学習アルゴリズムを用いて単一ガウス分
布の音響モデルを生成する生成手段と、上記生成手段の処理と上記検索手段の処理を、単一ガウ
ス分布の音響モデル内の状態を分割することができなく
なるまで又は単一ガウス分布の音響モデル内の状態数が
予め決められた分割数となるまで繰り返すことにより、
音響モデルを学習して生成する制御手段とを備えたこと
を特徴とする請求項１記載の音響モデル学習装置。2. The model learning means includes an initial model generation means for generating an acoustic model having a single Gaussian distribution using a predetermined learning algorithm based on vocal data of a plurality of specific speakers, and the initial model. Retrieval means for retrieving the state having the largest increase in the likelihood expected value when one state is divided into the context direction, the time direction and the path direction in the acoustic model of the single Gaussian distribution generated by the generation means And dividing the state having the maximum increase amount of the likelihood expected value searched by the search means into the context direction, the time direction, or the path direction corresponding to the increase amount of the maximum likelihood expected value, Generating means for generating an acoustic model of a single Gaussian distribution using the learning algorithm of, and processing of the generating means and processing of the searching means, By repeated until the number of states in the acoustic model state until or single Gaussian distribution can not be split in the model is the predetermined division number,
The acoustic model learning device according to claim 1, further comprising a control unit configured to learn and generate the acoustic model.

【請求項３】所定の音響モデルに基づいて、各状態に
おける分岐数を減少させるように上記音響モデルを変換
する変換処理を行う音響モデル変換装置であって、所定の音響モデルに基づいて、各音素並びに対して、す
べての処理対象状態を、処理対象状態よりも先行する先
行状態と、処理対象状態に後続する後続状態との毎に分
類し、分類後の各集合に対して、対応する集合に属する
すべての古い状態のすべてのガウス分布を構成要素とし
て有する新しい状態を生成する分類手段と、上記分類手段によって分類された後の各集合において、
処理対象状態が一致する新しい状態をすべて合併するこ
とにより、合併後の新しい状態の集合を生成する合併手
段と、上記合併手段による合併前の古い状態間のすべての遷移
を、合併前の古い始点を含む新しい状態から、合併前の
古い終点を含む新しい状態への遷移として設定すること
により、処理後のガウス分布を含む音響モデルを生成す
る遷移設定手段とを備えたことを特徴とする音響モデル
変換装置。3. An acoustic model conversion device for performing conversion processing for converting the acoustic model so as to reduce the number of branches in each state based on the predetermined acoustic model, wherein For a phoneme sequence, all the processing target states are classified into a preceding state that precedes the processing target state and a subsequent state that follows the processing target state, and for each set after classification, the corresponding set In each of the sets after being classified by the classifier, a classifier that generates a new state having all Gaussian distributions of all old states belonging to
All the transitions between the old state before the merger by the merger means that generates a set of new state after the merger by merging all the new states that the processing target states match, and the old start point before the merger And a transition setting means for generating an acoustic model including a Gaussian distribution after processing by setting as a transition from a new state including the old end point before merging to an acoustic model characterized by the following: Converter.

【請求項４】請求項１又は２記載の音響モデル学習装
置と、上記音響モデル学習装置によって生成された音響モデル
に対して変換処理を行う請求項３記載の音響モデル変換
装置とを備えたことを特徴とする音響モデル学習装置。4. The acoustic model learning device according to claim 1 or 2, and the acoustic model conversion device according to claim 3, which performs a conversion process on the acoustic model generated by the acoustic model learning device. Acoustic model learning device characterized by.

【請求項５】入力される発声音声文の音声信号に基づ
いて所定の音響モデルを参照して音声認識する音声認識
手段を備えた音声認識装置において、上記音声認識手段は、請求項１、２又は４に記載の音響
モデル学習装置、もしくは請求項３記載の音響モデル変
換装置によって生成又は変換された音響モデルを参照し
て音声認識することを特徴とする音声認識装置。5. A voice recognition device comprising voice recognition means for recognizing a voice by referring to a predetermined acoustic model based on a voice signal of an uttered voice sentence inputted, wherein the voice recognition means comprises: A voice recognition device characterized by performing voice recognition with reference to the acoustic model learning device according to claim 4 or the acoustic model generated or converted by the acoustic model conversion device according to claim 3.