JP3315565B2

JP3315565B2 - Voice recognition device

Info

Publication number: JP3315565B2
Application number: JP21342995A
Authority: JP
Inventors: 清治 ▲濱▼口; 耕市山口; 俊夫赤羽
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1995-08-22
Filing date: 1995-08-22
Publication date: 2002-08-19
Anticipated expiration: 2015-08-22
Also published as: JPH0962290A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、サブワード単位
のＨＭＭ（隠れマルコフモデル)を用いた音声認識装置
に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a speech recognition apparatus using a HMM (Hidden Markov Model) in subword units.

【０００２】[0002]

【従来の技術】従来より、音声認識の一手法としてＨＭ
Ｍに基づく方法が知られている。このＨＭＭに基づく音
声認識方法については、「中川聖一：“確率モデルによ
る音声認識"，電子情報通信学会刊行」等に詳細に述べ
られている。上記ＨＭＭは状態遷移ネットワークの一種
であり、初期状態となる確率,状態から状態への遷移確
率および各状態におけるシンボルの出力確率が定義され
ている。2. Description of the Related Art Conventionally, HM has been used as one method of speech recognition.
Methods based on M are known. The speech recognition method based on the HMM is described in detail in "Seiichi Nakagawa:" Speech Recognition by Stochastic Model ", published by IEICE." The HMM is a kind of a state transition network, and defines a probability of an initial state, a transition probability from a state to a state, and a symbol output probability in each state.

【０００３】ここで、上記サブワード単位のＨＭＭと
は、音素や半音節等のように単語より小さな単位でのＨ
ＭＭのことである。これらのサブワード単位のＨＭＭ
は、単語音声認識や連続音声認識等に応用することがで
きるのである。例えば、音素ＨＭＭを単語音声認識に用
いる場合には、単語を音素列とみなして、音素ＨＭＭを
連結したものが各単語の特徴系列であると考える。そし
て、音声パターンが入力された場合に、音素ＨＭＭを用
いたビタビアルゴリズムによって各単語の特徴系列生成
確率を求め、その確率の最も高い単語を認識結果とする
のである。一方、音素ＨＭＭを連続音声認識に用いる場
合には、図７に示すような、単語の有向グラフで構成さ
れた有限状態オートマンの構文ネットワークを用意す
る。この構文ネットワークは、各単語を音素列とみなす
ことによって音素列のネットワークであると考えること
ができる。そこで、音素ＨＭＭを用いた探索処理によっ
て上記構文ネットワークが許可する音素列のなかで最も
生成確率が高い音素列を認識結果として出力するのであ
る。[0003] Here, the HMM in sub-word units refers to HMM in units smaller than words, such as phonemes or semi-syllables.
MM. HMM of these subword units
Can be applied to word speech recognition, continuous speech recognition, and the like. For example, when a phoneme HMM is used for word speech recognition, a word is regarded as a phoneme sequence, and a combination of phoneme HMMs is considered to be a feature sequence of each word. Then, when a speech pattern is input, a feature sequence generation probability of each word is obtained by a Viterbi algorithm using a phoneme HMM, and the word having the highest probability is used as a recognition result. On the other hand, when the phoneme HMM is used for continuous speech recognition, a finite state Automan syntax network composed of a directed graph of words as shown in FIG. 7 is prepared. This syntactic network can be considered as a phoneme network by considering each word as a phoneme sequence. Therefore, the search processing using the phoneme HMM outputs the phoneme string having the highest generation probability among the phoneme strings permitted by the syntax network as a recognition result.

【０００４】上述のような単語音声認識や連続音声認識
を行うに際しては、通常は、フレーム毎に上記音素ＨＭ
Ｍあるいは構文ネットワークにおける各状態のシンボル
出力確率を予め算出してテーブル化しておく。そして、
各単語等の最大生成率を求めるための累積スコアを計算
する場合には、上述のようにして予め作成された尤度テ
ーブルを参照するという方法をとる。[0004] When performing the above-described word speech recognition or continuous speech recognition, usually, the phoneme HM is used for each frame.
The symbol output probability of each state in M or the syntax network is calculated in advance and tabulated. And
When calculating the cumulative score for obtaining the maximum generation rate of each word or the like, a method of referring to the likelihood table created in advance as described above is adopted.

【０００５】ここで、音声認識処理動作には、認識対象
語彙に含まれない音声(未知発話)が入力された場合には
認識結果を出力しないリジェクト機能が必要である。Ｈ
ＭＭ音声認識装置での未知発話のリジェクト方法につい
ては、例えば「渡辺隆夫，塚田聡：“音声認識を用い
たゆう度補正による未知発話のリジェクション"，電子
情報通信学会論文誌，Ｄ−II，Ｖol.Ｊ75−Ｄ−II，Ｎ
o.12，pp.2002−2009，1992年12月」に紹介されてい
る。この論文においては、半音節ＨＭＭを用いており、
認識対象ネットワークとは独立して、図８に示すような
音節ネットワークを有している。そして、上記音節ネッ
トワークから得られる最大の生成確率と認識対象ネット
ワークから得られる最大の生成確率とを比較し、その差
が一定値以上である場合には発声内容が未知発話である
と判定してリジェクトするようにしている。[0005] Here, the speech recognition processing operation requires a reject function of not outputting a recognition result when speech (unknown utterance) not included in the vocabulary to be recognized is input. H
Regarding the method of rejecting unknown utterances in the MM speech recognition device, for example, “Takao Watanabe, Satoshi Tsukada:“ Rejection of unknown utterances by likelihood correction using speech recognition ”, IEICE Transactions, D-II, Vol.J75-D-II, N
o.12, pp.2002-2009, December 1992 ". In this paper, we use a semi-syllable HMM,
It has a syllable network as shown in FIG. 8 independently of the recognition target network. Then, the maximum generation probability obtained from the syllable network is compared with the maximum generation probability obtained from the recognition target network, and if the difference is equal to or greater than a certain value, the utterance content is determined to be an unknown utterance. I try to reject.

【０００６】[0006]

【発明が解決しようとする課題】音声認識処理動作にお
ける上記未知発話のリジェクト判定の際に必要な処理量
や記憶量は、なるべく少ない方が処理速度やコストの点
で有利になる。ここで、上述したような音節ネットワー
クを参照した未知発話のリジェクト方法は確かに効果的
ではある。しかしながら、上記音節ネットワークと認識
対象ネットワークというの２つのネットワークを記憶し
ておく必要があり、大きな記憶容量を必要とする。ま
た、半音節ＨＭＭを用いているために１つの音節を構成
する２つの半音節に係る前半分の半音節ＨＭＭの次には
後半分の半音節ＨＭＭが接続しなければならないという
制約はあるものの、図８に示すように音節同士の接続に
関しては何等制約は無い。したがって、上記音節ネット
ワークからリジェクト判定用の最大生成確率を算出する
際には、状態を共有する半音節ＨＭＭの有無に拘わらず
上記音節ネットワークの全探索を繰り返して行わなけれ
ばならず、多くの演算処理と多くの記憶量を必要とする
という問題がある。In the speech recognition processing operation, the smaller the processing amount and storage amount required for the rejection determination of the unknown utterance, the more advantageous in terms of processing speed and cost. Here, the method of rejecting an unknown utterance with reference to the syllable network as described above is certainly effective. However, it is necessary to store two networks, the syllable network and the network to be recognized, which requires a large storage capacity. In addition, since a semisyllable HMM is used, there is a restriction that a half-syllable HMM for the second half must be connected to a half-syllable HMM of the first half related to two half-syllables constituting one syllable. As shown in FIG. 8, there is no restriction on the connection between syllables. Therefore, when calculating the maximum generation probability for rejection determination from the syllable network, it is necessary to repeatedly perform a full search of the syllable network regardless of the presence or absence of a semisyllable HMM sharing a state. There is a problem of requiring processing and a large amount of storage.

【０００７】尚、上記渡辺等の論文には、上述した音節
ネットワークを参照した未知発話のリジェクト判定法に
加えて、処理量を少なくしたより簡便なリジェクト判定
法として、各フレームでの各状態の局所シンボル出力確
率の最大値を入力の全区域にわたって累積し、この累積
値を上記音節ネットワークを参照して得られる最大の生
成確率の代わりとして扱う方法が紹介されている。そし
て、この簡易法では、各半音節ＨＭＭ間の連結制約どこ
ろか各状態の遷移にも何等制約を設けないうえに、状態
遷移確率や半音節間の接続条件(ＶＣ型の半音節のあと
にはＣＶ型の半音節のみが接続され得る等)を無視して
いるために、状態遷移の制約や言語的な連結制約を付与
した音節ネットワークを参照するリジェクト判定法に比
べて性能的に劣るとも報告されている。In the paper by Watanabe et al., In addition to the rejection determination method for unknown utterances referring to the syllable network described above, a simpler rejection determination method with a reduced processing amount is described in each state of each frame. A method is introduced in which the maximum value of the local symbol output probability is accumulated over the entire area of the input, and this accumulated value is used as a substitute for the maximum generation probability obtained by referring to the syllable network. In this simplified method, there is no restriction on the transition of each state as well as the connection restriction between the semi-syllable HMMs, and the state transition probability and the connection condition between the semi-syllables (after the VC type semi-syllable, Because only CV type syllables can be connected, etc.), it is reported that the performance is inferior to the reject decision method that refers to a syllable network with state transition constraints and linguistic connection constraints. Have been.

【０００８】すなわち、上記ＨＭＭは、図９に示すよう
に、幾つかの状態の連結で構成されており、ある状態か
らは特定の状態にしか遷移しないという制約がある。し
たがって、上述の如く各フレームでの各状態の局所シン
ボル出力確率の最大値を入力の全区域にわたって累積す
るという方法では、上記ＨＭＭの状態遷移の制約を全く
無視することになり、上記状態遷移の制約を考慮したリ
ジェクト判定方法に比べて性能的には劣るという問題が
付きまとうのである。That is, as shown in FIG. 9, the above HMM is composed of a connection of several states, and there is a restriction that a transition from a certain state to only a specific state is made. Therefore, in the method of accumulating the maximum value of the local symbol output probability of each state in each frame over the entire area of the input as described above, the constraint of the state transition of the HMM is completely ignored, and the state transition of the state transition is completely ignored. There is a problem that the performance is inferior to that of the reject determination method in consideration of the restriction.

【０００９】そこで、この発明の目的は、ＨＭＭ内での
状態遷移制約に加えてＨＭＭの連結制約を付与して、リ
ジェクト判定用の最大生成確率を少ない処理量と少ない
記憶量で求めることができる音声認識装置を提供するこ
とにある。Therefore, an object of the present invention is to provide a connection constraint of the HMM in addition to the state transition constraint in the HMM so that the maximum generation probability for reject determination can be obtained with a small processing amount and a small storage amount. A voice recognition device is provided.

【００１０】[0010]

【課題を解決するための手段】上記目的を達成するた
め、請求項１に係る発明は、入力音声から音響パラメー
タを抽出する音響分析部と、状態遷移制約情報を有する
サブワード単位のＨＭＭが蓄積されているＨＭＭデータ
メモリと、上記抽出された音響パラメータと上記蓄積さ
れているＨＭＭに基づいて,全ＨＭＭを構成している総
ての状態の局所尤度を算出して尤度テーブルを作成する
尤度テーブル作成部と、上記ＨＭＭの状態遷移制約情報
に基づく制約に従って上記尤度テーブル上に経路を設定
し,この経路に沿った最大参照累積スコアをビタビアル
ゴリズムによって算出するリジェクト判定用参照累積ス
コア算出部と、上記尤度テーブル上における上記ＨＭＭ
および各認識タスクに従った経路に沿った最大累積スコ
アを算出する認識タスク累積スコア算出部と、上記リジ
ェクト判定用参照累積スコア算出部で算出された最大参
照累積スコアと上記認識タスク累積スコア算出部で算出
された最大累積スコアとの差を算出し,この差の値が所
定値以上であれば発声内容は認識対象外の未知発話であ
ると判定してリジェクトするリジェクト判定部を備え
て、上記ＨＭＭデータメモリに蓄積されているＨＭＭの
状態遷移制約情報は,ＨＭＭ境界での認識対象言語によ
る音素間の連結制約情報を含むことを特徴としている。According to a first aspect of the present invention, there is provided an audio analysis unit for extracting an audio parameter from an input voice, and an HMM in subword units having state transition constraint information is stored. Based on the HMM data memory, the extracted acoustic parameters and the stored HMMs, local likelihoods of all states constituting all the HMMs are calculated to generate a likelihood table. Calculating a reference cumulative score for reject determination by setting a route on the likelihood table according to a constraint based on the state transition constraint information of the HMM and calculating a maximum reference cumulative score along the route by a Viterbi algorithm And the HMM on the likelihood table
A recognition task cumulative score calculation unit that calculates a maximum cumulative score along a path according to each recognition task; a maximum reference cumulative score calculated by the reject determination reference cumulative score calculation unit; and the recognition task cumulative score calculation unit Calculating a difference from the maximum cumulative score calculated in the above, and if the value of the difference is equal to or greater than a predetermined value, the utterance content is determined to be an unknown utterance not to be recognized and is rejected.
Of the HMM stored in the HMM data memory.
The state transition constraint information depends on the language to be recognized at the HMM boundary.
It is characterized by including connection constraint information between phonemes .

【００１１】上記構成において、上記音響分析部によっ
て入力音声から音響パラメータが抽出されると、尤度テ
ーブル作成部によって、上記抽出された音響パラメータ
とＨＭＭデータメモリに蓄積されているＨＭＭに基づい
て上記尤度テーブルが作成される。そして、リジェクト
判定用参照累積スコア算出部によって、上記尤度テーブ
ル上に、上記ＨＭＭの状態遷移制約情報に基づく制約に
従って経路が設定されて、この経路に沿った最大参照累
積スコアがビタビアルゴリズムで算出される。一方、認
識タスク累積スコア算出部によって、上記尤度テーブル
上における上記ＨＭＭおよび各認識タスクに従った経路
に沿って最大累積スコアが算出される。そうすると、上記リジェクト判定部によって、両最大累
積スコアの差が算出され、この差の値が所定値以上であ
れば発声内容は認識対象外の未知発話であると判定され
てリジェクトされる。[0011] In the above configuration, when acoustic parameters are extracted from the input speech by the acoustic analysis unit, the likelihood table creating unit creates the likelihood based on the extracted acoustic parameters and the HMM stored in the HMM data memory. A likelihood table is created. A route is set by the reject determination reference cumulative score calculation unit on the likelihood table according to the constraint based on the state transition constraint information of the HMM, and the maximum reference cumulative score along the route is calculated by the Viterbi algorithm. Is done. On the other hand, the recognition task cumulative score calculation unit calculates the maximum cumulative score along the path according to the HMM and each recognition task on the likelihood table. Then, the reject determination unit calculates the difference between the two maximum cumulative scores, and if the value of the difference is equal to or greater than a predetermined value, the utterance content is determined to be an unknown utterance not to be recognized and rejected.

【００１２】その際に、上記リジェクト判定用の最大参
照累積スコアは上記尤度テーブル上に設定されたビタビ
経路に沿って求められるので、同一フレームにおいて複
数ＨＭＭの状態が共有される場合には、上記複数ＨＭＭ
に係るリジェクト判定用の最大累積スコアの演算経路は
上記共有状態で１つに収束される。したがって、上記リ
ジェクト判定用の最大参照累積スコアは少ない演算量と
少ない記憶量で求められる。At this time, since the maximum reference cumulative score for the reject determination is obtained along the Viterbi path set on the likelihood table, when the state of a plurality of HMMs is shared in the same frame, The above multiple HMMs
The calculation path of the maximum cumulative score for reject determination according to the above is converged to one in the above-mentioned shared state. Therefore, the maximum reference cumulative score for reject determination can be obtained with a small amount of calculation and a small amount of storage.

【００１３】[0013]

【００１４】さらに、上記尤度テーブル上に設定される
経路にはＨＭＭ境界での認識対象言語による音素間の連
結制約情報に従った制約が付与されるので、リジェクト
判定用の最大参照累積スコアの演算量がより少なくなる
と共に、連結されるＨＭＭが認識対象言語に応じて限定
されてリジェクト判定用の最大参照累積スコアが精度良
く演算される。 Furthermore, since the path set on the likelihood table constraint in accordance with the communication <br/> binding constraint information among phonemes by the recognition target language in HMM boundary is given, for rejection determination calculating the amount of the maximum reference cumulative score with there is less, the maximum reference cumulative score for rejection determination HMM coupled is limited in accordance with the recognition target language is accurately calculated.

【００１５】また、請求項２に係る発明は、入力音声か
ら音響パラメータを抽出する音響分析部と、状態遷移制
約情報を有するサブワード単位のＨＭＭが蓄積されてい
るＨＭＭデータメモリと、上記抽出された音響パラメー
タと上記蓄積されているＨＭＭに基づいて,全ＨＭＭを
構成している総ての状態の局所尤度を算出して尤度テー
ブルを作成する尤度テーブル作成部と、上記ＨＭＭの状
態遷移制約情報に基づく制約に従って上記尤度テーブル
上に経路を設定し,この経路に沿った最大参照累積スコ
アをビタビアルゴリズムによって算出するリジェクト判
定用参照累積スコア算出部と、上記尤度テーブル上にお
ける上記ＨＭＭおよび各認識タスクに従った経路に沿っ
た最大累積スコアを算出する認識タスク累積スコア算出
部と、上記リジェクト判定用参照累積スコア算出部で算
出された最大参照累積スコアと上記認識タスク累積スコ
ア算出部で算出された最大累積スコアの差を算出し,こ
の差の値が所定値以上であれば発声内容は認識対象外の
未知発話であると判定してリジェクトするリジェクト判
定部を備えて、上記ＨＭＭデータメモリに蓄積されてい
るＨＭＭは音素環境依存型の音素ＨＭＭであり、上記Ｈ
ＭＭデータメモリに蓄積されているＨＭＭの状態遷移制
約情報は,ＨＭＭ境界での音素環境による連結制約情報
を含むことを特徴としている。According to a second aspect of the present invention, the input voice
Sound analysis unit that extracts sound parameters from the
HMMs in sub-words with information about
HMM data memory and the extracted acoustic parameters
Based on the data and the stored HMM,
Calculate the local likelihood of all the constituent states and calculate the likelihood table.
Likelihood table creation unit for creating a table,
The likelihood table according to the constraint based on the state transition constraint information
Above, and the maximum reference cumulative score along this path
Rejection calculation that calculates a by the Viterbi algorithm
The reference cumulative score calculation unit and the likelihood table
Along the path according to the above HMM and each recognition task
Recognition task cumulative score calculation to calculate the maximum cumulative score
Unit and the reference cumulative score calculation unit for reject determination described above.
The maximum reference cumulative score issued and the cumulative score
A) Calculate the difference between the maximum cumulative scores calculated by the
If the value of the difference is equal to or greater than the predetermined value,
Reject decision to judge as unknown utterance and reject
It comprises a tough, HMM stored in the HMM data memory is a phoneme HMM phoneme environment-dependent, the H
The HMM state transition constraint information stored in the MM data memory is characterized in that it includes connection constraint information based on the phoneme environment at the HMM boundary.

【００１６】上記構成によれば、上記ＨＭＭデータメモ
リに蓄積されているＨＭＭは音素環境依存型の音素ＨＭ
Ｍであり、上記尤度テーブル上に設定される経路には音
素環境による連結制約情報に従った制約が付与されてい
るので、リジェクト判定用の最大参照累積スコアの演算
量が更に少なくなると共に、連結される音素ＨＭＭが音
素環境によって特定されてリジェクト判定用の最大参照
累積スコアが非常に精度良く演算される。According to the above configuration, the HMM stored in the HMM data memory is a phoneme environment-dependent phoneme HM.
M, and the path set on the likelihood table is given a constraint according to the connection constraint information based on the phoneme environment, so that the calculation amount of the maximum reference cumulative score for reject determination is further reduced, The connected phoneme HMM is specified by the phoneme environment, and the maximum reference cumulative score for reject determination is calculated with very high accuracy.

【００１７】[0017]

【発明の実施の形態】以下、この発明を図示の実施の形
態により詳細に説明する。図１は本実施の形態の音声認
識装置におけるブロック図である。この音声認識装置
は、音素ＨＭＭと音響パラメータとから求められたＨＭ
Ｍ尤度テーブル上にＨＭＭの各状態遷移制約に沿ったビ
タビ経路を設定し、この経路上の最大参照累積スコアに
基づいて未知発話のリジェクト判定を行うものである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. FIG. 1 is a block diagram of the speech recognition apparatus according to the present embodiment. This speech recognition apparatus uses an HM obtained from a phoneme HMM and acoustic parameters.
A Viterbi route is set on the M likelihood table along each state transition constraint of the HMM, and reject determination of an unknown utterance is performed based on the maximum reference cumulative score on this route.

【００１８】以下、本実施の形態の音声認識装置につい
て説明するに先立って、上記ＨＭＭ尤度テーブル上に設
定されたビタビ経路に沿った最大累積スコア算出法につ
いて述べる。図４は、上記ＨＭＭ尤度テーブルからビタ
ビアルゴリズムによって最大累積スコアを求める様子を
示す。ここで、図４(a)は、図９に示す状態遷移におけ
る中間状態ｊあるいは最終状態ｍでの累積スコアを更新
する際の状態遷移の経路図である。また、図４(b)は、
図９における初期状態ｃでの累積スコアを更新する際の
状態遷移の経路図である。尚、縦軸は各状態を示し、横
軸はフレーム番号を示す。Before describing the speech recognition apparatus according to the present embodiment, a method of calculating the maximum cumulative score along the Viterbi path set on the HMM likelihood table will be described. FIG. 4 shows how the maximum cumulative score is obtained from the HMM likelihood table by the Viterbi algorithm. Here, FIG. 4A is a state transition path diagram when updating the accumulated score in the intermediate state j or the final state m in the state transition shown in FIG. FIG. 4 (b)
FIG. 10 is a path diagram of state transition when updating a cumulative score in an initial state c in FIG. 9. The vertical axis indicates each state, and the horizontal axis indicates the frame number.

【００１９】音声スペクトルのあるフレームにおける各
状態へは、その直前のフレームにおける幾つかの状態か
ら遷移する。そして、この状態遷移の経路は、ＨＭＭの
状態遷移制約に基づいた経路だけである。また、ＨＭＭ
間で状態の共有がないとすれば、図９のようにＨＭＭに
おける中間状態と最終状態とに遷移する経路は２本だけ
である。図４において、図４(a)におけるｉフレーム上
の状態ｊへは、直前の(ｉ−１)フレーム上の状態ｃと状
態ｊとから遷移している。そして、(ｉ−１)フレームか
らｉフレーム上における状態ｊに至る上記両経路の累積
スコア中で、最大の累積スコアがｉフレーム上における
状態ｊの累積スコアとなる。尚、上記累積スコアは式
(１)に示すビタビアルゴリズムによって算出される。Each state in a certain frame of the voice spectrum transitions from some states in the immediately preceding frame. The path of this state transition is only a path based on the state transition constraint of the HMM. Also, HMM
Assuming that there is no state sharing between the HMMs, there are only two paths that transition between the intermediate state and the final state in the HMM as shown in FIG. In FIG. 4, a transition is made from the state c and the state j on the immediately preceding (i-1) frame to the state j on the i-th frame in FIG. Then, among the cumulative scores of the two paths from the (i-1) frame to the state j on the i frame, the largest cumulative score is the cumulative score of the state j on the i frame. The cumulative score is calculated by the formula
It is calculated by the Viterbi algorithm shown in (1).

【数１】 (Equation 1)

【００２０】上述したように、各フレーム毎に、式(１)
によって各状態の累積スコアを計算するのであるが、そ
の際に、当該状態まで辿ってきた経路を記憶しておく必
要はない。また、各状態における累積スコアの計算は、
直前フレーム上の各状態における累積スコアが保持され
ていれば可能である。したがって、記憶するのは直前フ
レーム上の各状態における累積スコアだけでよく、必要
な記憶量を少なくできるのである。As described above, for each frame, equation (1)
The cumulative score of each state is calculated according to the above, but at this time, it is not necessary to store the path that has been reached to the state. The calculation of the cumulative score in each state is as follows:
It is possible if the accumulated score in each state on the immediately preceding frame is held. Therefore, only the accumulated score in each state on the immediately preceding frame needs to be stored, and the required storage amount can be reduced.

【００２１】図９に示すように、上記ＨＭＭは、初期状
態から中間状態を経て最終状態へ至る。そして、上記最
終状態を脱した後は、新たに別のＨＭＭの初期状態ある
いは同じＨＭＭの初期状態から状態の遷移が開始される
ことになる。ここで、先行の音素や音節あるいは後続の
音素や音節に依存しないタイプのＨＭＭの場合には、前
ＨＭＭの最終状態からつながり得る次ＨＭＭの初期状態
の数は多い。図４(b)は、ｉフレーム上の初期状態ｃ
に、(ｉ−１)フレーム上における同じＨＭＭの初期状態
ｃと他のＨＭＭの最終状態ａ,ｆ,ｍ,ｐとからの遷移し
ている様子を示している。その場合に、言語的に考え
て、特定の音素と音素あるいは音節と音節等の連結に制
限を設けることができる。例えば、日本語の場合には子
音と子音とは連結しないことが多いと考えられ、音素/
ｍ/から音素/ｈ/への遷移経路は設けない等の認識対象
言語による連結制約(以下、言語的連結制約と言う)を設
定できるのである。As shown in FIG. 9, the HMM goes from the initial state to the final state via the intermediate state. Then, after exiting the final state, a state transition is newly started from another HMM initial state or the same HMM initial state. Here, in the case of a type of HMM that does not depend on the preceding phoneme or syllable or the succeeding phoneme or syllable, the number of initial states of the next HMM that can be connected from the final state of the previous HMM is large. FIG. 4B shows the initial state c on the i-frame.
FIG. 7 (b) shows a transition from the initial state c of the same HMM and the final states a, f, m, p of the other HMMs on the (i-1) frame. In this case, it is possible to limit the connection between specific phonemes and phonemes or between syllables and syllables in terms of language. For example, in the case of Japanese, consonants and consonants are often not connected,
It is possible to set a connection constraint (hereinafter referred to as a linguistic connection constraint) based on the recognition target language, such as not providing a transition path from m / to a phoneme / h /.

【００２２】図５および図６は、上記言語的連結制約で
ある日本語における音素間の連結制約例を示し、左側の
音素ＨＭＭにおける最終状態から右側の音素ＨＭＭにお
ける初期状態に遷移し得ることを意味している。但し、
音素表記はヘボン式ローマ字綴りに従っており、/ｑ/は
促音を/Ｎ/は撥音を表している。したがって、図５に示
す例では、音素/ｈ/に接続可能な音素は/ａ/,/ｉ/,/ｕ
/,/ｅ/,/ｏ/,/Ｎ/,/ｑ/の７音素であるから、音素/ｈ/
の初期状態に遷移し得る経路は７本存在することにな
る。こうして、認識言語に特化することにより、より正
確に上記累積スコアを求めることが可能になるのであ
る。FIG. 5 and FIG. 6 show an example of a connection restriction between phonemes in Japanese, which is the above-mentioned linguistic connection restriction, and show that the transition from the final state in the left phoneme HMM to the initial state in the right phoneme HMM is possible. Means. However,
The phoneme notation follows the Hepburn roman spelling, where / q / indicates a prompting sound and / N / indicates a sound-repelling sound. Therefore, in the example shown in FIG. 5, phonemes connectable to phoneme / h / are / a /, / i /, / u
/, / e /, / o /, / N /, / q /
There will be seven paths that can transition to the initial state. In this way, by specializing in the recognition language, it is possible to more accurately calculate the cumulative score.

【００２３】また、音素環境依存型の音素ＨＭＭは、各
音素に係る先行音素や後続音素を特定化したものであ
る。例えば、同じ音素/ａ/であっても、先行音素が/ｋ/
である音素/ａ/と先行音素が/ｍ/である音素/ａ/とは異
なる音素のように扱われ、夫々別々の音素ＨＭＭで表現
されるのである。音素環境依存型の音素ＨＭＭは、音素
環境に依存しない音素ＨＭＭに比べて必要なモデル数が
多くなるものの、性能的に有利である。このように、音
素環境依存型の音素ＨＭＭの場合には、音素ＨＭＭ間の
連結制約が状態間の遷移制約と同様に重要であるため
に、リジェクト判定用の累積スコアの計算に際して音素
環境による音素ＨＭＭ間の連結制約(以下、音素環境連
結制約と言う)を付与することが上記累積スコア算出の
正確性と容易性から特に有効であるといえる。The phoneme environment-dependent phoneme HMM specifies a preceding phoneme and a succeeding phoneme relating to each phoneme. For example, even if the same phoneme is / a /, the preceding phoneme is / k /
Is treated as a different phoneme from the phoneme / a / whose preceding phoneme is / m /, and each phoneme is represented by a separate phoneme HMM. The phoneme environment-dependent phoneme HMM requires more models than the phoneme environment-independent phoneme HMM, but is advantageous in performance. As described above, in the case of a phoneme environment-dependent phoneme HMM, since the connection constraint between phoneme HMMs is as important as the transition constraint between states, the phoneme by the phoneme environment is used in calculating the cumulative score for reject determination. It can be said that providing a connection constraint between HMMs (hereinafter referred to as a phoneme environment connection constraint) is particularly effective in terms of the accuracy and ease of calculating the cumulative score.

【００２４】尚、最終的にリジェクト判定の際に参照さ
れる累積スコアは、最終フレームにおいて最大値を呈す
る累積スコアが用いられる。その際に、初期状態や中間
状態を除外して、最終状態に係る累積スコアの中から最
大値を呈する累積スコアを求めるようにしてもよい。Incidentally, as the cumulative score finally referred to at the time of the reject determination, the cumulative score that exhibits the maximum value in the final frame is used. At this time, the initial state and the intermediate state may be excluded, and the cumulative score exhibiting the maximum value may be obtained from the cumulative scores related to the final state.

【００２５】以下、上述した最大累積スコア算出法を適
用した音声認識装置の具体例を、図１に従って説明す
る。図１において、マイクロホン１から入力された音声
信号は、Ａ/Ｄコンバータ２でＡ/Ｄ変換されて音響分析
部３に送出される。そして、音響分析部３によって、デ
ィジタルの音声データに基づいて１フレーム毎に音響パ
ラメータが抽出される。尤度テーブル作成部４は、上記
抽出された音響パラメータとＨＭＭデータメモリ９に蓄
積されている総ての音素ＨＭＭとから、全音素ＨＭＭを
構成している互いに異なる総ての状態のシンボル出力確
率ｏを算出して尤度テーブルを作成する。そうすると、
リジェクト判定用参照累積スコア算出部５は、上記作成
された尤度テーブルおよび上記音素ＨＭＭを用いて、上
述のようにして、上記言語的連結制約および音素環境連
結制約が付与された図４に示すような経路上の各状態に
おけるリジェクト判定用参照累積スコアをフレーム毎に
計算し、リジェクト判定用参照累積スコア格納部１０の
内容を更新する。一方、認識タスク累積スコア算出部６
は、上記尤度テーブル,ＨＭＭデータメモリ９の各音素
ＨＭＭおよび認識タスク辞書１２を用いて、認識タスク
の生成確率の累積値である認識タスク累積スコアをフレ
ーム毎に計算し、認識タスク累積スコア格納部１１の内
容を更新する。判定部７は、最終フレームに対する上述
の処理が終了すると、リジェクト判定用参照累積スコア
格納部１０に格納されているリジェクト判定用参照累積
スコアの最大値と認識タスク累積スコア格納部１１に格
納されている認識タスク累積スコアの最大値との差の正
規化値に基づいて、認識結果の出力あるいはリジェクト
を行うのである。音声認識制御部８は、上記Ａ/Ｄコン
バータ２,音響分析部３,尤度テーブル作成部４,リジェ
クト判定用参照累積スコア算出部５,認識タスク累積ス
コア算出部６および判定部７を制御して、音声認識処理
を行う。Hereinafter, a specific example of a speech recognition apparatus to which the above-described maximum cumulative score calculation method is applied will be described with reference to FIG. In FIG. 1, an audio signal input from a microphone 1 is A / D converted by an A / D converter 2 and transmitted to an acoustic analysis unit 3. Then, the acoustic analysis unit 3 extracts acoustic parameters for each frame based on the digital audio data. The likelihood table creating unit 4 calculates the symbol output probabilities of all different states constituting the all phoneme HMM from the extracted acoustic parameters and all phoneme HMMs stored in the HMM data memory 9. o is calculated to create a likelihood table. Then,
The reference cumulative score calculation unit 5 for reject determination uses the created likelihood table and the phoneme HMM as shown in FIG. 4 to which the linguistic connection constraint and the phoneme environment connection constraint are added as described above. The reference cumulative score for reject determination in each state on such a route is calculated for each frame, and the content of the reference cumulative score storage unit for reject determination is updated. On the other hand, the recognition task cumulative score calculation unit 6
Calculates the recognition task cumulative score, which is the cumulative value of the generation probability of the recognition task, for each frame using the likelihood table, each phoneme HMM of the HMM data memory 9, and the recognition task dictionary 12, and stores the recognition task cumulative score. The content of the unit 11 is updated. When the above-described processing for the final frame is completed, the determination unit 7 stores the maximum value of the reject determination reference cumulative score stored in the reject determination reference cumulative score storage unit 10 and the recognition task cumulative score storage unit 11. The recognition result is output or rejected based on the normalized value of the difference between the recognition task cumulative score and the maximum value. The speech recognition control unit 8 controls the A / D converter 2, the acoustic analysis unit 3, the likelihood table creation unit 4, the reference cumulative score calculation unit 5 for reject determination, the recognition task cumulative score calculation unit 6, and the determination unit 7. Then, a voice recognition process is performed.

【００２６】ここで、上記ＨＭＭデータメモリ９に蓄積
されている音素ＨＭＭは音素環境依存型の音素ＨＭＭで
あり、ＨＭＭ境界には音素環境連結制約情報と図５およ
び図６に示すような言語的連結制約情報が付加されてい
る。Here, the phoneme HMM stored in the HMM data memory 9 is a phoneme environment-dependent phoneme HMM, and the phoneme environment connection constraint information and the linguistic information as shown in FIGS. Connection constraint information is added.

【００２７】図２は、上記音声認識制御部８の制御の下
に、上記Ａ/Ｄコンバータ２,音響分析部３,尤度テーブ
ル作成部４,リジェクト判定用参照累積スコア算出部５,
認識タスク累積スコア算出部６および判定部７によって
実行される、音声認識処理動作のフローチャートであ
る。以下、図２に従って、本実施の形態における音声認
識処理動作について説明する。FIG. 2 shows, under the control of the speech recognition control section 8, the A / D converter 2, the acoustic analysis section 3, the likelihood table creation section 4, the reference cumulative score calculation section 5 for reject determination,
5 is a flowchart of a speech recognition processing operation performed by a recognition task cumulative score calculation unit 6 and a determination unit 7; Hereinafter, the speech recognition processing operation in the present embodiment will be described with reference to FIG.

【００２８】ステップＳ1で、上記リジェクト判定用参
照累積スコア格納部１０に格納されているリジェクト判
定用参照累積スコア(以下、単にリジェクト累積スコア
と言う)ｇr_j(ｊ：状態番号)が初期化される。ステップ
Ｓ2で、上記認識タスク累積スコア格納部１１に格納さ
れている認識タスク累積スコア(以下、単に認識累積ス
コアと言う)ｇt_K(ｋ：認識タスク番号)が初期化され
る。ステップＳ3で、フレーム番号ｉおよびフレーム数
Ｉが“０"に初期化される。ステップＳ4で、上記フレーム番号ｉがインクリメント
される。ステップＳ5で、上記マイクロホン１から入力
された音声信号のｉ番目のフレームの音声信号が取り込
まれる。ステップＳ6で、上記Ａ/Ｄコンバータ２によっ
て、当該フレームｉの音声信号がＡ/Ｄ変換されてディ
ジタル化される。ステップＳ7で、上記音響分析部３に
よって、ディジタル音声信号から当該フレームｉの音響
パラメータが抽出される。In step S1, the reference cumulative score for reject determination (hereinafter, simply referred to as reject cumulative score) gr _j (j: state number) stored in the reference cumulative score storage for reject determination 10 is initialized. You. In step S2, the recognition task cumulative score (hereinafter simply referred to as the recognition cumulative score) gt _K (k: recognition task number) stored in the recognition task cumulative score storage unit 11 is initialized. In step S3, the frame number i and the frame number I are initialized to "0". In step S4, the frame number i is incremented. In step S5, the audio signal of the i-th frame of the audio signal input from the microphone 1 is captured. In step S6, the audio signal of the frame i is A / D converted and digitized by the A / D converter 2. In step S7, the acoustic analysis unit 3 extracts the acoustic parameters of the frame i from the digital audio signal.

【００２９】ステップＳ8で、上記尤度テーブル作成部
４によって、上記抽出された当該フレームの音響パラメ
ータとＨＭＭデータメモリ９の音素ＨＭＭのデータとか
ら、ＨＭＭデータメモリ９に蓄積されている全音素ＨＭ
Ｍを構成している互いに異なる総ての状態ｊのシンボル
出力確率ｏ_j(i)が算出される。ここで、上記音素ＨＭＭ
のデータは、多数話者の音素バランス単語を用いたタス
クに独立な学習によって作成したものである。ステップ
Ｓ9で、上記尤度テーブル作成部４によって、上記算出
された各状態のシンボル出力確率ｏ_j(i)に対して正規化
や対数変換が行われて当該状態の尤度が求められ、尤度
テーブルの当該フレーム分が作成される。ここで、上記
尤度テーブルは図３のような構成になっており、各フレ
ームにおける各状態毎に尤度が格納されている。実際に
は、この尤度テーブルを他の記憶部に記憶しておく必要
はなく、当該フレーム分を尤度テーブル作成部４の内部
メモリ等に保持しておき、当該フレームｉにおけるリジ
ェクト累積スコアｇr_j(i)と認識累積スコアｇt_k(i)の計
算が終了した後に消去すればよい。こうして、音声認識
処理動作に必要な記憶量の低減化を図るのである。In step S8, the likelihood table creating unit 4 converts all the phonemes HM stored in the HMM data memory 9 from the extracted acoustic parameters of the frame and the data of the phonemes HMM in the HMM data memory 9.
The symbol output probabilities o _j (i) of all the different states j constituting M are calculated. Here, the phoneme HMM
Are created by independent learning for tasks using phoneme balance words of many speakers. In step S9, the likelihood table creation unit 4 performs normalization and logarithmic conversion on the calculated symbol output probabilities o _j (i) of the respective states to obtain likelihoods of the states. The corresponding frame of the degree table is created. Here, the likelihood table has a configuration as shown in FIG. 3, and the likelihood is stored for each state in each frame. Actually, it is not necessary to store this likelihood table in another storage unit, and the likelihood table is held in an internal memory or the like of the likelihood table creation unit 4, and the reject cumulative score gr in the frame i is stored. _It may be deleted after the calculation of _j (i) and the cumulative cumulative score gt _k (i) is completed. Thus, the amount of storage required for the voice recognition processing operation is reduced.

【００３０】ステップＳ10で、上記リジェクト判定用参
照累積スコア算出部５によって、リジェクト判定用参照
累積スコア格納部１０に格納されている直前フレーム
(ｉ−１)の各状態ｊにおけるリジェクト累積スコアｇr_j
(i-1)が読み出され、ＨＭＭデータメモリ９に格納され
た各音素ＨＭＭと上記尤度テーブルを用いて、上記各音
素ＨＭＭに基づく図４に示すような状態遷移の経路に従
って、当該フレームｉの各状態ｊにおけるリジェクト累
積スコアｇr_j(i)が式(１)によって算出される。そし
て、こうして算出されたリジェクト累積スコアｇr_j(i)
によって、リジェクト判定用参照累積スコア格納部１０
の内容が更新される。In step S10, the reject determination reference cumulative score calculation section 5 stores the immediately preceding frame stored in the reject determination reference cumulative score storage section 10.
Reject cumulative score gr _j in each state j of (i-1)
(i-1) is read out, and using the respective phoneme HMMs stored in the HMM data memory 9 and the likelihood table, according to the state transition path shown in FIG. The reject cumulative score gr _j (i) in each state j of i is calculated by equation (1). Then, the reject cumulative score gr _j (i) thus calculated
The reject determination reference cumulative score storage unit 10
Is updated.

【００３１】換言すれば、上記リジェクト判定用参照累
積スコア算出部５は、上記尤度テーブル上に上記音素Ｈ
ＭＭに従って設定されたビタビ経路に沿って、各状態ｊ
毎にリジェクト累積スコアｇr_j(i)を算出するのであ
る。その際に、上記ＨＭＭデータメモリ９に蓄積されて
いる音素ＨＭＭは音素環境依存型の音素ＨＭＭであり、
ＨＭＭ境界には音素環境連結制約情報と言語的連結制約
情報が付加されている。したがって、上記尤度テーブル
上に各音素ＨＭＭに基づいて設定される状態遷移の経路
は限定されることになり、ビタビアルゴリズムによるリ
ジェクト累積スコアｇr_j(i)の算出処理件数が減少され
るのである。In other words, the reference cumulative score calculation unit 5 for reject determination stores the phoneme H in the likelihood table.
Each state j along the Viterbi path set according to the MM
The reject cumulative score gr _j (i) is calculated every time. At this time, the phoneme HMM stored in the HMM data memory 9 is a phoneme environment-dependent phoneme HMM,
The phonetic environment connection constraint information and the linguistic connection constraint information are added to the HMM boundary. Accordingly, the path of the state transition set based on each phoneme HMM on the likelihood table is limited, and the number of processings for calculating the reject cumulative score gr _j (i) by the Viterbi algorithm is reduced. .

【００３２】ステップＳ11で、上記認識タスク累積スコ
ア算出部６によって、認識タスク累積スコア格納部１１
に格納されている直前フレーム(ｉ−１)における各認識
タスクｋの認識累積スコアｇt_k(i-1)が読み出され、Ｈ
ＭＭデータメモリ９に格納されている各音素ＨＭＭ,上
記尤度テーブルおよび認識タスク辞書１２を用いて認識
累積スコアｇt_k(i)が算出される。そして、こうして算
出された認識累積スコアｇt_k(i)によって、認識タスク
累積スコア格納１１の内容が更新される。ここで、上記
認識タスクは音素ＨＭＭの連結により表現される。そこ
で、上記前フレーム(ｉ−１)における各認識タスクｋの
認識累積スコアｇt_k(i-1)に、当該フレームｉにおける
シンボル出力確率ｏ_k(i)と状態遷移確率ｐ_k(i)とを累積
することによって当該フレームｉの認識累積スコアｇt_k
(i)を算出するのである。In step S11, the recognition task cumulative score calculation unit 6 causes the recognition task cumulative score storage unit 11
Is read out and the cumulative recognition score gt _k (i-1) of each recognition task k in the immediately preceding frame (i-1) stored in
The recognition cumulative score gt _k (i) is calculated using each phoneme HMM stored in the MM data memory 9, the likelihood table and the recognition task dictionary 12. Then, the content of the recognition task cumulative score storage 11 is updated by the calculated recognition cumulative score gt _k (i). Here, the recognition task is represented by a concatenation of phoneme HMMs. Therefore, the recognition cumulative score gt _k of each recognition task k in the previous frame (i-1) (i- 1), a symbol output in the frame i the probability o _k (i) and state transition probability p _k (i) Is accumulated to obtain the cumulative recognition score gt _{k of the} frame i.
(i) is calculated.

【００３３】換言すれば、上記認識タスク累積スコア算
出部６は、上記尤度テーブル上に上記音素ＨＭＭと認識
タスクに従って設定されたビタビ経路に沿って、各認識
タスクｋ毎に認識累積スコアｇt_k(i)を算出するのであ
る。In other words, the recognition task cumulative score calculation unit 6 calculates the recognition cumulative score gt _{k for} each recognition task k along the Viterbi path set according to the phoneme HMM and the recognition task on the likelihood table. (i) is calculated.

【００３４】ステップＳ12で、当該フレームｉは最終フ
レームであるか否かが判別される。その結果、最終フレ
ームでなければ，上記ステップＳ4に戻って次のフレー
ム(ｉ＋１)の処理に移行する。一方、最終フレームであ
ればステップＳ13に進む。ステップＳ13で、上記フレー
ム数Ｉにフレーム番号ｉがセットされる。ステップＳ14
で、上記判定部７によって、リジェクト判定用参照累積
スコア格納部１０に格納されているリジェクト累積スコ
アｇr_j(I)の中から最大値が検索されてリジェクト判定
用最大参照累積スコアＬrとされる。同様に、認識タス
ク累積スコア格納部１１に格納されている認識累積スコ
アｇt_k(I)の中から最大値が検索されて認識タスク最大
累積スコアＬtとされる。In step S12, it is determined whether or not the frame i is the last frame. As a result, if it is not the last frame, the process returns to the step S4 to shift to the processing of the next frame (i + 1). On the other hand, if it is the last frame, the process proceeds to step S13. In step S13, the frame number i is set in the frame number I. Step S14
Then, the determination unit 7 searches the maximum value among the reject cumulative scores gr _j (I) stored in the reject determination reference cumulative score storage unit 10 and sets the maximum value as the reject determination maximum reference cumulative score Lr. . Similarly, the maximum value is retrieved from the recognition cumulative score gt _k (I) stored in the recognition task cumulative score storage unit 11 and is set as the recognition task maximum cumulative score Lt.

【００３５】ステップＳ15で、上記判定部７によって、
上記検索されたリジェクト判定用最大参照累積スコアＬ
rと認識タスク最大累積スコアＬtとから、正規化リジェ
クト判定値Ｌ'が式(２)によって算出される。Ｌ'＝(Ｌr−Ｌt)/Ｉ …（２）ここで、上記リジェクト判定用最大参照累積スコアＬr
と認識タスク最大累積スコアＬtとの差はフレーム数Ｉ
に比例して大きくなる。したがって、両累積スコアＬr,
Ｌtの差をフレーム数Ｉで正規化するのである。ステッ
プＳ16で、上記判定部７によって、上記算出された正規
化リジェクト判定値Ｌ'の値が閾値より小さいか否かが
判別される。その結果、閾値よりの小さい場合にはステ
ップＳ18に進み、閾値以上である場合にはステップＳ17
に進む。ステップＳ17で、上記判定部７によって、発声内容が未
知発語であると判定されて、認識タスク最大参照累積ス
コアＬtを呈する音素列がリジェクトされて音声認識処
理動作を終了する。ステップＳ18で、上記判定部７によ
って、発声内容が認識語彙に含まれているものと判定さ
れて、認識タスク最大参照累積スコアＬtを呈する音素
列が認識結果として出力されて音声認識処理動作を終了
する。In step S15, the determination unit 7 determines
The retrieved maximum reference cumulative score L for reject determination
From r and the recognition task maximum cumulative score Lt, a normalized reject determination value L 'is calculated by equation (2). L ′ = (Lr−Lt) / I (2) where the maximum reference cumulative score Lr for reject determination is
And the maximum cumulative score Lt of the recognition task is the number of frames I
It increases in proportion to. Therefore, both cumulative scores Lr,
The difference of Lt is normalized by the number of frames I. In step S16, the determination unit 7 determines whether the calculated value of the normalized rejection determination value L 'is smaller than a threshold value. As a result, if it is smaller than the threshold, the process proceeds to step S18, and if it is not smaller than the threshold, the process proceeds to step S17.
Proceed to. In step S17, the determination unit 7 determines that the utterance content is an unknown utterance, and rejects the phoneme string presenting the recognition task maximum reference cumulative score Lt, thus ending the voice recognition processing operation. In step S18, the determination unit 7 determines that the utterance content is included in the recognized vocabulary, outputs a phoneme string exhibiting the maximum reference cumulative score Lt of the recognition task as a recognition result, and ends the voice recognition processing operation. I do.

【００３６】上述のように、本実施の形態においては、
上記ＨＭＭデータメモリ９には、ＨＭＭ境界に音素環境
連結制約情報と言語的連結制約情報を付加した音素環境
依存型の音素ＨＭＭを登録している。そして、１フレー
ム毎に取り込まれた音声信号に基づく音響パラメータと
ＨＭＭデータメモリ９の音素ＨＭＭから、尤度テーブル
作成部４によって尤度テーブルを作成する。そうする
と、リジェクト判定用参照累積スコア算出部５は、音素
環境連結制約および言語的連結制約が付与された音素Ｈ
ＭＭに従って上記尤度テーブル上に設定された経路に沿
ってビタビアルゴリズムによってリジェクト累積スコア
ｇr_j(i)を算出して、リジェクト判定用参照累積スコア
格納部１０の内容を更新する。一方、認識タスク累積ス
コア算出部６は、上記尤度テーブル上に上記音素ＨＭＭ
および認識タスクに従って設定された経路に沿って各認
識タスクｋ毎に認識累積スコアｇt_k(i)を算出して、認
識タスク累積スコア格納部１１の内容を更新する。上記
判定部７は、上述の処理が最終フレームまで終了する
と、その時点でリジェクト判定用参照累積スコア格納部
１０および認識タスク累積スコア格納部１１に格納され
ている両累積スコアｇr_j(i),ｇt_k(i)の最大値を検索し
て、リジェクト判定用最大参照累積スコアＬrと認識タ
スク最大累積スコアＬtとを得る。そして、両最大累積
スコアＬr,Ｌtに基づいて求めた正規化リジェクト判定
値Ｌ'によって認識結果のリジェクト判定を行う。As described above, in the present embodiment,
The HMM data memory 9 registers a phoneme environment-dependent phoneme HMM obtained by adding phoneme environment connection constraint information and linguistic connection constraint information to HMM boundaries. Then, the likelihood table creation unit 4 creates a likelihood table from the acoustic parameters based on the audio signal captured for each frame and the phoneme HMM in the HMM data memory 9. Then, the reference cumulative score calculation unit 5 for reject determination determines that the phoneme H to which the phoneme environment connection constraint and the linguistic connection constraint have been added.
The reject cumulative score gr _j (i) is calculated by the Viterbi algorithm along the route set on the likelihood table according to the MM, and the contents of the reject determination reference cumulative score storage unit 10 are updated. On the other hand, the recognition task cumulative score calculation unit 6 stores the phoneme HMM in the likelihood table.
Then, along with the route set according to the recognition task, the recognition cumulative score gt _k (i) is calculated for each recognition task k, and the contents of the recognition task cumulative score storage unit 11 are updated. When the above-described processing is completed up to the final frame, the determination unit 7 stores the two cumulative scores gr _j (i), stored in the reject determination reference cumulative score storage unit 10 and the recognition task cumulative score storage unit 11 at that time. The maximum value of gt _k (i) is searched to obtain the maximum reference cumulative score Lr for reject determination and the maximum cumulative score Lt of the recognition task. Then, the rejection determination of the recognition result is performed using the normalized rejection determination value L ′ obtained based on both the maximum cumulative scores Lr and Lt.

【００３７】このように、本実施の形態における上記リ
ジェクト判定用参照累積スコア算出部５は、上記尤度テ
ーブル上に音素ＨＭＭに従って設定されたビタビ経路に
沿って各状態ｊ毎にリジェクト累積スコアｇr_j(i)を算
出している。したがって、当該フレームにおいてある状
態ｊが例えば２つの音素ＨＭＭによって共有されている
場合には、上記２つの音素ＨＭＭにおける直前フレーム
での異なる状態から当該フレームにおける共有された状
態ｊに遷移する２つの経路の一方(リジェクト累積スコ
アの小さい方)は、ビタビアルゴリズムによって当該フ
レームで断ち切られることになる。その結果、次フレー
ム以降においては上記断ち切られた音素ＨＭＭに関する
リジェクト累積スコアｇr_j(i)の演算を行う必要がなく
なり、その分だけ演算量を削減できるのである。これに
対して、従来の技術の項で述べた渡辺等の論文に記載さ
れている「音節認識を用いたゆう度補正法」において
は、図８に示すような音節ネットワークを用いて生成確
率を算出している。したがって、各音節ＨＭＭ間で部分
的な状態共有があっても各音節ＨＭＭ毎に互いに独立し
て累積スコアが求められる。したがって、累積スコアの
演算量は削減されないのである。すなわち、本実施の形
態によれば、用いる音素ＨＭＭに状態共有音素ＨＭＭが
存在する場合には、リジェクト判定の際の処理量を大幅
に削減されるのである。As described above, the reject determination reference cumulative score calculation unit 5 in the present embodiment performs the reject cumulative score gr for each state j along the Viterbi path set according to the phoneme HMM on the likelihood table. _j (i) is calculated. Therefore, when a certain state j in the frame is shared by, for example, two phoneme HMMs, two paths that transit from different states in the immediately preceding frame in the two phoneme HMMs to a shared state j in the frame. (The smaller of the reject cumulative scores) is cut off in the frame by the Viterbi algorithm. As a result, it is not necessary to calculate the reject cumulative score gr _j (i) for the cut-off phoneme HMM after the next frame, and the amount of calculation can be reduced accordingly. On the other hand, in the “likelihood correction method using syllable recognition” described in a paper by Watanabe et al. Described in the section of the prior art, the generation probability is calculated using a syllable network as shown in FIG. It has been calculated. Therefore, even if there is partial state sharing between the syllable HMMs, a cumulative score is obtained independently for each syllable HMM. Therefore, the calculation amount of the cumulative score is not reduced. That is, according to the present embodiment, when the phoneme HMM to be used includes the state-sharing phoneme HMM, the processing amount at the time of reject determination is greatly reduced.

【００３８】また、上記音素ＨＭＭの境界には音素環境
連結制約情報と言語的連結制約情報が付加されているの
で、連結される音素ＨＭＭを限定することによって演算
処理量を更に縮小できる。さらに、あり得ない経路を事
前に削除して正確にリジェクト累積スコアｇr_j(i)を算
出でき、リジェクト判定精度を高めることができる。Further, since the phoneme environment connection constraint information and the linguistic connection constraint information are added to the boundary of the phoneme HMM, the amount of operation processing can be further reduced by limiting the phoneme HMM to be connected. Furthermore, an impossible route is deleted in advance, and the reject cumulative score gr _j (i) can be accurately calculated, and the reject determination accuracy can be improved.

【００３９】また、リジェクト累積スコアｇr_j(i)の算
出および認識累積スコアｇt_k(i)の算出は、共にＨＭＭ
データメモリ９に蓄積された音素ＨＭＭのデータに基づ
いて作成している。したがって、従来のように、リジェ
クト判定用と認識タスク用の２つのネットワークを格納
しておく必要がなく、記憶量を低減できる。また、上記
実施の形態においては、上記ＨＭＭデータメモリ９に蓄
積されている音素ＨＭＭからリジェクト累積スコアｇr_j
(i)および認識累積スコアｇt_k(i)を算出するのであるか
ら、上述のリジェクト判定処理は、認識タスクが単語で
あろうと構文ネットワークを用いた連続音声認識であろ
うと適用可能である。The calculation of the reject cumulative score gr _j (i) and the calculation of the recognition cumulative score gt _k (i) are both performed by the HMM
It is created based on the phoneme HMM data stored in the data memory 9. Therefore, unlike the related art, it is not necessary to store two networks for the reject determination and the recognition task, and the amount of storage can be reduced. In the above embodiment, the reject cumulative score gr _{j is obtained} from the phoneme HMM stored in the HMM data memory 9.
Since (i) and the cumulative recognition score gt _k (i) are calculated, the rejection determination process described above is applicable whether the recognition task is a word or continuous speech recognition using a syntax network.

【００４０】尚、上記実施の形態においては、上記ＨＭ
Ｍデータメモリ９には音素環境依存型の音素ＨＭＭを登
録し、ＨＭＭ境界には制約情報として音素環境連結制約
情報と言語的連結制約情報を付加している。しかしなが
ら、この発明はこれに限定されるものではなく、上記制
約情報として音素環境連結制約情報のみを付加してもよ
い。また、非音素環境依存型の音素ＨＭＭをＨＭＭデー
タメモリに登録し、ＨＭＭ境界には言語的連結制約情報
のみを付加しても何等差し支えない。但し、その場合に
は、リジェクト累積スコアｇr_j(i)の算出精度は低くな
る。In the above embodiment, the HM
A phoneme environment-dependent phoneme HMM is registered in the M data memory 9, and phoneme environment connection constraint information and linguistic connection constraint information are added to the HMM boundary as constraint information. However, the present invention is not limited to this, and only the phoneme environment connection restriction information may be added as the restriction information. In addition, there is no problem if a non-phoneme environment-dependent phoneme HMM is registered in the HMM data memory and only linguistic connection constraint information is added to the HMM boundary. However, in that case, the calculation accuracy of the reject cumulative score gr _j (i) becomes low.

【００４１】[0041]

【発明の効果】以上より明らかなように、請求項１に係
る発明の音声認識装置は、音響パラメータとサブワード
単位のＨＭＭに基づいて尤度テーブル作成部によって尤
度テーブルを作成し、リジェクト判定用参照累積スコア
算出部によって、上記ＨＭＭの状態遷移制約情報に基づ
く制約に従って上記尤度テーブル上に設定した経路に沿
った最大参照累積スコアをビタビアルゴリズムで算出
し、認識タスク累積スコア算出部によって、上記尤度テ
ーブル上における上記ＨＭＭおよび各認識タスクに従っ
た経路に沿った最大累積スコアを算出し、リジェクト判
定部によって、上記両最大累積スコアの差に基づいて発
声内容のリジェクトを判定するので、リジェクト判定用
の最大参照累積スコアの演算に際して同一フレームで複
数ＨＭＭの状態を共有する場合には、上記最大参照累積
スコアの演算量が大幅に少なくなる。その結果、上記最
大参照累積スコアの記憶量を少なくできる。また、上記
リジェクト判定用の最大参照累積スコアおよび認識タス
ク用の最大累積スコアは、同じサブワード単位のＨＭＭ
に基づく尤度テーブルから算出されるので、リジェクト
判定用と認識タスク用のネットワークを独立して設ける
必要がなく、記憶容量を小さくできる。As is apparent from the above description, the speech recognition apparatus according to the first aspect of the present invention creates a likelihood table by a likelihood table creation unit based on acoustic parameters and HMMs in subword units, and performs rejection determination. The reference cumulative score calculation unit calculates the maximum reference cumulative score along the route set on the likelihood table according to the constraint based on the state transition constraint information of the HMM by the Viterbi algorithm, and the recognition task cumulative score calculation unit Since the maximum cumulative score along the path according to the HMM and each recognition task on the likelihood table is calculated and the reject determination unit determines rejection of the utterance content based on the difference between the two maximum cumulative scores, When calculating the maximum reference cumulative score for judgment, the state of multiple HMMs is shared in the same frame. When the operation amount of the maximum reference cumulative score is greatly reduced. As a result, the storage amount of the maximum reference cumulative score can be reduced. Further, the maximum reference cumulative score for the reject determination and the maximum cumulative score for the recognition task are the same subword unit HMM.
, It is not necessary to provide separate networks for reject determination and recognition tasks, and the storage capacity can be reduced.

【００４２】その際に、上記ＨＭＭの状態遷移制約情報
はＨＭＭ境界での認識対象言語による音素間の連結制約
情報を含むので、上記尤度テーブル上に設定される経路
には上記言語的連結制約が付与されている。したがっ
て、この発明によれば、リジェクト判定用の最大参照累
積スコアの演算量がより少なくなると共に、連結される
ＨＭＭが認識対象言語に応じて限定されてリジェクト判
定用の最大参照累積スコアの算出精度が良くなる。At this time, since the state transition constraint information of the HMM includes connection constraint information between phonemes in the recognition target language at the HMM boundary, the path set on the likelihood table includes the linguistic connection constraint. Is given. Therefore, according to the present invention, the calculation of the maximum operation amount of the reference cumulative score with there is less, the maximum reference cumulative score for HMM is limited in accordance with the recognition target language reject judgment to be connected for determination Li object Accuracy improves.

【００４３】また、請求項２に係る発明の音声認識装置
は、請求項１に係る発明と同様の音響分析部,ＨＭＭデ
ータメモリ,尤度テーブル作成部,リジェクト判定用参照
累積スコア算出部,認識タスク累積スコア算出部および
リジェクト判定部を有するので、リジェクト判定用の最
大参照累積スコアの演算に際して同一フレームで複数Ｈ
ＭＭの状態を共有する場合には、上記最大参照累積スコ
アの演算量が大幅に少なくなる。その結果、上記最大参
照累積スコアの記憶量を少なくできる。また、上記リジ
ェクト判定用の最大参照累積スコアおよび認識タスク用
の最大累積スコアは、同じサブワード単位のＨＭＭに基
づく尤度テーブルから算出されるので、リジェクト判定
用と認識タスク用のネットワークを独立して設ける必要
がなく、記憶容量を小さくできる。さらに、上記ＨＭＭ
データメモリに蓄積されているＨＭＭは音素環境依存型
の音素ＨＭＭであり、上記ＨＭＭの状態遷移制約情報は
ＨＭＭ境界での音素環境による連結制約情報を含むの
で、上記尤度テーブル上に設定される経路には上記音素
環境連結制約が付与されている。したがって、この発明
によれば、リジェクト判定用の最大参照累積スコアの演
算量がより少なくなると共に、連結される音素ＨＭＭが
音素環境によって特定されてリジェクト判定用の最大参
照累積スコアの算出精度が非常に良くなる。A speech recognition apparatus according to a second aspect of the present invention.
The same acoustic analysis unit and HMM
Data memory, likelihood table creation section, reject judgment reference
Cumulative score calculator, recognition task cumulative score calculator, and
Since it has a reject judgment unit,
When calculating large reference cumulative score, multiple H in the same frame
When sharing the state of the MM,
(1) The amount of operation is greatly reduced. As a result, the maximum
The storage amount of the illumination cumulative score can be reduced. In addition,
Maximum reference cumulative score for object judgment and recognition task
The maximum cumulative score of the
Calculated from the likelihood table
Need to provide separate networks for recognition and recognition tasks
Storage capacity can be reduced. Further, the above HMM
The HMM stored in the data memory is a phoneme environment-dependent phoneme HMM, and the state transition constraint information of the HMM includes connection constraint information based on the phoneme environment at the HMM boundary, and is set on the likelihood table. The path is provided with the above-described phoneme environment connection constraint. Therefore, according to the present invention, the calculation amount of the maximum reference cumulative score for determination Li object becomes less, phoneme HMM being linked to calculate the accuracy of the maximum reference cumulative score for rejection determination are specified by the phoneme environment Will be very good.

【図面の簡単な説明】[Brief description of the drawings]

【図１】この発明の音声認識装置におけるブロック図で
ある。FIG. 1 is a block diagram of a speech recognition apparatus according to the present invention.

【図２】図１における音声認識制御部の制御の下に行わ
れる音声認識処理動作のフローチャートである。FIG. 2 is a flowchart of a voice recognition processing operation performed under the control of a voice recognition control unit in FIG. 1;

【図３】尤度テーブルの一例を示す図である。FIG. 3 is a diagram illustrating an example of a likelihood table.

【図４】尤度テーブルからビタビアルゴリズムによって
累積スコアを求める際の経路の一例を示す図である。FIG. 4 is a diagram illustrating an example of a path when a cumulative score is obtained from a likelihood table by a Viterbi algorithm.

【図５】音素間の連結制約の一例を示す図である。FIG. 5 is a diagram showing an example of a connection restriction between phonemes.

【図６】図５に続く音素間の連結制約の一例を示す図で
ある。FIG. 6 is a diagram illustrating an example of a connection constraint between phonemes following FIG. 5;

【図７】構文ネットワークの一例を示す図である。FIG. 7 is a diagram illustrating an example of a syntax network.

【図８】従来の未知発話リジェクションにおいて使用さ
れる音節ネットワークを示す図である。FIG. 8 is a diagram showing a syllable network used in conventional unknown utterance rejection.

【図９】ＨＭＭの説明図である。FIG. 9 is an explanatory diagram of an HMM.

【符号の説明】[Explanation of symbols]

３…音響分析部、４…尤度テーブル
作成部、５…リジェクト判定用参照累積スコア算出部、
６…認識タスク累積スコア算出部、７…判定部、８…
音声認識制御部、９…ＨＭＭデータメモ
リ、１０…リジェクト判定用参照累積スコア格納部、１
１…認識タスク累積スコア格納部。3 ... Acoustic analysis unit 4 ... Likelihood table creation unit 5 ... Refuse judgment reference cumulative score calculation unit
6: recognition task cumulative score calculation unit, 7: determination unit, 8 ...
9: HMM data memory, 10: reject determination reference cumulative score storage unit, 1
1 ... Recognition task cumulative score storage unit.

フロントページの続き (56)参考文献特開平８−6588（ＪＰ，Ａ) 三木清一、川原達也、堂下修司，タスクの構文的制約から逸脱した発話のリジェクション，日本音響学会平成６年度秋季研究発表会講演論文集，日本，1994年 10月，１−Ｑ−１，ｐ．143−144 中川聖一，確率モデルによる音声認識，日本，電子情報通信学会，1988年７月，ｐ．96 大西宏樹、宮武正典，音素環境を利用した連続音声認識の検討，日本音響学会平成２年度秋季研究発表会講演論文集, 日本，1990年９月，２−８−４，ｐ. 53−54 野田喜昭，嵯峨山茂樹，前向き尤度を用いたＡ※ビーム探索によるＨＭＭ−ＬＲ音声認識，電子情報通信学会技術研究報告［音声］，日本，1994年６月17 日，ＳＰ94−23，ｐ．１−７渡辺隆夫，塚田聡，音節認識を用いたゆう度補正による未知発話のリジェクション，電子情報通信学会誌Ｄ−ＩＩ, 日本，1992年12月，Ｖｏｌ．Ｊ75−Ｄ− ＩＩＮｏ．12，ｐ．2002−2009 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/14 G10L 15/18 G10L 15/28 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References JP-A-8-6588 (JP, A) Seiichi Miki, Tatsuya Kawahara, Shuji Doshita, Rejection of utterances deviating from the syntactic constraints of tasks, The Acoustical Society of Japan, 1994 Proceedings of the Fall Meeting of the Japanese Research Conference, Japan, October 1994, 1-Q-1, p. 143-144 Seiichi Nakagawa, Speech Recognition by Probabilistic Model, Japan, IEICE, July 1988, p. 96 Hiroki Ohnishi, Masanori Miyatake, Study of Continuous Speech Recognition Using Phoneme Environment, Proceedings of the Acoustical Society of Japan Fall Meeting, 1990, September 1990, 2-8-4, p. 53-54 Yoshiaki Noda, Shigeki Sagayama, HMM-LR speech recognition by A * beam search using forward likelihood, IEICE technical report [Speech], Japan, June 17, 1994, SP94-23, p. . 1-7 Takao Watanabe and Satoshi Tsukada, Rejection of Unknown Utterance by Likelihood Correction Using Syllable Recognition, IEICE Journal D-II, Japan, December 1992, Vol. J75-D-II No. 12, p. 2002-2009 (58) Fields surveyed (Int. Cl. ⁷ , DB name) G10L 15/14 G10L 15/18 G10L 15/28 JICST file (JOIS)

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】入力音声から音響パラメータを抽出する
音響分析部と、状態遷移制約情報を有するサブワード単位のＨＭＭが蓄
積されているＨＭＭデータメモリと、上記抽出された音響パラメータと上記蓄積されているＨ
ＭＭに基づいて、全ＨＭＭを構成している総ての状態の
局所尤度を算出して尤度テーブルを作成する尤度テーブ
ル作成部と、上記ＨＭＭの状態遷移制約情報に基づく制約に従って上
記尤度テーブル上に経路を設定し、この経路に沿った最
大参照累積スコアをビタビアルゴリズムによって算出す
るリジェクト判定用参照累積スコア算出部と、上記尤度テーブル上における上記ＨＭＭおよび各認識タ
スクに従った経路に沿った最大累積スコアを算出する認
識タスク累積スコア算出部と、上記リジェクト判定用参照累積スコア算出部で算出され
た最大参照累積スコアと上記認識タスク累積スコア算出
部で算出された最大累積スコアの差を算出し、この差の
値が所定値以上であれば発声内容は認識対象外の未知発
話であると判定してリジェクトするリジェクト判定部を
備えて、上記ＨＭＭデータメモリに蓄積されているＨＭＭの状態
遷移制約情報は、ＨＭＭ境界での認識対象言語による音
素間の連結制約情報を含むことを特徴とする音声認識装
置。1. Extracting acoustic parameters from input speech
An acoustic analysis unit and an HMM in subword units having state transition constraint information are stored.
The stored HMM data memory, the extracted acoustic parameters and the stored H
Based on the MM, all the states that make up the HMM
Likelihood table for calculating local likelihood and creating a likelihood table
And a constraint based on the HMM state transition constraint information.
A route is set on the likelihood table, and the
Calculate large reference cumulative score by Viterbi algorithm
A reference cumulative score calculation unit for reject determination, and the HMM and each recognition module on the likelihood table.
To calculate the maximum cumulative score along the route
Knowledge task cumulative score calculation unit, and the reject determination reference cumulative score calculation unit
The maximum reference cumulative score and the above-mentioned recognition task cumulative score
Calculate the difference between the maximum cumulative scores calculated by the
If the value is equal to or greater than the predetermined value, the utterance content is unknown
A reject decision unit that decides that it is a story and rejects it
Preparationhand, HMM status stored in the HMM data memory
The transition constraint information is the sound in the target language at the HMM boundary.
Including connection constraint information between elements Voice recognition device characterized by the following:
Place.

【請求項２】入力音声から音響パラメータを抽出する
音響分析部と、状態遷移制約情報を有するサブワード単位のＨＭＭが蓄
積されているＨＭＭデータメモリと、上記抽出された音響パラメータと上記蓄積されているＨ
ＭＭに基づいて、全ＨＭＭを構成している総ての状態の
局所尤度を算出して尤度テーブルを作成する尤度テーブ
ル作成部と、上記ＨＭＭの状態遷移制約情報に基づく制約に従って上
記尤度テーブル上に経路を設定し、この経路に沿った最
大参照累積スコアをビタビアルゴリズムによって算出す
るリジェクト判定用参照累積スコア算出部と、上記尤度テーブル上における上記ＨＭＭおよび各認識タ
スクに従った経路に沿った最大累積スコアを算出する認
識タスク累積スコア算出部と、上記リジェクト判定用参照累積スコア算出部で算出され
た最大参照累積スコアと上記認識タスク累積スコア算出
部で算出された最大累積スコアの差を算出し、この差の
値が所定値以上であれば発声内容は認識対象外の未知発
話であると判定してリジェクトするリジェクト判定部を
備えて、上記ＨＭＭデータメモリに蓄積されているＨＭＭは、音
素環境依存型の音素ＨＭＭであり、上記ＨＭＭデータメモリに蓄積されているＨＭＭの状態
遷移制約情報は、ＨＭＭ境界での音素環境による連結制
約情報を含むことを特徴とする音声認識装置。(2)Extract acoustic parameters from input speech
An acoustic analysis unit, An HMM in subword units having state transition constraint information is stored.
An HMM data memory being loaded; The extracted acoustic parameters and the stored H
Based on the MM, all the states that make up the HMM
Likelihood table for calculating local likelihood and creating a likelihood table
File creation department, According to the constraints based on the state transition constraint information of the above HMM,
A route is set on the likelihood table, and the
Calculate large reference cumulative score by Viterbi algorithm
A reject determination reference cumulative score calculation unit; The HMM and each recognition module on the likelihood table
Along the route To calculate the maximum cumulative score
Knowledge task cumulative score calculation unit, The reject judgment reference cumulative score calculation unit calculates
The maximum reference cumulative score and the above-mentioned recognition task cumulative score
Calculate the difference between the maximum cumulative scores calculated by the
If the value is equal to or greater than the predetermined value, the utterance content is unknown
A reject decision unit that decides that it is a story and rejects
Preparation The HMM stored in the HMM data memory is
The elementary environment-dependent phoneme HMM, and the state of the HMM stored in the HMM data memory.
The transition constraint information is connected by the phoneme environment at the HMM boundary.
A speech recognition device characterized by including about information.