JP4270732B2

JP4270732B2 - Voice recognition apparatus, voice recognition method, and computer-readable recording medium recording voice recognition program

Info

Publication number: JP4270732B2
Application number: JP2000280655A
Authority: JP
Inventors: 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2000-09-14
Filing date: 2000-09-14
Publication date: 2009-06-03
Anticipated expiration: 2020-09-14
Also published as: JP2002091484A

Abstract

PROBLEM TO BE SOLVED: To obtain a language model having high estimation precision and a voice recognition device having high recoginition precision. SOLUTION: A learning text data tree structure clustering means 2001 conducts a tree structure clustering to hierarchically divide learning text data 1001 so as to have a linguistically similar nature and generates a tree structure learning text data cluster 2002. A language model generating means 1004 generates a tree structure cluster language model 2003 employing each learning text data which belongs to the cluster 2002.

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識を行う際に参照する言語モデル生成装置及びこれを用いた音声認識装置、言語モデル生成方法及びこれを用いた音声認識方法、並びに言語モデル生成プログラムを記録したコンピュータ読み取り可能な記録媒体及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体に関するものである。
【０００２】
【従来の技術】
近年、使用話者が単語を連続して入力できる連続音声認識技術の実用化検討が盛んに行われている。連続音声認識は、単語の復号列が最大事後確率を持つように、音声の音響的な観測系列に基づいて復号することである。これは次の（１）式で表される。
【数１】

ここで、Ｏは音声の音響的な観測値系列［ｏ₁，ｏ₂，ｏ₃，．．．，ｏ_T］であり、Ｗは単語列［ｗ₁，ｗ₂，ｗ₃，．．．，ｗ_n］である。Ｐ（Ｏ｜Ｗ）は単語列Ｗが与えられたときの観測値系列Ｏに対する確率であり、音響モデルによって計算するものである。Ｐ（Ｗ）は単語列Ｗの生起確率（出現確率）であり、言語モデルによって計算するものである。
【０００３】
音声認識については、森北出版（株）から出版されている古井貞煕著の「音声情報処理」（以降、文献１とする）、電子情報通信学会から出版されている中川聖一著の「確率モデルによる音声認識」（以降、文献２とする）、ＮＴＴアドバンステクノロジ（株）から出版されているＬａｗｒｅｎｃｅＲａｂｉｎｅｒ、Ｂｉｉｎｇ−ＨｗａｎｇＪｕａｎｇ著、古井貞煕監訳の「音声認識の基礎（上、下）」（以降、文献３とする）に詳しく説明されている。
【０００４】
音響モデルによって計算するＰ（Ｏ｜Ｗ）は、最近は統計的手法である隠れマルコフモデル（ＨＭＭ）を用いる検討が盛んである。隠れマルコフモデルを用いた音響モデルは、例えば文献３の６章に詳しく述べられている。
【０００５】
また、言語モデルによって計算するＰ（Ｗ）は統計的な手法を用いることが多く、代表的なものにＮ−ｇｒａｍモデルがある（Ｎは２以上）。これらについては、東京大学出版会から出版されている北研二著の「確率的言語モデル」（以下文献４とする）の３章において詳しく説明されている。Ｎ−ｇｒａｍモデルは、直前の（Ｎ−１）個の単語から次の単語への遷移確率を統計的に与えるものである。Ｎ−ｇｒａｍモデルによる単語列ｗ^L ₁＝ｗ₁．．．ｗ_Lの生起確率は、次の（２）式によって与えられる。
【数２】

【０００６】
上記（２）式において、確率Ｐ（ｗ_t｜ｗ_t+1-N ^t-1）は（Ｎ−１）個の単語からなる単語列ｗ_t+1-N ^t-1の後に単語ｗ_tが生起する確率であり、Πは積を表している。例えば、「私・は・駅・へ・行く」（・は単語の区切りを表す）といった単語列の生起確率を２−ｇｒａｍ（バイグラム）で求める場合は、次の（３）式のようになる。（３）式において、＃は文頭、文末を表す記号である。
Ｐ（私・は・駅・へ・行く）＝
Ｐ（私｜＃）Ｐ（は｜私）Ｐ（駅｜は）Ｐ（へ｜駅）Ｐ（行く｜へ）
Ｐ（＃｜行く）（３）
【０００７】
確率Ｐ（ｗ_t｜ｗ_t+1-N ^t-1）は学習用テキストデータの単語列の相対頻度によって求められる。単語列Ｗの学習用テキストデータにおける出現頻度をＣ（Ｗ）とすれば、例えば、「私・は」の２−ｇｒａｍ確率Ｐ（は｜私）は、次の（４）式によって計算される。（４）式において、Ｃ（私・は）は単語列「私・は」の出現頻度、Ｃ（私）は「私」の出現頻度である。
Ｐ（は｜私）＝Ｃ（私・は）／Ｃ（私）（４）
【０００８】
しかしながら、Ｎ−ｇｒａｍモデルの確率値を単純に相対頻度によって推定すると、学習用テキストデータ中に出現しない単語組を０にしてしまうという大きな欠点がある（ゼロ頻度問題）。また、例え学習用テキストデータ中に出現したとしても出現頻度の小さな単語列に対しては、統計的に信頼性のある確率値を推定するのが難しい（スパースネスの問題）。これらの問題に対処するために、通常はスムージングあるいは平滑化と呼ばれる手法を用いる。スムージングについては、上記文献４の３．３章にいくつかの手法が述べられているので、ここでは、具体的な説明は省略する。
【０００９】
言語モデルの学習には、音声認識の対象とする分野や場面・状況の文を学習用テキストデータとして用いるが、実際のアプリケーションでは、音声認識の対象がさまざまな分野や、さまざまな場面・状況の音声である場合が多い。単語列の生起確率は分野、場面・状況が違うと異なった確率となるので、分野、場面・状況の異なりを無視して学習用テキストデータを一括して学習して言語モデルを生成した場合は、言語モデルの精度は良くない。
【００１０】
このような、さまざまな分野や、さまざまな場面・状況を音声認識の対象とした音声認識装置の性能を上げるために、言語モデルの学習用テキストデータをクラスタリングして、分割されたクラスタ毎に言語モデルを作成する方法が検討されている。従来技術としては、例えば、公開特許公報２０００−７５８８６号の「統計的言語モデル生成装置及び音声認識装置」（以降、文献５とする）がある。ここで、クラスタとは、例えばクラスタ１が政治、クラスタ２がスポーツといった分野別の分割や、文の距離を定義して文をクラスタリングして得ることができる。
【００１１】
学習用テキストデータをクラスタに分割した場合には、クラスタ当たりの学習用テキストデータは少なくなるので、更に前述のゼロ頻度問題やスパースネスの問題が大きくなる。これに対して文献５では、クラスタに分割しない全学習用テキストデータを用いて推定した言語モデルＬＭ_aと、クラスタに分割された学習用テキストデータを用いて推定したクラスタ別の言語モデルＬＭ_c ^k（ｋはクラスタ番号）を用いて、最大事後確率推定法によってＬＭ_map ^kを推定することで精度の高い言語モデルを得ている。
【００１２】
図１３は文献５に記述されている従来の言語モデル生成装置の構成を示すブロック図である。図において、１００１は言語モデルの学習用テキストデータ、１００２は学習用テキストデータクラスタリング手段、１００３は学習用テキストデータクラスタ、１００３−１〜１００３−Ｍはクラスタ１〜Ｍの学習用テキストデータ、１００４は言語モデル生成手段、１００５はクラスタ別言語モデル、１００５−１〜１００５−Ｍはクラスタ１〜Ｍの言語モデルである。
【００１３】
次に動作について説明する。
学習用テキストデータ１００１は、言語モデルを学習するためのテキストデータであり、音声認識装置が認識対象とする単語や文を文字にしたものである。この学習用テキストデータ１００１は、学習用テキストデータクラスタリング手段１００２へ入力される。
【００１４】
学習用テキストデータクラスタリング手段１００２は、学習用テキストデータ１００１をクラスタリングする。文献５では、ｋ−ｍｅａｎｓ法に類似した方法を用いてテキストを文単位でクラスタリングしている。通常のｋ−ｍｅａｎｓ法と異なる点は、（１）クラスタ中心ベクトルを、そのクラスタに属する文で生成される言語モデルとすること、（２）距離尺度に文の生成確率を用いていることである。また、言語モデルにはＮ−ｇｒａｍモデルを用いている。
【００１５】
学習用テキストデータクラスタ１００３は、学習用テキストデータクラスタリング手段１００２によって、Ｍ個のクラスタにクラスタリングされたクラスタ１の学習用テキストデータ１００３−１〜クラスタＭの学習用テキストデータ１００３−Ｍで構成されている。
【００１６】
言語モデル生成手段１００４は、学習用テキストデータクラスタリング手段１００２によって得られたクラスタ１の学習用テキストデータ１００３−１〜クラスタＭの学習用テキストデータ１００３−Ｍをそれぞれ入力して、クラスタ１の言語モデル１００５−１〜クラスタＭの言語モデル１００５−Ｍで構成するクラスタ別言語モデル１００５を生成する。言語モデル生成手段１００４は、クラスタ毎の学習用テキストデータ数の減少による言語モデルの推定精度の低下を防ぐために、クラスタ分割しない全学習用テキストデータを用いて推定した言語モデルＬＭ_ａと、クラスタに分割された学習用テキストデータを用いて推定したクラスタ別の言語モデルＬＭ_c ^kを用いて、最大事後確率推定法によってクラスタ別の言語モデルＬＭ_map ^kを推定している。
【００１７】
次に上記言語モデル生成装置を用いた従来の音声認識装置の説明を行う。図１４は文献５に開示された従来の音声認識装置の構成を示すブロック図である。図において、１１０１は認識対象音声、１１０２は音声特徴量抽出手段、１１０３は音響モデル、１１０４は言語モデル選択手段、１１０５は照合手段、１１０６は音声認識結果である。クラスタ別言語モデル１００５は、図１３と同一の機能ブロックであり、同一の符号を付すと共に説明は省略する。
【００１８】
次に動作について説明する。
認識対象音声１１０１は認識対象とする音声であり、音声特徴量抽出手段１１０２へ入力される。音声特徴量抽出手段１１０２は、認識対象音声１１０１に含まれている音声特徴量を抽出する。音響モデル１１０３は音声の音響的な照合を行うためのモデルである。音響モデル１１０３は、例えば、多数の話者が発声した文や単語の音声を用いて学習した、前後音素環境を考慮した音素を認識ユニットとしたＨＭＭを用いる。
【００１９】
言語モデル選択手段１１０４は、言語モデル生成装置を用いて生成したクラスタ１の言語モデル１００５−１〜クラスタＭの言語モデル１００５−Ｍで構成されるクラスタ別言語モデル１００５の中から、照合手段１１０５で用いる言語モデルを選択する。文献５では、クラスタに分割する前の不特定言語モデルを用いて照合を行い、得られた認識結果候補の単語列に対して、最も生起確率が高いクラスタ別言語モデルを、クラスタ１の言語モデル１００５−１〜クラスタＭの言語モデル１００５−Ｍから１つ選択している。
【００２０】
照合手段１１０５は、言語モデル選択手段１１０４によって選択された言語モデルが設定している認識対象の単語［Ｗ（１），Ｗ（２），・・・，Ｗ（ｗｎ）］（ｗｎは認識対象とする単語数）の発音表記を認識ユニットラベル表記に変換し、このラベルにしたがって、音響モデル１１０３に格納されている音素単位のＨＭＭを連結し、認識対象単語の標準パターン［λ_W(1)，λ_W(2)，．．．，λ_W(wn)］を作成する。
【００２１】
そして、照合手段１１０５は、認識対象単語の標準パターンと選択された言語モデルによって表される単語列の生起確率を用いて、音声特徴量分析手段１１０２の出力である音声特徴量に対して照合を行い、音声認識結果１１０６を出力する。音声認識結果１１０６は、認識対象音声１１０１に対して、認識対象単語で最も照合スコアが高い単語の単語番号系列Ｒｎ＝［ｒ（１），ｒ（２），．．．，ｒ（ｍ）］を計算し、単語番号に対応する単語Ｒｗ＝［Ｗ（ｒ（１）），Ｗ（ｒ（２）），．．．，Ｗ（ｒ（ｍ））］を出力する。ここで、ｒ（ｉ）は音声認識結果１１０６の単語系列のｉ番目の単語の単語番号を示す。また、ｍは認識単語系列の単語数を示す。
【００２２】
【発明が解決しようとする課題】
従来の言語モデル生成装置は以上のように構成されているので、クラスタリングによって分割するクラスタ数が多くなると、クラスタ当たりの学習用テキストデータ数が少なくなり、言語モデルの推定精度が悪くなるので音声認識精度が高くならないという課題があった。
【００２３】
また、分割するクラスタ数が多くなると、１発声が複数のクラスタの言語性質を持つような場合、認識率が高くならないという課題があった。
【００２４】
この発明は、上記のような課題を解決するためになされたものであり、推定精度の高い言語モデルを作成できる言語モデル生成装置、言語モデル生成方法及び言語モデル生成プログラムを記録したコンピュータ読み取り可能な記録媒体を得ることを目的とする。
【００２５】
また、この発明は、推定精度の高い言語モデルを用いて、音声認識精度の高い音声認識装置、音声認識方法及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体を得ることを目的とする。
【００３０】
【課題を解決するための手段】
この発明に係る音声認識装置は、認識対象音声を入力して音声認識を行い音声認識結果を出力するものにおいて、上記認識対象音声を入力し音声特徴量を抽出する音声特徴量抽出手段と、音声の音響的な観測値系列の確率を求める音響モデルと、学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルと、上記木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率の高い複数の言語モデルを選択する複数言語モデル選択手段と、上記複数言語モデル選択手段によって選択された複数の言語モデルを入力して混合言語モデルを生成する混合言語モデル生成手段と、上記混合言語モデル生成手段により生成された言語モデルと上記音響モデルを用いて、上記音声特徴量抽出手段が抽出した音声特徴量に対して照合を行い音声認識結果を出力する照合手段とを備えたものである。
【００３１】
この発明に係る音声認識装置は、複数言語モデル選択手段が、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択するものである。
【００３２】
この発明に係る音声認識装置は、木構造クラスタ別言語モデルが、木構造の上位に位置する木構造クラスタ別言語モデルを用いて補間処理が行われた補間処理された木構造クラスタ別言語モデルであることを特徴とするものである。
【００３７】
この発明に係る音声認識方法は、認識対象音声を入力した音声認識を行い音声認識結果を出力するものにおいて、上記認識対象音声を入力し音声特徴量を抽出する第１のステップと、学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率が高い複数の言語モデルを選択する第２のステップと、上記第２のステップで選択された複数の言語モデルを入力して混合言語モデルを生成する第３のステップと、音声の音響的な観測値系列の確率を求める音響モデルと、上記第３のステップで生成された言語モデルを用いて、上記第１のステップで抽出した音声特徴量に対して照合を行い音声認識結果を出力する第４のステップとを備えたものである。
【００３８】
この発明に係る音声認識方法は、第２のステップで、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択するものである。
【００４３】
この発明に係る音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体は、認識対象音声を入力して音声認識を行い音声認識結果を出力するものであって、上記認識対象音声を入力し音声特徴量を抽出する音声特徴量抽出手順と、学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率の高い複数の言語モデルを選択する複数言語モデル選択手順と、上記複数言語モデル選択手順によって選択された複数の言語モデルを入力して混合言語モデルを生成する混合言語モデル生成手順と、音声の音響的な観測値系列の確率を求める音響モデルと、上記混合言語モデル生成手順により生成された言語モデルを用いて、上記音声特徴量抽出手順により抽出された音声特徴量に対して照合を行い音声認識結果を出力する照合手順とを実現させるものである。
【００４４】
この発明に係る音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体は、複数言語モデル選択手順が、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択するものである。
【００４５】
【発明の実施の形態】
以下、この発明の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による言語モデル生成装置の構成を示すブロック図である。図において、２００１は学習用テキストデータ木構造クラスタリング手段、２００２は木構造学習用テキストデータクラスタ、２００２−１〜２００２−Ｍは木構造クラスタ１〜Ｍの学習用テキストデータ、２００３は木構造クラスタ別言語モデル、２００３−１〜２００３−Ｍは木構造クラスタ１〜Ｍの言語モデルである。従来の言語モデル生成装置の構成を示す図１３と同一の機能ブロックについては、同一の符号を付し説明を省略する。
【００４６】
なお、言語モデルの学習用テキストデータ１００１は、音声認識の認識対象とする分野や場面・状況において用いられる単語や文を文字化したものである。例えば、アナウンサーが発声する政治のニュースを音声認識対象とした場合は、新聞の政治欄の記事や、政治の放送ニュースの発声内容を文字として書き起こしたテキストデータである。
【００４７】
次に動作について説明する。
図２はこの発明の実施の形態１による言語モデル生成装置における言語モデル生成方法を示すフローチャートである。学習用テキストデータ木構造クラスタリング手段２００１は、ステップＳＴ１０１において、学習用テキストデータ１００１を入力し、ステップＳＴ１０２において、クラスタリングの階層Ｉを０とし、ステップＳＴ１０３において、初めに学習用テキストデータ１００１の全てに対してクラスタリングを行う。この学習用テキストデータ１００１全てに対するクラスタリングを、階層０のクラスタリングとする。
【００４８】
ここで、クラスタリングとは、人手で２つ以上の分野に分けることや、文献５に示してあるｋ−ｍｅａｎｓアルゴリズムに類似した方法を用いて、学習用テキストデータを２つ以上の集合に分割することである。クラスタリングによって得られるクラスタに属する学習用テキストデータは、言語的に類似した性質を持つものとなる。
【００４９】
図３は学習用テキストデータ木構造クラスタリング手段２００１で行われる学習用テキストデータ木構造クラスタリングの説明図であり、文を単位として階層的にクラスタリングしている様子を示したものである。図３では、階層０の木構造クラスタ００のクラスタリングにより、学習用テキストデータ１００１全てを２つのクラスタに分割している。分割された学習用テキストデータの集合は、階層１の木構造クラスタ１０と木構造クラスタ１１となっている。
【００５０】
図２のステップＳＴ１０４において、学習用テキストデータ木構造クラスタリング手段２００１は、階層Ｉをインクリメントし、ステップＳＴ１０５において、学習用テキストデータ木構造クラスタリング手段２００１は、階層Ｉ−１（ここでは、Ｉ＝０）でクラスタリングされた各クラスタに属する学習用テキストデータに対してクラスタリングを行う。図３では、階層１のクラスタリングにより、階層１の木構造クラスタ１０から階層２の木構造クラスタ２０と木構造クラスタ２１を生成し、木構造クラスタ１１から階層２の木構造クラスタ２２と木構造クラスタ２３を生成している。
【００５１】
ステップＳＴ１０６において、クラスタ数が予め定めた数Ｍになったかを調べて、予め定めた数Ｍにならない場合には、ステップＳＴ１０４に戻り、階層Ｉをインクリメントし、ステップＳＴ１０５のクラスタリングの処理を繰り返す。以上の処理をクラスタ数が予め定めた数Ｍになるまで繰り返して、木構造クラスタ１の学習用テキストデータ２００２−１〜木構造クラスタＭの学習用テキストデータ２００２−Ｍを生成する。
【００５２】
予め定めたクラスタ数まで学習用テキストデータの木構造クラスタリングを行った後に、ステップＳＴ１０７において、言語モデル生成手段１００４は、クラスタリングされた木構造クラスタ別に、各クラスタに属する学習用テキストデータを用いて言語モデルの生成を行い、木構造クラスタ１の言語モデル２００３−１〜木構造クラスタＭの言語モデル２００３−Ｍで構成される木構造クラスタ別言語モデル２００３を生成する。
【００５３】
上記ステップＳＴ１０６において、階層的な学習用テキストデータのクラスタリングを、予め定めたクラスタ数Ｍになるまで繰り返す。ここでは、クラスタ数をクラスタリングの終了の基準にしているが、階層数を基準としても、クラスタ内の学習用テキストデータ数がある値以下であるならば、クラスタリングを終了するとしても良い。階層的なクラスタリングによって得られるクラスタは、階層が下になるほどクラスタに属する学習テキストデータの性質は分野や場面・状況の違いをよく表現している。
【００５４】
図４は木構造クラスタ別の言語モデル生成の説明図である。図４では木構造のノードが学習用テキストデータの木構造クラスタを表しており、各木構造クラスタ毎にそこに属する学習用テキストデータを用いて言語モデルの生成を行う。木構造の親ノードの木構造クラスタは、子ノードの木構造クラスタに属する学習用テキストデータ全てを含むものとなっている。図４では、例えば、木構造クラスタ００に属する学習用テキストデータを用いて生成した言語モデルが、木構造クラスタ００の言語モデルＬＭ００，木構造クラスタ１０に属する学習用テキストデータを用いて生成した言語モデルが、木構造クラスタ１０の言語モデルＬＭ１０にそれぞれ対応している。
【００５５】
生成される言語モデルの性質は、下層の木構造クラスタの言語モデルへいくほど、分野や場面・状況の違いによる言語の性質の違いを、より表現した言語モデルとなる。また、上層の木構造クラスタの言語モデルは、分野や場面・状況の違いによる言語の性質の違いは細かく表していないが、複数の分野や場面・状況の言語特徴を含んでいるので、発声が複数の分野や場面・状況を含んでいる場合には、有効な言語モデルとなっている。さらに、上層の木構造クラスタの言語モデルは学習テキストデータが多いので、木構造クラスタと同数のクラスタ数に一度に分割した場合に比べてパラメータ推定精度が高い。
【００５６】
言語モデルの生成の具体的方法は、文献４の３章から５章に述べられている、Ｎ−ｇｒａｍモデル、隠れマルコフモデル、確率文脈自由文法等である。
【００５７】
また、この実施の形態１における言語モデル生成方法を言語モデル生成プログラムとして記録媒体に記録することもできる。この場合には、学習用テキストデータ木構造クラスタリング手段２００１と同様の処理を実現する学習用テキストデータ木構造クラスタリング手順と、言語モデル生成手段１００４と同様の処理を実現する言語モデル生成手順とから構成される言語モデル生成プログラムを記録媒体に記録する。
【００５８】
以上のように、この実施の形態１の言語モデル生成装置及び言語モデル生成方法によれば、学習用テキストデータを階層的に木構造クラスタリングし、各木構造クラスタに属する学習用テキストデータを用いて、木構造クラスタ別言語モデルを生成するので、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、認識率の高い言語モデルが生成できる効果が得られる。また、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した言語モデルが存在するので、認識率の高い言語モデルが生成できる効果が得られる。
【００５９】
実施の形態２．
図５はこの発明の実施の形態２による言語モデル生成装置の構成を示すブロック図である。図において、３００１は言語モデル補間手段、３００２は補間処理された木構造クラスタ別言語モデル、３００２−１〜３００２−Ｍは補間処理された木構造クラスタ１〜Ｍの言語モデルである。実施の形態１の図１と同一の機能ブロックについては、同一の符号を付し説明を省略する。
【００６０】
次に動作について説明する。
図６はこの発明の実施の形態２による言語モデル生成装置における言語モデル生成方法を示すフローチャートである。ステップＳＴ２０１からステップＳＴ２０７までの処理は、実施の形態１の図２におけるステップＳＴ１０１からステップＳＴ１０７までの処理と同一である。
【００６１】
ステップＳＴ２０８において、言語モデル補間手段３００１は、言語モデル生成手段１００４によって生成された木構造クラスタ別言語モデルである木構造クラスタ１の言語モデル２００３−１〜木構造クラスタＭの言語モデル２００３−Ｍを入力し、補間処理された木構造クラスタ１の言語モデル３００２−１〜補間処理された木構造クラスタＭの言語モデル３００２−Ｍを生成する。このときの補間処理は、補間対象のクラスタ言語モデルが位置する木構造のノードの親ノードの木構造クラスタの言語モデルを用いて補間処理を行う。
【００６２】
図４の例では、木構造クラスタ２０の言語モデルＬＭ２０を補間する場合は、親ノードである木構造クラスタ１０の言語モデルＬＭ１０と、更に上層の親ノードである木構造クラスタ００の言語モデルＬＭ００とを用いて補間する。この補間処理において、例えば言語モデルがＮ−ｇｒａｍモデルである場合には、単語列ｗ_n+1-N ^n-1に続いてｗ_nが生起する確率がパラメータであり、次の（５）式によって求める。
【数３】

【００６３】
上記（５）式において、Ｐ’_s（ｗ_n｜ｗ_n+1-N ^n-1）は補間処理された木構造クラスタＳの言語モデルにおける単語列ｗ_n+1-N ^n-1に続いてｗ_nが生起する確率、Ωは木構造クラスタＳとその親ノードのクラスタ番号の集合、Ｐ_i（ｗ_n｜ｗ_n+1-N ^n-1）は木構造クラスタｉの言語モデルにおける単語列ｗ_n+1-N ^n-1に続いてｗ_nが生起する確率、α_iは重み係数である。このα_iは、例えば、文献４の３章に述べられている削除補間法によって推定可能である。
【００６４】
この説明では、Ｐ_i（ｗ_n｜ｗ_n+1-N ^n-1）は補間する前の生起確率としたが、木構造の上層から補間し、補間処理された生起確率Ｐ’_i（ｗ_n｜ｗ_n+1-N ^n-1）を用いても良い。木構造クラスタでは、下層のクラスタは学習用テキストデータが少量であるので、言語モデル生成において、ゼロ頻度問題やスパースネスの問題が生じやすいが、このように、学習用テキストデータ数が多い親ノードのクラスタの言語モデルを用いて、パラメータすなわち単語列ｗ_n+1-N ^n-1に続いてｗ_nが生起する確率の補間処理を行うので、言語モデル推定精度が高くなる。
【００６５】
また、実施の形態２における言語モデル生成方法を言語モデル生成プログラムとして記録媒体に記録することもできる。この場合には、学習用テキストデータ木構造クラスタリング手段２００１と同様の処理を実現する学習用テキストデータ木構造クラスタリング手順と、言語モデル生成手段１００４と同様の処理を実現する言語モデル生成手順と、言語モデル補間手段３００１と同様の処理を実現する言語モデル補間手順とから構成される言語モデル生成プログラムを記録媒体に記録する。
【００６６】
以上のように、この実施の形態２の言語モデル生成装置及び言語モデル生成方法によれば、学習用テキストデータを階層的に木構造クラスタリングし、各木構造クラスタに属する学習用テキストデータを用いて木構造クラスタ別言語モデルを生成し、生成されたクラスタ言語モデルを木構造の親ノードのクラスタ言語モデルを用いて補間するので、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、さらに認識率の高い言語モデルを生成できるという効果が得られる。
【００６７】
また、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した言語モデルが存在するので、認識率の高い言語モデルが生成できるという効果が得られる。
【００６８】
実施の形態３．
図７はこの発明の実施の形態３による音声認識装置の構成を示すブロック図である。図において、実施の形態１の図１，及び従来の音声認識装置の図１４と同一の機能ブロックについては、同一の符号を付し説明を省略する。
【００６９】
次に動作について説明する。
図８はこの発明の実施の形態３による音声認識装置における音声認識方法を示すフローチャートである。音声特徴量抽出手段１１０２は、ステップＳＴ３０１において認識対象音声１１０１を入力し、ステップＳＴ３０２において音声特徴量を抽出する。ここで、音声特徴量とは少ない情報量で音声の特徴を表すものであり、例えば、文献１の５章で述べているようなケプストラム、ケプストラムの動的特徴で構成する特徴ベクトルである。
【００７０】
ステップＳＴ３０３において、言語モデル選択手段１１０４は、照合手段１１０５で用いる言語モデルを、木構造クラスタ別言語モデル２００３の木構造クラスタ１の言語モデル２００３−１〜木構造クラスタＭの言語モデル２００３−Ｍから１つ選択する。言語モデルの選択は、例えば文献５に示されている方法を用い、最も生起確率が高い木構造クラスタの言語モデルを選択する。
【００７１】
ステップＳＴ３０４において、照合手段１１０５は、言語モデル選択手段１１０４によって選択された木構造クラスタ言語モデルと、音響モデル１１０３を入力して認識対象音声１１０１の音声特徴量に対して照合を行い、最も尤度（照合スコア）が高い単語列を音声認識結果１１０６として出力する。
【００７２】
この場合の照合処理を具体的に説明する。照合手段１１０５は、言語モデル選択手段１１０４によって選択された木構造クラスタ言語モデルが設定している認識対象の単語［Ｗ（１），Ｗ（２），．．．，Ｗ（ｗｎ）］（ｗｎは認識対象とする単語数）の発音表記を、認識ユニットラベル表記に変換し、このラベルにしたがって、音響モデル１１０３に格納されている音素ユニットのＨＭＭを連結し、認識対象単語の標準パターン［λ_W(1)，λ_W(2)，．．．，λ_W(wn)］を作成する。
【００７３】
そして、照合手段１１０５は、認識対象単語標準パターンと選択された木構造クラスタ言語モデルによって表される単語列の生起確率を用いて、音声特徴量分析手段１１０２の出力である音声特徴量に対して照合を行い、音声認識結果１１０６を出力する。音声認識結果１１０６は、認識対象音声に対して認識対象単語で最も尤度が高い単語の単語番号系列Ｒｎ＝［ｒ（１），ｒ（２），．．．，ｒ（ｍ）］を計算し、単語番号に対応する単語Ｒｗ＝［Ｗ（ｒ（１）），Ｗ（ｒ（２）），．．．，Ｗ（ｒ（ｍ））］を出力する。ここで、ｒ（ｉ）は音声認識結果の単語系列のｉ番目の単語の単語番号を示し、ｍは認識単語系列の単語数を示す。
【００７４】
以上は、選択対象の木構造クラスタ別言語モデルを、実施の形態１で生成した木構造クラスタ１の言語モデル２００３−１〜木構造クラスタＭの言語モデル２００３−Ｍとして説明したが、実施の形態２で生成した補間処理された木構造クラスタ１の言語モデル３００２−１〜補間処理された木構造クラスタＭの言語モデル３００２−Ｍとしても良い。
【００７５】
また、実施の形態３における音声認識方法を音声認識プログラムとして記録媒体に記録することもできる。この場合には、実施の形態１の言語モデル生成プログラムに加えて、音声特徴量抽出手段１１０２と同様の処理を実現する音声特徴量抽出手順と、言語モデル選択手段１１０４と同様の処理を実現する言語モデル選択手順と、照合手段１１０５と同様の処理を実現する照合手順を含む音声認識プログラムを記録媒体に記録する。
【００７６】
以上のように、この実施の形態３における音声認識装置及び音声認識方法によれば、学習用テキストデータ１００１を階層的に木構造クラスタリングし、各木構造クラスタに属する学習用テキストデータ２００２−１〜２００２−Ｍを用いて、木構造クラスタ別言語モデル２００３−１〜２００３−Ｍを生成するので、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、この木構造クラスタ別言語モデル２００３から言語モデルを選択して音声認識を行うので、認識精度が高い音声認識ができるという効果が得られる。
【００７７】
また、認識対象の音声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した木構造クラスタ言語モデルを選択し音声認識を行うので、認識性能が高い音声認識ができる効果が得られる。
【００７８】
実施の形態４．
図９はこの発明の実施の形態４による音声認識装置の構成を示すブロック図である。図において、５００１は複数言語モデル選択手段、５００２は混合言語モデル生成手段である。実施の形態３の図７と同一の機能ブロックについては、同一の符号を付し説明を省略する。
【００７９】
次に動作について説明する。
図１０はこの発明の実施の形態４による音声認識装置における音声認識方法を示すフローチャートである。ステップＳＴ４０１及びステップＳＴ４０２の処理は、実施の形態３における図８のステップＳＴ３０１及びステップＳＴ３０２の処理と同一である。
【００８０】
ステップＳＴ４０３において、複数言語モデル選択手段５００１は、木構造クラスタ１の言語モデル２００３−１〜木構造クラスタＭの言語モデル２００３−Ｍから２つ以上（Ｋ個以下）の木構造クラスタの言語モデルを選択する。言語モデルの選択は、例えば文献５に示されている方法を拡張し、生起確率が高い順からＫ個の言語モデルを選択する方法を用いる。
【００８１】
ステップＳＴ４０４において、混合言語モデル生成手段５００２は、複数言語モデル選択手段５００１によって選択された複数の木構造クラスタ言語モデルを入力し、１つの混合言語モデルを生成する。混合モデルは、例えばＮ−ｇｒａｍモデルであるならば、次の（６）式によって生起確率を計算する。
【数４】

【００８２】
上記（６）式において、Ｐ_m（ｗ_n｜ｗ_n+1-N ^n-1）は混合言語モデルの生起確率であり、Ψは複数言語モデル選択手段５００１によって選択された木構造クラスタ言語モデルの番号の集合、Ｐ_i（ｗ_n｜ｗ_n+1-N ^n-1）は選択された言語モデルの生起確率であり、β_iは重み係数である。ここでβ_iについては、例えば文献５に示されている言語モデル選択時の生起確率にしたがって、生起確率が高い言語モデルはβ_iが大きくなるように設定する。
【００８３】
ステップＳＴ４０５において、照合手段１１０５は、混合言語モデル生成手段５００２によって生成された混合言語モデルと、音響モデル１１０３を入力し、認識対象音声１１０１の音声特徴量に対して照合を行い、最も尤度が高い単語列を音声認識結果１１０６として出力する。
【００８４】
以上は、選択対象の木構造クラスタ言語モデルを、実施の形態１で生成した木構造クラスタ１の言語モデル２００３−１〜木構造クラスタＭの言語モデル２００３−Ｍとして説明したが、実施の形態２で生成した補間処理された木構造クラスタ１の言語モデル３００２−１〜補間処理された木構造クラスタＭの言語モデル３００２−Ｍとしても良い。
【００８５】
また、実施の形態４における音声認識方法を音声認識プログラムとして記録媒体に記録することもできる。この場合には、実施の形態１の言語モデル生成プログラムに加えて、音声特徴量抽出手段１１０２と同様の処理を実現する音声特徴量抽出手順と、照合手段１１０５と同様の処理を実現する照合手順と、複数言語モデル選択手段５００１と同様の処理を実現する複数言語モデル選択手順と、混合言語モデル生成手段５００２と同様の処理を実現する混合言語モデル生成手順とを含む音声認識プログラムを記録媒体に記録する。
【００８６】
以上のように、この実施の形態４における音声認識装置及び音声認識方法によれば、学習用テキストデータ１００１を階層的に木構造クラスタリングし、各木構造クラスタの学習用テキストデータ２００２−１〜２００２−Ｍを用いて、木構造クラスタ別言語モデル２００３−１〜２００３−Ｍを生成し、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、この木構造クラスタ別言語モデル２００３から複数選択した木構造クラスタ言語モデルによって混合言語モデルを生成して、音声認識に用いるので、さらに認識精度が高い音声認識ができるという効果が得られる。
【００８７】
また、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した言語モデルを選択し混合言語モデルを生成して音声認識に用いるので、認識性能が高い音声認識ができる効果が得られる。
【００８８】
実施の形態５．
図１１はこの発明の実施の形態５による音声認識装置の構成を示すブロック図である。図において、６００１は葉ノードのクラスタ別言語モデル、６００１−１〜６００１−Ｌは葉ノードクラスタ１〜Ｌの言語モデルである。実施の形態３の図７と同一の機能ブロックについては、同一の符号を付し説明を省略する。
【００８９】
次に動作について説明する。
図１２はこの発明の実施の形態５による音声認識装置における音声認識方法を示すフローチャートである。ステップＳＴ５０１及びステップＳＴ５０２の処理は、実施の形態３における図８のステップＳＴ３０１及びステップＳＴ３０２の処理と同一である。
【００９０】
ステップＳＴ５０３において、言語モデル選択手段１１０４は、木構造クラスタの葉ノードクラスタの言語モデルから、照合手段１１０５で用いる言語モデルを、葉ノードクラスタ１の言語モデル６００１−１〜葉ノードクラスタＬの言語モデル６００１−Ｌから１つ選択する。ここで、葉ノードクラスタの言語モデルとは、木構造の最も下層の木構造クラスタの言語モデルである。図４の例では、木構造クラスタ２０の言語モデルＬＭ２０，木構造クラスタ２１の言語モデルＬＭ２１，木構造クラスタ２２の言語モデルＬＭ２２，木構造クラスタ２３の言語モデルＬＭ２３が葉ノードクラスタの言語モデルに相当する。
【００９１】
このような葉ノードクラスタの言語モデルは、分野や場面・状況の違いによる言語の性質の違いを詳細に表現するモデルとなっているので、分野や場面・状況が明確に分かれるような認識対象の音声である場合は有効である。また、全ての木構造クラスタ別の言語モデルを用いる場合に比べて、選択対象のクラスタ言語モデルの数が少ないので、省メモリー、演算量削減の効果がある。葉ノードクラスタの言語モデルの選択は、例えば文献５に示されている方法を用い、最も生起確率が高い葉ノードクラスタの言語モデルを選択する。
【００９２】
ステップＳＴ５０４において、照合手段１１０５は、言語モデル選択手段１１０４によって選択された葉ノードクラスタの言語モデルと、音響モデル１１０３を入力して、認識対象音声１１０１の音声特徴量に対して照合を行い、最も尤度が高い単語列を音声認識結果１１０６として出力する。
【００９３】
以上は、選択対象の葉ノードクラスタの言語モデルを、実施の形態１で生成した木構造クラスタ別言語モデル２００３の葉ノードクラスタの言語モデルとしたが、実施の形態２で生成した補間処理された木構造クラスタ別言語モデル３００２の葉ノードクラスタの言語モデルとしても良い。また、言語モデル選択手段１１０４を複数言語モデル選択手段５００１とし、後段に混合言語モデル生成手段５００２を接続し、混合言語モデルを用いて照合処理を行っても良い。
【００９４】
また、実施の形態５における音声認識方法を音声認識プログラムとして記録媒体に記録することもできる。この場合には、実施の形態１の言語モデル生成プログラムに加えて、音声特徴量抽出手段１１０２と同様の処理を実現する音声特徴量抽出手順と、言語モデル選択手段１１０４と同様の処理を実現する言語モデル選択手順と、照合手段１１０５と同様の処理を実現する照合手順を含む音声認識プログラムを記録媒体に記録する。
【００９５】
以上のように、この実施の形態５における音声認識装置及び音声認識方法によれば、学習用テキストデータ１００１を階層的に木構造クラスタリングし、各木構造クラスタの学習用テキストデータ２００２−１〜２００２−Ｍを用いて、木構造クラスタ言語モデル２００３を生成するので、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、この木構造クラスタ言語モデル２００３の葉ノードクラスタの言語モデル６００１から選択した言語モデルを音声認識に用いるので、認識精度が高い音声認識ができると共に、言語モデルのメモリ容量を削減でき、言語モデルを選択する際の演算量を削減できるという効果が得られる。
【００９６】
また、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の葉ノードクラスタの言語モデルを選択し混合言語モデルを生成すれば、複数の分野や場面・状況の言語特徴を学習した言語モデルを音声認識に用いることになるので、認識性能が高い音声認識ができる効果が得られる。
【０１０１】
【発明の効果】
この発明によれば、音声認識装置が、音声特徴量抽出手段と、音響モデルと、学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルと、木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率の高い複数の言語モデルを選択する複数言語モデル選択手段と、選択された複数の言語モデルを入力して混合言語モデルを生成する混合言語モデル生成手段と、生成された言語モデルと音響モデルを用いて、音声特徴量抽出手段が抽出した音声特徴量に対して照合を行い音声認識結果を出力する照合手段とを備えたことにより、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、木構造クラスタ別言語モデルから複数選択した木構造クラスタ言語モデルによって混合言語モデルを生成して、音声認識に用いるので、さらに認識精度が高い音声認識ができると共に、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した言語モデルを選択し混合言語モデルを生成して音声認識に用いるので、認識性能が高い音声認識ができる効果がある。
【０１０２】
この発明によれば、音声認識装置の複数言語モデル選択手段が、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択することにより、言語モデルのメモリ容量を削減でき、言語モデルを選択する際の演算量を削減できるという効果がある。
【０１０３】
この発明によれば、音声認識装置の木構造クラスタ別言語モデルが、木構造の上位に位置する木構造クラスタ別言語モデルを用いて補間処理が行われた補間処理された木構造クラスタ別言語モデルであることにより、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、さらに認識率の高い言語モデルを生成できると共に、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した言語モデルが存在するので、認識率の高い言語モデルが生成できるという効果がある。
【０１０８】
この発明によれば、音声認識方法として、音声特徴量を抽出する第１のステップと、学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率が高い複数の言語モデルを選択する第２のステップと、選択された複数の言語モデルを入力して混合言語モデルを生成する第３のステップと、音響モデルと生成された言語モデルを用いて、抽出した音声特徴量に対して照合を行い音声認識結果を出力する第４のステップとを備えたことにより、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、木構造クラスタ別言語モデルから複数選択した木構造クラスタ言語モデルによって混合言語モデルを生成して、音声認識に用いるので、さらに認識精度が高い音声認識ができると共に、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した言語モデルを選択し混合言語モデルを生成して音声認識に用いるので、認識性能が高い音声認識ができる効果がある。
【０１０９】
この発明によれば、音声認識方法の第２のステップで、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択することにより、言語モデルを選択する際の演算量を削減できるという効果がある。
【０１１４】
この発明によれば、音声認識プログラムを記録した記録媒体で、音声特徴量を抽出する音声特徴量抽出手順と、学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率の高い複数の言語モデルを選択する複数言語モデル選択手順と、選択された複数の言語モデルを入力して混合言語モデルを生成する混合言語モデル生成手順と、音響モデルと、生成された言語モデルを用いて、抽出された音声特徴量に対して照合を行い音声認識結果を出力する照合手順とを実現させることにより、学習用テキストデータが少量であることによって生じる言語モデルのゼロ頻度問題やスパースネスの問題を軽減でき、木構造クラスタ別言語モデルから複数選択した木構造クラスタ言語モデルによって混合言語モデルを生成して、音声認識に用いるので、さらに認識精度が高い音声認識ができると共に、認識対象の１発声が複数の分野や場面・状況を含む場合であっても、複数の分野や場面・状況の言語特徴を学習した言語モデルを選択し混合言語モデルを生成して音声認識に用いるので、認識性能が高い音声認識ができる効果がある。
【０１１５】
この発明によれば、音声認識プログラムの複数言語モデル選択手順が、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択することにより、言語モデルを選択する際の演算量を削減できるという効果が得られる。
【図面の簡単な説明】
【図１】この発明の実施の形態１による言語モデル生成装置の構成を示すブロック図である。
【図２】この発明の実施の形態１による言語モデル生成装置における言語モデル生成方法を示すフローチャートである。
【図３】この発明の実施の形態１による学習用テキストデータ木構造クラスタリングの説明図である。
【図４】この発明の実施の形態１による木構造クラスタ別の言語モデル生成の説明図である。
【図５】この発明の実施の形態２による言語モデル生成装置の構成を示すブロック図である。
【図６】この発明の実施の形態２による言語モデル生成装置における言語モデル生成方法を示すフローチャートである。
【図７】この発明の実施の形態３による音声認識装置の構成を示すブロック図である。
【図８】この発明の実施の形態３による音声認識装置における音声認識方法を示すフローチャートである。
【図９】この発明の実施の形態４による音声認識装置の構成を示すブロック図である。
【図１０】この発明の実施の形態４による音声認識装置における音声認識方法を示すフローチャートである。
【図１１】この発明の実施の形態５による音声認識装置の構成を示すブロック図である。
【図１２】この発明の実施の形態５による音声認識装置における音声認識方法を示すフローチャートである。
【図１３】従来の言語モデル生成装置の構成を示すブロック図である。
【図１４】従来の音声認識装置の構成を示すブロック図である。
【符号の説明】
１００１学習用テキストデータ、１００４言語モデル生成手段、１１０１認識対象音声、１１０２音声特徴量抽出手段、１１０３音響モデル、１１０４言語モデル選択手段、１１０５照合手段、１１０６音声認識結果、２００１学習用テキストデータ木構造クラスタリング手段、２００２木構造学習用テキストデータクラスタ、２００２−１木構造クラスタ１の学習用テキストデータ、２００２−２木構造クラスタ２の学習用テキストデータ、２００２−Ｍ木構造クラスタＭの学習用テキストデータ、２００３木構造クラスタ別言語モデル、２００３−１木構造クラスタ１の言語モデル、２００３−２木構造クラスタ２の言語モデル、２００３−Ｍ木構造クラスタＭの言語モデル、３００１言語モデル補間手段、３００２補間処理された木構造クラスタ別言語モデル、３００２−１補間処理された木構造クラスタ１の言語モデル、３００２−２補間処理された木構造クラスタ２の言語モデル、３００２−Ｍ補間処理された木構造クラスタＭの言語モデル、５００１複数言語モデル選択手段、５００２混合言語モデル生成手段、６００１葉ノードのクラスタ別言語モデル、６００１−１葉ノードクラスタ１の言語モデル、６００１−２葉ノードクラスタ２の言語モデル、６００１−Ｌ葉ノードクラスタＬの言語モデル。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a language model generation apparatus to be referred to when performing speech recognition, a speech recognition apparatus using the same, a language model generation method, a speech recognition method using the same, and a computer-readable recording of a language model generation program The present invention relates to a recording medium and a computer-readable recording medium on which a voice recognition program is recorded.
[0002]
[Prior art]
In recent years, the practical application of continuous speech recognition technology that allows a speaker to input words continuously has been actively studied. Continuous speech recognition is decoding based on an acoustic observation sequence of speech so that a decoded sequence of words has a maximum posterior probability. This is expressed by the following equation (1).
[Expression 1]

Here, O is the acoustic observation value series [o₁, O₂, O_Three,. . . , O_T] And W is a word string [w₁, W₂, W_Three,. . . , W_n]. P (O | W) is a probability for the observed value series O when the word string W is given, and is calculated by an acoustic model. P (W) is an occurrence probability (appearance probability) of the word string W, and is calculated by a language model.
[0003]
For speech recognition, “Speech Information Processing” by Sadahiro Furui published by Morikita Publishing Co., Ltd. (hereinafter referred to as Reference 1), “Probability” by Seiichi Nakagawa published by the Institute of Electronics, Information and Communication Engineers "Speech Recognition by Model" (hereinafter referred to as Reference 2), "Launch of Speech Recognition (Upper, Lower)" by Lawrence Labiner, Biing-Hwang Jung, translated by Sadaaki Furui, published by NTT Advanced Technology Co., Ltd. (Hereinafter referred to as Document 3).
[0004]
P (O | W) calculated by an acoustic model has recently been actively studied using a hidden Markov model (HMM) which is a statistical method. The acoustic model using the hidden Markov model is described in detail in Chapter 6 of Document 3, for example.
[0005]
In addition, P (W) calculated by a language model often uses a statistical method, and a representative one is an N-gram model (N is 2 or more). These are described in detail in Chapter 3 of “Probabilistic Language Model” (hereinafter referred to as Reference 4) written by Kenji Kita, published by the University of Tokyo Press. The N-gram model statistically gives the transition probability from the previous (N-1) words to the next word. Word string w by N-gram model^L ₁= W₁. . . w_LThe occurrence probability of is given by the following equation (2).
[Expression 2]

[0006]
In the above equation (2), the probability P (w_t｜ w_{t + 1-N} ^t-1) Is a word string w consisting of (N-1) words._{t + 1-N} ^t-1After the word w_tIs the probability of occurrence, and Π represents the product. For example, when the occurrence probability of a word string such as “I • ha • station • go” (• indicates a word break) is determined by 2-gram (bigram), the following equation (3) is obtained. . In the formula (3), # is a symbol representing the beginning and end of a sentence.
P (I go to the station)
P (I | #) P (Ha | I) P (Station | Ha) P (To | Station) P (Go | To)
P (# | Go) (3)
[0007]
Probability P (w_t｜ w_{t + 1-N} ^t-1) Is obtained from the relative frequency of the word string of the text data for learning. If the appearance frequency of the word string W in the text data for learning is C (W), for example, the 2-gram probability P (ha | I) of "I / ha" is calculated by the following equation (4). . In the equation (4), C (I • Ha) is the frequency of occurrence of the word string “I • Ha”, and C (I) is the frequency of occurrence of “I”.
P (ha | I) = C (I / H) / C (I) (4)
[0008]
However, if the probability value of the N-gram model is simply estimated based on the relative frequency, there is a great drawback that a word group that does not appear in the learning text data is set to 0 (zero frequency problem). Moreover, even if it appears in the text data for learning, it is difficult to estimate a statistically reliable probability value for a word string with a low appearance frequency (sparseness problem). In order to deal with these problems, a technique called smoothing or smoothing is usually used. Regarding smoothing, some methods are described in Chapter 3.3 of the above-mentioned document 4, and a specific description is omitted here.
[0009]
For learning language models, the text of the subject, scene, or situation that is the target of speech recognition is used as text data for learning, but in an actual application, the subject of speech recognition is subject to a variety of subjects, situations, or situations. It is often audio. The occurrence probability of the word string will be different if the field, scene / situation is different, so if the language model is generated by learning the text data for learning collectively ignoring the difference of the field, scene / situation The accuracy of the language model is not good.
[0010]
In order to improve the performance of speech recognition devices that target speech recognition in such various fields and in various scenes and situations, the text data for language model learning is clustered, and the language for each divided cluster is determined. Methods for creating models are being considered. As a prior art, for example, there is “Statistical language model generation device and speech recognition device” (hereinafter referred to as Document 5) of Japanese Patent Application Laid-Open No. 2000-75886. Here, the cluster can be obtained by clustering sentences by defining field-specific divisions such as, for example, cluster 1 is politics and cluster 2 is sports, or sentence distances.
[0011]
When the learning text data is divided into clusters, the learning text data per cluster is reduced, and the above-described zero frequency problem and sparseness problem are further increased. On the other hand, in Reference 5, the language model LM estimated using all learning text data that is not divided into clusters._aAnd a cluster-specific language model LM estimated using learning text data divided into clusters_c ^k(Where k is the cluster number) and LM by the maximum posterior probability estimation method_map ^kA highly accurate language model is obtained by estimating.
[0012]
FIG. 13 is a block diagram showing a configuration of a conventional language model generation apparatus described in Document 5. In FIG. In the figure, reference numeral 1001 denotes language model learning text data; 1002, learning text data clustering means; 1003, learning text data clusters; 1003-1 to 1003-M, learning text data of clusters 1 to M; Language model generation means, 1005 is a cluster-specific language model, and 1005-1 to 1005-M are language models of clusters 1 to M.
[0013]
Next, the operation will be described.
The learning text data 1001 is text data for learning a language model, and is obtained by converting words and sentences to be recognized by the speech recognition apparatus into characters. This learning text data 1001 is input to the learning text data clustering means 1002.
[0014]
The learning text data clustering means 1002 clusters the learning text data 1001. In Document 5, texts are clustered in sentence units using a method similar to the k-means method. The difference from the normal k-means method is that (1) the cluster center vector is a language model generated by sentences belonging to the cluster, and (2) the sentence generation probability is used as a distance measure. is there. The language model is an N-gram model.
[0015]
The learning text data cluster 1003 is composed of learning text data 1003-1 to cluster 1 learning data 1003-M of the cluster M clustered into M clusters by the learning text data clustering means 1002. Yes.
[0016]
The language model generation unit 1004 inputs the learning text data 1003-1 to the learning data 1003-M of the cluster M obtained by the learning text data clustering unit 1002, respectively, and the language model of the cluster 1 is input. A cluster-specific language model 1005 configured by the language model 1005-M of 1005-1 to cluster M is generated. The language model generation means 1004 uses the language model LM estimated using all learning text data that is not divided into clusters in order to prevent a decrease in the estimation accuracy of the language model due to a decrease in the number of learning text data for each cluster._aAnd a cluster-specific language model LM estimated using learning text data divided into clusters_c ^k, A cluster-specific language model LM using the maximum posterior probability estimation method_map ^kIs estimated.
[0017]
Next, a conventional speech recognition apparatus using the language model generation apparatus will be described. FIG. 14 is a block diagram showing a configuration of a conventional speech recognition apparatus disclosed in Reference 5. In the figure, reference numeral 1101 denotes a speech to be recognized, 1102 denotes a voice feature amount extraction unit, 1103 denotes an acoustic model, 1104 denotes a language model selection unit, 1105 denotes a collation unit, and 1106 denotes a voice recognition result. The cluster-specific language model 1005 is the same functional block as in FIG. 13, and is given the same reference numerals and description thereof is omitted.
[0018]
Next, the operation will be described.
The recognition target voice 1101 is a voice to be recognized, and is input to the voice feature amount extraction unit 1102. The voice feature quantity extraction unit 1102 extracts a voice feature quantity included in the recognition target voice 1101. The acoustic model 1103 is a model for acoustically collating speech. The acoustic model 1103 uses, for example, an HMM that is learned using speech of sentences or words uttered by a large number of speakers and that uses a phoneme that takes into account the surrounding phoneme environment as a recognition unit.
[0019]
The language model selection means 1104 is a collation means 1105 out of the cluster-specific language models 1005 composed of the language models 1005-1 of the cluster 1 and the language models 1005-M of the cluster M generated by using the language model generation device. Select the language model to be used. In Reference 5, collation is performed using an unspecified language model before being divided into clusters, and the language model for each cluster having the highest probability of occurrence is obtained from the obtained recognition result candidate word string. One language model 1005-M of 1005-1 to cluster M is selected.
[0020]
The collation means 1105 is a recognition target word [W (1), W (2),..., W (wn)] (wn is a recognition target) set by the language model selected by the language model selection means 1104. Phonetic notation) is converted into recognition unit label notation, and HMMs of phonemes stored in the acoustic model 1103 are concatenated according to this label, and the standard pattern [λ_{W (1)}, Λ_{W (2)},. . . , Λ_{W (wn)}] Is created.
[0021]
Then, the matching unit 1105 uses the standard pattern of the recognition target word and the occurrence probability of the word string represented by the selected language model to match the speech feature amount that is the output of the speech feature amount analysis unit 1102. To output a voice recognition result 1106. The speech recognition result 1106 indicates that the recognition target speech 1101 has a word number sequence Rn = [r (1), r (2),. . . , R (m)], and the word Rw = [W (r (1)), W (r (2)),. . . , W (r (m))]. Here, r (i) indicates the word number of the i-th word in the word sequence of the speech recognition result 1106. M represents the number of words in the recognized word series.
[0022]
[Problems to be solved by the invention]
Since the conventional language model generation device is configured as described above, if the number of clusters divided by clustering increases, the number of text data for learning per cluster decreases, and the accuracy of language model estimation deteriorates. There was a problem that accuracy would not increase.
[0023]
Further, when the number of clusters to be divided increases, there is a problem that the recognition rate does not increase when one utterance has the language characteristics of a plurality of clusters.
[0024]
The present invention has been made in order to solve the above-described problems, and is a computer-readable recording medium storing a language model generation device, a language model generation method, and a language model generation program capable of generating a language model with high estimation accuracy. The object is to obtain a recording medium.
[0025]
Another object of the present invention is to obtain a computer-readable recording medium in which a speech recognition apparatus, speech recognition method, and speech recognition program with high speech recognition accuracy are recorded using a language model with high estimation accuracy.
[0030]
[Means for Solving the Problems]
The speech recognition apparatus according to the present invention includes a speech feature amount extraction unit that inputs a recognition target speech and extracts a speech feature amount, and inputs speech recognition results by inputting recognition target speech and outputs speech recognition results; The acoustic model that determines the probability of the acoustic observation value sequence of the tree and the tree text clustering that hierarchically divides the text data for learning so as to have linguistically similar properties, and the text data for learning of each tree structure cluster A language model for each of the tree-structured clusters generated by using the language model, and a multi-language model selection means for selecting a plurality of language models having a high probability of occurrence with respect to the word sequence of the speech recognition result candidate from the language model for each of the tree-structured clusters. Mixed language model generation means for generating a mixed language model by inputting a plurality of language models selected by the multiple language model selection means; A collation unit that collates the speech feature extracted by the speech feature extraction unit using the language model generated by the word model generation unit and the acoustic model, and outputs a speech recognition result. is there.
[0031]
In the speech recognition apparatus according to the present invention, the multi-language model selection means selects a plurality of language models from the cluster-specific language model of the lowermost leaf node in the tree-structured cluster-specific language model.
[0032]
The speech recognition apparatus according to the present invention is a language model for each tree structure cluster that is subjected to interpolation processing in which the language model for each tree structure cluster is interpolated using the language model for each tree structure cluster that is positioned above the tree structure. It is characterized by being.
[0037]
The speech recognition method according to the present invention includes a first step of inputting the recognition target speech and extracting a speech feature amount, and learning text, in which speech recognition is performed by inputting the recognition target speech and a speech recognition result is output. Candidate speech recognition results from tree-structured language models generated using text data for learning each tree-structured cluster by performing tree-structured clustering to divide the data hierarchically so as to have linguistically similar properties A second step of selecting a plurality of language models having a high probability of occurrence with respect to the word string, and a third step of generating a mixed language model by inputting the plurality of language models selected in the second step. Using the acoustic model for obtaining the probability of the acoustic observation value series of speech and the language model generated in the third step, and extracting in the first step. It is obtained and a fourth step of outputting a speech recognition result matches against speech features.
[0038]
In the speech recognition method according to the present invention, in the second step, a plurality of language models are selected from the cluster-specific language models of the lowermost leaf nodes in the tree-structured cluster-specific language model.
[0043]
A computer-readable recording medium recording a voice recognition program according to the present invention inputs a recognition target voice, performs voice recognition, and outputs a voice recognition result. Generated by using the text data for learning of each tree structure cluster, and the clustering of the tree structure that divides the learning text data hierarchically so as to have linguistically similar properties. Multi-language model selection procedure for selecting a plurality of language models having a high probability of occurrence with respect to a word sequence of speech recognition result candidates from a language model by a tree structure cluster, and a plurality of languages selected by the multi-language model selection procedure A mixed language model generation procedure for generating a mixed language model by inputting a model, and an acoustic model for determining the probability of an acoustic observation sequence of speech And a collation procedure for collating the speech feature extracted by the speech feature extraction procedure and outputting a speech recognition result using the language model generated by the mixed language model generation procedure Is.
[0044]
In the computer-readable recording medium on which the speech recognition program according to the present invention is recorded, the multiple language model selection procedure selects a plurality of language models from the clustered language model of the lowest leaf node in the tree-structured clustered language model. Is.
[0045]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a language model generation apparatus according to Embodiment 1 of the present invention. In the figure, 2001 is a learning text data tree structure clustering means, 2002 is a tree structure learning text data cluster, 2002-1 to 2002-M are learning text data of tree structure clusters 1 to M, and 2003 is a tree structure cluster-specific. Language models 2003-1 to 2003-M are language models of the tree structure clusters 1 to M. The same functional blocks as those in FIG. 13 showing the configuration of the conventional language model generation apparatus are denoted by the same reference numerals and description thereof is omitted.
[0046]
Note that the language model learning text data 1001 is obtained by characterizing words or sentences used in a field, scene, or situation to be recognized by speech recognition. For example, when a political news uttered by an announcer is set as a speech recognition target, it is text data in which an article in a politics column of a newspaper or a utterance content of a political broadcast news is transcribed as characters.
[0047]
Next, the operation will be described.
FIG. 2 is a flowchart showing a language model generation method in the language model generation apparatus according to Embodiment 1 of the present invention. The learning text data tree structure clustering means 2001 inputs the learning text data 1001 in step ST101, sets the clustering hierarchy I to 0 in step ST102, and first adds all of the learning text data 1001 to step ST103. Clustering is performed. The clustering for all of the learning text data 1001 is assumed to be clustering of the hierarchy 0.
[0048]
Here, the clustering is to divide the text data for learning into two or more sets using a method similar to the k-means algorithm shown in Document 5 by manually dividing into two or more fields. That is. The learning text data belonging to the cluster obtained by clustering has linguistically similar properties.
[0049]
FIG. 3 is an explanatory diagram of learning text data tree structure clustering performed by the learning text data tree structure clustering means 2001, and shows a state of hierarchical clustering in units of sentences. In FIG. 3, the entire learning text data 1001 is divided into two clusters by clustering the tree-structure cluster 00 of the hierarchy 0. A set of the divided text data for learning is a tree structure cluster 10 and a tree structure cluster 11 of hierarchy 1.
[0050]
In step ST104 of FIG. 2, the learning text data tree structure clustering means 2001 increments the hierarchy I. In step ST105, the learning text data tree structure clustering means 2001 increments the hierarchy I-1 (here, I = 0). Clustering is performed on the text data for learning belonging to each cluster clustered in (1). In FIG. 3, the tree structure cluster 20 and the tree structure cluster 21 of the hierarchy 2 are generated from the tree structure cluster 10 of the hierarchy 1 and the tree structure cluster 22 and the tree structure cluster of the hierarchy 2 are generated from the tree structure cluster 11. 23 is generated.
[0051]
In step ST106, it is checked whether or not the number of clusters has reached a predetermined number M. If not, the process returns to step ST104, increments the hierarchy I, and repeats the clustering process in step ST105. The above processing is repeated until the number of clusters reaches a predetermined number M to generate learning text data 2002-1 for the tree-structure cluster 1 to learning text data 2002-M for the tree-structure cluster M.
[0052]
After performing the tree structure clustering of the learning text data up to a predetermined number of clusters, in step ST107, the language model generation means 1004 uses the learning text data belonging to each cluster for each clustered tree structure cluster. A model is generated, and a language model 2003 for each tree structure cluster configured by the language model 2003-1 of the tree structure cluster 1 to the language model 2003-M of the tree structure cluster M is generated.
[0053]
In step ST106, the clustering of hierarchical learning text data is repeated until the number of clusters M determined in advance is reached. Here, the number of clusters is used as a reference for the end of clustering. However, even if the number of hierarchies is used as a reference, the clustering may be ended if the number of learning text data in the cluster is equal to or smaller than a certain value. In the cluster obtained by hierarchical clustering, the nature of the learning text data belonging to the cluster well expresses the difference in field, scene and situation as the hierarchy goes down.
[0054]
FIG. 4 is an explanatory diagram of language model generation for each tree structure cluster. In FIG. 4, the nodes of the tree structure represent the tree structure clusters of the learning text data, and the language model is generated using the learning text data belonging to each tree structure cluster. The tree structure cluster of the parent node of the tree structure includes all the learning text data belonging to the tree structure cluster of the child node. In FIG. 4, for example, a language model generated using learning text data belonging to the tree structure cluster 00 is a language model generated using the learning text data belonging to the language model LM00 of the tree structure cluster 00 and the tree structure cluster 10. The models correspond to the language model LM10 of the tree structure cluster 10, respectively.
[0055]
The properties of the generated language model become a language model that more expresses the difference in the properties of the language due to differences in fields, scenes, and situations as it goes to the language model of the tree-structured cluster below. In addition, the language model of the upper-level tree-structured cluster does not express the differences in language properties due to differences in fields, scenes, and situations, but includes language features of multiple fields, scenes, and situations. It is an effective language model when it includes multiple fields, scenes and situations. Furthermore, since the language model of the upper tree-structured cluster has a large amount of learning text data, the parameter estimation accuracy is higher than when divided into the same number of clusters as the tree-structured cluster at one time.
[0056]
Specific methods for generating the language model include an N-gram model, a hidden Markov model, a probabilistic context free grammar described in Chapters 3 to 5 of Reference 4.
[0057]
The language model generation method according to the first embodiment can also be recorded on a recording medium as a language model generation program. In this case, a learning text data tree structure clustering procedure for realizing the same processing as the learning text data tree structure clustering means 2001 and a language model generation procedure for realizing the same processing as the language model generation means 1004 are configured. The language model generation program to be recorded is recorded on a recording medium.
[0058]
As described above, according to the language model generation device and the language model generation method of the first embodiment, the learning text data is hierarchically tree-structured and the learning text data belonging to each tree-structured cluster is used. Since a language model for each tree structure cluster is generated, the zero frequency problem and sparseness problem caused by a small amount of text data for learning can be reduced, and a language model with a high recognition rate can be generated. . Even if one utterance to be recognized includes multiple fields, scenes, and situations, there are language models that have learned the language features of multiple fields, scenes, and situations. An effect that can be generated is obtained.
[0059]
Embodiment 2. FIG.
FIG. 5 is a block diagram showing a configuration of a language model generation apparatus according to Embodiment 2 of the present invention. In the figure, 3001 is a language model interpolation means, 3002 is a language model for each tree structure cluster subjected to interpolation processing, and 3002-1 to 3002-M are language models of tree structure clusters 1 to M subjected to interpolation processing. The same functional blocks as those in FIG. 1 of the first embodiment are denoted by the same reference numerals and description thereof is omitted.
[0060]
Next, the operation will be described.
FIG. 6 is a flowchart showing a language model generation method in the language model generation apparatus according to Embodiment 2 of the present invention. The processing from step ST201 to step ST207 is the same as the processing from step ST101 to step ST107 in FIG. 2 of the first embodiment.
[0061]
In step ST208, the language model interpolation unit 3001 converts the language model 2003-1 of the tree structure cluster 1 to the language model 2003-M of the tree structure cluster M, which is a language model for each tree structure cluster generated by the language model generation unit 1004. The language model 3002-1 of the tree structure cluster 1 subjected to the interpolation process and the tree structure cluster M subjected to the interpolation process is generated. The interpolation processing at this time is performed using the language model of the tree structure cluster of the parent node of the tree structure node where the cluster language model to be interpolated is located.
[0062]
In the example of FIG. 4, when the language model LM20 of the tree structure cluster 20 is interpolated, the language model LM10 of the tree structure cluster 10 that is the parent node, and the language model LM00 of the tree structure cluster 00 that is the parent node of the upper layer, Interpolate using. In this interpolation processing, for example, when the language model is an N-gram model, the word string w_{n + 1-N} ^n-1Followed by w_nIs the parameter, and is obtained by the following equation (5).
[Equation 3]

[0063]
In the above formula (5), P ′_s(W_n｜ w_{n + 1-N} ^n-1) Is a word string w in the language model of the tree-structured cluster S subjected to interpolation processing._{n + 1-N} ^n-1Followed by w_nIs the probability of occurrence of Ω, Ω is a set of cluster numbers of tree-structured cluster S and its parent node, P_i(W_n｜ w_{n + 1-N} ^n-1) Is a word string w in the language model of the tree structure cluster i._{n + 1-N} ^n-1Followed by w_nThe probability of occurrence, α_iIs a weighting factor. This α_iCan be estimated by the deletion interpolation method described in Chapter 3 of Document 4, for example.
[0064]
In this description, P_i(W_n｜ w_{n + 1-N} ^n-1) Is the occurrence probability before interpolation, but the occurrence probability P 'interpolated from the upper layer of the tree structure and subjected to interpolation processing_i(W_n｜ w_{n + 1-N} ^n-1) May be used. In a tree-structured cluster, the lower-level cluster has a small amount of learning text data, so in the generation of a language model, zero frequency problems and sparseness problems are likely to occur. Using the language model of the cluster, parameters, ie word strings w_{n + 1-N} ^n-1Followed by w_nSince the interpolation processing of the probability of occurrence of the language model is performed, the language model estimation accuracy is increased.
[0065]
Moreover, the language model generation method in Embodiment 2 can also be recorded on a recording medium as a language model generation program. In this case, a learning text data tree structure clustering procedure for realizing processing similar to that of the learning text data tree structure clustering means 2001, a language model generation procedure for realizing processing similar to that of the language model generation means 1004, a language A language model generation program including a language model interpolation procedure for realizing the same processing as that of the model interpolation unit 3001 is recorded on a recording medium.
[0066]
As described above, according to the language model generation device and the language model generation method of the second embodiment, the learning text data is hierarchically tree-structured and the learning text data belonging to each tree-structured cluster is used. Language model by tree structure cluster is generated, and the generated cluster language model is interpolated using the cluster language model of the parent node of the tree structure, so the zero frequency problem of the language model caused by the small amount of text data for learning And sparseness can be reduced, and a language model with a higher recognition rate can be generated.
[0067]
Even if one utterance to be recognized includes multiple fields, scenes, and situations, there are language models that have learned the language features of multiple fields, scenes, and situations. An effect that it can be generated is obtained.
[0068]
Embodiment 3 FIG.
FIG. 7 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 3 of the present invention. In the figure, the same functional blocks as those in FIG. 1 of the first embodiment and FIG. 14 of the conventional speech recognition apparatus are denoted by the same reference numerals and description thereof is omitted.
[0069]
Next, the operation will be described.
FIG. 8 is a flowchart showing a speech recognition method in the speech recognition apparatus according to Embodiment 3 of the present invention. The voice feature quantity extraction unit 1102 inputs the recognition target voice 1101 in step ST301, and extracts the voice feature quantity in step ST302. Here, the speech feature amount represents a speech feature with a small amount of information, and is, for example, a feature vector composed of cepstrum and cepstrum dynamic features as described in Chapter 5 of Document 1.
[0070]
In step ST303, the language model selection unit 1104 selects the language model used by the collation unit 1105 from the language model 2003-1 of the tree structure cluster 1 of the language model 2003 for each tree structure cluster and the language model 2003-M of the tree structure cluster M. Select one. The language model is selected, for example, using the method shown in Document 5, and the language model of the tree structure cluster having the highest occurrence probability is selected.
[0071]
In step ST304, the collation unit 1105 inputs the tree-structure cluster language model selected by the language model selection unit 1104 and the acoustic model 1103, collates the speech feature quantity of the recognition target speech 1101, and obtains the maximum likelihood. A word string having a high (matching score) is output as a speech recognition result 1106.
[0072]
The collation process in this case will be specifically described. The collating unit 1105 is configured to recognize the recognition target words [W (1), W (2),...] Set by the tree-structure cluster language model selected by the language model selecting unit 1104. . . , W (wn)] (where wn is the number of words to be recognized) is converted into a recognition unit label notation, and HMMs of phoneme units stored in the acoustic model 1103 are concatenated according to this label, Standard pattern of recognition target word [λ_{W (1)}, Λ_{W (2)},. . . , Λ_{W (wn)}] Is created.
[0073]
Then, the matching unit 1105 uses the occurrence probability of the word string represented by the recognition target word standard pattern and the selected tree-structured cluster language model, and performs the speech feature amount output from the speech feature amount analysis unit 1102. Collation is performed and a speech recognition result 1106 is output. The speech recognition result 1106 shows a word number sequence Rn = [r (1), r (2),. . . , R (m)], and the word Rw = [W (r (1)), W (r (2)),. . . , W (r (m))]. Here, r (i) indicates the word number of the i-th word in the word sequence of the speech recognition result, and m indicates the number of words in the recognized word sequence.
[0074]
In the above, the language model for each tree structure cluster to be selected has been described as the language model 2003-1 of the tree structure cluster 1 to the language model 2003-M of the tree structure cluster M generated in the first embodiment. The language model 3002-1 of the tree structure cluster 1 subjected to the interpolation process generated in 2 may be used as the language model 3002-M of the tree structure cluster M subjected to the interpolation process.
[0075]
Further, the speech recognition method according to Embodiment 3 can be recorded on a recording medium as a speech recognition program. In this case, in addition to the language model generation program of the first embodiment, a speech feature amount extraction procedure that realizes processing similar to that of the speech feature amount extraction unit 1102 and processing similar to the language model selection unit 1104 are realized. A speech recognition program including a language model selection procedure and a collation procedure for realizing the same processing as the collation unit 1105 is recorded on a recording medium.
[0076]
As described above, according to the speech recognition apparatus and the speech recognition method of the third embodiment, the learning text data 1001 is hierarchically tree-structured and the learning text data 2002-1 belonging to each tree-structured cluster is collected. Since 2002-M is used to generate language models 2003-1 to 2003-M for each tree structure cluster, it is possible to reduce the zero frequency problem and sparseness problem of the language model caused by the small amount of text data for learning, Since speech recognition is performed by selecting a language model from the tree-structured language model 2003, it is possible to obtain speech recognition with high recognition accuracy.
[0077]
In addition, even if the speech to be recognized includes multiple fields, scenes, and situations, the recognition is performed by selecting a tree-structured cluster language model that has learned the language features of multiple fields, scenes, and situations. An effect of performing speech recognition with high performance is obtained.
[0078]
Embodiment 4 FIG.
FIG. 9 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 4 of the present invention. In the figure, reference numeral 5001 denotes a multi-language model selection unit, and reference numeral 5002 denotes a mixed language model generation unit. The same functional blocks as those in FIG. 7 of the third embodiment are denoted by the same reference numerals, and description thereof is omitted.
[0079]
Next, the operation will be described.
FIG. 10 is a flowchart showing a speech recognition method in the speech recognition apparatus according to Embodiment 4 of the present invention. The processing of step ST401 and step ST402 is the same as the processing of step ST301 and step ST302 of FIG. 8 in the third embodiment.
[0080]
In step ST403, the multiple language model selection unit 5001 selects two or more (K or less) language models of the tree structure cluster from the language model 2003-1 of the tree structure cluster 1 to the language model 2003-M of the tree structure cluster M. select. For the selection of the language model, for example, the method shown in Document 5 is expanded, and a method of selecting K language models in descending order of occurrence probability is used.
[0081]
In step ST404, the mixed language model generation unit 5002 receives a plurality of tree structure cluster language models selected by the multiple language model selection unit 5001, and generates one mixed language model. If the mixed model is, for example, an N-gram model, the occurrence probability is calculated by the following equation (6).
[Expression 4]

[0082]
In the above equation (6), P_m(W_n｜ w_{n + 1-N} ^n-1) Is the occurrence probability of the mixed language model, Ψ is a set of tree-structured cluster language model numbers selected by the multi-language model selection means 5001, P_i(W_n｜ w_{n + 1-N} ^n-1) Is the probability of occurrence of the selected language model, β_iIs a weighting factor. Where β_iFor example, according to the occurrence probability at the time of language model selection shown in Reference 5, a language model with a high occurrence probability is β_iSet to be larger.
[0083]
In step ST405, the collation unit 1105 receives the mixed language model generated by the mixed language model generation unit 5002 and the acoustic model 1103, collates the speech feature amount of the recognition target speech 1101, and has the highest likelihood. A high word string is output as the speech recognition result 1106.
[0084]
The tree structure cluster language model to be selected has been described as the language model 2003-1 of the tree structure cluster 1 generated in the first embodiment and the language model 2003-M of the tree structure cluster M generated in the first embodiment. The language model 3002-1 of the tree structure cluster 1 subjected to the interpolation processing generated in step S1 may be used as the language model 3002-M of the tree structure cluster M subjected to the interpolation processing.
[0085]
In addition, the speech recognition method according to Embodiment 4 can be recorded on a recording medium as a speech recognition program. In this case, in addition to the language model generation program of the first embodiment, a speech feature amount extraction procedure that realizes processing similar to that of the speech feature amount extraction unit 1102 and a collation procedure that realizes processing similar to that of the collation unit 1105 A speech recognition program including a multi-language model selection procedure that realizes processing similar to that of the multi-language model selection unit 5001 and a mixed language model generation procedure that realizes processing similar to that of the mixed language model generation unit 5002 on a recording medium. Record.
[0086]
As described above, according to the speech recognition apparatus and speech recognition method of the fourth embodiment, the learning text data 1001 is hierarchically tree-structured and learning text data 2002-1 to 2002 of each tree-structured cluster. -M is used to generate language models 2003-1 to 2003-M by cluster structure cluster, and the zero frequency problem and sparseness problem of the language model caused by the small amount of learning text data can be reduced. Since a mixed language model is generated from a tree-structured cluster language model selected from the structural cluster-specific language model 2003 and used for speech recognition, it is possible to achieve speech recognition with higher recognition accuracy.
[0087]
In addition, even when a single utterance to be recognized includes multiple fields, scenes, and situations, a language model that learns the language features of multiple fields, scenes, and situations is selected and a mixed language model is generated for speech recognition. Therefore, the effect of performing speech recognition with high recognition performance can be obtained.
[0088]
Embodiment 5 FIG.
FIG. 11 is a block diagram showing the structure of a speech recognition apparatus according to Embodiment 5 of the present invention. In the figure, reference numeral 6001 denotes a language model for each leaf node cluster, and reference numerals 6001-1 to 6001-L denote language models for the leaf node clusters 1 to L. The same functional blocks as those in FIG. 7 of the third embodiment are denoted by the same reference numerals, and description thereof is omitted.
[0089]
Next, the operation will be described.
FIG. 12 is a flowchart showing a speech recognition method in the speech recognition apparatus according to Embodiment 5 of the present invention. The processing in step ST501 and step ST502 is the same as the processing in step ST301 and step ST302 in FIG. 8 in the third embodiment.
[0090]
In step ST503, the language model selection unit 1104 selects the language model used in the matching unit 1105 from the language model of the leaf node cluster of the tree structure cluster, and the language model 6001-1 to leaf node cluster L of the leaf node cluster 1. Select one from 6001-L. Here, the leaf node cluster language model is the language model of the tree structure cluster at the lowest level of the tree structure. In the example of FIG. 4, the language model LM20 of the tree structure cluster 20, the language model LM21 of the tree structure cluster 21, the language model LM22 of the tree structure cluster 22, and the language model LM23 of the tree structure cluster 23 correspond to the language model of the leaf node cluster. To do.
[0091]
The language model of such leaf node clusters is a model that expresses in detail the differences in language characteristics due to differences in fields, scenes, and situations. It is effective when it is voice. In addition, since the number of cluster language models to be selected is small compared to the case of using all language models for each tree-structured cluster, there is an effect of saving memory and reducing the amount of calculation. For example, the language model of the leaf node cluster having the highest occurrence probability is selected using the method shown in Reference 5.
[0092]
In step ST504, the collation unit 1105 inputs the language model of the leaf node cluster selected by the language model selection unit 1104 and the acoustic model 1103, collates the speech feature amount of the recognition target speech 1101, A word string having a high likelihood is output as a speech recognition result 1106.
[0093]
As described above, the language model of the leaf node cluster to be selected is the language model of the tree node cluster-specific language model 2003 generated in the first embodiment, but the interpolation processing generated in the second embodiment is performed. The language model 3002 for each tree structure cluster may be a language model of a leaf node cluster. Alternatively, the language model selection unit 1104 may be a multi-language model selection unit 5001, and the mixed language model generation unit 5002 may be connected to the subsequent stage to perform collation processing using the mixed language model.
[0094]
In addition, the speech recognition method according to Embodiment 5 can be recorded on a recording medium as a speech recognition program. In this case, in addition to the language model generation program of the first embodiment, a speech feature amount extraction procedure that realizes processing similar to that of the speech feature amount extraction unit 1102 and processing similar to the language model selection unit 1104 are realized. A speech recognition program including a language model selection procedure and a collation procedure for realizing the same processing as the collation unit 1105 is recorded on a recording medium.
[0095]
As described above, according to the speech recognition apparatus and speech recognition method of the fifth embodiment, learning text data 1001 is hierarchically tree-structured and learning text data 2002-1 to 2002 of each tree-structured cluster. Since -M is used to generate the tree-structured cluster language model 2003, the zero-frequency problem and sparseness problem of the language model caused by the small amount of text data for learning can be reduced. Since the language model selected from the language model 6001 of the leaf node cluster is used for speech recognition, speech recognition with high recognition accuracy can be performed, the memory capacity of the language model can be reduced, and the calculation amount when selecting the language model can be reduced. The effect is obtained.
[0096]
Even if one utterance to be recognized includes multiple fields, scenes, and situations, if a mixed language model is generated by selecting a language model of multiple leaf node clusters, multiple fields, scenes, and situations can be identified. Since the language model having learned the language features is used for speech recognition, an effect of performing speech recognition with high recognition performance can be obtained.
[0101]
【The invention's effect】
According to this invention, the speech recognition apparatus performs tree structure clustering that hierarchically divides the speech feature quantity extraction means, the acoustic model, and the learning text data so as to have linguistically similar properties. Select multiple language models with high probability of occurrence for word strings of speech recognition result candidates from the language model for each tree structure cluster generated using the text data for learning the structure cluster and the language model for each tree structure cluster Multiple language model selection means, mixed language model generation means for generating a mixed language model by inputting a plurality of selected language models, and speech feature quantity extraction means using the generated language model and acoustic model Provided with collation means for collating the voice feature value and outputting the voice recognition result, thereby generating a small amount of learning text data. The language model zero frequency problem and sparseness problem can be reduced, and a mixed language model is generated from a tree-structured cluster language model selected from the tree-structured language models and used for speech recognition. Generates a mixed language model by selecting a language model that has learned the language features of multiple fields, scenes, and situations, even when speech recognition is possible and one utterance to be recognized includes multiple fields, scenes, and situations Therefore, since it is used for voice recognition, there is an effect that voice recognition with high recognition performance can be performed.
[0102]
According to this invention, the multilingual model selection means of the speech recognition apparatus selects a plurality of language models from the cluster-specific language model of the lowermost leaf node in the tree-structured cluster-specific language model, so that the memory capacity of the language model The amount of computation when selecting a language model can be reduced.
[0103]
According to this invention, the language model for each tree structure cluster of the speech recognition apparatus is subjected to the interpolation process using the language model for each tree structure cluster positioned above the tree structure. As a result, the zero frequency problem and sparseness problem of the language model caused by the small amount of text data for learning can be reduced, a language model with a higher recognition rate can be generated, and a plurality of utterances to be recognized can be generated. Even when a field, scene, or situation is included, there is a language model that learns the language features of a plurality of fields, scenes, or situations, so that a language model with a high recognition rate can be generated.
[0108]
According to the present invention, as a speech recognition method, a first step of extracting speech feature values and tree-structure clustering that hierarchically divides the learning text data so as to have linguistically similar properties are performed. A second step of selecting a plurality of language models having a high probability of occurrence for a word sequence of speech recognition result candidates from a language model for each tree structure cluster generated using the text data for learning of the tree structure cluster; A third step of generating a mixed language model by inputting a plurality of language models that have been input, and using the acoustic model and the generated language model, collating the extracted speech feature quantity and outputting a speech recognition result The fourth step is to reduce the zero frequency problem and sparseness problem of the language model caused by the small amount of text data for learning. Since a mixed language model is generated from a tree-structured cluster language model selected from a tree-structured language model and used for speech recognition, speech recognition with higher recognition accuracy can be achieved and a plurality of utterances to be recognized Even if it includes multiple fields, scenes, and situations, a language model that learns the language features of multiple fields, scenes, and situations is selected and a mixed language model is generated and used for speech recognition. There is an effect that can be recognized.
[0109]
According to the present invention, in the second step of the speech recognition method, when selecting a language model by selecting a plurality of language models from the cluster-specific language model of the lowermost leaf node in the tree-structured cluster-specific language model, There is an effect that the amount of calculation can be reduced.
[0114]
According to the present invention, on the recording medium on which the speech recognition program is recorded, the speech feature amount extraction procedure for extracting the speech feature amount and the tree for hierarchically dividing the learning text data so as to have linguistically similar properties Multiple language models with high probability of occurrence for word strings of speech recognition result candidates from language models for each tree structure cluster that are generated using text data for learning of each tree structure cluster A language model selection procedure, a mixed language model generation procedure for generating a mixed language model by inputting a plurality of selected language models, an acoustic model, and an extracted speech feature amount using the generated language model The language model generated by the small amount of text data for learning. The zero-frequency problem and sparseness problem can be reduced, and a mixed language model is generated from a tree-structured cluster language model selected from the tree-structured language models and used for speech recognition. In addition, even if a single utterance to be recognized includes multiple fields, scenes, and situations, a language model that learns the language features of multiple fields, scenes, and situations is selected, and a mixed language model is generated to generate speech. Since it is used for recognition, there is an effect that voice recognition with high recognition performance can be performed.
[0115]
According to this invention, the multi-language model selection procedure of the speech recognition program selects a language model by selecting a plurality of language models from the cluster-specific language model of the lowest leaf node in the tree-structured cluster-specific language model. The effect is that the amount of calculation at the time can be reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a language model generation apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a flowchart showing a language model generation method in the language model generation device according to Embodiment 1 of the present invention;
FIG. 3 is an explanatory diagram of learning text data tree structure clustering according to Embodiment 1 of the present invention;
FIG. 4 is an explanatory diagram of language model generation for each tree structure cluster according to Embodiment 1 of the present invention;
FIG. 5 is a block diagram showing a configuration of a language model generation apparatus according to Embodiment 2 of the present invention.
FIG. 6 is a flowchart showing a language model generation method in the language model generation apparatus according to Embodiment 2 of the present invention;
FIG. 7 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 3 of the present invention.
FIG. 8 is a flowchart showing a speech recognition method in a speech recognition apparatus according to Embodiment 3 of the present invention.
FIG. 9 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 4 of the present invention.
FIG. 10 is a flowchart showing a speech recognition method in a speech recognition apparatus according to Embodiment 4 of the present invention.
FIG. 11 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 5 of the present invention.
FIG. 12 is a flowchart showing a speech recognition method in a speech recognition apparatus according to Embodiment 5 of the present invention.
FIG. 13 is a block diagram illustrating a configuration of a conventional language model generation device.
FIG. 14 is a block diagram showing a configuration of a conventional speech recognition apparatus.
[Explanation of symbols]
1001 Learning text data, 1004 Language model generation means, 1101 recognition target speech, 1102 speech feature amount extraction means, 1103 acoustic model, 1104 language model selection means, 1105 collation means, 1106 speech recognition result, 2001 learning text data tree structure Clustering means, 2002 tree structure learning text data cluster, 2002-1 tree structure cluster 1 learning text data, 2002-2 tree structure cluster 2 learning text data, 2002-M tree structure cluster M learning text data 2003 language model by tree structure cluster, 2003-1 language model of tree structure cluster 1, 2003-2 language model of tree structure cluster 2, 2003-M language model of tree structure cluster M, 3001 language model interpolation means, 30 02 Language model by tree structure cluster subjected to interpolation processing, 3002-1 Language model of tree structure cluster 1 subjected to interpolation processing, 3002-2 Language model of tree structure cluster 2 subjected to interpolation processing, 3002-M Tree subjected to interpolation processing Language model of structure cluster M, 5001 Multiple language model selection means, 5002 Mixed language model generation means, 6001 Language model for each leaf node cluster, 6001-1 Language model for leaf node cluster 1, 6001-2 Language for leaf node cluster 2 Model, 6001-L A language model of leaf node cluster L.

Claims

認識対象音声を入力して音声認識を行い音声認識結果を出力する音声認識装置において、
上記認識対象音声を入力し音声特徴量を抽出する音声特徴量抽出手段と、
音声の音響的な観測値系列の確率を求める音響モデルと、
学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルと、
上記木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率の高い複数の言語モデルを選択する複数言語モデル選択手段と、
上記複数言語モデル選択手段によって選択された複数の言語モデルを入力して混合言語モデルを生成する混合言語モデル生成手段と、
上記混合言語モデル生成手段により生成された言語モデルと上記音響モデルを用いて、上記音声特徴量抽出手段が抽出した音声特徴量に対して照合を行い音声認識結果を出力する照合手段とを
備えたことを特徴とする音声認識装置。In a speech recognition apparatus that inputs speech to be recognized, performs speech recognition, and outputs a speech recognition result,
Voice feature quantity extraction means for inputting the recognition target voice and extracting a voice feature quantity;
An acoustic model for determining the probability of a sequence of acoustic observations of speech;
The tree structure clustering that hierarchically divides the learning text data so as to have linguistically similar properties, the language model for each tree structure cluster generated using the learning text data of each tree structure cluster,
A plurality of language model selection means for selecting a plurality of language models having a high occurrence probability with respect to the word sequence of the speech recognition result candidate from the language model by tree structure cluster;
Mixed language model generation means for generating a mixed language model by inputting a plurality of language models selected by the multiple language model selection means;
A collation unit that collates the speech feature extracted by the speech feature extraction unit using the language model generated by the mixed language model generation unit and the acoustic model, and outputs a speech recognition result. A speech recognition apparatus characterized by that.

複数言語モデル選択手段が、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択する
ことを特徴とする請求項１記載の音声認識装置。Multiple language model selecting means, the lowermost layer of the speech recognition system according to claim 1, characterized in that the cluster-specific language model to select multiple language models leaf nodes in the tree structure cluster-specific language models.

木構造クラスタ別言語モデルが、木構造の上位に位置する木構造クラスタ別言語モデルを用いて補間処理が行われた補間処理された木構造クラスタ別言語モデルである
ことを特徴とする請求項１記載の音声認識装置。Claim tree cluster-specific language model, characterized in that it is a tree structure cluster-specific language model interpolation processing is interpolated were performed using a tree structure cluster-specific language models located in an upper level of the tree structure 1 The speech recognition apparatus according to the description.

認識対象音声を入力した音声認識を行い音声認識結果を出力する音声認識方法において、
上記認識対象音声を入力し音声特徴量を抽出する第１のステップと、
学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率が高い複数の言語モデルを選択する第２のステップと、
上記第２のステップで選択された複数の言語モデルを入力して混合言語モデルを生成する第３のステップと、
音声の音響的な観測値系列の確率を求める音響モデルと、上記第３のステップで生成された言語モデルを用いて、上記第１のステップで抽出した音声特徴量に対して照合を行い音声認識結果を出力する第４のステップとを
備えたことを特徴とする音声認識方法。In a speech recognition method for performing speech recognition by inputting recognition target speech and outputting a speech recognition result,
A first step of inputting the recognition target voice and extracting a voice feature amount;
Performs tree-structure clustering that divides learning text data hierarchically so as to have linguistically similar properties. From the language model for each tree-structure cluster generated using the learning text data of each tree-structure cluster, speech A second step of selecting a plurality of language models having a high probability of occurrence with respect to a recognition result candidate word string;
A third step of generating a mixed language model by inputting the plurality of language models selected in the second step;
Using the acoustic model for determining the probability of the acoustic observation value series of speech and the language model generated in the third step, the speech feature value extracted in the first step is collated and speech recognition is performed. A voice recognition method comprising: a fourth step of outputting a result.

第２のステップで、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択する
ことを特徴とする請求項４記載の音声認識方法。5. The speech recognition method according to claim 4 , wherein in the second step, a plurality of language models are selected from the cluster-specific language models of the lowest leaf nodes in the tree-structured cluster-specific language model.

認識対象音声を入力して音声認識を行い音声認識結果を出力する音声認識プログラムを記録した記録媒体であって、
上記認識対象音声を入力し音声特徴量を抽出する音声特徴量抽出手順と、
学習用テキストデータを言語的に類似した性質を持つように階層的に分割する木構造クラスタリングを行い、各木構造クラスタの学習用テキストデータを用いて生成された木構造クラスタ別言語モデルから、音声認識結果候補の単語列に対して生起確率の高い複数の言語モデルを選択する複数言語モデル選択手順と、
上記複数言語モデル選択手順によって選択された複数の言語モデルを入力して混合言語モデルを生成する混合言語モデル生成手順と、
音声の音響的な観測値系列の確率を求める音響モデルと、上記混合言語モデル生成手順により生成された言語モデルを用いて、上記音声特徴量抽出手順により抽出された音声特徴量に対して照合を行い音声認識結果を出力する照合手順とを
実現させる音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体。A recording medium recording a voice recognition program for inputting a recognition target voice, performing voice recognition, and outputting a voice recognition result,
A voice feature extraction procedure for inputting the recognition target voice and extracting a voice feature;
Performs tree-structure clustering that divides learning text data hierarchically so as to have linguistically similar properties. From the language model for each tree-structure cluster generated using the learning text data of each tree-structure cluster, speech A multi-language model selection procedure for selecting a plurality of language models having a high probability of occurrence for a recognition result candidate word string;
A mixed language model generation procedure for generating a mixed language model by inputting a plurality of language models selected by the multi-language model selection procedure;
Using the acoustic model for determining the probability of the acoustic observation value series of speech and the language model generated by the mixed language model generation procedure, the speech feature amount extracted by the speech feature extraction procedure is verified. And a computer-readable recording medium on which a voice recognition program for realizing a collating procedure for outputting a voice recognition result is recorded.

複数言語モデル選択手順が、木構造クラスタ別言語モデルにおける最も下層の葉ノードのクラスタ別言語モデルから複数の言語モデルを選択する
ことを特徴とする請求項６記載の音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体。7. The computer-readable recording system according to claim 6 , wherein the multi-language model selection procedure selects a plurality of language models from the cluster-specific language model of the leaf node in the lowest layer in the tree-structured cluster-specific language model. Possible recording media.