JPH11296515A

JPH11296515A - Language model approximation learning device, its method and storage medium recording approximation learning program

Info

Publication number: JPH11296515A
Application number: JP10099488A
Authority: JP
Inventors: Yasunari Maeda; 康成前田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-04-10
Filing date: 1998-04-10
Publication date: 1999-10-29

Abstract

PROBLEM TO BE SOLVED: To express by parameters of a number smaller than a true model and to learn an approximate model approximate to the true model by outputting a low-order n-gram approximate model approximate to a true n-gram model from learning data with Kullback Leibler(KL) information quantity as an evalua tion scale. SOLUTION: In the approximation learning device of a language model expressing by a parameter, an n-gram Bayesian learning means 100 receives a word group as learning data to calculate a Bayesian estimated value concerning KL information quantity with a true n-gram model corresponding to the language model to output an n-gram Bayesian estimated model. A low-order n-gram learning means 200 receives the n-gram Bayesian estimated model learned by the means 100 and calculates the low-order n-gram approximation model expressed by the number of parameters smaller than the n-gram Bayesian estimated model with the KL information quantity as the evaluating scale.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、自然言語処理にお
ける言語モデルの近似学習に係わり、特に、言語モデル
として多重マルコフ連鎖であるｎ−ｇｒａｍモデルが仮
定される場合に、できる限り真の言語モデルに近く、パ
ラメータ数の少ない近似モデルを学習する装置及び方法
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to approximation learning of a language model in natural language processing, and more particularly, to a language model which is as true as possible when an n-gram model which is a multiple Markov chain is assumed as the language model. And an apparatus and method for learning an approximate model having a small number of parameters.

【０００２】[0002]

【従来の技術】自然言語処理を行うためには、その言語
モデルを学習する必要がある。多くの場合、言語モデル
として多重マルコフ連鎖であるｎ−ｇｒａｍモデルが採
用されている。ｎ−ｇｒａｍとは、文章中に隣接して現
れるｎ個の単語の組の次に現れる単語の出現頻度の統計
を表す。多重マルコフ連鎖を学習する手法として、従
来、漸近的に真のモデルが同定されるような様々な手法
が提案されている。例えば、Matsushima, Inazumi, Hir
asawa,"A Class of Distortionless Codes Designed by
Bayes Decision Theory", IEEE Trans. IT, Vol.37, N
o.5, Sept., pp.1288-1293(1991) 又は、松嶋、平
澤、”ＦＳＭＸ情報源のユニヴァーサル符号につい
て”、第１８回情報理論とその応用シンポジウム、pp.3
77-380(1995)に記載されているようなベイズ統計学に基
づくベイズ符号では、パラメータの事前分布にベータ分
布を仮定し、ＫＬ情報量に関するベイズ推定量を用いる
ことによって、真の分布と推定した分布との間の距離で
あるＫＬ情報量が有限の学習データに対してベイズ基準
の下で最小になるモデルが学習される。2. Description of the Related Art In order to perform natural language processing, it is necessary to learn a language model. In many cases, an n-gram model that is a multiple Markov chain is adopted as a language model. The n-gram represents statistics of the frequency of appearance of a word that appears next to a set of n words that appear adjacently in a sentence. Conventionally, as a method of learning a multiple Markov chain, various methods for asymptotically identifying a true model have been proposed. For example, Matsushima, Inazumi, Hir
asawa, "A Class of Distortionless Codes Designed by
Bayes Decision Theory ", IEEE Trans. IT, Vol. 37, N
o.5, Sept., pp.1288-1293 (1991) or Matsushima and Hirasawa, "About Universal Codes of FSMX Sources", 18th Information Theory and Its Application Symposium, pp.3
In Bayesian codes based on Bayesian statistics as described in 77-380 (1995), a true distribution is estimated by assuming a beta distribution as a prior distribution of parameters and using a Bayesian estimator for KL information. A model is learned in which the KL information amount, which is the distance from the obtained distribution, is minimized under finite learning data under the Bayes criterion.

【０００３】ＫＬ情報量に関するベイズ推定量は、The Bayesian estimator for the KL information amount is:

【０００４】[0004]

【数１】 (Equation 1)

【０００５】によって示される。但し、ａ_iは情報源ア
ルファベット、ｘ^Mは学習データ、θは多重マルコフ連
鎖を支配する連続パラメータ、α（ａ_i｜ａⁿ）はベー
タ分布のパラメータ、ｎ（ａ_i｜ａⁿ）は頻度カウンタ
ーで学習データ中で系列ａⁿの後にａ_iが生起した回
数、ｐ（ａ_i｜ａⁿ，ｘ^M）は学習データｘ^Mによる系
列ａⁿの後にａ_iが生起する確率の推定値を表す。[0005] Here, a _i is an information source alphabet, x ^M is learning data, θ is a continuous parameter governing multiple Markov chains, α (a _i | a ⁿ ) is a parameter of a beta distribution, and n (a _i | a ⁿ ) is a frequency. number of a _i has occurred after the sequence a ⁿ in the training data at the _{^{counter, p (a i | a n}} , x M) is an estimate of the probability that a _i is occurring after the sequence a ⁿ by the learning data x ^M Represent.

【０００６】また、ＫＬ情報量は、The KL information amount is

【０００７】[0007]

【数２】 (Equation 2)

【０００８】で示される。但し、θ^*は多重マルコフ連
鎖を支配する真のパラメータ、π（ａ ⁿ，θ^*）は、連
立方程式、[0008] Where θ^*Is a multiple Markov ream
The true parameter governing the chain, π (a ⁿ, Θ^*)
Standing equations,

【０００９】[0009]

【数３】 (Equation 3)

【００１０】を解くことによって求められるθ^*により
支配されるマルコフ連鎖の状態ａⁿの定常分布を表す。
しかし、学習した確率モデルを汎用コンピュータ等に実
装しようとした場合、真のモデルと同じ次数のモデルを
そのまま実装したのでは、そのモデルを表現するパラメ
ータが膨大なメモリ量を占める。そこで、最初に真のモ
デルと同じ次数のモデルを学習し、次に、真のモデルよ
りも少ないパラメータ数のモデルで学習し直す形の言語
モデルの近似学習手法が提案されている。[0010] represents the stationary distribution of state a ⁿ of the Markov chain which is governed by is theta ^* determined by solving.
However, when trying to implement the learned probability model on a general-purpose computer or the like, if a model of the same order as the true model is implemented as it is, the parameters expressing the model occupy a huge amount of memory. Therefore, there has been proposed an approximate learning method of a language model in which a model having the same order as a true model is first learned, and then learning is performed using a model having a smaller number of parameters than the true model.

【００１１】このような従来の近似学習手法において、
学習する際の評価尺度は真のモデルと近似モデルとの差
を的確に表現できているとは言えない。例えば、Brown,
"Class-Based n-gram models of Natural Language",
Computational Linguistics,Vol.18, No.4, pp.467-479
(1992)において提案された手法では、以下の式、In such a conventional approximation learning method,
The evaluation scale at the time of learning cannot be said to accurately represent the difference between the true model and the approximate model. For example, Brown,
"Class-Based n-gram models of Natural Language",
Computational Linguistics, Vol. 18, No. 4, pp. 467-479
In the method proposed in (1992), the following equation:

【００１２】[0012]

【数４】 (Equation 4)

【００１３】のように、左辺のｎ−１重マルコフ連鎖で
あるｎ−ｇｒａｍモデルが右辺のクラスによるｎ−１重
マルコフ連鎖と、クラス毎の単語の多項分布との組合せ
によって近似される。但し、ｗ_iは、単語As described above, the n-gram model which is the n-1 double Markov chain on the left side is approximated by a combination of the n-1 double Markov chain by the class on the right side and the polynomial distribution of words for each class. Where w _i is the word

【００１４】[0014]

【外１】 [Outside 1]

【００１５】であり、ｃ_iは、クラスAnd c _i is the class

【００１６】[0016]

【外２】 [Outside 2]

【００１７】を表わす。図９は２−ｇｒａｍモデルの場
合の従来の近似学習手法の動作原理を説明する図であ
る。ステップ１０１において学習データを入力し、ステ
ップ１０２において、学習データに基づいて最尤推定量## EQU1 ## FIG. 9 is a diagram for explaining the operation principle of the conventional approximation learning method in the case of the 2-gram model. In step 101, learning data is input. In step 102, the maximum likelihood estimator is

【００１８】[0018]

【数５】 (Equation 5)

【００１９】を用いて２−ｇｒａｍモデルを学習する。
ステップ３において、｜Ｗ｜個のクラスを用意し、各単
語ｗ_iに一つのクラスを割り当て、ステップ１０４にお
いて、タイムカウンタを０に設定する。次に、ステップ
１０５において、Ｃ_iを時点ｔの分割Ｃとすると、以下
の（６）式、Is used to learn a 2-gram model.
In step 3, | W | classes are prepared, one class is assigned to each word w _i , and in step 104, the time counter is set to zero. Next, in step 105, assuming that C _i is a division C at the time point t, the following equation (6) is obtained.

【００２０】[0020]

【数６】 (Equation 6)

【００２１】で示されるクラス間の平均相互情報量の減
少量を最小にするような二つのクラスTwo classes that minimize the reduction of the average mutual information between the classes represented by

【００２２】[0022]

【外３】 [Outside 3]

【００２３】をマージすることによって、パラメータ数
の減少に伴う尤度関数の値の減少をなるべく小さくする
ような構成（分割Ｃ）を貪欲アルゴリズムによって逐次
的に求める。尚、推定量として相対頻度による最尤推定
量が使用される。ステップ１０６において、タイムカウ
ンタが更新され、ステップ１０７においてタイムカウン
タの値を用いて、ステップ１０５がＴ回繰り返される。
ステップ１０５をＴ回繰り返すことにより、クラス数を
｜Ｗ｜−Ｔ個に減少させ得る。パラメータ数に関して言
うと、元の２−ｇｒａｍモデルが｜Ｗ｜（｜Ｗ｜−１）
個のパラメータを必要とするのに対し、近似学習後は
（｜Ｗ｜−Ｔ）（｜Ｗ｜−１）＋（｜Ｗ｜−Ｔ）（｜Ｗ
｜−Ｔ−１）個のパラメータでモデルが表現され得る。The configuration (division C) that minimizes the decrease in the value of the likelihood function due to the decrease in the number of parameters by merging is sequentially obtained by the greedy algorithm. Note that the maximum likelihood estimator based on the relative frequency is used as the estimator. In step 106, the time counter is updated, and in step 107, step 105 is repeated T times using the value of the time counter.
By repeating step 105 T times, the number of classes can be reduced to | W | -T. In terms of the number of parameters, the original 2-gram model is | W | (| W | -1)
While approximation learning requires (| W | -T) (| W | -1) + (| W | -T) (| W
The model can be represented by | -T-1) parameters.

【００２４】[0024]

【発明が解決しようとする課題】しかしながら、上記従
来技術の近似学習手法によると、次のような問題点があ
る。第１に、ステップ１０２において最尤推定量が利用
されているが、最尤推定量は、有限のサンプル数の下で
は、高々推定誤差の分散を最小にする程度の保証しかな
い推定量である。そのため、最尤推定量を用いて２−ｇ
ｒａｍモデルを学習しても充分な保証が得られない。However, according to the above-described prior art approximation learning method, there are the following problems. First, although the maximum likelihood estimator is used in step 102, the maximum likelihood estimator is an estimator that has no guarantee that the variance of the estimation error is minimized at most under a limited number of samples. . Therefore, using the maximum likelihood estimator, 2-g
Learning the ram model does not provide a sufficient guarantee.

【００２５】第２に、最尤法は、本質的にモデルを固定
して尤度関数の値を最大化させる手法である。ところ
が、ステップ１０５では、クラスをマージすることによ
ってモデルを変化させながら、尤度関数の値ができる限
り大きくなるようなクラスの構成を求めている。そのた
め、（６）式の平均相互情報量を評価尺度として用いる
ことの理論的妥当性が保証されない。Second, the maximum likelihood method is a method of essentially fixing a model and maximizing the value of a likelihood function. However, in step 105, while changing the model by merging the classes, a class configuration that maximizes the value of the likelihood function is obtained. Therefore, the theoretical validity of using the average mutual information of Expression (6) as an evaluation scale is not guaranteed.

【００２６】第３に、ステップ１０２及びステップ１０
５において、最尤推定量を用いているが、最尤推定量の
場合、観測回数が０回の単語に対するパラメータの推定
値が０になってしまうというゼロ頻度問題に対処できな
いので、ゼロ頻度問題に対処するため別の補正手段を設
ける必要がある。以上の第１〜第３の問題点により、従
来の近似学習法により学習された近似モデルは元の２−
ｇｒａｍモデルに対して必ずしも近いモデルであるとは
限らない。このことは、一般化したｎ−ｇｒａｍモデル
に対しても同様に言える。Third, step 102 and step 10
5, the maximum likelihood estimator is used. However, in the case of the maximum likelihood estimator, the zero frequency problem that the parameter estimation value for a word whose observation number is 0 becomes 0 cannot be dealt with. It is necessary to provide another correction means in order to cope with the above. According to the above first to third problems, the approximate model learned by the conventional approximate learning method is the original 2-
The model is not always close to the gram model. This can be similarly applied to the generalized n-gram model.

【００２７】従って、本発明は、上記従来技術の問題点
に鑑み、より厳密な評価尺度を導入することによって、
真のモデルよりも少ない数のパラメータで表現され、か
つ、真のモデルに近い近似モデルを学習する言語モデル
の近似学習装置及び方法の提供を目的とする。また、本
発明は、言語モデルの近似学習プログラムを記録した記
録媒体の提供を目的とする。Accordingly, the present invention has been made in view of the above-mentioned problems of the prior art, by introducing a more strict evaluation scale,
An object of the present invention is to provide a language model approximation learning device and method for learning an approximation model that is expressed by a smaller number of parameters than a true model and is close to the true model. Another object of the present invention is to provide a recording medium on which a language model approximate learning program is recorded.

【００２８】[0028]

【課題を解決するための手段】図１は本発明の原理構成
図である。本発明によるパラメータで表現される言語モ
デルの近似学習装置は、単語系列を学習データとして受
け、上記言語モデルに対応した真のｎ−ｇｒａｍモデル
とのＫＬ（カルバック・ライブラー）情報量に関するベ
イズ推定量を算出して、ｎ−ｇｒａｍベイズ推定モデル
を出力するｎ−ｇｒａｍベイズ学習手段１００と、上記
ｎ−ｇｒａｍベイズ学習手段１００によって学習された
ｎ−ｇｒａｍベイズ推定モデルを受け、ＫＬ情報量を評
価尺度として、上記ｎ−ｇｒａｍベイズ推定モデルより
も少ないパラメータ数で表現された低次ｎ−ｇｒａｍ近
似モデルを算出する低次ｎ−ｇｒａｍ学習手段２００と
により構成され、上記学習データからＫＬ情報量を評価
尺度にして上記真のｎ−ｇｒａｍモデルに近い低次ｎ−
ｇｒａｍ近似モデルを出力することを特徴とする。FIG. 1 is a block diagram showing the principle of the present invention. The approximation learning apparatus for a language model represented by a parameter according to the present invention receives a word sequence as learning data, and performs Bayesian estimation on a KL (Kulbach-Leibler) information amount with a true n-gram model corresponding to the language model. N-gram Bayes learning means 100 for calculating the amount and outputting an n-gram Bayes estimation model; and receiving the n-gram Bayes estimation model learned by the n-gram Bayes learning means 100, and evaluating the KL information amount. A low-order n-gram learning means 200 for calculating a low-order n-gram approximation model represented by a smaller number of parameters than the n-gram Bayesian estimation model as a scale is used to calculate the KL information amount from the learning data. A low-order n- that is close to the true n-gram model as an evaluation scale
It is characterized by outputting a gram approximate model.

【００２９】図２は本発明の原理を説明するためのフロ
ーチャートである。本発明の言語モデルの近似学習方法
は、単語系列を学習データとして入力する段階１０と、
上記言語モデルに対応した真のｎ−ｇｒａｍモデルとの
ＫＬ情報量に関するベイズ推定量を算出してｎ−ｇｒａ
ｍベイズ推定モデルを学習する段階２０と、上記学習さ
れたｎ−ｇｒａｍベイズ推定モデルを受け、ＫＬ情報量
を評価尺度として、上記ｎ−ｇｒａｍベイズ推定モデル
よりも少ないパラメータ数で表現された低次ｎ−ｇｒａ
ｍ近似モデルを学習する段階３０と、上記学習により得
られた上記真のｎ−ｇｒａｍモデルに近い低次ｎ−ｇｒ
ａｍ近似モデルを出力する段階４０とからなる。FIG. 2 is a flowchart for explaining the principle of the present invention. The language model approximation learning method according to the present invention includes a step 10 of inputting a word sequence as learning data;
A Bayesian estimator for the KL information amount with a true n-gram model corresponding to the above language model is calculated and n-gram
Step 20 of learning the m-bays estimation model, and receiving the learned n-gram Bayesian estimation model, and using the KL information amount as an evaluation scale, the lower order expressed with a smaller number of parameters than the n-gram Bayesian estimation model n-gra
learning 30 an m-approximation model, and a low-order n-gr close to the true n-gram model obtained by the learning.
and outputting an am approximation model.

【００３０】従って、本発明の言語モデルの近似学習装
置及び方法による近似学習手法は、第１に、ＫＬ情報量
に関するベイズ推定量を用いたｎ−ｇｒａｍベイズ推定
モデルに基づいているので、ＫＬ情報量に関して真のモ
デルに近い低次ｎ−ｇｒａｍ近似モデルが最終的に出力
される。第２に、本発明によれば、ｎ−ｇｒａｍベイズ
推定モデルに基づいて低次ｎ−ｇｒａｍ近似モデルを求
める際にも、モデル間の距離を示すＫＬ情報量を評価尺
度として利用しているので、最終的に出力される低次ｎ
−ｇｒａｍ近似モデルもＫＬ情報量に関して真のモデル
に近いものとなる。第３に、本発明によれば、ベイズ推
定量を用いているので、事前分布の設定の仕方によっ
て、ゼロ頻度問題に対処することが可能である。Therefore, the approximation learning method using the language model approximation learning apparatus and method of the present invention is based on the n-gram Bayesian estimation model using the Bayesian estimator for the KL information amount. A low-order n-gram approximation model that is close to the true model in terms of quantity is finally output. Second, according to the present invention, the KL information amount indicating the distance between models is used as an evaluation scale even when a low-order n-gram approximate model is obtained based on the n-gram Bayesian estimation model. , The low order n finally output
The -gram approximation model is also close to the true model with respect to the KL information amount. Third, according to the present invention, since the Bayesian estimator is used, it is possible to address the zero frequency problem by setting a prior distribution.

【００３１】また、本発明は、パラメータで表現される
言語モデルの近似学習プログラムを記録した記録媒体で
あって、単語系列を学習データとして受け、上記言語モ
デルに対応した真のｎ−ｇｒａｍモデルとのＫＬ情報量
に関するベイズ推定量を算出して、ｎ−ｇｒａｍベイズ
推定モデルを学習させるプロセスと、ＫＬ情報量を評価
尺度として、上記ｎ−ｇｒａｍベイズ推定モデルよりも
少ないパラメータ数で表現された低次ｎ−ｇｒａｍ近似
モデルを算出させるプロセスとにより構成され、上記学
習データからＫＬ情報量を評価尺度にして上記真のｎ−
ｇｒａｍモデルに近い低次ｎ−ｇｒａｍ近似モデルを出
力させることを特徴とする言語モデルの近似学習プログ
ラムを記録した記録媒体である。Further, the present invention is a recording medium on which an approximate learning program for a language model represented by a parameter is recorded. The recording medium receives a word sequence as learning data and generates a true n-gram model corresponding to the language model. Calculating the Bayesian estimator for the KL information amount and learning the n-gram Bayesian estimation model, and using the KL information amount as an evaluation scale, the low-level expression expressed with a smaller number of parameters than the n-gram Bayesian estimation model. Calculating the next n-gram approximation model using the KL information amount as an evaluation scale from the learning data.
This is a recording medium on which a language model approximation learning program characterized by outputting a low-order n-gram approximation model close to a gram model is recorded.

【００３２】[0032]

【発明の実施の形態】図３は本発明による言語モデルの
近似学習装置の好ましい一実施例の構成図である。同図
に示す如く、本発明の言語モデルの近似学習装置は、ｎ
−ｇｒａｍベイズ学習部１００と、低次ｎ−ｇｒａｍ学
習部２００とからなる。ｎ−ｇｒａｍベイズ学習部１０
０は、頻度算出器１１０と、ベイズ推定量算出器１２０
と、ベータ分布パラメータテーブル１３０とからなる。
低次ｎ−ｇｒａｍ学習部２００は、クラスマージ器２１
０と、ＫＬ情報量算出器２２０とからなる。FIG. 3 is a block diagram showing a preferred embodiment of a language model approximation learning apparatus according to the present invention. As shown in the figure, the language model approximation learning apparatus of the present invention has n
It comprises a -gram Bayes learning unit 100 and a low-order n-gram learning unit 200. n-gram Bayesian learning unit 10
0 is the frequency calculator 110 and the Bayesian estimator calculator 120
And a beta distribution parameter table 130.
The low-order n-gram learning unit 200 includes the class merge unit 21
0 and a KL information amount calculator 220.

【００３３】最初に、図４に示されたｎ−ｇｒａｍベイ
ズ学習部１００の動作フローチャートを参照して、ｎ−
ｇｒａｍベイズ学習部１００の動作を説明する。まず、
学習データの単語系列が与えられる（ステップ５０）。
この学習データは、例えば、通常の文章である「自然
文」を形態素解析にかけ、形態素毎に分かち書きするこ
とにより形成される。学習データの単語系列が与えられ
た頻度算出器１１０は、頻度カウンタｎ（ｗ_i｜
ｗ^n-1）、即ち、学習データ中において系列ｗ^n-1の後
に単語ｗ_iが生起した回数を算出する（ステップ５
２）。First, referring to the operation flowchart of the n-gram Bayes learning unit 100 shown in FIG.
The operation of the gram Bayes learning unit 100 will be described. First,
A word sequence of the learning data is provided (step 50).
The learning data is formed, for example, by subjecting a “natural sentence”, which is a normal sentence, to a morphological analysis, and separating and writing each morpheme. The frequency calculator 110 word sequence of training data is given, the frequency counter n (w _i |
w ^n-1 ), that is, the number of occurrences of the word w _i after the series w ^n-1 in the learning data is calculated (step 5).
2).

【００３４】ベイズ推定量算出器１２０は、頻度カウン
タが与えられると、ベータ分布パラメータテーブル１３
０に記憶されている事前分布を表わすベータ分布のパラ
メータα（ｗ_i｜ｗ^n-1）を読み取る（ステップ５
４）。ベイズ推定量算出器１２０は、次に、以下の式、When given a frequency counter, the Bayesian estimator calculator 120 calculates the beta distribution parameter table 13
Parameter of the beta distribution representing the prior distributions stored in the 0 alpha | reading _{^{(w i w n-1)}} ( Step 5
4). Bayesian estimator calculator 120 then calculates the following equation:

【００３５】[0035]

【数７】 (Equation 7)

【００３６】で示されるＫＬ情報量に関するベイズ推定
量を算出し（ステップ５６）、ｎ−ｇｒａｍベイズ推定
モデル、即ち、真のモデルと推定したモデルとの間の距
離であるＫＬ情報量が有限の学習データに対してベイズ
基準の下で最小になるモデルを出力する（ステップ５
８）。次に、図５に示された低次ｎ−ｇｒａｍ学習部２
００の動作フローチャートを参照して、低次ｎ−ｇｒａ
ｍ学習部２００の動作を説明する。最初に、クラスマー
ジ器２１０は、ベイズ推定量算出器１２０から出力され
たｎ−ｇｒａｍベイズ推定モデルを読み込む（ステップ
６０）。A Bayesian estimator is calculated for the KL information amount represented by (step 56), and the KL information amount, which is the distance between the n-gram Bayesian estimation model, ie, the true model and the estimated model, is finite. Output a model that minimizes the learning data under the Bayes criterion (step 5)
8). Next, the low-order n-gram learning unit 2 shown in FIG.
00, the lower-order n-gra
The operation of the m learning unit 200 will be described. First, the class merger 210 reads the n-gram Bayesian estimation model output from the Bayesian estimator calculator 120 (step 60).

【００３７】次に、クラスマージ器２１０は、｜Ｗ｜
^n-1個のクラスを用意し、各単語系列ｗ^n-1に一つのク
ラスを割り当て（ステップ６２）、タイムカウンタｔを
０に設定する（ステップ６４）。ここで、本発明の好ま
しい一実施例では、「単語の系列の集合」が「クラス」
に分割され、一方、上記の従来技術の近似学習手法の場
合、「単語の集合」を「クラス」に分割することに注意
する必要がある。本発明の好ましい一実施例によれば、
真のモデルに合わせた長さの系列の集合を考えているの
に対し、従来の手法では常に長さ１の系列、即ち、「単
語」の集合しか考慮していない。Next, the class merger 210 calculates | W |
^{The n-1} classes are prepared, one class is assigned to each word sequence w ^n-1 (step 62), and the time counter t is set to 0 (step 64). Here, in one preferred embodiment of the present invention, the "set of a series of words" is "class".
On the other hand, in the case of the above-described approximate learning method of the related art, it is necessary to note that the "set of words" is divided into "classes". According to a preferred embodiment of the present invention,
While a set of sequences having a length matching the true model is considered, the conventional method always considers only a sequence of length 1, that is, a set of "words".

【００３８】次に、クラスマージ器２１０は、ＫＬ情報
量算出器２２０によって算出される式、Next, the class merger 210 calculates an expression calculated by the KL information amount calculator 220,

【００３９】[0039]

【数８】 (Equation 8)

【００４０】で示される量、即ち、二つのクラスをマー
ジすることによるＫＬ情報量の増加量を最小にするよう
な二つのクラス、In other words, two classes that minimize the amount of increase in the amount of KL information due to merging of the two classes,

【００４１】[0041]

【外４】 [Outside 4]

【００４２】をマージすることによって、なるべく真の
モデルに近いクラスの構成（分割Ｃ）を貪欲アルゴリズ
ムにより逐次的に求める（ステップ６６）。但し、上式
（８）において、ｐ（・｜・，ｔ）は時点ｔにおける近
似モデル（時点０では、ｎ−ｇｒａｍベイズ推定モデル
に対応している）を表し、π（ｃ_i，ｔ）は時点ｔにお
けるクラスｃ_iの定常分布を表わす。また、ｐ（ｗ_k｜
ｃ_i∪ｃ_j，ｔ＋１）はマージする二つのクラス
（ｃ_i，ｃ_j）を固定した下で、ＫＬ情報量の増加量を
最小にする推定値であり、以下の式、Then, the configuration (division C) of the class as close as possible to the true model is sequentially obtained by the greedy algorithm (step 66). However, in the above equation (8), p (· | ·, t) denotes the approximate model at time t (At time 0, corresponding to the n-gram Bayesian estimation _{model), π (c i, t} ) _Represents the stationary distribution of class c _i at time t. Also, p (w _k |
c _i ∪c _j , t + 1) is an estimated value that minimizes the increase in the amount of KL information while fixing the two classes (c _i , c _j ) to be merged.

【００４３】[0043]

【数９】 (Equation 9)

【００４４】によって求められる。上記の（９）式から
分かるように、本発明の好ましい一実施例によれば、定
常分布で加重平均をとることによって、マージ操作毎に
伴うＫＬ情報量の増加量を最小にする。尚、時点０のπ
（ｃ_i，０）は以下の（１０）式及び（１１）式、Is obtained by As can be seen from the above equation (9), according to a preferred embodiment of the present invention, the amount of increase in the amount of KL information associated with each merge operation is minimized by taking a weighted average with a stationary distribution. Note that π at time point 0
(C _i , 0) is the following equation (10) and equation (11):

【００４５】[0045]

【数１０】 (Equation 10)

【００４６】による連立方程式を解くことによって求め
られる。また、時点１以降の定常分布は、マージされた
クラスについては、式、 π（ｃ_i∪ｃ_j，ｔ＋１）＝π（ｃ_i，ｔ）＋π（ｃ_j，ｔ）（１２）によって更新される。その他のクラスについては、次
式、 π（ｃ_i，ｔ＋１）＝π（ｃ_i，ｔ）（１３）の通り、定常分布は変化しない。Is obtained by solving the simultaneous equations Further, stationary distribution point 1 or later, for the merged class, the _{_{formula, π (c i ∪c j,}} t + 1) = π (c i, t) + π (c j, t) is updated by (12) You. For the other classes, the stationary distribution does not change, as in the following equation: π (c _i , t + 1) = π (c _i , t) (13)

【００４７】マージされないクラスについて、ｐ（ｗ_k
｜ｃ_i，ｔ＋１）は、次式、ｐ（ｗ_k｜ｃ_i，ｔ＋１）＝ｐ（ｗ_k｜ｃ_i，ｔ）（１４）の通り変化しない。尚、クラスマージ器２１０と、ＫＬ
情報量算出器２２０との間では、図３に示される如く、
（８）式のＫＬ情報量を算出するために必要な分布情報
及びＫＬ情報量の受け渡しが行われる。For classes that are not merged, p (w _k
| C _i , t + 1) does not change as follows: p (w _k | c _i , t + 1) = p (w _k | c _i , t) (14) The class merge unit 210 and the KL
Between the information amount calculator 220 and the information amount calculator 220, as shown in FIG.
The distribution information and the KL information amount necessary for calculating the KL information amount in Expression (8) are transferred.

【００４８】タイムカウンタは更新され（ステップ６
８）、タイムカウンタの値がＴと一致したか否かが判定
され（ステップ６８）、一致しない場合、ステップ６６
に戻り、ステップ６６がＴ回繰り返される。ステップ６
６をＴ回繰り返すことにより、クラス数を｜Ｗ｜^n-1−
Ｔ個に減少させた低次ｎ−ｇｒａｍ近似モデルが求めら
れる。パラメータ数に関して言うと、元のｎ−ｇｒａｍ
モデルが｜Ｗ｜^n-1（｜Ｗ｜−１）個のパラメータを必
要とするのに対し、近似学習後の低次ｎ−ｇｒａｍ近似
モデルは、（｜Ｗ｜^n-1−Ｔ）（｜Ｗ｜−１）個のパラ
メータで表現できる。最終的にこの低次ｎ−ｇｒａｍ近
似モデルがクラスマージ器２１０から出力される。The time counter is updated (step 6).
8) It is determined whether or not the value of the time counter matches T (step 68).
And step 66 is repeated T times. Step 6
6 is repeated T times, so that the number of classes is | W | ^n-1 −
A low-order n-gram approximate model reduced to T pieces is obtained. In terms of the number of parameters, the original n-gram
While the model requires | W | ⁿ⁻¹ (| W | −1) parameters, the low-order n-gram approximation model after approximation learning has (| W | ⁿ⁻¹ −T) ( | W | -1) parameters. Finally, the low-order n-gram approximation model is output from the class merger 210.

【００４９】図６は本発明の一実施例による言語モデル
の学習システムの構成図である。言語モデルの学習シス
テムは、自然文が入力され、自然文を形態素に分離し、
形態素毎に分かち書きすることにより、学習データを生
成する形態素解析部５０を有する。本例の学習システム
は、学習データが入力され、言語モデルに対応した真の
ｎ−ｇｒａｍモデルとのＫＬ情報量に関するベイズ推定
量を算出して、ｎ−ｇｒａｍベイズ推定モデルを出力す
るｎ−ｇｒａｍベイズ学習部１００を更に有する。ま
た、上記学習システムは、上記ｎ−ｇｒａｍベイズ学習
部１００によって学習されたｎ−ｇｒａｍベイズ推定モ
デルを受け、ＫＬ情報量を評価尺度として、上記ｎ−ｇ
ｒａｍベイズ推定モデルよりも少ないパラメータ数で表
現された低次ｎ−ｇｒａｍ近似モデルを算出する低次ｎ
−ｇｒａｍ学習部２００を更に有する。本例の言語モデ
ルの学習システムを用いることにより、言語モデルを生
成すべき言語で記述された通常の文章のような自然文を
用いて、その言語の言語モデルの近似モデルを学習する
ことができる。FIG. 6 is a block diagram of a language model learning system according to one embodiment of the present invention. The language model learning system receives natural sentences, separates them into morphemes,
It has a morphological analysis unit 50 that generates learning data by writing separately for each morpheme. The learning system of the present example receives learning data, calculates a Bayesian estimator regarding the KL information amount with a true n-gram model corresponding to a language model, and outputs an n-gram Bayesian estimation model. A Bayesian learning unit 100 is further provided. Further, the learning system receives the n-gram Bayesian estimation model learned by the n-gram Bayesian learning unit 100, and uses the KL information amount as an evaluation scale to evaluate the n-gram Bayesian estimation model.
low-order n for calculating a low-order n-gram approximation model represented by a smaller number of parameters than the ram Bayesian estimation model
Further, it has a gram learning unit 200. By using the language model learning system of this example, an approximate model of the language model of the language can be learned using a natural sentence such as a normal sentence described in a language in which the language model is to be generated. .

【００５０】図７は本発明の他の実施例による文書ファ
イル圧縮システムの構成図である。この実施例による文
書ファイル圧縮システムは、図６を参照して説明した言
語モデルの学習システムを包含し、入力された文書ファ
イルを圧縮して、圧縮された文書ファイルを出力するシ
ステムである。文書ファイル圧縮システムは、自然文で
ある文書ファイルが入力され、学習データを生成する形
態素解析部５０と、上記ｎ−ｇｒａｍベイズ学習部１０
０と、上記低次ｎ−ｇｒａｍ学習部２００とからなる。
また、文書ファイル圧縮システムは、上記形態素解析部
５０及び上記低次ｎ−ｇｒａｍ学習部２００に接続さ
れ、学習データ及び低次ｎ−ｇｒａｍ近似モデルを受け
取り、近似モデルを利用して、分かち書きされたデータ
を圧縮し、圧縮された文書ファイルを出力する文書ファ
イル圧縮部３００を更に有する。FIG. 7 is a block diagram of a document file compression system according to another embodiment of the present invention. The document file compression system according to this embodiment includes the language model learning system described with reference to FIG. 6, and is a system that compresses an input document file and outputs a compressed document file. The document file compression system includes a morphological analysis unit 50 that receives a document file as a natural sentence and generates learning data, and the n-gram Bayesian learning unit 10.
0 and the lower-order n-gram learning unit 200.
Further, the document file compression system is connected to the morphological analysis unit 50 and the low-order n-gram learning unit 200, receives the learning data and the low-order n-gram approximation model, and is divided and written using the approximation model. It further includes a document file compression unit 300 that compresses data and outputs a compressed document file.

【００５１】従来の文書圧縮技術は、文書ファイルを
０、１の系列として圧縮しているが、上記本発明の他の
実施例による文書圧縮システムは、形態素の系列として
文書を圧縮する点に特徴がある。圧縮の際に使用する確
率モデルを、圧縮させるべき文書ファイルに合わせて学
習することができるので、従来の文書圧縮技術よりも高
い圧縮効率を達成することが可能である。Although the conventional document compression technique compresses a document file as a series of 0 and 1, the document compression system according to another embodiment of the present invention is characterized in that a document is compressed as a series of morphemes. There is. Since the probability model used for compression can be learned according to the document file to be compressed, it is possible to achieve higher compression efficiency than the conventional document compression technology.

【００５２】また、言語モデルの近似学習装置の構成
は、上記の実施例で説明された例に限定されることな
く、言語モデルの近似学習装置の各々の構成要件をソフ
トウェア（プログラム）で構築し、ディスク装置等に格
納しておき、必要に応じて言語モデルの近似学習装置の
コンピュータにインストールして近似モデルの学習を行
うことも可能である。さらに、構築されたプログラムを
フロッピーディスクやＣＤ−ＲＯＭ等の可搬記憶媒体に
格納し、このようなシステムを用いる場面で汎用的に使
用することも可能である。Further, the configuration of the language model approximation learning device is not limited to the example described in the above embodiment, and each component requirement of the language model approximation learning device is constructed by software (program). It can be stored in a disk device or the like, and installed in a computer of the language model approximation learning device as needed to perform approximation model learning. Further, the constructed program can be stored in a portable storage medium such as a floppy disk or a CD-ROM, and can be used for general purposes in a case where such a system is used.

【００５３】本発明は、上記の実施例に限定されること
なく、特許請求の範囲内で種々変更・応用が可能であ
る。The present invention is not limited to the above embodiments, but can be variously modified and applied within the scope of the claims.

【００５４】[0054]

【実施例】図８は本発明による言語モデルの近似学習手
法のシミュレーション結果説明図である。本例では、真
のモデルとして｜Ｗ｜＝１０と、次式、FIG. 8 is an explanatory diagram of a simulation result of the approximation learning method of a language model according to the present invention. In this example, | W | = 10 as a true model,

【００５５】[0055]

【数１１】 [Equation 11]

【００５６】のような単純マルコフ連鎖の遷移確率行列
を有する２−ｇｒａｍモデルを仮定し、従来技術の近似
学習手法と、本発明による近似学習手法とをシミュレー
ションによって比較する。上記（１５）式中のθ^*は遷
移確率行列を支配する真のパラメータである。また、上
記本発明の好ましい一実施例によるベイズ推定量算出部
１２０でベイズ推定量を算出する際に、事前分布が一様
分布になるようにベータ分布パラメータテーブル１３０
のパラメータを設定している。Assuming a 2-gram model having a simple Markov chain transition probability matrix as described above, the approximation learning method of the prior art and the approximation learning method of the present invention are compared by simulation. Θ ^* in the above equation (15) is a true parameter that governs the transition probability matrix. When calculating the Bayesian estimator in the Bayesian estimator calculator 120 according to a preferred embodiment of the present invention, the beta distribution parameter table 130 is used so that the prior distribution is uniform.
Parameters are set.

【００５７】シミュレーションは、上記（１５）式によ
る単純マルコフ連鎖から学習データを作成し、作成され
た学習データを基に従来技術の近似学習手法及び本発明
による近似学習手法で夫々に近似モデルを学習し、クロ
スエントロピーを用いて評価することにより行われる。
クロスエントロピーは分布の近さの評価として屡々使用
される評価尺度である。尚、エントロピーは理論的限界
値（真のモデル）であり、クロスエントロピーの値が低
いほど、真のモデルに近い近似モデルであると評価され
る。図８に本発明４５として示される如く、本発明の近
似学習手法の場合、真のモデルが９０個のパラメータを
持つのに対し、５個のクラスで表現することによって４
５個まで減少させられたパラメータで表現される近似モ
デルが得られる。一方、同図に従来６５として示される
従来技術の近似学習手法の場合に、５個のクラスを用い
ることにより、パラメータの個数が６５個まで減少させ
られる。また、同図に従来４８として示される従来技術
の近似学習手法の場合に、４個のクラスを用いることに
より、パラメータの個数が４８個まで減少させられてい
る。図８において、縦軸は評価尺度であるクロスエント
ロピー、横軸は学習データの長さを表わす。同図から分
かるように、本発明による近似モデルは、パラメータ数
の多い従来の近似学習手法よりも真のモデルに近い近似
モデルを学習する。即ち、本発明は、従来技術よりも少
ないパラメータで、より真のモデルに近い近似モデルを
学習することに成功している。また、本発明による近似
モデルのクロスエントロピーは、学習データの長さが長
くなるにつれて、単調減少することが分かる。In the simulation, learning data is created from a simple Markov chain according to the above equation (15), and an approximate model is learned based on the created learning data by the approximate learning method of the prior art and the approximate learning method of the present invention. Then, evaluation is performed using cross entropy.
Cross-entropy is a rating scale often used as a measure of closeness of distribution. Note that entropy is a theoretical limit value (true model), and the lower the value of cross entropy, the closer to the true model the closer the model is evaluated. As shown in FIG. 8 as the present invention 45, in the case of the approximation learning method of the present invention, while the true model has 90 parameters, the true model has 4 parameters by expressing it in 5 classes.
An approximation model represented by the parameters reduced to five is obtained. On the other hand, in the case of the conventional approximation learning method shown as 65 in the same drawing, the number of parameters can be reduced to 65 by using five classes. Also, in the case of the conventional approximation learning method shown as conventional 48 in the figure, the number of parameters is reduced to 48 by using four classes. In FIG. 8, the vertical axis represents cross-entropy, which is an evaluation scale, and the horizontal axis represents the length of learning data. As can be seen from the figure, the approximate model according to the present invention learns an approximate model that is closer to the true model than the conventional approximate learning method having a large number of parameters. That is, the present invention succeeds in learning an approximate model closer to a true model with fewer parameters than in the related art. Also, it can be seen that the cross-entropy of the approximation model according to the present invention monotonically decreases as the length of the learning data increases.

【００５８】[0058]

【発明の効果】上述のように、本発明によれば、ＫＬ情
報量に関するベイズ推定量を算出して、ｎ−ｇｒａｍベ
イズ推定モデルを学習し、学習されたｎ−ｇｒａｍベイ
ズ推定モデルを受け、ＫＬ情報量を評価尺度として、上
記ｎ−ｇｒａｍベイズ推定モデルよりも少ないパラメー
タ数で表現された低次ｎ−ｇｒａｍ近似モデルを学習す
るので、単語系列からなる学習データから真のモデルに
近い低次ｎ−ｇｒａｍ近似モデルを提示することが可能
になる。As described above, according to the present invention, the Bayesian estimator for the KL information amount is calculated, the n-gram Bayesian estimating model is learned, and the learned n-gram Bayesian estimating model is received. Since a low-order n-gram approximation model represented by a smaller number of parameters than the n-gram Bayesian estimation model is learned using the KL information amount as an evaluation scale, a low-order n-gram approximation model close to a true model is obtained from the learning data composed of word sequences. It is possible to present an n-gram approximation model.

【００５９】また、近似モデルを表現するパラメータの
個数が減少するので、近似モデルを実装するために必要
とされるメモリ容量が削減される利点が得られる。Further, since the number of parameters representing the approximate model is reduced, there is an advantage that the memory capacity required for implementing the approximate model is reduced.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の原理を説明するフローチャートであ
る。FIG. 2 is a flowchart illustrating the principle of the present invention.

【図３】本発明による言語モデルの近似学習装置の好ま
しい一実施例の構成図である。FIG. 3 is a block diagram of a preferred embodiment of a language model approximation learning apparatus according to the present invention.

【図４】本発明の好ましい一実施例によるｎ−ｇｒａｍ
ベイズ学習部の動作フローチャートである。FIG. 4 shows an n-gram according to a preferred embodiment of the present invention.
It is an operation flowchart of a Bayesian learning unit.

【図５】本発明の好ましい一実施例による低次ｎ−ｇｒ
ａｍ学習部の動作フローチャートである。FIG. 5 shows a lower order n-gr according to a preferred embodiment of the present invention.
It is an operation flowchart of an am learning unit.

【図６】本発明の一実施例による言語モデルの学習シス
テムの構成図である。FIG. 6 is a configuration diagram of a language model learning system according to an embodiment of the present invention.

【図７】本発明の他の実施例による文書ファイル圧縮シ
ステムの構成図である。FIG. 7 is a configuration diagram of a document file compression system according to another embodiment of the present invention.

【図８】本発明による言語モデルの近似学習手法のシミ
ュレーション結果説明図である。FIG. 8 is an explanatory diagram of a simulation result of a language model approximation learning method according to the present invention.

【図９】従来の近似学習手法の動作原理の説明図であ
る。FIG. 9 is an explanatory diagram of the operation principle of a conventional approximation learning method.

【符号の説明】[Explanation of symbols]

１００ｎ−ｇｒａｍベイズ学習手段２００低次ｎ−ｇｒａｍ学習手段 100 n-gram Bayes learning means 200 low-order n-gram learning means

Claims

【特許請求の範囲】[Claims]

【請求項１】パラメータで表現される言語モデルの近
似学習を行う装置において、単語系列を学習データとして受け、上記言語モデルに対
応した真のｎ−ｇｒａｍモデルとのＫＬ（カルバック・
ライブラー）情報量に関するベイズ推定量を算出して、
ｎ−ｇｒａｍベイズ推定モデルを学習するｎ−ｇｒａｍ
ベイズ学習手段と、上記ｎ−ｇｒａｍベイズ学習手段によって学習されたｎ
−ｇｒａｍベイズ推定モデルを受け、ＫＬ情報量を評価
尺度として、上記ｎ−ｇｒａｍベイズ推定モデルよりも
少ないパラメータ数で表現された低次ｎ−ｇｒａｍ近似
モデルを算出する低次ｎ−ｇｒａｍ学習手段とにより構
成され、上記学習データからＫＬ情報量を評価尺度にして上記真
のｎ−ｇｒａｍモデルに近い低次ｎ−ｇｒａｍ近似モデ
ルを出力することを特徴とする、言語モデルの近似学習
装置。1. An apparatus for performing approximate learning of a language model represented by a parameter, comprising: receiving a word sequence as learning data; and performing a KL (Kulbach / Kulbach) with a true n-gram model corresponding to the language model.
Liver) calculates a Bayesian estimator for the amount of information,
n-gram for learning an n-gram Bayesian estimation model
Bayes learning means; n learned by the n-gram Bayes learning means
A low-order n-gram learning means for receiving the -gram Bayesian estimation model and calculating a low-order n-gram approximate model represented by a smaller number of parameters than the n-gram Bayesian estimation model using the KL information amount as an evaluation scale; And outputting a low-order n-gram approximate model close to the true n-gram model using the KL information amount as an evaluation scale from the learning data.

【請求項２】上記ｎ−ｇｒａｍベイズ学習手段は、上記学習データ中の単語系列（ｗ^n-l）の後に単語（ｗ
_i）が生起した回数を算出する頻度算出手段と、上記単語系列（ｗ^n-l）の後に上記単語（ｗ_i）が生起
する事前分布のパラメータを記憶する分布パラメータテ
ーブルから上記事前分布のパラメータを読み取り、上記
頻度算出手段で算出された回数と、上記事前分布のパラ
メータとを用いて、上記真のモデルと上記推定されたｎ
−ｇｒａｍベイズ推定モデルとの間の距離を表すＫＬ情
報量が上記学習データに対してベイズ基準の下で最小に
なるようにＫＬ情報量に関するベイズ推定量を算出する
ベイズ推定量算出手段とからなることを特徴とする請求
項１記載の言語モデルの近似学習装置。2. The n-gram Bayesian learning means includes: a word (w ^nl ) after a word sequence (w ^nl ) in the learning data;
Reading and frequency calculation means _{for i)} calculates the number of times of occurrence, the parameters of the prior distribution from distribution parameter table storing parameters for the prior distribution of the word (w _i) is occurring after the word sequence (w ^nl) Using the number of times calculated by the frequency calculation means and the parameter of the prior distribution, the true model and the estimated n
A Bayesian estimator calculating means for calculating a Bayesian estimator relating to the KL information amount so that the KL information amount representing a distance from the gram Bayesian estimation model is minimized under the Bayesian criterion with respect to the learning data. The language model approximation learning device according to claim 1, wherein:

【請求項３】上記低次ｎ−ｇｒａｍ学習手段は、長さｎ−１の単語系列毎に割り当てられたクラスに対
し、二つのクラスをマージする前の低次ｎ−ｇｒａｍ近
似モデルと、マージした後の低次ｎ−ｇｒａｍ近似モデ
ルとの間のＫＬ情報量の増加量を算出するＫＬ情報量算
出手段と、上記ＫＬ情報量算出手段によって算出されたＫＬ情報量
の増加量を最小に抑えるように二つのクラスをマージす
るクラスマージ手段とからなり、上記ｎ−ｇｒａｍベイズ学習手段から受けた上記ｎ−ｇ
ｒａｍベイズ推定モデルを最初の上記前の低次ｎ−ｇｒ
ａｍ近似モデルとして利用し、上記真のｎ−ｇｒａｍモ
デルに近いクラスの構成を順次に得ることを特徴とする
請求項１又は２記載の言語モデルの近似学習装置。3. A low-order n-gram approximation model before merging two classes with respect to a class assigned to each word sequence having a length of n-1. KL information amount calculation means for calculating an increase amount of the KL information amount between the low order n-gram approximation model after the above, and an increase amount of the KL information amount calculated by the KL information amount calculation means is minimized. And a class merging means for merging the two classes as described above. The ng received from the n-gram Bayes learning means
The ram Bayesian estimation model is first reduced to the previous lower order n-gr
3. The approximation learning apparatus for a language model according to claim 1, wherein the apparatus is used as an am approximation model, and sequentially obtains a class configuration close to the true n-gram model.

【請求項４】パラメータで表現された言語モデルを近
似学習する方法において、単語系列を学習データとして入力する段階と、上記言語モデルに対応した真のｎ−ｇｒａｍモデルとの
ＫＬ情報量に関するベイズ推定量を算出してｎ−ｇｒａ
ｍベイズ推定モデルを学習する段階と、上記学習されたｎ−ｇｒａｍベイズ推定モデルを受け、
ＫＬ情報量を評価尺度として、上記ｎ−ｇｒａｍベイズ
推定モデルよりも少ないパラメータ数で表現された低次
ｎ−ｇｒａｍ近似モデルを学習する段階と、上記学習により得られた上記真のｎ−ｇｒａｍモデルに
近い低次ｎ−ｇｒａｍ近似モデルを出力する段階とから
なることを特徴とする、言語モデルの近似学習方法。4. A method for approximately learning a language model represented by parameters, comprising the steps of: inputting a word sequence as learning data; and performing Bayesian estimation on the KL information amount of a true n-gram model corresponding to the language model. Calculate the amount and n-gra
learning an m-bays estimation model; receiving the learned n-gram Bayesian estimation model;
Learning a low-order n-gram approximate model represented by a smaller number of parameters than the n-gram Bayesian estimation model, using the KL information amount as an evaluation scale; and the true n-gram model obtained by the learning. And outputting a low-order n-gram approximation model close to the language model.

【請求項５】上記ｎ−ｇｒａｍベイズ推定モデルを学
習する段階は、学習データ中において単語系列
（ｗ^n-1）の後に単語（ｗ_i）が生起した回数を算出する段階と、事前分布を表わすベータ分布のパラメータを読み取る段
階と、上記ＫＬ情報量に関するベイズ推定量を算出する段階
と、真のモデルと推定したモデルとの間の距離であるＫＬ情
報量が有限の学習データに対してベイズ基準の下で最小
になる上記ｎ−ｇｒａｍベイズ推定モデルを出力する段
階とからなることを特徴とする請求項４記載の言語モデ
ルの近似学習方法。Wherein the step of learning the n-gram Bayesian estimation model includes the steps of calculating the number of times the word (w _i) has occurred after the word sequence (w ^n-1) during the training data, the prior distribution Reading the parameters of the beta distribution to be represented, calculating the Bayesian estimator for the KL information amount, and calculating the Bayesian estimator for the learning data in which the distance between the true model and the estimated model is finite. Outputting the n-gram Bayesian estimation model which is minimized under a criterion.

【請求項６】上記低次ｎ−ｇｒａｍモデルを学習する
段階は、上記ｎ−ｇｒａｍベイズ推定モデルを読み込む段階と、長さｎ−１の各単語系列に一つのクラスを割り当てる段
階と、ＫＬ情報量の増加量を最小にするような二つのクラスを
マージすることによって、なるべく真のモデルに近いク
ラスの構成を逐次的に求め、クラス数が減少させられた
低次ｎ−ｇｒａｍ近似モデルを算出する段階と、上記低次ｎ−ｇｒａｍ近似モデルが出力される段階とか
らなることを特徴とする請求項４又は５記載の言語モデ
ルの近似学習方法。6. The step of learning the low-order n-gram model, the step of reading the n-gram Bayesian estimation model, the step of assigning one class to each word sequence of length n−1, By merging two classes that minimize the amount of increase, the configuration of classes that are as close as possible to the true model is sequentially obtained, and a low-order n-gram approximation model with a reduced number of classes is calculated. The approximation learning method for a language model according to claim 4, further comprising: performing a step of outputting the low-order n-gram approximation model.

【請求項７】パラメータで表現される言語モデルの近
似学習プログラムを記録した記録媒体であって、単語系列を学習データとして受け、上記言語モデルに対
応した真のｎ−ｇｒａｍモデルとのＫＬ（カルバック・
ライブラー）情報量に関するベイズ推定量を算出して、
ｎ−ｇｒａｍベイズ推定モデルを学習させるプロセス
と、ＫＬ情報量を評価尺度として、上記ｎ−ｇｒａｍベイズ
推定モデルよりも少ないパラメータ数で表現された低次
ｎ−ｇｒａｍ近似モデルを算出させるプロセスとにより
構成され、上記学習データからＫＬ情報量を評価尺度にして上記真
のｎ−ｇｒａｍモデルに近い低次ｎ−ｇｒａｍ近似モデ
ルを出力させることを特徴とする言語モデルの近似学習
プログラムを記録した記録媒体。7. A recording medium storing an approximate learning program for a language model represented by parameters, which receives a word sequence as learning data, and performs a KL (Kullback) comparison with a true n-gram model corresponding to the language model.・
Liver) calculates a Bayesian estimator for the amount of information,
a process of learning an n-gram Bayesian estimation model; and a process of calculating a low-order n-gram approximation model represented by a smaller number of parameters than the n-gram Bayesian estimation model using the KL information amount as an evaluation scale. A recording medium storing a language model approximate learning program characterized by outputting a low-order n-gram approximate model close to the true n-gram model using the KL information amount as an evaluation scale from the learning data.

【請求項８】上記学習データ中の単語系列（ｗ^n-l）
の後に単語（ｗ_i）が生起した回数を算出させるプロセ
スと、上記単語系列（ｗ^n-l）の後に上記単語（ｗ_i）が生起
する事前分布のパラメータを記憶する分布パラメータテ
ーブルから上記事前分布のパラメータを読み取り、上記
頻度算出手段で算出された回数と、上記事前分布のパラ
メータとを用いて、上記真のモデルと上記推定されたｎ
−ｇｒａｍベイズ推定モデルとの間の距離を表すＫＬ情
報量が上記学習データに対してベイズ基準の下で最小に
なるようにＫＬ情報量に関するベイズ推定量を算出させ
るプロセスとを更に有することを特徴とする請求項７記
載の言語モデルの近似学習プログラムを記録した記録媒
体。8. A word sequence (w ^nl ) in the learning data
Of the process of word (w _i) is to calculate the number of times that has occurred after, the distribution parameter table storing parameters for the prior distribution of the word (w _i) is occurring after the word sequence (w ^nl) of the prior distribution The parameter is read, and the true model and the estimated n are calculated using the number of times calculated by the frequency calculation means and the parameter of the prior distribution.
Calculating a Bayesian estimator for the KL information amount such that the KL information amount representing the distance from the gram Bayesian estimation model is minimized under the Bayesian criterion for the learning data. A recording medium storing the language model approximate learning program according to claim 7.

【請求項９】長さｎ−１の単語系列毎に割り当てられ
たクラスに対し、二つのクラスをマージする前の低次ｎ
−ｇｒａｍ近似モデルと、マージした後の低次ｎ−ｇｒ
ａｍ近似モデルとの間のＫＬ情報量の増加量を算出させ
るプロセスと、上記算出されたＫＬ情報量の増加量を最小に抑えるよう
に二つのクラスをマージするプロセスとを更に有し、上記ｎ−ｇｒａｍベイズ推定モデルを最初の上記前の低
次ｎ−ｇｒａｍ近似モデルとして利用し、上記真のｎ−
ｇｒａｍモデルに近いクラスの構成を順次に得ることを
特徴とする請求項７又は８記載の言語モデルの近似学習
プログラムを記録した記録媒体。9. A low-order n before merging two classes with respect to a class assigned to each word sequence having a length of n-1.
-Gram approximate model and low-order n-gr after merging
further comprising: a process for calculating an increase in the amount of KL information between the am approximation model; and a process for merging two classes so as to minimize the calculated increase in the amount of KL information; Using the gram Bayesian estimation model as the first previous lower order n-gram approximation model, and using the true n-
9. A recording medium storing a language model approximate learning program according to claim 7, wherein a class configuration close to a gram model is sequentially obtained.