JP3043625B2

JP3043625B2 - Word classification processing method, word classification processing device, and speech recognition device

Info

Publication number: JP3043625B2
Application number: JP8198950A
Authority: JP
Inventors: 明潮田; 仁飯田
Original assignee: 株式会社エイ・ティ・アール音声翻訳通信研究所
Priority date: 1996-02-15
Filing date: 1996-07-29
Publication date: 2000-05-22
Anticipated expiration: 2016-07-29
Also published as: JPH09282321A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置、形
態素解析装置、及び構文解析装置のための単語分類処理
方法及び単語分類処理装置、並びに、上記単語分類処理
装置を備えた音声認識装置に関する。The present invention relates to a word classification processing method and a word classification processing device for a speech recognition device, a morphological analysis device and a syntax analysis device, and a speech recognition device provided with the word classification processing device. .

【０００２】[0002]

【従来の技術】単語の分類体系は、音声認識装置、形態
素解析装置や構文解析装置において処理を円滑に行う上
で非常に重要な知識の１つである。この単語の分類体系
を構築するための１つの方法として、大量のテキストデ
ータに基づいて単語間の相互情報量を用いた方法（以
下、第１の従来例という。）が、例えば、従来技術文献
「Peter Brown, et al, “Class-Based n-gram Models
of Natural Language", Computational Linguistics,Vo
l.18,No.4,pp.467-479,１９９２年」において提案され
ている。この従来例の方法においては、ｎ−グラムモデ
ルを用いて所定の相互情報量を計算して英単語の分類を
行っている。2. Description of the Related Art A word classification system is one of very important knowledges for smooth processing in a speech recognition device, a morphological analysis device and a syntax analysis device. As one method for constructing this word classification system, a method using a mutual information amount between words based on a large amount of text data (hereinafter, referred to as a first conventional example) is described in, for example, the related art document. “Peter Brown, et al,“ Class-Based n-gram Models
of Natural Language ", Computational Linguistics, Vo
l. 18, No. 4, pp. 467-479, 1992 ". In the conventional method, a predetermined mutual information amount is calculated using an n-gram model to classify English words.

【０００３】しかしながら、第１の従来例の相互情報量
による分類処理方法を用いて英単語を分類した場合、出
現頻度の低い単語が不適切に分類される場合が多いとい
う問題点があった。この問題点を解決するために、単語
分類処理装置及び音声認識装置（以下、第２の従来例と
いう。）が、本出願人により特願平７−０５６９１８号
の特許出願において提案されている。[0003] However, when English words are classified using the mutual information classification method of the first conventional example, there is a problem that words with low appearance frequency are often inappropriately classified. In order to solve this problem, a word classification processing device and a speech recognition device (hereinafter, referred to as a second conventional example) have been proposed by the present applicant in the patent application of Japanese Patent Application No. 7-056918.

【０００４】当該第２の従来例の単語分類処理装置は、
単語のｎ−グラムを利用して、同一の単語に隣接する割
合の多い単語を同一のクラスに割り当てるという基準で
複数の単語を複数のクラスに分類する第１の分類手段
と、上記第１の分類手段によって分類された複数の単語
に対して、すべての単語の出現頻度を調べ、互いに異な
る第１のクラスの単語と第２のクラスの単語とが隣接し
て出現する頻度を、上記第１のクラスの単語の出現頻度
と第２のクラスの単語の出現頻度との積に対する相対的
な頻度の割合を表わす所定の相互情報量が最大となるよ
うに、上記複数の単語を二分木の形式で複数のクラスに
分類する第２の分類手段とを備えたことを特徴としてい
る。ここで、上記第２の分類手段は、好ましくは、上記
第１の分類手段によって分類された複数の単語に対し
て、すべての単語の出現頻度を調べ、出現頻度の高い単
語から順に、所定の複数Ｎ個のクラスに割り当て、Ｎ個
のクラスの中で上記相互情報量が最大である２つのクラ
スを１つのクラスとしてまとめることにより、（Ｎ−
１）個のクラスに分類し、クラスに割り当てられていな
い単語の中で、出現頻度が最大のものを新たにＮ番目の
クラスとして割り当て、すべての単語がＮ個のクラスに
割り当てられるまで、上記の処理を繰り返し、現在ある
クラスから上記相互情報量が最大である２つのクラスを
１つのクラスとしてまとめ、この処理を１個のクラスに
まとまるまで繰り返す。これにより、単語分類をより安
定に実行することができ、テキストデータから単語の分
類体系を自動的に獲得するときに、より精密で正確な分
類体系を得ることができるという特徴を有する。[0004] The second conventional example of the word classification processing device,
A first classifying unit that classifies a plurality of words into a plurality of classes on the basis of assigning words having a high ratio adjacent to the same word to the same class using an n-gram of the word; For a plurality of words classified by the classification means, the frequencies of appearance of all the words are checked, and the frequencies of the first class words and the second class words, which are different from each other, appearing adjacent to each other are determined by the first class. The plurality of words are divided into a binary tree so that a predetermined mutual information representing the ratio of the relative frequency to the product of the frequency of appearance of the words of the class of the second class and the frequency of appearance of the words of the second class is maximized. And a second classifying means for classifying into a plurality of classes. Here, it is preferable that the second classifying unit examines the appearance frequency of all the words with respect to the plurality of words classified by the first classifying unit, and determines the predetermined frequency in order from the word having the highest appearance frequency. By assigning to a plurality of N classes and combining the two classes having the largest mutual information amount among the N classes into one class, (N−
1) Classify into classes, and among words not assigned to a class, assign a new word having the highest appearance frequency as the Nth class, and repeat the above until all words are assigned to N classes. Is repeated, and the two classes having the largest mutual information amount from the current class are combined into one class, and this process is repeated until the classes are combined. Thereby, the word classification can be performed more stably, and when automatically acquiring the word classification system from the text data, a more precise and accurate classification system can be obtained.

【０００５】[0005]

【発明が解決しようとする課題】上記第１の従来例の相
互情報量による分類処理方法を用いて英単語を分類した
場合、出現頻度の低い単語が不適切に分類される場合が
多い。この原因としては、分離結果がバランスのとれた
階層構造となっていないためであると考えられる。When English words are classified using the mutual information amount classification method of the first conventional example, words with low appearance frequency are often inappropriately classified. It is considered that this is because the separation result does not have a balanced hierarchical structure.

【０００６】また、上記第２の従来例においては、互い
に異なる第１のクラスの単語と第２のクラスの単語とが
隣接して出現する頻度を、上記第１のクラスの単語の出
現頻度と第２のクラスの単語の出現頻度との積に対する
相対的な頻度の割合を表わす所定の相互情報量が最大と
なるように、上記複数の単語を二分木の形式で複数のク
ラスに分類しているので、上記第１のクラスの単語と上
記第２のクラスの単語においては、局所的に最適化され
た単語分類結果を得ることができるが、全体的に最適化
された単語分類結果を得ることができないという問題点
があった。Further, in the second conventional example, the frequency at which the first class word and the second class word which are different from each other appear adjacently is determined by the appearance frequency of the first class word. The plurality of words are classified into a plurality of classes in the form of a binary tree so that the predetermined mutual information representing the ratio of the relative frequency to the product of the frequency of appearance of the words of the second class is maximized. Therefore, for the words of the first class and the words of the second class, a locally optimized word classification result can be obtained, but a totally optimized word classification result can be obtained. There was a problem that it was not possible.

【０００７】本発明の目的は以上の問題点を解決し、単
語分類処理によりバランスのとれた階層構造を有しかつ
全体的に最適化された単語分類結果を得ることができる
単語分類処理方法、単語分類処理装置、及びその単語分
類処理装置を備えた音声認識装置を提供することにあ
る。An object of the present invention is to solve the above problems, to provide a word classification processing method capable of obtaining a word classification result which has a balanced hierarchical structure by the word classification processing and is totally optimized. An object of the present invention is to provide a word classification processing device and a speech recognition device provided with the word classification processing device.

【０００８】[0008]

【課題を解決するための手段】本発明に係る請求項１記
載の単語分類処理方法は、複数の単語を含むテキストデ
ータに対して、互いに異なるすべての複数ｖ個の単語の
出現頻度を調べ、出現頻度の高い単語から順に並べて、
複数ｖ個のクラスに割り当てるステップと、上記複数ｖ
個のクラスの単語のうち出現頻度が高いｖ個未満の（ｃ
＋１）個のクラスの単語を１つのウィンドウ内のクラス
の単語として第１の記憶装置に記憶するステップと、上
記第１の記憶装置に記憶された１つのウィンドウ内のク
ラスの単語に基づいて、第１のクラスの単語の出現確率
と第２のクラスの単語の出現確率との積に対する、互い
に異なる第１のクラスの単語と第２のクラスの単語とが
隣接して出現する確率の相対的な割合を表わす所定の平
均相互情報量が最大となるように、上記複数の単語を二
分木の形式で複数ｃ個のクラスに分類し、分類された複
数ｃ個のクラスを、単語分類結果を表わす全体のツリー
図の中間層の複数ｃ個のクラスとして第２の記憶装置に
記憶するステップと、上記第２の記憶装置に記憶された
中間層の複数ｃ個のクラスの単語に基づいて、上記平均
相互情報量が最大となるように、上記複数ｃ個のクラス
の単語を二分木の形式で１個のクラスになるまで分類
し、当該分類結果を上記ツリー図の上側層として第３の
記憶装置に記憶するステップと、上記第２の記憶装置に
記憶された中間層の複数ｃ個のクラスの各クラス毎に、
上記中間層の複数ｃ個のクラスの各クラス内の複数の単
語に基づいて、上記平均相互情報量が最大となるよう
に、上記複数の単語を二分木の形式で１個のクラスにな
るまでそれぞれ分類し、当該各クラス毎の複数の分類結
果を上記ツリー図の下側層として第４の記憶装置に記憶
するステップと、上記第４の記憶装置に記憶された上記
ツリー図の下側層を、上記第２の記憶装置に記憶された
上記中間層の複数ｃ個のクラスと連結する一方、上記第
３の記憶装置に記憶された上記ツリー図の上側層を、上
記第２の記憶装置に記憶された上記中間層の複数ｃ個の
クラスと連結することにより、上側層と中間層と下側層
とを備えた上記ツリー図を求めて単語分類結果として第
５の記憶装置に記憶するステップとを備えたことを特徴
とする。According to a first aspect of the present invention, there is provided a word classification processing method comprising: examining text data including a plurality of words; Sort by words that appear frequently,
Assigning to a plurality v of classes;
Less than v words with high appearance frequency (c
+1) storing the words of the class as words of the class in one window in the first storage device, and based on the words of the class in one window stored in the first storage device, Relative to the product of the probability of occurrence of the first class word and the probability of occurrence of the second class word, the probability that the first class word and the second class word that are different from each other appear adjacent to each other The above-mentioned plurality of words are classified into a plurality of c classes in the form of a binary tree so that the predetermined average mutual information amount representing the maximum ratio is maximized. Storing in the second storage device a plurality of c classes of the intermediate layer of the entire tree diagram to be represented; and based on the words of the plurality of c classes of the intermediate layer stored in the second storage device, Above average mutual information is maximum Classifying the words of the plurality of c classes into one class in the form of a binary tree so as to store the classification result in a third storage device as an upper layer of the tree diagram; For each of the plurality c classes of the intermediate layer stored in the second storage device,
On the basis of a plurality of words in each of the plurality of c classes of the intermediate layer, the plurality of words are divided into one class in the form of a binary tree so that the average mutual information is maximized. Classifying each of the classes and storing a plurality of classification results for each class in a fourth storage device as a lower layer of the tree diagram; and a lower layer of the tree diagram stored in the fourth storage device. Is connected to the plurality of c classes of the intermediate layer stored in the second storage device, while the upper layer of the tree diagram stored in the third storage device is connected to the second storage device. The above-mentioned tree diagram including the upper layer, the intermediate layer, and the lower layer is obtained by linking with the plurality of c classes of the intermediate layer stored in the fifth storage device, and is stored in the fifth storage device as a word classification result. And a step.

【０００９】また、請求項２記載の単語分類処理方法
は、請求項１記載の単語分類処理方法において、上記分
類された複数ｃ個のクラスを上記第２の記憶装置に記憶
するステップは、上記第１の記憶装置に記憶された１つ
のウィンドウよりも外側のクラスが存在し、又は上記１
つのウィンドウ内のクラスがｃ個ではないときは、現在
のウィンドウよりも外側にあり、最大の出現頻度を有す
るクラスの単語を上記ウィンドウ内に挿入した後、上記
二分木の形式の単語分類処理を実行することを特徴とす
る。According to a second aspect of the present invention, in the word classification processing method of the first aspect, the step of storing the plurality of classified c classes in the second storage device includes the step of: There is a class outside one window stored in the first storage device, or
When the number of classes in one window is not c, the words of the class that is outside the current window and has the highest frequency of occurrence are inserted into the window, and then the word classification processing in the form of the binary tree is performed. It is characterized by executing.

【００１０】本発明に係る請求項３記載の単語分類処理
装置は、複数の単語を含むテキストデータに対して、互
いに異なるすべての複数ｖ個の単語の出現頻度を調べ、
出現頻度の高い単語から順に並べて、複数ｖ個のクラス
に割り当てる第１の制御手段と、上記複数ｖ個のクラス
の単語のうち出現頻度が高いｖ個未満の（ｃ＋１）個の
クラスの単語を１つのウィンドウ内のクラスの単語とし
て第１の記憶装置に記憶する第２の制御手段と、上記第
１の記憶装置に記憶された１つのウィンドウ内のクラス
の単語に基づいて、第１のクラスの単語の出現確率と第
２のクラスの単語の出現確率との積に対する、互いに異
なる第１のクラスの単語と第２のクラスの単語とが隣接
して出現する確率の相対的な割合を表わす所定の平均相
互情報量が最大となるように、上記複数の単語を二分木
の形式で複数ｃ個のクラスに分類し、分類された複数ｃ
個のクラスを、単語分類結果を表わす全体のツリー図の
中間層の複数ｃ個のクラスとして第２の記憶装置に記憶
する第３の制御手段と、上記第２の記憶装置に記憶され
た中間層の複数ｃ個のクラスの単語に基づいて、上記平
均相互情報量が最大となるように、上記複数ｃ個のクラ
スの単語を二分木の形式で１個のクラスになるまで分類
し、当該分類結果を上記ツリー図の上側層として第３の
記憶装置に記憶する第４の制御手段と、上記第２の記憶
装置に記憶された中間層の複数ｃ個のクラスの各クラス
毎に、上記中間層の複数ｃ個のクラスの各クラス内の複
数の単語に基づいて、上記平均相互情報量が最大となる
ように、上記複数の単語を二分木の形式で１個のクラス
になるまでそれぞれ分類し、当該各クラス毎の複数の分
類結果を上記ツリー図の下側層として第４の記憶装置に
記憶する第５の制御手段と、上記第４の記憶装置に記憶
された上記ツリー図の下側層を、上記第２の記憶装置に
記憶された上記中間層の複数ｃ個のクラスと連結する一
方、上記第３の記憶装置に記憶された上記ツリー図の上
側層を、上記第２の記憶装置に記憶された上記中間層の
複数ｃ個のクラスと連結することにより、上側層と中間
層と下側層とを備えた上記ツリー図を求めて単語分類結
果として第５の記憶装置に記憶する第６の制御手段とを
備えたことを特徴とする。[0010] According to a third aspect of the present invention, in the word classification processing apparatus, for text data including a plurality of words, an appearance frequency of all a plurality of v words different from each other is checked.
First control means for sequentially arranging words having a high frequency of occurrence and assigning them to a plurality of v classes; and, among words of the plurality of v classes, words of less than v (c + 1) classes having a high frequency of occurrence, A second control unit for storing the words of the class in one window in the first storage device, and a first class based on the words of the class in one window stored in the first storage device. Represents the relative ratio of the probability that the first class word and the second class word that are different from each other appear adjacent to the product of the occurrence probability of the second word and the occurrence probability of the second class word The plurality of words are classified into a plurality c classes in the form of a binary tree such that the predetermined average mutual information amount is maximized, and the classified plurality c
Control means for storing the plurality of classes in the second storage device as a plurality of c classes in the middle layer of the entire tree diagram representing the word classification result, and the intermediate control means for storing the intermediate classes stored in the second storage device. Based on the words of the plurality c classes of the layer, the words of the plurality c classes are classified into one class in the form of a binary tree so that the average mutual information amount is maximized. Fourth control means for storing the classification result in the third storage device as an upper layer of the tree diagram, and for each of a plurality of c classes of the intermediate layer stored in the second storage device, Based on the plurality of words in each of the plurality of c classes in the intermediate layer, the plurality of words are each converted into a single tree in the form of a binary tree such that the average mutual information is maximized. And classify the results of each class into the above tree. Fifth control means for storing in the fourth storage device as a lower layer of the diagram, and a lower layer of the tree diagram stored in the fourth storage device being stored in the second storage device. While linking with the plurality of c classes of the intermediate layer, the upper layer of the tree diagram stored in the third storage device is combined with the plurality of c classes of the intermediate layer stored in the second storage device. A sixth control unit that obtains the tree diagram including the upper layer, the intermediate layer, and the lower layer by linking with the class, and stores the tree diagram as a word classification result in the fifth storage device. And

【００１１】また、請求項４記載の単語分類処理装置
は、請求項３記載の単語分類処理装置において、上記第
３の制御手段は、上記第１の記憶装置に記憶された１つ
のウィンドウよりも外側のクラスが存在し、又は上記１
つのウィンドウ内のクラスがｃ個ではないときは、現在
のウィンドウよりも外側にあり、最大の出現頻度を有す
るクラスの単語を上記ウィンドウ内に挿入した後、上記
二分木の形式の単語分類処理を実行することを特徴とす
る。According to a fourth aspect of the present invention, there is provided a word classification processing apparatus according to the third aspect, wherein the third control means is configured to execute the processing by the third control means more than one window stored in the first storage device. An outer class exists, or 1
When the number of classes in one window is not c, the words of the class that is outside the current window and has the highest frequency of occurrence are inserted into the window, and then the word classification processing in the form of the binary tree is performed. It is characterized by executing.

【００１２】本発明に係る請求項５記載の音声認識装置
は、入力される発声音声の音声信号に基づいて、請求項
３又は４記載の単語分類処理装置によって複数の単語が
複数のクラスに分類された単語分類結果を含む単語辞書
と、所定の隠れマルコフモデルとを参照して上記発声音
声を音声認識する音声認識手段を備えたことを特徴とす
る。According to a fifth aspect of the present invention, there is provided a speech recognition apparatus, wherein a plurality of words are classified into a plurality of classes by the word classification processing apparatus according to the third or fourth aspect based on an input speech signal. And a speech recognition unit for recognizing the uttered speech by referring to a word dictionary including the word categorized result and a predetermined hidden Markov model.

【００１３】[0013]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本発明に係る第
１の実施形態の音声認識装置のブロック図である。この
音声認識装置は、テキストデータメモリ１０内のテキス
トデータ内の単語について出現頻度の比較的低い単語
を、同一の単語に隣接する割合の多い単語を同一のクラ
スに割り当てるという基準で分類した後、単語分類結果
を中間層、上側層、及び下側層の３つの階層に分類し、
テキストデータ内のすべての単語を対象とするグローバ
ルな（全体的な）コスト関数である所定の平均相互情報
量を用いて、中間層、上側層、及び下側層の順序で階層
別に単語の分類を実行して、単語辞書メモリ１１内に単
語辞書として格納する単語分類処理部２０を備えたこと
を特徴とする。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a speech recognition device according to a first embodiment of the present invention. This speech recognition apparatus classifies words having a relatively low frequency of appearance with respect to words in text data in the text data memory 10 on the basis of assigning words having a high percentage of adjacent words to the same class to the same class. The word classification results are classified into three layers: an intermediate layer, an upper layer, and a lower layer,
Using a predetermined average mutual information, which is a global (overall) cost function for all words in the text data, classify words by hierarchy in the order of middle layer, upper layer, and lower layer And a word classification processing unit 20 that stores the word as a word dictionary in the word dictionary memory 11.

【００１４】＜単語分類処理方法＞まず、本発明に係る
本実施形態の単語の分類（クラスタリング）方法につい
て、第１の従来例の方法と対比させて説明する。本発明
の方法は、従来技術文献に開示された第１の従来例の方
法を修正しかつ大幅に発展させて改善させた方法であっ
て、第１の従来例の式と、本実施形態の式との相違につ
いて説明し、次いで、単語の分類処理方法について説明
する。ここで、第１の従来例と、本実施形態とを比較す
るために、第１の従来例で用いた表記法と同一の表記法
を用いることにする。<Word Classification Processing Method> First, the word classification (clustering) method of the present embodiment according to the present invention will be described in comparison with the method of the first conventional example. The method of the present invention is a method in which the method of the first conventional example disclosed in the prior art document is modified and greatly developed and improved. The method of the first conventional example and the method of the present embodiment are modified. The difference from the expression will be described, and then the word classification processing method will be described. Here, in order to compare the first conventional example with the present embodiment, the same notation as that used in the first conventional example will be used.

【００１５】まず、相互情報量を用いたクラスタリング
の方法について述べる。ここで、単語数Ｔのテキスト、
語数Ｖの語彙、それに語彙の分割関数πとが存在すると
仮定し、ここで、語彙の分割関数πは語彙Ｖから語彙の
中の単語クラスセットＣへの分割写像（マッピング）を
表わす写像関数である。第１の従来例においては、複数
の単語からなるテキストデータを生成するバイグラムの
クラスモデルの尤度Ｌ（π）は次式によって得られる。First, a clustering method using mutual information will be described. Here, a text with the number of words T,
It is assumed that a vocabulary having the number of words V and a vocabulary division function π exist, where the vocabulary division function π is a mapping function representing a division mapping from the vocabulary V to the word class set C in the vocabulary. is there. In the first conventional example, the likelihood L (π) of a bigram class model that generates text data composed of a plurality of words is obtained by the following equation.

【００１６】[0016]

【数１】Ｌ(π)＝−Ｈ＋Ｉ## EQU1 ## L (π) =-H + I

【００１７】ここで、Ｈはモノグラムの単語分布のエン
トロピーであり、Ｉはテキストデータ内の隣接する２つ
のクラスＣ₁，Ｃ₂に関する平均的な相互情報量（Averag
e Mutual Information；以下、平均相互情報量とし、Ａ
ＭＩと表記する。）であり、次式で計算することができ
る。Here, H is the entropy of the word distribution of the monogram, and I is the average mutual information (Averag) regarding two adjacent classes C ₁ and C ₂ in the text data.
e Mutual Information; A
Notated as MI. ) And can be calculated by the following equation.

【００１８】[0018]

【数２】 (Equation 2)

【００１９】ここで、Ｐｒ（Ｃ₁）は第１のクラスＣ₁の
単語の出現確率であり、Ｐｒ（Ｃ₂）は第２のクラスＣ₂
の単語の出現確率であり、Ｐｒ（Ｃ₁｜Ｃ₂）は、第２の
クラスＣ₂の単語は出現した後に、第１のクラスＣ₁の単
語が出現する条件付き確率であり、Ｐｒ（Ｃ₁，Ｃ₂）は
第１のクラスＣ₁の単語と第２のクラスＣ₂の単語が隣接
して出現する確率である。従って、上記数２で表される
ＡＭＩは、互いに異なる第１のクラスＣ₁の単語と第２
のクラスＣ₂の単語とが隣接して出現する確率を、上記
第１のクラスＣ₁の単語の出現確率と第２のクラスＣ₂の
単語の出現確率との積で割った相対的な割合を表わす。Here, Pr (C ₁ ) is the occurrence probability of a word of the first class C ₁ , and Pr (C ₂ ) is the second class C ₂
Pr (C ₁ | C ₂ ) is the conditional probability that a word of the first class C ₁ will appear after a word of the second class C ₂ has appeared, and Pr (C ₁ | C ₂ ) C ₁ , C ₂ ) is the probability that a word of the _first class C _{1 and} a word of the second class C ₂ appear adjacent to each other. Therefore, AMI is different first mutually Class C ₁ word and the second represented by the number 2
The relative proportions and words of class C ₂ is the probability of occurrence adjacent, divided by the product of the above first occurrence probabilities of the words in the classes C ₁ and a second class of C ₂ probability of occurrence of words Represents

【００２０】エントロピーＨは写像関数πに依存しない
値であることから、ＡＭＩを最大にする写像関数は同時
にテキストの尤度Ｌ（π）も最大にする。従って、ＡＭ
Ｉを単語のクラス構成における目的関数として使用する
ことができる。Since the entropy H is a value independent of the mapping function π, the mapping function that maximizes the AMI also maximizes the likelihood L (π) of the text. Therefore, AM
I can be used as an objective function in word class construction.

【００２１】第１の従来例の相互情報量を用いたクラス
タリング方法では、下側層から上側層へのボトムアップ
のマージ手順を用いている。初期の段階では、各単語を
それぞれ１つのクラスに割り当てる。次いで、すべての
クラスのペア（対）の中で最小のＡＭＩの減少量を与え
る２つのクラスのペアを探索し、その２つのクラスのペ
アをマージし、マージ後のクラス数が予め決められた数
ｃになるまで上記マージの処理を繰り返す。この第１の
従来例の基本的な方法において、例えばコンピュータに
よって実行される演算時間のコンプレキシティー（又は
演算時間のコスト）は、当該処理を以下に示すように直
接的に実行したとき、Ｖ⁵（語彙の語数Ｖの５乗）に比
例するオーダーであり、これをＯ（Ｖ⁵）と表記する。
ここで、演算時間のコンプレキシティーは、演算時間が
どれぐらいかかるかを示す指標である。The first prior art clustering method using mutual information uses a bottom-up merge procedure from the lower layer to the upper layer. Initially, each word is assigned to a class. Next, a search is made for a pair of two classes that gives the smallest AMI reduction among all pairs of classes, and the pairs of the two classes are merged. The number of classes after the merge is determined in advance. The above merging process is repeated until the number reaches c. In the basic method of the first conventional example, for example, the complexity (or the cost of the calculation time) of the calculation time executed by the computer becomes V when the process is directly executed as shown below. ⁵ (the number of words in the vocabulary V to the fifth power), which is denoted as O (V ⁵ ).
Here, the complexity of the calculation time is an index indicating how long the calculation time takes.

【００２２】＜ステップＡ１＞マージ処理の回数は合計
で（Ｖ−ｃ）回であり、このときの演算時間のコンプレ
キシティーは語彙の語数Ｖに比例するオーダーＯ（Ｖ）
である。＜ステップＡ２＞ｎ回のマージ処理の後には、（Ｖ−
ｎ）個のクラスが残り、次のマージ処理の段階では、組
み合わせ数_V-nＣ₂（すなわち、Ｖ−ｎ個のクラスから２
つのクラスをとるときの組み合わせ数）個のマージ処理
のテスト又はトライアル（trial）（以下、トライアル
といい、ここで、複数回のマージ処理を実行するが、実
際にマージして単語分類結果に反映させるのは、このう
ちの１つであるので、本実施形態ではこのように呼
ぶ。）を実行して探索する必要がある。そのうちの１つ
のみが後のマージ処理で有効化される。従って、このと
きの演算時間のコンプレキシティーは語彙の語数の２乗
Ｖ²に比例するオーダーＯ（Ｖ²）である。＜ステップＡ３＞第ｎ段階での１つのマージ処理のトラ
イアルには、上記数２を用いてＡＭＩを演算するための
（Ｖ−ｎ）²個の項又はクラスに関する加算演算を含
む。従って、このときの演算時間のコンプレキシティー
は語彙の語数の２乗Ｖ²に比例するオーダーＯ（Ｖ²）で
ある。<Step A1> The total number of merge processes is (Vc) times, and the complexity of the operation time at this time is of the order O (V) proportional to the number V of words in the vocabulary.
It is. <Step A2> After n merging processes, (V−
n) classes remain, and in the next merge processing stage, the number of combinations _Vn C ₂ (that is, 2
Test or trial of merge processing (number of combinations when taking one class) (hereinafter referred to as trial). Here, merge processing is performed multiple times, but actual merging is performed and reflected in the word classification result. This is one of them, so it is called in this embodiment.). Only one of them will be activated in a later merge process. Therefore, the complexity of the operation time at this time is of the order O (V ² ) proportional to the square V ² of the number of words in the vocabulary. <Step A3> The trial of one merge process in the n-th stage includes an addition operation for (V−n) ² terms or classes for calculating the AMI using the above equation ( ² ). Therefore, the complexity of the operation time at this time is of the order O (V ² ) proportional to the square V ² of the number of words in the vocabulary.

【００２３】従って、例えばコンピュータによって実行
される全体の演算時間のコンプレキシティーは、語数の
５乗Ｖ⁵に比例するオーダーＯ（Ｖ⁵）となる。しかしな
がら、後述するように、冗長的な計算を除くことによっ
て、演算時間のコンプレキシティーを、語彙の語数の３
乗Ｖ³に比例するオーダーＯ（Ｖ³）に減らすことも可能
である。つまり、次のような本発明に係る方法によれ
ば、上記ステップＡ３の部分を一定時間で実行すること
ができる。Accordingly, for example, the complexity of the total operation time executed by the computer is on the order O (V ⁵ ) in proportion to the fifth power V ^{5 of the} number of words. However, as will be described later, by eliminating redundant calculations, the complexity of the operation time can be reduced to three times the number of words in the vocabulary.
It is also possible to reduce the order to the order O (V ³ ) proportional to the power V ³ . That is, according to the following method of the present invention, the step A3 can be executed in a fixed time.

【００２４】＜ステップＢ１＞上記数２は、前の段階の
マージ処理で値が変更されたクラスのみについて計算す
る。従って、演算時間のコンプレキシティーは、第１の
従来例におけるオーダーＯ（Ｖ²）から、オーダーＯ
（Ｖ）となる。＜ステップＢ２＞前の段階のマージ処理におけるすべて
のトライアルの結果を格納する。従って、演算時間のコ
ンプレキシティーは、第１の従来例におけるオーダーＯ
（Ｖ）から、語彙の語数Ｖに依存しない一定のオーダー
Ｏ（１）となる。<Step B1> The above equation 2 is calculated only for the class whose value has been changed in the merge processing in the previous stage. Therefore, the complexity of the operation time is changed from the order O (V ² ) in the first conventional example to the order O (V ² ).
(V). <Step B2> Store the results of all trials in the previous merge process. Therefore, the complexity of the operation time is the order O in the first conventional example.
From (V), a constant order O (1) independent of the number V of words in the vocabulary is obtained.

【００２５】例えば、Ｖ個のクラス数の語彙から始めて
既に（Ｖ−ｋ）回のマージ処理を実行して、ｋ個のクラ
スＣ_k（１），Ｃ_k（２），…，Ｃ_k（ｋ）が残っている
と仮定する。この段階でのＡＭＩＩｋは次式で計算され
る。For example, starting from a vocabulary of V classes, the merging process has already been performed (V−k) times, and k classes C _k (1), C _k (2),..., C _k ( Suppose k) remains. AMIIk at this stage is calculated by the following equation.

【００２６】[0026]

【数３】 (Equation 3)

【数４】ｑ_k（ｌ，ｍ）＝ｐ_k（ｌ，ｍ）ｌｏｇ［ｐ
_k（ｌ，ｍ）／｛ｐｌ_k（ｌ）ｐｒ_k（ｍ）｝］## EQU4 ## q _k (l, m) = p _k (l, m) log [p
_{k (l, m) / {} pl k (l) pr k (m)}]

【００２７】ここで、ｐ_k（ｌ，ｍ）は、クラスＣ
_k（ｌ）における単語の次に、クラスＣ_k（ｍ）における
単語が続く確率であり、次式のように表される。なお、
本明細書及び図面において、表示を明確にするために、
ｌ（小文字のエル）としてｌをも用い、ｌ＝ｌとする。Where p _k (l, m) is the class C
_This is the probability that a word in class C _k (m) follows a word in _k (l), and is expressed as: In addition,
In this specification and the drawings, for clarity of indication,
l is also used as l (lowercase letter), and l = l.

【００２８】[0028]

【数５】ｐ_k（ｌ，ｍ）＝Ｐｒ（Ｃ_k（ｌ），Ｃ_k（ｍ））P _k (l, m) = Pr (C _k (l), C _k (m))

【数６】 (Equation 6)

【００２９】上記数３においては、ｑ_kは（ｋ×ｋ）ク
ラスのバイグラム平面テーブルの全体にわたって加算さ
れ、ここで、（ｌ，ｍ）のセルはｑ_k（ｌ，ｍ）で表わ
す。いま、クラスＣ_k（ｉ）とクラスＣ_k（ｊ）とのマー
ジ処理のトライアルを探索したとき、当該マージ処理に
よるＡＭＩの減少量を、Ｌ_k（ｉ，ｊ）≡Ｉ_k−Ｉ
_k（ｉ，ｊ）とし、ここで、Ｉ_k（ｉ，ｊ）は当該のマー
ジ処理後のＡＭＩである。In Equation 3, q _k is added over the entire (k × k) class bigram plane table, where the (l, m) cell is represented by q _k (l, m). Now, when searching for a trial of a merge process between the class C _k (i) and the class C _k (j), the amount of decrease in the AMI due to the merge process is represented by L _k (i, j) ≡I _k −I
_k (i, j), where I _k (i, j) is the AMI after the merge processing.

【００３０】図３は、本発明に係る単語分類処理におけ
る加算領域及び加減算処理を示すクラスバイグラム平面
テーブルの図である。ここで、図３及び、以下に示す図
４と図５は、２つのクラスのバイグラムの平面を示す。
図３に示すように、上記数３における加算領域Ｐ０は、
図３の部分領域Ｐ１、Ｐ２及びＰ３の和から部分領域Ｐ
４を減じた部分として表わすことができる。この４つの
部分Ｐ１，Ｐ２，Ｐ３，Ｐ４のうち、部分領域Ｐ１の加
算値はＣ_k（ｉ）とＣ_k（ｊ）とのマージ処理によって変
化することはない。従って、ＡＭＩの減少量Ｌ_k（ｉ，
ｊ）を算出する場合、加算領域Ｐ０を、２次元の領域
（すなわち、正方形の領域）から１次元の領域（すなわ
ち、複数のライン又は線）に減らすことが可能である。
よって、上記ステップＡ３における演算時間のコンプレ
キシティーは、オーダーＯ（Ｖ²）からオーダーＯ
（Ｖ）に減少させることができる。クラスＣ_k（ｉ）と
クラスＣ_k（ｊ）とのマージ処理によって生成されるク
ラスを表わす表記法として、Ｃ_k（ｉ＋ｊ）を使用する
と、ＡＭＩの減少量Ｌ_k（ｉ，ｊ）は次式によって与え
られる。FIG. 3 is a diagram of a class bigram plane table showing the addition area and the addition / subtraction processing in the word classification processing according to the present invention. Here, FIG. 3 and FIGS. 4 and 5 shown below show planes of two classes of bigrams.
As shown in FIG. 3, the addition area P0 in Equation 3 is
From the sum of the partial areas P1, P2 and P3 in FIG.
4 can be represented as a reduced portion. Of the four parts P1, P2, P3, and P4, the added value of the partial area P1 does not change due to the merge processing of C _k (i) and C _k (j). Therefore, the AMI reduction amount L _k (i,
When calculating j), the addition area P0 can be reduced from a two-dimensional area (that is, a square area) to a one-dimensional area (that is, a plurality of lines or lines).
Therefore, the complexity of the calculation time in step A3 is changed from the order O (V ² ) to the order O (V ² ).
(V). When C _k (i + j) is used as a notation representing a class generated by merging the class C _k (i) and the class C _k (j), the AMI reduction L _k (i, j) becomes Given by the expression.

【００３１】[0031]

【数７】ここで、(Equation 7) here,

【数８】 (Equation 8)

【００３２】すべてのクラスのペアのＡＭＩの減少量Ｌ
_kを算出したら、当該ＡＭＩの減少量Ｌ_kが最小となるよ
うなペア、例えばクラスＣ_k（ｉ）とクラスＣ_k（ｊ）
（但しｉ＜ｊ）とを選択し、次いで、これらのクラスの
ペアをマージさせたときの新しいクラスの名前をＣ_k-l
（ｉ）と命名し、さらに、（ｋ−１）個のクラスの新た
なセットによる次のマージ処理を続けて実行する。クラ
スＣ_k（ｉ）とクラスＣ_k（ｊ）を除くすべてのクラスに
ついてマージ処理後に同じ方法で索引番号（インデック
ス）を付与する。すなわち、クラスＣ_k（ｍ）をクラス
Ｃ_k-l（ｍ）とし、ただし、ｍ≠ｉ，ｊである。ここ
で、ｊ≠ｋであればクラスＣ_k（ｋ）をクラスＣ
_k-l（ｊ）とし、ｊ＝ｋであればマージ処理後にＣ
_k（ｋ）を削除する。AMI reduction L of all classes of pairs
After calculating the _k, pairs such as reduction L _k of the AMI is minimized, for example, class C _k (i) and class C _k (j)
(Where i <j) and then the name of the new class when these pairs of classes are merged is C _kl
(I), and further execute the next merge process using the new set of (k-1) classes. Index numbers (indexes) are assigned to all classes except the class C _k (i) and the class C _k (j) in the same manner after the merge processing. That is, the class C _k (m) is _defined as a class C _kl (m), where m ≠ i, j. Here, if j ≠ k, the class C _k (k) is changed to the class C _k
_kl (j), and if j = k, C after merge processing
_k (k) is deleted.

【００３３】前の段階のマージ処理によるすべてのＡＭ
Ｉの減少量Ｌ_kを記憶装置に格納することによって、別
の最適化処理を実行することができる。ここで、クラス
のペア（Ｃ_k（ｉ），Ｃ_k（ｊ））がマージ処理の対象と
して選択され、すなわち、すべてのペア（ｌ，ｍ）に対
して、Ｌ_k（ｉ，ｊ）≦Ｌ_k（ｌ，ｍ）であると仮定す
る。次のマージ処理の段階では、すべての（ｌ，ｍ）の
ペアに対して、ＡＭＩの減少量Ｌ_k-l ^(i,j)（ｌ，ｍ）を
計算する必要がある。ここで、上付き文字（ｉ，ｊ）
は、クラスのペア（Ｃ_k（ｉ），Ｃ_k（ｊ））が前のマー
ジ処理の段階でマージされたことを意味している。ここ
で、ＡＭＩの減少量Ｌ_k-l ^(i,j)（ｌ，ｍ）とＬ_k（ｌ，
ｍ）の違いに注意する必要がある。すなわち、Ｌ_k-l
^(i,j)（ｌ，ｍ）はクラスｉとクラスｊとのマージ処理
の後にクラスｌとクラスｍとをマージしたことによるＡ
ＭＩの減少量であり、Ｌ_k（ｌ，ｍ）はクラスｉとクラ
スｊとのマージ処理なしにクラスｌとクラスｍとをマー
ジしたことによるＡＭＩ減少量である。従って、ＡＭＩ
の減少量Ｌ_k-l ^(i,j)（ｌ，ｍ）と、ＡＭＩの減少量Ｌ_k
（ｌ，ｍ）との差分は、クラスのペア（Ｃ_k（ｉ），Ｃ_k
（ｊ））のマージ処理によって影響を受ける項又はクラ
スのみから発生する。All AMs from the previous merge process
By storing the reduced amount L _{k of} I in the storage device, another optimization process can be executed. Here, the class pair (C _k (i), C _k (j)) is selected as a target of the merging process, that is, for all pairs (l, m), L _k (i, j) ≦ Suppose L _k (l, m). In the next merge processing stage, it is necessary to calculate the AMI reduction amount L _kl ^{(i, j)} (l, m) for all (l, m) pairs. Where superscript (i, j)
Means that the class pair (C _k (i), C _k (j)) has been merged in the previous merge processing stage. Here, the AMI reduction amounts L _kl ^{(i, j)} (l, m) and L _k (l,
Note the difference in m). That is, L _kl
^{(i, j)} (l, m) is A due to merging of class l and class m after merge processing of class i and class j.
L _k (l, m) is the AMI reduction amount due to merging of class l and class m without merging class i and class j. Therefore, AMI
Reduction _{^{L kl (i, j) (}} l, m) and, reduction of AMI L _k of
The difference from (l, m) is the class pair (C _k (i), C _k
It occurs only from terms or classes affected by the merge processing of (j)).

【００３４】上記の処理を、図４を参照して説明する
と、ＡＭＩの減少量Ｌ_k-l ^(i,j)（ｌ，ｍ）と、ＡＭＩの
減少量Ｌ_k（ｌ，ｍ）に対するクラスバイグラム平面テ
ーブルの加算領域は図４の（ｂ）及び（ａ）のようにな
る。領域｛（ｘ，ｙ）｜ｘ≠ｉ，ｊ，ｌ，ｍ、及びｙ≠
ｉ，ｊ，ｌ，ｍ｝の加算値は、クラスｉとクラスｊとの
マージ処理によって、あるいは、クラスｌとクラスｍと
のマージ処理によって変化することはないため、それら
の領域については図示していない。ここで、ｍｈは、図
４から明らかなように、クラスｍをクラスｌにマージし
たときに領域が抜けてしまうクラスである。さらに、詳
細後述するように、｛Ｌ_k-l ^(i,j)（ｌ，ｍ）−Ｌ
_k（ｌ，ｍ）｝を計算するときに、図４の図中のほとん
どの領域は互いに相殺されて数カ所のポイントの領域の
みが残る。こうして、上記ステップＡ３における演算時
間のコンプレキシティーを定数にまで減少することがで
きる。The above process will be described with reference to FIG. 4. The class bigram plane for the AMI decrease L _kl ^{(i, j)} (l, m) and the AMI decrease L _k (l, m) The addition area of the table is as shown in (b) and (a) of FIG. Region {(x, y) | x {i, j, l, m, and y}
Since the added value of i, j, l, m} does not change by the merging process of the class i and the class j or by the merging process of the class 1 and the class m, those regions are shown in the drawing. Not. Here, as is clear from FIG. 4, mh is a class whose area is lost when class m is merged with class l. Further, as described later in detail, {L _kl ^{(i, j)} (l, m) -L
When calculating _k (l, m)}, most of the regions in the diagram of FIG. 4 cancel each other out, leaving only regions of several points. Thus, the complexity of the calculation time in step A3 can be reduced to a constant.

【００３５】[0035]

【数９】Ｌ_k(ｌ,ｍ)＝Ｉ_k−Ｉ_k(ｌ,ｍ) 及びＬ_k-1 ^(i,j)(ｌ,ｍ)＝Ｉ_k-1 ^(i,j)−Ｉ_k-1 ^(i,j)(ｌ,ｍ)，であるので、Ｌ_k-1 ^(i,j)(ｌ,ｍ)−Ｌ_k(ｌ,ｍ)＝−(Ｉ_k-1 ^(i,j)(ｌ,
ｍ)−Ｉ_k(ｌ,ｍ))＋(Ｉ_k-1 ^(i,j)−Ｉ_k)L _k (l, m) = I _k −I _k (l, m) and L _k−1 ^{(i, j)} (l, m) = I _k−1 ^{(i, j)} −I _{k −1} ^{(i, j)} (l, m), where L _k−1 ^{(i, j)} (l, m) −L _k (l, m) = − (I _k−1 ^{(i, j) )} (l,
m) −I _k (l, m)) + (I _k−1 ^{(i, j)} −I _k )

【００３６】ＡＭＩＩ_k-l ^(i,j)（ｌ，ｍ）と、ＡＭＩＩ
_kの加算領域のうちの幾つかの部分は、ＡＭＩＩ_k-l
^(i,j)の一部、あるいはＡＭＩＩ_k（ｌ，ｍ）の一部とと
もに相殺される。ここで、Ｉｈ_k-l ^(i,j)（ｌ，ｍ）、Ｉ
ｈ_k（ｌ，ｍ）、Ｉｈ_k-l ^(i,j)、Ｉｈ_kはそれぞれ、相殺
可能な共通のクラスをすべて相殺した後のＡＭＩＩ_k-l
^(i,j)（ｌ，ｍ）、Ｉ_k（ｌ，ｍ）、Ｉ_k-l ^(i,j)、Ｉ_kで
あることを表わす。このとき、次のような関係式が与え
られる。AMII _kl ^{(i, j)} (l, m) and AMII
Some parts of the summation region of _k are AMII _kl
It is canceled with a part of ^{(i, j)} or a part of AMII _k (l, m). Where Ih _kl ^{(i, j)} (l, m), I
h _k (l, m), Ih _kl ^{(i, j)} and Ih _k are the AMII _kl after all the cancelable common classes have been cancelled.
^{(i, j)} (l, m), I _k (l, m), I _kl ^{(i, j)} , and I _k . At this time, the following relational expression is given.

【００３７】[0037]

【数１０】Ｌ_k-1 ^(i,j)(ｌ,ｍ)−Ｌ_k(ｌ,ｍ)＝−(Ｉｈ
_k-1 ^(i,j)(ｌ,ｍ)−Ｉｈ_k(ｌ,ｍ))＋(Ｉｈ_k-1 ^(i,j)−Ｉ
ｈ_k) ここで、L _k-1 ^{(i, j)} (l, m) −L _k (l, m) = − (Ih
_{^{k-1 (i, j)}} (l, m) -Ih k (l, m)) + (Ih k-1 (i, j) -I
h _k ) where

【数１１】Ｉｈ_k-1 ^(i,j)(ｌ,ｍ)＝ｑ_k-1(ｌ＋ｍ,ｉ)＋
ｑ_k-1(ｉ,ｌ＋ｍ)## EQU11 ## Ih _k-1 ^{(i, j)} (l, m) = q _k-1 (l + m, i) +
q _k-1 (i, l + m)

【数１２】Ｉｈ_k(ｌ,ｍ)＝ｑ_k(ｌ＋ｍ,ｉ)＋ｑ_k(ｉ,ｌ
＋ｍ)＋ｑ_k(ｌ＋ｍ,ｊ)＋ｑ_k(ｊ,ｌ＋ｍ）## EQU12 ## Ih _k (l, m) = q _k (l + m, i) + q _k (i, l
+ M) + q _k (l + m, j) + q _k (j, l + m)

【数１３】Ｉｈ_ｋ−１ ^{（ｉ，ｊ）}＝ｑ_k-1(ｉ,ｌ)＋ｑ
_k-1(ｉ,ｍ)＋ｑ_k-1(ｌ,ｉ)＋ｑ_k-1(ｍ,ｉ)## EQU13 ## Ih _k-1 ^{(i, j)} = q _k-1 (i, l) + q
_k-1 (i, m) + qk _-1 (l, i) + qk _-1 (m, i)

【数１４】Ｉｈｋ＝ｑ_k(ｉ,ｌ)＋ｑ_k(ｉ,ｍ)＋ｑ_k(ｊ,
ｌ)＋ｑ_k(ｊ,ｍ)＋ｑ_k（ｌ,ｉ)＋ｑ_k(ｌ,ｊ)＋ｑ_k(ｍ,
ｉ)＋ｑ_k(ｍ,ｊ)[Number 14] _{Ihk = q k (i, l} ) + q k (i, m) + q k (j,
l) + q _k (j, m) + q _k (l, i) + q _k (l, j) + q _k (m,
i) + q _k (m, j)

【００３８】上記数１０におけるＩｈの加算領域を図５
に示す。第１の従来例においては、上記数１０の右辺第
２項を無視して第１項のみを使用して、ＡＭＩの減少量
Ｌ_k- _l ^(i,j)（ｌ，ｍ）−Ｌ_k（ｌ，ｍ）を計算している
ようである。なお、上記数１０の第１項に対応する従来
技術文献における方程式（１７）の第３式において、符
号は正負逆である。しかしながら、上記数１０の第２項
は、その第１項と同じ重み係数を有するので、本発明者
は、本発明の当該モデルを完全なものとするために、上
記数１０を用いる。FIG. 5 shows the addition area of Ih in the above equation (10).
Shown in In the first conventional example, the AMI reduction amount L _k− ₁ ^{(i, j)} (l, m) −L _k is used by ignoring the second term on the right-hand side of Expression 10 and using only the first term. It seems that (l, m) is being calculated. In addition, in the third expression of the equation (17) in the related art document corresponding to the first term of the expression 10, the signs are opposite in sign. However, since the second term in Equation 10 has the same weighting factor as the first term, the inventor uses Equation 10 to complete the model of the present invention.

【００３９】演算時間のコンプレキシティーのオーダー
Ｏ（Ｖ³）を有する方法を使用する場合でも、語彙数が
１０⁴又はそれ以上のオーダーのように大きいときに
は、実際に計算することができない。何れにしても、上
記ステップＡ１において、オーダーＯ（Ｖ）の演算時間
が必要であるため、修正できるのは上記ステップＡ２し
かないと考えられる。上記ステップＡ２においては、可
能なクラスペアのマージのすべてについて検討すること
もできるが、実際には探索するクラスペアの範囲を限定
することは可能である。このことに関しては、第１の従
来例においては、以下のような方法を提案しており、本
発明の方法もこれを採用している。まず、互いに重複し
ない単語数Ｖを含むテキストデータ内の単語に基づい
て、Ｖ個の単体のクラスを作り、これを頻度の高い順に
配列して、「マージ領域」（本実施形態では、ウィンド
ウという。従って、本明細書においては、マージ領域と
ウィンドウとは同義語である。）を、クラス順位の最初
のｃ＋１個のクラスの単語とする。従って、まずは、
（ｃ＋１）個の頻度の高い単語がマージ領域となる。次
いで下記の処理を行う。Even when a method having the order of complexity O (V ³ ) of the operation time is used, when the number of vocabulary words is as large as 10 ⁴ or more, it cannot be calculated actually. In any case, since the calculation time of the order O (V) is required in the step A1, it can be considered that only the step A2 can be corrected. In step A2, all possible merging of class pairs can be considered, but in practice, the range of class pairs to be searched can be limited. Regarding this, the first conventional example proposes the following method, and the method of the present invention also adopts this method. First, V single classes are created based on the words in the text data including the number V of words that do not overlap with each other, and are arranged in descending order of frequency to form a "merged area" (in this embodiment, a window called a window). Therefore, in the present specification, the merge area and the window are synonyms.) Is the word of the first c + 1 classes in the class order. Therefore, first,
(C + 1) frequently used words become a merge area. Next, the following processing is performed.

【００４０】＜ステップＤ１＞マージ領域内のすべての
ペアの中でも、ＡＭＩの減少量を最小にするようなクラ
スのペアをマージする。＜ステップＤ２＞（ｃ＋２）番目の位置にあるクラスを
マージ領域又はウィンドウの中に挿入し、（ｃ＋２）番
目の位置のクラスよりも後ろの各クラスをその左側方向
に１つだけ移動させる。＜ステップＤ３＞残りのクラスが所定のｃ個になるまで
上記ステップＤ１とＤ２の処理を繰り返す。<Step D1> Among all pairs in the merge area, a pair of classes that minimizes the amount of AMI reduction is merged. <Step D2> The class at the (c + 2) th position is inserted into the merge area or window, and each class behind the class at the (c + 2) th position is moved by one to the left. <Step D3> The processes of steps D1 and D2 are repeated until the number of remaining classes reaches a predetermined value c.

【００４１】当該第１の従来例の処理のアルゴリズムに
おいては、上記ステップＡ２の演算時間のコンプレキシ
ティーは、最終クラス数ｃの２乗であるｃ²に比例する
オーダーＯ（ｃ²）となり、全体の演算時間は、ｃ²Ｖに
比例するオーダー（ｃ²Ｖ）に減少する。In the processing algorithm of the first conventional example, the complexity of the operation time in step A2 is an order O (c ² ) proportional to c ² which is the square of the final class number c. the total calculation time is reduced to the order (c ² V) which is proportional to c ² V.

【００４２】次いで、単語のクラスタリング構造を得る
ための方法について述べる。単語のクラスタリング構造
を表わすツリーでの表現を得るための最も簡単な方法
は、マージ処理における副産物としてデンドログラム
（ｄｅｎｄｒｏｇｒａｍ；ツリーの系統図又はツリー図
ともいう。）系統樹を構築すること、即ち具体的には、
マージの順序の記録（又は履歴）を取ってその記録に基
づいて二分木を作ることである。図６に、５単語から成
る語彙を使用した簡単な例を示す。図６におけるマージ
履歴（又はマージのヒストリともいう。）は、次の表１
に示す通りである。なお、表１の第１行目は、「クラス
ＡとクラスＢとをマージして、マージ後の新しいクラス
をＡと名づけた。」ということを意味する。Next, a method for obtaining a word clustering structure will be described. The simplest method for obtaining a representation of a word in a tree representing the clustering structure of words is to construct a dendrogram (also referred to as a tree diagram or tree diagram) dendrogram as a by-product of the merging process. In general,
To take a record (or history) of the order of merge and create a binary tree based on that record. FIG. 6 shows a simple example using a vocabulary of five words. The merge history (or the merge history) in FIG.
As shown in FIG. Note that the first line of Table 1 means that "the class A and the class B are merged, and the new class after the merge is named A."

【００４３】[0043]

【表１】マージ履歴 ───────────── Ｍｅｒｇｅ（Ａ，Ｂ→Ａ）Ｍｅｒｇｅ（Ｃ，Ｄ→Ｃ）Ｍｅｒｇｅ（Ｃ，Ｅ→Ｃ）Ｍｅｒｇｅ（Ａ，Ｃ→Ａ） ─────────────[Table 1] Merge history Ｍ Merge (A, B → A) Merge (C, D → C) Merge (C, E → C) Merge (A, C → A) ) ─────────────

【００４４】しかしながら、この方法を、上記第１の従
来例の方法のＯ（Ｃ²Ｖ）アルゴリズムに直接的に適用
した場合、各クラスのバランスは極端に悪くなり、図７
に示されているようなほぼ左側方向の分岐のみのツリー
構造となる。この理由は、ＡＭＩ量に関して言えば、マ
ージ領域にある複数のクラスをある一定の大きさを有す
るように成長させた後に、比較的大きなサイズを有する
より高い頻度を有するクラスをマージするよりは、より
低い頻度を有する単集合のクラスをマージした方が、大
幅にコストが安くなるからである。However, when this method is directly applied to the O (C ² V) algorithm of the first conventional example, the balance of each class becomes extremely poor, and FIG.
As shown in FIG. 7, the tree structure has only a leftward branch. The reason for this is that, in terms of the amount of AMI, rather than merging a class having a relatively large size and a higher frequency after growing a plurality of classes in the merge area to have a certain size, This is because merging classes of a single set having a lower frequency is significantly lower in cost.

【００４５】本発明者が採用した本発明に係る新しい方
法は以下の通りである。＜ステップＥ１＞ＭＩクラスタリング：マージ領域の制
約条件を有する相互情報クラスタリングアルゴリズムを
使用してｃ個のクラスを作成する。当該ｃ個のクラス
は、図１９に示すように、最後のツリー図であるデンド
ログラムの中間層１００を構成する。＜ステップＥ２＞外部クラスタリング：テキストデータ
中のすべての単語をクラス・トークン（ｃｌａｓｓｔ
ｏｋｅｎ）と置換し（実際には、テキストデータ中の全
体の文章の代わりにバイグラム・テーブルについてのみ
処理を行う。）、マージ領域の制約条件なしにすべての
クラスがマージ処理によって単一のクラスになるまで二
分木の形式でマージ処理（バイナリーマージ処理）を実
行する。当該処理によって、デンドログラムＤ_rootを作
成する。このデンドログラムＤ_rootは、例えば図１９に
示すように、最終のツリー構造の上側層１０１を構成す
る。＜ステップＥ３＞内部クラスタリング：｛Ｃ¹，Ｃ²，
…，Ｃⁱ，…，Ｃ^c｝を上記ステップＥ１で得られた中間
層１００のクラスの集合（クラスセット）とする。そし
て、それぞれのｉ（１≦ｉ≦ｃ；ｉは自然数である。）
について以下の処理を行う。The new method according to the present invention adopted by the inventor is as follows. <Step E1> MI clustering: c classes are created using a mutual information clustering algorithm having a constraint condition of a merge area. As shown in FIG. 19, the c classes constitute the intermediate layer 100 of the dendrogram which is the last tree diagram. <Step E2> External clustering: classifying all words in the text data into class tokens (class t)
(actually, only the bigram table is processed instead of the entire text in the text data), and all classes are merged into a single class by the merge process without the constraints of the merge area. Perform the merge process (binary merge process) in the form of a binary tree until it is. By this process, a dendrogram D _root is created. This dendrogram D _root constitutes the upper layer 101 of the final tree structure, for example, as shown in FIG. <Step E3> Internal clustering: {C ¹ , C ² ,
, C ⁱ ,..., C ^c } are a set of classes (class set) of the intermediate layer 100 obtained in the above step E1. Each i (1 ≦ i ≦ c; i is a natural number)
Perform the following processing.

【００４６】＜ステップＥ３−１＞クラスＣⁱのものを
除いて、テキスト中のすべての単語をそのクラス・トー
クンと置き換える。新しい語彙Ｖ’＝Ｖ₁∪Ｖ₂を決定す
る。ここで、Ｖ₁＝｛Ｃⁱにおけるすべての単語｝、Ｖ₂
＝｛Ｃ¹，Ｃ²，…，Ｃ^i-1，Ｃⁱ⁺ ¹，Ｃ^c｝であり、Ｃ^jは
ｊ番目のクラスのトークンである（１≦ｊ≦ｃ）。語彙
Ｖ’の各要素を各々のクラスに割り当て、語彙Ｖ₁の要
素のみを含むクラスに限ってマージ処理が可能となると
いう制約条件付きマージ処理によって二分木の形式でマ
ージ処理を実行する。当該処理は、最初の│Ｖ₁│個の
クラスにおける語彙Ｖ₁の要素（すなわち、単語）を含
む語彙Ｖ’の要素を頻度の順序で順序づけし、次いで、
最初に幅｜Ｖ₁｜を有しかつ各マージ処理によって１つ
ずつ減少する幅を有するマージ領域内でマージ処理を実
行することによって実行することができる。ここで、｜
Ｖ₁｜は語彙Ｖ₁の単語の個数である。＜ステップＥ３−２＞語彙Ｖ₁におけるすべての要素が
単一のクラスに入るまでマージ処理を繰り返す。各クラ
ス毎に、マージ処理によって、図１９に示すように、下
側層１０２のデンドログラムＤ_subを作成する。このデ
ンドログラムは、葉のノード（ｌｅａｆｎｏｄｅ）が
クラス内の各単語を表している各クラスのサブツリーを
構成する。[0046] except those of <step E3-1> class C ^i, replace all the words in the text and the class token. Determine the new vocabulary V ′ = V ₁ ∪V ₂ . Here, V ₁ = {all words in C ⁱ }, V ₂
= {C ¹ , C ² ,..., C ⁱ⁻¹ , C ^{i +} ¹ , C ^c }, and C ^j is a j-th class token (1 ≦ j ≦ c). Assign each element of the vocabulary V 'to each class, executes the merge processing in the form of a binary tree by the constrained merging process of merging processing can be only class that contains only the elements of the vocabulary V _1. The process orders the elements of vocabulary V ′ that include the elements (ie, words) of vocabulary V ₁ in the _first | V ₁ | classes in order of frequency, and then
This can be done by first performing the merge process in a merge region having a width | V ₁ | and a width that decreases by one with each merge process. Where |
V ₁ | is the number of words in the vocabulary V ₁ . Repeated merging process until all the elements in <Step E3-2> vocabulary V ₁ is entered into a single class. As shown in FIG. 19, a dendrogram D _sub of the lower layer 102 is created by a merge process for each class. This dendrogram constitutes a subtree of each class in which leaf nodes represent each word in the class.

【００４７】＜ステップＥ４＞上側層１０１のデンドロ
グラムＤ_rootの各葉のノードを、対応する下側層１０２
のデンドログラムＤ_subと置き換えすることによってデ
ンドログラムを合成し、これによって、全体のデンドロ
グラムを得ることができる。<Step E4> The node of each leaf of the dendrogram D _root of the upper layer 101 is stored in the corresponding lower layer 102.
By substituting the dendrogram D _sub with the dendrogram, the entire dendrogram can be obtained.

【００４８】本発明に係る単語クラスタリングの方法
は、意味又は統語的特徴が似通った単語が近接した位置
に配置された点で、バランスが取れた二分木の形式を有
するツリー構造を生成することができる。図８は、本発
明の方法を用いて、ウォール・ストリート・ジャーナル
（以下、ＷＳＪという。）のコーパスの中で最も使用頻
度の高い上位７０，０００語に関して組み立てた５００
クラスの内の１クラスに対する下側層のデンドログラム
Ｄ_subの一例を示したものである。最後に、根のノード
（ルートノード（ｒｏｏｔｎｏｄｅ））から葉のノー
ド（リーフノード（ｌｅａｆｎｏｄｅ）に至るパスの
追跡し、左側方向の分岐又は右側方向の分岐をそれぞれ
表わす０又は１の１ビットを各分岐に割り当てることに
よって、語彙の中の各単語に対して、ビットストリング
（単語ビット）を割り当てることができる。The word clustering method according to the present invention may generate a tree structure having a balanced binary tree form in that words having similar meanings or syntactic features are located in close proximity. it can. FIG. 8 shows a 500 assembled using the method of the present invention for the most frequently used top 70,000 words in the Wall Street Journal (WSJ) corpus.
FIG. 9 shows an example of a lower layer dendrogram D _sub for one of the classes. FIG. Finally, a path from a root node (root node) to a leaf node (leaf node) is traced, and one bit of 0 or 1 representing a leftward branch or a rightward branch, respectively. Can be assigned a bit string (word bits) for each word in the vocabulary.

【００４９】図１０は、図１の単語分類処理部２０の構
成を示すブロック図である。図１０を参照して、単語分
類処理部２０の構成及び動作について説明する。図１０
において、単語分類処理部２０は、ＣＰＵ５０を備えた
コントローラであって、ＣＰＵ５０と、ＣＰＵ５０によ
って実行される単語分類処理のプログラム及び当該プロ
グラムを実行するために必要なデータを格納するための
ＲＯＭ５１と、上記単語分類処理を実行するときに必要
なワークエリアであるワークＲＡＭ５２と、上記単語分
類処理を実行するときに必要な複数のメモリエリアを有
するＲＡＭ５３と、２つのメモリインターフェース５
４，５５とを備え、これらの各回路５０乃至５５はバス
５６を介して互いに接続される。ここで、メモリインタ
ーフェース５４は、テキストデータメモリ１０とＣＰＵ
５０との間に設けられ、テキストデータメモリ１０とＣ
ＰＵ５０との間の信号変換などのインターフェース処理
を実行するためのインターフェース回路である一方、メ
モリインターフェース５５は、単語辞書メモリ１１とＣ
ＰＵ５０との間に設けられ、単語辞書メモリ１１とＣＰ
Ｕ５０との間の信号変換などのインターフェース処理を
実行するためのインターフェース回路である。ＲＡＭ５
３は、次のように区分された複数のメモリ部を備える。FIG. 10 is a block diagram showing the configuration of the word classification processing section 20 of FIG. The configuration and operation of the word classification processing unit 20 will be described with reference to FIG. FIG.
, The word classification processing unit 20 is a controller provided with the CPU 50, the CPU 50, a ROM 51 for storing a word classification processing program executed by the CPU 50 and data necessary for executing the program, A work RAM 52 that is a work area required when executing the word classification processing, a RAM 53 having a plurality of memory areas required when executing the word classification processing, and two memory interfaces 5
4 and 55, and these circuits 50 to 55 are connected to each other via a bus 56. Here, the memory interface 54 includes the text data memory 10 and the CPU.
50, the text data memory 10 and C
While the memory interface 55 is an interface circuit for executing interface processing such as signal conversion with the PU 50, the memory interface 55
Between the word dictionary memory 11 and the CP
It is an interface circuit for executing interface processing such as signal conversion with U50. RAM5
3 has a plurality of memory sections divided as follows.

【００５０】（ａ）初期化クラス単語メモリ６１：後述
する初期化処理によって得られたｖ個の単語及びそのク
ラスを格納する；（ｂ）ＡＭＩメモリ６２：後述する中間層クラスタリン
グ処理、上側層クラスタリング処理及び下側層クラスタ
リング処理において１つのウィンドウ内のクラスの単語
の中ですべての組わせの仮ペアを作り、各仮ペアをマー
ジしたときの平均相互情報量を数２を用いて計算した結
果を格納する；（ｃ）中間層メモリ６３：後述する中間層クラスタリン
グ処理によって得られたｃ個の中間層のクラスの単語を
格納する；（ｄ）上側層ヒストリメモリ６４：後述する上側層クラ
スタリング処理における各マージ処理の履歴（又はヒス
トリ）を格納する；（ｅ）上側層ツリーメモリ６５：上記上側層クラスタリ
ング処理によって得られたツリー図であるデンドログラ
ムＤ_rootを格納する；（ｆ）下側層ヒストリメモリ６６：上記下側層クラスタ
リング処理によって得られた、中間層１００の各クラス
に対して１つのツリー図である複数ｃ個のデンドログラ
ムＤ_subを格納する；（ｇ）下側層ツリーメモリ６７：上記下側層クラスタリ
ング処理によって得られたツリー図であるデンドログラ
ムＤ_subを格納する；（ｈ）ツリーメモリ６７：上側層１０１の１つのデンド
ログラムと下側層１０２の複数ｃ個のデンドログラムと
を、中間層１００を介して連結することにより得られた
全体のツリー図であるデンドログラムを格納する。(A) Initialization class word memory 61: stores v words and their classes obtained by initialization processing described later; (b) AMI memory 62: intermediate layer clustering processing, upper layer clustering described later In the processing and the lower layer clustering processing, temporary pairs of all combinations are created among the words of the class in one window, and the average mutual information when each temporary pair is merged is calculated using Equation 2. (C) Intermediate layer memory 63: Stores the words of the c intermediate layer classes obtained by the intermediate layer clustering process described later; (d) Upper layer history memory 64: Upper layer clustering process described later (E) Upper-layer tree memory 65: upper-layer clustering process Storing dendrogram D _root is a tree diagram obtained by physical; (f) the lower layer history memory 66: obtained by the lower layer clustering process, a single tree for each class of the intermediate layer 100 storing a plurality c pieces of dendrogram D _sub diagrams; (g) lower layer tree memory 67: storing dendrogram D _sub is a tree diagram obtained by the lower layer clustering process; (h) Tree memory 67: Stores a dendrogram as an entire tree diagram obtained by connecting one dendrogram of the upper layer 101 and a plurality of c dendrograms of the lower layer 102 via the intermediate layer 100. I do.

【００５１】図１１は、図１の単語分類処理部２０によ
って実行されるメインルーチンの単語分類処理を示すフ
ローチャートである。図１１に示すように、まず、ステ
ップＳ１においてテキストデータに基づいて出現頻度の
高い単語から順に並べる処理を実行する初期化処理を実
行し、次いで、ステップＳ２において中間層１００のク
ラスの単語を求める中間層クラスタリング処理を実行
し、さらに、ステップＳ３において上側層１０１のツリ
ー図を求める上側層クラスタリング処理を実行し、そし
て、ステップＳ４において下側層１０２のツリー図を求
める下側層クラスタリング処理を実行し、最後に、ステ
ップＳ５において上側層１０１の１つのツリー図と下側
層１０２の複数ｃ個のツリー図とを、中間層１００を介
して連結することにより得られた全体のツリー図である
デンドログラムを求めて、その結果を単語辞書として単
語辞書メモリ１１に格納するデータ出力処理を実行す
る。これによって、単語分類処理が終了する。なお、こ
れらのツリー図においては、各単語がそれぞれ１つのク
ラスに分類されかつクラス間の連結関係が示される。FIG. 11 is a flowchart showing the word classification processing of the main routine executed by the word classification processing section 20 of FIG. As shown in FIG. 11, first, in step S1, an initialization process for executing a process of arranging words in order of appearance frequency based on text data is performed, and then, in step S2, words of a class of the intermediate layer 100 are obtained. An intermediate layer clustering process is executed, and further, an upper layer clustering process for obtaining a tree diagram of the upper layer 101 is executed in step S3, and a lower layer clustering process for obtaining a tree diagram of the lower layer 102 is executed in step S4. Finally, the entire tree diagram obtained by connecting one tree diagram of the upper layer 101 and a plurality of c tree diagrams of the lower layer 102 via the intermediate layer 100 in step S5. A data output processing for obtaining a dendrogram and storing the result in the word dictionary memory 11 as a word dictionary To run. Thus, the word classification processing ends. In these tree diagrams, each word is classified into one class, and the connection relationship between the classes is shown.

【００５２】図１２は、図１１のサブルーチンの初期化
処理（Ｓ１）を示すフローチャートである。図１２に示
すように、ステップＳ１１において、テキストデータメ
モリ１０内のテキストデータに基づいて、単語の重複を
省いたすべての複数ｖ個の単語の出現頻度を調べて、出
現頻度の高い単語から順に並べて、これを複数ｖ個のク
ラスに割り当てて、複数ｖ個のクラスの単語を初期化ク
ラス単語メモリ６１に記憶して、元のメインルーチンに
戻る。ここで、ｖは２以上の自然数である。FIG. 12 is a flowchart showing the initialization processing (S1) of the subroutine of FIG. As shown in FIG. 12, in step S11, based on the text data in the text data memory 10, the appearance frequencies of all the plurality of v words excluding the duplication of the words are checked, and the words having the higher appearance frequencies are checked in order. The words are assigned to a plurality of v classes, the words of the plurality of v classes are stored in the initialization class word memory 61, and the process returns to the original main routine. Here, v is a natural number of 2 or more.

【００５３】図１３は、図１１のサブルーチンの中間層
クラスタリング処理（Ｓ２）を示すフローチャートであ
る。図１３に示すように、まず、ステップＳ２１におい
て、初期化クラス単語メモリ６１から複数ｖ個のクラス
の単語を読み出した後、複数ｖ個のクラスの単語のうち
の出現頻度の高いクラスの単語からｖ個未満の（ｃ＋
１）個のクラスの単語を１つのウィンドウ（又はマージ
領域）内のクラスの単語として、図１７に示すように、
ワークＲＡＭ５２に記憶する。ここで、１＜ｃ＋１＜ｖ
である。次いで、ステップＳ２２において、ワークＲＡ
Ｍ５２に記憶された１つのウィンドウ内のクラスの単語
の中で、すべての２個ずつの組み合わせの仮ペアを作
り、各仮ペアをそれぞれマージしたときの平均相互情報
量を数２を用いて計算して、各仮ペアとそれに対応する
計算された平均相互情報量とを次の表２の形式でＡＭＩ
メモリ６２に記憶する。FIG. 13 is a flowchart showing the intermediate layer clustering processing (S2) of the subroutine of FIG. As shown in FIG. 13, first, in step S21, after reading words of a plurality of v classes from the initialization class word memory 61, the words of the class having a high appearance frequency among the words of the plurality of v classes are read first. Less than v (c +
1) Assuming that words of a class are words of a class in one window (or merge area), as shown in FIG.
It is stored in the work RAM 52. Here, 1 <c + 1 <v
It is. Next, in step S22, the work RA
Among the words of the class in one window stored in M52, a tentative pair of all two combinations is created, and the average mutual information when each tentative pair is merged is calculated using Equation 2. Then, each temporary pair and the calculated average mutual information corresponding to the temporary pair are expressed in the form of AMI in the following Table 2.
It is stored in the memory 62.

【００５４】[0054]

【表２】 ────────────────── 仮ペア平均相互情報量 ────────────────── （Ｃ¹，Ｃ²）０．８６７６７８（Ｃ²，Ｃ³）０．２３４６８９（Ｃ³，Ｃ⁴）０．１２５６８６ ………… ……………… （Ｃ^c，Ｃ^c+1）０．６７５６４２ ──────────────────[Table 2] 仮 Provisional pair average mutual information ────────────────── (C ¹ , (C ² ) 0.867678 (C ² , C ³ ) 0.234689 (C ³ , C ⁴ ) 0.125686 ............ (C ^c , C ^{c + 1} ) 0.675642 ── ────────────────

【００５５】次いで、ステップＳ２３において、図１８
に示すように、ＡＭＩメモリ６２に記憶された各仮ペア
の平均相互情報量のうち、最大となる仮ペアを見つけて
当該仮ペアをマージすることにより、１つのクラスが減
少し、マージ後の１つのウィンドウ内のクラスの単語を
更新して、更新後のクラスの単語をワークＲＡＭ５２に
記憶する。そして、ステップＳ２４において、ウィンド
ウ外のクラスはなくなりかつウィンドウ内のクラスの数
はｃ個となったか否かが判断され、その判断がＮＯであ
るとき、ステップＳ２５において、図１８に示すよう
に、現在のウィンドウよりも外側にあり、最大の出現頻
度を有するクラスの単語をウィンドウ内に挿入し、挿入
後の１つのウィンドウ内のクラスの単語を更新して、更
新後のクラスの単語をワークＲＡＭ５２に記憶した後、
ステップＳ２２に戻って、ステップＳ２２以降の処理を
繰り返す。Next, in step S23, FIG.
As shown in (1), one class is reduced by finding the largest temporary pair out of the average mutual information amount of each temporary pair stored in the AMI memory 62 and merging the temporary pair, thereby reducing one class. The word of the class in one window is updated, and the updated word of the class is stored in the work RAM 52. Then, in step S24, it is determined whether there are no classes outside the window and the number of classes in the window is c. If the determination is NO, in step S25, as shown in FIG. The words of the class having the highest frequency of appearance outside the current window are inserted into the window, the words of the class in one window after the insertion are updated, and the words of the updated class are stored in the work RAM 52. After memorizing,
Returning to step S22, the processing after step S22 is repeated.

【００５６】一方、ステップＳ２４においてＹＥＳであ
るときは、ステップＳ２６において、ワークＲＡＭ５２
に記憶された、ウィンドウ内のｃ個のクラス及びそれに
属する単語を中間層１００として中間層メモリ６３に記
憶する。これによって、中間層クラスタリング処理が終
了し、メインルーチンに戻る。On the other hand, if YES is determined in the step S24, the work RAM 52 is determined in a step S26.
Are stored in the intermediate layer memory 63 as the intermediate layer 100. Thus, the intermediate layer clustering process ends, and the process returns to the main routine.

【００５７】図１４は、図１１のサブルーチンの上側層
クラスタリング処理（Ｓ３）を示すフローチャートであ
り、図１９に示すように、中間層１００から矢印２０１
の方向でツリー図を求める処理である。図１４に示すよ
うに、まず、ステップＳ３１において、中間層メモリ６
３内のｃ個のクラスの単語を読み出した後、当該ｃ個の
クラスの単語を１つのウィンドウ内のクラス単語とし
て、ワークＲＡＭ５２に記憶する。次いで、ステップＳ
３２において、ステップＳ２２と同様に、ワークＲＡＭ
５２に記憶された１つのウィンドウ内のクラスの単語の
中で、すべての２個ずつの組み合わせの仮ペアを作り、
各仮ペアをそれぞれマージしたときの平均相互情報量を
数２を用いて計算して、各仮ペアとそれに対応する計算
された平均相互情報量とを前述の表２の形式でＡＭＩメ
モリ６２に記憶する。FIG. 14 is a flowchart showing the upper layer clustering process (S3) of the subroutine of FIG. 11, and as shown in FIG.
Is a process of obtaining a tree diagram in the direction of. As shown in FIG. 14, first, in step S31, the intermediate layer memory 6
After reading the words of the c classes in 3, the words of the c classes are stored in the work RAM 52 as the class words in one window. Then, step S
In step 32, as in step S22, the work RAM
Among the words of the class in one window stored in 52, a tentative pair of all two combinations is created,
The average mutual information amount when each of the temporary pairs is merged is calculated using Equation 2, and each temporary pair and the calculated average mutual information amount corresponding thereto are stored in the AMI memory 62 in the format of Table 2 described above. Remember.

【００５８】次いで、ステップＳ３３において、ステッ
プＳ２３と同様に、ＡＭＩメモリ６２に記憶された各仮
ペアの平均相互情報量のうち、最大となる仮ペアを見つ
けて当該仮ペアをマージすることにより、１つのクラス
が減少し、マージ後の１つのウィンドウ内のクラスの単
語を更新して、更新後のクラスの単語をワークＲＡＭ５
２に記憶する。また、例えば表１の形式を有し、どのク
ラスとどのクラスとがマージされて新しく名づけられた
クラスとなったかを表わす当該マージ処理の履歴を上側
層ヒストリメモリ６４に記憶する。そして、ステップＳ
３４において、ウィンドウ内のクラスの数はｃ個となっ
たか否かが判断され、その判断がＮＯであるとき、ステ
ップＳ３２に戻って、ステップＳ３２以降の処理を繰り
返す。Next, in step S33, as in step S23, the largest provisional pair is found out of the average mutual information amount of each provisional pair stored in the AMI memory 62, and the provisional pair is merged. One class is reduced, the words of the class in one window after the merge are updated, and the words of the updated class are stored in the work RAM 5.
Stored in 2. In addition, the history of the merge process having the format shown in Table 1 and indicating which class and which class are merged into a newly named class is stored in the upper layer history memory 64. And step S
At 34, it is determined whether or not the number of classes in the window has reached c. If the determination is NO, the process returns to step S32, and the processes after step S32 are repeated.

【００５９】一方、ステップＳ３４においてＹＥＳであ
るときは、ステップＳ３５において、上側層ヒストリメ
モリ６４内の上側層の履歴又はヒストリに基づいて、例
えば図６に示すように、上側層のツリー図又はデンドロ
グラムＤ_rootを作成して上側層ツリーメモリ６５に記憶
する。これによって、上側層クラスタリング処理が終了
し、メインルーチンに戻る。On the other hand, if YES in step S34, in step S35, based on the history or history of the upper layer in the upper layer history memory 64, for example, as shown in FIG. A gram D _root is created and stored in the upper layer tree memory 65. Thus, the upper layer clustering process ends, and the process returns to the main routine.

【００６０】図１５は、図１１のサブルーチンの下側層
クラスタリング処理（Ｓ４）を示すフローチャートであ
り、図１５に示すように、下側層１０２の底辺に位置す
る単語から、中間層１００の各クラスＣⁱ毎に、矢印２
０２の方向でツリー図を求める処理である。ある。図１
５に示すように、まず、ステップＳ４１において、中間
層メモリ６３内のｃ個のクラスの単語を読み出した後、
当該ｃ個のクラスから１つのクラスを選択する。そし
て、ステップＳ４２において、選択されたクラス内のｖ
_i個の単語を１つのウィンドウ内のクラス単語として、
ワークＲＡＭ５２に記憶する。次いで、ステップＳ４３
において、ステップＳ２２及びＳ３２と同様に、ワーク
ＲＡＭ５２に記憶された１つのウィンドウ内のクラスの
単語の中で、すべての２個ずつの組み合わせの仮ペアを
作り、各仮ペアをそれぞれマージしたときの平均相互情
報量を数２を用いて計算して、各仮ペアとそれに対応す
る計算された平均相互情報量とを前述の表２の形式でＡ
ＭＩメモリ６２に記憶する。FIG. 15 is a flowchart showing the lower layer clustering processing (S4) of the subroutine of FIG. 11, and as shown in FIG. Arrow 2 for each class C ⁱ
This is a process for obtaining a tree diagram in the direction of 02. is there. FIG.
As shown in FIG. 5, first, in step S41, after reading out words of c classes in the intermediate layer memory 63,
One class is selected from the c classes. Then, in step S42, v in the selected class
_i words as class words in one window,
It is stored in the work RAM 52. Next, step S43
In the same manner as in steps S22 and S32, in the words of the class in one window stored in the work RAM 52, a temporary pair of every two combinations is created, and each temporary pair is merged. The average mutual information is calculated using Equation 2, and each tentative pair and the calculated average mutual information corresponding thereto are represented by A in the format of Table 2 described above.
It is stored in the MI memory 62.

【００６１】次いで、ステップＳ４４において、ステッ
プＳ２３及びＳ３３と同様に、ＡＭＩメモリ６２に記憶
された各仮ペアの平均相互情報量のうち、最大となる仮
ペアを見つけて当該仮ペアをマージすることにより、１
つのクラスが減少し、マージ後の１つのウィンドウ内の
クラスの単語を更新して、更新後のクラスの単語をワー
クＲＡＭ５２に記憶する。また、例えば表１の形式を有
し、どのクラスとどのクラスとがマージされて新しく名
づけられたクラスとなったかを表わす当該マージ処理の
履歴を下側層ヒストリメモリ６６に記憶する。そして、
ステップＳ４５において、ウィンドウ内のクラスの数は
ｃ個となったか否かが判断され、その判断がＮＯである
とき、ステップＳ４３に戻って、ステップＳ４３以降の
処理を繰り返す。ここで、ステップＳ４３及びＳ４４の
処理は、中間層１００の各クラス毎に実行される。Next, in step S44, as in steps S23 and S33, the largest provisional pair is found out of the average mutual information amount of each provisional pair stored in the AMI memory 62, and the provisional pair is merged. By 1
One class is reduced, the word of the class in one window after the merge is updated, and the word of the updated class is stored in the work RAM 52. Further, the history of the merging process having the format shown in Table 1 and indicating which class is merged with which class into a newly named class is stored in the lower layer history memory 66. And
In step S45, it is determined whether or not the number of classes in the window has reached c. If the determination is NO, the process returns to step S43, and the processing from step S43 is repeated. Here, the processing of steps S43 and S44 is executed for each class of the intermediate layer 100.

【００６２】一方、ステップＳ４５においてＹＥＳであ
るときは、ステップＳ４６においてすべての中間層１０
０のクラスについて処理したか否かが判断され、当該判
断がＮＯであるとき、未処理のクラスが残っているの
で、ステップＳ４７において残っている中間層１００の
別の未処理のクラスを選択した後、ステップＳ４２に進
む。一方、ステップＳ４６においてＹＥＳであるとき
は、ステップＳ４８において、下側層ヒストリメモリ６
６内の下側層の履歴又はヒストリに基づいて、例えば図
６に示すように、下側層のツリー図又はデンドログラム
Ｄ_subを作成して下側層ツリーメモリ６７に記憶する。
これによって、下側層クラスタリング処理が終了し、メ
インルーチンに戻る。On the other hand, when YES is determined in the step S45, all the intermediate layers 10 are determined in a step S46.
It is determined whether or not the processing has been performed for the class of 0. If the determination is NO, an unprocessed class remains, and another unprocessed class of the remaining intermediate layer 100 is selected in step S47. Thereafter, the process proceeds to step S42. On the other hand, when YES is determined in the step S46, in a step S48, the lower layer history memory 6
Based on the history or the history of the lower layer 6, for example, as shown in FIG. 6, and stores the lower layer tree memory 67 to create a tree diagram or dendrogram D _sub of the lower layer.
Thus, the lower layer clustering process ends, and the process returns to the main routine.

【００６３】図１６は、図１１のサブルーチンのデータ
出力処理（Ｓ５）を示すフローチャートである。図１６
に示すように、まず、ステップＳ５１において、図１９
に示すように、上側層ツリーメモリ６５内の上側層のツ
リー図と、下側層ツリーメモリ６７内の下側層のツリー
図とに基づいて、これら２つのツリー図を中間層１００
の各クラスＣⁱを介して連結し、すなわち、上側層ツリ
ーメモリ６５内の上側層のツリー図を中間層１００の各
クラスＣⁱに連結する一方、下側層ツリーメモリ６７内
の下側層ツリー図をその頂点にあるクラスを中間層１０
０の各クラスＣⁱに連結する。これによって、当該テキ
ストデータに基づく全体のツリー図を作成して、ツリー
図の情報をツリーメモリ６８に記憶する。当該ツリーメ
モリ６８には、図６及び図８に示すように、各クラスの
単語間の連結関係が単語辞書として記憶される。そし
て、ステップＳ５２において、ツリーメモリ６８内のツ
リー図の情報を単語分類結果（又は単語クラスタリング
結果）として単語辞書メモリ１１に出力して記憶する。FIG. 16 is a flowchart showing the data output processing (S5) of the subroutine of FIG. FIG.
As shown in FIG. 19, first, in step S51, FIG.
As shown in FIG. 7, based on the upper-layer tree diagram in the upper-layer tree memory 65 and the lower-layer tree diagram in the lower-layer tree memory 67, these two tree diagrams are stored in the intermediate layer 100.
Linked via respective class C ^i, that is, while connecting the tree view of the upper layer of the upper layer tree memory 65 in each class C ⁱ of the intermediate layer 100, the lower layer of the lower layer in the tree memory 67 The class at the top of the tree diagram is represented by the middle layer 10.
0 to each class C ⁱ . As a result, an entire tree diagram based on the text data is created, and information of the tree diagram is stored in the tree memory 68. As shown in FIGS. 6 and 8, the tree memory 68 stores the connection relation between words of each class as a word dictionary. Then, in step S52, the information of the tree diagram in the tree memory 68 is output to and stored in the word dictionary memory 11 as a word classification result (or word clustering result).

【００６４】＜第１の実施形態＞図１は、本発明に係る
第１の実施形態である音声認識装置のブロック図であ
る。図１において、テキストデータメモリ１０内に格納
された、例えば英語又は日本語の複数の単語を含むテキ
ストデータは、単語分類処理部２０によって上述の単語
分類処理が実行されることにより、複数のクラスに分類
されかつクラスの連結関係が記述された単語辞書とし
て、単語辞書メモリ１１内に格納される。<First Embodiment> FIG. 1 is a block diagram of a speech recognition apparatus according to a first embodiment of the present invention. In FIG. 1, text data including a plurality of words, for example, English or Japanese, stored in a text data memory 10 is subjected to a plurality of classes by executing the above-described word classification processing by the word classification processing unit 20. And stored in the word dictionary memory 11 as a word dictionary in which the connection relations of the classes are described.

【００６５】一方、マイクロホン１に入力された複数の
単語からなる発声音声は、マイクロホン１によって音声
信号に変換された後、Ａ／Ｄ変換器２によってディジタ
ル音声信号にＡ／Ｄ変換される。ディジタル音声信号は
特徴抽出部３に入力され、特徴抽出部３は、入力された
ディジタル音声信号に対して例えばＬＰＣ分析してケプ
ストラム係数や対数パワーなどの特徴パラメータを抽出
して、バッファメモリ４を介して音声認識部５に出力す
る。音声認識部５は、単語辞書メモリ１１に格納された
単語辞書を参照しかつ、例えば音素隠れマルコフモデル
（以下、音素ＨＭＭという。）である言語モデルメモリ
１２に格納された言語モデルを参照して、単語毎に音声
認識を実行して、音声認識結果を出力する。なお、ここ
で、単語辞書メモリ１１内の単語辞書は、例えば、（ａ）０１００１００１０，ｐｏｓｉｔｉｏｎ；（ｂ）０１００１００１１，ｌｏｃａｔｉｏｎ；（ｃ）１１００１０１００，ｆｏｒ；のように各単語とその単語の属するクラスを表現するビ
ット列などの情報を含む。On the other hand, the uttered voice composed of a plurality of words input to the microphone 1 is converted into a voice signal by the microphone 1 and then A / D converted by the A / D converter 2 into a digital voice signal. The digital audio signal is input to the feature extracting unit 3, which performs, for example, LPC analysis on the input digital audio signal to extract feature parameters such as cepstrum coefficients and logarithmic power, and stores the buffer memory 4 in the buffer memory 4. And outputs the result to the voice recognition unit 5. The speech recognition unit 5 refers to the word dictionary stored in the word dictionary memory 11 and refers to the language model stored in the language model memory 12 which is, for example, a phoneme hidden Markov model (hereinafter, referred to as a phoneme HMM). , Perform speech recognition for each word, and output a speech recognition result. Here, the word dictionaries in the word dictionary memory 11 include, for example, (a) 010010010, position; (b) 010010011, location; (c) 110010100, for; It contains information such as the bit string to be represented.

【００６６】＜第２の実施形態＞図２は、本発明に係る
第２の実施形態である形態素及び構文解析装置のブロッ
ク図である。図２において、テキストデータメモリ３
１，３２にそれぞれ格納された、複数の単語からなる２
つのテキストデータはそれぞれ、単語分類処理部２０に
よって上述の単語の分類処理が実行されることにより、
複数のクラスに分類されかつクラスの連結関係が記述さ
れた単語辞書として、それぞれ単語辞書メモリ４１，４
２内に格納される。<Second Embodiment> FIG. 2 is a block diagram of a morpheme and syntax analyzer according to a second embodiment of the present invention. In FIG. 2, the text data memory 3
2, consisting of a plurality of words stored in
Each of the two pieces of text data is subjected to the above-described word classification processing by the word classification processing unit 20.
As word dictionaries classified into a plurality of classes and describing the connection relations of the classes, word dictionary memories 41 and 4 are provided, respectively.
2 is stored.

【００６７】日本語又は英語などの所定の言語の文字列
からなり複数の単語からなる自然言語文が形態素解析部
２１に入力され、形態素解析部２１は、入力された自然
言語文の各単語の出現形に対して、単語辞書メモリ４１
に格納された単語辞書を参照して上記自然言語文を複数
の単語に分割するとともに、上記各出現形に対して品
詞、活用形、標準表現形、及び類語コードなどの情報を
付与し、これらの解析結果を構文解析部２２に出力す
る。次いで、構文解析部２２は、単語辞書メモリ４２に
格納された単語辞書を参照して、所定の構文解析を実行
して単語列に対して構文木情報を付加して解析結果とし
て出力する。A natural language sentence consisting of a character string of a predetermined language such as Japanese or English and consisting of a plurality of words is input to the morphological analysis unit 21, and the morphological analysis unit 21 converts each word of the input natural language sentence. For the appearance form, the word dictionary memory 41
In addition to dividing the natural language sentence into a plurality of words by referring to the word dictionary stored in the above, information such as part of speech, inflected form, standard expression form, and synonym code is given to each of the appearance forms. Is output to the syntax analyzer 22. Next, the syntax analysis section 22 refers to the word dictionary stored in the word dictionary memory 42, executes a predetermined syntax analysis, adds syntax tree information to the word string, and outputs the result as an analysis result.

【００６８】以上説明したように、図１９に示すよう
に、下側層、中間層、及び上側層と階層化して、複数の
単語を二分木の形式で複数のクラスに分類したので、単
語分類処理によりバランスのとれた階層構造を有する単
語分類結果を得ることができる。また、ＡＭＩの計算に
おいては、下側層、中間層、及び上側層ともに、すべて
のクラスの単語を対象としてＡＭＩを計算しているの
で、計算されたＡＭＩは局所的なＡＭＩではなく、全体
の単語の情報を含んだグローバルはＡＭＩに基づいて、
クラスタリング処理を実行している。従って、全体的に
最適化された単語分類結果を得ることができる。これに
より、テキストデータから単語の分類体系を自動的に獲
得するときに、より精密で正確な分類体系を得ることが
できる。さらに、上記単語分類部２０により得られた単
語辞書に基づいて音声認識することにより、従来例に比
較して高い認識率で音声認識することができる。As described above, as shown in FIG. 19, the lower layer, the middle layer, and the upper layer are hierarchized, and a plurality of words are classified into a plurality of classes in the form of a binary tree. A word classification result having a balanced hierarchical structure can be obtained by the processing. In addition, in the calculation of the AMI, the AMI is calculated for all classes of words in the lower layer, the middle layer, and the upper layer, so that the calculated AMI is not a local AMI but a whole AMI. A global containing word information is based on AMI,
The clustering process is running. Therefore, a totally optimized word classification result can be obtained. This makes it possible to obtain a more precise and accurate classification system when automatically acquiring a word classification system from text data. Furthermore, by performing speech recognition based on the word dictionary obtained by the word classification unit 20, speech recognition can be performed at a higher recognition rate than in the conventional example.

【００６９】以上の実施形態において、音声認識部５
と、単語分類処理部２０と、形態素解析部２１と、構文
解析部２２とは例えばディジタル計算機によって構成さ
れる。以上の実施形態の単語分類処理は、図１１に示す
ように、中間層クラスタリング処理、上側層クラスタリ
ング処理、下側層クラスタリング処理の順序で実行して
いるが、本発明はこれに限らず、中間層クラスタリング
処理、下側層クラスタリング処理、上側層クラスタリン
グ処理の順序で実行してもよい。以上の実施形態におい
て、図１１の初期化処理の前に、単語のｎ−グラムを利
用して、同一の単語に隣接する割合の多い単語を同一の
クラスに割り当てるという基準で複数の単語を複数のク
ラスに分類する処理を実行してもよい。In the above embodiment, the voice recognition unit 5
The word classification processing unit 20, the morphological analysis unit 21, and the syntax analysis unit 22 are configured by, for example, a digital computer. As shown in FIG. 11, the word classification process of the above embodiment is executed in the order of the intermediate layer clustering process, the upper layer clustering process, and the lower layer clustering process. However, the present invention is not limited to this. It may be executed in the order of the layer clustering process, the lower layer clustering process, and the upper layer clustering process. In the above embodiment, before the initialization process in FIG. 11, a plurality of words are assigned based on the criterion of assigning a word having a high ratio adjacent to the same word to the same class using the n-gram of the word. May be executed.

【００７０】[0070]

【実施例】【Example】

＜実験（シミュレーション）＞６年分のＷＳＪのコーパ
スの平易なテキストデータを使用し、クラスタと単語ビ
ットを作成した。テキストのサイズは５００万語、１０
００万語、２０００万語、及び５０００万語（それぞ
れ、５ＭＷ、１０ＭＷ、２０ＭＷ、５０ＭＷ；ここで、
Ｗはワードである。）である。語彙はコーパス全体で最
も頻繁に使用されている上位７万語とした。最終クラス
数ｃは５００に設定した。獲得したクラスタと単語ビッ
トを、それぞれ次の２つの尺度ＳＳ１とＳＳ２を使用し
て評価する。（ａ）尺度ＳＳ１は、ＷＳＪのコーパス、及び本出願人
が所有するコーパスである一般的な英語のツリーバンク
に基づいた、クラスを基本としたトライグラムモデルの
パープレキシティーを計算するパープレキシティー法で
ある。（ｂ）尺度ＳＳ２は、本出願人が所有する決定木を用い
る部分音声のラベル付け（ｔａｇｇｉｎｇ又はｌａｂｅ
ｌｉｎｇ）のラベル付け装置（ｔａｇｇｅｒ又はｌａｂ
ｅｌｅｒ）における誤り率である。<Experiment (Simulation)> Clusters and word bits were created using plain text data of the WSJ corpus for six years. Text size is 5 million words, 10
Million, 20 million, and 50 million words (5 MW, 10 MW, 20 MW, 50 MW, respectively;
W is a word. ). The vocabulary was the top 70,000 words used most frequently throughout the corpus. The final class number c was set to 500. The acquired cluster and word bits are evaluated using the following two measures SS1 and SS2 respectively. (A) The measure SS1 is a perplexity that calculates the perplexity of a class-based trigram model based on the WSJ corpus and a general English treebank, a corpus owned by the applicant. Is the law. (B) The measure SS2 is a labeling (labeling or label) of a partial sound using a decision tree owned by the present applicant.
ling) labeling device (tagger or lab)
eler).

【００７１】＜パープレキシティ法＞単語をその所属ク
ラスに写像するクラス関数Ｇを使用すると単語トライグ
ラムの確率は、次式のように書き直すことができる。<Perplexity Method> The probability of a word trigram can be rewritten as follows by using a class function G that maps a word to its belonging class.

【００７２】[0072]

【数１５】Ｐ(ｗ_i│ｗ_i-2ｗ_i-1)＝Ｐ_c(Ｇ(ｗ_i))│Ｇ(ｗ
_i-2)Ｇ(ｗ_i-1))Ｐ_m(ｗ_i│Ｇ(ｗ_i))[Number 15] _{_{P (w i │w i-2}} w i-1) = P c (G (w i)) │G (w
_{_{i-2) G (w i}} -1)) P m (w i │G (w i))

【００７３】ここで、Ｐ_cは２次のマルコフ連鎖確率で
あり、Ｐ_mは単語メンバーシップ確率である。Ｐｃ及び
Ｐｍのスムージングは、それぞれカッツ（Ｋａｔｚ）の
バックオフ、及びグッドテューリング公式を使用して行
う。トレーニング用テキストのサイズは１．９ＭＷで、
テストテキストは１５０ＫＷであり、両者ともＷＳＪの
コーパスを典拠としている。語彙サイズは７７ＫＷであ
る。図９は、テストテキストのパープレキシティーとク
ラスタリングのテキストサイズとの関係を示している。
クラスタリングのテキストサイズにおけるゼロ点は、単
語トライグラムモデルのパープレキシティーを表してい
る。クラスタリングのテキストサイズが増加するに従っ
て、パープレキシティーは単調に減少する。これはクラ
スタリング処理の改善を示している。５０ＭＷでは、パ
ープレキシティーは単語トライグラムモデルの場合より
１８％低くなっている。この結果は、クラス・トライグ
ラムのパープレキシティーが単語トライグラムモデルの
場合より僅かに高いとした第１の従来例の結果とは好対
照である。Here, P _c is a second-order Markov chain probability, and P _m is a word membership probability. Smoothing of Pc and Pm is performed using Katz backoff and Good Turing formula, respectively. The size of the training text is 1.9MW,
The test text is 150 KW, and both are based on the WSJ corpus. The vocabulary size is 77KW. FIG. 9 shows the relationship between the perplexity of the test text and the text size of the clustering.
The zero in the text size of the clustering represents the perplexity of the word trigram model. As the text size of the clustering increases, the perplexity monotonically decreases. This indicates an improvement in the clustering process. At 50 MW, the perplexity is 18% lower than for the word trigram model. This result is in sharp contrast to the result of the first conventional example in which the perplexity of the class trigram is slightly higher than that of the word trigram model.

【００７４】＜決定木を用いた音声部分のラベル付け＞
本出願人が所有する決定木を用いる部分音声のラベル付
け(tagging)のラベル付け装置（tagger）は、スパッタ
ー（ＳＰＡＴＴＥＲ、例えば、従来技術文献２「Ｄ．Ｍa
german, “Ｎatural Ｌanguage Ｐarsing as Ｓtat
istical Ｒecognition", Ｄoctoral Ｄissertation,
Ｓtanford Ｕniversity, Ｓtanford,Ｃalifornia, １
９９４年」参照。）をベースとした、本出願人が所有す
る決定木パーザーの統合モジュールである。上記ラベル
付け装置は、ユニバーシティ・オブ・ペンシルバニアの
トリーバンクプロジェクトのそれよりも、１桁だけ大き
い４４１個の統語的ラベル（syntactic tags）を採用
している。学習用テキスト、テスト用テキスト、及び実
行用テキストはすべて、単語とラベルとの対のすべての
シーケンスを含む。学習段階では、イベント（event）
は、特徴値の集合又は、質問とそれに対する回答との対
の集合である。１つの特徴は、処理すべき現在の単語ｗ
ｏｒｄ（０）が現れる文脈における任意の属性であり、
これは便宜上、質問の形式で表される。ラベル付けは左
から右へと行う。表３は、処理すべき現在の単語“ｌｉ
ｋｅ”を用いたイベントの一例を示している。<Labeling of Audio Part Using Decision Tree>
A labeling device (tagger) for tagging partial voices using a decision tree owned by the present applicant is known as a SPATTER, for example, D2Ma.
german, “Natural Language Parsing as Stat
istical Recognition ", Doctoral Dissertation,
Stanford University, Stanford, California, 1
994 ". ) Is an integrated module of a decision tree parser owned by the present applicant. The labeling apparatus employs 441 syntactic tags, an order of magnitude larger than that of the University of Pennsylvania Treebank project. The learning text, the test text, and the running text all include the entire sequence of word and label pairs. During the learning phase, events
Is a set of feature values or a set of pairs of questions and their answers. One feature is that the current word w to be processed
any attribute in the context where ord (0) appears,
This is conveniently represented in the form of a question. Labeling is done from left to right. Table 3 shows that the current word "li"
An example of an event using “ke” is shown.

【００７５】[0075]

【表３】 ─────────────────────────────────── Ｅvent−１２８: { ＜word(0), “like"＞＜word(-1), “flies"＞＜word(-2), “time"＞＜word(1), “an"＞＜word(2), “arrow"＞＜tag(-1), “Verb-3rd-Sg-type3" ＞＜tag(-2), “Noun-Sg-type14"＞ ............................ (Basic Questions) ＜Inclass?(word(0), Class295), “yes"＞＜WordBits(Word(-1), 29), “1"＞ ............................ (WordBits Questions) ＜Tag, “Prep-type5"＞ } ───────────────────────────────────[Table 3] ─────────────────────────────────── Event-128: {<word (0), “Like”> <word (-1), “flies”> <word (-2), “time”> <word (1), “an”> <word (2), “arrow”> <tag (- 1), “Verb-3rd-Sg-type3”> <tag (-2), “Noun-Sg-type14”> ............ ...... (Basic Questions) <Inclass? (Word (0), Class295), “yes”> <WordBits (Word (-1), 29), “1”> ........ .......... (WordBits Questions) <Tag, “Prep-type5”>} ──────────────── ───────────────────

【００７６】このイベントの最後のペアは、回答、即ち
当該現在の単語の正しいラベルを示す特別な項目であ
る。最初の２行は当該現在の単語の回りの単語の識別に
関する質問と、先行する単語のためのラベルを表してい
る。これらの質問は、基本質問と呼ばれている。第２の
質問形式（単語ビット質問）は、「この現在の単語はク
ラス２９５にありますか？」或いは「先行する単語の単
語ビット中の第２９ビットは何ですか？」と言ったクラ
スタ及び単語ビットに関するものである。The last pair of this event is an answer, a special item indicating the correct label of the current word. The first two lines represent the question about identifying the word around the current word and the label for the preceding word. These questions are called basic questions. The second question type (word bit question) consists of clusters and words that say "is this current word in class 295?" Or "what is the 29th bit in the word bits of the preceding word?"It's about bits.

【００７７】イベントの集合から決定木を作成する。決
定木の根のノードは、それぞれ対応する単語に対して正
しいラベルを含んでいるすべてのイベントからなるセッ
トを表している。根のノード用のラベルの確率分布は当
該集合におけるラベルの相対的な頻度を計算することに
よって得ることができる。当該セットの中の各イベント
における特徴値を問い合わせることで、そのセットはＮ
個のサブセットに分割することができる（Ｎは特徴に関
する可能値である）。次いで、この特徴値を条件とし
て、各サブセットに対するラベルの条件付き確率分布を
計算することが可能である。セットの分割によって生じ
るエントロピーの減少を各特徴毎に計算した後、エント
ロピーの減少量を最大にする特徴を選択する。この方法
を反復し、セットを各サブセットに分割することによっ
て、葉のノードがタグの条件付き確率分布を含むような
決定木を構築することができる。次いで、獲得した確率
分布を実行用データを使用してスムージングする。スム
ージング処理の詳細については上記従来技術文献２を参
照せよ。テスト段階では、システムはテストテキスト内
の各単語に対する条件付き確率分布を調査し、ビームサ
ーチを使用して最も可能性のあるラベル付けシーケンス
を選択する。A decision tree is created from a set of events. The root node of the decision tree represents a set of all events, each containing the correct label for the corresponding word. The probability distribution of the labels for the root nodes can be obtained by calculating the relative frequencies of the labels in the set. By querying the feature values for each event in the set, the set becomes N
(N is a possible value for the feature). Then, conditional on this feature value, it is possible to calculate a conditional probability distribution of the labels for each subset. After calculating the reduction in entropy caused by the set division for each feature, the feature that maximizes the reduction in entropy is selected. By iterating the method and dividing the set into subsets, a decision tree can be constructed in which the leaf nodes include the conditional probability distribution of the tags. Next, the obtained probability distribution is smoothed using the execution data. For details of the smoothing process, refer to the above-mentioned prior art document 2. During the test phase, the system examines the conditional probability distribution for each word in the test text and uses beam search to select the most likely labeling sequence.

【００７８】本発明者がラベル付け実験に使用したの
は、ＷＳＪのテキスト、及び本出願人が所有するコーパ
ス（以下、ＡＴＲコーパスという。）である。ＷＳＪの
テキストは、本出願人の統語ラベルセットを使用して手
動でラベル付けをし直した。上記ＡＴＲコーパスは、文
語体の米語の包括的な見本であり、その語法のスタイル
及び設定は非常に幅広く、多くの異なる領域から作り上
げられている。ＡＴＲコーパスはまだ開発過程にあるた
め、この実験用として手元にあるテキストの大きさは、
ラベルセットが大型である割にはかなり小型である。表
４は今回の実験に使用したテキストのサイズを示してい
る。The present inventors used WSJ text and a corpus (hereinafter referred to as ATR corpus) owned by the present applicant for the labeling experiment. The WSJ text was manually relabeled using Applicant's syntactic label set. The ATR corpus is a comprehensive sample of literary American English, and its grammar style and settings are very wide and are made up of many different domains. Since the ATR corpus is still in development, the size of the text at hand for this experiment is:
Although the label set is large, it is quite small. Table 4 shows the size of the text used in this experiment.

【００７９】[0079]

【表４】 ─────────────────────────────────── テキストサイズ(単語数) 学習用テスト用実行用 ─────────────────────────────────── ＷＳＪのテキスト７５,１３９５,８３１６,５３４ＡＴＲコーパス７６,１３２２３,１６３６,６８０ ───────────────────────────────────[Table 4] ─────────────────────────────────── Text size (number of words) Learning test Execution ─────────────────────────────────── WSJ text 75,139 5,831 6,534 ATR corpus 76,132 23,163 6,680

【００８０】図２０は、多様なクラスタリングのテキス
トサイズに対するラベル付けの誤り率を表している。本
実験では、２種類の質問形式の中から基本質問及び単語
ビット質問を使用している。ラベル付け装置への単語ビ
ット情報の導入の効果を見るため異なる実験を行った
が、その実験では無作為に生成されたビットストリング
を各単語に割り当て（特徴的なビットストリングが各単
語に割り当てられているが、ラベル付け装置もビットス
トリングを処理中の各単語の認識番号として使用してい
る。この制御実験においては、ビットストリングの割り
当ては無作為に行なわれるが、２つの単語が同じ単語ビ
ットを持つことはない。無作為の単語ビットは、ラベル
付け装置に対して単語の認識以外には何のクラス情報も
与えない。）、基本質問と単語ビット質問を使用した。
結果はクラスタリングのテキストサイズのゼロの値にお
いて表されている。ＷＳＪのテキスト及びＡＴＲコーパ
スの何れも、ラベル付けの誤り率は、５ＭＷのテキスト
から抽出された単語ビット情報を使用することによって
３０％以上低下し、また、クラスタリングのテキストサ
イズが増加するとさらに誤り率が減少した。５０ＭＷで
は、誤り率は４３％も低下した。これもまた、クラスタ
の品質向上はクラスタリングのテキストサイズの増大に
よるものであることを示している。全体的にみて、高い
誤り率は非常に大きなラベルセットと、小さな学習用セ
ットに起因している。この結果の注目に値する点は、Ａ
ＴＲコーパスのテキストとＷＳＪのテキストは互いに領
域が非常に異なったものであるにも関わらず、ＷＳＪの
テキストから構成された単語ビットの導入が、ＷＳＪの
テキストに対してと同じくらいＡＴＲコーパスのテキス
トのラベル付けにも効果的であったことである。この点
から、獲得した階層的クラスタは領域を越えて移動可能
であると考えられる。FIG. 20 shows the labeling error rates for various clustering text sizes. In this experiment, a basic question and a word bit question are used from two types of question formats. Different experiments were performed to see the effect of introducing the word bit information into the labeling device, in which a randomly generated bit string was assigned to each word (a characteristic bit string was assigned to each word). However, the labeling device also uses the bit string as an identification number for each word being processed.In this control experiment, the bit string assignment is random, but the two words are the same word bit. The random word bits do not give the labeling device any class information other than word recognition.), Basic questions and word bit questions were used.
The results are expressed at a value of zero for the text size of the clustering. For both the WSJ text and the ATR corpus, the labeling error rate is reduced by more than 30% by using word bit information extracted from the 5 MW text, and further increases as the text size of the clustering increases. Decreased. At 50 MW, the error rate dropped by 43%. This also indicates that the quality improvement of the cluster is due to an increase in the text size of the clustering. Overall, the high error rate is due to a very large label set and a small training set. The notable point of this result is that A
Although the text of the TR corpus and the text of the WSJ are very different in area from each other, the introduction of the word bits composed of the text of the WSJ is as much as the text of the ATR corpus as for the text of the WSJ. Was also effective in labeling From this point, it is considered that the acquired hierarchical cluster can move beyond the area.

【００８１】以上説明したように、本発明者は、複数の
単語の階層的クラスタリング分割に関するアルゴリズム
を提案し、５ＭＷから５０ＭＷまでの大型テキストデー
タを使用したクラスタ分割の実験を行った。獲得したク
ラスタの高品質性は、２種類の評価方法によって確認さ
れている。クラスを基にしたトライグラムモデルのパー
プレキシティーは、単語をベースとしたトライグラムモ
デルの場合よりも１８％低くなっている。本出願人が所
有する決定木の音声部分のラベル付け装置に単語ビット
を導入することにより、ラベル付けの誤りの割合は４３
％も減少する。ＷＳＪのテキストから得る階層的クラス
タリング分割処理はまた、ＷＳＪのテキストとは全く異
なる範囲にあるＡＴＲコーパスのテキストのラベル付け
にも有効であることが判った。As described above, the inventor has proposed an algorithm for hierarchical clustering division of a plurality of words, and has conducted an experiment of cluster division using large text data from 5 MW to 50 MW. The high quality of the acquired cluster has been confirmed by two types of evaluation methods. The perplexity of the class-based trigram model is 18% lower than that of the word-based trigram model. By introducing word bits into the labeling device for the speech part of the decision tree owned by the Applicant, the rate of labeling errors is 43%.
% Also decreases. The hierarchical clustering partitioning process derived from WSJ text was also found to be effective for labeling ATR corpus texts in a completely different range from WSJ text.

【００８２】[0082]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の単語分類処理方法によれば、複数の単語を含む
テキストデータに対して、互いに異なるすべての複数ｖ
個の単語の出現頻度を調べ、出現頻度の高い単語から順
に並べて、複数ｖ個のクラスに割り当てるステップと、
上記複数ｖ個のクラスの単語のうち出現頻度が高いｖ個
未満の（ｃ＋１）個のクラスの単語を１つのウィンドウ
内のクラスの単語として第１の記憶装置に記憶するステ
ップと、上記第１の記憶装置に記憶された１つのウィン
ドウ内のクラスの単語に基づいて、第１のクラスの単語
の出現確率と第２のクラスの単語の出現確率との積に対
する、互いに異なる第１のクラスの単語と第２のクラス
の単語とが隣接して出現する確率の相対的な割合を表わ
す所定の平均相互情報量が最大となるように、上記複数
の単語を二分木の形式で複数ｃ個のクラスに分類し、分
類された複数ｃ個のクラスを、単語分類結果を表わす全
体のツリー図の中間層の複数ｃ個のクラスとして第２の
記憶装置に記憶するステップと、上記第２の記憶装置に
記憶された中間層の複数ｃ個のクラスの単語に基づい
て、上記平均相互情報量が最大となるように、上記複数
ｃ個のクラスの単語を二分木の形式で１個のクラスにな
るまで分類し、当該分類結果を上記ツリー図の上側層と
して第３の記憶装置に記憶するステップと、上記第２の
記憶装置に記憶された中間層の複数ｃ個のクラスの各ク
ラス毎に、上記中間層の複数ｃ個のクラスの各クラス内
の複数の単語に基づいて、上記平均相互情報量が最大と
なるように、上記複数の単語を二分木の形式で１個のク
ラスになるまでそれぞれ分類し、当該各クラス毎の複数
の分類結果を上記ツリー図の下側層として第４の記憶装
置に記憶するステップと、上記第４の記憶装置に記憶さ
れた上記ツリー図の下側層を、上記第２の記憶装置に記
憶された上記中間層の複数ｃ個のクラスと連結する一
方、上記第３の記憶装置に記憶された上記ツリー図の上
側層を、上記第２の記憶装置に記憶された上記中間層の
複数ｃ個のクラスと連結することにより、上側層と中間
層と下側層とを備えた上記ツリー図を求めて単語分類結
果として第５の記憶装置に記憶するステップとを備え
る。従って、下側層、中間層、及び上側層と階層化し
て、複数の単語を二分木の形式で複数のクラスに分類し
たので、単語分類処理によりバランスのとれた階層構造
を有する単語分類結果を得ることができる。また、平均
相互情報量の計算においては、下側層、中間層、及び上
側層ともに、すべてのクラスの単語を対象として平均相
互情報量を計算しているので、計算された平均相互情報
量は局所的な平均相互情報量ではなく、全体の単語の情
報を含んだグローバルは平均相互情報量に基づいて、ク
ラスタリング処理を実行している。従って、全体的に最
適化された単語分類結果を得ることができる。これによ
り、テキストデータから単語の分類体系を自動的に獲得
するときに、より精密で正確な分類体系を得ることがで
きる。As described in detail above, according to the word classification processing method according to the first aspect of the present invention, all the plural v data different from each other are applied to text data including a plurality of words.
Examining the frequency of occurrence of the words, arranging the words in order from the word having the highest frequency, and assigning the words to a plurality of v classes;
Storing, in the first storage device, (c + 1) classes of words having a high appearance frequency of less than v among words of the plurality of v classes as words of a class in one window; Different products of the first class and the product of the probability of occurrence of the word of the first class and the probability of occurrence of the word of the second class based on the words of the class in one window stored in the storage device of the first class. In order to maximize the predetermined average mutual information representing the relative proportion of the probability that the word and the second class word appear adjacent to each other, the plurality of words are divided into a plurality of c Classifying the plurality of c classes into classes, and storing the classified plurality of c classes in the second storage device as the plurality of c classes in the intermediate layer of the entire tree diagram representing the word classification result; and the second storage Middle layer stored in the device Based on the words of the plurality of c classes, the words of the plurality of c classes are classified into one class in the form of a binary tree so that the average mutual information amount is maximized. In the third storage device as an upper layer of the tree diagram, and for each of the c classes of the intermediate layer stored in the second storage device, Based on a plurality of words in each class of the class, the plurality of words are classified in a binary tree form into one class so that the average mutual information is maximized. Storing a plurality of classification results for each of the plurality of classification results in the fourth storage device as a lower layer of the tree diagram; and storing the lower layer of the tree diagram stored in the fourth storage device in the second storage device A plurality of c classes of the intermediate layer stored in the device. While connecting the upper layer of the tree diagram stored in the third storage device with the plurality of c classes of the intermediate layer stored in the second storage device. And obtaining the tree diagram including the intermediate layer and the lower layer, and storing the tree diagram as a word classification result in the fifth storage device. Therefore, the lower layer, the middle layer, and the upper layer are hierarchized, and a plurality of words are classified into a plurality of classes in the form of a binary tree. Obtainable. In the calculation of the average mutual information, the average mutual information is calculated for all the classes of words in the lower layer, the middle layer, and the upper layer. The global processing including the information of the entire word, not the local average mutual information, performs the clustering process based on the average mutual information. Therefore, a totally optimized word classification result can be obtained. This makes it possible to obtain a more precise and accurate classification system when automatically acquiring a word classification system from text data.

【００８３】また、請求項２記載の単語分類処理方法
は、請求項１記載の単語分類処理方法において、上記分
類された複数ｃ個のクラスを上記第２の記憶装置に記憶
するステップは、上記第１の記憶装置に記憶された１つ
のウィンドウよりも外側のクラスが存在し、又は上記１
つのウィンドウ内のクラスがｃ個ではないときは、現在
のウィンドウよりも外側にあり、最大の出現頻度を有す
るクラスの単語を上記ウィンドウ内に挿入した後、上記
二分木の形式の単語分類処理を実行することを特徴とす
る。従って、所定の複数ｃ個のクラスを有する中間層を
最適化形式で得ることができる。According to a second aspect of the present invention, in the word classification processing method of the first aspect, the step of storing the plurality of classified c classes in the second storage device comprises the step of: There is a class outside one window stored in the first storage device, or
When the number of classes in one window is not c, the words of the class that is outside the current window and has the highest frequency of occurrence are inserted into the window, and then the word classification processing in the form of the binary tree is performed. It is characterized by executing. Therefore, it is possible to obtain an intermediate layer having predetermined plural c classes in an optimized form.

【００８４】本発明に係る請求項３記載の単語分類処理
装置は、複数の単語を含むテキストデータに対して、互
いに異なるすべての複数ｖ個の単語の出現頻度を調べ、
出現頻度の高い単語から順に並べて、複数ｖ個のクラス
に割り当てる第１の制御手段と、上記複数ｖ個のクラス
の単語のうち出現頻度が高いｖ個未満の（ｃ＋１）個の
クラスの単語を１つのウィンドウ内のクラスの単語とし
て第１の記憶装置に記憶する第２の制御手段と、上記第
１の記憶装置に記憶された１つのウィンドウ内のクラス
の単語に基づいて、第１のクラスの単語の出現確率と第
２のクラスの単語の出現確率との積に対する、互いに異
なる第１のクラスの単語と第２のクラスの単語とが隣接
して出現する確率の相対的な割合を表わす所定の平均相
互情報量が最大となるように、上記複数の単語を二分木
の形式で複数ｃ個のクラスに分類し、分類された複数ｃ
個のクラスを、単語分類結果を表わす全体のツリー図の
中間層の複数ｃ個のクラスとして第２の記憶装置に記憶
する第３の制御手段と、上記第２の記憶装置に記憶され
た中間層の複数ｃ個のクラスの単語に基づいて、上記平
均相互情報量が最大となるように、上記複数ｃ個のクラ
スの単語を二分木の形式で１個のクラスになるまで分類
し、当該分類結果を上記ツリー図の上側層として第３の
記憶装置に記憶する第４の制御手段と、上記第２の記憶
装置に記憶された中間層の複数ｃ個のクラスの各クラス
毎に、上記中間層の複数ｃ個のクラスの各クラス内の複
数の単語に基づいて、上記平均相互情報量が最大となる
ように、上記複数の単語を二分木の形式で１個のクラス
になるまでそれぞれ分類し、当該各クラス毎の複数の分
類結果を上記ツリー図の下側層として第４の記憶装置に
記憶する第５の制御手段と、上記第４の記憶装置に記憶
された上記ツリー図の下側層を、上記第２の記憶装置に
記憶された上記中間層の複数ｃ個のクラスと連結する一
方、上記第３の記憶装置に記憶された上記ツリー図の上
側層を、上記第２の記憶装置に記憶された上記中間層の
複数ｃ個のクラスと連結することにより、上側層と中間
層と下側層とを備えた上記ツリー図を求めて単語分類結
果として第５の記憶装置に記憶する第６の制御手段とを
備える。従って、下側層、中間層、及び上側層と階層化
して、複数の単語を二分木の形式で複数のクラスに分類
したので、単語分類処理によりバランスのとれた階層構
造を有する単語分類結果を得ることができる。また、平
均相互情報量の計算においては、下側層、中間層、及び
上側層ともに、すべてのクラスの単語を対象として平均
相互情報量を計算しているので、計算された平均相互情
報量は局所的な平均相互情報量ではなく、全体の単語の
情報を含んだグローバルは平均相互情報量に基づいて、
クラスタリング処理を実行している。従って、全体的に
最適化された単語分類結果を得ることができる。これに
より、テキストデータから単語の分類体系を自動的に獲
得するときに、より精密で正確な分類体系を得ることが
できる。The word classification processing apparatus according to claim 3 of the present invention examines the appearance frequency of all the plurality of v words different from each other with respect to text data containing a plurality of words,
First control means for sequentially arranging words having a high frequency of occurrence and assigning them to a plurality of v classes; and, among words of the plurality of v classes, words of less than v (c + 1) classes having a high frequency of occurrence, A second control unit for storing the words of the class in one window in the first storage device, and a first class based on the words of the class in one window stored in the first storage device. Represents the relative ratio of the probability that the first class word and the second class word that are different from each other appear adjacent to the product of the occurrence probability of the second word and the occurrence probability of the second class word The plurality of words are classified into a plurality c classes in the form of a binary tree such that the predetermined average mutual information amount is maximized, and the classified plurality c
Control means for storing the plurality of classes in the second storage device as a plurality of c classes in the middle layer of the entire tree diagram representing the word classification result, and the intermediate control means for storing the intermediate classes stored in the second storage device. Based on the words of the plurality c classes of the layer, the words of the plurality c classes are classified into one class in the form of a binary tree so that the average mutual information amount is maximized. Fourth control means for storing the classification result in the third storage device as an upper layer of the tree diagram, and for each of a plurality of c classes of the intermediate layer stored in the second storage device, Based on the plurality of words in each of the plurality of c classes in the intermediate layer, the plurality of words are each converted into a single tree in the form of a binary tree such that the average mutual information is maximized. And classify the results of each class into the above tree. Fifth control means for storing in the fourth storage device as a lower layer of the diagram, and a lower layer of the tree diagram stored in the fourth storage device being stored in the second storage device. While linking with the plurality of c classes of the intermediate layer, the upper layer of the tree diagram stored in the third storage device is combined with the plurality of c classes of the intermediate layer stored in the second storage device. Sixth control means for obtaining the tree diagram including the upper layer, the intermediate layer, and the lower layer by connecting to the class, and storing the tree diagram as a word classification result in the fifth storage device. Therefore, the lower layer, the middle layer, and the upper layer are hierarchized, and a plurality of words are classified into a plurality of classes in the form of a binary tree. Obtainable. In the calculation of the average mutual information, the average mutual information is calculated for all the classes of words in the lower layer, the middle layer, and the upper layer. A global containing the information of the whole word, not the local average mutual information, is based on the average mutual information,
The clustering process is running. Therefore, a totally optimized word classification result can be obtained. This makes it possible to obtain a more precise and accurate classification system when automatically acquiring a word classification system from text data.

【００８５】また、請求項４記載の単語分類処理装置
は、請求項３記載の単語分類処理装置において、上記第
３の制御手段は、上記第１の記憶装置に記憶された１つ
のウィンドウよりも外側のクラスが存在し、又は上記１
つのウィンドウ内のクラスがｃ個ではないときは、現在
のウィンドウよりも外側にあり、最大の出現頻度を有す
るクラスの単語を上記ウィンドウ内に挿入した後、上記
二分木の形式の単語分類処理を実行する。従って、所定
の複数ｃ個のクラスを有する中間層を最適化形式で得る
ことができる。According to a fourth aspect of the present invention, there is provided the word classification processing device according to the third aspect, wherein the third control means is configured to execute the processing based on one of the windows stored in the first storage device. An outer class exists, or 1
When the number of classes in one window is not c, the words of the class that is outside the current window and has the highest frequency of occurrence are inserted into the window, and then the word classification processing in the form of the binary tree is performed. Execute. Therefore, it is possible to obtain an intermediate layer having predetermined plural c classes in an optimized form.

【００８６】本発明に係る請求項５記載の音声認識装置
によれば、入力される発声音声の音声信号に基づいて、
請求項３又は４記載の単語分類処理装置によって複数の
単語が複数のクラスに分類された単語分類結果を含む単
語辞書と、所定の隠れマルコフモデルとを参照して上記
発声音声を音声認識する音声認識手段を備える。従っ
て、上記単語分類処理装置により得られた、バランスの
とれた階層構造を有しかつ全体的に最適化された単語辞
書に基づいて音声認識することにより、従来例に比較し
て高い認識率で音声認識することができる。According to the voice recognition device of the fifth aspect of the present invention, based on the voice signal of the input uttered voice,
A voice for recognizing said uttered voice by referring to a word dictionary including a word classification result in which a plurality of words are classified into a plurality of classes by the word classification processing device according to claim 3 or a predetermined hidden Markov model. A recognition unit is provided. Therefore, by performing speech recognition based on a word dictionary that has a well-balanced hierarchical structure and is totally optimized, obtained by the above-described word classification processing device, the recognition rate is higher than in the conventional example. Can recognize voice.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明に係る第１の実施形態である音声認識
装置のブロック図である。FIG. 1 is a block diagram of a speech recognition device according to a first embodiment of the present invention.

【図２】本発明に係る第２の実施形態である形態素及
び構文解析装置のブロック図である。FIG. 2 is a block diagram of a morpheme and syntax analyzer according to a second embodiment of the present invention.

【図３】図１及び図２の単語分類処理部によって実行
される単語分類処理における加算領域及び加減算処理を
示すクラスバイグラム平面テーブルの図である。FIG. 3 is a diagram of a class bigram plane table showing an addition area and an addition / subtraction process in a word classification process performed by the word classification processing unit in FIGS. 1 and 2;

【図４】（ａ）は上記単語分類処理におけるＡＭＩ減
少量Ｌ_k（ｌ，ｍ）に対する加算領域を示すクラスバイ
グラム平面テーブルの図であり、（ｂ）は上記単語分類
処理におけるＡＭＩ減少量Ｌ_k-1 ^(i,j)（ｌ，ｍ）に対す
る加算領域を示すクラスバイグラム平面テーブルの図で
ある。FIG. 4A is a diagram of a class bigram plane table showing an addition area with respect to an AMI reduction amount L _k (l, m) in the word classification process, and FIG. 4B is a diagram illustrating an AMI reduction amount L in the word classification process. It is a figure of the class bigram plane table which shows the addition area to _k-1 ^{(i, j)} (l, m).

【図５】（ａ）は上記単語分類処理におけるマージ後
のＡＭＩ量Ｉｈ_kを示すクラスバイグラム平面テーブル
の図であり、（ｂ）は上記単語分類処理におけるマージ
後のＡＭＩ量Ｉｈ_k（ｌ，ｍ）を示すクラスバイグラム
平面テーブルの図であり、（ｃ）は上記単語分類処理に
おけるマージ後のＡＭＩ量Ｉｈ_k-1 ^(i, ^j)を示すクラスバ
イグラム平面テーブルの図であり、（ｄ）は上記単語分
類処理におけるマージ後のＡＭＩ量Ｉｈ_k-1 ^(i,j)（ｌ，
ｍ）を示すクラスバイグラム平面テーブルの図である。FIG. 5A is a diagram of a class bigram plane table showing the AMI amount Ih _k after the merge in the word classification process, and FIG. 5B is a diagram showing the AMI amount Ih _k (l, FIG. 8C is a diagram of a class bigram plane table showing m), FIG. 9C is a diagram of a class bigram plane table showing AMI amount Ih _k−1 ^(i, ^j) after merge in the word classification process, and FIG. Is the AMI amount Ih _k-1 ^{(i, j)} (l,
FIG. 6 is a diagram of a class bigram plane table showing m).

【図６】上記単語分類処理によって得られるデンドロ
グラム（ツリーの系統図）の一例を示す図である。FIG. 6 is a diagram showing an example of a dendrogram (tree system diagram) obtained by the word classification processing.

【図７】上記単語分類処理によって得られる左側方向
の分岐ツリーの一例を示す図である。FIG. 7 is a diagram illustrating an example of a left-side branch tree obtained by the word classification processing.

【図８】上記単語分類処理によって得られる１つのク
ラスに対するサブツリーの一例を示す図である。FIG. 8 is a diagram showing an example of a subtree for one class obtained by the word classification processing.

【図９】本発明の音声認識装置におけるシミュレーシ
ョン結果である、テキストの大きさに対するパープレキ
シティーを示すグラフである。FIG. 9 is a graph showing perplexity with respect to text size, which is a simulation result in the speech recognition device of the present invention.

【図１０】図１の単語分類処理部２０の構成を示すブ
ロック図である。FIG. 10 is a block diagram illustrating a configuration of a word classification processing unit 20 of FIG. 1;

【図１１】図１の単語分類処理部２０によって実行さ
れるメインルーチンの単語分類処理を示すフローチャー
トである。FIG. 11 is a flowchart illustrating a word classification process of a main routine executed by the word classification processing unit 20 of FIG. 1;

【図１２】図１１のサブルーチンの初期化処理（Ｓ
１）を示すフローチャートである。FIG. 12 is a flowchart showing an initialization process (S
It is a flowchart which shows 1).

【図１３】図１１のサブルーチンの中間層クラスタリ
ング処理（Ｓ２）を示すフローチャートである。FIG. 13 is a flowchart showing an intermediate layer clustering process (S2) of the subroutine of FIG. 11;

【図１４】図１１のサブルーチンの上側層クラスタリ
ング処理（Ｓ３）を示すフローチャートである。FIG. 14 is a flowchart showing an upper layer clustering process (S3) of the subroutine of FIG. 11;

【図１５】図１１のサブルーチンの下側層クラスタリ
ング処理（Ｓ４）を示すフローチャートである。FIG. 15 is a flowchart showing a lower layer clustering process (S4) of the subroutine of FIG. 11;

【図１６】図１１のサブルーチンのデータ出力処理
（Ｓ５）を示すフローチャートである。FIG. 16 is a flowchart showing a data output process (S5) of the subroutine of FIG. 11;

【図１７】図１１のサブルーチンの中間層クラスタリ
ング処理（Ｓ２）におけるステップＳ２１の処理を示
し、単語クラスの集合を示す図である。FIG. 17 is a diagram showing a process of step S21 in the intermediate layer clustering process (S2) of the subroutine of FIG. 11, and showing a set of word classes.

【図１８】図１１のサブルーチンの中間層クラスタリ
ング処理（Ｓ２）におけるステップＳ２３及びＳ２４の
処理を示し、単語クラスの集合を示す図である。FIG. 18 is a diagram showing a process of steps S23 and S24 in the intermediate layer clustering process (S2) of the subroutine of FIG. 11, and showing a set of word classes.

【図１９】図１１の単語分類処理における処理及びそ
の処理によって得られる階層構造を示す図である。19 is a diagram showing a process in the word classification process of FIG. 11 and a hierarchical structure obtained by the process.

【図２０】本発明の音声認識装置のシミュレーション
結果である、テキストの大きさに対するクラスタリング
処理後のラベル付けの誤り率を示すグラフである。FIG. 20 is a graph showing the error rate of the labeling after the clustering process with respect to the text size, which is a simulation result of the speech recognition device of the present invention.

【符号の説明】[Explanation of symbols]

１…マイクロホン、２…Ａ／Ｄ変換器、３…特徴抽出部、４…バッファメモリ、５…音声認識部、１０，３１，３２…テキストデータメモリ、１１，４１，４２…単語辞書メモリ、１２…言語モデル、２０…単語分類処理部、２１…形態素解析部、２２…構文解析部、５０…ＣＰＵ、５１…ＲＯＭ、５２…ワークＲＡＭ、５３…ＲＡＭ、５４，５５…メモリインターフェース、６１…初期化クラス単語メモリ、６２…ＡＭＩメモリ、６３…中間層メモリ、６４…上側層ヒストリメモリ、６５…上側層ツリーメモリ、６６…下側層ヒストリメモリ、６７…下側層ツリーメモリ、６８…ツリーメモリ、１００…中間層、１０１…上側層、１０２…下側層、Ｓ１…初期化処理、Ｓ２…中間層クラスタリング処理、Ｓ３…上側層クラスタリング処理、Ｓ４…下側層クラスタリング処理、Ｓ５…データ出力処理。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... A / D converter, 3 ... Feature extraction part, 4 ... Buffer memory, 5 ... Voice recognition part, 10, 31, 32 ... Text data memory, 11, 41, 42 ... Word dictionary memory, 12 ... language model, 20 ... word classification processing unit, 21 ... morphological analysis unit, 22 ... syntax analysis unit, 50 ... CPU, 51 ... ROM, 52 ... work RAM, 53 ... RAM, 54,55 ... memory interface, 61 ... initial Classified word memory, 62 AMI memory, 63 Middle layer memory, 64 Upper layer history memory, 65 Upper layer tree memory, 66 Lower layer history memory, 67 Lower tree memory, 68 Tree memory , 100: middle layer, 101: upper layer, 102: lower layer, S1: initialization processing, S2: middle layer clustering processing, S3: upper layer Rastering process, S4 ... lower layer clustering process, S5 ... data output process.

フロントページの続き (56)参考文献特開平３−131967（ＪＰ，Ａ) 特開平３−111972（ＪＰ，Ａ) 特開昭63−172372（ＪＰ，Ａ) 柏岡秀紀他、“７Ｇ−５相互情報量を用いた単語の分類における出現頻度の低い単語の処理方法”、情報処理学会第 49回（平成６年度後期）全国大会講演論文集（３）、平成６年９月28日〜30日、ｐ．３−185〜３−186 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/27 G10L 3/00 Continuation of the front page (56) References JP-A-3-131967 (JP, A) JP-A-3-111972 (JP, A) JP-A-63-172372 (JP, A) Hideki Kashioka, et al., “7G-5” Processing Method of Words with Low Appearance in Word Classification Using Mutual Information ", IPSJ 49th (Late 1994) National Convention, Proceedings (3), September 28-30, 1994 Days, p. 3-185 to 3-186 (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 17/27 G10L 3/00

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】複数の単語を含むテキストデータに対し
て、互いに異なるすべての複数ｖ個の単語の出現頻度を
調べ、出現頻度の高い単語から順に並べて、複数ｖ個の
クラスに割り当てるステップと、上記複数ｖ個のクラスの単語のうち出現頻度が高いｖ個
未満の（ｃ＋１）個のクラスの単語を１つのウィンドウ
内のクラスの単語として第１の記憶装置に記憶するステ
ップと、上記第１の記憶装置に記憶された１つのウィンドウ内の
クラスの単語に基づいて、第１のクラスの単語の出現確
率と第２のクラスの単語の出現確率との積に対する、互
いに異なる第１のクラスの単語と第２のクラスの単語と
が隣接して出現する確率の相対的な割合を表わす所定の
平均相互情報量が最大となるように、上記複数の単語を
二分木の形式で複数ｃ個のクラスに分類し、分類された
複数ｃ個のクラスを、単語分類結果を表わす全体のツリ
ー図の中間層の複数ｃ個のクラスとして第２の記憶装置
に記憶するステップと、上記第２の記憶装置に記憶された中間層の複数ｃ個のク
ラスの単語に基づいて、上記平均相互情報量が最大とな
るように、上記複数ｃ個のクラスの単語を二分木の形式
で１個のクラスになるまで分類し、当該分類結果を上記
ツリー図の上側層として第３の記憶装置に記憶するステ
ップと、上記第２の記憶装置に記憶された中間層の複数ｃ個のク
ラスの各クラス毎に、上記中間層の複数ｃ個のクラスの
各クラス内の複数の単語に基づいて、上記平均相互情報
量が最大となるように、上記複数の単語を二分木の形式
で１個のクラスになるまでそれぞれ分類し、当該各クラ
ス毎の複数の分類結果を上記ツリー図の下側層として第
４の記憶装置に記憶するステップと、上記第４の記憶装置に記憶された上記ツリー図の下側層
を、上記第２の記憶装置に記憶された上記中間層の複数
ｃ個のクラスと連結する一方、上記第３の記憶装置に記
憶された上記ツリー図の上側層を、上記第２の記憶装置
に記憶された上記中間層の複数ｃ個のクラスと連結する
ことにより、上側層と中間層と下側層とを備えた上記ツ
リー図を求めて単語分類結果として第５の記憶装置に記
憶するステップとを備えたことを特徴とする単語分類処
理方法。1. A method of examining text data including a plurality of words, the appearance frequencies of all v words different from each other, arranging the words in descending order of frequency, and assigning the words to the v classes. Storing, in the first storage device, (c + 1) classes of words having a high appearance frequency of less than v among words of the plurality v of classes as words of a class in one window; Different products of the first class and the product of the probability of occurrence of the word of the first class and the probability of occurrence of the word of the second class based on the words of the class in one window stored in the storage device of the first class. In order to maximize the predetermined average mutual information representing the relative proportion of the probability that the word and the second class word appear adjacent to each other, the plurality of words are divided into a plurality of c Kula And storing the classified plurality of c classes in the second storage device as the plurality of c classes in the intermediate layer of the entire tree diagram representing the word classification result; and the second storage device The words of the plurality of c classes are converted into one class in the form of a binary tree such that the average mutual information amount is maximized based on the words of the plurality of c classes of the intermediate layer stored in And storing the classification result as an upper layer of the tree diagram in a third storage device. For each of a plurality of c classes in the intermediate layer stored in the second storage device, On the basis of a plurality of words in each of the plurality of c classes of the intermediate layer, the plurality of words are divided into one class in the form of a binary tree so that the average mutual information is maximized. Each class is classified and a plurality of Storing a similar result as a lower layer of the tree diagram in a fourth storage device; and storing the lower layer of the tree diagram stored in the fourth storage device in the second storage device. And linking the upper layer of the tree diagram stored in the third storage device with the plurality of c classes of the intermediate layer stored in the second storage device. And obtaining the tree diagram including the upper layer, the intermediate layer, and the lower layer by storing the tree diagram as a word classification result in a fifth storage device. Classification processing method.

【請求項２】上記分類された複数ｃ個のクラスを上記
第２の記憶装置に記憶するステップは、上記第１の記憶
装置に記憶された１つのウィンドウよりも外側のクラス
が存在し、又は上記１つのウィンドウ内のクラスがｃ個
ではないときは、現在のウィンドウよりも外側にあり、
最大の出現頻度を有するクラスの単語を上記ウィンドウ
内に挿入した後、上記二分木の形式の単語分類処理を実
行することを特徴とする請求項１記載の単語分類処理方
法。2. The method according to claim 2, wherein the classifying step includes the step of storing the classified plurality of c classes in the second storage device, wherein there is a class outside one window stored in the first storage device, or If there are not c classes in the one window, it is outside the current window,
2. The word classification processing method according to claim 1, wherein after the word of the class having the highest appearance frequency is inserted into the window, the word classification processing in the form of the binary tree is executed.

【請求項３】複数の単語を含むテキストデータに対し
て、互いに異なるすべての複数ｖ個の単語の出現頻度を
調べ、出現頻度の高い単語から順に並べて、複数ｖ個の
クラスに割り当てる第１の制御手段と、上記複数ｖ個のクラスの単語のうち出現頻度が高いｖ個
未満の（ｃ＋１）個のクラスの単語を１つのウィンドウ
内のクラスの単語として第１の記憶装置に記憶する第２
の制御手段と、上記第１の記憶装置に記憶された１つのウィンドウ内の
クラスの単語に基づいて、第１のクラスの単語の出現確
率と第２のクラスの単語の出現確率との積に対する、互
いに異なる第１のクラスの単語と第２のクラスの単語と
が隣接して出現する確率の相対的な割合を表わす所定の
平均相互情報量が最大となるように、上記複数の単語を
二分木の形式で複数ｃ個のクラスに分類し、分類された
複数ｃ個のクラスを、単語分類結果を表わす全体のツリ
ー図の中間層の複数ｃ個のクラスとして第２の記憶装置
に記憶する第３の制御手段と、上記第２の記憶装置に記憶された中間層の複数ｃ個のク
ラスの単語に基づいて、上記平均相互情報量が最大とな
るように、上記複数ｃ個のクラスの単語を二分木の形式
で１個のクラスになるまで分類し、当該分類結果を上記
ツリー図の上側層として第３の記憶装置に記憶する第４
の制御手段と、上記第２の記憶装置に記憶された中間層の複数ｃ個のク
ラスの各クラス毎に、上記中間層の複数ｃ個のクラスの
各クラス内の複数の単語に基づいて、上記平均相互情報
量が最大となるように、上記複数の単語を二分木の形式
で１個のクラスになるまでそれぞれ分類し、当該各クラ
ス毎の複数の分類結果を上記ツリー図の下側層として第
４の記憶装置に記憶する第５の制御手段と、上記第４の記憶装置に記憶された上記ツリー図の下側層
を、上記第２の記憶装置に記憶された上記中間層の複数
ｃ個のクラスと連結する一方、上記第３の記憶装置に記
憶された上記ツリー図の上側層を、上記第２の記憶装置
に記憶された上記中間層の複数ｃ個のクラスと連結する
ことにより、上側層と中間層と下側層とを備えた上記ツ
リー図を求めて単語分類結果として第５の記憶装置に記
憶する第６の制御手段とを備えたことを特徴とする単語
分類処理装置。3. A first method of examining the appearance frequency of all of a plurality of v words different from each other with respect to text data including a plurality of words, arranging the words in descending order of appearance frequency, and assigning the words to the plurality of v classes. A control unit that stores, in the first storage device, (c + 1) classes of words having a high appearance frequency and less than v in the plurality of v classes in the first storage device as words of a class in one window
Based on the word of the class in one window stored in the first storage device, the product of the probability of appearance of the word of the first class and the probability of appearance of the word of the second class Dividing the plurality of words into two so that the predetermined average mutual information representing the relative proportion of the probability that the first class words and the second class words different from each other appear adjacent to each other is maximized. Classifying into a plurality of c classes in the form of a tree, and storing the classified plurality of c classes in the second storage device as a plurality of c classes in the middle layer of the entire tree diagram representing the word classification result. Third control means, based on the words of the plurality of c classes in the intermediate layer stored in the second storage device, so that the average mutual information amount is maximized, Until the words become one class in the form of a binary tree Classified, fourth storing the classification result in the third storage device as the upper layer of the tree diagram
Control means for each class of the plurality of c classes of the intermediate layer stored in the second storage device, based on the plurality of words in each class of the plurality of c classes of the intermediate layer, The plurality of words are classified in the form of a binary tree until they become one class so that the average mutual information amount is maximized. Fifth control means for storing in the fourth storage device as: a lower layer of the tree diagram stored in the fourth storage device, a plurality of intermediate layers stored in the second storage device linking the upper layer of the tree diagram stored in the third storage device with the plurality of c classes of the intermediate layer stored in the second storage device while linking the c classes. Obtains the above tree diagram including the upper layer, the middle layer, and the lower layer. Word classification processing apparatus characterized by comprising a sixth control means for storing in the fifth storage device as a word classification result Te.

【請求項４】上記第３の制御手段は、上記第１の記憶
装置に記憶された１つのウィンドウよりも外側のクラス
が存在し、又は上記１つのウィンドウ内のクラスがｃ個
ではないときは、現在のウィンドウよりも外側にあり、
最大の出現頻度を有するクラスの単語を上記ウィンドウ
内に挿入した後、上記二分木の形式の単語分類処理を実
行することを特徴とする請求項３記載の単語分類処理装
置。4. The method according to claim 1, wherein the third control means is configured to determine whether there is a class outside one window stored in the first storage device or when there are not c classes in the one window. , Outside the current window,
4. The word classification processing device according to claim 3, wherein after the word of the class having the maximum appearance frequency is inserted into the window, the word classification processing in the form of the binary tree is executed.

【請求項５】入力される発声音声の音声信号に基づい
て、請求項３又は４記載の単語分類処理装置によって複
数の単語が複数のクラスに分類された単語分類結果を含
む単語辞書と、所定の隠れマルコフモデルとを参照して
上記発声音声を音声認識する音声認識手段を備えたこと
を特徴とする音声認識装置。5. A word dictionary containing a word classification result in which a plurality of words are classified into a plurality of classes by the word classification processing device according to claim 3 or 4, based on an input speech signal of a uttered voice, A voice recognition unit for recognizing the uttered voice by referring to the hidden Markov model.