JP2010238043A

JP2010238043A - Text analysis learning device

Info

Publication number: JP2010238043A
Application number: JP2009086407A
Authority: JP
Inventors: Koichi Tanigaki; 宏一谷垣; Yasuhiro Takayama; 泰博高山
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2009-03-31
Filing date: 2009-03-31
Publication date: 2010-10-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a text analysis learning device for achieving highly accurate text analysis by learning efficiently utilizing both of labeled learning data and label-less learning data. <P>SOLUTION: The likelihood of an analytical result indicated by a label attached to labeled learning data is calculated. An evaluation value indicating a degree of consistency with an analytical result of an input sentence belonging to the same category as that of an analytical result of label-less learning data is calculated. A value of a model parameter corresponding to identity data is updated so that a target function based on the likelihood and the evaluation value of the consistency is maximized, and the updating of the model parameter based on the likelihood and the evaluation value which are calculated by using the updated value of the model parameter is executed until the updated value of the model parameter satisfies a predetermined convergence condition. Then, an analysis dictionary to be used for a text analyzer is generated by using a list of the model parameter satisfying the predetermined convergence condition, the identity data and labels. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は、テキスト解析学習装置に関するものである。 The present invention relates to a text analysis learning apparatus.

一般にテキスト解析と呼ばれる処理の中には、入力データに対して構造化ラベルを付与する問題と見なすことができる処理がある。例えば、日本語文の単語区切り処理は、入力データである文字列に対し、単語区切り位置の有無を示すラベルを付与する問題と捉えることができる。 Among processes generally called text analysis, there is a process that can be regarded as a problem of giving a structured label to input data. For example, Japanese word separation processing can be regarded as a problem of giving a label indicating the presence or absence of a word separation position to a character string that is input data.

図１４は、ラベル付与による日本語単語区切り処理の例を示す図である。図１４では、入力文が「アイビーキャリアカレッジ」であり、ラベル「１」は、当該文字の直後が単語区切り位置であることを示している。反対に、ラベル「０」は、単語区切り位置ではないことを示している。従って、当該入力文は「アイビー」「キャリア」「カレッジ」と３つの単語に分割される。 FIG. 14 is a diagram illustrating an example of a Japanese word segmentation process by labeling. In FIG. 14, the input sentence is “Ivy Career College”, and the label “1” indicates that the position immediately after the character is a word break position. On the contrary, the label “0” indicates that it is not a word break position. Therefore, the input sentence is divided into three words, “Ivy”, “Career”, and “College”.

このようなテキスト解析処理は、従来、人手で記述した解析規則により実現されてきたが、記述のスケーラビリティ（網羅性やメンテナンス性）に限界があるために、近年は、統計的な手法により、図１４のようなラベル付き学習データを用意し、解析規則を自動的に獲得する場合が多い。 Conventionally, such text analysis processing has been realized by analysis rules written manually. However, due to limitations in the scalability (exhaustability and maintainability) of the description, in recent years it has been In many cases, learning data with a label like 14 is prepared and an analysis rule is automatically acquired.

統計的な手法では、個々の解析規則の抽出や、解析規則間の依存性の問題を機械が自動処理してくれるため、対象分野を十分カバーするラベル付き学習データを大量に用意さえすれば、自動的に高精度なテキスト解析器を得られるという利点がある。 In the statistical method, the machine automatically processes the analysis of individual analysis rules and the dependency between analysis rules, so if you prepare a large amount of labeled learning data that sufficiently covers the target field, There is an advantage that a highly accurate text analyzer can be obtained automatically.

従って、統計的な手法においては、従来のルール記述の問題が、十分な量のラベル付き学習データの確保という問題に置き換わっている。しかし、ラベル付与作業においても、対象ドメイン、文法双方の専門知識が必要である点では変わらず、高コストな作業となっているため、構築しようとするテキスト解析器の対象分野に合わせて、十分な量のラベル付き学習データを用意することは難しい。 Therefore, in the statistical method, the problem of conventional rule description is replaced with a problem of securing a sufficient amount of labeled learning data. However, the labeling work also requires high expertise in both the target domain and grammar, and is a high-cost work, so it is sufficient to match the target area of the text analyzer to be constructed. It is difficult to prepare a large amount of labeled learning data.

統計的手法によるテキスト解析器の構築において、学習に要するラベル付き学習データの確保は大きな課題となっている。このようなラベル付き学習データの問題に対しては、正解ラベルを付与していない学習データ（ラベル無し学習データ）を併用する半教師付き学習のアプローチがあり、本発明も半教師付き学習の一種である。 In the construction of text analyzers using statistical techniques, securing labeled learning data required for learning has become a major issue. For such labeled learning data problems, there is a semi-supervised learning approach that uses learning data that is not assigned a correct answer label (unlabeled learning data), and the present invention is also a kind of semi-supervised learning. It is.

前述の単語区切りの例において、ラベル無し学習データとは、単語区切りを与えていないプレーンな生データである。例えば、「アーツカレッジヨコハマ」「アートカレッジ専門学校」「アートガレージかわさき」といった、テキスト解析器への入力文を列挙したリストに相当する。こうした生データは、ラベル付き学習データと比較して低コストで大量に入手・利用可能である。 In the above example of word breaks, unlabeled learning data is plain raw data that does not give word breaks. For example, “Arts College Yokohama” “Art College College” “Art Garage Kawasaki” corresponds to a list listing input sentences to the text analyzer. Such raw data can be obtained and used in large quantities at low cost compared to labeled learning data.

このようなラベル無し学習データを利用する従来の技術として、例えば特許文献１に開示されるものがある。 As a conventional technique using such unlabeled learning data, there is one disclosed in Patent Document 1, for example.

特開２００８−２２５９０７号公報JP 2008-225907 A

ラベル無し学習データを利用する従来の技術では、性質の異なる２種類のモデル（識別モデルと生成モデル）を学習に使った一種のハイブリッド学習法により、一方のモデルによるラベル無し学習データの解析結果を、他方のモデルの教師付き学習に利用している。このように、本質的にモデルの補間特性の違いから生じる尤度差を利用するため、解析タスクやデータに対して一般性が高いという利点がある。 In the conventional technology using unlabeled learning data, the analysis result of unlabeled learning data by one model is obtained by a kind of hybrid learning method using two kinds of models (identification model and generation model) with different properties. The other model is used for supervised learning. Thus, since the likelihood difference resulting from the difference in the interpolation characteristics of the model is used essentially, there is an advantage that the generality is high for the analysis task and data.

しかしながら、先験的な知識に頼らない上記従来の手法では、ラベル無し学習データの利用効率が悪く、精度が上がりにくいという課題があった。 However, the above-described conventional method that does not rely on a priori knowledge has a problem that the use efficiency of unlabeled learning data is poor and the accuracy is difficult to increase.

この発明は、上記のような課題を解決するためになされたもので、ラベル付き学習データ及びラベル無し学習データの双方を効率よく利用した学習により高精度なテキスト解析を実現できるテキスト解析学習装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and provides a text analysis learning apparatus capable of realizing highly accurate text analysis by learning that efficiently uses both labeled learning data and unlabeled learning data. The purpose is to obtain.

この発明に係るテキスト解析学習装置は、テキスト解析器に対する入力文、入力文の正しい解析結果を示すラベル及び当該入力文が属するカテゴリを示すカテゴリ情報の組み合わせを、ラベル付き学習データとして格納するラベル付き学習データ格納部と、テキスト解析器に対する入力文及び当該入力文が属するカテゴリを示すカテゴリ情報の組み合わせを、ラベル無し学習データとして格納するラベル無し学習データ格納部と、ラベルの一覧を格納するラベル格納部と、入力文に対する照合条件と当該入力文のラベルとの組み合わせを、当該入力文に対応する素性データとして格納する素性データ格納部と、照合条件の一部及びラベルを変数とした素性テンプレートを格納する素性テンプレート格納部と、ラベル付き学習データ、ラベル無し学習データ、ラベルの一覧及び素性テンプレートを入力し、ラベル付き学習データ及びラベル無し学習データとラベルの一覧とから照合条件の一部及びラベルにそれぞれ対応する文字列及びラベルを抽出して素性テンプレートの変数を書き換えることにより、素性データを生成する素性生成部と、素性データに対応するモデルパラメータを格納するモデルパラメータ格納部と、ラベル付き学習データの入力文を、素性データ、モデルパラメータ及びラベルの一覧に基づいて解析し、当該ラベル付き学習データに付与されたラベルで示される解析結果の尤度を算出する尤度評価手段と、ラベル無し学習データの入力文を、素性データ、モデルパラメータ及びラベルの一覧に基づいて解析し、当該ラベル無し学習データに対する解析結果と、同じカテゴリに属する入力文に対する解析結果との整合性の度合を示す評価値を算出する整合性評価手段と、尤度評価手段で算出された尤度及び整合性評価手段で算出された評価値に基づく目標関数が最大化するようにモデルパラメータの値を更新し、当該モデルパラメータの更新値を用いて算出された尤度及び評価値に基づく当該モデルパラメータの更新を、当該モデルパラメータの更新値が所定の収束条件を満たすまで実行する更新手段と、素性データ、所定の収束条件を満たしたモデルパラメータ及びラベルの一覧を用いて、テキスト解析器が使用する解析用辞書を生成する解析辞書出力手段とを備えるものである。 The text analysis learning device according to the present invention is provided with a label for storing, as labeled learning data, a combination of an input sentence to a text analyzer, a label indicating a correct analysis result of the input sentence, and category information indicating a category to which the input sentence belongs. A learning data storage unit, a labelless learning data storage unit that stores a combination of an input sentence to the text analyzer and category information indicating a category to which the input sentence belongs, and a label storage that stores a list of labels A feature data storage unit that stores a combination of a collation condition for an input sentence and a label of the input sentence as feature data corresponding to the input sentence, and a feature template using a part of the collation condition and a label as a variable Feature template storage to store, learning data with label, no label Enter learning data, a list of labels, and a feature template, and extract a part of the matching condition and a character string and a label corresponding to each label from the labeled learning data and the unlabeled learning data and the list of labels. By rewriting variables, a feature generation unit that generates feature data, a model parameter storage unit that stores model parameters corresponding to the feature data, an input sentence of labeled learning data, a list of feature data, model parameters, and labels The likelihood evaluation means for calculating the likelihood of the analysis result indicated by the label given to the labeled learning data, and the input sentence of the unlabeled learning data, the feature data, the model parameter, and the label Analyzes based on the list, and the analysis results for the unlabeled learning data Based on the consistency evaluation means for calculating the evaluation value indicating the degree of consistency with the analysis result for the input sentence belonging to the category, the likelihood calculated by the likelihood evaluation means, and the evaluation value calculated by the consistency evaluation means Update the value of the model parameter so that the objective function is maximized, and update the model parameter based on the likelihood and the evaluation value calculated using the updated value of the model parameter. Update means for executing until the convergence condition of the above is satisfied, and analysis dictionary output means for generating an analysis dictionary used by the text analyzer using the feature data, the model parameters satisfying the predetermined convergence condition, and the list of labels. It is to be prepared.

この発明によれば、ラベル付き学習データに付与されたラベルで示される解析結果の尤度を算出し、ラベル無し学習データに対する解析結果と同じカテゴリに属する入力文に対する解析結果との整合性の度合を示す評価値を算出し、尤度及び整合性の評価値に基づく目標関数が最大化するように素性データに対応するモデルパラメータの値を更新し、当該モデルパラメータの更新値を用いて算出された尤度及び評価値に基づく当該モデルパラメータの更新を、当該モデルパラメータの更新値が所定の収束条件を満たすまで実行して、所定の収束条件を満たしたモデルパラメータ、素性データ及びラベルの一覧を用いて、テキスト解析器が使用する解析用辞書を生成する。このようにすることで、ラベル付き学習データにおけるラベルの尤度に加え、同じカテゴリのデータが互いに類似したラベルを持つ傾向があることも考慮してラベル無し学習データからも効率的に知識を獲得でき、これにより生成された解析辞書を利用することにより、テキスト解析器の解析精度を向上させることができるという効果がある。 According to this invention, the likelihood of the analysis result indicated by the label given to the labeled learning data is calculated, and the degree of consistency between the analysis result for the unlabeled learning data and the analysis result for the input sentence belonging to the same category is calculated. The model parameter value corresponding to the feature data is updated so that the objective function based on the likelihood and consistency evaluation values is maximized, and the updated value of the model parameter is used for calculation. Updating the model parameter based on the likelihood and evaluation value until the updated value of the model parameter satisfies a predetermined convergence condition, and a list of model parameters, feature data, and labels satisfying the predetermined convergence condition is obtained. To generate an analysis dictionary used by the text analyzer. By doing this, in addition to the likelihood of the label in the labeled learning data, it is possible to efficiently acquire knowledge from the unlabeled learning data considering that the data of the same category tends to have similar labels to each other In addition, the analysis accuracy of the text analyzer can be improved by using the analysis dictionary generated thereby.

この発明によるテキスト解析学習処理の概要を示す図である。It is a figure which shows the outline | summary of the text analysis learning process by this invention. この発明の実施の形態１によるテキスト解析学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the text analysis learning apparatus by Embodiment 1 of this invention. 実施の形態１による素性生成処理の流れを示すフローチャートである。5 is a flowchart showing a flow of feature generation processing according to the first embodiment. ラベル付き学習データの一例を示す図である。It is a figure which shows an example of learning data with a label. 素性テンプレートの一例を示す図である。It is a figure which shows an example of a feature template. 素性テンプレートの変数値を置き換えた結果の一例を示す図である。It is a figure which shows an example of the result of having substituted the variable value of the feature template. 素性データの一例を示す図である。It is a figure which shows an example of feature data. ラベル無し学習データの一例を示す図である。It is a figure which shows an example of unlabeled learning data. 実施の形態１による尤度評価処理の流れを示すフローチャートである。6 is a flowchart showing a flow of likelihood evaluation processing according to Embodiment 1. 単語区切り仮説グラフの一例を示す図である。It is a figure which shows an example of a word break hypothesis graph. 図１０中の単語区切り仮説グラフに部分グラフを明記した場合を示す図である。It is a figure which shows the case where a partial graph is specified in the word break hypothesis graph in FIG. 整合性評価処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a consistency evaluation process. テキスト解析用辞書の一例を示す図である。It is a figure which shows an example of the dictionary for text analysis. ラベル付与による日本語単語区切り処理の例を示す図である。It is a figure which shows the example of the Japanese word division | segmentation process by label provision.

実施の形態１．
以降では、テキスト解析処理の一例として、施設名相当の日本語文字列（入力文）を、単語単位に分割する解析処理を取り上げ、この解析処理を学習する場合における、この発明によるテキスト解析学習装置の構成及び動作について説明する。ただし、この発明は、学習対象が単語分割処理に限定されるものではなく、品詞同定を含む形態素解析や、固有表現抽出、構文解析等、多様なテキスト解析にも適用可能である。 Embodiment 1 FIG.
Hereinafter, as an example of the text analysis process, the text analysis learning device according to the present invention in the case of learning an analysis process in which a Japanese character string (input sentence) corresponding to a facility name is divided into units of words will be taken up. The configuration and operation will be described. However, the present invention is not limited to word segmentation processing, and can be applied to various text analysis such as morphological analysis including part-of-speech identification, specific expression extraction, and syntax analysis.

図１は、この発明によるテキスト解析学習処理の概要を示す図である。この発明によるテキスト解析学習処理で最終的に生成されるものは、テキスト解析用の確率的な規則が記述されたテキスト解析用辞書である。なお、この実施の形態１では、確率推定モデルとしてＣＲＦ（Conditional Random Fields）（参考文献１参照）を用いるが、これに限定するものではなく、最大エントロピー法やベイジアンネットワークなど、他の確率モデルを用いても構わない。
参考文献１；
Lafferty, J., et al., "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", Proc. of ICML-2001, pp. 282-289 (2001). FIG. 1 is a diagram showing an outline of text analysis learning processing according to the present invention. What is finally generated by the text analysis learning process according to the present invention is a text analysis dictionary in which probabilistic rules for text analysis are described. In the first embodiment, CRF (Conditional Random Fields) (see Reference 1) is used as a probability estimation model. However, the present invention is not limited to this, and other probability models such as a maximum entropy method and a Bayesian network are used. You may use.
Reference 1;
Lafferty, J., et al., "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", Proc. Of ICML-2001, pp. 282-289 (2001).

テキスト解析器は、この発明によるテキスト解析学習装置で生成されたテキスト解析用辞書を用いると、図１に示すように、入力文が「アイビーキャリアカレッジ」であると、この入力文を適切に単語単位に分割し、解析結果として「アイビー／キャリア／カレッジ」を出力する。 When the text analyzer uses the text analysis dictionary generated by the text analysis learning device according to the present invention, as shown in FIG. 1, if the input sentence is “Ivy Career College”, the input sentence is appropriately converted to a word. Divide into units and output “Ivy / Carrier / College” as the analysis result.

また、図１において、テキスト解析学習処理に用いた学習データは、施設名リスト（元データ）である。このリストの一部に対し、人手で正しい単語区切り位置を付与する（単語分割タグ付け）ことで、ラベル付き学習データが生成される。なお、図１中のラベル付き学習データのスラッシュ「／」は単語区切り位置を示している。ラベル付け作業は人手を要するため、構築可能なラベル付き学習データの量（少量）が限定される。一方、ラベルを付与していない元データは大量にあるため、この発明のテキスト解析学習装置では、これを学習データ（ラベル無し学習データ）として利用する。 In FIG. 1, the learning data used for the text analysis learning process is a facility name list (original data). Labeled learning data is generated by manually assigning correct word break positions (word division tagging) to a part of this list. Note that the slash “/” in the labeled learning data in FIG. 1 indicates a word break position. Since the labeling operation requires manpower, the amount (small amount) of labeled learning data that can be constructed is limited. On the other hand, since there is a large amount of original data to which no label is assigned, the text analysis learning apparatus of the present invention uses this as learning data (learned data without label).

施設名リストには、図１に示すように、元のアプリケーションが利用している属性情報（業種、住所など）が付与されている。ここでは、業種を施設名のカテゴリ情報として、ラベル無し学習データから単語分割規則を抽出する際の手がかりとされる。なお、本発明によるカテゴリ情報とは、アプリケーションから取得可能な学習データの属性情報や、データの抽出元に関する情報等である。 As shown in FIG. 1, the facility name list is provided with attribute information (such as business type and address) used by the original application. Here, the category information of the facility name is used as a clue when extracting word division rules from unlabeled learning data. The category information according to the present invention is attribute information of learning data that can be acquired from an application, information about a data extraction source, and the like.

図２は、この発明の実施の形態１によるテキスト解析学習装置の構成を示すブロック図である。図２において、実施の形態１によるテキスト解析学習装置１は、ラベル付き学習データ格納部２、素性テンプレート格納部３、ラベル無し学習データ格納部４、素性生成部５、素性データ格納部６、ラベル格納部７、モデルパラメータ格納部８、正解ラベル尤度評価手段（尤度評価手段）９、カテゴリ内整合性評価手段（整合性評価手段）１０、パラメータ更新手段（更新手段）１１及び解析辞書出力手段１２を備える。 FIG. 2 is a block diagram showing the configuration of the text analysis learning apparatus according to Embodiment 1 of the present invention. In FIG. 2, the text analysis learning apparatus 1 according to Embodiment 1 includes a labeled learning data storage unit 2, a feature template storage unit 3, an unlabeled learning data storage unit 4, a feature generation unit 5, a feature data storage unit 6, and a label. Storage unit 7, model parameter storage unit 8, correct label likelihood evaluation unit (likelihood evaluation unit) 9, intra-category consistency evaluation unit (consistency evaluation unit) 10, parameter update unit (update unit) 11, and analysis dictionary output Means 12 are provided.

ラベル付き学習データ格納部２は、テキスト解析器に入力される入力文のサンプル、この入力文のカテゴリ及びこの入力文の正しい単語区切り位置（ラベル）の複数の組み合わせが予め与えられ、ラベル付き学習データとして格納する記憶部である。図１の例では、施設名の入力文サンプル、この入力文のカテゴリ（業種）及びこの入力文の単語区切り位置（ラベル）の組み合わせが格納される。 The labeled learning data storage unit 2 is provided with a plurality of combinations of a sample of an input sentence input to a text analyzer, a category of the input sentence, and a correct word break position (label) of the input sentence, and learning with a label A storage unit for storing data. In the example of FIG. 1, a combination of a facility name input sentence sample, a category (industry) of the input sentence, and a word break position (label) of the input sentence is stored.

素性テンプレート格納部３は、予め与えられた素性を生成するためのひな形となる素性テンプレートを格納する記憶部である。ここで、素性とは、テキスト解析器に入力される入力文に対する照合条件とラベルとの組み合わせであり、正解ラベル尤度評価手段９及びカテゴリ内整合性評価手段１０が内部的に生成する単語区切り位置仮説の特徴（どんなところで切ろうとしているか）を表す。また、素性テンプレートには、図７を用いて後述するように、入力文に対する照合条件の一部及びラベルが変数として定義されている。 The feature template storage unit 3 is a storage unit that stores a feature template serving as a template for generating a predetermined feature. Here, the feature is a combination of a collation condition and a label for an input sentence input to the text analyzer, and is a word break generated internally by the correct label likelihood evaluation unit 9 and the intra-category consistency evaluation unit 10. Indicates the characteristics of the position hypothesis (where you are trying to cut). In the feature template, as will be described later with reference to FIG. 7, a part of the collation condition and the label for the input sentence are defined as variables.

ラベル無し学習データ格納部４は、テキスト解析器に入力される入力文のサンプルと、この入力文のカテゴリとの組み合わせが予め与えられ、ラベル無し学習データとして格納する記憶部である。図１の例では、施設名の入力文サンプルとこの入力文のカテゴリ（業種）との組み合わせが格納される。このラベル無し学習データには、正しい単語区切り位置を示すラベルが付与されていない。 The unlabeled learning data storage unit 4 is a storage unit that stores a combination of a sample of an input sentence input to the text analyzer and a category of the input sentence in advance as unlabeled learning data. In the example of FIG. 1, a combination of an input sentence sample of a facility name and a category (business type) of the input sentence is stored. This unlabeled learning data is not given a label indicating the correct word break position.

素性生成部５は、ラベル付き学習データ、ラベル無し学習データ、及びラベル格納部７から読み出したラベルの一覧表を用いて、素性テンプレートの変数を書き換えることで、素性を生成する手段である。素性データ格納部６は、素性生成部５により生成された素性を格納して保持する記憶部である。 The feature generation unit 5 is a unit that generates features by rewriting variables of the feature template using the labeled learning data, the unlabeled learning data, and the label list read from the label storage unit 7. The feature data storage unit 6 is a storage unit that stores and holds the features generated by the feature generation unit 5.

ラベル格納部７は、単語区切り位置の有無を示すラベル（テキスト解析器での正しい解析結果を示すラベル）の一覧を格納する記憶部であり、このラベル一覧はテキスト解析器が解析結果を出力する際に使用される。モデルパラメータ格納部８は、パラメータ更新手段１１による処理の開始時には初期値を保持し、処理開始以降はパラメータ更新手段１１によって逐次更新されるモデルパラメータ（素性に対応する実数値パラメータ）を保持する記憶部である。 The label storage unit 7 is a storage unit that stores a list of labels (labels indicating correct analysis results in the text analyzer) indicating the presence or absence of word break positions, and the text analyzer outputs the analysis results from the label list. Used when. The model parameter storage unit 8 stores an initial value at the start of processing by the parameter update unit 11 and stores model parameters (real value parameters corresponding to features) that are sequentially updated by the parameter update unit 11 after the start of processing. Part.

正解ラベル尤度評価手段９は、素性、モデルパラメータ及びラベルの一覧に基づいて、ラベル付き学習データの入力文を解析し、このラベル付き学習データに付与されたラベルが示す単語区切り位置の尤度（入力サンプルの正しい解析結果の尤度）を計算する手段である。 The correct label likelihood evaluation means 9 analyzes the input sentence of the labeled learning data based on the feature, the model parameter, and the list of labels, and the likelihood of the word break position indicated by the label given to the labeled learning data It is a means for calculating (the likelihood of the correct analysis result of the input sample).

カテゴリ内整合性評価手段１０は、素性、モデルパラメータ及びラベルの一覧に基づいて、ラベル無し学習データの入力文の単語区切りを解析し、このラベル無し学習データに対する単語区切り解析結果（ラベル無し学習データに対する解析結果）が、同じカテゴリに属するラベル無し学習データ（入力サンプル）に対する解析結果と、どの程度整合しているかを評価する手段である。 The intra-category consistency evaluation means 10 analyzes the word break of the input sentence of the unlabeled learning data based on the feature, the model parameter, and the list of labels, and the word break analysis result (the unlabeled learning data for the unlabeled learning data). This is a means for evaluating the degree to which the analysis result for the unlabeled learning data (input sample) belonging to the same category matches the analysis result.

パラメータ更新手段１１は、ラベル付き学習データに付与されたラベルが示す単語区切り位置の尤度（入力サンプルの正しい解析結果の尤度）と、ラベル無し学習データにおける単語区切り解析結果（ラベル無し学習データに対する解析結果）の同一カテゴリ内での整合性とに基づいて、モデルパラメータを更新する手段である。ここで、更新したモデルパラメータが所定の収束条件を満たしていない場合、このモデルパラメータを反復して用いて、正解ラベル尤度評価手段９でラベル付き学習データの解析結果の尤度計算を行い、カテゴリ内整合性評価手段１０で整合性計算を行い、これらの結果に基づいてパラメータ更新手段１１が、上記モデルパラメータを更新する。 The parameter update unit 11 is configured to calculate the likelihood of the word break position indicated by the label given to the labeled learning data (the likelihood of the correct analysis result of the input sample) and the word break analysis result (unlabeled learning data) of the unlabeled learning data. This is a means for updating the model parameter based on the consistency within the same category of (analysis results for). Here, when the updated model parameter does not satisfy the predetermined convergence condition, the model label is repeatedly used, the likelihood calculation of the analysis result of the labeled learning data is performed by the correct label likelihood evaluating means 9, The intra-category consistency evaluation means 10 performs consistency calculation, and the parameter update means 11 updates the model parameters based on these results.

解析辞書出力手段１２は、素性データ格納部６から読み出した素性、モデルパラメータ格納部８から読み出したモデルパラメータ、及びラベル格納部７から読み出したラベルの一覧を用いて、テキスト解析用の確率的な規則が記述されたテキスト解析用辞書を生成し出力する手段である。 The analysis dictionary output unit 12 uses the features read from the feature data storage unit 6, the model parameters read from the model parameter storage unit 8, and the list of labels read from the label storage unit 7 to perform probabilistic text analysis. This is a means for generating and outputting a text analysis dictionary in which rules are described.

次に動作について説明する。
実施の形態１によるテキスト解析学習装置は、三段階の処理で目的とするテキスト解析用辞書を生成する。第一段階は学習の前処理であり、素性生成部５によって、素性データが生成される。第二段階は実際の学習処理であり、正解ラベル尤度評価手段９、カテゴリ内整合性評価手段１０及びパラメータ更新手段１１によって、モデルパラメータが生成される。第三段階は学習の後処理であり、解析辞書出力手段１２によって、モデルパラメータがフォーマット変換され、テキスト解析用辞書が生成される。 Next, the operation will be described.
The text analysis learning device according to the first embodiment generates a target text analysis dictionary in three stages of processing. The first stage is pre-processing for learning, and feature data is generated by the feature generation unit 5. The second stage is an actual learning process, and model parameters are generated by the correct label likelihood evaluation means 9, the intra-category consistency evaluation means 10, and the parameter update means 11. The third stage is post-processing of learning, in which the model parameter is format-converted by the analysis dictionary output means 12 to generate a text analysis dictionary.

（１）学習の前処理（第一段階）
図３は、実施の形態１による素性生成処理の流れを示すフローチャートであり、この図に沿って処理の詳細を説明する。
先ず、素性生成部５は、ラベル付き学習データ格納部２から未処理のラベル付き学習データを１つ取り出す（ステップＳＴ１）。ここでは、図４に示すようなラベル付き学習データが取り出されたものとする。図４の例では、入力文（施設名）である文字列「アースビジネスカレッジ」と、その正しい単語区切り位置「／」が複数格納されている。また、各エントリには、副次的な情報として「各種学校」、「自動車整備」等の施設のカテゴリ情報が付与されている。この後、素性生成部５は、取り出したラベル付き学習データにおける未処理の文字位置を区切り注目点とする（ステップＳＴ２）。ラベル付き学習データでは、ラベルで特定される単語区切り位置が単語区切り注目点と規定される。 (1) Pre-processing for learning (first stage)
FIG. 3 is a flowchart showing the flow of the feature generation process according to the first embodiment, and the details of the process will be described with reference to this figure.
First, the feature generation unit 5 extracts one unprocessed labeled learning data from the labeled learning data storage unit 2 (step ST1). Here, it is assumed that labeled learning data as shown in FIG. 4 has been extracted. In the example of FIG. 4, a character string “Earth Business College” which is an input sentence (facility name) and a plurality of correct word break positions “/” are stored. In addition, each entry is given facility category information such as “various schools” and “car maintenance” as secondary information. Thereafter, the feature generation unit 5 delimits an unprocessed character position in the extracted labeled learning data as a point of interest (step ST2). In the labeled learning data, the word break position specified by the label is defined as the word break attention point.

次に、素性生成部５は、素性テンプレート格納部３から素性テンプレートを取り出し、当該素性テンプレートを使って現在の注目点で素性を生成する（ステップＳＴ３）。ここでは、例えば、図５に示す素性テンプレートが取り出される。この素性テンプレートは、テキスト解析器の設計者などが予め定義して素性テンプレート格納部３に保持しておく。 Next, the feature generation unit 5 takes out the feature template from the feature template storage unit 3, and uses the feature template to generate a feature at the current point of interest (step ST3). Here, for example, the feature template shown in FIG. 5 is taken out. This feature template is defined in advance by the designer of the text analyzer and stored in the feature template storage unit 3.

図５において、最も左端の列に記載される数値１，２，・・・は、各素性テンプレートに付与された通し番号に相当する素性テンプレート番号である。図５の例では、２６種類の素性テンプレートが定義されている。また、％ｌ［０］等のように、「％」で始まっている部分は素性テンプレートの変数部分であり、［］内の数値は注目点となる単語からの相対位置（字数単位）を示している。％ｌ［］は、単語区切りの有無を表すラベルである。％ｃ［］は［］内の数値で特定される位置の文字を表している。％ｓ［］は［］内の数値で特定される位置の文字の字種を表す変数である。素性生成部５は、これら素性テンプレートの変数部分をラベル付き学習データから抽出した値で置き換えることにより、素性を生成する。 5, numerical values 1, 2,... Described in the leftmost column are feature template numbers corresponding to serial numbers assigned to the feature templates. In the example of FIG. 5, 26 types of feature templates are defined. The part starting with “%”, such as% l [0], is the variable part of the feature template, and the numerical value in [] indicates the relative position (number of characters) from the word of interest. ing. % L [] is a label indicating the presence or absence of a word break. % C [] represents a character at a position specified by a numerical value in []. % S [] is a variable representing the character type of the character at the position specified by the numerical value in []. The feature generation unit 5 generates features by replacing the variable portions of these feature templates with values extracted from labeled learning data.

例えば、図４に示した２番目のラベル付き学習データである「アート／ＰＣ／教室」において、アートの「ト」の直後の単語区切りを注目点とした場合、素性生成部５は、図５で示した素性テンプレートの変数部分の値を、上記の学習データ「アート／ＰＣ／教室」から抽出された値で、図６に示すように置き換える。図６において、ｔは単語区切り注目点からの相対位置（字数単位）である。％ｌ［ｔ］は単語区切りの有無を表すラベルであり、％ｌ［ｔ］＝１は単語区切りがあることを表し、％ｌ［ｔ］＝０は単語区切りがないことを表している。また、％ｃ［ｔ］は位置ｔの文字を表している。％ｓ［ｔ］は位置ｔの文字の字種を表す変数であって、「カ」はカタカナを示しており、「Ａ」はアルファベットを示し、「漢」は漢字を示している。 For example, in the case of “art / PC / classroom”, which is the second labeled learning data shown in FIG. 4, when the word break immediately after “art” of art is the point of interest, the feature generator 5 6 is replaced with the value extracted from the learning data “art / PC / classroom” as shown in FIG. In FIG. 6, t is a relative position (in number of characters) from the word break attention point. % L [t] is a label indicating whether or not there is a word break,% l [t] = 1 represents that there is a word break, and% l [t] = 0 represents that there is no word break. Also,% c [t] represents the character at position t. % S [t] is a variable representing the character type of the character at the position t, “K” indicates katakana, “A” indicates alphabet, and “KAN” indicates Kanji.

素性生成部５は、図６に示すように素性テンプレートの変数値を置き換えることで、図５に示した素性テンプレートから、図７に示す素性データを生成する。図７において、変数である％ｌ［］（ラベル）、％ｃ［］、％ｓ［］（照合条件の一部）が、図６で示した学習データ「アート／ＰＣ／教室」から抽出した値で置換されている。 The feature generation unit 5 generates feature data shown in FIG. 7 from the feature template shown in FIG. 5 by replacing variable values of the feature template as shown in FIG. In FIG. 7, the variables% l [] (label),% c [],% s [] (part of the matching condition) are extracted from the learning data “Art / PC / classroom” shown in FIG. It has been replaced with a value.

素性生成部５は、上述した手順で素性を生成する度に、ラベル付き学習データにおける全ての文字位置（単語区切り）で素性生成処理を完了したか否かを判定する（ステップＳＴ４）。ここで、全ての文字位置で処理が完了していなければ、ステップＳＴ２の処理に戻って、未処理の文字位置に対しステップＳＴ２及びステップＳＴ３の処理を繰り返す。 The feature generation unit 5 determines whether or not the feature generation processing has been completed at every character position (word break) in the labeled learning data every time the feature is generated by the above-described procedure (step ST4). Here, if the processing is not completed at all character positions, the process returns to step ST2, and the processes of steps ST2 and ST3 are repeated for the unprocessed character positions.

ステップＳＴ４で全ての文字位置での処理完了を判定した場合、素性生成部５は、ラベル付き学習データ格納部２に格納される全ての未処理のラベル付き学習データを処理したか否かを判定する（ステップＳＴ５）。このとき、未処理のラベル付き学習データがあれば、ステップＳＴ１の処理に戻り、未処理のラベル付き学習データに対しステップＳＴ１からステップＳＴ４までの処理を繰り返す。 When it is determined in step ST4 that processing has been completed at all character positions, the feature generation unit 5 determines whether all unprocessed labeled learning data stored in the labeled learning data storage unit 2 has been processed. (Step ST5). At this time, if there is unprocessed labeled learning data, the process returns to step ST1, and the processes from step ST1 to step ST4 are repeated for the unprocessed labeled learning data.

全ての未処理のラベル付き学習データを処理した場合、素性生成部５は、生成した全ての素性データに対して使用頻度による予備選択を行い、頻度上位２０％までの素性を採用し、素性集合Ｆ１として保持する（ステップＳＴ６）。 When all unprocessed labeled learning data is processed, the feature generation unit 5 performs preliminary selection based on the frequency of use for all generated feature data, adopts features up to the top 20%, and sets of features Hold as F1 (step ST6).

続いて、素性生成部５は、ラベル無し学習データ格納部４から未処理のラベル無し学習データを１つ取り出す（ステップＳＴ７）。ここでは、図８に示すようなラベル付き学習データが取り出されたものとする。図８の例では、入力文（施設名）である文字列と、各エントリのカテゴリ情報が付与されている。ただし、図４で示したラベル付き学習データと異なり、単語区切り位置「／」は付与されていない。 Subsequently, the feature generation unit 5 extracts one unprocessed unlabeled learning data from the unlabeled learning data storage unit 4 (step ST7). Here, it is assumed that labeled learning data as shown in FIG. 8 has been extracted. In the example of FIG. 8, a character string that is an input sentence (facility name) and category information of each entry are given. However, unlike the labeled learning data shown in FIG. 4, the word break position “/” is not given.

素性生成部５は、取り出したラベル無し学習データにおける未処理の文字位置を区切り注目点とする（ステップＳＴ８）。ラベル無し学習データでは、ラベルが付与されていないため、ラベル無し学習データの文字列における単語を順次区切り注目点とする。 The feature generation unit 5 delimits unprocessed character positions in the extracted unlabeled learning data as attention points (step ST8). In the unlabeled learning data, since no label is given, words in the character string of the unlabeled learning data are sequentially separated and set as attention points.

次に、素性生成部５は、素性テンプレート格納部３から素性テンプレートを取り出し、当該素性テンプレートを使って現在の注目点で素性を生成する（ステップＳＴ９）。ここで、ラベル無し学習データにはラベルが付与されていないため、０、１の両ラベルを使って素性を生成する。つまり、注目点に対応するラベル％ｌ［］が０である場合と、１である場合の双方がアサインされ、それぞれの素性データが生成される。 Next, the feature generation unit 5 takes out the feature template from the feature template storage unit 3, and uses the feature template to generate a feature at the current attention point (step ST9). Here, since no label is given to the unlabeled learning data, the feature is generated using both the labels 0 and 1. That is, both the case where the label% l [] corresponding to the attention point is 0 and the case where it is 1 are assigned, and the respective feature data are generated.

この後、素性生成部５は、上述した手順で素性を生成する度に、ラベル無し学習データにおける全ての文字位置で素性生成を完了したか否かを判定する（ステップＳＴ１０）。ここで、全ての文字位置で処理が完了していなければ、ステップＳＴ８の処理に戻って、未処理の文字位置に対しステップＳＴ８及びステップＳＴ９の処理を繰り返す。 Thereafter, the feature generation unit 5 determines whether or not the feature generation has been completed at all character positions in the unlabeled learning data every time the feature is generated by the above-described procedure (step ST10). If processing has not been completed at all character positions, the process returns to step ST8, and the processes of steps ST8 and ST9 are repeated for the unprocessed character positions.

一方、全ての文字位置での処理が完了した場合、素性生成部５は、ラベル無し学習データ格納部４に格納される全ての未処理のラベル無し学習データを処理したか否かを判定する（ステップＳＴ１１）。このとき、未処理のラベル無し学習データがあれば、ステップＳＴ７の処理に戻り、未処理のラベル無し学習データに対してステップＳＴ７からステップＳＴ１０までの処理を繰り返す。 On the other hand, when the processing at all character positions is completed, the feature generation unit 5 determines whether or not all unprocessed unlabeled learning data stored in the unlabeled learning data storage unit 4 has been processed ( Step ST11). At this time, if there is unprocessed unlabeled learning data, the process returns to step ST7, and the processes from step ST7 to step ST10 are repeated for unprocessed unlabeled learning data.

全ての未処理のラベル無し学習データを処理した場合、素性生成部５は、ラベル無し学習データから生成した全ての素性データに対し頻度による予備選択を行い、頻度上位２０％までの素性を採用して、素性集合Ｆ２として保持する（ステップＳＴ１２）。 When all unprocessed unlabeled learning data is processed, the feature generation unit 5 performs preliminary selection by frequency for all feature data generated from the unlabeled learning data, and employs features up to the top 20% of the frequency. And stored as a feature set F2 (step ST12).

最後に、素性生成部５は、ステップＳＴ６で保持した素性集合Ｆ１とステップＳＴ１２で保持した素性集合Ｆ２とを合わせ、素性データとして素性データ格納部６に格納する（ステップＳＴ１３）。 Finally, the feature generation unit 5 combines the feature set F1 held in step ST6 and the feature set F2 held in step ST12, and stores it as feature data in the feature data storage unit 6 (step ST13).

（２）学習処理（第二段階）
この学習処理では、正解ラベル尤度評価手段９による尤度評価処理と、カテゴリ内整合性評価手段１０による整合性評価処理とが行われ、これらの結果に応じてパラメータ更新手段１１が、第１段階で生成された各素性に対応する実数値パラメータであるモデルパラメータを生成する。 (2) Learning process (second stage)
In this learning process, a likelihood evaluation process by the correct label likelihood evaluation unit 9 and a consistency evaluation process by the intra-category consistency evaluation unit 10 are performed, and the parameter updating unit 11 performs the first update according to these results. A model parameter that is a real value parameter corresponding to each feature generated in the stage is generated.

（２−１）尤度評価処理
図９は、実施の形態１による尤度評価処理の流れを示すフローチャートであり、この図に沿って尤度評価の詳細を説明する。
先ず、正解ラベル尤度評価手段９は、ラベル付き学習データ格納部２から未処理のラベル付き学習データを１つ取り出す（ステップＳＴ１ａ）。ここでは、図４に示すラベル付き学習データが取り出されたものとする。続いて、正解ラベル尤度評価手段９は、取り出したラベル付き学習データに対応する単語区切り仮説グラフを生成する（ステップＳＴ２ａ）。 (2-1) Likelihood Evaluation Processing FIG. 9 is a flowchart showing the flow of likelihood evaluation processing according to Embodiment 1, and details of likelihood evaluation will be described with reference to this drawing.
First, the correct label likelihood evaluation means 9 takes out one unprocessed labeled learning data from the labeled learning data storage unit 2 (step ST1a). Here, it is assumed that the labeled learning data shown in FIG. 4 is extracted. Subsequently, the correct label likelihood evaluating means 9 generates a word break hypothesis graph corresponding to the extracted labeled learning data (step ST2a).

図１０は、単語区切り仮説グラフの一例を示す図であり、図４に示した２番目のラベル付き学習データである「アートＰＣ教室」に関する単語区切り仮説グラフを示している。ここで、図１０中の黒丸ノードは入力であり、文中の文字を示している。左端と右端の「＃」はそれぞれ文頭及び文末を示す疑似入力文字である。また、白丸ノードは、入力文字位置に対応した単語区切り仮説であり、Ｓは開始位置に対応する疑似ラベルであり、Ｅは終了位置に対応する疑似ラベルである。太線のパスは、正解仮説「アート／ＰＣ／教室」を表している。 FIG. 10 is a diagram showing an example of a word break hypothesis graph, and shows a word break hypothesis graph related to “art PC classroom” which is the second labeled learning data shown in FIG. Here, a black circle node in FIG. 10 is an input and indicates a character in the sentence. “#” At the left end and the right end are pseudo input characters indicating the beginning and end of the sentence, respectively. The white circle node is a word break hypothesis corresponding to the input character position, S is a pseudo label corresponding to the start position, and E is a pseudo label corresponding to the end position. The bold path represents the correct hypothesis “Art / PC / Classroom”.

なお、入力文字の黒丸ノードと出力ラベルの白丸ノードとを繋ぐリンクは、その位置でのラベルの値を推定する際に考慮される入力文字を表している。図１０の例では、表記上の都合により、ラベルの値が０である白丸ノードとのリンクを記載していないが、ラベルの値が１である白丸ノードと同様のリンクが存在する。 The link connecting the black circle node of the input character and the white circle node of the output label represents the input character considered when estimating the value of the label at that position. In the example of FIG. 10, a link with a white circle node whose label value is 0 is not described for convenience of description, but a link similar to a white circle node whose label value is 1 exists.

正解ラベル尤度評価手段９は、生成した単語区切り仮説グラフ上の出力側の各白丸ノード（各出力ノード）及び白丸ノード間（出力ノード間）のリンクに対して、素性の照合を行いながら、ＣＲＦ確率計算式に従って尤度を算出する（ステップＳＴ３ａ）。尤度の計算には、動的計画法の一種であるフォワード・バックワードアルゴリズムを用いる。ＣＲＦによる確率計算式は、下記式（１）で与えられる。
ここで、ｐ_Λ（ｙ｜ｘ）は、入力ｘ（例えば「アートＰＣ教室」）が与えられるとき、その単語区切りが、出力ｙ（例えば００１０１０１、つまりアート／ＰＣ／教室）であるときの条件付き確率（モデルパラメータΛのときの推定値）を表している。
また、ｃはｃｌｉｑｕｅと呼ばれる仮説グラフの部分グラフであり、ここでは出力ｙ（＝パス）を構成する全ての辺Ｅ_ｙ及び頂点Ｖ_ｙを表している。ｆ_ｉは素性データであり、条件に一致するときは値１となり、一致しないときは値０となる関数である。ｙ｜ｃは、出力ラベル系列の中でｃ（ある頂点又は辺）に該当するラベルである。さらに、λ_ｉは、ｉ番目の素性に対応する実数値重みであって、Λ＝｛λ_０，・・・，λ_ｉ，・・・｝は、モデルパラメータ（ベクトル）である。このモデルパラメータの値は、パラメータ更新手段１１により逐次更新されるが、最初は初期値∀_ｉλ_ｉ＝０を用いる。Ｚ（ｘ）は、下記式（２）で表される。

The correct label likelihood evaluation means 9 performs the matching of the features with respect to the links between the white circle nodes (each output node) and the white circle nodes (between the output nodes) on the output side on the generated word break hypothesis graph, The likelihood is calculated according to the CRF probability calculation formula (step ST3a). For the calculation of likelihood, a forward / backward algorithm which is a kind of dynamic programming is used. The probability calculation formula by CRF is given by the following formula (1).
Here, p _Λ (y | x) is a condition when an input x (for example, “art PC classroom”) is given and the word break is an output y (for example, 0010101, that is, art / PC / classroom). The attached probability (estimated value when the model parameter Λ is used).
Further, c is a subgraph of a hypothesis graph called “clique”, and here represents all edges E _y and vertices V _y constituting the output y (= path). f _i is feature data, which is a function having a value of 1 when the condition is met and a value of 0 when the conditions are not met. y | c is a label corresponding to c (a certain vertex or side) in the output label series. Further, λ _i is a real value weight corresponding to the i-th feature, and Λ = {λ ₀ ,..., Λ _i ,. The value of this model parameter is sequentially updated by the parameter updating unit 11, but initially uses the initial value _{ｉ i} λ _i = 0. Z (x) is represented by the following formula (2).

図１１は、図１０中の単語区切り仮説グラフに部分グラフを明記した場合を示す図である。図１１において、辺ｃ１，ｃ３，・・・，や、頂点ｃ２，ｃ４，・・・は、太線で示したパスの部分グラフである。図１１に示す太線のパスは、入力「アートＰＣ教室」における正しい単語区切りを表している。このパスの尤度（正解ラベル尤度）は、部分グラフｃ１，ｃ２，・・・，ｃ１４において発火する（値１を取る）素性ｆ_ｉを調べ、その素性に対応する実数値重みλ_ｉを上記式（１）に従って足し込むことにより求められる。 FIG. 11 is a diagram illustrating a case where a partial graph is specified in the word break hypothesis graph in FIG. 10. In FIG. 11, edges c1, c3,... And vertices c2, c4,... Are partial graphs of paths indicated by bold lines. A thick line path shown in FIG. 11 represents a correct word break in the input “art PC classroom”. The likelihood of this path (true label likelihood) is subgraph c1, c2, · · ·, fires in c14 (taking values 1) examining the identity _{f i,} the real value weights lambda _i corresponding to the feature It is obtained by adding according to the above formula (1).

例えば、図７で示した２６種類の素性がある場合、部分グラフの辺ｃ５で発火する素性は、左端の素性番号が２，８，９，１３，１４，２０の６種類となる。また、部分グラフの頂点ｃ６で発火する素性は、素性番号が１，３，４，５，６，９，１０，１１，１２，１５，１６，１８，１９，２１，２２，２３，２４，２５，２６の２０種類となる。このようにして、正解ラベル尤度評価手段９が、現在のモデルパラメータΛによる、入力ｘに対応する正解ラベル系列ｙの尤度ｐ_Λ（ｙ｜ｘ）を算出する。 For example, when there are 26 types of features shown in FIG. 7, the features that are ignited at the edge c <b> 5 of the subgraph are the 6 types of feature numbers 2, 8, 9, 13, 14 and 20 at the left end. In addition, the features ignited at the vertex c6 of the subgraph have feature numbers 1, 3, 4, 5, 6, 9, 10, 11, 12, 15, 16, 18, 19, 21, 21, 22, 23, 24, There are 20 types, 25 and 26. In this way, the correct label likelihood evaluating means 9 calculates the likelihood p _Λ (y | x) of the correct label sequence y corresponding to the input x by the current model parameter Λ.

ステップＳＴ４ａでは、正解ラベル尤度評価手段９が、ラベル付き学習データ格納部２に格納される全てのラベル付き学習データに対して、上述した正解ラベル尤度の計算処理を実施したか調べる。ここで、実施していたらステップＳＴ５ａの処理に移行する。実施が未完の場合は、ステップＳＴ１ａに戻って、未処理の学習データに対する処理を継続する。 In step ST4a, the correct label likelihood evaluation means 9 checks whether or not the above-described correct label likelihood calculation processing has been performed on all labeled learning data stored in the labeled learning data storage unit 2. If so, the process proceeds to step ST5a. When the implementation is incomplete, the process returns to step ST1a to continue the process on the unprocessed learning data.

ステップＳＴ５ａでは、正解ラベル尤度評価手段９が、下記式（３）を用いて、全てのラベル付き学習データｄ∈Ｄ_Ｌに対する対数尤度の総和を算出する。なお、ｐ_Λ（ｙ_ｄ｜ｘ_ｄ）は上記式（１）で定義される尤度である。

In step ST5a, true label likelihood estimation means 9, using the following equation (3) to calculate the sum of the log-likelihood for all the labeled training data d∈D _L. Note that p _Λ (y _d | x _d ) is a likelihood defined by the above equation (1).

（２−２）整合性評価処理
図１２は、整合性評価処理の流れを示すフローチャートであり、この図に沿って処理の詳細を説明する。
先ず、カテゴリ内整合性評価手段１０は、ラベル無し学習データ格納部４から未処理のラベル無し学習データを１つ取り出す（ステップＳＴ１ｂ）。続いて、カテゴリ内整合性評価手段１０は、取り出したラベル無し学習データに対応する単語区切り仮説グラフを生成する（ステップＳＴ２ｂ）。 (2-2) Consistency Evaluation Process FIG. 12 is a flowchart showing the flow of the consistency evaluation process, and details of the process will be described with reference to this figure.
First, the intra-category consistency evaluation means 10 extracts one unprocessed unlabeled learning data from the unlabeled learning data storage unit 4 (step ST1b). Subsequently, the intra-category consistency evaluation unit 10 generates a word break hypothesis graph corresponding to the extracted unlabeled learning data (step ST2b).

カテゴリ内整合性評価手段１０は、生成された単語区切り仮説グラフ上の各出力ノード及び出力ノード間のリンクに対し素性の照合を行いながら、上述した尤度評価処理と同様に、ＣＲＦ確率計算式に従って尤度を算出する（ステップＳＴ３ｂ）。 The intra-category consistency evaluation means 10 performs the CRF probability calculation formula in the same manner as the likelihood evaluation process described above while collating the features with respect to each output node and the link between the output nodes on the generated word break hypothesis graph. The likelihood is calculated according to (step ST3b).

ステップＳＴ４ｂでは、カテゴリ内整合性評価手段１０が、ラベル無し学習データ格納部４に格納される全てのラベル無し学習データに対して、上述した正解ラベル尤度の計算処理を実施したか調べる。ここで、実施していたらステップＳＴ５ｂの処理に移行する。実施が未完の場合は、ステップＳＴ１ｂに戻って、未処理の学習データに対する処理を継続する。 In step ST4b, the intra-category consistency evaluation means 10 checks whether or not the above-described correct label likelihood calculation processing has been performed on all unlabeled learning data stored in the unlabeled learning data storage unit 4. Here, if implemented, the process proceeds to step ST5b. When the implementation is incomplete, the process returns to step ST1b to continue the process on the unprocessed learning data.

次に、カテゴリ内整合性評価手段１０は、正解ラベルが付与されていないラベル無し学習データに対して、モデルパラメータΛによる推定の「よさ」を評価するため、下記式（４）で得られるエントロピーを評価値として算出する（ステップＳＴ５ｂ）。なお、この式（４）は、各カテゴリκ∈Ｋにおけるラベル無し学習データｄ∈Ｄ_ｕに対して、部分グラフｃを推定するときの条件付きエントロピーを表している。ただし、ｐチルダは、学習データにおける観測確率であり、ｐ_Λは、モデルパラメータΛを用いた推定確率である。

Next, the intra-category consistency evaluation means 10 evaluates the “goodness” of the estimation by the model parameter Λ for the unlabeled learning data to which the correct label is not assigned, so that the entropy obtained by the following equation (4) Is calculated as an evaluation value (step ST5b). In this equation (4), to the unlabeled training data D∈D _u in each category Kappa∈K, represents the conditional entropy when estimating the subgraph c. Here, p tilde is an observation probability in the learning data, and p _Λ is an estimated probability using the model parameter Λ.

（２−３）モデルパラメータ更新処理
パラメータ更新手段１１は、正解ラベル尤度評価手段９が上記式（３）で算出した、ラベル付き学習データに対する尤度Ｌ_ＤＬ（Λ）と、カテゴリ内整合性評価手段１０が上記式（４）で算出した部分グラフのエントロピーＨ_Λ（ｃ｜Ｄ_Ｕ，Ｋ）とを用いて、下記式（５）で表される目標関数Ｇ（Λ）が最大化するようにモデルパラメータΛを更新する。下記式（５）において、α、βは実験的に定める定数である。また、右辺第２項は、モデルパラメータΛの大きさに応じたペナルティを与える項であり、過学習を防ぐために導入している。なお、｜｜Λ｜｜は、下記式（６）で与えられるユークリッドノルムである。

(2-3) Model Parameter Update Processing The parameter update unit 11 calculates the likelihood L _DL (Λ) for the labeled learning data calculated by the correct label likelihood evaluation unit 9 according to the above equation (3) and the intra-category consistency. The target function G (Λ) represented by the following equation (5) is maximized by using the entropy H _Λ (c | D _U , K) of the subgraph calculated by the evaluation means 10 using the above equation (4). The model parameter Λ is updated as follows. In the following formula (5), α and β are experimentally determined constants. The second term on the right side is a term that gives a penalty according to the size of the model parameter Λ, and is introduced to prevent overlearning. In addition, || Λ || is a Euclidean norm given by the following formula (6).

目標関数Ｇ（Λ）を最大化するモデルパラメータΛは、公知の山登り法によって求めることができる。すなわち、正解ラベル尤度評価手段９によるＬ_ＤＬの計算、カテゴリ内整合性評価手段１０によるＨ_Λ（ｃ｜Ｄ_Ｕ，Ｋ）の計算とともに反復して、モデルパラメータΛを逐次更新し、モデルパラメータΛの更新量が所定値以下となったら収束と見なし、処理を完了する。 The model parameter Λ that maximizes the target function G (Λ) can be obtained by a known hill-climbing method. That is, the model parameter Λ is sequentially updated by repeating the calculation of L _DL by the correct label likelihood evaluation unit 9 and the calculation of H _Λ (c | D _U , K) by the intra-category consistency evaluation unit 10, and the model parameter When the update amount of Λ is equal to or smaller than a predetermined value, it is regarded as convergence and the processing is completed.

なお、目標関数Ｇ（Λ）は微分可能であるから勾配を求めて、上記山登り法の代わりに準ニュートン法（例えば、ＢＦＧＳ法）を適用すれば、収束までの反復処理の回数を減らすことが可能である。 Since the target function G (Λ) is differentiable, if the gradient is obtained and the quasi-Newton method (for example, the BFGS method) is applied instead of the hill climbing method, the number of iterations until convergence can be reduced. Is possible.

目標関数Ｇ（Λ）を最大化するパラメータ推定では、ラベル付き学習データに対しては上記式（５）の右辺第１項の尤度の最大化が基準とされ、ラベル無し学習データに対しては上記式（５）の右辺第３項のエントロピーが小さくなる構造推定（カテゴリ内で一貫性のある構造推定）が増えるように学習される。後者では人手でタグ付け作業をする際の指針に直観的に近いため、教師なし学習において有用な規則獲得が可能となっている。
このようにして得られたモデルパラメータΛは、パラメータ更新手段１１によりモデルパラメータ格納部８に格納される。 In parameter estimation that maximizes the objective function G (Λ), the maximum likelihood of the first term on the right side of the above equation (5) is used as a reference for labeled learning data, and for unlabeled learning data Is learned so that the structure estimation (structure estimation consistent within the category) in which the entropy of the third term on the right-hand side of the above equation (5) becomes small increases. In the latter case, it is intuitively close to the guidelines for manually tagging work, so it is possible to obtain useful rules in unsupervised learning.
The model parameter Λ obtained in this way is stored in the model parameter storage unit 8 by the parameter updating means 11.

（３）学習の後処理（第三段階）
解析辞書出力手段１２は、パラメータ更新手段１１によるモデルパラメータ推定処理が完了すると、当該モデルパラメータと、素性データ格納部６に格納される素性データと、ラベル格納部７に格納されるラベルの一覧表とを統合し、テキスト解析用辞書として所定のフォーマットで出力する。 (3) Post-processing of learning (third stage)
When the model parameter estimation processing by the parameter updating unit 11 is completed, the analysis dictionary output unit 12 lists the model parameters, the feature data stored in the feature data storage unit 6, and the labels stored in the label storage unit 7. And are output in a predetermined format as a text analysis dictionary.

図１３は、テキスト解析用辞書の一例を示す図である。図１３に示すデータフォーマットは、＠ＬＡＢＥＬＳ、＠ＦＥＡＴＵＲＥＳ、＠ＷＥＩＧＨＴＳの３つのセクションから構成されている。＠ＬＡＢＥＬＳのセクションには、ラベル格納部７が保持するラベルが列挙される。＠ＦＥＡＴＵＲＥＳのセクションには、素性データ格納部６が保持する素性が列挙される。図１３に示す例では、各行が素性番号ｉ及び素性ｆ_ｉの定義から構成されている。なお、図１３においては、素性番号２７番目以降の素性の記載を省略している。＠ＷＥＩＧＨＴＳのセクションには、モデルパラメータ格納部８に保持されるモデルパラメータの値が列挙される。図１３では、各行が素性番号ｉ及び当該素性ｆ_ｉの重みパラメータλ_ｉから構成されている。重みパラメータについても素性と同様に素性番号２６までの一部のみを例示し、省略部分を「・・・」で示している。 FIG. 13 is a diagram illustrating an example of a text analysis dictionary. The data format shown in FIG. 13 is composed of three sections: @LABELS, @FEATURES, and @WEIGHTS. In the @LABELS section, labels held by the label storage unit 7 are listed. The features held by the feature data storage unit 6 are listed in the @FEATURES section. In the example shown in FIG. 13, each row is formed from the definition of identity numbers i and feature f _i. In FIG. 13, the description of the features having the feature number 27th or later is omitted. In the @WEIGHTS section, values of model parameters held in the model parameter storage unit 8 are listed. In FIG. 13, each row includes a feature number i and a weight parameter λ _i of the feature f _i . As for the weight parameter, only the part up to the feature number 26 is illustrated similarly to the feature, and the omitted part is indicated by “...”.

解析辞書出力手段１２から出力されたテキスト解析用辞書は、図１に示すように、電子ファイル等のデータとして保存され、テキスト解析器に読み込まれてテキスト解析に利用される。 As shown in FIG. 1, the text analysis dictionary output from the analysis dictionary output means 12 is stored as data such as an electronic file, read into a text analyzer, and used for text analysis.

以上のように、この実施の形態１によれば、ラベル付き学習データに付与されたラベルで示される解析結果の尤度を算出し、ラベル無し学習データに対する解析結果と同じカテゴリに属する入力文に対する解析結果との整合性の度合を示す評価値を算出し、尤度及び整合性の評価値に基づく目標関数が最大化するように素性データに対応するモデルパラメータの値を更新し、当該モデルパラメータの更新値を用いて算出された尤度及び評価値に基づく当該モデルパラメータの更新を、当該モデルパラメータの更新値が所定の収束条件を満たすまで実行して、所定の収束条件を満たしたモデルパラメータ、素性データ及びラベルの一覧を用いて、テキスト解析器が使用する解析用辞書を生成する。
このように、ラベル付き学習データとラベル無し学習データを併用する半教師付き学習の枠組みにおいて、データソースやアプリケーションから容易に取得可能なデータの副次的属性（カテゴリ情報）を利用することにより、ラベル付き学習データに対しては一般的な最尤推定を実施しつつ、ラベル無し学習データに対してはカテゴリ内で可能な限り一貫した推定を行うように学習される。ここで、この発明におけるラベル無し学習データに対する学習指針は、カテゴリ内のデータの類似性やラベルの共通性を仮定するものであるが、この仮定は、コーパス作成時に人手でラベル付け作業を行う際の明示的若しくは暗黙的な指針に直観的に近く、自然な仮定と考えられる。従って、このような先験的な知識を学習基準に取り込むことにより、ラベル無し学習データから効率的な学習を行うことができ、解析精度を向上させることが可能である。 As described above, according to the first embodiment, the likelihood of the analysis result indicated by the label given to the labeled learning data is calculated, and the input sentence belonging to the same category as the analysis result for the unlabeled learning data is calculated. An evaluation value indicating the degree of consistency with the analysis result is calculated, and the value of the model parameter corresponding to the feature data is updated so that the target function based on the evaluation value of likelihood and consistency is maximized. Update the model parameter based on the likelihood and the evaluation value calculated using the updated value of the model parameter until the updated value of the model parameter satisfies the predetermined convergence condition, and satisfy the predetermined convergence condition. Then, an analysis dictionary used by the text analyzer is generated using the feature data and the list of labels.
In this way, in the semi-supervised learning framework that uses both labeled learning data and unlabeled learning data, by using the secondary attributes (category information) of data that can be easily obtained from data sources and applications, While general maximum likelihood estimation is performed on labeled learning data, learning is performed on unlabeled learning data so that estimation is performed as consistently as possible within a category. Here, the learning guideline for the unlabeled learning data in the present invention assumes the similarity of the data in the category and the commonality of the labels. This assumption is used when the labeling operation is manually performed when the corpus is created. Intuitively close to the explicit or implicit guideline, and is considered a natural assumption. Therefore, by incorporating such a priori knowledge into the learning standard, efficient learning can be performed from unlabeled learning data, and the analysis accuracy can be improved.

１テキスト解析学習装置、２ラベル付き学習データ格納部、３素性テンプレート格納部、４ラベル無し学習データ格納部、５素性生成部、６素性データ格納部、７ラベル格納部、８モデルパラメータ格納部、９正解ラベル尤度評価手段（尤度評価手段）、１０カテゴリ内整合性評価手段（整合性評価手段）、１１パラメータ更新手段（更新手段）、１２解析辞書出力手段。 1 text analysis learning device, 2 labeled learning data storage unit, 3 feature template storage unit, 4 unlabeled learning data storage unit, 5 feature generation unit, 6 feature data storage unit, 7 label storage unit, 8 model parameter storage unit, 9 correct label likelihood evaluation means (likelihood evaluation means), 10 intra-category consistency evaluation means (consistency evaluation means), 11 parameter update means (update means), 12 analysis dictionary output means.

Claims

テキスト解析器に対する入力文、前記入力文の正しい解析結果を示すラベル及び当該入力文が属するカテゴリを示すカテゴリ情報の組み合わせを、ラベル付き学習データとして格納するラベル付き学習データ格納部と、
テキスト解析器に対する入力文及び当該入力文が属するカテゴリを示すカテゴリ情報の組み合わせを、ラベル無し学習データとして格納するラベル無し学習データ格納部と、
前記ラベルの一覧を格納するラベル格納部と、
入力文に対する照合条件と当該入力文の前記ラベルとの組み合わせを、当該入力文に対応する素性データとして格納する素性データ格納部と、
前記照合条件の一部及び前記ラベルを変数とした素性テンプレートを格納する素性テンプレート格納部と、
前記ラベル付き学習データ、前記ラベル無し学習データ、前記ラベルの一覧及び前記素性テンプレートを入力し、前記ラベル付き学習データ及び前記ラベル無し学習データと前記ラベルの一覧とから前記照合条件の一部及び前記ラベルにそれぞれ対応する文字列及びラベルを抽出して前記素性テンプレートの変数を書き換えることにより、前記素性データを生成する素性生成部と、
前記素性データに対応するモデルパラメータを格納するモデルパラメータ格納部と、
前記ラベル付き学習データの入力文を、前記素性データ、前記モデルパラメータ及び前記ラベルの一覧に基づいて解析し、当該ラベル付き学習データに付与されたラベルで示される解析結果の尤度を算出する尤度評価手段と、
前記ラベル無し学習データの入力文を、前記素性データ、前記モデルパラメータ及び前記ラベルの一覧に基づいて解析し、当該ラベル無し学習データに対する解析結果と、同じカテゴリに属する入力文に対する解析結果との整合性の度合を示す評価値を算出する整合性評価手段と、
前記尤度評価手段で算出された尤度及び前記整合性評価手段で算出された評価値に基づく目標関数が最大化するように前記モデルパラメータの値を更新し、当該モデルパラメータの更新値を用いて算出された前記尤度及び前記評価値に基づく当該モデルパラメータの更新を、当該モデルパラメータの更新値が所定の収束条件を満たすまで実行する更新手段と、
前記素性データ、前記所定の収束条件を満たしたモデルパラメータ及び前記ラベルの一覧を用いて、前記テキスト解析器が使用する解析用辞書を生成する解析辞書出力手段とを備えたテキスト解析学習装置。 A labeled learning data storage unit that stores a combination of an input sentence to a text analyzer, a label indicating a correct analysis result of the input sentence, and category information indicating a category to which the input sentence belongs, as labeled learning data;
An unlabeled learning data storage unit that stores a combination of an input sentence to the text analyzer and category information indicating a category to which the input sentence belongs;
A label storage unit for storing a list of the labels;
A feature data storage unit that stores a combination of a matching condition for the input sentence and the label of the input sentence as feature data corresponding to the input sentence;
A feature template storage unit that stores a feature template using a part of the matching condition and the label as a variable;
The labeled learning data, the unlabeled learning data, the list of labels and the feature template are input, and from the labeled learning data, the unlabeled learning data and the list of labels, a part of the matching condition and the A feature generation unit that generates the feature data by extracting a character string and a label corresponding to each label and rewriting the variables of the feature template;
A model parameter storage unit for storing model parameters corresponding to the feature data;
The likelihood that the input sentence of the labeled learning data is analyzed based on the feature data, the model parameter, and the list of labels, and the likelihood of the analysis result indicated by the label given to the labeled learning data is calculated. Degree evaluation means,
The input sentence of the unlabeled learning data is analyzed based on the feature data, the model parameters, and the list of labels, and the analysis result for the unlabeled learning data is matched with the analysis result for the input sentence belonging to the same category. Consistency evaluation means for calculating an evaluation value indicating the degree of sex;
Update the value of the model parameter so that the target function based on the likelihood calculated by the likelihood evaluation unit and the evaluation value calculated by the consistency evaluation unit is maximized, and use the updated value of the model parameter Update means for executing update of the model parameter based on the likelihood and the evaluation value calculated in the above until the update value of the model parameter satisfies a predetermined convergence condition;
A text analysis learning device comprising: an analysis dictionary output means for generating an analysis dictionary used by the text analyzer using the feature data, model parameters satisfying the predetermined convergence condition, and a list of the labels.