JP2005327107A

JP2005327107A - Proper name category estimation device and program

Info

Publication number: JP2005327107A
Application number: JP2004144999A
Authority: JP
Inventors: Takeshi Nagamine; 猛志永峯; Katsunori Yoshiji; 克典芳地; Akio Yamashita; 明男山下
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-05-14
Filing date: 2004-05-14
Publication date: 2005-11-24

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology capable of estimating a category even when a proper name appears in a sentence by itself not as a compound word. <P>SOLUTION: By analyzing a learning document group, the proper name having a known category and an independent word co-occurring in the same sentence are found. By totaling the independent words in each category of the proper name, a category-classified appearance frequency table 148 storing an appearance frequency score showing possibility that each independent word co-occurs with the proper name of each category is produced. When the proper name wherein the category cannot be specified in a position by analysis of a morpheme analysis part 102 is present inside an analysis target document 300, a co-occurrence word detection part 108 finds the independent word co-occurring inside the same sentence as the proper name from the analysis target document 300, and a category estimation part 110 acquires the appearance frequency score about each category of each found independent word from the category-classified appearance frequency table 148 and totals them in each category. It is estimated that the category having a highest adaptability value found by the totaling is the category of the proper name. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、自然文解析に関し、特に固有名のカテゴリ推定の技術に関する。 The present invention relates to natural sentence analysis, and more particularly to a technique for estimating a category of proper names.

例えば、固有名には人名や地名、組織名（例えば社名）など意味的に様々なカテゴリが存在する。そして、例えば「青山」という固有名が人名にも地名にも用いられるように、同じ固有名が複数のカテゴリに属することはよくあることである。 For example, there are various semantic categories such as personal names, place names, and organization names (for example, company names). For example, the same proper name often belongs to a plurality of categories so that the proper name “Aoyama” is used for both a personal name and a place name.

このため、自然文解析では、その固有名のカテゴリを１つに特定することが重要となっている。 For this reason, in natural sentence analysis, it is important to specify one category for the proper name.

このような目的のための従来技術として、特許文献１には、複数のカテゴリに属する固有名（固有名詞）をその後に続く承接語（接尾辞や一般名詞）との関係に基づき単一のカテゴリに絞り込む装置が開示されている。すなわち、この装置は、固有名の後に続く承接語の情報を保持すると共に、固有名と承接語との組合せに基づきその固有名及び承接語のカテゴリや品詞を決定するルールを保持し、それらの情報に基づき文章中の固有名のカテゴリを特定している。 As a prior art for such a purpose, Patent Document 1 discloses a single category based on the relationship between proper names (proprietary nouns) belonging to a plurality of categories and subsequent suffixes (suffixes and general nouns). An apparatus for narrowing down to is disclosed. That is, this device holds information on the adjoining word that follows the proper name, and also holds rules for determining the proper name and the adjunct category and part of speech based on the combination of the proper name and the adjunct. The category of the proper name in the sentence is specified based on the information.

また、固有名のカテゴリ特定そのものではないが、特許文献２には、複数の品詞に属する単語を単一の品詞に絞り込む装置が示されている。この装置は、隣接する品詞間の接続のしやすさの度合いと、接続不可の場合にどちらの品詞を除去すべきかを示す情報とを組み合わせた数値情報を記憶しており、これらの情報に基づき文章中の単語の品詞を判定する。 Although it is not the category specification itself of the proper name, Patent Document 2 shows a device that narrows down words belonging to a plurality of parts of speech to a single part of speech. This device stores numerical information that combines the degree of ease of connection between adjacent parts of speech and information indicating which part of speech should be removed when connection is not possible. Based on these information Determine the part of speech of a word in a sentence.

特開平１−２４８２７７号公報JP-A-1-248277 特開平２−５１７７２号公報JP-A-2-51772

上記特許文献２の技術は隣接する品詞間の関係を利用しているが、固有名のカテゴリの判定では、対象となる固有名は名詞であり、カテゴリが変わっても隣接する品詞自体には余り差異はないので、この従来技術で固有名のカテゴリの判定を行うには限界があった。 Although the technique of the above-mentioned patent document 2 uses the relationship between adjacent parts of speech, in determining a proper name category, the target proper name is a noun, and even if the category changes, the adjacent part of speech itself is not enough. Since there is no difference, there is a limit to the determination of the category of the proper name with this prior art.

上記特許文献１の技術では、固有名とその後に続く語からなる複合語を処理対象としており、しかも解析の鍵となる承接語は予め登録された特定の語を用いている。このため、この従来技術は、固有名が複合語としてではなく単独で文章中に現れた場合には対応できないという問題がある。 In the technique of the above-mentioned patent document 1, a compound word composed of a proper name and a subsequent word is a processing target, and a specific word registered in advance is used as a close word as a key for analysis. For this reason, this prior art has a problem that it cannot cope with the case where the proper name appears in the sentence alone instead of as a compound word.

本発明は、固有名が文章中に複合語としてではなく単独で現れた場合にもそのカテゴリを推定できる技術を提供する。 The present invention provides a technique capable of estimating a category even when a proper name appears alone in a sentence as a compound word rather than as a compound word.

本発明に係る装置は、固有名のカテゴリごとに、文書中において該カテゴリに属する固有名に対し所定の範囲内に出現する自立語の出現頻度スコアを記憶したカテゴリ別出現頻度テーブルと、与えられた解析対象文書に含まれる固有名が属するカテゴリを推定する手段であって、該解析対象文書中で該固有名に対し前記所定の範囲に出現する自立語を抽出し、それら各自立語の各カテゴリに対する出現頻度スコアを前記カテゴリ別出現頻度テーブルからそれぞれ求め、それら出現頻度スコアをカテゴリごとに集計することによりカテゴリごとの適合度を計算し、該適合度に基づき該固有名が属するカテゴリを推定する推定手段と、を備える The apparatus according to the present invention is provided with an appearance frequency table for each category that stores, for each proper name category, an appearance frequency score of an independent word that appears within a predetermined range for the proper name belonging to the category in the document. Means for estimating a category to which a proper name included in the analysis target document belongs, and extracting independent words appearing in the predetermined range with respect to the proper name in the analysis target document, and for each of the independent words The appearance frequency score for each category is obtained from the category-specific appearance frequency table, the appearance frequency scores are aggregated for each category, the degree of fitness for each category is calculated, and the category to which the proper name belongs is estimated based on the degree of fitness. Estimating means for

ここで、固有名に対する「所定の範囲」は、例えば固有名が属する文や、固有名が属する段落などであり、適宜定めることができる。 Here, the “predetermined range” for the proper name is, for example, a sentence to which the proper name belongs, a paragraph to which the proper name belongs, and the like, and can be determined as appropriate.

この構成では、推定手段が、解析対象文書中に現れる固有名のカテゴリを推定する際に、解析対象文書中でその固有名に対して所定の範囲内に現れる自立語を抽出し、それら抽出した自立語の各カテゴリに対する出現頻度スコアに基づき、その固有名の各カテゴリに対する適合度を計算し、それら各カテゴリに対する適合度に基づきその固有名のカテゴリを推定する。 In this configuration, when the estimation unit estimates the category of the proper name that appears in the analysis target document, it extracts independent words that appear within a predetermined range with respect to the proper name in the analysis target document and extracts them. Based on the appearance frequency score for each category of independent words, the fitness for each category of the proper name is calculated, and the category of the proper name is estimated based on the suitability for each category.

本発明の好適な態様では、固有名カテゴリ推定装置は、与えられた各学習用文書から固有名及び該固有名に対し前記所定の範囲内に出現する自立語を求め、固有名のカテゴリごとに該カテゴリに属する固有名に対し前記所定の範囲内に出現した自立語の出現回数を集計することにより各カテゴリに対するそれら各自立語の出現頻度スコアを求め、カテゴリ別出現頻度テーブルを作成する学習手段、を更に備える。 In a preferred aspect of the present invention, the proper name category estimation device obtains a proper name and an independent word appearing within the predetermined range for each proper name from each given learning document, and for each proper name category. Learning means for obtaining an appearance frequency score of each independent word for each category by calculating the number of appearances of the independent words that appeared within the predetermined range for the proper name belonging to the category, and creating an appearance frequency table for each category Are further provided.

この態様では、与えられる学習用文書から学習手段がカテゴリ別出現頻度テーブルを作成することができるので、事例としての学習用文書をこの装置に与えることで、カテゴリ別出現頻度テーブルを自動生成することができる。 In this aspect, since the learning means can create a category-specific appearance frequency table from a given learning document, a category-specific appearance frequency table can be automatically generated by giving a learning document as an example to this apparatus. Can do.

本発明の別の好適な態様では、前記学習手段は、各学習用文書から求めた固有名のうち、属するカテゴリが一意に特定できるものを抽出し、抽出した固有名とこれに対し前記所定の範囲内に出現した自立語に基づき前記カテゴリ別出現頻度テーブルを作成する。 In another preferred aspect of the present invention, the learning means extracts a unique name obtained from each learning document so that the category to which the category belongs can be uniquely specified. The category-specific appearance frequency table is created based on the independent words that appear within the range.

この態様では、学習手段が、学習用文書の中からカテゴリが一意に特定できる固有名を自動的に抽出し、そのような固有名についての情報からカテゴリ別出現頻度テーブルを作成する。したがって、カテゴリ別出現頻度テーブルにはカテゴリが曖昧な固有名の情報は反映されないので、精度のよいカテゴリ別出現頻度テーブルを作成できる。 In this aspect, the learning means automatically extracts a unique name from which a category can be uniquely specified from the learning document, and creates a category-specific appearance frequency table from information about such a unique name. Therefore, since the information on the proper name whose category is ambiguous is not reflected in the appearance frequency table for each category, it is possible to create an appearance frequency table for each category with high accuracy.

本発明の更に好適な態様では、前記学習手段は、固有名に対し前記所定の範囲内に出現した自立語に対し該固有名に対する距離が近いほど高くなる重みスコアを与え、出現した自立語についての重みスコアを集計することにより前記出現頻度スコアを求める。 In a further preferred aspect of the present invention, the learning means gives a weight score that increases as the distance to the proper name is closer to the independent word that appears within the predetermined range with respect to the proper name, and The appearance frequency score is obtained by adding up the weight scores.

この態様では、学習手段がカテゴリ別出現頻度テーブルを作成する処理において、固有名に対し所定の範囲に出現する自立語を集計する際、その固有名に対する距離が近いほど高い重みスコアとして集計する。このように作成したカテゴリ別出現頻度テーブルを用いれば、固有名に対して関連が強いと考えられる距離の近い自立語ほど、カテゴリ推定に対し強い影響を与えるようにすることができる。 In this aspect, in the process of creating the category-specific appearance frequency table, when the independent words that appear in a predetermined range with respect to the proper name are added up, the learning means adds up the higher weight score as the distance to the specific name is closer. If the category-specific appearance frequency table created in this way is used, it is possible to have a stronger influence on category estimation as an independent word whose distance is considered to be strongly related to the proper name.

本発明の別の好適な態様では、前記推定手段は、固有名に対し前記所定の範囲内に出現する自立語の出現頻度スコアに対し、該自立語と該固有名との間の距離が近いほど大きくなる重みを乗じた上で集計を行うことで前記適合度を計算する。 In another preferred aspect of the present invention, the estimating means is such that the distance between the independent word and the proper name is close to the appearance frequency score of the independent word that appears within the predetermined range with respect to the proper name. The fitness is calculated by performing aggregation after multiplying the weight which becomes larger.

この態様では、推定手段は、各自立語の出現頻度スコアに対し、固有名との距離が近いほど大きくなる重みを乗じて集計することで適合度を計算するので、固有名に対して関連が強いと考えられる距離の近い自立語ほど、カテゴリ推定に対し強い影響を与えるようにすることができる。 In this aspect, the estimation means calculates the fitness by multiplying the appearance frequency score of each independent word by a weight that increases as the distance from the proper name becomes shorter. Independent words that are closer to each other and considered to be stronger can have a stronger influence on category estimation.

本発明の更に別の態様では、固有名カテゴリ推定装置は、辞書を参照しながら前記解析対象文書を形態素解析する手段であって、前記辞書に登録されていない未知語を検出した場合、該未知語を固有名とみなし、前記推定手段に該未知語の属するカテゴリを推定させる手段、を更に備える。 In yet another aspect of the present invention, the proper name category estimation device is a means for morphological analysis of the analysis target document with reference to a dictionary, and when an unknown word that is not registered in the dictionary is detected, the unknown name category estimation device And a means for regarding the word as a proper name and causing the estimation means to estimate a category to which the unknown word belongs.

固有名は日々新たに生成されるので、辞書がそれに対応できていない場合は未知語となる。このような状況に対し、この態様によれば、解析対象文書中の未知語を自動的に固有名とみなし、そのカテゴリを推定するので、辞書が対応できていない固有名に対してもカテゴリ推定を行える。 Since the proper name is newly generated every day, it becomes an unknown word if the dictionary cannot cope with it. For this situation, according to this aspect, the unknown word in the analysis target document is automatically regarded as a proper name and its category is estimated, so category estimation is also performed for proper names that cannot be supported by the dictionary. Can be done.

本発明によれば、文書中で固有名に対し、同一文内や同一段落内などの所定の範囲に現れる各自立語の各カテゴリごとの出現頻度スコアの集計により、その固有名のカテゴリを推定するので、文書内に固有名が承接語を従えずに単独で現れる場合でも、そのカテゴリを特定することができる。 According to the present invention, for a proper name in a document, the category of the proper name is estimated by aggregating the appearance frequency score for each category of each independent word that appears in a predetermined range such as in the same sentence or in the same paragraph. Therefore, even if the proper name appears alone in the document without following the common word, the category can be specified.

以下、図面を参照して、本発明を実施するための最良の形態（以下「実施形態」と呼ぶ）について説明する。 The best mode for carrying out the present invention (hereinafter referred to as “embodiment”) will be described below with reference to the drawings.

図１は、本発明に係る固有名カテゴリ推定装置が実装されるハードウエア構成の例を示す図である。図に示すように、この装置は、ＣＰＵ（中央演算装置）１０、ＲＡＭ（ランダム・アクセス・メモリ）１２、ハードディスクドライブ等の不揮発性記憶装置１４、及び可搬型ディスク読取装置やＬＡＮ（ローカル・エリア・ネットワーク）インタフェース回路などの各種入出力インタフェース（Ｉ／Ｆ）１６がバス１８に接続された、一般的なコンピュータシステムを基礎として構成することができる。不揮発性記憶装置１４には、学習用プログラム１４２及び解析用プログラム１４４がインストールされている。ＣＰＵ１０が学習用プログラム１４２を実行することにより、このコンピュータシステムは、図３に示す学習用サブシステムとして機能する。学習用サブシステムは、固有名のカテゴリ推定の判断材料に用いる情報を学習するサブシステムである。より詳しくは、学習サブシステムは、学習用文書群を処理することで、後の解析に用いるカテゴリ別出現頻度テーブル１４８の作成・更新を行う。カテゴリ別出現頻度テーブル１４８及び学習用サブシステムの更なる詳細については、後で説明する。また、ＣＰＵ１０が解析用プログラム１４４を実行することにより、このコンピュータシステムは、図４に示す解析用サブシステムとして機能する。解析用サブシステムは、機械翻訳や自然言語ユーザインタフェース等の応用処理のために、解析対象の文書に対して形態素解析等の解析処理を行うサブシステムである。特にこのサブシステムは、辞書１４６ではカテゴリを一意に特定できない固有名のカテゴリの推定を、カテゴリ別出現頻度テーブル１４８を利用して行う。このサブシステムについても後で詳しく説明する。これら学習用サブシステムと解析用サブシステムとを兼ね備えたものが、本実施形態の固有名カテゴリ推定装置である。 FIG. 1 is a diagram showing an example of a hardware configuration in which a proper name category estimation apparatus according to the present invention is implemented. As shown in the figure, this apparatus includes a CPU (Central Processing Unit) 10, a RAM (Random Access Memory) 12, a non-volatile storage device 14 such as a hard disk drive, a portable disk reader and a LAN (Local Area). A network can be configured on the basis of a general computer system in which various input / output interfaces (I / F) 16 such as interface circuits are connected to the bus 18. A learning program 142 and an analysis program 144 are installed in the nonvolatile storage device 14. When the CPU 10 executes the learning program 142, this computer system functions as a learning subsystem shown in FIG. The learning subsystem is a subsystem that learns information to be used as a judgment material for category estimation of a proper name. More specifically, the learning subsystem creates / updates the category-specific appearance frequency table 148 used for later analysis by processing the learning document group. Further details of the category-specific appearance frequency table 148 and the learning subsystem will be described later. Further, when the CPU 10 executes the analysis program 144, the computer system functions as an analysis subsystem shown in FIG. The analysis subsystem is a subsystem that performs analysis processing such as morphological analysis on a document to be analyzed for application processing such as machine translation and a natural language user interface. In particular, this sub-system uses the category-specific appearance frequency table 148 to estimate a category having a unique name that cannot be uniquely specified by the dictionary 146. This subsystem will also be described in detail later. What has these learning subsystem and analysis subsystem is the proper name category estimation apparatus of this embodiment.

また、不揮発性記憶装置１４には、学習や解析に用いる辞書１４６と、固有名のカテゴリ推定のための判断材料となるカテゴリ別出現頻度テーブル１４８が格納されている。 In addition, the nonvolatile storage device 14 stores a dictionary 146 used for learning and analysis, and a category-specific appearance frequency table 148 that serves as a determination material for category estimation of the proper name.

辞書１４６は、単語群のデータが登録されたデータベースである。ここで、複数の単語が接続して１つのまとまった意味をなす複合語も、単語の概念に含まれるものとする。辞書１４６の各単語のデータには、それぞれ、当該単語が該当する品詞（複数ある場合もある）や、当該単語が該当する固有名のカテゴリ（これも複数あり得る）の情報が含まれる。辞書１４６としては、自然言語解析分野で一般的に用いられているものと同様のものを用いることができる。 The dictionary 146 is a database in which word group data is registered. Here, a compound word in which a plurality of words are connected to form a single meaning is also included in the concept of the word. The data of each word in the dictionary 146 includes information on the part of speech (there may be a plurality of words) to which the word corresponds and the category of the proper name to which the word corresponds (this can also be a plurality). The dictionary 146 may be the same as that generally used in the natural language analysis field.

カテゴリ別出現頻度テーブル１４８は、固有名のカテゴリごとに、文書中でカテゴリに属する固有名に対し所定の範囲内（すなわち固有名を基準に定められる所定の範囲の中）に出現する可能性のある単語のリストを登録したテーブルである。 The category-specific appearance frequency table 148 indicates, for each category of unique names, the possibility of appearing within a predetermined range with respect to the specific names belonging to the category in the document (that is, within a predetermined range determined based on the specific names). A table in which a list of words is registered.

以下では、煩雑さを避けるため、「文書中で固有名に対し所定の範囲内に出現する」ということを単に「共起する」と言うこととする。また、以下の例では、固有名に共起する単語を見つける対象範囲である「所定の範囲」として、「固有名と同一の文の中」を採用した場合を説明する。 In the following, in order to avoid complications, “appearing within a predetermined range for a proper name in a document” is simply referred to as “co-occurring”. Further, in the following example, a case will be described in which “in the same sentence as the proper name” is adopted as the “predetermined range” that is a target range for finding words that co-occur in the proper name.

このテーブル１４８の各単語の情報には、その単語がそのカテゴリに属する固有名と共起する度合い（共起する可能性の高さ）を示す出現頻度スコアが含まれる。 The information of each word in the table 148 includes an appearance frequency score indicating the degree (high possibility of co-occurrence) that the word co-occurs with the unique name belonging to the category.

図２にカテゴリ別出現頻度テーブル１４８の一例を示す。この例では、テーブルの各行がそれぞれ１つのカテゴリについてのデータである。固有名のカテゴリとしては、会社名などの「組織名」、「人名」、及び「地名」を示しているが、カテゴリはこれらに限られるわけではない。各カテゴリの行には、それぞれ、そのカテゴリに属する固有名と共起する単語の情報が各列に配列されている。この例では、単語の配列順は学習時に検出した順であり、それ自体に特に意味があるわけではない。個々の単語の情報は、「単語：その単語の出現数：その単語の出現頻度スコア」というように、単語の文字列、学習時のその単語の出現数、及び学習結果から求めたその単語の出現頻度スコアを、コロン「：」で区切って順に配列したものとなっている。この例では、ある単語のあるカテゴリについての出現頻度スコアは、学習時において、その単語がそのカテゴリに属する固有名と共起した回数、すなわち「出現数」を、そのカテゴリに属する固有名と共起したすべての単語の出現数の総和（図２では「カテゴリ別総数」として示す）で除した値である（図では煩雑さを避けるため、計算した値の１００倍を表示している）。すなわち、単語の出現数をカテゴリ別総数で正規化したものが出現頻度スコアである。ただし、このような出現頻度スコアの定義はあくまで一例であり、出現頻度スコアは、単語のカテゴリに対する共起のしやすさを示す値であればよく、正規化のための計算式としてはどのようなものを用いてもよい。なお、このように出現頻度スコアを計算するために、カテゴリ別出現頻度テーブル１４８には、各カテゴリごとに、そのカテゴリについての「カテゴリ別総数」を記録するエリアが設けられている。 FIG. 2 shows an example of the category-specific appearance frequency table 148. In this example, each row of the table is data for one category. As the category of the proper name, “organization name” such as company name, “person name”, and “place name” are shown, but the category is not limited to these. In each category row, word information co-occurring with a unique name belonging to the category is arranged in each column. In this example, the word arrangement order is the order detected at the time of learning, and is not particularly meaningful in itself. The information of each word is, for example, “word: number of occurrences of the word: appearance frequency score of the word”, the character string of the word, the number of occurrences of the word during learning, and the word obtained from the learning result. The appearance frequency scores are arranged in order by separating them with a colon “:”. In this example, the appearance frequency score for a certain category of a certain word is the number of times that the word co-occurred with the unique name belonging to the category at the time of learning, that is, the “number of occurrences” is shared with the unique name belonging to the category. It is a value divided by the total number of occurrences of all the words that occurred (shown as “total number by category” in FIG. 2) (in the figure, 100 times the calculated value is displayed to avoid complexity). That is, the appearance frequency score is obtained by normalizing the number of appearances of words by the total number by category. However, the definition of the appearance frequency score is merely an example, and the appearance frequency score may be a value indicating the ease of co-occurrence with respect to the category of the word. What is the calculation formula for normalization? You may use anything. In order to calculate the appearance frequency score in this way, the category-specific appearance frequency table 148 has an area for recording “total number by category” for each category for each category.

なお、本実施形態では、カテゴリ別出現頻度テーブル１４８に登録する単語を自立語に限定し、付属語は登録していない。自立語は、周知のように、名詞や動詞など、他の語に依存することなく文中で自立した語である。これに対し付属語は、自立語に付属することで意味をなす語であり、助詞や助動詞はその代表例である。接頭語や接尾語も付属語の範疇に含まれる。 In this embodiment, the words registered in the category-specific appearance frequency table 148 are limited to independent words, and no attached words are registered. As is well known, an independent word is a word that is independent in a sentence without depending on other words such as nouns and verbs. On the other hand, an adjunct is a word that makes sense by attaching to an independent word, and a particle or an auxiliary verb is a representative example. Prefixes and suffixes are also included in the category of attached words.

意味単位である自立語は、文法的には固有名と共起すべき理由はないので、そのような自立語の共起の傾向は逆に固有名のカテゴリに依存する可能性が高いと言える。これに対し、付属語、特に助詞や助動詞などは、固有名と共起したとしても、それは文法上の要請により固有名に対して付属したり、或いは固有名と共起する自立語に付属したりしているのであって、固有名の意味的なカテゴリとの間の相関は低いと考えられる。このようなことから、本実施形態では、カテゴリ別出現頻度テーブル１４８は、固有名のカテゴリに対する依存性が高いと考えられる自立語の共起のみを基礎として作成する。 Independent words that are semantic units have no reason to co-occur with proper names grammatically, so it can be said that the tendency of such independent words to co-occur is highly likely to depend on the category of proper names. In contrast, an adjunct, especially a particle or auxiliary verb, may co-occur with a proper name, but it may be attached to a proper name or a free-standing word co-occurring with a proper name by grammatical request. Therefore, the correlation with the semantic category of the proper name is considered to be low. For this reason, in this embodiment, the category-specific appearance frequency table 148 is created based only on the co-occurrence of independent words that are considered highly dependent on the category of the proper name.

なお、接頭語や接尾語は、特許文献１にも示されるように、固有名のカテゴリを特定する１つの重要な要素である。しかし、これら接頭語や接尾語に基づくカテゴリ推定は特許文献１に示される技術で実現できるので、本実施形態では、特許文献１の技術を用いても一意にカテゴリを特定できない固有名のカテゴリ推定の手法に重点を置いた構成を示している。このためカテゴリ別頻度テーブル１４８には、接頭語や接尾語は登録していない。 Note that the prefix and suffix are one important element for specifying the category of the proper name, as shown in Patent Document 1. However, since category estimation based on these prefixes and suffixes can be realized by the technique shown in Patent Document 1, in this embodiment, category estimation of a unique name that cannot uniquely identify a category even using the technique of Patent Document 1. The configuration with an emphasis on the method is shown. For this reason, prefixes and suffixes are not registered in the category-specific frequency table 148.

次に、図３を参照して、この固有名カテゴリ推定装置のうちの学習用サブシステムについて説明する。 Next, with reference to FIG. 3, the learning subsystem of the proper name category estimation apparatus will be described.

学習用サブシステムは、形態素解析部１０２、カテゴリ別出現リスト生成部１０４、テーブル生成・更新部１０６を備えている。これら各機能モジュール１０２，１０４及び１０６は、ＣＰＵ１０で学習用プログラム１４２を実行することにより実現される。 The learning subsystem includes a morphological analysis unit 102, a category-specific appearance list generation unit 104, and a table generation / update unit 106. Each of these functional modules 102, 104, and 106 is realized by the CPU 10 executing the learning program 142.

形態素解析部１０２は、解析の対象として与えられた文書のテキストデータに対し、形態素解析処理を施す機能モジュールである。 The morphological analysis unit 102 is a functional module that performs morphological analysis processing on text data of a document given as an analysis target.

この形態素解析処理では、辞書１４６のデータを利用して、文書のテキストを形態素・語の並びに分解し、各形態素・語の品詞を判定する。この判定で語が固有名であると判定される場合は、更にそのカテゴリも判定する。このような品詞、及び固有名のカテゴリの判定では、辞書１４６に登録された各単語の品詞やカテゴリの情報をもとに判定を行う。したがって、辞書１４６に単一のカテゴリしか登録されていない固有名については、この形態素解析の段階でカテゴリが一意に特定できる。一方、辞書１４６に複数の該当カテゴリが登録されている固有名については、形態素解析の段階ではカテゴリが一意に特定できない。なお、形態素解析は周知の技術であるので、これ以上の詳細な説明は省略する。 In this morpheme analysis process, data in the dictionary 146 is used to decompose the text of the document into a morpheme / word sequence, and determine the part of speech of each morpheme / word. If it is determined by this determination that the word is a proper name, the category is also determined. In determining the category of the part of speech and the proper name, the determination is made based on the part of speech and category information of each word registered in the dictionary 146. Therefore, for a proper name for which only a single category is registered in the dictionary 146, the category can be uniquely specified at the stage of morphological analysis. On the other hand, for a proper name in which a plurality of corresponding categories are registered in the dictionary 146, the category cannot be uniquely specified at the stage of morphological analysis. In addition, since morphological analysis is a well-known technique, the detailed description beyond this is abbreviate | omitted.

なお、形態素解析部１０２では、一般的な形態素解析でカテゴリが一意に特定できない固有名に対し、更に特許文献１のような承接語（接尾語や一般名詞）の情報や、接頭語の情報を利用したカテゴリ判定処理を適用し、カテゴリが特定できないか試みるようにしてもよい。 Note that the morpheme analysis unit 102 further adds information on a suffix (suffix and general noun) as in Patent Document 1 and prefix information for a proper name whose category cannot be uniquely specified by general morphological analysis. You may make it try whether a category cannot be specified by applying the utilized category determination process.

学習用サブシステムでは、形態素解析部１０２は、文書中から固有名、及びその固有名と共起する単語（特に自立語）を抽出するのに用いられる。 In the learning subsystem, the morphological analysis unit 102 is used to extract a proper name and a word (particularly an independent word) that co-occurs with the proper name from the document.

学習用サブシステムには、入出力Ｉ／Ｆ１６を介し、事例となる１以上の学習用文書２００を入力する。学習用文書２００としては、例えば新聞や雑誌の記事、ＷＷＷ(World Wide Web)上の各種情報提供サイト上の文書コンテンツなど、様々なものを用いることができる。特定の分野に特化する場合には、その分野の文書を集めて学習用文書２００とすればよい。 One or more learning documents 200 to be examples are input to the learning subsystem via the input / output I / F 16. As the learning document 200, for example, various articles such as newspaper and magazine articles, document contents on various information providing sites on the WWW (World Wide Web), and the like can be used. When specializing in a specific field, documents in the field may be collected and used as the learning document 200.

学習用サブシステムに対して１以上の学習用文書２００が入力されると、形態素解析部１０２は、それら各学習用文書２００を個々に形態素解析する。例えば、学習用文書２００として、図３に示すように、以下の文１から文６のような６個の文からなる文書が入力された場合を例にとる。 When one or more learning documents 200 are input to the learning subsystem, the morpheme analysis unit 102 individually analyzes each of the learning documents 200. For example, as an example of a learning document 200, as shown in FIG. 3, a document composed of six sentences such as the following sentence 1 to sentence 6 is input.

文１：ＡＡＡ商事の今期の売上は１０００億を達成する見込みであると発表した。
文２：ＢＢＢ電機は新素材の開発に成功し、今期の製品に採用すると発表した。
文３：歌手のＣＣＣは道路交通法違反によって逮捕された。
文４：タレントのＤＤＤはＥＥＥと結婚したのち、３年後に離婚した。
文５：ＦＦＦとＧＧＧは公共機関の効率化を図る狙いで合併した。
文６：ＨＨＨ電機のＩＩＩ社長は今期の製品戦略についてビジョンを示した。 Sentence 1: Announced that AAA Trading is expected to achieve 100 billion sales this term.
Sentence 2: BBB Electric Co., Ltd. announced that it has succeeded in developing a new material and will adopt it for this term.
Sentence 3: The singer CCC was arrested for violating the Road Traffic Act.
Sentence 4: Talent DDD married EEE and divorced three years later.
Sentence 5: FFF and GGG merged with the aim of increasing the efficiency of public institutions.
Sentence 6: President III of HHH Electric gave a vision on product strategy for the current term.

この場合、形態素解析部１０２の出力は、次のように、文ごとに、その文に現れた固有名と、その文に共起した自立語とのリストとなる。 In this case, the output of the morphological analysis unit 102 is a list of unique names appearing in the sentence and independent words co-occurring in the sentence as follows.

文１：ＡＡＡ商事（組織名）、今期、売上、１０００億、達成、見込み、発表
文２：ＢＢＢ電機（組織名）、新素材、開発、成功、今期、製品、採用、発表
文３：歌手、ＣＣＣ（人名）、道路交通法違反、逮捕
文４：タレント、ＤＤＤ（人名）、ＥＥＥ（人名）、結婚、３年後、離婚
文５：ＦＦＦ（地名）、ＧＧＧ（地名）、公共機関、効率化、図る、狙い、合併
文６：ＨＨＨ電機（組織）、ＩＩＩ（人名）、社長、今期、製品戦略、ビジョン、示す Sentence 1: AAA Shoji (organization name), current term, sales, 100 billion, achievement, forecast, announcement 2: BBB Electric (organization name), new materials, development, success, current term, product, adoption, announcement sentence 3: Singer , CCC (person name), violation of road traffic law, arrest sentence 4: talent, DDD (person name), EEE (person name), marriage, three years later, divorce sentence 5: FFF (place name), GGG (place name), public institution, Efficiency, Aim, Aim, Merger 6: HHH Electric (organization), III (person), president, current term, product strategy, vision

この例では、「ＡＡＡ商事」などのように、単語の後に「（組織名）」などと括弧書きが付随しているのが固有名であり、その括弧書きがその固有名の属するカテゴリである。また、「今期」や「図る」などのように括弧書きが付随していない単語は、固有名と共起した自立語である。 In this example, a unique name is such that “(organization name)” or the like is appended to the word, such as “AAA Shoji”, and the parenthesis is the category to which the proper name belongs. . In addition, words that are not accompanied by parentheses, such as “this term” and “plan”, are independent words that co-occur with the proper name.

ここに示した各文についての固有名及び共起自立語のリストを、文ごとの共起リストと呼ぶことにする。 The list of proper names and co-occurrence independent words for each sentence shown here is called a co-occurrence list for each sentence.

この例では、各文に出現した固有名がすべて、辞書１４６（更に承接語等を利用した判定を利用してもよい）により一意にカテゴリを特定できている。しかし、実際には、学習用文書２００中の固有名のすべてが形態素解析部１０２で一意にカテゴリ判定できるとは限らない。形態素解析部１０２で一意にカテゴリを判定できなかった固有名については、本実施形態では、形態素解析部１０２の出力である文ごとの共起リストには含めない。すなわち、カテゴリが一意に特定できない固有名は、カテゴリ別出現頻度テーブル１４８の作成のための以降の処理には用いない。これは、学習段階では、各カテゴリの固有名と共起する単語の情報を集めることが目的なので、カテゴリが特定できない固有名は利用できないからである。 In this example, all the unique names that appear in each sentence can uniquely identify the category by the dictionary 146 (further, the determination using a common word or the like may be used). However, in practice, not all unique names in the learning document 200 can be uniquely determined by the morphological analysis unit 102. In the present embodiment, the unique names for which the morphological analysis unit 102 cannot uniquely determine the category are not included in the co-occurrence list for each sentence that is the output of the morphological analysis unit 102. That is, the unique name for which the category cannot be uniquely specified is not used in the subsequent processing for creating the category-specific appearance frequency table 148. This is because, in the learning stage, the purpose is to collect information on the words that co-occur with the unique names of each category, and therefore unique names for which the category cannot be specified cannot be used.

また、以上の例では、辞書１４６や特許文献１の手法を利用した判定で固有名のカテゴリを判別したが、これは一例である。この他に、例えば、学習用文書２００として、各固有名に対し予めそのカテゴリをタグなどの形で付加した文書を用意すれば、そのような手法によらずとも、形態素解析部１０２にて各固有名のカテゴリを判別できる。 In the above example, the category of the proper name is determined by the determination using the method of the dictionary 146 or Patent Document 1, but this is an example. In addition to this, for example, as a learning document 200, if a document in which the category is added in advance in the form of a tag or the like to each unique name is prepared, the morphological analysis unit 102 can use each method regardless of the method. Can identify the category of proper names.

次にカテゴリ別出現リスト生成部１０４は、これら文ごとの共起リストの情報を、各文に含まれる固有名の所属カテゴリごとに集計することで、カテゴリ別出現リストを生成する。カテゴリ別出現リストは、固有名のカテゴリごとに、そのカテゴリに属する固有名と同一文内に共起した自立語と、その出現回数（すなわち共起の回数）とを配列したリストである。図示の例では、上に例示した共起リスト群から、次のようなカテゴリ別出現リストが生成される。 Next, the category-specific appearance list generation unit 104 generates the category-specific appearance list by aggregating the information on the co-occurrence list for each sentence for each category to which the unique name included in each sentence belongs. The category-specific occurrence list is a list in which, for each category of proper names, independent words that co-occur in the same sentence as proper names belonging to the category, and the number of appearances (that is, the number of co-occurrence) are arranged. In the illustrated example, the following category-specific occurrence list is generated from the co-occurrence list group exemplified above.

組織名：今期＊３、売上、１０００億、達成、見込み、発表＊２、新素材、開発、成功、製品、採用、社長、製品戦略、ビジョン、示す：１８個
人名：歌手、道路交通法違反、逮捕、タレント、結婚、３年後、離婚、社長、今期、製品戦略、ビジョン、示す
：１２個
地名：公共機関、効率化、図る、狙い、合併：５個 Organization name: Current term * 3, Sales, 100 billion, Achievement, Expectation, Announcement * 2, New materials, Development, Success, Product, Recruitment, President, Product strategy, Vision, 18: Personal name: Singer, Road Traffic Law violation , Arrest, talent, marriage, 3 years later, divorce, president, current term, product strategy, vision, show: 12 place names: public institution, efficiency, aim, merger: 5

この例では、組織名、人名、地名という各カテゴリごとに、カテゴリ別出現リストが生成されている。 In this example, a category-specific appearance list is generated for each category of organization name, person name, and place name.

この例では、例えば「今期」という自立語は、文１にて組織名である「ＡＡＡ商事」と、文２にて組織名である「ＢＢＢ電機」と、文６にて組織名である「ＨＨＨ電機」と、それぞれ同一文内に共起している。したがって、集計すれば、「今期」は組織名のカテゴリの中の共起語としてカテゴリ別出現リストに登録され、その出現数３も登録される。なお、上記の例では、出現数が１回の語については、出現回数の数値の表示は省略している。 In this example, for example, the independent word “current term” is “AAA Shoji”, which is the organization name in sentence 1, “BBB Electric”, which is the organization name in sentence 2, and the organization name in sentence 6. "HHH Electric" co-occurs in the same sentence. Therefore, by summing up, “this term” is registered in the category-specific appearance list as a co-occurrence word in the category of the organization name, and the number of occurrences 3 is also registered. In the above example, the numerical value of the number of appearances is omitted for a word having the number of appearances of one.

また、各カテゴリ別出現リストの末尾に付属する「１８個」等の個数の値は、そのカテゴリに属する各固有名と共起した自立語の総数である。これは、カテゴリ別出現頻度テーブル１４８（図２参照）における「カテゴリ別総数」の値を算出するために集計した値である。 Further, the value of the number such as “18” attached to the end of each category appearance list is the total number of independent words co-occurring with each unique name belonging to the category. This is a value aggregated to calculate the value of “total number by category” in the appearance frequency table by category 148 (see FIG. 2).

このようにしてカテゴリ別出現リストの生成が終わると、次にテーブル生成・更新部１０６が、それらカテゴリ別出現リストのデータを用いて、カテゴリ別出現頻度テーブル１４８の生成又は更新を行う。 When the generation of the category-specific appearance list is completed in this manner, the table generating / updating unit 106 generates or updates the category-specific appearance frequency table 148 using the data of the category-specific appearance list.

すなわち、学習の時点でカテゴリ別出現頻度テーブル１４８がまだ存在していない場合には、カテゴリ別出現頻度テーブル１４８を生成する。この場合、各カテゴリ別出現リストに含まれる各自立語とその出現数をそれぞれテーブル１４８における当該カテゴリの行のエントリとして登録する。そして更に、カテゴリ別出現リストの末尾の当該カテゴリの出現総数と各自立語の出現数からそれら各自立語の出現頻度スコアを計算し、これをテーブル１４８に登録する。 That is, if the category-specific appearance frequency table 148 does not yet exist at the time of learning, the category-specific appearance frequency table 148 is generated. In this case, each independent word included in each category appearance list and the number of occurrences thereof are registered as entries in the row of the category in the table 148, respectively. Further, the appearance frequency score of each independent word is calculated from the total number of occurrences of the category at the end of the category-specific appearance list and the number of independent words, and this is registered in the table 148.

一方、学習の時点でカテゴリ別出現頻度テーブル１４８が既に存在している場合には、各カテゴリ別出現リストのデータを用いてそのテーブル１４８を更新する。この更新では、カテゴリ別出現リストに含まれる各自立語のうち、テーブル１４８に当該カテゴリのデータとして既に登録されているものについては、テーブル１４８に登録されている出現数に当該リストでの出現数を加算する。また、「カテゴリ別総数」の値は、カテゴリ別出現リストの末尾の出現総数の値を加算して更新する。そして、このように更新された各自立語の出現数及びカテゴリ別総数の値を用いて、各自立語の出現頻度スコアを再計算し、この計算結果の値に更新する。なお、カテゴリ別出現リスト上の自立語のうち、テーブル１４８に登録されていないものについては、テーブル１４８の新規生成の場合と同様の処理を行えばよい。 On the other hand, when the category-specific appearance frequency table 148 already exists at the time of learning, the table 148 is updated using the data of each category-specific appearance list. In this update, among the independent words included in the category-specific occurrence list, those already registered as data of the category in the table 148 are added to the number of occurrences registered in the table 148. Is added. Further, the value of “total number by category” is updated by adding the value of the total number of appearances at the end of the category-specific appearance list. Then, the appearance frequency score of each independent word is recalculated using the values of the number of appearances of each independent word and the total number of categories updated in this way, and updated to the value of the calculation result. For independent words on the category-specific appearance list that are not registered in the table 148, the same processing as that for the new generation of the table 148 may be performed.

このように学習用サブシステムは、与えられた学習用文書２００に基づきカテゴリ別出現頻度テーブル１４８を生成したり、更新したりすることができる。 As described above, the learning subsystem can generate or update the category-specific appearance frequency table 148 based on the given learning document 200.

次に、解析用サブシステムについて、図４を参照して説明する。 Next, the analysis subsystem will be described with reference to FIG.

図示のように、解析用サブシステムは、形態素解析部１０２、共起語検出部１０８、カテゴリ推定部１１０を備えている。これら各機能モジュール１０２，１０８及び１１０は、ＣＰＵ１０で解析用プログラム１４４を実行することにより実現される。解析用サブシステムは、入出力Ｉ／Ｆ１６を介して入力された解析対象文書を解析し、その中に含まれる固有名のカテゴリを推定する。 As illustrated, the analysis subsystem includes a morphological analysis unit 102, a co-occurrence word detection unit 108, and a category estimation unit 110. Each of these functional modules 102, 108 and 110 is realized by the CPU 10 executing the analysis program 144. The analysis subsystem analyzes the analysis target document input via the input / output I / F 16 and estimates the category of the unique name included therein.

解析用サブシステムにおいて、形態素解析部１０２ａが果たす機能は、前述の学習用サブシステムの形態素解析部１０２と基本的に同様である。相違する点は、学習用サブシステムの形態素解析部１０２は、カテゴリが１つに特定できない固有名は解析結果から省いたのに対し、この解析用サブシステムの形態素解析部１０２ａは、カテゴリが１つに特定できない固有名を、該当する可能性のある複数のカテゴリの情報を含んだ形で出力する点である。 In the analysis subsystem, the function performed by the morpheme analysis unit 102a is basically the same as that of the morpheme analysis unit 102 of the learning subsystem. The difference is that the morphological analysis unit 102 of the learning subsystem omits unique names whose category cannot be specified as one from the analysis result, whereas the morphological analysis unit 102a of this analysis subsystem has a category of 1 A unique name that cannot be specified is output in a form that includes information on multiple categories that may be applicable.

例えば、解析対象文書３００として、図示のように、次の２つの文を含んだ文書が与えられた場合を考える。 For example, consider a case where a document including the following two sentences is given as the analysis target document 300 as shown in the figure.

文１：富士が今期の売上を発表した。
文２：宮崎が候補地として発表された。 Sentence 1: Fuji announced the sales of this term.
Sentence 2: Miyazaki was announced as a candidate site.

この場合、辞書１４６（及び場合によっては特許文献１の技術）によるカテゴリ判定では、文１の「富士」及び文２の「宮崎」という固有名はカテゴリが一意に絞り込めない。したがって、形態素解析部１０２ａの出力は次のようなものとなる。 In this case, in the category determination by the dictionary 146 (and in some cases, the technique of Patent Document 1), the category of the unique names “Fuji” in sentence 1 and “Miyazaki” in sentence 2 cannot be narrowed down uniquely. Therefore, the output of the morphological analysis unit 102a is as follows.

文１：富士（人名、組織名、地名）、今期、売上、発表
文２：宮崎（人名、地名）、候補地、発表 Sentence 1: Fuji (person name, organization name, place name), current term, sales, announcement Sentence 2: Miyazaki (person name, place name), candidate place, announcement

この例では、例えば文１の固有名「富士」は３つのカテゴリに属しており、形態素解析のレベルでは１つに絞り込めない。このため、以下の共起語検出部１０８及びカテゴリ推定部１１０では、解析対象文書３００内でその固有名と同一文内に共起した自立語の情報を用いてカテゴリの絞込を行う。 In this example, for example, the proper name “Fuji” of sentence 1 belongs to three categories, and cannot be narrowed down to one at the level of morphological analysis. For this reason, the following co-occurrence word detection unit 108 and category estimation unit 110 narrow down categories using information on independent words co-occurring in the same sentence as the proper name in the analysis target document 300.

まず共起語検出部１０８は、解析対象文書３００中から検出された、カテゴリが一意に特定できない各固有名について、その固有名と同一文内から自立語を検出する。図示の例では、共起語検出部１０８が個々の固有名ごとに検出した共起自立語のリストは次のようになる。 First, the co-occurrence word detection unit 108 detects an independent word from the same sentence as the proper name for each proper name detected from the analysis target document 300 and whose category cannot be uniquely specified. In the illustrated example, the list of co-occurrence independent words detected by the co-occurrence word detection unit 108 for each unique name is as follows.

富士：今期、売上、発表
宮崎：候補地、発表 Fuji: Sales, Announcement, Miyazaki: Candidate site, Announcement

このように検出された、個々の固有名に対する共起自立語のリストは、固有名のカテゴリを推定する際の有力な文脈情報となる。 The list of co-occurring independent words for individual proper names detected in this way is useful context information in estimating the proper name category.

カテゴリ推定部１１０は、固有名ごとに、その固有名の共起自立語のリストを用いて、その固有名が各カテゴリに属する可能性を示す適合度を計算する。この計算では、例えば、カテゴリごとに、各共起自立語のそのカテゴリに対する出現頻度スコアをカテゴリ別出現頻度テーブル１４８から求め、総和する。図示の例の場合、固有名である「富士」及び「宮崎」の、組織名、人名及び地名の各カテゴリへの適合度は次のように計算される。 For each proper name, the category estimation unit 110 uses the list of co-occurring independent words of the proper name to calculate a fitness indicating the possibility that the proper name belongs to each category. In this calculation, for example, for each category, the appearance frequency score for each category of each co-occurrence independent word is obtained from the category-specific appearance frequency table 148 and summed. In the case of the illustrated example, the degree of fit of the proper names “Fuji” and “Miyazaki” to each category of organization name, person name, and place name is calculated as follows.

富士：
組織名：今期（０．３５）＋売上（０．９３）＋発表（４．１３）＝５．４１
人名：今期（０．３３）＋売上（０）＋発表（０）＝０．３３
地名：今期（０）＋売上（０）＋発表（０）＝０
宮崎：
組織名：候補地（０）＋発表（４．１３）＝４．１３
人名：候補地（０）＋発表（０）＝０
地名：候補地（５．１０）＋発表（０）＝５．１０ Fuji:
Organization name: Current term (0.35) + Sales (0.93) + Announcement (4.13) = 5.41
Name: This term (0.33) + Sales (0) + Announcement (0) = 0.33
Place name: Current term (0) + Sales (0) + Announcement (0) = 0
Miyazaki:
Organization name: Candidate site (0) + Announcement (4.13) = 4.13
Name: Candidate site (0) + Announcement (0) = 0
Place Name: Candidate Place (5.10) + Announcement (0) = 5.10

この例では、例えば、固有名「富士」のカテゴリ「組織名」に対する適合度は５．４１、カテゴリ「人名」に対する適合度は０．３３となる。適合度が高いほど、その固有名はそのカテゴリに属する可能性が高いと考えられる。 In this example, for example, the fitness level of the proper name “Fuji” with respect to the category “organization name” is 5.41, and the fitness level with respect to the category “person name” is 0.33. The higher the degree of matching, the more likely that the proper name belongs to that category.

カテゴリ推定部１１０は、この計算結果に基づき、各固有名について、適合度が最も高いカテゴリをその固有名の属するカテゴリと推定する。したがって、図示の例では、カテゴリ推定部１１０の出力する解析結果３５０では、文１中の固有名「富士」のカテゴリは組織名、文２中の固有名「宮崎」のカテゴリは地名となる。 Based on the calculation result, the category estimation unit 110 estimates, for each unique name, the category having the highest matching degree as the category to which the proper name belongs. Therefore, in the illustrated example, in the analysis result 350 output from the category estimation unit 110, the category of the proper name “Fuji” in the sentence 1 is the organization name, and the category of the proper name “Miyazaki” in the sentence 2 is the place name.

以上説明したように、本実施形態によれば、辞書等の情報ではカテゴリが確定できない固有名のカテゴリを、その固有名と同一文内に現れる自立語の情報を用いることで、いわば文脈の面から推定することができる。 As described above, according to the present embodiment, a category of a proper name whose category cannot be determined by information such as a dictionary is used by using information of independent words appearing in the same sentence as the proper name, so to speak, in terms of context Can be estimated from

以上の実施形態では、固有名と同一文内に共起する自立語はすべて同等に取り扱ったが、この代わりに共起自立語に対し、固有名との距離に応じた重み付けを行うことで、推定の精度の向上を図ることも考えられる。すなわち、一般的には、同じく共起する自立語でも、固有名との距離が近いものほどその固有名との関連が強いと考えられる。そこで、関連が強い共起自立語ほど固有名のカテゴリ推定に及ぼす重みを強くするわけである。重み付けのやり方には、カテゴリ別出現頻度テーブル１４８に対して重み付けを反映させるやり方と、カテゴリ推定部１１０での適合度の計算において重み付けを反映させるやり方の２通りが考えられる。もちろん両者を組み合わせることも可能である。 In the above embodiment, all independent words that co-occur in the same sentence as proper names were treated equally, but instead, by weighting the co-occurring independent words according to the distance from proper names, It is also conceivable to improve the accuracy of estimation. That is, in general, it is considered that, even in the case of independent words that co-occur, the closer the distance to the proper name, the stronger the relation with the proper name. Therefore, the stronger the co-occurrence independent words, the stronger the weight on the category estimation of proper names. There are two ways of weighting: a method of reflecting weights in the category-specific appearance frequency table 148 and a method of reflecting weights in the calculation of the fitness level in the category estimation unit 110. Of course, it is possible to combine the two.

図５に示す例は、カテゴリ別出現頻度テーブル１４８に対して重み付けを反映させる場合の例であり、学習サブシステムでの処理の例を示している。この例は、カテゴリ別出現リスト生成部１０４ａとテーブル生成・更新部１０６ａの処理内容が、図３の構成内の対応要素と一部異なるだけであり、他の部分は図３の構成と同様である。また、図５に例示する学習用文書２００の例は、図３の場合と同じものである。 The example shown in FIG. 5 is an example in the case where weighting is reflected on the category-specific appearance frequency table 148, and shows an example of processing in the learning subsystem. In this example, the processing contents of the category-specific appearance list generation unit 104a and the table generation / update unit 106a are only partially different from the corresponding elements in the configuration of FIG. 3, and the other parts are the same as the configuration of FIG. is there. The example of the learning document 200 illustrated in FIG. 5 is the same as that in FIG.

図５の変形例において、カテゴリ別出現リスト生成部１０４ａは、形態素解析部１０２の出力を集計する際、固有名と同一文内に共起した各自立語に対し、その固有名との距離に応じた重みスコアを求め、カテゴリ別出現リストに登録する。図示の例では、固有名から共起自立語までの間の自立語の数（この数には共起自立語自身を含める）をその固有名と共起自立語との距離とし、その距離の逆数を重みスコアとしている。重みスコアをこのように決めれば、固有名との距離が近いほど、スコアの値は高くなるので、図３の例の出現頻度スコアと同様の扱いが可能となる。なお、距離の逆数を重みスコアとするという求め方はあくまで一例であり、この他の計算式で距離から重みスコアを求めてももちろんよい。 In the modification of FIG. 5, the category-specific occurrence list generation unit 104 a calculates the distance from the proper name for each independent word that co-occurs in the same sentence as the proper name when counting the outputs of the morphological analysis unit 102. The corresponding weight score is obtained and registered in the category-specific appearance list. In the example shown, the number of independent words between proper names and co-occurring independent words (this number includes the co-occurring independent words themselves) is the distance between the proper name and the co-occurring independent words, and the distance The reciprocal is used as the weight score. If the weight score is determined in this way, the closer the distance to the proper name is, the higher the score value becomes, so that the same treatment as the appearance frequency score in the example of FIG. 3 is possible. The method of obtaining the reciprocal of the distance as the weight score is merely an example, and the weight score may be obtained from the distance using another calculation formula.

この例において、図示の文１〜文６からなる学習用文書２００からカテゴリ別出現リスト生成部１０４ａが生成するカテゴリ別出現リストは、次のようになる。 In this example, the category-specific appearance list generated by the category-specific appearance list generation unit 104a from the learning document 200 including the illustrated sentence 1 to sentence 6 is as follows.

組織名：今期（１／１）、売上（１／２）、１０００億（１／３）、達成（１／４）、見込み（１／５）、発表（１／６）、新素材（１／１）、開発（１／２）、成功（１／３）、今期（１／４）、製品（１／５）、採用（１／６）、発表（１／７）、社長（１／２）、今期（１／３）、製品戦略（１／４）、ビジョン（１／５）、示す（１／７）：１８個

人名：歌手（１／１）、道路交通法違反（１／１）、逮捕（１／２）、タレント（１／１）、結婚（１／２）、３年後（１／３）、離婚（１／４）、タレント（１／２）、結婚（１／１）、３年後（１／２）、離婚（１／３）、社長（１／１）、今期（１／２）、製品戦略（１／３）、ビジョン（１／４）、示す（１・５）：１６個

地名：公共機関（１／２）、効率化（１／３）、図る（１／４）、狙い（１／５）、合併（１／６）、公共機関（１／１）、効率化（１／２）、図る（１／３）、狙い（１／４）、合併（１／５）：１０個 Organization name: Current term (1/1), Sales (1/2), 100 billion (1/3), Achieved (1/4), Expected (1/5), Announced (1/6), New material (1 / 1), development (1/2), success (1/3), current term (1/4), product (1/5), adoption (1/6), announcement (1/7), president (1 / 2), this term (1/3), product strategy (1/4), vision (1/5), show (1/7): 18

Name: Singer (1/1), Road Traffic Law violation (1/1), Arrest (1/2), Talent (1/1), Marriage (1/2), 3 years later (1/3), Divorce (1/4), talent (1/2), marriage (1/1), 3 years later (1/2), divorce (1/3), president (1/1), current term (1/2), Product strategy (1/3), vision (1/4), show (1/5): 16

Place name: Public organization (1/2), Efficiency (1/3), Aim (1/4), Aim (1/5), Merger (1/6), Public organization (1/1), Efficiency ( 1/2), plan (1/3), aim (1/4), merger (1/5): 10

この例では、例えば「今期」は、文１内で、対象となる固有名「ＡＡＡ商事」の隣に位置するので、両者の距離は１となり、重みスコアは１／１となる。 In this example, for example, “this term” is located next to the target proper name “AAA Shoji” in sentence 1, so the distance between them is 1 and the weight score is 1/1.

同一文内に固有名が複数含まれる場合は、その文内の各自立語をそれら各固有名に対してそれぞれ別々のものとしてカテゴリ別出現リストに登録する。各自立語についての重みスコアは、それぞれ対象とする固有名との距離から求める。図示の例では、文５には「ＦＦＦ」と「ＧＧＧ」という２つの固有名（カテゴリは共に地名）が含まれているので、文５に出てくる自立語「公共機関」は、固有名「ＦＦＦ」に対して重みスコア１／２で、固有名「ＧＧＧ」に対して重みスコア１／１でカテゴリ別出現リストに登録される。このように、この変形例では図３の例と異なり、１つの自立語を共起する固有名ごとに別々に登録しているので、カテゴリ別出現リストの末尾に登録されるカテゴリ別の総出現数の値が図３の例よりも大きい値となっている。 When a plurality of proper names are included in the same sentence, each independent word in the sentence is registered in the category-specific occurrence list as a separate one for each proper name. The weight score for each independent word is obtained from the distance from the target proper name. In the illustrated example, the sentence 5 includes two unique names “FFF” and “GGG” (both categories are place names), so the independent word “public institution” that appears in the sentence 5 is the unique name. It is registered in the category-specific appearance list with a weight score of 1/2 for “FFF” and a weight score of 1/1 for the unique name “GGG”. Thus, in this modification, unlike the example of FIG. 3, one independent word is registered separately for each unique name that co-occurs, so the total occurrence by category registered at the end of the category-specific occurrence list The value of the number is larger than the example of FIG.

テーブル生成・更新部１０６ａは、カテゴリ別出現頻度テーブルを作成する際、単に各自立語の出現数をカウントして登録する代わりに、自立語ごとに、その自立語の重みスコアの総和を求め、登録する。例えば、「今期」は、例示の学習用文書からは、組織名と共起するものとして３つ検出され、それらの重みスコアはそれぞれ１／１，１／４，１／３なのでその総和は１．５８となる。そして、出現頻度スコアは、その重みスコアの総和を「カテゴリ別総数」で正規化することで求める。 Instead of simply counting and registering the number of occurrences of each independent word, the table generating / updating unit 106a obtains the sum of the weight scores of the independent words for each independent word, register. For example, “current term” is detected from the example learning document as three that co-occur with the organization name, and their weight scores are 1/1, 1/4, and 1/3, respectively, so the sum is 1 .58. The appearance frequency score is obtained by normalizing the sum of the weight scores by the “total number by category”.

このように求められたカテゴリ別出現頻度テーブル１４８の例を図６に示す。この例では、テーブル１４８には、各カテゴリと共起する自立語ごとに、「単語：重みスコアの総和：出現頻度スコア」が登録されている。出現頻度スコアとしては、煩雑さを避けるため、実際の値に１００００を掛けた値を表示している。 An example of the category-specific appearance frequency table 148 thus obtained is shown in FIG. In this example, “word: sum of weight scores: appearance frequency score” is registered in the table 148 for each independent word that co-occurs with each category. As the appearance frequency score, a value obtained by multiplying the actual value by 10,000 is displayed in order to avoid complexity.

以上は、カテゴリ別出現頻度テーブル１４８に対して重み付けを反映させるやり方の例であった。次に、解析用サブシステムのカテゴリ推定部１１０での適合度の計算において重み付けを反映させる場合の例について、図７を用いて説明する。 The above is an example of a method of reflecting weighting on the appearance frequency table 148 for each category. Next, an example in which weighting is reflected in the calculation of the fitness in the category estimation unit 110 of the analysis subsystem will be described with reference to FIG.

図７の例は、カテゴリ推定部１１０ａの処理内容が図４の構成内の対応要素と一部異なるが、他の部分は図４の構成と同様である。またこの例では、カテゴリ別出現頻度テーブル１４８として、図６の例で求められる「距離」を反映したテーブルを用いているが、この代わりに図４の例と同様「距離」を考慮しないテーブルを用いてもよい。また、図７に例示する解析対象文書３００の例は、図４の場合と同じものである。 In the example of FIG. 7, the processing contents of the category estimation unit 110a are partially different from the corresponding elements in the configuration of FIG. 4, but the other parts are the same as the configuration of FIG. In this example, as the category-specific appearance frequency table 148, a table reflecting the “distance” obtained in the example of FIG. 6 is used. Instead, a table that does not consider the “distance” as in the example of FIG. 4 is used. It may be used. The example of the analysis target document 300 illustrated in FIG. 7 is the same as that in FIG.

この例のカテゴリ推定部１１０ａは、各固有名の各カテゴリに対する適合度を計算する際に、各自立語の出現頻度スコアに対し、固有名との距離に応じた重みを乗じた上で総和する。このときの重みの求め方は、図５の例における重みスコアの場合と同様でよい。したがって、カテゴリ推定部１１０ａでは、次のように各固有名の各カテゴリに対する適合度が計算される。 The category estimation unit 110a in this example, when calculating the suitability of each proper name for each category, sums the appearance frequency score of each independent word after multiplying the weight according to the distance from the proper name. . The method for obtaining the weight at this time may be the same as in the case of the weight score in the example of FIG. Therefore, the category estimation unit 110a calculates the fitness for each category of each unique name as follows.

文１の「富士」：
組織名：今期（６．６６）×１／１＋売上（３．２３）×１／２＋発表（９．０３）×１／３＝１１．２９
人名：今期（７．６４）×１／１＋売上（０）×１／２＋発表（０）×１／３＝７．６４
地名：今期（０）×１／１＋売上（０）×１／２＋発表（０）×１／３＝０
文２の「宮崎」：
組織名：候補地（０）×１／１＋発表（９．０３）×１／２＝４．０２
人名：候補地（０）×１／１＋発表（０）×１／２＝０
地名：候補地（９３．９３）×１／１＋発表（０）×１／２＝９３．９３ Sentence 1 "Fuji":
Organization name: Current term (6.66) x 1/1 + sales (3.23) x 1/2 + announcement (9.03) x 1/3 = 11.29
Name: This term (7.64) x 1/1 + sales (0) x 1/2 + announcement (0) x 1/3 = 7.64
Place name: Current term (0) x 1/1 + Sales (0) x 1/2 + Announcement (0) x 1/3 = 0
“Miyazaki” in sentence 2:
Organization name: Candidate site (0) x 1/1 + presentation (9.03) x 1/2 = 4.02
Name: Candidate site (0) x 1/1 + Announcement (0) x 1/2 = 0
Place name: Candidate place (93.93) x 1/1 + presentation (0) x 1/2 = 93.93

求めた適合度がもっとも高いカテゴリを解析結果として採用する点は上述の例と同様である。 The point that the category with the highest degree of matching is adopted as the analysis result is the same as in the above example.

このように、固有名と共起自立語との距離を考慮することで、カテゴリ推定精度の向上を見込むことができる。 Thus, by considering the distance between the proper name and the co-occurring independent word, it is possible to expect an improvement in category estimation accuracy.

以上、本発明の好適な実施の形態とその変形例を説明した。以上に説明した本実施形態及び変形例の固有名カテゴリ推定装置は、例えば、機械翻訳や自然言語ユーザインタフェース等の応用処理のための前処理の１つとして用いることができる。このような前処理としてよく知られているものに形態素解析や構文解析などがあるが、固有名のカテゴリ推定は、形態素解析における単語の品詞確定処理の一部分として利用できる。すなわち、機械翻訳などの対象として解析対象文書が与えられた場合、その中に含まれる多義性のある固有名（該当する可能性のあるカテゴリが複数ある固有名）については、この固有名カテゴリ推定装置により、該当する可能性が最も高いカテゴリを推定し、この情報を後段の応用処理に提供する。 The preferred embodiments of the present invention and the modifications thereof have been described above. The proper name category estimation apparatus according to the present embodiment and the modification described above can be used as one of pre-processing for application processing such as machine translation and natural language user interface. Well-known preprocessing includes morphological analysis and syntactic analysis, but category estimation of proper names can be used as a part of word part-of-speech determination processing in morphological analysis. In other words, when a document to be analyzed is given as an object for machine translation or the like, the unique name category estimation is performed for ambiguous unique names (proprietary names having a plurality of possible categories) included in the document. The device estimates the most likely category and provides this information for subsequent application processing.

以上では、固有名に共起する単語を見つける対象範囲である「所定の範囲」として、「固有名と同一の文の中」を採用した場合を説明したが、これは一例に過ぎない。この他にも「固有名と同一の段落の中」を「所定の範囲」として採用することもできる。また、文書が段落よりも大きい、例えば節や章などといった論理的な単位まで構造化されている場合は、そのような論理的な単位を「所定の範囲」として採用することもできる。 In the above description, the case where “in the same sentence as the proper name” is adopted as the “predetermined range” that is the target range for finding the words that co-occur in the proper name, but this is only an example. In addition, “in the same paragraph as the proper name” can be adopted as the “predetermined range”. Further, when a document is structured to be larger than a paragraph, for example, a logical unit such as a section or a chapter, such a logical unit can be adopted as a “predetermined range”.

また、以上では、文書中での固有名と個々の自立語との距離として、その固有名からその自立語に達するまでにある自立語の数を用いたが、これもあくまで一例である。この他に、「固有名と同一文内にある」、「固有名と同一文内ではないが同一段落内にはある」、「固有名と同一段落内にはないが同一節内にはある」等という具合に、文書の論理構造上での関係の近さ（遠さ）を数値化し、これを「距離」として用いてもよい。 In the above description, the number of independent words from the proper name to the independent word is used as the distance between the proper name and each independent word in the document. This is just an example. Other than this, “is in the same sentence as the proper name”, “is not in the same sentence as the proper name but is in the same paragraph”, “is not in the same paragraph as the proper name but is in the same section” Or the like, the closeness (distantness) of the relationship in the logical structure of the document may be digitized and used as the “distance”.

また、ところで組織名や人名、地名などの固有名は、日々新たに生成されていくものであり、辞書１４６が常に最新の固有名までカバーすることは事実上困難である。そこで、本実施形態の解析用サブシステムの変形例として、形態素解析部１０２の処理では未知の語（辞書１４６に未登録の語）と判定された語があった場合、これをカテゴリ不明の固有名とみなし、これに対してカテゴリが一意にと規定できない固有名の場合と同様のカテゴリ推定処理を施す処理が考えられる。この変形例によれば、日々新たに生まれる固有名に対しても、カテゴリ推定を行うことができる。 Also, unique names such as organization names, person names, and place names are newly generated every day, and it is practically difficult for the dictionary 146 to always cover the latest unique names. Therefore, as a modification of the analysis subsystem of this embodiment, if there is a word that is determined as an unknown word (a word that is not registered in the dictionary 146) by the processing of the morphological analysis unit 102, this is regarded as a unique category of unknown category. A process of performing category estimation processing similar to that in the case of a unique name that is regarded as a name and whose category cannot be uniquely defined can be considered. According to this modification, it is possible to perform category estimation even for unique names that are newly born every day.

本発明に係る固有名カテゴリ推定装置が実装されるハードウエア構成の例を示す図である。It is a figure which shows the example of the hardware constitutions by which the specific name category estimation apparatus which concerns on this invention is mounted. カテゴリ別出現頻度テーブルのデータ内容の例を示す図である。It is a figure which shows the example of the data content of the appearance frequency table classified by category. 学習用サブシステムの構成及び処理を説明するための図である。It is a figure for demonstrating the structure and process of a subsystem for learning. 解析用サブシステムの構成及び処理を説明するための図である。It is a figure for demonstrating the structure and process of an analysis subsystem. 重み付けをカテゴリ別出現頻度テーブルに反映させる変形例を示す図である。It is a figure which shows the modification which reflects weighting in the appearance frequency table classified by category. 重み付けを反映したカテゴリ別出現頻度テーブルのデータ内容の例を示す図である。It is a figure which shows the example of the data content of the appearance frequency table classified by category reflecting weighting. 重み付けを適合度の計算に反映させる変形例を示す図である。It is a figure which shows the modification which reflects weighting in calculation of a fitness.

符号の説明Explanation of symbols

１０ＣＰＵ、１２ＲＡＭ、１４不揮発性記憶装置、１６入出力Ｉ／Ｆ、１８バス、１０２形態素解析部、１０４カテゴリ別出現リスト生成部、１０６テーブル生成・更新部、１０８共起語検出部、１１０カテゴリ推定部、１４２学習用プログラム、１４４解析用プログラム、１４６辞書、１４８カテゴリ別出現頻度テーブル、２００学習用文書、３００解析対象文書。 10 CPU, 12 RAM, 14 Non-volatile storage device, 16 Input / output I / F, 18 bus, 102 Morphological analysis unit, 104 Category-specific appearance list generation unit, 106 Table generation / update unit, 108 Co-occurrence word detection unit, 110 Category estimation unit, 142 learning program, 144 analysis program, 146 dictionary, 148 appearance frequency table by category, 200 learning document, 300 analysis target document.

Claims

固有名のカテゴリごとに、文書中において該カテゴリに属する固有名に対し所定の範囲内に出現する自立語の出現頻度スコアを記憶したカテゴリ別出現頻度テーブルと、
与えられた解析対象文書に含まれる固有名が属するカテゴリを推定する手段であって、該解析対象文書中で該固有名に対し前記所定の範囲に出現する自立語を抽出し、それら各自立語の各カテゴリに対する出現頻度スコアを前記カテゴリ別出現頻度テーブルからそれぞれ求め、それら出現頻度スコアをカテゴリごとに集計することによりカテゴリごとの適合度を計算し、該適合度に基づき該固有名が属するカテゴリを推定する推定手段と、
を備える固有名カテゴリ推定装置。 For each category of proper names, an appearance frequency table for each category that stores appearance frequency scores of independent words that appear within a predetermined range for proper names belonging to the category in the document;
A means for estimating a category to which a proper name included in a given analysis target document belongs, wherein independent words appearing in the predetermined range with respect to the proper name in the analysis target document are extracted, and each of the independent words is extracted. The appearance frequency score for each category is calculated from the category-specific appearance frequency table, the appearance frequency score is aggregated for each category to calculate the fitness for each category, and the category to which the proper name belongs is based on the fitness Estimating means for estimating
A proper name category estimation apparatus comprising:

与えられた各学習用文書から固有名及び該固有名に対し前記所定の範囲内に出現する自立語を求め、固有名のカテゴリごとに該カテゴリに属する固有名に対し前記所定の範囲内に出現した自立語の出現回数を集計することにより各カテゴリに対するそれら各自立語の出現頻度スコアを求め、カテゴリ別出現頻度テーブルを作成する学習手段、を更に備える請求項１記載の固有名カテゴリ推定装置。 A unique name and an independent word appearing in the predetermined range for the proper name are obtained from each given learning document, and for each proper name category, the proper name belonging to the category appears in the predetermined range. The proper name category estimation apparatus according to claim 1, further comprising learning means for obtaining an appearance frequency score of each independent word for each category by counting the number of appearances of the independent words, and creating an appearance frequency table for each category.

前記学習手段は、各学習用文書から求めた固有名のうち、属するカテゴリが一意に特定できるものを抽出し、抽出した固有名とこれに対し前記所定の範囲内に出現した自立語に基づき前記カテゴリ別出現頻度テーブルを作成することを特徴とする請求項２記載の固有名カテゴリ推定装置。 The learning means extracts a unique name obtained from each learning document, and identifies a category to which the category belongs can be uniquely specified, and based on the extracted unique name and an independent word that appears within the predetermined range with respect to the extracted unique name 3. The proper name category estimation apparatus according to claim 2, wherein a category-specific appearance frequency table is created.

前記学習手段は、固有名に対し前記所定の範囲内に出現した自立語に対し該固有名に対する距離が近いほど高くなる重みスコアを与え、出現した自立語についての重みスコアを集計することにより前記出現頻度スコアを求めることを特徴とする請求項２又は３に記載の固有名カテゴリ推定装置。 The learning means gives a weight score that increases as the distance to the proper name is closer to the independent word that appears in the predetermined range with respect to the proper name, and totals the weight score for the independent word that appears. 4. The proper name category estimation apparatus according to claim 2, wherein an appearance frequency score is obtained.

前記推定手段は、固有名に対し前記所定の範囲内に出現する自立語の出現頻度スコアに対し、該自立語と該固有名との間の距離が近いほど大きくなる重みを乗じた上で集計を行うことで前記適合度を計算することを特徴とする請求項１〜４のいずれか１項に記載の固有名カテゴリ推定装置。 The estimating means adds the weight, which increases as the distance between the independent word and the proper name is shorter, to the appearance frequency score of the independent word that appears within the predetermined range with respect to the proper name. 5. The proper name category estimation apparatus according to claim 1, wherein the fitness is calculated by performing.

辞書を参照しながら前記解析対象文書を形態素解析する手段であって、前記辞書に登録されていない未知語を検出した場合、該未知語を固有名とみなし、前記推定手段に該未知語の属するカテゴリを推定させる手段、
を更に備える請求項１〜５のいずれか１項に記載の固有名カテゴリ推定装置。 A means for morphological analysis of the analysis target document with reference to a dictionary, and when an unknown word that is not registered in the dictionary is detected, the unknown word is regarded as a proper name, and the unknown word belongs to the estimation means Means for estimating categories,
The proper name category estimation apparatus according to any one of claims 1 to 5, further comprising:

コンピュータシステムを、
固有名のカテゴリごとに、文書中において該カテゴリに属する固有名に対し所定の範囲内に出現する自立語の出現頻度スコアを記憶したカテゴリ別出現頻度テーブル、
与えられた解析対象文書に含まれる固有名が属するカテゴリを推定する手段であって、該解析対象文書中で該固有名に対し前記所定の範囲に出現する自立語を抽出し、それら各自立語の各カテゴリに対する出現頻度スコアを前記カテゴリ別出現頻度テーブルからそれぞれ求め、それら出現頻度スコアをカテゴリごとに集計することによりカテゴリごとの適合度を計算し、該適合度に基づき該固有名が属するカテゴリを推定する推定手段、
として機能させるプログラム。 Computer system
For each category of proper names, an appearance frequency table for each category that stores appearance frequency scores of independent words that appear within a predetermined range for proper names belonging to the category in the document,
A means for estimating a category to which a proper name included in a given analysis target document belongs, wherein independent words appearing in the predetermined range with respect to the proper name in the analysis target document are extracted, and each of the independent words is extracted. The appearance frequency score for each category is calculated from the category-specific appearance frequency table, the appearance frequency score is aggregated for each category, the fitness for each category is calculated, and the category to which the proper name belongs is based on the fitness Estimating means for estimating
Program to function as.