JP2008225565A

JP2008225565A - Device and method for extracting set of interrelated unique expression

Info

Publication number: JP2008225565A
Application number: JP2007058794A
Authority: JP
Inventors: Toru Hirano; 徹平野; Yoshihiro Matsuo; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-03-08
Filing date: 2007-03-08
Publication date: 2008-09-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a method for extracting a set of interrelated unique expressions in response to individual examples. <P>SOLUTION: When a text is inputted, the method applys morpheme analysis to the inputted text to extract the plurality of unique expressions included in the inputted text, extracts origins including at least mutual information amounts when each unique expression corresponding to sets of the unique expressions also appear in other texts for each set of a plurality of unique expressions composed of combinations of the extracted each unique expression, and determines for each set of the unique expressions whether or not there is any relation among each unique expressions corresponding to the sets of the unique expressions on the basis of the extracted origins, the result determined in advance the presence or absence of relations among each unique expression corresponding to sets of prescribed unique expressions and the origins extracted in advance by using the text including each unique expression corresponding to the sets of the prescribed unique expressions. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、入力されたテキストを要約する要約システム等において重要な役割を果たす、相互に関係する複数の固有表現からなる固有表現の組を入力テキストから抽出する装置及びその方法に関する。 The present invention relates to an apparatus and method for extracting a set of specific expressions composed of a plurality of related specific expressions, which play an important role in a summarization system for summarizing input text, and the like.

まず、相互に関係する固有表現の具体例を以下に説明する。 First, specific examples of interrelated specific expressions will be described below.

例えば、「長澤まさみは渋谷で、速水もこみちは新宿で新作映画の舞台挨拶を行なった。」というテキストにおいて、人名を表す「長澤まさみ」及び「速水もこみち」という固有表現と、地名を表す「渋谷」及び「新宿」という固有表現との間で固有表現の組み合わせを考える。ここで、上記テキストから「長澤まさみは渋谷で新作映画の舞台挨拶を行なった」と解釈されるので、「長澤まさみ」と「渋谷」には「行なった」という関係がある。また、「速水もこみちは新宿で新作映画の舞台挨拶を行なった」と解釈されるから、「速水もこみち」と「新宿」には「行なった」という関係がある。しかし、「長澤まさみ」と「新宿」及び「速水もこみち」と「渋谷」にはそれぞれ関係がない。従って、入力テキストを要約する要約システムや大量のテキストデータから必要な情報を得る検索システム等では、互いに関係する固有表現の組を抽出することが重要となる。 For example, in the text "Masami Nagasawa was in Shibuya and Mokomichi Hayami gave a stage greeting of a new movie in Shinjuku." ”And“ Shinjuku ”are considered as combinations of specific expressions. Here, from the above text, it is interpreted that “Masami Nagasawa made a new stage greeting in Shibuya”, so there is a relationship “I did” between “Masami Nagasawa” and “Shibuya”. In addition, it is interpreted that “Hayami Mokomichi gave a stage greeting of a new movie in Shinjuku”, so “Hayami Mokomichi” and “Shinjuku” have a relationship “I did”. However, “Masami Nagasawa” and “Shinjuku” and “Hayami Mokomichi” and “Shibuya” are not related. Therefore, in a summarization system that summarizes input text, a search system that obtains necessary information from a large amount of text data, and the like, it is important to extract sets of specific expressions related to each other.

従来、この種の固有表現の組抽出装置及びその方法として、２つの固有表現間に存在する単語情報を素性とした機械学習を用いたものが知られている（例えば非特許文献１参照）。 2. Description of the Related Art Conventionally, as this kind of specific expression pair extraction apparatus and method, one using machine learning based on word information existing between two specific expressions is known (see, for example, Non-Patent Document 1).

この固有表現の組抽出装置では、「長澤まさみ」と「新宿」との間に関係があるか否かを判別する場合に、各固有表現の間に存在する「は渋谷で、速水もこみちは」という情報が素性として用いられている。
Kambhatla、“Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Extracting Relations”、The Companion Volume to the Proceedings of 42st Annual Meeting of the Association for Computational Linguistics、Association for Computational Linguistics、２００４年７月、p.１７８−１８１ In this unique expression group extraction device, when determining whether there is a relationship between “Masami Nagasawa” and “Shinjuku”, there is “Hashibuya, Hayami Mokomichiha” that exists between each unique expression. This information is used as a feature.
Kambhatla, “Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Extracting Relations”, The Companion Volume to the Proceedings of 42st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, July 2004, p. 178 -181

ところで、上記要約システムや検索システムに用いられるテキストには、読み手の知識を前提とした記載がなされているものが多い。例えば、「亀井氏に対して自民党は堀江氏を広島６区に推薦した。」というテキストは、「亀井氏は広島６区で出馬する」という読み手の前提知識に基づいて記載されている。一方、「小沢氏に対して自民党は堀江氏を広島６区に推薦した。」というテキストは、「小沢氏は民主党の党首である」という前提知識と、「小沢氏は広島６区で出馬していない」という前提知識に基づいて記載されている。この場合、読み手の知識に基づいて各固有表現の関係を判別してみると、「亀井」と「広島６区」には関係があるが、「小沢」と「広島６区」には関係がない。従って、入力テキストの文の構造が同じであったとしても、読み手の前提知識によっては各固有表現間に関係があるか否かについての判別結果が異なる場合がある。 By the way, many texts used in the above summarization system and search system are described on the premise of the reader's knowledge. For example, the text "The Liberal Democratic Party recommended Mr. Horie to Mr. Kamei to Hiroshima 6 Ward" is written based on the reader's premise that Mr. Kamei will run in Hiroshima 6 Ward. On the other hand, the text "The Liberal Democratic Party recommended Mr. Horie to Hiroshima 6 Ward against Mr. Ozawa" is a premise that "Ozawa is the leader of the Democratic Party" and "Ozawa runs in Hiroshima 6 Ward." It is written based on the premise that "not." In this case, when the relationship between each unique expression is determined based on the reader's knowledge, “Kamei” and “Hiroshima 6 Wards” are related, but “Ozawa” and “Hiroshima 6 Wards” are related. Absent. Therefore, even if the sentence structure of the input text is the same, the determination result as to whether or not there is a relationship between the specific expressions may differ depending on the reader's premise knowledge.

しかしながら、従来の装置は、単に各固有表現間に存在する単語情報を素性として用いているので、上記のように各固有表現間の関係が読み手の知識によって判別されるような入力テキストを用いた場合には、各固有表現間に関係があるか否かを適切に判断することが困難であった。 However, since the conventional apparatus simply uses word information existing between each unique expression as a feature, as described above, an input text is used in which the relationship between each specific expression is determined by the reader's knowledge. In some cases, it was difficult to appropriately determine whether or not there is a relationship between each unique expression.

本発明は前記問題点に鑑みてなされたものであり、その目的とするところは、個々の事例に応じて相互に関係する固有表現の組を抽出可能な装置及びその方法を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an apparatus and method for extracting a set of specific expressions related to each other according to each case. .

本発明の相互に関係する固有表現の組抽出装置は、前記目的を達成するために、相互に関係する複数の固有表現からなる固有表現の組を入力テキストから抽出する装置であって、テキストが入力されると、入力テキストを形態素解析して該入力テキストに含まれる複数の固有表現を抽出する固有表現抽出処理部と、固有表現抽出処理部によって抽出された各固有表現を組み合せてなる複数の固有表現の組毎に、固有表現の組に対応する各固有表現が他のテキストに共に現れるときの各固有表現間の関係の度合を表す統計情報を少なくとも含む素性を抽出する素性抽出処理部と、素性抽出処理部によって抽出された素性と、所定の固有表現の組に対応する各固有表現間の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて素性抽出処理部から事前に抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを固有表現の組毎に判別する判別処理部とを備えている。 In order to achieve the above-mentioned object, the inter-specific-expression-specific group extraction apparatus of the present invention is an apparatus for extracting a specific-expression group consisting of a plurality of inter-specific expressions from an input text. When input, a specific expression extraction processing unit that morphologically analyzes the input text and extracts a plurality of specific expressions included in the input text, and a plurality of specific expressions extracted by the specific expression extraction processing unit A feature extraction processing unit for extracting a feature including at least statistical information indicating a degree of a relationship between each specific expression when each specific expression corresponding to the specific expression pair appears in other text for each set of specific expressions; The feature extracted by the feature extraction processing unit, the result of the determination in advance regarding the presence or absence of a relationship between each specific expression corresponding to the predetermined specific expression set, and the predetermined specific expression set Whether or not there is a relationship between the specific expressions corresponding to the specific expression set based on the pre-extracted feature from the feature extraction processing unit using the text including each specific expression And a discrimination processing unit that discriminates every time.

また、本発明の相互に関係する固有表現の組抽出方法は、前記目的を達成するために、コンピュータを用いて、相互に関係する複数の固有表現からなる固有表現の組を入力テキストから抽出する方法であって、前記コンピュータは、テキストが入力されると、入力テキストを形態素解析して該入力テキストに含まれる複数の固有表現を抽出する第１のステップと、抽出された各固有表現を組み合せてなる複数の固有表現の組毎に、固有表現の組に対応する各固有表現が他のテキストに共に現れるときの各固有表現間の関係の度合を表す統計情報を少なくとも含む素性を抽出する第２のステップと、抽出された素性と、所定の固有表現の組に対応する各固有表現間の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて事前に第２のステップを行うことにより抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを固有表現の組毎に判別する第３のステップとを行っている。 According to another aspect of the present invention, there is provided a method for extracting a set of related specific expressions, wherein a set of specific expressions composed of a plurality of related specific expressions is extracted from an input text using a computer. In the method, when a text is input, the computer combines a first step of morphological analysis of the input text to extract a plurality of specific expressions included in the input text, and each extracted specific expression A feature including at least statistical information indicating the degree of the relationship between the specific expressions when the specific expressions corresponding to the specific expressions appear together in other text. Step 2, the extracted feature, the result of the determination in advance about the presence or absence of the relationship between each specific expression corresponding to the predetermined specific expression set, and the predetermined specific expression set Based on the pre-features extracted by performing the second step in advance using text including each specific expression, the specific expression indicates whether there is a relationship between the specific expressions corresponding to the set of specific expressions. And a third step of discriminating each group.

これにより、固有表現の組に対応する各固有表現が他のテキストに共に現れるときの各固有表現間の関係の度合を表す統計情報が素性に含まれ、該素性を用いて各固有表現間の関係が判別されることから、他のテキストにおける各固有表現間の関係の度合に基づいて各固有表現間に関係があるか否かを判別することが可能となる。 As a result, the feature includes statistical information indicating the degree of the relationship between the specific expressions when the specific expressions corresponding to the set of specific expressions appear together in other texts. Since the relationship is determined, it is possible to determine whether or not there is a relationship between the specific expressions based on the degree of the relationship between the specific expressions in other texts.

本発明の相互に関係する固有表現の組抽出装置及びその方法によれば、他のテキストにおける各固有表現間の関係の度合に基づいて各固有表現間に関係があるか否かを判別することができるので、例えば各固有表現間の関係が読み手の知識によって判別されるような入力テキストを用いた場合でも各固有表現間の関係を適切に判別することができ、個々の事例に応じて相互に関係する固有表現の組を抽出することができる。 According to the apparatus and method for extracting sets of specific expressions related to each other according to the present invention, it is determined whether or not there is a relationship between the specific expressions based on the degree of the relationship between the specific expressions in other texts. For example, even when using input text in which the relationship between each specific expression is determined by the reader's knowledge, the relationship between each specific expression can be appropriately determined and A set of proper expressions related to can be extracted.

図１乃至図６は本発明の第１の実施形態を示すもので、図１は固有表現の組抽出装置の構成図、図２は固有表現の組抽出処理のフロー図、図３は係り受け解析部による解析結果の概要を示す図、図４は基盤解析結果合成部による処理結果の概要を示す図、図５は固有表現間情報抽出処理部による処理結果の一例を示す図、図６は外部知識情報抽出部による処理結果の一例を示す図である。 1 to 6 show a first embodiment of the present invention. FIG. 1 is a configuration diagram of a unique expression set extraction apparatus, FIG. 2 is a flowchart of a unique expression set extraction process, and FIG. 3 is a dependency. FIG. 4 is a diagram showing an overview of analysis results by the analysis unit, FIG. 4 is a diagram showing an overview of processing results by the base analysis result synthesis unit, FIG. 5 is a diagram showing an example of processing results by the inter-specific expression information extraction processing unit, and FIG. It is a figure which shows an example of the processing result by an external knowledge information extraction part.

以下、図面を参照して本発明の相互に関係する固有表現の組抽出装置及びその方法の概要を説明する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS An overview of an apparatus and method for extracting sets of related expressions according to the present invention will be described below with reference to the drawings.

本発明の相互に関係する固有表現の組抽出装置（以下、抽出装置と称する。）は、周知のＣＰＵを主体として構成されたコンピュータ装置からなり、モニタ等の表示手段、キーボード等の入力手段、ハードディスクやメモリ等の記憶手段及び外部ネットワークに接続可能な通信装置等（何れも図示省略）を備えている。また、本発明の抽出装置には、固有表現抽出処理部１０、素性抽出処理部２０、テキスト記憶部３０、判別処理部４０及びモデル記憶部５０が設けられている。 A mutual expression group extraction device (hereinafter referred to as an extraction device) according to the present invention includes a computer device mainly composed of a well-known CPU, and includes display means such as a monitor, input means such as a keyboard, It includes storage means such as a hard disk and a memory, and a communication device that can be connected to an external network (both not shown). Further, the extraction apparatus of the present invention is provided with a specific expression extraction processing unit 10, a feature extraction processing unit 20, a text storage unit 30, a discrimination processing unit 40, and a model storage unit 50.

固有表現抽出処理部１０は、図１に示すように形態素解析部１１、固有表現抽出部１２、係り受け解析部１３、基盤解析結果合成部１４及び固有表現ペア生成部１５からなり、入力手段を用いて入力されたテキストを形態素解析して該入力テキストに含まれる複数の固有表現を抽出するようになっている。 As shown in FIG. 1, the specific expression extraction processing unit 10 includes a morphological analysis unit 11, a specific expression extraction unit 12, a dependency analysis unit 13, a base analysis result synthesis unit 14, and a specific expression pair generation unit 15. The input text is used for morphological analysis to extract a plurality of specific expressions included in the input text.

形態素解析部１１は、入力テキストを取得すると（図２のステップＳ１）、入力テキストに対して周知の形態素解析処理を行うことにより入力テキストを単語分割し、分割した各単語に品詞を付与して出力する（図２のステップＳ２）。例えば、「亀井氏に対して自民党は堀江氏を広島６区に推薦した。」というテキストが入力された場合には、形態素解析部１１による処理結果は、「亀井（名詞）／氏（名詞）／に（助詞）／対し（動詞）／て（助詞）／自民党（名詞）／は（助詞）／堀江（名詞）／氏（名詞）／を（助詞）／広島６区（名詞）／に（助詞）／推薦（動作名詞）／し（動詞）／た（動詞接尾辞）／。（句点）」となる。 When the morpheme analysis unit 11 acquires the input text (step S1 in FIG. 2), the input text is divided into words by performing a well-known morpheme analysis process, and parts of speech are assigned to the divided words. Output (step S2 in FIG. 2). For example, if the text “The Liberal Democratic Party recommended Mr. Horie to Mr. Kamei to Hiroshima 6 Ward” is input, the processing result by the morphological analysis unit 11 is “Kamei (noun) / Mr. (Noun)”. / Ni (participant) / versus (verb) / te (particle) / Liberal Democratic Party (noun) / ha (participant) / Horie (noun) / Mr. (Noun) / ha (particle) / Hiroshima 6 ward (noun) / Particle) / recommendation (behavior noun) / do (verb) / ta (verb suffix) /.

固有表現抽出部１２は、形態素解析部１１から取得した形態素解析済みの入力テキストに対して周知の固有表現抽出処理を行うことにより固有表現を抽出するとともに、抽出された固有表現に対して人名や地名等の固有表現の種類を付与した後に該固有表現を出力する（図２のステップＳ３）。ここで、例示した形態素解析済みの入力テキストが固有表現抽出部１２に入力されると、「亀井（人名）」、「自民党（組織名）」、「堀江（人名）」及び「広島６区（地名）」という情報が出力される。 The specific expression extraction unit 12 extracts a specific expression by performing a well-known specific expression extraction process on the input text obtained from the morphological analysis unit 11 and has been subjected to morpheme analysis. After assigning a kind of specific expression such as a place name, the specific expression is output (step S3 in FIG. 2). Here, when the input text that has been subjected to the morphological analysis illustrated is input to the specific expression extraction unit 12, "Kamei (person name)", "Liberal Democratic Party (organization name)", "Horie (person name)" and "Hiroshima 6 Wards ( Information) is output.

係り受け解析部１３は、形態素解析部１１から取得した形態素解析済みの入力テキストに対して周知の係り受け解析処理を行うことにより、該テキストを文節に分割し、分割された複数の文節間の係り受け関係を解析して出力する（図２のステップＳ４）。この場合、例示した入力テキストが係り受け解析部１３によって解析されると、図３に示すような係り受け構造を表す情報（係り受け木）が解析結果として出力される。ここで、「対し／て」、「自民党／は」、「堀江／氏／を」及び「広島６区／に」という文節は、それぞれ「推薦／し／た／。」という文節に係っており、「亀井／氏／に」という文節は「対し／て」という文節に係っている。これらの係り受け関係をデータとして実装する場合には、例えば「（推薦した。（対して（亀井氏に））（自民党は）（堀江氏を）（広島６区に））」というように表現される。 The dependency analysis unit 13 divides the text into phrases by performing a well-known dependency analysis process on the input text that has been analyzed from the morpheme analysis unit 11 and obtained from the morpheme analysis unit 11. The dependency relationship is analyzed and output (step S4 in FIG. 2). In this case, when the input text illustrated is analyzed by the dependency analysis unit 13, information representing a dependency structure (dependency tree) as shown in FIG. 3 is output as an analysis result. Here, the phrases “against / te”, “Liberal Democratic Party / ha”, “Horie / Mr. / O” and “Hiroshima 6 Wards / ni” are related to the clause “recommendation / do / ta /.”, Respectively. The phrase “Kamei / Mr / ni” is related to the phrase “vs./te”. When implementing these dependency relationships as data, for example, “(recommended. (Vs. (to Kamei))” (the Liberal Democratic Party) (Mr. Horie) (in Hiroshima 6 wards)) ” Is done.

基盤解析結果合成部１４は、固有表現抽出部１２から出力された情報と、係り受け解析部１３から出力された情報とを合成する処理を行う（図２のステップＳ５）。具体的には、基盤解析結果合成部１４は、固有表現抽出部１２及び係り受け解析部１３から情報を取得すると、各固有表現に対して固有表現を表すタグを付与する。例えば、「亀井」と「堀江」の前後には人名を示す＜ＰＳＮ＞というタグが付与され、「広島６区」の前後には地名を示す＜ＬＯＣ＞というタグが付与され、「自民党」の前後には組織を示す＜ＯＲＧ＞というタグが付与される。この場合、基盤解析結果合成部１４の処理結果は図４のように示される。 The base analysis result synthesizing unit 14 performs a process of synthesizing the information output from the specific expression extraction unit 12 and the information output from the dependency analysis unit 13 (step S5 in FIG. 2). Specifically, when the base analysis result synthesis unit 14 acquires information from the specific expression extraction unit 12 and the dependency analysis unit 13, the base analysis result synthesis unit 14 gives a tag representing the specific expression to each specific expression. For example, the tag <PSN> indicating the name of the person is given before and after “Kamei” and “Horie”, and the tag <LOC> indicating the place name is assigned before and after “Hiroshima 6 Ward”. A tag <ORG> indicating an organization is assigned before and after. In this case, the processing result of the base analysis result synthesis unit 14 is shown as in FIG.

固有表現ペア生成部１５は、基盤解析結果合成部１４の処理結果を取得すると、処理結果に含まれる全ての固有表現を組合わせることにより複数の固有表現の組（ペア）を生成する（図２のステップＳ６）。なお、本実施形態では、固有表現の組を、人名を表す固有表現と地名を表す固有表現の２つの固有表現から構成されるものとし、固有表現の組を「亀井：広島６区」のように表記する。この場合、各固有表現のうちテキストにおいて先に出現する固有表現が前方固有表現として「：」の左側に表され、後に出現する固有表現が後方固有表現として「：」の右側に表される。なお、例示した入力テキストからは、「亀井：広島６区」及び「堀江：広島６区」という２つの固有表現の組が出力される。 When the specific expression pair generation unit 15 acquires the processing result of the base analysis result synthesis unit 14, the specific expression pair generation unit 15 generates a plurality of specific expression pairs by combining all the specific expressions included in the processing result (FIG. 2). Step S6). In this embodiment, the set of proper expressions is composed of two specific expressions, i.e., a specific expression that represents a person name and a specific expression that represents a place name, and the set of specific expressions is "Kamei: Hiroshima 6 wards". Indicate. In this case, among the specific expressions, the specific expression that appears first in the text is represented on the left side of “:” as the forward specific expression, and the specific expression that appears later is represented on the right side of “:” as the backward specific expression. From the illustrated input text, a set of two unique expressions “Kamei: Hiroshima 6 wards” and “Horie: Hiroshima 6 wards” are output.

次に、素性抽出処理部２０の概要を説明する。素性抽出処理部２０は、固有表現間情報抽出部２１と外部知識情報抽出部２２からなり、固有表現ペア生成部１５によって生成された複数の固有表現の組毎に、該組に含まれる各固有表現間の素性を抽出するようになっている（図２のステップＳ７）。 Next, an overview of the feature extraction processing unit 20 will be described. The feature extraction processing unit 20 includes an inter-specific expression information extraction unit 21 and an external knowledge information extraction unit 22, and each unique expression included in the combination is generated for each set of multiple specific expressions generated by the specific expression pair generation unit 15. Features between expressions are extracted (step S7 in FIG. 2).

固有表現間情報抽出部２１は、基盤解析結果合成部１４の処理結果と、固有表現ペア生成部１５によって生成された固有表現の組とを取得すると、入力テキストにおいて固有表現の組に対応する各固有表現間に存在する単語、品詞、該単語の数及び各固有表現間に存在する固有表現の数と、各固有表現のそれぞれを含む文節の係り先文節の単語及び品詞、各固有表現が同一文節に存在するか否か及び各固有表現を含む文節間の最短経路の距離を抽出する。ここで、固有表現間情報抽出部２１の処理内容を具体的に説明する例として、図４に示した基盤解析結果合成部１４の処理結果と、「亀井：広島６区」という固有表現の組とを用いる。まず、入力テキストにおいて固有表現の組に対応する各固有表現間に存在する単語、品詞、該単語数及び各固有表現間に存在する固有表現の数は、「氏／に／対し／て／自民党／は／堀江／氏／を」、「名詞／助詞／動詞／助詞／名詞／助詞／名詞／名詞／助詞」、「９」及び「２」である。また、「亀井／氏／に」という文節の係り先文節の単語及び品詞は、それぞれ「対し／て」及び「動詞／助詞」であり、「広島６区／に」という文節についてはそれぞれ「推薦／し／た／。」及び「動作名詞／動詞／動詞接尾辞／句点」となる。さらに、各固有表現が同一文節に存在するか否かについては、「亀井」と「広島６区」がそれぞれ異なる文節に含まれていることから、「ＮＯ」となる、。また、各固有表現を含む文節間の最短経路の距離は、係り受け木における「亀井／氏／に」という文節と「広島６区／に」という文節との最短経路の距離で表される。この場合、「亀井／氏／に」という文節は、「対し／て」と「推薦／し／た／。」という文節を介して「広島６区／に」という文節に到達し、他に到達する経路がないことから、各文節間の最短経路は「亀井／氏／に」→「対し／て」→「推薦／し／た／。」→「広島６区／に」となり、その距離は３となる。従って、上記の例を用いた場合の固有表現間情報抽出部２１の処理結果は図５のように示される。 When the inter-specific expression information extraction unit 21 acquires the processing result of the base analysis result synthesis unit 14 and the set of specific expressions generated by the specific expression pair generation unit 15, each corresponding to the set of specific expressions in the input text. Words and parts of speech that exist between proper expressions, the number of such words and the number of specific expressions that exist between each specific expression, and the words and parts of speech of the clauses that include each specific expression are the same. Whether or not the phrase exists in the phrase and the distance of the shortest path between the phrases including each unique expression are extracted. Here, as an example for specifically explaining the processing contents of the inter-specific-expression information extraction unit 21, the processing result of the base analysis result synthesizing unit 14 shown in FIG. 4 and the specific expression set “Kamei: Hiroshima 6 Ward” And are used. First, in the input text, the words, parts of speech, the number of words, and the number of specific expressions existing between the specific expressions corresponding to the specific expressions corresponding to the set of specific expressions are: “/Ha/Horie/Mr./O”, “noun / particle / verb / particle / noun / particle / noun / noun / particle”, “9” and “2”. In addition, the words and parts of speech of the clauses related to the phrase “Kamei / Mr / ni” are “against / te” and “verb / particle”, respectively, and the phrases “Hiroshima 6 wards / ni” are “recommended” respectively. / Do / ta /. ”And“ behavior noun / verb / verb suffix / phrase ”. Furthermore, whether each proper expression exists in the same phrase is “NO” because “Kamei” and “Hiroshima 6 Wards” are included in different phrases, respectively. Further, the distance of the shortest path between clauses including each unique expression is represented by the distance of the shortest route between the phrase “Kamei / Mr. / ni” and the phrase “Hiroshima 6 wards / ni” in the dependency tree. In this case, the phrase “Kamei / Mr / ni” reaches the phrase “Hiroshima 6 wards / ni” via the phrases “vs./te” and “recommendation / do / ta /.” And reaches others. Because there is no route to do, the shortest route between each phrase is “Kamei / Mr. / ni” → “To / Te” → “Recommendation / do / ta /.” → “Hiroshima 6 wards / ni”, and the distance is 3 Therefore, the processing result of the inter-specific expression information extraction unit 21 when the above example is used is shown in FIG.

外部知識情報抽出部２２は、基盤解析結果合成部１４の処理結果と、固有表現ペア生成部１５によって生成された固有表現の組とを取得すると、固有表現の組に対応する各固有表現が他のテキストに共に現れるときの各固有表現間の関係の度合を表す統計情報を抽出する。具体的に説明すると、まず、外部知識情報抽出部２２は、テキスト記憶部３０に記憶された他のテキストを用いて、他のテキストにおける各固有表現の共起関係の度合を求める。なお、本実施形態では、各固有表現間の共起関係を、固有表現の組に対応する各固有表現の一方の固有表現が同一文において他方の固有表現よりも先に現れる第１のパターンと、各固有表現の他方の固有表現が同一文において一方の固有表現よりも先に現れる第２のパターンと、各固有表現が同一文において共に現れる第３のパターンの３つのパターンに定義しており、外部知識情報抽出部２２は、これらの各パターンについて各固有表現間の共起関係の度合を求める。また、これらの共起関係の度合は、共起尺度として知られる相互情報量を用いることにより求められる。例えば、各固有表現のうち一方の固有表現ｘの他のテキストにおける出現確率をＰ（ｘ）、各固有表現のうち他方の固有表現ｙの他のテキストにおける出現確率をＰ（ｙ）、各固有表現ｘ，ｙの他のテキストにおける共起確率をＰ（ｘ，ｙ）とすると、相互情報量ＭＩ（ｘ，ｙ）は以下の式（１）で求められる。 When the external knowledge information extraction unit 22 acquires the processing result of the base analysis result synthesis unit 14 and the set of specific expressions generated by the specific expression pair generation unit 15, each specific expression corresponding to the set of specific expressions is different. Statistical information representing the degree of the relationship between each unique expression when appearing together in the text is extracted. More specifically, first, the external knowledge information extraction unit 22 uses the other text stored in the text storage unit 30 to determine the degree of co-occurrence relationship of each unique expression in the other text. In the present embodiment, the co-occurrence relationship between each of the specific expressions is represented by the first pattern in which one of the specific expressions corresponding to the set of specific expressions appears before the other specific expression in the same sentence. The other specific expression of each specific expression is defined in three patterns: a second pattern that appears before one specific expression in the same sentence, and a third pattern in which each specific expression appears together in the same sentence. The external knowledge information extraction unit 22 obtains the degree of the co-occurrence relationship between the unique expressions for each of these patterns. The degree of these co-occurrence relationships is obtained by using a mutual information amount known as a co-occurrence scale. For example, P (x) represents the probability of occurrence of one unique expression x in another text among the specific expressions, and P (y) represents the probability of occurrence of the other specific expression y in the other text. When the co-occurrence probability in the other texts of the expressions x and y is P (x, y), the mutual information MI (x, y) is obtained by the following equation (1).

上記の各パターンについてみた場合、第１のパターンに対応する相互情報量ＭＩ１を求めるときの共起確率Ｐ（ｘ，ｙ）は、一方の固有表現ｘが他のテキストの同一文において他の固有表現ｙよりも先に現れる確率となる。また、第２のパターンに対応する相互情報量ＭＩ２を求めるときの共起確率Ｐ（ｘ，ｙ）は、他方の固有表現ｙが他のテキストの同一文において一方の固有表現ｙよりも先に現れる確率となる。さらに、第３のパターンに対応する相互情報量ＭＩ３を求めるときの共起確率Ｐ（ｘ，ｙ）は、各固有表現ｘ，ｙが他のテキストの同一文に共に現れる確率となる。そして、外部知識情報抽出部２２は、各相互情報量ＭＩ１，ＭＩ２及びＭＩ３を素性として抽出する。この場合、外部知識情報抽出部２２による処理結果の一例は図６のように示される。 When looking at each of the above patterns, the co-occurrence probability P (x, y) when obtaining the mutual information amount MI1 corresponding to the first pattern is different from that of one specific expression x in the same sentence of another text. The probability of appearing before the expression y. Further, the co-occurrence probability P (x, y) when obtaining the mutual information amount MI2 corresponding to the second pattern is such that the other specific expression y precedes one specific expression y in the same sentence of the other text. Probability of appearing. Furthermore, the co-occurrence probability P (x, y) when obtaining the mutual information amount MI3 corresponding to the third pattern is a probability that each specific expression x, y appears together in the same sentence of other text. And the external knowledge information extraction part 22 extracts each mutual information amount MI1, MI2, and MI3 as a feature. In this case, an example of the processing result by the external knowledge information extraction unit 22 is shown in FIG.

なお、外部知識情報抽出部２２を、他のテキストに含まれる複数の固有表現について、予め各固有表現間の共起関係の度合を求めるとともに、求められた度合をテキスト記憶部３０に記憶するように構成してもよい。 It should be noted that the external knowledge information extraction unit 22 obtains the degree of co-occurrence relation between each unique expression in advance for a plurality of unique expressions included in other texts, and stores the obtained degree in the text storage unit 30. You may comprise.

次に、判別処理部４０の概要を説明する。判別処理部４０は、モデル選択部４１、分類器４２及び固有表現ペア出力部４３からなり、素性抽出処理部２０から取得した素性等に基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを固有表現の組毎に判別するようになっている。 Next, an outline of the discrimination processing unit 40 will be described. The discrimination processing unit 40 includes a model selection unit 41, a classifier 42, and a specific expression pair output unit 43. Based on the features acquired from the feature extraction processing unit 20, the discrimination processing unit 40 includes each specific expression corresponding to a set of specific expressions. Whether or not there is a relationship is determined for each set of proper expressions.

モデル選択部４１は、固有表現の組毎に固有表現間情報抽出部２１及び外部知識情報抽出部２２の処理結果を取得すると、基盤解析結果合成部１４によって付与されたタグに基づいて固有表現の組を分類するとともに、後述の分類器４２によって抽出されるモデルの種類を選択する（図２のステップＳ８）。例えば、固有表現の組として「亀井：広島６区」が入力された場合には、モデル選択部４１は固有表現の組を「人名：地名」という種類に分類し、素性抽出処理部２０から取得した素性とともに固有表現の組の種類を出力する。 When the model selection unit 41 obtains the processing results of the inter-representative information extraction unit 21 and the external knowledge information extraction unit 22 for each set of specific representations, the model selection unit 41 generates a unique representation based on the tag assigned by the base analysis result synthesis unit 14. While classifying a set, the kind of model extracted by the classifier 42 mentioned later is selected (step S8 of FIG. 2). For example, when “Kamei: Hiroshima 6 Ward” is input as a specific expression group, the model selection unit 41 classifies the specific expression group into a type of “person name: place name” and acquires it from the feature extraction processing unit 20. The type of the set of proper expressions is output together with the feature.

分類器４２は、モデル選択部４１から出力された情報を取得すると、モデル選択部４１で選択された固有表現の組の種類に基づいて、複数のモデルが記憶されたモデル記憶部５０からモデルを抽出し、抽出したモデルを用いて固有表現の組に対応する各固有表現間に関係があるか否かを判別する（図２のステップＳ９）。 When the classifier 42 acquires the information output from the model selection unit 41, the classifier 42 selects a model from the model storage unit 50 in which a plurality of models are stored based on the type of the unique expression set selected by the model selection unit 41. Using the extracted model, it is determined whether or not there is a relationship between each of the specific expressions corresponding to the set of specific expressions (step S9 in FIG. 2).

ここで、モデルは、所定の固有表現の組に対応する各固有表現間の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて固有表現抽出処理部１０及び素性抽出処理部２０から事前に抽出された情報とを用いて周知の機械学習を行うことにより予め生成されている。また、所定の固有表現の組についての判別結果は人的な判断に基づいて事前になされている。なお、各モデルを、例えば「人名：地名」や「人名：人名」等のように固有表現の組の種類に応じて構成してもよいし、種類を区別することなく構成してもよい。 Here, the model uses a result determined in advance as to whether or not there is a relationship between each specific expression corresponding to a predetermined specific expression set, and text including each specific expression corresponding to the predetermined specific expression set. The information is previously generated by performing well-known machine learning using information extracted in advance from the specific expression extraction processing unit 10 and the feature extraction processing unit 20. In addition, the discrimination result for a predetermined set of specific expressions is made in advance based on human judgment. Note that each model may be configured according to the type of set of unique expressions, such as “person name: place name”, “person name: name”, or the like, or may be configured without distinguishing the types.

この場合、分類器４２による判別には、外部知識情報抽出部２２から取得した情報も利用されていることから、他のテキストにおける各固有表現間の関係の強さに基づいて各固有表現間の関係を判別することができる。 In this case, since the information acquired from the external knowledge information extraction unit 22 is also used for the discrimination by the classifier 42, each specific expression is determined based on the strength of the relationship between the specific expressions in other texts. The relationship can be determined.

なお、分類器４２は、関係あるか否かという判定結果の他に関係度を表す数値を出力するように構成してもよい。また、機械学習としては、周知のものを用いることが可能であるが、木構造やグラフ構造のデータを直接入力して学習可能に構成されたものを用いることが望ましい。 The classifier 42 may be configured to output a numerical value representing the degree of relationship in addition to the determination result of whether or not there is a relationship. As machine learning, a well-known machine can be used. However, it is desirable to use a machine that can learn by directly inputting data of a tree structure or a graph structure.

固有表現ペア出力部４３は、各固有表現間に関係があると分類器４２によって判別された固有表現の組を表示手段に出力する（図２のステップＳ１０）。なお、分類器４２が関係度を表す数値を出力するようになっている場合には、固有表現ペア出力部４３は、関係度が予め設定された所定の閾値より大きいときにのみ固有表現の組を出力するようにしてもよい。 The specific expression pair output unit 43 outputs the set of specific expressions determined by the classifier 42 as having a relationship between the specific expressions to the display unit (step S10 in FIG. 2). When the classifier 42 is configured to output a numerical value indicating the degree of relation, the specific expression pair output unit 43 sets the combination of specific expressions only when the degree of relation is larger than a predetermined threshold value. May be output.

前述したように上記実施形態では、テキストが入力されると、入力テキストを形態素解析して該入力テキストに含まれる複数の固有表現を抽出し、抽出された各固有表現を組み合せてなる複数の固有表現の組毎に、固有表現の組に対応する各固有表現が、固有表現の組に対応する各固有表現が他のテキストに共に現れるときの相互情報量を少なくとも含む素性を抽出し、抽出された素性と、所定の固有表現の組に対応する各固有表現間の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて事前に抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを固有表現の組毎に判別するので、他のテキストにおける各固有表現間の関係の度合に基づいて各固有表現間に関係があるか否かを判別することができ、例えば各固有表現間の関係が読み手の知識によって判別されるような入力テキストを用いた場合でも各固有表現間の関係を適切に判別することができる。従って、個々の事例に応じて相互に関係する固有表現の組を抽出することができる。 As described above, in the above embodiment, when text is input, morphological analysis is performed on the input text to extract a plurality of specific expressions included in the input text, and a plurality of specific expressions formed by combining the extracted specific expressions. For each set of expressions, each specific expression corresponding to the set of specific expressions is extracted by extracting features including at least mutual information when each specific expression corresponding to the set of specific expressions appears together in other text. Using the results including the features and the existence of the relationship between the specific expressions corresponding to the predetermined specific expressions in advance and the text including the specific expressions corresponding to the specific specific expressions. Based on the pre-features extracted in the above, it is determined for each set of specific expressions whether there is a relationship between the specific expressions corresponding to the set of specific expressions. Based on the degree of For example, even if input text is used in which the relationship between each named entity is determined by the reader's knowledge. It can be determined appropriately. Therefore, it is possible to extract a set of specific expressions related to each other according to each case.

また、各固有表現の一方の固有表現ｘが他のテキストの同一文において他方の固有表現ｙよりも先に現れるという関係の度合と、各固有表現の他方の固有表現ｙが他のテキストの同一文において一方の固有表現ｘよりも先に現れるという関係の度合と、各固有表現ｘ，ｙが他のテキストの同一文に共に現れるという関係の度合のそれぞれを表す相互情報量（ＭＩ１，ＭＩ２，ＭＩ３）を少なくとも含む素性を抽出するので、各固有表現が他のテキストに共に現れるときの互いの位置関係に基づいて各固有表現間の関係の有無をより詳細に判別することができ、相互に関係する固有表現の組の抽出精度が向上する。なお、外部知識情報抽出部２２を、上記の相互情報量（ＭＩ１，ＭＩ２，ＭＩ３）のうち少なくとも一つの相互情報量を素性として抽出するように構成してもよい。 In addition, the degree of the relationship that one specific expression x of each specific expression appears before the other specific expression y in the same sentence of the other text, and the other specific expression y of each specific expression is the same as the other text. Mutual information (MI1, MI2, each representing the degree of relationship that appears before one specific expression x in a sentence and the degree of relationship that each specific expression x, y appears together in the same sentence of another text. Since the feature including at least MI3) is extracted, it is possible to determine in more detail whether or not there is a relationship between each specific expression based on the mutual positional relationship when each specific expression appears together in other texts. The extraction accuracy of the related specific expression set is improved. Note that the external knowledge information extraction unit 22 may be configured to extract at least one mutual information amount as a feature among the mutual information amounts (MI1, MI2, MI3).

以下に本発明の第２の実施形態に係る抽出装置について説明する。本実施形態が第１の実施形態と異なる点は、固有表現の組に含まれる各固有表現間の関係の有無が判別処理部４０によって判別される毎に判別結果を記憶する判別結果記憶部６０を備え、判別処理部４０を、固有表現の組に対応する素性が抽出されると判別結果記憶部６０に記憶された判別結果を取得し、該判別結果と、抽出された素性と、所定の固有表現の組に対応する各固有表現の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて事前に抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを判別するように構成した点にある。他の構成及び動作については第１の実施形態と同様なので、ここでは相違点のみを図７乃至図１１を参照して説明する。 The extraction device according to the second embodiment of the present invention will be described below. The present embodiment is different from the first embodiment in that a discrimination result storage unit 60 that stores a discrimination result every time the discrimination processing unit 40 discriminates whether or not there is a relationship between each unique expression included in a set of proper expressions. When the feature corresponding to the set of specific expressions is extracted, the discrimination processing unit 40 acquires the discrimination result stored in the discrimination result storage unit 60, and the discrimination result, the extracted feature, and a predetermined feature A result determined in advance as to whether or not there is a relationship between each specific expression corresponding to a set of specific expressions, and a pre-feature extracted in advance using text including each specific expression corresponding to the predetermined specific expression set; On the basis of the above, it is configured to determine whether or not there is a relationship between each unique expression corresponding to the set of specific expressions. Since other configurations and operations are the same as those of the first embodiment, only the differences will be described with reference to FIGS.

本実施形態の素性抽出処理部２０は、固有表現抽出処理部１０の固有表現ペア生成部１５によって生成された固有表現の組の並び替えを行う固有表現ペア並び替え部２３を有している。 The feature extraction processing unit 20 according to the present embodiment includes a specific expression pair rearrangement unit 23 that rearranges a set of specific expressions generated by the specific expression pair generation unit 15 of the specific expression extraction processing unit 10.

また、本実施形態の判別処理部４０は、分類器４２によって判別された各固有表現間の関係の有無についての結果を判別結果記憶部６０に記憶するとともに、判別結果記憶部６０に記憶された判別結果を取得する判別結果取得部３４を有している。 In addition, the discrimination processing unit 40 according to the present embodiment stores a result of the presence / absence of a relationship between each unique expression discriminated by the classifier 42 in the discrimination result storage unit 60 and also stored in the discrimination result storage unit 60. It has a discrimination result acquisition unit 34 for acquiring a discrimination result.

本実施形態の抽出装置について、素性抽出処理部２０及び判別処理部４０の動作を第１の実施形態において例示した入力テキスト及び図８のフローを用いて説明する。なお、各固有表現には、入力テキストにおける出現順に固有表現ＩＤが付与されている。例えば、「亀井」、「自民党」、「堀江」及び「広島６区」には、それぞれ「ＩＤ１」、「ＩＤ２」、「ＩＤ３」及び「ＩＤ４」という固有表現ＩＤが付与される。また、判別結果記憶部６０には、当初何も記憶されていない。 With respect to the extraction apparatus of this embodiment, the operations of the feature extraction processing unit 20 and the discrimination processing unit 40 will be described using the input text exemplified in the first embodiment and the flow of FIG. Each unique expression is given a unique expression ID in the order of appearance in the input text. For example, “ID1”, “ID2”, “ID3”, and “ID4” are given unique expression IDs to “Kamei”, “Liberal Democratic Party”, “Horie”, and “Hiroshima 6 Wards”, respectively. Also, nothing is initially stored in the discrimination result storage unit 60.

まず、固有表現の組が図１０に示される順で固有表現抽出処理部１０の固有表現ペア生成部１５から出力されると、固有表現ペア並び替え部２３は、固有表現の組を並び替え規則に基づいて並び替える（ステップＳ１１）。ここで、本実施形態では、「固有表現ＩＤの絶対値差分が小さい順に並び替え、絶対値差分が等しい場合には、固有表現ＩＤの和が小さい順に並び替える。」という並び替え規則を用いている。また、固有表現ＩＤの絶対値差分及び和は図１１のように示されている。これにより、固有表現の組は図１２に示すように並び替えられる。 First, when the specific expression pairs are output from the specific expression pair generation unit 15 of the specific expression extraction processing unit 10 in the order shown in FIG. 10, the specific expression pair rearrangement unit 23 sorts the specific expression pairs. Rearrange based on (step S11). Here, in the present embodiment, the rearrangement rule “rearranged in descending order of the absolute value difference of the unique expression ID and rearrange in the order of smaller sum of the specific expression IDs when the absolute value difference is equal” is used. Yes. Further, the absolute value difference and the sum of the unique expression ID are shown in FIG. As a result, the sets of specific expressions are rearranged as shown in FIG.

次に、固有表現間情報抽出部２１及び外部知識情報抽出部２２は、並び替えられた固有表現の組から先頭の固有表現の組（「ＩＤ１−ＩＤ２」）を処理対象として抽出し（ステップＳ１２）、処理対象の固有表現の組について素性抽出処理を行う（ステップＳ１３）。なお、素性抽出処理の内容は第１の実施形態と同様である。 Next, the inter-specific expression information extraction unit 21 and the external knowledge information extraction unit 22 extract the first specific expression pair ("ID1-ID2") from the rearranged specific expression sets as a processing target (step S12). ), A feature extraction process is performed on the set of specific expressions to be processed (step S13). The contents of the feature extraction process are the same as those in the first embodiment.

次いで、判別処理部４０の判別結果取得部３４は、判別結果記憶部６０に判別結果が記憶されているか否か判別し（ステップＳ１４）、判別結果が記憶されている場合には判別結果記憶部６０に記憶されている全ての判別結果を取得する（ステップＳ１５）。なお、処理対象が「ＩＤ１−ＩＤ２」であるときには、判別結果記憶部６０に何も記憶されていないので、ステップＳ１６に処理が移行する。 Next, the discrimination result acquisition unit 34 of the discrimination processing unit 40 discriminates whether or not the discrimination result is stored in the discrimination result storage unit 60 (step S14). If the discrimination result is stored, the discrimination result storage unit All the discrimination results stored in 60 are acquired (step S15). When the processing target is “ID1-ID2”, nothing is stored in the determination result storage unit 60, and thus the process proceeds to step S16.

そして、モデル選択部４１が固有表現の組を分類すると（ステップＳ１６）、分類器４２は、固有表現の組に対応する各固有表現間の関係の有無を判別するとともに（ステップＳ１７）、判別結果を固有表現ペア出力部４３に出力する（ステップＳ１８）。なお、固有表現ペア出力部４３は、各固有表現間に関係があると分類器４２によって判別された場合には、固有表現の組を表示手段に出力する。そして、判別結果取得部３４は、処理対象として抽出されていない固有表現の組が存在する場合には、固有表現の組とその関係の判別結果を判別結果記憶部６０に記憶してステップＳ１２の処理に移行させる（ステップＳ１９，Ｓ２０）。この場合、次の処理対象は並び替えられた順に従う。 When the model selection unit 41 classifies the set of unique expressions (step S16), the classifier 42 determines whether or not there is a relationship between the specific expressions corresponding to the set of specific expressions (step S17), and the determination result. Is output to the specific expression pair output unit 43 (step S18). The specific expression pair output unit 43 outputs a set of specific expressions to the display means when the classifier 42 determines that there is a relationship between the specific expressions. Then, if there is a set of unique expressions that are not extracted as processing targets, the discrimination result acquisition unit 34 stores the discrimination results of the unique expressions and their relations in the discrimination result storage unit 60, and then in step S12. The process proceeds (steps S19 and S20). In this case, the next processing target follows the rearranged order.

なお、上記ステップＳ１５では、判別結果記憶部６０から全ての判別結果が取得されるようになっているが、固有表現の組に関連する判別結果のみを取得することも可能であり、その取得方法を以下に３つ示す。 In step S15, all the determination results are acquired from the determination result storage unit 60. However, it is also possible to acquire only the determination results related to the set of specific expressions. Three are shown below.

まず、第１の方法として、処理対象の固有表現の組に対応する固有表現ＩＤと同一の固有表現ＩＤを有する固有表現の組の判別結果を判別結果記憶部６０から取得する。例えば、処理対象の固有表現の組が「ＩＤ１−ＩＤ４」であった場合には、既に判別された固有表現の組のうち「ＩＤ１−ＩＤ２」、「ＩＤ３−ＩＤ４」、「ＩＤ１−ＩＤ３」及び「ＩＤ２−ＩＤ４」に対応する判別結果が取得される。 First, as a first method, a discrimination result of a specific expression group having the same specific expression ID as the specific expression ID corresponding to the specific expression set to be processed is acquired from the determination result storage unit 60. For example, when the set of unique expressions to be processed is “ID1-ID4”, “ID1-ID2”, “ID3-ID4”, “ID1-ID3”, A determination result corresponding to “ID2-ID4” is acquired.

また、第２の方法として、処理対象の固有表現の組に対応する各固有表現ＩＤ間に存在する固有表現ＩＤを有する固有表現の組の判別結果を判別結果記憶部６０から取得する。例えば、処理対象の固有表現の組が「ＩＤ１−ＩＤ４」であった場合には、各固有表現ＩＤ間に存在する固有表現ＩＤは「ＩＤ２」及び「ＩＤ３」となる。従って、既に判別された固有表現の組のうち「ＩＤ１−ＩＤ２」、「ＩＤ３−ＩＤ４」、「ＩＤ１−ＩＤ３」及び「ＩＤ２−ＩＤ４」に対応する判別結果が取得される。 As a second method, the discrimination result of the unique expression group having the unique expression ID existing between the unique expression IDs corresponding to the specific expression group to be processed is acquired from the discrimination result storage unit 60. For example, when the set of unique expressions to be processed is “ID1-ID4”, the unique expression IDs existing between the unique expression IDs are “ID2” and “ID3”. Therefore, the discrimination results corresponding to “ID1-ID2”, “ID3-ID4”, “ID1-ID3”, and “ID2-ID4” are acquired from the already-identified combinations of unique expressions.

さらに、第３の方法として、係り受け木において処理対象の固有表現の組に対応する各固有表現間の最短経路を抽出し、その最短経路上に存在する固有表現を含む固有表現の組の判別結果を判別結果記憶部６０から取得する。例えば、処理対象のお固有表現の組が「ＩＤ１−ＩＤ４」であった場合には、「亀井」を含む文節と「広島６区」を含む文節との図３の係り受け木における最短経路上には固有表現を含む文節が存在しないので、判別結果が取得されない。 Further, as a third method, the shortest path between each specific expression corresponding to the set of specific expressions to be processed in the dependency tree is extracted, and a set of specific expressions including the specific expressions existing on the shortest path is determined. The result is acquired from the discrimination result storage unit 60. For example, if the specific expression pair to be processed is “ID1-ID4”, the shortest path in the dependency tree of FIG. 3 of the phrase including “Kamei” and the phrase including “Hiroshima 6 Ward” Since there is no clause including a specific expression, no discrimination result is acquired.

なお、３つの方法を組み合わせて判別結果を取得するように構成してもよい。 In addition, you may comprise so that a discrimination | determination result may be acquired combining three methods.

また、判別結果が判別結果記憶部６０から取得された場合には、分類器４２は、上記フローのステップＳ１７において、取得した判別結果、各固有表現間の素性及びモデル記憶部５０から取得したモデルに基づいて、各固有表現間に関係があるか否かを判別する。 In addition, when the discrimination result is acquired from the discrimination result storage unit 60, the classifier 42 determines the acquired discrimination result, the feature between each unique expression, and the model acquired from the model storage unit 50 in step S17 of the flow. Based on the above, it is determined whether or not there is a relationship between the unique expressions.

さらに、上記フローでは説明を省略したが、本実施形態で用いられるモデルは、複数の所定の固有表現の組を用いて図９のフローに従って作成される。この場合、ステップＳ１７における各固有表現間の判別処理は人的な判断に基づいてなされ、人的に判別された結果と、各固有表現間の素性と、ステップＳ１５において判別結果記憶部６０から取得された判別結果とが訓練事例として所定の記憶部に記憶される。そして、ステップＳ１９において、全ての固有表現の組とその関係の判別結果が得られた場合には、訓練事例を用いてモデルが作成される。なお、モデルを作成する際には、ステップＳ１８の処理は省略される。 Furthermore, although the description is omitted in the above flow, the model used in the present embodiment is created according to the flow of FIG. 9 using a plurality of predetermined sets of specific expressions. In this case, the discrimination process between each unique expression in step S17 is performed based on human judgment, and the result of human discrimination, the feature between each unique expression, and the discrimination result storage unit 60 obtained in step S15. The determined determination result is stored in a predetermined storage unit as a training example. Then, in step S19, if all the unique expression pairs and their relationship determination results are obtained, a model is created using the training examples. When creating a model, the process of step S18 is omitted.

このように上記実施形態では、固有表現の組に対応する各固有表現間の関係の有無が判別される毎に、判別結果を判別結果記憶部６０に記憶し、固有表現の組に対応する素性が抽出されると判別結果記憶部６０に記憶された判別結果を取得し、該判別結果と、抽出された素性と、所定の固有表現の組に対応する各固有表現の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて事前に抽出された素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを判別するので、判別対象となる固有表現の組に関連する固有表現の組の判別結果を容易に利用することができる。 As described above, in the above-described embodiment, each time the presence / absence of the relationship between the unique expressions corresponding to the set of specific expressions is determined, the determination result is stored in the determination result storage unit 60, and the feature corresponding to the set of specific expressions is stored. Is extracted, the determination result stored in the determination result storage unit 60 is acquired, and the presence / absence of the relationship between the determination result, the extracted feature, and each specific expression corresponding to a predetermined specific expression set is determined in advance. Based on the determined result and the feature extracted in advance using the text including each specific expression corresponding to the predetermined specific expression set, there is a relationship between each specific expression corresponding to the specific expression set. Since it is determined whether or not there is, it is possible to easily use the determination result of the set of specific expressions related to the set of specific expressions to be determined.

なお、上記第１及び第２の実施形態は本発明の具体例に過ぎず、本発明が上記実施形態のみに限定されることはない。例えば、本発明は、周知のコンピュータに媒体もしくは通信回線を介して、図１や図７の構成図に示された機能を実現するプログラムあるいは図２や図８のフローに示された手順を備えるプログラムをインストールすることによっても実現可能である。 In addition, the said 1st and 2nd embodiment is only a specific example of this invention, and this invention is not limited only to the said embodiment. For example, the present invention includes a program for realizing the functions shown in the configuration diagrams of FIGS. 1 and 7 or the procedures shown in the flows of FIGS. 2 and 8 via a medium or a communication line in a known computer. It can also be realized by installing a program.

また、上記実施形態では、共起関係を求める手段として相互情報量を用いたものを示したが、他にダイス係数や対数尤度比を用いて共起関係を求めてもよい。 Moreover, although the said embodiment showed what used the mutual information amount as a means for calculating | requiring a co-occurrence relationship, you may obtain | require a co-occurrence relationship using a dice coefficient and log likelihood ratio.

さらに、上記実施形態では、固有表現間情報抽出部２１が図５に示した素性を出力するものを示したが、図１２に示すように各固有表現を含む文節内の固有表現以外の単語や、その単語の品詞を素性に含めて出力させてもよい。この場合、固有表現として用いることのない「日本文化」という単語に対して「日本」のみが地名を表す固有表現として抽出され、他の固有表現との関係の有無について判別されることを防止することができる。 Further, in the above-described embodiment, the inter-specific expression information extraction unit 21 outputs the feature shown in FIG. 5, but as shown in FIG. 12, a word other than the specific expression in the clause including each specific expression, The part of speech of the word may be included in the feature and output. In this case, with respect to the word “Japanese culture” that is not used as a specific expression, only “Japan” is extracted as a specific expression representing the place name, and it is prevented that it is determined whether there is a relationship with other specific expressions. be able to.

さらにまた、固有表現間情報抽出部２１を、図１３に示すように各固有表現の直後に名詞句があるか否かを素性に含めて出力させてもよい。この場合、前記と同様に「日本文化」という単語に対して「日本」のみが地名を表す固有表現として抽出され、他の固有表現との関係の有無について判別されることを防止することができる。 Furthermore, the inter-specific expression information extraction unit 21 may output whether or not there is a noun phrase immediately after each proper expression as shown in FIG. In this case, as in the case described above, only “Japan” is extracted as a unique expression representing a place name for the word “Japanese culture”, and it is possible to prevent the presence or absence of a relationship with another specific expression from being determined. .

また、固有表現間情報抽出部２１を、図１４に示すように各固有表現の間に存在する他の固有表現が各固有表現の一方と同一表記もしくは部分表記であるか否かを素性に含めて出力させてもよい。この場合、関係の判別対象となる固有表現が誤って抽出されることを防止することができる。 In addition, the inter-specific-expression information extraction unit 21 includes, in the feature, whether or not another specific expression existing between the specific expressions is the same notation or partial notation as one of the specific expressions as shown in FIG. May be output. In this case, it is possible to prevent a proper expression that is a target for determining the relationship from being erroneously extracted.

本発明の第１の実施形態における固有表現の組抽出装置の構成図1 is a configuration diagram of a unique expression set extraction apparatus according to the first embodiment of the present invention. 固有表現の組抽出処理のフロー図Flow chart of specific expression group extraction processing 係り受け解析部による解析結果の概要を示す図Diagram showing the summary of analysis results by the dependency analysis unit 基盤解析結果合成部による処理結果の概要を示す図The figure which shows the summary of the processing result by the base analysis result composition section 固有表現間情報抽出部による処理結果の一例を示す図The figure which shows an example of the processing result by the information extraction part between proper expressions 外部知識情報抽出部による処理結果の一例を示す図The figure which shows an example of the processing result by an external knowledge information extraction part 本発明の第２の実施形態における固有表現の組抽出装置の構成図Configuration diagram of proper expression set extraction apparatus in the second embodiment of the present invention 素性抽出処理部及び判別処理部の動作を示すフロー図Flow diagram showing operation of feature extraction processing unit and discrimination processing unit 処理対象リストを示す図Diagram showing processing target list 固有表現ＩＤの絶対値差分及び和を示す図The figure which shows the absolute value difference and sum of specific expression ID 並び替え処理後の処理対象リストを示す図The figure which shows the processing target list after the rearrangement processing 固有表現間情報抽出部による処理結果の変形例を示す図The figure which shows the modification of the processing result by the information extraction part between proper expressions 固有表現間情報抽出部による処理結果の変形例を示す図The figure which shows the modification of the processing result by the information extraction part between proper expressions 固有表現間情報抽出部による処理結果の変形例を示す図The figure which shows the modification of the processing result by the information extraction part between proper expressions

符号の説明Explanation of symbols

１０…固有表現抽出処理部、１１…形態素解析部、１２…係り受け解析部、２０…素性抽出処理部、２２…外部知識情報抽出部、４０…判別処理部、４２…分類器、６０…判別結果記憶部。 DESCRIPTION OF SYMBOLS 10 ... Specific expression extraction process part, 11 ... Morphological analysis part, 12 ... Dependency analysis part, 20 ... Feature extraction process part, 22 ... External knowledge information extraction part, 40 ... Discrimination processing part, 42 ... Classifier, 60 ... Discrimination Result storage unit.

Claims

相互に関係する複数の固有表現からなる固有表現の組を入力テキストから抽出する装置であって、
テキストが入力されると、入力テキストを形態素解析して該入力テキストに含まれる複数の固有表現を抽出する固有表現抽出処理部と、
固有表現抽出処理部によって抽出された各固有表現を組み合せてなる複数の固有表現の組毎に、固有表現の組に対応する各固有表現が他のテキストに共に現れるときの各固有表現間の関係の度合を表す統計情報を少なくとも含む素性を抽出する素性抽出処理部と、
素性抽出処理部によって抽出された素性と、所定の固有表現の組に対応する各固有表現間の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて素性抽出処理部から事前に抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを固有表現の組毎に判別する判別処理部とを備えた
ことを特徴とする相互に関係する固有表現の組抽出装置。 A device that extracts a set of specific expressions composed of a plurality of related specific expressions from input text,
When a text is input, a specific expression extraction processing unit that morphologically analyzes the input text and extracts a plurality of specific expressions included in the input text;
The relationship between each specific expression when each specific expression corresponding to the specific expression group appears together in other text for each of the multiple specific expression combinations that combine the specific expressions extracted by the specific expression extraction processing unit. A feature extraction processing unit that extracts a feature including at least statistical information indicating the degree of
The feature extracted by the feature extraction processing unit, the result of determining in advance whether or not there is a relationship between each specific expression corresponding to a predetermined specific expression set, and each specific expression corresponding to the predetermined specific expression set For each set of specific expressions, it is determined whether there is a relationship between each specific expression corresponding to the set of specific expressions based on the pre-extracted features from the feature extraction processing unit using text including An apparatus for extracting sets of interrelated specific expressions, characterized by comprising a discrimination processing unit.

前記素性抽出処理部は、各固有表現の一方の固有表現が他のテキストの同一文において他方の固有表現よりも先に現れるという関係の度合を表す統計情報を少なくとも含む素性を抽出する
ことを特徴とする請求項１記載の相互に関係する固有表現の組抽出装置。 The feature extraction processing unit extracts a feature including at least statistical information indicating a degree of a relationship that one specific expression of each specific expression appears before the other specific expression in the same sentence of another text. The set extraction device for interrelated specific expressions according to claim 1.

前記素性抽出処理部は、各固有表現の他方の固有表現が他のテキストの同一文において一方の固有表現よりも先に現れるという関係の度合を表す統計情報を少なくとも含む素性を抽出する
ことを特徴とする請求項１または２記載の相互に関係する固有表現の組抽出装置。 The feature extraction processing unit extracts a feature including at least statistical information indicating a degree of relation that the other specific expression of each specific expression appears before one specific expression in the same sentence of another text. 3. A set extraction apparatus for interrelated specific expressions according to claim 1 or 2.

前記素性抽出処理部は、各固有表現が他のテキストの同一文に共に現れるという関係の度合を表す統計情報を少なくとも含む素性を抽出する
ことを特徴とする請求項１乃至３何れか１項記載の相互に関係する固有表現の組抽出装置。 The feature extraction processing unit extracts a feature including at least statistical information indicating a degree of relation that each unique expression appears together in the same sentence of another text. A device for extracting a set of specific expressions related to each other.

前記各固有表現の組に含まれる各固有表現間の関係の有無が判別処理部によって判別される毎に判別結果を記憶する判別結果記憶部を備え、
判別処理部は、固有表現の組に対応する素性が素性抽出処理部によって抽出されると判別結果記憶部に記憶された判別結果を取得し、該判別結果と、抽出された素性と、所定の固有表現の組に対応する各固有表現の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて素性抽出処理部から事前に抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを判別する
ことを特徴とする請求項１乃至４何れか１項記載の相互に関係する固有表現の組抽出装置。 A determination result storage unit that stores a determination result each time the determination processing unit determines whether or not there is a relationship between each unique expression included in each set of specific expressions,
The discrimination processing unit obtains the discrimination result stored in the discrimination result storage unit when the feature corresponding to the set of specific expressions is extracted by the feature extraction processing unit, and the discrimination result, the extracted feature, and a predetermined feature Extracted in advance from the feature extraction processing unit using the result determined in advance as to whether or not there is a relationship between each specific expression corresponding to the specific expression set and text including each specific expression corresponding to the predetermined specific expression set 5. The interrelationship according to claim 1, wherein it is determined whether or not there is a relationship between each of the specific expressions corresponding to the set of specific expressions based on the prior feature that has been set. Specific expression set extraction device.

コンピュータを用いて、相互に関係する複数の固有表現からなる固有表現の組を入力テキストから抽出する方法であって、
前記コンピュータは、テキストが入力されると、入力テキストを形態素解析して該入力テキストに含まれる複数の固有表現を抽出する第１のステップと、
抽出された各固有表現を組み合せてなる複数の固有表現の組毎に、固有表現の組に対応する各固有表現が、固有表現の組に対応する各固有表現が他のテキストに共に現れるときの各固有表現間の関係の度合を表す統計情報を少なくとも含む素性を抽出する第２のステップと、
抽出された素性と、所定の固有表現の組に対応する各固有表現間の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて事前に第２のステップを行うことにより抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを固有表現の組毎に判別する第３のステップとを行う
ことを特徴とする相互に関係する固有表現の組抽出方法。 A method of extracting a set of specific expressions consisting of a plurality of related specific expressions from an input text using a computer,
A first step of extracting a plurality of specific expressions included in the input text by performing morphological analysis on the input text when the text is input;
When each unique expression corresponding to a specific expression pair appears in other text together with each specific expression for a plurality of specific expression combinations obtained by combining the extracted specific expressions. A second step of extracting a feature including at least statistical information indicating the degree of relationship between each unique expression;
Using the extracted features, the result of the determination in advance regarding the presence or absence of the relationship between each specific expression corresponding to the predetermined specific expression set, and text including each specific expression corresponding to the predetermined specific expression set And determining whether or not there is a relationship between the specific expressions corresponding to the set of specific expressions for each set of specific expressions based on the pre-features extracted by performing the second step in advance. A method of extracting sets of interrelated specific expressions characterized by

前記コンピュータは、第２のステップにおいて、各固有表現の一方の固有表現が他のテキストの同一文において他方の固有表現よりも先に現れるという関係の度合を表す統計情報を少なくとも含む素性を抽出する
ことを特徴とする請求項６記載の相互に関係する固有表現の組抽出方法。 In the second step, the computer extracts a feature including at least statistical information indicating a degree of relation that one specific expression of each specific expression appears before the other specific expression in the same sentence of the other text. 7. The method for extracting a set of interrelated specific expressions according to claim 6.

前記コンピュータは、第２のステップにおいて、各固有表現の他方の固有表現が他のテキストの同一文において一方の固有表現よりも先に現れるという関係の度合を表す統計情報を少なくとも含む素性を抽出する
ことを特徴とする請求項６または７記載の相互に関係する固有表現の組抽出方法。 In the second step, the computer extracts a feature including at least statistical information indicating a degree of relation that the other specific expression of each specific expression appears before one specific expression in the same sentence of the other text. 8. A method for extracting sets of interrelated specific expressions according to claim 6 or 7.

前記コンピュータは、第２のステップにおいて、各固有表現が他のテキストの同一文に共に現れるという関係の度合を表す統計情報を少なくとも含む素性を抽出する
ことを特徴とする請求項６乃至８何れか１項記載の相互に関係する固有表現の組抽出方法。 The computer according to any one of claims 6 to 8, wherein in the second step, the computer extracts a feature including at least statistical information indicating a degree of relation that each specific expression appears together in the same sentence of another text. A method for extracting a set of interrelated specific expressions described in item 1.

前記コンピュータは、第３のステップにおいて各固有表現の組に対応する各固有表現間の関係の有無が判別される毎に、判別結果を所定の判別結果記憶部に記憶する第４のステップを行い、
第３のステップにおいて、固有表現の組に対応する素性が抽出されると判別結果記憶部に記憶された判別結果を取得し、該判別結果と、抽出された素性と、所定の固有表現の組に対応する各固有表現の関係の有無について事前に判別された結果と、該所定の固有表現の組に対応する各固有表現を含むテキストを用いて事前に第２のステップを行うことにより抽出された事前素性とに基づいて、固有表現の組に対応する各固有表現間に関係があるか否かを判別する
ことを特徴とする請求項６乃至９何れか１項記載の相互に関係する固有表現の組抽出方法。 The computer performs a fourth step of storing a determination result in a predetermined determination result storage unit every time it is determined in the third step whether or not there is a relationship between each specific expression corresponding to each set of specific expressions. ,
In the third step, when a feature corresponding to a set of specific expressions is extracted, a determination result stored in the determination result storage unit is acquired, and the set of the determination result, the extracted feature, and a predetermined specific expression Is extracted by performing the second step in advance using the result of determining in advance whether or not there is a relationship between the respective specific expressions corresponding to, and text including each specific expression corresponding to the predetermined specific expression set. 10. The mutually related uniqueness according to claim 6, wherein it is determined whether or not there is a relationship between the specific expressions corresponding to the set of specific expressions based on the prior features. Expression set extraction method.