WO2006137516A1

WO2006137516A1 - Binary relation extracting device

Info

Publication number: WO2006137516A1
Application number: PCT/JP2006/312592
Authority: WO
Inventors: Masaki Murata; Tomohiro Mitsumori; Kouichi Doi; Yasushi Fukuda
Original assignee: National Institute Of Information And Communications Technology; National University Corporation NARA Institute of Science and Technology
Priority date: 2005-06-23
Filing date: 2006-06-23
Publication date: 2006-12-28
Also published as: CN101253497A; JP4565106B2; JP2007004458A

Abstract

Provided is a device capable of extracting a binary relation efficiently even for a complex problem. A solution-identity pair extraction unit (12) extracts the identify of a case, from a teacher data storage unit (11) stored with teacher data containing a case, in which a solution indicating what is to be extracted is given to the binary relation to appear in text data, thereby to create a combination of the identity set and the solution. A machine-learning unit (13) machine-learns it, by a predetermined machine-learning method, what solution an identity set takes, and stores the learning result information in a learning result storage unit (14). A candidate extraction unit (15) extracts a candidate for the binary relation from text data (2), and an identity extraction unit (16) extracts the set of identities of the candidates of the binary relation. On the basis of the learning result information, a solution estimation unit (17) estimates the feasibility degree for the solution of the case of the identity set of the candidates for the binary relation. From the estimation result, a binary relation extraction unit (18) extracts the candidate for the binary relation of an excellent estimation of a positive solution.

Description

明細書 Specification

二項関係抽出装置，二項関係抽出処理を用いた情報検索装置，二項関係抽出処理方法，二項関係抽出処理を用いた情報検索処理方法，二項関係抽出処理プログラム，および二項関係抽出処理を用いた情報検索処理プログラム技術分野 Binary relationship extraction device, information retrieval device using binary relationship extraction processing, binary relationship extraction processing method, information retrieval processing method using binary relationship extraction processing, binary relationship extraction processing program, and binary Information retrieval processing program using relation extraction processing

[0001] 本発明は，教師あり機械学習処理を用いて，テキストデータから二項関係を持つ表現 (語，文字列など)の対を抽出する二項関係抽出技術および二項関係抽出処理を用いた情報検索技術に関する。 [0001] The present invention provides a binary relation extraction technique and a binary relation extraction process for extracting a pair of expressions (words, character strings, etc.) having a binary relation from text data using supervised machine learning processing. It relates to the information retrieval technology used.

背景技術 Background art

[0002] テキストデータベースなど力も情報を抽出する手法として，関連する語句の二項関係に着目して希望する情報を抽出する方法が知られている。例えば，非特許文献 1 の手法では，構文解析結果である述語項構造を用いて求める情報を抽出するためのパターンフレームを与えて，正解付きのコーパスから抽出し，抽出したパターンのうち不適切なパターンを排除することによって選別したパターンを用いて適合する情報を抽出している。 [0002] As a technique for extracting information, such as a text database, a method for extracting desired information by paying attention to binary relations of related words is known. For example, in the method of Non-Patent Document 1, a pattern frame for extracting information to be obtained using the predicate term structure that is the result of parsing is given, extracted from a corpus with a correct answer, and the extracted pattern is inappropriate. By extracting the correct patterns, the selected information is extracted using the selected patterns.

非特許文献 1：薬師寺あかね他著，「述語項構造パターンを用いた医学 ·生物学分野情報抽出」，言語処理学会第 11回年次大会， 2005年 3月 Non-patent literature 1: Akane Yakushiji et al., “Medical / biological information extraction using predicate structure pattern”, 11th Annual Conference of the Language Processing Society of Japan, March 2005

発明の開示 Disclosure of the invention

発明が解決しょうとする課題 Problems to be solved by the invention

[0003] 従来では，人手によって作成したパターンを用いて二項関係を抽出処理する手法が主に用いられていた。また，非特許文献 1の手法では，ノターンの精度を良くするために学習コーノスと照らし合わせてパターンの選別を行って，二項関係の抽出処理の精度向上を図っている。 [0003] Conventionally, a technique for extracting binary relations using patterns created manually is mainly used. In the method of Non-Patent Document 1, in order to improve the accuracy of the no-turn, the pattern is selected against the learning conos to improve the accuracy of the binary relation extraction process.

[0004] しかし，二項関係の抽出ルールとしてパターンを用いる場合に，対象となる問題が複雑になると，ノターンが煩雑になるという問題がある。そのため，ノターンを利用する手法には限界があった。また，抽出手法の性能も高くならないという問題もあった。 [0004] However, there is a problem that when a pattern is used as a binary relation extraction rule, the problem becomes complicated if the target problem becomes complicated. For this reason, there is a limit to the method using noturn. Another problem is that the performance of the extraction method does not increase.

[0005] 本発明の目的は，テキストデータから二項関係を抽出するすべての問題に利用でき，複雑な問題についても性能よく二項関係を抽出できる二項関係抽出装置を提供することである。また，本発明の別の目的は，前記二項関係抽出処理を使用した情報検索装置，およびこれらの装置で実行される各処理方法，およびこれらの装置としてコンピュータを機能させるためのプログラムを提供することである。 [0005] The object of the present invention is to be used for all problems of extracting binary relations from text data. The purpose of this study is to provide a binary relation extraction device that can extract binary relations with high performance even for complex problems. Another object of the present invention is to provide an information retrieval device using the binary relation extraction processing, each processing method executed by these devices, and a program for causing a computer to function as these devices. Is to provide.

課題を解決するための手段 Means for solving the problem

[0006] 本発明は，コンピュータが読み取り可能な記憶装置に格納された文データ中に出現する二項関係を，機械学習処理を用いて抽出する二項関係抽出処理装置であつて， 1)問題と解との組で構成される事例であって，問題が文データ中に出現する二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶手段と， 2)前記教師データ記憶手段から前記事例を取り出し，前記事例ごとに，所定の情報を素性として抽出し，前記解と前記抽出した素性の集合との組を生成する解素性対抽出手段と， 3)所定の機械学習アルゴリズムにもとづいて，前記解と素性の集合との組について，どのような素性の集合の場合に前記解となるかと!/、うことを機械学習処理し，前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶手段に保存する機械学習手段と， 4)前記記憶装置に格納されたテキストデータから，前記二項関係の要素を抽出し，前記要素で構成される対を抽出し，前記抽出した対を二項関係の候補とする候補抽出手段と， 5)前記解素性対抽出手段が行う抽出処理と同様の抽出処理によって，前記二項関係の候補について前記所定の情報を素性として抽出する素性抽出手段と， 6)前記学習結果記憶手段に格納された前記学習結果情報にもとづ V、て，前記二項関係の候補の素性の集合の場合の前記解となりやす、度合、を推定する解推定手段と， 7)前記推定結果として，前記二項関係の候補について前記解となりやすい度合いが所定の程度より良い場合に，前記二項関係の候補を，抽出するべき二項関係として選択する二項関係抽出手段とを備える [0006] The present invention is a binary relation extraction processing device that extracts binary relations appearing in sentence data stored in a computer-readable storage device using machine learning processing. Teacher data in which teacher data is stored, including cases that consist of pairs of problems and solutions, where the problem is a binary relationship that appears in the sentence data and the solution should be extracted Storage means; and 2) taking out the case from the teacher data storage means, extracting predetermined information as a feature for each previous article example, and generating a pair of the solution and the set of extracted features. 3) Based on a predetermined machine learning algorithm, the combination of the solution and the feature set is determined by the machine learning process. And in the case of any set of features Machine learning means for storing learning information as learning result information in the learning result storage means, and 4) extracting the binary relation elements from the text data stored in the storage device. Candidate extraction means for extracting a pair composed of the elements, and using the extracted pair as a binary relation candidate. 5) By the extraction process similar to the extraction process performed by the feature pair extraction means, A feature extracting means for extracting the predetermined information as a feature for a binary relation candidate; 6) based on the learning result information stored in the learning result storage means, V, and the binary relation candidate. 7) a solution estimation means for estimating the degree, which is likely to be the solution in the case of a feature set of , Symptoms of the binomial relationship A and a binary relation extraction means for selecting as binary relations to be extracted

ことを特徴とする。 It is characterized by that.

[0007] 本発明は，文データ中に出現する二項関係に，抽出するべき二項関係であることを示す解の情報が付与された事例を含む教師データを教師データ記憶手段に記憶しておく。そして，解-素性対抽出手段によって，教師データ記憶手段から事例を取り出し，事例ごとに，所定の情報を素性として抽出し，抽出した素性の集合と解との組を生成する。さらに，機械学習手段によって，所定の機械学習アルゴリズムにもとづいて，解と素性の集合との組について，どのような素性の集合の場合にどのような解となるかと!/、うことを機械学習処理し，「どのような素性の集合の場合にどのような解となるかということ」を示す情報を学習結果情報として学習結果記憶手段に保存する。 [0007] The present invention stores teacher data including a case in which solution information indicating that a binary relation to be extracted is added to a binary relation appearing in sentence data in a teacher data storage unit. deep. A case is then taken from the teacher data storage means by the solution-feature pair extraction means. For each case, the specified information is extracted as a feature, and a set of the extracted feature set and solution is generated. In addition, the machine learning means determines what kind of solution is used for what type of feature set! /, Based on a predetermined machine learning algorithm. Machine learning processing is performed, and information indicating “what kind of solution is obtained in the case of what feature set” is stored in the learning result storage means as learning result information.

[0008] その後，候補抽出手段によって，記憶装置に格納されたテキストデータから，二項関係の要素を抽出し，前記要素で構成される対を抽出し，前記抽出した対を二項関係の候補とすると，素性抽出手段によって，解素性対抽出手段が行う抽出処理と同様の抽出処理によって，二項関係の候補について所定の情報を素性として抽出する。そして，解推定手段によって，学習結果記憶手段に格納された学習結果情報にもとづ!/、て，二項関係の候補の素性の集合の場合の解となりやす、度合、を推定し，二項関係抽出手段によって，推定結果から，二項関係の候補について解となりやす、度合、が所定の程度より良、場合に，その二項関係の候補を抽出する。 [0008] After that, candidate extraction means extracts binary relational elements from the text data stored in the storage device, extracts a pair composed of the elements, and extracts the extracted pair as a binary relational If it is a candidate, the feature extraction means extracts predetermined information as a feature for the binomial relation candidate by the extraction process similar to the extraction process performed by the feature pair extraction means. Then, based on the learning result information stored in the learning result storage means, the solution estimation means estimates the likelihood and degree of the solution in the case of a set of binary candidate features. The term relation extraction means extracts the binomial relation candidate from the estimation result when the likelihood of the binary relation candidate is better than a predetermined level.

[0009] また，本発明は，複数の検索キーワードによる情報検索処理において，教師あり機械学習処理を用いた二項関係抽出処理結果を利用して検索結果を抽出する情報検索装置であって， 1)問題と解との組で構成される事例であって，問題が検索キーヮードを要素とする二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶手段と， 2)前記教師データ記憶手段から前記事例を取り出し，前記事例ごとに，所定の情報を素性として抽出し，前記解と前記抽出した素性の集合との組を生成する解素性対抽出手段と， 3)所定の機械学習アルゴリズムにもとづいて，前記解と素性の集合との組について，どのような素性の集合の場合に前記解となるかと、うことを機械学習処理し，前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶手段に保存する機械学習手段と， 4)入力された複数の検索キーワードを用いた入力検索キーワード対を生成し，検索対象となるテキストデータ力前記入力検索キーヮ一ド対を含むテキストデータを抽出して取得する情報検索手段と， 5)前記検索して取得された各テキストデータ力前記入力検索キーワードで構成される対を生成し，前記生成した対を二項関係の候補とする候補抽出手段と， 6)前記解素性対抽出手段が行う抽出処理と同様の抽出処理によって，前記二項関係の候補について前記所定の情報を素性として抽出する素性抽出手段と， 7)前記学習結果記憶手段に格納された前記学習結果情報にもとづいて，前記二項関係の候補の素性の集合の場合の前記解となりやす!/ヽ度合!ヽを推定する解推定手段と， 8)前記推定結果として ,前記二項関係の候補にっ、て前記解となりやす、度合、が所定の程度より良、場合に，前記二項関係の候補を抽出するべき二項関係として選択し，前記選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出手段とを備えることを特徴とする。 [0009] Further, the present invention is an information search device for extracting a search result using a binary relation extraction process result using a supervised machine learning process in an information search process using a plurality of search keywords. 1) Teacher data including cases that consist of a combination of a problem and a solution, where the problem is a binary relationship with the search key as an element and the solution should be extracted 2) taking the example of the previous article from the teacher data storage means, extracting predetermined information as a feature for each case, and combining the solution with the extracted feature set 3) The feature pair extraction means for generating 3) Based on a predetermined machine learning algorithm, for which combination of the solution and the feature set, what kind of feature set will result in the solution? Machine learning process Machine learning means for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set of features, and 4) an input search keyword pair using a plurality of input search keywords. A text data force to be generated and to be searched; an information search means for extracting and acquiring text data including a pair of input search keys; 5) each text data force acquired by the search using the input search keyword Candidate extraction means for generating a pair to be configured and using the generated pair as a binary relation candidate; and 6) extracting the feature pair A feature extraction means for extracting the predetermined information as a feature for the binomial relationship candidate by an extraction process similar to the extraction process performed by the means; and 7) the learning result information stored in the learning result storage means Based on the above, the solution estimation means for estimating the likelihood of the solution in the case of the set of features of the binomial relationship candidate! / ヽ degree degree! ヽ, 8) As the estimation result, the candidate of the binary relationship is If the probability of the solution is better than a predetermined level, the binary relation candidate is selected as a binary relation to be extracted, and the text data including the selected binary relation is selected as a search result. And a search result extracting means for extracting.

[0010] 本発明は，検索キーワードを要素とする二項関係に，抽出するべき二項関係であることを示す解の情報を付与された事例を含む教師データを教師データ記憶手段に記憶しておく。そして，解素性対抽出手段によって，教師データ記憶手段から事例を取り出し，事例ごとに，所定の情報を素性として抽出し，抽出した素性の集合と解との組を生成する。さらに，機械学習手段によって，所定の機械学習アルゴリズムにもとづいて，解と素性の集合との組について，どのような素性の集合の場合にどのような解となるかと、うことを機械学習処理し，「どのような素性の集合の場合にどのような解となるかということ」を示す情報を学習結果情報として学習結果記憶手段に保存する。 [0010] According to the present invention, teacher data including cases in which solution information indicating that a binary relation to be extracted is added to a binary relation having a search keyword as an element is stored in a teacher data storage unit. deep. The feature pair extraction means extracts cases from the teacher data storage means, extracts predetermined information as features for each case, and generates a set of extracted feature sets and solutions. Furthermore, the machine learning means uses machine learning to determine what type of feature set results in a set of solution and feature set based on a predetermined machine learning algorithm. Processed information is stored in the learning result storage means as learning result information indicating “what kind of solution will be the case for what feature set”.

[0011] その後，情報検索手段によって，入力された複数の検索キーワードを用いた入力検索キーワード対を生成し，検索対象となるテキストデータから入力検索キーワード対を含むテキストデータを抽出して取得すると，候補抽出手段によって，検索して取得された各テキストデータから，入力検索キーワードで構成される対を生成し，前記生成した対を二項関係の候補とする。そして，素性抽出手段によって，解素性対抽出手段が行う抽出処理と同様の抽出処理によって，二項関係の候補について所定の情報を素性として抽出する。さらに，解推定手段によって，学習結果記憶手段に格納された学習結果情報にもとづ!/、て，二項関係の候補の素性の集合の場合の解となりやすい度合いを推定すると，検索結果抽出手段によって，推定結果として，二項関係の候補にっ、て解となりやす、度合、が所定の程度より良、場合に，その二項関係の候補を抽出するべき二項関係として選択し，選択した二項関係を含むテキストデータを検索結果として抽出する。 [0011] After that, when an input search keyword pair using a plurality of input search keywords is generated by the information search means, and text data including the input search keyword pair is extracted and acquired from the text data to be searched, A pair composed of input search keywords is generated from each text data obtained by searching by the candidate extraction means, and the generated pair is set as a binary relation candidate. Then, the feature extraction means extracts specified information as features for the binomial relation candidates by the same extraction process as the feature pair extraction means. In addition, if the solution estimation means estimates the degree of likelihood of being a solution in the case of a set of binary candidate features based on the learning result information stored in the learning result storage means, search result extraction is performed. Depending on the means, if the candidate of the binary relation is likely to be a solution and the degree is better than a predetermined level, the binary relation candidate is selected and selected as the binary relation to be extracted. Text that includes a binary relation List data is extracted as a search result.

[0012] また，本発明は，前記二項関係抽出装置または前記情報検索装置でそれぞれ実現される二項関係抽出処理方法，二項関係抽出処理方法を用いた情報検索処理方法である。 [0012] Further, the present invention relates to a binary relation extraction processing method and an information retrieval processing method using the binary relation extraction processing method realized by the binary relation extraction device or the information search device, respectively.

[0013] また，本発明は，前記二項関係抽出処理方法または前記情報検索処理方法として実行されるそれぞれの処理過程を，コンピュータに実行させるための二項関係抽出処理プログラム，および，二項関係抽出処理方法を用いた情報検索処理プログラムである。 [0013] The present invention also provides a binary relation extraction processing program for causing a computer to execute each processing step executed as the binary relation extraction processing method or the information retrieval processing method, and a binary relation This is an information retrieval processing program using the extraction processing method.

発明の効果 The invention's effect

[0014] 本発明によれば，抽出するべき二項関係か否かを示すタグを人手によって付与したテキストデータを学習データとして利用して機械学習を行うことによって，新しい二項関係の候補が与えられた場合に，その候補が抽出するべき二項関係か否かを判断することができる。例えば，抽出する二項関係である力否かのタグを付与した「相互作用をする蛋白質の名称の対」を学習データとして用いることによって，テキストデータベースなどから，希望する「相互作用をする蛋白質の名称の対」の情報を取得することができる。 [0014] According to the present invention, a new binary relation candidate can be obtained by performing machine learning using text data that is manually assigned a tag indicating whether or not a binary relation to be extracted is used as learning data. If given, it can be determined whether the candidate is a binary relation to be extracted. For example, by using a “pair of interacting protein names” with a tag indicating whether or not the binary relation to be extracted is used as learning data, the desired “interacting protein” is retrieved from a text database. Can be obtained.

[0015] また，情報検索処理における AND検索の二つの検索キーワードについて，その検索結果の文書にぉ、て意味のある関係であるか否かのタグを付与した「検索キーヮードの対」を学習データとして用いることによって，検索対象のテキストデータから意味のある検索結果を抽出することができる。 [0015] For the two search keywords of AND search in the information search process, a “search keyword pair” with a tag indicating whether or not the search result document has a meaningful relationship. By using as learning data, it is possible to extract meaningful search results from the text data to be searched.

[0016] 本発明は，テキストデータから二項関係を抽出するすべての問題に利用することができるため，きわめて汎用 ¾が高い。 [0016] Since the present invention can be used for all problems of extracting binary relations from text data, it is extremely versatile.

図面の簡単な説明 Brief Description of Drawings

[0017] [図 1]本発明にかかる二項関係抽出装置の構成例を示す図である。 FIG. 1 is a diagram showing a configuration example of a binary relation extraction device according to the present invention.

[図 2]二項関係抽出装置の処理の流れを示す図である。 FIG. 2 is a diagram showing a processing flow of a binary relation extraction device.

[図 3]教師データの例を示す図である。 FIG. 3 is a diagram showing an example of teacher data.

[図 4]サポートベクトルマシン法のマージン最大化の概念を示す図である。 FIG. 4 is a diagram showing the concept of margin maximization in the support vector machine method.

[図 5]図 3に示す二項関係の素性の集合との組の例を示す図である。 [図 6]本発明にかかる情報検索装置の構成例を示す図である。 FIG. 5 is a diagram showing an example of a pair with a set of binary relational features shown in FIG. FIG. 6 is a diagram showing a configuration example of an information search device according to the present invention.

[図 7]情報検索装置の処理の流れを示す図である。 FIG. 7 is a diagram showing a flow of processing of the information search device.

[図 8]教師データおよび，その二項関係の素性の集合との組の例を示す図である。 FIG. 8 is a diagram showing an example of a set of teacher data and a set of features of the binary relation.

[図 9]教師データおよび，その二項関係の素性の集合との組の例を示す図である。 [Fig. 9] A diagram showing an example of a set of teacher data and a set of features of the binary relation.

[図 10]教師データおよび，その二項関係の素性の集合との組の例を示す図である。符号の説明 FIG. 10 is a diagram showing an example of a set of teacher data and a set of features of the binary relation. Explanation of symbols

1 二項関係抽出装置 1 Binary relation extractor

11 教師データ記憶部 11 Teacher data storage

12 解素性対抽出部 12 feature pair extractor

13 機械学習部 13 Machine Learning Department

14 学習結果記憶部 14 Learning result memory

15 候補抽出部 15 Candidate extractor

16 素性抽出部 16 Feature extraction unit

17 解推定部 17 Solution estimation part

18 二項関係抽出部 18 Binary relation extractor

2 テキストデータ 2 Text data

3 二項関係 3 Binary relations

4 情報検索装置 4 Information retrieval device

40 情報検索部 40 Information Search Department

41 教師データ記憶部 41 Teacher data storage

42 解素性対抽出部 42 Feature pair extractor

43 機械学習部 43 Machine Learning Department

44 学習結果記憶部 44 Learning result memory

45 候補抽出部 45 Candidate extractor

46 素性抽出部 46 Feature Extractor

47 解推定部 47 Solution estimation part

48 検索結果抽出部 48 Search result extraction part

5 検索用テキストデータ 6 検索結果 5 Search text data 6 Search results

発明を実施するための最良の形態 BEST MODE FOR CARRYING OUT THE INVENTION

[0019] 以下，本発明の二項関係抽出装置 1の実施例を説明する。 Hereinafter, an embodiment of the binary relation extracting apparatus 1 of the present invention will be described.

[0020] 二項関係抽出装置 1は，抽出するべき二項関係力否かのタグを付与したテキストデータである教師データを用いて，どのような語句の対が抽出するべき二項関係であるかを機械学習し，与えられたテキストデータ 2から，二項関係の候補を取得して，抽出するべき二項関係 3を抽出する処理装置である。 [0020] The binary relation extraction device 1 uses the teacher data, which is text data with a tag indicating whether the binary relation is to be extracted, what binary pair should be extracted. It is a processor that performs machine learning to obtain the binary relation 3 from the given text data 2 and extracts the binary relation 3 to be extracted.

[0021] 図 1に，本発明にかかる二項関係抽出装置 1の構成例を示す。二項関係抽出装置 1は，教師データ記憶部 11,解素性対抽出部 12,機械学習部 13,学習結果記憶部 14,候補抽出部 15,素性抽出部 16,解推定部 17,および二項関係抽出部 18を備える。 FIG. 1 shows a configuration example of a binary relation extraction apparatus 1 according to the present invention. The binary relation extraction device 1 includes a teacher data storage unit 11, a feature pair extraction unit 12, a machine learning unit 13, a learning result storage unit 14, a candidate extraction unit 15, a feature extraction unit 16, a solution estimation unit 17, and a binary term. A relationship extraction unit 18 is provided.

[0022] 教師データ記憶部 11は，機械学習処理において使用される教師データとなるテキストデータを記憶する手段である。 [0022] The teacher data storage unit 11 is means for storing text data that is teacher data used in the machine learning process.

[0023] 教師データとして，テキストデータの文中に出現している二項関係の要素（一方の要素を第 1要素，他方の要素を第 2要素という)を問題，抽出するべき二項関係である力否かの情報を解とする事例を用いる。具体的には，テキストデータの一つの文中に二個以上の二項関係の要素を含む文のみについて，その文中の二項関係にある要素の対について，抽出するべき対 (正例）であるか，抽出するべきではない対 (負例）かのいずれかの解を示すタグを人手によって付与する。一文中に三個以上の二項関係の要素を含む場合には，要素のすべての組み合わせである対それぞれについてタグを付与する。なお，教師データの事例として，抽出するべき対 (正例）を示す解のみが付与された二項関係を使用してもよい。 [0023] As the teacher data, the binary relation elements (one element is referred to as the first element and the other element as the second element) appearing in the text data sentence are the binary relations to be extracted. Use the case where the answer is information about whether or not the power is available. Specifically, for a sentence that contains two or more binary relation elements in one sentence of text data, it is a pair (positive example) that should be extracted for a pair of binary relation elements in that sentence. Or a tag that indicates the solution of either a pair (negative example) that should not be extracted. If three or more binary elements are included in a sentence, a tag is assigned to each pair that is a combination of all elements. As an example of teacher data, a binary relation to which only a solution indicating the pair to be extracted (positive example) is given may be used.

[0024] 解-素性対抽出部 12は，教師データ記憶部 11に記憶されているテキストデータ内の事例から，解と素性の集合との組を抽出する処理手段である。 The solution-feature pair extraction unit 12 is a processing unit that extracts a set of a solution and a set of features from cases in the text data stored in the teacher data storage unit 11.

[0025] 素性は，機械学習処理で使用する情報である。解-素性対抽出部 12は，素性として，例えば，二項関係の要素，要素の周囲に出現する単語 Z文字とその出現位置や順序，要素や周囲の単語の品詞情報，形態素解析情報，構文解析情報，要素間の出現距離，要素間での他の二項関係の要素の有無などの情報を抽出する。 [0026] 機械学習部 13は，解素性対抽出部 12によって抽出された解と素性の集合との組から，どのような素性のときにどのような解になりやすいかを，教師あり機械学習法により学習する処理手段である。その学習結果は，学習結果記憶部 14に保存される A feature is information used in machine learning processing. The feature-feature pair extraction unit 12 includes, as features, for example, binary relation elements, word Z characters appearing around the elements and their appearance positions and order, parts of speech information of elements and surrounding words, morphological analysis information, Information such as parsing information, appearance distance between elements, and presence / absence of other binary relation elements between elements are extracted. [0026] The machine learning unit 13 uses the supervised machine learning to determine what type of solution is likely to be generated from the combination of the solution extracted by the feature pair extraction unit 12 and the set of features. It is a processing means that learns by law. The learning result is stored in the learning result storage unit 14.

[0027] 素性抽出部 16は，テキストデータ 2から抽出された二項関係の候補について，所定の素性を抽出する処理手段である。 The feature extraction unit 16 is a processing unit that extracts a predetermined feature for a binary relation candidate extracted from the text data 2.

[0028] 解推定部 17は，学習結果記憶部 14の学習結果を参照して，二項関係の各候補について，その素性の集合の場合に，どのような解 (分類先）になりやすいかの度合いを推定する処理手段である。 [0028] The solution estimation unit 17 refers to the learning result in the learning result storage unit 14, and for each candidate of the binary relation, what kind of solution (classification destination) is likely to occur in the case of the set of features. This is a processing means for estimating the degree.

[0029] 二項関係抽出部 18は，解推定部 17の推定結果にもとづいて，二項関係の候補から，抽出するべき二項関係であることを示す解となる度合いが高いと推定されたものを，二項関係 3として出力する処理手段である。 [0029] Based on the estimation result of the solution estimation unit 17, the binary relation extraction unit 18 is estimated from the binary relation candidates to have a high degree of solution indicating that the binary relation should be extracted. This is a processing means that outputs the data as binary relations 3.

[0030] 図 2に，二項関係抽出装置 1の処理の流れを示す。 FIG. 2 shows a processing flow of the binary relation extraction apparatus 1.

[0031] 二項関係抽出装置 1の教師データ記憶部 11には，教師データとして，ある意味を持つ要素の対である二項関係に，抽出するべき二項関係である力 (正)または抽出するべきでな、二項関係であるか (負）の、ずれかの「解」の情報が付与された事例を含むテキストデータ 2を記憶しておく。 [0031] The teacher data storage unit 11 of the binary relation extraction device 1 stores, as teacher data, a binary relation that is a pair of elements having a certain meaning, a force (positive) that is a binary relation to be extracted, or an extraction. Should not be done, text data 2 including a case where the information of the “solution” of the binomial relationship (negative) or the difference is given is stored.

[0032] なお，抽出するべき対にのみ，所定の解を付与した事例を含むテキストデータ 2を記憶しておくようにしてもよい。この場合には，テキストデータ 2の解が付与された対は ,抽出するべき二項関係である（正)の解が与えられているとみなされ，解が付与されて!、な、残りの対は抽出するべきではな、二項関係 (負）の解が与えられて!/、るとみなして扱われる。 [0032] Note that text data 2 including an example to which a predetermined solution is given may be stored only for the pair to be extracted. In this case, the pair to which the solution of text data 2 is given is considered to have been given the (positive) solution that is the binary relation to be extracted, and the solution is given! Pairs should not be extracted and are treated as if they were given binomial (negative) solutions! /.

[0033] まず，解素性対抽出部 12は，教師データ記憶部 11の教師データから各事例について，所定の素性を抽出し，解 (タグによって付与された情報）と抽出した素性の集合との組を生成する (ステップ Sl)。解-素性対抽出部 12は，教師データであるテキストデータ力所定のタグによって二項関係を抽出し，抽出した二項関係の要素について，形態素解析処理，構文解析処理，要素の出現位置や要素間の距離の算出処理などを行って，所定の素性を抽出する。 [0034] そして，機械学習部 13は，解-素性対抽出部 12により生成された解と素性の集合との組から，どのような素性の集合のときにどのような解 (正または負）になりやすいかを機械学習法により学習し，学習結果を学習結果記憶部 14に格納する (ステップ S2 )。機械学習部 13は，教師あり機械学習法として，例えば， k近傍法，シンプルベイズ法，決定リスト法，最大エントロピ一法，サポートベクトルマシン法などの手法のいずれかを用いて機械学習処理を行う。 [0033] First, the feature-feature pair extraction unit 12 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 11, and collects the solution (information given by the tag) and the extracted features. Generate a pair with (Step Sl). The solution-feature pair extraction unit 12 extracts the binary relations by using the text data force, which is the teacher data, and a predetermined tag, and for the extracted binary relation elements, the morphological analysis process, the parsing process, and the appearance of the elements Predetermined features are extracted by calculating the position and distance between elements. [0034] Then, the machine learning unit 13 determines what kind of solution (positive or negative) from the combination of the solution generated by the solution-feature pair extraction unit 12 and the feature set. The learning result is stored in the learning result storage unit 14 (step S2). The machine learning unit 13 uses, as a supervised machine learning method, for example, a machine learning process using any one of the methods such as the k-nearest neighbor method, the simple Bayes method, the decision list method, the maximum entropy method, and the support vector machine method. I do.

[0035] その後，候補抽出部 15は，二項関係を抽出したいテキストデータ 2を入力し，入力したテキストデータ 2から二項関係の候補を抽出する (ステップ S3)。候補抽出部 15 は，テキストデータを文単位に分割し，一文中に二以上の二項関係の要素が出現する文についてのみ処理対象として扱い，その文から二項関係の候補を抽出する。 [0035] Thereafter, the candidate extraction unit 15 inputs the text data 2 from which the binary relation is to be extracted, and extracts the binary relation candidate from the input text data 2 (step S3). The candidate extraction unit 15 divides the text data into sentence units, treats only sentences in which two or more binary relation elements appear in one sentence as processing targets, and extracts binary relation candidates from the sentence.

[0036] 素性抽出部 16は，解—素性対抽出部 12での処理とほぼ同様の処理によって，テキストデータ 2から抽出した二項関係の各候補について素性を抽出する (ステップ S4 [0036] The feature extraction unit 16 extracts features for each binary relation candidate extracted from the text data 2 by processing similar to the processing in the solution-feature pair extraction unit 12 (step S4).

) o ) o

[0037] 解推定部 17は，各候補について，その素性の集合の場合にどのような解になりやす!、か，すなわち「正となりやす、」か「負となりやす!/、か」の度合、を学習結果記憶部 14の学習結果をもとに推定する (ステップ S5)。そして，二項関係抽出部 18は，より良、度合、で「正となりやす、」と推定された候補のなかから，所定の程度の候補を抽出するべき二項関係 3として出力する (ステップ S6)。 [0037] The solution estimation unit 17 determines, for each candidate, what kind of solution is likely to be in the case of the set of features !, that is, “is likely to be positive” or “is likely to be negative! /,”. The degree is estimated based on the learning result in the learning result storage unit 14 (step S5). Then, the binomial relationship extraction unit 18 outputs a predetermined degree of candidates as binomial relationships 3 to be extracted from the candidates estimated to be “probably positive” with better, degree (step S6).

[0038] 次に，本発明の二項関係抽出処理の具体例を説明する。本例では，二項関係抽出装置 1を，生物医学関係の論文のテキストデータベースから，相互作用のある蛋白質表現 (蛋白質名）の二項関係を抽出するものとし，テキストデータベースでの蛋白質表現を 100%の精度で特定しているものと仮定する。 Next, a specific example of the binary relation extraction process of the present invention will be described. In this example, the binary relation extraction device 1 extracts binary relations of protein expressions (protein names) that interact with each other from a text database of biomedical papers. Assume that the representation is specified with 100% accuracy.

[0039] また，二項関係を構成する要素は同一文中に出現するものとする。なお，二項関係を構成する要素は，同一段落内，同一文書内に出現する要素同士であってもよい。 [0039] In addition, it is assumed that the elements constituting the binary relation appear in the same sentence. Note that the elements that make up the binary relation may be elements that appear in the same paragraph or the same document.

[0040] 教師データを作成する処理において，二項関係の要素となる表現，例えば，蛋白質表現，病名と治療方法などの特定の表現を二項関係の要素として取り出す場合には，以下のようにして行う。 [0040] In the process of creating the teacher data, when a specific expression such as a protein expression, disease name and treatment method is extracted as a binary relation element, for example, a binary relation element is as follows. To do.

[0041] 1)ルールを用いて要素を取り出す。人手によって，「NF— Kappa [A— Z] ,ただし， [A— Z]は Aから Zまでのいずれかの文字」などのパターンを定義して，該当する表現を抽出する。このパターンによって , NF -Kappa A, NF— Kappa Bなどの蛋白質名の表現である要素を抽出する。 [0041] 1) Extract an element using a rule. By manually defining a pattern such as “NF—Kappa [A—Z], where [A—Z] is any letter from A to Z”, the corresponding expression is extracted. Based on this pattern, elements that are expression of protein names such as NF-Kappa A and NF-Kappa B are extracted.

[0042] 2)辞書を用いて要素を取り出す。 [0042] 2) Extract elements using a dictionary.

病名や治療方法などの表現が記載された辞書を使用して，それらの辞書にあった表現 (文字列，単語列など)とまったく同じ文字列等を，病名や治療方法の表現である要素として抽出する。 Using dictionaries that describe expressions such as disease names and treatment methods, elements that are exactly the same as the expressions (character strings, word strings, etc.) in those dictionaries Extract as

[0043] 3)機械学習処理によって要素を取り出す。 [0043] 3) Elements are extracted by machine learning processing.

蛋白質表現，病名と治療方法などの表現の前後に開始位置タグと終了位置タグとを付与したテキストデータを，学習データとして用意する。そして，このタグ付きの学習データを用いた機械学習処理を行って，その学習結果を利用して，タグが付いていな、新し、テキストデータの該当する表現の開始位置と終了位置にタグを挿入することで要素を特定する。 Prepare text data with start and end position tags before and after expressions such as protein expression, disease name and treatment method as learning data. Then, machine learning processing using the tagged learning data is performed, and the learning result is used to add a tag to the start position and end position of the corresponding expression in the text data. The element is specified by inserting.

[0044] 4)所定の二項関係を示す情報を用いて取り出す。 [0044] 4) Extracted using information indicating a predetermined binary relationship.

あら力じめ二項関係の要素になりうる表現にタグが付与されたデータを利用して，そのタグをもとに二項関係の要素である表現を抽出する。 By using data with tags attached to expressions that can become binary relational elements, the expressions that are binary relational elements are extracted based on the tags.

[0045] 図 3に，教師データの例を示す。図 3 (A)に示すような，相互作用のある蛋白質表現を要素とする二項関係を含む英文テキストデータを，教師データとして使用する。本例では，教師データには，抽出するべき二項関係についてのみ，解 (正 Zpositiv e)を示すタグが付与される。すなわち，機械学習処理において，正の事例のみを含む教師データが使用される。 FIG. 3 shows an example of teacher data. As shown in Fig. 3 (A), English text data including binary relations with interacting protein expressions as elements is used as teacher data. In this example, a tag indicating the solution (correct Zpositiv e) is attached to the teacher data only for the binary relation to be extracted. In other words, teacher data containing only positive cases is used in the machine learning process.

[0046] 図 3 (B)に，教師データに付与されているタグの例を示す。教師データには，二つの二項関係の対 P1,対 P2が含まれる。二項関係（対) P1は，第1要素 1「(16½_&— 0 ateninj ,第 2要素 p2「presenilin 1」で構成されている。また，二項関係（対) P2は，第 1要素 pl「presenilin (PS) 1」，第 2要素 p2「delta— catenin」で構成されている。 [0046] Fig. 3 (B) shows an example of tags attached to teacher data. The teacher data includes two binary pairs P1 and P2. The binary relation (pair) P1 consists of the first element 1 “(16½ _& — 0 ateninj, the second element p2“ presenilin 1 ”. Also, the binary relation (pair) P2 has the first element pl It consists of “presenilin (PS) 1” and the second element p2 “delta—catenin”.

[0047] 解-素性対抽出部 12は，教師データ記憶部 11に記憶されているテキストデータ内の事例から，解と素性の集合との組を抽出する。例えば，素性として，以下のような情報を抽出する。 [0047] The solution-feature pair extraction unit 12 extracts a set of a solution and a set of features from the examples in the text data stored in the teacher data storage unit 11. For example, the following information Extract information.

[0048] 1)二項関係の要素の周囲に出現する単語または文字。例えば，二項関係の第 1要素 (最初の要素)の前方の所定数の単語 Z文字，第 2要素（二番目の要素)の後方の所定数の単語 Z文字，第 1要素と第 2要素の間の所定数の単語 Z文字； [0048] 1) A word or character that appears around a binary relational element. For example, a predetermined number of word Z letters before the first element (first element) of the binary relation, a predetermined number of word Z letters after the second element (second element), the first element and the second A predetermined number of words Z between elements; Z letters;

2)二項関係の要素の周囲に出現する単語 Z文字の出現位置，出現順序など； 2) Words appearing around binary elements, Z character appearance position, appearance order, etc .;

3)二項関係の二つの要素； 3) Two elements of binary relations;

4)二項関係の要素または周囲の単語の品詞情報，形態素解析情報など； 4) Binary elements or parts of speech information of surrounding words, morphological analysis information, etc .;

5)二項関係の要素または周囲の単語の構文解析情報； 5) Parsing information of binary relational elements or surrounding words;

6)二項関係の第 1要素と第 2要素との出現距離； 6) Appearance distance between the first and second elements of the binary relationship;

7)二項関係の第 1要素と第 2要素の間での要素の出現の有無； 7) Presence / absence of an element between the first and second elements of the binary relationship;

素性のうち，例えば，品詞情報は，形態素解析システム「ChaSen」などの既存の形態素解析処理手法を使用して取得する（参照： http：〃 chasen.aist- nara.ac.jp/index.h tml.ja) ₀英語のテキストデータの場合の品詞情報は，例えば，「Transformation-Base d Error-Driven Learning and Natural Language Processing: A Case Study in Part— of -Speech TaggingJ (Eric Brill, Computational Linguistics, Vol.21, No.4, p.543- 565, 1 995)を使用して取得する。 Among the features, for example, part-of-speech information is acquired using existing morphological analysis processing methods such as the morphological analysis system “ChaSen” (see: http: 〃 chasen.aist-nara.ac.jp/index. h tml.ja) ₀ Part-of-speech information in the case of English text data is, for example, “Transformation-Base d Error-Driven Learning and Natural Language Processing: A Case Study in Part— of-Speech Tagging (Eric Brill, Computational Linguistics, Vol.21, No.4, p.543-565, 1 995).

[0049] ここでは，二項関係の要素が，同一段落中に出現する場合には，素性として，二項関係の要素が文をまたぐ力否かという情報を用いてもよい。また，二項関係の要素が ,同一文書内に出現する場合には，素性として，二項関係の要素が文をまたぐか否かと、う情報，段落をまたぐ力否かと!、う情報を用いてもょ、。 [0049] Here, when a binary relation element appears in the same paragraph, information on whether or not the binary relation element has power over a sentence may be used as a feature. Also, when a binary relation element appears in the same document, the feature uses whether the binary relation element crosses a sentence, whether it is cross-sentence information, whether it has power over a paragraph, and so on. Well, ...

[0050] 解-素性対抽出部 12は，図 3 (B)に示すようなタグが付与された教師データの事例から，素性を抽出し，素性の集合と解との組を生成する。例えば，二項関係 P2の事例について，図 5に示すように，解 (positive :正）と，以下の素性の集合との組が生成されるとする。 [0050] The solution-feature pair extraction unit 12 extracts features from the example of the teacher data with tags as shown in Fig. 3 (B), and generates a set of feature sets and solutions. For example, in the case of binary relation P2, as shown in Fig. 5, a set of a solution (positive) and the following feature set is generated.

「第 1要素の前方 3単語内に「for」，「interaction」，「with」が出現； “The words“ for ”,“ interaction ”and“ with ”appear in the first three words of the first element;

要素間に「and」，「cloned」，「the」，「滅」，「-」，「length」，「cDNA」，「of」，「human」力 S 出現； “And”, “cloned”, “the”, “destructive”, “-”, “length”, “cDNA”, “of”, “human” forces S appear between elements;

第 2要素の後方 3単語内に「which」， r_encodedJ , 「1225」が出現」。 [0051] 機械学習部 13は，この解と素性の集合とをもとに，どのような素性の集合の場合に解 (positive)となりやすいかを機械学習処理し，学習結果を学習結果記憶部 14に feす。。 In the rear 3 in the word of the second element "which", r _e ncodedJ, "1225" is the appearance ". [0051] Based on this solution and the set of features, the machine learning unit 13 performs machine learning processing on which feature set is likely to be a positive (positive), and the learning result is stored in the learning result storage unit. 14 to fe. .

[0052] 機械学習部 13は，教師あり機械学習法として，例えば， k近傍法，シンプルベイズ法，決定リスト法，最大エントロピ一法，サポートベクトルマシン法などの手法を用いる [0052] The machine learning unit 13 uses, for example, a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method as a supervised machine learning method.

[0053] k近傍法は，最も類似する一つの事例のかわりに，最も類似する k個の事例を用いて，この k個の事例での多数決によって分類先 (解)を求める手法である。 kは，あらかじめ定める整数の数字であって，一般的に， 1から 9の間の奇数を用いる。シンプルベイズ法は，ベイズの定理にもとづいて各分類になる確率を推定し，その確率値が最も大きヽ分類を求める分類先とする方法である。 [0053] The k-nearest neighbor method is a technique that uses the most similar k cases instead of the most similar case, and obtains a classification destination (solution) by majority decision of these k cases. k is a predetermined integer number, and generally an odd number between 1 and 9 is used. The Simple Bayes method is a method that estimates the probability of each classification based on Bayes' theorem and uses the probability value as the classification destination for obtaining the largest classification.

[0054] シンプルベイズ法において，文脈 bで分類 aを出力する確率は，以下の式（1 )で与えられる。 [0054] In the Simple Bayes method, the probability of outputting classification a in context b is given by the following equation (1).

[0055] [数 1]

[0055] [Equation 1]

[0056] ただし，ここで文脈 bは，あら力じめ設定しておいた素性 f (≡F, l≤j≤k)の集合である。 p (b)は，文脈 bの出現確率である。ここで，分類 aに非依存であって定数のために計算しない。 P (a) (ここで Pは pの上部にチルダ）と P (f I a)は，それぞれ教師データカゝら推定された確率であって，分類 aの出現確率，分類 aのときに素性 fを持つ確率を意味する。 P (f I a)として最尤推定を行って求めた値を用いると，しばしば値がゼ口となり，式（2)の値がゼロで分類先を決定することが困難な場合が生じる。そのため ,スーム一ジングを行う。ここでは，以下の式（3)を用いてスーム一ジングを行ったものを用いる。 [0057] [数 2] [0056] Here, the context b is a set of features f (≡F, l≤j≤k) that have been set in advance. p (b) is the appearance probability of context b. Here, it is independent of the classification a and is not calculated because it is a constant. P (a) (where P is a tilde at the top of p) and P (f I a) are the probabilities estimated by the teacher data, respectively, and the appearance probability of class a, means the probability of having f. If the value obtained by maximum likelihood estimation is used as P (f I a), the value often becomes a negative value, and it may be difficult to determine the classification destination because the value of Eq. (2) is zero. Therefore, smoothing is performed. Here, we used the smoothing using the following formula (3). [0057] [Equation 2]

(_r レ, /»+ο.οικ ") ₍₃₎ ( _r , /»+ο.οικ ") ₍₃₎

Aん I " ー fr_eq(a) + 0M* freq(a) A I "ー fr _eq (a) + 0M * freq (a)

[0058] ただし， freq (f , a)は，素性 f_;を持ち，かつ分類力である事例の個数， freq (a)は ,分類力である事例の個数を意味する。 [0058] Here, freq (f, a) means the number of cases with the feature f _; and classification power, and freq (a) means the number of cases with classification power.

[0059] 決定リスト法は，素性と分類先の組とを規則とし，それらをあらかじめ定めた優先順序でリストに蓄えおき，検出する対象となる入力が与えられたときに，リストで優先順位の高いところ力入力のデータと規則の素性とを比較し，素性が一致した規則の分類先をその入力の分類先とする方法である。 [0059] In the decision list method, a set of features and classification destinations is used as a rule, and these are stored in a list in a predetermined priority order. When an input to be detected is given, the priority order is determined in the list. This is a method that compares the data of the force input with the rule feature and places the classification destination of the rule with the matching feature as the classification destination of the input.

[0060] 決定リスト方法では，あら力じめ設定しておいた素性 f (EF, l≤j≤k)のうち，いずれか一つの素性のみを文脈として各分類の確率値を求める。ある文脈 bで分類 aを出力する確率は以下の式によって与えられる。 [0060] In the decision list method, the probability value of each classification is obtained using only one of the features f (EF, l≤j≤k) that has been set in advance. The probability of outputting classification a in a context b is given by

[0061] p (a I b) =p (a | fmax ) (4) [0061] p (a I b) = p (a | fmax) (4)

ただし， fmaxは以下の式によって与えられる。 Fmax is given by the following equation.

[0062] [数 3] m_ax [0062] [ _Equation 3] m _ax

[0063] また， P (a | f ) (ここで Pは pの上部にチルダ）は，素性 fを文脈に持つ場合の分類 a の出現の割合である。 [0063] P (a | f) (where P is a tilde at the top of p) is the rate of occurrence of classification a when the feature f is in the context.

[0064] 最大エントロピ一法は，あらかじめ設定しておいた素性 fj (l≤j≤k)の集合を Fとするとき，以下の式 (6)を満足しながらエントロピーを意味する式 (7)を最大にするときの確率分布 p (a, b)を求め，その確率分布にしたがって求まる各分類の確率のうち，最も大きい確率値を持つ分類を求める分類先とする方法である。 [0064] The maximum entropy method is an equation that represents entropy while satisfying the following equation (6), where F is a set of features fj (l≤j≤k) set in advance (7 The probability distribution p (a, b) when) is maximized is obtained, and the classification with the highest probability value is obtained among the probabilities of each classification determined according to the probability distribution.

[0065] [数 4]

[0065] [Equation 4]

H(p) = - ∑ p( ,b)log(p(a,b)) (7) a A₉beB H (p) =-∑ p (, b) log (p (a, b)) (7) a A ₉ beB

[0066] ただし， A, Bは分類と文脈の集合を意味し， g (a, b)は文脈 bに素性 fがあって，なおかつ分類力の場合 1となり，それ以外で 0となる関数を意味する。また， P (a I f ) ( ここで Pは pの上部にチルダ）は，既知データでの（a, b)の出現の割合を意味する。 [0066] where A and B mean a set of classifications and contexts, and g (a, b) is 1 if the context b has a feature f, and if it has classification power, and 0 otherwise Means a function. P (a I f) (where P is a tilde at the top of p) means the rate of occurrence of (a, b) in the known data.

[0067] 式 (6)は，確率 pと出力と素性の組の出現を意味する関数 gをかけることで出力と素性の組の頻度の期待値を求めることになつており，右辺の既知データにおける期待値と，左辺の求める確率分布に基づいて計算される期待値が等しいことを制約として ,エントロピー最大化（確率分布の平滑化）を行なって，出力と文脈の確率分布を求めるものとなっている。最大エントロピ一法の詳細については，以下の参考文献 1および参考文献 2を参照された、。 [0067] In equation (6), the expected value of the frequency of the output and feature pair is obtained by multiplying the probability p and the function g that means the appearance of the output and feature pair. Obtain the probability distribution of the output and the context by maximizing entropy (smoothing of the probability distribution) with the restriction that the expected value in the data and the expected value calculated based on the probability distribution calculated on the left side are equal It has become a thing. For details on the maximum entropy method, see references 1 and 2 below.

考文献 1： Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (A CL/EACL Tutorial Program, Madrid, 1997) ; Reference 1: Eric Sven Ristad, Maximum Entropy Modeling for Natural Language, (A CL / EACL Tutorial Program, Madrid, 1997);

参考文献 2 : Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/ software/ memt, 1998)) Reference 2: Eric Sven Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta, (http://www.mnemonic.com/software/memt, 1998))

サポートベクトルマシン法は，空間を超平面で分割することにより，二つの分類からなるデータを分類する手法である。 The support vector machine method is a method for classifying data consisting of two categories by dividing the space into hyperplanes.

[0068] 図 4にサポートベクトルマシン法のマージン最大化の概念を示す。図 4において，白丸は正例，黒丸は負例を意味し，実線は空間を分割する超平面を意味し，破線はマ一ジン領域の境界を表す面を意味する。図 4 ( は，正例と負例の間隔が狭い場合 (スモールマージン）の概念図，図 4 (B)は，正例と負例の間隔が広い場合 (ラージマ一ジン）の概念図である。 FIG. 4 shows the concept of margin maximization in the support vector machine method. In Fig. 4, white circles indicate positive examples, black circles indicate negative examples, solid lines indicate hyperplanes that divide space, and broken lines indicate planes that represent the boundaries of the margin area. Fig. 4 (is a conceptual diagram when the interval between the positive and negative examples is narrow (small margin), and Fig. 4 (B) is when the interval between the positive and negative examples is wide (large FIG.

[0069] このとき，二つの分類が正例と負例力なるものとすると，学習データにおける正例と負例の間隔 (マージン）が大き、ものほどオープンデータで誤った分類をする可能性が低いと考えられ，図 4(B)に示すように，このマージンを最大にする超平面を求め，それを用いて分類を行なう。 [0069] At this time, if the two classifications are positive and negative, the distance between the positive and negative examples (margin) in the training data is large, and the possibility of incorrect classification with open data increases. As shown in Fig. 4 (B), the hyperplane that maximizes this margin is obtained and classified using it.

[0070] 基本的には上記のとおりであるが，通常，学習データにおいてマージンの内部領域に少数の事例が含まれてもよ!/、とする手法の拡張や，超平面の線形の部分を非線型にする拡張 (カーネル関数の導入）がなされたものが用いられる。 [0070] Basically, it is as described above. Usually, the training data may include a small number of examples in the internal area of the margin! /, And the linear part of the hyperplane. A non-linear extension (introduction of a kernel function) is used.

[0071] この拡張された方法は，以下の識別関数を用いて分類することと等価であり，その識別関数の出力値が正か負かによって二つの分類を判別することができる。 [0071] This extended method is equivalent to classification using the following discriminant function, and the two classes can be discriminated based on whether the output value of the discriminant function is positive or negative.

[0072] [数 5] [0072] [Equation 5]

b_i =∑oc_jy_JK(x_j,x_i) b _i = ∑oc _j y _J K (x _j , x _i )

ゾ =1 Z = 1

[0073] ただし， Xは識別したい事例の文脈 (素性の集合）を， Xと y (i=l, ···, \, γ≡{1, [0073] where X is the context (set of features) of the case to be identified, X and y (i = l, ..., \, γ≡ {1,

i ] ] i]]

1})は学習データの文脈と分類先を意味し，関数 sgnは， 1}) means the context and classification destination of the learning data, and the function sgn is

sgn(x) =1(χ≥0) sgn (x) = 1 (χ≥0)

― 1、 otherwise ) ― 1, otherwise)

であり，また，各 αは式（10)と式（11)の制約のもとの式（9)を最大にする場合のものである。 In addition, each α is the value when maximizing Eq. (9) under the constraints of Eqs. (10) and (11).

[0074] [数 6] L(a) = - -∑ ,α^γ^χ, , χ^) (9) [0074] [Equation 6] L (a) =--∑, α ^ γ ^ χ,, χ ^) (9)

i=i 2 ij=i i = i 2 ij = i

0≤",≤C (/ = 1"..，/) (i o) i 0≤ ", ≤C (/ = 1" .., /) (i o) i

∑" ^{= 0} (I D =1 ∑ " ^{= 0} (ID = 1

[0075] また，関数 Kはカーネル関数と呼ばれ，様々なものが用いられるが，本形態では以下の多項式のものを用いる。 [0075] The function K is called a kernel function, and various functions are used. In this embodiment, the following polynomial is used.

[0076] K (x, y) = (x -y+ l) d ( 12) [0076] K (x, y) = (x -y + l) d (12)

C, dは実験的に設定される定数である。後述する具体例では Cはすべての処理を通して 1に固定した。また， dは， 1と 2の二種類を試している。ここで， a >0となる Xは，サポートベクトルと呼ばれ，通常，式（8)の和をとつている部分は，この事例のみを用いて計算される。つまり，実際の解析には学習データのうちサポートベクトルと呼ばれる事例のみしか用いられな、。 C and d are experimentally set constants. In the specific example described later, C was fixed to 1 throughout the entire process. Also, two types of d are tested, 1 and 2. Here, X where a> 0 is called a support vector, and the part that is the sum of Eq. (8) is usually calculated using this example only. In other words, only the cases called support vectors in the learning data are used for actual analysis.

[0077] なお，拡張されたサポートベクトルマシン法の詳細については，以下の参考文献 3 および参考文献 4を参照されたヽ。 [0077] For details of the extended support vector machine method, see Reference 3 and Reference 4 below.

考文献 3 : Nello Cnstianini and John Shawe- Taylor, An Introduction to support Vector Machines and other kernel-based learning methods, (Cambridge University P ress,2000) ; Reference 3: Nello Cnstianini and John Shawe- Taylor, An Introduction to support Vector Machines and other kernel-based learning methods, (Cambridge University Press, 2000);

参考文献 4 : Taku Kudoh, Tinysvm:Support Vector machines , (http : //cl . aist-nar a. ac . jp/taku— ku〃software/Tiny SVM/index.html,2000)) Reference 4: Taku Kudoh, Tinysvm: Support Vector machines, (http: // cl .aist-nar a.ac .jp / taku— ku〃software / Tiny SVM / index.html, 2000))

サポートベクトルマシン法は，分類の数が 2個のデータを扱うものである。したがって ,分類の数が 3個以上の事例を扱う場合には，通常，これにペアワイズ法またはワン VSレスト法などの手法を組み合わせて用いることになる。 The support vector machine method handles data with two classifications. Therefore, when dealing with cases with three or more classifications, it is usually used in combination with methods such as the pair-wise method or the one-VS rest method.

[0078] ペアワイズ法は， n個の分類を持つデータの場合に，異なる二つの分類先のあらゆるペア (n (n— 1) Z2個）を生成し，各ペアごとにどちらがよいかを二値分類器，すなわちサポートベクトルマシン法処理モジュールで求めて，最終的に， n(n— 1) Z2個の二値分類による分類先の多数決によって，分類先を求める方法である。 [0078] The pairwise method generates all pairs (n (n— 1) Z2) of two different classification destinations for data with n classifications, and determines which is better for each pair. Binary classifier, sand In other words, it is obtained by the support vector machine method processing module, and finally the classification destination is obtained by majority decision of the classification destination by n (n−1) Z2 binary classification.

[0079] ワン VSレスト法は，例えば， a, b, cという三つの分類先があるときは，分類先 aとその他，分類先 bとその他，分類先 cとその他，という三つの組を生成し，それぞれの組についてサポートベクトルマシン法で学習処理する。そして，学習結果による推定処理において，その三つの組のサポートベクトルマシンの学習結果を利用する。推定すべき二項関係の候補が，その三つのサポートベクトルマシンではどのように推定されるかを見て，その三つのサポートベクトルマシンのうち，その他でないほうの分類先であって，かつサポートベクトルマシンの分離平面力最も離れた場合のものの分類先を求める解とする方法である。例えば，ある候補が，「分類先 aとその他」の組の学習処理で作成したサポートベクトルマシンにおいて分離平面力最も離れた場合には，その候補の分類先は aと推定する。 [0079] For example, when there are three classification destinations a, b, and c, the one VS rest method is classified into three groups, classification destination a and others, classification destination b and others, and classification destination c and others. And learning processing for each set using the support vector machine method. In the estimation process based on the learning results, the learning results of the three sets of support vector machines are used. By looking at how the three candidate support relations can be estimated by the three support vector machines, it is the other destination of the three support vector machines and the support vector. The separation plane force of the machine This is the method to find the classification destination of the one that is farthest away. For example, if a candidate is most distant from the separation plane force in the support vector machine created by the learning process of the “classification destination a and other” pair, the candidate classification destination is assumed to be a.

[0080] その後，候補抽出部 15は，入力された新しいテキストデータ 2から，二項関係の候補を抽出する。具体的には，テキストデータ 2を文単位に分割し，各文中の二項関係の要素となる表現 (文字列)を抽出する。そして，一文中に二項関係の要素となる表現が二個以上存在する力否かを調べ，一文中にある二項関係の要素のすべての二つの組み合わせ (対)を二項関係の候補として生成する。 [0080] Thereafter, the candidate extraction unit 15 extracts binomial candidates from the input new text data 2. Specifically, text data 2 is divided into sentence units, and expressions (character strings) that are binary relation elements in each sentence are extracted. Then, it is checked whether there are two or more expressions that are binary relation elements in a sentence, and all two combinations (pairs) of binary relation elements in a sentence are candidates for binary relations. Generate as

[0081] また，新しいテキストデータ 2を各段落に分割し，各段落中の二項関係の要素となる表現を抽出し，同じ段落内から二以上の要素がある段落について，すべての二つの組み合わせ (対)を二項関係の候補として生成してもよい。または，テキストデータ 2の一文書内からの二項関係の要素となる表現を抽出し，すべての二つの組み合わせ（対)を二項関係の候補として生成してもよ、。 [0081] In addition, new text data 2 is divided into paragraphs, expressions that are binary-related elements in each paragraph are extracted, and all two combinations of paragraphs having two or more elements from the same paragraph are extracted. (Pair) may be generated as a binary relation candidate. Alternatively, it is possible to extract expressions that are binary relation elements from one document of text data 2 and generate all two combinations (pairs) as binary relation candidates.

[0082] テキストデータ 2から二項関係の要素となる表現を抽出する手法としては，前述の教師データの生成方法で説明した手法を使用する。例えば，パターンや辞書の記述と合致する表現を抽出する，教師あり機械学習の学習結果にもとづいて推定した表現を抽出する。 [0082] The method described in the method for generating teacher data described above is used as a method for extracting an expression that is a binary relation element from text data 2. For example, an expression that matches the description of the pattern or dictionary is extracted, and an expression estimated based on the learning result of supervised machine learning is extracted.

[0083] テキストデータ 2の一文中に二個以上の要素が出現する場合に，その要素の対を二項関係の候補とする。なお，一文中に三個以上の要素が出現する場合には，要素のあらゆる組み合わせの対を二項関係の候補とする。 [0083] When two or more elements appear in one sentence of text data 2, the pair of elements is determined as a binary relation candidate. If three or more elements appear in a sentence, the element Any pair of combinations of is a candidate for a binary relation.

[0084] そして，素性抽出部 16は，二項関係の候補から，解-素性対抽出部 12と同様の処理によって同様の素性を抽出する。 Then, the feature extraction unit 16 extracts similar features from the binomial relationship candidates by the same processing as the solution-feature pair extraction unit 12.

[0085] 解推定部 17は，学習結果記憶部 14に記憶されている学習結果をもとに，各二項関係の候補にっ、て,その候補の素性の集合の場合に正の解 (positive)のなりやすさを推定する。二項関係抽出部 18は，解推定部 17の推定結果をもとに二項関係の候補から，正の解となりやす、推定の度合、が高、ものを二項関係 2として出力する。 [0085] Based on the learning result stored in the learning result storage unit 14, the solution estimation unit 17 obtains a positive solution (in the case of a set of feature features of each candidate). Estimate the likelihood of positive). Based on the estimation result of the solution estimation unit 17, the binomial relationship extraction unit 18 outputs a binomial relationship 2 that is likely to be a positive solution and has a high degree of estimation.

[0086] 本例では，上記の素性を抽出し，機械学習処理としてサポートベクトルマシン法を用いた。 10分割のクロスノくリデーシヨンを利用して精度を調べたところ， F値 =47. 5 %の精度が得られた。 F値は，再現率と適合率の調和平均をいう。再現率は，テキストデータ 2から抽出するべき二項関係のうち，どの程度のものが出力できたかを示す割合である。適合率は，二項関係抽出装置 1が抽出した二項関係のうち，どの程度のものが取り出すべき二項関係であつたかを示す割合である。 [0086] In this example, the above features were extracted and the support vector machine method was used as the machine learning process. When the accuracy was examined using a 10-part cross knot reduction, an accuracy of F value = 47.5% was obtained. The F value is the harmonic mean of recall and precision. The recall is the ratio that indicates how much of the binomial relation to be extracted from the text data 2 was output. The relevance ratio is a ratio indicating how much of the binary relations extracted by the binary relation extraction device 1 is the binary relation to be extracted.

[0087] 二項関係抽出装置 1では，機械学習部 13によって，所定の機械学習アルゴリズムにもとづいて，与えられた教師データを用いて，各二項関係の解と素性の集合との組につ、て，どのような素性の集合の場合にどのような解となるかと!/、うことを機械学習処理し，どのような素性の集合の場合にどのような解となるかと!/、うことを示す情報を学習結果情報として学習結果記憶部 14に保存し，解推定部 17によって，この学習結果情報にもとづいて，二項関係の候補の素性の集合の場合についての前記解となりやす!/ヽ度合!ヽを推定する。 [0087] In the binary relation extraction apparatus 1, the machine learning unit 13 uses the given teacher data to generate a set of each binary relation solution and feature set based on a predetermined machine learning algorithm. What kind of feature set results in a solution! /, And what kind of feature set results in a machine learning process! Is stored in the learning result storage unit 14 as learning result information, and the solution estimation unit 17 is likely to obtain the solution for the case of the feature set of binomial relation candidates based on the learning result information. Estimate! / ヽ degree! ヽ.

[0088] 二項関係抽出装置 1において，機械学習手法として k近傍法を用いる場合には，機械学習部 13は，教師データの事例同士で，その事例力抽出された素性の集合のうち重複する素性の割合（同じ素性をいくつ持っているかの割合）にもとづく事例同士の類似度と定義して，前記定義した類似度と事例とを学習結果情報として学習結果記憶部 14に記憶しておく。 [0088] When the k-nearest neighbor method is used as the machine learning method in the binomial relation extraction device 1, the machine learning unit 13 uses the feature set of feature data extracted between the cases of the teacher data. It is defined as the similarity between cases based on the ratio of overlapping features (the number of the same features), and the defined similarity and the cases are stored in the learning result storage unit 14 as learning result information. deep.

[0089] そして，解推定部 17は，新しいテキストデータ 2が入力されたときに，学習結果記憶部 14の定義した類似度と事例を参照して，テキストデータ 2から抽出された二項関係の候補にっ、て，その候補の類似度が高、順に k個の事例を学習結果記憶部 14の事例から選択し，選択した k個の事例での多数決によって決まった分類先を，二項関係の候補の分類先 (解)として推定する。すなわち，解推定部 17では，二項関係の候補の素性の集合の場合にある解となりやすさの度合、を，選択した k個の事例での多数決の票数，ここでは「抽出するべき」という分類が獲得した票数とする。また，機械学習手法として，シンプルベイズ法を用いる場合には，機械学習部 13は，教師データの事例にっ、て，前記事例の解と素性の集合との組を学習結果情報として学習結果記憶部 14に記憶する。そして，解推定部 17は，新しいテキストデータ 2が入力されたときに，学習結果記憶部 14の学習結果情報の解と素性の集合との組をもとに，ベイズの定理にもとづいて素性抽出部 16で取得した二項関係の候補の素性の集合の場合の各分類になる確率を算出して，その確率の値が最も大きい分類を，その二項関係の候補の素性の分類 (解)と推定する。すなわち，解推定部 17では，二項関係の候補の素性の集合の場合にある解となりやすさの度合、を，各分類になる確率 ,ここでは「抽出するべき」 t 、う分類になる確率とする。 [0089] Then, when new text data 2 is input, the solution estimation unit 17 refers to the similarity and the case defined by the learning result storage unit 14, and extracts the binary relation extracted from the text data 2. Therefore, k cases are selected from the cases in the learning result storage unit 14 in order, and the classification destination determined by the majority of the selected k cases is binomial. Estimated as the classification target (solution) of the relationship candidate. In other words, the solution estimation unit 17 determines the degree of likelihood of being a solution in the case of a set of binomial candidate features. The number of votes obtained by the classification. When the simple Bayes method is used as a machine learning method, the machine learning unit 13 uses a combination of a solution of the case and a set of features as learning result information according to the case of the teacher data. It is stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the features of the learning result information stored in the learning result storage unit 14 and a set of features based on the Bayes theorem. The probability of each classification in the case of a set of binary relation candidate features acquired by the extraction unit 16 is calculated, and the classification having the highest probability value is selected as the classification of the features of the binary relation candidate (solution ). In other words, the solution estimator 17 determines the degree of likelihood of being a solution in the case of a set of candidate features of a binomial relationship as the probability of each classification, here “to be extracted” t. Probability.

[0090] また，機械学習手法として決定リスト法を用いる場合には，機械学習部 13は，教師データの事例について，素性と分類先との規則を所定の優先順序で並べたリストを学習結果記憶部 14に記憶する。そして，新しいテキストデータ 2が入力されたときに，解推定部 17は，学習結果記憶部 14のリストの優先順位の高い順にテキストデータ 2 力抽出された二項関係の候補の素性と規則の素性とを比較し，素性が一致した規則の分類先をその候補の分類先 (解)として推定する。すなわち，解推定部 17では，二項関係の候補の素性の集合の場合にある解となりやすさの度合、を，所定の優先順位またはそれに相当する数値，尺度，ここでは「抽出するべき」という分類になる確率のリストにおける優先順位とする。 [0090] When the decision list method is used as the machine learning method, the machine learning unit 13 stores a list of learning data examples in which rules of features and classification destinations are arranged in a predetermined priority order. Store in Part 14. Then, when new text data 2 is input, the solution estimation unit 17 extracts the features of the binary relation candidates and the rules of the rules extracted from the text data 2 in descending order of the priority of the list in the learning result storage unit 14. And the classification destination of the rule with the same feature is estimated as the candidate classification destination (solution). In other words, the solution estimator 17 determines the degree of likelihood of being a solution in the case of a set of candidate features of a binomial relationship as a predetermined priority or a numerical value or scale corresponding thereto, in this case “to be extracted”. The priority in the list of probabilities of classification.

[0091] また，機械学習手法として最大エントロピ一法を使用する場合には，機械学習部 13 は，教師データの事例力も解となりうる分類を特定し，所定の条件式を満足しかつェントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項力もなる確率分布を求めて学習結果記憶部 14に記憶する。そして，新しいテキストデータ 2が入力されたときに，解推定部 17は，学習結果記憶部 14の確率分布を利用して，テキストデータ 2から抽出された二項関係の候補の素性の集合についてその解となりうる分類の確率を求めて，最も大きい確率値を持つ解となりうる分類を特定し，その特定した分類をその候補の解と推定する。すなわち，解推定部 17では，二項関係の候補の素性の集合の場合にある解となりやすさの度合いを，各分類になる確率，ここでは「抽出するべき」 t 、う分類になる確率とする。 [0091] When the maximum entropy method is used as a machine learning method, the machine learning unit 13 specifies a class that can also solve the case power of teacher data, satisfies a predetermined conditional expression, and performs entropy. A probability distribution that also has a set of features when maximizing the expression shown and a binomial force of classification that can be a solution is obtained and stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the probability distribution in the learning result storage unit 14 to generate a text. The probability of the classification that can be the solution for the set of candidate features of the binary relation extracted from the strike data 2 is obtained, the classification that can be the solution having the largest probability value is identified, and the identified classification is determined for the candidate. Estimate the solution. In other words, the solution estimator 17 determines the likelihood of a solution in the case of a set of binary candidate features as the probability of being classified into each class, in this case “to be extracted” t, To do.

[0092] また，機械学習手法としてサポートベクトルマシン法を使用する場合には，機械学習部 13は，教師データの事例力も解となりうる分類を特定し，分類を正例と負例に分割して，カーネル関数を用いた所定の実行関数にしたがって事例の素性の集合を次元とする空間上で，その事例の正例と負例の間隔を最大にし，かつ正例と負例を超平面で分割する超平面を求めて学習結果記憶部 14に記憶する。そして，新しいテキストデータ 2が入力されたときに，解推定部 17は，学習結果記憶部 14の超平面を利用して，テキストデータ 2から抽出された二項関係の候補の素性の集合が超平面で分割された空間において正例側力負例側のどちらにある力を特定し，その特定された結果にもとづいて定まる分類を，その候補の解と推定する。すなわち，解推定部 17 では，二項関係の候補の素性の集合の場合にある解となりやすさの度合いを，分離平面からの正例 (抽出するべき二項関係)の空間への距離の大きさとする。より詳しくは，抽出するべき二項関係を正例，抽出するべきではない二項関係を負例とする場合に，分離平面に対して正例側の空間に位置する事例が「抽出するべき事例」と判断され，その事例の分離平面力もの距離をその事例の度合、とする。 [0092] When the support vector machine method is used as the machine learning method, the machine learning unit 13 identifies a class that can also solve the case power of teacher data, and divides the class into positive and negative examples. Then, in the space where the feature set of the case is dimensioned according to a predetermined execution function using a kernel function, the interval between the positive example and the negative example of the case is maximized, and the positive example and the negative example are exceeded. A hyperplane to be divided by the plane is obtained and stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the hyperplane of the learning result storage unit 14 to set a set of candidate features of the binary relation extracted from the text data 2. In the space divided by hyperplanes, the force on either the positive side or the negative side is identified, and the classification determined based on the identified result is estimated as the candidate solution. In other words, the solution estimator 17 determines the degree of likelihood of a solution in the case of a set of candidate features of a binary relation by determining the degree of distance from the separation plane to the space of the positive example (binary relation to be extracted). Say it. More specifically, when a binary relation to be extracted is a positive example and a binary relation that should not be extracted is a negative example, an example that is located in the space on the positive example side with respect to the separation plane is The distance of the separation plane force of the case is the degree of the case.

[0093] また，解-素性対抽出部 12では，素性として，例えば，「二つの要素自体の単語」を使用してもよい。また，「要素の前方から一つ目の単語 Z文字列，二つ目の単語 Z 文字列，後方から一つ目の単語 Z文字列，二つ目の単語 Z文字列」を素性として使用してもよい。図 3 (A)の場合には，素性は， Further, the solution-feature pair extraction unit 12 may use, for example, “words of two elements themselves” as the feature. In addition, the first word Z character string, the second word Z character string from the front of the element, the first word Z character string from the rear, and the second word Z character string are used as features. May be. In the case of Fig. 3 (A), the feature is

「第 1要素が「presenilin (PS) 1」； “The first element is“ presenilin (PS) 1 ”;

第 2要素が「delta - cateninj； The second element is "delta-cateninj;

第 1要素の一つ目の単語が「presenilin」； The first word of the first element is “presenilin”;

同二つ目の単語が「(PS)」； The second word is “(PS)”;

第 1要素の最後から二つ目の単語が「(PS)」；同最後から一つ目の単語が「1」； The second word from the end of the first element is “(PS)”; The first word from the end is “1”;

第 2要素の一つ目の単語が「delta」； The first word of the second element is "delta";

同二つ目の単語が「-」； The second word is “-”;

第 2要素の最後から二つ目の単語が「-」； The second word from the end of the second element is "-";

同最後から一つ目の単語が「cateninである」となる。 The first word from the end becomes “catenin”.

[0094] または， [0094] Or,

「第 1要素の最初の 1文字が「P」； “The first letter of the first element is“ P ”;

同最初の 2文字が「pr」； The first two characters are “pr”;

同最初の 3文字が「pre」； The first three letters are “pre”;

同最後の 1文字が「1」； The last character is “1”;

同最後の 2文字が「スペース， 1」； The last two characters are “space, 1”;

同最後の 3文字が「)，スペース， 1」； The last three characters are “), space, 1”;

第 2要素の最初の 1文字が「d」； The first character of the second element is "d";

同最初の 2文字が「de」； The first two letters are “de”;

同最初の 3文字が「del」； The first three letters are “del”;

同最後の 1文字が「n」； The last character is “n”;

同最後の 2文字が「in」； The last two characters are “in”;

同最後の 3文字が「nin」である」となる。 The last three letters are “nin”.

[0095] また，要素の前後 2単語の単語自体とその品詞情報を素性とする場合には，素性は， [0095] Also, if the feature is the two words before and after the element and its part-of-speech information,

「第 1要素の二つ前の単語は「interaction」； “The word before the first element is“ interaction ”;

同二つ前の単語の品詞は「名詞」； The part of speech of the previous two words is “noun”;

同一つ前の単語は「with」； The previous word is “with”;

同一つ前の単語の品詞は「前置詞」； The part of speech of the previous word is “preposition”;

同一つ後の単語は「and」； The next word is “and”;

同一つ後の単語の品詞は「接続詞」； The part of speech of the next word is “connective”;

同二つ後の単語は「cloned」； The word after the second is “cloned”;

同二つ後の単語の品詞は「動詞」；第 2要素の二つ前の単語は「of」； The part of speech of the second word is “verb”; The word before the second element is “of”;

同二つ前の単語の品詞は「前置詞」； The part of speech of the previous two words is “preposition”;

同一つ前の単語は「human」； The previous word is “human”;

同一つ前の単語の品詞は「名詞」； The part of speech of the previous word is “noun”;

同一つ後の単語は「which」； The next word is “which”;

同一つ後の単語の品詞は「代名詞」； The part of speech of the next word is “pronoun”;

同二つ後の単語は「encoded」； The word after the second is "encoded";

同二つ後の単語の品詞は「動詞」である」となる。 The part of speech of the next two words is “verb”.

[0096] また，二つの要素の間の距離として，その要素間にある単語の数を素性として用いる場合には，「二つの要素間の距離は，「9」である」という情報が素性となる。 [0096] When the number of words between two elements is used as a feature as the distance between two elements, the information that the distance between the two elements is "9" is the feature. It becomes.

[0097] また，二つの要素の間の単語数が 0から 1の状態を「距離小」とし， 2から 4の状態を[0097] In addition, the state where the number of words between two elements is 0 to 1 is “small distance”, and the state 2 to 4 is

「距離中」とし， 5から 9の状態を「距離大」とし， 10以上の状態を「距離特大」とするそれぞれの状態を素性とする場合に，「二つの要素間の距離は，「距離大」である」という情報が素性となる。 If each state is characterized as “medium”, a state from 5 to 9 as “distance large”, and a state of 10 or more as “distance extra large”, “the distance between two elements is , “Long distance” is the feature.

[0098] また，二つの要素の間に他の要素がないかどうかという状態を素性とする場合に，「二つの要素の間に他の要素はない」という情報が素性となる。 [0098] Also, when the feature is whether there is no other element between the two elements, the information that "there is no other element between the two elements" is the feature.

[0099] さらに，二項関係の要素として異種の用語が設定されるような場合には，要素の出現順位を素性として用いてもよい。例えば，病名と治療方法の二項関係の場合には , 「第 1要素が「病名」で第 2要素が「治療方法」である」または「第 1要素が「治療方法」で第 2要素が「病名」である」との情報が素性となる。 [0099] Furthermore, when different terms are set as binary relational elements, the appearance order of elements may be used as a feature. For example, in the case of a binary relationship between a disease name and a treatment method, “the first element is“ disease name ”and the second element is“ treatment method ”” or “the first element is“ treatment method ”and the second element is Information that “is a disease name” is a feature.

[0100] 二項関係抽出装置 1は，教師データとして，相互作用のある蛋白質表現の二項関係以外に，病名と治療方法との二項関係，病名と蛋白質表現との二項関係，病名と器官 (臓器)との二項関係，病名と動物種との二項関係，病名と関連のある化学物質との二項関係，蛋白質表現とその蛋白質についてこれまでになされた実験方法との二項関係などのさまざまな二項関係の事例を与えることによって，生物医学論文のテキストデータ 2から，これらの対応する二項関係を抽出することができる。 [0100] The binary relation extraction device 1 uses, as teacher data, a binary relation between a disease name and a treatment method, a binary relation between a disease name and a protein expression, a disease name, as well as a binary relation between interacting protein expressions. The binary relationship between the disease name and animal species, the binary relationship between the disease name and the species, the chemical expression associated with the disease name, the protein expression and the experimental methods that have been performed so far. These binary relationships can be extracted from text data 2 of biomedical papers by giving examples of various binary relationships such as term relationships.

[0101] 例えば，教師データとして，以下のような二項関係を含むテキストデータを用いることがでさる。「Oral corticosteroids (要素：、冶療方法) are the preference of many for the treatment of CIDP (要素：病名）， being much less expensive than IVIG (要素：治療方法） infosi on or TA (要素:治療方法) .」 [0101] For example, text data including the following binary relations can be used as teacher data. "Oral corticosteroids (element: treatment method) are the preference of many for the treatment of CIDP (element: disease name), being much less expensive than IVIG (element: treatment method) infosi on or TA (element: treatment method). "

Γΐη the CIDP (要素：病名） patient, the IgG antibody (要素：蛋白質表現） titer to G D3 (要素：ィ匕学物質表現) was remarkably elevated (titer, 1:10,000), indicating maxi mal avidity to the tetrasaccharide epitope(- NeuAcalpha2- 8NeuAcalpha2- 3Gaibetal -4Glc-).J Γΐη the CIDP (element: disease name) patient, the IgG antibody (element: protein expression) titer to G D3 (element: 匕 chemical substance expression) was remarkably elevated (titer, 1: 10,000), indicating maxi mal avidity to the tetrasaccharide epitope (-NeuAcalpha2-8NeuAcalpha2-3Gaibetal-4Glc-). J

「Ciliated metaplasia (CM) in the stomach (要素：器官名) is mainly found in gastric m ucosa (要素：器官名) that harboursgastric cancer (要素：病名)」 “Ciliated metaplasia (CM) in the stomach (element: organ name) is mainly found in gastric m ucosa (element: organ name) that harboursgastric cancer (element: disease name)”

Variant Creutzfeldt- Jakob disease (CJD) (要素：病名) is a transmissible spongiform encephalopathy believed to be caused by the bovine (要素：動物種) spongiform enc ephalopathy agent, an abnormal isoformof the prion protein (PrP(sc)) (要素：蛋白質表現) .」 Variant Creutzfeldt- Jakob disease (CJD) (element: disease name) is a transmissible spongiform encephalopathy believed to be caused by the bovine (element: animal species) spongiform enc ephalopathy agent, an abnormal isoformof the prion protein (PrP (sc)) (element : Protein expression)

「AIDP (要素：病名) and CIDP (要素：病名) having specific antibodies to the carbohy drate epitope(- NeuAcalpha2- 8NeuAcalpha2- 3Galbetal- 4Glc- ) of gangliosides. (要素:化学物質表現)」 “AIDP (element: disease name) and CIDP (element: disease name) having specific antibodies to the carbohy drate epitope (-NeuAcalpha2-8NeuAcalpha2-3Galbetal-4Glc-) of gangliosides. (Element: chemical expression)”

「Gene expression in archived frozen suralnerve biopsies of patients with chronic infl ammatory demyelinatingpolyneuropathy (CIDP) (要素：病名) was compared to that i n vasculitic nerve biopsies (VAS) and to normal nerve (NN) by DNA microarraytech nology (要素：実験方法）.」 "Gene expression in archived frozen suralnerve biopsies of patients with chronic infl ammatory demyelinatingpolyneuropathy (CIDP) (element: disease name) was compared to that in vasculitic nerve biopsies (VAS) and to normal nerve (NN) by DNA microarraytech nology (element: experimental method ). "

「This novel interaction was identified in a yeast two-hybrid screen (要素：実験方法 ) using PrP(C) (要素：白質表現) as bait and confirmed by an in vitro binding assa y and co— immunoprecipitationsj “This novel interaction was identified in a yeast two-hybrid screen (element: experimental method) using PrP (C) (element: white matter expression) as bait and confirmed by an in vitro binding assay and co—immunoprecipitationsj

「Comparative study of the PrP(BSE) (要素：蛋白質表現） distribution in brains (要素：器官名） from BSE (要素：病名） field cases using rapid tests (要素：検査法）.」また，例えば，会社の製品名とその製品に対する評判 (例えば，評判がいい，悪いなどの情報)との対を，二項関係として抽出することもできる。 “Comparative study of the PrP (BSE) (element: protein expression) distribution in brains (element: organ name) from BSE (element: disease name) field cases using rapid tests. It is also possible to extract a pair of a product name and a reputation for the product (for example, information such as reputation or bad) as a binary relation.

以上のように，本発明の二項関係抽出装置 1によれば，機械学習処理用の教師データとして，抽出するべき二項関係であるか否かの評価 (解)を付与したテキストデータを用意するだけで，新しいテキストデータ力も抽出するべきものに値すると推定した二項関係を自動的に抽出することが可能となる。これによつて，二項関係抽出処理に使用するパターン生成の煩雑さを回避することができる。また，教師あり機械学習の精度向上によって，二項関係抽出処理の性能の向上が期待できる。 As described above, according to the binary relation extracting apparatus 1 of the present invention, a teacher data for machine learning processing is used. By preparing text data with an evaluation (solution) of whether or not it is a binary relation to be extracted as a data, the binary relation estimated to be worthy of extracting new text data is automatically generated. Can be extracted automatically. This avoids the complexity of generating patterns used for binary relation extraction processing. In addition, by improving the accuracy of supervised machine learning, the performance of binary relation extraction processing can be expected to improve.

[0103] 次に，本発明の情報検索装置 4の実施例を説明する。 Next, an embodiment of the information search device 4 of the present invention will be described.

[0104] 情報検索装置 4は， AND検索処理の二つの検索キーワードの関係を意味のある二項関係とみなして，この検索キーワードを要素とする二項関係について，抽出するべき関係であること (正)または，抽出するべき関係でな、こと (負）の、ずれかの解を示すタグを付与した教師データを用いて機械学習し，検索対象である検索用テキストデータ 5から，二つの検索キーワードを含む記事であって，その検索キーワードの対が抽出するべき二項関係であると推定されたものを検索結果 6として出力する処理装置である。 [0104] The information retrieval device 4 regards the relationship between two search keywords in AND search processing as a meaningful binary relationship, and is a relationship that should be extracted for a binary relationship having this search keyword as an element ( (Positive) Or, the relationship to be extracted, machine (learning) using the teacher data to which the tag indicating the solution of either (negative) is attached, and two searches from the search text data 5 to be searched This is a processing device that outputs the search results 6 that contain the keywords, and the search keyword pairs that are estimated to be binary relations to be extracted.

[0105] 図 6に，本発明にかかる情報検索装置 4の構成例を示す。情報検索装置 4は，情報検索部 40,教師データ記憶部 41,解素性対抽出部 42,機械学習部 43,学習結果記憶部 44,候補抽出部 45,素性抽出部 46,解推定部 47,および検索結果抽出部 48を備える。 FIG. 6 shows a configuration example of the information search device 4 according to the present invention. The information retrieval device 4 includes an information retrieval unit 40, a teacher data storage unit 41, a feature pair extraction unit 42, a machine learning unit 43, a learning result storage unit 44, a candidate extraction unit 45, a feature extraction unit 46, and a solution estimation unit 47. , And a search result extraction unit 48.

[0106] 情報検索装置 4の教師データ記憶部 41,解素性対抽出部 42,機械学習部 43, 学習結果記憶部 44,候補抽出部 45,素性抽出部 46,および解推定部 47は，図 1に示す二項関係抽出装置 1の教師データ記憶部 11,解素性対抽出部 12,機械学習部 13,学習結果記憶部 14,候補抽出部 15,素性抽出部 16,および解推定部 17 とそれぞれ同様の処理を行う処理手段である。 [0106] The teacher data storage unit 41, feature feature pair extraction unit 42, machine learning unit 43, learning result storage unit 44, candidate extraction unit 45, feature extraction unit 46, and solution estimation unit 47 of the information search device 4 are Teacher data storage unit 11, feature pair extraction unit 12, machine learning unit 13, learning result storage unit 14, candidate extraction unit 15, feature extraction unit 16, and solution estimation unit 17 of the binary relation extraction device 1 shown in 1 And processing means for performing similar processing.

[0107] 情報検索部 40は， AND検索処理で与えられた検索キーワードを用いて検索用テキストデータ 5を検索し，該当する記事 (テキストデータ）を取得する。 The information search unit 40 searches the search text data 5 using the search keyword given in the AND search process, and acquires the corresponding article (text data).

[0108] 候補抽出部 45は，情報検索部 40が取得した記事に含まれている二つの検索キーワードと同じ文字列 (語)の対を要素とする二項関係の候補を抽出する。 [0108] The candidate extraction unit 45 extracts a binary relation candidate having the same character string (word) pair as two search keywords included in the article acquired by the information search unit 40 as elements.

[0109] 検索結果抽出部 48は，解推定部 47の推定結果をもとに，検索用テキストデータ 5 から検索された記事の二項関係の候補から，推定された正の解 (抽出するべき二項関係であること）のなりやすさの度合いが所定の程度より良いものを抽出し，抽出した二項関係の候補を含む記事または記事を特定する情報を検索結果 6として出力する [0109] Based on the estimation result of the solution estimation unit 47, the search result extraction unit 48 extracts an estimated positive solution (which should be extracted from the binary relation candidates of the articles searched from the search text data 5). Binomial That are better than a certain degree) and output the search result 6 as information that identifies the article or article that contains the extracted binary relation candidate.

[0110] 図 7に，情報検索装置 4の処理の流れを示す。情報検索装置 4の教師データ記憶部 41には，教師データとして， AND検索処理で与えられる二つの検索キーワードを要素とする二項関係に，抽出するべき二項関係である力 (正)または抽出するべきでなヽニ項関係であるか (負）の、ずれかの「解」の情報が付与された事例を含むテキストデータを記憶しておく。 [0110] Fig. 7 shows the processing flow of the information retrieval device 4. The teacher data storage unit 41 of the information search device 4 stores, as teacher data, a binary relation that has two search keywords given by the AND search process as elements, and a force (positive) or an extraction that is a binary relation to be extracted. Text data including cases with information on “solution” that is either negative or negative (negative) should be stored.

[0111] まず，解-素性対抽出部 42は，教師データ記憶部 41の教師データから各事例について，所定の素性を抽出し，解 (タグによって付与された情報）と抽出した素性の集合との組を生成する (ステップ S 11)。解—素性対抽出部 42は，教師データであるテキストデータ力所定のタグによって二項関係を抽出し，抽出した二項関係の要素（検索キーワード）について，形態素解析処理，構文解析処理，要素の出現位置や要素間の距離の算出処理などを行って，所定の素性を抽出する。 [0111] First, the solution-feature pair extraction unit 42 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 41, and collects the solution (information given by the tag) and the extracted feature collection. A pair is generated (step S11). The solution-feature pair extraction unit 42 extracts the binary relations by using a predetermined tag for the text data force that is the teacher data. For the extracted binary relation elements (search keywords), the morphological analysis process, the syntax analysis process, the element Predetermined features are extracted by, for example, calculating the position of the appearance and the distance between elements.

[0112] そして，機械学習部 43は，解—素性対抽出部 42により生成された解と素性の集合との組から，どのような素性の集合のときにどのような解 (正または負）になりやすいかを機械学習法により学習し，学習結果を学習結果記憶部 44に格納する (ステップ S1 2)。機械学習部 43は，教師あり機械学習法として，例えば， k近傍法，シンプルべィズ法，決定リスト法，最大エントロピ一法，サポートベクトルマシン法などの手法のいずれかを用いて機械学習処理を行う。 [0112] The machine learning unit 43 then determines what kind of solution (positive or negative) is used for any feature set from the set of the solution generated by the solution-feature pair extraction unit 42 and the feature set. The learning result is stored in the learning result storage unit 44 (step S12). The machine learning unit 43 uses, as a supervised machine learning method, a machine learning method such as a k-nearest neighbor method, a simple basis method, a decision list method, a maximum entropy method, or a support vector machine method. Process.

[0113] その後，候補抽出部 45は， AND検索処理で与えられた二つの入力検索キーヮードを用いてすべての二つの組み合わせ (対)を生成する (ステップ S 13)。情報検索部 40は，二つの入力検索キーワードの対を用いて検索用テキストデータ 5を AND検索処理し，入力検索キーワード対を含む記事 (テキストデータ）を抽出し，候補抽出部 4 5は，検索処理によって抽出された記事に出現する入力検索キーワードを用いて，すベての二つの組み合わせ (対)を二項関係の候補として抽出する (ステップ S 14)。 [0113] Thereafter, the candidate extraction unit 45 generates all two combinations (pairs) using the two input search keywords given in the AND search process (step S13). The information search unit 40 performs an AND search on the search text data 5 using two pairs of input search keywords to extract articles (text data) including the input search keyword pairs. Using the input search keywords that appear in the articles extracted by the processing, all two combinations (pairs) are extracted as binomial relation candidates (step S14).

[0114] そして，素性抽出部 46は，解-素性対抽出部 42での処理とほぼ同様の処理によつて，検索した記事に出現している二項関係の各候補について，所定の素性の集合を抽出する (ステップ SI 5)。 [0114] Then, the feature extraction unit 46 uses a process similar to the process in the solution-feature pair extraction unit 42 to obtain a predetermined feature for each binary relation candidate appearing in the searched article. set Is extracted (step SI 5).

[0115] 解推定部 47は，各候補について，その素性の集合の場合にどのような解になりやすいか，すなわち，「正となりやすい」または「負となりやすいか」の度合いを学習結果記憶部 14の学習結果をもとに推定する (ステップ S16)。そして，検索結果抽出部 48 は，二項関係の候補から，所定の程度より良い程度で「正となりやすい」と推定されたものを抽出するべき二項関係として選択し，この二項関係を含む記事または記事を特定する情報を検索結果 6として出力する (ステップ S17)。 [0115] The solution estimation unit 47 determines, for each candidate, what kind of solution is likely to be in the case of the set of features, that is, the degree of “probably positive” or “prone to negative”. Estimate based on 14 learning results (step S16). Then, the search result extraction unit 48 selects, as a binary relation to be extracted, a candidate that is estimated to be “prone to be positive” to a better degree than a predetermined degree from candidates for the binary relation, and includes this binary relation. The article or information identifying the article is output as search result 6 (step S17).

[0116] 次に，本発明の情報検索処理の具体例を説明する。本例では，情報検索装置 4を ,検索用テキストデータ 5から， AND検索処理で使用される二つの検索キーワードとなりうる文字列を要素とする二項関係を含むテキストデータを教師データとする。そして， AND検索処理で与えられた入力検索キーワードを要素とする二項関係の候補を作成し，検索用テキストデータ 5からこの二項関係の候補を用いて検索を行い記事を抽出する。検索された記事に含まれる入力検索キーワードの二項関係の候補が抽出するべきであるカゝ否かを推定して，抽出するべきものと推定された度合ヽがよヽニ項関係の候補を含む記事を検索結果 6として出力するものとする。 Next, a specific example of the information search process of the present invention will be described. In this example, the information search device 4 uses the text data for search from the search text data 5 as text data including a binary relation whose elements are character strings that can be two search keywords used in the AND search processing. Then, a binary relation candidate is created using the input search keyword given in the AND search process as an element, and an article is extracted from the search text data 5 using this binary relation candidate. Estimate whether or not the binary relation candidate of the input search keyword included in the searched article should be extracted, and determine the degree to which it should be extracted. The included article is output as search result 6.

[0117] AND検索の検索キーワードとして，「京大」と「総長」を設定すると仮定する。また，検索キーワードの二項関係が正または負であるかの判断は人が行い，正または負の解を示すタグを人手で付与する。したがって，機械学習処理において正の事例および負の事例を含む教師データが使用される。 [0117] Assume that "Kyoto University" and "General Manager" are set as search keywords for AND search. In addition, whether a binary relationship of search keywords is positive or negative is determined by a person, and a tag indicating a positive or negative answer is manually attached. Therefore, teacher data including positive and negative cases is used in the machine learning process.

[0118] 図 8〜図 10に，教師データ記憶部 41に記憶される教師データの例および，その教師データ力解—素性対抽出部 42によって抽出される素性の例を示す。本例では，図 8および図 9の教師データ Dl, D2には，抽出するべき二項関係について解が正（ positive)であることを示すタグが付与される。また，図 10の教師データ D3には，抽出するべきでな、二項関係につ!、て解が負（negative)であることを示すタグが付与される。 FIGS. 8 to 10 show examples of teacher data stored in the teacher data storage unit 41 and examples of features extracted by the teacher data force solution / feature pair extraction unit 42. In this example, the teacher data Dl and D2 in Figs. 8 and 9 are given a tag indicating that the solution is positive for the binary relation to be extracted. In addition, the teacher data D3 in Fig. 10 is given a tag that indicates that the binary relation should not be extracted and that the solution is negative.

[0119] 図 8の教師データ D1には，二つの検索キーワードの対である二項関係の対 P3が含まれ，二項関係 (対) P3は，第 1要素 pi (検索キー K1)「京大」，第 2要素 p2 (検索キー K2)「総長」で構成され，二項関係の対 P3には正の解 (positive)が付与されている。 [0119] The teacher data D1 in Fig. 8 includes a binary relation pair P3, which is a pair of two search keywords. The binary relation (pair) P3 has the first element pi (search key K1) "K Large ”, second element p2 (search key K2)“ total length ”, and a positive relation (positive) is given to the binary relation pair P3. Yes.

[0120] 同様に，図 9の教師データ D2には，二つの検索キーワードの対である二項関係の対 P4が含まれ，二項関係 (対) P4は，第 1要素 pi (検索キー K1)「京大」，第 2要素 p 2 (検索キー K2)「総長」で構成され，二項関係の対 P4には正の解 (positive)が付与されている。図 8および図 9の教師データが「京大の総長」の内容であると判断できるカゝらである。 [0120] Similarly, the teacher data D2 in Fig. 9 includes a binary relationship pair P4, which is a pair of two search keywords, and the binary relationship (pair) P4 has the first element pi (search key K1 ) “Kyoto Univ.”, Second element p 2 (search key K2) “total length”, and the positive pair (P4) is given to the binary relation pair P4. The teachers in Fig. 8 and Fig. 9 can be judged as the contents of “Kyoto University President”.

[0121] また，図 10の教師データ D3には，二つの検索キーワードの対である二項関係の対 [0121] In addition, the teacher data D3 in Fig. 10 includes a pair of binary relations, which is a pair of two search keywords.

P5が含まれ，二項関係 (対) P5は，第 1要素 pi (検索キー K1)「京大」，第 2要素 p2 ( 検索キー K2)「総長」で構成され，二項関係の対 P5には負の解 (negative)が付与されている。同じデータ内に「京大」と「総長」とが出現しているが，相互に関係を持つものではなく，「京大の総長」の内容でないと判断できるからである。 P5 is included, and the binary relation (pair) P5 consists of the first element pi (search key K1) “Kyoto Univ.” And the second element p2 (search key K2) “total length”. Is given a negative solution. This is because “Kyoto University” and “President” appear in the same data, but they are not related to each other and can be judged not to be the contents of “President of Kyoto University”.

[0122] 解-素性対抽出部 42は，教師データ記憶部 41に記憶されている教師データの事例から，解と素性の集合との組を抽出する。例えば，素性として，要素 (検索キーヮード）の前後の二単語の単語自体，単語の品詞を素性とする。例えば教師データ D1を例にとると，素性は， [0122] The solution-feature pair extraction unit 42 extracts a set of a solution and a set of features from the example of the teacher data stored in the teacher data storage unit 41. For example, the features are the two words before and after the element (search keyword) and the part of speech of the word. For example, taking teacher data D1 as an example,

「第 1要素の二つ前の単語は「今日」； “The word before the first element is“ Today ”;

同一つ前の単語は「，」； The previous word is “,”;

同一つ前の単語の品詞は「読点」； The part of speech of the previous word is “reading”;

同一つ後の単語は「で」； The next word is “de”;

同一つ後の単語の品詞は「助詞」； The part of speech of the next word is “particle”;

同一つ後の単語は「の」； The next word is “no”;

第 2要素の二つ前の単語は「で」； The word before the second element is “de”;

同二つ前の単語の品詞は「助詞」； The part of speech of the previous two words is “particle”;

同一つ前の単語は「，」； The previous word is “,”;

同一つ後の単語は「が」；同一つ後の単語の品詞は「助詞」； The next word is “ga”; The part of speech of the next word is “particle”;

同二つ後の単語は「出席」； The second word is “attendance”;

同二つ後の単語の品詞は「名詞」である」となる。 The part of speech of the word after the second is “noun”.

[0123] なお，解-素性対抽出部 42は，二項関係抽出処理で説明したような情報を素性として抽出することができる。 Note that the solution-feature pair extraction unit 42 can extract information as described in the binary relation extraction process as a feature.

[0124] 機械学習部 43は，この解と素性の集合とをもとに，どのような素性の集合の場合にどのような解（正（positive) Z負（negative) )となりやす!/、かを機械学習処理し，学習結果を学習結果記憶部 44に記憶する。機械学習部 43は，教師あり機械学習法として，例えば， k近傍法，シンプルベイズ法，決定リスト法，最大エントロピ一法，サボートベクトルマシン法などの前述の処理手法を用いる。 [0124] Based on this solution and the set of features, the machine learning unit 43 is likely to be any solution (positive Z negative) in any feature set! /, Machine learning processing is performed, and the learning result is stored in the learning result storage unit 44. The machine learning unit 43 uses the above-described processing methods such as the k-nearest neighbor method, the simple Bayes method, the decision list method, the maximum entropy method, and the servo vector machine method as supervised machine learning methods.

[0125] その後，情報検索部 40は，与えられた入力検索キーワード「京大」と「総長」とをもとに検索用テキストデータ 5を AND検索し，入力検索キーワードを含む記事を取得する。そして，候補抽出部 45は，抽出された記事から二項関係の候補を抽出する。具体的には， AND検索の検索結果である記事中に含まれる入力検索キーワードから二項関係の候補を抽出する。そして，素性抽出部 46は，二項関係の候補から，解素性対抽出部 42と同じ素性を抽出し，解推定部 47は，学習結果記憶部 44に記憶されている学習結果をもとに，各二項関係の候補について，その候補の素性の集合の場合に正 (positive)または負（negative)のなりやすさの度合、を推定する。検索結果抽出部 48は，解推定部 47の推定結果をもとに二項関係の候補から，推定された正の解となりやすさの度合いがよい二項関係を抽出し，この二項関係を含む記事，記事を特定する情報を検索結果 6として出力する。 [0125] After that, the information search unit 40 performs an AND search on the search text data 5 based on the given input search keywords “Kyoto Univ.” And “total length”, and acquires articles including the input search keywords. . Then, the candidate extraction unit 45 extracts a binary relation candidate from the extracted article. Specifically, binomial relation candidates are extracted from the input search keywords included in the article that is the search result of the AND search. The feature extraction unit 46 extracts the same features as the feature pair extraction unit 42 from the binary relation candidates, and the solution estimation unit 47 uses the learning result stored in the learning result storage unit 44. Next, for each binary relation candidate, the degree of likelihood of being positive or negative in the case of a set of candidate features is estimated. The search result extraction unit 48 extracts a binary relationship that is likely to be an estimated positive solution from the binomial relationship candidates based on the estimation result of the solution estimation unit 47, and this binary term. Information that identifies articles and articles that contain relationships is output as search results 6.

[0126] 例えば，候補抽出部 45は，与えられた入力検索キーワードから，二つの入力検索キーワードのすべての組み合わせ (対)を生成し，生成した対を二項関係の候補とする。そして，情報検索部 40は，それぞれの二項関係の候補の要素（二つの入力検索キーワード)を用いて AND検索処理を行う。そして，素性抽出部 46は，抽出された記事に出現している二項関係の候補について所定の素性の集合を抽出する。 [0126] For example, the candidate extraction unit 45 generates all combinations (pairs) of two input search keywords from a given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 46 extracts a set of predetermined features for the binary relation candidates appearing in the extracted article.

[0127] 解推定部 47は，学習結果記憶部 44の学習結果をもとに，各二項関係の候補について，その候補の素性の集合の場合に解のなりやすさの度合いを推定する。入力検索キーワードの対である二項関係の候補それぞれが，検索されたその記事内で一つずつしか出現していないときは，それらすベての二項関係の候補が正 (抽出するべき[0127] Based on the learning result of the learning result storage unit 44, the solution estimation unit 47 estimates the degree of likelihood of the solution for each candidate candidate set of features. To do. Input validation If each binary keyword candidate that is a search keyword pair appears only once in the searched article, all of these binary candidates are positive (extracted)

)との度合いがよいと推定した場合に，その記事，記事を特定する情報を検索結果 6 とする。 )), The search result 6 is the information that identifies the article and the article.

[0128] また，入力検索キーワードの対である二項関係が，検索されたその記事内で複数出現しているときは，出現する複数の二項関係の候補のうちの一つの候補について正 (抽出するべき）との度合ヽがよ!ヽと推定してヽることを条件とし，さらに二項関係の候補それぞれが，前述の条件をすベて満足して正の度合!、がよ!、と推定した場合に ,その記事，記事を特定する情報を検索結果 6とする。 [0128] In addition, when multiple binary relationships that are pairs of input search keywords appear in the searched article, one of the multiple binary relationship candidates that appear is correct ( The candidate should have a positive degree of satisfaction that all of the above-mentioned conditions are satisfied! , The search result 6 is the information that identifies the article and the article.

[0129] さらに，候補抽出部 45は，与えられた入力検索キーワードから，すべての二つの入力検索キーワードの対を生成し，生成した対を二項関係の候補とする。そして，情報検索部 40は，それぞれの二項関係の候補の要素（二つの入力検索キーワード）を用いて AND検索処理を行う。そして，素性抽出部 46は，抽出された記事に出現している二項関係の候補について所定の素性の集合を抽出する。 [0129] Further, the candidate extraction unit 45 generates a pair of all two input search keywords from the given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 46 extracts a set of predetermined features for the binary relation candidates appearing in the extracted article.

[0130] 解推定部 47は，学習結果記憶部 44の学習結果をもとに，各二項関係の候補について，その候補の素性の集合の場合に解のなりやすさの度合いを推定する。入力検索キーワードの対である二項関係の候補それぞれが，検索されたその記事内で一つずつしか出現して!/、な、ときは，それらすベての二項関係の候補にっ、て正 (抽出するべき）の度合、を推定し，それらすベての二項関係の候補にっ、て推定された正の度合いを掛け合わせたものを，その記事の正の度合いとする。そして正の度合 [0130] Based on the learning result of the learning result storage unit 44, the solution estimation unit 47 estimates the degree of likelihood of the solution in the case of each candidate feature set. To do. Each binary search candidate that is a pair of input search keywords appears only once in the searched article! /, And sometimes all of these binary search candidates. Then, the degree of positive (to be extracted) is estimated, and all the binomial relation candidates are multiplied by the estimated positive degree as the positive degree of the article. . And positive degree

V、がよ、と推定した記事，記事を特定する情報を検索結果 6とする。 The search result 6 is the article that is estimated to be V, Gayo, and the information that identifies the article.

[0131] また，入力検索キーワードの対である二項関係が，検索された記事内で複数出現しているときは，出現する複数の二項関係の候補について正の度合いを推定し，それらの複数の二項関係の候補の推定した度合、のうち，最も値がよ!、度合、をその二項関係の候補の度合いとする。そして，それぞれの二項関係の度合いを求め，求めた度合いを掛け合わせたものを，その記事の正の度合いとする。そして正の度合 [0131] Also, when multiple binary relationships that are pairs of input search keywords appear in the searched article, the degree of positiveness is estimated for the multiple binary relationship candidates that appear. Among the estimated degrees of the plurality of binary relation candidates, the highest value is! And the degree is the degree of the binary relation candidates. Then, the degree of each binary relation is obtained, and the result obtained by multiplying the degree is the positive degree of the article. And positive degree

[0132] 以上のように，本発明の情報検索装置 4によれば，機械学習処理用の教師データとして， AND検索処理の二つの検索キーワードの二項関係に，抽出するべき二項関係であるか否かの評価を付与したテキストデータを用意するだけで，新、検索用テキストデータ 5から，抽出するべきものに値するとされた二項関係を含む記事を自動的に抽出することが可能となる。 [0132] As described above, according to the information search device 4 of the present invention, teacher data for machine learning processing is used. From the new text data for search 5, simply prepare text data with an evaluation of whether or not it is a binary relation to be extracted from the binary relation between two search keywords in AND search processing. Therefore, it is possible to automatically extract articles that contain binary relations that deserve to be extracted.

[0133] 本発明の情報検索装置 4は， AND検索処理の検索結果の記事に出現する検索キ一ワードの関係を，二項関係抽出処理を用いて評価することにより，検索キーワードを含んでいることによってヒットされたが，検索キーワード同士の関係がうすく，その結果として内容的に無関係な，いわば検索意図からはずれるような内容の記事を排除することができる。また，教師あり機械学習の精度向上によって，情報検索処理の性能の向上が期待できる。 [0133] The information search device 4 of the present invention includes a search keyword by evaluating the relation of search keywords appearing in the article of the search result of the AND search process using the binary relation extraction process. However, it is possible to eliminate articles whose contents are unrelated to the search intention. In addition, improvement in the performance of information retrieval can be expected by improving the accuracy of supervised machine learning.

[0134] 以上の実施例においては，二項関係抽出処理および情報検索処理において，二つの要素からなる二項関係の例を説明した。本発明は，三つの要素で構成される三項関係についても適用することができる。 In the above embodiment, the example of the binary relation composed of two elements has been described in the binary relation extraction process and the information search process. The present invention can also be applied to a ternary relationship composed of three elements.

[0135] 例えば，二項関係抽出装置 1において，教師データとして，三つの要素の三項関係を含むデータを用意する。そして，解—素性対抽出部 12は，この三項関係についての素性を，例えば，三つの要素のうちの，第 1要素（最初に出現する要素）の前方二単語，第 3要素 (最後に出現する要素)の後方二単語，第 1要素と第 2要素（中間に出現する要素）間の単語すベて，第 2要素と第 3要素間の単語すベての単語情報とすることによって，機械学習部 13は，三項関係の素性の集合をもとに解のなりやすさを学習することができ，二項関係抽出部 18において，三項関係の抽出を扱うことができる。なお，三項関係に与えられる解は，二項関係の場合と同様に，「抽出するべき三項関係」または「抽出するべきでな、三項関係」とする。 [0135] For example, in the binary relation extraction apparatus 1, data including a ternary relation of three elements is prepared as teacher data. Then, the solution-feature pair extraction unit 12 determines the feature of this ternary relationship by, for example, the first two elements in the first element (the element that appears first), the third element (the last element) The word information of all the words between the first and second elements (elements appearing in the middle) and all the words between the second and third elements Thus, the machine learning unit 13 can learn the easiness of the solution based on the set of features of the ternary relation, and the binary relation extraction unit 18 can handle the extraction of the ternary relation. The solution given to the ternary relation is the same as in the case of the binary relation: “ternary relation that should be extracted” or “ternary relation that should not be extracted”.

[0136] 例えば，二項関係抽出装置 1において，教師データとして，三つの要素の三項関係を含むデータを用意する。そして，二項関係抽出装置 1の各処理手段は，教師デ一タの三項関係を分解して得られたそれぞれの二項関係，第 1要素と第 2要素の二項関係，第 2要素と第 3要素の二項関係，第 1要素と第 3要素の二項関係をそれぞれ別個の二項関係として扱う。そして，それぞれの二項関係すべてについて，抽出するべき三項関係であるかの解の度合いを算出し，算出した度合いを掛け合わせて求めた値をその三項関係の度合いとする。そして，その度合いの大きいものを抽出するべき三項関係として取り出すようにする。 [0136] For example, in the binary relation extraction apparatus 1, data including a ternary relation of three elements is prepared as teacher data. Then, each processing means of the binary relation extraction device 1 includes the binary relation obtained by decomposing the ternary relation of the teacher data, the binary relation between the first element and the second element, and the second element. The binary relation between the first and third elements and the binary relation between the first and third elements are treated as separate binary relations. Then, for each binary relation, calculate the degree of solution to determine whether it is a ternary relation to be extracted, and multiply by the calculated degree. Is the degree of the ternary relationship. Then, the ternary relationship that should be extracted should be extracted.

[0137] このとき，機械学習部 13が，サポートベクトルマシン法を使用する場合には，分類先が二つ（正または負）となるので，ペアワイズ法またはワン VSレスト法を用いて三項関係を機械学習する。 [0137] At this time, when the machine learning unit 13 uses the support vector machine method, there are two classification destinations (positive or negative), so the ternary relation using the pairwise method or the one VS rest method is used. Machine learning.

[0138] また，二項関係抽出部 18では，二項関係 3の抽出の際に，抽出の確信度を求められるようにする。そして，二項関係を複数組み合わせて作成した三項関係の確信度として，それぞれの組み合わせた二項関係の確信度の積を用いて，三項関係の確信度の大きなものを取り出すようにする。二項関係の確信度は，通常の機械学習処理において算出される確信度を利用する。 [0138] Also, the binary relation extraction unit 18 can obtain the certainty of extraction when extracting the binary relation 3. Then, as the confidence of the ternary relation created by combining multiple binary relations, the product of the confidence of the combined binary relations is used to extract the one with the highest confidence of the ternary relation. To do. The confidence level of the binomial relationship uses the confidence level calculated in the normal machine learning process.

[0139] このような三項関係の抽出処理は，情報検索装置 4においても同様に行うことができる。例えば，「平成 12年の京大の総長」に関する記事を検索する場合に，教師データとして，「平成 12年」，「京大」，および「総長」の三つの検索キーワードによる三項関係を含むデータを与えて，検索用テキストデータ 5から，これら三つの検索キーヮードによる AND検索の検索結果 6を出力する。 [0139] Such ternary relation extraction processing can be performed in the information retrieval apparatus 4 in the same manner. For example, when searching for articles related to “General Manager of Kyoto University in 2000”, the ternary relationship using three search keywords of “2000”, “Kyoto University”, and “General Director” is used as teacher data. Given search data, search result 6 of AND search using these three search keywords is output from search text data 5.

[0140] また，本例では，事例の二項関係または三項関係に付与する解の情報として，「正 [0140] In this example, as the information of the solution to be given to the binary relation or ternary relation of the case,

(抽出するべき二項関係である）」または「負（抽出するべきでなヽニ項関係である）」を用いて説明したが，付与する解の情報として，例えば，「相互作用のある」，「反作用のある」，「作用がな、」などの多分類のものであってもよ、。 (It is a binary relation to be extracted) or “Negative (It is a binary relation that should not be extracted)”. As information on the solution to be given, for example, “Interaction”, It may be classified into multiple categories such as “with counter-action” or “no action”.

[0141] 以上,本発明をその実施の形態により説明したが,本発明はその主旨の範囲において種々の変形が可能であることは当然である。 [0141] While the present invention has been described with reference to the embodiments, it is obvious that the present invention can be variously modified within the scope of the gist thereof.

[0142] また，本発明は，コンピュータにより読み取られ実行されるプログラムとして実施することができる。本発明を実現するプログラムは，コンピュータが読み取り可能な，可搬媒体メモリ，半導体メモリ，ハードディスクなどの適当な記録媒体に格納することができ，これらの記録媒体に記録して提供され，または，通信インタフェースを介して種々の通信網を利用した送受信により提供されるものである。 [0142] Further, the present invention can be implemented as a program read and executed by a computer. The program for realizing the present invention can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, and a hard disk that can be read by a computer. It is provided by transmission and reception using various communication networks via the communication interface.

Claims

請求の範囲 The scope of the claims

[1] コンピュータが読み取り可能な記憶装置に格納された文データ中に出現する二項関係を，機械学習処理を用いて抽出する処理装置であって， [1] A processing device that extracts binary relations that appear in sentence data stored in a computer-readable storage device using machine learning processing.

問題と解との組で構成される事例であって，問題が文データ中に出現する二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶手段と， An example consisting of a set of a problem and a solution, in which the teaching data that contains the binary relation that the problem appears in the sentence data and the binary relation that should be extracted is stored Teacher data storage means,

前記教師データ記憶手段から前記事例を取り出し，前記事例ごとに，所定の情報を素性として抽出し，前記解と前記抽出した素性の集合との組を生成する解素性対抽出手段と， Extracting the case from the teacher data storage unit, extracting predetermined information as a feature for each case, and generating a pair of the feature and the extracted feature set;

所定の機械学習アルゴリズムにもとづいて，前記解と素性の集合との組について，どのような素性の集合の場合に前記解となるかと、うことを機械学習処理し，前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶手段に保存する機械学習手段と， Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. Machine learning means for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set, as learning result information;

前記記憶装置に格納されたテキストデータから，前記二項関係の要素を抽出し，前記要素で構成される対を抽出し，前記抽出した対を二項関係の候補とする候補抽出手段と， Candidate extraction means for extracting the binary relation elements from the text data stored in the storage device, extracting a pair composed of the elements, and using the extracted pair as a binary relation candidate;

前記解素性対抽出手段が行う抽出処理と同様の抽出処理によって，前記二項関係の候補について前記所定の情報を素性として抽出する素性抽出手段と，前記学習結果記憶手段に格納された前記学習結果情報にもとづいて，前記二項関係の候補の素性の集合の場合の前記解となりやす、度合、を推定する解推定手段と， A feature extraction unit that extracts the predetermined information as a feature of the binary relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction unit; and the learning result stored in the learning result storage unit Based on the information, a solution estimation means for estimating the degree of, which is likely to be the solution in the case of the set of features of the binomial relation candidates,

前記推定結果として，前記二項関係の候補について抽出するべき二項関係であることを示す解となりやす、度合、が所定の程度より良、場合に，前記二項関係の候補を抽出するべき二項関係として選択する二項関係抽出手段とを備える As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A binary relation extracting means for selecting as a binary relation

ことを特徴とする二項関係抽出装置。 A binary relation extraction apparatus characterized by that.

[2] 前記教師データ記憶手段は，前記事例として，問題の二項関係が，抽出するべき二項関係であることを示す正の解が与えられた正の事例と，問題の二項関係が，抽出するべきではない二項関係であることを示す負の解が与えられた負の事例とを含む教師データが格納される [2] The teacher data storage means includes, as the example, a positive case given a positive solution indicating that the binary relation of the problem is a binary relation to be extracted, and a binary relation of the problem. , And negative cases given a negative solution that indicates a binary relation that should not be extracted. Stored teacher data

ことを特徴とする請求項 1記載の二項関係抽出装置。 The binary relation extraction device according to claim 1, wherein:

[3] 前記機械学習手段は，前記教師データから，前記所定の情報である素性の集合と解を示す情報との対で構成した規則を設定し，前記規則を所定の順序でリスト上に並べたものを学習結果とし，前記規則のリストを学習結果情報として前記学習結果記憶手段に格納し， [3] The machine learning means sets a rule composed of a pair of a feature set as the predetermined information and information indicating a solution from the teacher data, and arranges the rule on the list in a predetermined order. A list of rules is stored as learning result information in the learning result storage means,

前記解推定手段は，前記学習結果記憶手段に格納された前記学習結果情報である前記規則のリストを先頭力もチェックして，前記二項関係の候補から抽出された素性の集合と一致する規則を検出し，検出した規則の解を示す情報をもとに，前記二項関係の候補の解として推定する The solution estimation means also checks a head force of the rule list which is the learning result information stored in the learning result storage means, and matches the set of features extracted from the binomial relation candidates. A rule is detected and estimated as a solution of the binary relation candidate based on information indicating the solution of the detected rule.

ことを特徴とする請求項 1または請求項 2のいずれか一項に記載の二項関係抽出装置。 The binary relation extraction device according to claim 1, wherein the binary relation extraction apparatus is characterized in that:

[4] 前記機械学習手段は，前記教師データから，解となりうる分類を特定し，所定の条件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項からなる確率分布を求め，前記確率分布を前記学習結果情報として前記学習結果記憶部に格納し， [4] The machine learning means identifies a class that can be a solution from the teacher data, and sets a feature that satisfies a predetermined conditional expression and maximizes an expression showing entropy. A probability distribution consisting of two terms is obtained, and the probability distribution is stored in the learning result storage unit as the learning result information.

前記解推定手段は，前記学習結果記憶手段に格納された前記学習結果情報である前記確率分布を利用して，前記二項関係の候補の集合の場合のそれぞれの解となりうる分類の確率を求めて，最も大きい確率値を持つ解となりうる分類を特定し，前記特定した分類を前記二項関係の候補の解と推定する The solution estimation means uses the probability distribution that is the learning result information stored in the learning result storage means, and the probability of classification that can be each solution in the case of the binomial candidate set. And identify the class that can be the solution with the largest probability value, and estimate the identified class as the solution of the binomial relation candidate.

[5] 前記機械学習手段は，前記教師データから解となりうる分類を特定し，前記分類を正例と負例とに分割し，所定のカーネル関数を用いたサポートベクトルマシン法を実行する関数にしたがって前記二項関係の候補力抽出された素性の集合を次元とする空間上で前記正例と前記負例との間隔を最大にしかつ超平面で分割する超平面を求め，前記超平面を前記学習結果情報として前記学習結果記憶手段に格納し，前記解推定手段は，前記学習結果記憶手段に格納された前記学習結果情報である前記超平面を利用して，前記二項関係の候補力抽出された素性の集合が前記超平面で分割された前記空間にお!、て前記正例の側か前記負例の側のどちらにあるかを特定し，前記特定された結果にもとづいて定まる解となりうる分類を特定し，前記特定した分類を前記二項関係の候補の解と推定する [5] The machine learning means specifies a classification that can be a solution from the teacher data, divides the classification into a positive example and a negative example, and executes a support vector machine method using a predetermined kernel function. The candidate power of the binomial relation is obtained according to the above, and a hyperplane that maximizes the interval between the positive example and the negative example and divides the hyperplane is obtained in a space having the extracted feature set as a dimension. Is stored in the learning result storage means as the learning result information, and the solution estimation means is the learning result information stored in the learning result storage means. Using the hyperplane, the candidate power of the binomial relation is extracted in the space divided by the hyperplane, either on the positive example side or on the negative example side. Identify a category that can be a solution determined based on the identified result, and estimate the identified category as a solution of the binomial candidate.

[6] 前記機械学習手段は，前記教師データの事例同士が，その事例から抽出された素性の集合のうち重複する素性の割合にもとづく事例同士の類似度を定義しておき，前記定義した類似度と事例を前記学習結果情報として前記学習結果記憶手段に格納し， [6] The machine learning means defines the degree of similarity between cases based on the ratio of overlapping features in the set of features extracted from the cases of the teacher data. The similarity and the case are stored in the learning result storage means as the learning result information,

前記解推定手段は，前記学習結果記憶手段に格納された前記学習結果情報である前記定義した類似度と前記事例を参照して，前記二項関係の候補につ!、てその候補との類似度が高、順に k個の事例を選択し，前記選択した k個の事例での多数決によって定めた分類先を，前記二項関係の候補の解と推定する The solution estimation means refers to the defined similarity and the case, which are the learning result information stored in the learning result storage means, and determines the binomial relationship candidate! K cases are selected in order of high similarity, and the classification destination determined by the majority vote in the selected k cases is estimated as a solution of the binomial relation candidate.

[7] 前記機械学習手段は，前記解と素性の集合との組を前記学習結果情報として前記学習結果記憶手段に格納し， [7] The machine learning means stores the set of the solution and feature set as the learning result information in the learning result storage means,

前記解推定手段は，前記学習結果記憶手段の前記解と素性の集合との組をもとに ,ベイズの定理にもとづ、て前記素性抽出手段から得た前記二項関係の候補の素性の集合の場合の各分類になる確率を算出し，前記確率の値が最も大きい分類を，前記二項関係の候補の解と推定する The solution estimation means, based on a set of the solution and feature set of the learning result storage means, based on a Bayes's theorem, obtains a candidate element of the binary relation obtained from the feature extraction means. Probability of each classification in the case of sex set is calculated, and the classification with the largest probability value is estimated as the solution of the binary relation candidate

[8] 複数の検索キーワードによる情報検索処理において，教師あり機械学習処理を用いた二項関係抽出処理結果を利用して検索結果を抽出する処理装置であって，問題と解との組で構成される事例であって，問題が検索キーワードを要素とする二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶手段と， [8] A processing device that extracts search results using binary relation extraction processing results using supervised machine learning processing in information search processing using multiple search keywords. The teacher data is stored, including the case where the problem is a binary relation with the search keyword as an element and the solution is a binary relation to be extracted. Teacher data storage means,

入力された複数の検索キーワードを用いた入力検索キーワード対を生成し，検索対象となるテキストデータ力も前記入力検索キーワード対を含むテキストデータを抽出して取得する情報検索手段と， An information search means for generating an input search keyword pair using a plurality of input search keywords, extracting text data including the input search keyword pair and acquiring the text data force to be searched;

前記検索して取得された各テキストデータ力前記入力検索キーワードで構成される対を生成し，前記生成した対を二項関係の候補とする候補抽出手段と， Each text data force obtained by the search, generating a pair composed of the input search keywords, and candidate extraction means for using the generated pair as a binary relation candidate;

前記推定結果として，前記二項関係の候補について抽出するべき二項関係であることを示す解となりやす、度合、が所定の程度より良、場合に，前記二項関係の候補を抽出するべき二項関係として選択し，前記選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出手段とを備える As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. Search result extraction means for selecting as a binary relation and extracting text data including the selected binary relation as a search result.

ことを特徴とする二項関係抽出処理を用いた情報検索装置。 An information retrieval apparatus using binary relation extraction processing characterized by the above.

前記教師データ記憶手段は，前記事例として，問題の二項関係が，抽出するべき二項関係であることを示す正の解が与えられた正の事例と，問題の二項関係が，抽出するべきではない二項関係であることを示す負の解が与えられた負の事例とを含む教師データが格納される The teacher data storage means extracts, as the case, a positive case given a positive solution indicating that the binary relation of the problem is a binary relation to be extracted and the binary relation of the problem. Teacher data is stored, including negative examples given negative solutions that indicate binary relationships that should not be done

ことを特徴とする請求項 8記載の二項関係抽出処理を用いた情報検索装置。 [10] 前記機械学習手段は，前記教師データから，前記所定の情報である素性の集合と解を示す情報との対で構成した規則を設定し，前記規則を所定の順序でリスト上に並べたものを学習結果とし，前記規則のリストを学習結果情報として前記学習結果記憶手段に格納し， 9. An information retrieval apparatus using binary relation extraction processing according to claim 8. [10] The machine learning means sets, from the teacher data, a rule composed of a pair of a feature set as the predetermined information and information indicating a solution, and arranges the rule on the list in a predetermined order. A list of rules is stored as learning result information in the learning result storage means,

ことを特徴とする請求項 8または請求項 9のいずれか一項に記載の二項関係抽出処理を用いた情報検索装置。 10. An information search apparatus using the binary relation extraction process according to claim 8, wherein the binary relation extraction process is performed.

[11] 前記機械学習手段は，前記教師データから，解となりうる分類を特定し，所定の条件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項からなる確率分布を求め，前記確率分布を前記学習結果情報として前記学習結果記憶部に格納し， [11] The machine learning means identifies a class that can be a solution from the teacher data, and a set of features that satisfy a predetermined conditional expression and maximize an expression showing entropy. A probability distribution consisting of two terms is obtained, and the probability distribution is stored in the learning result storage unit as the learning result information.

[12] 前記機械学習手段は，前記教師データから解となりうる分類を特定し，前記分類を正例と負例とに分割し，所定のカーネル関数を用いたサポートベクトルマシン法を実行する関数にしたがって前記二項関係の候補力抽出された素性の集合を次元とする空間上で前記正例と前記負例との間隔を最大にしかつ超平面で分割する超平面を求め，前記超平面を前記学習結果情報として前記学習結果記憶手段に格納し，前記解推定手段は，前記学習結果記憶手段に格納された前記学習結果情報である前記超平面を利用して，前記二項関係の候補力抽出された素性の集合が前記超平面で分割された前記空間にお!、て前記正例の側か前記負例の側のどちらにあるかを特定し，前記特定された結果にもとづいて定まる解となりうる分類を特定し，前記特定した分類を前記二項関係の候補の解と推定する [12] The machine learning means identifies a class that can be a solution from the teacher data, divides the class into a positive example and a negative example, and executes a support vector machine method using a predetermined kernel function. The candidate power of the binomial relation is obtained according to the above, and a hyperplane that maximizes the interval between the positive example and the negative example and divides the hyperplane is obtained in a space having the extracted feature set as a dimension. Is stored in the learning result storage means as the learning result information, and the solution estimation means uses the hyperplane that is the learning result information stored in the learning result storage means to store the binomial relationship. Candidate power The set of extracted features is in the space divided by the hyperplane, either on the positive side or on the negative side. Identify a class that can be a solution determined based on the identified result, and estimate the identified class as a solution of the binomial relationship candidate.

[13] 前記機械学習手段は，前記教師データの事例同士が，その事例力抽出された素性の集合のうち重複する素性の割合にもとづく事例同士の類似度を定義しておき，前記定義した類似度と事例を前記学習結果情報として前記学習結果記憶手段に格納し， [13] The machine learning means defines the degree of similarity between cases based on the ratio of overlapping features in the set of features extracted from the case power of the case examples of the teacher data. The similarity and the case are stored in the learning result storage means as the learning result information,

[14] 前記機械学習手段は，前記解と素性の集合との組を前記学習結果情報として前記学習結果記憶手段に格納し， [14] The machine learning means stores the set of the solution and the set of features as the learning result information in the learning result storage means,

[15] コンピュータが読み取り可能な記憶装置に格納された文データ中に出現する二項関係を，機械学習処理を用いて抽出する二項関係抽出処理方法であって，問題と解との組で構成される事例であって，問題が文データ中に出現する二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶手段から前記事例を取り出し，前記事例ごとに，所定の情報を素性として抽出し，前記解と前記抽出した素性の集合との組を生成する解素性対抽出処理過程と， [15] A binary relation extraction processing method that uses machine learning processing to extract binary relations that appear in sentence data stored in a computer-readable storage device. The case is composed of teacher data storage means storing teacher data including a binary relation in which the problem appears in sentence data and a binary relation to be extracted. For each of the cases, predetermined information is extracted as a feature, and a feature-feature pair extraction process that generates a set of the solution and the set of extracted features. Reasoning,

所定の機械学習アルゴリズムにもとづいて，前記解と素性の集合との組について，どのような素性の集合の場合に前記解となるかと、うことを機械学習処理し，前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶手段に保存する機械学習処理過程と， Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. A machine learning process for storing, in the learning result storage means, information indicating whether the solution is obtained in the case of a set, as learning result information;

前記記憶装置に格納されたテキストデータから，前記二項関係の要素を抽出し，前記要素で構成される対を抽出し，前記抽出した対を二項関係の候補とする候補抽出処理過程と， A candidate extraction process in which the binary relational elements are extracted from the text data stored in the storage device, a pair composed of the above elements is extracted, and the extracted pair is a binary relational candidate. ,

前記解素性対抽出手段が行う抽出処理と同様の抽出処理によって，前記二項関係の候補について前記所定の情報を素性として抽出する素性抽出処理過程と，前記学習結果記憶手段に格納された前記学習結果情報にもとづいて，前記二項関係の候補の素性の集合の場合の前記解となりやす、度合、を推定する解推定処理過程と， A feature extraction process for extracting the predetermined information as a feature for the binomial relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction means, and the learning stored in the learning result storage means Based on the result information, a solution estimation process for estimating the degree of the likelihood of being the solution in the case of the set of candidate features of the binomial relationship;

前記推定結果として，前記二項関係の候補について抽出するべき二項関係であることを示す解となりやす、度合、が所定の程度より良、場合に，前記二項関係の候補を抽出するべき二項関係として選択する二項関係抽出処理過程とを備えることを特徴とする二項関係抽出処理方法。 As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A binary relation extraction processing method comprising: a binary relation extraction process selected as a binary relation.

コンピュータが複数の検索キーワードによる情報検索処理を行う場合に，教師あり機械学習処理を用いた二項関係抽出処理結果を利用して検索結果を抽出する情報検索処理方法であって， An information search processing method for extracting a search result using a binary relation extraction process result using a supervised machine learning process when a computer performs an information search process using a plurality of search keywords.

問題と解との組で構成される事例であって，問題が検索キーワードを要素とする二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶手段から前記事例を取り出し，前記事例ごとに，所定の情報を素性として抽出し，前記解と前記抽出した素性の集合との組を生成する解素性対抽出処理過程と， Teacher data in which teacher data is stored, including cases that consist of pairs of problems and solutions, where the problem is a binary relationship with the search keywords as elements and the solution should be extracted Extracting the case from the storage means, extracting predetermined information as a feature for each case, and generating a feature-feature pair extraction process for generating a set of the solution and the set of extracted features;

所定の機械学習アルゴリズムにもとづいて，前記解と素性の集合との組について，どのような素性の集合の場合に前記解となるかと、うことを機械学習処理し，前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶手段に保存する機械学習処理過程と， Based on a predetermined machine learning algorithm, a machine learning process is performed on the set of the feature and the feature set to determine the feature set in which the solution is obtained. Information indicating whether the solution in the case of a set is the learning result information Machine learning process to be stored in the learning result storage means,

入力された複数の検索キーワードを用いた入力検索キーワード対を生成し，検索対象となるテキストデータ力も前記入力検索キーワード対を含むテキストデータを抽出して取得する情報検索処理過程と， An information search process for generating an input search keyword pair using a plurality of input search keywords, extracting text data including the input search keyword pair and acquiring the text data force to be searched;

前記検索して取得された各テキストデータ力前記入力検索キーワードで構成される対を生成し，前記生成した対を二項関係の候補とする候補抽出処理過程と，前記解素性対抽出手段が行う抽出処理と同様の抽出処理によって，前記二項関係の候補について前記所定の情報を素性として抽出する素性抽出処理過程と，前記学習結果記憶手段に格納された前記学習結果情報にもとづいて，前記二項関係の候補の素性の集合の場合の前記解となりやす、度合、を推定する解推定処理過程と， Each text data force obtained by the search is generated by a candidate extraction process that generates a pair composed of the input search keyword and uses the generated pair as a binary relation candidate, and the feature pair extraction means performs Based on a feature extraction process for extracting the predetermined information as a feature for the binomial relationship candidate by an extraction process similar to the extraction process, and the learning result information stored in the learning result storage means, A solution estimation process for estimating the degree of, which is likely to be the solution in the case of a set of feature candidates of term relations,

前記推定結果として，前記二項関係の候補について抽出するべき二項関係であることを示す解となりやす、度合、が所定の程度より良、場合に，前記二項関係の候補を抽出するべき二項関係として選択し，前記選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出処理過程とを備える As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A search result extraction process for selecting as a binary relation and extracting text data including the selected binary relation as a search result.

ことを特徴とする二項関係抽出処理を用いた情報検索処理方法。 An information retrieval processing method using binary relation extraction processing characterized by the above.

コンピュータに，読み取り可能な記憶装置に格納された文データ中に出現する二項関係を，機械学習処理を用いて抽出する処理方法として， As a processing method for extracting binary relations that appear in sentence data stored in a readable storage device to a computer using machine learning processing,

問題と解との組で構成される事例であって，問題が文データ中に出現する二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶手段から，前記事例を取り出し，前記事例ごとに，所定の情報を素性として抽出し，前記解と前記抽出した素性の集合との組を生成する解素性対抽出処理過程と， An example consisting of a set of a problem and a solution, in which the teaching data that contains the binary relation that the problem appears in the sentence data and the binary relation that should be extracted is stored Extracting the case from the teacher data storage means, extracting predetermined information as a feature for each case, and generating a pair of feature pairs to generate a set of the solution and the set of the extracted features; ,

前記記憶装置に格納されたテキストデータから，前記二項関係の要素を抽出し，前記要素で構成される対を抽出し，前記抽出した対を二項関係の候補とする候補抽出処理過程と， Extract the binary relation elements from the text data stored in the storage device. A candidate extraction process in which a pair consisting of the following elements is extracted, and the extracted pair is a binary relation candidate;

前記推定結果として，前記二項関係の候補について抽出するべき二項関係であることを示す解となりやす、度合、が所定の程度より良、場合に，前記二項関係の候補を抽出するべき二項関係として選択する二項関係抽出処理過程とを， As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A binary relation extraction process to be selected as a binary relation,

実行させるための二項関係抽出処理プログラム。 A binary relation extraction processing program for execution.

コンピュータに，複数の検索キーワードによる情報検索処理を行う場合に，教師あり機械学習処理を用いた二項関係抽出処理結果を利用して検索結果を抽出する方法として， When performing information retrieval processing with multiple search keywords on a computer, a method for extracting retrieval results using the binary relation extraction processing results using supervised machine learning processing

前記検索して取得された各テキストデータ力前記入力検索キーワードで構成される対を生成し，前記生成した対を二項関係の候補とする候補抽出処理過程と，前記解素性対抽出手段が行う抽出処理と同様の抽出処理によって，前記二項関係の候補について前記所定の情報を素性として抽出する素性抽出処理過程と，前記学習結果記憶手段に格納された前記学習結果情報にもとづいて，前記二項関係の候補の素性の集合の場合の前記解となりやす、度合、を推定する解推定処理過程と， Each text data force obtained by the search is generated as a pair composed of the input search keywords, and a candidate extraction process in which the generated pair is a binary relation candidate, A feature extraction process for extracting the predetermined information as a feature for the binomial relation candidate by an extraction process similar to the extraction process performed by the feature pair extraction means, and the learning stored in the learning result storage means Based on the result information, a solution estimation process for estimating the degree of the likelihood of being the solution in the case of the set of candidate features of the binomial relationship;

前記推定結果として，前記二項関係の候補について抽出するべき二項関係であることを示す解となりやす、度合、が所定の程度より良、場合に，前記二項関係の候補を抽出するべき二項関係として選択し，前記選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出処理過程とを， As a result of the estimation, the candidate for the binomial relationship is likely to be a solution indicating that the binomial relationship should be extracted. If the degree is better than a predetermined level, the candidate for the binomial relationship should be extracted. A search result extraction process for selecting as a binary relation and extracting text data including the selected binary relation as a search result.

実行させるための二項関係抽出処理を用いた情報検索処理プログラム。 An information search processing program using binary relation extraction processing for execution.