JP2010272006A

JP2010272006A - Relation extraction apparatus, relation extraction method and program

Info

Publication number: JP2010272006A
Application number: JP2009124403A
Authority: JP
Inventors: Yoshio Ishizawa; 善雄石澤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-05-22
Filing date: 2009-05-22
Publication date: 2010-12-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a relation extraction apparatus, a relation extraction method and a program that extract relations of words and sets of words satisfying the relations without preparing dictionaries. <P>SOLUTION: The relation extraction apparatus 11 is used for extracting relations of words and sets of words satisfying the relations. The relation extraction apparatus 11 includes: a character string expression creation part 3 for extracting two or more elements as an element set from structured data containing a plurality of words as elements, and substituting the extracted element set into a preset character string pattern representing a relation of words to create a plurality of character string expressions; and a character string expression determination part 4 for determining whether each of the plurality of character string expressions created appears in a document set, and outputting character string expressions appearing in the document set as the relation of words and sets of words satisfying the relation. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、情報検索の分野で利用され、且つ、語と語との関係及びその関係を満たす語の集合を抽出する、関係抽出装置、関係抽出方法、及びプログラムに関する。 The present invention relates to a relationship extraction apparatus, a relationship extraction method, and a program which are used in the field of information retrieval and extract a set of words satisfying the relationship between words.

近年、インターネットの急速な発達により、大量の情報が電子文書化され、多くの人に利用可能となっている。また、これにより、その中に埋もれた情報を有効活用したいという要求が高まっている。そして、このような大量の電子文書化された情報の活用において、電子化された文書からの情報検索は特に重要となる。 In recent years, with the rapid development of the Internet, a large amount of information has been electronically documented and made available to many people. In addition, as a result, there is an increasing demand for effective use of information buried in the information. In utilizing such a large amount of electronic document information, information retrieval from the electronic document is particularly important.

例えば、通常、ある企業の情報を検索したい場合であれば、企業名をキーワードに設定し、インターネット上で提供されているキーワード検索エンジンを用いて、検索が行われる。この場合、ヒットしたＷｅｂページから、必要な企業情報（従業員数、平均年収など）を取得することができる。 For example, normally, when searching for information on a certain company, the company name is set as a keyword, and a search is performed using a keyword search engine provided on the Internet. In this case, necessary company information (number of employees, average annual income, etc.) can be acquired from the hit Web page.

但し、従来からのキーワード検索エンジンにおいては、語と語との関係が考慮されないため、属性検索による情報の取り出しに対応できないという問題がある。例えば、「従業員数が２０００人以上の企業」を網羅的に検索したい場合、「平均年収が５００万円以上の企業」を網羅的に検索したい場合、といった属性検索が求められている場合において、従来のキーワード検索エンジンでは、このような情報の取得は困難である。 However, the conventional keyword search engine has a problem that it cannot cope with extraction of information by attribute search because the relationship between words is not considered. For example, in the case where an attribute search is required, such as “Company with 2,000 employees or more” or “Company with an average annual income of 5 million yen or more” With conventional keyword search engines, it is difficult to acquire such information.

これは、従来からの検索エンジンでは、「従業員数」や、「平均年収」といったキーワードしか入力できないことによる。つまり、「従業員数」や、「平均年収」といった語は一般的な語であり、企業情報を網羅的にまとめたＷｅｂページに特有のキーワードではない。そのため、キーワードを含むが、有益ではないＷｅｂページまでもが大量に検索されてしまう。 This is because conventional search engines can only input keywords such as “number of employees” and “average annual income”. That is, the words “number of employees” and “average annual income” are general words, and are not keywords specific to a Web page that comprehensively summarizes company information. Therefore, a large amount of web pages that include keywords but are not useful are searched.

その結果、例えば、大量のＷｅｂページ集合から、人手によって有益なＷｅｂページを見つけて「平均年収が５００万円以上の企業」を網羅的に抽出することは、利用者の大きな負担となる。そこで、このような検索を効率良く行うために、文書を解析してその目的に合うインデックスを作成しておく方法や、文書中に語や語の関係を表すタグを付与しておく方法などが考えられている。 As a result, for example, it is a heavy burden on the user to find useful Web pages manually from a large set of Web pages and exhaustively extract “a company with an average annual income of 5 million yen or more”. Therefore, in order to perform such a search efficiently, there are a method of analyzing a document and creating an index suitable for the purpose, a method of adding a tag representing a word and a relationship between words in the document, etc. It is considered.

また、このような方法を実現するためには、語が、どういった属性及び属性値を有しているかを示す知識（データベース）が必要となる。そして、既存の電子文書からそういった知識を構築する技術、即ち、語と語との関係を抽出する技術の開発が求められている。なお、ここで、語の「属性」とは、その語についての特色をいい、例えば、語が「企業名」である場合ならば、属性としては、その企業の住所、電話番号、従業員数、平均年収等が挙げられる。 In order to realize such a method, knowledge (database) indicating what attributes and attribute values a word has is required. There is a need to develop a technique for building such knowledge from existing electronic documents, that is, a technique for extracting the relationship between words. Here, the “attribute” of the word means a characteristic of the word. For example, if the word is “company name”, the attribute includes the address of the company, the telephone number, the number of employees, Examples include average annual income.

語と語との関係の抽出を実現する技術としては、例えば、特許文献１に開示の技術が挙げられる。特許文献１に開示の技術では、先ず、文書に対して形態素解析及び構文解析が行われる。更に、単語毎の領域（種類）が登録された領域辞書が用いられ、各形態素に対して、機能型又は装置型などのように表される領域（種類）が付与される。 As a technique for realizing the extraction of the relationship between words, for example, a technique disclosed in Patent Document 1 can be cited. In the technique disclosed in Patent Document 1, first, morphological analysis and syntax analysis are performed on a document. Furthermore, a region dictionary in which regions (types) for each word are registered is used, and regions (types) represented as functional types or device types are assigned to each morpheme.

次に、予め用意された、２語の特定の関係を表す文字列パターンを用い、これと、語に領域が付与された文とが一致するかどうかの確認が行われる。文字列パターンとしては、
例えば、２語の構成要素関係を表す「Ａ部を改良したＢ」（Ａ、Ｂは装置型）、２語の処理機能関係を表す「ＡができるＢ」（Ａは機能型、Ｂは装置型）等が挙げられる。この結果、文字列パターンに当てはまる２語の集合と、文字列パターンによって特定される２語の関係とが抽出される。 Next, using a character string pattern representing a specific relationship between two words prepared in advance, it is confirmed whether this matches a sentence in which a region is added to the word. As a string pattern,
For example, “B with improved part A” representing a component relationship of two words (A and B are device types) “B capable of A” representing a processing function relationship of two words (A is a functional type, B is a device) Type). As a result, a set of two words applicable to the character string pattern and a relationship between the two words specified by the character string pattern are extracted.

このように、特許文献１に開示の技術を用いれば、語と語との関係と、その関係を満たす語の集合とを示す知識（データベース）の構築が可能となる。そして、このような知識を用いれば、上述した属性検索を簡単に行うことができると考えられる。 As described above, if the technique disclosed in Patent Document 1 is used, it is possible to construct a knowledge (database) indicating a relationship between words and a set of words satisfying the relationship. If such knowledge is used, the attribute search described above can be easily performed.

特開平７−８５０４１号公報JP 7-85041 A

しかしながら、特許文献１に開示の技術では、形態素解析及び構文解析を行うための辞書と、領域辞書とが必要になり、知識（データベース）を作成するために辞書が必要になるという問題がある。 However, the technique disclosed in Patent Document 1 requires a dictionary for performing morphological analysis and syntax analysis and a region dictionary, and has a problem that a dictionary is necessary for creating knowledge (database).

また、特許文献１に開示の技術では、語と語との関係及びその関係を表す語の抽出精度を高めるためには、辞書の完成度を高める必要があるが、そのために、出来るだけ多くの語について、予め調べて登録しておくことが求められる。しかし、企業名や、商品名などの語は、日々増加していることから、大量の語を日毎又は週毎に網羅的に調べて、これを辞書に蓄えることは極めて困難である。 Further, in the technique disclosed in Patent Document 1, in order to increase the relationship between words and the extraction accuracy of words representing the relationship, it is necessary to improve the completeness of the dictionary. It is required to check and register words in advance. However, since words such as company names and product names are increasing every day, it is extremely difficult to exhaustively check a large number of words every day or every week and store them in a dictionary.

以上の点から、領域辞書や、形態素解析及び構文解析用の辞書を必要とすることなく、語と語との関係の抽出を実現する技術の開発が求められている。 In view of the above, there is a need for the development of a technology that realizes the extraction of the relationship between words without requiring an area dictionary or a dictionary for morphological analysis and syntax analysis.

本発明の目的は、上記問題を解消し、予め辞書を用意することなく、語と語との関係及びその関係を満たす語の集合を抽出し得る、関係抽出装置、関係抽出方法、及びプログラムを提供することにある。 An object of the present invention is to provide a relationship extraction apparatus, a relationship extraction method, and a program capable of solving the above problems and extracting a set of words satisfying the relationship between words and the relationship without preparing a dictionary in advance. It is to provide.

上記目的を達成するため、本発明における関係抽出装置は、語と語との関係及びその関係を満たす語の集合を抽出する装置であって、
複数の語を要素として含む構造化データから、２以上の前記要素を要素集合として抽出し、抽出した前記要素集合を、予め設定された語と語との関係を表す文字列パターンに当てはめて、複数の文字列表現を作成する文字列表現作成部と、
作成された前記複数の文字列表現それぞれが文書集合中に出現しているかどうかを判定し、前記文書集合中に出現している文字列表現を、前記語と語との関係及び前記その関係を満たす語の集合として出力する文字列表現判定部と、
を備えることを特徴とする。 In order to achieve the above object, a relationship extraction device according to the present invention is a device that extracts a set of words satisfying the relationship between words and the relationship between words,
Two or more elements are extracted as an element set from structured data including a plurality of words as elements, and the extracted element set is applied to a character string pattern representing a relationship between words set in advance, A character string expression creating unit for creating a plurality of character string expressions;
It is determined whether or not each of the plurality of created character string expressions appears in a document set, and the character string expression that appears in the document set is determined based on the relationship between the words and the relationship between the words. A character string expression determination unit that outputs a set of satisfying words;
It is characterized by providing.

また、上記目的を達成するため、本発明における関係抽出方法は、語と語との関係及びその関係を満たす語の集合を抽出するための方法であって、
（ａ）複数の語を要素として含む構造化データから、２以上の前記要素を要素集合として抽出し、抽出した前記要素集合を、予め設定された語と語との関係を表す文字列パターンに当てはめて、複数の文字列表現を作成する、ステップと、
（ｂ）前記（ａ）のステップで作成された前記複数の文字列表現それぞれが文書集合中に出現しているかどうかを判定し、前記文書集合中に出現している文字列表現を、前記語と
語との関係及び前記その関係を満たす語の集合として出力する、ステップと、
を有することを特徴とする。 In order to achieve the above object, the relation extraction method in the present invention is a method for extracting a relation between words and a set of words satisfying the relation,
(A) Two or more elements are extracted as element sets from structured data including a plurality of words as elements, and the extracted element sets are converted into a character string pattern representing a relationship between words set in advance. Apply to create multiple string representations,
(B) It is determined whether or not each of the plurality of character string expressions created in the step (a) appears in the document set, and the character string expression appearing in the document set is determined as the word Outputting as a set of words satisfying the relationship between and a word and the relationship;
It is characterized by having.

更に、上記目的を達成するため、本発明におけるプログラムは、コンピュータによって、語と語との関係及びその関係を満たす語の集合を抽出させるためのプログラムであって、
前記コンピュータによって、
（ａ）複数の語を要素として含む構造化データから、２以上の前記要素を要素集合として抽出し、抽出した前記要素集合を、予め設定された語と語との関係を表す文字列パターンに当てはめて、複数の文字列表現を作成する、ステップと、
（ｂ）前記（ａ）のステップで作成された前記複数の文字列表現それぞれが文書集合中に出現しているかどうかを判定し、前記文書集合中に出現している文字列表現を、前記語と語との関係及び前記その関係を満たす語の集合として出力する、ステップと、
を実行させることを特徴とする。 Furthermore, in order to achieve the above object, the program according to the present invention is a program for causing a computer to extract a relationship between words and a set of words satisfying the relationship,
By the computer,
(A) Two or more elements are extracted as element sets from structured data including a plurality of words as elements, and the extracted element sets are converted into a character string pattern representing a relationship between words set in advance. Apply to create multiple string representations,
(B) It is determined whether or not each of the plurality of character string expressions created in the step (a) appears in the document set, and the character string expression appearing in the document set is determined as the word Outputting as a set of words satisfying the relationship between and a word and the relationship;
Is executed.

以上の特徴により、本発明における、関係抽出装置、関係抽出方法、及びプログラムによれば、予め辞書を用意することなく、語と語との関係及びその関係を満たす語の集合を抽出することができる。 With the above features, according to the relationship extraction device, relationship extraction method, and program of the present invention, it is possible to extract a relationship between words and a set of words that satisfy the relationship without preparing a dictionary in advance. it can.

図１は、本発明の実施の形態１における関係抽出装置の概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a relationship extraction apparatus according to Embodiment 1 of the present invention. 図２は、本発明の実施の形態１で用いられる構造化データの一例を示す図である。FIG. 2 is a diagram showing an example of structured data used in Embodiment 1 of the present invention. 図３は、構造化データの元となるデータの一例を示す図であり、図３（ａ）及び図３（ｂ）はそれぞれ異なる例を示している。FIG. 3 is a diagram showing an example of data that is the basis of structured data, and FIGS. 3A and 3B show different examples. 図４は、本発明の実施の形態１で用いられる文字列パターンの一例を示す図である。FIG. 4 is a diagram showing an example of a character string pattern used in the first embodiment of the present invention. 図５は、図１に示した出力データ記憶部に記憶されている文字列表現の一例を示す図である。FIG. 5 is a diagram illustrating an example of a character string expression stored in the output data storage unit illustrated in FIG. 図６は、本発明の実施の形態１における関係抽出装置の動作を示すフロー図である。FIG. 6 is a flowchart showing the operation of the relationship extraction device according to Embodiment 1 of the present invention. 図７は、本発明の実施の形態２における関係抽出装置の概略構成を示すブロック図である。FIG. 7 is a block diagram showing a schematic configuration of the relation extraction device according to Embodiment 2 of the present invention. 図８は、本発明の実施の形態２における関係抽出装置の動作を示すフロー図である。FIG. 8 is a flowchart showing the operation of the relationship extraction apparatus according to Embodiment 2 of the present invention. 図９は、本発明の実施の形態３における関係抽出装置の概略構成を示すブロック図である。FIG. 9 is a block diagram showing a schematic configuration of the relation extraction apparatus according to Embodiment 3 of the present invention. 図１０は、本発明の実施の形態３における関係抽出装置の動作を示すフロー図である。FIG. 10 is a flowchart showing the operation of the relationship extraction apparatus according to Embodiment 3 of the present invention. 図１１は、本発明の実施の形態４における関係抽出装置の概略構成を示すブロック図である。FIG. 11 is a block diagram showing a schematic configuration of the relation extraction device according to Embodiment 4 of the present invention. 図１２は、本発明の実施の形態４における関係抽出装置の動作を示すフロー図である。FIG. 12 is a flowchart showing the operation of the relationship extraction device according to Embodiment 4 of the present invention. 図１３は、本発明の実施の形態４における関係抽出装置の他の例の概略構成を示すブロック図である。FIG. 13: is a block diagram which shows schematic structure of the other example of the relationship extraction apparatus in Embodiment 4 of this invention. 図１４は、本発明の実施の形態５における関係抽出装置の概略構成を示すブロック図である。FIG. 14 is a block diagram showing a schematic configuration of the relation extraction apparatus according to Embodiment 5 of the present invention. 図１５は、本発明の実施の形態５における関係抽出装置の動作を示すフロー図である。FIG. 15 is a flowchart showing the operation of the relationship extraction apparatus in the fifth embodiment of the present invention. 図１６は、本発明の実施の形態６における関係抽出装置の概略構成を示すブロック図である。FIG. 16 is a block diagram showing a schematic configuration of the relation extraction apparatus according to Embodiment 6 of the present invention. 図１７は、本発明の実施の形態６における関係抽出装置を構成する境界検出部の動作を示すフロー図である。FIG. 17 is a flowchart showing the operation of the boundary detection unit constituting the relation extraction device according to Embodiment 6 of the present invention. 図１８は、本発明の実施の形態１における関係抽出装置の動作の他の例を示すフロー図である。FIG. 18 is a flowchart showing another example of the operation of the relationship extraction device according to Embodiment 1 of the present invention.

（発明の概要）
本発明においては、複数の語を要素として含む構造化データ、語と語との関係を表す文字列パターン、文字列パターンから生成された文字列表現の存在を確認するための文書集合が用いられる。構造化データの例としては、複数の要素が行列状に配置されたデータ構造を有するテーブル、複数の要素が一定の規則に沿って羅列された箇条書き等が挙げられる。 (Summary of Invention)
In the present invention, structured data including a plurality of words as elements, a character string pattern representing a relationship between words, and a document set for confirming the existence of a character string expression generated from the character string pattern are used. . Examples of structured data include a table having a data structure in which a plurality of elements are arranged in a matrix, and a bulleted list in which a plurality of elements are listed according to a certain rule.

ここで、要素とは、例えば、構造化データにおいて、区切り記号と区切り記号とで囲まれた部分が挙げられる。構造化データがテーブルの場合は、各セルの内容が要素となる。また、箇条書きの場合は、各項目の内容が要素となる。後述する図２には、構造化データがテーブルの場合の例が示されている。図２においては、「企業名」、「会社Ａ」、「会社Ｂ」等の各セルに書かれている語が要素となる。 Here, the element includes, for example, a portion surrounded by delimiters and delimiters in the structured data. When the structured data is a table, the contents of each cell are elements. In the case of bullets, the content of each item is an element. FIG. 2 described later shows an example in which the structured data is a table. In FIG. 2, words written in each cell such as “company name”, “company A”, “company B”, and the like are elements.

文字列パターンの例としては、「ＡのＢはＣである（Ａ、Ｂ、Ｃは語を表す）」等が挙げられる。更に、上記文字列パターンにおいて、要素Ａ、Ｂ、Ｃの関係は、［対象、属性、属性値］となる（後述の図４参照）。 As an example of the character string pattern, “B of A is C (A, B, C represents a word)” and the like can be mentioned. Furthermore, in the character string pattern, the relationship between the elements A, B, and C is [target, attribute, attribute value] (see FIG. 4 described later).

そして、本発明においては、語と語との関係と、その関係を満たす語の集合とを抽出するため、先ず、一つの文字列パターンが選ばれる。次に、構造化データから、構造化データを構成する複数の要素が取り出される。また、要素の取り出しは、取り出す順序及び組み合わせをそれぞれ変えながら行われる。つまり、最初に選ばれた文字列パターンに当てはめる要素の組み合わせを、構造化データから抽出された全要素の集合から取り出し、取り出す際の順番を保持して要素の組み合わせを作る。以下、この組み合わせを要素集合と呼ぶ。 In the present invention, in order to extract a relationship between words and a set of words satisfying the relationship, first, one character string pattern is selected. Next, a plurality of elements constituting the structured data are extracted from the structured data. In addition, the extraction of elements is performed while changing the extraction order and combination. In other words, the combination of elements to be applied to the initially selected character string pattern is extracted from the set of all elements extracted from the structured data, and the combination of elements is created while maintaining the order of extraction. Hereinafter, this combination is referred to as an element set.

例えば、「ＡのＢはＣである」という文字列パターンが選ばれた場合、この文字列パターンに必要な要素の数は３つである。そこで、構造化データから３つの要素が選び出され、これらは、選び出された順番を保持した要素集合となる。構造化データが、例えば図２に示すものであり、要素として、「会社Ｂ」、「従業員」、「２０００」が、順に全要素集合から選ばれれば、その要素集合は（会社Ｂ、従業員数、２０００）となる。また、要素が、「従業員」、「２０００」、「会社Ｂ」の順に選ばれた場合は、要素集合は（従業員、２０００、会社Ｂ）となる。 For example, when the character string pattern “B of A is C” is selected, the number of elements necessary for this character string pattern is three. Therefore, three elements are selected from the structured data, and these become an element set holding the selected order. For example, if the structured data is as shown in FIG. 2 and “company B”, “employee”, and “2000” are sequentially selected from all the element sets, the element set is (company B, employee). Number, 2000). If the elements are selected in the order of “employee”, “2000”, and “company B”, the element set is (employee, 2000, company B).

続いて、このようにして選び出された各要素集合が、次々に文字列パターンに当てはめられ、文字列表現が生成される。例えば、上記した文字列パターン「ＡのＢはＣである」に、要素集合（会社Ｂ、従業員数、２０００）を当てはめると、文字列表現「［会社Ｂ］の［従業員数］は［２０００］である」が作成される。また、この文字列パターンに、要素集合（従業員数、２０００、会社Ｂ）を当てはめると、文字列表現「［従業員数］の［２０００］は［会社Ｂ］である」が作成される。 Subsequently, each element set selected in this way is successively applied to a character string pattern, and a character string expression is generated. For example, when the element set (company B, number of employees, 2000) is applied to the character string pattern “B of A is C”, the character expression “[company B] [number of employees] is [2000]”. Is created. When an element set (number of employees, 2000, company B) is applied to this character string pattern, the character string expression “[2000] of [number of employees] is [company B]” is created.

その後、生成された文字列表現が、用意されている文書集合中に出現しているかどうか
の判断が行われる。文字列表現で表される表現が、文書集合中に出現している場合は、文字列表現で使われた要素集合は、文字列パターンによって表された語と語との関係を適切に表していると考えられる。そして、この文字列表現から、要素集合と、文字列パターンと、この文字列パターンにおける要素の関係とを抽出する。また、別の要素集合に対しても同様の処理を行う。 Thereafter, it is determined whether the generated character string expression appears in the prepared document set. If the representation represented by the string representation appears in the document set, the element set used in the string representation appropriately represents the relationship between the words represented by the string pattern. It is thought that there is. Then, from this character string expression, an element set, a character string pattern, and a relationship between elements in the character string pattern are extracted. The same processing is performed for another element set.

一方、文字列表現で表される表現が、文書集合中に出現していない場合は、文字列表現で使われた要素集合は、文字列パターンによって表された語と語との関係を適切に表していないと考えられる。そして、この文字列表現で使われた要素集合は抽出されずに、別の要素集合に対して同様の処理が行われる。 On the other hand, if the expression represented by the character string expression does not appear in the document set, the element set used in the character string expression appropriately represents the relationship between the words represented by the character string pattern. It is thought that it does not represent. Then, the element set used in the character string expression is not extracted, and the same processing is performed on another element set.

特定の文字列パターンのために取り出された要素集合全てについて、文字列パターンへの当てはめと、文書集合中の文書に一致するかどうかの確認とが終了すると、別の文字列パターンが選ばれ、上記の処理が行われる。上記の処理は、全ての文字列パターンについて終了するまで、繰り返し行われる。そして、このようにして特定された文字列表現（文書集合中の文書と一致した文字列表現）は、語と語との関係及びその関係を満たす語の集合に相当する。 When all of the element sets extracted for a specific string pattern have been applied to the string pattern and checked to see if they match the documents in the document set, another string pattern is selected, The above processing is performed. The above processing is repeated until the processing is completed for all character string patterns. The character string expression specified in this way (character string expression that matches a document in the document set) corresponds to a relationship between words and a set of words that satisfy the relationship.

以上の説明に示したように、本発明では、領域辞書や、形態素解析及び構文解析を行うために必要な辞書を用いることなく、語と語との関係及びその関係を満たす語の集合を抽出することができる。 As described above, in the present invention, the relationship between words and a set of words satisfying the relationship are extracted without using an area dictionary or a dictionary necessary for morphological analysis and syntax analysis. can do.

（実施の形態１）
以下、本発明の実施の形態１における、関係抽出装置、関係抽出方法、及びプログラムについて、図１〜図６を参照しながら説明する。最初に、図１を用いて、本実施の形態１における関係抽出装置の構成について説明する。図１は、本発明の実施の形態１における関係抽出装置の概略構成を示すブロック図である。 (Embodiment 1)
Hereinafter, a relationship extraction device, a relationship extraction method, and a program according to Embodiment 1 of the present invention will be described with reference to FIGS. Initially, the structure of the relationship extraction apparatus in this Embodiment 1 is demonstrated using FIG. FIG. 1 is a block diagram showing a schematic configuration of a relationship extraction apparatus according to Embodiment 1 of the present invention.

図１に示す関係抽出装置１１は、語と語との関係及びその関係を満たす語の集合を抽出する装置である。図１に示すように、関係抽出装置１１は、文字列表現作成部３と、文字列表現判定部４とを備えている。 The relationship extraction device 11 shown in FIG. 1 is a device that extracts a relationship between words and a set of words that satisfy the relationship. As shown in FIG. 1, the relationship extraction device 11 includes a character string expression creation unit 3 and a character string expression determination unit 4.

文字列表現作成部３は、先ず、複数の語を要素として含む構造化データから、２以上の要素を要素集合として抽出する。また、文字列表現作成部３は、抽出した要素集合を、予め設定された語と語との関係を表す文字列パターンに当てはめて、複数の文字列表現を作成する。 First, the character string expression creation unit 3 extracts two or more elements as an element set from structured data including a plurality of words as elements. In addition, the character string expression creating unit 3 applies the extracted element set to a character string pattern that represents a preset relationship between words to create a plurality of character string expressions.

文字列表現判定部４は、作成された複数の文字列表現それぞれが文書集合中に出現しているかどうかを判定する。また、文字列表現判定部４は、文書集合中に出現している文字列表現を、語と語との関係及びその関係を満たす語の集合として出力する。 The character string expression determination unit 4 determines whether or not each of the created character string expressions appears in the document set. The character string expression determination unit 4 outputs the character string expressions appearing in the document set as a relationship between words and words and a set of words satisfying the relationship.

また、本実施の形態１では、関係抽出装置１１は、上記した構成に加え、構造化データ記憶部１と、文字列パターン記憶部２と、文書集合記憶部５と、出力データ記憶部６とを備えている。本実施の形態１では、構造化データ、文字列パターン、及び文書集合は予め用意され、それぞれ、構造化データ記憶部１、文字列パターン記憶部２、または文書集合記憶部５に記憶されている。また、出力データ記憶部６は、出力された文字列表現を記憶する。 In the first embodiment, the relationship extraction device 11 includes a structured data storage unit 1, a character string pattern storage unit 2, a document set storage unit 5, an output data storage unit 6 in addition to the above-described configuration. It has. In the first embodiment, structured data, a character string pattern, and a document set are prepared in advance and stored in the structured data storage unit 1, the character string pattern storage unit 2, or the document set storage unit 5, respectively. . The output data storage unit 6 stores the output character string expression.

このように、関係抽出装置１１は、構造化データと文字列パターンとで文字列表現を作成し、文字列表現と文書集合中の文書との一致の有無から、文字列表現で使われた要素集
合が、文字列パターンが表す語と語との関係を適切に表しているかどうかを判定する。つまり、関係抽出装置１１は、領域辞書、形態素解析及び構文解析で用いられる辞書を必要とすることなく、適切かどうかの判定を行うことができる。よって、関係抽出装置１１によれば、予め辞書を用意することなく、語と語との関係及びその関係を満たす語の集合を抽出することができる。また、このように、辞書を必要としないため、辞書の作成、保守、及び更新のために必要となる労力を無くすことができる。 In this way, the relationship extraction device 11 creates a character string representation from the structured data and the character string pattern, and uses the element used in the character string representation based on whether the character string representation matches the document in the document set. It is determined whether or not the set appropriately represents the relationship between words represented by the character string pattern. That is, the relationship extraction apparatus 11 can determine whether or not the relationship extraction device 11 is appropriate without requiring a dictionary used in the area dictionary, morphological analysis, and syntax analysis. Therefore, according to the relationship extraction device 11, it is possible to extract a relationship between words and a set of words that satisfy the relationship without preparing a dictionary in advance. Moreover, since a dictionary is not required in this way, the labor required for creating, maintaining, and updating the dictionary can be eliminated.

ここで、図２〜図５を用いて、関係抽出装置１１の構成を更に具体的に説明する。図２は、本発明の実施の形態１で用いられる構造化データの一例を示す図である。図３は、構造化データの元となるデータの一例を示す図であり、図３（ａ）及び図３（ｂ）はそれぞれ異なる例を示している。図４は、本発明の実施の形態１で用いられる文字列パターンの一例を示す図である。図５は、図１に示した出力データ記憶部に記憶されている文字列表現の一例を示す図である。 Here, the configuration of the relationship extraction apparatus 11 will be described more specifically with reference to FIGS. FIG. 2 is a diagram showing an example of structured data used in Embodiment 1 of the present invention. FIG. 3 is a diagram showing an example of data that is the basis of structured data, and FIGS. 3A and 3B show different examples. FIG. 4 is a diagram showing an example of a character string pattern used in the first embodiment of the present invention. FIG. 5 is a diagram illustrating an example of a character string expression stored in the output data storage unit illustrated in FIG.

本実施の形態１において、構造化データ記憶部１は、例えば、図２に示す構造化データを記憶している。図２の例に示す構造化データは、テーブルであり、複数の前記要素が行列状に配置されたデータ構造を有している。また、区切り記号と区切り記号とで囲まれた部分、即ち、テーブルを構成する各セルの内容が要素となる。文字列表現作成部３は、例えば、要素集合（会社Ｂ、従業員、２０００）等のように、各セルから要素を選び出す。 In the first embodiment, the structured data storage unit 1 stores, for example, structured data shown in FIG. The structured data shown in the example of FIG. 2 is a table and has a data structure in which a plurality of the elements are arranged in a matrix. Further, a portion surrounded by the delimiters and the delimiters, that is, the contents of the cells constituting the table are elements. The character string expression creation unit 3 selects an element from each cell such as an element set (company B, employee, 2000), for example.

また、図２に示す構造化データは、例えば、Ｗｅｂページに掲載されているテーブルから取得できる。具体的には、図３（ａ）に示すように、Ｗｅｂページでは、テーブル（下段参照）はＨＴＭＬによって記述されている（上段参照）。この場合、構造化データ記憶部１には、構造化データとして、テーブルを特定するＨＴＭＬファイルが格納される。 Moreover, the structured data shown in FIG. 2 can be acquired from, for example, a table posted on a Web page. Specifically, as shown in FIG. 3A, in the Web page, the table (see the lower part) is described in HTML (see the upper part). In this case, the structured data storage unit 1 stores an HTML file that identifies the table as structured data.

更に、本実施の形態１では、構造化データとして、複数の要素が一定の規則に沿って羅列された箇条書きを用いることもできる。この場合、箇条書きも、Ｗｅｂページから取得することができる。具体的には、図３（ｂ）に示すように、箇条書き（下段参照）もＨＴＭＬによって記述されている（上段参照）ので、構造化データ記憶部１には、箇条書きを特定するＨＴＭＬファイルが格納される。 Furthermore, in the first embodiment, as the structured data, a bulleted list in which a plurality of elements are listed according to a certain rule can be used. In this case, bullets can also be acquired from the Web page. Specifically, as shown in FIG. 3B, since the itemized list (see the lower part) is also described in HTML (see the upper part), the structured data storage unit 1 stores an HTML file for specifying the itemized item. Is stored.

なお、本実施の形態において、構造化データの収集は、人手によって行われていても良いし、コンピュータによって行われていても良い。具体的には、後者の場合では、コンピュータは、インターネット上のＨＴＭＬファイルから、ｔａｂｌｅ要素又はｕｌ要素を検索し、構造化データの収集を実行する。また、図３（ａ）及び（ｂ）において、下段の図は、ブラウザ上で表示されたときのイメージを示している。 In the present embodiment, the collection of structured data may be performed manually or may be performed by a computer. Specifically, in the latter case, the computer searches for a table element or an ul element from an HTML file on the Internet, and executes collection of structured data. In FIGS. 3A and 3B, the lower diagram shows an image when displayed on the browser.

また、構造化データ記憶部１が記憶する構造化データの数は、特に限定されず、構造化データ記憶部１は、複数の構造化データを記憶することができる。更に、構造化データ記憶部１に記憶される複数の構造化データは、テーブルのみ又は箇条書きのみで構成されていても良いし、テーブルと箇条書きとの両方で構成されていても良い。 The number of structured data stored in the structured data storage unit 1 is not particularly limited, and the structured data storage unit 1 can store a plurality of structured data. Further, the plurality of structured data stored in the structured data storage unit 1 may be composed of only a table or itemized items, or may be composed of both a table and itemized items.

また、本実施の形態１において、文字列パターン記憶部２は、例えば、図４に示す文字列パターンを記憶している。図４の例では、各文字列パターンは、文字列規則と、当てはめられる要素間の関係とで構成されている。なお、「要素間の関係」は、要素集合が適用された文字列パターン（要素集合が当てはめられた文字列規則）が、文書集合中の文書と一致して、適切であると判断された場合における、各要素と他の要素との関係を表している。 Moreover, in this Embodiment 1, the character string pattern memory | storage part 2 has memorize | stored the character string pattern shown, for example in FIG. In the example of FIG. 4, each character string pattern includes a character string rule and a relationship between elements to be applied. Note that the “relationship between elements” is determined when the character string pattern to which the element set is applied (the character string rule to which the element set is applied) matches the documents in the document set and is determined to be appropriate. Represents the relationship between each element and other elements.

更に、本実施の形態１では、文字列パターンは、人手によって作成されていても良いし
、コンピュータによって作成されていても良い。ここで、後者の場合について説明する。例えば、図４に示す５番のような上位語と下位語とで構成される文字列パターンを作成するのであれば、コンピュータは、先ず、予め用意された、上位語と下位語とが対となったデータ（例えば、（家具、椅子）、（家電製品、冷蔵庫）等）を用いて、一文中に対が存在する文書の抽出を行う。文書の抽出対象となる文書集合は、文書データベースに格納されたものであっても良いし、インターネット上に存在するものであっても良い。 Furthermore, in the first embodiment, the character string pattern may be created manually or by a computer. Here, the latter case will be described. For example, if a character string pattern composed of a broader word and a lower word such as No. 5 shown in FIG. 4 is created, the computer first sets a pair of a broader word and a lower word prepared in advance. The extracted data (for example, (furniture, chair), (home appliance, refrigerator), etc.) is used to extract a document having a pair in a sentence. The document set from which documents are to be extracted may be stored in a document database or may exist on the Internet.

次に、コンピュータは、抽出された文書の各文を比較し、共通部分を抽出し、これを文字列規則とする。例えば、文１「家具である椅子」と、文２「家電製品である冷蔵庫」とが抽出されている場合は、コンピュータは、文字列規則として、「ＡであるＢ」を抽出する。このとき、要素間の関係は、最初の設定により、（上位語、下位語）となる。 Next, the computer compares each sentence of the extracted documents, extracts a common part, and uses this as a character string rule. For example, when sentence 1 “a chair as furniture” and sentence 2 “a refrigerator as home appliance” are extracted, the computer extracts “B as A” as a character string rule. At this time, the relationship between the elements becomes (broader word, narrower word) by the initial setting.

また、本実施の形態１において、文字列パターン記憶部２に記憶される文字列パターンの数は、特に限定されるものではなく、関係抽出装置１１に求められる性能が得られるように設定すれば良い。更に、本実施の形態１では、文字列表現判定部３によって文書集合中の文書との一致の判定を行った後に、文字列パターン毎に一致した回数等を求め、利用者が、求められた回数等に基づいて、記憶される文字列パターンを取捨選択することもできる。 In the first embodiment, the number of character string patterns stored in the character string pattern storage unit 2 is not particularly limited, and may be set so that the performance required for the relationship extraction device 11 is obtained. good. Furthermore, in the first embodiment, after the character string expression determination unit 3 determines whether or not the document in the document set matches, the number of matches for each character string pattern is obtained, and the user is obtained. The character string pattern to be stored can be selected based on the number of times or the like.

文書集合記憶部５に記憶されている文書集合は、本実施の形態１においては、テキストデータによって構成されている。なお、後述するように、文書集合記憶部５は、テキストデータ以外の形式のデータを記憶することもできる。また、文書集合は、電子メールを蓄積しているメールサーバ、種々の文書を格納しているデータベース、インターネット上のＷｅｂサイト等から、人手又はコンピュータによって取得され、文章集合記憶部５に記憶される。 In the first embodiment, the document set stored in the document set storage unit 5 is composed of text data. As will be described later, the document set storage unit 5 can also store data in a format other than text data. The document set is acquired manually or by a computer from a mail server storing electronic mail, a database storing various documents, a Web site on the Internet, and the like, and stored in the sentence set storage unit 5. .

また、文字列表現作成部３は、本実施の形態１では、一つの文字列パターンを選択し、構造化データから、対応する数の要素を含む要素集合を組み合わせ及び順序を変えて選び出し、選び出した要素集合を、次々に、選択された文字列パターンに適用する。これにより、選択された文字列パターンに対応する文字列表現が作成される。文字列表現作成部３は、他の文字列パターンについても同様の処理を行い、文字列表現を作成する。 In the first embodiment, the character string expression creating unit 3 selects one character string pattern, selects an element set including a corresponding number of elements from the structured data, changes the combination and order, and selects them. The selected element set is applied to the selected character string pattern one after another. Thereby, a character string expression corresponding to the selected character string pattern is created. The character string expression creating unit 3 performs the same processing for other character string patterns to create a character string expression.

更に、本実施の形態１においては、構造化データが図２に示したテーブルであって、文字列パターンに当てはめ可能な要素の数が３である場合には、文字列表現作成部３は、次のように要素集合を選択することができる。つまり、文字列表現作成部３は、データ構造（テーブル）における行と列とを一つずつ選択し、選択した行の端にある要素と、選択した列の端にある要素と、選択した行及び列の交点にある要素とを、要素集合として抽出することができる。 Furthermore, in the first embodiment, when the structured data is the table shown in FIG. 2 and the number of elements that can be applied to the character string pattern is 3, the character string expression creating unit 3 An element set can be selected as follows. That is, the character string expression creation unit 3 selects one row and one column in the data structure (table) one by one, and selects the element at the end of the selected row, the element at the end of the selected column, and the selected row. And the element at the intersection of the columns can be extracted as an element set.

具体的には、文字列表現作成部３は、図２において「会社Ｂ」の行と、「従業員数」の列とを選択し、行の端の「会社Ｂ」と、列の端の「従業員」と、交点にある「２０００」と選択する。このような処理により、効率的に、適正な文字列表現を作成することが容易となる。 Specifically, the character string expression creation unit 3 selects the “company B” row and the “employee number” column in FIG. 2, the “company B” at the end of the row, and the “company B” at the end of the column. Select "Employee" and "2000" at the intersection. By such processing, it becomes easy to efficiently create a proper character string expression.

また、文字列表現判定部４は、本実施の形態１では、文字列表現が文書集合中に出現しているかどうかを判定するため、文書集合中の文書が、文字列表現と完全に一致する箇所を有しているかどうかを判定する。具体的には、文字列表現判定部４は、文字列表現を構成する文字コード列と同一の文字コード列が、文書集合中の文書に存在しているかどうかを判定している。なお、文書集合中に出現している文字列表現は、文書集合の問い合わせの結果として正しいと判定された文字列表現である。 In the first embodiment, the character string expression determination unit 4 determines whether or not the character string expression appears in the document set. Therefore, the documents in the document set completely match the character string expression. Determine if it has a location. Specifically, the character string expression determination unit 4 determines whether or not the same character code string as the character code string constituting the character string expression exists in the document in the document set. Note that the character string expression appearing in the document set is a character string expression determined to be correct as a result of the query of the document set.

更に、本実施の形態１では、文字列表現判定部４は、文書集合中に出現している文字列表現を出力する際、図５に示すように、文字列表現を構成している文字列パターン及び要素集合を出力する。そして、図５に示す内容が、そのまま、出力データ記憶部６によって記憶される。この結果、出力データ記憶部６は、語と語との関係を適切に表していると判断される文字列表現を蓄積していくことになり、出力データ記憶部６に蓄積された文字列表現は、適切な語と語との関係及びその関係を満たす語の集合が登録されたデータベースとして機能することができる。 Further, in the first embodiment, when the character string expression determination unit 4 outputs the character string expression appearing in the document set, as shown in FIG. 5, the character string constituting the character string expression is displayed. Output pattern and element set. Then, the contents shown in FIG. 5 are stored in the output data storage unit 6 as they are. As a result, the output data storage unit 6 accumulates character string representations that are determined to appropriately represent the relationship between words, and the character string representations accumulated in the output data storage unit 6 Can function as a database in which an appropriate word-to-word relationship and a set of words satisfying the relationship are registered.

このようにして得られたデータベースは、検索クエリの展開用辞書、文書解析用辞書、又は比較関係展開用辞書等として利用できる。また、背景技術の欄で述べた「平均年収が５００万円以上の企業」を網羅的に抽出した場合は、このデータベースを用いて、「平均年収」、又は「５００万円」といったキーワード検索を行えば良い。この場合、キーワードを含む文字列表現が抽出されるので、利用者が望む網羅的な抽出が実行されたことになる。つまり、本実施の形態１で得られたデータベースを用いれば、属性検索が可能となる。 The database thus obtained can be used as a search query expansion dictionary, a document analysis dictionary, or a comparison relationship expansion dictionary. In addition, when the “company with an average annual income of 5 million yen or more” as described in the background section is exhaustively extracted, a keyword search such as “average annual income” or “5 million yen” is performed using this database. Just do it. In this case, since the character string expression including the keyword is extracted, the exhaustive extraction desired by the user is executed. That is, if the database obtained in the first embodiment is used, attribute search can be performed.

その他、本実施の形態１で得られたデータベースは、クエリ展開、文書検索のためのタグ付けにも利用できる。クエリ展開とは、例えば、ある用語を検索する場合に、その用語と同種類の語も同時に検索することをいう。また、文書検索のためのタグとは、例えば、地域Ｘの旅行代理店を検索したときに、地域Ｘへの旅行を計画している旅行代理店の情報といった目的としない情報が検索されないようにするために、文書に付与されるタグをいう。この例であれば、文書中の旅行代理店に付与される、所在地及び職種といったタグが挙げられる。なお、文書検索のためのタグは、上記の例に限定されるものではない。 In addition, the database obtained in the first embodiment can also be used for tag expansion for query expansion and document search. Query expansion refers to, for example, when searching for a certain term, simultaneously searching for a term of the same type as that term. The tag for document search is such that, for example, when a travel agent in region X is searched, information that is not intended such as information on a travel agent planning to travel to region X is not searched. In order to do this, it refers to a tag attached to a document. In this example, tags such as a location and a job title that are given to a travel agency in a document can be cited. Note that the tag for document search is not limited to the above example.

次に、本発明の実施の形態１における関係抽出装置１１の動作について図６を用いて説明する。図６は、本発明の実施の形態１における関係抽出装置の動作を示すフロー図である。また、本実施の形態１においては、関係抽出装置１１を動作させることにより、本実施の形態１における関係抽出方法が実行される。このため、関係抽出方法の説明は、以下の関係抽出装置１１の動作の説明に代える。また、以下の説明においては、適宜、図１〜図５を参酌する。 Next, the operation of the relationship extraction apparatus 11 according to Embodiment 1 of the present invention will be described with reference to FIG. FIG. 6 is a flowchart showing the operation of the relationship extraction device according to Embodiment 1 of the present invention. Further, in the first embodiment, the relationship extraction method in the first embodiment is executed by operating the relationship extraction device 11. For this reason, the description of the relationship extraction method is replaced with the following description of the operation of the relationship extraction device 11. Moreover, in the following description, FIGS. 1-5 is referred suitably.

図６に示すように、先ず、文字列表現作成部３は、文字列パターン記憶部２に記憶されている文字列パターンの中から、１つの文字列パターンを選択する（ステップＡ１）。次に、文字列表現作成部３は、構造化データ記憶部１に記憶されている幾つかの構造化データの中から、１つの構造化データを選択し（ステップＡ２）、更に、選択された構造化データから、文字列パターンに当てはめる要素集合を全て抽出する（ステップＡ３）。 As shown in FIG. 6, first, the character string expression creating unit 3 selects one character string pattern from the character string patterns stored in the character string pattern storage unit 2 (step A1). Next, the character string expression creating unit 3 selects one structured data from among some structured data stored in the structured data storage unit 1 (step A2), and further, the selected data is selected. All element sets to be applied to the character string pattern are extracted from the structured data (step A3).

ステップＡ３においては、文字列表現作成部３は、選択された構造化データから、抽出される要素の組み合わせや順序を変えて、ステップＡ１で選択された文字列パターンに当てはめることが可能な要素集合を全て抽出する。 In step A3, the character string expression creation unit 3 changes the combination and order of the elements extracted from the selected structured data, and can be applied to the character string pattern selected in step A1. Are all extracted.

次に、文字列表現作成部３は、構造化データ記憶部１に記憶されている全ての構造化データについて、要素集合の抽出が終了したかどうかを判定する（ステップＡ４）。ステップＡ４の判定の結果、全ての構造化データについて要素集合の抽出が終了していない場合は、文字列表現作成部３は、再度、ステップＡ２及びＡ３を実行する。一方、ステップＡ４の判定の結果、全ての構造化データについて要素集合の抽出が終了している場合は、文字列表現作成部３は、ステップＡ５を実行する。 Next, the character string expression creating unit 3 determines whether or not extraction of the element set has been completed for all structured data stored in the structured data storage unit 1 (step A4). If the result of determination in step A4 is that extraction of element sets has not been completed for all structured data, the character string expression creation unit 3 executes steps A2 and A3 again. On the other hand, if the result of determination in step A4 is that extraction of element sets has been completed for all structured data, the character string expression creation unit 3 executes step A5.

ステップＡ５では、文字列表現作成部３は、抽出された要素集合を、ステップＡ１で選
択された文字列パターンに順次適用し、文字列表現を作成する。例えば、文字列パターン「ＡのＢはＣである」が選択されているとする。この場合、この文字列パターンに必要な要素は３つである。よって、図２に示した構造化データが選択されているのであれば、図２に示した構造化データから抽出される要素集合は、下記の通りとなる。要素集合としては、（会社Ｂ、従業員、２０００）、（従業員、２０００、会社Ｂ）、（企業名、会社Ａ、会社Ｂ）、（企業名、３５歳平均年収、従業員数）、（６００、８５０、３０００）等が抽出される。 In step A5, the character string expression creating unit 3 sequentially applies the extracted element set to the character string pattern selected in step A1 to create a character string expression. For example, it is assumed that the character string pattern “B of A is C” is selected. In this case, three elements are necessary for this character string pattern. Therefore, if the structured data shown in FIG. 2 is selected, an element set extracted from the structured data shown in FIG. 2 is as follows. The element set includes (Company B, Employee, 2000), (Employee, 2000, Company B), (Company name, Company A, Company B), (Company name, 35 years old average annual income, Number of employees), ( 600, 850, 3000) and the like are extracted.

そして、ステップＡ５において、文字列表現作成部３は、上記した要素集合を文字列パターンに適応する。この結果、「［会社Ｂ］の［従業員］は［２０００］である」、「［従業員］の［２０００］は［会社Ｂ］である」、「［企業名］の［会社Ａ］は［会社Ｂ］である」といった文字列表現が作成される。ステップＡ５で作成された文字列表現の中には、日本語として正しくない文字列表現も含まれているが、これらは、ステップＡ６により除去される。 In step A5, the character string expression creating unit 3 adapts the element set described above to the character string pattern. As a result, “[employee of [company B] is [2000]”, “[2000] of [employee] is [company B]”, “[company A] of [company name] is A character string expression such as “Company B” is created. The character string expression created in step A5 includes character string expressions that are not correct in Japanese, but these are removed in step A6.

次に、文字列表現判定部４は、作成された文字列表現が文書集合中に出現しているかどうかを判定する（ステップＡ６）。ステップＡ６において、出現していると判定する場合は、文字列表現判定部４は、この文字列表現を構成している文字列パターンと、使用されている要素集合とを出力し、これらのデータを出力データ記憶部６に記憶させる（ステップＡ９）。その後、文字列表現判定部４は、ステップＡ７を実行する。一方、ステップＡ６において、出現していないと判定する場合も、文字列表現判定部４は、ステップＡ７を実行する。 Next, the character string expression determination unit 4 determines whether or not the created character string expression appears in the document set (step A6). When it is determined in step A6 that the character string has appeared, the character string expression determination unit 4 outputs the character string pattern constituting the character string expression and the element set used, and outputs these data. Is stored in the output data storage unit 6 (step A9). Thereafter, the character string expression determination unit 4 executes Step A7. On the other hand, also when it determines with having not appeared in step A6, the character string expression determination part 4 performs step A7.

ステップＡ７においては、文字列表現判定部４は、全ての文字列表現について、ステップＡ６の判定が行われているかどうかを判定する。ステップＡ７の判定の結果、全ての文字列表現について、ステップＡ６の判定が行われていない場合は、文字列表現判定部４は、再度ステップＡ６を実行する。一方、ステップＡ７の判定の結果、全ての文字列表現について、ステップＡ６の判定が行われている場合は、文字列表現判定部４は、ステップＡ８を実行する。 In step A7, the character string expression determination unit 4 determines whether or not the determination in step A6 has been made for all character string expressions. As a result of the determination in step A7, when the determination in step A6 is not performed for all the character string expressions, the character string expression determination unit 4 executes step A6 again. On the other hand, as a result of the determination in step A7, if the determination in step A6 has been made for all character string expressions, the character string expression determination unit 4 executes step A8.

ステップＡ８では、文字列表現判定部４は、全ての文字列パターンについてステップＡ１〜Ａ７及びＡ９の処理が行われているかどうかを判定する。ステップＡ８の判定の結果、全ての文字列パターンについてステップＡ１〜Ａ７及びＡ９の処理が終了していない場合は、再度、文字列表現作成部３によってステップＡ１が実行される。一方、ステップＡ８の判定の結果、全ての文字列パターンについてステップＡ１〜Ａ７及びＡ９の処理が終了している場合は、関係抽出装置１１における処理は終了する。 In step A8, the character string expression determination unit 4 determines whether or not the processes in steps A1 to A7 and A9 are performed for all character string patterns. As a result of the determination in step A8, when the processes in steps A1 to A7 and A9 are not completed for all the character string patterns, step A1 is executed again by the character string expression creating unit 3. On the other hand, as a result of the determination in step A8, if the processes in steps A1 to A7 and A9 have been completed for all the character string patterns, the process in the relationship extraction device 11 ends.

ステップＡ１〜Ａ９の実行により、出力データ記憶部６には、上述したデータベースが構築される。そして、このデータベースを用いることにより、属性検索、クエリ展開、文書検索のためのタグ付けが可能となる。 By executing steps A1 to A9, the above-described database is constructed in the output data storage unit 6. By using this database, tagging for attribute search, query expansion, and document search becomes possible.

また、本実施の形態１におけるプログラムは、コンピュータに、図６に示すステップＡ１〜Ａ９を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することにより、本実施の形態１における関係抽出装置１１及び関係抽出方法を実現することができる。 Moreover, the program in this Embodiment 1 should just be a program which makes a computer perform step A1-A9 shown in FIG. By installing this program on a computer and executing it, the relationship extraction apparatus 11 and the relationship extraction method in the first embodiment can be realized.

この場合、コンピュータのＣＰＵ（Central Processing Unit）は、文字列表現作成部３及び文字列表現判定部４として機能し、処理を行なう。更に、プログラムがインストールされたコンピュータ（マスタ）に、ネットワーク等を介して別のコンピュータ（スレイブ）が接続されている場合は、マスタの指示により、スレイブのＣＰＵが、文字列表現作
成部３及び文字列表現判定部４のいずれか又は両方として機能しても良い。 In this case, a CPU (Central Processing Unit) of the computer functions as the character string expression creating unit 3 and the character string expression determining unit 4 to perform processing. Further, when another computer (slave) is connected to the computer (master) on which the program is installed via a network or the like, the slave CPU instructs the character string expression creating unit 3 and the character according to the instruction of the master. It may function as either or both of the column expression determination unit 4.

更に、構造化データ記憶部１、文字列パターン記憶部２、文書集合記憶部５、及び出力データ記憶部６は、コンピュータに備えられたハードディスク等の記憶装置によって実現できる。なお、記憶装置を備えるコンピュータは、本実施の形態１におけるプログラムがインストールされるコンピュータに、ネットワーク等を介して接続された別のコンピュータであっても良い。 Further, the structured data storage unit 1, the character string pattern storage unit 2, the document set storage unit 5, and the output data storage unit 6 can be realized by a storage device such as a hard disk provided in a computer. The computer provided with the storage device may be another computer connected via a network or the like to the computer in which the program according to the first embodiment is installed.

以上のように、本実施の形態１では、構造化データが入力として用いられ、そして、文字列パターンと構造化データの要素集合とから文字列表現が作成される。更に、この文字列表現の正しさは、文書集合によって判断され、正しさの判定に辞書は一切用いられない。よって、本実施の形態１によれば、領域辞書、形態素解析及び構文解析で用いられる辞書等を必要とすることなく、語と語との関係及びその関係を満たす語の集合の抽出が実現される。 As described above, in the first embodiment, structured data is used as an input, and a character string representation is created from a character string pattern and an element set of structured data. Further, the correctness of the character string expression is determined by the document set, and no dictionary is used for determining correctness. Therefore, according to the first embodiment, it is possible to extract a relationship between words and a set of words satisfying the relationship without requiring a region dictionary, a dictionary used in morphological analysis and syntax analysis, or the like. The

（実施の形態２）
次に、本発明の実施の形態２における関係抽出装置、関係抽出方法、及びプログラムについて、図７及び図８を参照しながら説明する。最初に、図７を用いて、本実施の形態２における関係抽出装置の構成について説明する。図７は、本発明の実施の形態２における関係抽出装置の概略構成を示すブロック図である。 (Embodiment 2)
Next, a relationship extraction device, a relationship extraction method, and a program according to Embodiment 2 of the present invention will be described with reference to FIGS. Initially, the structure of the relationship extraction apparatus in this Embodiment 2 is demonstrated using FIG. FIG. 7 is a block diagram showing a schematic configuration of the relation extraction device according to Embodiment 2 of the present invention.

図７に示すように、本実施の形態２における関係抽出装置１２は、図１に示した関係抽出装置１１の構成に加え、閾値判定部７を備えている。閾値判定部７は、文字列表現判定部４によって出力された文字列表現の文書集合中での出現頻度を求める。そして、閾値判定部７は、求めた出現頻度が予め設定された閾値以上となっているかどうかを判定する。 As shown in FIG. 7, the relationship extraction device 12 according to the second embodiment includes a threshold determination unit 7 in addition to the configuration of the relationship extraction device 11 shown in FIG. The threshold determination unit 7 obtains the appearance frequency of the character string expression output by the character string expression determination unit 4 in the document set. And the threshold determination part 7 determines whether the calculated | required appearance frequency is more than the preset threshold value.

また、出力データ記憶部６は、閾値判定部７によって出現頻度が閾値以上となっていると判定された文字列表現についてのみ、それを構成している文字列パターン及び要素集合を記憶する。閾値の設定は、関係抽出装置１２に求められる性能に応じて適宜設定すれば良い。 In addition, the output data storage unit 6 stores only the character string expression determined by the threshold determination unit 7 as having the appearance frequency equal to or higher than the threshold, the character string pattern and the element set constituting the character string expression. What is necessary is just to set a threshold value suitably according to the performance calculated | required by the relationship extraction apparatus 12. FIG.

更に、本実施の形態２では、閾値の判定は、出現頻度以外の文字列表現の利用の程度を表す指標を用いて行うこともできる。なお、本実施の形態２において、関係抽出装置１２は、上記した点以外については、実施の形態１において図１に示した関係抽出装置１１と同様に構成されている。 Furthermore, in the second embodiment, the threshold value can also be determined using an index representing the degree of use of the character string expression other than the appearance frequency. In the second embodiment, the relationship extraction device 12 is configured in the same manner as the relationship extraction device 11 shown in FIG. 1 in the first embodiment except for the points described above.

次に、本発明の実施の形態２における関係抽出装置１２の動作について図８を用いて説明する。図８は、本発明の実施の形態２における関係抽出装置の動作を示すフロー図である。また、本実施の形態２においても、関係抽出装置１２を動作させることにより、本実施の形態２における関係抽出方法が実行される。このため、関係抽出方法の説明は、以下の関係抽出装置１２の動作の説明に代える。また、以下の説明においては、適宜、図７を参酌する。 Next, the operation of the relationship extraction device 12 according to Embodiment 2 of the present invention will be described with reference to FIG. FIG. 8 is a flowchart showing the operation of the relationship extraction apparatus according to Embodiment 2 of the present invention. Also in the second embodiment, the relation extraction method in the second embodiment is executed by operating the relation extraction device 12. For this reason, the description of the relationship extraction method is replaced with the following description of the operation of the relationship extraction device 12. In the following description, FIG. 7 is referred to as appropriate.

図８において、ステップＢ１〜Ｂ８は、実施の形態１において図６に示したステップＡ１〜Ａ８と同様のステップである。本実施の形態２においては、ステップＢ９及びＢ１０のみが異なっている。以下に相違点について具体的に説明する。なお、ステップＡ１〜Ａ８と同様のステップＢ１〜Ｂ８については説明を省略する。 In FIG. 8, steps B1 to B8 are the same as steps A1 to A8 shown in FIG. 6 in the first embodiment. In the second embodiment, only steps B9 and B10 are different. The difference will be specifically described below. In addition, description is abbreviate | omitted about step B1-B8 similar to step A1-A8.

ステップＢ６において、文字列表現判定部４が、ステップＢ５で作成された文字列表現が文書集合中に出現していると判定すると、閾値判定部７は、ステップＢ９を実行する。
ステップＢ９では、閾値判定部７は、ステップＢ５で作成された文字列表現が文書中で出現している頻度が閾値以上であるかどうかを判定する。 In step B6, when the character string expression determination unit 4 determines that the character string expression created in step B5 appears in the document set, the threshold value determination unit 7 executes step B9.
In step B9, the threshold determination unit 7 determines whether the frequency that the character string expression created in step B5 appears in the document is equal to or higher than the threshold.

ステップＢ９の判定の結果、文字列表現の出現頻度が閾値以上でない場合は、文字列表現判定部４は、ステップＢ７を実行する。一方、ステップＢ９の判定の結果、文字列表現の出現頻度が閾値以上である場合は、出力データ記憶部６は、文字列表現判定部４がステップＢ６の判定結果に基づいて出力するデータのうち、出現頻度が閾値以上となっている文字列表現の文字列パターン及び要素集合のみを記憶する（ステップＢ１０）。 As a result of the determination in step B9, when the appearance frequency of the character string expression is not equal to or higher than the threshold value, the character string expression determination unit 4 executes step B7. On the other hand, as a result of the determination in step B9, if the appearance frequency of the character string expression is greater than or equal to the threshold, the output data storage unit 6 includes the data that the character string expression determination unit 4 outputs based on the determination result in step B6. Only the character string pattern and element set of the character string expression whose appearance frequency is equal to or higher than the threshold value are stored (step B10).

このように、本実施の形態２では、文字列表現の出現頻度に対して閾値を設定し、出現頻度が高い文字列表現のみを選択して蓄積している。よって、語と語との関係及びその関係を満たす語の集合の抽出精度の向上が図られ、属性検索等に用いられる辞書の信頼度の向上が可能となる。 As described above, in the second embodiment, a threshold is set for the appearance frequency of character string expressions, and only character string expressions having a high appearance frequency are selected and accumulated. Therefore, the relationship between words and the extraction accuracy of a set of words that satisfy the relationship can be improved, and the reliability of a dictionary used for attribute search or the like can be improved.

また、本実施の形態２におけるプログラムは、コンピュータに、図８に示すステップＢ１〜Ｂ１０を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することにより、本実施の形態２においても、実施の形態１と同様に、関係抽出装置１２及び関係抽出方法を実現することができる。 Moreover, the program in this Embodiment 2 should just be a program which makes a computer perform step B1-B10 shown in FIG. By installing and executing this program on a computer, the relationship extraction device 12 and the relationship extraction method can be realized in the second embodiment as well as the first embodiment.

この場合、コンピュータのＣＰＵ（Central Processing Unit）は、文字列表現作成部３、文字列表現判定部４、及び閾値判定部７として機能し、処理を行なう。更に、プログラムがインストールされたコンピュータ（マスタ）に、ネットワークを介して別のコンピュータ（スレイブ）が接続されている場合は、マスタの指示により、スレイブのＣＰＵが、文字列表現作成部３、文字列表現判定部４及び閾値判定部７のいずれか又は全部として機能しても良い。 In this case, a CPU (Central Processing Unit) of the computer functions as the character string expression creating unit 3, the character string expression determining unit 4, and the threshold value determining unit 7, and performs processing. Further, when another computer (slave) is connected to the computer (master) on which the program is installed via a network, the slave CPU instructs the character string expression creating unit 3 and the character string according to the instruction of the master. It may function as any or all of the expression determination unit 4 and the threshold determination unit 7.

更に、構造化データ記憶部１、文字列パターン記憶部２、文書集合記憶部５、及び出力データ記憶部６は、コンピュータに備えられたハードディスク等の記憶装置によって実現できる。なお、記憶装置を備えるコンピュータは、本実施の形態２におけるプログラムがインストールされるコンピュータに、ネットワーク等を介して接続された別のコンピュータであっても良い。 Further, the structured data storage unit 1, the character string pattern storage unit 2, the document set storage unit 5, and the output data storage unit 6 can be realized by a storage device such as a hard disk provided in a computer. The computer provided with the storage device may be another computer connected via a network or the like to the computer in which the program according to the second embodiment is installed.

（実施の形態３）
次に、本発明の実施の形態３における関係抽出装置、関係抽出方法、及びプログラムについて、図９及び図１０を参照しながら説明する。最初に、図９を用いて、本実施の形態３における関係抽出装置の構成について説明する。図９は、本発明の実施の形態３における関係抽出装置の概略構成を示すブロック図である。 (Embodiment 3)
Next, a relationship extraction device, a relationship extraction method, and a program according to Embodiment 3 of the present invention will be described with reference to FIGS. Initially, the structure of the relationship extraction apparatus in this Embodiment 3 is demonstrated using FIG. FIG. 9 is a block diagram showing a schematic configuration of the relation extraction apparatus according to Embodiment 3 of the present invention.

図９に示すように、本実施の形態３における関係抽出装置１３は、図１に示した関係抽出装置１１の構成に加え、要素集合抽出部８を更に備えている。要素集合抽出部８は、構造化データから、文字列表現判定部４によって出力された文字列表現を構成している要素集合と各要素の関係が同一となる別の要素集合を抽出する。また、要素集合抽出部８は、新たに抽出した別の要素集合を出力データ記憶部６へと出力し、それに記憶させる。 As shown in FIG. 9, the relationship extraction device 13 in the third embodiment further includes an element set extraction unit 8 in addition to the configuration of the relationship extraction device 11 shown in FIG. The element set extraction unit 8 extracts, from the structured data, another element set in which the relationship between each element and the element set constituting the character string expression output by the character string expression determination unit 4 is the same. Also, the element set extraction unit 8 outputs another newly extracted element set to the output data storage unit 6 and stores it therein.

ここで、要素集合抽出部８が実行する処理について具体的に説明する。例えば、構造化データが図２に示すテーブルであり、出力された文字列表現の要素集合として、（会社Ｂ、従業員数、２０００）が抽出されているとする。また、文字列パターンにおいて要素間の関係は（対象、属性、属性値）であるとする。この場合、要素集合抽出部８は、既に抽出した要素集合と各要素の関係が同一となる別の要素集合を抽出するため、以下の処理を行う。 Here, the processing executed by the element set extraction unit 8 will be specifically described. For example, it is assumed that the structured data is the table shown in FIG. 2 and (company B, number of employees, 2000) is extracted as an element set of the output character string expression. Further, it is assumed that the relationship between elements in the character string pattern is (target, attribute, attribute value). In this case, the element set extraction unit 8 performs the following process in order to extract another element set in which the relationship between each element and the already extracted element set is the same.

要素集合抽出部８は、先ず、既に抽出した要素集合の各要素間を結ぶ行と列、即ち、「会社Ｂ」と「２０００」とを結ぶ行と、「従業員数」と「２０００」とを結ぶ列とを特定する。そして、要素集合抽出部８は、特定した行をテーブルの垂直方向（列方向）に一つずらし、特定した列をテーブルの水平方向（行方向）に一つずらす。そして、ずらした後の行と列とに対する各要素の位置関係が、ずらす前の行と列とに対する各要素の位置関係と同じになるように、ずらした後の行と列とから要素集合を抽出する。 First, the element set extraction unit 8 sets a row and a column connecting the elements of the already extracted element set, that is, a row connecting “Company B” and “2000”, “Number of employees”, and “2000”. Identify the columns to connect. Then, the element set extraction unit 8 shifts the identified row by one in the vertical direction (column direction) of the table, and shifts the identified column by one in the horizontal direction (row direction) of the table. Then, the element set from the shifted row and column is set so that the positional relationship of each element with respect to the row and column after shifting is the same as the positional relationship of each element with respect to the row and column before shifting. Extract.

これによって、要素集合（会社Ｂ、従業員数、２０００）と同じ関係を持つ、要素集合（会社Ａ、従業員数、１０００）や、（会社Ｂ、３５歳平均年収、６００）等が抽出される。また、このようにして抽出された新たな要素集合が、既に出力されている文字列表現の文字列パターンに適用され、新たな文字列表現が作成されたとする。この場合、作成された新たな文字列表現は、文書集合中には現れていないが、現れている文字表現と同様に適切であると考えられる。 As a result, the element set (company A, number of employees, 1000), (company B, 35-year-old average annual income, 600), etc. having the same relationship as the element set (company B, number of employees, 2000) are extracted. Further, it is assumed that a new element set extracted in this way is applied to a character string pattern of a character string expression that has already been output, and a new character string expression is created. In this case, the created new character string expression does not appear in the document set, but is considered to be appropriate in the same manner as the appearing character expression.

次に、本発明の実施の形態３における関係抽出装置１３の動作について図１０を用いて説明する。図１０は、本発明の実施の形態３における関係抽出装置の動作を示すフロー図である。また、本実施の形態３においても、関係抽出装置１３を動作させることにより、本実施の形態３における関係抽出方法が実行される。このため、関係抽出方法の説明は、以下の関係抽出装置１３の動作の説明に代える。また、以下の説明においては、適宜、図９を参酌する。 Next, the operation of the relationship extraction device 13 according to Embodiment 3 of the present invention will be described with reference to FIG. FIG. 10 is a flowchart showing the operation of the relationship extraction apparatus according to Embodiment 3 of the present invention. Also in the third embodiment, the relation extraction method in the third embodiment is executed by operating the relation extraction device 13. For this reason, the description of the relationship extraction method is replaced with the following description of the operation of the relationship extraction device 13. In the following description, FIG. 9 is referred to as appropriate.

図１０において、ステップＣ１〜Ｃ８は、実施の形態１において図６に示したステップＡ１〜Ａ８と同様のステップである。本実施の形態３においては、ステップＣ９及びＣ１０のみが異なっている。以下に相違点について具体的に説明する。なお、ステップＡ１〜Ａ８と同様のステップＣ１〜Ｃ８については説明を省略する。 In FIG. 10, steps C1 to C8 are the same as steps A1 to A8 shown in FIG. 6 in the first embodiment. In the third embodiment, only steps C9 and C10 are different. The difference will be specifically described below. In addition, description is abbreviate | omitted about step C1-C8 similar to step A1-A8.

ステップＣ６において、文字列表現判定部４が、ステップＣ５で作成された文字列表現が文書集合中に出現していると判定すると、要素集合抽出部８は、ステップＣ９を実行する。ステップＣ９では、要素集合抽出部８は、文字列表現判定部４がステップＣ６の判定結果に基づいて出力する文字列表現を用い、構造化データから、この文字列表現の要素集合と各要素の関係が同一となる別の要素集合を抽出する。 In step C6, when the character string expression determination unit 4 determines that the character string expression created in step C5 appears in the document set, the element set extraction unit 8 executes step C9. In step C9, the element set extraction unit 8 uses the character string expression output by the character string expression determination unit 4 based on the determination result in step C6, and from the structured data, the element set of the character string expression and each element Extract another set of elements with the same relationship.

その後、文字列表現判定部４がステップＣ６の判定結果に基づいて出力した文字列表現（文字列パターン及び要素集合）と、ステップＣ９で抽出された要素集合とが、出力データ記憶部６に記憶される（ステップＣ１０）。 Thereafter, the character string expression (character string pattern and element set) output by the character string expression determination unit 4 based on the determination result in step C6 and the element set extracted in step C9 are stored in the output data storage unit 6. (Step C10).

このように、本実施の形態３では、抽出できた要素集合とそれを構成する各要素の関係とに基づいて、文書集合中には現れないなどの理由で抽出できなかった要素集合が新たに抽出される。このため、本実施の形態３によれば、属性検索等に用いられる辞書に、より多くの要素集合を登録できるため、当該辞書の信頼度の向上が可能となる。 As described above, in the third embodiment, an element set that cannot be extracted because it does not appear in the document set or the like is newly added based on the extracted element set and the relationship between the constituent elements. Extracted. For this reason, according to the third embodiment, since more element sets can be registered in the dictionary used for attribute search or the like, the reliability of the dictionary can be improved.

また、本実施の形態３におけるプログラムは、コンピュータに、図１０に示すステップＣ１〜Ｃ１０を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することにより、本実施の形態３においても、実施の形態１と同様に、関係抽出装置１３及び関係抽出方法を実現することができる。 Moreover, the program in this Embodiment 3 should just be a program which makes a computer perform step C1-C10 shown in FIG. By installing and executing this program on a computer, the relationship extraction device 13 and the relationship extraction method can be realized in the third embodiment as well as the first embodiment.

この場合、コンピュータのＣＰＵ（Central Processing Unit）は、文字列表現作成部３、文字列表現判定部４、及び要素集合抽出部８として機能し、処理を行なう。更に、プログラムがインストールされたコンピュータ（マスタ）に、ネットワーク等を介して別の
コンピュータ（スレイブ）が接続されている場合は、マスタの指示により、スレイブのＣＰＵが、文字列表現作成部３、文字列表現判定部４及び要素集合抽出部８のいずれか又は全部として機能しても良い。 In this case, a CPU (Central Processing Unit) of the computer functions as the character string expression creation unit 3, the character string expression determination unit 4, and the element set extraction unit 8, and performs processing. Furthermore, when another computer (slave) is connected to the computer (master) on which the program is installed via a network or the like, the slave CPU instructs the character string expression creating unit 3, the character, It may function as any or all of the column expression determination unit 4 and the element set extraction unit 8.

更に、構造化データ記憶部１、文字列パターン記憶部２、文書集合記憶部５、及び出力データ記憶部６は、コンピュータに備えられたハードディスク等の記憶装置によって実現できる。なお、記憶装置を備えるコンピュータは、本実施の形態３におけるプログラムがインストールされるコンピュータに、ネットワーク等を介して接続された別のコンピュータであっても良い。 Further, the structured data storage unit 1, the character string pattern storage unit 2, the document set storage unit 5, and the output data storage unit 6 can be realized by a storage device such as a hard disk provided in a computer. The computer provided with the storage device may be another computer connected via a network or the like to the computer in which the program according to the third embodiment is installed.

（実施の形態４）
次に、本発明の実施の形態４における関係抽出装置、関係抽出方法、及びプログラムについて、図１１及び図１２を参照しながら説明する。最初に、図１１を用いて、本実施の形態４における関係抽出装置の構成について説明する。図１１は、本発明の実施の形態４における関係抽出装置の概略構成を示すブロック図である。 (Embodiment 4)
Next, a relationship extraction device, a relationship extraction method, and a program according to Embodiment 4 of the present invention will be described with reference to FIGS. Initially, the structure of the relationship extraction apparatus in this Embodiment 4 is demonstrated using FIG. FIG. 11 is a block diagram showing a schematic configuration of the relation extraction device according to Embodiment 4 of the present invention.

図１１に示すように、本実施の形態４における関係抽出装置１４は、図１に示した関係抽出装置１１の構成に加え、境界検出部９を更に備えている。境界検出部９は、構造化データがテーブルである場合に、データ構造を構成する行と行との間、列と列との間、又は両者において、意味的な境界を検出することができる。 As shown in FIG. 11, the relationship extraction device 14 according to the fourth embodiment further includes a boundary detection unit 9 in addition to the configuration of the relationship extraction device 11 shown in FIG. When the structured data is a table, the boundary detection unit 9 can detect a semantic boundary between rows constituting the data structure, between columns, or both.

例えば、境界検出部９は、隣接する行よりも数字を要素としている割合が高い行が存在する場合に、当該二つの行と行との間に境界が存在していると判断する。また、境界検出部９は、隣接する列よりも数字を要素としている割合が高い列が存在する場合に、当該二つの列と列との間に境界が存在していると判断する。また、本実施の形態４では、文字列表現作成部３は、意味的な境界、及び文字列パターンにおける要素間の関係（図４参照）に基づいて、要素集合を抽出する。 For example, the boundary detection unit 9 determines that a boundary exists between the two rows when there is a row having a higher ratio of numbers as elements than adjacent rows. Also, the boundary detection unit 9 determines that a boundary exists between the two columns when there is a column having a higher ratio of numbers as elements than the adjacent columns. In the fourth embodiment, the character string expression creating unit 3 extracts an element set based on a semantic boundary and a relationship between elements in the character string pattern (see FIG. 4).

具体的には、構造化データが図２に示すテーブルであるとすると、境界検出部９は、第１行目と第２行目との間に境界が存在していると判断する。そして、例えば、文字列パターン「ＡのＢはＣである」（（Ａ、Ｂ、Ｃ）＝（対象、属性、属性値））が選択されているとする。この場合、文字列表現作成部３は、Ａ、Ｂ、Ｃそれぞれの種類が意味的に異なるため、要素集合を抽出する際に、境界で分割された一方側と他方側とから要素を抽出する。 Specifically, if the structured data is the table shown in FIG. 2, the boundary detection unit 9 determines that a boundary exists between the first row and the second row. For example, it is assumed that the character string pattern “B of A is C” ((A, B, C) = (target, attribute, attribute value)) is selected. In this case, since the types of A, B, and C are semantically different, the character string expression creation unit 3 extracts elements from one side and the other side divided at the boundary when extracting the element set. .

また、例えば、文字列パターン「Ｂ、Ｃ、ＤなどのＡ」（（Ａ、Ｂ、Ｃ、Ｄ）＝（上位語、下位語、下位語、下位語））が選択されている場合は、文字列表現作成部３は、Ｂ、Ｃ、Ｄは下位語なので、これらを境界で分割された同じ側から抽出する。また、文字列表現作成部３は、ＡはＢ、Ｃ、Ｄとは異なる上位語なので、境界で分割された別の側から抽出する。 For example, when the character string pattern “A such as B, C, D” ((A, B, C, D) = (broader word, narrower word, narrower word, narrower word)) is selected, The character string expression creation unit 3 extracts B, C, and D from the same side divided at the boundary because B, C, and D are subordinate words. Further, the character string expression creating unit 3 extracts from the other side divided at the boundary because A is a broader term different from B, C, and D.

次に、本発明の実施の形態４における関係抽出装置１４の動作について図１２を用いて説明する。図１２は、本発明の実施の形態４における関係抽出装置の動作を示すフロー図である。また、本実施の形態４においても、関係抽出装置１４を動作させることにより、本実施の形態４における関係抽出方法が実行される。このため、関係抽出方法の説明は、以下の関係抽出装置１４の動作の説明に代える。また、以下の説明においては、適宜、図１１を参酌する。 Next, the operation of the relationship extraction device 14 according to Embodiment 4 of the present invention will be described with reference to FIG. FIG. 12 is a flowchart showing the operation of the relationship extraction device according to Embodiment 4 of the present invention. Also in the fourth embodiment, the relationship extraction method in the fourth embodiment is executed by operating the relationship extraction device 14. For this reason, the description of the relationship extraction method is replaced with the following description of the operation of the relationship extraction device 14. In the following description, FIG. 11 is referred to as appropriate.

図１２において、ステップＤ１、Ｄ２、Ｄ５〜Ｄ１０は、実施の形態１において図６に示したステップＡ１、Ａ２、Ａ４〜Ａ９と同様のステップである。本実施の形態４は、ス
テップＤ３及びＤ４において、実施の形態１と異なっている。以下に相違点について具体的に説明する。なお、ステップＡ１、Ａ２、Ａ４〜Ａ９に対応する、ステップＤ１、Ｄ２、Ｄ５〜Ｄ１０については説明を省略する。 In FIG. 12, steps D1, D2, D5 to D10 are the same steps as steps A1, A2, A4 to A9 shown in FIG. 6 in the first embodiment. The fourth embodiment is different from the first embodiment in steps D3 and D4. The difference will be specifically described below. In addition, description is abbreviate | omitted about step D1, D2, D5-D10 corresponding to step A1, A2, A4-A9.

本実施の形態４では、文字列表現作成部３が１つの構造化データを選択する（ステップＤ２）と、境界検出部９は、データ構造を構成する行と行との間、列と列との間、又は両者において、意味的な境界を検出する（ステップＤ３）。次に、文字列表現作成部３は、ステップＡ３と同様に要素集合を抽出するが、本実施の形態４では、意味的な境界、及び文字列パターンにおける要素間の関係（図４参照）を考慮して、要素集合を抽出する（ステップＤ４）。 In the fourth embodiment, when the character string expression creating unit 3 selects one structured data (step D2), the boundary detecting unit 9 includes a column and a column between rows constituting the data structure. A semantic boundary is detected between or both (step D3). Next, the character string expression creation unit 3 extracts an element set in the same manner as in step A3. In the fourth embodiment, the semantic boundary and the relationship between elements in the character string pattern (see FIG. 4) are obtained. Considering this, an element set is extracted (step D4).

このように本実施の形態４では、要素集合の抽出処理においては、検出された境界と、文字列パターンに設定されている要素間の関係（各要素の種類）とが参照される。よって、本実施の形態４によれば、無駄な要素集合の抽出を省くことが可能となり、結果、要素集合を抽出する際の抽出精度の向上が図られる。 As described above, in the fourth embodiment, in the element set extraction process, the detected boundary and the relationship between the elements set in the character string pattern (type of each element) are referred to. Therefore, according to the fourth embodiment, it is possible to omit extraction of useless element sets, and as a result, improvement in extraction accuracy when extracting element sets can be achieved.

また、文字列表現作成部３によって抽出される要素集合の数を減らすことが可能となるので、本実施の形態４によれば、文字列表現作成部３が要素集合の抽出に費やす時間、及び文字列表現判定部４が文字列表現の判定に費やす時間の短縮も図られる。 In addition, since the number of element sets extracted by the character string expression creating unit 3 can be reduced, according to the fourth embodiment, the time spent by the character string expression creating unit 3 for extracting the element set, and The time spent by the character string expression determination unit 4 for determining the character string expression can also be shortened.

また、本実施の形態４におけるプログラムは、コンピュータに、図１２に示すステップＤ１〜Ｄ１０を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することにより、本実施の形態４においても、実施の形態１と同様に、関係抽出装置１４及び関係抽出方法を実現することができる。 Moreover, the program in this Embodiment 4 should just be a program which makes a computer perform step D1-D10 shown in FIG. By installing and executing this program on a computer, the relationship extraction device 14 and the relationship extraction method can be realized also in the fourth embodiment, as in the first embodiment.

この場合、コンピュータのＣＰＵ（Central Processing Unit）は、文字列表現作成部３、文字列表現判定部４、及び境界抽出部９として機能し、処理を行なう。更に、プログラムがインストールされたコンピュータ（マスタ）に、ネットワーク等を介して別のコンピュータ（スレイブ）が接続されている場合は、マスタの指示により、スレイブのＣＰＵが、文字列表現作成部３、文字列表現判定部４及び境界抽出部９のいずれか又は全部として機能しても良い。 In this case, a CPU (Central Processing Unit) of the computer functions as the character string expression creation unit 3, the character string expression determination unit 4, and the boundary extraction unit 9, and performs processing. Furthermore, when another computer (slave) is connected to the computer (master) on which the program is installed via a network or the like, the slave CPU instructs the character string expression creating unit 3, the character, It may function as any or all of the column expression determination unit 4 and the boundary extraction unit 9.

更に、構造化データ記憶部１、文字列パターン記憶部２、文書集合記憶部５、及び出力データ記憶部６は、コンピュータに備えられたハードディスク等の記憶装置によって実現できる。なお、記憶装置を備えるコンピュータは、本実施の形態４におけるプログラムがインストールされるコンピュータに、ネットワーク等を介して接続された別のコンピュータであっても良い。 Further, the structured data storage unit 1, the character string pattern storage unit 2, the document set storage unit 5, and the output data storage unit 6 can be realized by a storage device such as a hard disk provided in a computer. The computer provided with the storage device may be another computer connected via a network or the like to the computer in which the program according to the fourth embodiment is installed.

また、本実施の形態４においては、図１３に示す関係抽出装置１５を用いることもできる。図１３は、本発明の実施の形態４における関係抽出装置の他の例の概略構成を示すブロック図である。図１３の例に示す関係抽出装置１５は、図１１に示した境界検出部９に加えて、実施の形態２において図７に示した閾値判定部７と、実施の形態３において図８に示した要素集合抽出部８とを備えている。 In the fourth embodiment, the relationship extraction device 15 shown in FIG. 13 can also be used. FIG. 13: is a block diagram which shows schematic structure of the other example of the relationship extraction apparatus in Embodiment 4 of this invention. In addition to the boundary detection unit 9 shown in FIG. 11, the relationship extraction device 15 shown in the example of FIG. 13 includes the threshold determination unit 7 shown in FIG. 7 in the second embodiment and the threshold detection unit 7 shown in FIG. 8 in the third embodiment. And an element set extraction unit 8.

図１３に示す関係抽出装置１５によれば、実施の形態２及び３で述べた効果をも得ることができ、要素集合と各要素間の関係との抽出精度を保ちながら、大量の要素集合を高速に、要素集合記憶部６に格納することができる。なお、関係抽出装置１５は、閾値判定部７、要素集合抽出部８、及び境界検出部９のうち、いずれか二つのみを備えている態様であっても良い。この場合であっても、構成に応じた効果が得られることとなる。 According to the relationship extraction device 15 shown in FIG. 13, the effects described in the second and third embodiments can be obtained, and a large number of element sets can be obtained while maintaining the extraction accuracy between the element set and the relationship between each element. It can be stored in the element set storage unit 6 at high speed. Note that the relationship extraction device 15 may be configured to include only two of the threshold determination unit 7, the element set extraction unit 8, and the boundary detection unit 9. Even in this case, an effect according to the configuration can be obtained.

（実施の形態５）
次に、本発明の実施の形態５における関係抽出装置、関係抽出方法、及びプログラムについて、図１４及び図１５を参照しながら説明する。最初に、図１４を用いて、本実施の形態４における関係抽出装置の構成について説明する。図１４は、本発明の実施の形態５における関係抽出装置の概略構成を示すブロック図である。 (Embodiment 5)
Next, a relationship extraction device, a relationship extraction method, and a program according to Embodiment 5 of the present invention will be described with reference to FIGS. Initially, the structure of the relationship extraction apparatus in this Embodiment 4 is demonstrated using FIG. FIG. 14 is a block diagram showing a schematic configuration of the relation extraction apparatus according to Embodiment 5 of the present invention.

図１４に示すように、本実施の形態５における関係抽出装置１６は、図１に示し他関係抽出装置１１の構成に加え、バイナリデータ変換部１０を更に備えている。また、本実施の形態５においては、構造化データ記憶部１は、構造化データとして、テキスト形式のデータ以外のバイナリ形式のデータ、例えば、テキストを文字コードではなく画像によって特定するデータも記憶している。なお、実施の形態１で挙げたＨＴＭＬファイルのように文字コードを含むデータは、上述したテキスト形式のデータに含まれ、バイナリ形式のデータからは除かれる。 As shown in FIG. 14, the relationship extraction device 16 in the fifth embodiment further includes a binary data conversion unit 10 in addition to the configuration of the other relationship extraction device 11 shown in FIG. In the fifth embodiment, the structured data storage unit 1 also stores, as structured data, binary format data other than text format data, for example, data that specifies text by image instead of character code. ing. Note that data including character codes, such as the HTML file described in the first embodiment, is included in the text format data described above and excluded from binary format data.

バイナリデータ変換部１０は、バイナリ形式の構造化データに対して変換処理を行い、バイナリ形式のデータをテキスト形式のデータに変換することができる。このため、本実施の形態５によれば、バイナリ形式の構造化データからの要素集合の抽出も可能となる。なお、バイナリデータ変換部１０で行われる変換処理は、特に限定されるものではなく、既存の変換処理の方法を用いることができる。 The binary data conversion unit 10 can perform conversion processing on the structured data in the binary format, and convert the binary format data into the text format data. Therefore, according to the fifth embodiment, it is possible to extract an element set from structured data in binary format. The conversion process performed by the binary data conversion unit 10 is not particularly limited, and an existing conversion process method can be used.

次に、本発明の実施の形態５における関係抽出装置１６の動作について図１５を用いて説明する。図１５は、本発明の実施の形態５における関係抽出装置の動作を示すフロー図である。また、本実施の形態５においても、関係抽出装置１６を動作させることにより、本実施の形態５における関係抽出方法が実行される。このため、関係抽出方法の説明は、以下の関係抽出装置１６の動作の説明に代える。また、以下の説明においては、適宜、図１４を参酌する。 Next, the operation of the relationship extraction device 16 according to the fifth embodiment of the present invention will be described with reference to FIG. FIG. 15 is a flowchart showing the operation of the relationship extraction apparatus in the fifth embodiment of the present invention. Also in the fifth embodiment, the relation extraction method in the fifth embodiment is executed by operating the relation extraction device 16. Therefore, the description of the relationship extraction method is replaced with the following description of the operation of the relationship extraction device 16. In the following description, FIG. 14 is referred to as appropriate.

図１５において、ステップＥ２〜Ｅ１０は、実施の形態１において図６に示したステップＡ１〜Ａ９と同様のステップである。本実施の形態５は、ステップＥ１が実行される点で、実施の形態１と異なっている。以下に相違点について具体的に説明する。なお、ステップＡ１〜Ａ９と同様のステップＥ２〜Ｅ１０については説明を省略する。 15, steps E2 to E10 are the same steps as steps A1 to A9 shown in FIG. 6 in the first embodiment. The fifth embodiment is different from the first embodiment in that step E1 is executed. The difference will be specifically described below. In addition, description is abbreviate | omitted about step E2-E10 similar to step A1-A9.

本実施の形態５においては、先ず、バイナリデータ変換部１０が、構造化データ記憶部１からバイナリ形式の構造化データを検出し、検出した構造化データに対して変換処理を実行する（ステップＥ１）。これにより、構造化データ記憶部１に記憶されているバイナリ形式のデータは、全て、テキスト形式のデータに変換される。その後、ステップＥ２〜Ｅ１０が、実施の形態１において図６に示したステップＡ１〜Ａ９と同様に実行される。 In the fifth embodiment, the binary data conversion unit 10 first detects binary structured data from the structured data storage unit 1 and executes a conversion process on the detected structured data (step E1). ). As a result, all binary data stored in the structured data storage unit 1 is converted to text data. Thereafter, steps E2 to E10 are executed in the same manner as steps A1 to A9 shown in FIG. 6 in the first embodiment.

また、本実施の形態５におけるプログラムは、コンピュータに、図１５に示すステップＥ１〜Ｅ１０を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することにより、本実施の形態５においても、実施の形態１と同様に、関係抽出装置１６及び関係抽出方法を実現することができる。 Moreover, the program in this Embodiment 5 should just be a program which makes a computer perform step E1-E10 shown in FIG. By installing and executing this program on a computer, the relationship extraction device 16 and the relationship extraction method can be realized in the fifth embodiment as well as the first embodiment.

この場合、コンピュータのＣＰＵ（Central Processing Unit）は、文字列表現作成部３、文字列表現判定部４、及びバイナリデータ変換部１０として機能し、処理を行なう。更に、プログラムがインストールされたコンピュータ（マスタ）に、ネットワーク等を介して別のコンピュータ（スレイブ）が接続されている場合は、マスタの指示により、スレイブのＣＰＵが、文字列表現作成部３、文字列表現判定部４及びバイナリデータ変換部１０のいずれか又は全部として機能しても良い。 In this case, a CPU (Central Processing Unit) of the computer functions as the character string expression creation unit 3, the character string expression determination unit 4, and the binary data conversion unit 10, and performs processing. Furthermore, when another computer (slave) is connected to the computer (master) on which the program is installed via a network or the like, the slave CPU instructs the character string expression creating unit 3, the character, It may function as any or all of the column expression determination unit 4 and the binary data conversion unit 10.

更に、構造化データ記憶部１、文字列パターン記憶部２、文書集合記憶部５、及び出力データ記憶部６は、コンピュータに備えられたハードディスク等の記憶装置によって実現できる。なお、記憶装置を備えるコンピュータは、本実施の形態５におけるプログラムがインストールされるコンピュータに、ネットワーク等を介して接続された別のコンピュータであっても良い。 Further, the structured data storage unit 1, the character string pattern storage unit 2, the document set storage unit 5, and the output data storage unit 6 can be realized by a storage device such as a hard disk provided in a computer. The computer provided with the storage device may be another computer connected via a network or the like to the computer in which the program according to the fifth embodiment is installed.

（実施の形態６）
次に、本発明の実施の形態６における関係抽出装置、関係抽出方法、及びプログラムについて、図１６及び図１７を参照しながら説明する。最初に、図１６を用いて、本実施の形態６における関係抽出装置の構成について説明する。図１６は、本発明の実施の形態６における関係抽出装置の概略構成を示すブロック図である。 (Embodiment 6)
Next, a relationship extraction device, a relationship extraction method, and a program according to Embodiment 6 of the present invention will be described with reference to FIGS. Initially, the structure of the relationship extraction apparatus in this Embodiment 6 is demonstrated using FIG. FIG. 16 is a block diagram showing a schematic configuration of the relation extraction apparatus according to Embodiment 6 of the present invention.

本実施の形態６における関係抽出装置１７は、実施の形態４において図１１に示した関係抽出装置１４と同様に、境界検出部９を備えている。但し、本実施の形態６では、境界検出部９は、実施の形態４で述べた機能に加え、更に、以下の機能も備えている。 Similar to the relationship extraction device 14 shown in FIG. 11 in the fourth embodiment, the relationship extraction device 17 in the sixth embodiment includes a boundary detection unit 9. However, in the sixth embodiment, the boundary detection unit 9 has the following functions in addition to the functions described in the fourth embodiment.

境界検出部９は、本実施の形態６では、文字列表現判定部４によって出力された文字列表現に基づいて、出力された文字列表現に含まれる要素それぞれにおける、データ構造での位置と、出力された文字列表現での出現頻度と、出現したときの文字列パターンでの位置とを検出する。例えば、文字列パターンにおいて、対象、属性、上位語、又は下位語に相当する要素は、テーブルにおいて項目となりやすく、同じ行または同じ列に出現しやすくなっている。本実施の形態６では、このことが利用されている。 In the sixth embodiment, the boundary detection unit 9 outputs the position in the data structure of each element included in the output character string expression based on the character string expression output by the character string expression determination unit 4. The appearance frequency in the character string expression and the position in the character string pattern when it appears are detected. For example, in a character string pattern, an element corresponding to an object, an attribute, a broader word, or a narrower word tends to be an item in the table, and tends to appear in the same row or the same column. This is utilized in the sixth embodiment.

具体的には、境界検出部９は、出力データ記憶部６に格納されている要素集合について、要素毎に、対象、属性、上位語、又は下位語に該当しているか（文字列パターンでの位置）と、各要素の出現頻度と、テーブルでの位置とを検出する。続いて、境界検出部９は、検出した情報を用いて、テーブルを行方向または列方向に沿って解析する。これにより、対象、属性、上位語、又は下位語に該当した要素が、頻繁に出現している行又は列と、頻繁に出現していない行又は列とが特定される。 Specifically, the boundary detection unit 9 corresponds to an element, an attribute, a broader word, or a narrower word for each element of the element set stored in the output data storage unit 6 (in the character string pattern). Position), the appearance frequency of each element, and the position in the table. Subsequently, the boundary detection unit 9 analyzes the table along the row direction or the column direction using the detected information. Thereby, a row or a column in which an element corresponding to a target, an attribute, a broader term, or a lower term appears frequently and a row or column that does not appear frequently are specified.

そして、境界検出部９は、前者と後者との間を境界候補とし、更に、前者と後者との要素の出現頻度の比を求め、求められた比が閾値を超えていれば、この境界候補を境界と判定する。なお、境界の検出は、上記の処理以外の処理によって行うこともできる。また検出される境界は、実施形態４で述べた境界と同様のものである。 Then, the boundary detection unit 9 determines a boundary candidate between the former and the latter, further calculates a ratio of appearance frequencies of the former and the latter, and if the calculated ratio exceeds a threshold, the boundary candidate Is determined to be a boundary. Note that the boundary detection can also be performed by a process other than the above process. The detected boundary is the same as the boundary described in the fourth embodiment.

次に、本発明の実施の形態６における関係抽出装置１７の動作について説明する。本実施の形態６において、関係抽出装置１７の動作は、境界抽出部９における動作を除き、実施の形態４において図１２に示した関係抽出装置１４の動作と同様である。つまり、関係抽出装置１７は、図１２に示すステップＤ３を除く、ステップＤ１、Ｄ２、Ｄ４〜Ｄ１０を実行する。 Next, the operation of the relationship extraction device 17 in the sixth embodiment of the present invention will be described. In the sixth embodiment, the operation of the relationship extraction device 17 is the same as the operation of the relationship extraction device 14 shown in FIG. 12 in the fourth embodiment except for the operation in the boundary extraction unit 9. That is, the relationship extraction device 17 executes Steps D1, D2, and D4 to D10 excluding Step D3 shown in FIG.

このため、以下においては、境界抽出部９の動作について図１７を用いて説明する。図１７は、本発明の実施の形態６における関係抽出装置を構成する境界検出部の動作を示すフロー図である。なお、本実施の形態６においても、関係抽出装置１７を動作させることにより、本実施の形態６における関係抽出方法が実行される。 Therefore, hereinafter, the operation of the boundary extraction unit 9 will be described with reference to FIG. FIG. 17 is a flowchart showing the operation of the boundary detection unit constituting the relation extraction device according to Embodiment 6 of the present invention. Also in the sixth embodiment, the relation extraction method in the sixth embodiment is executed by operating the relation extraction device 17.

図１７に示すように、先ず、図１２に示したステップＤ２による構造化データの選択が行われると、境界抽出部９は、出力データ記憶部６から、それに格納されている要素集合を抽出する（ステップＤ３１）。次に、境界検出部９は、各要素について、文字列パターンでの位置と、各要素の出現頻度と、テーブルでの位置とを検出する（ステップＤ３２）
。 As shown in FIG. 17, first, when the structured data is selected in step D <b> 2 shown in FIG. 12, the boundary extraction unit 9 extracts the element set stored in the output data storage unit 6. (Step D31). Next, the boundary detection unit 9 detects the position in the character string pattern, the appearance frequency of each element, and the position in the table for each element (step D32).
.

次に、境界抽出部９は、検出した情報を用いて、対象、属性、上位語、又は下位語に該当する要素が頻繁に出現している行又は列と、頻繁に出現していない行又は列とを見つけ出し、境界候補を特定する（ステップＤ３３）。 Next, using the detected information, the boundary extraction unit 9 uses a row or a column in which elements corresponding to the target, attribute, broader term, or narrower term frequently appear and a row or column that does not appear frequently. A column is found and a boundary candidate is specified (step D33).

次に、境界抽出部９は、特定の要素が頻繁に出現している行又は列と、特定の要素が頻繁に出現していない行又は列とについて、特定の要素の出現頻度の比を算出する（ステップＤ３４）。続いて、境界抽出部９は、ステップＤ３４で算出した比が閾値を超えているかどうかを判定する（ステップＤ３５）。 Next, the boundary extraction unit 9 calculates the ratio of the appearance frequency of the specific element for the row or column in which the specific element frequently appears and the row or column in which the specific element does not frequently appear. (Step D34). Subsequently, the boundary extraction unit 9 determines whether or not the ratio calculated in Step D34 exceeds a threshold value (Step D35).

ステップＤ３５の判定の結果、比が閾値を超えていない場合は、境界抽出部９は、後述するステップＤ３７を実行する。一方、ステップＤ３５の判定の結果、比が閾値を超えている場合は、境界抽出部９は、判定対象となった境界候補を境界と判定する（ステップＤ３６）。 If the result of determination in step D35 is that the ratio does not exceed the threshold value, the boundary extraction unit 9 executes step D37 described later. On the other hand, if the result of determination in step D35 is that the ratio exceeds the threshold, the boundary extraction unit 9 determines that the boundary candidate that is the determination target is a boundary (step D36).

次に、境界抽出部９は、境界を特定する情報を文字列表現作成部３に通知し、全ての境界候補について判定が終了しているかどうかを判定する（ステップＤ３７）。ステップＤ３７の判定の結果、全ての境界候補についての判定が終了していない場合は、境界抽出部９は、再度ステップＤ３５を実行する。 Next, the boundary extraction unit 9 notifies the character string expression creation unit 3 of information specifying the boundary, and determines whether or not the determination has been completed for all boundary candidates (step D37). As a result of the determination in step D37, when the determination for all boundary candidates has not been completed, the boundary extraction unit 9 executes step D35 again.

一方、ステップＤ３７の判定の結果、全ての境界候補についての判定が終了している場合は、境界抽出部９は処理を終了する。その後、文字列表現作成部３によって、図１２に示したステップＤ４が実行される。なお、本実施の形態６においては、出力データ記憶部６が、未だ要素集合を記憶していない状態にあるときは、境界抽出部９は、実施の形態４で説明した処理を実行することともできる。 On the other hand, as a result of the determination in step D37, when the determination for all boundary candidates is completed, the boundary extraction unit 9 ends the process. Thereafter, the character string expression creating unit 3 executes Step D4 shown in FIG. In the sixth embodiment, when the output data storage unit 6 has not yet stored the element set, the boundary extraction unit 9 may execute the processing described in the fourth embodiment. it can.

以上のように、本実施の形態６によれば、より精度の高い、境界抽出を行うことができるので、無駄な要素集合の抽出を確実に省くことができ、結果、要素集合を抽出する際の抽出精度の更なる向上が図られる。 As described above, according to the sixth embodiment, it is possible to perform boundary extraction with higher accuracy, so that extraction of useless element sets can be surely omitted, and as a result, when extracting element sets, Further improvement of the extraction accuracy is achieved.

また、本実施の形態６におけるプログラムは、コンピュータに、図１２に示すステップステップＤ１、Ｄ２、Ｄ４〜Ｄ１０と、図１７に示すステップＤ３１〜Ｄ３７とを実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することにより、本実施の形態６においても、実施の形態１と同様に、関係抽出装置１７及び関係抽出方法を実現することができる。 Moreover, the program in this Embodiment 6 should just be a program which makes a computer perform step D1, D2, D4-D10 shown in FIG. 12, and step D31-D37 shown in FIG. By installing and executing this program on a computer, the relationship extraction device 17 and the relationship extraction method can be realized also in the sixth embodiment, as in the first embodiment.

更に、構造化データ記憶部１、文字列パターン記憶部２、文書集合記憶部５、及び出力データ記憶部６は、コンピュータに備えられたハードディスク等の記憶装置によって実現できる。なお、記憶装置を備えるコンピュータは、本実施の形態６におけるプログラムがインストールされるコンピュータに、ネットワーク等を介して接続された別のコンピュータであっても良い。 Further, the structured data storage unit 1, the character string pattern storage unit 2, the document set storage unit 5, and the output data storage unit 6 can be realized by a storage device such as a hard disk provided in a computer. The computer provided with the storage device may be another computer connected via a network or the like to the computer in which the program according to the sixth embodiment is installed.

また、上述した実施の形態１〜６では、各関係抽出装置は、先ず、一つの文字列パターンを選択し、選択した文字列パターン毎に、文字列表現の作成及び判定を実行しているが、本発明はこれに限定されるものではない。例えば、関係抽出装置は、先ず、構造化データを選択し、選択した構造化データ毎に、文字列表現の作成及び判定を実行することもできる。この点について実施の形態１の他の例を挙げて図１８を用いて説明する。 In the first to sixth embodiments described above, each relationship extraction device first selects one character string pattern, and creates and determines a character string expression for each selected character string pattern. However, the present invention is not limited to this. For example, the relationship extraction apparatus can first select structured data, and execute creation and determination of a character string expression for each selected structured data. This point will be described with reference to FIG. 18 using another example of the first embodiment.

図１８は、本発明の実施の形態１における関係抽出装置の動作の他の例を示すフロー図である。図１８の例では、先ず、文字列表現作成部３は、構造化データ記憶部１に記憶されている幾つかの構造化データの中から、１つの構造化データを選択する（ステップＦ１）。ステップＦ１は、図６に示したステップＡ２と同様のステップである。 FIG. 18 is a flowchart showing another example of the operation of the relationship extraction device according to Embodiment 1 of the present invention. In the example of FIG. 18, first, the character string expression creating unit 3 selects one structured data from among some structured data stored in the structured data storage unit 1 (step F1). Step F1 is the same as step A2 shown in FIG.

次に、文字列表現作成部３は、文字列パターン記憶部２に記憶されている文字列パターンの中から、１つの文字列パターンを選択する（ステップＦ２）。ステップＦ２は、図６に示したステップＡ１と同様のステップである。 Next, the character string expression creating unit 3 selects one character string pattern from the character string patterns stored in the character string pattern storage unit 2 (step F2). Step F2 is the same as step A1 shown in FIG.

続いて、文字列表現作成部３は、ステップＦ１で選択された構造化データから、ステップＦ２で選択された文字列パターンに当てはめる要素集合を全て抽出する（ステップＦ３）。ステップＦ３は、図６に示したステップＡ３と同様のステップである。 Subsequently, the character string expression creating unit 3 extracts all element sets to be applied to the character string pattern selected in Step F2 from the structured data selected in Step F1 (Step F3). Step F3 is the same as step A3 shown in FIG.

次に、文字列表現作成部３は、文字列パターン記憶部２に記憶されている全ての文字列パターンについて、要素集合の抽出が終了したかどうかを判定する（ステップＦ４）。ステップＦ４の判定の結果、全ての文字列パターンについて要素集合の抽出が終了していない場合は、文字列表現作成部３は、再度、ステップＦ２及びＦ３を実行する。一方、ステップＦ４の判定の結果、全ての文字列パターンについて要素集合の抽出が終了している場合は、文字列表現作成部３は、ステップＦ５を実行する。 Next, the character string expression creating unit 3 determines whether or not the extraction of the element set has been completed for all the character string patterns stored in the character string pattern storage unit 2 (step F4). As a result of the determination in step F4, if the extraction of the element set has not been completed for all the character string patterns, the character string expression creating unit 3 executes steps F2 and F3 again. On the other hand, if the result of determination in step F4 is that extraction of element sets has been completed for all character string patterns, the character string expression creating unit 3 executes step F5.

ステップＦ５では、文字列表現作成部３は、抽出された各要素集合を、ステップＦ２で選択された各文字列パターンに順次適用し、文字列表現を作成する。このように、図１８の例では、図６に示した例と異なり、構造化データ毎に、文字列表現が作成される。なお、作成される文字列表現は、図６に示したステップＡ５で作成される文字列表現と同様となる。 In step F5, the character string expression creating unit 3 sequentially applies each extracted element set to each character string pattern selected in step F2, thereby creating a character string expression. Thus, unlike the example shown in FIG. 6, the example of FIG. 18 creates a character string representation for each structured data. The character string expression created is the same as the character string expression created in step A5 shown in FIG.

次に、文字列表現判定部４は、文字列表現が文書集合中に出現しているかどうかの判定（ステップＦ６）、及び全ての文字列表現についてステップＦ６の判定が行われているかどうかの判定（ステップＦ７）を順次実行する。そして、文字列表現判定部４は、ステップＦ６において、出現していると判定する場合は、文字列表現を構成している文字列パターンと、使用されている要素集合とを出力し、これらのデータを出力データ記憶部６に記憶させる（ステップＦ９）。 Next, the character string expression determination unit 4 determines whether or not the character string expression appears in the document set (step F6), and determines whether or not the determination in step F6 has been performed for all character string expressions. (Step F7) is sequentially executed. When the character string expression determination unit 4 determines in step F6 that the character string expression has appeared, the character string expression determination unit 4 outputs the character string pattern that forms the character string expression and the element set that is used. Data is stored in the output data storage unit 6 (step F9).

ステップＦ６、Ｆ７、及びＦ９は、それぞれ、図６に示したステップＡ６、Ａ７、及びＡ９と同様ステップである。図１８の例においても、図６の例と同様に、作成された文字列表現が、文書集合中に出現しているかどうかを判定される。そして、出現している文字列表現は、出力データ記憶部６に記憶される。 Steps F6, F7, and F9 are the same as steps A6, A7, and A9 shown in FIG. 6, respectively. Also in the example of FIG. 18, as in the example of FIG. 6, it is determined whether the created character string expression appears in the document set. The appearing character string representation is stored in the output data storage unit 6.

その後、文字列表現判定部４は、ステップＦ７の判定の結果、全ての文字列表現について、ステップＦ６の判定が行われている場合は、ステップＦ８を実行する。ステップＦ８では、文字列表現判定部４は、全ての構造化データについてステップＦ１〜Ｆ７及びＦ９の処理が行われているかどうかを判定する。 Thereafter, the character string expression determination unit 4 executes Step F8 when the determination in Step F6 is performed for all character string expressions as a result of the determination in Step F7. In step F8, the character string expression determination unit 4 determines whether or not the processes of steps F1 to F7 and F9 are performed on all structured data.

ステップＦ８の判定の結果、全ての構造化データについてステップＦ１〜Ｆ７及びＦ９
の処理が終了していない場合は、再度、文字列表現作成部３によってステップＦ１が実行される。一方、ステップＦ８の判定の結果、全ての構造化データについてステップＦ１〜Ｆ７及びＦ９の処理が終了している場合は、関係抽出装置１１における処理は終了する。 As a result of the determination in step F8, steps F1 to F7 and F9 are performed for all structured data.
If the above process is not completed, step F1 is executed again by the character string expression creating unit 3. On the other hand, as a result of the determination in step F8, when the processes in steps F1 to F7 and F9 have been completed for all structured data, the process in the relationship extraction device 11 is completed.

このように、関係抽出装置１１は、図１８に示すステッＦ１〜Ｆ９に沿って動作することもでき、この場合も実施の形態１で述べた場合と同様の効果が得られる。また、実施の形態２〜実施の形態６における関係抽出装置も、図１８の例と同様に、最初に、構造化データを選択し、選択した構造化データ毎に、文字列表現の作成及び判定を実行できる。 In this way, the relationship extraction device 11 can also operate along steps F1 to F9 shown in FIG. 18, and in this case as well, the same effects as those described in the first embodiment can be obtained. Also, in the relationship extraction apparatus according to the second to sixth embodiments, similarly to the example of FIG. 18, first, structured data is selected, and a character string expression is created and determined for each selected structured data. Can be executed.

以上のように、本発明によれば、予め辞書を用意することなく、語と語との関係及びその関係を満たす語の集合を抽出することができる。また、本発明は、属性検索、検索クエリの展開、文書検索のためのタグ付け等に利用でき、産業上の利用可能性を有している。 As described above, according to the present invention, it is possible to extract a relationship between words and a set of words satisfying the relationship without preparing a dictionary in advance. Further, the present invention can be used for attribute search, search query expansion, tagging for document search, and the like, and has industrial applicability.

１構造化データ記憶部
２文字列パターン記憶部
３文字列表現作成部
４文字列表現判定部
５文書集合記憶部
６出力データ記憶部
７閾値判定部
８要素集合抽出部
９境界検出部
１０バイナリデータ変換部
１１関係抽出装置（実施の形態１）
１２関係抽出装置（実施の形態２）
１３関係抽出装置（実施の形態３）
１４、１５関係抽出装置（実施の形態４）
１６関係抽出装置（実施の形態５）
１７関係抽出装置（実施の形態６） DESCRIPTION OF SYMBOLS 1 Structured data memory | storage part 2 Character string pattern memory | storage part 3 Character string expression creation part 4 Character string expression determination part 5 Document set memory | storage part 6 Output data memory | storage part 7 Threshold value determination part 8 Element set extraction part 9 Boundary detection part 10 Binary data Conversion unit 11 relationship extraction device (Embodiment 1)
12 Relationship Extraction Device (Embodiment 2)
13 Relationship Extraction Device (Embodiment 3)
14, 15 Relationship extraction device (Embodiment 4)
16 Relationship Extraction Device (Embodiment 5)
17 Relationship Extraction Device (Embodiment 6)

Claims

語と語との関係及びその関係を満たす語の集合を抽出する装置であって、
複数の語を要素として含む構造化データから、２以上の前記要素を要素集合として抽出し、抽出した前記要素集合を、予め設定された語と語との関係を表す文字列パターンに当てはめて、複数の文字列表現を作成する文字列表現作成部と、
作成された前記複数の文字列表現それぞれが文書集合中に出現しているかどうかを判定し、前記文書集合中に出現している文字列表現を、前記語と語との関係及び前記その関係を満たす語の集合として出力する文字列表現判定部と、
を備えることを特徴とする関係抽出装置。 An apparatus for extracting a set of words satisfying a relationship between words and a word,
Two or more elements are extracted as an element set from structured data including a plurality of words as elements, and the extracted element set is applied to a character string pattern representing a relationship between words set in advance, A character string expression creating unit for creating a plurality of character string expressions;
It is determined whether or not each of the plurality of created character string expressions appears in a document set, and the character string expression that appears in the document set is determined based on the relationship between the words and the relationship between the words. A character string expression determination unit that outputs a set of satisfying words;
A relationship extraction apparatus comprising:

前記文字列表現判定部によって出力された前記文字列表現を構成している前記文字列パターン及び前記要素集合を記憶する、出力データ記憶部を更に備えている、請求項１に記載の関係抽出装置。 The relation extraction device according to claim 1, further comprising an output data storage unit that stores the character string pattern and the element set constituting the character string expression output by the character string expression determination unit. .

前記構造化データが、複数の前記要素が行列状に配置されたデータ構造を有し、
前記文字列表現作成部が、前記文字列パターンに当てはめ可能な要素の数が３である場合に、前記データ構造における行と列とを一つずつ選択し、選択した行の端にある要素と、選択した列の端にある要素と、選択した行及び列の交点にある要素とを、前記要素集合として抽出する、請求項１または２に記載の関係抽出装置。 The structured data has a data structure in which a plurality of the elements are arranged in a matrix;
When the number of elements that can be applied to the character string pattern is 3, the character string expression creation unit selects one row and one column in the data structure, and an element at the end of the selected row The relation extraction device according to claim 1, wherein an element at an end of the selected column and an element at the intersection of the selected row and column are extracted as the element set.

前記文字列表現判定部によって出力された前記文字列表現の前記文書集合中での出現頻度を求め、求めた出現頻度が予め設定された閾値以上となっているかどうかを判定する、閾値判定部を更に備え、
前記出力データ記憶部が、前記閾値判定部によって前記出現頻度が前記閾値以上となっていると判定された前記文字列表現について、それを構成している前記文字列パターン及び前記要素集合を記憶する、請求項２に記載の関係抽出装置。 A threshold value determination unit for determining an appearance frequency of the character string expression output by the character string expression determination unit in the document set and determining whether the calculated appearance frequency is equal to or higher than a preset threshold value; In addition,
The output data storage unit stores the character string pattern and the element set constituting the character string expression determined by the threshold value determination unit as the appearance frequency being equal to or higher than the threshold value. The relationship extraction device according to claim 2.

前記構造化データから、前記文字列表現判定部によって出力された前記文字列表現を構成している前記要素集合と各要素の関係が同一となる別の要素集合を抽出し、抽出した前記別の要素集合を前記出力データ記憶部に記憶させる、要素集合抽出部を更に備えている、請求項２に記載の関係抽出装置。 From the structured data, another element set in which the relation between each element and the element set constituting the character string expression output by the character string expression determination unit is the same is extracted, and the extracted another The relation extraction device according to claim 2, further comprising an element set extraction unit that stores an element set in the output data storage unit.

前記データ構造を構成する行と行との間、列と列との間、又は両者において、意味的な境界を検出する境界検出部を備え、
前記文字列表現作成部が、前記意味的な境界、及び前記文字列パターンが表す語と語との関係に基づいて、前記要素集合を抽出する、請求項３に記載の関係抽出装置。 A boundary detection unit that detects a semantic boundary between rows constituting the data structure, between columns, or both,
The relationship extraction device according to claim 3, wherein the character string expression creating unit extracts the element set based on the semantic boundary and a relationship between words represented by the character string pattern.

前記境界検出部が、前記文字列表現判定部によって出力された前記文字列表現に基づいて、出力された前記文字列表現に含まれる前記要素それぞれにおける、前記データ構造での位置と、出力された前記文字列表現での出現頻度と、出現したときの前記文字列パターンでの位置とを検出し、検出した前記出現頻度と前記位置とを用いて、前記意味的な境界を検出する、請求項６に記載の関係抽出装置。 Based on the character string expression output by the character string expression determination unit, the boundary detection unit, in each of the elements included in the output character string expression, the position in the data structure and the output character The appearance frequency in the column expression and the position in the character string pattern when the character string pattern appears are detected, and the semantic boundary is detected using the detected appearance frequency and the position. The described relationship extraction device.

前記境界検出部が、隣接する行よりも数字を要素としている割合が高い行が存在する場合に、当該二つの行と行との間に境界が存在していると判断し、更に、隣接する列よりも数字を要素としている割合が高い列が存在する場合に、当該二つの列と列との間に境界が存在していると判断する、請求項６に記載の関係抽出装置。 The boundary detection unit determines that there is a boundary between the two lines when there is a line having a higher percentage of numbers than the adjacent lines, and further adjacent The relation extraction device according to claim 6, wherein when there is a column having a higher ratio of numbers as elements than the columns, it is determined that a boundary exists between the two columns.

語と語との関係及びその関係を満たす語の集合を抽出するための方法であって、
（ａ）複数の語を要素として含む構造化データから、２以上の前記要素を要素集合として抽出し、抽出した前記要素集合を、予め設定された語と語との関係を表す文字列パターンに当てはめて、複数の文字列表現を作成する、ステップと、
（ｂ）前記（ａ）のステップで作成された前記複数の文字列表現それぞれが文書集合中に出現しているかどうかを判定し、前記文書集合中に出現している文字列表現を、前記語と語との関係及び前記その関係を満たす語の集合として出力する、ステップと、
を有することを特徴とする関係抽出方法。 A method for extracting a relationship between words and a set of words satisfying the relationship,
(A) Two or more elements are extracted as element sets from structured data including a plurality of words as elements, and the extracted element sets are converted into a character string pattern representing a relationship between words set in advance. Apply to create multiple string representations,
(B) It is determined whether or not each of the plurality of character string expressions created in the step (a) appears in the document set, and the character string expression appearing in the document set is determined as the word Outputting as a set of words satisfying the relationship between and a word and the relationship;
A relationship extraction method characterized by comprising:

（ｃ）前記（ｂ）のステップで出力された前記文字列表現を構成している前記文字列パターン及び前記要素集合を記憶する、ステップを更に有している、請求項９に記載の関係抽出方法。 The relation extraction according to claim 9, further comprising: (c) storing the character string pattern and the element set constituting the character string expression output in the step (b). Method.

前記構造化データが、複数の前記要素が行列状に配置されたデータ構造を有し、
前記（ａ）のステップにおいて、前記文字列パターンに当てはめ可能な要素の数が３である場合に、前記データ構造における行と列とを一つずつ選択し、選択した行の端にある要素と、選択した列の端にある要素と、選択した行及び列の交点にある要素とを、前記要素集合として抽出する、請求項９または１０に記載の関係抽出方法。 The structured data has a data structure in which a plurality of the elements are arranged in a matrix;
In the step (a), when the number of elements that can be applied to the character string pattern is 3, one row and one column in the data structure are selected, and an element at the end of the selected row is selected. The relation extraction method according to claim 9 or 10, wherein an element at an end of the selected column and an element at the intersection of the selected row and column are extracted as the element set.

（ｄ）前記（ｂ）のステップで出力された前記文字列表現の前記文書集合中での出現頻度を求め、求めた出現頻度が予め設定された閾値以上となっているかどうかを判定する、ステップを更に有し、
前記（ｃ）のステップにおいて、前記（ｄ）のステップで前記出現頻度が前記閾値以上となっていると判定された前記文字列表現について、それを構成している前記文字列パターン及び前記要素集合を記憶する、請求項１０に記載の関係抽出方法。 (D) obtaining an appearance frequency in the document set of the character string expression output in the step (b), and determining whether the obtained appearance frequency is equal to or higher than a preset threshold value; Further comprising
In the step (c), the character string pattern and the element set constituting the character string expression determined in the step (d) that the appearance frequency is equal to or higher than the threshold value. The relationship extraction method according to claim 10, wherein:

（ｅ）前記構造化データから、前記（ｂ）のステップで出力された前記文字列表現を構成している前記要素集合と各要素の関係が同一となる別の要素集合を抽出し、抽出した前記別の要素集合を記憶する、ステップを更に有している、請求項１０に記載の関係抽出方法。 (E) From the structured data, another element set in which the relation between each element and the element set constituting the character string expression output in the step (b) is the same is extracted and extracted. The relation extraction method according to claim 10, further comprising a step of storing the another element set.

（ｆ）前記データ構造を構成する行と行との間、列と列との間、又は両者において、意味的な境界を検出する、ステップを更に有し、
前記（ａ）のステップにおいて、前記（ｆ）のステップで検出された前記意味的な境界、及び前記文字列パターンが表す語と語との関係に基づいて、前記要素集合を抽出する、請求項１１に記載の関係抽出方法。 (F) further comprising detecting a semantic boundary between rows constituting the data structure, between columns and columns, or both;
In the step (a), the element set is extracted based on the semantic boundary detected in the step (f) and a relationship between words represented by the character string pattern. 11. The relation extraction method according to 11.

前記（ｆ）のステップにおいて、既に実行された前記（ｂ）のステップで出力された前記文字列表現に基づいて、出力された前記文字列表現に含まれる前記要素それぞれにおける、前記データ構造での位置と、出力された前記文字列表現での出現頻度と、出現したときの前記文字列パターンでの位置とを検出し、検出した前記出現頻度と前記位置とを用いて、前記意味的な境界を検出する、請求項１４に記載の関係抽出方法。 In the step (f), the position in the data structure in each of the elements included in the output character string representation based on the character string representation output in the step (b) that has already been executed. , Detecting the appearance frequency in the output character string expression and the position in the character string pattern when it appears, and detecting the semantic boundary using the detected appearance frequency and the position The relationship extraction method according to claim 14.

前記（ｆ）のステップにおいて、隣接する行よりも数字を要素としている割合が高い行が存在する場合に、当該二つの行と行との間に境界が存在していると判断し、更に、隣接する列よりも数字を要素としている割合が高い列が存在する場合に、当該二つの列と列との間に境界が存在していると判断する、請求項１４に記載の関係抽出方法。 In the step (f), if there is a row having a higher ratio of numbers than adjacent rows, it is determined that a boundary exists between the two rows, and further, The relation extraction method according to claim 14, wherein when there is a column having a higher ratio of numbers as elements than adjacent columns, it is determined that a boundary exists between the two columns.

コンピュータによって、語と語との関係及びその関係を満たす語の集合を抽出させるためのプログラムであって、
前記コンピュータによって、
（ａ）複数の語を要素として含む構造化データから、２以上の前記要素を要素集合として抽出し、抽出した前記要素集合を、予め設定された語と語との関係を表す文字列パターンに当てはめて、複数の文字列表現を作成する、ステップと、
（ｂ）前記（ａ）のステップで作成された前記複数の文字列表現それぞれが文書集合中に出現しているかどうかを判定し、前記文書集合中に出現している文字列表現を、前記語と語との関係及び前記その関係を満たす語の集合として出力する、ステップと、
を実行させることを特徴とするプログラム。 A program for extracting a set of words satisfying a relationship between words and words by a computer,
By the computer,
(A) Two or more elements are extracted as element sets from structured data including a plurality of words as elements, and the extracted element sets are converted into a character string pattern representing a relationship between words set in advance. Apply to create multiple string representations,
(B) It is determined whether or not each of the plurality of character string expressions created in the step (a) appears in the document set, and the character string expression appearing in the document set is determined as the word Outputting as a set of words satisfying the relationship between and a word and the relationship;
A program characterized by having executed.

（ｃ）前記（ｂ）のステップで出力された前記文字列表現を構成している前記文字列パターン及び前記要素集合を記憶する、ステップを、更に前記コンピュータによって実行させる、請求項１７に記載のプログラム。 The step of (c) storing the character string pattern and the element set constituting the character string expression output in the step (b) is further executed by the computer. program.

前記構造化データが、複数の前記要素が行列状に配置されたデータ構造を有し、
前記（ａ）のステップにおいて、前記文字列パターンに当てはめ可能な要素の数が３である場合に、前記データ構造における行と列とを一つずつ選択し、選択した行の端にある要素と、選択した列の端にある要素と、選択した行及び列の交点にある要素とを、前記要素集合として抽出する、請求項１７または１８に記載のプログラム。 The structured data has a data structure in which a plurality of the elements are arranged in a matrix;
In the step (a), when the number of elements that can be applied to the character string pattern is 3, one row and one column in the data structure are selected, and an element at the end of the selected row is selected. The program according to claim 17 or 18, wherein an element at an end of the selected column and an element at the intersection of the selected row and column are extracted as the element set.

（ｄ）前記（ｂ）のステップで出力された前記文字列表現の前記文書集合中での出現頻度を求め、求めた出現頻度が予め設定された閾値以上となっているかどうかを判定する、ステップを、更に前記コンピュータによって実行させ、
前記（ｃ）のステップにおいて、前記（ｄ）のステップで前記出現頻度が前記閾値以上となっていると判定された前記文字列表現について、それを構成している前記文字列パターン及び前記要素集合を記憶する、請求項１８に記載のプログラム。 (D) obtaining an appearance frequency in the document set of the character string expression output in the step (b), and determining whether the obtained appearance frequency is equal to or higher than a preset threshold value; Is further executed by the computer,
In the step (c), the character string pattern and the element set constituting the character string expression determined in the step (d) that the appearance frequency is equal to or higher than the threshold value. The program according to claim 18, wherein:

（ｅ）前記構造化データから、前記（ｂ）のステップで出力された前記文字列表現を構成している前記要素集合と各要素の関係が同一となる別の要素集合を抽出し、抽出した前記別の要素集合を記憶する、ステップを、更に前記コンピュータによって実行させる、請求項１８に記載のプログラム。 (E) From the structured data, another element set in which the relation between each element and the element set constituting the character string expression output in the step (b) is the same is extracted and extracted. The program according to claim 18, further comprising the step of storing the another element set by the computer.

（ｆ）前記データ構造を構成する行と行との間、列と列との間、又は両者において、意味的な境界を検出する、ステップを、更に前記コンピュータによって実行させる、
前記（ａ）のステップにおいて、前記（ｆ）のステップで検出された前記意味的な境界、及び前記文字列パターンが表す語と語との関係に基づいて、前記要素集合を抽出する、請求項１９に記載のプログラム。 (F) causing the computer to further execute a step of detecting a semantic boundary between rows constituting the data structure, between columns and columns, or both.
In the step (a), the element set is extracted based on the semantic boundary detected in the step (f) and a relationship between words represented by the character string pattern. 19. The program according to 19.

前記（ｆ）のステップにおいて、既に実行された前記（ｂ）のステップで出力された前記文字列表現に基づいて、出力された前記文字列表現に含まれる前記要素それぞれにおける、前記データ構造での位置と、出力された前記文字列表現での出現頻度と、出現したときの前記文字列パターンでの位置とを検出し、検出した前記出現頻度と前記位置とを用いて、前記意味的な境界を検出する、請求項２２に記載のプログラム。 In the step (f), the position in the data structure in each of the elements included in the output character string representation based on the character string representation output in the step (b) that has already been executed. , Detecting the appearance frequency in the output character string expression and the position in the character string pattern when it appears, and detecting the semantic boundary using the detected appearance frequency and the position The program according to claim 22.

前記（ｆ）のステップにおいて、隣接する行よりも数字を要素としている割合が高い行が存在する場合に、当該二つの行と行との間に境界が存在していると判断し、更に、隣接する列よりも数字を要素としている割合が高い列が存在する場合に、当該二つの列と列との間に境界が存在していると判断する、請求項２２に記載のプログラム。 In the step (f), if there is a row having a higher ratio of numbers than adjacent rows, it is determined that a boundary exists between the two rows, and further, The program according to claim 22, wherein when there is a column having a higher ratio of numbers as elements than adjacent columns, it is determined that a boundary exists between the two columns.