JP5497105B2

JP5497105B2 - Document retrieval apparatus and method

Info

Publication number: JP5497105B2
Application number: JP2012133641A
Authority: JP
Inventors: ボリウ; ユーボコウ; ジェンチャンリイ; ユウジャオ
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2011-08-01
Filing date: 2012-06-13
Publication date: 2014-05-21
Anticipated expiration: 2032-06-13
Also published as: CN102915304B; JP2013033452A; CN102915304A

Description

本発明は情報検索の分野に関し、特に、文書検索装置および方法に関する。 The present invention relates to the field of information retrieval, and in particular, to a document retrieval apparatus and method.

情報時代の到来に伴い、検索可能な文書の数は増加の一途を辿っている。そのため、膨大な文書の中から有益な情報を効果的に見つけ出すことは極めて重要となっている。 With the advent of the information age, the number of searchable documents has been increasing. Therefore, it is extremely important to effectively find useful information from a vast amount of documents.

情報検索（ＩＲ：ＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ）とは、一連の文書（文書セット）の中から特定の情報を検索する技術である。情報検索技術はさらに、文書に含まれる情報の検索、文書自体の検索、文書について記述するメタデータの検索、データベースの検索に分類される。データベースの検索では、リレーショナル・スタンドアロン型データベースや、Ｅｔｈｅｒｎｅｔ、コンテンツ／文書管理システム等のネットワーク型ハイパーテキストデータベースに対して、テキスト・音声・画像・データの検索が実行される。 Information retrieval (IR: Information Retrieval) is a technique for retrieving specific information from a series of documents (document set). Information retrieval techniques are further classified into retrieval of information contained in a document, retrieval of the document itself, retrieval of metadata describing the document, and retrieval of a database. In the database search, text / speech / image / data search is performed on a relational stand-alone database or a network-type hypertext database such as Ethernet or a content / document management system.

文書検索が実行される場合、文書検索システムが担う主なタスクは、１）ユーザクエリに関連する文書を発見すること、２）照合結果を評価し、関連度に基づいて文書をランキングすること、の２つである。従来の多くの文書検索システムでは、キーワード検索が重要な役割を果たしている。これらのシステムでは、文書内におけるクエリ語の出現頻度と出現場所、文書をポイントするハイパーリンク、文書アクセス情報といった、いくつかの特定の要因を考慮して文書の検索が実行される。 When a document search is performed, the main tasks that the document search system is responsible for are 1) finding documents related to the user query, 2) evaluating matching results and ranking documents based on relevance, These are two. Keyword search plays an important role in many conventional document search systems. In these systems, a search for a document is performed in consideration of several specific factors such as the appearance frequency and appearance location of a query word in the document, a hyperlink pointing to the document, and document access information.

最近では、いわゆる「意味的Ｗｅｂ（ＳＷ：ＳｅｍａｎｔｉｃＷｅｂ）」技術が提案されている。これは、機械が情報の意味（すなわち「セマンティクス」）を理解できるようにするための技術である。リソース・ディスクリプション・フレームワーク（ＰＤＦ：ＲｅｓｏｕｒｃｅＤｅｓｃｒｉｐｔｉｏｎＦｒａｍｅｗｏｒｋ）やＷｅｂオントロジ語（ＯＷＬ：ＷｅｂＯｎｔｏｌｏｇｙＬａｎｇｕａｇｅ）に代表されるＳＷ技術は、所与の知識ドメイン内に、概念と関係の形式的記述を提供することを主眼とする。そのため、ＳＷ技術を使用することにより文書検索の精度を高めることができる。 Recently, so-called “Semantic Web (SW)” technology has been proposed. This is a technique that enables machines to understand the meaning of information (ie "semantics"). SW technologies, such as Resource Description Framework (PDF) and Web Ontology Language (OWL), provide formal descriptions of concepts and relationships within a given knowledge domain. The main focus is to do. Therefore, the accuracy of document search can be improved by using SW technology.

近年、オントロジの使用によって検索精度を高めるための手法がいくつか開発されている。オントロジは、機械に理解可能な方法で、情報の意味を形式的に記述するものである。これにより、クエリおよび文書内で黙示される意味のマイニングが容易になり、ひいては、自然言語の多義性と同義性の問題に対処し、かつクエリや文書内における概念の文脈情報を理解することが可能になる。 In recent years, several methods have been developed to improve search accuracy by using ontology. Ontologies formally describe the meaning of information in a way that can be understood by machines. This facilitates the mining of implied meanings in queries and documents, and thus addresses the issues of natural language ambiguity and synonymity and understands the contextual information of concepts in queries and documents. It becomes possible.

非特許文献１（Ｐ．Ｃａｓｔｅｌｌｓ．Ｍ．Ｆｅｒｎａｎｄｅｓ，ａｎｄＤ．Ｖａｌｌｅｔ，“ＡｎＡｄａｐｔａｔｉｏｎｏｆｔｈｅＶｅｃｔｏｒ−ＳｐａｃｅＭｏｄｅｌｆｏｒＯｎｔｏｌｏｇｙ−ＢａｓｅｄＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ（オントロジベース情報検索のためのベクトル空間モデルの１つの適応）”，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＫｎｏｗｌｅｄｇｅａｎｄＤａｔａＥｎｇｉｎｅｅｒｉｎｇ，２００７）では、文書に関連度スコアを付与する方法が提案されている。この方法は、１）文書およびクエリから概念を抽出する、２）ベクトル空間モデルを使用して文書とクエリ間の類似度を計算する、および３）前のステップで取得されたスコアを、キーワードベースのアルゴリズムによって計算された類似度スコアと結合する、という３つのステップで構成される。 Non-Patent Document 1 (P. Castells. M. Fernandes, and D. Vallet, “An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval Vector Space Model 1 for Ontology Base Information Retrieval”) , IEEE Transactions on Knowledge and Data Engineering, 2007), a method for assigning a relevance score to a document is proposed. The method 1) extracts concepts from documents and queries, 2) calculates the similarity between documents and queries using a vector space model, and 3) calculates the score obtained in the previous step based on keyword It is composed of three steps: combining with the similarity score calculated by the algorithm.

非特許文献２（ＴｕｕｋｋａＲｕｏｔｓａｌｏａｎｄＥｅｒｏＨｙｖｏｎｅｎ，“ＡＭｅｔｈｏｄｆｏｒＤｅｔｅｒｍｉｎｉｎｇＯｎｔｏｌｏｇｙ−ＢａｓｅｄＳｅｍａｎｔｉｃＲｅｌｅｖａｎｃｅ（オントロジベース意味的関連度の決定方法）”，ＰｒｏｃｅｅｄｉｎｇｏｆＤＥＸＡ２００７）では、基礎となるドメインオントロジを使用して、注釈の相互の関連度を計算し、それによりＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ法（以下、略して「ＴＦ−ＩＤＦ法」）を拡張する方法が提案されている。 Non-Patent Document 2 (Tukka Ruotsalo and Eero Hyvonen, “A Method for Determining Ontology-Based Semantic Relevance”) A method has been proposed in which the degree of relevance of annotations is calculated, thereby extending the Term Frequency-Inverse Document Frequency method (hereinafter referred to as “TF-IDF method” for short).

特許文献１（ＷＯ２００６００１９０６Ａ３）では、グラフベースのランキングアルゴリズムが提案されている。これは、自然言語処理技術とドメインオントロジを使用して各文書に関するグラフを構築し、その後、曖昧性除去やキーワード抽出等のテキスト処理を実行するためにノードをランキングするアルゴリズムである。 In Patent Document 1 (WO2006001906 A3), a graph-based ranking algorithm is proposed. This is an algorithm that builds a graph for each document using natural language processing technology and domain ontology, and then ranks the nodes to perform text processing such as disambiguation and keyword extraction.

ＷＯ２００６００１９０６Ａ３WO2006001906 A3

Ｐ．Ｃａｓｔｅｌｌｓ．Ｍ．Ｆｅｒｎａｎｄｅｓ，ａｎｄＤ．Ｖａｌｌｅｔ，“ＡｎＡｄａｐｔａｔｉｏｎｏｆｔｈｅＶｅｃｔｏｒ−ＳｐａｃｅＭｏｄｅｌｆｏｒＯｎｔｏｌｏｇｙ−ＢａｓｅｄＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ（オントロジベース情報検索のためのベクトル空間モデルの１つの適応）”，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＫｎｏｗｌｅｄｇｅａｎｄＤａｔａＥｎｇｉｎｅｅｒｉｎｇ，２００７P. Castells. M.M. Fernandes, and D.M. Vallet, “An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval (Adaptation of Vector Space Model for Ontology-Based Information Retrieval)”, IEEE Transactions in World ＴｕｕｋｋａＲｕｏｔｓａｌｏａｎｄＥｅｒｏＨｙｖｏｎｅｎ，“ＡＭｅｔｈｏｄｆｏｒＤｅｔｅｒｍｉｎｉｎｇＯｎｔｏｌｏｇｙ−ＢａｓｅｄＳｅｍａｎｔｉｃＲｅｌｅｖａｎｃｅ（オントロジベース意味的関連度の決定方法）”，ＰｒｏｃｅｅｄｉｎｇｏｆＤＥＸＡ２００７Tukka Ruotsalo and Eero Hyvonen, “A Method for Determining Ontology-Based Semantic Relevance”, Proceeding of DEXA2007.

上記の従来技術による方法では、主にクエリと文書から取得されたオントロジの概念（クラスとインスタンス）を使用して文書検索を行うのみで、文書内で黙示されるリッチな意味的情報は考慮されないため、文書検索の精度に劣る。事実、文書とクエリ間の関連度を決定する際には、文書内の概念だけでなく、これらの概念間における暗黙の意味的情報も、重要な役割を果たす。クエリと文書内の概念を考慮するだけでは、検索におけるユーザの真の要求を示すことは不可能である。 The above prior art method only searches documents using the concepts of ontology (classes and instances) obtained from queries and documents, and does not consider rich semantic information implied in the documents. Therefore, the accuracy of document retrieval is inferior. In fact, not only the concepts within the document but also the implicit semantic information between these concepts plays an important role in determining the relevance between the document and the query. Just considering the query and the concepts in the document, it is not possible to show the user's true demand in the search.

上記の問題を解決するため、本発明は、文書内で黙示される意味的関連度情報を使用して文書検索を実行するための文書検索装置および方法を提供する。
具体的には、本発明による文書検索装置および方法は、まず、文書内で黙示される意味的情報を記述するハイパーグラフを構築し、その後、ドメインオントロジを使用してこのハイパーグラフを精緻化する。この方法により、ある特定のクエリに関して文書検索を実行する際には、ハイパーグラフに基づいて当該特定のクエリに関連する文書の関連度スコアを計算し、その関連度スコアを使用して文書をランキングすることが可能になる。 In order to solve the above problem, the present invention provides a document search apparatus and method for performing a document search using semantic relevance information implied in a document.
Specifically, the document retrieval apparatus and method according to the present invention first constructs a hypergraph describing semantic information implied in the document, and then refines the hypergraph using domain ontology. . Using this method, when performing a document search for a specific query, the relevance score of the document related to the specific query is calculated based on the hypergraph, and the relevance score is used to rank the documents. It becomes possible to do.

本発明の１つの態様によれば、文書検索装置であって、対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該文書に関するハイパーグラフを構築するように構成されたハイパーグラフ構築ユニットと、ハイパーグラフ構築ユニットによって構築されたハイパーグラフに基づいて、対象文書セット内で、ある特定クエリに対応する文書を検索し、検索された文書をランキングするように構成された文書ランキングユニットとを備える文書検索装置が提供される。 According to one aspect of the present invention, a document search apparatus is configured to construct a hypergraph related to a document in order to describe implicit semantic information included in one document in a target document set. Are configured to search for documents corresponding to a specific query in the target document set and rank the searched documents based on the hypergraph constructed unit and the hypergraph constructed by the hypergraph construction unit. There is provided a document search device comprising a document ranking unit.

ハイパーグラフ構築ユニットは、ドメインオントロジ情報を使用して文書から概念を抽出し、当該概念の重みを計算するように構成された概念抽出サブユニットと、当該文書の初期ハイパーグラフを構築するように構成されたハイパーグラフ構築サブユニットと、ドメインオントロジ情報を使用して当該初期ハイパーグラフを精緻化するように構成されたハイパーグラフ精緻化サブユニットと、精緻化されたハイパーグラフのノードおよびエッジに重みを付与するように構成された重み付与サブユニットとを備えるのが望ましい。 The hypergraph construction unit is configured to extract a concept from a document using domain ontology information and to construct a concept extraction subunit configured to calculate the weight of the concept and an initial hypergraph of the document. Weight the hypergraph construction subunit, the hypergraph refinement subunit configured to refine the initial hypergraph using domain ontology information, and the refined hypergraph nodes and edges. It is desirable to have a weighting subunit configured to grant.

ハイパーグラフ構築サブユニットは、当該文書に含まれる概念セットの各々についてノードを作成してノードセットを形成し、当該文書の各文に含まれる概念セットによって形成されるエッジを付加してエッジセットを形成し、当該ノードセットと当該エッジセットとから構成される初期ハイパーグラフを構築するように構成されるのが望ましい。 The hypergraph construction subunit creates a node set by creating a node for each concept set included in the document, and adds an edge formed by the concept set included in each sentence of the document. Preferably, it is configured to construct and construct an initial hypergraph composed of the node set and the edge set.

ハイパーグラフ精緻化サブユニットは、初期ハイパーグラフ内の２個のノードに対応する概念がドメインオントロジにおいて同じ意味を有する場合には、当該２個のノードをマージし、これらのノードに対応する概念がドメインオントロジにおいて直接関連付けられている場合には、初期ハイパーグラフ内の任意の個数のノードを連結するエッジを付加し、２個のエッジに対応する概念がドメインオントロジまたは初期ハイパーグラフ内において距離が近い場合には、初期ハイパーグラフ内において当該２個のエッジをマージするように構成されるのが望ましい。 If the concepts corresponding to two nodes in the initial hypergraph have the same meaning in the domain ontology, the hypergraph refinement subunit merges the two nodes and the concepts corresponding to these nodes When directly related in the domain ontology, an edge connecting any number of nodes in the initial hypergraph is added, and the concept corresponding to the two edges is close in the domain ontology or the initial hypergraph. In some cases, it is desirable to be configured to merge the two edges in the initial hypergraph.

重み付与サブユニットは、文書内におけるある特定の概念の出現頻度に基づいて、当該特定の概念に対応するノードに重みを付与し、文書内におけるあるエッジの概念の出現頻度と、文書内における当該エッジの出現頻度と、当該エッジ内の任意の２個のノードにおける意味的関連度の新規性の総和であるエッジの新規性とに基づいて、当該エッジに重みを付与するように構成されるのが望ましい。 The weighting subunit assigns a weight to a node corresponding to the specific concept based on the frequency of appearance of the specific concept in the document, and the frequency of appearance of the concept of an edge in the document. Based on the appearance frequency of the edge and the novelty of the edge that is the sum of the novelty of the semantic relevance at any two nodes in the edge, the edge is weighted. Is desirable.

２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での２個のノード間の意味的距離が、エッジ内のノード数マイナス１に相当する数を超えない場合は１とし、それ以外の場合には、２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での当該２個のノード間の意味的距離を、エッジ内のノード数と１との差に相当する数で除算して得られる数とするのが望ましい。 The novelty of the semantic relevance at the two nodes is 1 when the semantic distance between the two nodes in the domain ontology does not exceed the number corresponding to the number of nodes in the edge minus 1, In other cases, the novelty of the semantic relevance at the two nodes corresponds to the difference between the number of nodes in the edge and the semantic distance between the two nodes in the domain ontology. It is desirable to obtain a number obtained by dividing by a number.

文書ランキングユニットは、ハイパーグラフ構築ユニットによって構築されたハイパーグラフを用いてある特定のクエリに関する最小スパニングツリーを生成するように構成された最小スパニングツリー生成サブユニットと、生成された最小スパニングツリーの意味的関連度スコアを計算するように構成された関連度計算サブユニットと、当該意味的関連度スコアに基づいて文書をランキングするように構成された文書ランキングサブユニットとを備えるのが望ましい。 The document ranking unit is a minimum spanning tree generation subunit configured to generate a minimum spanning tree for a specific query using a hypergraph constructed by the hypergraph construction unit, and a meaning of the generated minimum spanning tree. Preferably, a relevance calculation subunit configured to calculate a relevance score and a document ranking subunit configured to rank documents based on the semantic relevance score.

最小スパニングツリー生成サブユニットは、欲張りアルゴリズムを使用して最小スパニングツリーを生成するように構成されるのが望ましい。 The minimum spanning tree generation subunit is preferably configured to generate a minimum spanning tree using a greedy algorithm.

関連度計算サブユニットは、意味的関連度スコアとして、最小スパニングツリーにおけるすべてのエッジの重みの平均を計算するように構成されるのが望ましい。 The relevance calculation subunit is preferably configured to calculate an average of the weights of all edges in the minimum spanning tree as a semantic relevance score.

本発明の他の態様によれば、文書検索方法であって、対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該文書に関するハイパーグラフを構築するステップと、構築されたハイパーグラフに基づいて、当該対象文書セット内で特定クエリに対応する文書を検索し、検索された文書をランキングするステップとを備える文書検索方法が提供される。 According to another aspect of the present invention, there is provided a document search method for constructing a hypergraph related to a document in order to describe implicit semantic information included in one document in a target document set; A document search method is provided that includes searching for a document corresponding to a specific query in the target document set based on the constructed hypergraph and ranking the searched documents.

構築ステップは、ドメインオントロジ情報を使用して１つの文書から概念を抽出し、当該概念に関する重みを計算するステップと、当該文書に関する初期ハイパーグラフを構築するステップと、ドメインオントロジ情報を使用して初期ハイパーグラフを精緻化するステップと、精緻化されたハイパーグラフのノードおよびエッジに重みを付与するステップとを備えるのが望ましい。 The construction step includes extracting a concept from one document using domain ontology information, calculating a weight related to the concept, constructing an initial hypergraph related to the document, and initializing using the domain ontology information. It is desirable to have a step of refining the hypergraph and a step of assigning weights to the refined hypergraph nodes and edges.

当該文書に関する初期ハイパーグラフを構築するステップは、当該文書に含まれる概念セットの各々についてノードを作成してノードセットを形成するステップと、当該文書の各文に含まれる概念セットによって形成されるエッジを付加してエッジセットを形成するステップと、当該ノードセットと当該エッジセットとから成る初期ハイパーグラフを構築するステップとを備えるのが望ましい。 The step of constructing an initial hypergraph related to the document includes a step of creating a node for each concept set included in the document to form a node set, and an edge formed by the concept set included in each sentence of the document Are preferably added to form an edge set, and an initial hypergraph comprising the node set and the edge set is constructed.

初期ハイパーグラフを精緻化するステップは、初期ハイパーグラフ内の２個のノードに対応する概念がドメインオントロジにおいて同じ意味を有する場合には、当該２個のノードをマージするステップと、初期ハイパーグラフ内の任意の個数のノードに対応する概念がドメインオントロジにおいて直接関連付けられている場合には、これらのノードを連結するエッジを付加するステップと、２個のエッジに対応する概念がドメインオントロジまたは初期ハイパーグラフ内において距離が近い場合には、初期ハイパーグラフ内において当該２個のエッジをマージするステップとを備えるのが望ましい。 The step of refining the initial hypergraph includes the step of merging the two nodes when the concepts corresponding to the two nodes in the initial hypergraph have the same meaning in the domain ontology, If the concepts corresponding to any number of nodes are directly associated in the domain ontology, the steps of adding edges connecting these nodes and the concepts corresponding to the two edges are the domain ontology or initial hypertension. When the distance is close in the graph, it is preferable to include a step of merging the two edges in the initial hypergraph.

重みを付与するステップは、文書内におけるある特定の概念の出現頻度に基づいて、当該特定の概念に対応するノードに重みを付与するステップと、文書内におけるあるエッジの概念の出現頻度と、文書内における当該エッジの出現頻度と、（すなわち、当該エッジ内の任意の２個のノードにおける意味的関連度の新規性の総和であるエッジの新規性とに基づいて、当該エッジに重みを付与するステップとを備えるのが望ましい。 The step of assigning a weight includes the step of assigning a weight to a node corresponding to the specific concept based on the frequency of appearance of a specific concept in the document, the frequency of appearance of a concept of an edge in the document, the document A weight is given to the edge based on the frequency of appearance of the edge in the node (that is, the novelty of the edge which is the sum of the novelty of the semantic relevance at any two nodes in the edge) Preferably comprising steps.

検索およびランキングを行うステップは、構築されたハイパーグラフを用いてある特定のクエリに関する最小スパニングツリーを生成するステップと、生成された最小スパニングツリーの意味的関連度スコアを計算するステップと、当該意味的関連度スコアに基づいて文書をランキングするステップを備えるのが望ましい。 The steps of searching and ranking include generating a minimum spanning tree for a specific query using the constructed hypergraph, calculating a semantic relevance score for the generated minimum spanning tree, It is desirable to have the step of ranking the documents based on the relevance score.

最小スパニングツリーを生成するステップは、欲張りアルゴリズムを使用して最小スパニングツリーを生成するステップを備えるのが望ましい。 Desirably, generating the minimum spanning tree comprises generating a minimum spanning tree using a greedy algorithm.

意味的関連度スコアを計算するステップは、意味的関連度スコアとして、最小スパニングツリーにおけるすべてのエッジの重みの平均を計算するステップを備えるのが望ましい。 The step of calculating the semantic relevance score preferably includes the step of calculating an average of weights of all edges in the minimum spanning tree as the semantic relevance score.

本発明によって提案される文書検索装置および方法では、文書内で黙示されるリッチな意味的情報を利用し、当該文書に関するハイパーグラフを構築して特定のクエリに関する当該文書の関連度スコアを計算し、かつ計算された関連度スコアに基づいて当該文書をランキングすることによって、文書検索の精度が高められる。そのため、検索におけるユーザの真の要求をより効果的に満たすことが可能となる。 The document search apparatus and method proposed by the present invention uses rich semantic information implied in a document, constructs a hypergraph for the document, and calculates the relevance score of the document for a specific query. In addition, by ranking the document based on the calculated relevance score, the accuracy of document search is improved. Therefore, it is possible to more effectively satisfy the user's true request for search.

本発明の上記および他の特徴は、添付図面を参照しながら以下の詳細な説明を読むことで、より明らかになるであろう。 These and other features of the present invention will become more apparent upon reading the following detailed description with reference to the accompanying drawings.

本発明の一実施例による文書検索装置を示すブロック図である。1 is a block diagram illustrating a document search apparatus according to an embodiment of the present invention. 本発明の一実施例による文書検索装置におけるハイパーグラフ構築ユニットを示すブロック図である。It is a block diagram which shows the hypergraph construction unit in the document search apparatus by one Example of this invention. ハイパーグラフ構築サブユニットによって構築されるハイパーグラフを示す概略図である。It is the schematic which shows the hypergraph constructed | assembled by the hypergraph construction subunit. 図３に示すハイパーグラフ上におけるノードマージ操作の実行前と実行後の状態を示す概略図である。It is the schematic which shows the state before execution of the node merge operation on the hypergraph shown in FIG. 3, and after execution. 図３に示すハイパーグラフ上におけるエッジ付加操作の実行前と実行後の状態を示す概略図である。It is the schematic which shows the state before execution and after execution of edge addition operation on the hypergraph shown in FIG. 図３に示すハイパーグラフ上におけるエッジマージ操作の実行前と実行後の状態を示す概略図である。It is the schematic which shows the state before execution and after execution of edge merge operation on the hypergraph shown in FIG. 本発明の一実施例による文書検索装置における文書ランキングユニットを示すブロック図である。It is a block diagram which shows the document ranking unit in the document search apparatus by one Example of this invention. 本発明の一実施例による文書検索方法を示すフローチャートである。4 is a flowchart illustrating a document search method according to an embodiment of the present invention. 本発明の一実施例による文書検索方法の詳細なステップを示すフローチャートである。4 is a flowchart illustrating detailed steps of a document search method according to an embodiment of the present invention.

以下では、本発明の原理と実施がさらに明らかになるよう、本発明による特定の実施例について添付図面を参照しながら説明する。本発明は以下で説明する特定の実施例に限定されないことに留意されたい。なお、説明の煩雑化を避けるため、本発明には直接関係のない、よく知られた技法の詳細は省略する。 In the following, specific embodiments according to the present invention will be described with reference to the accompanying drawings in order to further clarify the principle and implementation of the present invention. It should be noted that the present invention is not limited to the specific embodiments described below. Note that details of well-known techniques that are not directly related to the present invention are omitted in order to avoid complication of the description.

最初に、表１において、説明で使用する用語の意味を示す。

First, in Table 1, meanings of terms used in the description are shown.

図１は、本発明の一実施例による文書検索装置１０を示すブロック図である。図１に示すように、本実施例の文書検索装置１０は、ハイパーグラフ構築ユニット１１０と文書ランキングユニット１２０とを備える。ハイパーグラフ構築ユニット１１０は、対象文書セット内の各文書について、当該文書に含まれる暗黙的意味をモデル化するためのハイパーグラフを構築する。文書ランキングユニット１２０は、ハイパーグラフ構築ユニット１１０によって構築されたハイパーグラフに基づいて、特定のクエリに対応して対象文書セット内の文書を処理し、文書検索の結果を取得する。以下では、図２〜５を参照して、ハイパーグラフ構築ユニット１１０と文書ランキングユニット１２０の構造および動作について詳細に説明する。 FIG. 1 is a block diagram showing a document search apparatus 10 according to an embodiment of the present invention. As shown in FIG. 1, the document search apparatus 10 of this embodiment includes a hypergraph construction unit 110 and a document ranking unit 120. The hypergraph construction unit 110 constructs a hypergraph for modeling the implicit meaning included in the document for each document in the target document set. Based on the hypergraph constructed by the hypergraph construction unit 110, the document ranking unit 120 processes documents in the target document set in response to a specific query, and obtains a document search result. Hereinafter, the structures and operations of the hypergraph construction unit 110 and the document ranking unit 120 will be described in detail with reference to FIGS.

図２は、図１に示す文書検索装置１０内のハイパーグラフ構築ユニット１１０を示すブロック図である。図示するように、ハイパーグラフ構築ユニット１１０は、概念抽出サブユニット１１１０と、ハイパーグラフ構築サブユニット１１２０と、ハイパーグラフ精緻化サブユニット１１３０と、重み付与サブユニット１１４０とを備える。 FIG. 2 is a block diagram showing the hypergraph construction unit 110 in the document search apparatus 10 shown in FIG. As illustrated, the hypergraph construction unit 110 includes a concept extraction subunit 1110, a hypergraph construction subunit 1120, a hypergraph refinement subunit 1130, and a weighting subunit 1140.

概念抽出サブユニット１１１０は、概念認識技術により、ドメインオントロジに基づいて対象文書から概念を抽出し、その後これらの概念に対する重みを計算する。概念抽出サブユニット１１１０は、例えば既知のＴＦ−ＩＤＦ法を使用して概念に対する重みを計算してもよい。 The concept extraction subunit 1110 extracts concepts from the target document based on domain ontology using concept recognition technology, and then calculates weights for these concepts. The concept extraction subunit 1110 may calculate a weight for the concept using, for example, a known TF-IDF method.

ハイパーグラフ構築サブユニット１１２０は、特定の文書に関する初期ハイパーグラフを構築する。１つの文書内において、同じ文脈の中に多数の概念が出現する場合には、当該概念間に直接的な意味的関連度があると考えられる。直接的な意味的関連度は、当該文書内の暗黙の意味的情報とみなすことができる。本発明において「同じ文脈」とは、まったく同一の文を意味する。 The hypergraph construction subunit 1120 constructs an initial hypergraph for a particular document. When multiple concepts appear in the same context in one document, it is considered that there is a direct semantic relevance between the concepts. Direct semantic relevance can be regarded as implicit semantic information in the document. In the present invention, “same context” means exactly the same sentence.

以下では、ハイパーグラフ構築サブユニット１１２０の動作について説明する。最初に、１つの文書内で認識された概念セットＣに関して、ハイパーグラフ構築サブユニット１１２０が各概念に対応するノードｖを作成して、ノードセットＶを形成する。当該文書内の各文は概念セット｛Ｃ_１，Ｃ_２，…，Ｃ_ｎ｝（ｎはその文に含まれるノード（概念）の個数）を含むと想定される。次に、その各文について、ハイパーグラフ構築サブユニット１１２０が概念セット｛Ｃ_１，Ｃ_２，…，Ｃ_ｎ｝によって形成されるエッジｅを付加し、エッジセットＥを形成する。最後に、ハイパーグラフ構築サブユニット１１２０は、Ｇ（Ｖ，Ｅ）と表されるハイパーグラフを形成する。 Hereinafter, the operation of the hypergraph construction subunit 1120 will be described. First, for the concept set C recognized in one document, the hypergraph construction subunit 1120 creates a node v corresponding to each concept to form a node set V. Each sentence in the document is assumed to include a concept set {C ₁ , C ₂ ,..., C _n } (n is the number of nodes (concepts) included in the sentence). Next, for each sentence, the hypergraph construction subunit 1120 adds an edge e formed by the concept set {C ₁ , C ₂ ,..., C _n } to form an edge set E. Finally, the hypergraph construction subunit 1120 forms a hypergraph denoted G (V, E).

図３は、ハイパーグラフ構築サブユニット１１２０によって構築されるハイパーグラフの例を示す概略図である。図３に示すハイパーグラフの例は、７個のノード（ノード(1)〜(7)）と５個のエッジ（ノード(1)〜(7)を囲む閉曲線）とを含む。具体的には、この５個のエッジとは、ノード(1)とノード(2)とで構成されるエッジ、ノード(1)とノード(3)とで構成されるエッジ、ノード(2)とノード(4)とで構成されるエッジ、ノード(3)とノード(5)とノード(6)とで構成されるエッジ、ノード(3)とノード(5)とノード(7)とノードとで構成されるエッジである。上述したように、ハイパーグラフにおいて、１個のエッジで任意の数のノードを連結することができる。 FIG. 3 is a schematic diagram illustrating an example of a hypergraph constructed by the hypergraph construction subunit 1120. The example of the hypergraph shown in FIG. 3 includes seven nodes (nodes (1) to (7)) and five edges (closed curves surrounding the nodes (1) to (7)). Specifically, these five edges are an edge composed of node (1) and node (2), an edge composed of node (1) and node (3), and node (2) Edge composed of node (4), edge composed of node (3), node (5) and node (6), node (3), node (5), node (7) and node An edge to be constructed. As described above, in the hypergraph, an arbitrary number of nodes can be connected by one edge.

通常、文書の著者は、よく知られていると想定される意味的情報を文書内に記録しない（つまり、文書から省略する）ので、ハイパーグラフ構築サブユニット１１２０によって構築されたハイパーグラフは完全ではないことが多い。そこで本発明では、構築されたハイパーグラフをコンピュータによって処理できるようにするために、ハイパーグラフ精緻化サブユニット１１３０を使用して、省略された情報をハイパーグラフに付加する。ハイパーグラフ精緻化サブユニット１１３０は、ドメインオントロジを使用して、ハイパーグラフ構築サブユニット１１２０によって構築された初期ハイパーグラフを精緻化する。具体的には、ハイパーグラフ精緻化サブユニット１１３０は、ノード操作とエッジ操作という２種類の動作を実行する。 Normally, the author of a document does not record semantic information that is assumed to be well-known in the document (ie, omits it from the document), so the hypergraph constructed by the hypergraph construction subunit 1120 is not complete. Often not. Therefore, in the present invention, in order to allow the constructed hypergraph to be processed by a computer, the omitted information is added to the hypergraph using the hypergraph refinement subunit 1130. The hypergraph refinement subunit 1130 refines the initial hypergraph constructed by the hypergraph construction subunit 1120 using a domain ontology. Specifically, the hypergraph refinement subunit 1130 performs two types of operations: node operations and edge operations.

ノード操作（マージ）とは、ハイパーグラフ内の２個のノードをマージし、当該２個のノードに対応する概念がドメインオントロジ内で同じ意味を有する場合には、当該２個のノードのエッジをマージすることを意味する。図４（ａ）に、図３に示すハイパーグラフ上でのノード操作の実行例を示す。図４（ａ）に示すように、ノード(1)(2)がハイパーグラフ内で同じ意味を有する場合には、ハイパーグラフ精緻化サブユニット１１３０はノード(1)(2)をマージしてノード(1)を形成し、さらにそれに対応してノード(1)(2)のエッジをマージする。ノード操作後のハイパーグラフは、図４（ａ）の右側に示すように、６個のノードと４個のエッジしか含んでいない。 Node operation (merge) merges two nodes in a hypergraph, and if the concepts corresponding to the two nodes have the same meaning in the domain ontology, the edges of the two nodes are Means to merge. FIG. 4A shows an execution example of the node operation on the hypergraph shown in FIG. As shown in FIG. 4A, when nodes (1) and (2) have the same meaning in the hypergraph, the hypergraph refinement subunit 1130 merges the nodes (1) and (2) to (1) is formed, and the edges of the nodes (1) and (2) are merged correspondingly. The hypergraph after the node operation includes only 6 nodes and 4 edges, as shown on the right side of FIG.

エッジ操作（付加およびマージ）とは、任意の個数のノードに対応する概念がドメインオントロジ内で隣り合っている場合（すなわち、ドメインオントロジ内で直接関連付けられている場合）には、これらのノードを連結するエッジをハイパーグラフ内で付加し、さらには、２個のエッジに対応する概念がドメインオントロジ内または初期ハイパーグラフ内において互いに距離が近い場合には、ハイパーグラフ内でこれら２個のエッジをマージすることを意味する。図４（ｂ）に、図３に示すハイパーグラフ上でのエッジ操作の実行例を示す。図４（ｂ）に示すように、ノード(4)とノード(7)とがオントロジ内で直接関連付けられている場合には、ノード(4)とノード(7)とを連結するエッジが付加される。したがって、エッジ操作（付加）後のハイパーグラフは、７個のノードと６個のエッジを含む。 Edge manipulation (addition and merging) means that if the concepts corresponding to any number of nodes are adjacent in the domain ontology (ie, directly related in the domain ontology), If connected edges are added in the hypergraph and the concepts corresponding to the two edges are close to each other in the domain ontology or the initial hypergraph, these two edges are added in the hypergraph. Means to merge. FIG. 4B shows an execution example of the edge operation on the hypergraph shown in FIG. As shown in FIG. 4B, when the node (4) and the node (7) are directly associated with each other in the ontology, an edge connecting the node (4) and the node (7) is added. The Therefore, the hypergraph after the edge operation (addition) includes 7 nodes and 6 edges.

図４（ｃ）は、ハイパーグラフ上におけるエッジ操作の実行例を示す概略図である。図４（ｃ）に示すように、ハイパーグラフは当初、２個のエッジ（すなわちノード(1)とノード(2)とで構成されるエッジとノード(1)とノード(3)とで構成されるエッジ）を含んでいる。ノード(2)とノード(3)が直接関連付けられている（すなわち、隣り合っている）ことがドメインオントロジから判断される場合には、ノード(1)とノード(2)とで構成されるエッジと、ノード(1)とノード(3)とで構成されるエッジとをマージして、ノード(1)とノード(2)とノード(3)とで構成されるエッジを形成することができる。 FIG. 4C is a schematic diagram illustrating an execution example of the edge operation on the hypergraph. As shown in FIG. 4 (c), the hypergraph is initially composed of two edges (ie, an edge composed of node (1) and node (2), node (1) and node (3)). Edge). If the domain ontology determines that the node (2) and the node (3) are directly associated (that is, adjacent to each other), the edge composed of the node (1) and the node (2) And an edge constituted by the node (1) and the node (3) can be merged to form an edge constituted by the node (1), the node (2) and the node (3).

重み付与サブユニット１１４０は、文書内における意味的情報の重要度に基づいて、精緻化されたハイパーグラフのノードとエッジに重みを付与する。具体的には、重み付与サブユニット１１４０は、以下の動作を実行する。 The weighting subunit 1140 assigns weights to refined hypergraph nodes and edges based on the importance of semantic information in the document. Specifically, the weighting subunit 1140 performs the following operations.

（i）ある特定の概念に対応するノードついて、当該概念の文書内における出現頻度（出現回数）に基づいて、当該ノードに重みを付与する。例えば、ノードｖに関して、その重みは式ｗｅｉｇｈｔ（ｖ）＝Ｆｒｅｑ（ｔ）として表される。ここで、ｔはノードｖに対応する概念、Ｆｒｅｑ（ｔ）は文書内における概念ｔの出現頻度（出現回数）である。 (I) For a node corresponding to a specific concept, a weight is assigned to the node based on the appearance frequency (number of appearances) of the concept in the document. For example, for node v, its weight is expressed as the expression weight (v) = Freq (t). Here, t is a concept corresponding to the node v, and Freq (t) is an appearance frequency (number of appearances) of the concept t in the document.

（ii）ある特定のエッジｅについて、文書内における当該エッジｅに対応する概念ｔの出現頻度（出現回数）（「Ｆｒｅｑ_ｔｅｒｍ（ｅ）」とする）と、文書内における当該エッジｅの出現頻度（出現回数）（「Ｆｒｅｑ_{ｒｅｌａｔｉｏｎ}（ｅ）」とする）と、当該エッジｅの新規度（「Ｎｏｖ（ｅ）」とする）とに基づいて、当該エッジｅに重みを付与する。 (Ii) For a specific edge e, the appearance frequency (number of appearances) of the concept t corresponding to the edge e in the document (referred to as “Freq _term (e)”) and the appearance frequency of the edge e in the document A weight is given to the edge e based on the (number of appearances) (referred to as “Freq _relation (e)”) and the novelty level of the edge e (referred to as “Nov (e)”).

例えば、各エッジｅ（ｅ＝｛ｖ_１，ｖ_２、…、ｖ_ｋ｝）（ここで、ｋは当該エッジ内に含まれるノードの総数を表す）について、その重みは以下の式で表される。
ｗｅｉｇｈｔ（ｅ）＝Ｆｒｅｑ_ｔｅｒｍ（ｅ）＊Ｆｒｅｑ_{ｒｅｌａｔｉｏｎ}（ｅ）＊Ｎｏｖ（ｅ），
ここで、
Ｆｒｅｑ_ｔｅｒｍ（ｅ）＝（ｗｅｉｇｈｔ（ｖ_１）＋ｗｅｉｇｈｔ（ｖ_２）＋…＋ｗｅｉｇｈｔ（ｖ_ｋ））／ｋ
であり、さらに、１つの文書および２つの概念を所与とすると、新規度とは、当該文書が当該２つの概念の意味的距離をどの程度短縮できるかを意味する。Ｎｏｖ（ｅ）＝ΣＮｏｖ（｛ｖ_ｉ，ｖ_ｊ｝），０＜ｉ，ｊ≦ｋであり、ここで任意の２つの概念ｖ_ｉ，ｖ_ｊにおいて、その意味的距離（「Ｄ（｛ｖｉ，ｖｊ｝」として表される）がｋ−１以下の長さの場合はＮｏｖ（｛ｖ_ｉ、ｖ_ｊ｝）＝１であり、それ以外の場合はＮｏｖ（｛ｖ_ｉ、ｖ_ｊ｝）＝Ｄ（｛ｖ_ｉ、ｖ_ｊ｝）／（ｋ−１）である。新規な情報は２つの概念間の意味的距離を短縮する可能性があるため、新規度Ｎｏｖ（ｅ）は重大な意味を持つ。 For example, for each edge e (e = {v ₁ , v ₂ ,..., V _k }) (where k represents the total number of nodes included in the edge), the weight is represented by the following equation: The
weight (e) = Freq _term (e) * Freq _relation (e) * Nov (e),
here,
Freq _term (e) = (weight (v ₁ ) + weight (v ₂ ) +... + Weight (v _k )) / k
Furthermore, given a document and two concepts, novelty means how much the document can reduce the semantic distance between the two concepts. Nov (e) = ΣNov ({v _i , v _j }), 0 <i, j ≦ k, where in any two concepts v _i , v _j , the semantic distance (“D ({vi , Vj} ”) is equal to or less than k−1, it is Nov ({v _i , v _j }) = 1, otherwise Nov ({v _i , v _j }) = D ({v _i , v _j }) / (k−1) The novelty level Nov (e) is significant because new information can shorten the semantic distance between two concepts. Meaningful.

図５は、図１に示す文書検索装置１０内の文書ランキングユニット１２０を示すブロック図である。図５に示すように、文書ランキングユニット１２０は、最小スパニングツリー生成サブユニット１２１０と、関連度計算サブユニット１２２０と、文書ランキングサブユニット１２３０とを備える。 FIG. 5 is a block diagram showing the document ranking unit 120 in the document search apparatus 10 shown in FIG. As shown in FIG. 5, the document ranking unit 120 includes a minimum spanning tree generation subunit 1210, a relevance calculation subunit 1220, and a document ranking subunit 1230.

最小スパニングツリー生成サブユニット１２１０は、ハイパーグラフ構築ユニット１２１０によって構築されたハイパーグラフを用いて、最小スパニングツリーを生成する。最小スパニングツリー生成サブユニット１２１０は、例えば欲張りアルゴリズムを使用して最小スパニングツリーを生成してもよい。欲張りアルゴリズムにおいては、任意の２個のノードが最短のパスによって連結される。所与のすべてのノードが連結されると、アルゴリズムは終了する。 The minimum spanning tree generation subunit 1210 generates a minimum spanning tree using the hypergraph constructed by the hypergraph construction unit 1210. The minimum spanning tree generation subunit 1210 may generate a minimum spanning tree using, for example, a greedy algorithm. In the greedy algorithm, any two nodes are connected by the shortest path. Once all the given nodes are connected, the algorithm ends.

関連度計算サブユニット１２２０は、生成された最小スパニングツリーの意味的関連度スコアを計算する。一例を挙げれば、文書Ｄｏｃ１とクエリ（ｑ_１，ｑ_２，…．，ｑ_ｎ）を所与とすると、最小スパニングツリー生成サブユニット１２１０が当該クエリに関して計算する最小スパニングツリーはＴ＝｛ｒ、（ｑ_１，ｑ_２，…．，ｑ_ｎ）｝である（ここで、ｒはＴのルート、ｍはＴのエッジ数を表す）。その後、関連度計算サブユニット１２２０は以下の式により、文書Ｄｏｃ１の当該クエリに関する意味的関連度スコアを計算する。

Ｓｃｏｒｅ（Ｄｏｃ１）＝Σ（ｗｅｉｇｈｔ（ｅ_１）＋ｗｅｉｇｈｔ（ｅ_２）＋…＋ｗｅｉｇｈｔ（ｅ_ｍ））／ｍ．
The relevance calculation subunit 1220 calculates a semantic relevance score of the generated minimum spanning tree. As an example, given a document Doc1 and a query (q ₁ , q ₂ ,..., Q _n ), the minimum spanning tree that the minimum spanning tree generation subunit 1210 calculates for the query is T = {r, (Q ₁ , q ₂ ,..., Q _n )} (where r represents the root of T and m represents the number of edges of T). Thereafter, the relevance calculation subunit 1220 calculates a semantic relevance score for the query of the document Doc1 according to the following equation.

Score (Doc1) = Σ (weight (e ₁ ) + weight (e ₂ ) +... + Weight (e _m )) / m.

文書ランキングサブユニット１２３０は、計算された文書の意味的関連度スコアに基づいて、対象文書をランキングし、文書検索の最終結果を取得する。 The document ranking subunit 1230 ranks the target document based on the calculated semantic relevance score of the document, and acquires the final result of the document search.

以下では、文書検索装置１０の具体的な用途例を示す。 Hereinafter, a specific application example of the document search apparatus 10 will be described.

対象文書Ｄｏｃ１は“Ｔｈｅｃｏｍｐｕｔｅｒｓｃｉｅｎｃｅｆｉｅｌｄｏｆｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌｓｔｕｄｉｅｓｈｏｗｔｏｓｔｏｒｅ，ｉｎｄｅｘ，ｒｅｔｒｉｅｖｅａｎｄｒａｎｋｄｏｃｕｍｅｎｔｓ（コンピュータサイエンスの情報検索とは、文書の格納、索引付け、検索、およびランキングについて研究する分野である）”という文を含み、別の対象文書Ｄｏｃ２は““Ｉｎｔｈｉｓｐａｐｅｒ，ｇｒａｐｈｍａｔｃｈｉｎｇｔｅｃｈｎｉｑｕｅｓａｒｅｕｔｉｌｉｚｅｄｔｏｅｎｈａｎｃｅｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ（本論文では、情報検索の精度を高めるためにグラフ照合技術が利用されている）”という文を含む。ユーザは、文書をグラフとして表現することにより情報検索のパフォーマンスを向上させる方法に関連する文献を見つけようと、３つのキーワード（“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ（情報検索）”、“ｄｏｃｕｍｅｎｔ（文書）”、“ｇｒａｐｈ（グラフ）”）を含むクエリを入力する。この場合、従来技術の文書検索装置は、Ｄｏｃ１とＤｏｃ２の当該クエリに関連するスコアは同じであるという結果を返す。それは、Ｄｏｃ１におけるキーワード“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”と“ｄｏｃｕｍｅｎｔ”との絶対距離が、Ｄｏｃ２におけるキーワード“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”と“ｇｒａｐｈ”との絶対距離と同じだからである。 The target document Doc1 is “The computer science field of information retrieval studies how to store, index, retrieve and rank documents, search and indexing of documents, research and indexing of documents, ”And another target document Doc2 is“ In this paper, graph matching techniques are optimized to enhance information retry (in this paper, a graph matching technique is used to improve the accuracy of information retrieval) ” The user expresses the document as a graph A query that includes three keywords ("information retrieval", "document", "graph") to find documents related to how to improve information retrieval performance by doing In this case, the prior art document search apparatus returns a result that the scores related to the query in Doc1 and Doc2 are the same, which is the absolute value of the keywords “information retryval” and “document” in Doc1. This is because the distance is the same as the absolute distance between the keywords “information retryval” and “graph” in Doc2.

本発明による文書検索装置１０は、これとは異なる結果を返す。それは、Ｄｏｃ１におけるキーワード“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”および“ｄｏｃｕｍｅｎｔ”と、Ｄｏｃ２におけるキーワード“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”および“ｇｒａｐｈ”とでは、その絶対距離は同じであっても相対距離は異なるからである。具体的には、Ｄｏｃ１におけるキーワード“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”と“ｄｏｃｕｍｅｎｔ”との相対距離はＤ（“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”，“ｄｏｃｕｍｅｎｔ”）＝１と判定され、Ｄｏｃ２におけるキーワード“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”と“ｇｒａｐｈ”との相対距離はＤ（“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”，“ｇｒａｐｈ”）＝５と判定される。それは、第１キーワードグループと第２キーワードグループの概念出現頻度とエッジ出現頻度はいずれも１であるが、第１キーワードグループの新規度は１であるのに対し、第２キーワードグループの新規度は５であるためである。その結果、Ｄｏｃ１とＤｏｃ２の意味的関連度スコアは以下のようになる。

Ｓｃｏｒｅ（Ｄｏｃ１）＝Ｗｅｉｇｈｔ（ｅ（“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”，“ｄｏｃｕｍｅｎｔ”））＝１；
Ｓｃｏｒｅ（Ｄｏｃ１）＝Ｗｅｉｇｈｔ（ｅ（“ｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ”，“ｇｒａｐｈ”））＝５．
The document search apparatus 10 according to the present invention returns a result different from this. The reason is that the keywords “information retry” and “document” in Doc1 and the keywords “information retry” and “graph” in Doc2 have the same absolute distance but different relative distances. Specifically, the relative distance between the keywords “information retry” and “document” in Doc1 is determined as D (“information retry”, “document”) = 1, and the keywords “information retrival” and “graph” and “graph” in Doc2 are determined. Is determined as D (“information retryval”, “graph”) = 5. The concept appearance frequency and edge appearance frequency of the first keyword group and the second keyword group are both 1, but the novelty level of the first keyword group is 1, whereas the novelty degree of the second keyword group is This is because it is 5. As a result, the semantic relevance scores of Doc1 and Doc2 are as follows.

Score (Doc1) = Weight (e (“information retryval”, “document”)) = 1;
Score (Doc1) = Weight (e (“information retryval”, “graph”)) = 5.

したがって、本発明による文書検索装置１０の検索結果において、Ｄｏｃ２はＤｏｃ１よりも高いランクになる。換言すれば、ユーザが求める文書はＤｏｃ２であると判定される。 Therefore, in the search result of the document search apparatus 10 according to the present invention, Doc2 has a higher rank than Doc1. In other words, it is determined that the document requested by the user is Doc2.

図６は、本発明の一実施例による文書検索方法６０を示すフローチャートである。図６に示すように、方法６０はステップＳ６１０から始まる。 FIG. 6 is a flowchart illustrating a document search method 60 according to an embodiment of the present invention. As shown in FIG. 6, the method 60 begins at step S610.

ステップＳ６２０において、対象文書セット内の各文書に関して、当該文書に含まれる暗黙的意味を記述するためのハイパーグラフが構築される。図７（ａ）に、ハイパーグラフを構築するプロセスの具体例を示す。図７に示すように、ステップＳ６２１０において、概念認識技術により、ドメインオントロジに基づいて対象文書から概念が抽出され、その後これらの概念に対する重みが計算される。概念に対する重みは、例えば既知のＴＦ−ＩＤＦ法を使用して計算してもよい。 In step S620, for each document in the target document set, a hypergraph for describing the implicit meaning included in the document is constructed. FIG. 7A shows a specific example of a process for constructing a hypergraph. As shown in FIG. 7, in step S6210, concepts are extracted from the target document based on the domain ontology by the concept recognition technique, and then the weights for these concepts are calculated. The weight for the concept may be calculated using, for example, the known TF-IDF method.

次に、ステップＳ６２２０において、各文書のハイパーグラフが構築される。１つの文書内において、同じ文脈に多数の概念が出現する場合には、これらの概念の間には直接的な意味的関連度があると判定される。具体的には、ある特定の文書における概念セットＣについて、各概念に対応するノードｖが作成されて、ノードセットＶが形成される。次に、概念セット｛Ｃ_１，Ｃ_２，…，Ｃ_ｎ｝を含むと推定される文書内の各文について、概念セット｛Ｃ_１，Ｃ_２，…，Ｃ_ｎ｝によって形成されるエッジｅが付加され、エッジセットＥが形成される。これにより、Ｇ（Ｖ，Ｅ）と表されるハイパーグラフが形成される。 Next, in step S6220, a hypergraph of each document is constructed. When multiple concepts appear in the same context in one document, it is determined that there is a direct semantic relevance between these concepts. Specifically, for a concept set C in a specific document, a node v corresponding to each concept is created, and a node set V is formed. Next, the concept set _{_{{C 1, C 2, ...}} , C n} for each sentence in the document suspected of containing, the concept set _{_{{C 1, C 2, ...}} , C n} edge e formed by Is added to form an edge set E. As a result, a hypergraph represented by G (V, E) is formed.

続いて、ステップＳ６２３０において、ドメインオントロジを用いて初期ハイパーグラフが精緻化される。具体的には、初期ハイパーグラフは、上記において図４を参照して説明したノード操作とエッジ操作を実行することにより精緻化される。 Subsequently, in step S6230, the initial hypergraph is refined using the domain ontology. Specifically, the initial hypergraph is refined by executing the node operation and the edge operation described above with reference to FIG.

最後に、ステップＳ６２４０において、文書内における意味的情報の重要度に基づいて、精緻化されたハイパーグラフのノードとエッジに重みが付与される。例えば、ある特定の概念に対応するノードに対して、文書内における当該概念の出現頻度（出現回数）に基づいて重みを付与し、さらには、あるエッジに対して、当該エッジの概念の出現頻度（出現回数）と、文書内における当該エッジの出現頻度（出現回数）と、当該エッジの新規性とに基づいて重みを付与することができる。 Finally, in step S6240, weights are assigned to the refined hypergraph nodes and edges based on the importance of the semantic information in the document. For example, a node corresponding to a specific concept is given a weight based on the appearance frequency (number of appearances) of the concept in the document, and further, the appearance frequency of the concept of the edge for a certain edge A weight can be given based on (number of appearances), frequency of appearance of the edges in the document (number of appearances), and novelty of the edges.

ここで図６に戻ると、ステップＳ６２０の完了後、ステップＳ６３０において、ステップＳ６２０で生成されたハイパーグラフに基づいて、ある特定のクエリに対応する文書が検索される。図７（ｂ）に、文書を検索するプロセスの具体例を示す。図７に示すように、ステップＳ６３１０においてまず、ステップＳ６２０で生成されたハイパーグラフを用いて最小スパニングツリーが生成される。この最小スパニングツリーは、例えば、欲張りアルゴリズムを用いて生成することができる。欲張りアルゴリズムにおいては、任意の２個のノードが最短のパスで連結される。所与のすべてのノードが連結されると、アルゴリズムは終了する。 Returning to FIG. 6, after completion of step S620, in step S630, a document corresponding to a specific query is searched based on the hypergraph generated in step S620. FIG. 7B shows a specific example of a process for searching for a document. As shown in FIG. 7, in step S6310, first, a minimum spanning tree is generated using the hypergraph generated in step S620. This minimum spanning tree can be generated, for example, using a greedy algorithm. In the greedy algorithm, any two nodes are connected by the shortest path. Once all the given nodes are connected, the algorithm ends.

次に、ステップＳ６３２０において、生成された最小スパニングツリーの意味的関連度スコアが計算される。一例を挙げれば、文書Ｄｏｃ１とクエリ（ｑ_１，ｑ_２，…．，ｑ_ｎ）を所与とすると、最小スパニングツリー生成サブユニット１２１０が当該クエリに関して計算する最小スパニングツリーはＴ＝｛ｒ、（ｑ_１，ｑ_２，…．，ｑ_ｎ）｝である（ここで、ｒはＴのルート、ｍはＴのエッジ数を表す）。その後、関連度計算サブユニット１２２０は以下の式により、文書Ｄｏｃ１の当該クエリに関する意味的関連度スコアを計算する。

Ｓｃｏｒｅ（Ｄｏｃ１）＝Σ（ｗｅｉｇｈｔ（ｅ_１）＋ｗｅｉｇｈｔ（ｅ_２）＋…＋ｗｅｉｇｈｔ（ｅ_ｍ））／ｍ．
Next, in step S6320, a semantic relevance score for the generated minimum spanning tree is calculated. As an example, given a document Doc1 and a query (q ₁ , q ₂ ,..., Q _n ), the minimum spanning tree that the minimum spanning tree generation subunit 1210 calculates for the query is T = {r, (Q ₁ , q ₂ ,..., Q _n )} (where r represents the root of T and m represents the number of edges of T). Thereafter, the relevance calculation subunit 1220 calculates a semantic relevance score for the query of the document Doc1 according to the following equation.

Score (Doc1) = Σ (weight (e ₁ ) + weight (e ₂ ) +... + Weight (e _m )) / m.

最後に、ステップＳ６３３０において、計算された意味的関連度スコアに基づいて文書がランキングされ、文書検索の最終結果が取得される。 Finally, in step S6330, the documents are ranked based on the calculated semantic relevance score, and the final result of the document search is obtained.

図６に戻ると、方法６０はステップＳ６３０の完了後、ステップＳ６４０において終了する。 Returning to FIG. 6, the method 60 ends at step S640 after completion of step S630.

本発明による文書検索装置および方法では、文書内で黙示されるリッチな意味的情報を利用し、当該文書に関するハイパーグラフを構築して特定のクエリに関する当該文書の関連度スコアを計算し、かつ計算された関連度スコアに基づいて当該文書をランキングすることによって、文書検索の精度が高められる。そのため、検索におけるユーザの真の要求をより効果的に満たすことができる。 In the document search apparatus and method according to the present invention, rich semantic information implied in a document is used, a hypergraph related to the document is constructed, and a relevance score of the document related to a specific query is calculated. By ranking the documents based on the obtained relevance score, the accuracy of document search is improved. Therefore, the true request of the user in the search can be satisfied more effectively.

上記では、好適な実施例を参照して本発明を説明してきたが、本発明の精神および範囲から逸脱することなく、様々な変更、置換、改変が可能であることは当業者には理解されるであろう。したがって、本発明は上記の実施例に限定されず、添付請求項およびその等価物によってのみ限定される。 Although the invention has been described above with reference to preferred embodiments, those skilled in the art will recognize that various changes, substitutions, and alterations can be made without departing from the spirit and scope of the invention. It will be. Accordingly, the invention is not limited to the above embodiments, but only by the appended claims and their equivalents.

さらに、上記実施形態の一部又は全部は、以下の付記のようにも記載されうるが、これに限定されない。 Further, a part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）
対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該文書に関するハイパーグラフを構築するように構成されたハイパーグラフ構築ユニットと、
前記ハイパーグラフ構築ユニットによって構築されたハイパーグラフに基づいて、前記対象文書セット内で、ある特定クエリに対応する文書を検索し、検索された文書をランキングするように構成された文書ランキングユニットとを備えることを特徴とする文書検索装置。 (Appendix 1)
A hypergraph construction unit configured to construct a hypergraph for the document in order to describe implicit semantic information contained in one document of the target document set;
A document ranking unit configured to search for a document corresponding to a specific query in the target document set based on the hypergraph constructed by the hypergraph construction unit and to rank the retrieved documents; A document retrieval apparatus comprising:

（付記２）
前記ハイパーグラフ構築ユニットは、
ドメインオントロジ情報を使用して文書から概念を抽出し、当該概念の重みを計算するように構成された概念抽出サブユニットと、
前記文書の初期ハイパーグラフを構築するように構成されたハイパーグラフ構築サブユニットと、
前記ドメインオントロジ情報を使用して前記初期ハイパーグラフを精緻化するように構成されたハイパーグラフ精緻化サブユニットと、
精緻化されたハイパーグラフのノードおよびエッジに重みを付与するように構成された重み付与サブユニットとを備えることを特徴とする付記１に記載の文書検索装置。 (Appendix 2)
The hypergraph construction unit is:
A concept extraction subunit configured to extract concepts from a document using domain ontology information and calculate weights of the concepts;
A hypergraph construction subunit configured to construct an initial hypergraph of the document;
A hypergraph refinement subunit configured to refine the initial hypergraph using the domain ontology information;
The document search apparatus according to claim 1, further comprising a weighting subunit configured to give weights to nodes and edges of the refined hypergraph.

（付記３）
前記ハイパーグラフ構築サブユニットは、
前記文書に含まれる概念セットの各々についてノードを作成してノードセットを形成し、
前記文書の各文に含まれる概念セットによって形成されるエッジを付加してエッジセットを形成し、当該ノードセットと当該エッジセットとから構成される初期ハイパーグラフを構築するように構成される
ことを特徴とする付記２に記載の文書検索装置。 (Appendix 3)
The hypergraph construction subunit is:
Creating a node for each concept set included in the document to form a node set;
Adding an edge formed by a concept set included in each sentence of the document to form an edge set, and constructing an initial hypergraph composed of the node set and the edge set. The document search device according to Supplementary Note 2, which is characterized.

（付記４）
前記ハイパーグラフ精緻化サブユニットは、
前記初期ハイパーグラフ内の２個のノードに対応する概念がドメインオントロジにおいて同じ意味を有する場合には、当該２個のノードをマージし、
これらのノードに対応する概念がドメインオントロジにおいて直接関連付けられている場合には、前記初期ハイパーグラフ内の任意の個数のノードを連結するエッジを付加し、
前記２個のエッジに対応する概念がドメインオントロジまたは初期ハイパーグラフ内において距離が近い場合には、初期ハイパーグラフ内において当該２個のエッジをマージするように構成される
ことを特徴とする付記２に記載の文書検索装置。 (Appendix 4)
The hypergraph refinement subunit is:
If the concepts corresponding to the two nodes in the initial hypergraph have the same meaning in the domain ontology, merge the two nodes;
If the concepts corresponding to these nodes are directly related in the domain ontology, add an edge connecting any number of nodes in the initial hypergraph,
Supplementary note 2 wherein when the concept corresponding to the two edges is close in the domain ontology or the initial hypergraph, the two edges are merged in the initial hypergraph. Document retrieval device described in 1.

（付記５）
前記重み付与サブユニットは、
前記文書内におけるある特定の概念の出現頻度に基づいて、当該特定の概念に対応するノードに重みを付与し、
前記文書内におけるあるエッジの概念の出現頻度と、前記文書内における当該エッジの出現頻度と、当該エッジ内の任意の２個のノードにおける意味的関連度の新規性の総和であるエッジの新規性とに基づいて、当該エッジに重みを付与するように構成されることを特徴とする付記２に記載の文書検索装置。 (Appendix 5)
The weighting subunit is:
Based on the frequency of occurrence of a specific concept in the document, a weight is given to a node corresponding to the specific concept,
The novelty of an edge, which is the sum of the frequency of appearance of a concept of an edge in the document, the frequency of appearance of the edge in the document, and the novelty of the semantic relevance at any two nodes in the edge 3. The document search device according to appendix 2, wherein a weight is assigned to the edge based on the above.

（付記６）
２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での２個のノード間の意味的距離が、エッジ内のノード数マイナス１に相当する数を超えない場合は１とし、それ以外の場合には、２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での当該２個のノード間の意味的距離を、エッジ内のノード数と１との差に相当する数で除算して得られる数とすることを特徴とする付記５に記載の文書検索装置。 (Appendix 6)
The novelty of the semantic relevance at the two nodes is 1 when the semantic distance between the two nodes in the domain ontology does not exceed the number corresponding to the number of nodes in the edge minus 1, In other cases, the novelty of the semantic relevance at the two nodes corresponds to the difference between the number of nodes in the edge and the semantic distance between the two nodes in the domain ontology. 6. The document search apparatus according to appendix 5, wherein the number is obtained by dividing by a number.

（付記７）
前記文書ランキングユニットは、
前記ハイパーグラフ構築ユニットによって構築されたハイパーグラフを用いてある特定のクエリに関する最小スパニングツリーを生成するように構成された最小スパニングツリー生成サブユニットと、
生成された最小スパニングツリーの意味的関連度スコアを計算するように構成された関連度計算サブユニットと、
前記意味的関連度スコアに基づいて文書をランキングするように構成された文書ランキングサブユニットと
を備えることを特徴とする付記１に記載の文書検索装置。 (Appendix 7)
The document ranking unit is:
A minimum spanning tree generation subunit configured to generate a minimum spanning tree for a particular query using the hypergraph constructed by the hypergraph building unit;
A relevance calculation subunit configured to calculate a semantic relevance score for the generated minimum spanning tree;
The document search apparatus according to appendix 1, further comprising: a document ranking subunit configured to rank documents based on the semantic relevance score.

（付記８）
前記最小スパニングツリー生成サブユニットは、欲張りアルゴリズムを使用して最小スパニングツリーを生成するように構成されることを特徴とする付記７に記載の文書検索装置。 (Appendix 8)
The document search apparatus according to appendix 7, wherein the minimum spanning tree generation subunit is configured to generate a minimum spanning tree using a greedy algorithm.

（付記９）
前記関連度計算サブユニットは、意味的関連度スコアとして、最小スパニングツリーにおけるすべてのエッジの重みの平均を計算するように構成されることを特徴とする付記７に記載の文書検索装置。 (Appendix 9)
The document search device according to appendix 7, wherein the relevance calculation subunit is configured to calculate an average of weights of all edges in the minimum spanning tree as a semantic relevance score.

（付記１０）
対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該文書に関するハイパーグラフを構築するステップと、
構築されたハイパーグラフに基づいて、前記対象文書セット内で特定クエリに対応する文書を検索し、検索された文書をランキングするステップと
を備えることを特徴とする文書検索方法。 (Appendix 10)
Constructing a hypergraph for the document to describe implicit semantic information contained in one document in the target document set;
A document search method comprising: searching for a document corresponding to a specific query in the target document set based on the constructed hypergraph, and ranking the searched documents.

（付記１１）
前記構築ステップは、
ドメインオントロジ情報を使用して１つの文書から概念を抽出し、当該概念に関する重みを計算するステップと、
前記文書に関する初期ハイパーグラフを構築するステップと、
ドメインオントロジ情報を使用して初期ハイパーグラフを精緻化するステップと、
精緻化されたハイパーグラフのノードおよびエッジに重みを付与するステップと
を備えることを特徴とする付記１０に記載の文書検索方法。 (Appendix 11)
The construction step includes
Extracting concepts from a document using domain ontology information and calculating weights for the concepts;
Building an initial hypergraph for the document;
Refining the initial hypergraph using domain ontology information;
The document retrieval method according to claim 10, further comprising the step of assigning weights to nodes and edges of the refined hypergraph.

（付記１２）
前記文書に関する初期ハイパーグラフを構築するステップは、
前記文書に含まれる概念セットの各々についてノードを作成してノードセットを形成するステップと、
前記文書の各文に含まれる概念セットによって形成されるエッジを付加してエッジセットを形成するステップと、
前記ノードセットと当該エッジセットとから成る初期ハイパーグラフを構築するステップと
を備えることを特徴とする付記１１に記載の文書検索方法。 (Appendix 12)
Building an initial hypergraph for the document comprises:
Creating a node for each concept set included in the document to form a node set;
Adding an edge formed by a concept set included in each sentence of the document to form an edge set;
The document search method according to claim 11, further comprising a step of constructing an initial hypergraph including the node set and the edge set.

（付記１３）
前記初期ハイパーグラフを精緻化するステップは、
前記初期ハイパーグラフ内の２個のノードに対応する概念がドメインオントロジにおいて同じ意味を有する場合には、当該２個のノードをマージするステップと、
前記初期ハイパーグラフ内の任意の個数のノードに対応する概念がドメインオントロジにおいて直接関連付けられている場合には、これらのノードを連結するエッジを付加するステップと、
２個のエッジに対応する概念がドメインオントロジまたは初期ハイパーグラフ内において距離が近い場合には、初期ハイパーグラフ内において当該２個のエッジをマージするステップと
を備えることを特徴とする付記１１に記載の文書検索方法。 (Appendix 13)
The step of refining the initial hypergraph comprises:
Merging the two nodes if the concepts corresponding to the two nodes in the initial hypergraph have the same meaning in the domain ontology;
If concepts corresponding to any number of nodes in the initial hypergraph are directly associated in a domain ontology, adding edges connecting these nodes;
The concept corresponding to two edges comprises a step of merging the two edges in the initial hypergraph when the distance is close in the domain ontology or the initial hypergraph. Document search method.

（付記１４）
前記重みを付与するステップは、
前記文書内におけるある特定の概念の出現頻度に基づいて、当該特定の概念に対応するノードに重みを付与するステップと、
前記文書内におけるあるエッジの概念の出現頻度と、文書内における当該エッジの出現頻度と、当該エッジ内の任意の２個のノードにおける意味的関連度の新規性の総和であるエッジの新規性とに基づいて、当該エッジに重みを付与するステップと
を備えることを特徴とする付記１１に記載の文書検索方法。 (Appendix 14)
The step of assigning the weight includes:
Assigning a weight to a node corresponding to the specific concept based on the frequency of appearance of the specific concept in the document;
The frequency of appearance of the concept of an edge in the document, the frequency of appearance of the edge in the document, and the novelty of the edge that is the sum of the novelty of the semantic relevance at any two nodes in the edge The document search method according to claim 11, further comprising the step of assigning a weight to the edge based on:

（付記１５）
前記２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での２個のノード間の意味的距離が、エッジ内のノード数マイナス１に相当する数を超えない場合は１とし、それ以外の場合には、２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での当該２個のノード間の意味的距離を、エッジ内のノード数と１との差に相当する数で除算して得られる数とすることを特徴とする付記１４に記載の文書検索方法。 (Appendix 15)
The novelty of the semantic relevance in the two nodes is 1 when the semantic distance between the two nodes in the domain ontology does not exceed the number corresponding to the number of nodes in the edge minus 1. In other cases, the new degree of semantic relevance at two nodes is equivalent to the difference between the number of nodes in the edge and the semantic distance between the two nodes in the domain ontology. 15. The document search method according to appendix 14, wherein the number is obtained by dividing by the number to be added.

（付記１６）
前記検索およびランキングを行うステップは、
構築されたハイパーグラフを用いてある特定のクエリに関する最小スパニングツリーを生成するステップと、
生成された最小スパニングツリーの意味的関連度スコアを計算するステップと、当該意味的関連度スコアに基づいて文書をランキングするステップと
を備えることを特徴とする付記１０に記載の文書検索方法。 (Appendix 16)
The searching and ranking step includes:
Generating a minimal spanning tree for a particular query using the constructed hypergraph; and
The document search method according to appendix 10, further comprising: calculating a semantic relevance score of the generated minimum spanning tree; and ranking the documents based on the semantic relevance score.

（付記１７）
前記最小スパニングツリーを生成するステップは、欲張りアルゴリズムを使用して最小スパニングツリーを生成するステップを備えることを特徴とする付記１６に記載の文書検索方法。 (Appendix 17)
The document search method according to claim 16, wherein generating the minimum spanning tree comprises generating a minimum spanning tree using a greedy algorithm.

（付記１８）
前記意味的関連度スコアを計算するステップは、意味的関連度スコアとして、最小スパニングツリーにおけるすべてのエッジの重みの平均を計算するステップを備えることを特徴とする付記１６に記載の文書検索方法。 (Appendix 18)
The document search method according to claim 16, wherein the step of calculating the semantic relevance score includes a step of calculating an average of weights of all edges in the minimum spanning tree as the semantic relevance score.

１０：文書検索装置
１１０：ハイパーグラフ構築ユニット
１２０：文書ランキングユニット
１１１０：概念抽出サブユニット
１１２０：ハイパーグラフ構築サブユニット
１１３０：ハイパーグラフ精緻化サブユニット
１１４０：重み付与サブユニット
１２１０：最小スパニングツリー生成サブユニット
１２２０：関連度計算サブユニット
１２３０：文書ランキングサブユニット
10: Document Retrieval Device 110: Hypergraph Construction Unit 120: Document Ranking Unit 1110: Concept Extraction Subunit 1120: Hypergraph Construction Subunit 1130: Hypergraph Refinement Subunit 1140: Weighting Subunit 1210: Minimum Spanning Tree Generation Sub Unit 1220: Relevance calculation subunit 1230: Document ranking subunit

Claims

対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該文書に関するハイパーグラフを構築するように構成されたハイパーグラフ構築ユニットと、
前記ハイパーグラフ構築ユニットによって構築されたハイパーグラフに基づいて、前記対象文書セット内で、ある特定クエリに対応する文書を検索し、検索された文書をランキングするように構成された文書ランキングユニットとを備え、
前記ハイパーグラフ構築ユニットは、
ドメインオントロジ情報を使用して文書から概念を抽出し、当該概念の重みを計算する
ように構成された概念抽出サブユニットと、
前記文書の初期ハイパーグラフを構築するように構成されたハイパーグラフ構築サブユ
ニットと、
前記ドメインオントロジ情報を使用して前記初期ハイパーグラフを精緻化するように構
成されたハイパーグラフ精緻化サブユニットと、
精緻化された初期ハイパーグラフのノードおよびエッジに重みを付与するように構成された重み付与サブユニットとを備える、
ことを特徴とする文書検索装置。 A hypergraph construction unit configured to construct a hypergraph for the document in order to describe implicit semantic information contained in one document of the target document set;
A document ranking unit configured to search for a document corresponding to a specific query in the target document set based on the hypergraph constructed by the hypergraph construction unit and to rank the retrieved documents; Prepared ,
The hypergraph construction unit is:
Use domain ontology information to extract concepts from documents and calculate the weights for those concepts
A concept extraction subunit configured as:
A hypergraph construction sub-unit configured to construct an initial hypergraph of the document.
Knit and
The domain ontology information is used to refine the initial hypergraph.
The formed hypergraph refinement subunit,
A weighting subunit configured to weight the refined initial hypergraph nodes and edges;
Document retrieval apparatus according to claim and this.

前記ハイパーグラフ構築サブユニットは、
前記文書に含まれる概念セットの各々についてノードを作成してノードセットを形成し、
前記文書の各文に含まれる概念セットによって形成されるエッジを付加してエッジセットを形成し、当該ノードセットと当該エッジセットとから構成される初期ハイパーグラフを構築するように構成される
ことを特徴とする請求項１に記載の文書検索装置。 The hypergraph construction subunit is:
Creating a node for each concept set included in the document to form a node set;
Adding an edge formed by a concept set included in each sentence of the document to form an edge set, and constructing an initial hypergraph composed of the node set and the edge set. The document search apparatus according to claim 1 , wherein:

前記ハイパーグラフ精緻化サブユニットは、
前記初期ハイパーグラフ内の２個のノードに対応する概念がドメインオントロジにおいて同じ意味を有する場合には、当該２個のノードをマージし、
これらのノードに対応する概念がドメインオントロジにおいて直接関連付けられている場合には、前記初期ハイパーグラフ内の任意の個数のノードを連結するエッジを付加し、
２個のエッジに対応する概念がドメインオントロジまたは初期ハイパーグラフ内において距離が近い場合には、初期ハイパーグラフ内において当該２個のエッジをマージするように構成される
ことを特徴とする請求項１に記載の文書検索装置。 The hypergraph refinement subunit is:
If the concepts corresponding to the two nodes in the initial hypergraph have the same meaning in the domain ontology, merge the two nodes;
If the concepts corresponding to these nodes are directly related in the domain ontology, add an edge connecting any number of nodes in the initial hypergraph,
If the distance is close in two concepts in the domain ontology or initial hyper graph corresponding to the edge, claim 1, characterized in that configured to merge the two edges in the initial Hypergraph Document retrieval device described in 1.

前記重み付与サブユニットは、
前記文書内におけるある特定の概念の出現頻度に基づいて、当該特定の概念に対応するノードに重みを付与し、
前記文書内におけるあるエッジの概念の出現頻度と、前記文書内における当該エッジの出現頻度と、当該エッジ内の任意の２個のノードにおける意味的関連度の新規性の総和であるエッジの新規性とに基づいて、当該エッジに重みを付与するように構成されることを特徴とする請求項１に記載の文書検索装置。 The weighting subunit is:
Based on the frequency of occurrence of a specific concept in the document, a weight is given to a node corresponding to the specific concept,
The novelty of an edge, which is the sum of the frequency of appearance of a concept of an edge in the document, the frequency of appearance of the edge in the document, and the novelty of the semantic relevance at any two nodes in the edge based on the bets, the document search apparatus according to claim 1, characterized in that it is configured to apply a weight to the edges.

２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での２個のノード間の意味的距離が、エッジ内のノード数マイナス１に相当する数を超えない場合は１とし、それ以外の場合には、２個のノードにおける意味的関連度の新規度は、ドメインオントロジ内での当該２個のノード間の意味的距離を、エッジ内のノード数と１との差に相当する数で除算して得られる数とすることを特徴とする請求項４に記載の文書検索装置。 The novelty of the semantic relevance at the two nodes is 1 when the semantic distance between the two nodes in the domain ontology does not exceed the number corresponding to the number of nodes in the edge minus 1, In other cases, the novelty of the semantic relevance at the two nodes corresponds to the difference between the number of nodes in the edge and the semantic distance between the two nodes in the domain ontology. 5. The document search apparatus according to claim 4 , wherein the number is obtained by dividing by a number.

対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該文書に関するハイパーグラフを構築するように構成されたハイパーグラフ構築ユニットと、A hypergraph construction unit configured to construct a hypergraph for the document in order to describe implicit semantic information contained in one document of the target document set;
前記ハイパーグラフ構築ユニットによって構築されたハイパーグラフに基づいて、前記対象文書セット内で、ある特定クエリに対応する文書を検索し、検索された文書をランキングするように構成された文書ランキングユニットとを備え、A document ranking unit configured to search for a document corresponding to a specific query in the target document set based on the hypergraph constructed by the hypergraph construction unit and to rank the retrieved documents; Prepared,
前記文書ランキングユニットは、The document ranking unit is:
前記ハイパーグラフ構築ユニットによって構築されたハイパーグラフを用いてある特定のクエリに関する最小スパニングツリーを生成するように構成された最小スパニングツリー生成サブユニットと、A minimum spanning tree generation subunit configured to generate a minimum spanning tree for a particular query using the hypergraph constructed by the hypergraph building unit;
生成された最小スパニングツリーの意味的関連度スコアを計算するように構成された関連度計算サブユニットと、A relevance calculation subunit configured to calculate a semantic relevance score for the generated minimum spanning tree;
前記意味的関連度スコアに基づいて文書をランキングするように構成された文書ランキングサブユニットとを備える、A document ranking subunit configured to rank documents based on the semantic relevance score.
ことを特徴とする文書検索装置。A document search apparatus characterized by that.

前記最小スパニングツリー生成サブユニットは、欲張りアルゴリズムを使用して最小スパニングツリーを生成するように構成されることを特徴とする請求項６に記載の文書検索装置。 The apparatus according to claim 6 , wherein the minimum spanning tree generation subunit is configured to generate a minimum spanning tree using a greedy algorithm.

前記関連度計算サブユニットは、意味的関連度スコアとして、最小スパニングツリーにおけるすべてのエッジの重みの平均を計算するように構成されることを特徴とする請求項６に記載の文書検索装置。 The document search apparatus according to claim 6 , wherein the relevance calculation subunit is configured to calculate an average of weights of all edges in the minimum spanning tree as a semantic relevance score.

対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該
文書に関するハイパーグラフを構築するステップと、
構築されたハイパーグラフに基づいて、前記対象文書セット内で特定クエリに対応する
文書を検索し、検索された文書をランキングするステップと、
を備え、
前記ハイパーグラフを構築するステップは、
ドメインオントロジ情報を使用して文書から概念を抽出し、当該概念の重みを計算するステップと、
前記文書の初期ハイパーグラフを構築するステップと、
前記ドメインオントロジ情報を使用して前記初期ハイパーグラフを精緻化するステップと、
精緻化された初期ハイパーグラフのノードおよびエッジに重みを付与するステップと、
を含む、
ことを特徴とする文書検索方法。 Constructing a hypergraph for the document to describe implicit semantic information contained in one document in the target document set;
Searching for a document corresponding to a specific query in the target document set based on the constructed hypergraph, and ranking the searched documents ;
Equipped with a,
The step of constructing the hypergraph comprises:
Extracting concepts from the document using domain ontology information and calculating weights for the concepts;
Building an initial hypergraph of the document;
Refining the initial hypergraph using the domain ontology information;
Weighting nodes and edges of the refined initial hypergraph; and
including,
Document retrieval wherein a call.

対象文書セットのある１つの文書に含まれる暗黙の意味的情報を記述するために、当該To describe the implicit semantic information contained in one document in the target document set,
文書に関するハイパーグラフを構築するステップと、Building a hypergraph for the document;
構築されたハイパーグラフに基づいて、前記対象文書セット内で特定クエリに対応するCorresponding to a specific query in the target document set based on the constructed hypergraph
文書を検索し、検索された文書をランキングするステップと、Searching for documents and ranking the searched documents;
を備え、With
前記文書をランキングするステップは、Ranking the documents comprises:
構築されたハイパーグラフを用いてある特定のクエリに関する最小スパニングツリーを生成するステップと、Generating a minimal spanning tree for a particular query using the constructed hypergraph; and
生成された最小スパニングツリーの意味的関連度スコアを計算するステップと、Calculating a semantic relevance score for the generated minimum spanning tree;
前記意味的関連度スコアに基づいて文書をランキングするステップと、Ranking the documents based on the semantic relevance score;
を含む、including,
ことを特徴とする文書検索方法。A document search method characterized by the above.