JP2006059082A

JP2006059082A - Document summary system and method, computer-readable storage medium storing program, and program

Info

Publication number: JP2006059082A
Application number: JP2004239544A
Authority: JP
Inventors: Tatsunori Mori; 辰則森; Masanori Nozawa; 正憲野澤; Yoshiaki Asada; 義昭浅田
Original assignee: Yokohama National University NUC
Current assignee: Yokohama National University NUC
Priority date: 2004-08-19
Filing date: 2004-08-19
Publication date: 2006-03-02

Abstract

<P>PROBLEM TO BE SOLVED: To generate a summary sentence including generally important information in addition to a solution to a query in a balanced manner in a document summary system generating a summary document from a plurality of documents. <P>SOLUTION: For each sentence included in a plurality of documents to be summarized, a general-purpose sentence importance degree, which is an importance degree as a general-purpose sentence, is calculated (S503), and in a similar manner, a query response sentence importance degree, which is an importance degree of a sentence as a query response, is calculated (S504) for each sentence. The general-purpose sentence importance degree and the query response sentence importance degree are integrated together for calculating an integrated sentence importance degree serving as an integrated sentence importance degree is calculated (S505). Then, based on the integrated sentence importance degree, important sentences are extracted from the sentences included in a plurality of documents to be summarized (S507), and the extracted important sentences are aligned for generating the summary sentence (S508). <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、複数の文書から要約文書を生成する文書要約システムに係り、質問に対する解のみならず、一般的に重要な情報をバランスよく含む要約文書を生成することができる技術に関する。 The present invention relates to a document summarization system that generates a summary document from a plurality of documents, and relates to a technique capable of generating a summary document that includes not only an answer to a question but also generally important information in a balanced manner.

大量の文書が溢れている昨今、その中から必要とされる情報を効率良く見つけたいという要求がある。情報検索や質問応答等の技術により情報要求に関連する文書群や答え自身を容易に得る事が出来るようになりつつあるが、最終的には原文書を調べる必要がある。これらの技術と相補的な関係にあるのが、検索文書群を対象とした複数文書要約技術である。特に、近年、「質問の答に焦点を当てた要約」が注目されている。これは、情報検索過程においては利用者が情報要求を持っており、また、それらが質問文として記述できるという考え方に基づく。ＮＩＳＴ主催のＤＵＣ２００４においては、そのタスクの一つに、一つの質問文に注目した複数文書要約が取り上げられている。 Now that there are a lot of documents, there is a demand to efficiently find the information that is needed. Although it is becoming possible to easily obtain a group of documents related to information requests and answers themselves by techniques such as information retrieval and question answering, it is ultimately necessary to examine the original document. Complementary to these techniques is a multi-document summarization technique for a search document group. In particular, in recent years, “summary focusing on the answers to questions” has attracted attention. This is based on the idea that the user has an information request in the information retrieval process and can describe them as a question sentence. In DUC 2004 sponsored by NIST, as one of the tasks, a multi-document summary focusing on one question sentence is taken up.

複数文書要約においては内容把握ができるように、ある程度の要約文書量が必要であるので、利用者の知りたい事柄の一つ一つについて別々の要約文書を生成すると、最終的に利用者が読むべき文書量が増えてしまう。複数の要求の答とその背景知識を一度に概観できるような要約が生成できることが望ましい。
特開２００４−１１８５４５号公報特開２００１−２６５７９２号公報森辰則，「検索結果表示向け文書要約における情報利得比に基づく後の重要度計算」，自然言語処理，２００２年，第９巻，第４号，ｐ．３−３２平尾努他、外２名，「質問に適応した文書要約手法とその評価」，情報処理学会論文誌，２００１年，第４２巻，第９号，ｐ．２２５９−２２６９ A certain amount of summary documents is required so that the contents can be grasped in multi-document summaries, so when users create separate summary documents for each thing they want to know, the user finally reads The amount of documents that should be increased. It is desirable to be able to generate a summary that gives an overview of the answers to multiple requests and their background knowledge at once.
JP 2004-118545 A JP 2001-265792 A Masanori Mori, “Subsequent importance calculation based on information gain ratio in document summary for search result display”, Natural Language Processing, 2002, Vol. 9, No. 4, p. 3-32 Tsutomu Hirao et al., 2 others, “Document Summarization Method Adapted to Questions and Its Evaluation”, Journal of Information Processing Society of Japan, 2001, Vol. 42, No. 9, p. 2259-2269

以上を踏まえて、本発明では、複数の質問文に対応可能な文重要度計算処理方法として質問応答エンジンの解のスコアを利用する手法を提案する。そして、これを汎用要約生成向けの文重要度計算処理方法に融合する。 Based on the above, the present invention proposes a method of using the answer score of the question answering engine as a sentence importance calculation processing method that can handle a plurality of question sentences. This is then merged with the sentence importance calculation method for general purpose summary generation.

本発明に係る文書要約システムは、
要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムであって、以下の要素を有することを特徴とする
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算部
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算部
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出部
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出部
（５）抽出した重要文を整列させて要約文書を生成する重要文整列部
（６）生成した要約文書を出力する要約文書生成部。 The document summarization system according to the present invention includes:
A document summarization system for generating a summary document including a solution to be a response to a question sentence from a plurality of documents to be summarized, and having the following elements: (1) Included in a plurality of documents to be summarized General-purpose sentence importance calculation unit that calculates general-purpose sentence importance, which is the importance of the sentence as general-purpose for each sentence (2) Importance of sentence as question answer for each sentence included in multiple documents to be summarized Question answer sentence importance calculation unit for calculating the degree of question answer sentence importance (3) Integrated sentence importance that is an integrated sentence importance obtained by integrating the general sentence importance and the question answer sentence importance Integrated sentence importance calculating unit for calculating (4) an important sentence extracting unit for extracting an important sentence from sentences included in a plurality of documents to be summarized based on the integrated sentence importance (5) aligning the extracted important sentences The important sentence alignment unit that generates the summary document (6) The generated summary document Summary document generating unit for force.

汎用文重要度計算部は、文に含まれる単語毎に、汎用としての単語の重要度である汎用単語重要度を求め、当該文に含まれる各単語に係る汎用単語重要度の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする。 The general-purpose sentence importance calculation unit obtains a general-purpose word importance that is the importance of the word as a general-purpose for each word included in the sentence, and calculates a sum of general-purpose word importance related to each word included in the sentence. Dividing by the length of the sentence, the quotient is used as an element for determining the general sentence importance of the sentence.

汎用文重要度計算部は、要約対象の文書に含まれる単語について文書内単語頻度を算出し、文に含まれる単語毎に、当該単語の文書内単語頻度を重み付けとして用いた値を求め、当該文に含まれる各単語に係るその値の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする。 The general-purpose sentence importance calculation unit calculates the word frequency in the document for the words included in the document to be summarized, obtains a value using the word frequency in the document as a weight for each word included in the sentence, The sum of the values for each word included in the sentence is divided by the length of the sentence, and the quotient is used as an element for determining the general sentence importance of the sentence.

汎用文重要度計算部は、要約対象の候補となる文書に基づいて単語について文書頻度の逆数を算出し、文に含まれる単語毎に、当該単語の文書頻度の逆数を重み付けとして用いた値を求め、当該文に含まれる各単語に係るその値の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする。 The general-purpose sentence importance calculation unit calculates a reciprocal of the document frequency for the word based on the document that is a candidate for summarization, and uses a value that uses the reciprocal of the document frequency of the word as a weight for each word included in the sentence. The sum of the values of each word included in the sentence is divided by the length of the sentence, and the quotient is used as a general sentence importance determining element of the sentence.

汎用文重要度計算部は、複数の文書を階層的にクラスタリングし、文書に含まれる単語について、当該クラスタ構造に則した出現分布を持つ単語に対する重み付けとして、当該文書が各階層において属するクラスタにおける当該単語の情報利得比の総和を求め、文に含まれる単語毎に、当該単語の情報利得比の総和を用いた値を求め、当該文に含まれる各単語に係るその値の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする。 The general-purpose sentence importance calculation unit clusters a plurality of documents in a hierarchical manner, and for the words included in the documents, as weights for the words having an appearance distribution in accordance with the cluster structure, The sum of the information gain ratios of the words is obtained, and for each word included in the sentence, a value using the sum of the information gain ratios of the word is obtained, and the sum of the values related to each word included in the sentence is The quotient is used as an element for determining the general sentence importance of the sentence.

上記文の長さは、当該文に含まれる文字数、当該文に含まれる単語数、当該文に含まれる文節数、あるいは当該文に含まれる節数のいずれかであることを特徴とする。 The length of the sentence is any one of the number of characters included in the sentence, the number of words included in the sentence, the number of clauses included in the sentence, or the number of clauses included in the sentence.

質問応答文重要度計算部は、文に含まれる単語毎に、質問文に対する解としての良さを示すスコアを算出し、当該スコアに基づいて当該文の質問応答文重要度を計算することを特徴とする。 The question answer sentence importance calculation unit calculates a score indicating goodness as a solution to the question sentence for each word included in the sentence, and calculates the question answer sentence importance of the sentence based on the score And

統合文重要度算出部は、上記汎用文重要度と、上記質問応答文重要度を所定の重みで按分して、統合文重要度を算出することを特徴とする。 The integrated sentence importance calculating unit calculates the integrated sentence importance by dividing the general sentence importance and the question answer sentence importance by a predetermined weight.

本発明に係る文書要約方法は、
要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムによる文書要約方法であって、以下の要素を有することを特徴とする
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算処理工程
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算処理工程
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出処理工程
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出処理工程
（５）抽出した重要文を整列させて要約文書を生成する重要文整列処理工程
（６）生成した要約文書を出力する要約文書生成処理工程。 The document summarization method according to the present invention includes:
A document summarization method by a document summarization system that generates a summary document including a solution that is a response to a question sentence from a plurality of documents to be summarized, and includes the following elements: (1) A plurality of summary targets General-purpose sentence importance calculation process for calculating general-purpose sentence importance, which is the importance of a sentence as general-purpose for each sentence included in the document (2) Question answer for each sentence included in a plurality of documents to be summarized Question answer sentence importance calculation process for calculating the importance of the question answer sentence, which is the importance of the sentence as (3) The sentence sentence importance obtained by integrating the general sentence importance and the question answer sentence importance Integrated sentence importance calculation process step (4) for calculating an integrated sentence importance level (5) An important sentence extraction process step (5) for extracting an important sentence from sentences included in a plurality of documents to be summarized based on the integrated sentence importance level ) A summary document is created by aligning the extracted important sentences. Summary document generation processing step of outputting a key sentence alignment process step (6) the generated summary document to.

本発明に係るプログラムを記録したコンピュータ読み取り可能な記録媒体は、
要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムとなるコンピュータに、以下の処理を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であることを特徴とする
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算処理
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算処理
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出処理
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出処理
（５）抽出した重要文を整列させて要約文書を生成する重要文整列処理
（６）生成した要約文書を出力する要約文書生成処理。 A computer-readable recording medium on which a program according to the present invention is recorded,
A computer-readable recording medium recording a program for causing a computer to be a document summarizing system that generates a summarizing document including a solution as a response to a question sentence from a plurality of summarizing target documents to execute the following processing (1) General-purpose sentence importance calculation processing for calculating general-purpose sentence importance, which is the importance of a sentence as general-purpose for each sentence included in a plurality of documents to be summarized (2) Multiple summary objects Question answer sentence importance calculation processing for calculating the importance of a question answer sentence, which is the importance of a sentence as a question answer, for each sentence included in the document (3) The general sentence importance and the question answer sentence importance Integrated sentence importance calculation processing that calculates integrated sentence importance, which is an integrated sentence importance (4) Based on the integrated sentence importance, important sentences are extracted from sentences included in a plurality of documents to be summarized Extract important sentence to extract Management (5) extracted important sentences key sentences are aligned to generate a summary document alignment processor (6) summary document generation processing for outputting the generated summary document.

本発明に係るプログラムは、
要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムとなるコンピュータに、以下の手順を実行させるためのプログラムであることを特徴とする
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算処理手順
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算処理手順
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出処理手順
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出処理手順
（５）抽出した重要文を整列させて要約文書を生成する重要文整列処理手順
（６）生成した要約文書を出力する要約文書生成処理手順。 The program according to the present invention is:
(1) Summarization, characterized in that it is a program for causing a computer, which is a document summarization system to generate a summary document including a solution to be a response to a question sentence, from a plurality of documents to be summarized. General-purpose sentence importance calculation procedure for calculating general-purpose sentence importance, which is the importance of a sentence as general-purpose for each sentence included in a plurality of target documents (2) For each sentence included in a plurality of documents to be summarized Question answer sentence importance calculation process for calculating question answer sentence importance, which is the importance of a sentence as a question answer (3) The above general sentence importance and the above question answer sentence importance are integrated and integrated Integrated sentence importance calculation processing procedure for calculating an integrated sentence importance which is a sentence importance (4) Important sentence extraction processing for extracting an important sentence from sentences included in a plurality of documents to be summarized based on the integrated sentence importance Step (5) Prepare the extracted important sentences Summary document generation procedure of outputting the key sentence alignment procedure (6) generated summary document to generate a summary document by.

本発明に係る文書要約システムは、
要約対象の複数の文書から、複数の質問文に対する要約文書を生成する文書要約システムであって、要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出し、この算出された上記スコアを質問文が共通するスコアの集合毎に正規化し、上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択し、選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算することを特徴とする。 The document summarization system according to the present invention includes:
A document summarization system that generates summary documents for a plurality of question sentences from a plurality of documents to be summarized, and for each of the plurality of question sentences for each word included in the sentences included in the plurality of documents to be summarized A plurality of scores indicating goodness as a solution to the question sentence is calculated, the calculated score is normalized for each set of scores common to the question sentences, and for each word included in the sentence, A maximum value is selected from the normalized scores, and a question response sentence importance of the sentence is calculated based on the selected maximum normalized score.

本発明に係る文書要約方法は、
要約対象の複数の文書から、複数の質問文に対する要約文書を生成する文書要約システムによる文書要約方法であって、以下の要素を有することを特徴とする
（１）要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出する工程
（２）この算出された上記スコアを質問文が共通するスコアの集合毎に正規化する工程
（３）上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択する工程
（４）選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算する工程。 The document summarization method according to the present invention includes:
A document summarization method by a document summarization system for generating summary documents for a plurality of question sentences from a plurality of documents to be summarized, and having the following elements: (1) Included in a plurality of documents to be summarized Calculating a plurality of scores indicating goodness as a solution to each question sentence for each of the plurality of question sentences for each word included in the sentence (2) A score common to the question sentences for the calculated score (3) For each word included in the sentence, for each word included in the sentence, select a maximum value among the normalized scores for each question sentence. (4) Based on the selected maximum normalized score The step of calculating the question answering sentence importance of the sentence.

本発明に係るプログラムを記録したコンピュータ読み取り可能な記録媒体は、
要約対象の複数の文書から、複数の質問文に対する要約文書を生成する文書要約システムとなるコンピュータに、以下の処理を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であることを特徴とする
（１）要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出する処理
（２）この算出された上記スコアを質問文が共通するスコアの集合毎に正規化する処理
（３）上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択する処理
（４）選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算する処理。 A computer-readable recording medium on which a program according to the present invention is recorded,
A computer-readable recording medium storing a program for causing a computer to be a document summarization system that generates summary documents for a plurality of question sentences from a plurality of documents to be summarized to perform the following processing: (1) A process of calculating a plurality of scores indicating goodness as a solution to each question sentence for the plurality of question sentences for each word included in sentences included in a plurality of documents to be summarized (2) A process for normalizing the calculated score for each set of scores that share a question sentence (3) A process for selecting the maximum value among the normalized scores for each question sentence for each word included in the sentence (4 ) A process of calculating the question answer sentence importance of the sentence based on the selected maximum normalization score.

本発明に係るプログラムは、
要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムとなるコンピュータに、以下の手順を実行させるためのプログラムであることを特徴とする
（１）要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出する手順
（２）この算出された上記スコアを質問文が共通するスコアの集合毎に正規化する手順
（３）上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択する手順
（４）選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算する手順。 The program according to the present invention is:
(1) Summarization, characterized in that it is a program for causing a computer, which is a document summarization system to generate a summary document including a solution to be a response to a question sentence, from a plurality of documents to be summarized. Procedure for calculating a plurality of scores indicating goodness as a solution to each question sentence for the plurality of question sentences for each word included in a sentence included in a plurality of target documents (2) This calculated score (3) Procedure for selecting the maximum value among the normalized scores for each question sentence for each word included in the sentence (4) The selected maximum A procedure for calculating the importance of the question answering sentence based on the normalized score.

本発明においては、汎用としての文の重要度と、質問応答としての文の重要度を統合し、統合した文の重要度に応じて重要文と抽出するので、質問に対する解のみならず、一般的に重要な情報をバランスよく含む要約文書を作ることができる。また、複数の質問文に対する解を含む要約文書を作ることができる。 In the present invention, the importance of the sentence as a general purpose and the importance of the sentence as a question answer are integrated, and the sentence is extracted as an important sentence according to the importance of the integrated sentence. A summary document containing important information in a balanced manner can be created. In addition, a summary document including solutions to a plurality of question sentences can be created.

実施の形態１．
まず、概要について説明する。要約対象の文書群は、情報検索等の結果として得られており、また、利用者の情報要求は、複数の質問文として与えられているとする。質問文については、利用者がシステムとの対話の中で一つずつ与えていき、その都度、その答を含む文脈をそれ以前の要約文書との関連を考慮しつつ要約していくという設定が自然ではある。しかし、本実施例では第一次近似として、複数の質問文が同時に与えられることを想定する。尚、利用者との対話の中での要約を生成することも考えられる。 Embodiment 1 FIG.
First, an outline will be described. It is assumed that the document group to be summarized is obtained as a result of information retrieval or the like, and the user's information request is given as a plurality of question sentences. About the question sentence, the setting that the user gives one by one in the dialogue with the system and summarizes the context including the answer in consideration of the relation with the previous summary document each time. It is natural. However, in the present embodiment, it is assumed that a plurality of question sentences are given simultaneously as the first approximation. It is also possible to generate a summary in a dialog with the user.

この状況下では、複数文書要約のために、
１）「情報要求を考慮した重要箇所抽出」、
２）「文書間の冗長箇所の削除」、
３）「文書間の相違点の抽出」が必要であると考える。 In this situation, for multi-document summaries,
1) “Extracting important points considering information requirements”,
2) “Deletion of redundant parts between documents”,
3) “Extract differences between documents” is considered necessary.

提案手法では、これらについて以下の技術を用いる。
（ａ）質問応答エンジンの出力スコアに基づく文の重要度計算
（ｂ）語の出現分布に関する情報利得比に基づく文の重要度計算
（ｃ）ＭＭＲに基づく要約文書中の冗長性の制御
更に、抽出文間の結束性の担保のために
（ｄ）ハニング窓関数に基づく文重要度平滑化
を採用する。 The proposed technique uses the following techniques.
(A) sentence importance calculation based on output score of question answering engine (b) sentence importance calculation based on information gain ratio regarding word occurrence distribution (c) redundancy control in summary document based on MMR (D) Sentence importance level smoothing based on the Hanning window function is adopted to ensure coherence between extracted sentences.

具体的には、（ａ）は、後述する質問応答文重要度計算処理（Ｓ５０４）に相当し、（ｂ）は、後述する汎用文重要度計算処理（Ｓ５０３）に相当し、（ｃ）は、後述する重要文抽出処理（Ｓ５０７）に相当し、（ｄ）は、後述する文重要度平滑化処理（Ｓ５０６）に相当する。 Specifically, (a) corresponds to a question answer sentence importance calculation process (S504) described later, (b) corresponds to a general sentence importance calculation process (S503) described later, and (c) This corresponds to important sentence extraction processing (S507) described later, and (d) corresponds to sentence importance level smoothing processing (S506) described later.

文書要約システムへの入力は、要約対象となる日本語文書（のＩＤ）の集合、情報要求に対応する質問文の集合、ならびに、求める抜粋の長さ（文字数もしくは文数）である。出力は文書集合の抜粋（文の列）、つまり要約文書である。 The input to the document summarization system is a set of Japanese documents (IDs) to be summarized, a set of question sentences corresponding to the information request, and the length of the excerpt (number of characters or sentences) to be obtained. The output is an excerpt of a document set (a sequence of sentences), that is, a summary document.

要約対象の文書は、要約対象となり得る候補の文書のデータベースから選択されたものを用いる。図１は、要約対象文書の選択に係る構成を示す図である。要約対象候補文書データベース１０１、要約対象文書選択部１０２、及び要約対象文書記憶部１０３を有している。要約対象文書の選択は、検索条件に基づいて、要約対象文書選択部１０２で要約対象候補文書データベース１０１内の文書を検索した結果を取得し、それを要約対象文書記憶部１０３に記憶させてもよいし、要約対象文書選択部１０２で操作者から直接文書の指定を受付け、その指定された文書ＩＤを要約対象候補文書データベース１０１から取得するようにしてもよい。 The document to be summarized is selected from a database of candidate documents that can be summarized. FIG. 1 is a diagram showing a configuration related to selection of a summary target document. A summarization target candidate document database 101, a summarization target document selection unit 102, and a summarization target document storage unit 103 are included. The selection of the summarization target document may be performed by acquiring a result of searching for a document in the summarization target candidate document database 101 by the summarization target document selection unit 102 based on the search condition and storing the result in the summarization target document storage unit 103. Alternatively, the summarization target document selection unit 102 may accept the specification of the document directly from the operator, and the specified document ID may be acquired from the summarization target candidate document database 101.

図２は、要約対象の文書ＩＤの例を示す図である。図に示すように、要約対象文書記憶部１０３では文書ＩＤを記憶し、文書ＩＤに係る文書のデータを要するときに、その文書ＩＤで特定される文書を要約対象候補文書データベース１０１から取得できるように構成されている。あるいは、要約対象文書記憶部１０３で、文書ＩＤと対応付けて文書のデータを記憶しておいてもよい。 FIG. 2 is a diagram illustrating an example of a document ID to be summarized. As shown in the figure, the summarization target document storage unit 103 stores the document ID, and when the document data relating to the document ID is required, the document specified by the document ID can be acquired from the summarization target candidate document database 101. It is configured. Alternatively, document data may be stored in the summary target document storage unit 103 in association with the document ID.

質問文の入力について説明する。図３は、質問文入力に係る構成を示す図である。質問文入力部３０１、及び質問文記憶部３０２の要素を有している。質問文は、質問文入力部３０１で入力を受け付け、質問文記憶部３０２でその質問文を集合として記憶するように構成されている。 The question sentence input will be described. FIG. 3 is a diagram showing a configuration relating to question text input. It has elements of a question sentence input unit 301 and a question sentence storage unit 302. The question sentence is configured to accept input by the question sentence input unit 301 and to store the question sentence as a set in the question sentence storage unit 302.

図４は、質問文記憶部の例を示す図である。質問文毎にレコードを設け、質問文ＩＤと質問文の項目を有し、それぞれを対応付けている。 FIG. 4 is a diagram illustrating an example of the question sentence storage unit. A record is provided for each question sentence, and has a question sentence ID and a question sentence item, which are associated with each other.

次に、文書要約システムの全体処理について説明する。図５は、全体処理フローを示す図である。まず、前処理として、文書頻度の逆数（ＩＤＦ値）算出処理（Ｓ５０１）と、文書解析処理（Ｓ５０２）を行う。図６は、文書頻度の逆数算出処理と文書解析処理に係る構成を示す図である。 Next, the overall processing of the document summarization system will be described. FIG. 5 is a diagram showing an overall processing flow. First, as preprocessing, reciprocal document frequency (IDF value) calculation processing (S501) and document analysis processing (S502) are performed. FIG. 6 is a diagram showing a configuration relating to the document frequency reciprocal calculation process and the document analysis process.

文書頻度の逆数（ＩＤＦ値）算出部６０１は、、要約対象候補文書データベース１０１に記憶している要約対象候補文書を読み込み、出現する単語について文書頻度の逆数（ＩＤＦ値）を算出し、当該単語と対応付けて、文書頻度の逆数（ＩＤＦ値）を文書頻度の逆数（ＩＤＦ値）テーブル６０２に記憶させる。文書頻度の逆数（ＩＤＦ値）は、後述する汎用文重要度計算処理（Ｓ５０３）で用いる。 The reciprocal number (IDF value) calculation unit 601 of the document frequency reads the summary target candidate document stored in the summary target candidate document database 101, calculates the reciprocal number (IDF value) of the document frequency for the appearing word, and the word And the reciprocal of document frequency (IDF value) is stored in the reciprocal of document frequency (IDF value) table 602. The reciprocal of document frequency (IDF value) is used in a general sentence importance calculation process (S503) described later.

文書解析部６０３は、要約対象文書記憶部１０３に記憶している要約対象文書を読み込み、解析した結果を、文テーブル６０４、単語テーブル６０５、文構造テーブル６０６、及び文出所テーブル６０７に記憶させるように構成されている。 The document analysis unit 603 reads the summary target document stored in the summary target document storage unit 103, and stores the analysis result in the sentence table 604, the word table 605, the sentence structure table 606, and the source table 607. It is configured.

図７は、文テーブルの例を示す図である。文毎にレコードを設け、文ＩＤと文の項目を有し、それぞれを対応付けている。 FIG. 7 is a diagram illustrating an example of a sentence table. A record is provided for each sentence, and has a sentence ID and a sentence item, which are associated with each other.

図８は、単語テーブルの例を示す図である。単語毎にレコードを設け、単語ＩＤと単語の項目を有し、それぞれを対応付けている。 FIG. 8 is a diagram illustrating an example of a word table. A record is provided for each word, and has a word ID and a word item, which are associated with each other.

図９は、文構造テーブルの例を示す図である。文毎にレコードを設け、文ＩＤ及び、一番目単語ＩＤ、二番目単語ＩＤ、三番目単語ＩＤのように文の先頭から順に単語ＩＤの列の項目を有し、それぞれを対応付けている。 FIG. 9 is a diagram illustrating an example of a sentence structure table. A record is provided for each sentence, and has a column of word IDs in order from the head of the sentence, such as a sentence ID, a first word ID, a second word ID, and a third word ID, which are associated with each other.

図１０は、文出所テーブルの例を示す図である。文毎にレコードを設け、その文の出所の文書のＩＤと、出所の文書内の文の位置（何番目の文であるか）の項目を有し、それぞれを対応付けている。 FIG. 10 is a diagram illustrating an example of the source table. A record is provided for each sentence, and has items of the document ID of the sentence and the position (sentence number of the sentence) of the sentence in the document of origin.

これらのテーブルを生成する文解析処理について詳述する。図１１は、文解析処理フローを示す図である。要約対象の文書毎に以下の処理を繰り返し（Ｓ１１０１）、文書に含まれる各文毎に以下の処理を繰り返す（Ｓ１１０２）。 A sentence analysis process for generating these tables will be described in detail. FIG. 11 is a diagram showing a sentence analysis processing flow. The following processing is repeated for each document to be summarized (S1101), and the following processing is repeated for each sentence included in the document (S1102).

当該文に文ＩＤを割り当て（Ｓ１１０３）、文テーブル６０４に、当該文ＩＤと文を対応付けて記憶させる（Ｓ１１０４）。また、文出所テーブル６０７に、当該文ＩＤと、出所である文書の文書ＩＤと、当該文の位置（文書内で何番目に位置するか）を対応付けて記憶させる（Ｓ１１０５）。 A sentence ID is assigned to the sentence (S1103), and the sentence ID and the sentence are stored in the sentence table 604 in association with each other (S1104). Further, the sentence source table 607 stores the sentence ID, the document ID of the document that is the source, and the position of the sentence (the position in the document) in association with each other (S1105).

次に、文について形態素解析を行い（Ｓ１１０６）、解析された各単語に単語ＩＤを割り当て、（Ｓ１１０７）単語テーブル６０５に、割り当てた単語ＩＤと単語を対応付けて記憶させる（Ｓ１１０８）。但し、すでに単語ＩＤが割り当てられている単語を除く。 Next, morphological analysis is performed on the sentence (S1106), a word ID is assigned to each analyzed word (S1107), and the assigned word ID and the word are stored in the word table 605 in association with each other (S1108). However, words that have already been assigned a word ID are excluded.

そして、当該文ＩＤと、文を構成する順序つけられた単語の単語ＩＤ群を文構造テーブルに記憶させ、（Ｓ１１０９）すべての文について処理した時点で（Ｓ１１１０）、次の文書に対する処理に移行し、すべての文書について処理した時点で終了する（Ｓ１１１１）。 Then, the sentence ID and the word ID group of the ordered words constituting the sentence are stored in the sentence structure table. (S1109) When all the sentences are processed (S1110), the process proceeds to the next document. When all the documents have been processed, the process ends (S1111).

図５に示すように、文書解析処理（Ｓ５０２）に続いて、汎用としての文重要度を計算する処理（Ｓ５０３：汎用文重要度計算処理）、質問応答としての文重要度を計算する処理（Ｓ５０４：質問応答文重要度計算処理）、及び統合した文重要度を算出する処理（Ｓ５０５：統合文重要度算出処理）を行う。 As shown in FIG. 5, following the document analysis process (S502), a process for calculating sentence importance as a general purpose (S503: general sentence importance calculation process) and a process for calculating sentence importance as a question response (S503). S504: Question answer sentence importance calculation processing) and integrated sentence importance calculation processing (S505: integrated sentence importance calculation processing).

図１２は、汎用文重要度計算処理と質問応答文重要度計算処理と統合文重要度算出処理に係る構成を示す図である。汎用文重要度計算度１２０１、汎用文重要度テーブル１２０２、質問応答文重要度計算部１２０３、質問応答文重要度テーブル１２０４、統合文重要度算出部１２０５、及び統合文重要度テーブル１２０６の要素を有している。 FIG. 12 is a diagram illustrating a configuration related to general-purpose sentence importance calculation processing, question answer sentence importance calculation processing, and integrated sentence importance calculation processing. The elements of the general sentence importance calculation degree 1201, the general sentence importance degree table 1202, the question answer sentence importance calculation section 1203, the question answer sentence importance degree table 1204, the integrated sentence importance calculation section 1205, and the integrated sentence importance degree table 1206 are shown. Have.

まず、汎用文重要度計算度１２０１による汎用としての文重要度を計算する処理（Ｓ５０３：汎用文重要度計算処理）について説明する。この処理では、語の出現分布に関する情報利得比に基づく文の重要度計算を行う。 First, a process for calculating a sentence importance as a general sentence based on the general sentence importance calculation degree 1201 (S503: general sentence importance calculation process) will be described. In this process, sentence importance is calculated based on the information gain ratio regarding the word appearance distribution.

発明者は、検索結果文書の各々を要約する手法として、情報利得比に基づく語の重み付けを用いた重要文抽出手法を提案している。この手法では、検索結果文書間の類似性構造を階層的クラスタリングにより抽出し、その構造に則した出現分布を持つ語に高い重みをつけるために、情報利得比（ＩｎｆｏｒｍａｔｉｏｎＧａｉｎＲａｔｉｏ、ＩＧＲ）に基づく語の重要度計算を行なう。本実施例では、この手法を利用し、与えられた文書群に関する重要文の抽出を行なう。 The inventor has proposed an important sentence extraction technique using word weighting based on an information gain ratio as a technique for summarizing each search result document. In this method, a similarity structure between search result documents is extracted by hierarchical clustering, and based on an information gain ratio (IGR) in order to give high weight to words having an appearance distribution according to the structure. Perform word importance calculation. In the present embodiment, this technique is used to extract important sentences relating to a given document group.

ＣｉをＣのｉ番目の部分クラスタとすると、クラスタＣにおける単語ｗの確率分布に関する情報利得比ＩＧＲ（ｗ，Ｃ）は次のように求められる。 Assuming that Ci is the i-th partial cluster of C, the information gain ratio IGR (w, C) related to the probability distribution of word w in cluster C is obtained as follows.

・・・式（１）

... Formula (1)

・・・式（２）

... Formula (2)

・・・式（３）

... Formula (3)

・・・式（４）

... Formula (4)

ここで次の二点に注意しなければならない。
１．対象文書群が情報検索結果であれば、それらと検索されなかった文書群との対比が語の重み付けに関する重要な情報を担う。そこで、クラスタ構造の最上部に、根クラスタの上に仮想的なクラスタ（要約対象候補文書データベース１０１に相当する）を設ける。このクラスタには要約対象の文書の属する部分クラスタ（要約対象文書記憶部１０３に相当する）と、それ以外の文書が属する部分クラスタが存在する。同仮想クラスタでは、対象文書群全体に関連する語に高い重みが与えられるので、検索要求に関する語が高く重みづけられる。
２．階層的なクラスタリングを考える場合、各階層のクラスタ毎に語の重みが得られるので、これらを統合する必要がある。本発明では、各文書の所属するすべてのクラスタにおける語の重みの平均値を採用し、文書Ｄにおける語ｗの値をＩＧＲ＿ａｖｅ（ｗ，Ｄ）と記す。 Here we must pay attention to the following two points.
1. If the target document group is an information search result, the contrast between the document group that has not been searched and the document group that has not been searched bears important information relating to word weighting. Therefore, a virtual cluster (corresponding to the summary target candidate document database 101) is provided on the root cluster at the top of the cluster structure. This cluster includes a partial cluster to which the document to be summarized belongs (corresponding to the summary target document storage unit 103) and a partial cluster to which other documents belong. In the virtual cluster, words related to the entire target document group are given high weight, so that words related to the search request are weighted high.
2. When considering hierarchical clustering, word weights are obtained for each cluster in each hierarchy, and these need to be integrated. In the present invention, an average value of word weights in all clusters to which each document belongs is adopted, and the value of word w in document D is denoted as IGR_ave (w, D).

そして、この重みと文書内単語頻度（ＴＦ値）や文書頻度の逆数（ＩＤＦ値）など既存の重みづけ手法を組み合わせることにより、最終的な語の重みとする。この語の重みに基づく各文Ｓｉの重要度Ｉｍｐ_ＩＧＲ（Ｓｉ）は、下式に示すとおり、含まれる名詞の重みの総和を文の長さ（単語単位）により正規化したものである。また、文書間の文重要度を正規化するために、文書内の文重要度を偏差値
（Ｔ−ｓｃｏｒｅ）に変換する。これをＩｍｐ^ｎ _ＩＧＲ（Ｓｉ）とする。 Then, a final word weight is obtained by combining this weight with an existing weighting method such as the word frequency in the document (TF value) or the reciprocal of the document frequency (IDF value). The importance Imp _IGR (Si) of each sentence Si based on the word weight is obtained by normalizing the sum of the weights of the contained nouns by the sentence length (word unit) as shown in the following equation. Moreover, in order to normalize the sentence importance between documents, the sentence importance in a document is converted into a deviation value (T-score). This is Imp ⁿ _IGR (Si).

・・・式（５）

... Formula (5)

更に、具体的な処理について説明する。図１３は、汎用文重要度計算処理フローを示す図である。汎用文重要度計算度１２０１は、順次、文書クラスタリング処理（Ｓ１３０１）、情報利得比総和（ＩＧＲ＿ｓｕｍ値）算出処理（Ｓ１３０２）、汎用文重要度導出処理（Ｓ１３０３）を行う。 Furthermore, specific processing will be described. FIG. 13 is a diagram showing a general-purpose sentence importance calculation processing flow. The general sentence importance calculation degree 1201 sequentially performs a document clustering process (S1301), an information gain ratio total (IGR_sum value) calculation process (S1302), and a general sentence importance degree derivation process (S1303).

文書クラスタリング処理（Ｓ１３０１）について詳述する。図１４は、文書クラスタリング処理フローを示す図である。文書内単語頻度（ＴＦ値）と、単語の文書頻度の逆数（ＩＤＦ値）の積であるＴＦ・ＩＤＦ値を算出し（Ｓ１４０１）、当該値に基づいて、文書ベクトルの向きが近いもの同士を、類似度が高いもの同士として、文書を階層的にクラスタリングする（Ｓ１４０２）。 The document clustering process (S1301) will be described in detail. FIG. 14 is a diagram showing a document clustering process flow. A TF / IDF value, which is the product of the word frequency in the document (TF value) and the reciprocal of the word document frequency (IDF value), is calculated (S1401). The documents are clustered hierarchically with high similarity (S1402).

図１５は、ＴＦ・ＩＤＦ値算出処理（Ｓ１４０１）フローを示す図である。本処理では、ＴＦ・ＩＤＦ値テーブルを生成する。 FIG. 15 is a diagram showing a flow of TF / IDF value calculation processing (S1401). In this process, a TF / IDF value table is generated.

図１６は、ＴＦ・ＩＤＦ値テーブルの例を示す図である。ヘッダとして文書ＩＤを有し、単語毎にレコードを設け、単語ＩＤとＴＦ・ＩＤＦ値の項目を有し、それぞれを対応付けている。このテーブルは、文書毎に設けられている。 FIG. 16 is a diagram illustrating an example of a TF / IDF value table. It has a document ID as a header, a record is provided for each word, and has items of word ID and TF / IDF value, which are associated with each other. This table is provided for each document.

図１５に示すように、要約対象の各文書について以下の処理を繰り返す（Ｓ１５０１）。当該文書の文書ＩＤをＴＦ・ＩＤＦ値テーブルのヘッダに記憶させ（Ｓ１５０２）、文書に含まれる各単語について以下の処理を繰り返す（Ｓ１５０３）。 As shown in FIG. 15, the following processing is repeated for each document to be summarized (S1501). The document ID of the document is stored in the header of the TF / IDF value table (S1502), and the following processing is repeated for each word included in the document (S1503).

当該単語の文書内単語頻度（ＴＦ値）を算出し（Ｓ１５０４）、更に当該単語の文書頻度の逆数（ＩＤＦ値）を読み出し（Ｓ１５０５）、文書内単語頻度（ＴＦ値）と当該単語の文書頻度の逆数（ＩＤＦ値）を乗じて積（ＴＦ・ＩＤＦ値）を求める（Ｓ１５０６）。そして、ＴＦ・ＩＤＦ値テーブルのレコードに、積（ＴＦ・ＩＤＦ値）を記憶させる（Ｓ１５０７）。 The word frequency (TF value) in the document of the word is calculated (S1504), and the reciprocal number (IDF value) of the document frequency of the word is read (S1505), and the word frequency (TF value) in the document and the document frequency of the word are read. Is multiplied by the reciprocal number (IDF value) to obtain a product (TF · IDF value) (S1506). Then, the product (TF / IDF value) is stored in the record of the TF / IDF value table (S1507).

すべての単語について処理した時点で（Ｓ１５０８）、次の文書の処理に移行し、すべての文書について処理した時点で終了する（Ｓ１５０９）。 When all the words have been processed (S1508), the process proceeds to the next document, and ends when all the documents have been processed (S1509).

図１３の情報利得比総和算出処理（Ｓ１３０２）について説明する。図１７は、情報利得比総和算出処理フローを示す図である。本処理では、情報利得比総和テーブルを生成する。 The information gain ratio sum calculation process (S1302) in FIG. 13 will be described. FIG. 17 is a diagram showing an information gain ratio sum calculation process flow. In this process, an information gain ratio sum table is generated.

図１８は、情報利得比総和テーブルを示す図である。ヘッダとして文書ＩＤを有し、単語毎にレコードを設け、単語ＩＤと情報利得比総和の項目を有し、それぞれを対応付けている。このテーブルは、文書毎に設けられている。 FIG. 18 shows an information gain ratio sum table. It has a document ID as a header, a record is provided for each word, and there are items of word ID and information gain ratio sum, which are associated with each other. This table is provided for each document.

図１７に示すように、要約対象の文書毎に以下の処理を繰り返し（Ｓ１７０１）、当該文書の文書ＩＤを情報利得比総和テーブルのヘッダに記憶させる（Ｓ１７０２）。そして、当該文書に含まれる単語毎に以下の処理を繰り返す（Ｓ１７０３）
各階層毎に以下の処理を繰り返し（Ｓ１７０４）、当該階層に含まれるクラスタ毎に以下の処理を繰り返し（Ｓ１７０５）、クラスタにおける単語の情報利得比ｇａｉｎ＿ｒ（ｗ，Ｃ）の算出処理（Ｓ１７０６）を行う。これにより、すべての階層（Ｓ１７０８）のすべてのクラスタ（Ｓ１７０７）について、単語の情報利得比ｇａｉｎ＿ｒ（ｗ，Ｃ）を算出する。尚、クラスタにおける単語の情報利得比ｇａｉｎ＿ｒ（ｗ，Ｃ）の算出処理（Ｓ１７０６）については、図１９と図２０を用いて後述する。 As shown in FIG. 17, the following processing is repeated for each document to be summarized (S1701), and the document ID of the document is stored in the header of the information gain ratio total table (S1702). Then, the following processing is repeated for each word included in the document (S1703).
The following processing is repeated for each layer (S1704), the following processing is repeated for each cluster included in the layer (S1705), and the calculation processing (S1706) of the word information gain ratio gain_r (w, C) in the cluster is performed. Do. As a result, the word information gain ratio gain_r (w, C) is calculated for all clusters (S1707) in all layers (S1708). The calculation process (S1706) of the word information gain ratio gain_r (w, C) in the cluster will be described later with reference to FIGS.

そして、当該文書の属する各階層のクラスタにおける単語の情報利得比を足して、各クラスタにおける単語の情報利得比の総和（ＩＧＲ＿ｓｕｍ値）を求め（Ｓ１７０９）、情報利得比総和テーブルのレコードに、当該情報利得比総和（ＩＧＲ＿ｓｕｍ値）を記憶させる（Ｓ１７１０）。 Then, the information gain ratio of words in each cluster to which the document belongs is added to obtain a sum (IGR_sum value) of information gain ratios of words in each cluster (S1709). The information gain ratio sum (IGR_sum value) is stored (S1710).

すべての単語について処理した時点で（Ｓ１７１１）、次の文書の処理に移行し、すべての文書について処理した時点で終了する（Ｓ１７１２）。 When all the words have been processed (S1711), the process proceeds to the next document, and ends when all the documents have been processed (S1712).

図１９と図２０は、クラスタにおける単語の情報利得比の算出処理フローを示す図である。まず、当該クラスタ（親クラスタＣと呼ぶ。）内の単語ｗの情報量ｉｎｆｏ（ｗ，Ｃ）を算出する（Ｓ１９０１）。特定のクラスタ内の単語ｗの情報量ｉｎｆｏ（ｗ，Ｃ）の算出処理については、図２１を用いて詳述する。 19 and 20 are diagrams showing a calculation processing flow of the information gain ratio of words in a cluster. First, the information amount info (w, C) of the word w in the cluster (referred to as parent cluster C) is calculated (S1901). The calculation processing of the information amount info (w, C) of the word w in a specific cluster will be described in detail with reference to FIG.

そして、当該クラスタに対する下位のクラスタ（子クラスタＣｉと呼ぶ）毎に以下の処理を繰り返す（Ｓ１９０２）。子クラスタＣｉ内の単語ｗの情報量ｉｎｆｏ（ｗ，Ｃｉ）を算出し（Ｓ１９０３）、更に子クラスタの大きさ｜Ｃｉ｜を親クラスタの大きさ｜Ｃ｜で割って、クラスタの大きさの比（｜Ｃｉ｜／｜Ｃ｜）を求める（Ｓ１９０４）。そして、子クラスタＣｉ内の単語ｗの情報量ｉｎｆｏ（ｗ，Ｃｉ）に、クラスタの大きさの比（｜Ｃｉ｜／｜Ｃ｜）を乗じて、クラスタの大きさに応じた単語ｗの情報量を求める（Ｓ１９０５）。特定のクラスタ内の単語ｗの情報量ｉｎｆｏ（ｗ，Ｃ）の算出処理については、図２１を用いて詳述する。 Then, the following processing is repeated for each lower cluster (called a child cluster Ci) with respect to the cluster (S1902). The information amount info (w, Ci) of the word w in the child cluster Ci is calculated (S1903), and the child cluster size | Ci | is divided by the parent cluster size | C | The ratio (| Ci | / | C |) is obtained (S1904). Then, the information amount info (w, Ci) of the word w in the child cluster Ci is multiplied by the cluster size ratio (| Ci | / | C |) to obtain information on the word w according to the size of the cluster. The amount is obtained (S1905). The calculation processing of the information amount info (w, C) of the word w in a specific cluster will be described in detail with reference to FIG.

すべての子クラスタについて処理した時点で（Ｓ１９０６）、すべての子クラスタの大きさに応じた単語ｗの情報量を足して総和を求め、この総和を、下位階層における単語ｗの情報量ｉｎｆｏ_ｄｉｖ（ｗ，Ｃ）とする（Ｓ１９０７）。 When all the child clusters have been processed (S1906), the total amount is obtained by adding the information amount of the word w according to the size of all the child clusters, and this total amount is obtained as the information amount info _div of the word w in the lower hierarchy ( w, C) (S1907).

続いて、当該クラスタに対する下位のクラスタ（子クラスタＣｉ）毎に以下の処理を繰り返す（Ｓ１９０８）。子クラスタの大きさ｜Ｃｉ｜を親クラスタの大きさ｜Ｃ｜で割って、クラスタの大きさの比（｜Ｃｉ｜／｜Ｃ｜）を求め（Ｓ１９０９）、クラスタの大きさの比の対数（ｌｏｇ_２（｜Ｃｉ｜／｜Ｃ｜））を求める（Ｓ１９１０）。そして、クラスタの大きさの比の対数に、クラスタの大きさ比を乗じて積を求める（Ｓ１９１１）。 Subsequently, the following processing is repeated for each lower cluster (child cluster Ci) with respect to the cluster (S1908). The size of the child cluster | Ci | is divided by the size of the parent cluster | C | to obtain a cluster size ratio (| Ci | / | C |) (S1909), and the logarithm of the cluster size ratio (Log ₂ (| Ci | / | C |)) is obtained (S1910). The product is obtained by multiplying the logarithm of the cluster size ratio by the cluster size ratio (S1911).

すべての子クラスタについて処理した時点で（Ｓ１９１２）、すべての前記積を足して総和を求め、この総和の正負を逆転させて、分割による情報量ｓｐｌｉｔ＿ｉｎｆｏ（Ｃ）とする（Ｓ１９１３）。 When all the child clusters have been processed (S1912), the sum is obtained by adding all the products, and the sum of the sums is reversed to obtain the split information amount split_info (C) (S1913).

当該クラスタ（親クラスタＣ）内の単語ｗの情報量ｉｎｆｏ（ｗ，Ｃ）から下位階層における単語ｗの情報量ｉｎｆｏ_ｄｉｖ（ｗ，Ｃ）を差し引いて、単語ｗの情報量の差を求める（Ｓ１９１４）。 By subtracting the information amount info _div (w, C) of the word w in the lower layer from the information amount info (w, C) of the word w in the cluster (parent cluster C), a difference in the information amount of the word w is obtained ( S1914).

当該単語ｗの情報量の差を、分割による情報量ｓｐｌｉｔ＿ｉｎｆｏ（Ｃ）で割って、商を求め、この商を当該クラスタＣにおける単語ｗの情報利得比ｇａｉｎ＿ｒ（ｗ，Ｃ）とする（Ｓ１９１５）。 The difference of the information amount of the word w is divided by the information amount split_info (C) by division to obtain a quotient, and this quotient is set as the information gain ratio gain_r (w, C) of the word w in the cluster C (S1915). .

上述のＳ１９０１やＳ１９０３で行うクラスタ内の単語の情報量の計算について詳述する。図２１は、クラスタ内の単語の情報量の計算処理フローを示す図である。 The calculation of the information amount of words in the cluster performed in S1901 and S1903 will be described in detail. FIG. 21 is a diagram showing a calculation processing flow of the information amount of words in the cluster.

クラスタＣに単語ｗが出現する確率（出現確率ｐ（ｗ｜Ｃ））を算出し（Ｓ２１０１）、更に出現確率の対数（ｌｏｇ_２ｐ（ｗ｜Ｃ））を算出し（Ｓ２１０２）、出現確率の対数に、出現確率を乗じて積を求める（Ｓ２１０３）。そして、当該積の正負を逆転させて、第一項の値とする（Ｓ２１０４）。この値は、式（２）の第一項に相当する。 The probability of appearance of the word w in the cluster C (appearance probability p (w | C)) is calculated (S2101), and the logarithm of the appearance probability (log ₂ p (w | C)) is calculated (S2102). Is multiplied by the appearance probability to obtain a product (S2103). Then, the sign of the product is reversed to obtain the value of the first term (S2104). This value corresponds to the first term of equation (2).

次に、１から出現確率を差し引いて、余事象確率（１−ｐ（ｗ｜Ｃ））を求め（Ｓ２１０５）、余事象確率の対数（ｌｏｇ_２（１−ｐ（ｗ｜Ｃ））を算出し（Ｓ２１０６）、余事象確率の対数に、余事象確率を乗じて積を求める（Ｓ２１０７）。当該積の正負を逆転させて、第二項の値とする（Ｓ２１０８）。この値は、式（２）の第二項に相当する。 Next, the occurrence probability (1-p (w | C)) is obtained by subtracting the appearance probability from 1 (S2105), and the logarithm (log ₂ (1-p (w | C)) of the excess event probability is calculated. (S2106), the logarithm of the residual event probability is multiplied by the residual event probability to obtain a product (S2107), and the value of the second term is reversed by reversing the positive / negative of the product (S2108). This corresponds to the second term of (2).

最後に、第一項の値と第二項の値を足して、和をクラスタＣ内の単語ｗの情報量ｉｎｆｏ（ｗ，Ｃ）とする（Ｓ２１０９）。 Finally, the value of the first term and the value of the second term are added to obtain the sum as the information amount info (w, C) of the word w in the cluster C (S2109).

次に、図１３の汎用文重要度導出処理（Ｓ１３０３）について詳述する。図２２と図２３は、汎用文重要度導出処理フローを示す図である。本処理では、中間的にＴＦ・ＩＤＦ値・情報利得比総和テーブルを生成し、最終的に汎用文重要度テーブルを生成する。 Next, the general-purpose sentence importance level derivation process (S1303) in FIG. 13 will be described in detail. 22 and 23 are diagrams showing a general-purpose sentence importance level derivation process flow. In this process, a TF / IDF value / information gain ratio total table is generated in the middle, and finally a general sentence importance table is generated.

図２４は、ＴＦ・ＩＤＦ値・情報利得比総和テーブルの例を示す図である。ヘッダとして文書ＩＤを有し、単語毎にレコードを設け、単語ＩＤとＴＦ・ＩＤＦ値・情報利得比総和の項目を有し、それぞれを対応付けている。このテーブルは、文書毎に設けられている。 FIG. 24 is a diagram illustrating an example of a TF / IDF value / information gain ratio total table. A document ID is provided as a header, a record is provided for each word, and items of a word ID, a TF / IDF value, and an information gain ratio sum are associated with each other. This table is provided for each document.

図２５は、汎用文重要度テーブルの例を示す図である。文毎にレコードを設け、
文ＩＤと汎用文重要度の項目を有し、それぞれを対応付けている。 FIG. 25 is a diagram illustrating an example of a general-purpose sentence importance degree table. Set a record for each sentence,
There are items of sentence ID and general sentence importance, which are associated with each other.

図２２に示すように、要約対象の文書毎に以下の処理を繰り返し（Ｓ２２０１）、当該文書の文書ＩＤをＴＦ・ＩＤＦ値・情報利得比総和テーブルのヘッダに記憶させる（Ｓ２２０２）。 As shown in FIG. 22, the following processing is repeated for each document to be summarized (S2201), and the document ID of the document is stored in the header of the TF / IDF value / information gain ratio total table (S2202).

次に、当該文書に含まれる単語毎に以下の処理を繰り返す（Ｓ２２０３）。当該文書の文書ＩＤで特定されるＴＦ・ＩＤＦ値テーブルから、当該単語のＴＦ・ＩＤＦ値を読み出し（Ｓ２２０４）、当該文書の文書ＩＤで特定される情報利得比総和テーブルから、当該単語の情報利得比総和（ＩＧＲ＿ｓｕｍ値）を読み出し（Ｓ２２０５）、ＴＦ・ＩＤＦ値と情報利得比総和（ＩＧＲ＿ｓｕｍ値）を乗じて、積（ＴＦ・ＩＤＦ値・情報利得比総和）を求める（Ｓ２２０６）。ＴＦ・ＩＤＦ値・情報利得比総和テーブルのレコードに、積（ＴＦ・ＩＤＦ値・情報利得比総和）を記憶させる（Ｓ２２０７）。すべての単語について処理した時点で次の文書の処理に移行し（Ｓ２２０８）、すべての文書について処理した時点で次の処理に移行する（Ｓ２２０９）。 Next, the following processing is repeated for each word included in the document (S2203). The TF / IDF value of the word is read from the TF / IDF value table specified by the document ID of the document (S2204), and the information gain of the word is read from the information gain ratio total table specified by the document ID of the document. The ratio sum (IGR_sum value) is read (S2205), and the product (TF / IDF value / information gain ratio sum) is obtained by multiplying the TF / IDF value and the information gain ratio sum (IGR_sum value) (S2206). The product (TF / IDF value / information gain ratio sum) is stored in the record of the TF / IDF value / information gain ratio sum table (S2207). When all the words have been processed, the process proceeds to the next document (S2208), and when all the documents have been processed, the process proceeds to the next process (S2209).

すべての文書について以下の処理を繰り返す（Ｓ２２１０）。更に、当該文書に含まれる文毎について以下の処理を繰り返す（Ｓ２２１１）。総和の変数を初期化（０を初期値とする）し（Ｓ２２１２）、当該文に含まれる単語毎に以下の処理を繰り返す（Ｓ２２１３）。当該文ＩＤの出所の文書ＩＤＥで特定されるＴＦ・ＩＤＦ値・情報利得比総和テーブルから当該単語のＴＦ・ＩＤＦ値・情報利得比総和を取得し（Ｓ２２１４）、ＴＦ・ＩＤＦ値・情報利得比総和を、総和の変数に加える（Ｓ２２１５）。すべての単語について処理することにより（Ｓ２２１６）、総和を得る。 The following processing is repeated for all documents (S2210). Further, the following processing is repeated for each sentence included in the document (S2211). The sum variable is initialized (0 is the initial value) (S2212), and the following processing is repeated for each word included in the sentence (S2213). The TF / IDF value / information gain ratio sum of the word is obtained from the TF / IDF value / information gain ratio sum table specified in the document IDE of the sentence ID (S2214), and the TF / IDF value / information gain ratio is obtained. The sum is added to the variable of the sum (S2215). By processing all the words (S2216), the sum is obtained.

本実施の形態では、汎用としての単語の重要度である汎用単語重要度の例として、文書内単語頻度と文書頻度の逆数と情報利得比の総和を乗じた積を用いる。以下、文に含まれる各単語に係る汎用単語重要度の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とする。 In this embodiment, a product obtained by multiplying the word frequency in a document, the reciprocal of the document frequency, and the sum of the information gain ratio is used as an example of the general word importance, which is the general word importance. Hereinafter, the sum of the general-purpose word importance related to each word included in the sentence is divided by the length of the sentence, and the quotient is used as an element for determining the general-purpose sentence importance of the sentence.

次に、当該文の長さ（｜Ｓｉ｜）を算出し（Ｓ２２１７）、総和の変数の値を、文の長さで割って商を求める（Ｓ２２１８）。この商を汎用文重要度（Ｉｍｐ_ＩＧＲ（Ｓｉ））として、汎用文重要度テーブルに、文ＩＤと対応付けて記憶させる（Ｓ２２１９）。すべての文について処理した時点で（Ｓ２２２０）、当該文書に含まれるすべての文の汎用文重要度（Ｉｍｐ_ＩＧＲ（Ｓｉ））を、当該文書を範囲とした偏差値（Ｔ−ｓｃｏｒｅ）に変換し、これを改めて汎用文重要度（Ｉｍｐ^ｎ _ＩＧＲ（Ｓｉ））として汎用文重要度テーブルに文ＩＤと対応付けて記憶させる（Ｓ２２２１）。これらの処理を、すべての文書について行った時点で終了する（Ｓ２２２２）。 Next, the length (| Si |) of the sentence is calculated (S2217), and the quotient is obtained by dividing the value of the sum variable by the length of the sentence (S2218). This quotient is stored as a general sentence importance (Imp _IGR (Si)) in the general sentence importance table in association with the sentence ID (S2219). When all the sentences are processed (S2220), the general sentence importance (Imp _IGR (Si)) of all sentences included in the document is converted into a deviation value (T-score) with the document as a range. This is again stored in the general-purpose sentence importance table in association with the sentence ID as the general-purpose sentence importance (Imp ⁿ _IGR (Si)) (S2221). When these processes are performed for all the documents, the process ends (S2222).

次に、質問応答文重要度計算部１２０３による質問応答としての文重要度を計算する処理（Ｓ５０４：質問応答文重要度計算処理）について説明する。この処理では、質問応答エンジンの出力スコアに基づく文の重要度計算を行う。 Next, the process of calculating the sentence importance as the question response by the question answer sentence importance calculating unit 1203 (S504: question answer sentence importance calculating process) will be described. In this process, the importance of the sentence is calculated based on the output score of the question answering engine.

与えられた質問文集合を扱うために、質問応答システムを用いる。このシステムでは、質問文が与えられた後に、形態素解析や構文解析、固有表現（ＮＥ）抽出などといった計算コストの大きい処理をするので、Ａ^＊に基づく解の探索制御ならびにより少ない処理コストで計算のできるスコアの近似手法を導入している。これらにより、文書中の無関係な箇所の計算コストを削減し、実時間処理が行なえる。 A question answering system is used to handle a given set of question sentences. In this system, after a question sentence is given, processing with high calculation costs such as morphological analysis, syntactic analysis, and named entity (NE) extraction is performed. Therefore, search control of the solution based on A ^* and calculation with less processing cost are performed. A score approximation method is introduced. As a result, the calculation cost of irrelevant parts in the document can be reduced and real-time processing can be performed.

質問応答システムのエンジンは質問文が１つ与えられると対象文書中の各語（形態素）に対して解としての適切さを表すスコアを付与する。スコアは、質問文の疑問詞と対象形態素を対応づけ同一視した場合に、質問文における残りの部分とその形態素が含まれる文の残りの部分との間の照合の度合として計算される。照合の度合を計る尺度としては、ａ）共通する文字ｂｉｇｒａｍの数、ｂ）共通する形態素の数、ｃ）格の一致の度合、ｄ）係受け関係の一致の度合、ｅ）ＮＥ型と質問型の一致の度合、の線形結合を用いている。発明者はこのスコアを質問の解に注目した時の語の重要度と考え、文の重要度をそれらから計算することを提案する。このスコアを利用することにより、質問に含まれる語や質問型の情報のみを利用する従来手法よりも精度の高い要約生成を行なえると期待される。 When one question sentence is given, the engine of the question answering system gives a score representing appropriateness as a solution to each word (morpheme) in the target document. The score is calculated as the degree of matching between the remaining part of the question sentence and the remaining part of the sentence including the morpheme when the question word and the target morpheme are associated and identified. The scale for measuring the degree of collation is as follows: a) number of common character bigrams, b) number of common morphemes, c) degree of matching of cases, d) degree of matching of dependency relationships, e) NE type and question A linear combination of the degree of type match is used. The inventor considers this score as the importance of a word when focusing on the solution to the question, and proposes to calculate the importance of the sentence from them. By using this score, it is expected that a summary can be generated with higher accuracy than the conventional method using only the words included in the question and the question type information.

本実施例では、複数の質問文が与えられることを想定しているため、形態素毎にスコアの「組」が求められる。各組中の各スコアはある質問文に対応する。各スコアの値域は、質問文の複雑さや質問の型により変動するため、本来、異なる質問文のスコアを比較することには意味がない。しかし、ある形態素について複数の質問文に対する単一の重要度を付与したいので、元のスコアを比較可能な値に正規化する。さて、ある一つの質問文に注目しその答を見つける際には、各語のスコアの絶対値は重要でなく、他の語のスコアとの相対的関係が重要である。そこで、本実施例では、スコアの平均値からの隔たりが重要であると考え、質問文毎に語のスコアを式（６）に示す偏差値（Ｔ−ｓｃｏｒｅ）に変換し、これを正規化スコアとする。ここで、ｘは正規化されるべきスコアの値であり、Ｄは（ｘを要素として持つ）スコア値の集合である。正規化スコアは複数の質問文に亙って平均値が同一になる。 In this embodiment, since it is assumed that a plurality of question sentences are given, a “set” of scores is obtained for each morpheme. Each score in each group corresponds to a question sentence. Since the range of each score varies depending on the complexity of the question sentence and the type of question, it is inherently meaningless to compare the scores of different question sentences. However, since it is desired to give a single importance to a plurality of question sentences for a certain morpheme, the original score is normalized to a comparable value. Now, when paying attention to one question sentence and finding the answer, the absolute value of the score of each word is not important, and the relative relationship with the score of other words is important. Therefore, in this embodiment, it is considered that the distance from the average value of the score is important, and the word score is converted into a deviation value (T-score) shown in Expression (6) for each question sentence, and this is normalized. Score. Here, x is a score value to be normalized, and D is a set of score values (having x as an element). Normalized scores have the same average value over multiple question sentences.

・・・式（６）

... Formula (6)

質問文ｑに関する形態素ｗの正規化スコアをｓｃｏｒｅ^ｎ（ｗ，ｑ）とする時、文Ｓｉの重要度Ｉｍｐ^ｎ _ＱＡ（Ｓｉ）を式（７）で求める。ただし、Ｑは与えられた質問文の集合、Ｗ_Ｓｉは文Ｓｉに現れる形態素の集合である。いずれかの質問の答えが含まれているかと言う観点から文の重要度を決めるとすれば、式（７）に示すとおり、ある文の重要度はその文に含まれる形態素の最大スコアとなる。 When the normalized score of the morpheme w related to the question sentence q is score ⁿ (w, q), the importance Imp ⁿ _QA (Si) of the sentence Si is obtained by Expression (7). Here, Q is a set of given question sentences, and W _Si is a set of morphemes that appear in the sentence Si. If the importance of a sentence is determined from the viewpoint of whether the answer to any question is included, the importance of a sentence is the maximum score of the morpheme contained in that sentence, as shown in Equation (7). .

・・・式（７）

... Formula (7)

更に、具体的な処理について説明する。図２６は、質問応答文重要度計算処理フローを示す図である。質問応答文重要度計算部１２０３は、順次、スコア取得処理（Ｓ２６０１）、スコア正規化処理（Ｓ２６０２）、質問応答文重要度導出処理（Ｓ２６０３）を行う。 Furthermore, specific processing will be described. FIG. 26 is a diagram showing a question response sentence importance calculation processing flow. The question response sentence importance calculation unit 1203 sequentially performs a score acquisition process (S2601), a score normalization process (S2602), and a question response sentence importance derivation process (S2603).

スコア取得処理（Ｓ２６０１）について詳述する。図２７は、スコア取得処理フローを示す図である。本処理では、質問文別スコアテーブルを生成する。 The score acquisition process (S2601) will be described in detail. FIG. 27 is a diagram showing a score acquisition process flow. In this process, a question sentence-specific score table is generated.

図２８は、質問文別スコアテーブルの例を示す図である。ヘッダとして、文ＩＤと質問文ＩＤを有し、文内の単語毎にレコードを設け、スコアを記憶する。 FIG. 28 is a diagram illustrating an example of a question sentence score table. It has a sentence ID and a question sentence ID as a header, a record is provided for each word in the sentence, and a score is stored.

図２７に示すように、文毎に以下の処理を繰り返し（Ｓ２７０１）、そのループの内側で、質問文毎に以下の処理を繰り返し（Ｓ２７０２）、まず、質問文別スコアテーブルのヘッダに、文ＩＤと質問文ＩＤを記憶させる（Ｓ２７０３）。 As shown in FIG. 27, the following processing is repeated for each sentence (S2701), and the following processing is repeated for each question sentence inside the loop (S2702). The ID and question sentence ID are stored (S2703).

次に、文に含まれる単語毎に以下の処理を繰り返し（Ｓ２７０４）、単語について、質問文に対する解としての「良さ」を表すスコアを算出する（Ｓ２７０５）。つまり、質問文に対する適合の度合いを示す値である。このスコアは、質問応答エンジンにより算出する。質問文別スコアテーブルのレコードに、スコアを記憶させ（Ｓ２７０６）、すべての単語について処理した時点で（Ｓ２７０７）、次の質問文の処理に移行する。また、すべての質問文について処理した時点で（Ｓ２７０８）、次の文の処理に移行し、すべての文について処理した時点で終了する（Ｓ２７０９）。 Next, the following processing is repeated for each word included in the sentence (S2704), and a score representing “goodness” as a solution to the question sentence is calculated for the word (S2705). That is, it is a value indicating the degree of adaptation to the question sentence. This score is calculated by the question answering engine. The score is stored in the record of the score table for each question sentence (S2706), and when all the words are processed (S2707), the process proceeds to the next question sentence process. When all question sentences have been processed (S2708), the process proceeds to the next sentence, and ends when all sentences have been processed (S2709).

スコア正規化処理（Ｓ２６０２）について詳述する。図２９は、スコア正規化処理フローを示す図である。本処理では、質問文別正規化スコアテーブルを生成する。 The score normalization process (S2602) will be described in detail. FIG. 29 is a diagram showing a score normalization process flow. In this process, a question score normalization score table is generated.

図３０は、質問文別正規化スコアテーブルの例を示す図である。ヘッダとして、文ＩＤと質問文ＩＤを有し、文内の単語毎にレコードを設け、正規化スコアを記憶するように構成されている。 FIG. 30 is a diagram illustrating an example of a question sentence normalization score table. It has a sentence ID and a question sentence ID as a header, a record is provided for each word in the sentence, and a normalization score is stored.

質問文毎に以下の処理を繰り返す（Ｓ２９０１）。質問文ＩＤが一致する質問文別スコアテーブル群を選択し（Ｓ２９０２）、選択した質問文別スコアテーブル群に含まれるスコアを単語毎に正規化する（Ｓ２９０３）。ここで、単語毎のスコアの平均値、標準偏差等を求める。 The following processing is repeated for each question sentence (S2901). A question sentence-specific score table group having the same question sentence ID is selected (S2902), and the scores included in the selected question sentence-specific score table group are normalized for each word (S2903). Here, the average value, standard deviation, etc. of the score for each word are obtained.

次に、文毎に以下の処理を繰り返し（Ｓ２９０４）、質問文別正規化スコアテーブルのヘッダに、文ＩＤと質問文ＩＤを記憶させ（Ｓ２９０５）、単語毎に正規化したスコア（偏差値）を、質問文別正規化スコアテーブルのレコードに記憶させる（Ｓ２９０６）。すべての文について処理した時点で（Ｓ２９０７）、次の質問文の処理に移行し、すべての質問文について処理した時点で終了する（Ｓ２９０８）。 Next, the following processing is repeated for each sentence (S2904), the sentence ID and the question sentence ID are stored in the header of the normalized score table for each question sentence (S2905), and the normalized score (deviation value) for each word Is stored in a record of the normalized score table for each question sentence (S2906). When all sentences have been processed (S2907), the process proceeds to the next question sentence, and ends when all question sentences have been processed (S2908).

質問応答文重要度導出処理（Ｓ２６０３）について詳述する。図３１と図３２は、質問応答文重要度導出処理フローを示す図である。本処理では、中間的に最大正規化スコアテーブルを生成し、最終的に質問応答文重要度テーブルを生成する。 The question response sentence importance level derivation process (S2603) will be described in detail. 31 and 32 are diagrams showing a question answer sentence importance level derivation process flow. In this process, a maximum normalized score table is generated in the middle, and a question response sentence importance level table is finally generated.

図３３は、最大正規化スコアテーブルの例を示す図である。ヘッダとして文ＩＤを有し、文内の単語毎にレコードを設け、正規化されたスコアのうち最大のもの（最大正規化スコア）を記憶するように構成されている。このテーブルは、文毎に設けられている。 FIG. 33 is a diagram illustrating an example of the maximum normalized score table. The header has a sentence ID, a record is provided for each word in the sentence, and the maximum score (maximum normalized score) among the normalized scores is stored. This table is provided for each sentence.

図３４は、質問応答文重要度テーブルの例を示す図である。文ＩＤと質問応答文重要度の項目を有し、それぞれを対応付けている。 FIG. 34 is a diagram illustrating an example of a question response sentence importance degree table. There are items of sentence ID and question answer sentence importance, and they are associated with each other.

図３１に示すように、文毎に以下の処理を繰り返し（Ｓ３１０１）、文に含まれる単語毎に以下の処理を繰り返す（Ｓ３１０２）。まず、最大正規化スコアテーブルのヘッダに、文ＩＤを記憶させ（Ｓ３１０３）、最大値候補の変数を初期化する（Ｓ３１０４）。例えば、正規化スコアが取り得る最低値以下を初期値とする。 As shown in FIG. 31, the following processing is repeated for each sentence (S3101), and the following processing is repeated for each word included in the sentence (S3102). First, the sentence ID is stored in the header of the maximum normalized score table (S3103), and the variable of the maximum value candidate is initialized (S3104). For example, the initial value is equal to or lower than the lowest value that can be taken by the normalized score.

次に、質問文毎に以下の処理を繰り返す（Ｓ３１０５）。文ＩＤと質問文ＩＤで特定される質問文別正規化スコアテーブルから当該単語の正規化スコアを取得し（Ｓ３１０６）、正規化スコアを最大値候補の変数と比較し、正規化スコアが大きい場合に正規化スコアを最大値候補の変数に代入する（Ｓ３１０７）。すべての質問文について処理すると（Ｓ３１０８）、最大正規化スコアテーブルのレコード（最大正規化スコアを示す）に、最大値候補の変数の値を記憶させる（Ｓ３１０９）。すべての単語について処理した時点で次に移行するＳ３１１０）。 Next, the following processing is repeated for each question sentence (S3105). When the normalization score of the word is obtained from the normalization score table for each question sentence specified by the sentence ID and the question sentence ID (S3106), and the normalization score is compared with the variable of the maximum value candidate, and the normalization score is large The normalized score is substituted into the maximum value candidate variable (S3107). When all the question sentences have been processed (S3108), the value of the variable of the maximum value candidate is stored in the record of the maximum normalization score table (indicating the maximum normalization score) (S3109). When all the words have been processed, the process proceeds to S3110).

最大値候補の変数を初期化する（Ｓ３１１１）。例えば、正規化スコアが取り得る最低値以下を初期値とする。そして、文に含まれる単語毎に以下の処理を繰り返す（Ｓ３１１２）。文ＩＤで特定される最大正規化スコアテーブルから当該単語の最大正規化スコアを取得し（Ｓ３１１３）、最大正規化スコアを最大値候補の変数と比較し、最大正規化スコアが大きい場合に最大正規化スコアを最大値候補の変数に代入する（Ｓ３１１４）。すべての単語について処理すると（Ｓ３１１５）、最大値候補の変数の値を質問応答文重要度（Ｉｍｐ^ｎ _ＱＡ（Ｓｉ））として質問応答文重要度テーブルに、文ＩＤと対応付けて記憶させる（Ｓ３１１６）。すべての文について処理した時点で終了する（Ｓ３１１７）。 The variable of the maximum value candidate is initialized (S3111). For example, the initial value is equal to or lower than the lowest value that can be taken by the normalized score. Then, the following processing is repeated for each word included in the sentence (S3112). The maximum normalization score of the word is obtained from the maximum normalization score table specified by the sentence ID (S3113), the maximum normalization score is compared with the variable of the maximum value candidate, and the maximum normalization score is obtained when the maximum normalization score is large. The conversion score is substituted into the variable of the maximum value candidate (S3114). When all the words have been processed (S3115), the value of the variable of the maximum value candidate is stored as a question answer sentence importance (Imp ⁿ _QA (Si)) in the question answer sentence importance degree table in association with the sentence ID (S3116). ). The process ends when all sentences are processed (S3117).

次に、統合文重要度算出部１２０５による統合した文重要度を算出する処理（Ｓ５０５：統合文重要度算出処理）について説明する。前述の式（５）と式（７）を統合した文重要度として、式（８）を考える。ここで、αは、文重要度Ｉｍｐ^ｎ _ＱＡのＩｍｐ^ｎ _ＩＧＲに対する重みである。つまり、αは、統合した文重要度を１とした場合に、統合した文重要度に占めるＩｍｐ^ｎ _ＱＡの重み付けを示す値である。従って、１−αは、統合した文重要度を１とした場合に、統合した文重要度に占めるＩｍｐ^ｎ _ＩＧＲの重み付けを示す値である。 Next, a process of calculating the integrated sentence importance by the integrated sentence importance calculation unit 1205 (S505: integrated sentence importance calculation process) will be described. Formula (8) is considered as the sentence importance obtained by integrating Formula (5) and Formula (7). Here, α is a weight with respect to Imp ⁿ _IGR of sentence importance Imp ⁿ _QA . That is, α is a value indicating the weight of Imp ⁿ _QA in the integrated sentence importance when the integrated sentence importance is 1. Therefore, 1−α is a value indicating the weight of Imp ⁿ _IGR in the integrated sentence importance when the integrated sentence importance is 1.

・・・式（８）

... Formula (8)

更に、具体的な処理について説明する。図３５は、統合文重要度算出処理フローを示す図である。まず、質問応答文重要度と汎用文重要度の統合における質問応答文重要度の重みを特定する（Ｓ３５０１）。例えば、予め記憶している質問応答文重要度の重みαを読み込む。 Furthermore, specific processing will be described. FIG. 35 is a diagram showing an integrated sentence importance calculation process flow. First, the weight of the question answer sentence importance in the integration of the question answer sentence importance and the general sentence importance is specified (S3501). For example, the weight α of the question answer sentence importance stored in advance is read.

次に、質問応答文重要度と汎用文重要度の統合における汎用文重要度の重みを特定する（Ｓ３５０２）。この例では、１からαを引いて差を求める。 Next, the weight of the general sentence importance in the integration of the question answer sentence importance and the general sentence importance is specified (S3502). In this example, 1 is subtracted from 1 to obtain the difference.

そして、文毎に以下の処理を繰り返す（Ｓ３５０３）。質問応答文重要度テーブルから文ＩＤに対応する質問応答文重要度を読み込み（Ｓ３５０４）、質問応答文重要度の重みを質問応答文重要度に乗じて、統合文重要度における質問応答文重要度分（α・Ｉｍｐ^ｎ _ＱＡ（Ｓｉ））を求める（Ｓ３５０５）。 Then, the following processing is repeated for each sentence (S3503). The question answer sentence importance corresponding to the sentence ID is read from the question answer sentence importance degree table (S3504), and the weight of the question answer sentence importance is multiplied by the question answer sentence importance to obtain the question answer sentence importance in the integrated sentence importance. Minute (α · Imp ⁿ _QA (Si)) is obtained (S3505).

また、汎用文重要度テーブルから文ＩＤに対応する汎用文重要度を読み込み（Ｓ３５０６）、汎用文重要度の重みを汎用文重要度に乗じて、統合文重要度における汎用文重要度分（（１−α）・Ｉｍｐ^ｎ _ＩＧＲ（Ｓｉ））を求める（Ｓ３５０７）。 Further, the general sentence importance corresponding to the sentence ID is read from the general sentence importance table (S3506), the weight of the general sentence importance is multiplied by the general sentence importance, and the general sentence importance ((( 1-α) · Imp ⁿ _IGR (Si)) is obtained (S3507).

これらの質問応答文重要度分と汎用文重要度分を加えて和を求め（Ｓ３５０８）、この和を、統合した文重要度（統合文重要度）として、統合文重要度テーブルに、文ＩＤと対応付けて記憶させる（Ｓ３５０９）。すべての文について処理した時点で終了する（Ｓ３５１０）。 The sum of the question response sentence importance and the general sentence importance is obtained to obtain a sum (S3508), and this sum is used as an integrated sentence importance (integrated sentence importance) in the integrated sentence importance table. Are stored in association with each other (S3509). The process ends when all sentences have been processed (S3510).

この処理により統合文重要度テーブルが生成される。図３６は、統合文重要度テーブルの例を示す図である。文毎にレコードを設け、文ＩＤと統合文重要度の項目を有し、それぞれを対応付けている。 An integrated sentence importance degree table is generated by this processing. FIG. 36 is a diagram illustrating an example of the integrated sentence importance degree table. A record is provided for each sentence, and has items of sentence ID and integrated sentence importance, which are associated with each other.

図５に示すように、上述の処理に続いて、統合した文重要度を平滑化する処理（Ｓ５０６：文重要度平滑化処理）、文の再順位付けにより重要文を抽出する処理（Ｓ５０７：重要文抽出処理）、クラスタリングにより重要文を整列する処理（Ｓ５０８：重要文整列処理）、要約文書出力処理（Ｓ５０９）を行う。 As shown in FIG. 5, following the above-described processing, the integrated sentence importance is smoothed (S506: sentence importance smoothing), and the important sentence is extracted by sentence re-ranking (S507: (Important sentence extraction process), an important sentence alignment process by clustering (S508: important sentence alignment process), and a summary document output process (S509).

図３７は、文重要度平滑化処理と重要文抽出処理と重要文整列処理と要約文書出力処理に係る構成を示す図である。文重要度平滑化部３７０１、平滑化統合文重要度テーブル３７０２、重要度抽出部３７０３、重要文テーブル３７０４、重要文整列部３７０５、要約文書記憶部３７０６、及び要約文書出力部３７０７の要素を有している。 FIG. 37 is a diagram showing a configuration relating to sentence importance level smoothing processing, important sentence extraction processing, important sentence alignment processing, and summary document output processing. The sentence importance level smoothing unit 3701, the smoothed integrated sentence importance level table 3702, the importance level extraction unit 3703, the important sentence table 3704, the important sentence alignment unit 3705, the summary document storage unit 3706, and the summary document output unit 3707 are provided. is doing.

まず、文重要度平滑化部３７０１による統合した文重要度を平滑化する処理（Ｓ５０６：文重要度平滑化処理）について説明する。本処理では、出力される要約における文間の結束性を維持するために、ハニング窓関数を用いて文の重要度の変化を平滑化するが、文重要度の統合の為に必須の処理ではなく、省略しても構わない。 First, the process (S506: sentence importance level smoothing process) of smoothing the sentence importance level integrated by the sentence importance level smoothing unit 3701 will be described. In this process, in order to maintain the coherence between sentences in the output summary, the Hanning window function is used to smooth the change in sentence importance. However, this process is essential for sentence importance integration. It can be omitted.

Ｓ５０５までの処理では各文を独立に扱うため、対象文書数が多い時には多くの文書から少しずつ重要文を抽出し、文間の結束性が低下する傾向が見られる。要約文書長が長い場合には、文の重要度を考慮しつつも、文間の結束性を高める事が必要である。そこで、ある文数の範囲内で重要度が滑らかに変化するように、ハニング窓関数を用いた重要度の平滑化を行なう。窓幅Ｗの同関数を用いて平滑化した文重要度は式（９）により与えられる。なお、文書の先頭と末尾においては、その文が連続するものとして計算する。 Since each sentence is handled independently in the processing up to S505, when the number of target documents is large, important sentences are extracted little by little from a large number of documents, and there is a tendency that the coherence between sentences tends to decrease. When the summary document length is long, it is necessary to improve the coherence between sentences while considering the importance of sentences. Therefore, importance is smoothed using a Hanning window function so that the importance changes smoothly within a certain number of sentences. The sentence importance level smoothed using the same function of the window width W is given by Equation (9). Note that, at the beginning and end of the document, the sentence is calculated as continuous.

・・・式（９）

... Formula (9)

同手法が有効な典型的な状況は、一つの中程度の重要度の文Ｓｂが二つの重要度の高い文Ｓａ、Ｓｃに挟まれている場合である。このとき、文Ｓｂの重要度は同関数の平滑化により増加し、Ｓａ、Ｓｂ、Ｓｃという一連の文群が採用されやすくなる。ここにおいて、Ｓｂの採用は二つの重要文Ｓａ、Ｓｃの間の結束性を増加させる可能性がある。 A typical situation where this method is effective is a case where one medium importance sentence Sb is sandwiched between two high importance sentences Sa and Sc. At this time, the importance of the sentence Sb increases due to the smoothing of the function, and a series of sentence groups of Sa, Sb, and Sc is easily adopted. Here, the adoption of Sb may increase the cohesiveness between the two important sentences Sa and Sc.

次に、重要度抽出部３７０３による文の再順位付けにより重要文を抽出する処理（Ｓ５０７：重要文抽出処理）について説明する。本処理では、ＭＭＲを用いて、重要度を考慮しつつも冗長性が少なくなるように文を順位付けし、順位付けられた文集合から指定された要約長に相当する上位のｎ文を選択する。 Next, a process (S507: important sentence extraction process) of extracting an important sentence by reordering sentences by the importance degree extraction unit 3703 will be described. In this process, using MMR, sentences are ranked so as to reduce redundancy while taking importance into account, and the top n sentences corresponding to the specified summary length are selected from the ranked sentence set. To do.

この重要文抽出において、Ｃａｒｂｏｎｅｌｌらが提案するＭＭＲと同種の冗長性制御機構を導入する。ＭＭＲは、本来、文書もしくはパッセージを単位として、順位づけを行なうものであり、初期順位は検索質問に対する文書の類似度を用いる。これを式（１０）のように文を単位とし、初期順位を文の重要度により与えるように変更する。 In this important sentence extraction, a redundancy control mechanism similar to MMR proposed by Carbonell et al. Is introduced. The MMR originally ranks documents or passages as a unit, and the initial rank uses the similarity of documents to a search query. This is changed so that the sentence is a unit and the initial order is given according to the importance of the sentence as shown in Expression (10).

・・・式（１０）

... Formula (10)

ここで、ＳＳは要約対象の文集合、Ａは既選択文の集合、Ｉｍｐ^ｎ _ｃ（Ｓｉ）は式（９）に定義される文Ｓｉの平滑化正規化重要度、Ｓｉｍ_ｓは文間の類似度を表す尺度、λは冗長度を制御する定数である。これをＭＭＩ−ＭＳ（ＭａｘｉｍａｌＭａｒｇｉｎａｌ
Ｉｍｐｏｒｔａｎｃｅ − Ｍｕｌｔｉ−Ｓｅｎｔｅｎｃｅ）と呼ぶ。 Here, SS is a sentence set to be summarized, A is a set of selected sentences, Imp ⁿ _c (Si) is the smoothing normalization importance of the sentence Si defined in Equation (9), Sim _s is between sentences A measure representing similarity, λ is a constant that controls redundancy. This is called MMI-MS (Maximal Marginal
It is called “Impotance-Multi-Sensence”.

Ａに空集合を、冗長度制御変数λに適切な値を設定してから式（１０）を繰返し適用すると、冗長性を考慮した文の再順位づけがなされる。なお、本実施例では、Ｓｉｍ_ｓとして文ベクトルのｃｏｓｉｎｅ類似度を採用した。同ベクトルの各次元は、各文に含まれる名詞であり、その値は対応する名詞の重要度である。 If an empty set is set in A and an appropriate value is set in the redundancy control variable λ, then the expression (10) is repeatedly applied to reorder sentences in consideration of redundancy. In the present embodiment employs the cosine similarity of the sentence vector as Sim _s. Each dimension of the vector is a noun included in each sentence, and its value is the importance of the corresponding noun.

そして、順位づけられた文の列の上位より、与えられた要約長になるまで、文を選択する。 Then, sentences are selected from the top of the ranked sentence column until the given summary length is reached.

具体的には、以下のように処理する。図３８と図３９は、重要文抽出処理フローを示す図である。この処理において、重要文テーブルを用いる。 Specifically, the processing is as follows. 38 and 39 are diagrams showing the important sentence extraction processing flow. In this process, an important sentence table is used.

図４０は、重要文テーブルを示す図である。文毎にレコードを設け、文ＩＤと抽出フラグの項目を有し、それぞれを対応付けている。抽出された文を、ＯＮとして識別するように構成されている。 FIG. 40 is a diagram showing an important sentence table. A record is provided for each sentence, and items of sentence ID and extraction flag are associated with each other. The extracted sentence is configured to be identified as ON.

まず、既選択文集合Ａを空集合に初期化する（Ｓ３８０１）。具体的には、重要文テーブルのすべての抽出フラグをＯＦＦにする。次に、変数である既選択文長Ｌを０に初期化する（Ｓ３８０２）。 First, the selected sentence set A is initialized to an empty set (S3801). Specifically, all the extraction flags of the important sentence table are turned off. Next, the selected sentence length L, which is a variable, is initialized to 0 (S3802).

そして、要約対象の文書に含まれる文の集合ＳＳと既選択文集合Ａの差集合に含まれる文（Ｓｉ∈ＳＳ＼Ａ）毎に以下の処理を行う（Ｓ３８０３）。具体的には、重要文テーブルの抽出フラグがＯＦＦの文について処理する。 Then, the following processing is performed for each sentence (SiεSS \ A) included in the difference set between the sentence set SS included in the document to be summarized and the selected sentence set A (S3803). Specifically, processing is performed for a sentence whose extraction flag of the important sentence table is OFF.

既選択文集合Ａに含まれる文（Ｓｊ∈Ａ）毎に以下の処理を繰り返す（Ｓ３８０４）。具体的には、重要文テーブルの抽出フラグがＯＮの文について処理する。差集合に含まれる文（Ｓｉ）と既選択文集合に含まれる文（Ｓｊ）の類似度（Ｓｉｍ_ｓ（Ｓｉ，Ｓｊ））を算出する（Ｓ３８０５）。既選択文集合に含まれる文（Ｓｊ）のすべてについて処理した時点で（Ｓ３８０６）、次に移行する。 The following processing is repeated for each sentence (SjεA) included in the selected sentence set A (S3804). Specifically, processing is performed for a sentence whose extraction flag in the important sentence table is ON. The similarity (Sim _s (Si, Sj)) between the sentence (Si) included in the difference set and the sentence (Sj) included in the already selected sentence set is calculated (S3805). When all the sentences (Sj) included in the selected sentence set have been processed (S3806), the process proceeds to the next.

既選択文集合に含まれる各文との組み合わせによる類似度のうち、最大の類似度（ｍａｘＳｉｍ_ｓ（Ｓｉ，Ｓｊ））を選択し（Ｓ３８０７）、最大の類似度に、（１−冗長度制御変数λ）を乗じて積を求め、積を第二項の値（（１−λ）ｍａｘＳｉｍ_ｓ（Ｓｉ，Ｓｊ））とする（Ｓ３８０８）。 The maximum similarity (maxSim _s (Si, Sj)) is selected from the similarities in combination with each sentence included in the selected sentence set (S3807), and (1-redundancy control) is selected as the maximum similarity. The product is obtained by multiplying the variable λ), and the product is set to the value of the second term ((1-λ) maxSim _s (Si, Sj)) (S3808).

統合文重要度テーブルから、差集合に含まれる文（Ｓｉ）の統合文重要度（Ｉｍｐ^ｎ（Ｓｉ））を読み込み（Ｓ３８０９）、統合文重要度に冗長度制御変数λを乗じて積を求め、積を第一項の値（λＩｍｐ^ｎ（Ｓｉ））とする（Ｓ３８１０）。 The integrated sentence importance (Imp ⁿ (Si)) of the sentence (Si) included in the difference set is read from the integrated sentence importance table (S3809), and the product is obtained by multiplying the integrated sentence importance by the redundancy control variable λ. The product is set to the value of the first term (λImp ⁿ (Si)) (S3810).

そして、第一項の値から第二項の値を引いて差を求め、差を抽出評価値とする（Ｓ３８１１）。差集合に含まれる文（Ｓｉ）のすべてについて処理すると（Ｓ３８１２）、差集合に含まれる文（Ｓｉ）のうち、前記抽出評価値が最大となる文（Ｓｉ）を特定する（Ｓ３８１３）。 Then, a difference is obtained by subtracting the value of the second term from the value of the first term, and the difference is set as an extraction evaluation value (S3811). When all the sentences (Si) included in the difference set are processed (S3812), the sentence (Si) having the maximum extraction evaluation value is specified from the sentences (Si) included in the difference set (S3813).

前記抽出評価値が最大の文の長さ（｜Ｓｉ｜）を既選択文長Ｌに加え（Ｓ３８１４）、既選択文長Ｌが要約文書制限長を越えた場合には（Ｓ３８１５）、終了する。越えていない場合には、前記抽出評価値が最大の文（Ｓｉ）を既選択文集合Ａに加える（Ｓ３８１６）。具体的には、重要文テーブルの当該文の文ＩＤに対応する抽出フラグをＯＮにする。そして、処理を繰り返す。 The length (| Si |) of the sentence having the maximum extraction evaluation value is added to the selected sentence length L (S3814), and when the selected sentence length L exceeds the summary document limit length (S3815), the process ends. . If not, the sentence (Si) having the maximum extraction evaluation value is added to the selected sentence set A (S3816). Specifically, the extraction flag corresponding to the sentence ID of the sentence in the important sentence table is turned ON. Then, the process is repeated.

次に、重要文整列部３７０５によるクラスタリングにより重要文を整列する処理（Ｓ５０８：重要文整列処理）について説明する。この処理では、原文書群のクラスタ構造と記事の日付順を考慮して選択した文を配置する。 Next, a process of arranging important sentences by clustering by the important sentence arranging unit 3705 (S508: important sentence arranging process) will be described. In this process, a sentence selected in consideration of the cluster structure of the original document group and the date order of the articles is arranged.

まず、原文書群は単リンククラスタリングにより分割される。得られたクラスタ群は日付順に並べられる。またクラスタ内の文書も日付順に並べられる。これにより、記事の列が得られる。なおクラスタの日付はその中に含まれる記事のうち最も古い日付により定義されるものとする。 First, the original document group is divided by single link clustering. The obtained cluster groups are arranged in date order. The documents in the cluster are also arranged in date order. Thereby, a column of articles is obtained. The cluster date is defined by the oldest date among the articles included in the cluster date.

先に選択された重要文は、上記の手法で得られた記事の並びの順序にしたがって出力される。これが要約文書である。同一記事から複数の文が選択されている時には元の記事内の文の順序に従う。 The important sentence selected previously is output according to the order of the articles arranged by the above method. This is a summary document. When multiple sentences are selected from the same article, the order of sentences in the original article is followed.

簡単に処理フローを示す。要約文書出力処理（Ｓ５０９）について説明する。図４１は、重要文整列処理フローを示す図である。要約対象の文書について非階層型のクラスタリングを行う（Ｓ４１０１）。クラスタ間の順序付けを行い（Ｓ４１０２）、更にクラスタ内の文書間の順序付けを行う（Ｓ４１０３）。そして、順序に従って、要約対象の文書を特定し、当該文書を出所とする文を抽出し、要約文書記憶部に記憶させる（Ｓ４１０４）。この例では、日付により順序付けを行うが、他の基準により順序付けを行っても構わない。つまり、何らかの順に従って、重要文を一文ずつ並べる処理を行う。 A processing flow is simply shown. The summary document output process (S509) will be described. FIG. 41 is a diagram showing an important sentence alignment process flow. Non-hierarchical clustering is performed on the documents to be summarized (S4101). Ordering between clusters is performed (S4102), and ordering between documents in the cluster is further performed (S4103). Then, in accordance with the order, the document to be summarized is specified, a sentence originating from the document is extracted, and stored in the summary document storage unit (S4104). In this example, the ordering is performed by date, but the ordering may be performed by other criteria. That is, a process of arranging important sentences one by one in some order.

最後に、要約文書出力部３７０７により要約文書記憶部３７０６に記憶している要約文書を出力する（Ｓ５０９）。 Lastly, the summary document output unit 3707 outputs the summary document stored in the summary document storage unit 3706 (S509).

図４２は、要約文書の例を示す図である。太字の部分が質問の答えの一つである。 FIG. 42 is a diagram illustrating an example of a summary document. The bold part is one of the answers to the questions.

本システム全体についてまとめる。図４３は、文書要約システムの主用な要素を示す図である。本図では、要素を具体的な処理表現で示している。 Summarize the entire system. FIG. 43 shows the main elements of the document summarization system. In this figure, the elements are shown by specific processing expressions.

以下、本発明の実験と評価について述べる。ここでは、評価型ワークショップであるＮＴＣＩＲ４ＴＳＣ３におけるＦｏｒｍａｌＲｕｎの課題により提案手法に基づくシステムを評価する。ＮＴＣＩＲＴＳＣは国立情報学研究所主催の文書自動要約に関する一連の評価型ワークショップである。ＮＴＣＩＲ４ＴＳＣ３の報告会は２００４年６月に開催された。ここでは、１）モデル抜粋との比較による抜粋の性能、ならびに、２）モデル要約との比較による質問に対する解の被覆率に基づき評価を行なう。モデル抜粋とモデル要約はタスクオーガナイザにより準備がなされ、ＦｏｒｍａｌＲｕｎの後に評価のために配布された。 The experiment and evaluation of the present invention will be described below. Here, the system based on the proposed method is evaluated according to the problem of Formal Run in NTCIR4 TSC3 which is an evaluation type workshop. NTCIR TSC is a series of evaluation workshops on automatic document summarization organized by the National Institute of Informatics. The NTCIR4 TSC3 debriefing was held in June 2004. Here, evaluation is performed based on 1) the performance of the extract by comparison with the model extract, and 2) the coverage of the answer to the question by comparison with the model summary. Model excerpts and model summaries were prepared by the task organizer and distributed for evaluation after the Formal Run.

同ＦｏｒｍａｌＲｕｎの課題は、３０トピックからなる。各トピックは、要約対象文書ＩＤのリスト（５〜１９文書）、トピックの表題（検索要求を簡潔に
表現したもの）、生成すべき要約文書の長さ（文字数、ならびに、文数。いずれも短いもの（Ｓｈｏｒｔ、要約率約５％）と長いもの（Ｌｏｎｇ、要約率約１０％）の二種）、要約に含まれるべき事項を表した質問文のリスト（Ｓｈｏｒｔ用平均７．６文とＬｏｎｇ用平均１１．９文の二種）から構成される。要約対象文書は９８、９９年の毎日及び読売新聞の記事から選ばれている。なお、同ＦｏｒｍａｌＲｕｎでは、要約生成に際して質問文のリストを利用するか否かは参加者の判断に委ねられている点に注意されたい。本発明ではこれを積極的に利用している。 The topic of the topic Run consists of 30 topics. Each topic includes a list of document IDs to be summarized (5 to 19 documents), a title of the topic (a concise expression of the search request), and a length of the summary document to be generated (number of characters and number of sentences. Both are short. List of questions (Short, summary rate of about 5%) and long (Long, summary rate of about 10%), lists of questions that should be included in the summary (Short average of 7.6 sentences and Long 2 types of average 11.9 sentences). The summarization target document is selected from articles in 98 and 99 every day and from the Yomiuri Shimbun. It should be noted that, in the same format run, whether or not to use a question list when generating a summary is left to the judgment of the participant. This is actively utilized in the present invention.

提案システムの各種パラメタは、ＦｏｒｍａｌＲｕｎに先だって配布された例題５トピックにより手動で調整を行なった。Ｓｈｏｒｔ用にはハニング窓関数を適用せず、Ｌｏｎｇ用には窓幅４とした。二種類の文重要度Ｉｍｐ^ｎ _ＱＡならびにＩｍｐ^ｎ _ＩＧＲの混合比を決めるパラメタαの値は０．８（Ｓｈｏｒｔ用）ならびに０．７（Ｌｏｎｇ用）とした。ＭＭＩ−ＭＳ用のパラメタλは０．４＋０．５・１−Ｓｉｍ_ａｖｅとした。ここでＳｉｍ_ａｖｅはトピック毎の平均文間類似度である。 Various parameters of the proposed system were manually adjusted according to the topic of Example 5 distributed prior to Formal Run. The Hanning window function is not applied to the short, and the window width is set to 4 for the long. The value of the parameter α which determines the mixing ratio of the two kinds of sentence importance levels Imp ⁿ _QA and Imp ⁿ _IGR was set to 0.8 (for Short) and 0.7 (for Long). The parameter λ for MMI-MS was set to 0.4 + 0.5 · 1-Sim _ave . Here, Sim _ave is the average sentence similarity between topics.

「重要文抽出の性能に関する評価」について述べる。 This paper describes "Evaluation on performance of important sentence extraction".

複数文書を対象とすると、同じ内容を表現する異なる文が存在することがあり、また、ある一つの文の内容が別の文書では２つ以上の文により記述されることがある。そのため、正解となるモデル抜粋ＭＥ中のｉ番目の文は、原文書の文ＩＤの集合Ａ_ｉ，ｊの集合ＭＳ_ｉにより表現される。一方で、ある抜粋は、文ＩＤの集合ＳＳにより表現される。この時、モデル抜粋ＭＥのｉ番目の文に対する、抜粋ＳＳの被覆率（Ｃｏｖｅｒａｇｅ）ｃ（ＳＳ、ＭＳ_ｉ）を式（１１）で定義する。さらに、モデル抜粋ＭＥ全体に対する抜粋の被覆率Ｃ（ＳＳ、ＭＥ）と精度を、それぞれ、式（１２）ならびに（１３）で定義する。 When targeting multiple documents, there may be different sentences expressing the same contents, and the contents of one sentence may be described by two or more sentences in another document. Therefore, the i-th sentence in the model excerpt ME that is a correct answer is expressed by a set MS _i of a set A _{i, j} of sentence IDs of the original document. On the other hand, a certain excerpt is expressed by a set SS of sentence IDs. At this time, the coverage SS (Coverage) c (SS, MS _i ) of the excerpt SS for the i-th sentence of the model excerpt ME is defined by Expression (11). Furthermore, the coverage C (SS, ME) and accuracy of the excerpt for the entire model excerpt ME are defined by equations (12) and (13), respectively.

・・・式（１１）

... Formula (11)

・・・式（１２）

... Formula (12)

・・・式（１３）

... Formula (13)

ただし、関数ｍｅｍｐ（ｅ、Ｓ）はｅが集合Ｓの要素であるときに１、それ以外は、０を返す関数である。本評価では、モデル抜粋として、モデル要約を元にタスクオーガナイザイが作成したものを使用する。また、各トピックに対するモデル要約は、５人の元新聞記者のうちの一人が作成したものである。 However, the function memp (e, S) returns 1 when e is an element of the set S, and returns 0 otherwise. In this evaluation, model excerpts created by the task organizer based on the model summary are used. The model summary for each topic was created by one of five former newspaper reporters.

提案システムの出力抜粋の平均被覆率（ＡｖｅｒａｇｅＣｏｖｅｒａｇｅ）ならびに平均精度（ＡｖｅｒａｇｅＰｒｅｃｉｓｉｏｎ）を図４４と図４５に示す。図中のラベル‘ＩＧＲ＋ＭＭＲ＋ＱＡ’は提案手法である。ラベル‘ＩＧＲ＋ＭＭＲ’ならびに‘ＩＧＲ＋ＭＭＲ＋ＱＢ’、‘ＩＧＲ＋ＭＭＲ＋ＱＢ＋ＮＥ’は我々が用意したベースラインである。‘ＩＧＲ＋ＭＭＲ’は提案手法において質問応答エンジンによる文重要度を使わない場合に相当する。‘ＩＧＲ＋ＭＭＲ＋ＱＢ’はＱｕｅｒｙ−ｂｉａｓｅｄ手法に基づくベースラインであり、式（７）の代わりに文重要度Ｉｍｐ^ｎ _ＱＢ（Ｓｉ）を用いる。Ｉｍｐ^ｎ _ＱＢ（Ｓｉ）は次の式（１４）の値をＴ−ｓｃｏｒｅにより正規化して得られたもので、質問文中に含まれる語に重みを与えるものである。‘ＩＧＲ＋ＭＭＲ＋ＱＢ＋ＮＥ’は‘ＩＧＲ＋ＭＭＲ＋ＱＡ’に加えて、固有表現（ＮＥ）の出現に重みを与えるものであり、式（１５）の文重要度をＴ−ｓｃｏｒｅにより正規化したＩｍｐ^ｎ _{ＱＢ＋ＮＥ}（Ｓｉ）に基づく。提案手法とこれらベースラインとの間の主な違いは質問応答エンジンの出力、すなわち、質問の答えに関する情報を使うか使わないかである。 44 and 45 show the average coverage (Average Coverage) and average accuracy (Average Precision) of the output excerpt of the proposed system. The label “IGR + MMR + QA” in the figure is the proposed method. The labels 'IGR + MMR' and 'IGR + MMR + QB', 'IGR + MMR + QB + NE' are the baselines we prepared. 'IGR + MMR' corresponds to the case where sentence importance by the question answering engine is not used in the proposed method. 'IGR + MMR + QB' is a baseline based on the Query-biased method, and sentence importance Imp ⁿ _QB (Si) is used instead of Equation (7). Imp ⁿ _QB (Si) is obtained by normalizing the value of the following equation (14) with T-score, and gives weight to the words included in the question sentence. 'IGR + MMR + QB + NE' gives a weight to the appearance of the proper expression (NE) in addition to 'IGR + MMR + QA', and is based on Imp ⁿ _{QB + NE} (Si) obtained by normalizing the sentence importance of Expression (15) by T-score. . The main difference between the proposed method and these baselines is whether or not to use the output of the question answering engine, ie information about the answer to the question.

・・・式（１４）

... Formula (14)

・・・式（１５）

... Formula (15)

一方、‘Ｌｅａｄ’はタスクオーガナイザが提供したＬｅａｄ手法（各文書の先頭部分を抽出する）によるベースライン、それ以外の点は他の参加システムである。ただし、トピック情報中の質問文群の利用については、先に述べたように参加グループの判断に委ねられている。そのため、Ｌｅａｄ法を含め質問文群を利用していないシステムが存在することに注意されたい。また、次節での評価と異なり、モデル抜粋以外には人間の作成した抜粋はタスクオーガナイザより提供されていない。 On the other hand, 'Lead' is a baseline based on the Lead method (extracting the head portion of each document) provided by the task organizer, and other points are other participating systems. However, the use of the question sentence group in the topic information is left to the judgment of the participating group as described above. Therefore, it should be noted that there are systems that do not use the question sentence group including the Lead method. Also, unlike the evaluation in the next section, human-generated excerpts other than model excerpts are not provided by the task organizer.

また、被験者を用いた主観評価により内容の平均被覆率を調べた。図４６に示す。 Moreover, the average coverage of the content was examined by subjective evaluation using subjects. It shows in FIG.

「質問に対する解の被覆率に基づく評価」について説明する。各トピックについて、Ｓｈｏｒｔ、Ｌｏｎｇの各要約文書字数に対して、モデル要約に含まれる質問文の解が提案システムの出力抜粋に含有される度合（解の平均被覆率）を調べた。図４７と図４８に示す。尺度としては、正解文字列そのものが現れる割合の平均値（ＥｘａｃｔＭａｔｃｈ）、ならびに、式（１６）により定義される正解文字列Ａｎｓ_ｉと文Ｓの間の編集距離ＥｄｉｔＤ（）に基づく尺度の平均値（ＥｄｉｔＤｉｓｔａｎｃｅ）の二種類がタスクオーガナイザにより提供されている。 The “evaluation based on the coverage ratio of the solution to the question” will be described. For each topic, the degree to which the answer to the question text included in the model summary was included in the output excerpt of the proposed system (the average coverage of the solution) was examined for each number of short and long summary documents. 47 and 48. As a measure, the average value of the ratio of correct character string itself appears (Exact Match), and the average of the metric based on Equation edit distance EditD between correct character string Ans _i and statements S defined by (16) () Two types of values (Edit Distance) are provided by the task organizer.

・・・式（１６）

... Formula (16)

ここで、関数Ｌｅｎ（）は文字列の長さを返す。図中のラベル‘Ｈｕｍａｎ’はモデル要約作成者とは別の人間が作成した要約である。 Here, the function Len () returns the length of the character string. The label 'Human' in the figure is a summary created by a person other than the model summary creator.

「二つの文重要度の混合比に関する評価」について説明する。 The “evaluation regarding the mixing ratio of two sentence importance levels” will be described.

二種類の文重要度の混合比が各種性能に与える影響について調べるため、他のパラメタは前述の通りに固定しつつ、パラメタαの値を０．０から１．０の範囲で変化させて同様の評価を行なった。図４９と図５０に抜粋の性能変化を、図５１と図５２に質問に対する解の平均被覆率を示す。 In order to investigate the effect of the mixing ratio of two kinds of sentence importance on various performances, the other parameters are fixed as described above, and the value of parameter α is changed in the range of 0.0 to 1.0. Was evaluated. FIG. 49 and FIG. 50 show the performance change of the excerpt, and FIG. 51 and FIG.

考察する。「重要文抽出の性能」について説明する。 Consider. The “important sentence extraction performance” will be described.

図４４によると、要約長が短いとき（‘Ｓｈｏｒｔ’）には、提案手法（ＩＧＲ＋ＭＭＲ＋ＱＡ）はＬｅａｄ手法には勝っているが、ベースラインＩＧＲ＋ＭＭＲ＋ＱＢ、ＩＧＲ＋ＭＭＲ＋ＱＢ＋ＮＥとはほぼ同等である。つまり、質問文中の語だけでも抜粋生成について十分な情報があり、あえて解を求める必要はなさそうである。一方、要約長が長いとき（‘Ｌｏｎｇ’）には、図４５に示すとおり、すべてのベースラインならびに他参加システムに対して、その優位性が示されている。ただし、質問文の情報を利用しない参加システムもあることに注意されたい。ＱＡエンジンを使わない‘ＩＧＲ＋ＭＭＲ’と比較すると性能の改善は著しく、ＱＡエンジンによる重み付けが非常に有効に機能していることがわかる。同様に被験者の主観評価に基づく被覆率評価においても、図４６が示す通り、提案手法の評価が高い。 According to FIG. 44, when the summary length is short ('Short'), the proposed method (IGR + MMR + QA) is superior to the Lead method, but is almost equivalent to the baseline IGR + MMR + QB and IGR + MMR + QB + NE. In other words, there is enough information about excerpt generation even with only the words in the question sentence, and there seems to be no need to seek a solution. On the other hand, when the summary length is long ('Long'), as shown in FIG. 45, the superiority is shown for all baselines and other participating systems. However, it should be noted that some participating systems do not use question information. Compared with 'IGR + MMR' which does not use the QA engine, the performance improvement is remarkable, and it can be seen that the weighting by the QA engine functions very effectively. Similarly, in the coverage rate evaluation based on the subjective evaluation of the subject, the evaluation of the proposed method is high as shown in FIG.

ベースラインＩＧＲ＋ＭＭＲ＋ＱＢが比較的良好な性能を示しているが、これは今回のタスク設定において多くの質問文を参照できたためであると考えられる。一方、ＩＧＲ＋ＭＭＲ＋ＱＢ＋ＮＥはＩＧＲ＋ＭＭＲ＋ＱＢよりも、むしろ、性能が悪くなっている。今回のタスクでは質問が複数あるために、質問型によるＮＥの選別をおこなっていない。そのため、有効な重みづけができなかった可能性がある。 The baseline IGR + MMR + QB shows relatively good performance, which is considered to be because many question sentences can be referred to in the task setting this time. On the other hand, the performance of IGR + MMR + QB + NE is worse than that of IGR + MMR + QB. In this task, since there are a plurality of questions, NE selection by question type is not performed. Therefore, there is a possibility that effective weighting could not be performed.

ところで、Ｌｏｎｇについては提案手法の抜粋精度が０．６８０と高いのに対して、抜粋被覆率は０．３９１と低い。これは、別の文書に由来する同一もしくは非常に似通った文を抽出する例が見受けられるためである。各システムが生成した各要約に存在するほぼ同一の文の数の平均値は図５３に示すとおりである。これは被験者による読み易さに関する主観評価の一部として調査されたものである。 By the way, for Long, the extraction accuracy of the proposed method is as high as 0.680, whereas the extraction coverage is as low as 0.391. This is because an example of extracting the same or very similar sentence from another document can be seen. An average value of the number of almost identical sentences existing in each summary generated by each system is as shown in FIG. This was investigated as part of a subjective assessment of readability by subjects.

この図によると、提案手法は冗長な文を消去しきれていないことがわかる。出力文書の冗長制御を行なっているＭＭＩ−ＭＳでは、名詞の重要度を成分とする文ベクトルの類似度を用いているが、各語の重要度は文書によって異なるために、全く同一の文であっても類似度が１にならない。文間類似度計算の精緻化が今後必要である。 This figure shows that the proposed method does not eliminate redundant sentences. MMI-MS, which performs redundant control of output documents, uses sentence vector similarity with noun importance as its component, but since the importance of each word varies from document to document, Even if there is, the similarity does not become 1. It is necessary to refine the similarity calculation between sentences.

「質問に対する解の被覆率に関する性能」について説明する。 The “performance related to the coverage of the answer to the question” will be described.

次に質問の解の被覆率について考察する。図４７と図４８によると、提案手法は各種ベースラインと比較して、Ｓｈｏｒｔ、Ｌｏｎｇの要約長のいずれにおいても、改善されていることがわかる。ただし、‘Ｈｕｍａｎ’で示される要約は、質問文を見ずに人間が作成した要約であるので注意されたい。 Next, the coverage of the answer to the question is considered. 47 and 48, it can be seen that the proposed method is improved in both the Short and Long summary lengths as compared to the various baselines. However, it should be noted that the summary indicated by 'Human' is a summary created by a human without looking at the question text.

「二つの文重要度の混合の効果」について説明する。 The “effect of mixing two sentence importance levels” will be described.

最後に、二種類の文重要度の混合比について考察する。図４９（被覆率）ならびに図５０（精度）によると、提案手法ＩＧＲ＋ＭＭＲ＋ＱＡやベースラインＩＧＲ＋ＭＭＲ＋ＱＢ、ＩＧＲ＋ＭＭＲ＋ＱＢ＋ＮＥについて、二種類の文重要度のうち、質問文自身やその解といった質問から得られる文重要度が支配的であることがわかる。ただし、いずれもα＝０．６〜０．８の箇所に性能の頂点が存在するので、両重要度を考慮したほうが良いこともわかる。特に解の被覆率評価（図５１（ＥｘａｃｔＭａｔｃｈ）、図５２（ＥｄｉｔＤｉｓｔａｎｃｅ））においてもα＝１．０ではない箇所に頂点があることが興味深い。採用している質問応答エンジンは、日本語の質問応答に関する評価型ワークショップであるＮＴＣＩＲＱＡＣ１、２の質問セットにおいてＭＲＲが０．５程度であり精度が十分ではなないことから、ＩＧＲに基づく文重要度がこれを補っていると考えられる。 Finally, let us consider the mixing ratio of two types of sentence importance. According to FIG. 49 (coverage ratio) and FIG. 50 (accuracy), the sentence importance obtained from the question such as the question sentence itself or its answer among the two kinds of sentence importance levels for the proposed method IGR + MMR + QA, baseline IGR + MMR + QB, IGR + MMR + QB + NE. It turns out that it is dominant. However, since the top of performance exists in the place of (alpha) = 0.6-0.8 in all, it turns out that it is better to consider both importance. In particular, in the evaluation of the coverage ratio of the solution (FIG. 51 (Exact Match), FIG. 52 (Edit Distance)), it is interesting that there is a vertex at a location where α = 1.0 is not satisfied. The question answering engine used is a sentence based on IGR because the MRR is about 0.5 and the accuracy is not sufficient in the question set of NTCIR QAC1 and 2, which are evaluation type workshops related to Japanese question answering. It is thought that importance is supplementing this.

別の興味深い点は、ベースラインＩＧＲ＋ＭＭＲ＋ＱＢ、ＩＧＲ＋ＭＭＲ＋ＱＢ＋ＮＥに注目すると、Ｌｏｎｇにおける解の被覆率はαによらず、ほとんど変化がないことである。これは、解の被覆率という観点からみたときに、ＩＧＲによる重み付けは質問文によるバイアスと同じような性質をもっていることを示唆するものである。 Another interesting point is that when looking at the baselines IGR + MMR + QB and IGR + MMR + QB + NE, the solution coverage in Long is almost independent of α. This suggests that the weighting by IGR has the same property as the bias by the question sentence from the viewpoint of the coverage of the solution.

実施の形態２．
上述の実施の形態では、汎用文重要度を、ＴＦ・ＩＤＦ値と情報利得比総和の積の総和を用いて求めた。しかし、文重要度の統合の効果を得るためには、必ずしも情報利得比総和を乗じる必要はなく、ＴＦ・ＩＤＦ値と情報利得比総和の積の総和に代えて、ＴＦ・ＩＤＦ値の総和を用いてもよい。 Embodiment 2. FIG.
In the above-described embodiment, the general-purpose sentence importance is obtained by using the sum of products of the TF / IDF value and the information gain ratio sum. However, in order to obtain the effect of the integration of sentence importance, it is not always necessary to multiply the information gain ratio sum. Instead of the sum of products of the TF / IDF value and the information gain ratio sum, the sum of the TF / IDF values is It may be used.

つまり、汎用としての単語の重要度である汎用単語重要度の例として、文書内単語頻度と文書頻度の逆数を乗じた積を用い、文に含まれる各単語に係る汎用単語重要度の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とする。例えば、その商を、汎用文重要度とする。 That is, as an example of the general word importance, which is the general word importance, the product of the word frequency in the document and the reciprocal of the document frequency is used to calculate the sum of the general word importance for each word included in the sentence. The quotient is divided by the length of the sentence to determine the general sentence importance of the sentence. For example, the quotient is set as the general sentence importance.

この形態は、図２２と図２３の汎用文重要度導出処理において、Ｓ２２０１〜Ｓ２２０９を省き、Ｓ２２１４で文ＩＤの出所である文書ＩＤで特定されるＴＦ・ＩＤＦ値テーブルから当該単語のＴＦ・ＩＤＦ値を取得し、Ｓ２２１５でＴＦ・ＩＤＦ値を総和の変数に加えることにより実現される。 This form omits S2201 to S2209 in the general-purpose sentence importance level derivation processing of FIGS. 22 and 23, and the TF / IDF of the word from the TF / IDF value table specified by the document ID that is the source of the sentence ID in S2214. This is realized by acquiring the value and adding the TF / IDF value to the variable of the sum in S2215.

実施の形態３．
汎用としての単語の重要度である汎用単語重要度の例として、文書内単語頻度を用い、文に含まれる各単語に係る汎用単語重要度の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることもできる。例えば、その商を、汎用文重要度とする。 Embodiment 3 FIG.
As an example of general-purpose word importance that is the importance of a word as general-purpose, use the word frequency in the document, divide the sum of general-purpose word importance related to each word included in the sentence by the length of the sentence, The quotient can be used as an element for determining the general sentence importance of the sentence. For example, the quotient is set as the general sentence importance.

この形態は、図１３の情報利得比総和算出処理（Ｓ１３０２）が不要になる。また、図１５のＴＦ・ＩＤＦ値算出処理で生成するＴＦ・ＩＤＦ値テーブルに代えて、ＴＦ値テーブルを生成する（ＴＦ値算出処理）。その場合、ヘッダの記憶は、Ｓ１５０２と同様であり、レコードの記憶は、Ｓ１５０７で記憶するＴＦ・ＩＤＦ値に代えてＴＦ値を単語ＩＤと対応付けて記憶させる。つまり、ＴＦ・ＩＤＦ値に代えて、ＴＦ値を記憶させたＴＦ値テーブルを生成する。 In this form, the information gain ratio total calculation process (S1302) of FIG. 13 is not required. Further, instead of the TF / IDF value table generated in the TF / IDF value calculation process of FIG. 15, a TF value table is generated (TF value calculation process). In this case, the storage of the header is the same as in S1502, and the storage of the record is performed by storing the TF value in association with the word ID instead of the TF / IDF value stored in S1507. That is, a TF value table in which TF values are stored instead of TF / IDF values is generated.

そして、図２２と図２３の汎用文重要度導出処理において、Ｓ２２０１〜Ｓ２２０９を省き、Ｓ２２１４で文ＩＤの出所である文書ＩＤで特定されるＴＦ値テーブルから当該単語のＴＦ値を取得し、Ｓ２２１５でＴＦ値を総和の変数に加えることにより実現される。 Then, in the general sentence importance level derivation process of FIGS. 22 and 23, S2201 to S2209 are omitted, and the TF value of the word is acquired from the TF value table specified by the document ID that is the source of the sentence ID in S2214, and S2215. This is realized by adding the TF value to the sum variable.

実施の形態４．
汎用としての単語の重要度である汎用単語重要度の例として、文書頻度の逆数を用い、文に含まれる各単語に係る汎用単語重要度の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることもできる。例えば、その商を、汎用文重要度とする。 Embodiment 4 FIG.
As an example of general-purpose word importance that is the importance of a word as general-purpose, the reciprocal of document frequency is used, and the sum of general-purpose word importance related to each word included in the sentence is divided by the length of the sentence, The quotient can be used as an element for determining the general sentence importance of the sentence. For example, the quotient is set as the general sentence importance.

この形態は、図１３の情報利得比総和算出処理（Ｓ１３０２）が不要になる。また、図１５のＴＦ・ＩＤＦ値算出処理で生成するＴＦ・ＩＤＦ値テーブルに代えて、ＩＤＦ値テーブルを生成する（ＩＤＦ値テーブル生成処理）。その場合、ヘッダの記憶は、Ｓ１５０２と同様であり、レコードの記憶は、Ｓ１５０７で記憶するＴＦ・ＩＤＦ値に代えてＩＤＦ値を単語ＩＤと対応付けて記憶させる。つまり、ＴＦ・ＩＤＦ値に代えて、ＩＤＦ値を記憶させたＩＤＦ値テーブルを生成する。 In this form, the information gain ratio total calculation process (S1302) of FIG. 13 is not required. Further, instead of the TF / IDF value table generated in the TF / IDF value calculation process of FIG. 15, an IDF value table is generated (IDF value table generation process). In this case, the storage of the header is the same as in S1502, and the storage of the record is performed by storing the IDF value in association with the word ID instead of the TF / IDF value stored in S1507. That is, an IDF value table in which IDF values are stored instead of TF / IDF values is generated.

そして、図２２と図２３の汎用文重要度導出処理において、Ｓ２２０１〜Ｓ２２０９を省き、Ｓ２２１４で文ＩＤの出所である文書ＩＤで特定されるＩＤＦ値テーブルから当該単語のＩＤＦ値を取得し、Ｓ２２１５でＩＤＦ値を総和の変数に加えることにより実現される。 In the general sentence importance level derivation process of FIGS. 22 and 23, S2201 to S2209 are omitted, and the IDF value of the word is acquired from the IDF value table specified by the document ID that is the source of the sentence ID in S2214. This is realized by adding the IDF value to the sum variable.

実施の形態５．
上述の例では、汎用単語重要度の総和を除する文の長さとして、当該文に含まれる単語数を用いたが、当該文に含まれる文字数、当該文に含まれる文節数、あるいは当該文に含まれる節数など、文を構成する他の単位数を用いることも考えられる。 Embodiment 5. FIG.
In the above example, the number of words included in the sentence is used as the length of the sentence excluding the general-purpose word importance, but the number of characters included in the sentence, the number of clauses included in the sentence, or the sentence It is also conceivable to use other number of units constituting the sentence, such as the number of clauses included in.

実施の形態６．
上述の実施の形態では、要約対象の複数の文書から、複数の質問文に対するそれぞれの応答となる解を含む要約文書を生成する為に、複数の質問文に対する質問応答文重要度を、汎用文重要度と統合し、統合文重要度に基づいて重要文を抽出したが、質問応答文重要度のみに基づいて重要文を抽出することも考えられる。 Embodiment 6 FIG.
In the above-described embodiment, in order to generate a summary document including solutions that are responses to a plurality of question sentences from a plurality of documents to be summarized, the question answer sentence importance for the plurality of question sentences is set as a general sentence. Although the important sentences are extracted based on the integrated sentence importance, the important sentences are extracted based only on the question answer sentence importance.

この実施の形態では、文書頻度の逆数算出処理（Ｓ５０１）と汎用文重要度計算処理（Ｓ５０３）は不要となる。そして、文重要度平滑化処理（Ｓ５０６）あるいは重要文抽出処理（Ｓ５０７）は、図３７の統合文重要度テーブル１２０６から得られる（文ＩＤに対応する）統合文重要度に代えて、図１２の質問応答文重要度テーブル１２０４から得られる（文ＩＤに対応する）質問応答文重要度を用いて処理することにより実現される。 In this embodiment, the document frequency reciprocal calculation process (S501) and the general sentence importance calculation process (S503) are not required. Then, the sentence importance level smoothing process (S506) or the important sentence extraction process (S507) is performed in place of the integrated sentence importance level (corresponding to the sentence ID) obtained from the integrated sentence importance level table 1206 of FIG. This is realized by processing using the question answer sentence importance (corresponding to the sentence ID) obtained from the question answer sentence importance degree table 1204.

文書要約システムは、コンピュータであり、各要素はプログラムにより処理を実行することができる。また、プログラムを記憶媒体に記憶させ、記憶媒体からコンピュータに読み取られるようにすることができる。 The document summarization system is a computer, and each element can execute processing by a program. Further, the program can be stored in a storage medium so that the computer can read the program from the storage medium.

要約対象文書の選択に係る構成を示す図である。It is a figure which shows the structure concerning selection of the summary object document. 要約対象の文書ＩＤの例を示す図である。It is a figure which shows the example of document ID of the summary object. 質問文入力に係る構成を示す図である。It is a figure which shows the structure which concerns on a question sentence input. 質問文記憶部の例を示す図である。It is a figure which shows the example of a question text memory | storage part. 全体処理フローを示す図である。It is a figure which shows the whole processing flow. 文書頻度の逆数算出処理と文書解析処理に係る構成を示す図である。It is a figure which shows the structure which concerns on the reciprocal calculation of document frequency, and a document analysis process. 文テーブルの例を示す図である。It is a figure which shows the example of a sentence table. 単語テーブルの例を示す図である。It is a figure which shows the example of a word table. 文構造テーブルの例を示す図である。It is a figure which shows the example of a sentence structure table. 文出所テーブルの例を示す図である。It is a figure which shows the example of a text origin table. 文解析処理フローを示す図である。It is a figure which shows a sentence analysis processing flow. 汎用文重要度計算処理と質問応答文重要度計算処理と統合文重要度算出処理に係る構成を示す図である。It is a figure which shows the structure which concerns on a general sentence importance calculation process, a question answer sentence importance calculation process, and an integrated sentence importance calculation process. 汎用文重要度計算処理フローを示す図である。It is a figure which shows a general-purpose sentence importance calculation processing flow. 文書クラスタリング処理フローを示す図である。It is a figure which shows a document clustering process flow. ＴＦ・ＩＤＦ値算出処理フローを示す図である。It is a figure which shows TF * IDF value calculation processing flow. ＴＦ・ＩＤＦ値テーブルの例を示す図である。It is a figure which shows the example of a TF * IDF value table. 情報利得比総和算出処理フローを示す図である。It is a figure which shows the information gain ratio sum total calculation processing flow. 情報利得比総和テーブルを示す図である。It is a figure which shows an information gain ratio sum total table. クラスタにおける単語の情報利得比の算出処理フロー（１／２）を示す図である。It is a figure which shows the calculation processing flow (1/2) of the information gain ratio of the word in a cluster. クラスタにおける単語の情報利得比の算出処理フロー（２／２）を示す図である。It is a figure which shows the calculation processing flow (2/2) of the information gain ratio of the word in a cluster. クラスタ内の単語の情報量の計算処理フローを示す図である。It is a figure which shows the calculation processing flow of the information content of the word in a cluster. 汎用文重要度導出処理フロー（１／２）を示す図である。It is a figure which shows a general purpose sentence importance derivation | leading-out process flow (1/2). 汎用文重要度導出処理フロー（２／２）を示す図である。It is a figure which shows a general purpose sentence importance derivation | leading-out process flow (2/2). ＴＦ・ＩＤＦ値・情報利得比総和テーブルの例を示す図である。It is a figure which shows the example of a TF * IDF value * information gain ratio sum total table. 汎用文重要度テーブルの例を示す図である。It is a figure which shows the example of a general purpose sentence importance degree table. 質問応答文重要度計算処理フローを示す図である。It is a figure which shows a question response sentence importance calculation processing flow. スコア取得処理フローを示す図である。It is a figure which shows a score acquisition process flow. 質問文別スコアテーブルの例を示す図である。It is a figure which shows the example of the score table classified by question sentence. スコア正規化処理フローを示す図である。It is a figure which shows a score normalization process flow. 質問文別正規化スコアテーブルの例を示す図である。It is a figure which shows the example of the normalization score table classified by question sentence. 質問応答文重要度導出処理フロー（１／２）を示す図である。It is a figure which shows a question response sentence importance derivation | leading-out process flow (1/2). 質問応答文重要度導出処理フロー（２／２）を示す図である。It is a figure which shows a question response sentence importance derivation | leading-out process flow (2/2). 最大正規化スコアテーブルの例を示す図である。It is a figure which shows the example of a maximum normalization score table. 質問応答文重要度テーブルの例を示す図である。It is a figure which shows the example of a question response sentence importance degree table. 統合文重要度算出処理フローを示す図である。It is a figure which shows the integrated sentence importance calculation processing flow. 統合文重要度テーブルの例を示す図である。It is a figure which shows the example of an integrated sentence importance degree table. 文重要度平滑化処理と重要文抽出処理と重要文整列処理と要約文書出力処理に係る構成を示す図である。It is a figure which shows the structure which concerns on a sentence importance degree smoothing process, an important sentence extraction process, an important sentence alignment process, and a summary document output process. 重要文抽出処理フロー（１／２）を示す図である。It is a figure which shows an important sentence extraction process flow (1/2). 重要文抽出処理フロー（２／２）を示す図である。It is a figure which shows an important sentence extraction process flow (2/2). 重要文テーブルを示す図である。It is a figure which shows an important sentence table. 重要文整列処理フローを示す図である。It is a figure which shows the important sentence alignment processing flow. 要約文書の例を示す図である。It is a figure which shows the example of a summary document. 文書要約システムの主用な要素を示す図である。It is a figure which shows the main elements of a document summarization system. 抜粋の平均被覆率ならびに平均精度（ｓｈｏｒｔ）を示す図である。It is a figure which shows the average coverage and average precision (short) of an extract. 抜粋の平均被覆率ならびに平均精度（ｌｏｎｇ）を示す図である。It is a figure which shows the average coverage and average precision (long) of an extract. 被験者による主観評価に基づく平均被覆率を示す図である。It is a figure which shows the average coverage based on the subjective evaluation by a test subject. 質問に対する解の平均被覆率（ｓｈｏｒｔ）を示す図である。It is a figure which shows the average coverage (short) of the solution with respect to a question. 質問に対する解の平均被覆率（ｌｏｎｇ）を示す図である。It is a figure which shows the average coverage (long) of the solution with respect to a question. 文重要度混合比αの変化に対する抜粋の性能変化（被覆率）を示す図である。It is a figure which shows the performance change (coverage) of the extract with respect to the change of sentence importance degree mixture ratio (alpha). 文重要度混合比αの変化に対する抜粋の性能変化（精度）を示す図である。It is a figure which shows the performance change (accuracy) of the extract with respect to the change of sentence importance degree mixture ratio (alpha). 文重要度混合比αの変化に対する質問応答の性能変化（ＥｘａｃｔＭａｔｃｈ）を示す図である。It is a figure which shows the performance change (Exact Match) of the question response with respect to the change of sentence importance degree mixture ratio (alpha). 文重要度混合比αの変化に対する質問応答の性能変化（ＥｄｉｔＤｉｓｔａｎｃｅ）を示す図である。It is a figure which shows the performance change (Edit Distance) of the question response with respect to the change of sentence importance degree mixture ratio (alpha). 重複文の平均数を示す図である。It is a figure which shows the average number of duplicate sentences.

符号の説明Explanation of symbols

１０１要約対象候補文書データベース、１０２要約対象文書選択部、１０３要約対象文書記憶部、３０１質問文入力部、３０２質問文記憶部、６０１文書頻度の逆数（ＩＤＦ値）算出部、６０２文書頻度の逆数（ＩＤＦ値）テーブル、６０３文書解析部、６０４文テーブル、６０５単語テーブル、６０６文構造テーブル、６０７文出所テーブル、１２０１汎用文重要度計算度、１２０２汎用文重要度テーブル、１２０３質問応答文重要度計算部、１２０４質問応答文重要度テーブル、１２０５統合文重要度算出部、１２０６統合文重要度テーブル、３７０１文重要度平滑化部、３７０２平滑化統合文重要度テーブル、３７０３重要度抽出部、３７０４重要文テーブル、３７０５重要文整列部、３７０６要約文書記憶部、３７０７要約文書出力部。 DESCRIPTION OF SYMBOLS 101 Summarization object candidate document database, 102 Summarization object document selection part, 103 Summarization object document memory | storage part, 301 Question sentence input part, 302 Question sentence memory | storage part, 601 Document frequency reciprocal (IDF value) calculation part, 602 Document frequency reciprocal (IDF value) table, 603 document analysis unit, 604 sentence table, 605 word table, 606 sentence structure table, 607 sentence source table, 1201 general sentence importance calculation degree, 1202 general sentence importance degree table, 1203 question answer sentence importance degree Calculation unit, 1204 Question answer sentence importance degree table, 1205 Integrated sentence importance degree calculation part, 1206 Integrated sentence importance degree table, 3701 sentence importance degree smoothing part, 3702 Smoothed integrated sentence importance degree table, 3703 importance degree extraction part, 3704 Important sentence table, 3705 Important sentence alignment part, 3706 Summary Book storage unit, 3707 summary document output unit.

Claims

要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムであって、以下の要素を有することを特徴とする文書要約システム
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算部
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算部
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出部
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出部
（５）抽出した重要文を整列させて要約文書を生成する重要文整列部
（６）生成した要約文書を出力する要約文書生成部。 A document summarization system for generating a summary document including a solution to be a response to a question sentence from a plurality of documents to be summarized, the document summarization system having the following elements: (1) a plurality of summary targets General-purpose sentence importance calculation unit for calculating general-purpose sentence importance, which is the importance of a sentence as general-purpose for each sentence included in a document (2) For each sentence included in a plurality of documents to be summarized, Question answer sentence importance calculation section for calculating the importance of the question answer sentence which is the importance of the sentence (3) Integration of the general sentence importance and the question answer sentence importance, and the integrated sentence importance Integrated sentence importance calculating section for calculating sentence importance (4) Important sentence extracting section for extracting important sentences from sentences included in a plurality of documents to be summarized based on the integrated sentence importance (5) Extracted important sentences Important sentence alignment unit (6) Summary document generating unit that outputs a summary documents.

汎用文重要度計算部は、文に含まれる単語毎に、汎用としての単語の重要度である汎用単語重要度を求め、当該文に含まれる各単語に係る汎用単語重要度の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする請求項１記載の文書要約システム。 The general-purpose sentence importance calculation unit obtains a general-purpose word importance that is the importance of the word as a general-purpose for each word included in the sentence, and calculates a sum of general-purpose word importance related to each word included in the sentence. 2. The document summarization system according to claim 1, wherein the quotient is divided by the length of the sentence and is used as an element for determining the general sentence importance of the sentence.

汎用文重要度計算部は、要約対象の文書に含まれる単語について文書内単語頻度を算出し、文に含まれる単語毎に、当該単語の文書内単語頻度を重み付けとして用いた値を求め、当該文に含まれる各単語に係るその値の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする請求項１記載の文書要約システム。 The general-purpose sentence importance calculation unit calculates the word frequency in the document for the words included in the document to be summarized, obtains a value using the word frequency in the document as a weight for each word included in the sentence, 2. The document according to claim 1, wherein the sum of the values of each word included in the sentence is divided by the length of the sentence, and the quotient is used as an element for determining the general sentence importance of the sentence. Summarization system.

汎用文重要度計算部は、要約対象の候補となる文書に基づいて単語について文書頻度の逆数を算出し、文に含まれる単語毎に、当該単語の文書頻度の逆数を重み付けとして用いた値を求め、当該文に含まれる各単語に係るその値の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする請求項１記載の文書要約システム。 The general-purpose sentence importance calculation unit calculates a reciprocal of the document frequency for the word based on the document that is a candidate for summarization, and uses a value that uses the reciprocal of the document frequency of the word as a weight for each word included in the sentence. The sum of the values of the words included in the sentence is divided by the length of the sentence, and the quotient is used as an element for determining the general sentence importance of the sentence. Document summarization system described.

汎用文重要度計算部は、複数の文書を階層的にクラスタリングし、文書に含まれる単語について、当該クラスタ構造に則した出現分布を持つ単語に対する重み付けとして、当該文書が各階層において属するクラスタにおける当該単語の情報利得比の総和を求め、文に含まれる単語毎に、当該単語の情報利得比の総和を用いた値を求め、当該文に含まれる各単語に係るその値の総和を、当該文の長さで除して、その商を当該文の汎用文重要度決定の要素とすることを特徴とする請求項１記載の文書要約システム。 The general-purpose sentence importance calculation unit clusters a plurality of documents in a hierarchical manner, and for the words included in the documents, as weights for the words having an appearance distribution in accordance with the cluster structure, The sum of the information gain ratios of the words is obtained, and for each word included in the sentence, a value using the sum of the information gain ratios of the word is obtained, and the sum of the values related to each word included in the sentence is 2. The document summarization system according to claim 1, wherein the quotient is used as an element for determining the general sentence importance of the sentence by dividing by the length of the sentence.

上記文の長さは、当該文に含まれる文字数、当該文に含まれる単語数、当該文に含まれる文節数、あるいは当該文に含まれる節数のいずれかであることを特徴とする請求項２から５のいずれかに記載のの文書要約システム。 The length of the sentence is any one of the number of characters included in the sentence, the number of words included in the sentence, the number of clauses included in the sentence, or the number of clauses included in the sentence. The document summarization system according to any one of 2 to 5.

質問応答文重要度計算部は、文に含まれる単語毎に、質問文に対する解としての良さを示すスコアを算出し、当該スコアに基づいて当該文の質問応答文重要度を計算することを特徴とする請求項１記載の文書要約システム。 The question answer sentence importance calculation unit calculates a score indicating goodness as a solution to the question sentence for each word included in the sentence, and calculates the question answer sentence importance of the sentence based on the score The document summarization system according to claim 1.

統合文重要度算出部は、上記汎用文重要度と、上記質問応答文重要度を所定の重みで按分して、統合文重要度を算出することを特徴とする請求項１記載の文書要約システム。 2. The document summarization system according to claim 1, wherein the integrated sentence importance calculating unit calculates the integrated sentence importance by dividing the general sentence importance and the question answer sentence importance by a predetermined weight. .

要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムによる文書要約方法であって、以下の要素を有することを特徴とする文書要約方法
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算処理工程
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算処理工程
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出処理工程
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出処理工程
（５）抽出した重要文を整列させて要約文書を生成する重要文整列処理工程
（６）生成した要約文書を出力する要約文書生成処理工程。 A document summarization method by a document summarization system for generating a summary document including a solution as a response to a question sentence from a plurality of documents to be summarized, the document summarization method having the following elements: (1) Summarization General-purpose sentence importance calculation process for calculating general-purpose sentence importance, which is the importance of a sentence as general-purpose for each sentence included in a plurality of target documents (2) For each sentence included in a plurality of documents to be summarized The question answer sentence importance calculation processing step for calculating the importance of the question answer sentence as the question answer (3) The above-mentioned general sentence importance and the above question answer sentence importance are integrated and integrated. Integrated sentence importance calculation processing step for calculating an integrated sentence importance which is a sentence importance (4) Important sentence extraction processing for extracting an important sentence from sentences included in a plurality of documents to be summarized based on the integrated sentence importance Step (5) Align the extracted important sentences Summary document generation processing step of outputting a key sentence alignment process step (6) the generated summary document to produce about a document.

要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムとなるコンピュータに、以下の処理を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算処理
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算処理
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出処理
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出処理
（５）抽出した重要文を整列させて要約文書を生成する重要文整列処理
（６）生成した要約文書を出力する要約文書生成処理。 A computer-readable recording medium (1) that records a program for causing a computer to be a document summarizing system that generates a summarizing document including a solution as a response to a question sentence from a plurality of summarization target documents. ) General sentence importance calculation processing for calculating general sentence importance, which is the importance of a sentence as a general sentence, for each sentence included in a plurality of documents to be summarized (2) For each sentence included in a plurality of documents to be summarized The question answer sentence importance calculation processing for calculating the question answer sentence importance, which is the importance of the sentence as the question answer (3) The above general sentence importance and the above question answer sentence importance are integrated and integrated. Integrated sentence importance calculation process for calculating an integrated sentence importance that is a sentence importance (4) An important sentence extraction process for extracting an important sentence from sentences included in a plurality of documents to be summarized based on the integrated sentence importance ( 5) Extracted important Aligned so important sentences aligning process of generating a summary document (6) summary document generation processing for outputting the generated summary document.

要約対象の複数の文書から、質問文の応答となる解を含む要約文書を生成する文書要約システムとなるコンピュータに、以下の手順を実行させるためのプログラム
（１）要約対象の複数の文書に含まれる文毎に、汎用としての文の重要度である汎用文重要度を計算する汎用文重要度計算処理手順
（２）要約対象の複数の文書に含まれる文毎に、質問応答としての文の重要度である質問応答文重要度を計算する質問応答文重要度計算処理手順
（３）上記汎用文重要度と、上記質問応答文重要度を統合して、統合した文重要度である統合文重要度を算出する統合文重要度算出処理手順
（４）統合文重要度に基づいて、要約対象の複数の文書に含まれる文から重要文を抽出する重要文抽出処理手順
（５）抽出した重要文を整列させて要約文書を生成する重要文整列処理手順
（６）生成した要約文書を出力する要約文書生成処理手順。 A program for causing a computer, which is a document summarization system to generate a summary document including a solution as a response to a question sentence, from a plurality of documents to be summarized, to execute the following procedure (1) Included in a plurality of documents to be summarized General sentence importance calculation procedure for calculating general sentence importance, which is the importance of a sentence as a general sentence for each sentence (2) For each sentence included in a plurality of documents to be summarized, Question answer sentence importance calculation processing procedure for calculating the importance of the question answer sentence (3) The integrated sentence that is the sentence importance obtained by integrating the general sentence importance and the question answer sentence importance. Integrated sentence importance calculation processing procedure for calculating importance (4) Important sentence extraction processing procedure for extracting important sentences from sentences included in a plurality of documents to be summarized based on the integrated sentence importance (5) extracted importance Generate summary documents by aligning sentences Summary document generation procedure of outputting the key sentence alignment procedure (6) generated summary document that.

要約対象の複数の文書から、複数の質問文に対する要約文書を生成する文書要約システムであって、要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出し、この算出された上記スコアを質問文が共通するスコアの集合毎に正規化し、上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択し、選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算することを特徴とする文書要約システム。 A document summarization system that generates summary documents for a plurality of question sentences from a plurality of documents to be summarized, and for each of the plurality of question sentences for each word included in the sentences included in the plurality of documents to be summarized A plurality of scores indicating goodness as a solution to the question sentence is calculated, the calculated score is normalized for each set of scores common to the question sentences, and for each word included in the sentence, A document summarization system, wherein a maximum value is selected from the normalization scores, and a question response sentence importance of the sentence is calculated based on the selected maximum normalization score.

要約対象の複数の文書から、複数の質問文に対する要約文書を生成する文書要約システムによる文書要約方法であって、以下の要素を有することを特徴とする文書要約方法
（１）要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出する工程
（２）この算出された上記スコアを質問文が共通するスコアの集合毎に正規化する工程
（３）上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択する工程
（４）選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算する工程。 A document summarization method by a document summarization system for generating summary documents for a plurality of question sentences from a plurality of documents to be summarized, the document summarizing method having the following elements: (1) a plurality of summary targets A step of calculating a plurality of scores indicating goodness as a solution to each question sentence for the plurality of question sentences for each word included in a sentence included in the document (2) The question sentence uses the calculated scores The step of normalizing for each set of common scores (3) The step of selecting the maximum value among the normalized scores for each question sentence for each word included in the sentence (4) The selected maximum normalized score A step of calculating the question answer sentence importance of the sentence based on the sentence.

要約対象の複数の文書から、複数の質問文に対する要約文書を生成する文書要約システムとなるコンピュータに、以下の処理を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体
（１）要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出する処理
（２）この算出された上記スコアを質問文が共通するスコアの集合毎に正規化する処理
（３）上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択する処理
（４）選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算する処理。 A computer-readable recording medium recording a program for causing a computer to be a document summarization system that generates summary documents for a plurality of question sentences from a plurality of documents to be summarized (1) A process of calculating a plurality of scores indicating goodness as a solution to each question sentence with respect to the plurality of question sentences for each word included in a sentence included in a plurality of documents (2) questioning the calculated score Process for normalization for each set of scores with common sentences (3) Process for selecting the maximum value among the normalized scores for each question sentence for each word included in the sentence (4) Selected maximum normalization A process of calculating the importance of the question answer sentence of the sentence based on the score.

要約対象の複数の文書から、複数の質問文に対する要約文書を生成する文書要約システムとなるコンピュータに、以下の手順を実行させるためのプログラム
（１）要約対象の複数の文書に含まれる文に含まれる単語毎に上記複数の質問文に対してそれぞれの質問文に対する解としての良さを示す複数のスコアを算出する手順
（２）この算出された上記スコアを質問文が共通するスコアの集合毎に正規化する手順
（３）上記文に含まれる単語毎に、各質問文に対する当該正規化スコアのうち最大値を選択する手順
（４）選択した当該最大正規化スコアに基づいて当該文の質問応答文重要度を計算する手順。 A program for causing a computer to be a document summarization system that generates summary documents for a plurality of question sentences from a plurality of documents to be summarized to execute the following procedure: (1) Included in sentences included in a plurality of documents to be summarized (2) A procedure for calculating a plurality of scores indicating goodness as a solution to each question sentence for each of the plurality of question sentences for each word, (2) calculating the calculated score for each set of scores common to the question sentences Procedure for normalization (3) Procedure for selecting the maximum value among the normalization scores for each question sentence for each word included in the sentence (4) Question response of the sentence based on the selected maximum normalization score Procedure for calculating sentence importance.