JP2013174988A

JP2013174988A - Similar document retrieval support apparatus and similar document retrieval support program

Info

Publication number: JP2013174988A
Application number: JP2012038163A
Authority: JP
Inventors: Hisao Mase; 久雄間瀬; Kohei Toko; 航平藤稿
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-02-24
Filing date: 2012-02-24
Publication date: 2013-09-05
Anticipated expiration: 2032-02-24
Also published as: CN103294741A; CN103294741B; JP5324677B2

Abstract

PROBLEM TO BE SOLVED: To efficiently implement a cycle of a retrieval work process of a user and improve efficiency and quality of the retrieval work by presenting, to the user, information relating to an influence degree of a factor affecting similar document retrieval accuracy on the retrieval accuracy and measures for improving retrieval accuracy.SOLUTION: A similar document retrieval support apparatus performs an analysis relating to a factor for a set of a pair of past input documents and correct documents, associates a range of factor values with retrieval accuracy, and stores them in a table. The similar document retrieval support apparatus performs a similar factorial analysis on a new input document through computer processing, refers it to the table, and calculates the retrieval accuracy associated with a value range falling under the factorial value of the new input document. Then, through the computer processing, the retrieval accuracy and/or a divergence value from a retrieval accuracy average of all the past input documents is presented to the user. In a further preferable case, measure information for improving the retrieval accuracy is presented to the user.

Description

本発明は、大量の文書集合の中から所望の文書を検索する文書検索装置及び文書検索プログラムに関する。特に、本発明は、利用者から指定された文章又は文書を検索条件とし、その記載内容に類似又は関連する文書を検索対象とする文書集合の中から検索し、類似又は関連する度合の高い文書から順に出力する類似文書検索支援装置及び類似文書検索支援プログラムに関する。 The present invention relates to a document search apparatus and a document search program for searching a desired document from a large collection of documents. In particular, the present invention uses a sentence or document designated by a user as a search condition, searches a document set for which a document similar or related to the description is a search target, and is a document having a high degree of similarity or related. The present invention relates to a similar document search support apparatus and a similar document search support program that output in order.

インターネットなどの通信ネットワークやＰＣ・携帯電話などのハードウェアの普及と低価格化、ＣＰＵの高速化、メモリやディスクの大容量化・低価格化、検索システムや文書エディタなどのソフトウェアの高機能化・高性能化などにより、一般の人々が大量の文書情報に容易にアクセスできるようになっている。その一方で、大量の文書集合の中から、所望の文書を、迅速かつ的確かつ低労力で検索・取得することが困難になってきている。 Widespread use and cost reduction of communication networks such as the Internet and PCs / mobile phones, faster CPUs, larger capacity and lower prices of memory and disks, and higher functionality of software such as search systems and document editors・ With high performance, the general public can easily access a large amount of document information. On the other hand, it has become difficult to retrieve and acquire a desired document from a large collection of documents quickly, accurately and with low effort.

大量の文書集合の中から所望の文書を検索する方式としては、キーワード検索が一般的である。キーワード検索では、利用者は、所望の文書に関連する一つ以上のキーワードと、キーワード間の論理的関係性を示す論理演算子（AND/OR/NOTなど）から構成されるキーワード論理式を作成する。文書検索装置は利用者からの論理式を受け取り、この論理式が真となる文書だけを検索対象文書集合の中から検索して、利用者に提示する。 A keyword search is generally used as a method for searching a desired document from a large collection of documents. In keyword search, the user creates a keyword formula that consists of one or more keywords related to the desired document and logical operators (AND / OR / NOT, etc.) that indicate the logical relationship between the keywords. To do. The document search apparatus receives a logical expression from the user, searches only a document for which this logical expression is true from the search target document set, and presents it to the user.

しかし、キーワード検索では、検索結果文書を閲覧可能な件数に絞り込むために、どのようなキーワード論理式を作成したらよいかを利用者が思いつかないことがしばしばある。また、利用者の検索意図を反映した検索結果文書を優先的に出力することは、精度的にも困難である。 However, in the keyword search, the user often cannot come up with what kind of keyword logical expression should be created in order to narrow down the search result documents to the number that can be browsed. In addition, it is difficult to accurately output a search result document reflecting the user's search intention.

ところで、昨今、キーワード検索の分野では、利用者により入力された任意の文章又は指定された任意の文書を検索条件とし、その記載内容に類似又は関連する文書を検索対象とする文書集合の中から検索し、類似又は関連の度合の高い文書から順番に出力する技術が普及してきている。この技術は、類似文書検索と呼ばれている。なお、この技術は、概念検索、自然言語検索、自然文検索、あいまい検索、連想検索とも呼ばれる。 By the way, in the field of keyword search, an arbitrary sentence input by a user or an arbitrary specified document is used as a search condition, and a document similar to or related to the description is searched from a set of documents. A technique for searching and outputting documents in order from documents having a high degree of similarity or relevance has become widespread. This technique is called similar document search. This technique is also called concept search, natural language search, natural sentence search, fuzzy search, and associative search.

類似文書検索は、以下の処理を通じて実現される。まず、検索対象とする文書集合を構成する各々の検索対象文書から、記載内容を特徴づける特徴語を抽出し、その後、各特徴語に対してその重要度に応じた重みを算出・付与することにより、一語以上の重み付き特徴語から構成される特徴語ベクトルを生成し、検索インデクスに予め格納する。また、利用者が入力した文章又は指定した文書（以下、まとめて「入力文書」という）からも、同様の方法により重み付き特徴語を抽出して特徴語ベクトルを生成する。次に、入力文書から生成された特徴ベクトルと、各検索対象文書の特徴ベクトルを照合し、両者の類似度を算出する。類似度の算出には、特徴ベクトル間の内積や、特徴ベクトルがなす角の余弦値がしばしば使われる。その後、類似度を降順にソートして得られる上位の文書を入力文書に類似する文書として出力する。 The similar document search is realized through the following processing. First, feature words that characterize the description content are extracted from each search target document constituting the document set to be searched, and then a weight corresponding to the importance is calculated and assigned to each feature word. Thus, a feature word vector composed of one or more weighted feature words is generated and stored in advance in the search index. Further, a feature word vector is generated by extracting weighted feature words from a sentence input by a user or a specified document (hereinafter collectively referred to as “input document”) by a similar method. Next, the feature vector generated from the input document is collated with the feature vector of each search target document, and the similarity between the two is calculated. For calculating the similarity, an inner product between feature vectors and a cosine value of an angle formed by the feature vectors are often used. Thereafter, the higher order document obtained by sorting the similarity in descending order is output as a document similar to the input document.

特開２００２−２３００３２号公報JP 2002-230032 A 特開１９９５−１９２０２０号公報Japanese Patent Laid-Open No. 1995-192020 特開２０００−３１１１７３号公報JP 2000-31173 A

類似文書検索では、自分の頭に思い浮かんだ任意の文章や、手元にある文書をそのまま検索条件として指定できるので、利用者がキーワード論理式を作成する手間が不要となるという長所がある。また、入力文書の内容に類似する度合の高い文書から順位付けして出力できるため、利用者は所望の文書に逸早く辿り着けるという長所もある。 The similar document search has an advantage that the user does not need to create a keyword logical expression because an arbitrary sentence that comes to mind or a document at hand can be specified as it is as a search condition. In addition, since a document having a high degree of similarity to the contents of the input document can be ranked and output, the user can quickly reach a desired document.

しかし、類似文書検索では、数多くの重み付き特徴語を要素とする特徴語ベクトルを照合させて、入力文書と検索対象文書の間の類似性を判定する。このため、なぜその文書が類似文書として出力されたのかという検索根拠を利用者が理解するのが困難であるという短所がある。より具体的には、類似文書検索には、以下に示す４つの課題が存在する。
・課題（１）入力文書中のどの特徴語が、どのくらい類似文書検索結果の出力に貢献したのかを理解できない。
・課題（２）類似文書検索が、どの程度うまくいったのかを理解できない。
・課題（３）類似文書検索がうまくいっていない場合、何が原因なのかを理解できない。
・課題（４）類似文書検索がうまくいっていない場合、次に何をどうすればより良い検索結果を得られるのかを理解できない。 However, in the similar document search, feature word vectors having a number of weighted feature words as elements are collated to determine the similarity between the input document and the search target document. For this reason, there is a disadvantage that it is difficult for the user to understand the search reason why the document is output as a similar document. More specifically, the similar document search has the following four problems.
・ Problem (1) Unable to understand how many feature words in the input document contributed to the output of similar document search results.
・ Problem (2) I cannot understand how well the similar document search was successful.
・ Problem (3) When similar document retrieval is not successful, it is impossible to understand what is the cause.
Problem (4) If similar document retrieval is not successful, it is difficult to understand what should be done next and how to obtain better retrieval results.

前述した課題（１）に関連する技術文献として、特許文献１及び特許文献２がある。これら特許文献に記載された発明は、検索結果と、検索で使用した項目を軸として構成される表又はグラフの形態により検索結果を表示する。 As technical documents related to the above-described problem (1), there are Patent Document 1 and Patent Document 2. The inventions described in these patent documents display the search results in the form of search results and a table or graph configured with the items used in the search as axes.

特許文献１では、複数の判定基準に基づいて、判定基準毎の文書適合値を算出し、これらをまとめた総合文書適合値を算出する。文書検索結果を出力する際に、検索結果文書と判定基準を２軸とし、検索結果文書毎の総合文書適合値及び判定基準毎の文書適合値を値とした表を出力する。この表を通じ、利用者は、どの判定基準がどの検索結果文書の出力にどのくらい貢献したかを理解することができる。 In Patent Document 1, based on a plurality of judgment criteria, a document suitability value for each judgment criterion is calculated, and a comprehensive document suitability value is calculated by combining these. When outputting the document search result, a table is output with the search result document and the determination criterion as two axes, and the total document conformance value for each search result document and the document conformance value for each determination criterion as values. Through this table, the user can understand how much the determination criteria contributed to the output of which search result document.

特許文献２では、入力文書を解析して複数の異なる視点に分け、視点毎に検索命令に変換し、入力文書と検索対象文書の間の類似度を視点別に算出し、これらを総合して検索結果を出力する。検査結果の出力時には、指定された視点を軸に使用し、検索命令と検索結果文書の類似の度合を２次元又は３次元空間に表示する。この表示を通じ、利用者は、どの視点に基づいて、どの検索結果文書が出力されているのかを理解することができる。 In Patent Document 2, the input document is analyzed and divided into a plurality of different viewpoints, converted into search commands for each viewpoint, the similarity between the input document and the search target document is calculated for each viewpoint, and these are comprehensively searched. Output the result. When the inspection result is output, the specified viewpoint is used as an axis, and the degree of similarity between the search command and the search result document is displayed in a two-dimensional or three-dimensional space. Through this display, the user can understand which search result document is output based on which viewpoint.

前述した特許文献１及び２に記載された発明は、検索結果と、検索で使用した項目（視点、判定基準）を軸として構成される表又はグラフを用いて検索結果を表示することにより、前述した課題（１）を解決する。しかし、これらの発明は、その他の課題（２）、（３）、（４）を解決する仕組みについては何ら言及していない。 The inventions described in Patent Documents 1 and 2 described above display the search results by using the search results and a table or graph configured with the items (viewpoints and determination criteria) used in the search as axes. Solve the problem (1). However, these inventions do not mention any mechanism for solving the other problems (2), (3), and (4).

例えば前述した課題（２）については、類似文書検索がどの程度うまくいったのか否かを利用者が理解できるように、入力文書と検索対象文書の間の類似性をさまざまな要因から解析し、要因毎に類似文書検索の良し悪しを利用者が評価できる態様で提供する必要がある。 For example, for the above problem (2), the similarity between the input document and the search target document is analyzed from various factors so that the user can understand how well the similar document search has been performed. It is necessary to provide a manner in which the user can evaluate the quality of similar document search for each factor.

この課題（２）に関連する技術文献には特許文献３がある。特許文献３には、まず、過去の検索結果から、類似文書検索によって検索された類似文書の類似度の値範囲に対応する検索精度を、検索結果文書に付与された分類毎に予め算出し、次に、新規入力文書に対する検索結果文書の各々の類似度及び分類から、この分類における類似度に対応する検索精度を特定し、その後、当該検索結果文書の類似度の値を、この特定された検索精度の値に置き換えて確度とし、確度の高い順に検索結果を並べ替えて表示することにより、検索精度を向上させる手法が記載されている。 Patent Document 3 is a technical document related to this problem (2). In Patent Document 3, first, search accuracy corresponding to a value range of similarity of similar documents searched by similar document search is calculated in advance for each classification given to the search result document from past search results. Next, the search accuracy corresponding to the similarity in this classification is specified from the similarity and the classification of each search result document to the new input document, and then the similarity value of the search result document is specified. There is described a technique for improving the search accuracy by replacing the search accuracy value with the accuracy and rearranging and displaying the search results in descending order of accuracy.

しかし、特許文献３に記載の手法は、類似度と検索精度の対応関係に基づいて、類似度を検索精度に置き換え、検索結果文書の表示順序を補正（並べ替え）しているだけである。従って、特許文献３で言及された仕組みによっては、検索がうまくいかなかった要因や、この要因を踏まえて次に何をすればよいのかを、利用者が理解することはできない。 However, the method described in Patent Document 3 merely replaces the similarity with the search accuracy based on the correspondence between the similarity and the search accuracy, and corrects (reorders) the display order of the search result document. Therefore, depending on the mechanism mentioned in Patent Document 3, the user cannot understand the reason why the search is not successful and what to do next based on this factor.

類似文書検索では、「検索条件指定→検索実行→検索結果の傾向や要因を把握→検索条件修正→再検索」という検索作業プロセスのサイクルを効率よく回すこと、すなわち検索作業を効率化することがしばしば求められる。この検索作業の効率化には、検索結果の提示と共に、検索結果の根拠・原因・対処方法などに関する種々の情報についても利用者に提示して、利用者が次の検索に向けて検索条件を効率よくかつ的確に修正できるように支援する仕組みが必要となる。 In the similar document search, it is possible to efficiently rotate the cycle of the search operation process of “search condition specification → search execution → grasp the tendency and factors of the search result → search condition correction → re-search”, that is, to improve the search work efficiency. Often required. In order to improve the efficiency of this search work, the search results are presented together with various information on the basis, cause, and countermeasures of the search results, so that the user can set the search conditions for the next search. A support system is needed to make corrections efficiently and accurately.

しかし、特許文献３に記載の手法は、類似度と検索精度の対応関係に基づく検索結果文書の並べ替えにとどまっており、検索結果の傾向や要因を把握して検索条件を修正し、再検索するという検索作業プロセスのサイクルを効率よく回すための仕組みについては何ら開示していない。結果的に、特許文献３に記載の手法によっては、前述した課題（３）、（４）を解決することができない。 However, the method described in Patent Document 3 is limited to the sorting of the search result documents based on the correspondence between the similarity and the search accuracy, and the search condition is corrected by grasping the tendency and factors of the search result, and the search is performed again. It does not disclose any mechanism for efficiently rotating the search work process cycle. As a result, the above-described problems (3) and (4) cannot be solved by the method described in Patent Document 3.

また、特許文献３で着目しているのは、類似度の値そのものと、検索結果文書の属する分類だけである。しかし、文書間の類似性を定量的に表す類似度は、一般に、複数のミクロな要因が影響する中で算出される値である。ここでいう要因の具体例には、検索に使用される入力文書の特徴語の質と数、検索対象文書の内容・構造・文章量のばらつき、文書執筆者の異なり数や不特定性、検索対象文書の中で使用されている特徴語の質やばらつきなどが挙げられる。 Also, Patent Document 3 focuses only on the similarity value itself and the classification to which the search result document belongs. However, the similarity that quantitatively represents the similarity between documents is generally a value that is calculated under the influence of a plurality of micro factors. Specific examples of the factors here include the quality and number of feature words in the input document used for the search, variations in the content / structure / sentence of the search target document, the number and unspecificity of the document authors, and the search Examples include the quality and variation of feature words used in the target document.

そのため、類似度そのものの値と検索精度の間の関係性を解析するだけでは、検索がうまくいかなかった要因を特定することはできない。ここでの要因の特定には、よりミクロな要因と検索精度の関係を解析し、検索精度を向上させている要因と、低下させている要因をきちんと識別して利用者に定量的に提示することが必須となる。しかし、特許文献３の手法では、検索がうまくいかなかった要因を特定する技術については、何ら言及されていない。このため、特許文献３に記載の手法では、前述した課題（３）を解決することができない。 For this reason, simply analyzing the relationship between the value of the similarity itself and the search accuracy cannot identify the cause of the search failure. To identify the factors here, analyze the relationship between more micro factors and search accuracy, identify the factors that improve search accuracy and the factors that decrease it, and present them quantitatively to the user. It is essential. However, the technique of Patent Document 3 does not mention anything about the technique for identifying the cause of the search failure. For this reason, the method described in Patent Document 3 cannot solve the above-described problem (3).

本発明は、前述した技術的背景や従来技術に対する考察の下に完成されたものであり、類似文書検索が抱える前記４つの課題のうち、特に課題（３）及び（４）を解決する技術を提供する。すなわち、本発明は、類似文書検索がうまくいっていない場合、何が原因なのかを利用者が理解できるようにする。また、本発明は、類似文書検索がうまくいっていない場合、次に何をどうすれば、より良い検索結果を得られるのかを利用者が理解できるようにする。そして、これらの課題を解決することにより、本発明は、利用者が検索作業プロセスのサイクルを効率よく回すことができるようにする。 The present invention has been completed in consideration of the above-described technical background and prior art, and among the above four problems that the similar document search has, a technique for solving problems (3) and (4) in particular. provide. That is, the present invention enables the user to understand what is the cause when the similar document search is not successful. The present invention also enables the user to understand what should be done next and how to obtain better search results when similar document search is not successful. Then, by solving these problems, the present invention enables the user to efficiently rotate the cycle of the search work process.

本発明は、前述した課題（３）を解決するために、類似文書検索の精度に影響を与える要因を定義し、その上で、検索結果について、各要因からみたときの検索精度、及び／又は、精度平均との乖離度を要因毎に算出して利用者に提示する。例えば本発明に係る類似文書検索支援装置やプログラムは、ハードウェア資源を使用して、以下の処理を実行する。まず、過去の入力文書と正解文書の対の集合に対し、各要因に関する解析を行って要因の値範囲と検索精度を対応付けてテーブルに格納する。次に、新規入力文書に対して同様の要因解析を実行する。その後、前記テーブルとの照合により、新規入力文書の要因値に該当する値範囲に対応する検索精度を特定し、検出精度及び／又は過去の入力文書全体に対する検索精度平均との乖離度を利用者に提示する。 In order to solve the above-mentioned problem (3), the present invention defines factors that affect the accuracy of similar document search, and then, on the search results, search accuracy when viewed from each factor, and / or The degree of deviation from the accuracy average is calculated for each factor and presented to the user. For example, the similar document search support apparatus and program according to the present invention execute the following processing using hardware resources. First, an analysis on each factor is performed on a set of past input document and correct document pairs, and the factor value range and the search accuracy are associated with each other and stored in a table. Next, the same factor analysis is executed for the new input document. Thereafter, by collating with the table, the search accuracy corresponding to the value range corresponding to the factor value of the new input document is specified, and the deviation from the detection accuracy and / or the average search accuracy for the entire past input document is determined by the user. To present.

また、本発明は、前述した課題（４）を更に解決するために、利用者がより良い類似文書検索結果を得るための対策情報として、何をすべきであるかを記載した対策内容、前記対策内容をどのようにして行うかを記載した操作方法、前記操作方法を行うために遷移すべき画面情報を、前記要因の各々の視点から要因グループ毎に格納した対策テーブルを用意する。そして、検索結果文書集合を利用者に報知する際、精度影響度テーブルに格納された要因値並びに検索精度及び／又は乖離値を利用者に提示すると共に、前記対策テーブルに記載された対策内容、操作方法、画面情報の少なくとも１つを要因グループに付随させて表示する。 Further, the present invention provides a countermeasure content that describes what should be done as countermeasure information for a user to obtain a better similar document search result in order to further solve the above-described problem (4), An operation method that describes how to implement the countermeasure contents, and a countermeasure table that stores screen information to be changed to perform the operation method for each factor group from each viewpoint of the factor are prepared. Then, when notifying the user of the search result document set, the factor value stored in the accuracy impact table and the search accuracy and / or divergence value are presented to the user, and the countermeasure contents described in the countermeasure table, At least one of the operation method and the screen information is displayed in association with the factor group.

本発明により、類似文書検索結果の根拠を利用者が把握できるようになる。すなわち、類似文書検索がどの程度うまくいったのか、うまくいっていない場合、何が原因なのかを利用者が理解できるようになる。さらに、検索精度及び／又は乖離値を利用者に提示する際に、対策テーブルに記載された対策内容、操作方法、画面情報の少なくとも１つを要因グループに付随させて表示する場合には、類似文書検索がうまくいっていない場合、次に何をどうすればより良い検索結果を得られるのかを利用者が理解できるようになる。この結果、検索作業プロセスのサイクルを効率よく回すことができるようになり、検索作業時間を短縮できるとともに、質の高い検索結果を得ることができるようになる。上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to the present invention, the user can grasp the basis of the similar document search result. In other words, the user can understand how well the similar document search has been performed, and if it is not successful, what the cause is. Furthermore, when the search accuracy and / or deviation value is presented to the user, it is similar if at least one of the countermeasure contents, operation method, and screen information described in the countermeasure table is displayed in association with the factor group. If the document search is not successful, users will be able to understand what to do next and how to get better search results. As a result, the cycle of the search operation process can be efficiently rotated, the search operation time can be shortened, and a high-quality search result can be obtained. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

入力文書指定画面の構成例を示す図。The figure which shows the structural example of an input document designation | designated screen. 対策情報に応じた検索条件編集画面の構成例を示す図。The figure which shows the structural example of the search condition edit screen according to countermeasure information. 類似文書検索結果の概要表示画面の構成例を示す図。The figure which shows the structural example of the summary display screen of a similar document search result. 類似文書検索結果の詳細表示画面の構成例を示す図（画面上段）。The figure which shows the structural example of the detailed display screen of a similar document search result (screen upper stage). 類似文書検索結果の詳細表示画面の構成例を示す図（画面下段）。The figure which shows the structural example of the detailed display screen of a similar document search result (screen lower stage). 類似文書検索支援装置の機能ブロック構成の概略図。Schematic of the functional block structure of a similar document search assistance apparatus. 検索インデクス505の構成例を示す図。The figure which shows the structural example of the search index 505. 書誌テーブル507の構成例を示す図。The figure which shows the structural example of the bibliography table. 教師文書テーブル508の構成例を示す図。The figure which shows the structural example of the teacher document table 508. FIG. 特徴語テーブル510の構成例を示す図。FIG. 5 is a diagram showing a configuration example of a feature word table 510. 検索結果テーブル512の構成例を示す図。The figure which shows the structural example of the search result table 512. FIG. 要因テーブル514の構成例を示す図。The figure which shows the structural example of the factor table 514. 特徴語照合テーブル515の構成例を示す図。The figure which shows the structural example of the feature word collation table 515. 要因データ抽出部513で実行される処理方法の一例を示す図。The figure which shows an example of the processing method performed in the factor data extraction part 513. FIG. 検索精度テーブル517の構成例を示す図。The figure which shows the structural example of the search precision table 517. 検索精度解析部516で実行される処理方法の一例を示す図。FIG. 6 is a diagram showing an example of a processing method executed by a search accuracy analysis unit 516. 検索精度解析部516で実行される処理方法の具体例を示す図。FIG. 5 is a diagram showing a specific example of a processing method executed by a search accuracy analysis unit 516. 精度影響度テーブル520の構成例を示す図。The figure which shows the structural example of the precision influence degree table 520. FIG. 精度影響度算出部519の処理方法の一例を示す図The figure which shows an example of the processing method of the precision influence calculation part 519 類似文書検索支援装置のハードウェアの構成例を示す図。The figure which shows the structural example of the hardware of a similar document search assistance apparatus. 検索結果の詳細表示画面の他の構成例を示す図（画面上段）。The figure which shows the other structural example of the detailed display screen of a search result (screen upper stage). 検索結果の詳細表示画面の他の構成例を示す図（画面下段）。The figure which shows the other structural example of the detailed display screen of a search result (screen lower stage). 対策情報提示画面の構成例を示す図。The figure which shows the structural example of a countermeasure information presentation screen. 対策テーブル522の構成例を示す図。The figure which shows the structural example of the countermeasure table 522. FIG.

以下、図面に基づいて、本発明の実施例を説明する。以下の実施例は、特許文書を検索条件とし、入力された特許文書の発明内容に類似する過去の特許文書を検索する類似特許検索システムを想定する。具体的には、審査対象とする特許出願に対する公知例を過去の特許文書の中から検索する際に、出願文書を丸ごと入力し、その発明内容に類似する特許文書を検索するユースケースを想定する。 Embodiments of the present invention will be described below with reference to the drawings. The following embodiment assumes a similar patent search system that searches for past patent documents similar to the invention content of the input patent document using patent documents as search conditions. Specifically, when searching for known examples of patent applications to be examined from past patent documents, a use case is assumed in which the entire application documents are input and patent documents similar to the content of the invention are searched. .

しかし、本発明の実施例は、このユースケースに限定されるものではない。また、本実施例では、特許文書を検索対象に用いているが、論文や新聞記事、設計文書やメール、Ｗｅｂページなどの文書を検索対象とすることも可能である。 However, embodiments of the present invention are not limited to this use case. In this embodiment, patent documents are used as search targets, but documents such as papers, newspaper articles, design documents, emails, and Web pages can also be set as search targets.

本実施例では、類似文書の検索結果の根拠として、入力文書中のどの特徴語がどのくらい検索結果の出力に貢献したのか、類似文書検索がどの程度うまくいったのか、類似文書検索がうまくいっていない場合、何が原因なのか、類似文書検索がうまくいっていない場合、次に何をどうすればより良い検索結果を得られるのかなどを理解する機能を提供する。 In this embodiment, as the basis for the search result of the similar document, which feature word in the input document contributed to the output of the search result, how well the similar document search is performed, and the similar document search is not successful. In this case, a function is provided to understand what is the cause, if similar document search is not successful, what can be done next, and how to obtain better search results.

まず、本システムの入出力イメージを、画面例を用いて説明する。図１に、本システムの入力文書指定画面の構成例を示す。入力文書指定画面100において、利用者は検索したい文書の識別子である特許出願番号を入力エリア101に入力する。特許出願番号の入力後、「検索」ボタン103を押下すると、類似文書検索が実行され、検索結果が別画面に出力される。なお、「クリア」ボタン102が押下されると、入力エリア101の内容が消去される。 First, an input / output image of this system will be described using a screen example. FIG. 1 shows a configuration example of an input document designation screen of this system. On the input document designation screen 100, the user inputs the patent application number, which is the identifier of the document to be searched, into the input area 101. When the “search” button 103 is pressed after inputting the patent application number, a similar document search is executed, and the search result is output to another screen. Note that when the “clear” button 102 is pressed, the contents of the input area 101 are deleted.

入力文書指定画面100には、検索のオプションとして、検索実行前に入力文書から抽出された特徴語及びその重みの内容を確認・修正する前編集を行うか否かを選択入力するためのチェックボックス104と、入力文書から抽出された特徴語を同義語に展開してから検索を実行するか否かを選択入力するためのチェックボックス105が設けられている。チェックボックス104及び／又は105がチェックされている状態で検索ボタン103が押下されると、図２に示すような特徴語や同義語などの検索条件を編集する画面が表示される。当該画面の詳細構成については後述する。 On the input document specification screen 100, as a search option, a check box for selecting and inputting whether or not to perform pre-editing for confirming / correcting the content of the feature word extracted from the input document before the search execution and the content of its weight 104 and a check box 105 for selectively inputting whether or not the search is executed after the feature words extracted from the input document are expanded into synonyms. When the search button 103 is pressed while the check boxes 104 and / or 105 are checked, a screen for editing search conditions such as feature words and synonyms as shown in FIG. 2 is displayed. The detailed configuration of the screen will be described later.

本実施例では、入力文書を指定する際に出願番号のような文書ＩＤを入力することを想定しているが、特許のテキスト部分をコピーして入力エリアに貼り付ける、あるいは、直接テキストを入力エリアにタイプ入力してもよい。あるいは、文書検索結果などの形式で一覧表示される文書の中から任意の文書を選択指定するような形式で入力文書を指定してもよい。 In this embodiment, it is assumed that a document ID such as an application number is input when specifying an input document. However, the text part of the patent can be copied and pasted into the input area, or the text can be directly input. You may type in the area. Alternatively, the input document may be specified in a format in which an arbitrary document is selected and specified from documents displayed in a list such as a document search result.

図３に、類似文書の検索結果の表示に使用する概要表示画面300の構成例を示す。概要表示画面300には、類似文書として検索された文書が、入力文書との類似の度合（類似度）の高い文書から順に表示される。その際、検索順位を表す順位308、類似度309、出願番号である文書ＩＤ310、文書タイトルに相当する発明の名称311、出願人312の各項目が、検索された文書毎に表示される。勿論、分類や要約文章など、これら以外の書誌情報やテキスト情報を表示してもよい。 FIG. 3 shows a configuration example of a summary display screen 300 used for displaying similar document search results. On the summary display screen 300, documents retrieved as similar documents are displayed in order from a document having a higher degree of similarity (similarity) to the input document. At that time, the items of the rank 308 representing the search rank, the similarity 309, the document ID 310 as the application number, the title 311 of the invention corresponding to the document title, and the applicant 312 are displayed for each retrieved document. Of course, other bibliographic information and text information such as classification and summary text may be displayed.

本実施例の場合、選択チェックボックス307により選択された文書の抄録データを表示する「抄録」ボタン301と、本文データを表示する「本文」ボタン302が、概要表示画面300の画面上部に設けられる。なお、同じく画面上部に設けられた「戻る」ボタン304が押下されると、表示画面は入力文書指定画面100に戻る。また、「次」ボタン306が押下されると、次の検索結果文書10件を表示し、「前」ボタン305が押下されると、前の検索結果文書10件が表示される。 In this embodiment, an “abstract” button 301 for displaying abstract data of a document selected by the selection check box 307 and a “text” button 302 for displaying text data are provided at the top of the summary display screen 300. . Note that when a “return” button 304 provided at the top of the screen is pressed, the display screen returns to the input document designation screen 100. When the “next” button 306 is pressed, 10 next search result documents are displayed. When the “previous” button 305 is pressed, 10 previous search result documents are displayed.

図４Ａ及び図４Ｂに、類似文書検索結果の詳細表示画面の構成例を示す。本画面は、概要表示画面300（図３）の画面上部に設けた「詳細」ボタン303の押下により表示される。紙面の都合上、画面上段部分に表示される表400を図４Ａに示し、画面下段部分に表示される表470を図４Ｂに示す。表400は、図３で出力された類似文書検索の結果がうまくいったのか、うまくいっていないとしたらその原因は何かを解析した結果を示したものである。 4A and 4B show a configuration example of a detailed display screen of similar document search results. This screen is displayed when a “details” button 303 provided at the top of the overview display screen 300 (FIG. 3) is pressed. Due to space limitations, table 400 displayed in the upper part of the screen is shown in FIG. 4A, and table 470 displayed in the lower part of the screen is shown in FIG. 4B. Table 400 shows the result of analyzing whether the result of the similar document search output in FIG. 3 is successful or the cause if it is not successful.

この表400は、類似文書検索の精度に影響を及ぼす要因410、要因に対する要因値440、要因毎に得られる「検索精度への影響度」450で構成される。要因410は、各要因の属する要因カテゴリ420と、要因名称430から構成される。要因値440は、当該入力文書に対する要因値である値441と、複数の教師入力文書に対する要因値の分野平均442から構成される。「検索精度への影響度」450は、要因の値441が属する「該当要因Group」451、該当要因Group451に対応する検索精度452、当該要因が類似文書検索の精度にどのくらい影響しているかを、教師入力文書全体に対する検索精度の平均値との乖離の度合として示す影響度453から構成される。影響度453の値がプラスの要因は、その絶対値が大きいほど検索精度の向上に貢献しており、影響度453の値がマイナスの要因は、その絶対値が大きいほど検索精度の低下の原因になっているとみなすことができる。利用者はこの影響度の値をチェックすることにより、検索がうまくいったのか、検索精度を低下させている要因は何かを理解することができる。もっとも、検索精度452と影響度453のいずれか一方だけを表示する場合も考えられる。 The table 400 includes a factor 410 that affects the accuracy of similar document search, a factor value 440 for the factor, and an “influence on search accuracy” 450 obtained for each factor. The factor 410 includes a factor category 420 to which each factor belongs and a factor name 430. The factor value 440 includes a value 441 that is a factor value for the input document and a field average 442 of factor values for a plurality of teacher input documents. The “influence on search accuracy” 450 is the “corresponding factor group” 451 to which the factor value 441 belongs, the search accuracy 452 corresponding to the corresponding factor group 451, and how much the factor affects the accuracy of similar document search. The degree of influence 453 is shown as the degree of deviation from the average value of search accuracy for the entire teacher input document. Factors with a positive influence value of 453 contribute to improved search accuracy as the absolute value increases. Factors with a negative influence value of 453 cause a decrease in search accuracy as the absolute value increases. Can be regarded as By checking the value of the influence degree, the user can understand whether the search has been successful or what causes the search accuracy to be lowered. Of course, only one of the search accuracy 452 and the influence level 453 may be displayed.

表470は、入力文書から抽出された重み付き特徴語を縦軸とし、検索結果文書を横軸として表されている。また、表470では、検索結果文書472における各特徴語に対応する値を、各特徴語の重みの値の大小に応じて濃さを変えた塗り潰しにより表している。表470では、その縦軸に、入力文書から抽出された全特徴語である20語が重みの高い順に表示され、その横軸に、検索結果文書472の上位30件が類似度の高い順に表示されている。 Table 470 shows weighted feature words extracted from the input document on the vertical axis and search result documents on the horizontal axis. Also, in Table 470, the values corresponding to the feature words in the search result document 472 are represented by filling with different densities depending on the weight values of the feature words. In Table 470, the vertical axis displays all 20 feature words extracted from the input document in descending order of weight, and the horizontal axis displays the top 30 search result documents 472 in descending order of similarity. Has been.

入力文書の特徴語471に関するデータは、特徴語の見出し473、特徴語が上位30件の検索結果文書472の中でヒットした件数である「上位ヒット件数」474、入力文書内での特徴語の出現頻度475、文書データベースにおける特徴語の出現文書数から算出される固有度476、出現頻度475及び固有度476から算出される特徴語の重み477から構成される。 The data related to the feature word 471 of the input document includes the feature word heading 473, the number of feature words hit in the top 30 search result documents 472, the “top hit count” 474, and the feature word in the input document. It is composed of the appearance frequency 475, the uniqueness 476 calculated from the number of feature words in the document database, and the feature word weight 477 calculated from the appearance frequency 475 and the uniqueness 476.

類似度479では、検索結果文書472の類似度の値を、その大小に応じて濃さを変えた塗りつぶしにより表している。分類480では、入力文書に付与された分類と検索結果文書472に付与された分類を比較し、より下位レベルまで分類が一致している検索結果文書472ほど濃さを濃く塗りつぶして表している。出願人481では、入力文書の出願人／発明者と検索結果文書472に付与された出願人／発明者を比較し、発明者が同一である検索結果文書472は濃さを濃く塗りつぶして表し、出願人が同一である検索結果文書472はやや薄い濃さで塗りつぶして表している。 In the similarity 479, the value of the similarity of the search result document 472 is represented by a fill whose depth is changed according to the size. In the classification 480, the classification given to the input document and the classification given to the search result document 472 are compared, and the search result document 472 whose classification matches to a lower level is shown with a darker shade. In the applicant 481, the applicant / inventor of the input document is compared with the applicant / inventor assigned to the search result document 472, and the search result document 472 having the same inventor is expressed with darkness. The search result documents 472 having the same applicant are shown with a slightly lighter shade.

なお、入力文書の特徴語471を構成する要素473〜477のいずれかが選択されると、選択された要素をキーとして表の行を降順に並べ替えた内容が再表示される。エリア482は、順位がｊ番目である検索結果文書における特徴語ｉの重みＷijの値の大小に応じた濃さで塗りつぶしたセルを表示する。色の濃いセルほど当該検索結果文書の中で重要視されている特徴語であり、色のついていないセルは、その特徴語がその検索結果文書に含まれていないことを示している。なお、重みＷijの代わりに、順位がｊ番目である検索結果文書の類似度Ｓjにおける特徴語ｉの部分類似度Ｓijの値の大小に応じた濃さで塗りつぶしたセルを表示してもよい。 When any one of the elements 473 to 477 constituting the feature word 471 of the input document is selected, the contents obtained by rearranging the rows of the table in descending order using the selected element as a key are displayed again. The area 482 displays a cell filled with a density corresponding to the value of the weight Wij of the feature word i in the search result document having the j-th rank. A darker cell is a feature word that is regarded as more important in the search result document, and a cell that is not colored indicates that the feature word is not included in the search result document. Instead of the weight Wij, a cell filled with a density corresponding to the value of the partial similarity Sij of the feature word i in the similarity Sj of the search result document having the j-th rank may be displayed.

本実施例では、文書と文書の間の類似度を、重み付き特徴語から構成されるベクトルのなす角の余弦を100倍した値として算出する。したがって、部分類似度Ｓijは、入力文書の特徴語ｉの重みと、検索結果文書ｊの特徴語ｉの重みを乗算し、入力文書の特徴語ベクトルの大きさと検索結果文書ｊの特徴語ベクトルの大きさの積で除算することによって算出することができる。利用者はこの表470を参照することにより、どの特徴語が検索結果の出力にどの程度貢献しているかを視覚的・直感的に把握することが可能となる。 In the present embodiment, the similarity between documents is calculated as a value obtained by multiplying the cosine of the angle formed by the vector composed of weighted feature words by 100. Accordingly, the partial similarity Sij is obtained by multiplying the weight of the feature word i of the input document by the weight of the feature word i of the search result document j, and the size of the feature word vector of the input document and the feature word vector of the search result document j. It can be calculated by dividing by the product of the sizes. By referring to this table 470, the user can grasp visually and intuitively which feature word contributes to the output of the search result.

また、表400（図４Ａ）で提示された影響度に対する詳細内容を、この表470（図４Ｂ）を参照することにより把握することができる。例えば表400では、要因「総ヒット数」432の値が「166」であると表示されているが、これは表470において、色塗りされたセルの総数に一致している。従って、表470を鳥瞰することにより、166個の色塗りされたセルがどのような分布をなしているかを一目で把握することができる。 Further, the detailed contents for the degree of influence presented in the table 400 (FIG. 4A) can be grasped by referring to this table 470 (FIG. 4B). For example, in the table 400, the value of the factor “total number of hits” 432 is displayed as “166”, which corresponds to the total number of colored cells in the table 470. Therefore, by looking at the table 470, it is possible to grasp at a glance how the 166 colored cells are distributed.

また、要因「多ヒット特徴語数」436の値が「５」であると表示されているが、これは表470における上位ヒット件数474が閾値（本実施例では80%にあたる24件）以上の特徴語の数に一致している。従って、表470を鳥瞰する、あるいは上位ヒット件数474をキーとして表470を降順にソートすることによって得られる表の上位に位置する特徴語を鳥瞰することにより、どのような特徴語が多ヒット特徴語に該当しているのかを利用者が把握することができる。 In addition, although the value of the factor “number of multi-hit feature words” 436 is displayed as “5”, this is a feature in which the number of top hits 474 in Table 470 is equal to or greater than a threshold value (24 cases corresponding to 80% in this embodiment). Match the number of words. Therefore, by looking at the top of the table 470 or by sorting the table 470 in descending order using the number of top hits 474 as a key and sorting the table 470 in descending order, what feature word is a multi-hit feature The user can grasp whether it corresponds to the word.

このように、類似文書検索精度にかかる要因及びその影響度を示す表400と、特徴語と検索結果文書の間の照合関係を示す表470を対にして表示することにより、利用者が両者を適宜対応付けることにより、検索結果の傾向をより正確にかつ深く理解することができる。 In this way, the table 400 showing the factors and the influence degree on the similar document search accuracy and the table 470 showing the collation relationship between the feature word and the search result document are displayed in a pair so that the user can By appropriately associating, the tendency of search results can be understood more accurately and deeply.

次に、図４Ａ及び図４Ｂに示したような、要因毎の検索精度及びその影響度（乖離）を算出する処理を含む類似文書検索支援システムの構成、データ構造、処理方法について、図表を用いて説明する。 Next, as shown in FIG. 4A and FIG. 4B, a chart is used for the configuration, data structure, and processing method of the similar document search support system including the processing for calculating the search accuracy for each factor and its influence (deviation). I will explain.

図５に、本実施例に係る類似文書検索支援装置500の機能ブロック構成を示す。検索対象となる特許文書データは、入力装置530を介して文書データベース501に格納される。特徴語抽出部502は、文書データベース501に格納された特許文書の各々から特徴語とその重要度を表す重み及び重み算出に用いる出現頻度と固有度を抽出する。本実施例の場合、特徴語抽出部502は、単語辞書503を参照して文章を単語に分割する形態素解析を行い、品詞が名詞又は動詞である語を特徴語として抽出する。検索インデクス生成部504は、類似文書検索を効率よく行えるように、特徴語抽出部502で得られた文書別の特徴語及び重みに関する数値データをまとめ、検索インデクス505に格納する。書誌抽出部506は、文書データベース501に格納された特許文書の各々から公開日や出願日、特許分類や出願人、発明者などの書誌情報を抽出して、書誌テーブル507に文書毎に書誌項目名と書誌項目値に分けて格納する。特徴語抽出部502、検索インデクス作成部504、書誌抽出部506の処理内容は、市販されている多くの類似文書検索システムにおいて既に実現されているため、本実施例ではこれ以上言及しない。なお、特徴語抽出部502、検索インデクス作成部504、書誌抽出部506は、実際に入力文書を指定して類似文書検索を行えるようにするため、事前に実行しておく処理である。 FIG. 5 shows a functional block configuration of the similar document search support apparatus 500 according to the present embodiment. Patent document data to be searched is stored in the document database 501 via the input device 530. The feature word extraction unit 502 extracts a feature word, a weight indicating its importance, and an appearance frequency and a uniqueness used for weight calculation from each of the patent documents stored in the document database 501. In the case of the present embodiment, the feature word extraction unit 502 refers to the word dictionary 503 to perform a morphological analysis that divides a sentence into words, and extracts a word whose part of speech is a noun or a verb as a feature word. The search index generation unit 504 collects numerical data related to feature words and weights for each document obtained by the feature word extraction unit 502 and stores them in the search index 505 so that similar document searches can be performed efficiently. The bibliographic extraction unit 506 extracts bibliographic information such as publication date, application date, patent classification, applicant, inventor, etc. from each patent document stored in the document database 501, and bibliographic items for each document in the bibliographic table 507. Store the name and bibliographic item value separately. Since the processing contents of the feature word extraction unit 502, the search index creation unit 504, and the bibliographic extraction unit 506 have already been realized in many similar document search systems on the market, no further mention will be made in this embodiment. Note that the feature word extraction unit 502, the search index creation unit 504, and the bibliographic extraction unit 506 are processes that are executed in advance so that a similar document search can be performed by actually specifying an input document.

図６に、検索インデクス505の構成例を示す。本実施例の場合、検索インデクス505は、文書データベース501に含まれる文書と特徴語を２軸とし、対応する重みを値とする重みインデクス600、対応する出現頻度を値とする出現頻度インデクス610、特徴語とその固有度で構成される固有度インデクス620から構成される。 FIG. 6 shows a configuration example of the search index 505. In the case of the present embodiment, the search index 505 includes a document and a feature word included in the document database 501 as two axes, a weight index 600 having a corresponding weight as a value, an appearance frequency index 610 having a corresponding appearance frequency as a value, It consists of a uniqueness index 620 composed of feature words and their uniqueness.

本実施例において、文書ｄにおける特徴語Ｔの重みｗは、以下の方法により算出する。まず、文書ｄにおける特徴語ｗの出現頻度TFの対数値logTFを求める。次に、特徴語ｗの固有度IDFを、文書データベース501に格納される文書数Ｎを当該特徴語ｗが含まれる文書数ｎで除算した値の対数値log(N/n)を求める。最後に(1+logTF)×log(N/n)を算出することにより、重みｗを算出する。ただし、TF=0の場合、ｗの値は0とする。この方法は、TF-IDF法として広く知られているものであるので、これ以上言及しない。 In this embodiment, the weight w of the feature word T in the document d is calculated by the following method. First, the logarithmic value logTF of the appearance frequency TF of the feature word w in the document d is obtained. Next, a logarithm value log (N / n) of a value obtained by dividing the uniqueness IDF of the feature word w by the number of documents N stored in the document database 501 by the number of documents n including the feature word w is obtained. Finally, the weight w is calculated by calculating (1 + logTF) × log (N / n). However, when TF = 0, the value of w is 0. Since this method is widely known as the TF-IDF method, it will not be described further.

図７に、書誌テーブル507の構成例を示す。書誌テーブル507は、番号700、文書ＩＤ701、書誌項目名702、書誌項目値703から構成される。本実施例では、特許の書誌のうち、公開日、出願日、特許分類であるＩＰＣ及びテーマ、出願人、発明者にかかるデータを文書毎に格納しているが、これ以外の書誌を格納しておいてもかまわない。 FIG. 7 shows a configuration example of the bibliographic table 507. The bibliographic table 507 includes a number 700, a document ID 701, a bibliographic item name 702, and a bibliographic item value 703. In this embodiment, among the bibliographies of patents, the publication date, the filing date, the IPC that is the patent classification and the data relating to the theme, the applicant, and the inventor are stored for each document, but other bibliographies are stored. You can leave it.

図５の説明に戻る。教師文書テーブル508は、検索したい特許文書（以下「正解文書」という）が既知である入力文書（以下「教師入力文書」という）と教師入力文書に対応する正解文書の対が複数集まって構成されるデータであり、入力装置530を介して利用者又はシステム管理者によって入力されるデータである。 Returning to the description of FIG. The teacher document table 508 includes a plurality of pairs of input documents (hereinafter referred to as “teacher input documents”) whose patent documents (hereinafter referred to as “correct answer documents”) to be searched are known and correct answer documents corresponding to the teacher input documents. This data is input by the user or system administrator via the input device 530.

図８に、教師文書テーブル508の構成例を示す。教師文書テーブル508は、教師データＩＤ801、教師入力文書ＩＤ802、正解文書ＩＤ803から構成され、これらを対応付けた形式で複数格納する。本実施例では、特許庁で過去に審査済みの出願特許に対する拒絶理由通知書の中で引用された特許を、教師入力文書に対する「正解文書」と定義する。勿論、利用者又はシステム管理者が独自に任意の観点で正解文書を定義し、教師入力文書と正解文書を関連付けて登録・蓄積したものを用いてもよいし、その他の定義に従って正解文書を規定してもよい。また、１件の教師入力文書に対する正解文書が複数件存在してもよい。また、複数件存在する正解文書の中から、最も入力文書に類似した文書だけを正解としてもよいし、類似文書検索結果において最も上位にランクされている正解文書だけを正解文書として使用してもよい。 FIG. 8 shows a configuration example of the teacher document table 508. The teacher document table 508 includes a teacher data ID 801, a teacher input document ID 802, and a correct answer document ID 803, and a plurality of these are stored in a format in which these are associated with each other. In the present embodiment, a patent cited in a notice of reasons for refusal of a patent application that has been examined in the past by the JPO is defined as a “correct document” for a teacher input document. Of course, the user or the system administrator can define the correct answer document from an arbitrary point of view, and register and store the teacher input document and the correct answer document in association with each other, or specify the correct answer document according to other definitions. May be. There may be a plurality of correct documents for one teacher input document. In addition, among the multiple correct documents, only the document that is most similar to the input document may be set as the correct answer, or only the correct answer document that is ranked highest in the similar document search result may be used as the correct answer document. Good.

図５の説明に戻る。特徴語収集部509は、教師文書テーブル508に格納された教師入力文書、又は、入力装置530を介して利用者から指定された新規入力文書番号518に対する特徴語を、検索インデクス505を参照することにより抽出し、抽出した結果を特徴語テーブル510に格納する。 Returning to the description of FIG. The feature word collection unit 509 refers to the search index 505 for a feature word corresponding to the teacher input document stored in the teacher document table 508 or the new input document number 518 designated by the user via the input device 530. The extracted result is stored in the feature word table 510.

本実施例では、新規入力文書番号518に対する特徴語及び書誌のデータと、教師文書テーブル508内の教師入力文書に対する特徴語及び書誌のデータは、検索インデクス505及び書誌テーブル507にそれぞれ全て格納されているものとする。従って、これらの入力文書に対する特徴語を抽出する際には、検索インデクス505の中から入力文書に対応する特徴語及びその重み、出現頻度の値を抜き出して特徴語テーブル510に格納することにより容易に収集することができる。 In this embodiment, the feature word and bibliographic data for the new input document number 518 and the feature word and bibliographic data for the teacher input document in the teacher document table 508 are all stored in the search index 505 and the bibliography table 507, respectively. It shall be. Therefore, when extracting feature words for these input documents, the feature words corresponding to the input document, their weights, and appearance frequency values are extracted from the search index 505 and stored in the feature word table 510. Can be collected.

また、固有度については、検索インデクス505から抽出された特徴語に対応する固有度の値を抜き出し、特徴語テーブル510に格納することにより容易に収集することができる。ただし、図１に示した入力文書指定画面100において、利用者が任意の文章を入力可能とした場合には、検索インデクス505に特徴語が格納されていないので、入力された文章を特徴語抽出部502に渡し、特徴語を抽出して重みを付与する処理を実行すればよい。 The uniqueness can be easily collected by extracting the uniqueness value corresponding to the feature word extracted from the search index 505 and storing it in the feature word table 510. However, in the input document designation screen 100 shown in FIG. 1, when the user can input any text, the feature word is not stored in the search index 505, so the input text is extracted as a feature word. A process of extracting the feature word and assigning a weight may be executed by passing it to the unit 502.

図９に、特徴語テーブル510の構成例を示す。特徴語テーブル510は、文書ＩＤ901、見出し902、出現頻度903、固有度904、重み905から構成される。 FIG. 9 shows a configuration example of the feature word table 510. The feature word table 510 includes a document ID 901, a heading 902, an appearance frequency 903, a uniqueness 904, and a weight 905.

図５の説明に戻る。類似文書検索部511は、検索インデクス505を参照して、特徴語収集部509が特徴語テーブル510に格納した重み付き特徴語の集合に類似する文書を検索することにより類似度を算出し、検索結果上位30件を検索結果テーブル512に格納する。 Returning to the description of FIG. The similar document search unit 511 calculates a similarity by searching a document similar to the set of weighted feature words stored in the feature word table 510 by the feature word collection unit 509 with reference to the search index 505, and performs a search. The top 30 results are stored in the search result table 512.

上述したように、本実施例では、文書と文書の間の類似度を、重み付き特徴語から構成されるベクトルのなす角の余弦を100倍した値として算出する。従って、類似度は0から100の間の値をとり、類似度が100に近いほど、類似性が高いことを意味する。特徴語の集合をベクトルとして捉え、両者の類似性をベクトルのなす角、又は内積によって求める方法は、ベクトル空間モデルとして広く知られているので、これ以上は言及しない。 As described above, in this embodiment, the degree of similarity between documents is calculated as a value obtained by multiplying the cosine of the angle formed by the vector composed of weighted feature words by 100. Therefore, the similarity takes a value between 0 and 100, and the closer the similarity is to 100, the higher the similarity is. A method of obtaining a set of feature words as a vector and obtaining the similarity between them by an angle formed by a vector or an inner product is widely known as a vector space model, and will not be described any further.

図１０に、検索結果テーブル512の構成例を示す。検索結果テーブル512は、入力文書ＩＤ1001、検索順位1002、類似度1003、検索結果文書ＩＤ1004から構成される。なお、類似文書検索結果を出力する際に、入力文書及び検索結果文書の出願日、公開日を比較し、入力文書の出願日よりも以前に公開された特許のみを検索するようなオプションを付加してもよい。ここで、教師文書テーブル508に格納されている教師入力文書の全てに対して、特徴語収集部509及び類似文書検索部511の処理を適用することにより、特徴語テーブル510及び検索結果テーブル512には、複数の教師入力文書に対する特徴語及び検索結果がそれぞれ格納された状態となる。 FIG. 10 shows a configuration example of the search result table 512. The search result table 512 includes an input document ID 1001, a search order 1002, a similarity 1003, and a search result document ID 1004. When outputting similar document search results, an option is added to compare the filing date and publication date of the input document and the search result document, and search only for patents published before the filing date of the input document. May be. Here, by applying the processing of the feature word collection unit 509 and the similar document search unit 511 to all the teacher input documents stored in the teacher document table 508, the feature word table 510 and the search result table 512 are stored. Is a state in which feature words and search results for a plurality of teacher input documents are stored.

図５の説明に戻る。要因データ抽出部513は、教師入力文書の各々に対して前述した処理を適用して得たデータが格納された特徴語テーブル510、検索結果テーブル512及び書誌テーブル507のうちの少なくとも一つ以上を参照して、図４Ａに示した要因410に対応する値441を抽出し、要因テーブル514に格納する。また、図４Ｂの表470を生成するために、要因データ抽出部513は、特徴語と検索結果文書の対応関係を解析し、特徴語及びその重みにかかるデータと共に、特徴語照合テーブル515に格納する。 Returning to the description of FIG. The factor data extraction unit 513 stores at least one of the feature word table 510, the search result table 512, and the bibliography table 507 in which data obtained by applying the above-described processing to each teacher input document is stored. Referring to FIG. 4A, a value 441 corresponding to the factor 410 is extracted and stored in the factor table 514. Further, in order to generate the table 470 of FIG. 4B, the factor data extraction unit 513 analyzes the correspondence relationship between the feature word and the search result document, and stores it in the feature word matching table 515 together with the data related to the feature word and its weight. To do.

図４Ａ及び図４Ｂにも示しているが、本実施例では、類似文書検索精度に影響を及ぼす要因として、以下の８種類を使用する。また、これらの要因は大きく三つの要因カテゴリに大別することができる。 As shown in FIGS. 4A and 4B, in the present embodiment, the following eight types are used as factors affecting the similar document search accuracy. These factors can be roughly divided into three factor categories.

（要因カテゴリ１）特徴語ヒット傾向
入力文書の特徴語と検索結果文書の間のヒット傾向に関する要因である。すなわち、図４Ｂに示した特徴語と検索結果文書の間のヒット状況を表す表470のデータ（これらのデータ自体は特徴語照合テーブル515に格納されている）から算出できる要因である。具体的には、以下の６種類の要因がある。 (Factor category 1) Feature word hit tendency This is a factor related to the hit tendency between the feature word of the input document and the search result document. That is, this is a factor that can be calculated from the data in the table 470 representing the hit situation between the feature word and the search result document shown in FIG. 4B (the data itself is stored in the feature word matching table 515). Specifically, there are the following six factors.

（要因１）有効特徴語数
表470の上位ヒット件数474が予め指定された閾値（本実施例では4件）以上の特徴語の数である。この値が小さいと、類似文書検索の手掛かりとなる特徴語の数が少なくなり、検索精度に悪影響を及ぼす可能性がある。 (Factor 1) Number of effective feature words The number of top hits 474 in the table 470 is the number of feature words equal to or greater than a predetermined threshold (four in this embodiment). If this value is small, the number of feature words that are clues to similar document search is reduced, which may adversely affect the search accuracy.

（要因２）総ヒット数
表470において色塗りされているセル数、言い換えれば、上位ヒット件数474の値の総和である。この値が小さいと、特徴語がヒットする検索結果文書が少ない状態となり、検索精度に悪影響を及ぼす可能性がある。逆に、値が大きいと、特徴語がヒットする検索結果文書が多くなり、類似する文書を少数に絞り込めない状態となり、検索精度に悪影響を及ぼす可能性がある。 (Factor 2) Total number of hits This is the total number of cells that are colored in Table 470, in other words, the value of the top hit number 474. If this value is small, there will be few search result documents hit with feature words, which may adversely affect search accuracy. On the other hand, if the value is large, the number of search result documents in which feature words are hit increases, and it becomes impossible to narrow down similar documents to a small number, which may adversely affect search accuracy.

（要因３）高ヒット数
表470において色塗りされているセルのうち、予め指定された閾値（本実施例では「20」）以上の値を持つ（色の濃い）セル数である。この値が小さいと、ヒットした特徴語の検索結果文書における重要度が低いため、類似文書を絞りにくい状態となり、検索精度に悪影響を及ぼす可能性がある。 (Factor 3) Number of high hits Among the cells colored in Table 470, this is the number of cells (dark color) having a value equal to or greater than a predetermined threshold value (“20” in this embodiment). If this value is small, the importance of the hit feature word in the search result document is low, so that it is difficult to narrow down similar documents, which may adversely affect search accuracy.

（要因４）高ヒット率
上記高ヒット数を上記総ヒット数で除算した値である。この値が小さいと、検索結果文書において重要でない入力文書の特徴語が多い状態となり、検索精度に悪影響を及ぼす可能性がある。 (Factor 4) High hit rate This is a value obtained by dividing the number of high hits by the total number of hits. When this value is small, there are many feature words of the input document that are not important in the search result document, which may adversely affect the search accuracy.

（要因５）値平均
表470において色塗りされているセルの値の平均である。この値が小さいと、検索結果文書において重要でない入力文書の特徴語が多い状態となり、検索精度に悪影響を及ぼす可能性がある。 (Factor 5) Value average The average of the values of the cells that are colored in Table 470. When this value is small, there are many feature words of the input document that are not important in the search result document, which may adversely affect the search accuracy.

（要因６）多ヒット特徴語数
検索結果文書のうち、予め指定された閾値（本実施例では80%に相当する24個）よりも多くの検索結果文書に含まれる入力文書の特徴語数である。多ヒット特徴語に該当する特徴語は、その技術分野（分類）でよく使われる語又は、一般の文書でもよく使われる語であることが多い。多ヒット特徴語数が大きいと、関連する文書を大まかには絞り込めるが、文書内容のポイント（特許で言えば、発明の特徴（新規性・進歩性）を表す部分）で絞り込めていない状態となり、検索精度に悪影響を及ぼす可能性がある。 (Factor 6) Number of multi-hit feature words This is the number of feature words of the input document included in more search result documents than a predetermined threshold value (24 in this embodiment, which corresponds to 80%) among the search result documents. A feature word corresponding to a multi-hit feature word is often a word often used in the technical field (classification) or a word often used in general documents. If the number of multi-hit feature words is large, the related documents can be narrowed down roughly, but it is not narrowed down by the points of the document contents (in terms of patents, the part representing the features of the invention (novelty / inventive step)). May adversely affect search accuracy.

（要因カテゴリ２）書誌ヒット傾向
入力文書の書誌情報と検索結果文書の書誌情報の間の共通性に関する要因である。書誌情報は書誌テーブル507から容易に抽出することができるので、これらを照合することにより共通性を解析することができる。具体的には、以下の要因がある。 (Factor category 2) Bibliographic hit tendency This is a factor related to the commonality between the bibliographic information of the input document and the bibliographic information of the search result document. Since the bibliographic information can be easily extracted from the bibliographic table 507, the commonality can be analyzed by collating them. Specifically, there are the following factors.

（要因７）分類ヒット件数
入力文書に付与されている分類と、検索結果文書に付与されている分類が共通する検索結果文書件数である。特許文書の場合、複数の分類体系（ＩＰＣ／ＦＩ、テーマ／Ｆターム）があり、それぞれ多段構成になっている（セクション、サブクラス、メイングループなど）。本実施例では、ＩＰＣのメイングループのレベルで分類の共通する検索結果文書件数を算出するが、他のレベルで算出してもよい。 (Factor 7) Number of classification hits This is the number of search result documents in which the classification given to the input document and the classification given to the search result document are common. In the case of patent documents, there are a plurality of classification systems (IPC / FI, theme / F-term), each of which has a multi-stage structure (section, subclass, main group, etc.). In this embodiment, the number of search result documents having a common classification is calculated at the level of the main group of the IPC, but may be calculated at other levels.

書誌ヒット傾向に関する他の要因としては、分類ヒット件数のほかに、発明者／出願人が同一である検索結果文書件数を表す「出願人ヒット件数」、出願日が閾値以上かけ離れている検索結果文書件数又はかけ離れている値の平均値を表す「出願日乖離度」などが挙げられる。これらの要因を使用してもよい。 Other factors related to the bibliographic hit tendency include “number of applicant hits” indicating the number of search result documents with the same inventor / applicant in addition to the number of classification hits, and search result documents whose application date is far above the threshold. Examples include “application date deviation degree” that represents the average number of cases or values that are far from each other. These factors may be used.

（要因カテゴリ３）類似度
入力文書に対する検索結果文書の類似度の値に関する要因である。具体的には、以下の要因がある。 (Factor category 3) Similarity factor A factor relating to the similarity value of the search result document with respect to the input document. Specifically, there are the following factors.

（要因８）類似度減衰率
上位類似検索結果文書の持つ類似度が、順位が下がるに伴ってどのように減衰しているかを数値化したものである。具体的には、予め指定された順位Ｒ１（本実施例では１位）の検索結果文書の類似度に対する、予め指定された順位Ｒ２（本実施例では30位）の検索結果文書の類似度の割合を、この検索結果の類似度減衰率としている。類似度減衰率が低いと、類似度の拮抗した類似文書が数多く出力状態となり、検索精度に悪影響を及ぼす可能性がある。 (Factor 8) Similarity decay rate This is a numerical representation of how the similarity of the upper similarity search result document decays as the rank decreases. Specifically, the similarity of the search result document of the rank R2 (30th in this embodiment) specified in advance to the similarity of the search result document of the previously specified rank R1 (1st in this embodiment). The ratio is used as the similarity decay rate of this search result. When the similarity decay rate is low, a large number of similar documents with similarities antagonized are output, which may adversely affect search accuracy.

図１１に、要因テーブル514の構成例を示す。要因テーブル514は、入力文書ＩＤ1101、入力文書に対する正解文書ＩＤ1102、入力文書に付与された分類1103（本実施例では特許文書に付与されるテーマ分類を格納）、類似文書検索結果における正解文書ＩＤ1102の検索順位1104からなり、有効特徴語数1105から類似度減衰率1112までは上述した要因に対応し、入力文書ＩＤ1101毎に算出した値（要因値）を格納する。なお、分類1103は、後述するように、各要因の類似文書検索精度への影響度を技術分野別に算出したい場合に、分類1103に基づいて教師入力文書をフィルタリングする際に用いられる。 FIG. 11 shows a configuration example of the factor table 514. The factor table 514 includes an input document ID 1101, a correct document ID 1102 for the input document, a classification 1103 given to the input document (in this embodiment, the theme classification given to the patent document is stored), and a correct document ID 1102 in the similar document search result. The search rank 1104, the number of effective feature words 1105 to the similarity decay rate 1112 corresponds to the above-described factors, and stores a value (factor value) calculated for each input document ID 1101. As will be described later, the classification 1103 is used when filtering the teacher input document based on the classification 1103 when it is desired to calculate the degree of influence of each factor on the similar document search accuracy for each technical field.

図１２に、特徴語照合テーブル515の構成例を示す。特徴語照合テーブル515は、入力文書特徴語に関するデータを格納している部分1201と、検索結果文書における特徴語の重みの値を格納している部分1210に分けられる。前者は、特徴語の見出し1202、30件の検索結果文書における特徴語のヒット件数1203、入力文書における特徴語の出現頻度1204、特徴語の文書データベース501における固有度1205、特徴語の重み1206から構成される。なお、この特徴語照合テーブル515は、図４Ｂに示した表470を表示する際にも参照される。 FIG. 12 shows a configuration example of the feature word matching table 515. The feature word matching table 515 is divided into a part 1201 that stores data related to the input document feature word and a part 1210 that stores the value of the weight of the feature word in the search result document. The former is based on the feature word header 1202, the number of feature word hits 1203 in 30 search result documents, the appearance frequency 1204 of the feature word in the input document, the uniqueness 1205 in the feature word document database 501, and the feature word weight 1206. Composed. The feature word matching table 515 is also referred to when the table 470 shown in FIG. 4B is displayed.

図１３に、要因データ抽出部513で実行される処理方法の一例を示す。要因データ抽出部513は、前述した要因のうち、要因カテゴリ「特徴語ヒット傾向」に属する要因に対する値の抽出を効率よく行うために、検索結果文書における入力文書の特徴語のヒット内容に関するデータを格納した特徴語照合テーブル515を生成する特徴語照合テーブル生成処理1302と、特徴語照合テーブル515などを参照して各入力文書における各要因値を算出する処理1303〜1310からなる。 FIG. 13 shows an example of a processing method executed by the factor data extraction unit 513. In order to efficiently extract values for factors belonging to the factor category “feature word hit tendency” among the above-described factors, the factor data extraction unit 513 performs data regarding hit contents of feature words of the input document in the search result document. A feature word matching table generation process 1302 for generating the stored feature word matching table 515, and processes 1303 to 1310 for calculating factor values in each input document with reference to the feature word matching table 515 and the like.

要因データ抽出部513では、以下の処理が実行される。ステップ1301において、要因データ抽出部513は、未処理の入力文書があるか否かを判定し、未処理の入力文書が「無い」場合は、処理を終了する。一方、未処理の入力文書が「有る」場合、要因データ抽出部513は、特徴語照合テーブル生成処理1302を実行する。 In the factor data extraction unit 513, the following processing is executed. In step 1301, the factor data extraction unit 513 determines whether there is an unprocessed input document. If there is no unprocessed input document, the process ends. On the other hand, if there is an unprocessed input document, the factor data extraction unit 513 executes a feature word collation table generation process 1302.

特徴語照合テーブル生成処理1302は、以下に示す処理1351〜1356で構成される。ステップ1351において、要因データ抽出部513は、特徴語テーブル510から入力文書の特徴語の見出し、出現頻度、固有度、重みを取り出し、特徴語照合テーブル515の対応するエリアにそれぞれ格納する。次のステップ1352において、要因データ抽出部513は、検索結果テーブル512から、当該入力文書に対応する上位検索結果文書を予め指定されたＭ件（本実施例では30件）抽出する。続くステップ1353において、要因データ抽出部513は、抽出されたＭ件の検索結果文書の各々に対応する特徴語と重みを、検索インデクス505の重みインデクス600から抽出する。 The feature word collation table generation process 1302 includes the following processes 1351 to 1356. In step 1351, the factor data extraction unit 513 extracts the feature word heading, appearance frequency, uniqueness, and weight of the input document from the feature word table 510, and stores them in the corresponding areas of the feature word matching table 515. In the next step 1352, the factor data extraction unit 513 extracts, from the search result table 512, the upper search result documents corresponding to the input document M (30 in this embodiment) designated in advance. In subsequent step 1353, the factor data extraction unit 513 extracts the feature word and the weight corresponding to each of the extracted M search result documents from the weight index 600 of the search index 505.

次のステップ1354において、要因データ抽出部513は、当該入力文書にかかる未処理の特徴語が有るか否かを判定する。未処理の特徴語が「無い」場合、要因データ抽出部513はステップ1303に進む。これに対し、未処理の特徴語が「有る」場合、要因データ抽出部513は、まずステップ1355において、Ｍ件の検索結果文書のうち、当該特徴語が含まれる検索結果文書における当該特徴語の重みを取り出し、特徴語照合テーブル515における当該検索結果文書と当該特徴語に該当するエリアにそれぞれ格納する。 In the next step 1354, the factor data extraction unit 513 determines whether there is an unprocessed feature word related to the input document. If there is no unprocessed feature word, the factor data extraction unit 513 proceeds to step 1303. On the other hand, when there are “unprocessed” feature words, the factor data extraction unit 513 first selects the feature word in the search result document including the feature word among the M search result documents in Step 1355. The weight is extracted and stored in the search result document in the feature word matching table 515 and the area corresponding to the feature word.

次のステップ1356において、要因データ抽出部513は、Ｍ件の検索結果文書のうち当該特徴語が含まれる検索結果文書件数をカウントし、特徴語照合テーブル515（図１２）の「ヒット件数1203」のエリアに格納し、ステップ1354に戻る。 In the next step 1356, the factor data extraction unit 513 counts the number of search result documents including the feature word among the M search result documents, and the “number of hits 1203” in the feature word matching table 515 (FIG. 12). And return to step 1354.

有効特徴語数算出処理1303は、要因「有効語特徴語数」の値を算出する処理であり、ステップ1373で構成される。ステップ1373において、要因データ抽出部513は、特徴語照合テーブル515（図１２）の「ヒット件数1203」が予め指定された閾値（本実施例では4）以上の特徴語数をカウントし、要因テーブル514の有効特徴語数のエリアに格納する。 The effective feature word count calculation process 1303 is a process of calculating the value of the factor “effective word feature word count”, and includes step 1373. In step 1373, the factor data extraction unit 513 counts the number of feature words in which the “number of hits 1203” in the feature word matching table 515 (FIG. 12) is equal to or greater than a predetermined threshold (4 in this embodiment), and the factor table 514 It stores in the area of the number of effective feature words.

総ヒット数算出処理1304は、要因「総ヒット数」の値を算出する処理であり、ステップ1374で構成される。ステップ1374において、要因データ抽出部513は、特徴語照合テーブル515（図１２）の「ヒット件数1203」の総和を求め、要因テーブル514の総ヒット数のエリアに格納する。 The total hit count calculation process 1304 is a process of calculating the value of the factor “total hit count”, and includes step 1374. In step 1374, the factor data extraction unit 513 obtains the sum total of “number of hits 1203” in the feature word collating table 515 (FIG. 12), and stores it in the area of the total number of hits in the factor table 514.

高ヒット数算出処理1305は、要因「高ヒット数」の値を算出する処理であり、ステップ1375で構成される。ステップ1375において、要因データ抽出部513は、前述したステップ1355で取り出され、特徴語照合テーブル515に格納されている特徴語の重みが、予め指定された閾値（本実施例では20）以上の特徴語の延べ数を求め、要因テーブル514の高ヒット数のエリアに格納する。 The high hit count calculation process 1305 is a process of calculating the value of the factor “high hit count”, and includes step 1375. In step 1375, the factor data extraction unit 513 has a feature word weight that is extracted in step 1355 described above and stored in the feature word matching table 515 with a weight that is equal to or greater than a predetermined threshold (20 in this embodiment). The total number of words is obtained and stored in the high hit count area of the factor table 514.

高ヒット率算出処理1306は、要因「高ヒット率」の値を算出する処理であり、ステップ1376で構成される。ステップ1376において、要因データ抽出部513は、前述したステップ1375で取り出した高ヒット数を、前述のステップ1374で取り出した総ヒット数で除算した値を求め、要因テーブル514の高ヒット率のエリアに格納する。 The high hit rate calculation process 1306 is a process of calculating the value of the factor “high hit rate”, and includes step 1376. In step 1376, the factor data extraction unit 513 obtains a value obtained by dividing the number of high hits extracted in step 1375 described above by the total number of hits extracted in step 1374 described above. Store.

値平均算出処理1307は、要因「値平均」の値を算出する処理であり、ステップ1377で構成される。ステップ1377において、要因データ抽出部513は、前述のステップ1355で取り出され、特徴語照合テーブル515に格納されている特徴語の重みが0よりも大きい特徴語について重みの平均を求め、要因テーブル514の値平均のエリアに格納する。 The value average calculation process 1307 is a process for calculating the value of the factor “value average”, and includes step 1377. In step 1377, the factor data extraction unit 513 obtains an average of the weights of the feature words that are extracted in step 1355 and stored in the feature word matching table 515, and the feature words have a weight greater than 0, and the factor table 514 Store the value in the average area.

多ヒット特徴語数算出処理1308は、要因「多ヒット特徴語数」の値を算出する処理であり、ステップ1378で構成される。ステップ1378において、要因データ抽出部513は、特徴語照合テーブル515（図１２）の「ヒット件数1203」が予め指定された閾値（本実施例では24）以上の特徴語数をカウントし、要因テーブル514の多ヒット特徴語数のエリアに格納する。 The multi-hit feature word count calculation process 1308 is a process of calculating the value of the factor “number of multi-hit feature words”, and includes step 1378. In step 1378, the factor data extraction unit 513 counts the number of feature words in which “number of hits 1203” in the feature word matching table 515 (FIG. 12) is equal to or greater than a predetermined threshold (24 in this embodiment), and the factor table 514 In the area of the number of feature words hit.

分類ヒット件数算出処理1309は、要因「分類ヒット件数」の値を算出する処理であり、ステップ1379で構成される。ステップ1379において、要因データ抽出部513は、当該入力文書及びＭ件の検索結果文書の各々に対応するＩＰＣメイングループを書誌テーブル507から抽出し、当該入力文書と共通するＩＰＣメイングループを一つ以上持つ検索結果文書数を求め、要因テーブル514の分類ヒット件数のエリアに格納する。 The classification hit number calculation process 1309 is a process for calculating the value of the factor “number of classification hits” and includes step 1379. In step 1379, the factor data extraction unit 513 extracts the IPC main group corresponding to each of the input document and the M search result documents from the bibliographic table 507, and sets one or more IPC main groups common to the input document. The number of search result documents held is obtained and stored in the area of the number of classification hits in the factor table 514.

類似度減衰率算出処理1310は、要因「類似度減衰率」の値を算出する処理であり、ステップ1380で構成される。ステップ1380において、要因データ抽出部513は、検索結果テーブル512の予め指定された検索順位Ｒ１（本実施例では１位）の検索結果文書の類似度に対する、予め指定された順位Ｒ２（本実施例では30位）の検索結果文書の類似度の割合値を求め、要因テーブル514の類似度減衰率のエリアに格納する。その後、要因データ抽出部513は、ステップ1301に戻る。 The similarity attenuation rate calculation process 1310 is a process of calculating the value of the factor “similarity attenuation rate”, and includes step 1380. In step 1380, the factor data extraction unit 513 performs the pre-designated rank R2 (in this embodiment) with respect to the similarity of the search result documents in the pre-designated search rank R1 (first in this embodiment) in the search result table 512. The ratio value of the degree of similarity of the search result document of No. 30) is obtained and stored in the area of the degree of similarity attenuation of the factor table 514. Thereafter, the factor data extraction unit 513 returns to Step 1301.

図５の説明に戻る。検索精度解析部516は、要因テーブル514に格納されている、教師文書テーブル508内の教師入力文書集合にかかる要因データから、各要因に対する検索精度を算出し、教師入力文書全体の検索精度の平均値との差（乖離値）を算出する。算出された乖離値は、各要因が検索精度に及ぼす影響度を示す指標として、後に利用者に提示される。ここで算出した結果は、検索精度テーブル517に格納される。本実施例では、検索精度を「正解文書の検索順位が、予め指定された閾値Ｒ（本実施例では100位）以内である入力文書件数の割合」と定義している。もちろんの他の定義でもかまわない。 Returning to the description of FIG. The search accuracy analysis unit 516 calculates the search accuracy for each factor from the factor data related to the teacher input document set in the teacher document table 508 stored in the factor table 514, and averages the search accuracy of the entire teacher input document The difference from the value (deviation value) is calculated. The calculated divergence value is later presented to the user as an index indicating the degree of influence of each factor on the search accuracy. The result calculated here is stored in the search accuracy table 517. In this embodiment, the search accuracy is defined as “the ratio of the number of input documents in which the search order of correct documents is within a predetermined threshold R (100th in this embodiment)”. Of course, other definitions may be used.

図１４に、検索精度テーブル517の構成例を示す。検索精度テーブル517は、要因を識別する要因ＩＤ1401、要因をグループ分けする要因カテゴリ1402、要因名称1403、各要因を構成する要因Groupを識別する要因GroupＩＤ1404、要因Group名称1405、要因Groupの取り得る値の下限値1406、要因Groupの取り得る値の上限値1407、要因Groupに属する教師入力文書の検索精度1408、教師入力文書全体における検索精度に対する検索精度1408の差である「精度平均との乖離1409」から構成される。 FIG. 14 shows a configuration example of the search accuracy table 517. The search accuracy table 517 includes a factor ID 1401 for identifying the factor, a factor category 1402 for grouping the factors, a factor name 1403, a factor Group ID 1404 for identifying the factor Group constituting each factor, a factor Group name 1405, and possible values of the factor Group Lower limit value 1406, upper limit value 1407 of possible values of factor Group, search accuracy 1408 of teacher input documents belonging to factor Group, and difference of search accuracy 1408 with respect to the search accuracy in the entire teacher input document, “deviation from accuracy average 1409 Is comprised.

検索精度テーブル517のうち、要因ＩＤ1401、要因カテゴリ1402、要因名称1403は予め固定されているデータである。要因をいくつのグループに分けるかについては、本実施例では各々３つに分けているが、利用者から指定された数のグループに分けることも可能である。 In the search accuracy table 517, the factor ID 1401, the factor category 1402, and the factor name 1403 are data fixed in advance. The number of groups into which the factor is divided is divided into three in this embodiment, but can be divided into the number of groups designated by the user.

図１５に、検索精度解析部516で実行される処理方法の一例を示す。また、図１６に、本処理方法の具体例を示す。 FIG. 15 shows an example of a processing method executed by the search accuracy analysis unit 516. FIG. 16 shows a specific example of this processing method.

検索精度解析部516は、図１５に示すように、まずステップ1501において未処理の要因があるか否かを判定し、「無い」場合は処理を終了する。一方、未処理の要因が「有る」場合、検索精度解析部516は、ステップ1502で、要因テーブル514の中から解析対象とする入力文書ＩＤ1101、検索順位1104、当該処理対象要因に対する要因値（1105から1112までのいずれかの値）を取り出して、２次元配列に一時的に格納する。ここまでの処理結果の例を図１６の左端の表1600に示す。 As shown in FIG. 15, the search accuracy analysis unit 516 first determines whether or not there is an unprocessed factor in step 1501. On the other hand, if there is an unprocessed factor, the search accuracy analysis unit 516 determines in step 1502 the input document ID 1101 to be analyzed from the factor table 514, the search order 1104, and the factor value (1105 for the factor to be processed). To any one of 1112) and temporarily stored in a two-dimensional array. An example of the processing results so far is shown in the leftmost table 1600 of FIG.

本実施例の場合、検索精度解析部516は、教師文書テーブル508に格納されている教師入力文書の全てを用いて検索精度テーブル517を生成している。しかし、要因テーブル514の分類1103に基づいて教師入力文書をフィルタリングし、ある特定の分類が付与された教師入力文書にかかるデータのみを用いて検索精度テーブル517を生成することも可能である。類似文書検索精度は、技術分野によっても大きく左右されると考えられる。従って、特定の条件を満たす教師入力文書だけを取り出して解析することは有効であると考えられる。なお、フィルタリングの基準には、分類1103だけでなく、出願日や出願人などを基準としてもよい。 In the present embodiment, the search accuracy analysis unit 516 generates the search accuracy table 517 using all of the teacher input documents stored in the teacher document table 508. However, it is also possible to filter the teacher input document based on the classification 1103 of the factor table 514 and generate the search accuracy table 517 using only data relating to the teacher input document to which a specific classification is assigned. It is considered that the similar document search accuracy greatly depends on the technical field. Therefore, it is considered effective to extract and analyze only a teacher input document that satisfies a specific condition. The filtering criteria may be based not only on the classification 1103 but also on the filing date and the applicant.

次に、ステップ1503において、検索精度解析部516は、取り出した全ての要因値に対応する正解文書の検索順位が、予め指定された閾値Ｒ（本実施例では100位）以内である入力文書件数の割合を、「精度平均」として算出する。 Next, in step 1503, the search accuracy analysis unit 516 counts the number of input documents whose search order of correct documents corresponding to all the extracted factor values is within a predetermined threshold value R (100th in this embodiment). Is calculated as “accuracy average”.

次に、ステップ1504において、検索精度解析部516は、前記ステップ1502において格納された入力文書ＩＤ、検索順位、要因値の２次元配列を、要因値をキーとして昇順にソートする。ここまでの処理結果の例を図１６の中央の表1610に示す。 Next, in step 1504, the search accuracy analysis unit 516 sorts the two-dimensional array of the input document ID, search order, and factor value stored in step 1502 in ascending order using the factor value as a key. An example of the processing results so far is shown in a table 1610 in the center of FIG.

次に、ステップ1505において、検索精度解析部516は、要因値の大小に基づいて、２次元配列を予め指定された要因Groupの数Ｎ（本実施例では３）に分割（グルーピング）する。ここまでの処理結果の例を図１６の右端の表1610のうち1612〜1614までに示す。図１６の例では、要因Group「低」及び「高」は５件の入力文書から構成され、「中」は10件の入力文書から構成されている。どの要因Groupにどのくらいの数／割合の入力文書が入るかについてであるが、全ての要因Groupで一律にしてもよいし、要因Group毎に可変としてもよい。また、利用者に指定させることも可能である。 Next, in step 1505, the search accuracy analysis unit 516 divides (groups) the two-dimensional array into the number N of factor groups specified in advance (3 in this embodiment) based on the magnitude of the factor values. Examples of processing results so far are shown in 1612 to 1614 in the table 1610 at the right end of FIG. In the example of FIG. 16, the factor groups “low” and “high” are composed of five input documents, and “medium” is composed of ten input documents. It is about how many / ratio of input documents are included in which factor group, but it may be uniform for all factor groups or may be variable for each factor group. It is also possible to let the user specify.

次のステップ1506において、検索精度解析部516は、未処理の要因Groupがあるか否かを判定する。未処理の要因Groupが「無い」場合、検索精度解析部516は、ステップ1501に戻って次の要因の処理に移る。一方、未処理の要因Groupが「有る」場合、検索精度解析部516は、まず、ステップ1507で、当該要因Groupに対する、要因値の上限値と下限値を求める。このステップの処理結果の例を図１６の右端の表1610のうち1614に示す。 In the next step 1506, the search accuracy analysis unit 516 determines whether there is an unprocessed factor group. If there is no unprocessed factor group, the search accuracy analysis unit 516 returns to step 1501 and proceeds to processing of the next factor. On the other hand, when there is an unprocessed factor group, the search accuracy analysis unit 516 first obtains an upper limit value and a lower limit value of the factor value for the factor group in step 1507. An example of the processing result of this step is shown at 1614 in the table 1610 at the right end of FIG.

要因値には、離散値を取るものと、連続値を取るものがある。例えば有効特徴語数は、整数からなる離散値であるが、類似度減衰率は実数をとる連続値である。 Factor values include discrete values and continuous values. For example, the number of effective feature words is a discrete value composed of integers, but the similarity reduction rate is a continuous value taking a real number.

要因Groupの上限値と下限値を決める場合、隣接する要因Groupの境界において、どちらの要因Groupにも属さない値が存在してはならない。従って、隣接する要因Groupの境界にどちらの要因Groupにも属さない値が存在する場合、当該値をどちらの要因Groupに入れるかを決めないといけない。例えば図１６の場合、要因Group「低」の上限値は「12」であるが、要因Group「中」の下限値は「14」である。このため、要因値が「13」の場合、どちらに入れるべきかが決まらない。そこで、本実施例では、要因Groupを「低」、「中」、「高」３つに分けているが、上述した属さない値はすべて「中」に含めるというヒューリスティックな処理を適用し、上記問題を解決している。この処理により、図１６の右端の表1610の1614に示すように、要因Group「中」の下限値が「14」ではなく「13」になっている。勿論、「低」の上限値と「中」の下限値の平均を算出して、均等に割り振るなどの他の方法でもよい。 When determining the upper and lower limit values of a factor group, there must be no value that does not belong to either factor group at the boundary between adjacent factor groups. Therefore, when there is a value that does not belong to any factor group at the boundary between adjacent factor groups, it is necessary to determine which factor group the value should be included in. For example, in the case of FIG. 16, the upper limit value of the factor group “low” is “12”, but the lower limit value of the factor group “medium” is “14”. For this reason, when the factor value is “13”, it is not determined which one should be included. Therefore, in this embodiment, the factor group is divided into “low”, “medium”, and “high”. However, the above-described heuristic processing of including all the values not belonging to “medium” is applied, and The problem is solved. By this processing, as indicated by 1614 in the table 1610 at the right end of FIG. 16, the lower limit value of the factor “medium” is “13” instead of “14”. Of course, other methods such as calculating the average of the upper limit value of “low” and the lower limit value of “medium” and allocating them equally may be used.

次のステップ1508において、検索精度解析部516は、当該要因Group内の要因値に対応する検索順位に対して、ステップ1503と同様の方法で検索精度を算出する。次のステップ1509において、検索精度解析部516は、上記ステップ1508で算出した当該要因Groupの検索精度から、上記ステップ1503で算出した精度平均を減算することにより、両者の値の乖離値（差）を算出する。ここまでの処理結果の例を図１６の右端の表1610に示す。図１６の右端の表1610には、要因Group「低」に、５件の教師入力文書が含まれており、このうち２件の検索順位が100位以内にある。このため、要因Group「低」における検索精度は40%（2/5）となる。教師入力文書は全体で20件あるので、その精度平均（全体の検索精度）1616は60%（12/20）となる。従って、要因Group「低」における検索精度の精度平均との乖離値1617は、-20%（＝40%-60%）となる。同様に、要因Group「中」及び「高」の各乖離値1617は、それぞれ0%及び+20%となる。 In the next step 1508, the search accuracy analysis unit 516 calculates the search accuracy for the search order corresponding to the factor value in the factor group by the same method as in step 1503. In the next step 1509, the search accuracy analysis unit 516 subtracts the accuracy average calculated in step 1503 from the search accuracy of the factor group calculated in step 1508 to obtain a divergence value (difference) between the two values. Is calculated. An example of the processing results so far is shown in the rightmost table 1610 of FIG. In the table 1610 at the right end of FIG. 16, five teacher input documents are included in the factor Group “low”, and two of these search ranks are within the 100th rank. Therefore, the search accuracy in the factor group “low” is 40% (2/5). Since there are 20 teacher input documents in total, the average accuracy (overall search accuracy) 1616 is 60% (12/20). Accordingly, the deviation value 1617 from the accuracy average of the search accuracy in the factor Group “low” is −20% (= 40% −60%). Similarly, the deviation values 1617 of the factor groups “medium” and “high” are 0% and + 20%, respectively.

次のステップ1510において、検索精度解析部516は、算出された要因Groupにかかる上限値、下限値、検索精度、乖離値を、検索精度テーブル517の該当する要因Groupのエリアにそれぞれ格納する。そして、ステップ1506に戻る。 In the next step 1510, the search accuracy analysis unit 516 stores the calculated upper limit value, lower limit value, search accuracy, and divergence value for the factor group in the corresponding factor group area of the search accuracy table 517, respectively. Then, the process returns to step 1506.

図５の説明に戻る。精度影響度算出部519は、利用者から指定された新規入力文書番号518に対し、教師入力文書と同様、以下の処理を経て得られる要因テーブル514と検索精度テーブル517を照合する。ここで、要因テーブル514は、(1) 特徴語収集部509による特徴語及びその重みの収集、(2) 類似文書検索部511による類似文書検索結果の取得、(3) 要因データ抽出部513による要因値の算出を経ることにより得られる。精度影響度算出部519は、前述した照合により、新規入力文書の要因値に該当する要因Groupを要因毎に特定すると、検索精度への影響度（精度平均との乖離値）を更に特定し、精度影響度テーブル520に格納する。 Returning to the description of FIG. Similar to the teacher input document, the accuracy influence calculation unit 519 collates the factor table 514 and the search accuracy table 517 obtained through the following processing with respect to the new input document number 518 designated by the user. Here, the factor table 514 includes (1) collection of feature words and their weights by the feature word collection unit 509, (2) acquisition of similar document search results by the similar document search unit 511, and (3) by factor data extraction unit 513. It is obtained by calculating the factor value. When the factor group corresponding to the factor value of the new input document is identified for each factor by the above-described collation, the accuracy factor calculator 519 further identifies the factor of influence on the search accuracy (deviation value from the accuracy average), Stored in the accuracy impact table 520.

図１７に、精度影響度テーブル520の構成例を示す。精度影響度テーブル520は、要因ＩＤ1701、要因カテゴリ1702、要因名称1703、要因値1704、該当する要因Group1705、該当する要因Groupに対応する検索精度1706、検索精度1706と精度平均の乖離1707から構成される。 FIG. 17 shows a configuration example of the accuracy impact table 520. The accuracy impact table 520 includes a factor ID 1701, a factor category 1702, a factor name 1703, a factor value 1704, a corresponding factor Group 1705, a search accuracy 1706 corresponding to the corresponding factor Group, a search accuracy 1706 and a deviation 1707 of the accuracy average. The

図１８に、精度影響度算出部519で実行される処理方法の一例を示す。精度影響度算出部519は、ステップ1801において、未処理の要因があるか否かを判定する。未処理の要因が「無い」場合、精度影響度算出部519は処理を終了する。未処理の要因が「有る」場合、精度影響度算出部519は、ステップ1802において、新規入力文書に対する当該要因に対応する要因ＩＤと要因値を要因テーブル514から抽出する。次に、精度影響度算出部519は、ステップ1803において、抽出された要因値を、検索精度テーブル517において該当する要因の上限値と下限値と照合し、当該要因値の属する要因Groupを特定する。次に、精度影響度算出部519は、ステップ1804において、特定した要因Groupに対応する要因ＩＤ1401、要因カテゴリ1402、要因名称1403、要因Group名称1405、検索精度1408、精度平均との乖離1409を取り出し、要因値と共に、推定結果テーブル520の要因ＩＤ1701、要因カテゴリ1702、要因名称1703、要因値1704、該当要因Group1705、検索精度1706、精度平均との乖離1707にそれぞれ格納する。 FIG. 18 shows an example of a processing method executed by the accuracy influence calculation unit 519. In step 1801, the accuracy impact calculation unit 519 determines whether there is an unprocessed factor. If there is no unprocessed factor, the accuracy impact calculation unit 519 ends the process. When the unprocessed factor is “present”, the accuracy impact calculation unit 519 extracts the factor ID and the factor value corresponding to the factor for the new input document from the factor table 514 in Step 1802. Next, in step 1803, the accuracy influence calculation unit 519 collates the extracted factor value with the upper limit value and lower limit value of the corresponding factor in the search accuracy table 517, and specifies the factor group to which the factor value belongs. . Next, in step 1804, the accuracy impact calculation unit 519 extracts the factor ID 1401, factor category 1402, factor name 1403, factor group name 1405, search accuracy 1408, and deviation 1409 from the accuracy average corresponding to the specified factor group. Are stored in the factor ID 1701, factor category 1702, factor name 1703, factor value 1704, corresponding factor group 1705, search accuracy 1706, and deviation 1707 from the accuracy average, together with the factor value.

図５の説明に戻る。検索結果出力部521は、特徴語照合テーブル515及び精度影響度テーブル520に基づいて図４Ａ及び図４Ｂに示す出力画面を生成し、出力装置540を介して利用者に提示する。図４Ａの表400は、精度影響度テーブル520から容易に生成することができる。図４Ｂの表470は、特徴語照合テーブル515から容易に生成することができる。 Returning to the description of FIG. The search result output unit 521 generates the output screen shown in FIGS. 4A and 4B based on the feature word matching table 515 and the accuracy influence level table 520 and presents it to the user via the output device 540. The table 400 of FIG. 4A can be easily generated from the accuracy influence degree table 520. The table 470 in FIG. 4B can be easily generated from the feature word matching table 515.

対策テーブル522は、後述するように、類似文書検索精度を低下させる要因（精度平均との乖離値がマイナスの要因）について、その要因の視点から類似文書検索精度を向上させるために、次に何をしたらよいのかに対する対策情報を要因に対応付けて利用者に提示するための対策情報を格納したものである。 As will be described later, the countermeasure table 522 has the following factors for reducing the similar document search accuracy (factors having a negative deviation from the accuracy average) in order to improve the similar document search accuracy from the viewpoint of the factor. It stores the countermeasure information for presenting to the user the countermeasure information corresponding to whether or not to be associated with the factor.

以上の通り、本実施例に係る類似文書検索支援装置は、図５に示す機能ブロック構成を用いることにより、類似文書検索結果の根拠として、検索精度に影響を及ぼす要因とその影響の度合（精度平均との乖離）を利用者に提示することができる。 As described above, the similar document search support apparatus according to the present embodiment uses the functional block configuration shown in FIG. 5 as a basis for the similar document search result, and the factors affecting the search accuracy and the degree of the influence (accuracy). The deviation from the average) can be presented to the user.

図１９に、本実施例に係る類似文書検索支援装置のハードウェア構成例を示す。本装置は大きく分けて、計算処理を実行する処理装置1950、利用者が操作内容又はデータを入力するための入力装置1930、計算処理結果を利用者に出力するための出力装置1940、処理装置1950における処理に関するプログラム及びデータを格納する記憶装置1960から構成される。 FIG. 19 shows a hardware configuration example of the similar document search support apparatus according to the present embodiment. This apparatus is roughly divided into a processing apparatus 1950 for executing calculation processing, an input apparatus 1930 for a user to input operation contents or data, an output apparatus 1940 for outputting calculation processing results to the user, and a processing apparatus 1950. It comprises a storage device 1960 that stores programs and data relating to the processing in FIG.

入力装置1930は、キーボード1951及びマウス1952から構成される。出力装置1940は、出力モニタ1953から構成される。入出力データを別の計算機とやりとりする場合には、入出力データはネットワーク1954を介して送受信する。 The input device 1930 includes a keyboard 1951 and a mouse 1952. The output device 1940 includes an output monitor 1953. When the input / output data is exchanged with another computer, the input / output data is transmitted / received via the network 1954.

記憶装置1960は、処理装置1950における処理データを一時的に格納するワーキングエリア1961と、データを格納する文書データベース格納エリア1962、単語辞書格納エリア1963、検索インデクス格納エリア1964、書誌テーブル格納エリア1965、教師文書テーブル格納エリア1966、検索結果テーブル格納エリア1967、特徴語テーブル格納エリア1968、要因テーブル格納エリア1969、特徴語照合テーブル格納エリア1970、検索精度テーブル格納エリア1971、精度影響度テーブル格納エリア1972、対策テーブル格納エリア1973と、プログラムを格納する特徴語抽出部格納エリア1974、検索インデクス生成部格納エリア1975、書誌抽出部格納エリア1976、特徴語収集部格納エリア1977、類似文書検索部格納エリア1978、要因データ抽出部格納エリア1979、検索精度解析部格納エリア1980、精度影響度算出部格納エリア1981、検索結果出力部格納エリア1982から構成される。 The storage device 1960 includes a working area 1961 for temporarily storing processing data in the processing device 1950, a document database storage area 1962 for storing data, a word dictionary storage area 1963, a search index storage area 1964, a bibliographic table storage area 1965, Teacher document table storage area 1966, search result table storage area 1967, feature word table storage area 1968, factor table storage area 1969, feature word collation table storage area 1970, search accuracy table storage area 1971, accuracy impact table storage area 1972, Measure table storage area 1973, feature word extraction unit storage area 1974 for storing programs, search index generation unit storage area 1975, bibliographic extraction unit storage area 1976, feature word collection unit storage area 1977, similar document search unit storage area 1978, Factor data extraction unit storage area 1979, search accuracy analysis unit storage area 1 980, a precision influence calculation unit storage area 1981, and a search result output unit storage area 1982.

処理装置1950は、記憶装置1960から必要なプログラム及びデータをロードし、実行した結果を記憶装置1960に格納することを繰り返し、所定の処理を実行する。
次に、前述した実施例の変形例を説明する。 The processing device 1950 repeatedly loads a necessary program and data from the storage device 1960, stores the execution result in the storage device 1960, and executes predetermined processing.
Next, a modified example of the above-described embodiment will be described.

（変形例１）
前述した実施例においては、検索精度解析部516が、教師入力文書から各要因に対する検索精度を算出する際、要因を幾つかの要因Groupにグルーピングして要因Group毎の検索精度を算出し、さらに、影響度算出部519が、新規入力文書から得られた要因値と要因Groupを照合し、該当する要因Groupの検索精度を特定した。 (Modification 1)
In the embodiment described above, when the search accuracy analysis unit 516 calculates the search accuracy for each factor from the teacher input document, the factor is grouped into several factor groups to calculate the search accuracy for each factor group, The impact calculation unit 519 collates the factor value obtained from the new input document with the factor group, and specifies the search accuracy of the corresponding factor group.

これに対し、本変形例では、要因Groupを特定して対応する検索精度を特定するのではなく、新規入力文書から得られた要因値又はその近傍値を持つ教師入力文書を特定し、当該教師入力文書から検索精度を算出する。 On the other hand, in this modified example, instead of specifying the factor group and specifying the corresponding search accuracy, the teacher input document having the factor value obtained from the new input document or its neighboring value is specified, and the teacher Search accuracy is calculated from the input document.

例えば図１６において、新規入力文書から得られた要因値が「18」であった場合、前述の実施例では、要因Group「中」に属するとみなされ、検索精度は60%、乖離値は0％となる。一方、本変形例では、要因値「18」又はその近傍値を持つ教師入力文書を特定する。要因値「18」を中心として、その前後の値を取る教師入力文書を全体の30%にあたる６件抽出すると、要因値が「17」から「19」までの値を持つ教師入力文書が６件（図１６の中央の表1610の#12から#17まで）得られる。この６件に対する検索精度は67%（4/6）、乖離値は+7%（67%-60%）となる。 For example, in FIG. 16, when the factor value obtained from the new input document is “18”, in the above-described embodiment, it is regarded as belonging to the factor group “medium”, the search accuracy is 60%, and the deviation value is 0. %. On the other hand, in this modified example, a teacher input document having a factor value “18” or its neighborhood value is specified. If 6 teacher input documents, which are 30% of the whole, are extracted, with the factor value “18” as the center, 6 teacher input documents with the factor values from “17” to “19” are extracted. (From # 12 to # 17 in the table 1610 in the center of FIG. 16). The search accuracy for these 6 cases is 67% (4/6), and the deviation value is + 7% (67% -60%).

本変形例は、影響度算出部519において、要因テーブル514に格納された要因データから、上述したような新規入力文書の要因値又はその近傍値を持つ教師入力文書を一定件数抽出する処理と、抽出した教師入力文書の検索順位から検索精度を算出する処理とを行うことにより実現することができる。 In the present modification, the influence calculation unit 519 extracts a fixed number of teacher input documents having the factor value of the new input document as described above or its vicinity value from the factor data stored in the factor table 514; This can be realized by performing processing for calculating the search accuracy from the search order of the extracted teacher input document.

（変形例２）
上述した実施例においては、８種類の要因について、検索精度への影響度を乖離値として算出しているが、この影響度は要因毎に独立であるという前提で解析している。 (Modification 2)
In the above-described embodiment, the influence on the search accuracy is calculated as the deviation value for the eight types of factors, but the analysis is performed on the assumption that the influence is independent for each factor.

これに対し、本変形例では、２種類以上の要因を組み合わせ、各要因の要因Groupを組み合わせた「統合要因Group」を形成する。すなわち、本変形例では、教師入力文書に対して統合要因Group毎に検索精度を算出し、新規入力文書から得られる要因値の組み合わせに基づいて、該当する統合要因Groupを特定する。その後、対応する検索精度及び精度平均との乖離値を特定して、利用者に提示する。どの要因とどの要因を組み合わせるかは、予め固定しておいてもよいし、利用者に選択させてもよい。 On the other hand, in the present modification, two or more types of factors are combined to form an “integrated factor group” in which the factor groups of the factors are combined. That is, in this modification, the search accuracy is calculated for each integration factor group for the teacher input document, and the corresponding integration factor group is specified based on the combination of the factor values obtained from the new input document. Then, the deviation value from the corresponding search accuracy and accuracy average is specified and presented to the user. Which factor and which factor are combined may be fixed in advance or may be selected by the user.

例えば要因「総ヒット数」と要因「類似度減衰率」を組み合わせる。この場合において、それぞれが３種類の要因Groupから構成されている場合、９（＝３×３）種類の統合要因Groupが生成される。検索精度解析部516は、図１５に示す処理方法のステップ1504において、要因をソートする際に、当該要因のうちの一つ目の要因値でソートして３つのGroupに分割し、更に分割された各々のグループを二つ目の要因値でソートしてそれぞれ３つのGroupに分割する、という処理を繰り返すことにより、統合要因Groupを生成することができる。その後の処理は同様の処理で実現できる。 For example, the factor “total number of hits” and the factor “similarity decay rate” are combined. In this case, if each is composed of three types of factor groups, nine (= 3 × 3) types of integration factor groups are generated. In step 1504 of the processing method shown in FIG. 15, the search accuracy analysis unit 516 sorts the factors by the first factor value of the factors, divides them into three groups, and further divides them. The integrated factor group can be generated by repeating the process of sorting each group by the second factor value and dividing each group into three groups. Subsequent processing can be realized by similar processing.

（拡張例１）
次に、前述した実施例の拡張例について述べる。前述の実施例では、図４Ａ及び図４Ｂに示したような表示態様により、検索精度への影響度を要因毎に利用者に提示する。利用者は提示された内容から、どの要因が検索精度を向上／低下させているかを理解することができる。 (Extended example 1)
Next, an extended example of the above-described embodiment will be described. In the above-described embodiment, the degree of influence on the search accuracy is presented to the user for each factor by the display mode as shown in FIGS. 4A and 4B. The user can understand which factors improve / decrease the search accuracy from the presented contents.

しかし、もっとよい検索結果を得るためには、具体的にどうしたらよいのか、またそれはどう操作したらできるのか、という対策方法を全ての利用者が理解できるとは限らない。対策方法が分からないと、その時点で検索作業が中断してしまうため、検索作業をスピーディかつ円滑に行うことができなくなる。 However, in order to obtain better search results, not all users can understand the specific countermeasures and how to operate them. If the countermeasure method is not known, the search operation is interrupted at that time, and the search operation cannot be performed speedily and smoothly.

そこで、本拡張例においては、類似文書検索精度を低下させる要因（精度平均との乖離値がマイナスの要因）について、その要因の視点から類似文書検索精度を向上させるために、次に何をしたらよいのかに対する対策情報を、要因に対応付けて利用者に提示する。具体的には、図５に示す機能ブロック構成と同様に、対策情報を格納した対策テーブル522を備え、利用者からの要求に応じて、次に何をすればよいのかにかかる「対策内容」と、それを具体的にどう行うかにかかる「操作方法」を利用者に提示する。 Therefore, in this extended example, what can be done next to improve the similar document search accuracy from the viewpoint of the factor that decreases the similar document search accuracy (factor that has a negative deviation from the average accuracy)? Countermeasure information on whether it is good is presented to the user in association with the factor. Specifically, similarly to the functional block configuration shown in FIG. 5, the countermeasure table 522 storing countermeasure information is provided, and “countermeasure content” relating to what to do next in response to a request from the user. Then, the “operation method” related to how to do this is presented to the user.

図２０Ａ及び図２０Ｂに、本拡張例における類似文書検索結果の詳細表示画面の構成例を示す。なお、図２０Ａ及び図２０Ｂには、図４Ａ及び図４Ｂとの対応部分に同一符号を付して表している。図２０Ａは画面上段部分に表示される表400を表し、図２０Ｂは画面下段部分に表示される表470を表している。図２０Ａに示す表400と図４Ａに示す表400との違いは、各要因に対する対策方法2001を表示する項目が図２０Ａに示す表400に追加されている点である。 20A and 20B show a configuration example of a detailed display screen of a similar document search result in this extension example. 20A and 20B, parts corresponding to those in FIGS. 4A and 4B are denoted by the same reference numerals. 20A shows a table 400 displayed in the upper part of the screen, and FIG. 20B shows a table 470 displayed in the lower part of the screen. The difference between the table 400 shown in FIG. 20A and the table 400 shown in FIG. 4A is that an item for displaying the countermeasure method 2001 for each factor is added to the table 400 shown in FIG. 20A.

例えば影響度453の値をマイナスの値とし、しかも、その絶対値を大きくしている要因（分類ヒット件数437や有効特徴語数431など）は、検索精度を低下させている要因である。この要因の視点から検索精度を改善するには、どうしたらよいのかを知りたい場合、利用者は、この要因に該当する対策方法2001の「対策」リンク2002を押下する。すると、図２１に一例を示すように、対策内容2103及び操作方法2104が要因2101及び要因Group2102に対応付けて表示される。さらに、操作方法2104において、「前編集画面」リンク2105を押下すると、図２に示すような、この対策内容を行うための画面である前編集画面を表示する。利用者はこれらのナビゲーションにしたがって、検索条件を適切に修正でき、かつ、操作に困ることもなくなる。 For example, a factor (such as the number of classified hits 437 and the number of effective feature words 431) that makes the influence degree 453 a negative value and increases the absolute value is a factor that reduces the search accuracy. If the user wants to know how to improve the search accuracy from the viewpoint of this factor, the user presses the “Countermeasure” link 2002 of the countermeasure method 2001 corresponding to this factor. Then, the countermeasure content 2103 and the operation method 2104 are displayed in association with the factor 2101 and the factor Group 2102 as shown in FIG. Further, when the “previous edit screen” link 2105 is pressed in the operation method 2104, a preedit screen as a screen for performing this countermeasure content as shown in FIG. 2 is displayed. The user can appropriately correct the search condition according to these navigations, and there is no trouble in operation.

図２２は、対策テーブル522の構成の一例を示したものである。対策テーブル522は、要因ＩＤ2201、要因名称2202、要因GroupＩＤ2203、要因Group名称2204、次に何をすべきかを記載した対策内容2205、対策内容をどのように操作して実現するかを記載した操作方法2206、操作のために遷移すべき遷移先画面2207から構成される。図２０Ａに示す表400で選択される対策に対応する要因と、対策テーブル522に記載されたデータは、要因名称及び要因Groupをキーとして対応付けできるので、要因に合致したデータを対策テーブル522から取り出して図２１に示したような形で表示するのは容易に実現できる。 FIG. 22 shows an example of the configuration of the countermeasure table 522. The countermeasure table 522 includes a factor ID 2201, a factor name 2202, a factor Group ID 2203, a factor group name 2204, a countermeasure content 2205 that describes what should be done next, and an operation method that describes how to implement the countermeasure content. 2206, composed of a transition destination screen 2207 to be transitioned for operation. The factor corresponding to the countermeasure selected in the table 400 shown in FIG. 20A and the data described in the countermeasure table 522 can be associated with the factor name and the factor group as a key. It is easy to take out and display in the form as shown in FIG.

なお、遷移先画面2207についてであるが、図２１では文中のアンカーとして遷移先画面にジャンプできるようにしているが、「画面遷移」というボタンを別途設けて表示し、利用者がこのボタンを押下すると、対策テーブル522に定義された遷移先画面にジャンプするようにしてもよい。 As for the transition destination screen 2207, in FIG. 21, it is possible to jump to the transition destination screen as an anchor in the sentence. However, a button “screen transition” is separately provided and the user presses this button. Then, the screen may jump to the transition destination screen defined in the countermeasure table 522.

図２は、類似文書検索精度を向上させるために、検索条件（特徴語の追加・削除・重み修正・同義語展開、書誌による絞り込みなど）を編集する検索条件編集画面200の構成例を示している。 FIG. 2 shows a configuration example of a search condition editing screen 200 for editing search conditions (addition / deletion of feature words / weight correction / synonym expansion, refinement by bibliography, etc.) in order to improve similar document search accuracy. Yes.

検索条件編集画面200は、特徴語の削除及び重みの修正を行う特徴語編集サブ画面201と、特徴語の追加を行う特徴語追加サブ画面202と、同義語の展開を行う同義語展開サブ画面203、分類や出願人、出願日などの書誌に基づいて検索結果を絞り込む又は拡張する書誌条件編集サブ画面204から構成される。 The search condition editing screen 200 includes a feature word editing subscreen 201 for deleting feature words and correcting weights, a feature word adding subscreen 202 for adding feature words, and a synonym expansion subscreen for expanding synonyms 203, a bibliographic condition editing sub-screen 204 for narrowing down or expanding a search result based on a bibliography such as classification, applicant, and application date.

特徴語編集サブ画面201では、検索に使用された特徴語に関するデータが表示される。ここで、選択チェックボックス211を選択状態（×がついた状態）にすると、その特徴語が検索に使用され、選択状態を解除する（×がついていない状態）にすると、その特徴語が検索に使用されなくなる。また、本サブ画面において重み212の値を任意の値に変更できる。 On the feature word editing subscreen 201, data related to the feature word used for the search is displayed. Here, if the selection check box 211 is set to a selected state (a state with an x), the feature word is used for the search, and if the selected state is canceled (a state without an x), the feature word is used for the search. No longer used. In addition, the value of the weight 212 can be changed to an arbitrary value on this sub-screen.

特徴語追加サブ画面202では、入力文書に含まれる特徴語で検索に使用されなかった特徴語を表示している。また、検索結果文書に含まれる特徴語を表示することもできる。ここでも選択チェックボックス221の選択によって、検索に使用する特徴語を追加することができる。また追加する特徴語の重み222を任意の値に変更できる。 In the feature word addition sub-screen 202, feature words included in the input document that are not used in the search are displayed. In addition, feature words included in the search result document can be displayed. Again, by selecting the selection check box 221, a feature word used for the search can be added. Further, the weight 222 of the feature word to be added can be changed to an arbitrary value.

同義語展開サブ画面203では、検索に使用された特徴語に対する同義語データを表示する。同義語データについては、単語辞書503に格納しておいてもよいし、同義語辞書として別のデータとして格納してもよい。特徴語のリスト231から任意の特徴語（ここでは「通報」）を選択すると、右の表232に同義語の候補をその確信度とともに表示する。同義語として適切な語のチェックボックスを選択状態にすることにより、選択された語を特徴語として追加する。 The synonym expansion sub-screen 203 displays synonym data for the feature word used for the search. The synonym data may be stored in the word dictionary 503 or may be stored as another data as a synonym dictionary. When an arbitrary feature word (here, “report”) is selected from the feature word list 231, synonym candidates are displayed together with their certainty in the table 232 on the right. The selected word is added as a feature word by selecting a check box of a word suitable as a synonym.

書誌条件編集サブ画面204では、書誌による絞り込みを行う。書誌項目のリスト241から任意の書誌項目（ここでは「分類（ＩＰＣ）」を選択すると、右の表242に上位検索結果文書における当該書誌項目の値の分布を件数で表示する。選択チェックボックスで書誌値を選択することにより、検索結果を絞り込む。 On the bibliographic condition editing subscreen 204, narrowing down by bibliography is performed. When an arbitrary bibliographic item (here, “classification (IPC)”) is selected from the bibliographic item list 241, the distribution of the value of the bibliographic item in the high-order search result document is displayed by the number of cases in the table 242 on the right. Narrow down search results by selecting bibliographic values.

利用者はこの検索条件編集画面200において、図２１で示した画面の中で提案（suggest）された内容に沿って検索条件を修正し、類似文書検索を再実行する。例えば図２１では、特徴語の追加が提案（suggest）されており、操作方法2206として「特徴語追加画面で特徴語を追加する」と表示されている。従って、特徴語追加サブ画面202において、適切な特徴語を見つけて追加し、検索ボタン250を押下して検索を再実行する。なお、図２においては、複数のサブ画面が一つの画面にまとめられて表示されているが、必要なサブ画面だけを利用者に提示してもよい。 In the search condition editing screen 200, the user corrects the search conditions in accordance with the contents suggested in the screen shown in FIG. 21, and re-executes the similar document search. For example, in FIG. 21, addition of a feature word is proposed, and “add a feature word on the feature word addition screen” is displayed as the operation method 2206. Therefore, an appropriate feature word is found and added on the feature word addition sub-screen 202, and the search button 250 is pressed to execute the search again. In FIG. 2, a plurality of sub-screens are displayed as a single screen, but only necessary sub-screens may be presented to the user.

前述した実施例、変形例及び拡張例によって、利用者は検索結果の根拠として、入力文書中のどの特徴語がどのくらい類似文書検索結果の出力に貢献したのか、類似文書検索がどの程度うまくいったのか、類似文書検索がうまくいっていない場合、何が原因なのか、類似文書検索がうまくいっていない場合、次に何をどのようにすればより良い検索結果を得られるのかを理解でき、次の行動にスムーズに移ることができるので、検索作業プロセスのサイクルを効率よく回すことができ、質の高い検索結果を得ることができるようになる。 According to the above-described embodiment, modification, and extended example, the user can find out how well the feature word in the input document contributed to the output of the similar document search result, and how well the similar document search has been performed. If similar document search is not successful, what is the cause, if similar document search is not successful, you can understand what to do next and how to get better search results, Therefore, the cycle of the search operation process can be efficiently rotated, and high-quality search results can be obtained.

501：文書データベース、502：特徴語抽出部、503：単語辞書、504：検索インデクス生成部、505：検索インデクス、506：書誌抽出部、507：書誌テーブル、508：教師文書テーブル、509：特徴語収集部、510：特徴語テーブル、511：類似文書検索部、512：検索結果テーブル、513：要因データ抽出部、514：要因テーブル、515：特徴語照合テーブル、516：検索精度解析部、517：検索精度テーブル、518：新規入力文書番号、519：精度影響度算出部、520：精度影響度テーブル、521：検索結果出力部、522：対策テーブル、530：入力装置、540：出力装置。 501: Document database, 502: Feature word extraction unit, 503: Word dictionary, 504: Search index generation unit, 505: Search index, 506: Bibliographic extraction unit, 507: Bibliographic table, 508: Teacher document table, 509: Feature word Collection unit, 510: feature word table, 511: similar document search unit, 512: search result table, 513: factor data extraction unit, 514: factor table, 515: feature word matching table, 516: search accuracy analysis unit, 517: Search accuracy table, 518: new input document number, 519: accuracy impact calculation unit, 520: accuracy impact table, 521: search result output unit, 522: countermeasure table, 530: input device, 540: output device.

Claims

コンピュータに、
文書データベースに格納された検索対象文書を解析して特徴語及びその重要度を示す重みを抽出し、検索インデクスに格納する特徴語抽出処理ステップと、
入力装置に対する操作入力を通じて指定のあった入力文書から対応する重み付き特徴語を抽出して、前記検索インデクスに格納された重み付き特徴語と照合し、前記入力文書と前記検索対象文書との間の類似度を算出し、類似度の高い検索対象文書から順に検索結果文書集合として決定する類似文書検索処理ステップと、
前記検索結果文書集合を利用者に報知する検索結果出力処理ステップと
を実行させる類似文書検索プログラムにおいて、
正解文書が既知である教師入力文書、及び、前記教師入力文書に対応する前記正解文書の対を複数有する教師文書テーブルを構成する教師入力文書の各々に対する重み付き特徴語を、前記特徴語抽出処理ステップによって教師入力文書内のテキストから抽出して、又は前記検索インデクスから収集して、特徴語テーブルに格納する特徴語収集処理ステップと、
前記教師入力文書の各々について前記類似文書検索処理ステップにより決定される検索結果文書集合に基づいて、各教師入力文書に対応する前記正解文書の検索順位を特定すると共に、前記各教師入力文書に対する前記特徴語テーブル、前記検索結果文書集合、書誌情報及び前記検索インデクスのうちの一つ以上を参照することにより、類似文書検索精度に影響を及ぼす要因として予め定義された要因の各々に対する前記各教師入力文書の要因値を抽出し、要因テーブルに格納する要因データ抽出処理ステップと、
前記要因テーブルに格納された、前記教師文書テーブル内の教師入力文書集合に対する前記要因値に対して、一つの要因にかかる要因値の分布又は複数の要因にかかる要因値の分布の組合せに基づいて、前記教師入力文書集合を要因グループに分割し、一つの要因グループに属する前記教師入力文書に対する前記正解文書の検索順位から当該要因グループに対する検索精度を算出し、前記教師入力文書全体に対して算出される検索精度平均値に対する、前記算出された検索精度の差を乖離値として算出し、前記要因グループと、当該要因グループに該当する前記要因値の取りうる範囲と、前記検索精度と、前記乖離値を検索精度テーブルに格納する検索精度解析処理ステップと、
前記正解文書が未知である新規入力文書に対して得られる前記要因値を、前記検索精度テーブルに格納された各要因グループの値範囲と照合することにより、前記値範囲を満たす要因グループに対応する前記検索精度及び乖離値を抽出し、当該新規入力文書の前記要因値と共に影響度テーブルに格納する影響度算出処理ステップとを有し、
前記検索結果出力処理ステップにおいて、前記影響度テーブルに格納された新規入力文書に対する前記要因値並びに前記検索精度及び／又は前記乖離値を利用者に提示する
類似文書検索支援プログラム。 On the computer,
A feature word extraction processing step of analyzing a search target document stored in the document database, extracting a feature word and a weight indicating its importance, and storing it in a search index;
A corresponding weighted feature word is extracted from an input document designated through an operation input to the input device, and is compared with a weighted feature word stored in the search index, between the input document and the search target document. A similar document search processing step of calculating a similarity of and determining a search result document set in order from a search target document having a high similarity,
In a similar document search program for executing a search result output processing step for notifying a user of the search result document set,
A weighted feature word for each of teacher input documents in which a correct answer document is known and a teacher input document constituting a teacher document table having a plurality of pairs of the correct answer documents corresponding to the teacher input document, the feature word extraction processing A feature word collection processing step for extracting from the text in the teacher input document by the step or collecting from the search index and storing it in the feature word table;
Based on the search result document set determined by the similar document search processing step for each of the teacher input documents, the search order of the correct documents corresponding to each teacher input document is specified, and the teacher input documents Each teacher input for each of factors predefined as factors affecting the accuracy of similar document search by referring to one or more of a feature word table, the search result document set, bibliographic information, and the search index A factor data extraction processing step for extracting factor values of documents and storing them in a factor table;
Based on a factor value distribution for one factor or a combination of factor value distributions for a plurality of factors, with respect to the factor value for the teacher input document set in the teacher document table stored in the factor table The teacher input document set is divided into factor groups, the search accuracy for the factor group is calculated from the search order of the correct answer documents for the teacher input documents belonging to one factor group, and calculated for the entire teacher input document. The difference of the calculated search accuracy with respect to the average search accuracy is calculated as a divergence value, and the factor group, the range of the factor values corresponding to the factor group, the search accuracy, and the divergence A search accuracy analysis processing step for storing values in a search accuracy table;
Corresponding to the factor group satisfying the value range by comparing the factor value obtained for the new input document whose correct answer document is unknown with the value range of each factor group stored in the search accuracy table Extracting the search accuracy and divergence value, and storing in an influence degree table together with the factor value of the new input document,
In the search result output processing step, a similar document search support program for presenting the factor value, the search accuracy and / or the divergence value to a new input document stored in the influence degree table to a user.

コンピュータに、
文書データベースに格納された検索対象文書を解析して特徴語及びその重要度を示す重みを抽出し、検索インデクスに格納する特徴語抽出処理ステップと、
入力装置に対する操作入力を通じて指定のあった入力文書から対応する重み付き特徴語を抽出して、前記検索インデクスに格納された重み付き特徴語と照合し、前記入力文書と前記検索対象文書との間の類似度を算出し、類似度の高い検索対象文書から順に検索結果文書集合として決定する類似文書検索処理ステップと、
前記検索結果文書集合を利用者に報知する検索結果出力処理ステップと、
を実行させる類似文書検索プログラムにおいて、
正解文書が既知である教師入力文書、及び、前記教師入力文書に対応する前記正解文書の対を複数有する教師文書テーブルを構成する教師入力文書の各々に対する重み付き特徴語を、前記特徴語抽出処理ステップによって教師入力文書内のテキストから抽出して、又は、前記検索インデクスから収集して、特徴語テーブルに格納する特徴語収集処理ステップと、
前記教師入力文書の各々について前記類似文書検索処理ステップにより決定される検索結果文書集合に基づいて、各教師入力文書に対応する前記正解文書の検索順位を特定すると共に、前記各教師入力文書に対する前記特徴語テーブル、前記検索結果文書集合、書誌情報及び前記検索インデクスのうちの一つ以上を参照することにより、類似文書検索精度に影響を及ぼす要因として予め定義された要因の各々に対する前記各教師入力文書の要因値を抽出し、要因テーブルに格納する要因データ抽出処理ステップと、
前記正解文書が未知である新規入力文書に対して得られる前記要因値に対して、一つの要因にかかる新規入力文書に対する要因値、又は、その近傍値を満たす前記教師入力文書、又は、複数の要因にかかる新規入力文書に対する要因値、又は、その近傍値をすべて満たす前記教師入力文書から構成される文書群を特定し、前記文書群に属する前記教師入力文書に対する前記正解文書の検索順位から当該文書群に対する検索精度を算出し、前記教師入力文書全体に対して算出される検索精度平均値に対する、前記算出された検索精度の差を乖離値として算出し、前記要因値、前記検索精度及び前記乖離値を影響度テーブルに格納する影響度算出処理ステップとを有し、
前記検索結果出力処理ステップにおいて、前記影響度テーブルに格納された新規入力文書に対する前記要因値並びに前記検索精度及び／又は前記乖離値を利用者に提示する
類似文書検索支援プログラム。 On the computer,
A feature word extraction processing step of analyzing a search target document stored in the document database, extracting a feature word and a weight indicating its importance, and storing it in a search index;
A corresponding weighted feature word is extracted from an input document designated through an operation input to the input device, and is compared with a weighted feature word stored in the search index, between the input document and the search target document. A similar document search processing step of calculating a similarity of and determining a search result document set in order from a search target document having a high similarity,
A search result output processing step for informing the user of the search result document set;
In a similar document search program that executes
A weighted feature word for each of teacher input documents in which a correct answer document is known and a teacher input document constituting a teacher document table having a plurality of pairs of the correct answer documents corresponding to the teacher input document, the feature word extraction processing A feature word collection processing step of extracting from the text in the teacher input document by the step or collecting from the search index and storing it in the feature word table;
Based on the search result document set determined by the similar document search processing step for each of the teacher input documents, the search order of the correct documents corresponding to each teacher input document is specified, and the teacher input documents Each teacher input for each of factors predefined as factors affecting the accuracy of similar document search by referring to one or more of a feature word table, the search result document set, bibliographic information, and the search index A factor data extraction processing step for extracting factor values of documents and storing them in a factor table;
With respect to the factor value obtained for a new input document for which the correct answer document is unknown, the factor input value for a new input document relating to one factor, or the teacher input document satisfying a neighborhood value thereof, or a plurality of A document group composed of the teacher input documents satisfying all the factor values for the new input document related to the factor or its neighborhood value is specified, and the search result of the correct answer document with respect to the teacher input documents belonging to the document group A search accuracy for the document group is calculated, and a difference between the calculated search accuracy with respect to a search accuracy average value calculated for the entire teacher input document is calculated as a deviation value, and the factor value, the search accuracy, and the An influence degree calculation processing step for storing a deviation value in an influence degree table;
In the search result output processing step, a similar document search support program for presenting the factor value, the search accuracy and / or the divergence value to a new input document stored in the influence degree table to a user.

請求項１又は２に記載の類似文書検索支援プログラムにおいて、
前記類似文書検索精度に影響を及ぼす要因は、以下に示す（１）〜（１２）のうちの少なくとも一つ以上を含む
ことを特徴とする類似文書検索支援プログラム。
（１）予め指定された件数からなる上位検索結果文書の各々に対する入力文書中の各々の特徴語の総ヒット数又はその割合
（２）前記（１）の総ヒット数のうち、入力文書中の特徴語の検索結果文書における重みが予め指定された閾値以上である数又はその割合
（３）前記（１）の総ヒット数のうち、入力文書中の特徴語にかかる部分類似度又はそれが検索結果文書の類似度に占める割合
（４）前記（２）の数又はその割合を、前記（１）の数又はその割合で除算した値
（５）前記（３）の数又はその割合を、前記（１）の数又はその割合で除算した値
（６）前記上位検索結果文書において、入力文書の一つの特徴語のヒット件数が予め指定された閾値以上である特徴語の個数又はその割合
（７）前記上位検索結果文書において、入力文書の一つの特徴語のヒット件数が予め指定された閾値以下である特徴語の個数又はその割合
（８）前記上位検索結果文書の類似度が検索順位の低下に伴って減衰する割合
（９）前記上位検索結果文書において、入力文書に付与された分類が付与された件数又はその割合
（１０）検索対象となるすべての文書において、入力文書に付与された分類が付与された件数又はその割合
（１１）前記上位検索結果文書において、著者が入力文書と共通である件数又はその割合
（１２）前記上位検索結果文書において、入力文書との間の発行日の乖離が予め指定された閾値以内である件数又はその割合 In the similar document search support program according to claim 1 or 2,
The similar document search support program characterized in that the factor affecting the similar document search accuracy includes at least one of the following (1) to (12).
(1) The total number of hits of each feature word in the input document with respect to each of the high-order search result documents having the number designated in advance or the ratio thereof (2) Of the total number of hits in (1) above, Number of feature word search result documents whose weight is equal to or greater than a predetermined threshold or its ratio (3) Of the total number of hits in (1) above, the partial similarity of feature words in the input document or that is the search The ratio of the degree of similarity in the result document (4) The number of (2) or the ratio thereof divided by the number of (1) or the ratio thereof (5) The number of (3) or the ratio thereof, (1) Number divided by the ratio or the ratio thereof (6) In the upper search result document, the number or ratio of feature words whose number of hits of one feature word of the input document is equal to or greater than a predetermined threshold (7 ) In the upper search result document, the input document Number of feature words whose number of hits of one feature word is equal to or less than a predetermined threshold or a ratio thereof (8) Rate at which the similarity of the upper search result document is attenuated with a decrease in search order (9) Upper search In the result document, the number of classifications assigned to the input document or the ratio thereof (10) In all the documents to be searched, the number of classifications assigned to the input document or the ratio thereof (11) Number of cases in which the author is common to the input document in the upper search result document or the ratio thereof (12) Number of cases in which the deviation of the issue date from the input document is within a predetermined threshold in the upper search result document or the number thereof Percentage

請求項１又は２に記載の類似文書検索支援プログラムにおいて、
前記検索精度は、前記教師入力文書に対する前記正解文書が、前記類似文書検索処理ステップによって予め指定された順位以内に認定されている前記教師入力文書の件数の割合である
ことを特徴とする類似文書検索支援プログラム。 In the similar document search support program according to claim 1 or 2,
The retrieval accuracy is a ratio of the number of the teacher input documents in which the correct answer document with respect to the teacher input document is recognized within the rank specified in advance by the similar document search processing step. Search support program.

請求項１に記載の類似文書検索支援プログラムにおいて、
前記検索精度解析処理ステップにおいて使用する前記教師入力文書に対する要因テーブル中の要因値は、予め指定された条件を満たす前記教師入力文書に対する要因値のみで構成されている
ことを特徴とする類似文書検索支援プログラム。 The similar document search support program according to claim 1,
The similar document search, wherein the factor value in the factor table for the teacher input document used in the search accuracy analysis processing step is configured only by the factor value for the teacher input document that satisfies a pre-specified condition. Support program.

請求項１又は２に記載の類似文書検索支援プログラムにおいて、
前記検索結果出力処理ステップにおいて、前記影響度テーブルに格納された新規入力文書に対する要因値並びに検索精度及び／又は乖離値を利用者に提示する際に、前記新規入力文書の特徴語と前記新規入力文書に対する上位検索結果文書とを２軸とし、前記上位検索結果文書ｉにおける新規入力文書の特徴語ｊの重み値Ｗij、又は、前記上位検索結果文書ｉにおける新規入力文書の特徴語ｊの持つ部分類似度Ｓijを値とする対応表を付随させて表示する
ことを特徴とする類似文書検索支援プログラム。 In the similar document search support program according to claim 1 or 2,
In the search result output processing step, when the factor value and the search accuracy and / or divergence value for the new input document stored in the influence degree table are presented to the user, the feature word of the new input document and the new input The upper search result document for the document has two axes, and the weight value Wij of the feature word j of the new input document in the higher search result document i or the part of the feature word j of the new input document in the higher search result document i A similar document search support program characterized by displaying a correspondence table with similarity Sij as a value.

請求項１又は２に記載の類似文書検索支援プログラムにおいて、
利用者がより良い類似文書検索結果を得るための対策情報として、利用者が何をすべきであるかを記載した対策内容、前記対策内容をどのようにして行うかを記載した操作方法、前記操作方法を行うために遷移すべき画面情報を、前記要因の各々の視点から前記要因グループ毎に格納した対策テーブルを設け、
前記検索結果出力処理ステップにおいて前記影響度テーブルに格納された要因値並びに検索精度及び／又は乖離値を利用者に提示する際に、前記対策テーブルに記載された前記対策内容、前記操作方法、前記画面情報の少なくとも１つを、要因グループに付随させて表示する
ことを特徴とする類似文書検索支援プログラム。 In the similar document search support program according to claim 1 or 2,
As countermeasure information for the user to obtain a better similar document search result, countermeasure contents describing what the user should do, an operation method describing how to perform the countermeasure contents, Provide a countermeasure table that stores screen information to be changed to perform the operation method for each factor group from the viewpoint of each factor,
In presenting the factor value stored in the influence degree table and the search accuracy and / or deviation value to the user in the search result output processing step, the countermeasure content described in the countermeasure table, the operation method, A similar document search support program characterized by displaying at least one piece of screen information in association with a factor group.

利用者からの操作入力やデータ入力を受け付ける入力装置と、
検索対象文書を格納した文書データベースと、
前記文書データベースに格納された検索対象文書を解析して特徴語及びその重要度を示す重みを抽出する特徴語抽出部と、
前記抽出された重み付き特徴語を格納する検索インデクスと、
前記入力装置に対する操作入力を通じて指定のあった入力文書から対応する重み付き特徴語を抽出して、前記検索インデクスに格納された重み付き特徴語と照合し、前記入力文書と前記検索対象文書との間の類似度を算出し、類似度の高い検索対象文書から順に検索結果文書集合として決定する類似文書検索部と、
前記検索結果文書集合を利用者に報知する出力装置と
を有する類似文書検索装置において、
正解文書が既知である教師入力文書、及び、前記教師入力文書に対応する前記正解文書の対を複数有する教師文書テーブルと、
教師入力文書の各々に対する重み付き特徴語を、前記特徴語抽出部によって教師入力文書内のテキストから抽出して、又は前記検索インデクスから収集して、特徴語テーブルに格納する特徴語収集部と、
前記教師入力文書の各々について前記類似文書検索部により決定される検索結果文書集合に基づいて、各教師入力文書に対応する前記正解文書の検索順位を特定すると共に、前記各教師入力文書に対する前記特徴語テーブル、前記検索結果文書集合、書誌情報及び前記検索インデクスのうちの一つ以上を参照することにより、類似文書検索精度に影響を及ぼす要因として予め定義された要因の各々に対する前記各教師入力文書の要因値を抽出し、要因テーブルに格納する要因データ抽出部と、
前記要因テーブルに格納された、前記教師文書テーブル内の教師入力文書集合に対する前記要因値に対して、一つの要因にかかる要因値の分布又は複数の要因にかかる要因値の分布の組合せに基づいて、前記教師入力文書集合を要因グループに分割し、一つの要因グループに属する前記教師入力文書に対する前記正解文書の検索順位から当該要因グループに対する検索精度を算出し、前記教師入力文書全体に対して算出される検索精度平均値に対する、前記算出された検索精度の差を乖離値として算出し、前記要因グループと、当該要因グループに該当する前記要因値の取りうる範囲と、前記検索精度と、前記乖離値を検索精度テーブルに格納する検索精度解析部と、
前記正解文書が未知である新規入力文書に対して得られる前記要因値を、前記検索精度テーブルに格納された各要因グループの値範囲と照合することにより、前記値範囲を満たす要因グループに対応する前記検索精度及び乖離値を抽出し、当該新規入力文書の前記要因値と共に影響度テーブルに格納する影響度算出部とを有し、
前記出力装置を通じ、前記影響度テーブルに格納された新規入力文書に対する前記要因値並びに前記検索精度及び／又は前記乖離値を利用者に提示する
類似文書検索支援装置。 An input device for receiving operation input and data input from the user;
A document database storing search target documents;
A feature word extraction unit that analyzes a search target document stored in the document database and extracts a feature word and a weight indicating its importance;
A search index for storing the extracted weighted feature words;
A corresponding weighted feature word is extracted from an input document designated through an operation input to the input device, checked against a weighted feature word stored in the search index, and the input document and the search target document A similar document search unit that calculates a similarity between the search target documents in order from a search target document having a high similarity,
In a similar document search device having an output device for notifying a user of the search result document set,
A teacher input document in which the correct answer document is known, and a teacher document table having a plurality of pairs of the correct answer documents corresponding to the teacher input document;
A feature word collection unit that extracts weighted feature words for each of the teacher input documents from the text in the teacher input document by the feature word extraction unit or collects from the search index and stores it in the feature word table;
Based on the search result document set determined by the similar document search unit for each of the teacher input documents, the search order of the correct document corresponding to each teacher input document is specified, and the feature for each teacher input document Each teacher input document for each of factors previously defined as factors affecting similar document search accuracy by referring to one or more of a word table, the search result document set, bibliographic information, and the search index A factor data extraction unit that extracts the factor values of and stores them in the factor table;
Based on a factor value distribution for one factor or a combination of factor value distributions for a plurality of factors, with respect to the factor value for the teacher input document set in the teacher document table stored in the factor table The teacher input document set is divided into factor groups, the search accuracy for the factor group is calculated from the search order of the correct answer documents for the teacher input documents belonging to one factor group, and calculated for the entire teacher input document. The difference of the calculated search accuracy with respect to the average search accuracy is calculated as a divergence value, and the factor group, the range of the factor values corresponding to the factor group, the search accuracy, and the divergence A search accuracy analysis unit for storing values in a search accuracy table;
Corresponding to the factor group satisfying the value range by comparing the factor value obtained for the new input document whose correct answer document is unknown with the value range of each factor group stored in the search accuracy table An influence calculation unit that extracts the search accuracy and the divergence value and stores the extracted accuracy and the deviation value together with the factor value of the new input document;
A similar document search support device that presents the factor value and the search accuracy and / or the divergence value for a new input document stored in the influence degree table to the user through the output device.

利用者からの操作入力やデータ入力を受け付ける入力装置と、
検索対象文書を格納した文書データベースと、
前記文書データベースに格納された検索対象文書を解析して特徴語及びその重要度を示す重みを抽出する特徴語抽出部と、
前記抽出された重み付き特徴語を格納する検索インデクスと、
前記入力装置に対する操作入力を通じて指定のあった入力文書から対応する重み付き特徴語を抽出して、前記検索インデクスに格納された検索対象文書の重み付き特徴語と照合し、前記入力文書と前記検索対象文書との間の類似度を算出し、類似度の高い検索対象文書から順に検索結果文書集合として認定する類似文書検索部と、
前記検索結果文書集合を利用者に報知する出力装置と
を有する類似文書検索装置において、
正解文書が既知である教師入力文書、及び、前記教師入力文書に対応する前記正解文書の対を複数有する教師文書テーブルと、
前記教師文書テーブルを構成する教師入力文書の各々に対する重み付き特徴語を、前記特徴語抽出処理ステップによって教師入力文書内のテキストから抽出して、又は、前記検索インデクスから収集して、特徴語テーブルに格納する特徴語収集部と、
前記教師入力文書の各々について前記類似文書検索部により決定される検索結果文書集合に基づいて、各教師入力文書に対応する前記正解文書の検索順位を特定すると共に、前記各教師入力文書に対する前記特徴語テーブル、前記検索結果文書集合、書誌情報及び前記検索インデクスのうちの一つ以上を参照することにより、類似文書検索精度に影響を及ぼす要因として予め定義された要因の各々に対する前記各教師入力文書の要因値を抽出し、要因テーブルに格納する要因データ抽出部と、
前記正解文書が未知である新規入力文書に対して得られる前記要因値に対して、一つの要因にかかる新規入力文書に対する要因値、又は、その近傍値を満たす前記教師入力文書、又は、複数の要因にかかる新規入力文書に対する要因値、又は、その近傍値をすべて満たす前記教師入力文書から構成される文書群を特定し、前記文書群に属する前記教師入力文書に対する前記正解文書の検索順位から当該文書群に対する検索精度を算出し、前記教師入力文書全体に対して算出される検索精度平均値に対する、前記算出された検索精度の差を乖離値として算出し、前記要因値、前記検索精度及び前記乖離値を影響度テーブルに格納する影響度算出部とを有し、
前記出力装置を通じ、前記影響度テーブルに格納された新規入力文書に対する前記要因値並びに前記検索精度及び／又は前記乖離値を利用者に提示する
類似文書検索支援装置。 An input device for receiving operation input and data input from the user;
A document database storing search target documents;
A feature word extraction unit that analyzes a search target document stored in the document database and extracts a feature word and a weight indicating its importance;
A search index for storing the extracted weighted feature words;
A corresponding weighted feature word is extracted from an input document designated through an operation input to the input device, and is compared with a weighted feature word of a search target document stored in the search index, and the input document and the search A similar document search unit that calculates a similarity between the target documents and certifies a search result document set in order from a search target document having a high similarity;
In a similar document search device having an output device for notifying a user of the search result document set,
A teacher input document in which the correct answer document is known, and a teacher document table having a plurality of pairs of the correct answer documents corresponding to the teacher input document;
A weighted feature word for each of the teacher input documents constituting the teacher document table is extracted from the text in the teacher input document by the feature word extraction processing step or collected from the search index, and the feature word table A feature word collection unit stored in
Based on the search result document set determined by the similar document search unit for each of the teacher input documents, the search order of the correct document corresponding to each teacher input document is specified, and the feature for each teacher input document Each teacher input document for each of factors previously defined as factors affecting similar document search accuracy by referring to one or more of a word table, the search result document set, bibliographic information, and the search index A factor data extraction unit that extracts the factor values of and stores them in the factor table;
With respect to the factor value obtained for a new input document for which the correct answer document is unknown, the factor input value for a new input document relating to one factor, or the teacher input document satisfying a neighborhood value thereof, or a plurality of A document group composed of the teacher input documents satisfying all the factor values for the new input document related to the factor or its neighborhood value is specified, and the search result of the correct answer document with respect to the teacher input documents belonging to the document group A search accuracy for the document group is calculated, and a difference between the calculated search accuracy with respect to a search accuracy average value calculated for the entire teacher input document is calculated as a deviation value, and the factor value, the search accuracy, and the An influence calculation unit that stores the deviation value in the influence table;
A similar document search support device that presents the factor value and the search accuracy and / or the divergence value for a new input document stored in the influence degree table to the user through the output device.

請求項８又は９に記載の類似文書検索支援装置において、
前記類似文書検索精度に影響を及ぼす要因は、以下に示す（１）〜（１２）のうちの少なくとも一つ以上を含む
ことを特徴とする類似文書検索支援装置。
（１）予め指定された件数からなる上位検索結果文書の各々に対する入力文書中の各々の特徴語の総ヒット数又はその割合
（２）前記（１）の総ヒット数のうち、入力文書中の特徴語の検索結果文書における重みが予め指定された閾値以上である数又はその割合
（３）前記（１）の総ヒット数のうち、入力文書中の特徴語にかかる部分類似度又はそれが検索結果文書の類似度に占める割合
（４）前記（２）の数又はその割合を、前記（１）の数又はその割合で除算した値
（５）前記（３）の数又はその割合を、前記（１）の数又はその割合で除算した値
（６）前記上位検索結果文書において、入力文書の一つの特徴語のヒット件数が予め指定された閾値以上である特徴語の個数又はその割合
（７）前記上位検索結果文書において、入力文書の一つの特徴語のヒット件数が予め指定された閾値以下である特徴語の個数又はその割合
（８）前記上位検索結果文書の類似度が検索順位の低下に伴って減衰する割合
（９）前記上位検索結果文書において、入力文書に付与された分類が付与された件数又はその割合
（１０）検索対象となるすべての文書において、入力文書に付与された分類が付与された件数又はその割合
（１１）前記上位検索結果文書において、著者が入力文書と共通である件数又はその割合
（１２）前記上位検索結果文書において、入力文書との間の発行日の乖離が予め指定された閾値以内である件数又はその割合 The similar document search support device according to claim 8 or 9,
The similar document search support apparatus characterized in that the factor affecting the similar document search accuracy includes at least one of the following (1) to (12).
(1) The total number of hits of each feature word in the input document with respect to each of the high-order search result documents having the number designated in advance or the ratio thereof (2) Of the total number of hits in (1) above, Number of feature word search result documents whose weight is equal to or greater than a predetermined threshold or its ratio (3) Of the total number of hits in (1) above, the partial similarity of feature words in the input document or that is the search The ratio of the degree of similarity in the result document (4) The number of (2) or the ratio thereof divided by the number of (1) or the ratio thereof (5) The number of (3) or the ratio thereof, (1) Number divided by the ratio or the ratio thereof (6) In the upper search result document, the number or ratio of feature words whose number of hits of one feature word of the input document is equal to or greater than a predetermined threshold (7 ) In the upper search result document, the input document Number of feature words whose number of hits of one feature word is equal to or less than a predetermined threshold or a ratio thereof (8) Rate at which the similarity of the upper search result document is attenuated with a decrease in search order (9) Upper search In the result document, the number of classifications assigned to the input document or the ratio thereof (10) In all the documents to be searched, the number of classifications assigned to the input document or the ratio thereof (11) Number of cases in which the author is common to the input document in the upper search result document or the ratio thereof (12) Number of cases in which the deviation of the issue date from the input document is within a predetermined threshold in the upper search result document or the number thereof Percentage

請求項８又は９に記載の類似文書検索支援装置において、
前記検索精度は、前記教師入力文書に対する前記正解文書が、前記類似文書検索部によって予め指定された順位以内に認定されている前記教師入力文書の件数の割合である
ことを特徴とする類似文書検索支援装置。 The similar document search support device according to claim 8 or 9,
The search accuracy is a ratio of the number of the teacher input documents in which the correct answer document with respect to the teacher input document is recognized within the rank specified in advance by the similar document search unit. Support device.

請求項８に記載の類似文書検索支援装置において、
前記検索精度解析部において使用する前記教師入力文書に対する要因テーブル中の要因値は、予め指定された条件を満たす前記教師入力文書に対する要因値のみで構成されている
ことを特徴とする類似文書検索支援装置。 The similar document search support device according to claim 8,
A similar document search support, wherein a factor value in a factor table for the teacher input document used in the search accuracy analysis unit is configured only by a factor value for the teacher input document that satisfies a predetermined condition. apparatus.

請求項８又は９に記載の類似文書検索支援装置において、
前記出力装置を通じ、前記影響度テーブルに格納された新規入力文書に対する要因値並びに検索精度及び／又は乖離値を利用者に提示する際に、前記新規入力文書の特徴語と前記新規入力文書に対する上位検索結果文書とを２軸とし、前記上位検索結果文書ｉにおける新規入力文書の特徴語ｊの重み値Ｗij、又は、前記上位検索結果文書ｉにおける当該新規入力文書の特徴語ｊの持つ部分類似度Ｓijを値とする対応表を付随させて表示する
ことを特徴とする類似文書検索支援装置。 The similar document search support device according to claim 8 or 9,
When the factor value and the search accuracy and / or divergence value for the new input document stored in the influence degree table are presented to the user through the output device, the characteristic words of the new input document and the higher rank for the new input document The search result document has two axes, and the weight value Wij of the feature word j of the new input document in the higher search result document i or the partial similarity of the feature word j of the new input document in the higher search result document i A similar document search support device, characterized by displaying a correspondence table with Sij as a value.

請求項８又は９に記載の類似文書検索支援装置において、
利用者がより良い類似文書検索結果を得るための対策情報として、利用者が何をすべきであるかを記載した対策内容、前記対策内容をどのようにして行うかを記載した操作方法、前記操作方法を行うために遷移すべき画面情報を、前記要因の各々の視点から前記要因グループ毎に格納した対策テーブルを設け、
前記検索結果出力部において前記影響度テーブルに格納された要因値並びに検索精度及び／又は乖離値を利用者に提示する際に、前記対策テーブルに記載された前記対策内容、前記操作方法、前記画面情報の少なくとも１つを、要因グループに付随させて表示する
ことを特徴とする類似文書検索支援装置。 The similar document search support device according to claim 8 or 9,
As countermeasure information for the user to obtain a better similar document search result, countermeasure contents describing what the user should do, an operation method describing how to perform the countermeasure contents, Provide a countermeasure table that stores screen information to be changed to perform the operation method for each factor group from the viewpoint of each factor,
When the search result output unit presents the factor value and search accuracy and / or deviation value stored in the influence degree table to the user, the countermeasure content, the operation method, and the screen described in the countermeasure table A similar document search support device, wherein at least one piece of information is displayed in association with a factor group.