JP2020095521A

JP2020095521A - Information processor, method for processing information, and program

Info

Publication number: JP2020095521A
Application number: JP2018233606A
Authority: JP
Inventors: 琢麻蔵満; Takuma Kuramitsu
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2020-06-18
Anticipated expiration: 2038-12-13
Also published as: JP7284371B2

Abstract

To provide a mechanism which allows a search for an adequate document that a user wants.SOLUTION: A document search device extracts a text from an input document, executes a morphological analysis, acquires a collection of words, determines a category score of each category of the input document, acquires a category-related word from a category with at least a category determination threshold value, and extracts a feature word of the input document by using such a method as "tf-idf". The document search device further writes, as a search query, the acquired feature word, the obtained category information (name of a category, a related-word, and a score adjusted value), and searches for stored documents.SELECTED DRAWING: Figure 17

Description

本発明は、文書の検索方法に関する。 The present invention relates to a document search method.

入力として与えられた文書と類似する文書を出力するシステムを類似文書検索システムと呼ぶ。 A system that outputs a document similar to the document given as an input is called a similar document search system.

検索者が自らキーワードや検索クエリを入力して検索する検索システムと比べ、適切な検索クエリを検索者が思いつかない場合においても利用できるため、検索システムの仕組みに明るくないユーザにも利用しやすいという特徴がある。 Compared to a search system in which searchers enter keywords and search queries themselves, they can be used even when searchers can't think of an appropriate search query, making it easier for users who are not familiar with the structure of the search system. There are features.

類似文書検索システムの実装については、既に様々な製品の開発や研究がなされている。オープンソースの検索エンジンであるApache Luceneにおける「MoreLikeThis」と呼ばれるAPIは、文書集合における単語の出現頻度をもとに、入力文書に出現する単語において希少性が高い単語を特徴語として選出し、当該特徴語を用いた検索クエリを自動的に発行することで類似文書検索の機能を実現している。 Various products have already been developed and researched on the implementation of the similar document retrieval system. The API called "More LikeThis" in Apache Lucene, an open source search engine, selects words with high rarity among words that appear in an input document as feature words based on the frequency of appearance of words in the document set. The function of similar document search is realized by automatically issuing a search query using feature words.

前述の単語の出現頻度をもとにした実装は、実際に様々なケースで有効に動作するが、システム利用者の目的によっては不十分なことがある。 The above-mentioned implementation based on the frequency of occurrence of words actually works effectively in various cases, but it may be insufficient depending on the purpose of the system user.

以下、具体例として、SIerにおける営業担当者が、「RFP（提案依頼書）」を入力とし、過去の「類似案件の見積書」を検索する場面を挙げ、２つの問題点を説明する。 Hereinafter, as a specific example, a case where a sales person at SIer inputs “RFP (request for proposal)” and retrieves past “quote of similar case” will be described, and two problems will be described.

１つ目の問題点は、1つの入力文書に対して、類似文書として出力する文書集合が、検索者の意図に関わらず同じものが出力される点である。 The first problem is that, with respect to one input document, the same document set output as a similar document is output regardless of the searcher's intention.

例えば、営業Ａは「顧客の業界」が似ている見積書を、営業Ｂは「提案の内容」が似ている見積書を取得したいと考えている場合、各々に出力されるべき文書集合は全く異なる。 For example, if sales A wants to obtain a quote with a similar "customer's industry" and sales B wants to obtain a quote with a similar "content of proposal," the document set to be output to each is Totally different.

前述したように、単語の出現頻度から特徴語を抽出して検索クエリに使用する場合、例えば、プロジェクトメンバーの氏名や顧客の社名などが検索時のキーワードとして採用され、検索者の意図とは異なる観点の類似文書（同じ特徴語を多く含む文書）が出力されることがある。 As mentioned above, when feature words are extracted from the frequency of occurrence of words and used in a search query, for example, the project member's name or the customer's company name is adopted as a keyword during the search, which is different from the searcher's intention. A similar document of a viewpoint (a document containing many same characteristic words) may be output.

２つ目の問題点は、意味的に似ている文章が含まれている文書においても、単語の表記が異なれば検索結果としてヒットしないという点である。 The second problem is that even in a document containing sentences that are semantically similar, if the word notation is different, it will not be hit as a search result.

例えば、文章Ａ「ＵＸを考慮したポータルサイトを作りたい」と、文章Ｂ「顧客満足度向上に向けてホームページを改修したい」は、意味的に似ているが、文章Ａの単語（「UX」、「考慮する」、「ポータルサイト」、「作る」と、文章Ｂの単語（「顧客満足度」、「向上」、「ホームページ」、「改修する」）の表記は一致しない。 For example, the sentence A "I want to create a portal site that takes UX into consideration" and the sentence B "I want to modify the homepage to improve customer satisfaction" are semantically similar, but the word of the sentence A ("UX" , "Consider", "portal site", "make" and the words of sentence B ("customer satisfaction", "improvement", "home page", "repair") do not match.

１つ目の問題を解決するためには、検索システムに対してなんらかの方法でユーザの検索意図を入力する手段が必要となる。 In order to solve the first problem, a means for inputting the user's search intention to the search system in some way is required.

特許文献１には、ユーザの検索意図を反映したクエリを生成するために、ユーザが入力した単語群について、関連する単語をユーザが選択可能な状態で列挙し、ユーザに改めて検索意図に近い関連語を選択させる方法が示されている。 In Patent Document 1, in order to generate a query that reflects the user's search intention, related words are listed in a user selectable state in a word group input by the user, and a relationship close to the user's search intention is newly found. It shows how to select a word.

２つ目の問題を解決するためには、検索システムに、言葉の持つ意味や概念を考慮したロジックを組み込むことが必要となる。 In order to solve the second problem, it is necessary to incorporate logic that considers the meaning and concept of words into the search system.

特許文献２には、検索語と索引語の両方について、上位・下位・関連概念に展開して検索する方法が示されている。 Patent Document 2 discloses a method for expanding and searching high-rank, low-rank, and related concepts for both search words and index words.

２つ目の問題に対する他のアプローチとして、あらかじめ文書のカテゴリを定義し、特許文献３や非特許文献１に示されるように、機械学習を用いた文書分類技術を用いて、入力文書に該当するカテゴリの文書のスコアを向上させる方法が考えられる。 As another approach to the second problem, a category of a document is defined in advance, and the document is classified as an input document by using a document classification technique using machine learning as shown in Patent Document 3 and Non-Patent Document 1. There are possible ways to improve the score of documents in a category.

特開２０００−８２０６７号公報JP-A-2000-82067 特開２００４−２９９０６号公報JP, 2004-29906, A 特開２００４−３２６４６５号公報JP, 2004-326465, A 人工知能による文書分類情報の科学と技術66巻6号、277-281Document classification by artificial intelligence Science and technology of information Volume 66 No.6, 277-281

しかしながら、特許文献１に記載された手法は、ユーザにとって検索したいキーワードがある程度定まっている場合においては有効であると考えられるが、類似文書検索システムのユーザは、具体的な検索キーワードが定まっている状態ではない。 However, the method described in Patent Document 1 is considered to be effective when the keyword to be searched by the user has been determined to some extent, but the user of the similar document search system has the specific search keyword determined. Not in a state.

入力文書から抽出した全ての特徴語に対して関連語を表示することも考えられるが、自動抽出した特徴語の中に、そもそもユーザの検索意図と関連する言葉が含まれていない可能性がある。 Although related words may be displayed for all characteristic words extracted from the input document, the automatically extracted characteristic words may not include words related to the user's search intention in the first place. ..

また、文書から抽出する特徴語の数を増やした場合、ユーザに提示する選択肢が膨大になる問題もある。 Further, when the number of characteristic words extracted from a document is increased, there is a problem that the choices presented to the user become huge.

また、特許文献２に記載された手法は、表記の異なる単語においても、意味的に近い単語をヒットさせることが可能になるが、前述の問題と同様に、ユーザにとって検索したいキーワードが定まっていない場合に、入力文書から抽出した特徴語をクエリとすると、検索者にとって不要な単語が拡張される可能性がある。 Further, the method described in Patent Document 2 makes it possible to hit words that are semantically similar even in words with different notations, but as with the above-mentioned problem, the keyword that the user wants to search is not fixed. In this case, if a characteristic word extracted from the input document is used as a query, there is a possibility that a word unnecessary for the searcher may be expanded.

さらに、、特許文献３や非特許文献１に記載された手法は、教師学習を行う際に、十分な教師データが必要になる。 Furthermore, the methods described in Patent Document 3 and Non-Patent Document 1 require sufficient teacher data when performing teacher learning.

しかし、企業内で日々発生する様々な文書データにおいて、日常的に適切な分類やタグ付けがなされているケースは稀であり、検索システムに搭載したいと考えるカテゴリについて、大量の教師データを確保することは困難である。 However, in various document data that occur every day in a company, it is rare that proper classification and tagging are performed on a daily basis, and a large amount of teacher data is secured for the category that you want to install in the search system. Is difficult.

そこで、本発明では、ユーザが所望する適切な文書の検索を行うことが可能な仕組みを提供することを目的とする。 Therefore, it is an object of the present invention to provide a mechanism that enables a user to search for an appropriate document.

上記目的を達成するための本発明は、文書を検索する情報処理装置であって、カテゴリに分類された文書及び当該カテゴリに属する検索文字列を記憶する記憶手段と、入力された文書から前記記憶手段により記憶された検索文字列を取得する取得手段と、前記取得手段により取得した検索文字列を用いて前記入力された文書のカテゴリを推定する推定手段と、前記推定手段により推定したカテゴリに属する検索文字列と前記入力された文書の特徴語とを用いて、前記記憶手段に記憶する文書を検索する検索手段と、を備えたことを特徴とする。 The present invention for achieving the above object is an information processing apparatus for searching a document, comprising storage means for storing a document classified into a category and a search character string belonging to the category, and storing the input document from the storage means. Belongs to a category estimated by the estimating means, an obtaining means for obtaining the search character string stored by the means, an estimating means for estimating the category of the input document using the search character string obtained by the obtaining means And a search unit that searches the document stored in the storage unit by using the search character string and the characteristic word of the input document.

本発明によれば、ユーザが所望する適切な文書の検索を行うことができる、という効果を奏する。 According to the present invention, there is an effect that it is possible to search for an appropriate document desired by a user.

本発明の実施形態における類似文書検索システムの構成図である。It is a block diagram of a similar document search system in an embodiment of the present invention. 本発明の実施形態における文書検索装置がユーザに提示するユーザインターフェースの一例である。1 is an example of a user interface presented to a user by a document search device according to an embodiment of the present invention. 文書入力後のユーザインターフェースの一例である。It is an example of a user interface after document input. カテゴリ要素クリック後のユーザインターフェースの一例である。It is an example of a user interface after clicking a category element. ユーザインターフェースにおける文書詳細画面の一例である。It is an example of a document detail screen in the user interface. 検索意図入力後のユーザインターフェースの一例である。It is an example of a user interface after the search intention is input. 本発明の実施形態におけるカテゴリ作成処理の流れを表すフローチャートである。It is a flow chart showing a flow of category creation processing in an embodiment of the present invention. カテゴリテーブルの一例であるIt is an example of a category table カテゴリ関連語テーブルの一例であるIt is an example of a category related word table カテゴリスコア調整語テーブルの一例であるIt is an example of a category score adjustment word table 本発明の実施形態における機械学習処理のフローチャートである6 is a flowchart of machine learning processing according to the embodiment of the present invention. カテゴリ判定閾値調整処理のフローチャートであるIt is a flow chart of category judgment threshold adjustment processing. 関連語・調整語学習処理のフローチャートであるIt is a flowchart of a related word/adjustment word learning process. 教師データテーブルの一例であるIt is an example of a teacher data table 本発明の実施形態における文書登録処理のフローチャートである6 is a flowchart of a document registration process according to the embodiment of the present invention. 検索インデックスに登録する文書情報の一例であるIt is an example of document information registered in a search index. 本発明の実施形態における検索クエリ生成処理のフローチャートであるIt is a flow chart of search query generation processing in an embodiment of the present invention. 検索クエリの一例であるIt is an example of a search query 本発明の実施形態における検索スコアの計算方法であるIt is a method of calculating a search score in the embodiment of the present invention.

以下、図面を参照して、本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、類似文書検索システムの構成図を表す。類似文書検索システム１００は、カテゴリＤＢ１０１、検索インデックス１０２、文書登録装置１０３、文書検索装置１０４、及びカテゴリ学習装置１０５から成る。 FIG. 1 shows a block diagram of a similar document search system. The similar document search system 100 includes a category DB 101, a search index 102, a document registration device 103, a document search device 104, and a category learning device 105.

図２は、文書検索装置１０４が、類似文書検索システム１００のユーザに表示するユーザインターフェースの一例を表している。 FIG. 2 shows an example of a user interface displayed by the document search device 104 to the user of the similar document search system 100.

文書検索画面２００は、対話型のインターフェースであり、利用者の検索クエリを入力するクエリ入力フォーム２１０と、会話の内容を表示する会話表示領域２２０とから成る。 The document search screen 200 is an interactive interface, and includes a query input form 210 for inputting a user's search query and a conversation display area 220 for displaying the content of conversation.

ユーザは、会話表示領域２２０に、文書ファイルをドラッグ＆ドロップすることにより、文書検索装置１０４に文書ファイルを入力できる。 The user can input the document file to the document search device 104 by dragging and dropping the document file onto the conversation display area 220.

図３は、文書ファイル（ファイル名が、Ａ製薬/文書管理システムＲＦＰ.pdf）を入力した後の文書検索画面２００の一例である。文書検索装置１０４は、入力文書の分析結果３１０、および、検索結果３２０をユーザに提示する。 FIG. 3 is an example of the document search screen 200 after a document file (file name is A Pharmaceutical/Document Management System RFP.pdf) is input. The document search device 104 presents the analysis result 310 of the input document and the search result 320 to the user.

入力文書の分析結果３１０には、類似の観点を表す検索軸（案件内容、業界）と、文書が属する検索軸ごとのカテゴリ（文書管理、スキャン、製薬、医療）と、入力文書における特徴語（ＭＲ、安全、提携、症例、添付、Ａ製薬、文書、ペーパーレス）とが含まれる。 In the analysis result 310 of the input document, the search axis (matter content, industry) showing a similar viewpoint, the category (document management, scan, pharmaceutical, medical) for each search axis to which the document belongs, and the characteristic word in the input document ( MR, safety, affiliation, case, attachment, A-pharmaceutical, document, paperless).

尚、各情報の抽出方法の詳細については、後述の検索クエリ生成処理において説明する。 The details of the method of extracting each information will be described in a search query generation process described later.

文書検索画面２００に表示するカテゴリ要素は、クリック操作やタッチ操作を受け付ける要素であり、ユーザは、カテゴリをクリックすることにより、該当するカテゴリに関連する単語（以下、カテゴリ関連語と呼ぶ）のうち、入力文書中に存在する単語の一覧を表示することができる。 The category element displayed on the document search screen 200 is an element that receives a click operation or a touch operation, and the user clicks the category to select one of the words (hereinafter, referred to as a category-related word) related to the category. , A list of words existing in the input document can be displayed.

図４は、カテゴリ３１１（スキャン）をクリックした後の文書検索画面２００を表しており、カテゴリ関連語の一覧３１２は、カテゴリ３１１（スキャン）の関連語として、「ＯＣＲ」、「原本」、「電子化」、「スキャン」の４つが入力文書中に含まれていることを示している。尚、カテゴリ関連語の登録方法については後述する。 FIG. 4 shows the document search screen 200 after clicking on the category 311 (scan), and the category related word list 312 includes “OCR”, “original”, and “OCR” as related words of the category 311 (scan). It is shown that the input document includes four items of "digitization" and "scan". The method of registering category-related words will be described later.

検索結果３２０は、文書検索装置１０４が類似文書と判定した文書の一覧を表示する。文書検索装置１０４は、入力文書のカテゴリ、カテゴリ関連語、特徴語をクエリとして検索処理を行い、検索スコアの高い順に文書を表示する。 The search result 320 displays a list of documents determined by the document search device 104 to be similar documents. The document search device 104 performs a search process using the category of the input document, the category-related word, and the characteristic word as a query, and displays the documents in descending order of search score.

各文書の情報として表示される情報は、ファイル名、検索スコアの他に、文書検索装置１０４が判定したカテゴリの情報が含まれる。 The information displayed as the information of each document includes, in addition to the file name and the search score, information on the category determined by the document search device 104.

検索結果３２０におけるファイル名はリンク要素であり、クリック操作やタッチ操作によって、文書詳細画面５００を表示することができる。 The file name in the search result 320 is a link element, and the document detail screen 500 can be displayed by a click operation or a touch operation.

図５はファイル名３２１のクリックによって表示される文書詳細画面５００を表している。 FIG. 5 shows the document detail screen 500 displayed by clicking the file name 321.

文書詳細画面５００は、文書ダウンロードボタン５１０と、カテゴリ表示領域５２０と、特徴語表示領域５３０と、文書プレビュー領域５４０からなる。 The document detail screen 500 includes a document download button 510, a category display area 520, a characteristic word display area 530, and a document preview area 540.

ユーザは文書ダウンロードボタン５１０を選択することで、文書詳細画面５００に表示している文書ファイルをダウンロードすることができる。 The user can download the document file displayed on the document detail screen 500 by selecting the document download button 510.

カテゴリ表示領域５２０における各カテゴリの要素は、前述と同様にユーザのクリック操作によってカテゴリ関連語を表示する機能を備えることに加え、カテゴリ情報の誤りをシステムに入力するためのカテゴリ削除ボタン５２１と、文書に追加で付与すべきカテゴリを追加するためのカテゴリ追加ボタン５２２を備える。 Each category element in the category display area 520 has a function of displaying category-related words by a user's click operation as described above, and a category delete button 521 for inputting an error in category information to the system. A category addition button 522 for adding a category to be additionally provided to the document is provided.

類似文書検索システム１００のユーザが、カテゴリ削除ボタン５２１、および、カテゴリ追加ボタン５２２により、削除、または追加したカテゴリは、検索インデックス１０２に即座に適用される他、後述するカテゴリの学習処理に使用される。 The category deleted or added by the user of the similar document search system 100 using the category deletion button 521 and the category addition button 522 is immediately applied to the search index 102, and is also used for category learning processing described later. It

図３に戻って、ユーザの検索意図を文書検索装置１０３に入力する方法について説明する。 Returning to FIG. 3, a method of inputting the user's search intention into the document search device 103 will be described.

ユーザは入力文書の分析結果３１０に表示された情報から、入力文書は、案件内容が「文書管理」、および、「スキャン」カテゴリに属するものであり、顧客の業界が「製薬」、「医療」カテゴリに属するものであることが分かる。 Based on the information displayed in the analysis result 310 of the input document, the user can see that the input document is one whose document contents belong to the “document management” and “scan” categories, and the customer industry is “pharmaceutical” or “medical”. You can see that it belongs to the category.

例えば、ユーザの類似検索の意図が、「業界が似ていて違う提案内容」の見積書を検索することである場合、ユーザは、クエリ入力フォーム２１０に対して、「業界優先」、「スキャン不要」、「文書管理不要」など、検索軸やカテゴリを表す言葉と、当該要素の検索スコアを調整するための言葉とを合わせて入力することで、検索意図を文書検索装置１０４に入力することができる。 For example, when the intention of the user's similarity search is to search for a quotation of “similar industry and different proposal content”, the user inputs “industry priority” or “no scan required” to the query input form 210. , “Document management not required”, and the like for inputting a search axis or category and a word for adjusting the search score of the element, the search intention can be input to the document search device 104. it can.

図６は、「業界優先文書管理不要スキャン不要」をクエリ入力フォーム２１０から入力した後の文書検索画面２００を表している。 FIG. 6 shows the document search screen 200 after inputting “Industry priority document management is not required and scanning is not required” from the query input form 210.

文書検索装置１０４は、受け取ったクエリに該当するカテゴリの重みを調整した結果を、入力文書の表示領域３１０に提示する。 The document search device 104 presents the result of adjusting the weight of the category corresponding to the received query in the display area 310 of the input document.

カテゴリ要素６１１における「―」や、カテゴリ要素６１２における「＋」は、検索に使用する重みを表している。 The “−” in the category element 611 and the “+” in the category element 612 represent the weight used for the search.

以下、説明簡略化のため、「＋」は文書中に当該カテゴリ、およびカテゴリ関連語が含まれていることに対するスコアを２倍にすることを表し、「―」は文書中に当該カテゴリ、および、カテゴリ関連語が含まれていることに対するスコアを−２倍にすることを表すものとするが、スコアを調整するための言葉（必須、優先、強め、弱目、不要、＋、−など）に応じて、検索時におけるカテゴリごとの重みを細かく調整できるように設計してもよい。 Hereinafter, for simplification of description, “+” indicates doubling the score for the category and the category-related word included in the document, and “−” indicates the category in the document, and , It means that the score for the inclusion of category-related words is multiplied by -2, but the words for adjusting the score (required, priority, strengthen, weakness, unnecessary, +, -, etc.) According to the above, the weight for each category at the time of search may be designed to be finely adjusted.

文書検索装置１０４は、調整したスコアで類似文書を検索し、検索結果表示領域３２０に検索結果を表示する。 The document search device 104 searches for a similar document with the adjusted score, and displays the search result in the search result display area 320.

図６における検索結果は、業界が「製薬」、「医療」カテゴリに属する文書であり、案件内容のカテゴリとして「ポータル」、「文書校正」、「ＥＤＩ」、「ＡＩ」など、入力文書の案件内容のカテゴリ（「文書管理」、「スキャン」）に属しない文書が提示されている。 The search results in FIG. 6 are documents whose industry belongs to the “pharmaceutical” and “medical” categories, and the input document items such as “portal”, “document proofreading”, “EDI”, and “AI” as the category of item content Documents that do not belong to the category of content (“Document management”, “Scan”) are presented.

このように、検索軸、カテゴリを用いて検索意図を入力可能なユーザインターフェースとすることにより、単語単位でクエリを拡張する方式と比較して、ユーザは直感的な操作で検索意図に応じた文書を得ることが可能になる。 In this way, the user interface that allows the user to input the search intention using the search axis and the category allows the user to intuitively operate the document according to the search intention, as compared with the method of expanding the query in word units. Will be able to obtain.

次に、カテゴリ学習装置１０５における、カテゴリ学習処理について、図７〜図１４を用いて説明する。 Next, the category learning processing in the category learning device 105 will be described with reference to FIGS. 7 to 14.

本発明のカテゴリ分類機能は文書集合中に教師データが存在しない場合でも動作することを特徴とする。 The category classification function of the present invention is characterized in that it operates even when there is no teacher data in the document set.

基本的なアイデアは、カテゴリ関連語の取得処理と、機械学習による単語別の重み習得処理とを分離することである。 The basic idea is to separate the process of acquiring category-related words from the process of weight learning for each word by machine learning.

図７は、本発明におけるカテゴリ作成処理の流れを示すフローチャートである。 FIG. 7 is a flowchart showing the flow of the category creation processing in the present invention.

まず、ステップＳ７０１では、カテゴリごとのカテゴリ名・検索軸・出現率をカテゴリＤＢ１０１に登録する。 First, in step S701, the category name, search axis, and appearance rate for each category are registered in the category DB 101.

図８は、カテゴリＤＢ１０１において、カテゴリの情報を管理するためのテーブル（カテゴリテーブル８００）の一例であり、カテゴリごとのカテゴリ名、検索軸、出現率、及びカテゴリ判定閾値の４つ組の一覧を保持する。 FIG. 8 is an example of a table (category table 800) for managing category information in the category DB 101. A list of four sets of category name, search axis, appearance rate, and category determination threshold for each category is shown in FIG. Hold.

ここで、「出現率」は、類似文書検索システム１００に登録する文書集合において、該当するカテゴリが存在する確率である。 Here, the “appearance rate” is the probability that the corresponding category exists in the document set registered in the similar document search system 100.

類似文書検索システム１００に登録する文書集合について、十分な知見をもつユーザがいる場合は、適当な出現率を登録し、出現率が不明な場合は、デフォルトの値（例えば、０．３など）を格納する。 When there is a user who has sufficient knowledge about the document set registered in the similar document search system 100, an appropriate appearance rate is registered, and when the appearance rate is unknown, a default value (for example, 0.3) To store.

「カテゴリ判定閾値」は、後述の機械学習処理において定まる値であり、初期値は０である。 The “category determination threshold value” is a value determined in the machine learning process described later, and its initial value is 0.

なお、「出現率」は、カテゴリ判定閾値を求めるためのパラメータとして使用する値であり、厳密な値である必要はない。 The "appearance rate" is a value used as a parameter for obtaining the category determination threshold value, and does not need to be a strict value.

次に、ステップＳ７０２では、カテゴリ関連語をカテゴリＤＢ１０１に登録する。 Next, in step S702, category related words are registered in the category DB 101.

図９は、カテゴリＤＢ１０１において、カテゴリ関連語を管理するためのテーブル（カテゴリ関連語テーブル９００）の一例であり、カテゴリ名、関連語、及び重みの３つ組の一覧を保持する。 FIG. 9 is an example of a table (category related word table 900) for managing category related words in the category DB 101, and holds a list of three sets of category name, related word, and weight.

ここで、「重み」は後述の機械学習処理によって自動的に定まる値であり、初期値は１である。 Here, the “weight” is a value that is automatically determined by the machine learning process described later, and the initial value is 1.

カテゴリ関連語は、類似文書検索システム１００のユーザが、カテゴリＤＢ１０１に直接投入できる。 The user of the similar document search system 100 can directly input the category-related words into the category DB 101.

投入の際、Ｗｉｋｉｐｅｄｉａ等、既にカテゴリ体系が整理されている辞書から、該当するカテゴリやサブカテゴリに属する記事のタイトル名を抽出したり、類似文書検索システム１００に投入した文書集合において、カテゴリに含まれていると思われる単語を含む文書集合の特徴語を抽出したものを使用してもよい（参考：Elasticsearch Significant Terms Aggregatio）。 At the time of submission, the title name of an article belonging to the relevant category or subcategory is extracted from a dictionary in which the category system is already organized such as Wikipedia, or is included in the category in the document set submitted to the similar document search system 100. It is also possible to use a feature word extracted from a document set that includes a word that seems to be (reference: Elasticsearch Significant Terms Aggregatio).

次に、ステップＳ７０３では、カテゴリ学習装置１０５における機械学習処理を実行する。 Next, in step S703, the machine learning process in the category learning device 105 is executed.

機械学習処理の詳細は後述する（図１１）が、この処理によって、カテゴリ判定閾値の調整、カテゴリ関連語の重み調整、及びカテゴリスコア調整語の生成が行われる。 Although details of the machine learning process will be described later (FIG. 11 ), the category determination threshold value adjustment, the category related word weight adjustment, and the category score adjustment word generation are performed by this process.

ここで、「カテゴリスコア調整語」は、カテゴリの判定に使用するための単語であり、カテゴリ学習装置１０５が機械的に獲得するものである。 Here, the “category score adjustment word” is a word used for category determination and is mechanically acquired by the category learning device 105.

図１０は、カテゴリＤＢ１０１における、カテゴリスコア調整語を管理するためのテーブル（カテゴリスコア調整語テーブル１０００）の一例であり、カテゴリ名、調整語、及び重みの３つ組の一覧を管理する。 FIG. 10 is an example of a table (category score adjustment word table 1000) for managing category score adjustment words in the category DB 101, and manages a list of three sets of category name, adjustment word, and weight.

図１１は、カテゴリ学習装置１０５における機械学習処理のフローチャートを示している。 FIG. 11 shows a flowchart of the machine learning process in the category learning device 105.

カテゴリ学習装置１０５は、ステップＳ１１０１のカテゴリ判定閾値調整処理、及びステップＳ１１０２の関連語・調整語学習処理を、学習結果に変化が生じなくなる（ステップＳ１１０３で判定）か、所定の回数実行（ステップＳ１１０４で判定）するまで繰り返し行う。 The category learning device 105 executes the category determination threshold value adjustment processing of step S1101 and the related word/adjusted word learning processing of step S1102 in the learning result without change (determined in step S1103) or a predetermined number of times (step S1104). It repeats until it judges).

図１２は、ステップＳ１１０１で実施するカテゴリ判定閾値調整処理のフローチャートを示している。 FIG. 12 shows a flowchart of the category determination threshold value adjustment processing executed in step S1101.

カテゴリ学習装置１０５は、まず、ステップＳ１２０１において、各文書について各カテゴリのスコアをカテゴリ関連語テーブル９００とカテゴリスコア調整語テーブル１０００に登録されているデータを用いて求める。文書dにおけるカテゴリcのスコアscore(d, c)は下記の式から算出する。 First, in step S1201, the category learning apparatus 105 obtains the score of each category for each document using the data registered in the category-related word table 900 and the category score adjustment word table 1000. The score score(d, c) of the category c in the document d is calculated from the following formula.

rw(x)=カテゴリ関連語xの重み
aw(y)=カテゴリスコア調整語yの重み
score(d,c)=Σrw(x)+Σaw(y)
x∈（文書dに出現するカテゴリcの関連語）
y∈（文書dに出現するカテゴリcのカテゴリスコア調整語）
次に、ステップＳ１２０２において、各カテゴリについて、カテゴリスコアのＸパーセンタイル（Ｘ＝１００×（１-カテゴリの出現率））を取得する。 rw(x)=weight of category-related word x
aw(y) = weight of category score adjustment word y
score(d,c)=Σrw(x)+Σaw(y)
x ∈ (related word of category c that appears in document d)
y ∈ (category score adjustment word for category c that appears in document d)
Next, in step S1202, the Xth percentile of the category score (X=100×(1-appearance rate of category)) is acquired for each category.

例えば、文書集合の件数が１０件であり、各文書における「金融」カテゴリ（出現率０．３）のスコアを昇順に並べたスコア列が、（０,０,０,０,０,１,３,５,７,８）であったとき、７０パーセンタイルである「３」を取得する。 For example, the number of documents is 10, and the score sequence in which the scores of the “finance” category (occurrence rate 0.3) in each document are arranged in ascending order is (0,0,0,0,0,1, 3, 5, 7, 8), the 70th percentile “3” is acquired.

その後、ステップＳ１２０３において、ステップＳ１２０２において取得した各値を、各カテゴリのカテゴリ判定閾値としてカテゴリテーブル８００を更新する。 After that, in step S1203, the category table 800 is updated with each value acquired in step S1202 as the category determination threshold value of each category.

次に、図１３及び図１４を用いて、関連語・調整語学習処理について説明する。 Next, the related word/adjusted word learning process will be described with reference to FIGS. 13 and 14.

図１３は、関連語・調整語学習処理の流れを示すフローチャートである。 FIG. 13 is a flowchart showing the flow of the related word/adjusted word learning process.

図１４はカテゴリＤＢ１０１における教師データテーブル１４００の一例であり、カテゴリ名、正負区分、及び文書ＩＤの３つ組の一覧を管理する。 FIG. 14 is an example of the teacher data table 1400 in the category DB 101, which manages a list of three sets of category name, positive/negative classification, and document ID.

正負区分は、教師データにおける正例（ＴＲＵＥ）、負例（ＦＡＬＳＥ）のいずれかを表すフラグであり、文書ＩＤは、類似文書検索システム１００に登録した文書を一意に特定する値である。 The positive/negative classification is a flag that represents either a positive example (TRUE) or a negative example (FALSE) in the teacher data, and the document ID is a value that uniquely identifies the document registered in the similar document search system 100.

教師データテーブル１４００の各レコードは、ユーザが直接入力可能である。 Each record of the teacher data table 1400 can be directly input by the user.

また、前述のカテゴリ表示領域５２０において、ユーザが、カテゴリ削除ボタン５２１を押下した際には、該当する文書ＩＤとカテゴリの負例として自動登録され、カテゴリ追加ボタン５２２を押下してカテゴリを追加した際には、該当する文書ＩＤとカテゴリの正例として自動登録されるものである。 In the category display area 520 described above, when the user presses the category delete button 521, the corresponding document ID and category are automatically registered as a negative example, and the category add button 522 is pressed to add the category. At this time, it is automatically registered as a positive example of the corresponding document ID and category.

関連語・調整語学習処理は、これらの正例、負例の情報をもとに、ステップＳ１３０１からステップＳ１３１１までの処理で、カテゴリ関連語、および、カテゴリスコア調整語の重みを調整する処理である。 The related word/adjusted word learning process is a process of adjusting weights of the category related word and the category score adjustment word in the processes of steps S1301 to S1311, based on the information of these positive examples and negative examples. is there.

まず、ステップＳ１３０２において、カテゴリスコアがカテゴリテーブル８００の閾値以下である正例（False Negativeの文書集合）の特徴語を取得する。 First, in step S1302, characteristic words of a positive example (False Negative document set) whose category score is equal to or lower than the threshold value of the category table 800 are acquired.

ここで、特徴語は、ＪＬＨスコア等の指標を用いて、スコアが事前に定めた所定の値よりも高い単語を取得すればよい。 Here, as the characteristic word, a word having a score higher than a predetermined value may be acquired using an index such as the JLH score.

特徴語の取得方法については、他にも様々な方法が考えられるが、本発明に関する部分ではないため、説明を省略する。 Various other methods can be considered for the method of acquiring the characteristic word, but the method is not related to the present invention, and thus the description thereof is omitted.

次に、ステップＳ１３０２で取得した各特徴語に対して、ステップＳ１３０４〜ステップＳ１３０６の処理を実施する。 Next, the processing of steps S1304 to S1306 is performed on each characteristic word acquired in step S1302.

まずステップＳ１３０４では、特徴語がカテゴリ関連語テーブル９００に存在するか否かを判定し、存在する場合は、ステップＳ１３０５で、該当するカテゴリ関連語の重みにプラスの補正を加える。 First, in step S1304, it is determined whether or not the characteristic word exists in the category related word table 900, and if it exists, a positive correction is added to the weight of the corresponding category related word in step S1305.

ここで加える値は、０．３などの固定の値を加算してもよいし、もとの値の１０％など、変動する値でもよい。 The value added here may be a fixed value such as 0.3, or may be a variable value such as 10% of the original value.

ステップＳ１３０４では、特徴語がカテゴリ関連語テーブル９００に存在しない場合は、ステップＳ１３０６で、カテゴリスコア調整語テーブル１０００において、該当する調整語の重みにプラスの補正を加える。 If the characteristic word does not exist in the category related word table 900 in step S1304, a positive correction is added to the weight of the corresponding adjustment word in the category score adjustment word table 1000 in step S1306.

このとき、カテゴリスコア調整語テーブル１０００に、該当する調整語が存在しない場合は、新しくレコードを追加する。 At this time, if the corresponding adjustment word does not exist in the category score adjustment word table 1000, a new record is added.

次に、ステップＳ１３０７では、カテゴリスコアがカテゴリテーブル８００のカテゴリ判定閾値よりも大きい負例（False Positiveの文書集合）の特徴語を取得し、ステップＳ１３０８〜ステップＳ１３１１で、重みの学習を行う。 Next, in step S1307, a characteristic word of a negative example (False Positive document set) whose category score is larger than the category determination threshold of the category table 800 is acquired, and weight learning is performed in steps S1308 to S1311.

なお、特徴語は、ステップＳ１３０２と同様に、ＪＬＨスコア等の指標を用いて取得する。 The characteristic word is acquired using an index such as the JLH score, as in step S1302.

ステップＳ１３０９で、特徴語に該当する関連語がカテゴリ関連テーブル９００に存在すれば、ステップＳ１３１０で、該当する関連語の重みにマイナス補正を加える。 If a related word corresponding to the characteristic word exists in the category related table 900 in step S1309, a negative correction is added to the weight of the related word in step S1310.

ここで加える値は、ステップＳ１３０５と同様に、−０．３などの固定の値を加算してもよいし、もとの値の１０％など、変動する値でもよい。 The value to be added here may be a fixed value such as −0.3 as in step S1305, or may be a variable value such as 10% of the original value.

ただし、カテゴリ関連語テーブル９００において、補正後の重みが０未満になる場合、重みを０に修正する。 However, in the category-related word table 900, when the corrected weight is less than 0, the weight is corrected to 0.

ステップＳ１３０９では、特徴語がカテゴリ関連語テーブル９００に存在しない場合は、ステップＳ１３１１で、カテゴリスコア調整語テーブル１０００において、該当する調整語の重みにマイナスの補正を加える。 If the characteristic word does not exist in the category related word table 900 in step S1309, a negative correction is added to the weight of the corresponding adjustment word in the category score adjustment word table 1000 in step S1311.

以上、カテゴリ学習装置１０５における機械学習処理について説明した。 The machine learning process in the category learning device 105 has been described above.

ステップＳ１１０１におけるカテゴリ判定閾値調整処理は、教師データが存在しない状態でも動作する。 The category determination threshold value adjustment processing in step S1101 operates even in the absence of teacher data.

また、ステップＳ１１０２における関連語・調整語学習処理は、正例や負例の特徴語を用いて単語の重みを調整するものであり、少ない教師データでも動作する。 Further, the related word/adjusted word learning process in step S1102 adjusts the weight of the word using the characteristic words of the positive example and the negative example, and operates even with a small amount of teacher data.

なお、機械学習により獲得したカテゴリスコア調整語は、類似文書検索システム１００のユーザにとって、直接カテゴリに関連しているように思えないものが含まれる可能性があるため、文書検索画面２００に表示されると、類似文書検索システム１００に対する不信感が生まれる恐れがある。 The category score adjustment word acquired by machine learning may be displayed on the document search screen 200 because it may include something that does not seem to be directly related to the category for the user of the similar document search system 100. Then, there is a fear that a distrust of the similar document search system 100 may be generated.

カテゴリ関連語とカテゴリスコア調整語を分類している理由はこの問題に対する処置であり、文書検索画面２００における、カテゴリ関連語の一覧３１２にはカテゴリ関連語のみを表示することで、不要な単語を表示しない仕組みを実現している。 The reason for classifying the category-related words and the category score adjustment words is to deal with this problem. By displaying only the category-related words in the category-related word list 312 on the document search screen 200, unnecessary words are eliminated. A mechanism that does not display is realized.

次に、文書登録装置１０３における、文書登録処理について、図１５〜図１６を用いて説明する。 Next, the document registration processing in the document registration device 103 will be described with reference to FIGS.

図１５は文書登録処理のフローチャートを示しており、図１６は検索インデックス１０２に登録する文書情報の一例を示している。 FIG. 15 shows a flowchart of the document registration processing, and FIG. 16 shows an example of the document information registered in the search index 102.

文書登録装置１０３は、文書、及び文書ＩＤを入力として受付ける。まず、ステップＳ１５０１で、入力文書から文書のタイトル、及びテキストを抽出し、ステップＳ１５０２において、形態素解析を実施して単語の集合を取得する。 The document registration device 103 receives a document and a document ID as inputs. First, in step S1501, the document title and text are extracted from the input document, and in step S1502, morphological analysis is performed to acquire a word set.

次にステップＳ１５０３では、各カテゴリのカテゴリスコア、及び入力文書中に存在するカテゴリ関連語を取得する。 Next, in step S1503, the category score of each category and the category-related words existing in the input document are acquired.

ここで、入力文書dにおけるカテゴリスコアcは、前述の式score(d, c)によって求める。 Here, the category score c in the input document d is obtained by the above expression score(d, c).

ステップＳ１５０４では、教師データテーブル１４００において、入力した文書ＩＤに該当する教師データが存在すれば、各カテゴリにおける正負区分を取得する。 In step S1504, if teacher data corresponding to the input document ID exists in the teacher data table 1400, the positive/negative classification in each category is acquired.

ステップＳ１５０５では、ステップＳ１５０１〜ステップＳ１５０４において取得した文書情報１６００を、検索インデックス１０２に登録する。 In step S1505, the document information 1600 acquired in steps S1501 to S1504 is registered in the search index 102.

以上で説明したように、文書登録装置１０３における文書登録処理では、カテゴリ判定閾値を用いたカテゴリの判定自体は実施せず、カテゴリごとのスコア、及び教師データの正負区分を検索インデックス１０２する。 As described above, in the document registration process in the document registration apparatus 103, the category determination itself using the category determination threshold value is not performed, but the score for each category and the positive/negative classification of the teacher data are used as the search index 102.

なお、検索インデックス１０２に登録した教師データの正負区分は、文書検索画面２００からカテゴリの追加、および、削除が行われた際に、適宜更新されるものである。 The positive/negative classification of the teacher data registered in the search index 102 is appropriately updated when a category is added or deleted from the document search screen 200.

最後に、図１７〜図１９を用いて、文書検索装置１０４における、文書検索処理について説明する。 Finally, the document search processing in the document search device 104 will be described with reference to FIGS.

文書検索装置１０４における文書検索処理は、入力文書から検索クエリを生成する処理と、検索インデックス１０２から検索クエリに該当する文書を取得して検索スコアの高い順に取得する処理とに分かれる。 The document search process in the document search device 104 is divided into a process of generating a search query from an input document and a process of acquiring documents corresponding to the search query from the search index 102 and acquiring them in descending order of search score.

図１７は、検索クエリ生成処理のフローチャートを示しており、図１８は、生成する検索クエリの一例を表している。 FIG. 17 shows a flowchart of the search query generation processing, and FIG. 18 shows an example of the search query to be generated.

文書検索装置１０４は、まずステップＳ１７０１で入力文書からテキストを抽出し、ステップＳ１７０２で、形態素解析を実行して単語の集合を取得する。 The document search apparatus 104 first extracts text from the input document in step S1701, and executes morphological analysis in step S1702 to acquire a set of words.

次に、ステップＳ１７０３では、各カテゴリのカテゴリスコアを求め、カテゴリ判定閾値以上のカテゴリについて、カテゴリ関連語を取得する。 Next, in step S1703, the category score of each category is obtained, and category-related words are acquired for categories that are equal to or greater than the category determination threshold.

次に、ステップＳ１７０４では、入力文書の特徴語を抽出する。ここで、特徴語は、ステップＳ１７０２で取得した単語のうち、「tf-idf」値が所定の値以上のものである。 Next, in step S1704, the characteristic word of the input document is extracted. Here, the characteristic word is a word having a “tf-idf” value of a predetermined value or more among the words acquired in step S1702.

なお、「tf-idf」値における「idf」値は、検索インデックス１０２における単語フィールドの統計情報から求めることができる。 The “idf” value in the “tf-idf” value can be obtained from the statistical information of the word field in the search index 102.

次に、ステップＳ１７０５では、ステップＳ１７０４で取得した特徴語、及びステップＳ１７０３で取得したカテゴリの情報（カテゴリ名、関連語、及びスコア調整値）を検索クエリとして書き出す。 Next, in step S1705, the characteristic word acquired in step S1704 and the category information (category name, related word, and score adjustment value) acquired in step S1703 are written out as a search query.

ここで、スコア調整値は、検索スコアの計算時において、該当するカテゴリをどの程度重要視するかを表す指標であり、検索クエリ生成処理においては初期値として１が入る。 Here, the score adjustment value is an index indicating how important the corresponding category is when the search score is calculated, and 1 is entered as an initial value in the search query generation process.

なお、スコア調整値は、前述したように、文書検索画面２００において、クエリ入力フォーム２１０に検索意図を表す言葉（「業界優先」、「スキャン不要」など）を入力することで、ユーザが任意のタイミングで更新できる。 As described above, the score adjustment value can be set by the user by inputting a word indicating the search intention (“industry priority”, “no scan required”, etc.) in the query input form 210 on the document search screen 200. It can be updated at the timing.

図１９は、文書検索装置１０４における文書検索処理のスコア計算方法を示している。文書検索装置１０４は、検索クエリ１８００における特徴語に対して、「単語」フィールドに同じ単語がある文書の検索スコアを上げる。 FIG. 19 shows a score calculation method of the document search processing in the document search device 104. The document search device 104 increases the search score of documents having the same word in the “word” field with respect to the characteristic word in the search query 1800.

さらに、検索クエリ１８００における各カテゴリに対して、教師データの正例（正負区分がtrueである）場合、または、カテゴリスコアがカテゴリ判定閾値以上であり教師データの負例（正負区分がfalse）ではない文書に対して、検索スコアを上げる。 Furthermore, for each category in the search query 1800, if the teacher data is a positive example (the positive/negative division is true), or if the category score is equal to or higher than the category determination threshold value, the teacher data is a negative example (the positive/negative division is false). Increase search score for missing documents.

このとき、検索スコアの上げ幅には、検索クエリ１８００におけるスコア調整値を乗算する。また、同条件の文書に対して、一致する関連語の数に応じて、同様に検索スコアを上げる。 At this time, the increment of the search score is multiplied by the score adjustment value in the search query 1800. Further, for documents of the same condition, the search score is similarly increased according to the number of matching related words.

文書検索装置１０４は、以上で説明した検索スコアの計算方法によって、検索インデックス１０２内から各文書の検索スコアを求め、検索スコアの高い順に文書の一覧をソートして検索結果として返す。 The document search device 104 obtains the search score of each document from the search index 102 by the above-described method of calculating the search score, sorts the document list in descending order of the search score, and returns it as the search result.

以上、類似文書検索システム１００における、各装置の動作について説明した。 The operation of each device in the similar document search system 100 has been described above.

説明簡略化のため、カテゴリ関連語の設定方法については簡易な説明に留めたが、カテゴリスコア調整語において、高い重みをもつ調整語をカテゴリ関連語の候補としてユーザに提示するＵＩを作成するなど、機械学習処理の結果を使用してカテゴリ関連語の追加、削除を容易にすることもできる。 For simplification of the explanation, the setting method of the category-related words is described only briefly, but in the category score adjustment word, a UI for presenting the adjustment word having a high weight as a category-related word candidate to the user is created, It is also possible to easily add or delete category-related words by using the result of the machine learning process.

また、検索インデックス１０２において、各文書にはカテゴリの判定結果ではなく、カテゴリのスコアと教師データにおける正負区分を保存する仕組みを活用し、カテゴリスコア自体の大きさを検索スコアに加味すること（「医療」カテゴリのスコアが高い順に表示など）もできる。 In addition, in the search index 102, not the determination result of the category but the mechanism of storing the category score and the positive/negative division in the teacher data is used for each document, and the size of the category score itself is added to the search score (“ It can also be displayed in descending order of the score in the "Medical" category.

以上のように、前述した実施形態の機能を実現するプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムを読み出し、実行することによっても本発明の目的が達成されることは言うまでもない。 As described above, the recording medium recording the program that realizes the functions of the above-described embodiments is supplied to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the program stored in the recording medium. It goes without saying that the object of the present invention can also be achieved by reading and executing.

この場合、記録媒体から読み出されたプログラム自体が本発明の新規な機能を実現することになり、そのプログラムを記録した記録媒体は本発明を構成することになる。 In this case, the program itself read from the recording medium realizes the novel function of the present invention, and the recording medium recording the program constitutes the present invention.

プログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク等を用いることが出来る。 As a recording medium for supplying the program, for example, a flexible disk, hard disk, optical disk, magneto-optical disk, CD-ROM, CD-R, DVD-ROM, magnetic tape, non-volatile memory card, ROM, EEPROM, silicon. A disk or the like can be used.

また、コンピュータが読み出したプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, not only the functions of the above-described embodiments are realized by executing the program read by the computer, but also the OS (operating system) or the like running on the computer is actually executed based on the instructions of the program. It goes without saying that a case where a part or all of the processing is performed and the functions of the above-described embodiments are realized by the processing is also included.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, after the program read from the recording medium is written in the memory provided in the function expansion board inserted in the computer or the function expansion unit connected to the computer, the function expansion board is instructed based on the instruction of the program code. Needless to say, this also includes the case where the CPU or the like included in the function expansion unit performs some or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

また、本発明は、複数の機器から構成されるシステムに適用しても、ひとつの機器から成る装置に適用しても良い。 Further, the present invention may be applied to a system including a plurality of devices or an apparatus including a single device.

また、本発明は、システムあるいは装置にプログラムを供給することによって達成される場合にも適応できることは言うまでもない。 Further, it goes without saying that the present invention can be applied to the case where it is achieved by supplying a program to a system or an apparatus.

この場合、本発明を達成するためのプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 In this case, by reading the recording medium storing the program for achieving the present invention into the system or device, the system or device can enjoy the effects of the present invention.

さらに、本発明を達成するためのプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Furthermore, by downloading and reading a program for achieving the present invention from a server, a database or the like on a network using a communication program, the system or apparatus can enjoy the effects of the present invention.

なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 It should be noted that the present invention also includes all configurations that combine the above-described embodiments and modifications thereof.

１００類似文書検索システム
１０１カテゴリＤＢ
１０２検索インデックス
１０３文書登録装置
１０４文書検索装置
１０５カテゴリ学習装置 100 Similar Document Search System 101 Category DB
102 Search Index 103 Document Registration Device 104 Document Search Device 105 Category Learning Device

Claims

文書を検索する情報処理装置であって、
カテゴリに分類された文書及び当該カテゴリに属する検索文字列を記憶する記憶手段と、
入力された文書から前記記憶手段により記憶された検索文字列を取得する取得手段と、
前記取得手段により取得した検索文字列を用いて前記入力された文書のカテゴリを推定する推定手段と、
前記推定手段により推定したカテゴリに属する検索文字列と前記入力された文書の特徴語とを用いて、前記記憶手段に記憶する文書を検索する検索手段と、
を備えたことを特徴とする情報処理装置。 An information processing device for retrieving a document,
Storage means for storing documents classified into categories and search character strings belonging to the categories;
Acquisition means for acquiring the search character string stored by the storage means from the input document,
Estimating means for estimating the category of the input document using the search character string acquired by the acquiring means;
Search means for searching a document stored in the storage means using a search character string belonging to the category estimated by the estimation means and a characteristic word of the input document;
An information processing apparatus comprising:

文書を検索する情報処理装置の制御方法であって、
前記情報処理装置が、
入力された文書からカテゴリに分類された文書及び当該カテゴリに属する検索文字列を記憶する記憶手段により記憶された検索文字列を取得する取得ステップと、
前記取得ステップにより取得した検索文字列を用いて前記入力された文書のカテゴリを推定する推定ステップと、
前記推定ステップにより推定したカテゴリに属する検索文字列と前記入力された文書の特徴語とを用いて、前記記憶手段に記憶する文書を検索する検索ステップと、
を実行することを特徴とする情報処理装置の制御方法。 A method of controlling an information processing device for retrieving a document, comprising:
The information processing device is
An acquisition step of acquiring the search character string stored in the storage means for storing the document classified into the category and the search character string belonging to the category from the input document;
An estimation step of estimating the category of the input document using the search character string acquired in the acquisition step;
A search step of searching a document to be stored in the storage means using a search character string belonging to the category estimated in the estimation step and a characteristic word of the input document;
A method for controlling an information processing apparatus, comprising:

コンピュータを、
入力された文書からカテゴリに分類された文書及び当該カテゴリに属する検索文字列を記憶する記憶手段により記憶された検索文字列を取得する取得手段と、
前記取得手段により取得した検索文字列を用いて前記入力された文書のカテゴリを推定する推定手段と、
前記推定手段により推定したカテゴリに属する検索文字列と前記入力された文書の特徴語とを用いて、前記記憶手段に記憶する文書を検索する検索手段と、
として機能させることを特徴とするプログラム。 Computer,
Acquisition means for acquiring the search character string stored in the storage means for storing the document classified into the category and the search character string belonging to the category from the input document;
Estimating means for estimating the category of the input document using the search character string acquired by the acquiring means;
Search means for searching a document stored in the storage means using a search character string belonging to the category estimated by the estimation means and a characteristic word of the input document;
A program characterized by making it function as.