JP5286007B2

JP5286007B2 - Document search device, document search method, and document search program

Info

Publication number: JP5286007B2
Application number: JP2008239168A
Authority: JP
Inventors: 浩之戸田; 眞哉村田; 由美子松浦; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-09-18
Filing date: 2008-09-18
Publication date: 2013-09-11
Anticipated expiration: 2028-09-18
Also published as: JP2010072909A

Description

本発明は、電子文書群中から検索条件に該当する電子文書を検索する技術に関する。 The present invention relates to a technique for searching an electronic document corresponding to a search condition from a group of electronic documents.

Ｗｅｂページなどの文書コレクションを検索するシステムにおいて、文書の内容とは独立した文書のスコア付けが行われ、検索結果のランキングに利用されている。 In a system for searching a document collection such as a Web page, a document is scored independently of the contents of the document and used for ranking search results.

最も単純な方法としては、人手で有益な文書に重み付けを行う手法であるが、文書数が膨大になると、網羅性を持たせるには非常に大きなコストがかかり現実的ではない。 The simplest method is to manually weight useful documents, but if the number of documents becomes enormous, it will be very expensive to achieve completeness and it is not practical.

そのため、一つの手法として、非特許文献１のようなページ間の参照関係を利用する方法が提案されている。これは、多く引用される文書は有益であるとの考え方に基づき、引用回数を元に文書をスコア付けする方法である。 Therefore, as one technique, a method using a reference relationship between pages as in Non-Patent Document 1 has been proposed. This is a method of scoring a document based on the number of citations based on the idea that a document that is frequently cited is useful.

また、この考え方をさらに進め、有益な文書から閲覧される文書はより有益であるという考え方を具現化した手法として、非特許文献２のＰａｇｅＲａｎｋが提案されている。
ＥｕｇｅｎｅＧａｒｆｉｅｌｄ．“Ｃｉｔａｔｉｏｎａｎａｌｙｓｉｓａｓａｔｏｏｌｉｎｊｏｕｒｎａｌｅｖａｌｕａｔｉｏｎ”．Ｓｃｉｅｎｃｅ，１７８：ｐｐ．４７１−４７９，１９７２．Ｓ．Ｂｒｉｎ，ａｎｄＬ．Ｐａｇｅ，“Ｔｈｅａｎａｔｏｍｙｏｆａｌａｒｇｅ−ｓｃａｌｅｈｙｐｅｒｔｅｘｕｔｕａｌＷｅｂＳｅａｒｃｈＥｎｇｉｎｅ”．ＣｏｍｐｕｔｅｒＮｅｔｗｏｒｋｓ３０（１−７）：ｐｐ．１０７−１１７，１９９７．村田眞哉，戸田浩之，松浦由美子，片岡良治，“検索結果中のアクセス集中サイトを利用したクエリ拡張法の提案”．データベースとＷｅｂ情報システムに関するシンポジウム（ＤＢＷｅｂ２００７）：２００７． Further, PageRank of Non-Patent Document 2 has been proposed as a technique that realizes the idea that a document viewed from a useful document is further useful by further advancing this idea.
Eugene Garfield. “Citation analysis as a tool in journal evaluation”. Science, 178: pp. 471-479, 1972. S. Brin, and L.L. Page, “The anatomy of a large-scale hypertextual Web Search Engine”. Computer Networks 30 (1-7): pp. 107-117, 1997. Junya Murata, Hiroyuki Toda, Yumiko Matsuura, Ryoji Kataoka, “Proposal for Query Expansion Using Access Concentrated Sites in Search Results”. Symposium on Database and Web Information System (DBWeb2007): 2007.

しかしながら、上記で示した文書間の引用関係を利用する手法は、基本的に引用される回数が多ければ文書の重要度が高いとの考えに基づいているため、自らの文書を重要としたければ、その文書を引用する文書を多く作成する事により、人為的に重要度（スコア）をつり上げることが可能となる。 However, the method using the citation relationship between documents shown above is basically based on the idea that the importance of a document is high if the number of citations is large. For example, it is possible to artificially raise the importance (score) by creating many documents that cite the document.

実際、自らのＷｅｂサイトを検索結果の上位に表示させるために、自らのＷｅｂサイトに対してハイパーリンクを持つページを機械的に生成する試みが多くのサイト運営者によって行われている。これらの手法はリンクスパムと呼ばれ、検索結果の上位に閲覧者の意図しない結果が提示される原因となっている。 In fact, many site operators have attempted to mechanically generate pages with hyperlinks to their websites in order to display their websites at the top of search results. These methods are called link spam, and cause the results unintended by the viewer to be presented at the top of the search results.

そこで本発明は、このような問題に鑑み、意味のある引用元からの引用関係を重視することで、真に有益な文書に高い重要度（スコア）を付与することを解決課題としている。 Therefore, in view of such a problem, the present invention makes it a solution to give a high importance (score) to a truly useful document by placing importance on the citation relationship from a meaningful citation source.

本発明は、前記課題を解決するために創作された技術的思想であって、任意の文書のスコア算出の際、該文書の引用元となっている文書に付与されたアクセス情報やブックマーク情報に基づく重みを用いることにより、意味のある文書から引用される文書に高いスコアを付与している。 The present invention is a technical idea created in order to solve the above-mentioned problem, and when calculating the score of an arbitrary document, the access information or bookmark information given to the document that is the citation source of the document is included. By using the weight based on, a high score is given to a document cited from a meaningful document.

具体的には請求項１記載の発明は、クライアント端末から指示された検索条件に該当する電子文書を電子文書群中から検索する文書検索装置であって、前記検索条件を用いて電子文書群を検索して、検索結果を出力する検索実行手段と、前記検索結果中、前記クライアント端末のアクセスした文書をユーザの操作に基づいて登録する登録手段と、前記登録手段による複数のクライアント端末の文書の登録情報を集計して、文書毎の登録頻度を求める集計手段と、前記登録頻度および文書間の引用関係を用いて各文書のスコアを算出する算出手段と、を備え、前記算出手段が、引用先文書の文書スコアを算出する際に引用元文書の登録頻度に基づく重み付けを反映させ、前記検索実行手段が、前記スコアに従ってランキングされた検索結果を出力することを特徴としている。 Specifically, the invention described in claim 1 is a document search apparatus that searches an electronic document group that corresponds to a search condition instructed from a client terminal from an electronic document group, and the electronic document group is searched using the search condition. Search execution means for searching and outputting search results, registration means for registering documents accessed by the client terminal in the search results based on a user operation, documents of a plurality of client terminals by the registration means Aggregating means for aggregating registration information and obtaining a registration frequency for each document; and calculating means for calculating a score of each document using the registration frequency and a citation relationship between documents, wherein the calculating means includes a citation When calculating the document score of the preceding document, the weighting based on the registration frequency of the citation source document is reflected, and the search execution means outputs a search result ranked according to the score. It is characterized in that.

また、請求項２記載の発明は、クライアント端末から指示された検索条件に該当する電子文書を電子文書群中から検索する装置の実行する文書検索装置であって、検索実行手段が、前記検索条件を用いて電子文書群を検索して、検索結果を出力する第１ステップと、登録手段が、前記検索結果中、前記クライアント端末のアクセスした文書をユーザの操作に基づいて登録する第２ステップと、集計手段が、前記登録手段による複数のクライアント端末の文書の登録情報を集計して、文書毎の登録頻度を求める第３ステップと、算出手段が、前記登録頻度および文書間の引用関係を用いて各文書のスコアを算出する第４ステップと、前記検索実行手段が、前記スコアに従ってランキングされた検索結果を出力する第５ステップと、を有し、前記第４ステップは、引用先文書の文書スコアを算出する際に引用元文書の登録頻度に基づく重み付けを反映させることを特徴としている。 The invention according to claim 2 is a document search device executed by a device that searches an electronic document group for an electronic document corresponding to a search condition instructed from a client terminal, wherein the search execution means includes the search condition. A first step of searching for a group of electronic documents and outputting a search result, and a second step of registering a document accessed by the client terminal in the search result based on a user operation; A third step of counting registration information of documents of a plurality of client terminals by the registration unit to obtain a registration frequency for each document; and a calculation unit using the registration frequency and a citation relationship between documents. A fourth step of calculating a score of each document, and a fifth step in which the search execution means outputs search results ranked according to the score, 4 step is characterized in that to reflect the weighting based on registration frequency of reference source document when calculating document score Cited document.

また、請求項５記載の発明は、文書検索プログラムであり、請求項３または４のいずれか１項に記載の文書検索方法の各ステップをコンピュータに実行させることを特徴としている。 The invention described in claim 5 is a document search program, characterized by causing a computer to execute each step of the document search method described in claim 3 or 4.

請求項１〜５記載の発明によれば、引用元文書のアクセス情報およびブックマーク情報などに基づく重みを用いて文書のスコアを算出していることから、単純な引用回数の影響を受けることなく、真に有益な文書により高いスコアを付与し、かかるスコアが検索結果に反映される。 According to the inventions of claims 1 to 5, since the score of the document is calculated using the weight based on the access information and the bookmark information of the citation source document, without being influenced by the simple number of citations, A truly useful document is given a high score, which is reflected in the search results.

（１）第１実施形態
図１は、本発明の第１実施形態に係る文書検索装置１を示している。この文書検索装置１は、ネットワークを介して検索条件を指示するクライアント端末（ＰＣ）２および電子文書群が格納されているコンテンツサーバＳと通信可能に接続されている。なお、この文書検索装置１には、通常は前記クライアント端末２が複数台接続される。 (1) First Embodiment FIG. 1 shows a document search apparatus 1 according to a first embodiment of the present invention. The document search apparatus 1 is communicably connected to a client terminal (PC) 2 that instructs a search condition via a network and a content server S that stores an electronic document group. Note that a plurality of the client terminals 2 are normally connected to the document search apparatus 1.

ここでは、前記文書検索装置１がインターネット上の前記コンテンツサーバＳに存在するコンテンツなどを検索するサーバ（例えば検索エンジンなど）として構成されたものとする。なお、文書検索装置１は、例えばネットワークに接続可能で文書検索の処理ロジックを実行可能な計算機などでもよく、また前記文書検索装置１を社内ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）などのインターネット以外のネットワークに接続してもよい。 Here, it is assumed that the document search device 1 is configured as a server (for example, a search engine) that searches for content and the like existing in the content server S on the Internet. The document search apparatus 1 may be, for example, a computer that can be connected to a network and can execute processing logic for document search. The document search apparatus 1 is connected to a network other than the Internet such as an in-house LAN (Local Area Network). May be.

前記クライアント端末２は、ユーザとのインタフェースとしてのブラウザ３を備え、前記文書検索装置１と通信可能なブラウザ３が実装されていれば、携帯電話などのモバイル端末であってもよい。 The client terminal 2 may be a mobile terminal such as a mobile phone as long as the client terminal 2 includes a browser 3 as an interface with a user and the browser 3 capable of communicating with the document search apparatus 1 is installed.

前記文書検索装置１は、図１に示すように、主に検索エンジンとして機能し、情報収集手段であるクローラ４、文書データベース５、引用関係データベース６、検索応答手段７、検索実行手段８、クリックログデータベース９、アクセス集計手段１０、アクセス情報データベース１１、リンク解析手段１２を備えている。 As shown in FIG. 1, the document search apparatus 1 mainly functions as a search engine, and is a crawler 4, which is an information collection means, a document database 5, a citation relation database 6, a search response means 7, a search execution means 8, and a click. A log database 9, an access counting means 10, an access information database 11, and a link analysis means 12 are provided.

前記各機能ブロック４〜１２の機能は、前記文書検索装置１の制御部（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｏｒＵｎｉｔ等）が文書検索プログラムを読み込んで実現されている。また、前記文書検索装置１は、コンピュータの通常の構成要素、例えば図示省略のキーボードやマウスなどの入力部と、処理データなどを一時記憶する書き換え可能なメモリ（ＲＡＭ）と、前記クライアント端末２および前記コンテンツサーバＳとのネットワーク接続に使用する通信デバイスと、ハードディスクドライブ装置などの記憶部と、ディスプレイなどの表示部とを備えている。このうち前記各データベース５．６．９．１１は、前記ハードディスクドライブ装置上に構築されている。 The functions of the functional blocks 4 to 12 are realized by reading a document search program by a control unit (CPU: Central Processor Unit) of the document search apparatus 1. The document search apparatus 1 includes normal components of a computer, for example, an input unit such as a keyboard and a mouse (not shown), a rewritable memory (RAM) for temporarily storing processing data, the client terminal 2 and the like. A communication device used for network connection with the content server S, a storage unit such as a hard disk drive, and a display unit such as a display are provided. Of these, each of the databases 5.6.9.11 is constructed on the hard disk drive.

前記クローラ４は、前記通信デバイスを通じて前記コンテンツサーバＳにアクセスし、検索対象となる電子文書群を収集して、該電子文書群を前記文書データベース５に格納する。また、該電子文書群の各文書間の引用関係を示すデータを前記引用関係データベース６に格納する。 The crawler 4 accesses the content server S through the communication device, collects electronic document groups to be searched, and stores the electronic document groups in the document database 5. Further, the citation relationship database 6 stores data indicating the citation relationship between the documents of the electronic document group.

前記検索応答手段７は、前記クライアント端末２からユーザが前記ブラウザ３をもって入力した検索条件を前記通信デバイスを通じて受信し、該検索条件を前記検索実行手段８へ送信する。 The search response means 7 receives the search conditions input by the user from the client terminal 2 through the browser 3 through the communication device, and transmits the search conditions to the search execution means 8.

前記検索実行手段８は、前記検索応答手段７から受信した検索条件を用いて前記文書データベース５を検索し、該検索結果を前記検索応答手段７へ返信する。この検索実行手段８は、検索エンジンのプログラムなどに組み込んで実現される。 The search execution means 8 searches the document database 5 using the search conditions received from the search response means 7 and returns the search results to the search response means 7. This search execution means 8 is implemented by being incorporated in a search engine program or the like.

前記検索応答手段７は、前記検索実行手段８から受信した検索結果を前記クライアント端末２へ返信する。そして、該検索結果の文書に対してユーザが実際に閲覧した文書の情報（アクセス履歴）を該クライアント端末２から取得し、これをクリックログとして前記クリックログデータベース９に格納する。この検索応答手段７は、検索アプリケーションなどに組み込んで実現される。 The search response means 7 returns the search result received from the search execution means 8 to the client terminal 2. Then, information (access history) of the document actually browsed by the user with respect to the search result document is acquired from the client terminal 2 and stored in the click log database 9 as a click log. This search response means 7 is implemented by being incorporated in a search application or the like.

前記アクセス集計手段１０は、前記クリックログデータベース９からクリックログ（アクセス履歴）を読み出し、該クリックログを集計して文書毎のアクセス情報を求め、これを前記アクセス情報データベース１１に格納する。 The access counting means 10 reads a click log (access history) from the click log database 9, totals the click logs to obtain access information for each document, and stores this in the access information database 11.

前記リンク解析手段１２は、前記アクセス情報および前記引用関係データベース６に格納された文書間の引用関係を用いて文書毎のスコアを算出し、該スコアを前記文書データベース５に格納する。そして、前記検索実行手段８は、前記クライアント端末２による以後の検索の際に、前記スコアに従って文書をランキングし、これを検索結果に反映させる。以下、前記各機能ブロック４〜１２の実行するスコアの算出処理について、図２のフローチャートに基づき説明する。 The link analysis unit 12 calculates a score for each document using the access information and the citation relationship between documents stored in the citation relationship database 6, and stores the score in the document database 5. Then, the search execution means 8 ranks the documents according to the score in the subsequent search by the client terminal 2 and reflects this in the search result. Hereinafter, the score calculation process executed by each of the functional blocks 4 to 12 will be described with reference to the flowchart of FIG.

Ｓ０１：まず、前記クローラ４は、前記コンテンツサーバＳから検索対象となる電子文書群をネットワーク経由で収集し、前記文書データベース５に格納する。また、これと同時に、収集した文書間の引用関係を解析し、解析結果のデータを前記引用関係データベース６に格納する。 S01: First, the crawler 4 collects electronic document groups to be searched from the content server S via a network and stores them in the document database 5. At the same time, the citation relationship between the collected documents is analyzed, and the analysis result data is stored in the citation relationship database 6.

ここで、引用関係とは、収集対象の文書がＷｅｂページであればハイパーリンクを想定し、学術文献などのその他文書であればリファレンス情報などを想定している。 Here, the citation relationship assumes a hyperlink if the document to be collected is a Web page, and assumes reference information if it is another document such as an academic document.

前記文書データベース５のデータ例を表１に、前記引用関係データベース６のデータ例を表２に示す。なお、Ｓ０１では前記クローラ４により文書情報（文書ＩＤ、タイトル、本文）が前記文書データベース５に格納され、各文書のスコアなどは後述するステップ（Ｓ０４）で格納される。 A data example of the document database 5 is shown in Table 1, and a data example of the citation relation database 6 is shown in Table 2. In S01, document information (document ID, title, text) is stored in the document database 5 by the crawler 4, and the score of each document is stored in step (S04) described later.

Ｓ０２：前記検索応答手段７は、Ｓ０１で収集した電子文書群に対して検索を行い、検索結果の文書に対する前記クライアント端末２のアクセス履歴を前記クリックログデータベース９に格納する。 S02: The search response means 7 searches the electronic document group collected in S01, and stores the access history of the client terminal 2 for the search result document in the click log database 9.

即ち、前記検索応答手段７は、前記クライアント端末２からのアクセスを受け付けると、前記ブラウザ３に検索インタフェース画面を提示し、かかる画面からユーザが前記キーボードなどを用いて入力した検索条件を受信して、これを前記検索実行手段８へ送信する。 That is, when the search response means 7 accepts access from the client terminal 2, the search response means 7 presents a search interface screen to the browser 3, and receives search conditions input by the user using the keyboard or the like from the screen. This is transmitted to the search execution means 8.

このとき前記検索実行手段８は、受信した前記検索条件を用いて前記文書データベース５を検索し、該検索条件に該当する文書のリストを検索結果として前記検索応答手段７へ返信する。 At this time, the search execution means 8 searches the document database 5 using the received search conditions, and returns a list of documents corresponding to the search conditions to the search response means 7 as a search result.

前記検索応答手段７は、受信した前記検索結果を前記クライアント端末２へ返信し、前記ブラウザ３を介してユーザへ提示する。そして、この検索結果に対してユーザが実際に閲覧した文書の情報（前記クライアント端末２のアクセス履歴）を前記クライアント端末２から取得し、取得した該アクセス履歴をクリックログとして前記クリックログデータベース９に格納する。このクリックログデータベース９のデータ例を表３に示す。 The search response means 7 returns the received search result to the client terminal 2 and presents it to the user via the browser 3. Then, information on the document actually browsed by the user with respect to the search result (access history of the client terminal 2) is acquired from the client terminal 2, and the acquired access history is stored in the click log database 9 as a click log. Store. A data example of the click log database 9 is shown in Table 3.

Ｓ０３：前記アクセス集計手段１０は、前記クリックログデータベース９からクリックログを読み出して文書毎に集計し、集計結果をアクセス情報として前記アクセス情報データベース１１に格納する。 S03: The access counting means 10 reads the click logs from the click log database 9, totals them for each document, and stores the totaled results in the access information database 11 as access information.

アクセス情報の最も単純な例としては、文書へのアクセス頻度を利用する方法が考えられる。しかし、単純なアクセス回数は、検索結果の上位に提示されるものの方が多くなるというバイアス（偏向）が含まれるため、非特許文献３のように、検索結果のランキングにおける前後のサイトとの関係のような相対的な値をアクセス頻度の程度を表す指標として利用することも考えられる。 As the simplest example of the access information, a method using the access frequency to the document can be considered. However, since the number of simple accesses includes a bias (deflection) that the higher the search results are presented, the relationship with the previous and subsequent sites in the search result ranking as in Non-Patent Document 3. It is also conceivable to use a relative value such as as an index representing the degree of access frequency.

また、外部の検索サイトの情報収集ロボットによるアクセス履歴などを前記クリックログデータベース９から排除するために、機械的なアクセスパターンを発見し、それらを無効にする事も考えられる。なお、ここでは文書へのアクセス回数を前記アクセス情報として利用した例を説明する。このときのアクセス情報データベース１１のデータ例を表４に示す。 Further, in order to exclude an access history or the like by an information collecting robot at an external search site from the click log database 9, it is conceivable to discover mechanical access patterns and invalidate them. Here, an example in which the number of accesses to a document is used as the access information will be described. Table 4 shows an example of data in the access information database 11 at this time.

Ｓ０４：前記リンク解析手段１２は、前記アクセス情報データベース１１から読み出した各文書のアクセス情報と、前記引用関係データベース６から読み出した文書間の引用関係のデータとを用いて、各文書のスコアを算出する。 S04: The link analysis unit 12 calculates the score of each document using the access information of each document read from the access information database 11 and the citation relationship data between documents read from the citation relationship database 6. To do.

スコア算出の一例としては、まず、表４のアクセス情報データベース１１に格納された各文書のアクセス情報から、該アクセス情報に基づく重みを文書毎に付与する。この重みとしては、例えばアクセス情報の値が大きい、即ち多くアクセスされる文書に対してはより大きい重みを付与するなどが考えられる。 As an example of the score calculation, first, a weight based on the access information is assigned to each document from the access information of each document stored in the access information database 11 of Table 4. As this weight, for example, a larger weight may be given to a document having a large access information value, that is, a document that is frequently accessed.

次に、表２の引用関係データベース６から文書間の引用関係のデータを読み出し、スコア算出対象の文書が他のどの文書から引用されているかを解析する。ここでは、引用回数を文書のスコアとするものの、単純な引用回数ではなく、式（１）に示すように、引用元の文書に付与されているアクセス情報に基づく重みを利用した重み付き和として、該文書のスコアを算出する。この式（１）は、前記文書検索装置１のプログラムに定義されているものとする。 Next, the citation relationship data between documents is read out from the citation relationship database 6 of Table 2, and the other document from which the score calculation target document is cited is analyzed. Here, although the number of citations is a document score, it is not a simple number of citations, but as a weighted sum using weights based on access information given to a citation source document as shown in Expression (1). The score of the document is calculated. This expression (1) is defined in the program of the document search apparatus 1.

ここで、Ｓｃｏｒｅ（Ｄ_i）は文書Ｄ_iのスコアを示し、Ｗ（Ｄ_j）は文書Ｄ_jのアクセス情報に基づく重みを示す。また、ＩＮ（Ｄ_i）は、文書Ｄ_iを引用する文書の集合を表す。 Here, Score (D _i ) indicates the score of the document D _i , and W (D _j ) indicates the weight based on the access information of the document D _j . IN (D _i ) represents a set of documents that cite the document D _i .

これにより、文書Ｄ_iのスコアの算出において、該文書Ｄ_iを引用している各文書のアクセス情報に基づく重みが考慮される。即ち、アクセス回数の多い文書からより多く引用される文書に対して高いスコアが付与される。 Thereby, in calculating the score of the document D _i, the weight based on the access information of each document quoting the document D _i is taken into consideration. That is, a high score is given to a document that is cited more frequently than a document that is frequently accessed.

また、前述のＰａｇｅＲａｎｋに基づくスコアを利用する場合には、オリジナルのＰａｇｅＲａｎｋアルゴリズムにおいてランダムに遷移する項として与えられているランダムジャンプ項を、前記アクセス情報に基づいて重み付けする方法が考えられる。このＰａｇｅＲａｎｋアルゴリズムは以下の式（２）で示される。 When using the score based on the above-described PageRank, a method of weighting a random jump term given as a term that randomly changes in the original PageRank algorithm based on the access information can be considered. This PageRank algorithm is expressed by the following equation (2).

ここで、ＯＵＴ（Ｄ_j）は文書Ｄ_jが引用する文書の集合を示し、ｄは文書Ｄ_iが一定の割合でランダムに引用先の文書を閲覧するという仮定に基づくランダムジャンプを表す項である。この項の一部を、式（３）に示すように、アクセス情報に基づく重みを用いて重み付けする。この式（２）（３）も、前記文書検索装置１のプログラムに定義されているものとする。 Here, OUT (D _j ) represents a set of documents cited by the document D _j , and d is a term representing a random jump based on the assumption that the document D _i browses the cited document randomly at a certain rate. is there. A part of this term is weighted using a weight based on access information as shown in Equation (3). These formulas (2) and (3) are also defined in the program of the document retrieval apparatus 1.

ここで、αはランダムジャンプ項の重みを指定する係数であり、α＝０の場合、ランダムジャンプは導入されず、式（２）のランダムジャンプは、すべて文書Ｄ_iのアクセス情報による重みに基づき分配される。 Here, alpha is a coefficient that specifies the weight of the random jump section in the case of alpha = 0, the random jump is not introduced, the random jump of formula (2) are all based on the weight according to the access information of the document D _i Distributed.

これにより、ＰａｇｅＲａｎｋアルゴリズムにおける文書Ｄ_iのスコアの算出においても、引用元文書Ｄ_jのアクセス情報に基づく重みが考慮されることとなり、この重みが大きい文書に対してより高いスコアが付与される。 Thus, even in the calculation of the score of the document D _i in PageRank algorithm, it will be weighted based on the access information quoted original document D _j are considered, a higher score than the document the weight is large is given.

そして、前記リンク解析手段１２は、このように算出した各文書のスコアを、前記アクセス情報などとともに前記文書データベース５に格納する。格納された各文書のスコアは、前記クライアント端末２からの以後の検索の際に検索結果のランキングに使用される。 The link analysis unit 12 stores the score of each document calculated in this way in the document database 5 together with the access information and the like. The stored score of each document is used for the ranking of search results in subsequent searches from the client terminal 2.

即ち、前記スコア算出後の前記クライアント端末２からの検索要求に対し、前記検索実行手段８は、ユーザによって入力された検索条件と文書との類似度や、Ｓ０４で算出した各文書のスコアなどに従って文書をランキングし、指定された件数のランキング上位の検索結果を前記検索応答手段７へ返信する。前記検索応答手段７は、受信した検索結果を前記クライアント端末２へ返信し、前記ブラウザ３を介してユーザに提示する。このようなスコアの算出は一定期間ごとに行ってもよく、これにより最新のスコアが検索結果に反映される。 That is, in response to the search request from the client terminal 2 after the score calculation, the search execution means 8 follows the similarity between the search condition input by the user and the document, the score of each document calculated in S04, and the like. The document is ranked, and the search result of the highest ranking of the designated number is returned to the search response means 7. The search response means 7 returns the received search result to the client terminal 2 and presents it to the user via the browser 3. Such calculation of the score may be performed at regular intervals, whereby the latest score is reflected in the search result.

以上のように、本実施形態に係る文書検索装置１によれば、引用元文書のアクセス情報に基づく重みを用いて文書のスコアを算出していることから、真に有益な文書により高いスコアを付与することが可能となる。さらに、リンクスパムのように特定の文書のスコアを不当に高く評価しようとする人為的な操作の影響を受けることなく、ユーザの直観に近いスコアの付与が可能となる。 As described above, according to the document search device 1 according to the present embodiment, since the score of the document is calculated using the weight based on the access information of the citation source document, a higher score is obtained for a truly useful document. It becomes possible to grant. Furthermore, it is possible to give a score close to the user's intuition without being affected by an artificial operation that tries to evaluate the score of a specific document unreasonably high like link spam.

（２）第２実施形態
図３は、本発明の第２実施形態に係る文書検索装置１３を示している。この文書検索装置１３は、前記クリックログデータベース９、前記アクセス集計手段１０、前記アクセス情報データベース１１が廃止されている一方、ブックマーク手段１４、ブックマークデータベース１５、ブックマーク集計手段１６、ブックマーク情報データベース１７を備えている。即ち、該文書検索装置１３では、文書のアクセス情報に基づく重みに代えて、ブックマーク情報に基づく重みを利用してスコアを算出している。 (2) Second Embodiment FIG. 3 shows a document search apparatus 13 according to a second embodiment of the present invention. The document search device 13 includes a bookmark unit 14, a bookmark database 15, a bookmark tabulation unit 16, and a bookmark information database 17, while the click log database 9, the access tabulation unit 10, and the access information database 11 are abolished. ing. That is, the document search apparatus 13 calculates a score using a weight based on bookmark information instead of a weight based on document access information.

前記両データベース１５．１７は、前記クリックログデータベース９、前記アクセス情報データベース１１と同様に、前記ハードディスクドライブ装置上に構築されている。また、前記ブックマーク手段１４は、ブックマークアプリケーションあるいは前記ブラウザ３などに組み込んで実現される。以下、ブックマーク情報によるスコア算出処理について、図４のフローチャートに基づき説明する。 Both the databases 15.17 are constructed on the hard disk drive device in the same manner as the click log database 9 and the access information database 11. Further, the bookmark means 14 is realized by being incorporated in a bookmark application or the browser 3 or the like. Hereinafter, score calculation processing based on bookmark information will be described based on the flowchart of FIG.

Ｓ１１：前記クローラ４は、Ｓ０１と同様に、前記コンテンツサーバＳから検索対象となる電子文書群を収集して前記文書データベース５に格納するとともに、収集した文書間の引用関係のデータを前記引用関係データベース６に格納する。両データベース５．６のデータは表１．２と同様とする。 S11: As with S01, the crawler 4 collects electronic document groups to be searched from the content server S and stores them in the document database 5 and also stores the citation relationship data between the collected documents. Store in database 6. The data in both databases 5.6 are the same as in Table 1.2.

ここでは、前記検索実行手段８はＳ０２と同様に、前記クライアント端末２から受信した検索条件に該当する文書のリストを検索結果として前記検索応答手段７へ返信する。前記検索応答手段７は、受信した検索結果を前記クライアント端末２へ返信し、前記ブラウザ３を介してユーザへ提示する。ユーザは、前記検索結果から文書を選択して前記ブラウザ３をもって閲覧することができる。 Here, the search execution means 8 returns a list of documents corresponding to the search condition received from the client terminal 2 to the search response means 7 as a search result, as in S02. The search response means 7 returns the received search result to the client terminal 2 and presents it to the user via the browser 3. The user can select a document from the search result and browse with the browser 3.

Ｓ１２：前記ブックマーク手段１４は、ユーザの操作に基づき前記閲覧文書を前記ブックマークデータベース１５にブックマーク（登録）する。 S12: The bookmark means 14 bookmarks (registers) the browsed document in the bookmark database 15 based on a user operation.

ここでは、前記ブラウザ３には、ユーザが閲覧中の文書を前記ブックマーク手段１４を通じて前記ブックマークデータベース１５に登録することが可能な機能が組み込まれているものとする。 Here, it is assumed that the browser 3 has a function capable of registering a document being browsed by a user in the bookmark database 15 through the bookmark unit 14.

このユーザの登録操作によれば、前記ブックマーク手段１４を通じて現在閲覧されている文書を特定するために必要な情報（ＵＲＬなど）が取得される。この必要情報および登録日時が前記ブックマークデータベース１５に記述され、これにより前記ブックマークデータベース１５にブックマーク履歴が作成される。このブックマークデータベース１５の格納データ例を表５に示す。 According to this user registration operation, information (such as a URL) necessary for specifying the document currently being browsed through the bookmark means 14 is acquired. The necessary information and registration date and time are described in the bookmark database 15, and a bookmark history is created in the bookmark database 15. An example of data stored in the bookmark database 15 is shown in Table 5.

Ｓ１３：前記ブックマーク集計手段１６は、前記ブックマークデータベース１５にアクセスして各文書のブックマーク履歴を読み出し、それらを集計した結果をブックマーク情報として前記ブックマーク情報データベース１７に格納する。このブックマーク情報データベース１７のデータ例を表６に示す。 S13: The bookmark totalization means 16 accesses the bookmark database 15, reads the bookmark history of each document, and stores the totalized result in the bookmark information database 17 as bookmark information. A data example of the bookmark information database 17 is shown in Table 6.

ブックマーク情報の最も単純な例としては、文書へのブックマークの頻度を利用する方法が考えられる。しかし、単純な登録回数は累積値であることから、登録時期が古いものでも有効となってしまう可能性があるため、古いブックマーク情報は利用しないことや、古いブックマーク情報による集計結果への反映度を低くすることなどが考えられる。また、機械的なブックマーク登録を集計結果から排除するために、機械的な登録パターンを発見し、それらを無効にする事も考えられる。なお、ここでは文書への各ユーザからのブックマーク登録回数を前記ブックマーク情報として利用した例を説明する。 As the simplest example of bookmark information, a method using the frequency of bookmarks to a document can be considered. However, since the simple registration count is a cumulative value, it may become effective even if the registration period is old, so do not use old bookmark information, or reflect on the aggregation result by old bookmark information It may be possible to lower the value. In addition, in order to exclude mechanical bookmark registration from the counting result, it is possible to discover mechanical registration patterns and invalidate them. Here, an example in which the bookmark registration count from each user to the document is used as the bookmark information will be described.

Ｓ１４：前記リンク解析手段１２は、前記ブックマーク情報データベース１７から読み出した各文書のブックマーク情報と、前記引用関係データベース６から読み出した文書間の引用関係のデータとを用いて、各文書のスコアを算出する。 S14: The link analysis unit 12 calculates the score of each document using the bookmark information of each document read from the bookmark information database 17 and the citation relationship data between documents read from the citation relationship database 6. To do.

スコア算出の一例としては、表６のブックマーク情報データベース１７に格納された各文書のブックマーク情報から、該ブックマーク情報に基づく重みを文書毎に付与する。この重みとしては、例えばブックマーク情報の値が大きい、即ち多くのユーザからブックマーク登録された文書に対してはより大きい重みを付与するなどが考えられる。 As an example of score calculation, a weight based on the bookmark information is given to each document from the bookmark information of each document stored in the bookmark information database 17 of Table 6. As this weight, for example, the value of the bookmark information is large, that is, a larger weight is given to a document registered as a bookmark by many users.

前記リンク解析手段１２は、このブックマーク情報に基づく重みを前述のアクセス情報に基づく重みに代えて使用し、第１実施形態と同様に各文書のスコアを算出して、前記ブックマーク情報などとともに前記文書データベース５に格納する。前記検索実行手段８は、以後の前記クライアント端末２からの検索に対して、該スコアなどに従って検索結果の出力順を決定する。 The link analysis means 12 uses the weight based on the bookmark information in place of the weight based on the access information described above, calculates the score of each document as in the first embodiment, and together with the bookmark information, the document Store in database 5. The search execution means 8 determines the output order of the search results according to the score for subsequent searches from the client terminal 2.

以上のように、本実施形態に係る文書検索装置１３によれば、文書のスコア算出において引用元文書のブックマーク情報に基づく重みが考慮される。即ち、ブックマーク登録回数の多い文書からより多く引用される文書に対して高いスコアを付与することができ、第１実施形態と同等の効果を奏することができる。 As described above, according to the document search device 13 according to the present embodiment, the weight based on the bookmark information of the citation source document is taken into account in the document score calculation. That is, a high score can be given to a document that is cited more frequently than a document with a large number of bookmark registrations, and an effect equivalent to that of the first embodiment can be achieved.

なお、本発明は、コンピュータに前記第１．２実施形態の各処理ステップＳ０１〜０４．１１〜１４を実行させる文書検索プログラムとしても提供することができる。このプログラムは、各処理ステップＳ０１〜０４．１１〜１４の全ての処理ステップをコンピュータに実行させるものでもよく、あるいはその一部の処理を実行させるものであってもよい。 The present invention can also be provided as a document search program that causes a computer to execute the processing steps S01 to 04.11 to 14 of the first embodiment. This program may cause the computer to execute all the processing steps S01 to 04.11 to 14, or may execute a part of the processing.

このプログラムは、Ｗｅｂサイトなどからのダウンロードによってコンピュータに提供される。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，Ｂｌｕ−ｒａｙＤｉｓｋ（登録商標）などの記録媒体に格納してコンピュータに提供してもよい。 This program is provided to the computer by downloading from a website or the like. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, Blu-ray Disk (registered trademark). It may be provided to a computer.

本発明の第１実施形態に係る文書検索装置の構成図。1 is a configuration diagram of a document search apparatus according to a first embodiment of the present invention. 同スコア算出の処理フロー図。The processing flow figure of the same score calculation. 本発明の第２実施形態に係る文書検索装置の構成図。The block diagram of the document search device which concerns on 2nd Embodiment of this invention. 同スコア算出の処理フロー図。The processing flow figure of the same score calculation.

符号の説明Explanation of symbols

１．１３…文書検索装置
２…クライアント端末
３…ブラウザ
４…クローラ（情報収集手段）
５…文書データベース
６…引用関係データベース
７…検索アクセス手段（検索アプリケーション）
８…検索実行手段（検索エンジン）
９…クリックログデータベース
１０…アクセス集計手段
１１…アクセス情報データベース
１２…リンク解析手段（スコア算出手段）
１４…ブックマーク手段（ブックマークアプリケーション）
１５…ブックマークデータベース
１６…ブックマーク集計手段
１７…ブックマーク情報データベース
Ｓ…コンテンツサーバ 1.13 Document retrieval device 2 Client terminal 3 Browser 4 Crawler (information collecting means)
5 ... Document database 6 ... Citation-related database 7 ... Search access means (search application)
8 ... Search execution means (search engine)
DESCRIPTION OF SYMBOLS 9 ... Click log database 10 ... Access totalization means 11 ... Access information database 12 ... Link analysis means (score calculation means)
14 ... Bookmark means (bookmark application)
15 ... Bookmark database 16 ... Bookmark totaling means 17 ... Bookmark information database S ... Content server

Claims

クライアント端末から指示された検索条件に該当する電子文書を電子文書群中から検索する文書検索装置であって、
前記検索条件を用いて電子文書群を検索して、検索結果を出力する検索実行手段と、
前記検索結果中、前記クライアント端末のアクセスした文書をユーザの操作に基づいて登録する登録手段と、
前記登録手段による複数のクライアント端末の文書の登録情報を集計して、文書毎の登録頻度を求める集計手段と、
前記登録頻度および文書間の引用関係を用いて各文書のスコアを算出する算出手段と、を備え、
前記算出手段が、引用先文書の文書スコアを算出する際に引用元文書の登録頻度に基づく重み付けを反映させ、
前記検索実行手段が、前記スコアに従ってランキングされた検索結果を出力することを特徴とする文書検索装置。 A document search apparatus that searches an electronic document group that matches a search condition instructed from a client terminal from an electronic document group,
Search execution means for searching a group of electronic documents using the search condition and outputting a search result;
In the search results, registration means for registering a document accessed by the client terminal based on a user operation;
Aggregating means for calculating registration frequency for each document by counting registration information of documents of a plurality of client terminals by the registration means;
Calculating means for calculating the score of each document using the registration frequency and the citation relationship between documents,
The calculation means reflects the weight based on the registration frequency of the citation source document when calculating the document score of the citation destination document,
The document search apparatus, wherein the search execution means outputs a search result ranked according to the score.

クライアント端末から指示された検索条件に該当する電子文書を電子文書群中から検索する装置の実行する文書検索装置であって、
検索実行手段が、前記検索条件を用いて電子文書群を検索して、検索結果を出力する第１ステップと、
登録手段が、前記検索結果中、前記クライアント端末のアクセスした文書をユーザの操作に基づいて登録する第２ステップと、
集計手段が、前記登録手段による複数のクライアント端末の文書の登録情報を集計して、文書毎の登録頻度を求める第３ステップと、
算出手段が、前記登録頻度および文書間の引用関係を用いて各文書のスコアを算出する第４ステップと、
前記検索実行手段が、前記スコアに従ってランキングされた検索結果を出力する第５ステップと、を有し、
前記第４ステップは、引用先文書の文書スコアを算出する際に引用元文書の登録頻度に基づく重み付けを反映させる
ことを特徴とする文書検索方法。 A document search device executed by a device that searches an electronic document group corresponding to a search condition instructed from a client terminal,
A search execution means for searching for an electronic document group using the search condition and outputting a search result;
A second step of registering a document accessed by the client terminal based on a user operation in the search result;
A third step in which a counting unit totals registration information of documents of a plurality of client terminals by the registration unit to obtain a registration frequency for each document;
A fourth step in which the calculating means calculates a score of each document using the registration frequency and the citation relationship between the documents;
The search execution means has a fifth step of outputting search results ranked according to the score,
In the document search method, the fourth step reflects weighting based on the registration frequency of the citation source document when calculating the document score of the citation destination document.

請求項２に記載の文書検索方法の各ステップをコンピュータに実行させることを特徴とする文書検索プログラム。 A document search program for causing a computer to execute each step of the document search method according to claim 2 .