JP5475704B2

JP5475704B2 - Document search apparatus, document search method, and document search program

Info

Publication number: JP5475704B2
Application number: JP2011032319A
Authority: JP
Inventors: 良彦数原; 潤鈴木; 宜仁安田; 義昌小池; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-02-17
Filing date: 2011-02-17
Publication date: 2014-04-16
Anticipated expiration: 2031-02-17
Also published as: JP2012173796A

Description

本発明は、文書の検索結果を提示する装置およびその方法に関するものである。 The present invention relates to an apparatus and a method for presenting document search results.

ウェブ検索システムのような検索システムにおいては、ＴＦ−ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ−ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）のようなクエリ頻度に基づくスコアや、ＰａｇｅＲａｎｋのようなリンク解析に基づくスコアなど、多数の要因（スコア要因と呼ぶ）を用いて最終的なランキングに用いる検索スコアを算出する（非特許文献１参照）。そして、算出された検索スコアの降順に並べることによって、ランキングを行った検索結果を提示する方法が広く用いられている。 In a search system such as a web search system, there are many factors (score factors such as a score based on a query frequency such as TF-IDF (Term Frequency-Inverse Document Frequency) and a score based on link analysis such as PageRank). The search score used for the final ranking is calculated (see Non-Patent Document 1). A method of presenting the search results obtained by ranking by arranging the calculated search scores in descending order is widely used.

ここで多数のスコア要因を入力として受け取り、検索スコアを出力する関数をランキング関数と呼ぶ。適合度の高いランキングを実現するために、人手によって作成した訓練データを用いて、ランキング関数を生成する技術がある（非特許文献２参照）。 Here, a function that receives a large number of score factors as input and outputs a search score is called a ranking function. There is a technique for generating a ranking function using training data created manually in order to realize ranking with a high degree of fitness (see Non-Patent Document 2).

非特許文献２では、訓練データを文書の順序ペアに落とし込み、順序ペアの誤りを最小化することで、適切にランキングを行うランキング関数を生成する。 In Non-Patent Document 2, a training function is dropped into an ordered pair of documents, and an order pair error is minimized to generate a ranking function that performs ranking appropriately.

尚、本発明の文書検索装置で利用する検索結果評価指標については、下記非特許文献３に記載されている。 The search result evaluation index used in the document search apparatus of the present invention is described in Non-Patent Document 3 below.

竹野浩、井上孝史、「分散型高速情報収集／全文検索システムＩｎｆｏＢｅｅ／Ｅｖａｎｇｅｌｉｓｔ」、ＮＴＴＲ＆ＤＶｏｌ．５２Ｎｏ．２２００３、ｐｐ．７８−８４。Hiroshi Takeno, Takashi Inoue, “Distributed high-speed information collection / full-text search system InfoBee / Evangelist”, NTT R & D Vol. 52 no. 2 2003, pp. 78-84. ＴｈｏｒｓｔｅｎＪｏａｃｈｉｍｓ，“ＯｐｔｉｍｉｚｉｎｇＳｅａｒｃｈＥｎｇｉｎｅｓｕｓｉｎｇＣｌｉｃｋｔｈｒｏｕｇｈＤａｔａ”，ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅｅｉｇｈｔｈＡＣＭｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＫｎｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙａｎｄＤａｔａｍｉｎｉｎｇ（ＫＤＤ ’０２），２００２，ｐｐ．１３３−１４２．Thorsten Joachims, “Optimizing Search Engineers using Clickthrough Data”, In Proceedings of the height of the ACM International Conference on Knowledge. 133-142. ＫａｌｅｒｖｏＪａｒｖｅｌｉｎａｎｄＪａａｎａＫｅｋａｌａｉｎｅｎ，“ＣｕｍｕｌａｔｅｄＧａｉｎ−ＢａｓｅｄＥｖａｌｕａｔｉｏｎｏｆＩＲＴｅｃｈｎｉｑｕｅｓ”，ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＩｎｆｏｒｍａｔｉｏｎＳｙｓｔｅｍｓ，Ｖｏｌ．２０，Ｎｏ．４，２００２，ｐｐ．４２２−４４６．Kalervo Jarvelin and Jana Kekarainen, “Cumulated Gain-Based Evaluation of IR Techniques”, ACM Transactions on Information Systems, Vol. 20, no. 4, 2002, pp. 422-446.

従来技術では、特徴空間におけるマージン（順序の誤りに対する重要度）は全ての順序ペアに対して一定であった。このため、検索結果上位に来るべき文書や下位に存在すべき文書などを区別することなく、全ての順序ペアを等しく扱う（順序の全組合せに対してマージンの重要度が等しい）という問題があった。このため、特に検索結果上位を重視するような評価指標の観点において、高精度なランキングを実現するランキング関数を生成できないという課題がある。 In the prior art, the margin in the feature space (importance for order errors) is constant for all ordered pairs. For this reason, there is a problem in that all order pairs are handled equally (the importance of the margin is the same for all combinations of orders) without distinguishing between documents that should be higher in the search results and documents that should exist in the lower order. It was. For this reason, there is a problem that it is not possible to generate a ranking function that realizes a highly accurate ranking, particularly from the viewpoint of an evaluation index that places importance on the higher search results.

本発明は上記課題を解決するものであり、その目的は、ランキング関数生成の性能を向上し検索ランキングの精度向上を実現した、文書検索装置、方法、プログラムを提供することにある。 The present invention solves the above-described problems, and an object of the present invention is to provide a document search apparatus, method, and program that improve the performance of ranking function generation and improve the accuracy of search ranking.

上記課題を解決するための本発明の文書検索装置は、Ｎ個のクエリに対する文書の検索結果の適合度と、Ｍ次元の特徴表現とを有した訓練データが格納された訓練データデータベースと、前記訓練データを入力とし、各クエリにおける複数の異なる適合度の組合せを求め、該組合せの順序を変更したときの検索結果評価指標値の変更幅を求め、前記指標値の最大の変更幅を基準としてクエリ毎に適合度の組合せに対する重要度を表すマージンを求め、Ｎ個のクエリと、前記適合度の組合せと、前記求められたマージンとを有したマージンデータベースを構築するマージン生成手段と、前記訓練データデータベースおよびマージンデータベースの各データを入力とし、訓練データ中の相対的に高い適合度の文書を検索結果上位に提示させる検索スコアを出力するためのスコア要因重みを保持したランキングモデルを生成してランキングモデルデータベースを構築するランキング関数生成手段と、予めＷｅｂページから収集した文書を基に作成された文書インデクスが格納された文書インデクスデータベースと、入力された検索クエリに対する検索結果集合を前記文書インデクスデータベースから取得し、該検索結果集合と複数のスコア要因とでスコア要因値行列を算出するクエリ処理手段と、前記クエリ処理手段で算出されたスコア要因値行列と、前記ランキングモデルデータベースのデータを入力とし、前記入力された検索クエリに対応する前記ランキングモデルデータベース内のランキングモデルとしてのスコア要因重みと、前記スコア要因値行列とを積算して検索スコアベクトルを計算する検索スコア計算手段と、前記検索スコア計算手段により計算された検索スコアの降順に入力クエリに対する検索結果を提示する検索結果提示手段と、を備えたことを特徴としている。 Document retrieval apparatus of the present invention for solving the above problems, the fitness of the search result document for the N query, and training data database training data and a feature representation of the M-dimensional is stored, Using the training data as input, obtain a plurality of combinations of different fitness values in each query, obtain a change width of the search result evaluation index value when the order of the combination is changed, and use the maximum change width of the index value as a reference Margin generating means for obtaining a margin representing importance for a combination of goodness for each query, and constructing a margin database having N queries, the combination of goodness of fit, and the obtained margin; Retrieval in which training data database and margin database data are used as input, and documents with relatively high fitness in training data are presented at the top of the search results. A ranking function generating means for generating a ranking model holding a score factor weight for outputting a core and building a ranking model database, and a document storing a document index created based on a document collected in advance from a Web page An index database, query processing means for obtaining a search result set for the input search query from the document index database, and calculating a score factor value matrix from the search result set and a plurality of score factors; and the query processing means Using the calculated score factor value matrix and the data of the ranking model database as input, the score factor weight as a ranking model in the ranking model database corresponding to the input search query, and the score factor value matrix Accumulate and search score vector It is characterized by comprising a search score calculating means for calculating, and a search result display means for presenting a search result for the input query in descending order of the calculated search score by the search score calculating means.

本発明によれば、検索評価指標に基づいてそれぞれのクエリにおける適合性評価の各組み合わせに対して適切なマージンを設定することが可能となり、これにより、ランキング関数生成の性能を向上し、検索ランキングの精度向上を実現することができる。 According to the present invention, it is possible to set an appropriate margin for each combination of suitability evaluation in each query based on the search evaluation index, thereby improving the performance of ranking function generation and search ranking. Improvement in accuracy can be realized.

本発明の一実施形態例の文書検索装置全体の構成図。1 is a configuration diagram of an entire document search apparatus according to an embodiment of the present invention. 図１のランキングモデルＤＢ１０４を作成するランキング関数生成装置の構成図。The block diagram of the ranking function production | generation apparatus which produces the ranking model DB104 of FIG. 図２のマージン生成機能部１２０の処理の流れを示すフローチャート。3 is a flowchart showing a processing flow of a margin generation function unit 120 in FIG. 図１の文書検索装置の処理の流れを示すフローチャート。3 is a flowchart showing a flow of processing of the document search apparatus in FIG. 1.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。まず本発明の一実施形態例の全体構成の概要を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments. First, an overview of the overall configuration of an embodiment of the present invention will be described.

本実施形態例の文書検索装置１００は、図１に示すように、予めＷｅｂページから収集した文書を基に作成された文書インデクスデータが格納された文書インデクスＤＢ（データベース）１０１、ランキングモデルのデータが格納されたランキングモデルＤＢ１０４、クエリ処理手段としてのクエリ処理部１５０、検索スコア計算手段としての検索スコア計算部１６０および検索結果提示手段としての検索結果提示部１７０を備えている。 As shown in FIG. 1, the document search apparatus 100 according to the present embodiment includes a document index DB (database) 101 in which document index data created based on a document collected in advance from a Web page is stored, ranking model data, and the like. Is stored in the ranking model DB 104, a query processing unit 150 as a query processing unit, a search score calculation unit 160 as a search score calculation unit, and a search result presentation unit 170 as a search result presentation unit.

図１のランキングモデルＤＢ１０４は、図２に示すように、Ｎ個のクエリに対する文書の検索結果の適合度と、Ｍ次元の特徴表現とを有した訓練データＤＢ１０２に格納されているデータに基づいて、ランキング関数生成装置１１０の処理によって構築される。 As shown in FIG. 2, the ranking model DB 104 in FIG. 1 is based on data stored in the training data DB 102 having the fitness of document search results for N queries and the M-dimensional feature expression. It is constructed by the processing of the ranking function generation device 110.

図２のランキング関数生成装置１１０は、訓練データＤＢ１０２を入力とし、クエリ毎に文書の検索結果の適合度の組合せに対する重要度（マージン）を生成し、マージンＤＢ１０３を構築するマージン生成機能部１２０と、訓練データＤＢ１０２およびマージンＤＢ１０３の各データに基づいてランキングモデルを生成してランキングモデルＤＢ１０４を構築する、ランキング関数生成手段としてのランキング関数生成部１３０とを備えている。 The ranking function generation device 110 in FIG. 2 receives the training data DB 102 as an input, generates an importance (margin) for a combination of suitability of document search results for each query, and creates a margin DB 103 and a margin generation function unit 120 A ranking function generating unit 130 as a ranking function generating unit that generates a ranking model based on each data of the training data DB 102 and the margin DB 103 and constructs the ranking model DB 104.

前記マージン生成機能部１２０およびマージンＤＢ１０３によってマージン生成手段としてのマージン生成部１４０を構成している。 The margin generation function unit 120 and the margin DB 103 constitute a margin generation unit 140 as margin generation means.

図１および図２に示す文書検索装置１００は、例えばコンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＲＯＭ、ＲＡＭ、ＣＰＵ、入力装置、出力装置、表示装置、通信インターフェース、ハードディスク、記録媒体およびその駆動装置を備えている。 A document search apparatus 100 shown in FIGS. 1 and 2 is configured by a computer, for example, and is a normal computer hardware resource, for example, ROM, RAM, CPU, input device, output device, display device, communication interface, hard disk, recording medium And a driving device thereof.

このハードウェアリソースとソフトウェアリソース（ＯＳ、アプリケーションなど）との協働の結果、文書検索装置１００は、図１、図２に示すように、文書インデクスＤＢ１０１、訓練データＤＢ１０２、マージンＤＢ１０３、ランキングモデルＤＢ１０４、ランキング関数生成部１３０、マージン生成部１４０、クエリ処理部１５０、検索スコア計算部１６０および検索結果提示部１７０を実装する。 As a result of the cooperation between the hardware resource and the software resource (OS, application, etc.), the document search apparatus 100 has a document index DB 101, a training data DB 102, a margin DB 103, and a ranking model DB 104 as shown in FIGS. The ranking function generation unit 130, the margin generation unit 140, the query processing unit 150, the search score calculation unit 160, and the search result presentation unit 170 are mounted.

前記文書インデクスＤＢ１０１、訓練データＤＢ１０２、マージンＤＢ１０３、ランキングモデルＤＢ１０４は、ハードディスクあるいはＲＡＭなどの保存手段・記憶手段に構築されているものとする。 It is assumed that the document index DB 101, the training data DB 102, the margin DB 103, and the ranking model DB 104 are constructed in storage means / storage means such as a hard disk or RAM.

次に、上記のように構成された装置の詳細を説明する。 Next, details of the apparatus configured as described above will be described.

まず図２において、ランキング関数生成装置１１０は、訓練データＤＢ１０２内の訓練データを入力として受け取り、ランキングモデルのデータを出力してランキングモデルＤＢ１０４を構築する。 First, in FIG. 2, the ranking function generation device 110 receives the training data in the training data DB 102 as an input, outputs the ranking model data, and constructs the ranking model DB 104.

訓練データＤＢ１０２のデータ構造の例を表１に示す。 An example of the data structure of the training data DB 102 is shown in Table 1.

表１において、それぞれの行が、あるクエリに対する検索結果文書の特徴表現と適合度を表している。適合度が大きい方が、当該クエリに対してより適切な結果であることを示している。適合度は、クエリと文書に対して付与されているため、たとえ同じ文書であっても、クエリによっては異なる適合度が付与されることがある。適合度は、例えば被験者が判断して付与した多段階（例えば５段階）の値を用いる。各文書はＭ次元の特徴表現で表され、ｘ₁，．．，ｘ_Mは当該文書の各次元の特徴量を表している。 In Table 1, each row represents the feature expression and the fitness of the search result document for a certain query. A higher matching score indicates a more appropriate result for the query. Since the matching level is given to the query and the document, different matching levels may be given depending on the query even for the same document. For example, a multi-level (for example, five levels) value determined and given by the subject is used as the fitness. Each document is represented by an M-dimensional feature representation, x ₁ ,. . , X _M represent feature quantities of each dimension of the document.

＜マージン生成機能部１２０＞
マージン生成機能部１２０は、訓練データＤＢ１０２を受け取り、図３のステップＳ１２１〜Ｓ１２９に示す処理を行なってマージンＤＢ１０３を出力する。マージンＤＢ１０３のデータ構造の例を表２に示す。 <Margin Generation Function Unit 120>
The margin generation function unit 120 receives the training data DB 102, performs the processing shown in steps S121 to S129 in FIG. 3, and outputs the margin DB 103. An example of the data structure of the margin DB 103 is shown in Table 2.

表２では、各クエリにおいて、任意の適合度の組み合わせに対するマージンの値を表している。この例では適合度の降順に考え、上位は適合度がより高い値、下位は相対的に低い値とする。例えば表２の１行目の例は、クエリＩＤ１のクエリにおいて、適合度４の文書と適合度３の文書に対してどの程度のマージンを与えるかという情報を保持している。 Table 2 shows a margin value for an arbitrary combination of suitability in each query. In this example, the order of suitability is considered in descending order, and the higher order is a value with a higher degree of fit, and the lower order is a relatively low value. For example, the example of the first row in Table 2 holds information on how much margin is given to a document with a fitness level 4 and a document with a fitness level 3 in a query with a query ID 1.

まず図３のステップＳ１２１において、訓練データＤＢ１０２から未処理のクエリｑを選択する。 First, in step S121 of FIG. 3, an unprocessed query q is selected from the training data DB.

次にステップＳ１２２において、訓練データＤＢ１０２の中からクエリＩＤがクエリｑに該当するレコードを取得し、該当する文書を適合度順に並べたリストをπｑとする。 Next, in step S122, a record in which the query ID corresponds to the query q is acquired from the training data DB 102, and a list in which the corresponding documents are arranged in the order of suitability is defined as πq.

次にステップＳ１２３において、クエリｑの文書の適合度の集合Ｒを取得し、適合度の組み合わせＲ_pair＝{ｒ_i，ｒ_j∈Ｒ｜ｒ_i＞ｒ_j}を算出する。ここでは、適合度の高い点数と低い点数の全ての組み合わせを求めている。 Next, in step S123, a set R of matching levels of documents of the query q is acquired, and a matching level combination R _pair = {r _i , r _j εR | r _i > r _j } is calculated. Here, all combinations of scores with high and low scores are obtained.

次にステップＳ１２４において、Ｒ_pairから未処理の適合度ペアｒ_iとｒ_jを取得する。 In step S124, unprocessed fitness pairs r _i and r _j are acquired from R _pair .

次にステップＳ１２５では、π_qにおいて、ｒ_iの適合度を持つ文書の最上位とｒｊの最下位の文書を交換し、そのリストをπ_q′とする。ここで、例えばあるクエリに対する８文書が適合度の降順に並べられ、それぞれの適合度が（Ａ，Ｂ，Ｃ，Ｄ，Ｅ，Ｆ，Ｇ，Ｈ）＝（４，４，４，３，３，２，１，１）のように与えられている例を考える。この際、適合度４の最上位文書はＡであり、適合度３の最下位文書はＥである。そのため、この二つの文書を交換した文書とその適合度は、（Ｅ，Ｂ，Ｃ，Ｄ，Ａ，Ｆ，Ｇ，Ｈ）＝（３，４，４，３，４，２，１，１）となる。なお、検索結果評価指標は検索結果の位置とその適合度に対して決定されるため、異なる位置かつ同じ適合度の文書が交換されても値は変わらない。 Next, in step S125, at π _q , the highest document of the documents having the fitness of r _i and the lowest document of r _j are exchanged, and the list is set to π _q ′. Here, for example, eight documents corresponding to a certain query are arranged in descending order of suitability, and each suitability is (A, B, C, D, E, F, G, H) = (4, 4, 4, 3, Consider the example given as 3, 2, 1, 1). At this time, the highest level document with a fitness level of 4 is A, and the lowest level document with a fitness level of 3 is E. Therefore, a document obtained by exchanging these two documents and its relevance are (E, B, C, D, A, F, G, H) = (3,4, 4, 3, 4, 2, 1, 1 ) Since the search result evaluation index is determined with respect to the position of the search result and its relevance, the value does not change even if documents having different positions and the same relevance are exchanged.

予め与えられた検索結果指標、例えばＮｏｒｍａｌｉｚｅｄＤｉｓｃｏｕｎｔｅｄＣｕｍｕｌａｔｉｖｅＧａｉｎ（ＮＤＣＧ）の値にしたがって、評価指標の減少値ΔＥ（ｒ_i，ｒ_j）＝Ｅｖａｌ（π_q）−Ｅｖａｌ（π_q′）を計算する（検索結果評価指標値の変更幅を求める）。尚前記ＮＤＣＧは、非特許文献３の技術を用いて計算することができる。 A reduction value ΔE (r _i , r _j ) = Eval (π _q ) −Eval (π _q ′) of the evaluation index is calculated in accordance with a search result index given in advance, for example, a value of Normalized Discounted Cumulative Gain (NDCG). (Find the range of change of the search result evaluation index value). The NDCG can be calculated using the technique of Non-Patent Document 3.

次にステップＳ１２６において、未処理のｒ_i，ｒ_jがある場合にはステップＳ１２４に戻り、ない場合にはステップＳ１２７に進む。 Next, in step S126, if there are unprocessed r _i and r _j , the process returns to step S124, and if not, the process proceeds to step S127.

ステップＳ１２７では、検索結果指標の減少値そのままではマージンに不適切な値である可能性があるので、最大の減少値を示した組み合わせに対して、予め設定された最大マージンサイズＥ_maxを与える。すなわち、 In step S127, since the decrease value of the search result index may be an inappropriate value for the margin, a preset maximum margin size E _max is given to the combination showing the maximum decrease value. That is,

を用いてスケールを計算する。 Use to calculate the scale.

次にステップＳ１２８において、クエリｑにおける全ての適合度の組み合わせに対するマージンをマージンＤＢ１０３に出力する。マージンはステップＳ１２７で計算したスケールをかけて（Ｅ_scaleΔＥ（ｒ_i，ｒ_j））出力する。すなわち、［ｑ，ｒ_i，ｒ_j，Ｅ_scaleΔＥ（ｒ_i，ｒ_j）］という４つの情報から成るレコードをマージンＤＢ１０３に出力する。 Next, in step S128, margins for all the combinations of matching levels in the query q are output to the margin DB 103. The margin is output with the scale calculated in step S127 (E _scale ΔE (r _i , r _j )). That is, a record including four pieces of information [q, r _i , r _j , E _scale ΔE (r _i , r _j )] is output to the margin DB 103.

次にステップＳ１２９において、未処理のクエリがある場合にはステップＳ１２１に戻り、そうでなければ処理を終了する。 Next, in step S129, if there is an unprocessed query, the process returns to step S121, and if not, the process ends.

上記のように、マージン生成機能部１２０の動作によって、それぞれのクエリにおける適合性評価の各組合せに対してマージンサイズを設定することが可能となる。 As described above, the margin size can be set for each combination of suitability evaluation in each query by the operation of the margin generation function unit 120.

＜ランキング関数生成部１３０＞
ランキング関数生成部１３０は、訓練データＤＢ１０２と、マージンＤＢ１０３を入力として受け取り、訓練データ中の相対的に高い適合度の文書が検索結果上位に提示されるように作用するスコア要因重みを保持したランキングモデルＤＢ１０４を出力する。 <Ranking function generator 130>
The ranking function generation unit 130 receives the training data DB 102 and the margin DB 103 as inputs, and holds ranking factor weights that act so that documents with relatively high fitness in the training data are presented at the top of the search results. The model DB 104 is output.

ランキングモデルＤＢ１０４のデータ構造の例を表３に示す。 An example of the data structure of the ranking model DB 104 is shown in Table 3.

ランキングモデルＤＢ１０４は、生成されたランキングモデル、すなわちＭ次元の特徴表現に対する重み情報を保持しており、表３において、ｗ₁，．，ｗ_MはＭ次元の重みの値を表している。 The ranking model DB 104 holds weight information for the generated ranking model, that is, the M-dimensional feature expression. In Table 3, w ₁ ,. , W _M represent M-dimensional weight values.

ランキング関数生成部１３０は、入力で与えられた訓練データＤＢ１０２を元に、相対的に高い適合度の文書が検索結果上位に提示されるような検索スコアを出力するため、重みベクトルｗを生成するものである。 The ranking function generation unit 130 generates a weight vector w in order to output a search score such that a document with a relatively high relevance is presented at the top of the search result based on the training data DB 102 given as input. Is.

ランキング関数生成部１３０には、例えば非特許文献２の技術を用いることができる。ランキング関数生成部１３０で用いられる目的関数に、マージン生成機能部１２０によって生成されたマージンＤＢ１０３を利用する。非特許文献２で用いられるヒンジ誤差にマージンを組み込むためには、下記式（１）のように誤差関数を設定する。 For the ranking function generation unit 130, for example, the technique of Non-Patent Document 2 can be used. The margin DB 103 generated by the margin generation function unit 120 is used as the objective function used by the ranking function generation unit 130. In order to incorporate a margin into the hinge error used in Non-Patent Document 2, an error function is set as in the following equation (1).

式（１）において、ｘ_q ⁽¹⁾，ｘ_q ⁽²⁾は、クエリｑに対してそれぞれ異なる文書（１）と文書（２）の特徴表現ベクトルを表現している。Ｅ（ｒｅｌ_q（１），ｒｅｌ_q（２））は、文書（１）と文書（２）の適合度の組み合わせに対するマージンの大きさを表しており、マージンＤＢ１０３から取得する。また、λは正則化パラメータであり、訓練データにどれだけフィットさせるかという調整する役割を持っている。λを大きな値に設定することによって、訓練データに対して過剰にフィットすることを抑える。λはあらかじめ値を設定しておく（例えば１．０）。 In Expression (1), x _q ⁽¹⁾ and x _q ⁽²⁾ represent feature expression vectors of the document (1) and the document (2) that are different from each other for the query q. E (rel _q (1), rel _q (2)) represents the size of the margin for the combination of the matching degrees of the document (1) and the document (2), and is acquired from the margin DB 103. Further, λ is a regularization parameter and has a role of adjusting how much the training data is fit. Setting λ to a large value prevents excessive fitting to the training data. A value is set in advance for λ (for example, 1.0).

ここでｒｅｌ_q（ｉ）がクエリｑにおける文書ｉの適合度スコアを表している。また、ｚ_q ^(1),(2)は文書（１)の適合度スコアと文書（２）の適合度スコアの差を表し、ｚ_q ^(1),(2)≡ｓｉｇｎ（ｒｅｌ_q ⁽¹⁾−ｒｅｌ_q ⁽²⁾）にしたがって算出される。尚ｓｉｇｎは、値が正であれば１、負であれば−１、０であれば０を返す符号関数である。 Here, rel _q (i) represents the fitness score of the document i in the query q. Z _q ^{(1), (2)} represents the difference between the fitness score of document (1) and the fitness score of document (2), and z _q ^{(1), (2)} ≡sign (rel _q ^{(1 )} -Rel _q ⁽²⁾ ). The sign is a sign function that returns 1 if the value is positive, -1 if the value is negative, and 0 if the value is 0.

また、［・］₊は、・が正の値を取る場合のみその値を返し、０未満の場合には常に０を返す演算である。 [·] ₊ Is an operation that returns a value only when • takes a positive value, and always returns 0 when less than 0.

ここで here

はｘを引数として取る関数ｆ（ｘ）の最小値を取る際のｘを返す関数である。全クエリにおける全ての順序ペアについて、式（１）に示す誤差の合計が最小になるようにｗを設定する。訓練データＤＢ１０２を用いた、式（１）を誤差関数とする重みパラメータｗの探索には、勾配法などの最適化手法を用いることが可能であり、これらを用いて重みパラメータを求める。 Is a function that returns x when taking the minimum value of the function f (x) that takes x as an argument. For all ordered pairs in all queries, w is set so that the sum of errors shown in Equation (1) is minimized. An optimization method such as a gradient method can be used to search for the weight parameter w using Equation (1) as an error function using the training data DB 102, and the weight parameter is obtained using these.

次に図１の文書検索装置１００の詳細を図４のフローチャートとともに説明する。 Next, details of the document search apparatus 100 of FIG. 1 will be described with reference to the flowchart of FIG.

＜クエリ処理部１５０＞
クエリ処理部１５０は、検索クエリを入力として受け取り、該検索クエリを含む検索結果集合（文書）を文書インデクスＤＢ１０１から取得し、該検索結果集合と複数のスコア要因とでスコア要因値行列を算出する（ステップＳ１５０）。 <Query processing unit 150>
The query processing unit 150 receives a search query as input, acquires a search result set (document) including the search query from the document index DB 101, and calculates a score factor value matrix using the search result set and a plurality of score factors. (Step S150).

具体的には、Ｍ個のスコア要因を用いて、文書インデクスＤＢ１０１からＮ件の検索結果集合を取得した際、そのスコア要因値行列は、 Specifically, when N search result sets are acquired from the document index DB 101 using M score factors, the score factor value matrix is:

と表現する。ここで、Ｄのｉ行目がｉ番目の検索結果のスコア要因値を表している。例えば、ｄ₂₃は、２番目の文書に対する３番目のスコア要因値である。 It expresses. Here, the i-th row of D represents the score factor value of the i-th search result. For example, d ₂₃ is the third score factor value for the second document.

＜検索スコア計算部１６０＞
検索スコア計算部１６０は、クエリ処理部１５０が出力したスコア要因値行列Ｄ、ランキングモデルＤＢ１０４のデータおよび入力された検索クエリｑ_inputを各々入力として受け取る。 <Search score calculation unit 160>
The search score calculation unit 160 receives the score factor value matrix D output from the query processing unit 150, the data of the ranking model DB 104, and the input search query q _input as inputs.

検索スコア計算部１６０は、ランキングモデルＤＢ１０４からスコア要因重みｗを取得し、該スコア要因重みｗとスコア要因値行列Ｄを元に検索スコアベクトルを計算する（ステップＳ１６０）。 The search score calculation unit 160 acquires the score factor weight w from the ranking model DB 104, and calculates a search score vector based on the score factor weight w and the score factor value matrix D (step S160).

検索ランキングに用いるための検索スコアベクトルｓは、スコア要因値行列Ｄと、スコア要因重みｗの積によって得られる。 The search score vector s for use in the search ranking is obtained by the product of the score factor value matrix D and the score factor weight w.

すなわちｉ番目の文書に対する検索スコアｓ_iは、 That is, the search score s _i for the i-th document is

によって算出する。 Calculated by

＜検索結果提示部１７０＞
検索結果提示部１７０は、前記算出された検索スコアベクトルｓを受け取り、検索スコアｓ_iの降順に、クエリに対する検索結果を提示する（表示、又はデータとして出力する）（ステップＳ１７０）。 <Search result presentation unit 170>
The search result presentation unit 170 receives the calculated search score vector s, and presents the search results for the query in the descending order of the search scores s _i (displays or outputs as data) (step S170).

また、本実施形態のマージン生成機能を有するランキング関数生成装置を用いた文書検索装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態のマージン生成機能を有するランキング関数生成装置を用いた文書検索方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 Further, some or all of the functions of each means in the document search apparatus using the ranking function generation apparatus having a margin generation function according to the present embodiment are configured by a computer program, and the program is executed using the computer. Needless to say, the invention can be realized, and the procedure in the document search method using the ranking function generation device having the margin generation function of the present embodiment can be configured by a computer program and the program can be executed by the computer. A program for realizing the functions of the computer is recorded on a computer-readable recording medium such as FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory). y) Recording on a memory card, CD (Compact Disk) -ROM, DVD (Digital Versatile Disk) -ROM, CD-R, CD-RW, HDD, removable disk, etc. for storage and distribution Is possible. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１００…文書検索装置
１０１…文書インデクスＤＢ
１０２…訓練データＤＢ
１０３…マージンＤＢ
１０４…ランキングモデルＤＢ
１１０…ランキング関数生成装置
１２０…マージン生成機能部
１３０…ランキング関数生成部
１４０…マージン生成部
１５０…クエリ処理部
１６０…検索スコア計算部
１７０…検索結果提示部 DESCRIPTION OF SYMBOLS 100 ... Document retrieval apparatus 101 ... Document index DB
102 ... Training data DB
103 ... Margin DB
104 ... Ranking model DB
DESCRIPTION OF SYMBOLS 110 ... Ranking function generation apparatus 120 ... Margin generation function part 130 ... Ranking function generation part 140 ... Margin generation part 150 ... Query processing part 160 ... Search score calculation part 170 ... Search result presentation part

Claims

Ｎ個のクエリに対する文書の検索結果の適合度と、Ｍ次元の特徴表現とを有した訓練データが格納された訓練データデータベースと、
前記訓練データを入力とし、各クエリにおける複数の異なる適合度の組合せを求め、該組合せの順序を変更したときの検索結果評価指標値の変更幅を求め、前記指標値の最大の変更幅を基準としてクエリ毎に適合度の組合せに対する重要度を表すマージンを求め、Ｎ個のクエリと、前記適合度の組合せと、前記求められたマージンとを有したマージンデータベースを構築するマージン生成手段と、
前記訓練データデータベースおよびマージンデータベースの各データを入力とし、訓練データ中の相対的に高い適合度の文書を検索結果上位に提示させる検索スコアを出力するためのスコア要因重みを保持したランキングモデルを生成してランキングモデルデータベースを構築するランキング関数生成手段と、
予めＷｅｂページから収集した文書を基に作成された文書インデクスが格納された文書インデクスデータベースと、
入力された検索クエリに対する検索結果集合を前記文書インデクスデータベースから取得し、該検索結果集合と複数のスコア要因とでスコア要因値行列を算出するクエリ処理手段と、
前記クエリ処理手段で算出されたスコア要因値行列と、前記ランキングモデルデータベースのデータを入力とし、前記入力された検索クエリに対応する前記ランキングモデルデータベース内のランキングモデルとしてのスコア要因重みと、前記スコア要因値行列とを積算して検索スコアベクトルを計算する検索スコア計算手段と、
前記検索スコア計算手段により計算された検索スコアの降順に入力クエリに対する検索結果を提示する検索結果提示手段と、
を備えたことを特徴とする文書検索装置。 A training data database in which training data having a matching degree of a document search result with respect to N queries and an M-dimensional feature expression are stored;
Using the training data as input, obtain a plurality of combinations of different fitness values in each query, obtain a change width of the search result evaluation index value when the order of the combination is changed, and use the maximum change width of the index value as a reference Margin generating means for obtaining a margin representing importance for the combination of matching levels for each query, and constructing a margin database having N queries, the combination of matching levels, and the determined margin;
Using the training data database and the margin database as input, generate a ranking model that retains score factor weights for outputting a search score that causes a document with relatively high fitness in the training data to be presented at the top of the search results And a ranking function generation means for constructing a ranking model database,
A document index database in which a document index created based on a document collected in advance from a Web page is stored;
Query processing means for acquiring a search result set for the input search query from the document index database, and calculating a score factor value matrix from the search result set and a plurality of score factors;
The score factor value matrix calculated by the query processing means and the data of the ranking model database are input, the score factor weight as a ranking model in the ranking model database corresponding to the input search query, and the score A search score calculating means for calculating a search score vector by integrating the factor value matrix;
Search result presenting means for presenting search results for the input query in descending order of the search score calculated by the search score calculating means;
A document retrieval apparatus comprising:

文書検索装置のマージン生成手段が、Ｎ個のクエリに対する文書の検索結果の適合度と、Ｍ次元の特徴表現とを有した訓練データが格納された訓練データデータベース内の訓練データを入力とし、各クエリにおける複数の異なる適合度の組合せを求め、該組合せの順序を変更したときの検索結果評価指標値の変更幅を求め、前記指標値の最大の変更幅を基準としてクエリ毎に適合度の組合せに対する重要度を表すマージンを求め、Ｎ個のクエリと、前記適合度の組合せと、前記求められたマージンとを有したマージンデータベースを構築するマージン生成ステップと、
文書検索装置のランキング関数生成手段が、前記訓練データデータベースおよびマージンデータベースの各データを入力とし、訓練データ中の相対的に高い適合度の文書を検索結果上位に提示させる検索スコアを出力するためのスコア要因重みを保持したランキングモデルを生成してランキングモデルデータベースを構築するランキング関数生成ステップと、
文書検索装置のクエリ処理手段が、入力された検索クエリに対する検索結果集合を、予めＷｅｂページから収集した文書を基に作成された文書インデクスが格納された文書インデクスデータベースから取得し、該検索結果集合と複数のスコア要因とでスコア要因値行列を算出するクエリ処理ステップと、
文書検索装置の検索スコア計算手段が、前記クエリ処理手段で算出されたスコア要因値行列と、前記ランキングモデルデータベースのデータを入力とし、前記入力された検索クエリに対応する前記ランキングモデルデータベース内のランキングモデルとしてのスコア要因重みと、前記スコア要因値行列とを積算して検索スコアベクトルを計算する検索スコア計算ステップと、
文書検索装置の検索結果提示手段が、前記検索スコア計算手段により計算された検索スコアの降順に入力クエリに対する検索結果を提示する検索結果提示ステップと、
を備えたことを特徴とする文書検索方法。 The margin generation means of the document search apparatus receives training data in a training data database in which training data having a fitness of a document search result for N queries and an M-dimensional feature expression are stored, A combination of a plurality of different fitness values in a query is obtained, a change width of a search result evaluation index value when the order of the combinations is changed, and a combination of fitness levels for each query based on the maximum change width of the index value A margin generation step of obtaining a margin representing an importance level for the N, and constructing a margin database having N queries, a combination of the matching degrees, and the determined margin;
A ranking function generating means of the document search device receives each data of the training data database and the margin database as an input, and outputs a search score for causing a document with a relatively high fitness in the training data to be presented at the top of the search results. A ranking function generation step of generating a ranking model holding score factor weights and constructing a ranking model database;
The query processing means of the document search apparatus acquires a search result set for the input search query from a document index database in which a document index created based on a document previously collected from a Web page is stored, and the search result set And a query processing step for calculating a score factor value matrix with a plurality of score factors,
The search score calculation means of the document search device receives the score factor value matrix calculated by the query processing means and data of the ranking model database, and ranks in the ranking model database corresponding to the input search query A search score calculation step for calculating a search score vector by accumulating the score factor weight as a model and the score factor value matrix;
A search result presenting step in which the search result presenting means of the document search device presents the search result for the input query in descending order of the search score calculated by the search score calculating means;
A document search method characterized by comprising :

コンピュータを請求項１に記載の各手段として機能させる文書検索プログラム。 A document search program for causing a computer to function as each means according to claim 1 .