JP2010067011A

JP2010067011A - Method and device for detecting document group

Info

Publication number: JP2010067011A
Application number: JP2008232784A
Authority: JP
Inventors: Satoko Shiga; 聡子志賀; Tomoya Iwakura; 友哉岩倉; Takehisa Ando; 剛寿安藤; Aoshi Okamoto; 青史岡本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2008-09-11
Filing date: 2008-09-11
Publication date: 2010-03-25
Anticipated expiration: 2028-09-11
Also published as: JP5163379B2

Abstract

PROBLEM TO BE SOLVED: To analyze a search log of a user for each document group based on a feature rule showing a document group to be detected, to calculate each feature score, to detect a document group having a feature satisfying a condition based on the feature score, and to automatically detect a document group useful for the user based on behaviors of the user. SOLUTION: An information extracting means 12 analyzes the search log stored in a search log DB 11a, and specifies a document group to which an acquired document belongs. A feature counting means 13 analyzes the search log for each document group, and counts the feature scores according to a predetermined feature rule. A document group determining means 14, based on the result of counting the feature score, detects a document group having a feature satisfying a predetermined condition, and registers it as an information providing document group candidate. The detected information providing document group candidate is presented to a user by a document group presenting means 15. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は文書群検出方法及び文書群検出装置に関し、特に、ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出方法及び文書群検出装置に関する。 The present invention relates to a document group detection method and a document group detection apparatus, and more particularly to a document group detection method for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers. And a document group detection apparatus.

近年、技術の急速な進歩に伴って日々増え続ける専門用語は、紙類に印刷される事典や辞書などで対応することが難しくなっている。一方、ネットワーク上には、このような専門用語を解説した文書の集合であって、１またはそれ以上のコンピュータによって管理されている文書群が存在する。現在最も普及しているものとして、インターネット上で提供されるワールド・ワイド・ウェブ（World Wide Web；以下、ＷＷＷとする）には、このような専門用語を解説する文書の集合体である文書群が多数存在する。このようなＷＷＷ上の文書はＷｅｂページ、文書群または文書群が置いてあるインターネット上での場所はＷｅｂサイトと呼ばれる。以下、このように専門用語を解説する文書が集合したＷｅｂサイトを、辞書サイトまたは用語解説サイト、Ｗｅｂページを解説ページと呼ぶ。辞書サイトの解説ページは日々更新されており、これらを利用することによって、最新の用語の解説を見ることができる。 In recent years, technical terms that have been increasing day by day due to rapid progress in technology have become difficult to deal with in encyclopedias and dictionaries printed on paper. On the other hand, on the network, there is a group of documents that explain such technical terms and are managed by one or more computers. As the most widespread currently available, the World Wide Web (hereinafter referred to as the WWW) provided on the Internet is a group of documents that explain such technical terms. There are many. Such a document on the WWW is called a web page, a document group, or a place on the Internet where a document group is placed. Hereinafter, a website in which documents that explain technical terms are gathered in this way is called a dictionary site or a term explanation site, and a web page is called an explanation page. The explanation page of the dictionary site is updated daily, and you can see the explanation of the latest term by using these pages.

また、この辞書サイトを利用し、任意のＷｅｂページの文中の用語について、自動的に辞書サイトの解説ページへのリンクを貼るシステムがある。このような処理は、オートリンクと呼ばれる。 Also, there is a system that uses this dictionary site and automatically puts a link to an explanation page of the dictionary site for a term in a sentence of an arbitrary Web page. Such processing is called autolink.

図２７は、オートリンクシステムの概略を示した図である。
オートリンクシステムは、オートリンク辞書９１を記憶する記憶装置と、オートリンクエンジン９０とを有する。オートリンク辞書９１には、オートリンク対象の単語と、その単語の解説ページのＵＲＬ（Uniform Resource Locator）とが関連付けて登録されている。オートリンクエンジン９０は、対象のＨＴＭＬ（Hyper Text Markup Language）文書９３を解析し、その文中にオートリンク辞書９１に登録された単語を検出すると、単語に関連付けられたＵＲＬへのリンクを貼る。こうして作成されたＨＴＭＬ（リンク付き）文書９４が出力され、ユーザに利用される。 FIG. 27 is a diagram showing an outline of the auto link system.
The auto link system includes a storage device that stores an auto link dictionary 91 and an auto link engine 90. In the autolink dictionary 91, a word to be autolinked and a URL (Uniform Resource Locator) of an explanation page of the word are registered in association with each other. When the auto link engine 90 analyzes a target HTML (Hyper Text Markup Language) document 93 and detects a word registered in the auto link dictionary 91 in the sentence, the auto link engine 90 attaches a link to a URL associated with the word. An HTML (with link) document 94 created in this way is output and used by the user.

しかし、ＷＷＷに存在するページの数は膨大であり、現在も増加を続けている。そのように膨大なＷｅｂページの中から所望の用語が解説されたＷｅｂページを見つけ出すことは容易ではない。一般的には、ある用語に関連するＷｅｂページを検索するためのツールとして、ＷＷＷの検索エンジンが用いられている。しかし、ＷＷＷの検索エンジンでは、不要なページが多数検索され、所望のＷｅｂページまで辿り着くことは非常に難しい。 However, the number of pages existing on the WWW is enormous and continues to increase. It is not easy to find a Web page in which a desired term is explained from such a large number of Web pages. Generally, a WWW search engine is used as a tool for searching a Web page related to a certain term. However, with the WWW search engine, a large number of unnecessary pages are searched, and it is very difficult to reach a desired Web page.

このため、特定語句について説明する情報を他のＷｅｂページから見つけ出し、解説部分を抜き出して提示する事典システムが提案されている（例えば、特許文献１参照）。
特開２００３−８５１８１号公報 For this reason, an encyclopedia system that finds information explaining a specific phrase from another Web page and extracts and presents explanation parts has been proposed (for example, see Patent Document 1).
JP 2003-85181 A

しかし、従来の事典システムでは、ネットワーク上で提供される文書が所定の条件に合致した文書群のものであるかどうかが識別されないという問題点があった。
従来のインターネット上の事典システムでは、用語を解説するＷｅｂページを見つけ出すことはできるが、そのＷｅｂページのＷｅｂサイトがどのようなサイトであるかについては考慮していない。例えば、文中の用語にその用語の解説をリンクさせる場合、リンク先の情報が辞書サイトであることが望ましい。これは、辞書サイトではないＷｅｂページの掲載情報は、情報内容の中立性及び一般性が保証されないことが多いことによる。したがって、単に文中の用語に用語の解説のＷｅｂページをリンクさせるだけでは、解説の内容の中立性及び一般性を保証することができない。このため、リンク先として、この種のページを極力排除し、辞書サイトに志向したページを選択する必要がある。 However, the conventional encyclopedia system has a problem in that it is not possible to identify whether or not a document provided on the network belongs to a document group that meets a predetermined condition.
A conventional encyclopedia system on the Internet can find a Web page that explains a term, but does not consider what site the Web site of the Web page is. For example, when linking an explanation of a term to a term in a sentence, it is desirable that the linked information is a dictionary site. This is due to the fact that the neutrality and generality of the information content of web page information that is not a dictionary site is often not guaranteed. Therefore, the neutrality and generality of the content of the explanation cannot be guaranteed simply by linking the term explanation Web page to the term in the sentence. For this reason, it is necessary to eliminate this type of page as much as possible as a link destination and to select a page intended for a dictionary site.

このような事情から、従来のオートリンクシステムでは、辞書サイトの検出を人手で行っていた。そして、検出された辞書サイトからエントリ（単語と解説ページのＵＲＬとを関連付けた情報）を抽出し、オートリンク辞書９１に登録していた。しかし、膨大な数のＷｅｂページから適切な辞書サイトを検出するのは、容易な作業ではない。また、人手による作業であるため辞書の管理コストが高くなり、オートリンクサービスを提供するサービス提供者が頻繁に辞書の追加ができないという問題もある。このような状況から、オートリンク辞書作成の自動化が望まれていた。 Under such circumstances, the conventional auto link system manually detects the dictionary site. Then, an entry (information that associates the word with the URL of the explanation page) is extracted from the detected dictionary site and registered in the autolink dictionary 91. However, detecting an appropriate dictionary site from a huge number of Web pages is not an easy task. In addition, since it is a manual operation, the management cost of the dictionary is high, and there is also a problem that a service provider that provides an auto link service cannot frequently add a dictionary. Under such circumstances, it has been desired to automate the creation of an autolink dictionary.

また、オートリンクの用途に限らず、用語とそれに関連するページのＵＲＬとを対応付けた辞書の整備の自動化は、重要な課題である。
本発明はこのような点に鑑みてなされたものであり、ネットワーク上で提供される目的の文書を有する適切な文書群を検出する文書群検出方法及び文書群検出装置を提供することを目的とする。 Further, not only the use of autolink but also the automation of the maintenance of a dictionary in which terms are associated with URLs of related pages is an important issue.
SUMMARY An advantage of some aspects of the invention is that it provides a document group detection method and a document group detection apparatus for detecting an appropriate document group having a target document provided on a network. To do.

上記課題を解決するために、ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出方法が提供される。この文書群検出方法は、情報抽出手段が、検索ログ記憶手段から検索ログを読み出し、検索ログに記録された文書群の識別情報を抽出する。検索ログには、任意の検索条件を用いて検索され、検索結果に基づいて利用者が取得した文書に関し、文書の識別情報と、この文書が属する文書群の識別情報とを有するアドレスが記録されている。次に、特徴集計手段が、情報抽出手段によって抽出された文書群の識別情報ごとに検索ログを分類する。そして、文書群ごとに分類された検索ログを解析して、検出対象とする文書群を特徴付ける特徴ルールに応じた特徴スコアを文書群ごとに集計する。次に、文書群判定手段が、特徴集計手段によって文書群ごとに集計された特徴スコアに基づき、この文書群が、特徴ルールによって規定される検出対象とする文書群の条件を満たしているかどうかを判定する。そして、条件を満たしている文書群を情報提供文書群候補に登録する。続いて文書群提示手段が、文書群判定手段によって情報提供文書群候補に登録された文書群の識別情報を利用者に提示する。 In order to solve the above problems, there is provided a document group detection method for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers. In this document group detection method, the information extraction unit reads the search log from the search log storage unit, and extracts the identification information of the document group recorded in the search log. In the search log, an address having document identification information and identification information of a document group to which the document belongs is recorded with respect to a document searched using an arbitrary search condition and acquired by a user based on the search result. ing. Next, the feature counting unit classifies the search log for each identification information of the document group extracted by the information extracting unit. Then, the search log classified for each document group is analyzed, and the feature score corresponding to the feature rule characterizing the document group to be detected is totaled for each document group. Next, the document group determination unit determines whether or not the document group satisfies the condition of the document group to be detected specified by the feature rule based on the feature score totaled for each document group by the feature totaling unit. judge. Then, the document group satisfying the condition is registered in the information providing document group candidate. Subsequently, the document group presenting unit presents the identification information of the document group registered in the information providing document group candidate by the document group determining unit to the user.

このような文書群検出方法によれば、検索ログが解析され、利用者が取得した文書が属する文書群の識別情報が抽出される。次に、抽出された文書群の識別情報ごとに検索ログが分類され、検出対象とする文書群を特徴付ける特徴ルールに応じた特徴スコアが文書群ごとに集計される。この特徴スコアに基づき、検索対象の文書群の特徴を有する文書群が選択され、情報提供文書群候補に登録される。登録された情報提供文書群候補は、利用者に提示される。 According to such a document group detection method, the search log is analyzed, and the identification information of the document group to which the document acquired by the user belongs is extracted. Next, the search log is classified for each identification information of the extracted document group, and the feature score corresponding to the feature rule characterizing the document group to be detected is tabulated for each document group. Based on this feature score, a document group having the characteristics of the document group to be searched is selected and registered in the information providing document group candidate. The registered information provision document group candidate is presented to the user.

また、上記課題を解決するために、コンピュータに、上記の文書群検出方法を実行させた文書群検出装置が提供される。 In order to solve the above problem, a document group detection apparatus is provided in which a computer executes the document group detection method.

開示の文書群検出方法及び文書群検出装置によれば、文書群ごとに、閲覧者の検索ログが検出対象の文書群を表す特徴ルールに基づいて解析され、それぞれの特徴スコアが算出される。そして、この特徴スコアに基づいて、条件に合った特徴を有する文書群が検出される。これにより、閲覧者の行動に基づく、閲覧者にとって有用な文書群を自動的に検出することができる。 According to the disclosed document group detection method and document group detection apparatus, for each document group, the browser's search log is analyzed based on the feature rule representing the document group to be detected, and each feature score is calculated. Then, based on this feature score, a document group having features that meet the conditions is detected. Thereby, the document group useful for the viewer based on the behavior of the viewer can be automatically detected.

以下、本発明の実施の形態を図面を参照して説明する。まず、発明の概要について説明し、その後、具体的な内容を説明する。
図１は、発明の概要を示した図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, an outline of the invention will be described, and then specific contents will be described.
FIG. 1 is a diagram showing an outline of the invention.

文書群検出装置１０は、検索ログデータベース（Data Base；以下、ＤＢとする）１１ａ、特徴ルールＤＢ１１ｂ、集計情報ＤＢ１１ｃ、文書群候補ＤＢ１１ｄ、及び文書群ＤＢ１１ｅの記憶手段と、情報抽出手段１２、特徴集計手段１３、文書群判定手段１４及び文書群提示手段１５と、を有する。 The document group detection apparatus 10 includes a search log database (Data Base; hereinafter referred to as DB) 11a, a feature rule DB 11b, a summary information DB 11c, a document group candidate DB 11d, and a document group DB 11e, storage means, information extraction means 12, and features. A totaling unit 13, a document group determining unit 14, and a document group presenting unit 15.

検索ログＤＢ１１ａには、閲覧者が検索を行って取得した文書に関し、検索条件、検索結果、文書のアクセス先情報、文書の内容を示す情報などが記録される検索ログが蓄積されている。アクセス先情報は、例えば、文書情報を取得したアドレスである。このような文書を取得したアドレスには、通常、文書を識別する識別情報と、その文書を含む文書の集合である文書群を識別する識別情報とが含まれる。例えば、インターネット上でＷｅｂページ（文書）を取得するためのアクセスＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／ｒｓｓ．ｈｔｍｌ」を例にとる。このアクセスＵＲＬには文書の識別情報である「ｒｓｓ．ｈｔｍｌ」と、文書群の識別情報（文書群を管理するサーバの識別情報）である「ａａａ．ｃｏｍ」と、が含まれている。検索条件には、文書を検索した際の検索キーワードや、そのときの検索条件などが定義される。特徴ルールＤＢ１１ｂには、検出対象の文書群の特徴に基づき、この文書群を検出するための特徴ルールが格納される。集計情報ＤＢ１１ｃには、文書群ごとに、この文書群の検索ログを特徴ルールに基づいて解析して得られた特徴スコアの集計結果が格納される。文書群候補ＤＢ１１ｄには、特徴スコアが基準を満たし、情報提供文書群候補に選択された文書群の識別情報が格納される。文書群ＤＢ１１ｅは、利用者が利用するとして登録した文書群の識別情報が格納される。 The search log DB 11a stores a search log in which search conditions, search results, access destination information of the document, information indicating the contents of the document, and the like are stored with respect to the document acquired by the search by the viewer. The access destination information is, for example, an address at which document information is acquired. The address at which such a document is acquired usually includes identification information for identifying the document and identification information for identifying a document group that is a set of documents including the document. For example, an access URL “http://aaa.com/rss.html” for acquiring a Web page (document) on the Internet is taken as an example. This access URL includes “rss.html”, which is document identification information, and “aaa.com”, which is document group identification information (identification information of a server that manages the document group). The search condition defines a search keyword when a document is searched, a search condition at that time, and the like. The feature rule DB 11b stores a feature rule for detecting the document group based on the feature of the document group to be detected. The summary information DB 11c stores, for each document group, a summary result of feature scores obtained by analyzing the search log of this document group based on the feature rule. The document group candidate DB 11d stores identification information of the document group selected as the information providing document group candidate whose feature score satisfies the standard. The document group DB 11e stores identification information of document groups registered as used by the user.

情報抽出手段１２は、検索ログＤＢ１１ａに格納される検索ログを読み出し、検索ログに記録される文書を取得したアドレスから文書群の識別情報を抽出する。例えば、検索ログに記録されるＷｅｂページのアクセスＵＲＬからＷｅｂサイトを示す上位ＵＲＬ（ドメイン名）を抽出する。抽出された文書群の識別情報は、検索ログとともに特徴集計手段１３へ送られる。 The information extraction unit 12 reads the search log stored in the search log DB 11a, and extracts document group identification information from the address from which the document recorded in the search log is acquired. For example, the upper URL (domain name) indicating the Web site is extracted from the access URL of the Web page recorded in the search log. The extracted identification information of the document group is sent to the feature counting unit 13 together with the search log.

特徴集計手段１３は、検出された文書群の識別情報ごとに検索ログを分類する。そして、文書群ごとに分類された検索ログを解析し、検出対象とする文書群の特徴を表す特徴ルールに応じた特徴スコアを集計する。このような文書群の特徴や特徴スコアの集計方法は、予め特徴ルールとして定義しておき、特徴ルールＤＢ１１ｂに格納しておく。例えば、検索対象の文書群が、用語を解説する複数の文書の集合である辞書に相当する文書群であれば、辞書固有の特徴を有する。一例として、特徴が、「この文書群に属する文書のタイトルに『とは』という語句が多い」というものであれば、検索ログに記録される文書の内容を示す情報を抽出し、「とは」が出現する数に応じた特徴スコアを集計する。特徴スコアを「とは」の出現数で求めるときは、この文書群に関する検索ログの文書の内容を示す情報に「とは」の出現する数を集計する。また、特徴スコアを「とは」の出現率で求めるときは、この文書群に関する全検索ログの文書の内容を示す情報に「とは」が出現する出現率を集計する。なお、特徴語句の検出は、文書のタイトル部分など、範囲を絞って行うとしてもよい。なお、特徴スコアは文書の内容を示す情報に出現する特徴語句に基づくものとは限らない。例えば、検索に用いた検索キーワードの解析結果などを用いることができる。このように異なった視点に基づく文書群の特徴を表す複数の特徴スコアを、文書群ごとに算出しておく。 The feature counting means 13 classifies the search log for each identification information of the detected document group. Then, the search log classified for each document group is analyzed, and the feature scores corresponding to the feature rules representing the features of the document group to be detected are totaled. Such a document group feature and feature score aggregation method is defined in advance as a feature rule and stored in the feature rule DB 11b. For example, if the document group to be searched is a document group corresponding to a dictionary that is a set of a plurality of documents that explain terms, it has characteristics unique to the dictionary. As an example, if the feature is “the title of the document belonging to this document group has many words“ to ””, information indicating the content of the document recorded in the search log is extracted, and “ ”Is added up according to the number of occurrences of“ ”. When the characteristic score is obtained by the number of occurrences of “to”, the number of occurrences of “to” is added to the information indicating the content of the search log document relating to this document group. Further, when the characteristic score is obtained by the appearance rate of “to”, the appearance rates at which “to” appear in the information indicating the contents of the documents of all search logs related to this document group are tabulated. The feature word / phrase may be detected by narrowing the range, such as the title part of the document. Note that the feature score is not necessarily based on feature words that appear in information indicating the content of the document. For example, the analysis result of the search keyword used for the search can be used. In this way, a plurality of feature scores representing the characteristics of a document group based on different viewpoints are calculated for each document group.

文書群判定手段１４は、特徴集計手段１３による、文書群ごとの特徴スコアの集計結果に基づき、この文書群が検出対象の文書群の条件を満たすかどうかを判定する。閾値などの判定条件は、予め特徴ルールＤＢ１１ｂに格納しておく。例えば、上記の場合、出現個数や出現率の閾値を設定しておき、集計結果が閾値を超えることを条件にする。文書群が条件を満たす場合、この文書群は情報提供文書群候補に選択され、情報提供文書群候補一覧に登録される。情報提供文書群候補の一覧情報は、文書群候補ＤＢ１１ｄに格納される。判定は、任意の特徴スコアの結果を組み合わせて行うとする。複数の特徴スコアを組み合わせて判定することにより、情報提供文書群候補が検出対象の文書群である確度（確からしさ）が高くなる。 The document group determination unit 14 determines whether or not this document group satisfies the condition of the document group to be detected based on the result of the feature score for each document group by the feature aggregation unit 13. Determination conditions such as a threshold value are stored in advance in the feature rule DB 11b. For example, in the above case, threshold values for the number of appearances and the appearance rate are set, and the total result exceeds the threshold value. If the document group satisfies the condition, this document group is selected as an information provision document group candidate and registered in the information provision document group candidate list. The list information of the information providing document group candidates is stored in the document group candidate DB 11d. The determination is made by combining results of arbitrary feature scores. By determining by combining a plurality of feature scores, the probability (probability) that the information providing document group candidate is the document group to be detected increases.

文書群提示手段１５は、文書群判定手段１４によって情報提供文書群候補に登録された文書群の識別情報を利用者に提示する。そして、情報提供文書群候補のうち、利用者が選択した文書群の識別情報を目的の文書群として登録する。選択された目的の文書群の一覧情報は、文書群ＤＢ１１ｅに格納される。 The document group presentation unit 15 presents identification information of the document group registered in the information provision document group candidate by the document group determination unit 14 to the user. Then, the identification information of the document group selected by the user among the information providing document group candidates is registered as the target document group. The list information of the selected target document group is stored in the document group DB 11e.

このような構成の文書群検出装置１０による文書群検出方法について説明する。
検索ログＤＢ１１ａには、予め、閲覧者が所定の検索条件で検索を行って取得した文書に関する履歴が、検索ログとして格納されている。検索ログには、取得した文書に関し、アドレス（文書の識別情報と、文書が属する文書群の識別情報を含む）、検索時の検索キーワード、文書の内容を示す情報（タイトル、要旨など）が記録されている。また、特徴ルールＤＢ１１ｂには、検索対象の文書群を検出するため、検索対象の文書群の特徴を表す特徴ルールや、検索対象の文書群と判定する条件などが定義された特徴ルールが格納されている。 A document group detection method by the document group detection apparatus 10 having such a configuration will be described.
In the search log DB 11a, a history relating to a document acquired by a viewer performing a search under a predetermined search condition is stored in advance as a search log. In the search log, addresses (including document identification information and identification information of the document group to which the document belongs), search keywords at the time of search, and information (title, abstract, etc.) indicating the content of the document are recorded in the search log. Has been. In addition, the feature rule DB 11b stores feature rules that define features of the search target document group and conditions for determining the search target document group in order to detect the search target document group. ing.

処理の開始要求があると、文書群検出装置１０は、検索ログＤＢ１１ａに格納される検索ログを読み出し、アドレスから文書群の識別情報を抽出する。特徴集計手段１３は、検索ログを文書群ごとに分類して解析し、検出対象の文書群の特徴を表す特徴ルールに応じた特徴スコアを文書群ごとに集計する。例えば、検索キーワード、文書のタイトルや文などに出現する所定の単語の出現数や出現率などが特徴スコアとして集計される。そして、文書群判定手段１４は、特徴集計手段１３によって集計された文書群ごとの特徴スコアが、目的の文書群の条件を満たしているかどうかを判定する。条件を満たしている場合は、この文書群を情報提供文書群候補に登録する。文書群提示手段１５は、検出された情報提供文書群候補を利用者に提示する。 When there is a processing start request, the document group detection apparatus 10 reads the search log stored in the search log DB 11a and extracts the document group identification information from the address. The feature tabulating unit 13 classifies and analyzes the search log for each document group, and tabulates the feature score corresponding to the feature rule representing the feature of the document group to be detected for each document group. For example, the number of appearances and the appearance rate of predetermined words appearing in search keywords, document titles, sentences, and the like are aggregated as feature scores. Then, the document group determination unit 14 determines whether or not the feature score for each document group calculated by the feature totaling unit 13 satisfies the condition of the target document group. If the condition is satisfied, this document group is registered as an information providing document group candidate. The document group presenting means 15 presents the detected information providing document group candidate to the user.

このように、開示の文書群検出装置１０によれば、閲覧者が実際に検索を行った記録である検索ログに基づいて、実際に検索された文書から検出対象の文書群を検出することができる。検出対象の文書群であるかどうかは、検索ログから特定された文書群ごとに特徴スコアを集計し、集計された特徴スコアが、目的の文書群の条件を満たすかどうかによって判定される。これにより、閲覧者の行動に基づく、閲覧者にとって有用な文書群を自動的に検出することが可能となる。 As described above, according to the disclosed document group detection apparatus 10, it is possible to detect the document group to be detected from the actually searched documents based on the search log that is a record of the actual search performed by the viewer. it can. Whether the document group is a detection target document group is determined based on whether or not the feature score is totaled for each document group specified from the search log and whether the total feature score satisfies the condition of the target document group. As a result, it is possible to automatically detect a document group useful for the viewer based on the behavior of the viewer.

以下、実施の形態を、インターネット上で提供される文書を解析して辞書サイトを検出する辞書サイト検出システムに適用した場合を例に図面を参照して詳細に説明する。検出された辞書サイトは、オートリンクシステムなどに適用される辞書の候補に用いられる。実施の形態では、閲覧者が検索により取得した文書をＷｅｂページ、文書群がＷｅｂページの集合であるＷｅｂサイトになる。Ｗｅｂページ群は、１またはそれ以上のコンピュータによって管理されており、このようなコンピュータ群のインターネット上の識別子がドメインになる。したがって、Ｗｅｂサイトは、ＷｅｂページのＵＲＬに共通するドメインによって識別することができる。 Hereinafter, an embodiment will be described in detail with reference to the drawings, taking as an example a case where the embodiment is applied to a dictionary site detection system that detects a dictionary site by analyzing a document provided on the Internet. The detected dictionary site is used as a dictionary candidate applied to an auto link system or the like. In the embodiment, a document acquired by a search by a viewer is a Web page, and a document group is a Web site that is a set of Web pages. The Web page group is managed by one or more computers, and the identifier on the Internet of such a computer group becomes a domain. Therefore, the Web site can be identified by a domain common to the URL of the Web page.

図２は、辞書サイト検出システムの構成例を示した図である。
辞書サイト検出システムは、辞書サイトを判別する辞書サイト検出サーバ１００と、検索サイト検出の指示を行うユーザのクライアント装置２００と、検索ログを生成・蓄積する検索サーバ３００とが、ネットワーク４００を介して接続する。 FIG. 2 is a diagram illustrating a configuration example of the dictionary site detection system.
A dictionary site detection system includes a dictionary site detection server 100 that determines a dictionary site, a user client device 200 that issues a search site detection instruction, and a search server 300 that generates and stores a search log via a network 400. Connecting.

辞書サイト検出サーバ１００は、文書群検出装置であり、検索サーバ３００が蓄積した検索ログを取得し、検索ログを解析して辞書サイトの候補を検出する。クライアント装置２００は、オートリンク辞書を作成する作成者の装置で、ブラウザ２１０と、入力手段２２０とを有する。ブラウザ２１０は、辞書サイト検出サーバ１００から取得したＨＴＭＬ形式の検出結果などを図示しない表示装置に表示させる。入力手段２２０は、作成者の指示を入力し、辞書サイト検出サーバ１００に通知する。検索サーバ３００は、閲覧者からの検索要求が入力されると、要求された検索キーワードに合った複数のＷｅｂページのアクセスＵＲＬを含むページ情報を要求元へ送信する。このとき、検索ログとして、アクセス日時、ユーザ・クッキー、閲覧者が選択したＷｅｂページのアクセス先、アクセス先のタイトルなどの情報を記録しておく。ネットワーク４００は、例えば、インターネットである。 The dictionary site detection server 100 is a document group detection device, acquires the search log accumulated by the search server 300, analyzes the search log, and detects dictionary site candidates. The client device 200 is a creator device that creates an autolink dictionary, and includes a browser 210 and an input unit 220. The browser 210 displays an HTML format detection result obtained from the dictionary site detection server 100 on a display device (not shown). The input means 220 inputs the creator's instruction and notifies the dictionary site detection server 100. When a search request from a viewer is input, the search server 300 transmits page information including access URLs of a plurality of Web pages that match the requested search keyword to the request source. At this time, information such as access date / time, user cookie, access destination of the Web page selected by the viewer, and title of the access destination are recorded as the search log. The network 400 is, for example, the Internet.

辞書サイト検出サーバ１００の構成を説明する。辞書サイト検出サーバ１００は、検索ログＤＢ１１０、情報抽出部１２０、特徴ルールＤＢ１３０、特徴集計部１４０、集計情報ＤＢ１５０、辞書サイト判定部１６０、辞書サイト候補ＤＢ１７０、辞書サイト提示部１８０及び辞書サイトＤＢ１９０を有する。 The configuration of the dictionary site detection server 100 will be described. The dictionary site detection server 100 includes a search log DB 110, an information extraction unit 120, a feature rule DB 130, a feature aggregation unit 140, an aggregation information DB 150, a dictionary site determination unit 160, a dictionary site candidate DB 170, a dictionary site presentation unit 180, and a dictionary site DB 190. Have.

検索ログＤＢ１１０は、ネットワーク４００経由で検索サーバ３００から取得した検索ログを検索ログテーブルとして格納する。検索ログテーブルには、検索日時の他、検索条件、検索結果に基づいて行われたアクセス先に関する情報などが含まれる。ここでは、検索条件として検索キーワードが記録されているとする。また、アクセス先に関する情報として、このＷｅｂページをアクセスするためのアクセスＵＲＬと、そのコンテンツ情報として、アクセスＵＲＬのＷｅｂページから抽出されたアクセスＵＲＬのタイトルとが記録されるとする。詳細は後述する。 The search log DB 110 stores a search log acquired from the search server 300 via the network 400 as a search log table. In addition to the search date and time, the search log table includes search conditions, information on access destinations based on search results, and the like. Here, it is assumed that a search keyword is recorded as a search condition. Further, it is assumed that an access URL for accessing this Web page is recorded as information on the access destination, and a title of the access URL extracted from the Web page of the access URL is recorded as its content information. Details will be described later.

情報抽出部１２０は、検索ログＤＢ１１０に格納される検索ログを読み出し、ＷｅｂサイトのＵＲＬを含む特徴把握のための情報を抽出する情報抽出手段である。検索ログＤＢ１１０に格納される検索ログを読み出し、検索ログに含まれるアクセスＵＲＬから上位ＵＲＬ（ドメイン名）を抽出する。ＷｅｂページのＵＲＬであるアクセスＵＲＬは、一般的には、「ｈｔｔｐ：（スキーム名）／／サーバ名（ドメイン名）／ファイル名」の形式をとる。また、ドメイン名には予め定義される文字列が含まれており、これらの情報に基づいてアクセスＵＲＬからドメイン名を抽出することができる。このドメイン名で構成されるＵＲＬは、ＷｅｂサイトのＵＲＬに相当し、アクセスＵＲＬに対して上位のＵＲＬになる。こうしてアクセスＵＲＬから抽出した上位ＵＲＬは、検索ログの検索キーワード、アクセスＵＲＬ、アクセスＵＲＬのタイトルとともに特徴集計部１４０へ引き渡される。 The information extraction unit 120 is an information extraction unit that reads a search log stored in the search log DB 110 and extracts information for grasping features including a URL of a Web site. The search log stored in the search log DB 110 is read, and the upper URL (domain name) is extracted from the access URL included in the search log. The access URL that is the URL of the Web page generally takes the form of “http: (scheme name) // server name (domain name) / file name”. The domain name includes a predefined character string, and the domain name can be extracted from the access URL based on such information. The URL configured with this domain name corresponds to the URL of the Web site and is a higher URL than the access URL. The upper URL extracted from the access URL is delivered to the feature counting unit 140 together with the search keyword of the search log, the access URL, and the title of the access URL.

特徴ルールＤＢ１３０は、辞書サイトの特徴に基づいて、目的の辞書サイトを検出するための特徴ルールを格納する。特徴ルールには、辞書サイトの判定に用いる特徴項目と、辞書サイトと判定する特徴スコアの閾値などが定義される。ここでは、検索ログの分析に基づいて辞書サイトを検出することができる特徴として、次の特徴ルールを用いる。特徴ルール１は、辞書サイトであれば、同じサイト（ドメイン）に対し、異なる言葉からたくさんアクセスがあるということである。特徴ルール２は、辞書サイトの各解説ページのタイトルには、「とは」、「用語事典」「意味解説」などの決まった語句が含まれるということである。特徴ルール３は、辞書サイトであれば、特徴ルール２を満たすＷｅｂページがサイト内に占める割合が高いということである。特徴ルール４は、ある分野の用語に関する辞書サイトは複数ある。また、ある分野の２つの辞書サイトは、同じような用語を持っているので、同じようなクエリでアクセスされるということである。なお、特徴ルール４は、特徴ルール１，２，３を用いて選択されたサイト候補から不確かなＷｅｂサイトを除去するためのルールである。これらの特徴ルールの詳細については、後述する。 The feature rule DB 130 stores feature rules for detecting a target dictionary site based on the features of the dictionary site. The feature rule defines a feature item used for determining a dictionary site, a threshold value of a feature score to be determined as a dictionary site, and the like. Here, the following feature rule is used as a feature capable of detecting a dictionary site based on analysis of a search log. Feature rule 1 is that a dictionary site has many accesses to the same site (domain) from different words. Characteristic rule 2 is that the title of each comment page of the dictionary site includes a fixed phrase such as “To”, “Glossary”, “Meaning”. The feature rule 3 is that, if it is a dictionary site, the ratio of Web pages satisfying the feature rule 2 in the site is high. The feature rule 4 has a plurality of dictionary sites related to terms in a certain field. In addition, two dictionary sites in a certain field have the same terms and are accessed with the same query. The feature rule 4 is a rule for removing an uncertain Web site from the site candidates selected using the feature rules 1, 2, and 3. Details of these characteristic rules will be described later.

特徴集計部１４０は、対象抽出部１４１、特徴スコア集計部１４２及び類似サイトチェック部１４３を有する特徴集計手段である。対象抽出部１４１は、情報抽出部１２０から取得した上位ＵＲＬ、検索ログの検索キーワード、アクセスＵＲＬ、アクセスＵＲＬのタイトルから、対象となる上位ＵＲＬと検索キーワードの組を抽出する。さらに、アクセスＵＲＬと、アクセスＵＲＬのタイトルの組を抽出する。これらの情報は、次段の特徴スコア集計部１４２で用いられる。特徴スコア集計部１４２は、対象抽出部１４１で抽出された目的のサイトと検索キーワードの組、アクセスＵＲＬとアクセスＵＲＬのタイトルの組の情報を解析し、特徴項目ごとの特徴スコアを集計する。例えば、Ｗｅｂサイトが、特徴ルール１である「同じサイトに対し異なる言葉からたくさんアクセスがある」を満たすかどうかは、上位ＵＲＬに対応する検索キーワードの種類数（同じものはカウントしない）で検証することができる。そこで、検索キーワードの多様性を特徴項目とし、その検索キーワードの種類数を集計して特徴スコアとする。他の特徴ルールについても同様に、特徴を表す特徴スコアを集計する。類似サイトチェック部１４３は、特徴ルール４である「ある分野の用語に関する辞書サイトは複数ある」を確認するため、類似するサイトを検出する。 The feature tabulation unit 140 is a feature tabulation unit including an object extraction unit 141, a feature score tabulation unit 142, and a similar site check unit 143. The target extraction unit 141 extracts a set of the target upper URL and the search keyword from the upper URL acquired from the information extraction unit 120, the search keyword of the search log, the access URL, and the title of the access URL. Further, a combination of the access URL and the title of the access URL is extracted. These pieces of information are used in the feature score totaling unit 142 in the next stage. The feature score totaling unit 142 analyzes information on a target site and search keyword set extracted by the target extraction unit 141, and an access URL and access URL title set, and totals a feature score for each feature item. For example, whether or not a website satisfies the feature rule 1 “there are many accesses from different words to the same site” is verified by the number of types of search keywords corresponding to the upper URL (the same is not counted). be able to. Therefore, the diversity of search keywords is used as a feature item, and the number of types of the search keywords is totaled to obtain a feature score. Similarly, the feature scores representing the features are totaled for the other feature rules. The similar site check unit 143 detects similar sites in order to confirm the feature rule 4 “There are a plurality of dictionary sites related to terms in a certain field”.

集計情報ＤＢ１５０は、上位ＵＲＬテーブル１５１、アクセスＵＲＬテーブル１５２及び集計情報１５３の、特徴集計部１４０の処理で用いる集計情報を格納する。上位ＵＲＬテーブル１５１には、検索ログから抽出された上位ＵＲＬ（ドメイン名）が、検索キーワードに関連付けて登録される。ドメイン名は、検索ログに対応するＷｅｂページを有するＷｅｂサイトのＵＲＬを構成する。アクセスＵＲＬテーブル１５２には、検索ログから抽出されたＷｅｂページのＵＲＬと、そのＷｅｂページのタイトルとが関連付けて登録される。集計情報１５３には、特徴スコア集計部１４２で算出されたＷｅｂサイトの特徴を表す特徴項目に関する集計値が格納される。詳細は後述する。 The tabulation information DB 150 stores tabulation information used in the processing of the feature tabulation unit 140 of the upper URL table 151, the access URL table 152, and the tabulation information 153. In the upper URL table 151, the upper URL (domain name) extracted from the search log is registered in association with the search keyword. The domain name constitutes the URL of a Web site having a Web page corresponding to the search log. In the access URL table 152, the URL of the Web page extracted from the search log and the title of the Web page are registered in association with each other. The total information 153 stores the total value regarding the feature item representing the feature of the Web site calculated by the feature score totaling unit 142. Details will be described later.

辞書サイト判定部１６０は、特徴集計部１４０による特徴項目の集計結果に基づいて、上位ＵＲＬが辞書サイトであるかどうかを判定する文書群判定手段である。例えば、ある特徴項目についての特徴スコアが閾値より大きいとき、辞書サイトの可能性が高いと判定する。こうして辞書サイトの可能性が高いと判定されたサイトは、サイト候補として辞書サイト候補テーブルに登録し、辞書サイト候補ＤＢ１７０に格納する。 The dictionary site determination unit 160 is a document group determination unit that determines whether the upper URL is a dictionary site based on the result of the feature item aggregation by the feature aggregation unit 140. For example, when a feature score for a certain feature item is larger than a threshold, it is determined that the possibility of a dictionary site is high. Sites that have been determined to be highly likely dictionary sites are registered in the dictionary site candidate table as site candidates and stored in the dictionary site candidate DB 170.

辞書サイト候補ＤＢ１７０には、辞書サイト判定部１６０によってサイト候補に登録されたＷｅｂサイトが、その特徴スコアとともに登録される辞書サイト候補テーブルが格納される。詳細は後述する。 The dictionary site candidate DB 170 stores a dictionary site candidate table in which Web sites registered as site candidates by the dictionary site determination unit 160 are registered together with their feature scores. Details will be described later.

辞書サイト提示部１８０は、クライアント装置２００を介して利用者から要求があったときは、辞書サイト候補ＤＢ１７０の辞書サイト候補テーブルに登録されるサイト候補の一覧をクライアント装置２００に表示させる文書群提示手段である。また、クライアント装置２００を介して選択された辞書サイト候補を辞書サイトに決定し、辞書サイトテーブルに登録する。 When there is a request from the user via the client device 200, the dictionary site presentation unit 180 presents a document group that causes the client device 200 to display a list of site candidates registered in the dictionary site candidate table of the dictionary site candidate DB 170. Means. Further, the dictionary site candidate selected via the client device 200 is determined as a dictionary site and registered in the dictionary site table.

辞書サイトＤＢ１９０は、利用者が辞書サイトとして登録した辞書サイトテーブルが格納される。
ここで、辞書サイト検出サーバのハードウェア構成について説明する。図３は、辞書サイト検出サーバのハードウェア構成例を示すブロック図である。 The dictionary site DB 190 stores a dictionary site table registered by the user as a dictionary site.
Here, the hardware configuration of the dictionary site detection server will be described. FIG. 3 is a block diagram illustrating a hardware configuration example of the dictionary site detection server.

辞書サイト検出サーバ１００は、ＣＰＵ（Central Processing Unit）１０１によって装置全体が制御されている。ＣＰＵ１０１には、バス１０５を介してＲＡＭ（Random Access Memory）１０２、ハードディスクドライブ（ＨＤＤ：Hard Disk Drive）１０３及び通信インタフェース１０４が接続されている。 The entire dictionary site detection server 100 is controlled by a CPU (Central Processing Unit) 101. A random access memory (RAM) 102, a hard disk drive (HDD) 103, and a communication interface 104 are connected to the CPU 101 via a bus 105.

ＲＡＭ１０２には、ＣＰＵ１０１に実行させるＯＳ（Operating System）のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ１０２には、ＣＰＵ１０１による処理に必要な各種データが格納される。ＨＤＤ１０３には、ＯＳやアプリケーションのプログラムが格納される。通信インタフェース１０４は、ネットワーク４００に接続されており、ネットワーク４００を介してクライアント装置２００及び検索サーバ３００との間でデータの送受信を行う。 The RAM 102 temporarily stores at least part of an OS (Operating System) program and application programs to be executed by the CPU 101. The RAM 102 stores various data necessary for processing by the CPU 101. The HDD 103 stores the OS and application programs. The communication interface 104 is connected to the network 400, and transmits / receives data to / from the client device 200 and the search server 300 via the network 400.

このようなハードウェア構成によって、辞書サイト検出サーバ１００の処理機能を実現することができる。なお、辞書サイト検出サーバ１００への指示は、クライアント装置２００の入力手段２２０より入力された指示がネットワーク４００を介して送られてくる。また、検出結果などは、辞書サイト検出サーバ１００が生成した表示情報をクライアント装置２００に送信し、クライアント装置２００によって表示装置に表示される。 With such a hardware configuration, the processing function of the dictionary site detection server 100 can be realized. The instruction to the dictionary site detection server 100 is sent via the network 400 from the input unit 220 of the client device 200. Further, the detection result and the like are transmitted to the client device 200 by the display information generated by the dictionary site detection server 100 and displayed on the display device by the client device 200.

このような構成の辞書サイト検出システムの動作及び辞書サイト検出処理の手順について説明する。
まず、辞書サイト検出に用いる辞書サイトの特徴ルールと、特徴項目について説明する。図４は、辞書サイトの特徴ルールを説明するための図である。（Ａ）は、辞書サイトの検索キーワードとページの一例、（Ｂ）は辞書サイト以外の検索キーワードとページの一例を示している。 The operation of the dictionary site detection system having such a configuration and the procedure of the dictionary site detection process will be described.
First, feature rules and feature items of dictionary sites used for dictionary site detection will be described. FIG. 4 is a diagram for explaining the feature rule of the dictionary site. (A) shows an example of the search keyword and page of the dictionary site, and (B) shows an example of the search keyword and page other than the dictionary site.

（Ａ）の辞書サイトの一例では、辞書サイト５００は、用語を解説したページ５１１，５１２，５１３を有している。そして、検索キーワード「ＶＰＮ」５２１によって、辞書サイト５００の解説ページ「ＶＰＮとは」５１１が検索されたことを示している。同様に、検索キーワード「ＲＳＳ」５２２によって解説ページ「ＲＳＳとは」５１２、検索キーワード「ＬＡＮ」５２３によって解説ページ「ＬＡＮとは」５１３が検索されたことが示されている。 In an example of the dictionary site (A), the dictionary site 500 includes pages 511, 512, and 513 that explain terms. The search keyword “VPN” 521 indicates that the explanation page “What is VPN” 511 of the dictionary site 500 is searched. Similarly, it is shown that the explanation page “RSS” 512 is searched by the search keyword “RSS” 522, and the explanation page “What is LAN” 513 is searched by the search keyword “LAN” 523.

（Ｂ）の辞書サイト以外の一例では、ラーメン店紹介サイト５０１は、ページ５３１，５３２，５３３を有している。（Ａ）と同様に、検索キーワード「ラーメン」５４１によって解説ページ「人気ラーメン店一覧」５３１、検索キーワード「＊＊ラーメン」５４２によって解説ページ「ランキング」５３２、検索キーワード「ラーメン」５４３によって解説ページ「ラーメンとは」５３３が検索されたことを示している。 In an example other than the dictionary site (B), the ramen shop introduction site 501 has pages 531, 532, and 533. Similarly to (A), the search keyword “ramen” 541 gives an explanation page “popular ramen shop list” 531, the search keyword “** ramen” 542 gives an explanation page “ranking” 532, and the search keyword “ramen” 543 gives an explanation page “ “Ramen” indicates that “533” has been searched.

辞書サイトの特徴ルール１は、「同じサイト（ドメイン）に対し、異なる言葉からたくさんアクセスがある」ということである。例えば、辞書サイトでないラーメン店紹介サイト５０１を検索する場合は、検索キーワードには、「ラーメン」という語句が含まれる場合が多い。図の例では、検索キーワード５４１，５４２，５４３すべての「ラーメン」が含まれている。これに対し、辞書サイト５００の検索キーワードは、その多くがそれぞれの解説ページで解説する用語であることがほとんどであるため、異なる語句になる。図の例では、検索キーワード５２１は「ＶＰＮ」、検索キーワード５２２は「ＲＳＳ」、検索キーワード５２３は「ＬＡＮ」と、すべて異なる言葉である。 Characteristic rule 1 of a dictionary site is that “the same site (domain) has many accesses from different words”. For example, when searching for a ramen shop introduction site 501 that is not a dictionary site, the search keyword often includes the phrase “ramen”. In the example of the figure, all the search keywords 541, 542, and 543 include “ramen”. On the other hand, most of the search keywords of the dictionary site 500 are different terms because most of them are terms explained on each explanation page. In the illustrated example, the search keyword 521 is “VPN”, the search keyword 522 is “RSS”, and the search keyword 523 is “LAN”, which are all different words.

このような特徴ルール１を満たしているかどうかを検証するため、上位ＵＲＬ（サイトのドメイン）と、検索キーワードの組を抽出し、その個数をカウントする。このとき、同じ上位ＵＲＬに対して同一の検索キーワードで複数回の検索が行われていた場合でも、１回しかカウントしない。これにより、Ｗｅｂサイトごとに、用いられた検索キーワードの種類数が集計される。以下、特徴ルール１に基づく特徴項目を「キーワード」とし、上位ＵＲＬごとに集計された検索キーワードの種類数（以下、単にキーワード数とする）を特徴スコアＳ１とする。図の例では、キーワード数による（Ａ）の特徴スコアＳ１は３、（Ｂ）の特徴スコアＳ１は１になる。 In order to verify whether or not the feature rule 1 is satisfied, a set of upper URL (site domain) and a search keyword is extracted and the number is counted. At this time, even if a plurality of searches are performed for the same upper URL with the same search keyword, the count is only counted once. Thereby, the number of types of search keywords used is tabulated for each Web site. Hereinafter, the feature item based on the feature rule 1 is referred to as “keyword”, and the number of types of search keywords (hereinafter simply referred to as the number of keywords) collected for each upper URL is referred to as a feature score S1. In the example of the figure, the feature score S1 of (A) is 3 and the feature score S1 of (B) is 1 depending on the number of keywords.

辞書サイトの特徴ルール２は、「各ページのタイトルには決まった語句が含まれる」ということである。例えば、辞書サイト５００の各ページのタイトルには、「とは」という語句が多い。図の例のページ５１１，５１２，５１３には、すべてのタイトルに「とは」が含まれている。これに対し、ラーメン店紹介サイト５０１では、「とは」はページ５３３のみに含まれる。ただし、辞書サイト以外のページであってもこの特徴が表れる場合がある。例えば、上記のラーメン店紹介サイト５０１が、「札幌ラーメンとは」「旭川ラーメンとは」「熊本ラーメンとは」というようなページで構成される場合がある。しかし、辞書サイトであれば、特徴ルール１を満たしているはずである。そこで、特徴ルール２を満たしているかどうかを、特徴ルール１を加味して判断することにより、辞書サイト候補の確度を上げることができる。 Characteristic rule 2 of the dictionary site is that “the title of each page includes a fixed phrase”. For example, the title of each page of the dictionary site 500 often includes the phrase “to what”. In the example page 511, 512, 513, “to” is included in all titles. On the other hand, in the ramen shop introduction site 501, “Toha” is included only in the page 533. However, this feature may appear even on pages other than dictionary sites. For example, the ramen shop introduction site 501 may be configured with pages such as “What is Sapporo Ramen”, “What is Asahikawa Ramen”, and “What is Kumamoto Ramen”. However, a dictionary site should satisfy feature rule 1. Therefore, the accuracy of dictionary site candidates can be increased by determining whether or not the feature rule 2 is satisfied in consideration of the feature rule 1.

ここでは、特徴ルール２を満たしているかどうかを検出するため、検索キーワード側から、検索キーワードとアクセス先Ｗｅｂページのタイトルとの繋がりを見て特徴語句（例えば「とは」）を含むＷｅｂページの数を集計する。すなわち、特徴ルール２は、特徴語句を含むＷｅｂページに辿り着いた検索キーワードの数と解釈し、同一の検索キーワードから検索された特徴語句を含む複数のＷｅｂページを１とカウントする。具体的には、特徴語句を含むＷｅｂページに辿り着いた検索キーワードの数を集計する。特徴ルール２に基づき、特徴語句を含むページに辿り着いた検索キーワード数を集計した特徴項目を「タイトル特徴」とする。そして、上位ＵＲＬごとに集計されたタイトル特徴数を特徴スコアＳ２とする。図の例で、「とは」を含むタイトルを集計するというルールを適用した場合、タイトル特徴数による（Ａ）の特徴スコアＳ２は３、（Ｂ）の特徴スコアＳ２は１になる。 Here, in order to detect whether or not the feature rule 2 is satisfied, the search keyword side sees the connection between the search keyword and the title of the Web page to be accessed, and the Web page including the feature word (for example, “to”) Count the numbers. That is, the feature rule 2 is interpreted as the number of search keywords that have arrived at a Web page including a feature word, and counts a plurality of Web pages including a feature word searched from the same search keyword as 1. Specifically, the number of search keywords that have arrived at a Web page including a feature word is totaled. Based on the feature rule 2, a feature item obtained by tabulating the number of search keywords that have arrived at a page including a feature word is referred to as a “title feature”. The number of title features collected for each upper URL is defined as a feature score S2. In the example shown in the figure, when the rule of counting titles including “to” is applied, the feature score S2 of (A) is 3 and the feature score S2 of (B) is 1 according to the number of title features.

特徴ルール３は、「特徴ルール２を満たすページがサイト内に占める割合が高い」ということである。例えば、辞書サイト５００のページ５１１，５１２，５１３のすべてが、特徴ルール２を満たす。これに対し、ラーメン店紹介サイト５０１では、特徴ルール２を満たすページは、ページ５３３のみである。 The feature rule 3 is “the ratio of pages satisfying the feature rule 2 in the site is high”. For example, all the pages 511, 512, and 513 of the dictionary site 500 satisfy the feature rule 2. On the other hand, in the ramen shop introduction site 501, the page that satisfies the characteristic rule 2 is only the page 533.

このような特徴ルール３を満たしているかどうかを検出するため、特徴ルール１の特徴スコアＳ１として全ての検索キーワードを集計した「キーワード数」に、特徴ルール２の特徴スコアＳ２として特徴語句を含むＷｅｂページに辿りついた検索キーワード数を集計した「タイトル特徴数」が占める割合を算出する。この特徴項目を「含有率」とし、上位ＵＲＬごとに集計された含有率の値を特徴スコアＳ３とする。含有率Ｃは、
含有率（Ｃ）＝｛特徴語句を含むページに辿り着いた検索キーワード数（タイトル特徴数）｝²／キーワード数・・・（１）
により算出される。図の例では、含有率による（Ａ）の特徴スコアＳ３は、３．０（３²／３）、（Ｂ）の特徴スコアＳ３は０．３３（１／３）となる。したがって、特徴スコアＳ３から、（Ａ）の方が辞書サイトらしいということになる。なお、式（１）では、タイトル特徴数に重み付けを行っているが、式（１）の分子はタイトル特徴数そのままであってもよい。 In order to detect whether or not the feature rule 3 is satisfied, the “number of keywords” obtained by adding up all the search keywords as the feature score S1 of the feature rule 1 includes a feature phrase as the feature score S2 of the feature rule 2 The ratio of the “number of title features”, which is the total number of search keywords that have reached the page, is calculated. This feature item is defined as “content rate”, and the content rate value aggregated for each upper URL is defined as a feature score S3. Content C is
Content rate (C) = {number of search keywords arriving at a page including a feature word (number of title features)} ² / number of keywords (1)
Is calculated by In the illustrated example, features scores S3 in accordance with the content (A) is 3.0 (3 ^2/3), the feature score S3 in (B) is 0.33 (1/3). Therefore, from the feature score S3, (A) is more likely to be a dictionary site. In the formula (1), the title feature number is weighted. However, the numerator of the formula (1) may be the title feature number as it is.

辞書サイトであるかの判定処理では、上記の特徴ルール１、特徴ルール２及び特徴ルール３を満たしているＷｅｂサイトを辞書サイトと判定する。特徴ルールを満たしているかどうかは、特徴スコアごとに閾値を設定しておき、集計された特徴スコアと閾値とを比較して判定する。発明者による実験では、上記の特徴ルール１、特徴ルール２及び特徴ルール３を用いて辞書サイトが発見できることが見出されている。 In the process of determining whether the site is a dictionary site, a website that satisfies the above-described feature rule 1, feature rule 2, and feature rule 3 is determined to be a dictionary site. Whether or not the feature rule is satisfied is determined by setting a threshold value for each feature score and comparing the aggregated feature score with the threshold value. In an experiment by the inventor, it has been found that a dictionary site can be found using the above-described feature rule 1, feature rule 2, and feature rule 3.

上記の説明では、特徴ルール２に基づく特徴スコア（特徴スコアＳ２）及び特徴ルール３に基づく特徴スコア（特徴スコアＳ３）を、特徴語句を含むＷｅｂページに辿りついた検索キーワード数を用いて算出するとしたが、タイトルに特徴語句を含むＷｅｂページの数を用いて特徴スコアを求めることもできる。この場合は、特徴ルール２，３に基づき、検索ログからタイトルに特徴語句が含まれるアクセスＵＲＬのタイトル数を集計する。そして、特徴語句が含まれるアクセスＵＲＬのタイトル数が、全てのアクセスＵＲＬのタイトル数に占める割合を算出し、これを特徴スコアＳ４とする。この特徴スコアＳ４に特徴ルール１に基づく特徴スコアＳ１を加味し、総合特徴スコアＳ５を算出する。例えば、特徴スコアＳ１（キーワード数）が閾値以上のものに対し、特徴スコアＳ４を算出する。これにより、１または少数の検索キーワードによって特徴語句がタイトルに含まれるＷｅｂページが多数アクセスされるケースを除くことができる。また、特徴スコアＳ１、特徴スコアＳ４それぞれに重み付けを行って総合特徴スコアを算出するとしてもよい。総合特徴スコアＳ５は、任意の係数ａ，ｂを用いて、
総合特徴スコアＳ５＝ａＳ１＋ｂＳ４・・・（２）
と、算出することができる。 In the above description, the feature score based on the feature rule 2 (feature score S2) and the feature score based on the feature rule 3 (feature score S3) are calculated using the number of search keywords that have reached the Web page including the feature phrase. However, the feature score can also be obtained by using the number of Web pages that include a feature phrase in the title. In this case, based on the feature rules 2 and 3, the number of titles of the access URLs including the feature word / phrase in the title from the search log is totaled. Then, the ratio of the number of titles of the access URLs including the feature words to the number of titles of all the access URLs is calculated, and this is set as the feature score S4. The feature score S1 based on the feature rule 1 is added to the feature score S4 to calculate an overall feature score S5. For example, the feature score S4 is calculated for a feature score S1 (number of keywords) equal to or greater than a threshold value. As a result, it is possible to exclude a case where a large number of Web pages whose feature words are included in the title are accessed by one or a small number of search keywords. Alternatively, the total feature score may be calculated by weighting each of the feature score S1 and the feature score S4. The overall feature score S5 is determined using arbitrary coefficients a and b.
Overall feature score S5 = aS1 + bS4 (2)
And can be calculated.

特徴ルール４は、「ある分野の用語に関する辞書サイトは複数ある」ということである。言い換えれば、「ある分野の２つの辞書サイトは同じような用語が解説されているので、同じようなクエリでアクセスされる」ということが言える。 The characteristic rule 4 is “There are a plurality of dictionary sites related to terms in a certain field”. In other words, it can be said that “two dictionary sites in a certain field are accessed with the same query because similar terms are explained”.

図５は、サイト間のクエリの一例を示した図である。（Ａ）は、辞書サイト同士のクエリの例である。
図の例では、辞書サイトＡ５０３と辞書サイトＢ５０４は、ともに、検索キーワード「ＶＰＮ」５５１、「ＲＳＳ」５５２、「Ｗｅｂ２．０」５５３によってアクセスされている。同じ分野の辞書サイトであれば、同じような用語の解説ページを有していると推定することができる。すなわち、辞書サイト同士であれば、クエリが類似していると推定することができる。 FIG. 5 is a diagram showing an example of a query between sites. (A) is an example of a query between dictionary sites.
In the example shown in the figure, both the dictionary site A 503 and the dictionary site B 504 are accessed by the search keywords “VPN” 551, “RSS” 552, and “Web 2.0” 553. It can be presumed that a dictionary site in the same field has an explanation page of similar terms. That is, it can be estimated that the queries are similar between dictionary sites.

ここでは、全検索キーワードに、他のサイトと共通の検索キーワード（共通キーワード）が占める割合を算出し、これを他のサイトとの類似スコアとする。サイトＡとサイトＢの類似スコアＲは、
Ｒ＝（サイトＡ、サイトＢの共通のキーワード数）／サイトＢのキーワード総数
・・・（３）
によって算出することができる。 Here, the ratio of search keywords (common keywords) common to other sites to all search keywords is calculated, and this is used as a similarity score with other sites. The similarity score R between Site A and Site B is
R = (number of keywords common to site A and site B) / total number of keywords for site B
... (3)
Can be calculated.

図の例であれば、類似スコアＲは、Ｒ＝１．０（３／３）になる。
類似スコアがある閾値以上であれば、類似サイトとして選択することができる。したがって、類似スコアもそのＷｅｂサイトの特徴を表す特徴スコアの１つである。例えば、辞書サイト候補に登録されたサイト同士で類似スコアを算出し、類似スコアが閾値を超えるものがあれば、類似サイト有と判定することができる。上記の特徴ルール４に基づき、類似サイトがあるものは辞書サイトと推定することができる。また、このとき類似サイトとして選択されたものも、辞書サイトであると推定できる。一方、特徴ルール１、特徴ルール２及び特徴ルール３を満たしていても、類似サイトが検出できない場合、このサイトは辞書サイトではないと推定される。 In the example shown in the figure, the similarity score R is R = 1.0 (3/3).
If the similarity score is above a certain threshold, it can be selected as a similar site. Therefore, the similarity score is one of the feature scores representing the features of the website. For example, a similarity score is calculated between sites registered as dictionary site candidates, and if there is a similarity score that exceeds a threshold, it can be determined that there is a similar site. Based on the feature rule 4 described above, a site having a similar site can be estimated as a dictionary site. In addition, it can be estimated that the site selected as a similar site at this time is also a dictionary site. On the other hand, if a similar site cannot be detected even if the feature rule 1, the feature rule 2, and the feature rule 3 are satisfied, it is estimated that this site is not a dictionary site.

辞書サイト同士ではない場合で説明する。例えば、ブログサイトは、様々な情報が掲載されるため、検索キーワードの種類は多岐に渡る。したがって、特徴ルール１では、辞書サイトと認識されてしまう場合がある。このようなケースでは、この特徴ルール４を用いて辞書サイトではないことを検出することができる。 The explanation is based on the case where the sites are not dictionary sites. For example, since various information is posted on a blog site, there are various types of search keywords. Therefore, the feature rule 1 may be recognized as a dictionary site. In such a case, this feature rule 4 can be used to detect that it is not a dictionary site.

図６は、サイト間のクエリの他の例を示した図である。（Ｂ）は、辞書サイトとブログサイトの間のクエリの例である。（Ｃ）は、ブログサイト同士のクエリの例である。
（Ｂ）に示した辞書サイトＡ５０３と、ブログサイトＢ５０５とは、検索キーワード「ＶＰＮ」５５１と、「ＲＳＳ」５５２とは共通である。しかし、検索キーワード「Ｗｅｂ２．０」５５３でアクセスしているのは辞書サイトＡ５０３のみである。また、検索キーワード「遊園地」５５４及び「夏休み」５５５は、ブログサイトＢ５０５のみで使われている。このように、辞書サイト同士でない場合には、全体の検索キーワードに対し、共通キーワードの占める割合が低くなる。言い換えれば、類似スコアが低くなる。 FIG. 6 is a diagram showing another example of a query between sites. (B) is an example of a query between a dictionary site and a blog site. (C) is an example of a query between blog sites.
The search keyword “VPN” 551 and “RSS” 552 are common to the dictionary site A 503 and the blog site B 505 shown in FIG. However, only the dictionary site A503 is accessed by the search keyword “Web2.0” 553. The search keywords “amusement park” 554 and “summer vacation” 555 are used only on the blog site B505. As described above, in the case of not being dictionary sites, the ratio of the common keyword to the entire search keyword is low. In other words, the similarity score is lowered.

図の例では、辞書サイトＡ５０３から見たブログサイトＢ５０５の類似スコアは、式（３）により、Ｒ＝（２／４）＝０．５となる。
（Ｃ）に示したブログサイトＡ５０６と、ブログサイトＢ５０５とは、検索キーワード「ＲＳＳ」５５２は共通である。しかし、検索キーワード「ラーメン」５５７、「フランス」５５８は、ブログサイトＡ５０６のみに使用される。また、検索キーワード「遊園地」５５４及び「夏休み」５５５は、ブログサイトＢ５０５のみで使われている。このように、ブログサイト同士の場合も、類似スコアが低くなる。図の例のブログサイトＡ５０６から見たブログサイトＢ５０５の類似スコアは、Ｒ＝（１／４）＝０．２５になる。 In the example of the figure, the similarity score of the blog site B505 viewed from the dictionary site A503 is R = (2/4) = 0.5 according to the equation (3).
The search keyword “RSS” 552 is common to the blog site A 506 and the blog site B 505 shown in FIG. However, the search keywords “ramen” 557 and “France” 558 are used only for the blog site A506. The search keywords “amusement park” 554 and “summer vacation” 555 are used only on the blog site B505. As described above, the similarity score is low between blog sites. The similarity score of the blog site B505 viewed from the blog site A506 in the example in the figure is R = (1/4) = 0.25.

このように、特徴ルール１、特徴ルール２及び特徴ルール３を用いて検出されたサイト候補に対し、特徴ルール４を適用すれば、辞書サイトではない可能性の高い辞書サイト候補を除外するフィルタ効果を発揮する。 Thus, if the feature rule 4 is applied to the site candidates detected using the feature rule 1, the feature rule 2, and the feature rule 3, a filter effect that excludes dictionary site candidates that are not likely to be dictionary sites. To demonstrate.

次に、特徴ルール１、特徴ルール２、特徴ルール３及び特徴ルール４を用いて検索ログを解析し、辞書サイト候補を検出する処理を、具体例を用いて説明する。以下の説明では、特徴ルール２は、特徴語句を含むページに辿り着いた検索キーワード数を集計したタイトル特徴を用いるとする。また、特徴ルール３は、式（１）に示した含有率を用いるとする。 Next, processing for analyzing a search log using feature rule 1, feature rule 2, feature rule 3, and feature rule 4 and detecting dictionary site candidates will be described using a specific example. In the following description, it is assumed that the feature rule 2 uses a title feature in which the number of search keywords that have arrived at a page including a feature word is totaled. In addition, the feature rule 3 uses the content shown in the formula (1).

検索サーバ３００は、検索依頼及び検索結果のアクセス要求に基づいて、検索ログを生成している。辞書サイト検出サーバ１００では、ネットワーク４００を介して検索サーバ３００から検索ログを収集し、検索ログＤＢ１１０に格納しておく。 The search server 300 generates a search log based on the search request and the search result access request. The dictionary site detection server 100 collects search logs from the search server 300 via the network 400 and stores them in the search log DB 110.

検索ログＤＢ１１０に格納される検索ログテーブルについて説明する。図７は、検索ログテーブルの一例を示した図である。
検索ログテーブル１１１０は、検索履歴と、検索結果のアクセスの履歴をマージしたログであり、日時１１１１、セッションＩＤ１１１２、ユーザ・クッキー１１１３、検索キーワード１１１４、アクセスＵＲＬ１１１５及びアクセスＵＲＬのタイトル１１１６の情報項目を有する。 A search log table stored in the search log DB 110 will be described. FIG. 7 is a diagram showing an example of the search log table.
The search log table 1110 is a log obtained by merging the search history and the access history of the search result. The search log table 1110 includes information items such as date 1111, session ID 1112, user cookie 1113, search keyword 1114, access URL 1115, and access URL title 1116. Have.

日時１１１１は検索要求時の日時、セッションＩＤ１１１２は検索依頼時のセッション番号、ユーザ・クッキー１１１３は検索依頼時のユーザのクッキーである。検索キーワード１１１４は、検索依頼時に入力された検索キーワード（クエリ）である。例えば、ＷＷＷの検索エンジンであれば、検索キーワードが複数設定されたり、条件（ＡＮＤ検索やＯＲ検索など）が付けられたりする。これらの情報は、検索サーバ３００が、利用者のクライアント装置から検索依頼を受け付けたとき、検索履歴として記録される。検索サーバ３００は、検索依頼に基づいて検索結果（ＷｅｂページのアクセスＵＲＬ）を利用者に提供する。ここでは、検索サーバ３００がこのような検索条件から所定のキーワードを検索キーワードとして選択し、検索ログに登録しているとする。なお、キーワードの抽出は、情報抽出部１２０で行うとしてもよい。 The date and time 1111 is the date and time at the time of the search request, the session ID 1112 is the session number at the time of the search request, and the user cookie 1113 is the user cookie at the time of the search request. The search keyword 1114 is a search keyword (query) input at the time of a search request. For example, in the case of a WWW search engine, a plurality of search keywords are set or conditions (AND search, OR search, etc.) are added. These pieces of information are recorded as a search history when the search server 300 receives a search request from the user's client device. The search server 300 provides a search result (Web page access URL) to the user based on the search request. Here, it is assumed that the search server 300 selects a predetermined keyword as a search keyword from such search conditions and registers it in the search log. The keyword extraction may be performed by the information extraction unit 120.

アクセスＵＲＬ１１１５は、利用者がアクセス要求を行ったアクセスＵＲＬの情報である。アクセスＵＲＬのタイトル１１１６は、このアクセスＵＲＬによって取得されるＷｅｂページから抽出したタイトルである。これらの情報は、検索結果に基づいて利用者がＷｅｂページへアクセス要求を行ったとき、検索結果のアクセス履歴として記録される。１つのクエリに対し、検索結果へのアクセスは複数行われる場合がある。このような検索履歴とアクセス履歴とを照合し、対応するものをマージして作成されるのが、検索ログテーブル１１１０である。ここでは、タイトルがコンテンツを最も反映していると考えられることから、アクセスＵＲＬの情報としてアクセスＵＲＬのタイトルを検索ログに記録している。しかしながら、アクセスＵＲＬのコンテンツの特徴を把握できる情報であれば、タイトルでなくてもよい。 The access URL 1115 is information on an access URL requested by the user. The access URL title 1116 is a title extracted from the Web page acquired by this access URL. These pieces of information are recorded as an access history of the search result when the user makes an access request to the Web page based on the search result. There are cases where a plurality of accesses to the search results are performed for one query. A search log table 1110 is created by collating such search history and access history and merging the corresponding ones. Here, since it is considered that the title most reflects the content, the title of the access URL is recorded in the search log as access URL information. However, it is not necessary to be a title as long as it is information that can grasp the characteristics of the content of the access URL.

情報抽出部１２０は、クライアント装置２００からの処理開始要求により、検索ログＤＢ１１０に格納される検索ログを解析し、検索ログに含まれるアクセスＵＲＬから、ドメイン名が含まれる上位ＵＲＬを抽出する。例えば、検索ログテーブル１１１０のＮｏ．１の検索ログのアクセスＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／ｒｓｓ．ｈｔｍｌ」からは、ドメイン名「ａａａ．ｃｏｍ」が抽出される。同様にして、Ｎｏ．２の検索ログからは、「ａａａ．ｃｏｍ」、Ｎｏ．３の検索ログからは、「ｂｂｂ．ｃｏｍ」、Ｎｏ．４の検索ログからは、「ｃｃｃ．ｃｏ．ｊｐ」、Ｎｏ．５の検索ログからは、「ａｂｃ．ｃｏｍ」が上位ＵＲＬとして抽出される。情報抽出部１２０は、こうして抽出されたドメイン名を含む上位ＵＲＬを、対応する検索キーワード、アクセスＵＲＬ及びアクセスＵＲＬのタイトルとともに、特徴集計部１４０へ引き渡す。 The information extraction unit 120 analyzes a search log stored in the search log DB 110 in response to a processing start request from the client device 200, and extracts an upper URL including a domain name from an access URL included in the search log. For example, the search log table 1110 No. The domain name “aaa.com” is extracted from the access URL “http://aaa.com/rss.html” of the search log 1. Similarly, no. 2 search log “aaa.com”, No. 2 3 search log, “bbb.com”, No. 3 4 from the search log “ccc.co.jp”, No. 4 From the search log 5, “abc.com” is extracted as the upper URL. The information extraction unit 120 delivers the upper URL including the domain name extracted in this way to the feature counting unit 140 together with the corresponding search keyword, access URL, and access URL title.

特徴集計部１４０は、取得した上位ＵＲＬ、検索キーワード、アクセスＵＲＬ及びアクセスＵＲＬのタイトルを用いて、この上位ＵＲＬの特徴を表す特徴ルールに基づく複数の特徴項目の特徴スコアを集計する。１つの特徴スコアであっても、辞書サイト候補を検出することはできる。しかし、異なる特徴ルールに応じた複数の特徴スコアを組み合わせることにより、辞書サイト候補が辞書サイトである確度を高くすることができる。 The feature tabulation unit 140 tabulates feature scores of a plurality of feature items based on the feature rule representing the features of the upper URL, using the acquired upper URL, search keyword, access URL, and access URL title. Even with one feature score, dictionary site candidates can be detected. However, by combining a plurality of feature scores according to different feature rules, the probability that a dictionary site candidate is a dictionary site can be increased.

特徴スコアの集計処理のため、最初に対象抽出部１４１が、対象となる上位ＵＲＬと検索キーワードの組を抽出し、上位ＵＲＬテーブル１５１に登録する。また、アクセスＵＲＬとアクセスＵＲＬのタイトルの組を抽出し、アクセスＵＲＬテーブル１５２に登録する。このとき、抽出された上位ＵＲＬと検索キーワードの組が上位ＵＲＬテーブル１５１に既に登録されていたときは、登録は行わない。アクセスＵＲＬテーブル１５２についても同様である。 For the feature score counting process, the target extraction unit 141 first extracts a target upper URL and search keyword pair and registers them in the upper URL table 151. Also, a pair of access URL and access URL title is extracted and registered in the access URL table 152. At this time, if the combination of the extracted upper URL and the search keyword is already registered in the upper URL table 151, the registration is not performed. The same applies to the access URL table 152.

図８は、上位ＵＲＬテーブルの一例を示した図である。
上位ＵＲＬテーブル１５１０は、サイトのＵＲＬと、そのサイトのＷｅｂページにアクセスする際の検索キーワードとが格納されるテーブルで、上位ＵＲＬ１５１１と、検索キーワード１５１２とを有する。上位ＵＲＬ１５１１には、検索ログテーブル１１１０のアクセスＵＲＬ１１１５から抽出された上位ＵＲＬが、サイトのＵＲＬとして格納される。検索キーワード１５１２には、アクセスＵＲＬ１１１５が抽出された検索ログに記録される検索キーワードが格納される。 FIG. 8 is a diagram showing an example of the upper URL table.
The upper URL table 1510 stores a URL of a site and a search keyword for accessing the Web page of the site, and includes an upper URL 1511 and a search keyword 1512. In the upper URL 1511, the upper URL extracted from the access URL 1115 of the search log table 1110 is stored as the URL of the site. The search keyword 1512 stores a search keyword recorded in the search log from which the access URL 1115 has been extracted.

例えば、情報抽出部１２０によって検索ログテーブル１１１０のＮｏ．１から抽出された上位ＵＲＬ「ａａａ．ｃｏｍ」と、その検索キーワード「ＲＳＳ」とが上位ＵＲＬテーブル１５１０のＮｏ．１に格納される。以降、検索ログについて順番に同様の処理が行われる。ただし、以降の検索ログに上位ＵＲＬと検索キーワードの組み合わせが同じものが出現した場合は、登録されない。 For example, the information extraction unit 120 causes the search log table 1110 to have a No. The upper URL “aaa.com” extracted from “1” and the search keyword “RSS” 1 is stored. Thereafter, the same processing is sequentially performed on the search log. However, when the same combination of the upper URL and the search keyword appears in the subsequent search logs, it is not registered.

図９は、アクセスＵＲＬテーブルの一例を示した図である。
アクセスＵＲＬテーブル１５２０は、ＷｅｂページのアクセスＵＲＬと、そのＷｅｂページのタイトルとが格納されるテーブルで、アクセスＵＲＬ１５２１と、検索キーワード１５２２と、アクセスＵＲＬのタイトル１５２３と、を有する。
アクセスＵＲＬ１５２１には、情報抽出部１２０によって検索ログテーブル１１１０から抽出されたアクセスＵＲＬが格納される。検索キーワード１５２２には、このＷｅｂページに辿り着いた検索キーワードが検索ログテーブル１１１０より抽出されて格納される。アクセスＵＲＬタイトル１５２３には、そのアクセスＵＲＬによって取得されるＷｅｂページのタイトルが格納される。また、同じ組み合わせがアクセスＵＲＬテーブル１５２０に格納されないことは、上位ＵＲＬテーブル１５１０と同様である。 FIG. 9 is a diagram showing an example of the access URL table.
The access URL table 1520 is a table in which an access URL of a web page and a title of the web page are stored. The access URL table 1520 includes an access URL 1521, a search keyword 1522, and an access URL title 1523.
The access URL 1521 stores the access URL extracted from the search log table 1110 by the information extraction unit 120. In the search keyword 1522, the search keyword that has reached this Web page is extracted from the search log table 1110 and stored. The access URL title 1523 stores the title of the Web page acquired by the access URL. Further, the same combination is not stored in the access URL table 1520 as in the upper URL table 1510.

こうして上位ＵＲＬテーブル１５１０とアクセスＵＲＬテーブル１５２０が作成された後、特徴スコア集計部１４２が上位ＵＲＬごとに特徴スコアを集計する。ここでは、特徴スコアとして、キーワード数（特徴スコアＳ１）と、タイトル特徴数（特徴スコアＳ２）及び含有率（特徴スコアＳ３）を集計した後、類似サイトチェック部１４３で類似スコアを算出する。 After the upper URL table 1510 and the access URL table 1520 are thus created, the feature score totaling unit 142 totals the feature scores for each upper URL. Here, after the number of keywords (feature score S1), the number of title features (feature score S2), and the content rate (feature score S3) are tabulated as the feature score, the similar site check unit 143 calculates the similarity score.

キーワード数は、特徴ルール１に基づく特徴項目を調べるための特徴スコアＳ１で、上位ＵＲＬテーブル１５１０を解析し、上位ＵＲＬに対応する検索キーワードの数を集計して得る。例えば、上位ＵＲＬテーブル１５１０では、上位ＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」については、検索キーワード「ＲＳＳ」「ＶＰＮ」が検出されるので、この時点でのキーワード数は「２」になる。同様にして、上位ＵＲＬテーブル１５１０のすべての上位ＵＲＬについてキーワード数を集計する。キーワード数が多いほど、このサイトに対し、異なる言葉からたくさんアクセスがあることになる。 The number of keywords is a feature score S1 for examining feature items based on the feature rule 1, and is obtained by analyzing the upper URL table 1510 and totaling the number of search keywords corresponding to the upper URL. For example, in the upper URL table 1510, for the upper URL “http://aaa.com/”, the search keywords “RSS” and “VPN” are detected, and the number of keywords at this time is “2”. Similarly, the number of keywords is totaled for all upper URLs in the upper URL table 1510. The more keywords you have, the more you will be accessing the site from different words.

タイトル特徴数は、特徴ルール２に基づく特徴項目を調べるための特徴スコアＳ２で、アクセスＵＲＬテーブル１５２０を解析し、上位ＵＲＬごとに、特徴語句がアクセスＵＲＬのタイトルに辿り着いた検索キーワードの数を集計する。なお、特徴語句の定義は、特徴ルールＤＢ１３０に格納されている情報を用いる。例えば、特徴語句として「とは」が定義されているとする。検索キーワードによって辿り着いたアクセスＵＲＬのタイトル１５２３に「とは」が含まれるものを検索する。図の例では、アクセスＵＲＬテーブル１５２０のＮｏ．１の検索キーワード「ＲＳＳ」に対応する「ＲＳＳとは」が検索される。同様にして、Ｎｏ．２の検索キーワード「ＶＰＮ」に対応する「ＶＰＮとは」、及びＮｏ．４の検索キーワード「パンダ」に対応する「パンダランドとは」が検索される。そして、対応する上位ＵＲＬをアクセスＵＲＬ１５２１から抽出し、上位ＵＲＬごとに特徴語句が出現したＷｅｂページに辿り着いた検索キーワードの数を集計する。この場合、Ｎｏ．１の「ＲＳＳ」、Ｎｏ．２の「ＶＰＮ」の上位ＵＲＬは「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」であるので、上位ＵＲＬは、「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」になる。したがって、上位ＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」の特徴スコアＳ２の「タイトル特徴数」は「２」になる。同様にして、アクセスＵＲＬテーブル１５２０のすべてのアクセスＵＲＬのタイトルを解析し、タイトル特徴数を集計する。タイトル特徴数が多いほど、このサイトには辞書の特徴を持つＷｅｂページが多いことになる。 The number of title features is a feature score S2 for examining feature items based on the feature rule 2, and the access URL table 1520 is analyzed, and for each upper URL, the number of search keywords whose feature words have reached the title of the access URL is calculated. Tally. Note that information stored in the feature rule DB 130 is used to define the feature words. For example, it is assumed that “to” is defined as a feature word. A search is made for items containing “to” in the title 1523 of the access URL reached by the search keyword. In the example of FIG. “What is RSS?” Corresponding to one search keyword “RSS” is searched. Similarly, no. “What is VPN” corresponding to the search keyword “VPN” of No. 2 and “No. “What is Panda Land” corresponding to the four search keywords “Panda” is searched. Then, the corresponding upper URL is extracted from the access URL 1521, and the number of search keywords that have arrived at the Web page in which the feature word / phrase appears for each upper URL is totaled. In this case, no. 1 “RSS”, No. 1 Since the upper URL of “VPN” of “2” is “http://aaa.com/”, the upper URL is “http://aaa.com/”. Therefore, the “title feature number” of the feature score S2 of the upper URL “http://aaa.com/” is “2”. Similarly, the titles of all access URLs in the access URL table 1520 are analyzed, and the number of title features is totaled. The greater the number of title features, the more Web pages with dictionary features on this site.

含有率は、特徴ルール３に基づく特徴項目を調べるための特徴スコアＳ３で、上記のキーワード数及びタイトル特徴数の集計結果から式（１）を用いて算出する。例えば、上記の例で、上位ＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」について、キーワード数「２」、タイトル特徴数「２」が得られているので、含有率＝２²（タイトル特徴数）／２（キーワード数）＝２．０となる。含有率が高いほど、特徴語句を含むページに辿り着いた検索が、全検索に占める割合が高いという解釈になる。このとき、キーワード数（特徴スコアＳ１）が閾値以下のものについては除外して含有率を算出するとしてもよい。 The content rate is a feature score S3 for examining a feature item based on the feature rule 3, and is calculated using the formula (1) from the above-described total number of keywords and title features. For example, in the above example, the number of keywords “2” and the number of title features “2” are obtained for the upper URL “http://aaa.com/”, so the content rate = 2 ² (number of title features) / 2 (number of keywords) = 2.0. It can be interpreted that the higher the content rate, the higher the percentage of all searches that have reached a page containing a feature word. At this time, the content rate may be calculated by excluding those having a keyword number (feature score S1) equal to or less than a threshold.

こうして、上位ＵＲＬごとに特徴スコアの集計結果が得られる。図１０は、特徴スコアの集計結果を示した集計テーブルの一例である。
集計テーブル１５３０は、上位ＵＲＬの特徴スコアが格納されるテーブルで、上位ＵＲＬ１５３１、タイトル特徴数１５３２、キーワード数１５３３及び含有率１５３４を有する。上位ＵＲＬ１５３１には、検索ログテーブル１１１０より抽出された上位ＵＲＬが格納される。タイトル特徴数１５３２には、この上位ＵＲＬで検索された特徴語句がタイトルに含まれるページに辿り着いた検索キーワードの数が格納される。キーワード数１５３３には、この上位ＵＲＬへのアクセスに用いられた検索キーワードの種類数が格納される。含有率１５３４には、式（１）に基づいて、この上位ＵＲＬに対する全検索キーワードのうち、特徴語句が含まれるタイトルが付されたページに辿り着いた検索キーワードの割合に応じたスコアが格納される。 In this way, the total result of the feature score is obtained for each upper URL. FIG. 10 is an example of a tabulation table showing the tabulation results of feature scores.
The aggregation table 1530 is a table in which the feature score of the upper URL is stored, and includes the upper URL 1531, the title feature number 1532, the keyword number 1533, and the content rate 1534. The upper URL 1531 stores the upper URL extracted from the search log table 1110. The title feature number 1532 stores the number of search keywords in which the feature phrase searched by the higher URL reaches the page included in the title. The number of keywords 1533 stores the number of types of search keywords used for accessing the upper URL. The content rate 1534 stores a score according to the ratio of the search keywords that arrived at the page with the title including the feature phrase among all the search keywords for the upper URL based on the formula (1). The

次に、類似サイトチェック部１４３が、特徴ルール４に基づき、上記の処理で検出されたサイトに、類似サイトが存在するかどうかをチェックする。このため、集計テーブル１５３０において、所定の条件を満たす辞書サイトの可能性が高い候補について、他のサイトとの類似スコアを算出する。比較対象の他のサイトは、検索ログテーブル１１１０の解析により抽出されたサイトばかりでなく、既に辞書サイトとしてユーザが登録済みのサイトを用いてもよい。また、辞書サイトの可能性が高い候補についてばかりでなく、抽出されたすべてのサイトについて類似サイトチェックを行うとしてもよい。 Next, the similar site check unit 143 checks whether there is a similar site in the site detected by the above processing based on the feature rule 4. For this reason, in the tabulation table 1530, similarity scores with other sites are calculated for candidates that are highly likely to be dictionary sites that satisfy a predetermined condition. As another site to be compared, not only a site extracted by analysis of the search log table 1110 but also a site where a user has already been registered as a dictionary site may be used. Further, similar site checks may be performed for all extracted sites as well as for candidates that are highly likely to be dictionary sites.

具体例を挙げて説明する。まず辞書サイトの可能性の高い候補を選択するため、集計テーブル１５３０に登録される上位ＵＲＬのうち、キーワード数が一定値以上のサイトを含有率の高い順に並べ替える。これにより、特徴ルール１、特徴ルール２及び特徴ルール３を満たすサイトをサイト候補として抽出することができる。特徴ルール１（検索キーワード数が多い）と、特徴ルール３（含有率が高い）とが満たされれば、特徴ルール２が満たされることは自明である。 A specific example will be described. First, in order to select a candidate having a high possibility of a dictionary site, sites having a keyword number equal to or greater than a certain value are rearranged in descending order of content ratio among the upper URLs registered in the aggregation table 1530. As a result, sites satisfying the feature rule 1, the feature rule 2, and the feature rule 3 can be extracted as site candidates. It is obvious that the feature rule 2 is satisfied if the feature rule 1 (the number of search keywords is large) and the feature rule 3 (content ratio is high) are satisfied.

図１１は、集計テーブルを辞書サイトの可能性の高い順に並び変えた一例である。
集計テーブル（Ｂ）１５３５は、図１０に示した集計テーブル１５３０に登録されるサイトのうち、キーワード数が所定の数以上のサイトを抽出し、含有率の高い順に並び変えたものである。 FIG. 11 shows an example in which the aggregation table is rearranged in the descending order of the possibility of the dictionary site.
The tabulation table (B) 1535 is obtained by extracting, from the sites registered in the tabulation table 1530 shown in FIG.

図の例では、キーワード数が所定の数以上のサイト（上位ＵＲＬ）に関し、含有率が「２３０．００」と最も高いサイトのＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」がＮｏ．１に登録される。Ｎｏ．２，Ｎｏ．３と順次、含有率が低下する。 In the example of the figure, the URL “http://aaa.com/” of the site with the highest content rate of “230.00” regarding the site (higher URL) having the number of keywords equal to or greater than the predetermined number is “No. 1 is registered. No. 2, no. The content rate decreases sequentially with 3.

次に、並び変えられた順に、類似スコアを算出する処理が行われる。類似スコアは、集計テーブル（Ｂ）１５３５から選択された１つの上位ＵＲＬを基準サイトとして、他の上位ＵＲＬとの間で算出される。この基準サイトと比較する他の上位ＵＲＬを比較サイトと呼ぶ。基準サイトは、集計テーブル（Ｂ）１５３５から選択された１つの上位ＵＲＬ、比較サイトを他の上位ＵＲＬとする。なお、比較サイトは、既に辞書サイトとして登録されたサイトとしてもよい。類似スコアの算出は、式（３）を用いて行う。 Next, a process of calculating similarity scores is performed in the rearranged order. The similarity score is calculated with another upper URL using one upper URL selected from the aggregation table (B) 1535 as a reference site. Another upper URL compared with this reference site is called a comparison site. The reference site has one upper URL selected from the aggregation table (B) 1535 and the comparison site as another upper URL. The comparison site may be a site already registered as a dictionary site. The similarity score is calculated using Equation (3).

例えば、含有率が最も高いサイト「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」を基準サイトとし、次に含有率の高いサイト「ｈｔｔｐ：ｄｄｄ．ｃｏｍ／」を比較サイトとして類似スコアを算出する。続いて、「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」を基準サイト、３番目に含有率の高いサイト「ｈｔｔｐ：／／ｅｅｅ．ｃｏｍ／」を比較サイトとして類似スコアを算出する。以下、集計テーブル（Ｂ）１５３５に選出された他のサイトを比較サイトとして、順次類似スコアを算出する。 For example, the similarity score is calculated using the site “http://aaa.com/” with the highest content as the reference site and the site “http: ddd.com/” with the next highest content as the comparison site. Subsequently, the similarity score is calculated using “http://aaa.com/” as a reference site and the site “http://eeee.com/” having the third highest content ratio as a comparison site. Hereinafter, similar scores are sequentially calculated using the other sites selected in the tabulation table (B) 1535 as comparison sites.

こうして、類似スコアテーブルが得られる。図１２は、類似スコアテーブルの一例である。
類似スコアテーブル１５４０は、辞書サイトの可能性が高いサイトと他のサイトとの類似スコアが格納されるテーブルで、基準ＵＲＬ１５４１、比較ＵＲＬ１５４２、共通して出現するキーワード数１５４３、比較ＵＲＬに出現するキーワード数１５４４及び類似スコア１５４５を有する。 In this way, a similar score table is obtained. FIG. 12 is an example of a similarity score table.
The similarity score table 1540 is a table in which similarity scores between a site that is highly likely to be a dictionary site and another site are stored. The reference URL 1541, the comparison URL 1542, the number of commonly appearing keywords 1543, and the keywords that appear in the comparison URL It has the number 1544 and a similarity score 1545.

基準ＵＲＬ１５４１には、基準サイトのＵＲＬが格納される。比較ＵＲＬ１５４２には、基準サイトとの類似スコアが算出された比較サイトのＵＲＬが格納される。共通して出現するキーワード数１５４３には、基準サイト（ＵＲＬ）のアクセスに用いられた検索キーワードと、比較サイト（ＵＲＬ）のアクセスに用いられた検索キーワードとを照合し、両方のサイトで共通に用いられていると判定された共通キーワードの数が格納される。比較ＵＲＬに出現するキーワード数１５４４には、比較サイト（ＵＲＬ）のアクセスに用いられた検索キーワードの総数が格納される。類似スコア１５４５には、共通して出現するキーワード数１５４３と、比較ＵＲＬに出現するキーワード数１５４４の値を式（３）に適用して得られた類似スコアが格納される。 The reference URL 1541 stores the URL of the reference site. The comparison URL 1542 stores the URL of the comparison site for which the similarity score with the reference site is calculated. For the number of keywords 1543 that appear in common, the search keyword used for accessing the reference site (URL) is matched with the search keyword used for accessing the comparison site (URL), and is common to both sites. The number of common keywords determined to be used is stored. The number of keywords 1544 appearing in the comparison URL stores the total number of search keywords used for accessing the comparison site (URL). The similarity score 1545 stores a similarity score obtained by applying the value of the number of commonly appearing keywords 1543 and the number of keywords 1544 appearing in the comparison URL to the expression (3).

図のＮｏ．１の例では、基準ＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」と、比較ＵＲＬ「ｈｔｔｐ：／／ｄｄｄ．ｃｏｍ／」との間で共通する検索キーワードの総数は、「１８９」である。また、比較ＵＲＬに出現する検索キーワードの総数は、「２３２」である。この値を式（３）に代入して、類似スコア、「０．８１４７（１８９／２３２）」が算出される。 No. in the figure. In the example of FIG. 1, the total number of search keywords common to the reference URL “http://aaa.com/” and the comparison URL “http://ddd.com/” is “189”. The total number of search keywords appearing in the comparison URL is “232”. By substituting this value into Equation (3), a similarity score “0.8147 (189/232)” is calculated.

特徴集計部１４０による上記の処理手順が実行され、集計テーブル１５３０と、類似スコアテーブル１５４０が生成される。辞書サイト判定部１６０では、これらの情報を用いて、辞書サイト候補を選択する。上述のように、特徴ルール１に基づく判定は、集計テーブル１５３０のキーワード数１５３３の特徴スコアを用いて行うことができる。特徴ルール２に基づく判定は、集計テーブル１５３０のタイトル特徴数１５３２の特徴スコアを用いて行うことができる。特徴ルール３に基づく判定は、集計テーブル１５３０の含有率１５３４の特徴スコアを用いて行うことができる。そして、特徴ルール４に基づく判定は、類似スコアテーブル１５４０の類似スコア１５４５を用いて行うことができる。これらの特徴スコア及び類似スコアのうち、どれを利用して判定を行うかは、予め特徴ルールＤＢ１３０に格納しておく。ここでは、集計テーブル（Ｂ）１５３５に抽出されたサイトについて、類似スコアが閾値以上のものを辞書サイト候補として選択する。 The above-described processing procedure by the feature totaling unit 140 is executed, and a totaling table 1530 and a similarity score table 1540 are generated. The dictionary site determination unit 160 selects dictionary site candidates using these pieces of information. As described above, the determination based on the feature rule 1 can be performed using the feature score of the number of keywords 1533 in the aggregation table 1530. The determination based on the feature rule 2 can be performed by using the feature score of the title feature number 1532 in the aggregation table 1530. The determination based on the feature rule 3 can be performed using the feature score of the content rate 1534 of the aggregation table 1530. The determination based on the feature rule 4 can be performed using the similarity score 1545 of the similarity score table 1540. Which of these feature scores and similarity scores is used for determination is stored in the feature rule DB 130 in advance. Here, for the sites extracted in the tabulation table (B) 1535, those having a similarity score equal to or higher than the threshold are selected as dictionary site candidates.

集計テーブル（Ｂ）１５３５の最上位のサイトから順に、このサイトに関する類似スコアをチェックし、類似スコアが閾値以上のサイトが存在するか否かを判定する。そして、存在すれば、このサイトを辞書サイト候補とする。なお、このとき、同時に、この辞書サイト候補との間の類似スコアが閾値以上となった比較サイト（ＵＲＬ）も辞書サイト候補としてもよい。 In order from the highest site in the total table (B) 1535, the similarity score regarding this site is checked, and it is determined whether or not there is a site having a similarity score equal to or greater than a threshold. If it exists, this site is set as a dictionary site candidate. At this time, a comparison site (URL) whose similarity score with the dictionary site candidate is equal to or greater than a threshold value may be a dictionary site candidate.

例えば、集計テーブル（Ｂ）１５３５のＮｏ．１の上位ＵＲＬ「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」に関する類似スコアを類似スコアテーブル１５４０から検索すると、Ｎｏ．１とＮｏ．２が検索される。いずれかの類似スコアが閾値を超えれば、「ｈｔｔｐ：／／ａａａ．ｃｏｍ／」には、類似サイトがあり、辞書サイトである可能性が高いということになる。 For example, the total table (B) 1535 No. When a similar score relating to the upper URL “http://aaa.com/” of No. 1 is searched from the similar score table 1540, No. 1 is retrieved. 1 and No. 2 is searched. If any of the similarity scores exceeds the threshold value, “http://aaa.com/” has a similar site and is likely to be a dictionary site.

なお、辞書サイト検出サーバ１００では、辞書サイトである可能性の高い辞書サイト候補をユーザに提示するのみであり、辞書サイトの選択はユーザが行う。そこで、辞書サイトを選択するための参考情報として、辞書サイト候補とともに、算出された特徴スコア及び類似スコアをユーザに提示する。また、特徴スコアと類似スコアそれぞれに重み付けを行って、辞書らしさスコアを算出するとしてもよい。辞書らしさスコアは、集計された特徴スコアや類似スコアに基づいて算出されるこの辞書サイト候補が辞書サイトである確度を数値化したスコアになる。例えば、特徴スコアのうちの含有率スコアと、類似スコアを用いて、
辞書らしさスコア＝ α・含有率スコア＋ β・類似スコア・・・（４）
とすることができる。α及びβは、重み付けのための係数である。 Note that the dictionary site detection server 100 only presents the user with dictionary site candidates that are likely to be dictionary sites, and the user selects a dictionary site. Therefore, the calculated feature score and similarity score are presented to the user as reference information for selecting a dictionary site along with dictionary site candidates. Alternatively, the dictionary score may be calculated by weighting each of the feature score and the similarity score. The dictionary-like score is a score obtained by quantifying the probability that the dictionary site candidate calculated based on the collected feature score and similarity score is a dictionary site. For example, using the content score in the feature score and the similarity score,
Dictionary-like score = α · Content score + β · Similarity score (4)
It can be. α and β are coefficients for weighting.

こうしてスコア付きの辞書サイト候補一覧テーブルが作成される。なお、辞書サイト候補一覧テーブルは、サイト候補ＤＢ１７０に格納される。図１３は、辞書サイト候補一覧テーブルの一例である。 In this way, a dictionary site candidate list table with scores is created. The dictionary site candidate list table is stored in the site candidate DB 170. FIG. 13 is an example of a dictionary site candidate list table.

辞書サイト候補一覧テーブル１７１０は、辞書サイト判定部１６０によって辞書サイト候補に選択された辞書サイトが、その評価（類似スコアなど）とともに格納されるテーブルである。基準ＵＲＬ１７１１、比較ＵＲＬ１７１２、類似スコア１７１３、含有率スコア１７１４及び辞書らしさスコア１７１５の各情報が登録される。 The dictionary site candidate list table 1710 is a table in which dictionary sites selected as dictionary site candidates by the dictionary site determination unit 160 are stored together with their evaluation (similarity score or the like). Information of a reference URL 1711, a comparison URL 1712, a similarity score 1713, a content rate score 1714, and a dictionary-like score 1715 is registered.

基準ＵＲＬ１７１１には、辞書サイト候補として選択されたサイトのＵＲＬが登録される。比較ＵＲＬ１７１２には、基準ＵＲＬとの間の類似スコアが最も高かった比較ＵＲＬのＵＲＬが登録される。類似スコア１７１３には、その基準ＵＲＬと比較ＵＲＬとの類似スコアが登録される。含有率スコア１７１４には、この基準ＵＲＬの含有率スコアが登録される。そして、辞書らしさスコア１７１５には、式（４）により算出された辞書らしさスコアが登録される。 In the reference URL 1711, the URL of the site selected as the dictionary site candidate is registered. In the comparison URL 1712, the URL of the comparison URL having the highest similarity score with the reference URL is registered. In the similarity score 1713, a similarity score between the reference URL and the comparison URL is registered. In the content rate score 1714, the content rate score of the reference URL is registered. Then, the dictionary-like score 1715 is registered with the dictionary-like score calculated by the equation (4).

例えば、辞書サイト候補一覧テーブル１７１０のＮｏ．１の基準ＵＲＬ「ｈｔｔｐ：ａａａ．ｃｏｍ／」について、類似スコアテーブル１５４０の最も高い類似スコアの比較ＵＲＬ「ｈｔｔｐ：／／ｄｄｄ．ｃｏｍ」が抽出され、比較ＵＲＬ１７１２に登録される。また、類似スコア１５４５の値が類似スコア１７１３に転記される。さらに、集計テーブル１５３０から該当する含有率が抽出され、含有率スコア１７１４に転記される。辞書らしさスコア１７１５は、類似スコア１７１３の値と、含有率スコア１７１４の値とを式（４）に代入して算出された値が登録される。 For example, No. in the dictionary site candidate list table 1710. For one reference URL “http: aaa.com/”, the comparison URL “http://dd.com” having the highest similarity score in the similarity score table 1540 is extracted and registered in the comparison URL 1712. Also, the value of the similarity score 1545 is transferred to the similarity score 1713. Further, the corresponding content rate is extracted from the aggregation table 1530 and posted to the content rate score 1714. As the dictionary likelihood score 1715, a value calculated by substituting the value of the similarity score 1713 and the value of the content rate score 1714 into the equation (4) is registered.

こうして生成された辞書サイト候補一覧テーブル１７１０に格納される辞書サイト候補と、その特徴スコアは、辞書サイト提示部１８０を介してクライアント装置２００に提供される。 The dictionary site candidates stored in the dictionary site candidate list table 1710 generated in this way and their characteristic scores are provided to the client device 200 via the dictionary site presentation unit 180.

クライアント装置２００からの表示要求を受けた辞書サイト提示部１８０は、サイト候補ＤＢ１７０に格納される辞書サイト候補一覧テーブル１７１０に基づいて、利用者に辞書サイト候補一覧を提示する。図１４は、辞書サイト候補一覧表示画面の一例である。 Upon receiving a display request from the client device 200, the dictionary site presenting unit 180 presents a dictionary site candidate list to the user based on the dictionary site candidate list table 1710 stored in the site candidate DB 170. FIG. 14 is an example of a dictionary site candidate list display screen.

辞書サイト候補一覧表示画面１８１０は、辞書サイト提示部１８０が生成した表示情報に基づいて、クライアント装置２００の表示装置に表示される。この辞書サイト候補一覧表示画面１８１０には、チェック欄１８１１、ＵＲＬ表示１８１２及びＳｃｏｒｅ（スコア）表示１８１３と、三つの操作ボタン、「類似サイトを表示」１８１４、「ＮＧサイトに登録」１８１５及び「辞書サイトに登録」１８１６とを有する。 The dictionary site candidate list display screen 1810 is displayed on the display device of the client device 200 based on the display information generated by the dictionary site presentation unit 180. The dictionary site candidate list display screen 1810 includes a check column 1811, URL display 1812 and Score (score) display 1813, three operation buttons, “display similar site” 1814, “register in NG site” 1815, and “dictionary”. "Register on site" 1816.

チェック欄１８１１は、操作ボタン１８１４，１８１５，１８１６の対象となる辞書サイト候補を選択するための欄である。ＵＲＬ表示１８１２は、辞書サイト候補に登録されたサイトのＵＲＬである。スコア表示１８１３は、この辞書サイトについて算出された類似スコアである。なお、類似スコア以外にも含有率スコア、辞書らしさスコアが表示されるとしてもよい。 The check column 1811 is a column for selecting dictionary site candidates that are the targets of the operation buttons 1814, 1815, and 1816. The URL display 1812 is a URL of a site registered as a dictionary site candidate. The score display 1813 is a similarity score calculated for this dictionary site. In addition to the similarity score, a content score and a dictionary-like score may be displayed.

操作ボタン１８１４，１８１５，１８１６は、チェック欄１８１１にチェックされたサイトに対し、それぞれ指定された操作を行う。「類似サイトを表示」１８１４が操作されると、チェックされているサイト候補の類似サイトとして検出されているサイトが表示される。「ＮＧサイトに登録」１８１５が操作されると、チェックされているサイト候補をＮＧサイトとして登録し、以降の処理でサイト候補一覧に加えない。ＮＧサイトの登録情報は、辞書サイトＤＢ１９０に格納される。「辞書サイトに登録」１８１６が操作されると、チェックされているサイト候補を辞書サイトに登録する。辞書サイトの登録情報は、辞書サイトＤＢ１９０に格納される。 The operation buttons 1814, 1815, and 1816 perform operations designated for the sites checked in the check column 1811. When “display similar sites” 1814 is operated, sites detected as similar sites of the checked site candidates are displayed. When “Register to NG site” 1815 is operated, the checked site candidate is registered as an NG site, and is not added to the site candidate list in the subsequent processing. The registration information of the NG site is stored in the dictionary site DB 190. When “register in dictionary site” 1816 is operated, the checked site candidate is registered in the dictionary site. The dictionary site registration information is stored in the dictionary site DB 190.

なお、上記の表示画面例では、サイトのＵＲＬのみを示したが、このサイトが有する具体的なアクセスＵＲＬのリストを表示するとしてもよい。このとき、さらに要求があれば、検索ログから指定されたサイトのＵＲＬを検索し、このサイトに関する検索ログのうち、タイトル特徴を含む検索ログだけを抽出して表示させることもできる。これらサイトのコンテンツを把握することができる情報を提供することにより、利用者の辞書サイト登録判断を助けることができる。 In the above display screen example, only the URL of the site is shown, but a list of specific access URLs possessed by this site may be displayed. At this time, if there is a further request, the URL of the designated site can be searched from the search log, and only the search log including the title feature can be extracted and displayed from the search logs related to this site. By providing information that can grasp the contents of these sites, it is possible to help the user to make a dictionary site registration decision.

例えば、辞書サイト候補のＮｏ．１とＮｏ．２にチェックし、「辞書サイトに登録」１８１６を操作すると、辞書サイトへの登録が行われ、登録が完了したことを通知するメッセージが表示される。 For example, the dictionary site candidate No. 1 and No. When the item 2 is checked and “register in dictionary site” 1816 is operated, registration in the dictionary site is performed, and a message notifying that registration is completed is displayed.

図１５は、登録終了画面の一例である。
登録終了画面１８２０は、辞書サイトへ登録したサイトの表示１８２１と、このサイトを辞書サイト一覧に追加したことを通知するメッセージとが表示される。また、操作ボタン１８２２，１８２３，１８２４なども表示される。「辞書サイト候補一覧に戻る」１８２２が操作されると、表示画面は辞書サイト候補一覧表示画面１８１０に戻る。「登録済み辞書サイト一覧を見る」１８２３が操作されると、これまでに登録された辞書サイトが一覧表示される。「登録済みＮＧサイト一覧を見る」１８２４が操作されると、ＮＧサイトに登録されたサイトが一覧表示される。 FIG. 15 is an example of a registration end screen.
The registration end screen 1820 displays a display 1821 of a site registered in the dictionary site and a message notifying that this site has been added to the dictionary site list. In addition, operation buttons 1822, 1823, 1824 and the like are also displayed. When “return to dictionary site candidate list” 1822 is operated, the display screen returns to dictionary site candidate list display screen 1810. When “view registered dictionary site list” 1823 is operated, a list of dictionary sites registered so far is displayed. When “view registered NG site list” 1824 is operated, a list of sites registered in the NG site is displayed.

こうして辞書サイト候補から選択された辞書サイトが登録されると、この辞書サイトを用いてオートリンク辞書を作成することができる。辞書サイト候補は、利用者が過去に検索を行ったことがあるサイトで、かつ、予め登録された辞書サイトとしての特徴を有するものが自動的に抽出されるので、有用である可能性が高い。 When a dictionary site selected from dictionary site candidates is registered in this way, an autolink dictionary can be created using this dictionary site. Dictionary site candidates are highly likely to be useful because sites that the user has searched in the past and that have characteristics as a dictionary site registered in advance are automatically extracted. .

このような辞書サイト検出システムにおける辞書サイト検出方法の処理手順について、フローチャートを用いて説明する。
図１６は、辞書サイト登録までの全体処理手順を示したフローチャートである。 The processing procedure of the dictionary site detection method in such a dictionary site detection system will be described using a flowchart.
FIG. 16 is a flowchart showing the entire processing procedure up to dictionary site registration.

［ステップＳ０１］検索サーバ３００が検索ログを作成する。図４に示したように、検索ログテーブル１１１０は、検索履歴と、検索結果のアクセスの履歴をマージしたログである。生成した検索ログテーブル１１１０は、辞書サイト検出サーバ１００が、所定の周期、あるいは利用者からの要求時等のタイミングで検索サーバ３００から検索ログを取得し、検索ログＤＢ１１０に格納しておく。 [Step S01] The search server 300 creates a search log. As shown in FIG. 4, the search log table 1110 is a log obtained by merging a search history and an access history of search results. The generated search log table 1110 is acquired by the dictionary site detection server 100 from the search server 300 at a predetermined cycle or timing such as a request from the user and stored in the search log DB 110.

［ステップＳ０２］辞書サイト検出サーバ１００が、検索ログＤＢ１１０に格納される検索ログテーブルに基づいて、辞書サイトを検出する辞書サイト検出処理を行う。辞書サイト検出処理では、予め特徴ルールＤＢ１３０に格納される特徴ルールに適合した辞書サイトの候補を抽出する。 [Step S02] The dictionary site detection server 100 performs a dictionary site detection process for detecting a dictionary site based on a search log table stored in the search log DB 110. In the dictionary site detection process, dictionary site candidates that match the feature rules stored in the feature rule DB 130 in advance are extracted.

［ステップＳ０３］ステップＳ０２で検出された辞書サイトの候補を、辞書を利用する利用者に提示する。
以上の処理手順が実行されることにより、辞書を利用する利用者は、辞書サイトの候補の一覧を取得することができる。この辞書サイト候補の中から辞書サイトとして登録するサイトを選択する。 [Step S03] The dictionary site candidates detected in step S02 are presented to the user who uses the dictionary.
By executing the above processing procedure, the user using the dictionary can obtain a list of dictionary site candidates. A site to be registered as a dictionary site is selected from the dictionary site candidates.

辞書サイト検出処理手順について説明する。
図１７は、辞書サイト検出処理の手順を示したフローチャートである。検索ログＤＢ１１０に検索ログテーブル１１１０が格納された後、処理が起動される。 The dictionary site detection processing procedure will be described.
FIG. 17 is a flowchart showing a procedure of dictionary site detection processing. After the search log table 1110 is stored in the search log DB 110, processing is started.

［ステップＳ２１］検索ログテーブル１１１０から検索ログを１行取り出す。
［ステップＳ２２］情報抽出部１２０が、ステップＳ２１で抽出された検索ログから上位ＵＲＬなどの情報を抽出する情報抽出処理を行う。 [Step S21] One line of the search log is extracted from the search log table 1110.
[Step S22] The information extraction unit 120 performs an information extraction process of extracting information such as the upper URL from the search log extracted in step S21.

［ステップＳ２３］特徴集計部１４０は、特徴集計処理を行うため、対象となる上位ＵＲＬと検索キーワードの組を抽出する対象抽出処理を行う。このとき、アクセスＵＲＬと、アクセスＵＲＬのタイトルの組も抽出しておく。 [Step S23] In order to perform the feature totaling process, the feature totaling unit 140 performs a target extraction process for extracting a target upper URL and search keyword pair. At this time, a combination of the access URL and the title of the access URL is also extracted.

［ステップＳ２４］検索ログテーブルの次の行に登録があるかどうかを判定する。次の行に登録があるときは、ステップＳ２１に戻って次の行の処理を行う。次の行に登録がないときは、次のステップへ処理を進める。 [Step S24] It is determined whether there is a registration in the next row of the search log table. When there is a registration in the next line, the process returns to step S21 and the next line is processed. If there is no registration on the next line, the process proceeds to the next step.

［ステップＳ２５］これまでの処理で、検索ログから検出された上位ＵＲＬに関し、辞書サイトとしての特徴をどの程度有しているかを集計する特徴集計処理を行う。
［ステップＳ２６］辞書サイト判定部１６０が、上位ＵＲＬごとに特徴集計結果を解析し、この上位ＵＲＬが辞書サイトであるか否かを判定する。 [Step S25] With respect to the upper URL detected from the search log in the process so far, a feature totaling process for totalizing how much the dictionary site has the characteristics is performed.
[Step S 26] The dictionary site determination unit 160 analyzes the feature count result for each upper URL, and determines whether or not the upper URL is a dictionary site.

［ステップＳ２７］ステップＳ２６で辞書サイトと判定されたサイト候補を、サイト判定結果（スコア）とともに辞書サイト候補一覧に格納する。
このような処理手順が実行されることにより、検索ログテーブルに基づいて、検索が行われた上位ＵＲＬ（サイト）が抽出される。特徴ルールに基づいて、そのサイトが辞書サイトの特徴を有するかどうか、いくつかの特徴項目で評価される（スコアが算出される）。そして、スコアに基づいて辞書サイト候補であるかどうかが判定され、認められれば、辞書サイト候補に登録される。 [Step S27] The site candidate determined as the dictionary site in step S26 is stored in the dictionary site candidate list together with the site determination result (score).
By executing such a processing procedure, the upper URL (site) where the search has been performed is extracted based on the search log table. Based on the feature rule, whether or not the site has the characteristics of a dictionary site is evaluated by several feature items (score is calculated). Then, based on the score, it is determined whether or not it is a dictionary site candidate, and if it is recognized, it is registered as a dictionary site candidate.

次に、情報抽出処理（ステップＳ２２）、対象抽出処理（ステップＳ２３）、特徴集計処理（ステップＳ２５）及び辞書サイト判定処理（ステップＳ２６）の各処理を説明する。 Next, each process of an information extraction process (step S22), an object extraction process (step S23), a feature tabulation process (step S25), and a dictionary site determination process (step S26) will be described.

図１８は、情報抽出処理の手順を示したフローチャートである。
情報抽出部１２０は、ステップＳ２１で読み出された検索ログの1行を入力し、処理を開始する。 FIG. 18 is a flowchart showing a procedure of information extraction processing.
The information extraction unit 120 inputs one line of the search log read in step S21 and starts processing.

［ステップＳ２２１］入力された検索ログを解析し、アクセスＵＲＬ、検索キーワード、アクセスＵＲＬのタイトルを抽出する。
［ステップＳ２２２］ステップＳ２２１で抽出されたアクセスＵＲＬから、ドメイン名を抽出し、上位ＵＲＬを検出する。 [Step S221] The input search log is analyzed, and an access URL, a search keyword, and a title of the access URL are extracted.
[Step S222] The domain name is extracted from the access URL extracted in step S221, and the upper URL is detected.

［ステップＳ２２３］ステップＳ２２１で検索ログより抽出されたアクセスＵＲＬ、検索キーワード、アクセスＵＲＬのタイトルと、ステップＳ２２２で検出された上位ＵＲＬとを、特徴集計部１４０に引き渡す。 [Step S223] The access URL extracted from the search log in step S221, the search keyword, the title of the access URL, and the upper URL detected in step S222 are delivered to the feature counting unit 140.

このような処理手順が実行されることにより、検索ログからＷｅｂサイトの上位ＵＲＬが検出される。特徴集計部１４０では、上位ＵＲＬと、検索ログから抽出された情報に基づいて、Ｗｅｂサイトを特徴付ける特徴項目のスコアを集計する。まず、対象抽出処理で、対象となる上位ＵＲＬに関する上位ＵＲＬテーブルと、そのＷｅｂページに関するアクセスＵＲＬテーブルとを生成する。 By executing such a processing procedure, the upper URL of the Web site is detected from the search log. The feature totaling unit 140 totals the scores of the feature items that characterize the website based on the upper URL and information extracted from the search log. First, in the target extraction process, an upper URL table related to the target upper URL and an access URL table related to the Web page are generated.

図１９は、対象抽出処理の手順を示したフローチャートである。
情報抽出部１２０から抽出情報を取得し、処理が開始される。
［ステップＳ２３１］情報抽出部１２０から、検索ログから抽出されたアクセスＵＲＬ、検索キーワード及びアクセスＵＲＬのタイトルと、検出された上位ＵＲＬとを受け取る。 FIG. 19 is a flowchart showing the procedure of the target extraction process.
Extraction information is acquired from the information extraction unit 120, and processing is started.
[Step S231] The information extraction unit 120 receives the access URL extracted from the search log, the search keyword, the title of the access URL, and the detected upper URL.

［ステップＳ２３２］上位ＵＲＬテーブル１５１０を検索し、ステップＳ２３１で取得された上位ＵＲＬと、検索キーワードの組が登録されているかどうかを判定する。登録されていないときは、処理を次のステップＳ２３３に進める。登録されていたときは、処理をステップＳ２３４に進める。 [Step S232] The upper URL table 1510 is searched, and it is determined whether or not the combination of the upper URL acquired in step S231 and the search keyword is registered. If not registered, the process proceeds to the next step S233. If registered, the process proceeds to step S234.

［ステップＳ２３３］受け取った上位ＵＲＬと、検索キーワードの組が上位ＵＲＬテーブル１５１０に登録されていないときは、新たに、この上位ＵＲＬと検索キーワードの組を登録する。 [Step S233] If the combination of the received upper URL and the search keyword is not registered in the upper URL table 1510, a new combination of the upper URL and the search keyword is newly registered.

［ステップＳ２３４］アクセスＵＲＬテーブル１５２０を検索し、ステップＳ２３１で取得されたアクセスＵＲＬと、アクセスＵＲＬのタイトルの組が登録されているかどうかを判定する。登録されていないときは、処理を次のステップＳ２３５に進める。登録されていたときは、処理を終了する。 [Step S234] The access URL table 1520 is searched to determine whether the combination of the access URL acquired in step S231 and the title of the access URL is registered. If not registered, the process proceeds to the next step S235. If registered, the process is terminated.

［ステップＳ２３５］受け取ったアクセスＵＲＬと、アクセスＵＲＬのタイトルの組がアクセスＵＲＬテーブル１５２０に登録されていないときは、新たに、このアクセスＵＲＬとアクセスＵＲＬのタイトルの組を登録する。 [Step S235] If the combination of the received access URL and the title of the access URL is not registered in the access URL table 1520, the combination of the access URL and the title of the access URL is newly registered.

このような処理手順が実行されることにより、上位ＵＲＬテーブル１５１０と、アクセスＵＲＬテーブル１５２０とが生成される。
すべての検索ログについて上記処理手順が行われ、上位ＵＲＬテーブル１５１０とアクセスＵＲＬテーブル１５２０に抽出された情報がすべて登録された後、特徴集計処理が開始される。 By executing such a processing procedure, an upper URL table 1510 and an access URL table 1520 are generated.
The above-described processing procedure is performed for all the search logs, and after all the extracted information is registered in the upper URL table 1510 and the access URL table 1520, the feature counting process is started.

図２０は、特徴集計処理の手順を示したフローチャートである。
［ステップＳ２５１］特徴ルール１に基づいて、上位ＵＲＬごとに検出された検索キーワードの数を集計するキーワード数集計処理を行う。 FIG. 20 is a flowchart showing the procedure of the feature counting process.
[Step S251] Based on the feature rule 1, a keyword count process is performed to count the number of search keywords detected for each upper URL.

［ステップＳ２５２］特徴ルール２に基づいて、上位ＵＲＬごとに、管理下のアクセスＵＲＬのタイトルに含まれる特徴語句の数を集計するタイトル特徴集計処理を行う。
［ステップＳ２５３］特徴ルール３に基づいて、上位ＵＲＬごとに、全ページに含まれる特徴語句を有するページの比率を算出する含有率集計処理を行う。 [Step S252] Based on the feature rule 2, for each upper URL, a title feature totaling process is performed to count the number of feature words included in the title of the managed access URL.
[Step S253] Based on the feature rule 3, the content rate totaling process for calculating the ratio of pages having the feature words included in all pages is performed for each upper URL.

［ステップＳ２５４］特徴ルール４に基づいて、上位ＵＲＬごとに、類似するサイトが存在するかどうかを確認する類似サイトチェック処理を行う。
これらの処理手順が実行されることにより、特徴ルール１−４に基づく特徴項目のスコアが集計される。続いて、各処理の詳細を説明する。 [Step S254] Based on the feature rule 4, a similar site check process is performed to check whether a similar site exists for each upper URL.
By executing these processing procedures, the score of the feature item based on the feature rule 1-4 is aggregated. Next, details of each process will be described.

図２１は、特徴集計処理におけるキーワード数集計処理の手順を示したフローチャートである。
［ステップＳ２５１１］上位ＵＲＬテーブル１５１０より上位ＵＲＬを１つ取り出す。集計テーブル１５３０のこの上位ＵＲＬに対応するキーワード数欄には、初期値（１）を設定しておく。 FIG. 21 is a flowchart showing the procedure of the keyword count process in the feature count process.
[Step S2511] One upper URL is extracted from the upper URL table 1510. An initial value (1) is set in the keyword number column corresponding to this higher URL in the aggregation table 1530.

［ステップＳ２５１２］上位ＵＲＬテーブル１５１０を検索し、同じ上位ＵＲＬを持つ検索キーワードの登録を探す。検出されたときは、処理をステップＳ２５１３に進める。検出されなかったときは、処理をステップＳ２５１４に進める。 [Step S2512] The upper URL table 1510 is searched to search for registration of a search keyword having the same upper URL. If detected, the process advances to step S2513. If not detected, the process proceeds to step S2514.

［ステップＳ２５１３］この上位ＵＲＬについて他の検索キーワードが検出されたときは、集計テーブル１５３０のこの上位ＵＲＬに対応するキーワード数欄の値に、１を加算する。 [Step S2513] When another search keyword is detected for the upper URL, 1 is added to the value in the keyword number column corresponding to the upper URL in the aggregation table 1530.

［ステップＳ２５１４］この上位ＵＲＬに関する未処理の検索キーワードがないかどうか、すなわち、上位ＵＲＬテーブル１５１０をすべて検索したかどうかを判定する。未処理の検索キーワードがないときは、処理をステップＳ２５１５に進める。未処理の検索キーワードがあるときは、ステップＳ２５１２に戻り、次の検索キーワードを探す処理を行う。 [Step S2514] It is determined whether or not there is an unprocessed search keyword related to the upper URL, that is, whether or not all the upper URL tables 1510 have been searched. If there is no unprocessed search keyword, the process advances to step S2515. If there is an unprocessed search keyword, the process returns to step S2512 to search for the next search keyword.

［ステップＳ２５１５］上位ＵＲＬテーブル１５１０に未処理の上位ＵＲＬがあるかどうかを判定する。あるときは、ステップＳ２５１１に戻り、次の上位ＵＲＬの検索キーワードを探す処理を行う。ないときは、処理を終了する。 [Step S2515] It is determined whether or not there is an unprocessed upper URL in the upper URL table 1510. If there is, the process returns to step S2511 to search for the next higher URL search keyword. If not, the process ends.

このような処理手順が実行されることにより、集計テーブル１５３０のキーワード数に、検索ログから検出された上位ＵＲＬごとの検索キーワード数が登録される。
図２２は、特徴集計処理におけるタイトル特徴集計処理の手順を示したフローチャートである。 By executing such a processing procedure, the number of search keywords for each upper URL detected from the search log is registered in the number of keywords in the aggregation table 1530.
FIG. 22 is a flowchart showing the procedure of the title feature tabulation process in the feature tabulation process.

［ステップＳ２５２１］アクセスＵＲＬテーブル１５２０より、アクセスＵＲＬを１つ取り出す。
［ステップＳ２５２２］ステップＳ２５２１で取り出したアクセスＵＲＬから上位ＵＲＬを抽出する。また、集計テーブル１５３０のこの上位ＵＲＬの行のタイトル特徴数を初期化（０）する。 [Step S2521] One access URL is extracted from the access URL table 1520.
[Step S2522] The upper URL is extracted from the access URL extracted in step S2521. Also, the number of title features in the upper URL row of the aggregation table 1530 is initialized (0).

［ステップＳ２５２３］アクセスＵＲＬテーブル１５２０を検索し、ステップＳ２５２２で抽出した上位ＵＲＬと同じ上位ＵＲＬを持つ検索キーワードを１つ取り出す。
［ステップＳ２５２４］ステップＳ２５２３で取り出した検索キーワードによって辿り着いたＷｅｂページに含まれるアクセスＵＲＬのタイトルを全て抽出する。 [Step S2523] The access URL table 1520 is searched, and one search keyword having the same upper URL as the upper URL extracted in step S2522 is extracted.
[Step S2524] All titles of access URLs included in the Web page arrived by the search keyword extracted in step S2523 are extracted.

［ステップＳ２５２５］ステップＳ２５２４で抽出されたアクセスＵＲＬのタイトルのうち、特徴を満たしているものがあるかどうか、すなわち、辞書の特徴を表す特徴語句が含まれているタイトルがあるかどうかを判定する。タイトルが特徴を満たしているものが検出されたときは、処理をステップＳ２５２６に進める。タイトルが特徴を満たしているものが検出されないときは、処理をステップＳ２５２７に進める。 [Step S2525] It is determined whether or not there is a title satisfying the feature among the titles of the access URL extracted in step S2524, that is, whether or not there is a title including a feature word representing the feature of the dictionary. . If the title satisfies the feature, the process proceeds to step S2526. If no title satisfies the feature, the process proceeds to step S2527.

［ステップＳ２５２６］この検索キーワードで検索された上位ＵＲＬ配下のアクセスＵＲＬのタイトルが特徴を満たしているときは、集計テーブル１５３０の該当上位ＵＲＬのタイトル特徴数の値に１を加算する。 [Step S2526] When the title of the access URL under the upper URL searched with this search keyword satisfies the feature, 1 is added to the value of the title feature number of the corresponding upper URL in the aggregation table 1530.

［ステップＳ２５２７］アクセスＵＲＬテーブル１５２０に、この上位ＵＲＬについて未処理の検索キーワードがあるかどうかを判定する。未処理の検索キーワードがないときは、処理をステップＳ２５２８に進める。未処理のタイトルがあるときは、ステップＳ２５２３に戻って、次の検索キーワードについての処理を行う。 [Step S 2527] It is determined whether or not there is an unprocessed search keyword for the upper URL in the access URL table 1520. If there is no unprocessed search keyword, the process proceeds to step S2528. If there is an unprocessed title, the process returns to step S2523 to perform processing for the next search keyword.

［ステップＳ２５２８］アクセスＵＲＬテーブル１５２０に、タイトルを取り出されていない未処理のアクセスＵＲＬ（未処理の上位ＵＲＬ）があるかどうかを判定する。未処理のアクセスＵＲＬがあるときは、ステップＳ２５２１に戻って、次の上位ＵＲＬの処理を行う。 [Step S2528] It is determined whether or not there is an unprocessed access URL (unprocessed upper URL) from which no title has been extracted in the access URL table 1520. If there is an unprocessed access URL, the process returns to step S2521, and the next higher URL is processed.

このような処理手順が実行されることにより、集計テーブル１５３０のタイトル特徴数に、検索ログから検出された上位ＵＲＬごとに、特徴語句を含むアクセスＵＲＬのタイトルに辿り着いた検索キーワードの数が登録される。 By executing such a processing procedure, the number of search keywords arriving at the title of the access URL including the feature phrase is registered in the number of title features of the totaling table 1530 for each upper URL detected from the search log. Is done.

図２３は、特徴集計処理における含有率集計処理の手順を示したフローチャートである。
［ステップＳ２５３１］集計テーブル１５３０より、１行（上位ＵＲＬ、タイトル特徴数、キーワード数）を抽出する。 FIG. 23 is a flowchart illustrating a content rate totaling process in the characteristic totaling process.
[Step S2531] One line (higher URL, number of title features, number of keywords) is extracted from the aggregation table 1530.

［ステップＳ２５３２］ステップＳ２５３１で抽出されたタイトル特徴数と、キーワード数とを式（１）に適用して含有率を算出し、集計テーブル１５３０の含有率１５３４に登録する。 [Step S2532] The content ratio is calculated by applying the number of title features extracted in step S2531 and the number of keywords to the equation (1), and is registered in the content ratio 1534 of the tabulation table 1530.

［ステップＳ２５３３］未処理の上位ＵＲＬが存在するかどうかを判定する。存在するときは、ステップＳ２５３１に戻って、未処理の上位ＵＲＬの処理を行う。存在しないときは、処理を終了する。 [Step S2533] It is determined whether there is an unprocessed upper URL. If it exists, the process returns to step S2531, and the unprocessed upper URL is processed. If it does not exist, the process ends.

このような処理手順が実行されることにより、集計テーブル１５３０の含有率１５３４に、上位ＵＲＬごとの含有率が登録される。
こうして、集計テーブル１５３０には、上位ＵＲＬごとに、タイトル特徴数１５３２、キーワード数１５３３及び含有率１５３４が登録される。そして、この集計テーブル１５３０に基づいて、類似サイトチェックが行われる。 By executing such a processing procedure, the content rate for each upper URL is registered in the content rate 1534 of the aggregation table 1530.
In this way, the number of title features 1532, the number of keywords 1533, and the content rate 1534 are registered in the aggregation table 1530 for each upper URL. Based on this total table 1530, a similar site check is performed.

図２４は、特徴集計処理における類似サイトチェック処理の手順を示したフローチャートである。
［ステップＳ２５４１］集計テーブル１５３０のキーワード数１５３３が閾値以上のものを抽出し、含有率でソートする。そして、そのソート順に並べた集計テーブル（Ｂ）１５３５を生成する。 FIG. 24 is a flowchart showing the procedure of the similar site check process in the feature counting process.
[Step S2541] The number of keywords 1533 in the total table 1530 is greater than or equal to the threshold value, and is sorted by content rate. Then, a tabulation table (B) 1535 arranged in the sort order is generated.

［ステップＳ２５４２］集計テーブル（Ｂ）１５３５に登録される未処理の上位ＵＲＬのうち、最も上位にある、すなわち含有率が最も高い上位ＵＲＬを基準ＵＲＬとして１つ取り出す。 [Step S2542] Among unprocessed upper URLs registered in the tabulation table (B) 1535, one upper URL having the highest content rate, that is, the highest content rate is taken out as a reference URL.

［ステップＳ２５４３］上位ＵＲＬテーブル１５１０から、ステップＳ２５４２で取り出された基準ＵＲＬに該当する上位ＵＲＬの検索キーワードをすべて取り出し、検索キーワード群（１）とする。 [Step S2543] All the search keywords of the upper URL corresponding to the reference URL extracted in step S2542 are extracted from the upper URL table 1510 and set as a search keyword group (1).

［ステップＳ２５４４］集計テーブル（Ｂ）１５３５に登録される上位ＵＲＬから、基準ＵＲＬ以外の上位ＵＲＬを取り出し、比較ＵＲＬとする。
［ステップＳ２５４５］上位ＵＲＬテーブル１５１０から、ステップＳ２５４４で取り出された比較ＵＲＬに該当する上位ＵＲＬの検索キーワードをすべて取り出し、検索キーワード群（２）とする。 [Step S2544] From the upper URLs registered in the aggregation table (B) 1535, upper URLs other than the reference URL are extracted and set as comparison URLs.
[Step S2545] All the search keywords of the upper URL corresponding to the comparison URL extracted in step S2544 are extracted from the upper URL table 1510 and set as a search keyword group (2).

［ステップＳ２５４６］検索キーワード群（１）と、検索キーワード群（２）との共通キーワードの数を数える。
［ステップＳ２５４７］ステップＳ２５４６で集計した共通キーワードの数と、ステップＳ２５４４で取り出された検索キーワード群（２）のキーワード数とを用いて、類似スコアを算出する。類似スコアは、式（３）によって求められる。ここでは、共通のキーワード数／検索キーワード群（２）のキーワード数によって算出される。そして、算出された類似スコアを、基準ＵＲＬ及び比較ＵＲＬ、共通して出現するキーワード数及び比較ＵＲＬに出現するキーワード数、とともに類似スコアテーブル１５４０に格納する。 [Step S2546] The number of common keywords of the search keyword group (1) and the search keyword group (2) is counted.
[Step S2547] Using the number of common keywords tabulated in step S2546 and the number of keywords in the search keyword group (2) extracted in step S2544, a similarity score is calculated. A similarity score is calculated | required by Formula (3). Here, the number of common keywords / the number of keywords in the search keyword group (2) is calculated. Then, the calculated similarity score is stored in the similarity score table 1540 together with the reference URL and the comparison URL, the number of keywords appearing in common and the number of keywords appearing in the comparison URL.

［ステップＳ２５４８］集計テーブル（Ｂ）１５３５に他の比較ＵＲＬが存在するかどうかを判定する。他の比較ＵＲＬが存在しないときは、処理をステップＳ２５４９に進める。他の比較ＵＲＬが存在するときは、処理をステップＳ２５４４に戻し、次の比較ＵＲＬからの処理を行う。 [Step S2548] It is determined whether there is another comparison URL in the aggregation table (B) 1535. If there is no other comparison URL, the process proceeds to step S2549. If another comparison URL exists, the process returns to step S2544 to perform the process from the next comparison URL.

［ステップＳ２５４９］集計テーブル（Ｂ）１５３５に、他の基準ＵＲＬが存在するかどうかを判定する。存在するときは、処理をステップＳ２５４２に戻し、次の基準ＵＲＬの処理を行う。存在しないときは、処理を終了する。 [Step S2549] It is determined whether another reference URL exists in the total table (B) 1535. If it exists, the process returns to step S2542, and the next reference URL is processed. If it does not exist, the process ends.

このような処理手順が実行されることにより、上位ＵＲＬ間の類似スコアが算出される。
以上の処理が行われ、類似スコアテーブル１５４０が生成される。続いて、これらの情報に基づく辞書サイト判定処理が行われる。 By executing such a processing procedure, a similarity score between higher URLs is calculated.
The above process is performed, and a similarity score table 1540 is generated. Subsequently, dictionary site determination processing based on these pieces of information is performed.

図２５は、辞書サイト判定処理の手順を示したフローチャートである。
［ステップＳ２６１］含有率の高い順に配列される集計テーブル（Ｂ）１５３５の上位から１行を選択し、上位ＵＲＬと含有率スコアとを取得する。この上位ＵＲＬは、基準ＵＲＬになる。 FIG. 25 is a flowchart showing the procedure of the dictionary site determination process.
[Step S261] One row is selected from the top of the aggregation table (B) 1535 arranged in descending order of the content rate, and the high URL and the content rate score are acquired. This upper URL becomes the reference URL.

［ステップＳ２６２］類似スコアテーブル１５４０から、ステップＳ２６１で取得した上位ＵＲＬをキーとして、この上位ＵＲＬを基準ＵＲＬとする行を検索する。そして、類似スコアに基づいて、未処理のもので最も高い類似スコアの１行を取り出す。これにより、比較ＵＲＬと、類似スコアとを取り出すことができる。 [Step S262] The similar score table 1540 is searched for a line using the upper URL acquired in Step S261 as a key and the upper URL as a reference URL. Then, based on the similarity score, one unprocessed one with the highest similarity score is extracted. Thereby, the comparison URL and the similarity score can be extracted.

［ステップＳ２６３］ステップＳ２６２で取り出した類似スコアが閾値以上であるかどうかを判定する。閾値以上であれば、基準ＵＲＬは、特徴ルール４によって、類似サイトが存在する辞書サイトであると判定できる。類似スコアが閾値以上であれば、処理をステップＳ２６４に進める。類似スコアの順に処理しているので、類似スコアが閾値より小さいときはこれ以上に類似スコアの高い比較ＵＲＬはない。そこで、処理をステップＳ２６８に進める。 [Step S263] It is determined whether the similarity score extracted in step S262 is equal to or greater than a threshold value. If it is equal to or greater than the threshold, the reference URL can be determined by the feature rule 4 to be a dictionary site where a similar site exists. If the similarity score is greater than or equal to the threshold, the process proceeds to step S264. Since the processing is performed in the order of similarity scores, there is no comparison URL with a higher similarity score when the similarity score is smaller than the threshold value. Therefore, the process proceeds to step S268.

［ステップＳ２６４］基準ＵＲＬに対する比較ＵＲＬの類似スコアが閾値よりも高いときは、この基準ＵＲＬを辞書サイト候補とする。また、含有率スコアと類似スコアとから、辞書らしさスコアを式（４）に基づいて算出する。 [Step S264] When the similarity score of the comparison URL with respect to the reference URL is higher than the threshold, the reference URL is set as a dictionary site candidate. Further, a dictionary-like score is calculated based on the formula (4) from the content score and the similarity score.

［ステップＳ２６５］辞書サイト候補一覧テーブル１７１０にこの基準ＵＲＬを登録する。なお、この基準ＵＲＬが既に辞書サイトに登録されている場合、あるいは、ＮＧサイトに登録されているときは、登録を行わない。辞書サイト候補一覧テーブル１７１０に登録する際は、基準ＵＲＬに加え、比較ＵＲＬ、類似スコア、含有率スコア及び辞書らしさスコアも登録する。 [Step S265] This reference URL is registered in the dictionary site candidate list table 1710. If this reference URL is already registered in the dictionary site, or is registered in the NG site, registration is not performed. When registering in the dictionary site candidate list table 1710, in addition to the reference URL, a comparison URL, a similarity score, a content score, and a dictionary-like score are also registered.

［ステップＳ２６７］この基準ＵＲＬに対し、類似スコアテーブル１５４０に、他の比較ＵＲＬがあるかどうかを判定する。他の比較ＵＲＬがないときは、処理をステップＳ２６８に進める。他の比較ＵＲＬがあるときは、ステップＳ２６２に戻って、次の処理対象の比較ＵＲＬの設定からの処理を行う。 [Step S267] It is determined whether there is another comparison URL in the similarity score table 1540 for this reference URL. If there is no other comparison URL, the process proceeds to step S268. If there is another comparison URL, the process returns to step S262 to perform processing from the setting of the next comparison target comparison URL.

［ステップＳ２６８］類似スコアテーブル１５４０に、未処理の基準ＵＲＬがあるかどうかを判定する。未処理の基準ＵＲＬがあるときは、ステップＳ２６１に戻って、次の処理対象の基準ＵＲＬの設定からの処理を行う。未処理の基準ＵＲＬがないときは、処理を終了する。 [Step S268] It is determined whether or not there is an unprocessed reference URL in the similarity score table 1540. If there is an unprocessed reference URL, the process returns to step S261 to perform processing from the setting of the next reference URL to be processed. If there is no unprocessed reference URL, the process is terminated.

以上の処理手順が実行されることにより、特徴ルールに基づいて辞書サイトと判定された辞書サイト候補一覧が、その確からしさのスコア（類似スコア、含有率スコア、辞書らしさスコア）とともに得られる。 By executing the above processing procedure, a dictionary site candidate list determined to be a dictionary site based on the feature rule is obtained together with the probability score (similarity score, content rate score, dictionary-like score).

このように、辞書サイト検出サーバ１００によれば、予め特徴ルールを定義しておけば、利用者の検索ログに基づいて自動的に、利用者が検索を行ったサイトのうち、辞書サイトの特徴を有するサイトが辞書サイト候補として提示される。このとき、それぞれの特徴項目についての評価（スコア）も同時に提示されるので、辞書サイトを選択する際の目安とすることができる。また、利用者の行動に基づいて実際にアクセスされたサイトから辞書サイトが検出されるので、利用者にとって有益なサイトが辞書サイト候補に選択される。この結果、人手によって行われてきた辞書サイトの検出に要する時間を大幅に短縮することが可能なばかりでなく、辞書サイトを見逃してしまうようなミスを防ぐことができる。 As described above, according to the dictionary site detection server 100, if the feature rule is defined in advance, the feature of the dictionary site among the sites searched by the user automatically based on the search log of the user. Sites with are presented as dictionary site candidates. At this time, the evaluation (score) for each feature item is also presented at the same time, which can be used as a guide for selecting a dictionary site. Further, since a dictionary site is detected from a site actually accessed based on the user's behavior, a site useful for the user is selected as a dictionary site candidate. As a result, it is possible not only to greatly reduce the time required for manual dictionary site detection, but also to prevent mistakes such as missing a dictionary site.

次に、第２の実施の形態として、特徴集計処理に特徴スコアＳ４と総合特徴スコアＳ５を用いる場合について説明する。なお、第２の実施の形態における特徴集計処理の全体的な流れは、図２０に示した処理手順と同様である。ただし、タイトル特徴集計処理（特徴ルール２）と、含有率集計処理（特徴ルール３）については、異なる処理が行われる。図２０に示した処理手順において、タイトル特徴集計処理（ステップＳ２５２）と、含有率集計処理（ステップＳ２５３）とを以下に説明する処理手順と入れ替えることにより、第２の実施の形態の特徴集計処理が実現される。 Next, as a second embodiment, a case where a feature score S4 and an overall feature score S5 are used for feature aggregation processing will be described. Note that the overall flow of the feature counting process in the second embodiment is the same as the processing procedure shown in FIG. However, different processing is performed for the title feature tabulation process (feature rule 2) and the content rate tabulation process (feature rule 3). In the processing procedure shown in FIG. 20, the feature totalization processing of the second embodiment is performed by replacing the title feature totalization processing (step S252) and the content rate totalization processing (step S253) with the processing procedures described below. Is realized.

図２６は、特徴集計処理における特徴スコアＳ４と総合特徴スコアＳ５の算出手順を示したフローチャートである。
［ステップＳ２５５１］アクセスＵＲＬテーブル１５２０より、アクセスＵＲＬを１つ取り出す。 FIG. 26 is a flowchart showing a procedure for calculating the feature score S4 and the total feature score S5 in the feature counting process.
[Step S2551] One access URL is extracted from the access URL table 1520.

［ステップＳ２５５２］ステップＳ２５５１で取り出したアクセスＵＲＬから上位ＵＲＬを抽出する。また、この上位ＵＲＬについて検索された全てのアクセスＵＲＬのタイトル数（全タイトル数）と、特徴語句を含むアクセスＵＲＬのタイトル数（特徴有タイトル数）のカウンタを初期化（０）する。 [Step S2552] The upper URL is extracted from the access URL extracted in step S2551. Also, a counter is initialized (0) for the number of titles of all access URLs (total number of titles) searched for the upper URL and the number of titles of access URLs (number of featured titles) including feature words.

［ステップＳ２５５３］アクセスＵＲＬテーブル１５２０を検索し、ステップＳ２５５２で抽出した上位ＵＲＬと同じ上位ＵＲＬを持つアクセスＵＲＬのタイトルを１つ取り出す。このとき、該当上位ＵＲＬに関する全タイトル数のカウンタをインクリメントする。 [Step S2553] The access URL table 1520 is searched, and one title of the access URL having the same upper URL as the upper URL extracted in step S2552 is extracted. At this time, the counter of the total number of titles related to the corresponding upper URL is incremented.

［ステップＳ２５５４］ステップＳ２５５３で取り出したアクセスＵＲＬのタイトルが、特徴を満たしているかどうか、すなわち、辞書の特徴を表す特徴語句が含まれているかどうかを判定する。タイトルが特徴を満たしているときは、処理をステップＳ２５５５に進める。タイトルが特徴を満たしていないときは、処理をステップＳ２５５６に進める。 [Step S2554] It is determined whether or not the title of the access URL extracted in step S2553 satisfies the feature, that is, whether or not a feature word representing the feature of the dictionary is included. If the title satisfies the feature, the process proceeds to step S2555. If the title does not satisfy the feature, the process proceeds to step S2556.

［ステップＳ２５５５］この上位ＵＲＬ配下のアクセスＵＲＬのタイトルが特徴を満たしているときは、該当上位ＵＲＬに関する特徴有タイトル数のカウンタをインクリメントする。 [Step S2555] When the title of the access URL under the upper URL satisfies the characteristics, the counter of the number of characteristic titles related to the upper URL is incremented.

［ステップＳ２５５６］アクセスＵＲＬテーブル１５２０に、この上位ＵＲＬについて未処理のタイトルがあるかどうかを判定する。未処理のタイトルがないときは、処理をステップＳ２５５７に進める。未処理のタイトルがあるときは、ステップＳ２５５３に戻って、次のタイトルを検出する処理を行う。 [Step S2556] It is determined whether or not there is an unprocessed title for this upper URL in the access URL table 1520. If there is no unprocessed title, the process proceeds to step S2557. If there is an unprocessed title, the process returns to step S2553 to perform processing for detecting the next title.

［ステップＳ２５５７］アクセスＵＲＬテーブル１５２０に、タイトルを取り出されていない未処理のアクセスＵＲＬ（未処理の上位ＵＲＬ）があるかどうかを判定する。未処理のアクセスＵＲＬがあるときは、ステップＳ２５５１に戻って、次の上位ＵＲＬの処理を行う。ないときは、次ステップへ処理を進める。 [Step S2557] It is determined whether or not there is an unprocessed access URL (unprocessed upper URL) from which no title has been extracted in the access URL table 1520. If there is an unprocessed access URL, the process returns to step S2551 to process the next higher URL. If not, proceed to the next step.

［ステップＳ２５５８］全ての上位ＵＲＬについての処理が終了し、検索ログから検出された上位ＵＲＬごとに、全てのアクセスＵＲＬのタイトル数と、特徴語句を含むアクセスＵＲＬの特徴有タイトル数が集計される。こうして集計された全タイトル数と、特徴有タイトル数から、特徴有タイトル数／全タイトル数を算出し、特徴スコアＳ４とする。 [Step S2558] Processing for all upper URLs is completed, and for each upper URL detected from the search log, the number of titles of all access URLs and the number of characteristic URLs of access URLs including feature words are totaled. . From the total number of titles and the number of characteristic titles thus calculated, the number of characteristic titles / the total number of titles is calculated and set as a characteristic score S4.

［ステップＳ２５５９］ステップＳ２５５８で算出された特徴スコアＳ４と、前段の特徴ルール１に基づくキーワード数集計処理で集計された特徴スコアＳ１とを式（２）に適用し、総合特徴スコアＳ５を算出する。係数ａ，ｂは、予め適切な値が設定されている。 [Step S2559] The feature score S4 calculated in step S2558 and the feature score S1 aggregated in the keyword count aggregation process based on the preceding feature rule 1 are applied to the equation (2) to calculate the overall feature score S5. . Appropriate values are set in advance for the coefficients a and b.

以上の処理が行われることにより、特徴スコアＳ４と、総合特徴スコアＳ５が算出される。なお、上記の総合スコアＳ５算出処理では、式（２）を用いて特徴スコアＳ１と特徴スコアＳ４とから算出するとしたが、特徴スコアＳ１が閾値以下のものを除外し、特徴スコアＳ４でスコア付けを行うとしてもよい。算出された総合特徴スコアＳ５が高い上位ＵＲＬは、多くの検索キーワードでアクセスされるという特徴ルール１の特徴と、特徴語句を含むタイトルの割合が高いという特徴ルール２，３の特徴とを合わせ持つ。すなわち、辞書サイトである可能性が高いということになる。 By performing the above processing, a feature score S4 and an overall feature score S5 are calculated. In the above-described total score S5 calculation process, the calculation is performed from the feature score S1 and the feature score S4 using the equation (2). May be performed. The high-order URL with the high total feature score S5 calculated has both the feature of the feature rule 1 that is accessed by many search keywords and the feature of the feature rules 2 and 3 that have a high ratio of titles including feature words / phrases. . That is, it is highly possible that the site is a dictionary site.

上記の第２の実施の形態の特徴集計処理に続く類似サイトチェック処理（図２０のステップＳ２５４）、その後の辞書サイト判定処理（図１７のステップＳ２６）における処理では、「含有率」の代わりに「総合特徴スコアＳ５」を用いて処理を行う。 In the similar site check process (step S254 in FIG. 20) following the feature tabulation process in the second embodiment, and the subsequent dictionary site determination process (step S26 in FIG. 17), instead of “content ratio” Processing is performed using the “total feature score S5”.

なお、上記の処理機能は、コンピュータによって実現することができる。その場合、文書群検出装置が有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。 The above processing functions can be realized by a computer. In that case, a program describing the processing contents of the functions that the document group detection apparatus should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium.

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ（Digital Versatile Disc）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）などの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing the program, for example, portable recording media such as a DVD (Digital Versatile Disc) and a CD-ROM (Compact Disc Read Only Memory) on which the program is recorded are sold. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

以上の実施形態に関し、更に以下の付記を開示する。
（付記１）ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出方法において、
情報抽出手段が、任意の検索条件を用いて検索され、検索結果に基づいて利用者が取得した文書に関し、前記文書の識別情報と前記文書が属する文書群の識別情報とを有するアドレスが記録される検索ログを検索ログ記憶手段から読み出し、前記検索ログに記録された前記文書群の識別情報を抽出する手順と、
特徴集計手段が、前記情報抽出手段によって抽出された前記文書群の識別情報ごとに前記検索ログを分類し、前記文書群ごとに分類された前記検索ログを解析して、検出対象とする文書群を特徴付ける特徴ルールに応じた特徴スコアを前記文書群ごとに集計する手順と、
文書群判定手段が、前記特徴集計手段によって前記文書群ごとに集計された前記特徴スコアに基づき、前記文書群が前記特徴ルールによって規定される前記検出対象とする文書群の条件を満たしているかどうかを判定し、条件を満たしている前記文書群を情報提供文書群候補に登録する手順と、
文書群提示手段が、前記文書群判定手段によって前記情報提供文書群候補に登録された前記文書群の識別情報を利用者に提示する手順と、
を有することを特徴とする文書群検出方法。 Regarding the above embodiment, the following additional notes are disclosed.
(Supplementary Note 1) In a document group detection method for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
An information extraction unit is searched using an arbitrary search condition, and an address including identification information of the document and identification information of a document group to which the document belongs is recorded with respect to a document acquired by a user based on a search result. A search log stored in the search log storage means, and identifying the document group identification information recorded in the search log;
A feature counting unit classifies the search log for each identification information of the document group extracted by the information extraction unit, analyzes the search log classified for each document group, and sets a document group as a detection target. A procedure for totalizing a feature score corresponding to a feature rule characterizing each document group;
Whether or not the document group determination unit satisfies the condition of the document group to be detected defined by the feature rule based on the feature score totaled for each document group by the feature aggregation unit And registering the document group satisfying the condition as an information providing document group candidate;
A document group presenting means for presenting identification information of the document group registered in the information providing document group candidate by the document group determining means to a user;
A document group detection method characterized by comprising:

（付記２）前記情報抽出手段は、前記検索ログに記録される前記アドレスから前記文書群の識別情報を抽出して前記文書群を特定する、
ことを特徴とする付記１記載の文書群検出方法。 (Additional remark 2) The said information extraction means extracts the identification information of the said document group from the said address recorded on the said search log, and specifies the said document group,
The document group detection method according to supplementary note 1, wherein:

（付記３）前記検索ログには、前記利用者が取得した文書が検索されたときの検索キーワードが記録されており、
前記特徴集計手段が、前記文書群ごとに、前記文書群に含まれる前記文書が検索されたときの前記検索キーワードを前記検索ログから抽出し、前記検索キーワードの種類数を前記文書群ごとに集計し、集計された前記検索キーワードの種類数を前記文書群の前記特徴スコアの１つにする、
ことを特徴とする付記１記載の文書群検出方法。 (Supplementary Note 3) In the search log, a search keyword when a document acquired by the user is searched is recorded.
The feature counting unit extracts, for each document group, the search keyword when the document included in the document group is searched from the search log, and totals the number of types of the search keyword for each document group. Then, the total number of types of the search keyword is set as one of the feature scores of the document group.
The document group detection method according to supplementary note 1, wherein:

（付記４）前記検索対象の文書群は、前記文書に所定の用語を解説する内容が記述されている辞書に相当する文書群であり、
前記文書群判定手段が、前記文書群が前記辞書に相当する文書群であれば、登録される用語に応じた複数の異なる前記検索キーワードでアクセスされるという特徴に基づく前記特徴ルールを用いて判定を行う、
ことを特徴とする付記３記載の文書群検出方法。 (Supplementary Note 4) The search target document group is a document group corresponding to a dictionary in which contents explaining a predetermined term are described in the document.
If the document group is a document group corresponding to the dictionary, the document group determination means determines using the feature rule based on the feature that the document group is accessed with a plurality of different search keywords according to a registered term. I do,
The document group detection method according to supplementary note 3, wherein

（付記５）前記検索ログには、前記利用者が取得した文書が検索されたときの検索キーワードと前記文書の内容を示す情報が記録されており、
前記特徴集計手段が、前記文書群ごとに、前記文書群に含まれる前記文書の内容を示す情報を抽出し、前記検索対象の文書群の特徴を表す特徴語句が前記文書の内容を示す情報に含まれているか否かを判定し、前記特徴語句が含まれていると判定された特徴語句を含む文書の数と、前記文書群に属する前記文書の総数とを集計して前記特徴語句を含む文書が前記文書群に占める割合を算出するとともに、前記文章群の前記検索キーワードの総数を集計し、集計された前記検索キーワードの総数が閾値を超える前記文書群について、前記特徴語句を含む文書の割合を前記特徴スコアの１つにする、
ことを特徴とする付記３記載の文書群検出方法。 (Supplementary Note 5) In the search log, information indicating a search keyword and a content of the document when the document acquired by the user is searched is recorded.
For each document group, the feature counting unit extracts information indicating the content of the document included in the document group, and a feature word indicating a characteristic of the document group to be searched is information indicating the content of the document. It is determined whether or not it is included, and the number of documents including the feature phrase determined to include the feature phrase and the total number of the documents belonging to the document group are aggregated to include the feature phrase The ratio of the document to the document group is calculated, the total number of the search keywords of the sentence group is totaled, and the document group including the feature word / phrase is included in the document group in which the total number of the search keywords exceeds the threshold. The ratio is set to one of the characteristic scores.
The document group detection method according to supplementary note 3, wherein

（付記６）前記検索対象の文書群は、前記文書に所定の用語を解説する内容が記述されている辞書に相当する文書群であり、前記特徴語句には前記用語を解説する文に出現が予測される語句が設定され、
前記文書群判定手段が、前記文書群が前記辞書に相当する文書群であれば、前記文書群は複数の異なる前記検索キーワードでアクセスされ、かつ、前記文書群に占める前記特徴語句を含む文書の割合が基準値を超えるという特徴に基づく前記特徴ルールを用いて判定を行う、
ことを特徴とする付記５記載の文書群検出方法。 (Supplementary Note 6) The document group to be searched is a document group corresponding to a dictionary in which contents explaining a predetermined term are described in the document, and the feature word phrase appears in a sentence explaining the term. Predicted words are set,
If the document group determination unit is a document group corresponding to the dictionary, the document group is accessed with a plurality of different search keywords, and the document group includes the feature phrase that occupies the document group. The determination is performed using the feature rule based on the feature that the ratio exceeds the reference value.
The document group detection method according to appendix 5, characterized in that:

（付記７）前記特徴集計手段が、前記文書群ごとに、前記文書群に占める前記特徴語句を含む文書の割合と、前記キーワードの総数とをそれぞれ所定の係数で重み付けし、重み付けされた前記特徴語句を含む文書の割合及び前記キーワードの総数とを加えた値を前記特徴スコアの１つにする、
ことを特徴とする付記５記載の文書群検出方法。 (Additional remark 7) The said characteristic totaling means weights the ratio of the document containing the said characteristic word phrase in the said document group for every said document group, and the total number of the said keywords, respectively with a predetermined coefficient, and weighted the said characteristic. A value obtained by adding a ratio of documents including phrases and the total number of keywords is set as one of the feature scores.
The document group detection method according to appendix 5, characterized in that:

（付記８）前記検索対象の文書群は、前記文書に所定の用語を解説する内容が記述されている辞書に相当する文書群であり、前記特徴語句には前記用語を解説する文に出現が予測される語句が設定され、
前記文書群判定手段が、前記文書群が前記辞書に相当する文書群であれば、前記前記文書群に占める前記特徴語句を含む文書の割合が基準値を超えるという特徴に基づく前記特徴ルールを用いて判定を行う、
ことを特徴とする付記７記載の文書群検出方法。 (Supplementary Note 8) The document group to be searched is a document group corresponding to a dictionary in which the content explaining a predetermined term is described in the document, and the feature word phrase appears in a sentence explaining the term. Predicted words are set,
If the document group determination unit is a document group corresponding to the dictionary, the feature rule based on the feature that the ratio of the document including the feature word / phrase in the document group exceeds a reference value is used. Make a decision,
The document group detection method according to supplementary note 7, wherein:

（付記９）前記検索ログには、前記利用者が取得した文書が検索されたときの検索キーワードと前記文書の内容を示す情報が記録されており、
前記特徴集計手段が、前記文書群ごとに、前記文書群に含まれる前記文書の内容を示す情報と前記検索キーワードとを前記検索ログから抽出し、前記検索対象の文書群の特徴を表す特徴語句が前記文書の内容を示す情報に含まれているか否かを判定し、前記特徴語句が含まれると判定された特徴語句を含む文書に対応する前記検索キーワードの数を集計し、集計された前記特徴語句を含む文書に対応する前記検索キーワードの数を前記特徴スコアの１つにする、
ことを特徴とする付記１または３記載の文書群検出方法。 (Supplementary Note 9) In the search log, a search keyword when the document acquired by the user is searched and information indicating the content of the document are recorded.
For each document group, the feature totaling unit extracts information indicating the content of the document included in the document group and the search keyword from the search log, and a feature phrase representing the characteristics of the document group to be searched Is included in the information indicating the content of the document, the number of the search keywords corresponding to the document including the feature phrase determined to include the feature phrase is totaled, and the tabulated The number of the search keywords corresponding to a document including a feature phrase is set as one of the feature scores.
The document group detection method according to appendix 1 or 3, characterized in that:

（付記１０）前記検索対象の文書群は、前記文書に所定の用語を解説する内容が記述されている辞書に相当する文書群であり、前記特徴語句には前記用語を解説する文に出現が予測される語句が設定され、
前記文書群判定手段が、前記文書群が前記辞書に相当する文書群であれば、前記特徴語句を含む文書に辿り着いた前記検索キーワードの数が前記文書群に対応する前記検索キーワードに占める割合が、所定の閾値を超えるという特徴に基づく前記特徴ルールを用いて判定を行う、
ことを特徴とする付記９記載の文書群検出方法。 (Supplementary Note 10) The document group to be searched is a document group corresponding to a dictionary in which contents explaining a predetermined term are described in the document, and the feature word phrase appears in a sentence explaining the term. Predicted words are set,
If the document group determination unit is a document group corresponding to the dictionary, the ratio of the number of the search keywords that arrived at the document including the feature word phrase to the search keywords corresponding to the document group Is determined using the feature rule based on the feature that exceeds a predetermined threshold,
The document group detection method according to appendix 9, wherein

（付記１１）前記特徴集計手段が、前記文書群ごとに、前記特徴語句を含む文書に対応する前記検索キーワードが、前記文書群に対応する前記検索キーワード全体に含まれる割合を含有率として算出し、算出された含有率を前記特徴スコアの１つにする、
ことを特徴とする付記９記載の文書群検出方法。 (Supplementary Note 11) For each document group, the feature counting unit calculates, as a content rate, a ratio in which the search keyword corresponding to the document including the feature word / phrase is included in the entire search keyword corresponding to the document group. The calculated content rate is set as one of the characteristic scores.
The document group detection method according to appendix 9, wherein

（付記１２）前記検索対象の文書群は、前記文書に所定の用語を解説する内容が記述されている辞書に相当する文書群であり、前記特徴語句には前記用語を解説する文に出現が予測される語句が設定され、
前記文書群判定手段が、前記文書群が前記辞書に相当する文書群であれば、前記含有率が所定の閾値を超えるという特徴に基づく前記特徴ルールを用いて判定を行う、
ことを特徴とする付記１１記載の文書群検出方法。 (Supplementary Note 12) The document group to be searched is a document group corresponding to a dictionary in which the content explaining a predetermined term is described in the document, and the feature word phrase appears in a sentence explaining the term. Predicted words are set,
If the document group determination unit is a document group corresponding to the dictionary, the document group determination unit performs determination using the feature rule based on a feature that the content rate exceeds a predetermined threshold.
The document group detection method according to appendix 11, wherein

（付記１３）前記検索ログには、前記利用者が取得した文書が検索されたときの検索キーワードが記録されており、
前記特徴集計手段が、前記情報抽出手段によって抽出された前記文書群の識別情報に対応する前記文書群の１つを基準文書群、前記基準文書群以外の他の文書群の識別情報に対応する前記他の文書群を比較文書群として、前記基準文書群に関する前記検索キーワードと、前記比較文書群に関する前記検索キーワードとを照合して共通する共通キーワードを検出し、前記比較文書群に関する前記検索キーワードに前記共通キーワードが占める割合を前記比較文書群の類似スコアとし、前記類似スコアを前記特徴スコアの１つにする、
ことを特徴とする付記１、３、５、７、９、または１１記載の文書群検出方法。 (Additional remark 13) The search keyword when the document which the said user acquired is searched is recorded on the said search log,
The feature counting means corresponds to one of the document groups corresponding to the identification information of the document group extracted by the information extraction means, and corresponds to identification information of a document group other than the reference document group. Using the other document group as a comparison document group, the search keyword related to the reference document group and the search keyword related to the comparison document group are collated to detect a common keyword, and the search keyword related to the comparison document group The ratio of the common keyword is a similarity score of the comparison document group, and the similarity score is one of the feature scores.
The document group detection method according to appendix 1, 3, 5, 7, 9, or 11 characterized by the above.

（付記１４）前記特徴集計手段が、前記情報抽出手段によって検出された複数の前記文書群から、前記特徴スコアが所定の閾値を超える少なくとも２つ以上の前記文書群を抽出し、１つを前記基準文書群とし、他を前記比較文書群として前記類似スコアを算出する、
ことを特徴とする付記１３記載の文書群検出方法。 (Additional remark 14) The said feature totaling means extracts the said 2 or more said document group from which the said feature score exceeds a predetermined threshold value from the several said document group detected by the said information extraction means, and one is said The similarity score is calculated using a reference document group and the others as the comparison document group.
The document group detection method according to supplementary note 13, wherein

（付記１５）前記特徴集計手段が、前記情報抽出手段によって検出された複数の前記文書群から、前記特徴スコアが所定の閾値を超える前記文書群を抽出して前記基準文書群とし、利用対象の文書群として既に登録されている前記文書群を前記比較文書群として前記類似スコアを算出する、
ことを特徴とする付記１３記載の文書群検出方法。 (Additional remark 15) The said feature totaling means extracts the said document group from which the said characteristic score exceeds a predetermined threshold value from the several said document group detected by the said information extraction means, and makes it the said reference document group, and uses Calculating the similarity score using the document group already registered as a document group as the comparison document group;
The document group detection method according to supplementary note 13, wherein

（付記１６）前記文書群判定手段が、前記基準文書群が前記辞書に相当する文書群であれば、前記類似スコアが所定値以上の前記比較文書群が存在するという特徴に基づく前記特徴ルールを用いて判定を行う、
ことを特徴とする付記１３記載の文書群検出方法。 (Supplementary Note 16) The feature rule based on a feature that the comparison document group having the similarity score of a predetermined value or more exists if the document group determination unit is a document group corresponding to the dictionary. To make a decision,
The document group detection method according to supplementary note 13, wherein

（付記１７）前記文書群判定手段が、予め前記情報提供文書群候補に認められない不認可文書群を登録した不認可文書群情報に基づき、前記文書群が前記不認可文書群に登録されていたときは、前記文書群を前記情報提供文書群候補に登録しない、
ことを特徴とする付記１記載の文書群検出方法。 (Supplementary Note 17) The document group is registered in the unauthorized document group based on the unauthorized document group information in which the document group determining unit has previously registered the unauthorized document group that is not recognized as the information providing document group candidate. The document group is not registered in the information provision document group candidate,
The document group detection method according to supplementary note 1, wherein:

（付記１８）前記文書群判定手段が、前記情報提供文書群候補と判定された前記文書群について、前記特徴集計手段によって集計された該文書群の複数の前記特徴スコアそれぞれに重み付けを行って、該文書群が前記検出対象の文書群である確度を表す確からしさスコアを算出する、
ことを特徴とする付記１記載の文書群検出方法。 (Supplementary Note 18) For the document group determined as the information providing document group candidate, the document group determination unit weights each of the plurality of feature scores of the document group totaled by the feature totalization unit, Calculating a probability score representing the probability that the document group is the document group to be detected;
The document group detection method according to supplementary note 1, wherein:

（付記１９）前記提示手段が、前記情報提供文書群候補とともに、前記情報提供文書群候補について算出された前記特徴スコアを提示する、
ことを特徴とする付記１記載の文書群検出方法。 (Additional remark 19) The said presentation means presents the said characteristic score calculated about the said information provision document group candidate with the said information provision document group candidate.
The document group detection method according to supplementary note 1, wherein:

（付記２０）ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出装置において、
任意の検索条件を用いて検索され、検索結果に基づいて利用者が取得した文書に関し、前記文書の識別情報と前記文書が属する文書群の識別情報とを有するアドレスが記録される検索ログを記憶する検索ログ記憶手段と、
前記検索ログ記憶手段から前記検索ログを読み出し、前記検索ログに記録された前記文書群の識別情報を抽出する情報抽出手段と、
前記情報抽出手段によって抽出された前記文書群の識別情報ごとに前記検索ログを分類し、前記文書群ごとに分類された前記検索ログを解析して、検出対象とする文書群を特徴付ける特徴ルールに応じた特徴スコアを前記文書群ごとに集計する特徴集計手段と、
前記特徴集計手段によって前記文書群ごとに集計された前記特徴スコアに基づき、前記文書群が前記特徴ルールによって規定される前記検出対象とする文書群の条件を満たしているかどうかを判定し、条件を満たしている前記文書群を情報提供文書群候補に登録する文書群判定手段と、
前記文書群判定手段によって前記情報提供文書群候補に登録された前記文書群の識別情報を利用者に提示する文書群提示手段と、
を有することを特徴とする文書群検出装置。 (Supplementary note 20) In a document group detection apparatus for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
Stores a search log in which an address having identification information of the document and identification information of a document group to which the document belongs is stored for a document searched using an arbitrary search condition and acquired by a user based on a search result Search log storage means to
Information extraction means for reading the search log from the search log storage means and extracting identification information of the document group recorded in the search log;
Classifying the search log for each identification information of the document group extracted by the information extraction unit, analyzing the search log classified for each document group, and characterizing the document group to be detected A feature summarizing means for summing up the corresponding feature score for each document group;
Based on the feature score aggregated for each document group by the feature aggregation unit, it is determined whether or not the document group satisfies a condition of the document group to be detected defined by the feature rule, and a condition is determined. Document group determination means for registering the document group that satisfies the information group document candidate,
Document group presenting means for presenting identification information of the document group registered in the information providing document group candidate by the document group determining means to a user;
A document group detection apparatus comprising:

（付記２１）ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出するための文書群検出プログラムにおいて、
コンピュータを、
任意の検索条件を用いて検索され、検索結果に基づいて利用者が取得した文書に関し、前記文書の識別情報と前記文書が属する文書群の識別情報とを有するアドレスが記録される検索ログを検索ログ記憶手段から読み出し、前記検索ログに記録された前記文書群の識別情報を抽出する情報抽出手段、
前記情報抽出手段によって抽出された前記文書群の識別情報ごとに前記検索ログを分類し、前記文書群ごとに分類された前記検索ログを解析して、検出対象とする文書群を特徴付ける特徴ルールに応じた特徴スコアを前記文書群ごとに集計する特徴集計手段、
前記特徴集計手段によって前記文書群ごとに集計された前記特徴スコアに基づき、前記文書群が前記特徴ルールによって規定される前記検出対象とする文書群の条件を満たしているかどうかを判定し、条件を満たしている前記文書群を情報提供文書群候補に登録する文書群判定手段、
前記文書群判定手段によって前記情報提供文書群候補に登録された前記文書群の識別情報を利用者に提示する文書群提示手段、
として機能させることを特徴とする文書群検出プログラム。 (Supplementary note 21) In a document group detection program for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
Computer
Searches a search log in which an address having identification information of the document and identification information of a document group to which the document belongs is recorded with respect to a document that is searched using an arbitrary search condition and acquired by a user based on a search result. Information extraction means for reading out from the log storage means and extracting identification information of the document group recorded in the search log;
Classifying the search log for each identification information of the document group extracted by the information extraction unit, analyzing the search log classified for each document group, and characterizing the document group to be detected A feature summarizing means for summing up the corresponding feature score for each document group;
Based on the feature score aggregated for each document group by the feature aggregation unit, it is determined whether or not the document group satisfies a condition of the document group to be detected defined by the feature rule, and a condition is determined. Document group determination means for registering the document group that satisfies the information group document candidate
Document group presenting means for presenting identification information of the document group registered in the information providing document group candidate by the document group determining means to a user;
A document group detection program characterized by functioning as

発明の概要を示した図である。It is the figure which showed the outline | summary of invention. 辞書サイト検出システムの構成例を示した図である。It is the figure which showed the structural example of the dictionary site detection system. 辞書サイト検出サーバのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of a dictionary site detection server. 辞書サイトの特徴ルールを説明するための図である。It is a figure for demonstrating the characteristic rule of a dictionary site. サイト間のクエリの一例を示した図である。It is the figure which showed an example of the query between sites. サイト間のクエリの他の例を示した図である。It is the figure which showed the other example of the query between sites. 検索ログテーブルの一例を示した図である。It is the figure which showed an example of the search log table. 上位ＵＲＬテーブルの一例を示した図である。It is the figure which showed an example of the high-order URL table. アクセスＵＲＬテーブルの一例を示した図である。It is the figure which showed an example of the access URL table. 特徴スコアの集計結果を示した集計テーブルの一例である。It is an example of the total table which showed the total result of the characteristic score. 集計テーブルを辞書サイトの可能性の高い順に並び変えた一例である。This is an example in which the aggregation table is rearranged in the descending order of the possibility of the dictionary site. 類似スコアテーブルの一例である。It is an example of a similarity score table. 辞書サイト候補一覧テーブルの一例である。It is an example of a dictionary site candidate list table. 辞書サイト候補一覧表示画面の一例である。It is an example of a dictionary site candidate list display screen. 登録終了画面の一例である。It is an example of a registration end screen. 辞書サイト登録までの全体処理手順を示したフローチャートである。It is the flowchart which showed the whole process procedure to dictionary site registration. 辞書サイト検出処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the dictionary site detection process. 情報抽出処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the information extraction process. 対象抽出処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the object extraction process. 特徴集計処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the characteristic total process. 特徴集計処理におけるキーワード数集計処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the keyword count total process in a characteristic total process. 特徴集計処理におけるタイトル特徴集計処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the title characteristic total process in a characteristic total process. 特徴集計処理における含有率集計処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the content rate total process in a characteristic total process. 特徴集計処理における類似サイトチェック処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the similar site check process in a feature totaling process. 辞書サイト判定処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of dictionary site determination processing. 特徴集計処理における特徴スコアＳ４と総合特徴スコアＳ５の算出手順を示したフローチャートである。It is the flowchart which showed the calculation procedure of feature score S4 and total feature score S5 in the feature totalization process. オートリンクシステムの概略を示した図である。It is the figure which showed the outline of the auto link system.

符号の説明Explanation of symbols

１０文書群検出装置
１１ａ検索ログＤＢ
１１ｂ特徴ルールＤＢ
１１ｃ集計情報ＤＢ
１１ｄ文書群候補ＤＢ
１１ｅ文書群ＤＢ
１２情報抽出手段
１３特徴集計手段
１４文書群判定手段
１５文書群提示手段 10 Document Group Detection Device 11a Search Log DB
11b Feature rule DB
11c Total information DB
11d Document group candidate DB
11e Document group DB
12 Information extraction means 13 Feature counting means 14 Document group determination means 15 Document group presentation means

Claims

ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出方法において、
情報抽出手段が、任意の検索条件を用いて検索され、検索結果に基づいて利用者が取得した文書に関し、前記文書の識別情報と前記文書が属する文書群の識別情報とを有するアドレスが記録される検索ログを検索ログ記憶手段から読み出し、前記検索ログに記録された前記文書群の識別情報を抽出する手順と、
特徴集計手段が、前記情報抽出手段によって抽出された前記文書群の識別情報ごとに前記検索ログを分類し、前記文書群ごとに分類された前記検索ログを解析して、検出対象とする文書群を特徴付ける特徴ルールに応じた特徴スコアを前記文書群ごとに集計する手順と、
文書群判定手段が、前記特徴集計手段によって前記文書群ごとに集計された前記特徴スコアに基づき、前記文書群が前記特徴ルールによって規定される前記検出対象とする文書群の条件を満たしているかどうかを判定し、条件を満たしている前記文書群を情報提供文書群候補に登録する手順と、
文書群提示手段が、前記文書群判定手段によって前記情報提供文書群候補に登録された前記文書群の識別情報を利用者に提示する手順と、
を有することを特徴とする文書群検出方法。 In a document group detection method for detecting a predetermined document group that is a set of documents provided on a network and managed by one or more computers,
An information extraction unit is searched using an arbitrary search condition, and an address including identification information of the document and identification information of a document group to which the document belongs is recorded with respect to a document acquired by a user based on a search result. A search log stored in the search log storage means, and identifying the document group identification information recorded in the search log;
A feature counting unit classifies the search log for each identification information of the document group extracted by the information extraction unit, analyzes the search log classified for each document group, and sets a document group as a detection target. A procedure for totalizing a feature score corresponding to a feature rule characterizing each document group;
Whether or not the document group determination unit satisfies the condition of the document group to be detected defined by the feature rule based on the feature score totaled for each document group by the feature aggregation unit And registering the document group satisfying the condition as an information providing document group candidate;
A document group presenting means for presenting identification information of the document group registered in the information providing document group candidate by the document group determining means to a user;
A document group detection method characterized by comprising:

前記検索ログには、前記利用者が取得した文書が検索されたときの検索キーワードが記録されており、
前記特徴集計手段が、前記文書群ごとに、前記文書群に含まれる前記文書が検索されたときの前記検索キーワードを前記検索ログから抽出し、前記検索キーワードの種類数を前記文書群ごとに集計し、集計された前記検索キーワードの種類数を前記文書群の前記特徴スコアの１つにする、
ことを特徴とする請求項１記載の文書群検出方法。 In the search log, a search keyword when the document acquired by the user is searched is recorded,
The feature counting unit extracts, for each document group, the search keyword when the document included in the document group is searched from the search log, and totals the number of types of the search keyword for each document group. Then, the total number of types of the search keyword is set as one of the feature scores of the document group.
The document group detection method according to claim 1.

前記検索ログには、前記利用者が取得した文書が検索されたときの検索キーワードと前記文書の内容を示す情報が記録されており、
前記特徴集計手段が、前記文書群ごとに、前記文書群に含まれる前記文書の内容を示す情報と前記検索キーワードとを前記検索ログから抽出し、前記検索対象の文書群の特徴を表す特徴語句が前記文書の内容を示す情報に含まれているか否かを判定し、前記特徴語句が含まれると判定された特徴語句を含む文書に対応する前記検索キーワードの数を集計し、集計された前記特徴語句を含む文書に対応する前記検索キーワードの数を前記特徴スコアの１つにする、
ことを特徴とする請求項１または２記載の文書群検出方法。 In the search log, information indicating a search keyword and a content of the document when the document acquired by the user is searched is recorded,
For each document group, the feature totaling unit extracts information indicating the content of the document included in the document group and the search keyword from the search log, and a feature phrase representing the characteristics of the document group to be searched Is included in the information indicating the content of the document, the number of the search keywords corresponding to the document including the feature phrase determined to include the feature phrase is totaled, and the tabulated The number of the search keywords corresponding to a document including a feature phrase is set as one of the feature scores.
3. The document group detection method according to claim 1, wherein the document group is detected.

前記特徴集計手段が、前記文書群ごとに、前記特徴語句を含む文書に対応する前記検索キーワードが、前記文書群に対応する前記検索キーワード全体に含まれる割合を含有率として算出し、算出された含有率を前記特徴スコアの１つにする、
ことを特徴とする請求項３記載の文書群検出方法。 The feature aggregation means calculates, for each document group, the ratio of the search keyword corresponding to the document including the feature phrase included in the entire search keyword corresponding to the document group as a content rate. Making the content rate one of the characteristic scores,
The document group detection method according to claim 3.

前記検索ログには、前記利用者が取得した文書が検索されたときの検索キーワードが記録されており、
前記特徴集計手段が、前記情報抽出手段によって抽出された前記文書群の識別情報に対応する前記文書群の１つを基準文書群、前記基準文書群以外の他の文書群の識別情報に対応する前記他の文書群を比較文書群として、前記基準文書群に関する前記検索キーワードと、前記比較文書群に関する前記検索キーワードとを照合して共通する共通キーワードを検出し、前記比較文書群に関する前記検索キーワードに前記共通キーワードが占める割合を前記比較文書群の類似スコアとし、前記類似スコアを前記特徴スコアの１つにする、
ことを特徴とする請求項１、２、３または４記載の文書群検出方法。 In the search log, a search keyword when the document acquired by the user is searched is recorded,
The feature counting means corresponds to one of the document groups corresponding to the identification information of the document group extracted by the information extraction means, and corresponds to identification information of a document group other than the reference document group. Using the other document group as a comparison document group, the search keyword related to the reference document group and the search keyword related to the comparison document group are collated to detect a common keyword, and the search keyword related to the comparison document group The ratio of the common keyword is a similarity score of the comparison document group, and the similarity score is one of the feature scores.
5. The document group detection method according to claim 1, 2, 3 or 4.

ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出装置において、
任意の検索条件を用いて検索され、検索結果に基づいて利用者が取得した文書に関し、前記文書の識別情報と前記文書が属する文書群の識別情報とを有するアドレスが記録される検索ログを記憶する検索ログ記憶手段と、
前記検索ログ記憶手段から前記検索ログを読み出し、前記検索ログに記録された前記文書群の識別情報を抽出する情報抽出手段と、
前記情報抽出手段によって抽出された前記文書群の識別情報ごとに前記検索ログを分類し、前記文書群ごとに分類された前記検索ログを解析して、検出対象とする文書群を特徴付ける特徴ルールに応じた特徴スコアを前記文書群ごとに集計する特徴集計手段と、
前記特徴集計手段によって前記文書群ごとに集計された前記特徴スコアに基づき、前記文書群が前記特徴ルールによって規定される前記検出対象とする文書群の条件を満たしているかどうかを判定し、条件を満たしている前記文書群を情報提供文書群候補に登録する文書群判定手段と、
前記文書群判定手段によって前記情報提供文書群候補に登録された前記文書群の識別情報を利用者に提示する文書群提示手段と、
を有することを特徴とする文書群検出装置。 In a document group detection apparatus for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
Stores a search log in which an address having identification information of the document and identification information of a document group to which the document belongs is stored for a document searched using an arbitrary search condition and acquired by a user based on a search result Search log storage means to
Information extraction means for reading the search log from the search log storage means and extracting identification information of the document group recorded in the search log;
Classifying the search log for each identification information of the document group extracted by the information extraction unit, analyzing the search log classified for each document group, and characterizing the document group to be detected A feature summarizing means for summing up the corresponding feature score for each document group;
Based on the feature score aggregated for each document group by the feature aggregation unit, it is determined whether or not the document group satisfies a condition of the document group to be detected defined by the feature rule, and a condition is determined. Document group determination means for registering the document group that satisfies the information group document candidate,
Document group presenting means for presenting identification information of the document group registered in the information providing document group candidate by the document group determining means to a user;
A document group detection apparatus comprising: