JP2004046312A

JP2004046312A - Site manager information extraction method and device, site manager information extraction program, and recording medium with the program recorded

Info

Publication number: JP2004046312A
Application number: JP2002199458A
Authority: JP
Inventors: Tetsuo Ikeda; 池田　哲夫; Kenichi Mori; 森　憲一; Tetsuji Sato; 佐藤　哲司
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-07-09
Filing date: 2002-07-09
Publication date: 2004-02-12

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently extract and collect candidates of contact point information on any or more of the address, telephone number and email address of a site manager from a website. <P>SOLUTION: A site information generator 1 collects webpages in advance, rearranges the web page collection into a collection of sites, generates a tree structure as an internal site structure topped with a top page for every site, and stores the tree structure in a site information database (DB) 2. With the tree structure of each site stored in the site information DB 2, a manager contact point candidate information collector 3 uses only the top page, the top page and all pages of the next lower level, or the top page and those pages in pages of the next lower level which meet prepared filtering conditions, to collect candidates of contact point information on anyone or more of the address, telephone number and email address of the manager of each site, and then stores them in a site manager contact point candidate information DB 4. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、Ｗｅｂサイトに関するメタデータを抽出、収集する方法及び装置に関わるものである。
【０００２】
【従来の技術】
Ｗｅｂサイトとは、「一つの組織あるいは人が、作成・運営する一群のページ」のことを言う。
【０００３】
Ｗｅｂサイトに関するメタデータとは、Ｗｅｂサイトの運営者の電話番号、住所、電子メールアドレス、Ｗｅｂサイトの全ページ数、Ｗｅｂサイトのリンク構造、などＷｅｂサイトの特徴を記述するデータを言う。
【０００４】
Ｗｅｂサイトの運営者の電話番号、住所、電子メールアドレスを収集する技術は見当たらない。
【０００５】
【発明が解決しようとする課題】
従来、Ｗｅｂサイトの運営者の電話番号、住所、電子メールアドレスを収集するのは容易ではなかった。
【０００６】
その理由は、以下のとおりである。
【０００７】
・単純に、Ｗｅｂページの中から電話番号、住所、電子メールアドレスを抽出しても、サイト運営者のものとは限定されず、サイト運営者と関係のある、或いは興味のある人・組織の電話番号、住所、電子メールアドレスも多数抽出され、それらの中からサイト運営者のものを選り分けるのは容易ではない。
【０００８】
・サイトを抽出する技術は、既に提案されている。例えば、Ｗｅｎ−Ｓｙａｎ　Ｌｉ，ＯｋａｎＫｏｌａｋ，Ｑｕｏｃ　Ｖｕ，Ｈａｊｉｍｅ　Ｔａｋａｎｏ：Ｄｅｆｉｎｉｎｇ　ｌｏｇｉｃａｌ　ｄｏｍａｉｎｓ　ｉｎ　ａ　ｗｅｂ　ｓｉｔｅ．Ｈｙｐｅｒｔｅｘｔ２０００：１２３−１３２等がある。Ｌｉの論文における論理ドメイン（ｌｏｇｉｃａｌ　ｄｏｍａｉｎ）は、本発明におけるサイトに相当するものである。論理ドメインは、（ハイパーリンク構造ではなく）ディレクトリ構造に沿ってサイトをまとめたものであるため、その中のどこのページにサイト運営者の電話番号、住所、電子メールアドレス情報があるかは明らかでない。従って、従来のサイト抽出方法では、サイト運営者の電話番号、住所、電子メールアドレスをどう抽出すればよいかの指針は分からない。
【０００９】
本発明は、上記従来の技術の問題点を解決するため、サイト運営者に関する情報、例えば、住所、電話番号、電子メールアドレスのいずれか、あるいは二つ以上の情報というような情報の候補を効率的に抽出、収集することができるサイト運営者情報抽出方法及び装置を提供することを課題とする。
【００１０】
【課題を解決するための手段】
上記の課題を解決するため、本発明は、サイト運営者に関する情報の候補を抽出し収集する、サイト運営者情報抽出方法であって、予めＷｅｂページを収集しておき、Ｗｅｂページ集合をサイトの集合にまとめなおし、各サイト毎にトップページを頂点とするサイト内の内部構造である木構造を生成し、その木構造をサイト情報データベースに格納するサイト情報生成ステップと、サイト情報データベースに格納された各サイト毎の木構造をもとに、各サイトの運営者に関する情報の候補を抽出、収集し、サイト運営者情報データベースに格納する運営者情報収集ステップと、を有することを特徴とするサイト運営者情報抽出方法を解決の手段とする。
【００１１】
あるいは、上記のサイト運営者情報抽出方法において、前記運営者情報収集ステップでは、各サイトの木構造のうち、トップページのみを用いて、各サイトの運営者に関する情報の候補を抽出し収集することを特徴とするサイト運営者情報抽出方法を解決の手段とする。
【００１２】
あるいは、上記のサイト運営者情報抽出方法において、前記運営者情報収集ステップでは、各サイトの木構造のうち、トップページ及びトップページの一段下位の全ページを用いて、各サイトの運営者に関する情報の候補を抽出し収集することを特徴とするサイト運営者情報抽出方法を解決の手段とする。
【００１３】
あるいは、上記のサイト運営者情報抽出方法において、前記運営者情報収集ステップでは、各サイトの木構造のうち、トップページ及びトップページの一段下位のページのうち予め作成されたフィルタリング条件を満たすページのみを用いて、各サイトの運営者に関する情報の候補を抽出し収集することを特徴とするサイト運営者情報抽出方法を解決の手段とする。
【００１４】
あるいは、上記のサイト運営者情報抽出方法において、前記運営者情報収集ステップでは、トップページからトップページの一段下位のページを指すアンカー文字列を抽出し、トップページ及び一段下位のページのうち該アンカー文字列に予めフィルタリング条件として作成したサイト運営者に関するアンカー用語を含むページのみを用いることを特徴とするサイト運営者情報抽出方法を解決の手段とする。
【００１５】
あるいは、サイト運営者に関する情報の候補を抽出し収集する、サイト運営者情報抽出装置であって、予めＷｅｂページを収集しておき、Ｗｅｂページ集合をサイトの集合にまとめなおし、各サイト毎にトップページを頂点とするサイト内の内部構造である木構造を生成するサイト情報生成手段と、該生成された各サイトの木構造情報を格納するサイト情報データベースと、該サイト情報データベースに格納された各サイト毎の木構造をもとに、各サイトの運営者に関する情報の候補を抽出し収集する運営者情報収集手段と、該収集された各サイトの運営者に関する情報の候補を格納するサイト運営者情報データベースと、を有することを特徴とするサイト運営者情報抽出装置を解決の手段とする。
【００１６】
あるいは、上記のサイト運営者情報抽出装置において、前記運営者情報収集手段は、各サイトの木構造のうち、トップページのみを用いて、各サイトの運営者に関する情報の候補を抽出し収集するものであることを特徴とするサイト運営者情報抽出装置を解決の手段とする。
【００１７】
あるいは、上記のサイト運営者情報抽出装置において、前記運営者情報収集手段は、各サイトの木構造のうち、トップページ及びトップページの一段下位の全ページを用いて、各サイトの運営者に関する情報の候補を抽出し収集するものであることを特徴とするサイト運営者情報抽出装置を解決の手段とする。
【００１８】
あるいは、上記のサイト運営者情報抽出装置において、前記運営者情報収集手段は、各サイトの木構造のうち、トップページ及びトップページの一段下位のページのうち予め作成されたフィルタリング条件を満たすページを用いて、各サイトの運営者に関する情報の候補を抽出し収集するものであることを特徴とするサイト運営者情報抽出装置を解決の手段とする。
【００１９】
あるいは、上記のサイト運営者情報抽出装置において、前記運営者情報収集手段は、トップページからトップページの一段下位のページを指すアンカー文字列を抽出し、トップページ及び一段下位のページのうち該アンカー文字列に予めフィルタリング条件として作成したサイト運営者に関するアンカー用語を含むページのみを用いるものであることを特徴とするサイト運営者情報抽出装置を解決の手段とする。
【００２０】
あるいは、上記のサイト運営者情報抽出方法におけるステップを、コンピュータに実行させるためのプログラムとしたことを特徴とするサイト運営者情報抽出プログラムを解決の手段とする。
【００２１】
あるいは、上記のサイト運営者情報抽出方法におけるステップを、コンピュータに実行させるためのプログラムとし、該プログラムを、該コンピュータが読み取りできる記録媒体に記録したことを特徴とするサイト運営者情報抽出プログラムを記録した記録媒体を解決の手段とする。
【００２２】
本発明に先立ち本発明者の一人が発明した特願２００１−３８９４４５〜２００１−３８９４４８は、サイトの抽出を以下の手順で行うものである。
・サイトのトップページを機械学習機能を用いて決定する。
・サイトの内部構造（木構造）を機械学習機能を用いて決定する。木構造は、ディレクトリ構造ではなく、ハイパーリンクの親子関係を元に決定する。
【００２３】
この発明を実装したシステムを用いて実際にサイトを抽出し、サイト運営者の電話番号、住所、電子メールアドレス情報がどこに存在するかを確認したところ、以下の知見を得た。
・殆どの場合、サイト運営者の電話番号、住所、電子メールアドレスは、サイトのトップページかその一段下のページに存在する。
・サイトの一段下のページに存在する場合、トップページからそのページを指すハイパーリンクのアンカー文字列は、特徴的な用語（例えば、連絡先、所有者、運営者など）を含むことが多い。
【００２４】
この知見は、サイトの構造をディレクトリ構造に沿ってまとめる方法では、得ることが困難な知見である。
【００２５】
本発明では、この知見を活用し、Ｗｅｂサイト運営者に関する情報、例えば、電話番号、住所、電子メールアドレスというような情報の候補を、サイト内の木構造に着目してそのトップページ、あるいは一段下位のページを見ることにより、抽出、収集する。
【００２６】
【発明の実施の形態】
以下、本発明の実施の形態について図を用いて詳細に説明する。
【００２７】
《本発明による装置構成の実施形態例》
本発明による装置構成の一実施形態例を図１に示す。
【００２８】
サイト情報生成装置１は、予めＷｅｂページを収集しておき、Ｗｅｂページ集合をサイトの集合にまとめなおし、各サイト毎にトップページを頂点とするサイト内の内部構造（木構造）を生成し、その木構造をサイト情報データベース（以下、ＤＢ）２に格納する。
【００２９】
サイト情報生成装置１における、Ｗｅｂ集合をサイトの集合にまとめなおす手段や方法としては、前述の特願２００１−３８９４４５〜２００１−３８９４４８の手段や方法を用いる。
【００３０】
すなわち、まず、Ｗｅｂページ集合を収集し、このＷｅｂページ集合からＷｅｂサイトのトップページを推定し、サイトＩＤとトップページのＵＲＬを得る。次に、このＷｅｂページ集合について推定したトップページと、それにリンクしたページからサイト単位の木構造を推定してページ単位にこのページが属するサイトＩＤおよびこのサイトの木構造の深さ情報を得る。これらの情報を木構造として登録する。
【００３１】
Ｗｅｂページ集合を収集し、このＷｅｂページ集合からＷｅｂサイトのトップページを推定するに際しては、ページ集合の各ページが属する全てのサーバ名を抽出しておき、ページ集合の各ページについて、そのＵＲＬとサーバ名とディレクトリ階層と表層知識に基づくメタ情報を抽出しておき、人手でページタイプが付与された学習用ページ集合を元にページ分類木を獲得し、このページ分類木を基に、ページ集合の各ページのページタイプへの分類尤度を抽出しておく。そして、ページ集合の各ページが所属する各サーバ毎に、
・当該サーバに属し、ディレクトリ階層が０かつ当該階層に位置する特定ファイル名をもつページを当該サーバ名におけるトップページの一つに決定する、
・ページタイプへの分類尤度のトップページタイプ分類尤度を基にトップページが存在するディレクトリ階層を順次下げてトップページが存在するディレクトリ階層を決定する、
・決定されたディレクトリ階層に所属し、下位階層にファイルが存在するファイル名をもち、前記ページタイプへの分類尤度の和が最大のページをトップページとして各ディレクトリ階層毎に決定する、
・ディレクトリ階層の１段下のディレクトリ階層に属し、ページタイプのトップページ分類尤度が閾値以上のページをトップページとする、
ことを繰り返すことでトップページを推定する。
【００３２】
Ｗｅｂページ集合について推定したトップページとそれにリンクしたページからサイト単位の木構造を推定するに際しては、人手でリンクタイプが付与された学習用ページ集合を元にリンク分類木を獲得し、このリンク分類木を基に、前記ページ集合の各ページのリンクタイプへの分類尤度を抽出しておき、ページ集合が属する全てのサーバについて、サイトの木構造の推定がなれていないサーバに属するページ集合を取り出し、以下、
・上述のサイトＩＤとそのトップページのＵＲＬによるトップページの推定によりトップページ候補集合を獲得する、
・各トップページ候補を起点とし、リンクタイプ分類尤度を基に各ページの親ページを決定する、
・親ページが未決のページ集合の中からディレクトリの最も浅い階層に属しかつ上記で抽出されたページタイプ分類尤度のトップページ尤度とインデクスページ尤度およびメニューページ尤度の和が最大のページをトップページ候補として取り出し、このトップページ候補からリンクタイプ分類尤度を基に親ページを決定する、
ことを繰り返すことで親ページを推定し、親ページをリンク元とするリンクからサイトの木構造を推定する。
【００３３】
運営者連絡先候補情報収集装置３は、サイト情報ＤＢ２に格納された各サイト毎の木構造をもとに、各サイトの運営者の住所、電話番号、電予メールアドレスのいずれか、あるいは二つ以上の情報の候補を収集してサイト運営者連絡先候補情報ＤＢ４に格納する。
【００３４】
全体として、Ｗｅｂページ集合の中から、サイト運営者の連絡先候補情報を収集する。
【００３５】
サイト運営者の連絡候補情報の精度（再現率）としてどの位の精度を求めるかと処理性能にどの程度のものを求めるかに応じて、運営者連絡先候補情報収集装置は３つの機能のいずれかを有する。（１）が最も精度が低く、次いで（３）が低く、（２）が最も精度が高い。（１）が最も高速で、次いで（３）が高速で、（２）が最も性能が低い。精度と性能はトレードオフの関係にある。
【００３６】
（１）トップページのみからの候補情報収集
日本語処理技術を用いて、トップページの中から住所、電話番号、電子メールアドレスのいずれか、あるいは二つ以上の情報の候補を収集してサイト運営者連絡先候補情報ＤＢ４に格納する。
【００３７】
（２）トップページ及び一段下の全ページからの候補情報収集
日本語処理技術を用いて、トップページ及び一段下の全ページの中から住所、電話番号、電子メールアドレスのいずれか、あるいは二つ以上の情報の候補を収集してサイト運営者連絡先候補情報ＤＢ４に格納する。
【００３８】
（３）トップページ、及び一段下のページのうちフィルタリング条件を満たす全ページからの候補情報収集
予めフィルタリング条件を定めておき、その条件を満たす一段下の全ページとトップページの中から住所、電話番号、電子メールアドレスのいずれか、あるいは二つ以上の情報の候補を収集してサイト運営者連絡先候補情報ＤＢ４に格納する。
【００３９】
なお、ページの中から住所、電話番号、電子メールアドレスを抽出する技術は、『佐藤理史「ワールドワイドウェブを利用した住所探索」情報処理学会論文誌、Ｖｏｌ．４２，Ｎｏ．０１』等での公知の技術を用いる。
【００４０】
また、フィルタリング条件は、用語辞書を予め作成し、用語辞書中の用語を含むか否かで処理をスキップするか否かを決定するという、公知の技術を用いる。
【００４１】
《本発明による方法の実施形態例を示すフローチャート》
運営者連絡先候補情報収集装置３によるトップページのみからのサイト運営者連絡先候補情報収集処理のフローチャートを図２に示す。
１．Ｓ０１では、サイト情報ＤＢ２からサイトを一つ取り出す。
２．Ｓ０２では、サイトからトップページを取り出す。
３．Ｓ０３では、トップページからサイト運営者連絡先候補情報を取り出す。
４．Ｓ０４では、全サイトに関してＳ１以降の処理が終わったかを確認し、全サイトに関して終了しているならば、候補情報収集全体を終了させる。
【００４２】
運営者連絡先候補情報収集装置３によるトップページ及び一段下の全ページからのサイト運営者連絡先候補情報収集処理のフローチャートを図３に示す。
１．Ｓ１１では、サイト情報ＤＢ２からサイトを一つ取り出す。
２．Ｓ１２では、サイトからトップページを取り出す。
３．Ｓ１３では、トップページからサイト運営者連絡先候補情報を取り出す。
４．Ｓ１４では、サイト内部構造（木構造）における深さ１のページを一つ取り出す。
５．Ｓ１５では、そのページからサイト運営者連絡先候補情報を取り出す。
６．Ｓ１６では、サイト内の深さ１の全ページに関して、Ｓ１４，Ｓ１５の処理が終わったか確認し、全ページに関して終了しているならば、Ｓ１７に分岐し、終了していないならばＳ１４に分岐する。
７．Ｓ１７では、全サイトに関してＳ１１以降の処理が終わったかを確認し、全サイトに関して終了しているならば、候補情報収集全体を終了させる。
【００４３】
運営者連絡先候補情報収集装置３による、トップページ及び一段下のページのうちフィルタリング条件を満たす全ページからのサイト運営者連絡先候補情報収集処理のフローチャートを図４に示す。
１．Ｓ２１では、サイト情報ＤＢ２からサイトを一つ取り出す。
２．Ｓ２２では、サイトからトップページを取り出す。
３．Ｓ２３では、トップページからサイト運営者連絡先候補情報を取り出す。
４．Ｓ２４では、サイト内部構造（木構造）における深さ１のページを一つ取り出す。
５．Ｓ２５では、そのページに関して、スキップ判定情報を導く。
６．Ｓ２６では、スキップ条件を満たすかを判定し、満たすならばＳ２８に分岐し、満たさないならば、Ｓ２７に分岐する。
７．Ｓ２７では、そのページからサイト運営者連絡先候補情報を取り出す。
８．Ｓ２８では、サイト内の深さ１の全ページに関して、Ｓ２４以降の処理が終わったか確認し、全ページに関して終了しているならば、Ｓ２９に分岐し、終了していないならばＳ２４に分岐する。
９．Ｓ２９では、全サイトに関してＳ２１以降の処理が終わったかを確認し、全サイトに関して終了しているならば、候補情報収集全体を終了させる。
【００４４】
運営者連絡先候補情報収集装置３による、図２のＳ０３、図３のＳ１３，Ｓ１５、図４のＳ２３，Ｓ２７におけるサイト運営者連絡先候補情報の取り出し処理のフローチャートを図５に示す。
１．Ｓ３０１では、欲しい情報種類を取り出す。
２．Ｓ３０２では、欲しい情報種類が住所だった場合は、ページ内を公知の日本語解析技術により解析し、ページから住所を取り出す。
３．Ｓ３０３では、ページに住所が含まれていれば、Ｓ３０４に分岐し、含まれていなければＳ３１１に分岐する。
４．Ｓ３０４では、サイトのＩＤ、ページのＵＲＬ、情報種類（住所）と共に住所をサイト運営者連絡先候補情報ＤＢ４に格納する。
５．Ｓ３０５では、欲しい情報種類が電話番号だった場合は、ページ内を公知の日本語解析技術により解析し、ページから電話番号を取り出す。
６．Ｓ３０６では、ページに電話番号が含まれていれば、Ｓ３０７に分岐し、含まれていなければＳ３１１に分岐する。
７．Ｓ３０７では、サイトのＩＤ、ページのＵＲＬ、情報種類（電話番号）と共に電話番号をサイト運営者連絡先候補情報ＤＢ４に格納する。
８．Ｓ３０８では、欲しい情報種類が電子メールアドレスだった場合は、ページ内を公知の日本語解析技術により解析し、ページから電子メールアドレスを取り出す。
９．Ｓ３０９では、ページに電子メールアドレスが含まれていれば、Ｓ３０４に分岐し、含まれていなければＳ３１１に分岐する。
１０．Ｓ３１０では、サイトのＩＤ、ページのＵＲＬ、情報種類（電子メールアドレス）と共に電子メールアドレスをサイト運営者連絡先候補情報ＤＢ４に格納する。
１１．Ｓ３１１では、全ての欲しい情報種類に関してＳ３０１以降の処理が終わったかを確認し、全ての欲しい情報種類に関して終了しているならば、サイト運営者連絡先候補情報取り出し処理全体を終了させる。
【００４５】
運営者連絡先候補情報収集装置３による、図４のＳ２５におけるスキップ判定情報導出処理のフローチャートを図６を用いて示す。
【００４６】
スキップ判定情報導出処理のおいては、キーワードフィルタリング技術と呼ぶ公知の技術を利用する。キーワードフィルタリングにおいては、予めフィルタリング可否を判定するための辞書を人間が作成する。
【００４７】
人間が、サイト運営者の住所、電話番号、電子メールアドレスのいずれかが載っているページのサンプルを手作業で収集する。そのページを指しているリンクのアンカー文字列中の語（例：連絡先、問い合わせ先、運営者情報、等）を手作業で収集し、用語辞書を作成する。結果は「連絡先候補情報掲載ページへのアンカー用語辞書」と呼ぶ。
【００４８】
この辞書を用いて、スキップ判定情報導出処理を行う。
１．Ｓ５０１では、サイトのトップページから該当ページを指すハイパーリンクのアンカー文字列を取り出す。
２．Ｓ５０２では、アンカー文字列の中に「連絡先候補情報掲載ページへのアンカー用語辞書」の中の用語が含まれているか確認する。
３．Ｓ５０３では、アンカー文字列の中に辞書の用語が含まれていれば、Ｓ５０４に分岐し、含まれていなければＳ５０５に分岐する。
４．Ｓ５０４では、ページはスキップ不可であると設定する。
５．Ｓ５０５では、ページはスキップ可であると設定する。
【００４９】
《候補情報収集の実施形態例》
図７、図８および図９により、本発明によるサイト運営者連絡先候補情報収集の実施形態例を説明する。
【００５０】
図７は、トップページのみからのサイト運営者連絡先候補情報収集の例である。サイトのトップページの中に以下の情報が記載されているとする。
【００５１】
連絡先
東京都千代田区〜
０３（１１１１）１１１１
Ｗｅｂｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ
要求情報種類が住所、電話番号、電子メールアドレスの３つだったとすると、＜サイトＩＤ、トップページＵＲＬ、住所、“東京都千代田区〜”＞、＜サイトＩＤ、トップページＵＲＬ、電話番号、“０３（１１１１）１１１１”＞、＜サイトＩＤ、トップページＵＲＬ、電子メールアドレス、“Ｗｅｂｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ”＞の３タプルがサイト運営者連絡先候補情報ＤＢ４に格納される。
【００５２】
図８は、トップページ及び一段下の全ページからのサイト運営者連絡先候補情報収集の例である。サイトのトップページ、ページ２、ページ４にそれぞれ以下の情報が記載されているとする。
・トップページ
お問い合わせは
ｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ
・ページ２
電話での質問は
０３（２２２２）２２２２
・ページ４
連絡先
東京都千代田区〜
０３（１１１１）２２２２
ｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ
要求情報種類が住所、電話番号、電子メールアドレスの３つだったとすると、＜サイトＩＤ、トップページＵＲＬ、電子メールアドレス、“ｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ”＞、＜サイトＩＤ、ページ２ＵＲＬ、電話番号、“０３（２２２２）２２２２”＞、＜サイトＩＤ、ページ４ＵＲＬ、住所、“岡京都千代田区〜”＞、＜サイトＩＤ、ページ４ＵＲＬ、電話番号、“０３（１１１１）２２２２”＞、＜サイトＩＤ、ページ４ＵＲＬ、電子メールアドレス、“ｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ”＞の５タプルがサイト運営者連絡先候補情報ＤＢ４に格納される。
【００５３】
図９は、トップページ、及び一段下のページのうちフィルタリング条件を満たす全ページからのサイト運営者連絡先候補情報収集の例である。
【００５４】
「連絡先候補情報掲載ページへのアンカー用語辞書」に「連絡先」は含まれているが、「社名」「由来」「沿革」「製品」「リンク集」は含まれていないとする。この時、ページ１からページ４のうちスキップされないのは、ページ３だけである。
【００５５】
ページ３に以下の情報が記載されているとする（トップページには、住所、電話番号、電子メールアドレスの記載はないものとする）。
【００５６】
連絡先
東京都千代田区〜
０３（１１１１）３３３３
ｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ
要求情報種類が住所、電話番号、電子メールアドレスの３つだったとすると、＜サイトＩＤ、ページ３ＵＲＬ、住所、“東京都千代田区〜”＞、＜サイトＩＤ、ページ３ＵＲＬ、電話番号、“０３（１１１１）３３３３”＞、＜サイトＩＤ、ページ３ＵＲＬ、電子メールアドレス、“ｍａｓｔｅｒ＠・・．・・．ｃｏ．ｊｐ”＞の３タプルがサイト運営者連絡先候補情報ＤＢ４に格納される。
【００５７】
なお、図１で示した装置における各部の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、あるいは、図２〜図６で示した処理のステップをコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラム、あるいは、コンピュータにその処理のステップを実行させるためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えば、ＦＤ（フレキシブルディスク）や、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。
【００５８】
【発明の効果】
以上のように本発明を用いると、サイト運営者に関する情報、例えば、住所、電話番号、電子メールアドレスのいずれか、あるいは二つ以上の情報の候補を効率的に抽出し収集することが可能になる。
【図面の簡単な説明】
【図１】本発明による装置構成の一実施形態例を示す図
【図２】本発明による方法の一実施形態例を示すトップページのみからのサイト運営者連絡先候補情報収集処理のフローチャート
【図３】本発明による方法の一実施形態例を示すトップページ及び一段下の全ページからのサイト運営者連絡先候補情報収集処理のフローチャート
【図４】本発明による方法の一実施形態例を示すトップページ及び一段下のページのうちフィルタリング条件を満たす全ページからのサイト運営者連絡先候補情報収集処理のフローチャート
【図５】上記サイト運営者連絡先候補情報収集処理におけるサイト運営者連絡先候補情報の取り出し処理のフローチャート
【図６】上記サイト運営者連絡先候補情報収集処理におけるスキップ判定情報導出処理のフローチャート
【図７】本発明によるトップページのみからのサイト運営者連絡先候補情報収集の例を説明する図
【図８】本発明によるトップページ及び一段下の全ページからのサイト運営者連絡先候補情報収集の例を説明する図
【図９】本発明によるトップページ及び一段下のページのうちフィルタリング条件を満たす全ページからのサイト運営者連絡先候補情報収集の例を説明する図
【符号の説明】
１…サイト情報生成装置
２…サイト情報ＤＢ
３…運営者連絡先候補情報収集装置
４…サイト運営者連絡先候補情報ＤＢ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and an apparatus for extracting and collecting metadata about a Web site.
[0002]
[Prior art]
The website refers to "a group of pages created and operated by one organization or person".
[0003]
The metadata relating to the Web site refers to data describing characteristics of the Web site, such as the telephone number, address, e-mail address, total number of pages of the Web site, and link structure of the Web site, of the Web site operator.
[0004]
No technology has been found to collect phone numbers, addresses, and email addresses of website operators.
[0005]
[Problems to be solved by the invention]
Conventionally, it has been difficult to collect telephone numbers, addresses, and e-mail addresses of Web site operators.
[0006]
The reason is as follows.
[0007]
・ Even if a phone number, address, or e-mail address is simply extracted from a Web page, it is not limited to the one of the site operator, and is extracted from persons / organizations related to or interested in the site operator. Many phone numbers, addresses, and e-mail addresses are also extracted, and it is not easy to select one of them from those.
[0008]
・ Site extraction technology has already been proposed. See, for example, Wen-Sian Li, Okan Kolak, Quoc Vu, Hajime Takano: Defining logical domains in a web site. Hypertext 2000: 123-132 and the like. The logical domain in Li's paper corresponds to the site in the present invention. A logical domain is a collection of sites that follow a directory structure (rather than a hyperlink structure), so it is clear which pages within the site contain the publisher's phone number, address, and email address information. Not. Therefore, the conventional site extraction method does not provide any guidance on how to extract the telephone number, address, and e-mail address of the site operator.
[0009]
The present invention solves the above-mentioned problems of the related art by efficiently converting information about a site operator, for example, an information candidate such as one of an address, a telephone number, and an e-mail address, or two or more information. It is an object of the present invention to provide a method and apparatus for extracting site operator information that can be extracted and collected.
[0010]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the present invention is a site operator information extraction method for extracting and collecting information about a site operator, wherein a Web page is collected in advance, and a Web page set is A site information generating step of generating a tree structure which is an internal structure in the site having a top page as a vertex for each site, storing the tree structure in the site information database, and storing the tree structure in the site information database. Extracting and collecting information candidates relating to the operator of each site based on the tree structure of each site, and collecting the operator information in a site operator information database. The operator information extraction method is the solution.
[0011]
Alternatively, in the above-mentioned site operator information extracting method, in the operator information collecting step, the candidate of information on the operator of each site is extracted and collected by using only the top page in the tree structure of each site. The method for extracting the site operator information, which is characterized by
[0012]
Alternatively, in the above-mentioned site operator information extracting method, in the operator information collecting step, information on an operator of each site is used by using a top page and all pages below the top page in the tree structure of each site. A method for extracting site operator information, which is characterized by extracting and collecting candidates for the information, is a means for solving the problem.
[0013]
Alternatively, in the above-mentioned site operator information extracting method, in the operator information collecting step, only a page satisfying a previously created filtering condition among a top page and a page one step lower than the top page in the tree structure of each site. A method of extracting site operator information characterized by extracting and collecting information candidates regarding the operator of each site using the method described above.
[0014]
Alternatively, in the above-mentioned site operator information extracting method, in the operator information collecting step, an anchor character string indicating a page one step lower from the top page is extracted from the top page, and the anchor character string is extracted from the top page and one step lower page. A method for extracting a site operator information, characterized in that only a page including an anchor term related to the site operator created in advance as a filtering condition in a character string is used.
[0015]
Alternatively, a site operator information extraction device that extracts and collects information candidates regarding a site operator, collects Web pages in advance, regroups Web page sets into a set of sites, and sets a top for each site. Site information generating means for generating a tree structure which is an internal structure in a site having a page as a vertex, a site information database storing the generated tree structure information of each site, and each of the sites stored in the site information database. Operator information collecting means for extracting and collecting information candidates for each site operator based on the tree structure of each site, and a site operator storing the collected information candidates for each site operator An information database, and a site operator information extracting device characterized by having:
[0016]
Alternatively, in the above-mentioned site operator information extracting apparatus, the operator information collecting means extracts and collects information candidates regarding the operator of each site by using only the top page in the tree structure of each site. The solution is a site operator information extraction device characterized in that:
[0017]
Alternatively, in the above-mentioned site operator information extracting device, the operator information collecting means uses the top page and all the pages one step lower than the top page in the tree structure of each site to obtain information on the operator of each site. And a site operator information extracting device for extracting and collecting candidates.
[0018]
Alternatively, in the above-mentioned site operator information extracting device, the operator information collecting means may select a page satisfying a previously created filtering condition among a top page and a page one step lower than the top page in a tree structure of each site. A solution to the problem is a site operator information extraction device which extracts and collects information candidates regarding the operators of each site.
[0019]
Alternatively, in the above-mentioned site operator information extracting device, the operator information collecting means extracts an anchor character string indicating a lower page of the top page from the top page, and selects the anchor character string of the top page and the lower page. A means for solving the problem is a site operator information extracting apparatus characterized in that only a page including an anchor term relating to a site operator created in advance as a filtering condition in a character string is used.
[0020]
Alternatively, a solution for the site operator information extraction program is a program for causing a computer to execute the steps in the above-mentioned site operator information extraction method.
[0021]
Alternatively, a program for causing a computer to execute the steps of the method for extracting site operator information described above is recorded on a recording medium readable by the computer, and a program for extracting site operator information is recorded. The recorded recording medium is used as a solution.
[0022]
Japanese Patent Application No. 2001-389445-2001-389448, which was invented by one of the present inventors prior to the present invention, extracts a site by the following procedure.
・ Determine the top page of the site using the machine learning function.
-Determine the internal structure (tree structure) of the site using the machine learning function. The tree structure is determined not based on the directory structure but on the parent-child relationship of the hyperlink.
[0023]
When a site was actually extracted using a system in which the present invention was implemented, and where the telephone number, address, and e-mail address information of the site operator existed, the following knowledge was obtained.
-In most cases, the phone number, address, and e-mail address of the site operator are on the top page of the site or a page below the site.
-When present on a page one step below the site, the anchor character string of the hyperlink pointing from the top page to the page often includes a characteristic term (for example, contact information, owner, operator, etc.).
[0024]
This finding is difficult to obtain by a method that organizes the site structure along the directory structure.
[0025]
In the present invention, by utilizing this knowledge, information on a Web site operator, for example, information candidates such as a telephone number, an address, and an e-mail address are focused on the tree structure in the site, and the top page or one step Extract and collect by looking at the lower pages.
[0026]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0027]
<< Embodiment Example of Device Configuration According to the Present Invention >>
FIG. 1 shows an embodiment of the device configuration according to the present invention.
[0028]
The site information generation device 1 collects Web pages in advance, regroups the Web page set into a set of sites, generates an internal structure (tree structure) in the site having a top page as a top for each site, The tree structure is stored in a site information database (hereinafter, DB) 2.
[0029]
In the site information generating apparatus 1, as a means and a method for regrouping a Web set into a set of sites, the means and method of the aforementioned Japanese Patent Application No. 2001-389445-2001-389448 are used.
[0030]
That is, first, a Web page set is collected, a top page of a Web site is estimated from the Web page set, and a site ID and a URL of the top page are obtained. Next, the site-based tree structure is estimated from the top page estimated for this Web page set and the pages linked to the top page, and the site ID to which this page belongs in page units and the depth information of the site tree structure are obtained in page units. These pieces of information are registered as a tree structure.
[0031]
When collecting a Web page set and estimating the top page of a Web site from the Web page set, all server names to which each page of the page set belongs are extracted, and the URL and the URL of each page of the page set are extracted. Meta information based on server name, directory hierarchy, and surface knowledge is extracted, and a page classification tree is obtained based on a learning page set to which a page type is manually assigned. Based on this page classification tree, a page set is obtained. The classification likelihood of each page into the page type is extracted. Then, for each server to which each page of the page set belongs,
Determining a page that belongs to the server, has a directory hierarchy of 0, and has a specific file name located in the hierarchy, as one of the top pages in the server name;
-The top page type classification likelihood of the page type is sequentially reduced based on the top page type classification likelihood to determine the directory hierarchy where the top page exists,
A page that belongs to the determined directory hierarchy and has a file name in which a file exists in a lower hierarchy, and a page having the largest sum of the likelihoods of classification to the page type is determined as a top page for each directory hierarchy;
A page belonging to a directory hierarchy one level below the directory hierarchy and having a top page classification likelihood of a page type equal to or greater than a threshold is determined as a top page;
Estimate the top page by repeating the above.
[0032]
When estimating a tree structure for each site from a top page estimated for a Web page set and pages linked to the top page, a link classification tree is acquired based on a learning page set to which a link type is manually assigned, and the link classification tree is obtained. Based on the tree, the classification likelihood of each page of the page set to the link type is extracted, and for all servers to which the page set belongs, the page set belonging to the server whose site tree structure is not estimated is determined. Take out, below,
A top page candidate set is obtained by estimating the top page based on the site ID and the URL of the top page thereof,
・ From each top page candidate as a starting point, determine the parent page of each page based on the link type classification likelihood,
-The page whose parent page belongs to the shallow hierarchy of the directory from the undetermined page set and has the largest sum of the top page likelihood, index page likelihood and menu page likelihood of the page type classification likelihood extracted above. Take out as a top page candidate, determine the parent page from this top page candidate based on the link type classification likelihood,
By repeating the above, the parent page is estimated, and the tree structure of the site is estimated from the link having the parent page as a link source.
[0033]
Based on the tree structure for each site stored in the site information DB 2, the operator contact candidate information collection device 3 selects one of the address, telephone number, and electronic mail address of the operator of each site, or One or more information candidates are collected and stored in the site operator contact information candidate information DB4.
[0034]
As a whole, contact candidate information of a site operator is collected from a Web page set.
[0035]
Depending on how much accuracy is required as the accuracy (recall) of the site operator's contact candidate information, and how much processing performance is required, the operator contact candidate information collection device has one of three functions. Having. (1) has the lowest accuracy, (3) has the lowest accuracy, and (2) has the highest accuracy. (1) has the highest speed, (3) has the highest speed, and (2) has the lowest performance. Accuracy and performance are in a trade-off relationship.
[0036]
(1) Gathering candidate information only from the top page
Using the Japanese language processing technology, any one of an address, a telephone number, an e-mail address, or two or more information candidates is collected from the top page and stored in the site operator contact information DB4.
[0037]
(2) Gather candidate information from the top page and all pages below
Using Japanese-language processing technology, address, telephone number, e-mail address, or two or more information candidates are collected from the top page and all pages under one step, and the candidate information of the site operator contact information is collected. Store in DB4.
[0038]
(3) Collection of candidate information from all pages that satisfy the filtering conditions among the top page and the next lower page
Filtering conditions are determined in advance, and any address, telephone number, or e-mail address, or two or more information candidates are collected from all the pages below and the top page that satisfy the conditions, and the site operator is collected. It is stored in the contact candidate information DB4.
[0039]
A technique for extracting an address, a telephone number, and an e-mail address from a page is described in "Satoshi Sato" Address Search Using the World Wide Web "Transactions of Information Processing Society of Japan, Vol. 42, no. 01 ”or the like.
[0040]
As the filtering condition, a known technique is used in which a term dictionary is created in advance, and whether to skip processing is determined based on whether or not a term in the term dictionary is included.
[0041]
<< Flowchart showing an embodiment of the method according to the present invention >>
FIG. 2 shows a flowchart of the site operator contact candidate information collection processing from only the top page by the operator contact candidate information collection device 3.
1. In S01, one site is extracted from the site information DB2.
2. In S02, the top page is extracted from the site.
3. In S03, site operator contact information is extracted from the top page.
4. In S04, it is confirmed whether or not the processing after S1 has been completed for all sites, and if the processing has been completed for all sites, the entire collection of candidate information is terminated.
[0042]
FIG. 3 shows a flowchart of the site operator contact candidate information collection process from the top page and all the pages below by the operator contact candidate information collection device 3.
1. In S11, one site is extracted from the site information DB2.
2. In S12, the top page is extracted from the site.
3. In S13, site operator contact information is extracted from the top page.
4. In S14, one page of depth 1 in the site internal structure (tree structure) is extracted.
5. In S15, site operator contact information is extracted from the page.
6. In S16, it is checked whether or not the processing of S14 and S15 has been completed for all pages of depth 1 in the site. If the processing has been completed for all pages, the process branches to S17, and if not completed, the process branches to S14. .
7. In S17, it is confirmed whether or not the processing of S11 and subsequent steps has been completed for all sites, and if the processing has been completed for all sites, the entire candidate information collection is terminated.
[0043]
FIG. 4 shows a flowchart of the site operator contact candidate information collection processing by the operator contact candidate information collection device 3 from all pages satisfying the filtering condition among the top page and the next lower page.
1. In S21, one site is extracted from the site information DB2.
2. In S22, the top page is extracted from the site.
3. In S23, the site operator contact information is extracted from the top page.
4. In S24, one page of depth 1 in the site internal structure (tree structure) is extracted.
5. In S25, skip determination information is derived for the page.
6. In S26, it is determined whether or not the skip condition is satisfied. If the skip condition is satisfied, the flow branches to S28, and if not, the flow branches to S27.
7. In S27, the site operator contact information is extracted from the page.
8. In S28, it is checked whether the processing after S24 has been completed for all pages having a depth of 1 in the site. If the processing has been completed for all pages, the flow branches to S29. If not, the flow branches to S24.
9. In S29, it is confirmed whether or not the processing of S21 and subsequent steps has been completed for all sites, and if the processing has been completed for all sites, the entire candidate information collection is terminated.
[0044]
FIG. 5 shows a flowchart of the process for extracting the site operator contact candidate information in S03 of FIG. 2, S13 and S15 of FIG. 3, and S23 and S27 of FIG.
1. In S301, a desired information type is extracted.
2. In S302, if the desired information type is an address, the inside of the page is analyzed by a known Japanese analysis technique, and the address is extracted from the page.
3. In S303, if the address is included in the page, the flow branches to S304, and if not, the flow branches to S311.
4. In S304, the address is stored in the site operator contact candidate information DB4 together with the site ID, the URL of the page, and the information type (address).
5. In S305, if the desired information type is a telephone number, the page is analyzed by a known Japanese analysis technique, and the telephone number is extracted from the page.
6. In S306, if the page contains a telephone number, the flow branches to S307; otherwise, the flow branches to S311.
7. In S307, the telephone number is stored in the site operator contact information DB4 together with the site ID, the URL of the page, and the information type (telephone number).
8. In S308, if the desired information type is an e-mail address, the inside of the page is analyzed by a known Japanese analysis technology, and the e-mail address is extracted from the page.
9. In S309, if the page contains an e-mail address, the flow branches to S304; otherwise, the flow branches to S311.
10. In S310, the e-mail address is stored in the site operator contact candidate information DB4 together with the site ID, page URL, and information type (e-mail address).
11. In step S311, it is checked whether or not the processing of step S301 and subsequent steps has been completed for all desired information types. If the processing has been completed for all desired information types, the entire site operator contact candidate information extraction processing is terminated.
[0045]
FIG. 6 shows a flowchart of the skip determination information deriving process in S25 of FIG. 4 by the operator contact candidate information collection device 3.
[0046]
In the skip determination information deriving process, a known technology called a keyword filtering technology is used. In keyword filtering, a human creates in advance a dictionary for determining whether filtering is possible.
[0047]
Humans manually collect samples of pages that contain the publisher's address, phone number, or email address. Manually collect the words (eg, contact information, contact information, operator information, etc.) in the anchor character string of the link pointing to the page, and create a term dictionary. The result is called "anchor term dictionary for contact candidate information posting page".
[0048]
The skip determination information deriving process is performed using this dictionary.
1. In S501, an anchor character string of a hyperlink indicating the corresponding page is extracted from the top page of the site.
2. In S502, it is confirmed whether or not a term in the “anchor term dictionary for a contact candidate information page” is included in the anchor character string.
3. In S503, if a dictionary term is included in the anchor character string, the flow branches to S504, and if not, the flow branches to S505.
4. In S504, it is set that the page cannot be skipped.
5. In S505, it is set that the page can be skipped.
[0049]
<< Example of embodiment of candidate information collection >>
Referring to FIGS. 7, 8, and 9, an embodiment of collecting site operator contact candidate information according to the present invention will be described.
[0050]
FIG. 7 is an example of collecting the site operator contact information from only the top page. It is assumed that the following information is described in the top page of the site.
[0051]
contact information
Chiyoda ward, Tokyo~
03 (1111) 1111
Webmaster @ ...・・． co. jp
Assuming that there are three types of request information: address, telephone number, and e-mail address, <site ID, top page URL, address, “Chiyoda-ku, Tokyo-” ”, <site ID, top page URL, telephone number,“ 03 (1111) 1111 ">, <site ID, top page URL, e-mail address," Webmaster @ ..... co.jp "> are stored in the site operator contact information DB4. .
[0052]
FIG. 8 is an example of collecting site operator contact information from the top page and all pages one step below. It is assumed that the following information is described on the top page, page 2 and page 4 of the site.
·top page
Contact us
master @ ...・・． co. jp
・ Page 2
Questions on the phone
03 (2222) 2222
・ Page 4
contact information
Chiyoda ward, Tokyo~
03 (1111) 2222
master @ ...・・． co. jp
Assuming that there are three types of request information: address, telephone number, and e-mail address, <site ID, top page URL, e-mail address, “master @ .... co.jp”>, <site ID, Page 2 URL, phone number, “03 (2222) 2222”>, <site ID, page 4URL, address, “Okakyo Chiyoda-ku ~”>, <site ID, page 4URL, phone number, “03 (1111) 2222” >, <Site ID, page 4 URL, e-mail address, “master @ ..... co.jp”> are stored in the site operator contact information DB4.
[0053]
FIG. 9 is an example of collecting site operator contact candidate information from all pages satisfying the filtering condition among the top page and the next lower page.
[0054]
It is assumed that “contact information” is included in the “dictionary of anchor terms to the contact candidate information page”, but “company name”, “origin”, “history”, “product”, and “link collection” are not included. At this time, only the page 3 out of the pages 1 to 4 is not skipped.
[0055]
It is assumed that the following information is described on page 3 (the address, telephone number, and e-mail address are not described on the top page).
[0056]
contact information
Chiyoda ward, Tokyo~
03 (1111) 3333
master @ ...・・． co. jp
Assuming that there are three request information types: address, telephone number, and e-mail address, <site ID, page 3 URL, address, "Chiyoda-ku, Tokyo-"",<site ID, page 3 URL, telephone number," 03 ( 3111 ">, <site ID, page 3 URL, e-mail address," master @ ..... co.jp "> are stored in the site operator contact information DB4.
[0057]
A part or all of the functions of each unit in the apparatus shown in FIG. 1 is configured by a computer program, and the program can be executed by using a computer to realize the present invention. It goes without saying that the steps of the processing indicated by 6 can be constituted by a computer program, and that the program can be executed by a computer, or a program for realizing the function by the computer, or the computer executes the steps of the processing. A program for causing the program to be recorded on a computer-readable recording medium, for example, an FD (flexible disk), an MO, a ROM, a memory card, a CD, a DVD, a removable disk, or the like, and is stored or distributed. It is possible to do. Further, it is also possible to provide the above program through a network such as the Internet or e-mail.
[0058]
【The invention's effect】
As described above, by using the present invention, it is possible to efficiently extract and collect information on a site operator, for example, any one of an address, a telephone number, an e-mail address, or two or more information candidates. Become.
[Brief description of the drawings]
FIG. 1 is a diagram showing an embodiment of an apparatus configuration according to the present invention.
FIG. 2 is a flowchart of a process of collecting candidate information of a site operator contact from only a top page showing an exemplary embodiment of the method according to the present invention;
FIG. 3 is a flowchart of a process for collecting candidate information of a site operator contact information from a top page and all pages one step below showing an embodiment of the method according to the present invention;
FIG. 4 is a flowchart illustrating a process of collecting candidate information of a site operator contact information from all pages satisfying a filtering condition among a top page and a next lower page according to an embodiment of the method according to the present invention.
FIG. 5 is a flowchart of a process of extracting site operator contact candidate information in the site operator contact candidate information collection process.
FIG. 6 is a flowchart of skip determination information derivation processing in the above-mentioned site operator contact candidate information collection processing.
FIG. 7 is a diagram illustrating an example of collecting site operator contact information from only the top page according to the present invention.
FIG. 8 is a view for explaining an example of collecting site operator contact candidate information from the top page and all pages one step below according to the present invention;
FIG. 9 is a view for explaining an example of collecting site operator contact information from all pages satisfying a filtering condition among a top page and a next lower page according to the present invention;
[Explanation of symbols]
1. Site information generation device
2 ... Site information DB
3 ... Operator contact candidate information collection device
4: Site operator contact candidate information DB

Claims

サイト運営者に関する情報の候補を抽出し収集する、サイト運営者情報抽出方法であって、
予めＷｅｂページを収集しておき、Ｗｅｂページ集合をサイトの集合にまとめなおし、各サイト毎にトップページを頂点とするサイト内の内部構造である木構造を生成し、その木構造をサイト情報データベースに格納するサイト情報生成ステップと、
サイト情報データベースに格納された各サイト毎の木構造をもとに、各サイトの運営者に関する情報の候補を抽出、収集し、サイト運営者情報データベースに格納する運営者情報収集ステップと、を有する
ことを特徴とするサイト運営者情報抽出方法。A method for extracting publisher information that extracts and collects information about publishers,
Web pages are collected in advance, the Web page set is regrouped into a set of sites, a tree structure as an internal structure in the site having a top page as a vertex is generated for each site, and the tree structure is stored in a site information database. Site information generation step to be stored in the
An operator information collecting step of extracting and collecting information candidates regarding the operator of each site based on the tree structure for each site stored in the site information database, and storing the candidate in the site operator information database. A method for extracting publisher information, characterized in that:

前記運営者情報収集ステップでは、
各サイトの木構造のうち、トップページのみを用いて、各サイトの運営者に関する情報の候補を抽出し収集する
ことを特徴とする請求項１記載のサイト運営者情報抽出方法。In the operator information collecting step,
2. The site operator information extracting method according to claim 1, wherein information candidates regarding the operator of each site are extracted and collected using only the top page of the tree structure of each site.

前記運営者情報収集ステップでは、
各サイトの木構造のうち、トップページ及びトップページの一段下位の全ページを用いて、各サイトの運営者に関する情報の候補を抽出し収集する
ことを特徴とする請求項１記載のサイト運営者情報抽出方法。In the operator information collecting step,
2. The site operator according to claim 1, wherein a candidate for information on an operator of each site is extracted and collected using a top page and all pages below the top page in the tree structure of each site. Information extraction method.

前記運営者情報収集ステップでは、
各サイトの木構造のうち、トップページ及びトップページの一段下位のページのうち予め作成されたフィルタリング条件を満たすページのみを用いて、各サイトの運営者に関する情報の候補を抽出し収集する
ことを特徴とする請求項１記載のサイト運営者情報抽出方法。In the operator information collecting step,
In the tree structure of each site, extracting and collecting information candidates regarding the operator of each site by using only the page satisfying the filtering condition created in advance among the top page and the next lower page of the top page. The method for extracting site operator information according to claim 1, wherein:

前記運営者情報収集ステップでは、
トップページからトップページの一段下位のページを指すアンカー文字列を抽出し、トップページ及び一段下位のページのうち該アンカー文字列に予めフィルタリング条件として作成したサイト運営者に関するアンカー用語を含むページのみを用いる
ことを特徴とする請求項４記載のサイト運営者情報抽出方法。In the operator information collecting step,
An anchor character string that points to the lowermost page of the top page is extracted from the top page, and only the pages including the anchor term related to the publisher created as a filtering condition in advance in the anchor character string among the top page and the lowermost page are extracted. 5. The method for extracting site operator information according to claim 4, wherein the method is used.

サイト運営者に関する情報の候補を抽出し収集する、サイト運営者情報抽出装置であって、
予めＷｅｂページを収集しておき、Ｗｅｂページ集合をサイトの集合にまとめなおし、各サイト毎にトップページを頂点とするサイト内の内部構造である木構造を生成するサイト情報生成手段と、
該生成された各サイトの木構造情報を格納するサイト情報データベースと、
該サイト情報データベースに格納された各サイト毎の木構造をもとに、各サイトの運営者に関する情報の候補を抽出し収集する運営者情報収集手段と、
該収集された各サイトの運営者に関する情報の候補を格納するサイト運営者情報データベースと、を有する
ことを特徴とするサイト運営者情報抽出装置。A publisher information extraction device that extracts and collects information candidates about a publisher,
A site information generating unit that collects Web pages in advance, regroups a Web page set into a set of sites, and generates a tree structure that is an internal structure of the site having a top page as a top for each site;
A site information database for storing the tree structure information of each generated site;
Operator information collecting means for extracting and collecting information candidates regarding the operator of each site based on the tree structure for each site stored in the site information database;
A site operator information database for storing the collected information about the operators of the respective sites.

前記運営者情報収集手段は、
各サイトの木構造のうち、トップページのみを用いて、各サイトの運営者に関する情報の候補を抽出し収集するものである
ことを特徴とする請求項６記載のサイト運営者情報抽出装置。The operator information collecting means,
7. The site operator information extracting device according to claim 6, wherein information candidates concerning the operator of each site are extracted and collected using only the top page of the tree structure of each site.

前記運営者情報収集手段は、
各サイトの木構造のうち、トップページ及びトップページの一段下位の全ページを用いて、各サイトの運営者に関する情報の候補を抽出し収集するものである
ことを特徴とする請求項６記載のサイト運営者情報抽出装置。The operator information collecting means,
7. The method according to claim 6, wherein, in the tree structure of each site, a candidate for information on an operator of each site is extracted and collected by using a top page and all pages one step lower than the top page. Site operator information extraction device.

前記運営者情報収集手段は、
各サイトの木構造のうち、トップページ及びトップページの一段下位のページのうち予め作成されたフィルタリング条件を満たすページを用いて、各サイトの運営者に関する情報の候補を抽出し収集するものである
ことを特徴とする請求項６記載のサイト運営者情報抽出装置。The operator information collecting means,
In the tree structure of each site, a candidate for information on the operator of each site is extracted and collected by using a page that satisfies a filtering condition created in advance among top pages and pages one step below the top page. 7. The site operator information extracting device according to claim 6, wherein:

前記運営者情報収集手段は、
トップページからトップページの一段下位のページを指すアンカー文字列を抽出し、トップページ及び一段下位のページのうち該アンカー文字列に予めフィルタリング条件として作成したサイト運営者に関するアンカー用語を含むページのみを用いるものである
ことを特徴とする請求項９記載のサイト運営者情報抽出装置。The operator information collecting means,
An anchor character string that points to the lower-level page of the top page is extracted from the top page, and only the pages that include the anchor term related to the publisher created as a filtering condition in advance in the anchor character string among the top-page and lower-level pages are extracted. The site operator information extracting device according to claim 9, wherein the site operator information extracting device is used.

請求項１ないし５のいずれかに記載のサイト運営者情報抽出方法におけるステップを、コンピュータに実行させるためのプログラムとした
ことを特徴とするサイト運営者情報抽出プログラム。A site operator information extraction program, wherein the steps in the site operator information extraction method according to any one of claims 1 to 5 are performed by a computer.

請求項１ないし５のいずれかに記載のサイト運営者情報抽出方法におけるステップを、コンピュータに実行させるためのプログラムとし、
該プログラムを、該コンピュータが読み取りできる記録媒体に記録した
ことを特徴とするサイト運営者情報抽出プログラムを記録した記録媒体。A program for causing a computer to execute the steps in the method for extracting site operator information according to any one of claims 1 to 5,
A recording medium recording a site operator information extraction program, wherein the program is recorded on a recording medium readable by the computer.