JP3803961B2

JP3803961B2 - Database generation apparatus, database generation processing method, and database generation program

Info

Publication number: JP3803961B2
Application number: JP2001371636A
Authority: JP
Inventors: 克人別所; 成人岩瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-12-05
Filing date: 2001-12-05
Publication date: 2006-08-02
Anticipated expiration: 2021-12-05
Also published as: JP2003173280A

Description

【０００１】
【発明の属する技術分野】
本発明は、インターネット等のネットワーク上に分散配置され、店などの案内情報等を独立に管理・運営している複数のサーバ等からデータを収集し、検索・案内するためのデータベースを生成する装置及び方法、並びにそのプログラムに関する。
【０００２】
【従来の技術】
店の案内情報などのデータは、いくつかの組織において独立に作成され、必要に応じて更新される場合が多い。一つの組織が所有しているデータ集合が全ての店の案内情報をカバーしているわけではないので、独立に作成・更新されているこれらのデータ集合を統合すれば、より充実した情報検索サービスを行うことができる。各組織が保有するデータ集合は、インターネット等のネットワークに接続されたコンピュータ内に保管され、閲覧に供される。以後、このようなコンピュータを情報サーバと呼ぶことにする。
【０００３】
複数の情報サーバからデータ集合を収集し、データベースを生成する従来の技術においては、複数の情報サーバから収集したデータ集合を単純にマージしたものをデータベースとしていた。生成されたデータベース中の各データには、該データが存在する情報サーバ内の元データへのリンク情報が付与されており、ユーザが端末を用いてデータベースからデータを検索した際は、端末に表示されたデータに付随するリンク情報により、該データの元データにアクセスを行うことができる。図１１は、データベースから、例えば業種が「中華」で住所が「新宿区神楽坂」である店を検索したときの、従来の検索結果表示画面の一例を示したものである。ユーザがリンク情報を画面上でクリックすることにより、リンク先の店の詳細画面が表示される。
【０００４】
【発明が解決しようとする課題】
いくつかの組識において独立に作成されたデータ集合では、同一店舗でも名義や住所が異なる形式、表現で登録されることが多い。従って、複数の情報サーバから収集したデータ集合を単純にマージしてデータベースを生成する従来の技術では、重複する同一店舗を一つにまとめることができず、検索結果の店舗群の中に、同一店舗が複数混在して表示されることがある。このような場合、検索結果が冗長に多くなり、ユーザは不必要なデータの中身まで吟味し、それが既に見たデータと同じかどうか判断するといった煩雑な作業を強いられることになる。例えば、図１１の検索結果表示画面では、１番目の店舗と４番目の店舗が同一であり、３番目の店舗と６番目の店舗が同一である。
【０００５】
また、店などの情報を検索するユーザにとって特に興味のあるのは、店の新しい情報や、新規に出来たお店などの情報である。このため、データ集合の収集とデータベースの生成を定期的に実行する場合、生成されたデータベースからデータを検索するユーザにとっては、表示されたデータの内、どのデータが更新されたものであるか、または新規のものであるかの情報がついていると、新しい情報を迅速に取得することが出来る。しかしながら、従来技術においては、各データにこのような更新情報は付与されない。
【０００６】
本発明の目的は、ネットワーク上に分散して存在している複数の情報サーバ等からデータ集合を収集して、冗長性がないようにデータを統合し、かつデータの更新情報が付加されたデータベースを生成することを可能とするデータベース生成装置及び方法、並びにそのためのデータベース生成プログラムを提供することにある。
【０００７】
【課題を解決するための手段】
本発明のデータベース生成装置は、過去に生成されたデータベース（旧データベース）を記憶する記憶手段と、複数地点から、名義や住所などの属性の値を含むデータ、該データの識別ＩＤ、更新日時などの情報を収集するデータ収集手段と、前記収集された各データから属性の値を抽出し、各データが、前記抽出した属性の値、識別ＩＤ、更新日時などからなる構成のデータベース（新データベース）を生成する属性情報抽出手段と、前記生成された新データベース内の属性の値が同一とみなせるデータ集合を同一グループに分類する名寄せ手段と、新データベースと前記旧データベースとの間で、属性の値が同一とみなせるデータ同士を同一と判断して両データベース間の各データを対応付けする結合手段と、前記新データベース中のデータの識別ＩＤや更新日時などの情報と、前記データと対応付けされた前記旧データベース中のデータの識別ＩＤや更新日時などの情報とを比較することによって、前記新データベース内の該当データに更新情報を付与する更新情報付与手段とを有することを特徴とする。
【０００８】
名寄せ手段では、生成された新データベースにおいて、重複するデータが一つにされている。このため、このデータベースからユーザの要求に合致するデータを検索し表示したとき、同一店舗のデータが複数個表示されることはなく、検索結果の把握がより容易に行える。また、結合手段は、生成した新データベースと、前回生成した旧データベースとの間で、同一店舗等のデータを特定し、更新情報付与手段では、それらの識別ＩＤ（例えば名称）や更新日時などを比較することにより、データの更新情報を導出するので、最終的に生成されたデータベースは、データの更新情報が付与された上で、データを表示することが可能である。
【０００９】
次に、本発明のデータベース生成処理方法は、過去に生成されたデータベース（旧データベース）を記憶する記憶装置を備えたデータベース生成装置が、複数地点から、名義や住所などの属性の値を含むデータ、該データの識別ＩＤ、更新日時などの情報を収集するデータ収集過程と、前記収集された各データから属性の値を抽出し、各データが、前記抽出した属性の値、識別ＩＤ、更新日時などからなる構成のデータベース（新データベース）を生成して記憶装置に格納する属性情報抽出過程と、前記記憶装置に格納された新データベース内の属性の値が同一とみなせるデータ集合を同一グループに分類する名寄せ過程と、前記記憶装置に格納された新データベースと過去に生成して記憶装置に保持されている旧データベースとの間で、属性の値が同一とみなせるデータ同士を同一と判断して両データベース間の各データを対応付けする結合過程と、前記新データベース中のデータの識別ＩＤや更新日時などの情報と、前記データと対応付けされた前記旧データベース中のデータの識別ＩＤや更新日時などの情報とを比較することによって、記憶装置に格納された新データベース内の該当データに更新情報を付与する更新情報付与過程とを実行することを特徴とする。
【００１０】
次に、本発明のデータベース生成プログラムは、複数地点から、名義や住所などの属性の値を含むデータ、該データの識別ＩＤ、更新日時などの情報を収集するデータ収集プロセスと、前記収集された各データから属性の値を抽出し、各データが、前記抽出した属性の値、識別ＩＤ、更新日時などからなる構成のデータベース（新データベース）を生成する属性情報抽出プロセスと、前記属性情報抽出プロセスで生成された新データベース内の属性の値が同一とみなせるデータ集合を同一グループに分類する名寄せプロセスと、前記新データベースと過去に生成した旧データベースとの間で、属性の値が同一とみなせるデータ同士を同一と判断して両データベース間の各データを対応付けする結合プロセスと、前記新データベース中のデータの識別ＩＤや更新日時などの情報と、前記データと対応付けされた前記旧データベース中のデータの識別ＩＤや更新日時などの情報とを比較することによって、前記新データベース内の該当データに更新情報を付与する更新情報付与プロセスとをコンピュータに実行させるプログラムであることを特徴とする。
【００１１】
【発明の実施の形態】
以下に、本発明の一実施例について、図面を参照して説明する。
図１は、本発明の一実施の形態のデータベース生成装置の構成例を示す図である。データベース生成装置１０は、インターネット等のネットワーク４０に接続されるものであり、該ネットワーク４０を介して、店の案内情報などのデータ集合を管理・運営している複数の情報サーバ２０と、ユーザが使用するユーザ端末３０とに接続している。ネットワーク４０に接続された個々の情報サーバ２０はそのＵＲＬ（uniform resource locator）によって識別される。各情報サーバ２０は、その内部にデータ集合をもち、当該データ集合を、他の情報サーバとは独立して管理・運営している。したがって、同一店舗の案内情報などが、複数の情報サーバ２０内に存在することが多々ある。データベース生成装置１０自体も、その内部にデータ集合をもって、管理・運営するという形態をとっていてもよい。ユーザ端末３０としては、典型的には、ＷＷＷソフトウェア（ＷＷＷブラウザ）が組み込まれたパーソナルコンピュータ（パソコン）や携帯端末が使用される。各ユーザは、該ユーザ端末３０を用いて情報検索などを行うほか、必要ならデータベース生成装置１０に対して要望等を通知する。
【００１２】
データベース生成装置１０は、データ収集手段１１、属性情報抽出手段１２、名寄せ手段１３、結合手段１４、更新情報付与手段１５の各処理手段、及び、データベース格納部１６を具備する。データベース格納部１６には、過去（ここでは前回とする）に生成したデータベース（旧データベース）１７と新規に生成したデータベース（新データベース）１８が格納される。データベース生成装置１０は、所謂コンピュータで実現されるものであり、各処理手段１１〜１５はＣＰＵやその内蔵メモリ（ＲＡＭ、ＲＯＭ等）が受け持ち、データベース格納部１６はハードディスク、その他の外部記憶装置などが受け持つ。
【００１３】
なお、データベース生成装置１０自体、ユーザ端末３０から検索要求を受けて情報検索サービスを実施してもよい。この場合、図１では省略したが、データベース生成装置１０は情報検索手段も具備することになる。また、情報検索装置は該データベース生成装置１０とは別構成として、データベース生成装置１０で生成したデータベースを別の情報検索装置で利用することでもよい。
【００１４】
図２は、本発明の一実施形態のデータベース生成方法のフローチャートを示す図である。以下、図２のフローチャートに従って、本データベース生成装置１０の動作を詳しく説明する。
【００１５】
データベース生成装置１０では、データ収集手段１１において、一定期間や特定日時ごと（例えば、１日、１週間、毎月曜日など）に、各情報サーバ２０にアクセスし、各情報サーバ２０内のデータ（データ集合）を収集する（ステップ１１１）。ここで、各データは一つのファイルであり、全てのファイルがあるディレクトリ配下にあるものとする。このディレクトリの所在は、データベース生成装置１０の管理者と各情報サーバ２０の管理者との間であらかじめ取り決めがなされており、データ収集手段１１は、各情報サーバ２０の該ディレクトリ配下のファイル群をダウンロードし、例えばＲＡＭやハードディスク等に一時的に格納する。ここで、ファイルとともに、データの名称となるファイル名（これが当該データのリンク情報となる）とファイルの更新日時の情報も取得する。
【００１６】
図３は、情報サーバＡ及びＢからダウンロードしたデータの一例を示したものである。この例では、同一店舗「紅蘭亭」のデータが情報サーバＡにもＢにも登録されているものとし、そのデータを示したものである。図３に示すように、情報サーバＡとＢでは、同一店舗「紅蘭亭」でも、名義や住所等が異なる形式、表現で登録されている。
【００１７】
次に、属性情報抽出手段１２において、上記データ収集手段１１で収集した各データから名義や住所などの該データを特徴付ける属性の値を抽出する（ステップ１１２）。各データファイルは典型的にはＨＴＭＬ文書やＸＭＬ文書であり、ユーザはユーザ端末３０を用いてＷＷＷソフトウェア（ＷＷＷブラウザ）から該当ファイルのＵＲＬにアクセスすることにより、その内容を閲覧することができるものである。各データファイルの内容が、どういった属性からなり、各属性がどのようなフォーマットで記述されているかといったフォーマット情報は、各情報サーバ２０ごとに決められている。ここでは、各情報サーバに対応したデータファイルフォーマット解析ルーチンを属性情報抽出手段１２が保持しているとする。属性情報抽出手段１２は、各情報サーバに対応したデータファイルフォーマット解析ルーチンにより、データファイルから名義や住所などの属性値を抽出する。次に、属性情報抽出手段１２では、抽出した名義や住所などの属性値と該データが存在する情報サーバ名及び該データのリンク情報及び更新日時の情報等からなるデータ（レコード）を作成し、このようなデータが集積したデータベースを生成してデータベース格納部１６に格納する。この新たに生成されたデータベースを新データベース１８とする。また、前回（１日前、１週間前など）、同様に各情報サーバ２０からデータを収集して生成し、後述の名寄せ、結合、更新情報付与等の処理を施したデータベースを旧データベース１７とする。
【００１８】
図４は、新たに生成されたデータベース（新データベース）１８の一例を示したもので、（ａ）は情報サーバＡのデータ、（ｂ）は情報サーバＢのデータである。ここでは、抽出する属性として業種、名義、住所をとっている。業種体系は情報サーバ２０ごとに一般に異なっている。また、同一店舗のデータでも、情報サーバが異なれば、名義や住所の表記には揺れがある。
【００１９】
なお、情報サーバ２０が、店のデータファイルの他に、各店の名義や住所、電話番号、リンク情報などの基本情報のみが記載されているデータのリストからなるファイルをもっている場合、データ収集手段１１において、データファイル群の代わりに、そのようなリストファイルをダウンロードしてもよい。この場合、属性情報抽出手段１２においては、リストファイルの各データから名義、住所、リンク情報などを抽出し、抽出したリンク情報をもとに、再び情報サーバ２０にアクセスし、データファイルの更新日時情報を取得する。そして、同様に図４のような新データベース１８を生成する。
【００２０】
次に、名寄せ手段１３において、新データベース１８内の名義や住所などの属性の値が同一とみなせるデータ（レコード）を同一グループに分類する（ステップ１１３）。即ち、同一店舗として名寄せする。
【００２１】
例えば、図４に示した新データベース１８の任意の２データ間において、名義及び住所の属性の値同士を照合し、マッチしたレコード同士を同一グループに分類する。名義文字列や住所文字列の照合方法には例えば次のようなものが考えられる。一つには完全一致したときマッチするとみなす方法（完全一致と呼ぶ）があり、また、両方に共通して含まれる文字の数の割合がある閾値以上のときマッチするとみなす方法（文字単位一致と呼ぶ）がある。他には、文字列を単語分割して両方に共通して含まれる単語の数の割合がある閾値以上のときマッチするとみなす方法（単語単位一致と呼ぶ）がある。いずれの方法も、漢数字を算用数字に変換したり、英字を大文字に統一化するといった表記の揺れを解消する処理を事前に行うことにより、より照合の精度を高めることが可能である。照合の結果、図４の例では、１番目と４番目のデータ（レコード）がマッチし、３番目と６番目のデータがマッチする。このマッチしたレコード同士を同一グループに分類する。ここで、各グループを通常のデータと区別して、名寄せデータと呼ぶことにする。
【００２２】
名寄せ手段１３では、各名寄せデータの名義や住所の属性値として、例えば当該名寄せデータに含まれるデータの名義や住所の属性値から一つだけ選んで、その値そのものを用いるか、あるいは正規化した値に変換する。また、各データの業種名は、データベース生成装置１０独自の業種体系における対応する業種名に変換する。
【００２３】
図４について、こうして更新された新データベース１８の一例を図５に示す。例えば、データベース生成装置１０独自の業種体系では、業種として「和食」、「中華」などがあり、図４におけるデータの業種名はいずれも「中華」に変換される。図５において、同一グループに分類された１番目と４番目のデータの業種名はともに「中華」に変換されるので、名寄せデータとしての業種名も「中華」となる。３番目と６番目のデータに関しても同様である。また、名寄せデータの名義や住所の属性値としては、１番目と４番目のデータでは、名義は「紅蘭亭」を選択し、住所は「新宿区神楽坂１−２−３」を選択している。同様に、３番目と６番目のデータでは、名義は「大竹亭」を選択し、住所は「新宿区神楽坂３−８−６」を選択している。なお、図５中の新データベース１８の「更新情報」の欄は後述の更新情報付与手段１５で書き替えられるもので、ここでは全て空（ＮＵＬＬ）としておく。
【００２４】
ここで、どの属性値同士をどの照合方法で照合させるかといった照合ルールは、名寄せ手段１３を実現するプログラム内に記述してもよいし、データベース生成装置１０内の、プログラムが参照する外付けテーブルに記述して、データベース生成装置１０の管理者が、この外付けテーブルを自由に変更できるようにしておいてもよい。
【００２５】
図６は、このような外付けテーブルの内容の一例である。図６（ａ）では、データが一致する基準を記述する。この例では、照合項目として名義と住所を指定している。名義の照合結果の評価値が９０点以上かつ住所の照合結果の評価値が８０点以上の場合、あるいは名義の照合結果の評価値が８０点以上かつ住所の照合結果の評価値が９０点以上の場合、２データが一致すると判定する。図６（ｂ）では、名義の照合方法を記述する。ここでは、照合方法として完全一致、文字単位一致、単語単位一致を指定しており、各方法による照合を行う。完全一致の照合処理で一致したならば評価値１００とし、一致しなければ評価値０とする。文字単位一致の照合結果の評価値は一致した文字の数の割合に１００を乗じたものとする。単語単位一致の照合結果の評価値も一致した単語の数の割合に１００を乗じたものとする。一番高い評価値を返した照合方法の評価値を名義の評価値とする。図６（ｃ）では、住所の照合方法を同様に記述する。ここでは、照合方法として完全一致、単語単位一致を指定している。一番高い評価値を返した照合方法の評価値を住所の評価値とする。
【００２６】
次に、結合手段１４において、データベース格納部１６にある、名寄せ後の新データベース１８と、前回各情報サーバ２０からデータを収集して、生成した旧データベース１７との間で、名義や住所などの属性の値が同一とみなせる名寄せデータ同士を同一と判断してリンク付けし、対応付けする（ステップ１１４）。例えば、新旧データベース１７、１８内の同一と判断された両データに、同一なデータであることを示す情報を付与するなどしてリンク付けし、対応付けする。
【００２７】
ここでは、情報サーバ２０において、同一データのリンク情報が時の経過とともに変わり得るという前提であるものとする。各データの更新情報を導出するにあたっては、新データと旧データの更新日時などを比較する必要があり、そのためには、新旧データベースにおいて、どのデータが同一かを判断しなけれならない。リンク情報が不変であれば、リンク情報が同一かで判断できるが、リンク情報が変わり得るという前提のもとでは、データがもつ名義や住所の属性値が同一かで判断する必要があるわけである。ここで、同一データであっても時の経過とともに、名義などが微妙に変更される場合もありうるので、照合は、表記の揺れを考慮して行う。具体的には、例えば完全一致以外に文字端単位一致や単語単位一致といった照合方法で行う。基本的には名寄せの場合と同様である。また、照合の対象となる項目を、例えば名義のみにすると、同一店の住所が変更しても、新旧のデータはマッチすることになる。このように、どのような条件で新旧のデータを同一視するかは、照合ルールを変更することにより調節可能である。図７に、外付けテーブルに記述する照合ルールにおけるデータ一致基準の一例を示す。ここでは照合項目として名義のみを指定した例を示している。名義の照合方法の記述は、例えば図６と同様にすればよい。
【００２８】
図８は、旧データベース１７の一例である。便宜上、図８では、各データは前々回から更新がなかったとしている。結合手段１４では、図５に示した新データベース１８の各名寄せデータと同一な旧データベース１７の名寄せデータを、名義のみあるいは名義及び住所の属性値同士を照合することによって特定する。その結果、図５の新データベース１８の１番目、２番目、３番目の名寄せデータがそれぞれ、図８の旧データベース１７の１番目、２番目、３番目の名寄せデータにリンク付けされる。図５の新データベース１８の４番目の名寄せデータにリンク付けされる名寄せデータは、図８の旧データベース１７には存在しない。なお、リンク付けされた名寄せデータ内の同一の対応情報サーバをもつデータ同士も、同一のデータとしてリンク付けされる。以後、図５、図８の各データを上から何番目かで表現する。
【００２９】
次に、更新情報付与手段１５において、新データベース１８のデータのリンク情報や更新日時の情報と、結合手段１４により該データと同一と判断された旧データベース１７中のデータのリンク情報や更新日時の情報とを比較することにより、新データベース１８中の該当データに更新情報を設定・付与する（ステップ１１５）。即ち、新データベース１８中のデータとリンク付けされた旧データベース１７のデータがあり、かつリンク情報または更新日時が変更されているとき、該データは更新されたものと判断し、いずれも変更されていないとき、該データは更新なしと判断し、新データベース１８中の該当データの更新情報を「更新」あるいは「更新なし」とする。また、新データベース１８中のデータとリンク付けされた旧データベース１７のデータがない場合、該データは新規に作成されたものと判断し、新データベース１８中の該当データの更新情報を「新規」とする。
【００３０】
例えば、図５の新データベース１８の１番目のデータは、リンク付けされた図８の旧データベース１７の１番目のデータと、リンク情報が同じで、更新日時が変わっているので、当該データは更新されたものと判断する。
【００３１】
図５の新データベース１８の２番目のデータは、リンク付けされた図８の旧データベース１７の２番目のデータと比べ、リンク情報も更新日時も不変なので、当該データは更新されていないものと判断する。図５の新データベース１８の４番目のデータについても同様である。
【００３２】
図５の新データベース１８の３番目のデータは、リンク付けされた図８の旧データベース１７の３番目のデータと比べ、更新日時は変わらないが、リンク情報が変わっているので、当該データは更新されたものと判断する。
【００３３】
図５の新データベース１８の５番目のデータは、名寄せデータとしては、図８の旧データベース１７の３番目の名寄せデータとリンクしているが、データとしてリンク付けされたデータは図８の旧データベース１７にないので、新規に作成されたものと判断する。
【００３４】
図５の新データベース１８の６番目のデータにリンク付けされたデータは、図８の旧データベース１７にないので、当該データは新規に作成されたものと判断する。
【００３５】
このようにして、図５の新データベース１８と図８の旧データベース１７の場合、図９に示すような更新情報の付与された新データベース１８が最終的に生成される。更新情報付与手段１５では、この最終的に生成された新データベース１８でもって旧データベース１７を上書きする。
【００３６】
以上によりデータベースの生成が終了する。最終的に生成されたデータベースにユーザ端末３０からアクセスし、ユーザの要求に合致するデータを検索し表示したときには、名寄せデータの業種、名義、住所の情報と、該データが存在する情報サーバ２０内のファイルへのリンク情報及び更新情報が表示される。図１０は、図９の生成データベースにより、業種が「中華」で住所が「新宿区神楽坂」である店を検索したときの検索結果の表示例である。ユーザはこのリンク情報を画面上でクリックすることにより、リンク先のファイルの内容である店の詳細情報にアクセスすることができる。
【００３７】
以上、本発明の典型的な一実施例について述べたが、名寄せ前の旧データベースを保持しておき、結合手段１４のリンク付けを、名寄せ後の新旧データベース（図５及び図８）間で実行するのではなく、名寄せ前の新旧データベース（図４及び図４相当の古いデータベース）間で実行してもよい。例えば、この場合、対応情報サーバが同一なデータ同士を照合させる。
【００３８】
情報サーバ２０において、同一データのリンク情報が時の経過とともに変わりえても、各データにとって恒久的に不変なＩＤ情報がデータ中に含まれている場合は次のように処理を行うこともできる。属性情報抽出手段１２において、このＩＤ情報を抽出し、結合手段１４におけるリンク付けを、新データベース中の各データに対し、当該データと同一のＩＤ情報をもつ旧データベース中のデータをリンク付けることによって行う。
【００３９】
また、情報サーバ２０において、同一データのリンク情報が常に不変であれば、結合手段１４のリンク付けは必要でない。更新情報付与手段１５において、生成したデータベース（名寄せ前のものでも名寄せ後のものでもよい）中のデータの更新日時が、前にデータ集合を収集した時点以降ならば、該データは更新されたデータか新規データであることが分かる。さらに、該データの対応情報サーバとリンク情報がともに同一であるデータが、前に生成したデータベース中にあれば、該データは更新されたデータであり、なければ新規データであることが判明する。
【００４０】
上記に挙げた以外にも、本発明は特許請求の範囲の記載内で、様々な変更や拡張が可能である。例えば、名寄せ手段や名寄せ過程をなくして、各データの更新情報のみを付与する構成も考えられる。
【００４１】
なお、図１で示した装置における各部の一部もしくは全部での処理機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、あるいは、図２で示した処理手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもない。また、コンピュータでその処理機能を実現するためのプログラム、あるいは、コンピュータにその処理手順を実行させるためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えば、ＦＤや、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して、保存したり、提供したりすることができるとともに、インターネット等のネットワークを通してそのプログラムを配布したりすることが可能である。
【００４２】
【発明の効果】
以上説明したように、本発明によれば、生成されたデータベースからユーザの要求に合致するデータを検索したとき、重複データがなく、かつデータの更新情報が付加された形で検索結果を表示することが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態のデータベース生成装置の構成を示す図である。
【図２】本発明の一実施形態のデータベース生成方法のフローチャート図である。
【図３】情報サーバからダウンロードしたデータの一例を示す図である。
【図４】属性情報抽出手段で生成された新データベースの一例を示す図である。
【図５】名寄せ手段で更新された新データベースの一例を示す図である。
【図６】名寄せ手段で適用される照合ルールの一例を示す図である。
【図７】結合手段で適用される照合ルールの一例を示す図である。
【図８】前回生成した旧データベースの一例を示す図である。
【図９】更新情報付与手段で更新情報が付与された新データベースの一例を示す図である。
【図１０】本発明により生成されたデータベースからの検索結果の表示画面の一例を示す図である。
【図１１】従来のデータベース生成技術により生成したデータベースからの検索結果の表示画面の例を示す図である。
【符号の説明】
１０データベース生成装置
１１データ収集手段
１２属性情報抽出手段
１３名寄せ手段
１４結合手段
１５更新情報付与手段
１６データベース格納部
１７旧データベース
１８新データベース
２０情報サーバ
３０ユーザ端末
４０ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention is an apparatus that collects data from a plurality of servers and the like that are distributed and arranged on a network such as the Internet and that independently manage and operate guidance information for stores and the like, and generate a database for searching and guiding And a method and a program thereof.
[0002]
[Prior art]
In many cases, data such as store guide information is created independently in some organizations and updated as necessary. Since the data set owned by one organization does not cover all store information, integrating these data sets that have been created and updated independently will provide a more complete information retrieval service. It can be performed. A data set held by each organization is stored in a computer connected to a network such as the Internet and is used for browsing. Hereinafter, such a computer is referred to as an information server.
[0003]
In the conventional technique of collecting a data set from a plurality of information servers and generating a database, a database obtained by simply merging data sets collected from a plurality of information servers is used as the database. Each data in the generated database is given link information to the original data in the information server where the data exists, and when the user searches for data from the database using the terminal, it is displayed on the terminal The original data of the data can be accessed by the link information attached to the data. FIG. 11 shows an example of a conventional search result display screen when, for example, a store having a business type “Chinese” and an address “Shinjuku-ku Kagurazaka” is searched from the database. When the user clicks on the link information on the screen, the details screen of the linked store is displayed.
[0004]
[Problems to be solved by the invention]
Data sets created independently in some organizations are often registered in different formats and expressions even at the same store. Therefore, in the conventional technique in which a database is generated by simply merging data sets collected from a plurality of information servers, it is not possible to combine the same identical stores into one, and the same store group of search results A plurality of stores may be displayed together. In such a case, the search results are redundantly increased, and the user is forced to perform complicated work such as examining the contents of unnecessary data and determining whether the data is the same as the data that has already been seen. For example, in the search result display screen of FIG. 11, the first store and the fourth store are the same, and the third store and the sixth store are the same.
[0005]
Further, a user who searches for information on a store or the like is particularly interested in new information on a store or information on a newly created store. For this reason, when collecting a data set and generating a database on a regular basis, for users who search for data from the generated database, which of the displayed data is updated, Or, if there is information on whether it is new, new information can be acquired quickly. However, in the prior art, such update information is not given to each data.
[0006]
An object of the present invention is to collect a data set from a plurality of information servers distributed on a network, integrate the data so that there is no redundancy, and to which data update information is added The present invention provides a database generation apparatus and method capable of generating a database, and a database generation program therefor.
[0007]
[Means for Solving the Problems]
The database generation device of the present invention includes a storage means for storing a database (old database) generated in the past, data including attribute values such as name and address from a plurality of points, identification ID of the data, update date and time, etc. A data collection means for collecting the information, and an attribute value is extracted from each collected data, and each data is composed of the extracted attribute value, identification ID, update date and the like (new database) Between the new database and the old database, the attribute information extracting means for generating the attribute information, the name identification means for classifying the data sets in which the attribute values in the generated new database can be regarded as the same group, and the attribute value between the new database and the old database A means for associating data that can be regarded as identical to each other and associating each data between both databases, and a data in the new database Update the corresponding data in the new database by comparing the information such as the identification ID and update date of the data with the information such as the identification ID and update date of the data in the old database associated with the data. And an update information adding means for adding information.
[0008]
In the name identification means, duplicate data is combined into one in the generated new database. Therefore, when data matching the user's request is searched from this database and displayed, a plurality of data of the same store is not displayed, and the search result can be grasped more easily. Further, the combining means specifies data of the same store or the like between the generated new database and the previously generated old database, and the update information providing means indicates their identification ID (for example, name) and update date and time. Since the data update information is derived by the comparison, the finally generated database can display the data after the data update information is given.
[0009]
Next, the database generation of the present invention processing The method is A database generation device including a storage device that stores a database (old database) generated in the past, From a plurality of points, data including attribute values such as name and address, data collection process for collecting information such as identification ID of the data, update date and time, and extracting attribute values from each collected data, Generates a database (new database) whose data consists of the extracted attribute value, identification ID, update date, etc. And store in storage Attribute information extraction process, and Store in storage A name identification process for classifying data sets that have the same attribute values in the new database into the same group; Stored in the storage device Generate a new database and the past In storage A process of combining the data that can be regarded as the same in the attribute value with the old database being held as the same, and associating the data between the two databases, and the identification ID of the data in the new database By comparing information such as update date and time with information such as identification ID and update date and time of data in the old database associated with the data, Stored in storage An update information addition process for adding update information to the corresponding data in the new database Execution It is characterized by doing.
[0010]
Next, the present invention Database generation The program extracts data including attribute values such as name and address from multiple points, a data collection process for collecting information such as the identification ID of the data, update date and time, and attribute values from the collected data And each data includes an attribute information extraction process for generating a database (new database) having a configuration including the extracted attribute value, identification ID, update date and time, and the like in the new database generated by the attribute information extraction process. Between the name identification process for classifying data sets that can be regarded as having the same attribute value into the same group and the data that can be regarded as having the same attribute value between the new database and the old database generated in the past, both A process for associating each data between databases and information such as identification ID and update date / time of data in the new database By comparing the data with such identification ID and update time of the data in correspondence to said old database information, and update information imparting process for imparting update information in the appropriate data in the new database A program to be executed by a computer It is characterized by that.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described below with reference to the drawings.
FIG. 1 is a diagram illustrating a configuration example of a database generation device according to an embodiment of the present invention. The database generation apparatus 10 is connected to a network 40 such as the Internet. Via the network 40, a plurality of information servers 20 that manage and operate a set of data such as store guide information, and a user It is connected to the user terminal 30 to be used. Each information server 20 connected to the network 40 is identified by its URL (uniform resource locator). Each information server 20 has a data set therein, and manages and operates the data set independently of other information servers. Therefore, there are many cases where guidance information of the same store exists in a plurality of information servers 20. The database generation device 10 itself may take the form of managing and operating a data set therein. As the user terminal 30, a personal computer (personal computer) or a portable terminal in which WWW software (WWW browser) is incorporated is typically used. Each user uses the user terminal 30 to perform information retrieval and notifies the database generation device 10 of a request or the like if necessary.
[0012]
The database generation apparatus 10 includes a data collection unit 11, an attribute information extraction unit 12, a name identification unit 13, a combination unit 14, an update information addition unit 15, and a database storage unit 16. The database storage unit 16 stores a database (old database) 17 generated in the past (here, the previous time) and a newly generated database (new database) 18. The database generation device 10 is realized by a so-called computer. The processing units 11 to 15 are handled by a CPU and its built-in memory (RAM, ROM, etc.), the database storage unit 16 is a hard disk, other external storage devices, and the like. Is responsible.
[0013]
Note that the database generation apparatus 10 itself may receive a search request from the user terminal 30 and implement an information search service. In this case, although omitted in FIG. 1, the database generation device 10 also includes an information search means. In addition, the information search apparatus may be configured separately from the database generation apparatus 10 to use a database generated by the database generation apparatus 10 in another information search apparatus.
[0014]
FIG. 2 is a diagram showing a flowchart of the database generation method according to the embodiment of the present invention. Hereinafter, the operation of the database generation apparatus 10 will be described in detail according to the flowchart of FIG.
[0015]
In the database generation apparatus 10, the data collection unit 11 accesses each information server 20 for a certain period or every specific date and time (for example, one day, one week, every Monday, etc.), and data (data) in each information server 20 (Set) is collected (step 111). Here, it is assumed that each data is one file and all the files are under a certain directory. The location of this directory is determined in advance between the administrator of the database generation device 10 and the administrator of each information server 20, and the data collection means 11 stores the files under the directory of each information server 20. Download and temporarily store in, for example, a RAM or a hard disk. Here, along with the file, the file name that is the name of the data (this is the link information of the data) and the update date / time information of the file are also acquired.
[0016]
FIG. 3 shows an example of data downloaded from the information servers A and B. In this example, it is assumed that data of the same store “Koran-tei” is registered in both the information server A and B, and the data is shown. As shown in FIG. 3, in the information servers A and B, the same store “Koran-tei” is registered with different names and addresses in different formats and expressions.
[0017]
Next, the attribute information extraction unit 12 extracts attribute values characterizing the data such as name and address from each data collected by the data collection unit 11 (step 112). Each data file is typically an HTML document or an XML document, and the user can browse the contents by accessing the URL of the corresponding file from the WWW software (WWW browser) using the user terminal 30. It is. The format information such as what attribute the contents of each data file is and what format each attribute is described in is determined for each information server 20. Here, it is assumed that the attribute information extraction unit 12 holds a data file format analysis routine corresponding to each information server. The attribute information extraction unit 12 extracts attribute values such as names and addresses from the data file by a data file format analysis routine corresponding to each information server. Next, the attribute information extraction means 12 creates data (record) composed of the attribute value such as the extracted name and address, the information server name where the data exists, the link information of the data, the information of the update date and the like, A database in which such data is accumulated is generated and stored in the database storage unit 16. This newly generated database is referred to as a new database 18. In addition, the previous database (one day ago, one week ago, etc.) is collected and generated from each information server 20 in the same manner, and a database that has been subjected to processing such as name identification, combination, and update information addition described later is referred to as the old database 17. .
[0018]
FIG. 4 shows an example of a newly generated database (new database) 18, where (a) is data of the information server A and (b) is data of the information server B. Here, industry, name, and address are taken as attributes to be extracted. The industry system generally differs for each information server 20. In addition, even in the same store data, if the information server is different, the name and address notation may be shaken.
[0019]
If the information server 20 has a file including a list of data in which only basic information such as the name, address, telephone number, and link information of each store is described in addition to the store data file, the data collection means 11 In such a case, such a list file may be downloaded instead of the data file group. In this case, the attribute information extraction unit 12 extracts the name, address, link information, and the like from each data of the list file, accesses the information server 20 again based on the extracted link information, and updates the date and time of the data file Get information. Similarly, a new database 18 as shown in FIG. 4 is generated.
[0020]
Next, the name identification means 13 classifies data (records) that can be regarded as having the same attribute value such as name and address in the new database 18 into the same group (step 113). In other words, names are collected as the same store.
[0021]
For example, between two arbitrary data in the new database 18 shown in FIG. 4, the values of the name and address attributes are collated, and the matched records are classified into the same group. For example, the following collation methods of nominal character strings and address character strings are conceivable. One is a method that considers a match when it is a perfect match (referred to as a perfect match), and a method that considers a match when the percentage of the number of characters that are included in both is greater than a certain threshold (character match and character match). Called). In addition, there is a method of dividing a character string into words (referred to as “word unit match”) that matches when the ratio of the number of words included in both is greater than a certain threshold. In any of the methods, it is possible to further improve the accuracy of collation by performing in advance a process for eliminating the fluctuation of the notation such as conversion of Chinese numerals into arithmetic numerals or unification of English letters into capital letters. As a result of the collation, in the example of FIG. 4, the first and fourth data (records) match, and the third and sixth data match. The matched records are classified into the same group. Here, each group is called name identification data in distinction from normal data.
[0022]
The name identification means 13 selects, for example, only one of the name and address attribute values of the data included in the name identification data as the name or address attribute value of each name identification data, and uses the value itself or is normalized. Convert to value. In addition, the industry name of each data is converted into a corresponding industry name in the industry system unique to the database generation apparatus 10.
[0023]
FIG. 5 shows an example of the new database 18 updated in this way with respect to FIG. For example, in the business type system unique to the database generation apparatus 10, there are “Japanese food”, “Chinese” and the like as business types, and the business name of the data in FIG. 4 is converted to “Chinese”. In FIG. 5, the business name of the first and fourth data classified into the same group is both converted to “Chinese”, so the business name as the name identification data is also “Chinese”. The same applies to the third and sixth data. In addition, as the name of the name identification data and the attribute value of the address, in the first and fourth data, the name is “Koran-tei” and the address is “Shinjuku-ku Kagurazaka 1-2-3”. . Similarly, in the third and sixth data, the name is “Otaketei” and the address is “Shinjuku-ku Kagurazaka 3-8-6”. Note that the “update information” column of the new database 18 in FIG. 5 is rewritten by the update information adding means 15 described later, and is all left empty here.
[0024]
Here, a collation rule such as which attribute value is collated with which collation method may be described in a program that realizes the name collation unit 13, or an external table that is referred to by the program in the database generation device 10. The administrator of the database generation device 10 may be able to freely change this external table.
[0025]
FIG. 6 is an example of the contents of such an external table. In FIG. 6 (a), the criteria for matching data are described. In this example, a name and an address are specified as collation items. When the evaluation value of the name matching result is 90 points or more and the evaluation value of the address matching result is 80 points or more, or the evaluation value of the name matching result is 80 points or more and the evaluation value of the address matching result is 90 points or more. In this case, it is determined that the two data match. FIG. 6B describes a name verification method. Here, complete matching, character unit matching, and word unit matching are designated as the collation methods, and collation is performed by each method. The evaluation value is 100 if they match in the complete matching process, and the evaluation value is 0 if they do not match. Assume that the evaluation value of the matching result of character unit matching is obtained by multiplying the ratio of the number of matched characters by 100. It is also assumed that the evaluation value of the matching result of word unit matching is obtained by multiplying the ratio of the number of matched words by 100. The evaluation value of the collation method that returned the highest evaluation value is the nominal evaluation value. In FIG. 6C, an address matching method is described in the same manner. Here, complete matching and word unit matching are specified as the matching method. The evaluation value of the matching method that returned the highest evaluation value is set as the evaluation value of the address.
[0026]
Next, in the combining means 14, between the new database 18 after name identification in the database storage unit 16 and the old database 17 that has been collected from the previous information server 20 and generated, the name, address, etc. Name identification data that can be regarded as having the same attribute value are determined to be the same, linked, and associated (step 114). For example, the data that are determined to be the same in the old and new databases 17 and 18 are linked and associated by adding information indicating that the data is the same.
[0027]
Here, it is assumed that the link information of the same data can change with time in the information server 20. In deriving the update information of each data, it is necessary to compare the update date and time of the new data and the old data. For this purpose, it is necessary to determine which data is the same in the old and new databases. If the link information is unchanged, it can be determined whether the link information is the same, but under the assumption that the link information can change, it is necessary to determine whether the data has the same name and address attribute values. is there. Here, even with the same data, the name may be changed slightly with the passage of time, so the collation is performed in consideration of fluctuations in the notation. Specifically, for example, a matching method such as character end unit matching or word unit matching is used in addition to perfect matching. This is basically the same as in the case of name identification. Further, if the item to be verified is only the name, for example, the old and new data will match even if the address of the same store changes. Thus, under what conditions the old and new data are identified can be adjusted by changing the matching rule. FIG. 7 shows an example of data matching criteria in the matching rule described in the external table. Here, an example is shown in which only the name is specified as the collation item. The description of the name verification method may be the same as in FIG.
[0028]
FIG. 8 is an example of the old database 17. For convenience, in FIG. 8, it is assumed that each data has not been updated since the last time. The combination means 14 specifies name identification data in the old database 17 that is the same as each name identification data in the new database 18 shown in FIG. 5 by comparing only the name or the attribute values of the name and address. As a result, the first, second, and third name identification data in the new database 18 in FIG. 5 are linked to the first, second, and third name identification data in the old database 17 in FIG. 8, respectively. The name identification data linked to the fourth name identification data in the new database 18 in FIG. 5 does not exist in the old database 17 in FIG. Note that data having the same correspondence information server in the linked name identification data is also linked as the same data. Hereinafter, each data in FIG. 5 and FIG. 8 is expressed by the number from the top.
[0029]
Next, in the update information giving means 15, the link information and update date information of the data in the new database 18, and the link information and update date information of the data in the old database 17 determined to be the same as the data by the combining means 14. By comparing with the information, update information is set and assigned to the corresponding data in the new database 18 (step 115). That is, when there is data in the old database 17 linked to the data in the new database 18 and the link information or update date / time has been changed, it is determined that the data has been updated and both have been changed. When there is no data, it is determined that the data is not updated, and the update information of the corresponding data in the new database 18 is set to “update” or “no update”. If there is no data in the old database 17 linked to the data in the new database 18, it is determined that the data has been newly created, and the update information of the corresponding data in the new database 18 is set to “new”. To do.
[0030]
For example, the first data in the new database 18 in FIG. 5 has the same link information as the first data in the old database 17 in FIG. 8 linked, and the update date / time has changed. Judge that it was done.
[0031]
Since the second data in the new database 18 in FIG. 5 is the same as the second data in the old database 17 in FIG. 8 that is linked, the link information and the update date / time are unchanged, so it is determined that the data has not been updated. To do. The same applies to the fourth data in the new database 18 of FIG.
[0032]
The third data in the new database 18 in FIG. 5 does not change the update date and time compared with the third data in the old database 17 in FIG. 8 linked, but the link information has changed, so the data is updated. Judge that it was done.
[0033]
The fifth data in the new database 18 in FIG. 5 is linked to the third name identification data in the old database 17 in FIG. 8 as name identification data, but the data linked as data is the old database in FIG. Since it is not in 17, it is determined that it was newly created.
[0034]
Since the data linked to the sixth data in the new database 18 in FIG. 5 is not in the old database 17 in FIG. 8, it is determined that the data has been newly created.
[0035]
In this way, in the case of the new database 18 in FIG. 5 and the old database 17 in FIG. 8, the new database 18 to which update information is given as shown in FIG. 9 is finally generated. The update information giving means 15 overwrites the old database 17 with the finally generated new database 18.
[0036]
This completes the database generation. When the database finally generated is accessed from the user terminal 30 and data matching the user's request is retrieved and displayed, information on the type, name, and address of the name identification data and the information server 20 where the data exists The link information and update information to the file are displayed. FIG. 10 is a display example of a search result when a store having a business type of “Chinese” and an address of “Shinjuku-ku Kagurazaka” is searched using the generated database of FIG. By clicking the link information on the screen, the user can access the detailed information of the store as the contents of the linked file.
[0037]
As mentioned above, the typical embodiment of the present invention has been described. However, the old database before the name identification is held, and the linking of the combining means 14 is executed between the old and new databases after the name identification (FIGS. 5 and 8). Instead, it may be executed between old and new databases (old databases corresponding to FIGS. 4 and 4) before name identification. For example, in this case, the correspondence information servers collate the same data.
[0038]
Even if the link information of the same data can change over time in the information server 20, if ID information that is permanently unchanged for each data is included in the data, the following processing can be performed. The attribute information extraction means 12 extracts this ID information, and links in the combining means 14 by linking the data in the old database having the same ID information as the data to each data in the new database. Do.
[0039]
Further, in the information server 20, if the link information of the same data is always unchanged, the linking of the coupling means 14 is not necessary. If the update date and time of data in the generated database (which may be the one before or after name identification) is later than the time when the data set was previously collected in the update information adding means 15, the data is updated data. It turns out that it is new data. Further, if data whose link information is the same as the corresponding information server of the data is found in the previously generated database, it is determined that the data is updated data, and if it is new data.
[0040]
In addition to the above, the present invention can be variously modified and expanded within the scope of the claims. For example, a configuration in which name update means and a name identification process are eliminated and only update information of each data is given is also conceivable.
[0041]
The processing functions of some or all of the components in the apparatus shown in FIG. 1 can be configured by a computer program, and the program can be executed using the computer to implement the present invention, or FIG. Needless to say, the processing procedure shown in FIG. 5 can be constituted by a computer program, and the program can be executed by the computer. In addition, a computer-readable recording medium such as an FD, an MO, a ROM, a memory card, a program for realizing the processing function by the computer, or a program for causing the computer to execute the processing procedure, The program can be recorded on a CD, DVD, removable disk, etc., stored, provided, and the program can be distributed through a network such as the Internet.
[0042]
【The invention's effect】
As described above, according to the present invention, when data that matches the user's request is searched from the generated database, the search result is displayed in a form in which there is no duplicate data and data update information is added. It becomes possible.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a database generation apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart of a database generation method according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating an example of data downloaded from an information server.
FIG. 4 is a diagram showing an example of a new database generated by attribute information extraction means.
FIG. 5 is a diagram showing an example of a new database updated by name identification means.
FIG. 6 is a diagram illustrating an example of a collation rule applied by a name identification unit.
FIG. 7 is a diagram illustrating an example of a collation rule applied by a combining unit.
FIG. 8 is a diagram showing an example of an old database generated last time.
FIG. 9 is a diagram showing an example of a new database to which update information is added by update information adding means.
FIG. 10 is a diagram showing an example of a search result display screen from a database generated according to the present invention.
FIG. 11 is a diagram showing an example of a search result display screen from a database generated by a conventional database generation technique.
[Explanation of symbols]
10 Database generator
11 Data collection means
12 Attribute information extraction means
13 name identification
14 coupling means
15 Update information giving means
16 Database storage
17 Old database
18 New database
20 Information server
30 user terminals
40 network

Claims

複数地点からデータを収集してデータベースを生成する装置であって、
過去に生成されたデータベース（以下、旧データベース）を記憶する記憶手段と、
前記複数地点から、名義や住所などの属性の値を含むデータ、該データの識別ＩＤ、更新日時情報を収集するデータ収集手段と、
前記収集された各データから属性の値を抽出し、各データが、少なくとも前記抽出した属性の値、識別ＩＤ、更新日時からなる構成のデータベース（以下、新データベース）を生成する属性情報抽出手段と、
前記生成された新データベース内の属性の値が同一とみなせるデータ集合を同一グループに分類する名寄せ手段と、
前記新データベースと前記旧データベースとの間で、属性の値が同一とみなせるデータ同士を同一と判断して両データベース間の各データを対応付けする結合手段と、
前記新データベース中のデータの識別ＩＤや更新日時などの情報と、前記データと対応付けされた前記旧データベース中のデータの識別ＩＤや更新日時などの情報とを比較することによって、前記新データベース内の該当データに更新情報を付与する更新情報付与手段と、
を有することを特徴とするデータベース生成装置。A device that collects data from multiple points and generates a database,
Storage means for storing a database generated in the past (hereinafter referred to as an old database);
Data collecting means for collecting data including attribute values such as name and address, identification ID of the data, and update date and time information from the plurality of points ;
Attribute information extraction means for extracting an attribute value from each collected data and generating a database (hereinafter referred to as a new database) in which each data is composed of at least the extracted attribute value, identification ID, and update date and time; ,
Name identification means for classifying data sets in which the attribute values in the generated new database are considered to be the same into the same group;
Between the new database and the old database, combining means for determining that the data that can be regarded as the same attribute values are the same and associating each data between the two databases,
By comparing information such as the identification ID and update date and time of data in the new database with information such as the identification ID and update date and time of data in the old database associated with the data, Update information giving means for giving update information to the corresponding data,
A database generation device characterized by comprising:

請求項１記載のデータベース生成装置において、更新情報付与手段は、新データベースのデータの識別ＩＤや更新日時の情報が、旧データベースの対応するデータの識別ＩＤや更新日時の情報と不一致の場合は更新あり、一致の場合は更新なしを表わす更新情報を新データベースの該当データに付与し、旧データベースに新データベースのデータと対応付けられたデータが存在しない場合には、新データベースの該当データに新規を表わす更新情報を付与することを特徴とするデータベース生成装置。 2. The database generation device according to claim 1, wherein the update information adding means updates the data ID of the new database and the update date / time information if they do not match the corresponding data ID / update date / time information of the old database. If there is a match, update information indicating no update is assigned to the corresponding data in the new database, and if there is no data associated with the data in the new database in the old database, new data is added to the corresponding data in the new database. A database generation apparatus characterized by giving update information to be expressed.

過去に生成されたデータベースを記憶する記憶装置を備え、複数地点からデータを収集して新たにデータベースを自動生成して前記記憶装置に格納するデータベース生成装置におけるデータベース生成処理方法であって、
前記データベース生成装置は、
前記複数地点から、名義や住所などの属性の値を含むデータ、該データの識別ＩＤ、更新日時情報を収集するデータ収集過程と、
前記収集された各データから属性の値を抽出し、各データが、少なくとも前記抽出した属性の値、識別ＩＤ、更新日時からなる構成のデータベース（以下、新データベース）を生成して前記記憶装置に格納する属性情報抽出過程と、
前記記憶装置に格納された新データベース内の属性の値が同一とみなせるデータ集合を同一グループに分類する名寄せ過程と、
前記記憶装置に格納された新データベースと過去に生成して記憶装置に保持されている旧データベースとの間で、属性の値が同一とみなせるデータ同士を同一と判断して両データベース間の各データを対応付けする結合過程と、
前記新データベース中のデータの識別ＩＤや更新日時などの情報と、前記データと対応付けされた前記旧データベース中のデータの識別ＩＤや更新日時などの情報とを比較することによって、記憶装置に格納された新データベース内の該当データに更新情報を付与する更新情報付与過程と、
を実行することを特徴とするデータベース生成処理方法。 A database generation processing method in a database generation device comprising a storage device for storing a database generated in the past, collecting data from a plurality of points, automatically generating a new database and storing it in the storage device,
The database generation device includes:
A data collection process for collecting data including attribute values such as name and address, identification ID of the data, and update date / time information from the plurality of points ,
An attribute value is extracted from each of the collected data, and each data generates a database (hereinafter referred to as a new database) having at least the extracted attribute value, identification ID, and update date and time in the storage device. A process of extracting attribute information to be stored ;
A name identification process for classifying data sets that can be regarded as having the same attribute value in the new database stored in the storage device into the same group;
Between the new database stored in the storage device and the old database generated in the past and held in the storage device, the data that can be regarded as the same in attribute values are determined to be the same, and each data between both databases A combining process that associates
The information such as the identification ID and update date / time of the data in the new database is stored in the storage device by comparing the information such as the identification ID and update date / time of the data in the old database associated with the data update information imparting step of imparting the update information in the appropriate data in the new database that is,
The database generation processing method characterized by performing this .

請求項３記載のデータベース生成処理方法において、
前記データベース生成装置は、前記名寄せ過程あるいは前記結合過程の少なくとも一方を省略し、前記更新情報付与過程では、少なくとも前記結合過程が省略された場合には、データ中の不変情報にもとづいて新データベースのデータと旧データベースのデータとの対応を認識することを特徴とするデータベース生成処理方法。The database generation processing method according to claim 3, wherein
The database generation device omits at least one of the name identification process or the combining process, and in the update information adding process, if at least the combining process is omitted, a new database is created based on invariant information in the data. A database generation processing method characterized by recognizing correspondence between data and data in an old database.

複数地点からデータを収集してデータベースを生成する処理を、コンピュータに実行させるプログラムであって、
前記複数地点から、名義や住所などの属性の値を含むデータ、該データの識別ＩＤ、更新日時情報を収集するデータ収集プロセスと、
前記収集された各データから属性の値を抽出し、各データが、少なくとも前記抽出した属性の値、識別ＩＤ、更新日時からなる構成のデータベース（以下、新データベース）を生成する属性情報抽出プロセスと、
前記生成された新データベース内の属性の値が同一とみなせるデータ集合を同一グループに分類する名寄せプロセスと、
前記新データベースと過去に生成した旧データベースとの間で、属性の値が同一とみなせるデータ同士を同一と判断して両データベース間の各データを対応付けする結合プロセスと、
前記新データベース中のデータの識別ＩＤや更新日時などの情報と、前記データと対応付けされた前記旧データベース中のデータの識別ＩＤや更新日時などの情報とを比較することによって、前記新データベース内の該当データに更新情報を付与する更新情報付与プロセスと、
をコンピュータに実行させるデータベース生成プログラム。 A program that causes a computer to execute a process of collecting data from multiple points and generating a database,
A data collection process for collecting data including attribute values such as name and address, identification ID of the data, and update date and time information from the plurality of points ;
An attribute information extraction process for extracting an attribute value from each collected data and generating a database (hereinafter referred to as a new database) in which each data includes at least the extracted attribute value, identification ID, and update date and time; ,
A name identification process for classifying data sets that can be regarded as having the same attribute value in the generated new database into the same group;
Between the new database and the old database generated in the past, a data combination process that associates data between both databases by determining that the data that can be regarded as having the same attribute value is the same;
By comparing information such as the identification ID and update date and time of data in the new database with information such as the identification ID and update date and time of data in the old database associated with the data, An update information grant process for assigning update information to the corresponding data,
Database generation program that causes a computer to execute .