JP2018152015A

JP2018152015A - Storage controller, storage control program and storage control method

Info

Publication number: JP2018152015A
Application number: JP2017049658A
Authority: JP
Inventors: 田中　哲; Satoru Tanaka; 哲田中
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-03-15
Filing date: 2017-03-15
Publication date: 2018-09-27

Abstract

PROBLEM TO BE SOLVED: To reduce the time required for data crawling.SOLUTION: A storage controller in an embodiment for storing data acquired from Web pages includes a storage part and a control unit. The storage part is configured to store a piece of tag structure information which represents the position of the acquisition source of first data in the Web page of the acquisition source while associating the first data acquired from the Web page. The control unit is configured to control to acquire a piece of second data from the Web page based on the tag structure information stored in the storage part and to store the acquired second data in the storage part while associating the tag structure information.SELECTED DRAWING: Figure 1

Description

本発明の実施形態は、記憶制御装置、記憶制御プログラムおよび記憶制御方法に関する。 Embodiments described herein relate generally to a storage control device, a storage control program, and a storage control method.

インターネット上に公開されている情報を収集するためのツールとして、クローラツールが知られている。クローラツールは、インターネット上のＷｅｂサイトで公開されているホームページ（以下、Ｗｅｂサイトと呼ぶ）を巡回して、ＵＲＬ（Uniform Resource Locator）単位、すなわちページ単位で内容を保存する。このとき、クローラツールは、Ｗｅｂサイトのドキュメントを解析して取得したデータを保存する。このドキュメントの解析については、ドキュメント構造に基づいて関連情報を決定する方法、文書の持つ論理構造を解析する方法が提案されている。 A crawler tool is known as a tool for collecting information disclosed on the Internet. The crawler tool circulates a home page (hereinafter referred to as a Web site) published on a Web site on the Internet, and stores contents in URL (Uniform Resource Locator) units, that is, page units. At this time, the crawler tool stores data obtained by analyzing the document on the website. As for the analysis of the document, a method for determining related information based on the document structure and a method for analyzing the logical structure of the document have been proposed.

特開２００７−１６４７８８号公報JP 2007-164788 A 特開平１１−２４２６７６号公報Japanese Patent Laid-Open No. 11-242676

しかしながら、上記の従来技術では、データのクロールを行う際に、Ｗｅｂサイトごとにドキュメントの解析を行うことから、処理に時間を要するという問題がある。 However, the above-described conventional technique has a problem in that it takes time for processing since document analysis is performed for each Web site when data is crawled.

１つの側面では、データのクロールに要する時間を短縮できる記憶制御装置、記憶制御プログラムおよび記憶制御方法を提供することを目的とする。 In one aspect, an object is to provide a storage control device, a storage control program, and a storage control method that can reduce the time required for crawling data.

第１の案では、ウェブページから取得したデータを記憶する記憶制御装置において、記憶部と、制御部とを備える。記憶部は、ウェブページから取得した第１のデータに対応づけて、取得元のウェブページのうち第１のデータの取得元の位置を示すタグ構造情報を記憶する。制御部は、記憶部に記憶されたタグ構造情報に基づいて、ウェブページから新たに第２のデータを取得し、新たに取得した第２のデータをタグ構造情報と対応付けて記憶部に記憶させる。 In the first proposal, a storage control device that stores data acquired from a web page includes a storage unit and a control unit. A memory | storage part matches the 1st data acquired from the web page, and memorize | stores the tag structure information which shows the position of the acquisition source of 1st data among the web pages of an acquisition source. The control unit newly acquires second data from the web page based on the tag structure information stored in the storage unit, and stores the newly acquired second data in association with the tag structure information in the storage unit. Let

本発明の１実施態様によれば、データのクロールに要する時間を短縮できる。 According to one embodiment of the present invention, the time required for crawling data can be shortened.

図１は、実施形態にかかる記憶制御装置の構成の一例を示すブロック図である。FIG. 1 is a block diagram illustrating an example of the configuration of the storage control device according to the embodiment. 図２は、対象記憶部の一例を示す図である。FIG. 2 is a diagram illustrating an example of the target storage unit. 図３は、項目記憶部の一例を示す図である。FIG. 3 is a diagram illustrating an example of the item storage unit. 図４は、ページ記憶部の一例を示す図である。FIG. 4 is a diagram illustrating an example of the page storage unit. 図５は、抽出データ記憶部の一例を示す図である。FIG. 5 is a diagram illustrating an example of the extracted data storage unit. 図６は、実施形態にかかる記憶制御装置の動作例を示すフローチャートである。FIG. 6 is a flowchart illustrating an operation example of the storage control device according to the embodiment. 図７は、記憶制御プログラムを実行するコンピュータの一例を示す図である。FIG. 7 is a diagram illustrating an example of a computer that executes a storage control program.

以下、図面を参照して、実施形態にかかる記憶制御装置、記憶制御プログラムおよび記憶制御方法を説明する。実施形態において同一の機能を有する構成には同一の符号を付し、重複する説明は省略する。なお、以下の実施形態で説明する記憶制御装置、記憶制御プログラムおよび記憶制御方法は、一例を示すに過ぎず、実施形態を限定するものではない。また、以下の各実施形態は、矛盾しない範囲内で適宜組みあわせてもよい。 Hereinafter, a storage control device, a storage control program, and a storage control method according to embodiments will be described with reference to the drawings. In the embodiment, configurations having the same functions are denoted by the same reference numerals, and redundant description is omitted. Note that the storage control device, the storage control program, and the storage control method described in the following embodiments are merely examples, and do not limit the embodiments. In addition, the following embodiments may be appropriately combined within a consistent range.

図１は、実施形態にかかる記憶制御装置の構成の一例を示すブロック図である。図１に示すデータ取得装置１００は、例えば、ネットワークＮを介してインターネットに接続され、管理者に指定されたインターネット上のＷｅｂサイト３００を巡回する。そして、データ取得装置１００は、Ｗｅｂサイト３００で公開されているＷｅｂ（ウェブ）ページを取得してデータベースに蓄積する。すなわち、データ取得装置１００は、記憶制御装置の一例である。 FIG. 1 is a block diagram illustrating an example of the configuration of the storage control device according to the embodiment. A data acquisition apparatus 100 shown in FIG. 1 is connected to the Internet via a network N, for example, and circulates a Web site 300 on the Internet designated by an administrator. Then, the data acquisition device 100 acquires Web pages published on the Web site 300 and stores them in the database. In other words, the data acquisition device 100 is an example of a storage control device.

データ取得装置１００は、例えば、ある地域の観光情報を取得するために、観光スポットのサイトや都道府県が設けた観光情報サイトを巡回して、各観光スポットの住所、電話番号、説明文等のデータを取得する。 For example, the data acquisition device 100 circulates a tourist information site or a tourist information site provided by a prefecture in order to acquire tourist information of a certain area, and includes an address, a telephone number, a description, etc. of each tourist spot. Get the data.

このとき、データ取得装置１００は、ウェブページから取得した各観光スポットの住所、電話番号、説明文等のデータに対応づけて、取得元のウェブページのうちデータの取得元の位置を示すタグ構造情報を記憶する。このタグ構造情報は、ウェブページより取得したデータにかかるタグの階層構造上の位置を示す情報である（詳細は後述する）。 At this time, the data acquisition device 100 associates the data such as the address, telephone number, and description of each tourist spot acquired from the web page with a tag structure that indicates the position of the data acquisition source in the acquisition source web page. Store information. This tag structure information is information indicating the position on the hierarchical structure of tags related to data acquired from a web page (details will be described later).

ここで、タグ構造情報を含む文書としては、例えば、マークアップ言語で記述された文書が挙げられ、例えばＨＴＭＬ（HyperText Markup Language）文書、ＸＭＬ（Extensible Markup Language）文書等が挙げられる。なお、以下の説明では、一例として、ＨＴＭＬ文書を用いたウェブページを巡回する場合について説明する。 Here, examples of the document including the tag structure information include a document described in a markup language, such as an HTML (HyperText Markup Language) document, an XML (Extensible Markup Language) document, and the like. In the following description, as an example, a case of circulating a web page using an HTML document will be described.

データ取得装置１００は、ウェブページの初回の巡回時や、ウェブページ自体のレイアウトの更新があった場合に、ウェブページから取得したデータに対応付けてタグ構造情報を記憶する。そして、データ取得装置１００は、ウェブページの次回の巡回時には記憶されたタグ構造情報に基づいて、ウェブページ内のタグ構造情報が示す位置から新たにデータを取得し、新たに取得したデータをタグ構造情報と対応付けて記憶する。 The data acquisition apparatus 100 stores tag structure information in association with data acquired from a web page when the web page is first visited or when the layout of the web page itself is updated. The data acquisition device 100 acquires new data from the position indicated by the tag structure information in the web page based on the tag structure information stored at the next visit of the web page, and tags the newly acquired data as a tag. It is stored in association with the structure information.

これにより、データ取得装置１００は、ウェブページの初回の巡回またはウェブページ自体のレイアウト更新後の巡回の後に再度巡回する際には、ウェブサイトごとにドキュメントの解析を行うことなくデータを取得できるので、データのクロールに要する時間を短縮できる。 As a result, the data acquisition apparatus 100 can acquire data without analyzing the document for each website when the user visits again after the first visit of the web page or the visit after the layout update of the web page itself. , The time required to crawl data can be shortened.

次に、データ取得装置１００の構成について説明する。図１に示すように、データ取得装置１００は、入力部１０１と、出力部１０２と、通信部１１０と、記憶部１２０と、制御部１３０とを有する。なお、データ取得装置１００は、図１に示す機能部以外にも既知のコンピュータが有する各種の機能部を有することとしてもかまわない。 Next, the configuration of the data acquisition device 100 will be described. As illustrated in FIG. 1, the data acquisition apparatus 100 includes an input unit 101, an output unit 102, a communication unit 110, a storage unit 120, and a control unit 130. Note that the data acquisition apparatus 100 may include various functional units included in a known computer in addition to the functional units illustrated in FIG.

入力部１０１は、例えば、キーボードやマウス等の入力デバイスであり、データ取得装置１００の管理者から各種情報の入力を受け付ける。例えば、入力部１０１は、データ取得装置１００の管理者により、巡回するサイトのＵＲＬ、取得するデータ項目等が入力され、入力結果を制御部１３０に出力する。また、入力部１０１は、例えば、ＳＤ（Secure Digital）メモリカード等のリーダライタであってもよい。入力部１０１は、例えば、ＳＤメモリカードから読み込んだ、巡回するサイトのＵＲＬ、取得するデータ項目等を制御部１３０に出力する。なお、入力部１０１は、入力デバイスとＳＤメモリカード等のリーダライタとの双方を有してもよい。 The input unit 101 is an input device such as a keyboard and a mouse, for example, and receives input of various information from the administrator of the data acquisition apparatus 100. For example, the input unit 101 receives the URL of the site to be visited, the data item to be acquired, and the like by the administrator of the data acquisition apparatus 100 and outputs the input result to the control unit 130. The input unit 101 may be a reader / writer such as an SD (Secure Digital) memory card. For example, the input unit 101 outputs the URL of the site to be visited, the data item to be acquired, and the like read from the SD memory card to the control unit 130. Note that the input unit 101 may include both an input device and a reader / writer such as an SD memory card.

出力部１０２は、例えば、各種情報を表示するための表示デバイスである。出力部１０２は、例えば、表示デバイスとして液晶ディスプレイ等によって実現される。また、出力部１０２は、ＳＤメモリカード等のリーダライタであってもよい。出力部１０２は、制御部１３０から出力データが入力されると、出力データについて表示又はメモリカードへの書き込みを行う。なお、入力部１０１および出力部１０２は、一体化されてもよく、例えば、ＳＤメモリカード等のリーダライタのように、双方の機能を有するデバイスであってもよい。また、出力部１０２は、例えば、表示デバイスとＳＤカードリーダライタの双方を有してもよい。 The output unit 102 is a display device for displaying various types of information, for example. The output unit 102 is realized by, for example, a liquid crystal display as a display device. The output unit 102 may be a reader / writer such as an SD memory card. When the output data is input from the control unit 130, the output unit 102 displays or writes the output data to the memory card. The input unit 101 and the output unit 102 may be integrated. For example, a device having both functions may be used, such as a reader / writer such as an SD memory card. The output unit 102 may include both a display device and an SD card reader / writer, for example.

通信部１１０は、例えば、ＮＩＣ（Network Interface Card）等によって実現される。通信部１１０は、ネットワークＮを介して、例えばインターネットと有線又は無線で接続され、インターネット上のＷｅｂサイト３００のサーバとの間で情報の通信を司る通信インタフェースである。通信部１１０は、インターネット上のＷｅｂサイト３００からウェブページの内容、例えば、ＨＴＭＬ文書、画像ファイル等を受信する。通信部１１０は、受信したウェブページ内容を制御部１３０に出力する。また、通信部１１０は、制御部１３０から入力されたページ要求等をインターネット上のＷｅｂサイト３００に送信する。 The communication unit 110 is realized by, for example, a NIC (Network Interface Card). The communication unit 110 is a communication interface that is connected to the Internet, for example, via the network N in a wired or wireless manner and manages information communication with the server of the Web site 300 on the Internet. The communication unit 110 receives the contents of a web page, for example, an HTML document, an image file, and the like from a website 300 on the Internet. The communication unit 110 outputs the received web page content to the control unit 130. In addition, the communication unit 110 transmits a page request or the like input from the control unit 130 to the Web site 300 on the Internet.

記憶部１２０は、例えば、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、ハードディスクや光ディスク等の記憶装置によって実現される。記憶部１２０は、対象記憶部１２１と、項目記憶部１２２と、ページ記憶部１２３と、抽出データ記憶部１２４と、通知先記憶部１２５とを有する。また、記憶部１２０は、制御部１３０での処理に用いる情報を記憶する。 The storage unit 120 is realized by, for example, a semiconductor memory device such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 120 includes a target storage unit 121, an item storage unit 122, a page storage unit 123, an extracted data storage unit 124, and a notification destination storage unit 125. In addition, the storage unit 120 stores information used for processing in the control unit 130.

対象記憶部１２１は、データを取得するクロール処理の対象となるＷｅｂサイト３００のＵＲＬ（以下、対象ＵＲＬという）と、ＨＴＭＬ文書よりデータを抽出した対象部分のタグ構造情報とを対応付けて記憶する。すなわち、対象記憶部１２１は、ウェブページから取得したデータに対応づけて、取得元のウェブページのうちデータの取得元の位置を示すタグ構造情報を記憶する。 The target storage unit 121 stores the URL of the Web site 300 that is the target of the crawl process for acquiring data (hereinafter referred to as the target URL) and the tag structure information of the target part from which the data is extracted from the HTML document in association with each other. . That is, the target storage unit 121 stores tag structure information indicating the position of the data acquisition source among the acquisition source web pages in association with the data acquired from the web page.

図２は、対象記憶部１２１の一例を示す図である。図２に示すように、対象記憶部１２１は、「ＵＲＬＩＤ」、「対象ＵＲＬ」、「タグ構造情報」といった項目を有する。また、「タグ構造情報」は、「タイトル」、「住所」といった取得元のデータ項目を有する。なお、タグ構造情報は、図示はしないが、他にも、電話番号、更新日、位置情報、説明文といったデータ項目を有する。対象記憶部１２１は、例えば、１つの対象ＵＲＬごとに１レコードとして記憶する。 FIG. 2 is a diagram illustrating an example of the target storage unit 121. As illustrated in FIG. 2, the target storage unit 121 includes items such as “URLID”, “target URL”, and “tag structure information”. The “tag structure information” has data items of acquisition sources such as “title” and “address”. The tag structure information includes data items such as a telephone number, an update date, position information, and an explanatory text, although not shown. The target storage unit 121 stores, for example, one record for each target URL.

「ＵＲＬＩＤ」は、対象ＵＲＬを識別する。「対象ＵＲＬ」は、クロール処理でアクセスする対象となるＨＴＭＬ文書のＵＲＬを示す。対象ＵＲＬは、例えば、管理者によって入力部１０１の入力デバイスにより入力される。「タグ構造情報」は、対象ＵＲＬのＨＴＭＬ文書内における、データを抽出した対象部分の位置を特定するための情報を示す。「タイトル」は、対象となるＨＴＭＬ文書内のタイトルについて、タグの名称、タグの文書内における順番、及び、タグの階層構造のうち１つ以上を組み合わせて、タグの階層構造上の位置を示す。「住所」は、対象となるＨＴＭＬ文書内の住所について、タグの名称、タグの文書内における順番、及び、タグの階層構造のうち１つ以上を組み合わせて、タグの階層構造上の位置を示す。 “URLID” identifies the target URL. “Target URL” indicates the URL of an HTML document to be accessed in the crawl process. The target URL is input by the administrator using the input device of the input unit 101, for example. “Tag structure information” indicates information for specifying the position of the target portion from which data is extracted in the HTML document of the target URL. “Title” indicates the position of the tag in the hierarchical structure of the target HTML document by combining one or more of the tag name, the order of the tag in the document, and the hierarchical structure of the tag. . “Address” indicates the position of the tag in the hierarchical structure of the target HTML document by combining one or more of the tag name, the tag order in the document, and the hierarchical structure of the tag. .

図２の１行目の例では、ＵＲＬＩＤが「１」の対象ＵＲＬ「http://aaaa.bbb.ccc/ddd/eee/001.html」のＨＴＭＬ文書内における、「タイトル」および「住所」のタグ構造情報を示す。「タイトル」のタグ構造情報は、例えば、「<DIV class="title"> </DIV>,順番：1,/title/」と表現される。「<DIV class="title"> </DIV>」は、例えば、ＣＳＳ（Cascading Style Sheets）セレクタを用いて抽出したタイトルを示すタグの名称を示す。「順番：1」は、当該ＨＴＭＬ文書内のタイトルを示すタグのうち、１番目のタグを示す。「/title/」は、当該ＨＴＭＬ文書のタイトルを示すタグの階層構造を示す。なお、当該ＨＴＭＬ文書からタイトルとして抜き出されるデータは、ＤＩＶタグに囲まれた部分となる。 In the example of the first line in FIG. 2, “title” and “address” in the HTML document of the target URL “http: //aaaa.bbb.ccc/ddd/eee/001.html” whose URL ID is “1”. Shows the tag structure information. The tag structure information of “title” is expressed as, for example, “<DIV class =“ title ”> </ DIV>, order: 1, / title /”. “<DIV class =“ title ”> </ DIV>” indicates the name of a tag indicating a title extracted using, for example, a CSS (Cascading Style Sheets) selector. “Order: 1” indicates the first tag among the tags indicating the title in the HTML document. “/ Title /” indicates a hierarchical structure of tags indicating the title of the HTML document. Note that data extracted as a title from the HTML document is a portion surrounded by DIV tags.

同様に、「住所」のタグ構造情報は、例えば、「<DIV class="address"> </DIV>,順番：1,/info/address/」と表現される。「<DIV class="address"> </DIV>」は、例えば、ＣＳＳセレクタを用いて抽出した住所を示すタグの名称を示す。「順番：1」は、当該ＨＴＭＬ文書内の住所を示すタグのうち、１番目のタグを示す。「/info/address/」は、当該ＨＴＭＬ文書の住所を示すタグの階層構造を示す。なお、当該ＨＴＭＬ文書から住所として抜き出されるデータは、ＤＩＶタグに囲まれた部分となる。また、データを抽出した対象部分のタグ構造情報は、タグの名称、タグの順番およびタグの階層構造のうち１つ以上を用いて特定してもよい。 Similarly, the tag structure information of “address” is expressed, for example, as “<DIV class =“ address ”> </ DIV>, order: 1, / info / address /”. “<DIV class =“ address ”> </ DIV>” indicates, for example, the name of a tag indicating an address extracted using a CSS selector. “Order: 1” indicates the first tag among the tags indicating the address in the HTML document. “/ Info / address /” indicates a hierarchical structure of tags indicating the address of the HTML document. Note that the data extracted from the HTML document as an address is a portion surrounded by DIV tags. Further, the tag structure information of the target part from which the data is extracted may be specified using one or more of the tag name, the tag order, and the tag hierarchical structure.

また、タグの名称は、正規表現を用いて表してもよい。図２の２行目の例では、住所を示すタグの名称を「/<DIV.*>(.+)</DIV>/ /住所：(.+)$/」と表現している。正規表現では、ＤＩＶタグに囲まれた箇所、又は、「住所：」の後ろに続く箇所が、住所として抜き出されるデータとなる。さらに、抽出対象部分の位置特定情報は、ＣＳＳセレクタと正規表現を組み合わせてもよい。 The tag name may be expressed using a regular expression. In the example of the second line of FIG. 2, the name of the tag indicating the address is expressed as “/<DIV.*>(.+)</DIV>//address: (. +) $ /”. In the regular expression, the portion surrounded by the DIV tag or the portion following “Address:” is the data extracted as the address. Furthermore, the position specifying information of the extraction target part may be a combination of a CSS selector and a regular expression.

また、図２の３行目の例のように、データを抽出した対象部分のタグ構造情報は、切り出し手法を用いて表現してもよい。この場合には、タイトルの位置特定情報は、例えば、ＣＳＳセレクタを用いて「div#left h2,順番：3,/tps/table/」と表現される。また、住所の位置特定情報は、例えば、ＣＳＳセレクタと正規表現とを用いて「#infoContent @<h3>所在地</h3>\s+?<p>(.+?)</p>@is,順番：5,/info/address/」と表現される。 Further, as in the example of the third line in FIG. 2, the tag structure information of the target portion from which the data is extracted may be expressed using a clipping method. In this case, the title position specifying information is expressed as “div # left h2, order: 3, / tps / table /” using a CSS selector, for example. In addition, the address location information is, for example, “#infoContent @ <h3> location </ h3> \ s +? <P> (. +?) </ P> @is, using a CSS selector and a regular expression. "Order: 5, / info / address /".

図１の説明に戻り、項目記憶部１２２は、対象ＵＲＬのページ内容から抽出するデータ項目の定義を記憶する。図３は、項目記憶部１２２の一例を示す図である。図３に示すように、項目記憶部１２２は、「項目ＩＤ」、「データ名」、「データ型」、「切り出し手法」といった項目を有する。項目記憶部１２２は、例えば、１つのデータ名ごとに、１レコードとして記憶する。 Returning to the description of FIG. 1, the item storage unit 122 stores the definition of the data item extracted from the page content of the target URL. FIG. 3 is a diagram illustrating an example of the item storage unit 122. As illustrated in FIG. 3, the item storage unit 122 includes items such as “item ID”, “data name”, “data type”, and “cutout method”. The item storage unit 122 stores, for example, one record for each data name.

「項目ＩＤ」は、データ項目、すなわちデータ名を識別する。「データ名」は、Ｗｅｂサイト３００をクロールする際に抽出するデータの名前を示す。データ名は、例えば、タイトル、住所、電話番号、更新日、位置情報、説明文といったデータが挙げられる。「データ型」は、抽出したデータを抽出データ記憶部１２４に記憶する際の当該データの型を示す。データ型は、例えば、文字、数字、日付、緯度経度といった型が挙げられる。「切り出し手法」は、対象ＵＲＬのページ内容からデータを切り出す、つまり抜き出す手法を示す。切り出し手法は、例えば、ＣＳＳセレクタ、正規表現といった手法が挙げられる。 “Item ID” identifies a data item, that is, a data name. “Data name” indicates the name of data to be extracted when the Web site 300 is crawled. Examples of the data name include data such as a title, address, telephone number, update date, location information, and explanatory text. “Data type” indicates the type of data when the extracted data is stored in the extracted data storage unit 124. Data types include, for example, types such as letters, numbers, dates, and latitude and longitude. The “cutout method” indicates a method of cutting out data from the page content of the target URL, that is, a method of extracting the data. Examples of the clipping method include a CSS selector and a regular expression.

図１の説明に戻り、ページ記憶部１２３は、対象ＵＲＬについて、クロール処理でアクセスして取得したページ内容、すなわち、ＨＴＭＬ文書、画像ファイル等を記憶する。図４は、ページ記憶部１２３の一例を示す図である。図４に示すように、ページ記憶部１２３は、「ＵＲＬＩＤ」、「対象ＵＲＬ」、「記憶領域」といった項目を有する。ページ記憶部１２３は、例えば、１つの対象ＵＲＬごとに１レコードとして記憶する。 Returning to the description of FIG. 1, the page storage unit 123 stores, for the target URL, the page content obtained by accessing the crawl process, that is, an HTML document, an image file, and the like. FIG. 4 is a diagram illustrating an example of the page storage unit 123. As illustrated in FIG. 4, the page storage unit 123 includes items such as “URLID”, “target URL”, and “storage area”. For example, the page storage unit 123 stores one record for each target URL.

「ＵＲＬＩＤ」は、対象ＵＲＬを識別する。「対象ＵＲＬ」は、クロール処理でアクセスしたＨＴＭＬ文書のＵＲＬを示す。「記憶領域」は、取得したＨＴＭＬ文書や画像ファイル等を記憶した記憶領域を示す。記憶領域は、例えば、記憶部１２０のファイルシステムのディレクトリを記憶し、対応するディレクトリにＨＴＭＬ文書や画像ファイル等を記憶する。なお、ページ記憶部１２３は、記憶領域に、取得したＨＴＭＬ文書や画像ファイルを直接記憶するようにしてもよい。 “URLID” identifies the target URL. “Target URL” indicates the URL of the HTML document accessed by the crawl process. The “storage area” indicates a storage area that stores the acquired HTML document, image file, and the like. The storage area stores, for example, a file system directory in the storage unit 120, and stores an HTML document, an image file, and the like in the corresponding directory. Note that the page storage unit 123 may directly store the acquired HTML document or image file in the storage area.

図１の説明に戻り、抽出データ記憶部１２４は、ＨＴＭＬ文書から抽出された、抽出対象部分のデータを記憶する。すなわち、抽出データ記憶部１２４は、クロール処理によって収集されたデータを格納するデータベースである。図５は、抽出データ記憶部１２４の一例を示す図である。図５に示すように、抽出データ記憶部１２４は、「ＵＲＬＩＤ」、「タイトル」、「住所」、「電話番号」、「更新日」、「位置情報」、「説明文」といった項目を有する。抽出データ記憶部１２４は、例えば、１つのＵＲＬＩＤごとに１レコードとして記憶する。 Returning to the description of FIG. 1, the extracted data storage unit 124 stores the data of the extraction target portion extracted from the HTML document. In other words, the extracted data storage unit 124 is a database that stores data collected by the crawl process. FIG. 5 is a diagram illustrating an example of the extracted data storage unit 124. As illustrated in FIG. 5, the extracted data storage unit 124 includes items such as “URLID”, “title”, “address”, “phone number”, “update date”, “location information”, and “description”. The extracted data storage unit 124 stores, for example, one record for each URLID.

「ＵＲＬＩＤ」は、対象ＵＲＬを識別する。「タイトル」は、対象ＵＲＬのＨＴＭＬ文書から抽出されたデータ項目の１つであり、対象ＵＲＬのＨＴＭＬ文書のタイトルを示す。「住所」は、対象ＵＲＬのＨＴＭＬ文書から抽出されたデータ項目の１つであり、対象ＵＲＬのＨＴＭＬ文書内に記載された住所を示す。「電話番号」は、対象ＵＲＬのＨＴＭＬ文書から抽出されたデータ項目の１つであり、対象ＵＲＬのＨＴＭＬ文書内に記載された電話番号を示す。「更新日」は、対象ＵＲＬのＨＴＭＬ文書から抽出されたデータ項目の１つであり、対象ＵＲＬのＨＴＭＬ文書内に記載された更新日を示す。「位置情報」は、緯度経度を示す。緯度経度は、対象ＵＲＬのＨＴＭＬ文書から抽出された住所に基づいて、例えば、外部のＡＰＩ（Application Programming Interface）サービスを利用することで取得される。なお、位置情報は、ＨＴＭＬ文書内に緯度経度の記載があれば、当該緯度経度であってもよい。「説明文」は、対象ＵＲＬのＨＴＭＬ文書から抽出されたデータ項目の１つであり、例えば、対象ＵＲＬのＨＴＭＬ文書が観光スポットに関する文書であれば、文書内の観光スポットに関する説明文を示す。なお、住所は、ＨＴＭＬ文書内に記載がない場合には、例えば、タイトルに記載された観光スポット名を用いて、外部のＡＰＩサービスを利用することで取得された住所であってもよい。 “URLID” identifies the target URL. “Title” is one of the data items extracted from the HTML document of the target URL, and indicates the title of the HTML document of the target URL. “Address” is one of the data items extracted from the HTML document of the target URL, and indicates an address described in the HTML document of the target URL. “Telephone number” is one of the data items extracted from the HTML document of the target URL, and indicates the telephone number described in the HTML document of the target URL. “Update date” is one of the data items extracted from the HTML document of the target URL, and indicates the update date described in the HTML document of the target URL. “Position information” indicates latitude and longitude. The latitude and longitude are acquired by using, for example, an external API (Application Programming Interface) service based on the address extracted from the HTML document of the target URL. Note that the position information may be the latitude and longitude as long as the latitude and longitude are described in the HTML document. The “descriptive text” is one of data items extracted from the HTML document of the target URL. For example, if the HTML document of the target URL is a document related to a tourist spot, an explanatory text related to a tourist spot in the document is shown. In addition, when there is no description in an HTML document, the address acquired by using an external API service using the tourist spot name described in the title may be used, for example.

図１の説明に戻り、通知先記憶部１２５は、例えば管理者のメールアドレスなどの、通知先（宛先）の情報を記憶する。制御部１３０は、通知先記憶部１２５を基に、管理者などの所定の宛先に対してＷｅｂサイト３００のクロール結果などの通知を行う。 Returning to the description of FIG. 1, the notification destination storage unit 125 stores notification destination (destination) information such as an administrator's mail address. Based on the notification destination storage unit 125, the control unit 130 notifies a predetermined destination such as an administrator of a crawl result of the website 300.

制御部１３０は、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等によって、内部の記憶装置に記憶されているプログラムがＲＡＭを作業領域として実行されることにより実現される。また、制御部１３０は、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実現されるようにしてもよい。制御部１３０は、設定部１３１と、クロール部１３２と、抽出部１３３と、出力制御部１３４とを有し、以下に説明する情報処理の機能や作用を実現又は実行する。なお、制御部１３０の内部構成は、図１に示した構成に限られず、後述する情報処理を行う構成であれば他の構成であってもよい。 The control unit 130 is realized, for example, by executing a program stored in an internal storage device using a RAM as a work area by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. Further, the control unit 130 may be realized by an integrated circuit such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA). The control unit 130 includes a setting unit 131, a crawl unit 132, an extraction unit 133, and an output control unit 134, and implements or executes information processing functions and operations described below. Note that the internal configuration of the control unit 130 is not limited to the configuration illustrated in FIG. 1, and may be another configuration as long as the information processing described below is performed.

設定部１３１は、クロールの対象とするＷｅｂサイト３００の対象ＵＲＬおよびクロールの抽出対象とするデータ項目の定義などの設定を行う。設定部１３１は、例えば、管理者が入力部１０１を操作することにより、クロールの対象とするＷｅｂサイト３００の対象ＵＲＬの入力を受け付ける。設定部１３１は、受け付けた対象ＵＲＬを、ＵＲＬＩＤを付与して対象記憶部１２１に登録する。 The setting unit 131 performs settings such as the target URL of the Web site 300 to be crawled and the definition of data items to be crawled. For example, when the administrator operates the input unit 101, the setting unit 131 receives an input of the target URL of the Web site 300 to be crawled. The setting unit 131 registers the received target URL in the target storage unit 121 with a URL ID.

また、設定部１３１は、例えば、管理者が入力部１０１を操作することにより、抽出対象部分とするデータ名、データ型及び切り出し手法の入力を受け付ける。設定部１３１は、受け付けたデータ名、データ型及び切り出し手法を対応付けて、データ項目の定義を生成する。設定部１３１は、生成したデータ項目の定義を項目記憶部１２２に記憶する。つまり、登録部１３１は、クロールの抽出対象とするデータ項目の定義を項目記憶部１２２に登録する。 In addition, for example, the setting unit 131 receives input of a data name, a data type, and a clipping method that are to be extracted by operating the input unit 101 by the administrator. The setting unit 131 associates the received data name, data type, and clipping method to generate a data item definition. The setting unit 131 stores the generated data item definition in the item storage unit 122. In other words, the registration unit 131 registers the definition of the data item to be extracted from the crawl in the item storage unit 122.

クロール部１３２は、対象記憶部１２１を参照して、対象ＵＲＬを含むホームページ、例えば、ある観光情報を公開するＷｅｂサイト３００のトップページにアクセスする。すなわち、クロール部１３２は、Ｗｅｂサイト３００のサーバに対して通信部１１０を介してページ要求を送信し、当該サーバから通信部１１０を介してページ内容を受信する。 The crawl unit 132 refers to the target storage unit 121 and accesses the home page including the target URL, for example, the top page of the Web site 300 that discloses certain tourist information. That is, the crawl unit 132 transmits a page request to the server of the Web site 300 via the communication unit 110 and receives the page content from the server via the communication unit 110.

クロール部１３２は、例えば、定期的または不定期に、つまり予め管理者によって指定された間隔又は任意のタイミングで、対象ＵＲＬを含むホームページにアクセスする。指定された間隔は、例えば、１日、１週間、１ヶ月等のように任意の間隔とすることができる。 For example, the crawl unit 132 accesses the home page including the target URL regularly or irregularly, that is, at an interval designated in advance by the administrator or at an arbitrary timing. The designated interval can be an arbitrary interval such as one day, one week, one month, or the like.

クロール部１３２は、対象記憶部１２１を参照して、ホームページ内の全リンクのうち、ページ内容を取得する対象ＵＲＬを選定する。クロール部１３２は、例えば、観光スポットごとのページの対象ＵＲＬを選定する。クロール部１３２は、選定した対象ＵＲＬからページ内容を取得する。クロール部１３２は、取得したページ内容をページ記憶部１２３に記憶する。また、クロール部１３２は、ページ内容の取得が完了したことを示す取得完了情報を抽出部１３３に出力する。 The crawl unit 132 refers to the target storage unit 121 and selects a target URL for acquiring the page contents from all the links in the home page. For example, the crawl unit 132 selects a target URL of a page for each sightseeing spot. The crawl unit 132 acquires page contents from the selected target URL. The crawl unit 132 stores the acquired page content in the page storage unit 123. Further, the crawl unit 132 outputs acquisition completion information indicating that the acquisition of the page content is completed to the extraction unit 133.

抽出部１３３は、クロール部１３２から取得完了情報が入力されると、対象記憶部１２１のタグ構造情報を参照して、ページ記憶部１２３に記憶された対象ＵＲＬのページ内容から、抽出対象部分のデータ項目のデータを抽出する。抽出部１３３は、抽出したデータをＵＲＬＩＤと対応付けて、項目記憶部１２２のデータ項目の定義に従って、抽出データ記憶部１２４に記憶する。抽出部１３３は、抽出したデータを抽出データ記憶部１２４に記憶すると、出力制御部１３４に、抽出完了情報を出力する。 When the acquisition completion information is input from the crawl unit 132, the extraction unit 133 refers to the tag structure information of the target storage unit 121, and extracts the extraction target portion from the page content of the target URL stored in the page storage unit 123. Extract data item data. The extraction unit 133 associates the extracted data with the URL ID, and stores the extracted data in the extracted data storage unit 124 according to the definition of the data item in the item storage unit 122. When the extraction unit 133 stores the extracted data in the extraction data storage unit 124, the extraction unit 133 outputs extraction completion information to the output control unit 134.

例えば、抽出部１３３は、ＵＲＬＩＤが「１」の対象ＵＲＬ「http://aaaa.bbb.ccc/ddd/eee/001.html」のＨＴＭＬ文書から「タイトル」のデータを抽出する場合、対象記憶部１２１より「タイトル」のタグ構造情報を参照する（図２参照）。これにより、抽出部１３３は、対象ＵＲＬのウェブページより「タイトル」のデータを抽出する位置を「<DIV class="title"> </DIV>,順番：1,/title/」とするタグ構造情報を得る。抽出部１３３は、「<DIV class="title"> </DIV>,順番：1,/title/」とするタグ構造情報より、ＨＴＭＬ文書の１番目のタグで、タグの構造におけるタイトルを示す「/title/」のタグのＤＩＶタグに囲まれた部分のデータ（タイトル）を抽出する。次いで、抽出部１３３は、「タイトル」と同様にして「住所」などのタグ構造情報に含まれるデータを順次抽出する。そして、抽出部１３３は、抽出したデータを抽出データ記憶部１２４に記憶する。 For example, when the extraction unit 133 extracts “title” data from the HTML document of the target URL “http: //aaaa.bbb.ccc/ddd/eee/001.html” with the URL ID “1”, the target storage is performed. The tag structure information of “title” is referred to from the section 121 (see FIG. 2). As a result, the extraction unit 133 has a tag structure in which “<DIV class =" title "> </ DIV>, order: 1, / title /” is extracted from the web page of the target URL. get information. The extraction unit 133 indicates the title in the tag structure with the first tag of the HTML document from the tag structure information “<DIV class =" title "> </ DIV>, order: 1, / title /”. The data (title) of the portion surrounded by the DIV tag of the “/ title /” tag is extracted. Next, the extraction unit 133 sequentially extracts data included in the tag structure information such as “address” in the same manner as “title”. Then, the extracting unit 133 stores the extracted data in the extracted data storage unit 124.

なお、対象記憶部１２１に記憶された対象ＵＲＬの初回巡回時には、タグ構造情報が未登録である。また、対象ＵＲＬのウェブページ自体にレイアウトの更新がある場合は、以前のタグ構造情報ではデータが得られないこととなる。したがって、抽出部１３３は、タグ構造情報が未登録の初回巡回時や、登録済みのタグ構造情報からデータが得られないレイアウト更新時には、項目記憶部１２２の切出し手法を用いてウェブページの解析を行う。このウェブページの解析により、抽出部１３３は、項目記憶部１２２で定義された項目のデータを取得する。 Note that the tag structure information is not registered when the target URL stored in the target storage unit 121 is first visited. Further, when there is a layout update in the web page itself of the target URL, data cannot be obtained with the previous tag structure information. Therefore, the extraction unit 133 uses the clipping method of the item storage unit 122 to analyze the web page at the first patrol in which the tag structure information is not registered or at the time of layout update in which data is not obtained from the registered tag structure information. Do. By analyzing this web page, the extraction unit 133 acquires data of items defined in the item storage unit 122.

例えば、抽出部１３３は、抽出対象部分のデータ項目のデータを抽出する場合に、項目記憶部１２２の切り出し手法で指定された手法を用いて抽出する。抽出部１３３は、例えば、住所を示すタグの階層が「/info/address/」で定義され、例えば「.address」と記述されたＣＳＳセレクタを用いることで住所を抽出する。この場合には、抽出部１３３は、例えば、タグ内に「address」を含む項目を、住所として切り出すことができる。 For example, when extracting data of the data item of the extraction target portion, the extraction unit 133 performs extraction using a method specified by the extraction method of the item storage unit 122. The extraction unit 133 extracts the address by using, for example, a CSS selector in which the tag hierarchy indicating the address is defined by “/ info / address /” and described as “.address”, for example. In this case, for example, the extraction unit 133 can cut out an item including “address” in the tag as an address.

また、抽出部１３３は、例えば、１行目に「.info」と記述され、２行目に「/<DIV.*>(.+)</DIV>/」と記述され、３行目に「/住所：(.+)$/」と記述された正規表現を用いることで住所を抽出する。この場合には、抽出部１３３は、例えば、ＤＩＶタグのクラスが「info」であるタグに含まれる階層から、「住所：」の文字列の後に続く文字列を住所として切り出すことができる。 Further, the extraction unit 133 describes, for example, “.info” on the first line, “/<DIV.*>(.+)</DIV>/” on the second line, and on the third line. An address is extracted by using a regular expression described as “/ address: (. +) $ /”. In this case, for example, the extraction unit 133 can extract, as an address, a character string that follows the character string “address:” from a hierarchy included in a tag whose DIV tag class is “info”.

初回巡回時やレイアウト更新時に抽出したデータも同様、抽出部１３３は、抽出したデータをＵＲＬＩＤと対応付けて、項目記憶部１２２のデータ項目の定義に従って、抽出データ記憶部１２４に記憶する。また、抽出部１３３は、抽出したデータの項目ごとに、タグ構造情報を求めて対象記憶部１２１に記憶する。 Similarly to the data extracted at the first patrol or layout update, the extracting unit 133 stores the extracted data in the extracted data storage unit 124 in association with the URLID according to the definition of the data item in the item storage unit 122. Further, the extraction unit 133 obtains tag structure information for each item of the extracted data and stores it in the target storage unit 121.

例えば、抽出部１３３は、項目記憶部１２２における項目ＩＤ「１」の「タイトル」の項目について（図３参照）、「ＣＳＳセレクタ」の切り出し手法によりデータ（タイトル）を抽出した場合、抽出したデータに関連するタグの情報を得る。具体的には、抽出部１３３は、ＨＴＭＬ文書におけるタグの順番と、ＨＴＭＬ文書のタイトルを示すタグの階層構造を示す「/title/」とを得る。抽出部１３３は、得られたタグの情報をタグ構造情報として対象記憶部１２１に記憶する。 For example, when the extraction unit 133 extracts data (title) for the item “title” of the item ID “1” in the item storage unit 122 (see FIG. 3) using the “CSS selector” extraction method, the extracted data Get tag information related to. Specifically, the extraction unit 133 obtains the tag order in the HTML document and “/ title /” indicating the hierarchical structure of the tag indicating the title of the HTML document. The extraction unit 133 stores the obtained tag information in the target storage unit 121 as tag structure information.

出力制御部１３４は、抽出部１３３から抽出完了情報が入力されると、抽出データ記憶部１２４を参照して、抽出したデータを出力データとして出力部１０２に出力して表示させる。なお、出力制御部１３４は、出力部１０２がＳＤメモリカード等のリーダライタである場合には、抽出したデータを出力データとして出力部１０２に出力して、ＳＤメモリカード等に記憶させてもよい。 When the extraction completion information is input from the extraction unit 133, the output control unit 134 refers to the extraction data storage unit 124 and outputs the extracted data as output data to the output unit 102 for display. When the output unit 102 is a reader / writer such as an SD memory card, the output control unit 134 may output the extracted data as output data to the output unit 102 and store it in the SD memory card or the like. .

また、出力制御部１３４は、過去のクロール処理によって取得して抜き出したデータと、今回のクロール処理によって取得して抜き出したデータとが異なる場合には、通知先記憶部１２５をもとに、管理者のメールアドレスなどの所定の宛先に変更内容を通知する。なお、変更内容の通知については、例えば、今回のクロール処理によって取得したデータにおける、過去からの変更部分（差分）を表示色を変更するようにしてもよい。これにより、管理者は、Ｗｅｂサイト３００におけるデータの変更を容易に知ることができる。 In addition, the output control unit 134 manages based on the notification destination storage unit 125 when the data acquired and extracted by the previous crawl process is different from the data acquired and extracted by the current crawl process. The change contents are notified to a predetermined destination such as a mail address of the person. As for the notification of the change contents, for example, the display color of the changed part (difference) from the past in the data acquired by the current crawl process may be changed. Thereby, the administrator can easily know the data change in the Web site 300.

次に、記憶制御装置の一例であるデータ取得装置１００におけるクロール処理の動作例を説明する。図６は、実施形態にかかる記憶制御装置の動作例を示すフローチャートである。なお、対象記憶部１２１の対象ＵＲＬにかかるウェブページの内容は取得済みであり、クロールされたウェブページの内容はページ記憶部１２３に記憶されているものとする。 Next, an operation example of the crawl process in the data acquisition apparatus 100 that is an example of the storage control apparatus will be described. FIG. 6 is a flowchart illustrating an operation example of the storage control device according to the embodiment. It is assumed that the content of the web page related to the target URL in the target storage unit 121 has been acquired, and the content of the crawled web page is stored in the page storage unit 123.

図６に示すように、処理が開始されると、抽出部１３３は、対象記憶部１２１の対象ＵＲＬを順次読み出して、クロール処理の対象とするウェブページを指定する（Ｓ１）。 As shown in FIG. 6, when the process is started, the extraction unit 133 sequentially reads out the target URLs in the target storage unit 121 and designates a web page that is a target of the crawl process (S1).

次いで、抽出部１３３は、対象記憶部１２１における対象ＵＲＬのタグ構造情報が未登録であるか否かにより、初回の巡回であるか否かを判定する。また、抽出部１３３は、対象記憶部１２１に登録済みのタグ構造情報からデータが得られたか否かにより、ウェブページ自体の更新の有無を判定する（Ｓ２）。 Next, the extraction unit 133 determines whether or not it is the first tour based on whether or not the tag structure information of the target URL in the target storage unit 121 is unregistered. Further, the extraction unit 133 determines whether or not the web page itself is updated depending on whether or not data is obtained from the tag structure information registered in the target storage unit 121 (S2).

初回またはウェブページ自体の更新がある場合（Ｓ２：ＹＥＳ）、抽出部１３３は、項目記憶部１２２の切出し手法を用いてウェブページの解析を行い（Ｓ３）、項目記憶部１２２で定義された項目のデータを抽出する（Ｓ４）。次いで、抽出部１３３は、抽出したデータを記憶部１２０へ格納する（Ｓ５）。具体的には、抽出部１３３は、抽出したデータをＵＲＬＩＤと対応付けて、項目記憶部１２２のデータ項目の定義に従って、抽出データ記憶部１２４に記憶する。また、抽出部１３３は、抽出したデータの項目ごとに、タグ構造情報を求めて対象記憶部１２１に記憶する。 When there is an update for the first time or the web page itself (S2: YES), the extraction unit 133 analyzes the web page using the extraction method of the item storage unit 122 (S3), and the items defined in the item storage unit 122 Are extracted (S4). Next, the extraction unit 133 stores the extracted data in the storage unit 120 (S5). Specifically, the extraction unit 133 associates the extracted data with the URL ID, and stores the extracted data in the extracted data storage unit 124 according to the definition of the data item in the item storage unit 122. Further, the extraction unit 133 obtains tag structure information for each item of the extracted data and stores it in the target storage unit 121.

初回の巡回ではなく、ウェブページ自体の更新もない場合（Ｓ２：ＮＯ）は、対象記憶部１２１のタグ構造情報を参照してウェブページからデータの抽出ができることとなる。 If the web page is not updated for the first time (S2: NO), data can be extracted from the web page with reference to the tag structure information in the target storage unit 121.

したがって、Ｓ２が否定判定の場合、抽出部１３３は、ウェブページ（対象ＵＲＬ）のタグ構造情報を対象記憶部１２１より取得する（Ｓ６）。次いで、抽出部１３３は、ページ記憶部１２３に記憶された対象ＵＲＬのページ内容から、タグ構造情報をもとに、抽出対象部分のデータ項目のデータを抽出する（Ｓ７）。 Therefore, if S2 is negative, the extraction unit 133 acquires tag structure information of the web page (target URL) from the target storage unit 121 (S6). Next, the extraction unit 133 extracts data item data of the extraction target portion from the page content of the target URL stored in the page storage unit 123 based on the tag structure information (S7).

次いで、抽出部１３３は、抽出したデータをＵＲＬＩＤと対応付けて、項目記憶部１２２のデータ項目の定義に従って、抽出データ記憶部１２４に記憶し、記憶部１２０のデータ更新を行う（Ｓ８）。具体的には、抽出部１３３は、今回新たに抽出したデータと、前に抽出データ記憶部１２４に記憶したデータとを比較し、データに変更がある場合に、今回新たに抽出したデータ（変更の内容）を抽出データ記憶部１２４に記憶する。 Next, the extraction unit 133 associates the extracted data with the URLID, stores the extracted data in the extracted data storage unit 124 according to the definition of the data item in the item storage unit 122, and updates the data in the storage unit 120 (S8). Specifically, the extraction unit 133 compares the data newly extracted this time with the data previously stored in the extraction data storage unit 124, and when the data is changed, the newly extracted data (change In the extracted data storage unit 124.

次いで、出力制御部１３４は、通知先記憶部１２５をもとに、管理者のメールアドレスなどの所定の宛先に結果通知を行う（Ｓ９）。具体的には、出力制御部１３４は、過去のクロール処理によって取得して抜き出したデータ（以前の抽出データ記憶部１２４のデータ）と、今回のクロール処理によって取得して抜き出したデータ（今回抽出データ記憶部１２４に格納したデータ）とが異なるか否かを判定する。異なる場合、出力制御部１３４は、通知先記憶部１２５に設定されている通知先に変更部分（差分）を通知する。 Next, the output control unit 134 notifies the result to a predetermined destination such as an administrator's mail address based on the notification destination storage unit 125 (S9). Specifically, the output control unit 134 acquires data extracted by past crawl processing (previously extracted data storage unit 124 data) and data acquired and extracted by current crawl processing (current extracted data). It is determined whether the data stored in the storage unit 124 is different. If they are different, the output control unit 134 notifies the change destination (difference) to the notification destination set in the notification destination storage unit 125.

以上のように、データ取得装置１００は、記憶部１２０と、制御部１３０とを有する。記憶部１２０は、ウェブページから取得したデータに対応づけて、取得元のウェブページのうちデータの取得元の位置を示すタグ構造情報を記憶する。制御部１３０は、記憶部１２０に記憶されたタグ構造情報に基づいて、ウェブページから新たにデータを取得し、新たに取得したデータをタグ構造情報と対応付けて記憶部１２０に記憶させる。したがって、データ取得装置１００は、例えば、ウェブページの初回の巡回またはウェブページ自体のレイアウト更新後の巡回の後に再度巡回する際には、ウェブサイトごとにドキュメントの解析を行うことなくデータを取得できる。このように、データ取得装置１００は、ドキュメントの解析を省くことで、データのクロールに要する時間を短縮できる。 As described above, the data acquisition device 100 includes the storage unit 120 and the control unit 130. The storage unit 120 stores tag structure information indicating the position of the data acquisition source among the acquisition source web pages in association with the data acquired from the web page. The control unit 130 acquires new data from the web page based on the tag structure information stored in the storage unit 120, and causes the storage unit 120 to store the newly acquired data in association with the tag structure information. Therefore, the data acquisition apparatus 100 can acquire data without analyzing the document for each website, for example, when patroling again after the first patrol of the web page or the patrol after updating the layout of the web page itself. . As described above, the data acquisition apparatus 100 can shorten the time required for data crawl by omitting the document analysis.

また、図示した各部の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各部の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。例えば、クロール部１３２と、抽出部１３３と、出力制御部１３４とを統合して、出力制御部としてもよい。 In addition, each component of each part illustrated does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each part is not limited to the one shown in the figure, and all or a part thereof may be functionally or physically distributed / integrated in arbitrary units according to various loads and usage conditions. Can be configured. For example, the crawl unit 132, the extraction unit 133, and the output control unit 134 may be integrated into an output control unit.

さらに、各装置で行われる各種処理機能は、ＣＰＵ（又はＭＰＵ、ＭＣＵ（Micro Controller Unit）等のマイクロ・コンピュータ）上で、その全部又は任意の一部を実行するようにしてもよい。また、各種処理機能は、ＣＰＵ（又はＭＰＵ、ＭＣＵ等のマイクロ・コンピュータ）で解析実行されるプログラム上、又はワイヤードロジックによるハードウェア上で、その全部又は任意の一部を実行するようにしてもよいことは言うまでもない。 Furthermore, various processing functions performed in each device may be executed entirely or arbitrarily on a CPU (or a microcomputer such as an MPU or MCU (Micro Controller Unit)). The various processing functions may be executed entirely or arbitrarily on a program that is analyzed and executed by a CPU (or a microcomputer such as an MPU or MCU) or hardware based on wired logic. Needless to say, it is good.

ところで、上記の実施例で説明した各種の処理は、予め用意されたプログラムをコンピュータで実行することで実現できる。そこで、以下では、上記の実施例と同様の機能を有するプログラムを実行するコンピュータの一例を説明する。図７は、記憶制御プログラムを実行するコンピュータの一例を示す図である。 By the way, the various processes described in the above embodiments can be realized by executing a program prepared in advance by a computer. Therefore, in the following, an example of a computer that executes a program having the same function as in the above embodiment will be described. FIG. 7 is a diagram illustrating an example of a computer that executes a storage control program.

図７が示すように、コンピュータ２００は、各種演算処理を実行するＣＰＵ２０１と、データ入力を受け付ける入力装置２０２と、モニタ２０３とを有する。また、コンピュータ２００は、記憶媒体からプログラム等を読み取る媒体読取装置２０４と、各種装置と接続するためのインタフェース装置２０５と、他の情報処理装置等と有線又は無線により接続するための通信装置２０６とを有する。また、コンピュータ２００は、各種情報を一時記憶するＲＡＭ２０７と、ハードディスク装置２０８とを有する。また、各装置２０１〜２０８は、バス２０９に接続される。 As illustrated in FIG. 7, the computer 200 includes a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input, and a monitor 203. The computer 200 also includes a medium reading device 204 that reads a program and the like from a storage medium, an interface device 205 for connecting to various devices, and a communication device 206 for connecting to other information processing devices and the like by wire or wirelessly. Have The computer 200 also includes a RAM 207 that temporarily stores various types of information and a hard disk device 208. Each device 201 to 208 is connected to a bus 209.

ハードディスク装置２０８には、図１に示した設定部１３１、クロール部１３２、抽出部１３３および出力制御部１３４の各処理部と同様の機能を有する記憶制御プログラムが記憶される。また、ハードディスク装置２０８には、対象記憶部１２１、項目記憶部１２２、ページ記憶部１２３、抽出データ記憶部１２４および記憶制御プログラムを実現するための各種データが記憶される。入力装置２０２は、入力部１０１と同等の機能を有し、例えば、コンピュータ２００の管理者から、対象ＵＲＬ、定義、管理情報等の各種情報の入力を受け付ける。モニタ２０３は、出力部１０２と同等の機能を有し、例えば、コンピュータ２００の管理者に対して管理情報の画面、受付画面、データ表示画面等の各種画面を表示する。インタフェース装置２０５は、例えば、印刷装置等が接続される。通信装置２０６は、例えば、図１に示した通信部１１０と同様の機能を有し、ネットワークＮと接続され、インターネット上のＷｅｂサイト３００と各種情報をやりとりする。 The hard disk device 208 stores a storage control program having the same functions as the processing units of the setting unit 131, the crawl unit 132, the extraction unit 133, and the output control unit 134 illustrated in FIG. The hard disk device 208 also stores a target storage unit 121, an item storage unit 122, a page storage unit 123, an extracted data storage unit 124, and various data for realizing a storage control program. The input device 202 has a function equivalent to that of the input unit 101, and receives input of various information such as a target URL, definition, management information, and the like from an administrator of the computer 200, for example. The monitor 203 has a function equivalent to that of the output unit 102, and displays various screens such as a management information screen, a reception screen, and a data display screen to the administrator of the computer 200, for example. The interface device 205 is connected to, for example, a printing device. For example, the communication device 206 has the same function as the communication unit 110 shown in FIG. 1, is connected to the network N, and exchanges various information with the Web site 300 on the Internet.

ＣＰＵ２０１は、ハードディスク装置２０８に記憶された各プログラムを読み出して、ＲＡＭ２０７に展開して実行することで、各種の処理を行う。また、これらのプログラムは、コンピュータ２００を図１に示した設定部１３１、クロール部１３２、抽出部１３３および出力制御部１３４として機能させることができる。 The CPU 201 reads out each program stored in the hard disk device 208, develops it in the RAM 207, and executes it to perform various processes. Also, these programs can cause the computer 200 to function as the setting unit 131, the crawl unit 132, the extraction unit 133, and the output control unit 134 illustrated in FIG.

なお、上記の記憶制御プログラムは、必ずしもハードディスク装置２０８に記憶されている必要はない。例えば、コンピュータ２００が読み取り可能な記憶媒体に記憶されたプログラムを、コンピュータ２００が読み出して実行するようにしてもよい。コンピュータ２００が読み取り可能な記憶媒体は、例えば、ＣＤ−ＲＯＭやＤＶＤディスク、ＵＳＢ（Universal Serial Bus）メモリ等の可搬型記録媒体、フラッシュメモリ等の半導体メモリ、ハードディスクドライブ等が対応する。また、公衆回線、インターネット、ＬＡＮ等に接続された装置にこの記憶制御プログラムを記憶させておき、コンピュータ２００がこれらから記憶制御プログラムを読み出して実行するようにしてもよい。 Note that the above storage control program is not necessarily stored in the hard disk device 208. For example, the computer 200 may read and execute a program stored in a storage medium readable by the computer 200. The storage medium readable by the computer 200 corresponds to, for example, a portable recording medium such as a CD-ROM, a DVD disk, a USB (Universal Serial Bus) memory, a semiconductor memory such as a flash memory, a hard disk drive, and the like. Alternatively, the storage control program may be stored in a device connected to a public line, the Internet, a LAN, etc., and the computer 200 may read and execute the storage control program from these devices.

以上の実施形態に関し、さらに以下の付記を開示する。 Regarding the above embodiment, the following additional notes are disclosed.

（付記１）ウェブページから取得したデータを記憶する記憶制御装置において、
ウェブページから取得した第１のデータに対応づけて、前記取得元のウェブページのうち前記第１のデータの取得元の位置を示すタグ構造情報を記憶する記憶部と、
前記記憶部に記憶された前記タグ構造情報に基づいて、前記ウェブページから新たに第２のデータを取得し、新たに取得した前記第２のデータを前記タグ構造情報と対応付けて前記記憶部に記憶させる制御部と、
を備えることを特徴とする記憶制御装置。 (Supplementary note 1) In a storage control device for storing data acquired from a web page,
A storage unit that stores tag structure information indicating the position of the acquisition source of the first data among the acquisition source web pages in association with the first data acquired from the web page;
Based on the tag structure information stored in the storage unit, the second data is newly acquired from the web page, and the newly acquired second data is associated with the tag structure information in the storage unit. A control unit to be stored in
A storage control device comprising:

（付記２）前記制御部は、前記第２のデータの取得から所定の間隔後に、前記タグ構造情報に基づいて前記ウェブページから新たに第３のデータを取得する、
ことを特徴とする付記１に記載の記憶制御装置。 (Additional remark 2) The said control part acquires 3rd data newly from the said web page based on the said tag structure information after the predetermined space | interval from acquisition of the said 2nd data,
The storage control device according to appendix 1, wherein

（付記３）前記制御部は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから取得した前記第２のデータの比較に基づき変更を検知すると、検知した前記変更の内容を記憶させる、
ことを特徴とする付記１に記載の記憶制御装置。 (Supplementary Note 3) When the control unit detects a change based on a comparison between the first data acquired from the web page and the second data acquired from the web page, the control unit stores the content of the detected change. Let
The storage control device according to appendix 1, wherein

（付記４）前記制御部は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから取得した前記第２のデータとの比較に基づき変更を検知すると、検知した前記変更の内容を所定の宛先に通知する、
ことを特徴とする付記１に記載の記憶制御装置。 (Supplementary Note 4) When the control unit detects a change based on a comparison between the first data acquired from the web page and the second data acquired from the web page, the control unit displays the content of the detected change. Notify the specified destination,
The storage control device according to appendix 1, wherein

（付記５）ウェブページから取得したデータを記憶する処理をコンピュータに実行させる記憶制御プログラムであって、
ウェブページから取得した第１のデータに対応づけて、前記取得元のウェブページのうち前記第１のデータの取得元の位置を示すタグ構造情報を記憶し、
記憶された前記タグ構造情報に基づいて、前記ウェブページから新たに第２のデータを取得し、新たに取得した前記第２のデータを前記タグ構造情報と対応付けて記憶する、
処理をコンピュータに実行させることを特徴とする記憶制御プログラム。 (Supplementary Note 5) A storage control program for causing a computer to execute processing for storing data acquired from a web page,
In association with the first data acquired from the web page, the tag structure information indicating the position of the acquisition source of the first data of the acquisition source web page is stored,
Based on the stored tag structure information, new second data is acquired from the web page, and the newly acquired second data is stored in association with the tag structure information.
A storage control program for causing a computer to execute processing.

（付記６）前記記憶する処理は、前記第２のデータの取得から所定の間隔後に、前記タグ構造情報に基づいて前記ウェブページから新たに第３のデータを取得する、
ことを特徴とする付記５に記載の記憶制御プログラム。 (Additional remark 6) The said process to memorize | stores 3rd data newly from the said web page based on the said tag structure information after the predetermined space | interval from acquisition of the said 2nd data,
The storage control program according to appendix 5, characterized in that:

（付記７）前記記憶する処理は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから新たに取得した前記第２のデータの比較に基づき変更を検知すると、検知した前記変更の内容を記憶する、
ことを特徴とする付記５に記載の記憶制御プログラム。 (Supplementary note 7) When the process to be stored is detected based on a comparison between the first data acquired from the web page and the second data newly acquired from the web page, Remember the contents,
The storage control program according to appendix 5, characterized in that:

（付記８）前記記憶する処理は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから新たに取得した前記第２のデータとの比較に基づき変更を検知すると、検知した前記変更の内容を所定の宛先に通知する、
ことを特徴とする付記５に記載の記憶制御プログラム。 (Additional remark 8) The said process to memorize | stores, if the change is detected based on the comparison with the said 1st data acquired from the said web page, and the said 2nd data newly acquired from the said web page, the said detected change To notify the destination of the contents of
The storage control program according to appendix 5, characterized in that:

（付記９）ウェブページから取得したデータを記憶する処理をコンピュータが実行する記憶制御方法であって、
ウェブページから取得した第１のデータに対応づけて、前記取得元のウェブページのうち前記第１のデータの取得元の位置を示すタグ構造情報を記憶し、
記憶された前記タグ構造情報に基づいて、前記ウェブページから新たに第２のデータを取得し、新たに取得した前記第２のデータを前記タグ構造情報と対応付けて記憶する、
処理をコンピュータが実行することを特徴とする記憶制御方法。 (Supplementary Note 9) A storage control method in which a computer executes a process of storing data acquired from a web page,
In association with the first data acquired from the web page, the tag structure information indicating the position of the acquisition source of the first data of the acquisition source web page is stored,
Based on the stored tag structure information, new second data is acquired from the web page, and the newly acquired second data is stored in association with the tag structure information.
A storage control method, wherein a computer executes a process.

（付記１０）前記記憶する処理は、前記第２のデータの取得から所定の間隔後に、前記タグ構造情報に基づいて前記ウェブページから新たに第３のデータを取得する、
ことを特徴とする付記９に記載の記憶制御方法。 (Additional remark 10) The said process to memorize | stores newly acquires 3rd data from the said web page based on the said tag structure information after the predetermined space | interval from acquisition of the said 2nd data.
The storage control method according to appendix 9, wherein

（付記１１）前記記憶する処理は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから新たに取得した前記第２のデータの比較に基づき変更を検知すると、検知した前記変更の内容を記憶する、
ことを特徴とする付記９に記載の記憶制御方法。 (Additional remark 11) If the process to memorize | stores a change based on the comparison of the 1st data acquired from the web page and the 2nd data newly acquired from the web page, the change of the detected change Remember the contents,
The storage control method according to appendix 9, wherein

（付記１２）前記記憶する処理は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから新たに取得した前記第２のデータとの比較に基づき変更を検知すると、検知した前記変更の内容を所定の宛先に通知する、
ことを特徴とする付記９に記載の記憶制御方法。 (Additional remark 12) The said process to memorize | stores, if the change detected based on the comparison with the said 1st data acquired from the said web page, and the said 2nd data newly acquired from the said web page, the said detected change To notify the destination of the contents of
The storage control method according to appendix 9, wherein

１００…データ取得装置
１０１…入力部
１０２…出力部
１１０…通信部
１２０…記憶部
１２１…対象記憶部
１２２…項目記憶部
１２３…ページ記憶部
１２４…抽出データ記憶部
１２５…通知先記憶部
１３０…制御部
１３１…設定部
１３２…クロール部
１３３…抽出部
１３４…出力制御部
２００…コンピュータ
２０１…ＣＰＵ
２０２…入力装置
２０３…モニタ
２０４…媒体読取装置
２０５…インタフェース装置
２０６…通信装置
２０７…ＲＡＭ
２０８…ハードディスク装置
２０９…バス
３００…Ｗｅｂサイト
Ｎ…ネットワーク DESCRIPTION OF SYMBOLS 100 ... Data acquisition apparatus 101 ... Input part 102 ... Output part 110 ... Communication part 120 ... Storage part 121 ... Target storage part 122 ... Item storage part 123 ... Page storage part 124 ... Extraction data storage part 125 ... Notification destination storage part 130 ... Control unit 131 ... Setting unit 132 ... Crawl unit 133 ... Extraction unit 134 ... Output control unit 200 ... Computer 201 ... CPU
202 ... Input device 203 ... Monitor 204 ... Media reader 205 ... Interface device 206 ... Communication device 207 ... RAM
208 ... Hard disk device 209 ... Bus 300 ... Web site N ... Network

Claims

ウェブページから取得したデータを記憶する記憶制御装置において、
ウェブページから取得した第１のデータに対応づけて、前記取得元のウェブページのうち前記第１のデータの取得元の位置を示すタグ構造情報を記憶する記憶部と、
前記記憶部に記憶された前記タグ構造情報に基づいて、前記ウェブページから新たに第２のデータを取得し、新たに取得した前記第２のデータを前記タグ構造情報と対応付けて前記記憶部に記憶させる制御部と、
を備えることを特徴とする記憶制御装置。 In a storage control device for storing data acquired from a web page,
A storage unit that stores tag structure information indicating the position of the acquisition source of the first data among the acquisition source web pages in association with the first data acquired from the web page;
Based on the tag structure information stored in the storage unit, the second data is newly acquired from the web page, and the newly acquired second data is associated with the tag structure information in the storage unit. A control unit to be stored in
A storage control device comprising:

前記制御部は、前記第２のデータの取得から所定の間隔後に、前記タグ構造情報に基づいて前記ウェブページから新たに第３のデータを取得する、
ことを特徴とする請求項１に記載の記憶制御装置。 The control unit newly acquires third data from the web page based on the tag structure information after a predetermined interval from the acquisition of the second data.
The storage control device according to claim 1.

前記制御部は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから取得した前記第２のデータの比較に基づき変更を検知すると、検知した前記変更の内容を記憶させる、
ことを特徴とする請求項１に記載の記憶制御装置。 When the control unit detects a change based on the comparison between the first data acquired from the web page and the second data acquired from the web page, the control unit stores the content of the detected change.
The storage control device according to claim 1.

前記制御部は、前記ウェブページから取得した前記第１のデータと、前記ウェブページから取得した前記第２のデータとの比較に基づき変更を検知すると、検知した前記変更の内容を所定の宛先に通知する、
ことを特徴とする請求項１に記載の記憶制御装置。 When the control unit detects a change based on the comparison between the first data acquired from the web page and the second data acquired from the web page, the detected content of the change is set to a predetermined destination. Notice,
The storage control device according to claim 1.

ウェブページから取得したデータを記憶する処理をコンピュータに実行させる記憶制御プログラムであって、
ウェブページから取得した第１のデータに対応づけて、前記取得元のウェブページのうち前記第１のデータの取得元の位置を示すタグ構造情報を記憶し、
記憶された前記タグ構造情報に基づいて、前記ウェブページから新たに第２のデータを取得し、新たに取得した前記第２のデータを前記タグ構造情報と対応付けて記憶する、
処理をコンピュータに実行させることを特徴とする記憶制御プログラム。 A storage control program for causing a computer to execute processing for storing data acquired from a web page,
In association with the first data acquired from the web page, the tag structure information indicating the position of the acquisition source of the first data of the acquisition source web page is stored,
Based on the stored tag structure information, new second data is acquired from the web page, and the newly acquired second data is stored in association with the tag structure information.
A storage control program for causing a computer to execute processing.

ウェブページから取得したデータを記憶する処理をコンピュータが実行する記憶制御方法であって、
ウェブページから取得した第１のデータに対応づけて、前記取得元のウェブページのうち前記第１のデータの取得元の位置を示すタグ構造情報を記憶し、
記憶された前記タグ構造情報に基づいて、前記ウェブページから新たに第２のデータを取得し、新たに取得した前記第２のデータを前記タグ構造情報と対応付けて記憶する、
処理をコンピュータが実行することを特徴とする記憶制御方法。 A storage control method in which a computer executes a process of storing data acquired from a web page,
In association with the first data acquired from the web page, the tag structure information indicating the position of the acquisition source of the first data of the acquisition source web page is stored,
Based on the stored tag structure information, new second data is acquired from the web page, and the newly acquired second data is stored in association with the tag structure information.
A storage control method, wherein a computer executes a process.