CN102360368B

CN102360368B - Web data extraction method based on visual customization of extraction template

Info

Publication number: CN102360368B
Application number: CN201110301775.9A
Authority: CN
Inventors: 李庆忠; 闫中敏; 彭朝晖; 蔡益清
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2011-10-09
Filing date: 2011-10-09
Publication date: 2014-07-02
Anticipated expiration: 2031-10-09
Also published as: CN102360368A

Abstract

The invention discloses a Web data extraction method based on visual customization of an extraction template. The Web data extraction method comprises the following steps: A. pretreatment of template pages: converting and showing source codes of the template pages; B. visual customization of the extraction template: providing a drag selection function on a user interface, setting the corresponding relationship between attribute tags and data values on the template pages and attributes in a domain model by a user, and establishing the extraction template; C. setting of mass extraction frequency of the pages: extracting the crawled HTML (Hypertext Markup Language) pages in large quantity once every 8 hours; and D. mass extraction of the pages: extracting the crawled HTML pages in large quantity by the corresponding extraction template, converting semi-structured data into structured data and then storing the structured data in a local database.

Description

Web data pick-up method based on the visual customization of extraction template

Technical field

The present invention relates to a kind of extraction of the Web of the relating to page, belong to computer application field, relate in particular to a kind of Web data pick-up method based on the visual customization of extraction template.

Background technology

Along with the develop rapidly of Internet technology, the website on Web and webpage quantity is with volatile trend growth, thereby makes Web become huge, a widely distributed data source.Text, form and multimedia file as picture, video etc. be the main forms of Web information, Web data pick-up is according to certain rule, from Web data, extract semantic consistency, structurized numerical value knowledge, set up numerical value Knowledge Element Repository, meet user data query, data analysis demand.For robotization the Web page of input is changed into structural data, launched a lot of work in data pick-up field.Web data pick-up is mainly for generation of structural data, and these structural datas are convenient to subsequent analysis and are excavated and process.Web data pick-up has vital function and significance for numerous Web data analyses and excavation application.

A Web data pick-up task can be defined as input and output in form.Input can be unstructured data, and for example free text can be also ubiquitous semi-structured document in Web.

Owing to there being above technical requirement, current aspect the extraction of Web page data, also have the following disadvantages:

1 due to the upper isomerism of data of Web and the disappearance of structure, causes towards the Web market demand of analyzing and excavating, and such as market intelligence analysis etc., need to spend a large amount of costs and go to process the Web data source of different-format.

The output of 2 one Web data pick-up tasks can be a data object that has the relation table of many records or have labyrinth.For some Web data pick-up tasks, attribute can lack or in a record certain attribute there are multiple property values, in addition, in the time that the semi-structured data in the Web page exists the not unique or misspelling of attribute order, Web data pick-up task will become more complicated and difficult.

Summary of the invention

Object of the present invention is exactly in order to address the above problem, and a kind of Web data pick-up method based on the visual customization of extraction template is provided, and it has visual, friendly user interactions ability advantage.

To achieve these goals, the present invention adopts following technical scheme:

A Web data pick-up method based on the visual customization of extraction template, it comprises the following steps:

A. template page pre-service;

B. the visual customization of extraction template;

C. the page is extracted to frequency setting in batches;

D. the page extracts in batches;

The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display; In described steps A, the conversion of template page source code and displaying specifically comprise the following steps:

A1. the template page providing is carried out to html source code analysis, change into the pagefile that meets XML standard;

A2. to its complete DOM structure of page analysis, and be illustrated in user interface;

A3. to the page after transforming, meeting under the condition of page original structure, adding necessary JS control routine, in order to realize page mark;

A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use.

The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template; In described step B, the visual customization of extraction template specifically comprises the following steps:

B1. after user opens the template page of display demonstration, drag and choose the data item that will extract with mouse, program can be dragged the data item of selecting according to user, analyzes the XPATH path of this data item and records;

If B2. this data item also has corresponding page-tag in the page, this page-tag is also dragged and selected, program can be recorded the XPATH path of this page-tag and the content of text of this label, and with decimation rule of XPATH path common combination of the data item of selecting; If this data item does not have corresponding page-tag, need not select;

B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this attribute tags is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, has completed exactly page data item to the mapping being listed as in tables of data;

B4. repeat above B1 to B3 and walk, until all data that will extract are marked out, the decimation rule set obtaining through above step is saved as to a page extraction template.

The described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.

The described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base;

In described step D, the page extracts and specifically comprises the following steps in batches:

D1. the current page that will extract is changed into the XML file of standard;

D2. utilizing the decimation rule recording in extraction template, is exactly XPATH path, extracts needed data item;

D3. root, according to every data label that decimation rule is corresponding, is saved in the data item extracting in the corresponding row of database table;

Described step D2 specifically comprises the following steps:

D2-1 selects an also original decimation rule;

If this decimation rule of D2-2 does not record corresponding page-tag information, directly read out corresponding content of text according to XPATH path corresponding to data item, and this decimation rule is labeled as and is used, forward step D2-8 to; Record corresponding page-tag information if this decimation rule has, forward step D2-3 to;

D2-3 extracts corresponding text according to XPATH path corresponding to this page-tag; If extract successfully, forward step D2-4 to; If extract unsuccessfully, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-7 to;

D2-4 compares the page-tag text recording in the text extracting and this decimation rule; If coupling, according to the XPATH of the data item recording in decimation rule, extracts corresponding data, and this decimation rule is labeled as and is used, forward step D2-8 to; If do not mate, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-5 to;

D2-5 checks whether the text mates certain page-tag in original decimation rule; If there is corresponding decimation rule, this text will be served as a page-tag, forward step D2-6 to, otherwise forward step D2-7 to;

D2-6, according to the XPATH of the page-tag recording in decimation rule and data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;

D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, the explanation existence data item that this label is corresponding in current page is by default situation; If find,, according to taking out the page-tag that records in rule and the XPATH of data item, calculate the XPATH of this page-tag corresponding data item, extraction corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;

D2-8 repeats above step, until all decimation rules are all used.

Described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.

Beneficial effect of the present invention:

1, the present invention is directed to each data source, adopt visual user customizing method, design parameter, configurable wrapper, make it to possess visual, friendly user interactions ability, and the extensive Web page gathering is implemented to Automatic Extraction according to wrapper.

2, because the content and structure on the Web page often changes, cause the decimation rule having produced to lose efficacy, the adaptive ability that how effectively to improve Web data pick-up is studied, and the variation that enables to occur according to target web adjusts automatically, upgrades corresponding decimation rule.

3, data pick-up method applicability of the present invention is strong, and precision is high, can change by self adaptive net, can greatly improve extraction efficiency.

Accompanying drawing explanation

Fig. 1 is the Web data pick-up method flow based on the visual customization of extraction template;

Fig. 2 is template page pretreatment process;

Fig. 3 is the visual customization flow process of page extraction template;

Fig. 4 is that the page extracts overall procedure;

Fig. 5 is extraction process refinement flow process;

Fig. 6 is that the detailed page in certain website is as page template schematic diagram;

Fig. 7 carries out extraction process schematic diagram to the webpage of website.

Embodiment

Below in conjunction with accompanying drawing and embodiment, the invention will be further described.

In Fig. 1, a kind of Web data pick-up method based on the visual customization of extraction template, it comprises the following steps

A. template page pre-service;

B. the visual customization of extraction template;

C. the page is extracted to frequency setting in batches;

D, the page extract in batches.

The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display.

The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template.

The described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base.

In Fig. 2, in described steps A, the conversion of template page source code and displaying specifically comprise the following steps:

In Fig. 3, in described step B, the visual customization of extraction template specifically comprises the following steps:

If B2. this data item also has corresponding page-tag in the page, this data label is also dragged and is selected, program can be recorded the XPATH path of this data label and the content of text of this label, and with decimation rule of the data item XPATH common combination of selecting; If this data item does not have corresponding data label, need not select;

B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this label is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, and its essence has been exactly that page data item is to the mapping being listed as in tables of data;

In Fig. 4, in described step D, the visual customization of extraction template specifically comprises the following steps:

D2. utilize the decimation rule recording in extraction template, its essence is exactly XPATH path, extracts needed data item;

D3. according to every data label that decimation rule is corresponding, the data item extracting is saved in the corresponding row of database table.

In Fig. 5, described step D2 specifically comprises the following steps:

D2-1 selects an also original decimation rule;

D2-6, according to taking out the page-tag that records in rule and the XPATH of data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;

D2-8 repeats above step, until all decimation rules are all used.

Another embodiment of the present invention, we select to adopt certain website as data source.The page is as page template in detail, and for custom built forms, page general data region sectional drawing is as accompanying drawing 6.

Suppose that the data that will extract of the manual mark of user are as the part of being surrounded by rectangle frame in figure.

We can obtain following 10 decimation rules:

1. data label: position title;

Page-tag: sky;

Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[3]/TD[2];

2. data label: recruitment company;

Page-tag: sky;

Data item XPAHT:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[1]/TBODY[1]/TR[2]/TD[1]/TABLE[1]/TBODY[1]/TR[1]/TD[1]/STRONG[1]

3. data label: date issued;

Page-tag: date issued;

Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[1]

Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[2]

4. data label: work place;

Page-tag: work place;

Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[3]

Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[4]

5. data label: the number of recruits;

Page-tag: the number of recruits;

Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[5]

Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[6]

6. data label: working experience;

Page-tag: length of service;

Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[1]

Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[2]

7. data label: language requirement;

Page-tag: language requirement;

Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[3]

Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[4]

8. data label: educational background;

Page-tag: educational requirement;

Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]

Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]

9. data label: level of salary;

Page-tag: salary scope;

Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]

The extraction template that utilizes these 9 decimation rules to form, we can carry out batch to the similar webpage that derives from this website.

Suppose that we extract by the webpage to same website (accompanying drawing 7):

We can find to lack in this page 2 data item that we will extract: language requirement and level of salary.Wherein analyze us by page code and can find that then 1 ~ 6 decimation rule effectively can directly utilize.In the time that we use the 7th article of decimation rule " language requirement ", we can find that the locational text of current page respective labels XPATH is educational background, be not inconsistent with the language requirement recording in decimation rule, but this page-tag of educational background exists in decimation rule 8, therefore the data item after educational background " junior college " is extracted, and in the page this page-tag of root expanded search " language requirement ", owing to there is not this label in the page, therefore search for less than.Although it is different to be extracted like this structure of page structure and drawing template establishment, the data on the page still can and extract by correct identification.

By reference to the accompanying drawings the specific embodiment of the present invention is described although above-mentioned; but not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme of the present invention, those skilled in the art do not need to pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims

1. the Web data pick-up method based on the visual customization of extraction template, is characterized in that, it comprises the following steps:

A. template page pre-service;

B. the visual customization of extraction template;

C. the page is extracted to frequency setting in batches;

D. the page extracts in batches;

A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use;

If B2. this data item also has corresponding page-tag in the page, this page-tag is also dragged and selected, program can be recorded the XPATH path of this page-tag and the content of text of this page-tag, and with decimation rule of XPATH path common combination of the data item of selecting; If this data item does not have corresponding page-tag, need not select;

2.web data pick-up method based on the visual customization of extraction template, is characterized in that as claimed in claim 1, and the described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.

3.web data pick-up method based on the visual customization of extraction template as claimed in claim 1, it is characterized in that, the described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base;

D2. utilize the XPATH path of recording in extraction template, extract needed data item;

D3. according to every data label that decimation rule is corresponding, the data item extracting is saved in the corresponding row of database table;

Described step D2 specifically comprises the following steps:

D2-1 selects an also original decimation rule;

D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, illustrate that data item corresponding to existence this page-tag in current page is by default situation; If find, according to the XPATH of the page-tag recording in decimation rule and data item, calculate the XPATH of this page-tag corresponding data item, extract corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;

D2-8 repeats above step, until all decimation rules are all used.

4.web data pick-up method based on the visual customization of extraction template as claimed in claim 3, it is characterized in that, described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.