CN102360368B - Web data extraction method based on visual customization of extraction template - Google Patents

Web data extraction method based on visual customization of extraction template Download PDF

Info

Publication number
CN102360368B
CN102360368B CN201110301775.9A CN201110301775A CN102360368B CN 102360368 B CN102360368 B CN 102360368B CN 201110301775 A CN201110301775 A CN 201110301775A CN 102360368 B CN102360368 B CN 102360368B
Authority
CN
China
Prior art keywords
page
data
tag
template
data item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110301775.9A
Other languages
Chinese (zh)
Other versions
CN102360368A (en
Inventor
李庆忠
闫中敏
彭朝晖
蔡益清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201110301775.9A priority Critical patent/CN102360368B/en
Publication of CN102360368A publication Critical patent/CN102360368A/en
Application granted granted Critical
Publication of CN102360368B publication Critical patent/CN102360368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Web data extraction method based on visual customization of an extraction template. The Web data extraction method comprises the following steps: A. pretreatment of template pages: converting and showing source codes of the template pages; B. visual customization of the extraction template: providing a drag selection function on a user interface, setting the corresponding relationship between attribute tags and data values on the template pages and attributes in a domain model by a user, and establishing the extraction template; C. setting of mass extraction frequency of the pages: extracting the crawled HTML (Hypertext Markup Language) pages in large quantity once every 8 hours; and D. mass extraction of the pages: extracting the crawled HTML pages in large quantity by the corresponding extraction template, converting semi-structured data into structured data and then storing the structured data in a local database.

Description

Web data pick-up method based on the visual customization of extraction template
Technical field
The present invention relates to a kind of extraction of the Web of the relating to page, belong to computer application field, relate in particular to a kind of Web data pick-up method based on the visual customization of extraction template.
Background technology
Along with the develop rapidly of Internet technology, the website on Web and webpage quantity is with volatile trend growth, thereby makes Web become huge, a widely distributed data source.Text, form and multimedia file as picture, video etc. be the main forms of Web information, Web data pick-up is according to certain rule, from Web data, extract semantic consistency, structurized numerical value knowledge, set up numerical value Knowledge Element Repository, meet user data query, data analysis demand.For robotization the Web page of input is changed into structural data, launched a lot of work in data pick-up field.Web data pick-up is mainly for generation of structural data, and these structural datas are convenient to subsequent analysis and are excavated and process.Web data pick-up has vital function and significance for numerous Web data analyses and excavation application.
A Web data pick-up task can be defined as input and output in form.Input can be unstructured data, and for example free text can be also ubiquitous semi-structured document in Web.
Owing to there being above technical requirement, current aspect the extraction of Web page data, also have the following disadvantages:
1 due to the upper isomerism of data of Web and the disappearance of structure, causes towards the Web market demand of analyzing and excavating, and such as market intelligence analysis etc., need to spend a large amount of costs and go to process the Web data source of different-format.
The output of 2 one Web data pick-up tasks can be a data object that has the relation table of many records or have labyrinth.For some Web data pick-up tasks, attribute can lack or in a record certain attribute there are multiple property values, in addition, in the time that the semi-structured data in the Web page exists the not unique or misspelling of attribute order, Web data pick-up task will become more complicated and difficult.
Summary of the invention
Object of the present invention is exactly in order to address the above problem, and a kind of Web data pick-up method based on the visual customization of extraction template is provided, and it has visual, friendly user interactions ability advantage.
To achieve these goals, the present invention adopts following technical scheme:
A Web data pick-up method based on the visual customization of extraction template, it comprises the following steps:
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted to frequency setting in batches;
D. the page extracts in batches;
The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display; In described steps A, the conversion of template page source code and displaying specifically comprise the following steps:
A1. the template page providing is carried out to html source code analysis, change into the pagefile that meets XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, meeting under the condition of page original structure, adding necessary JS control routine, in order to realize page mark;
A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use.
The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template; In described step B, the visual customization of extraction template specifically comprises the following steps:
B1. after user opens the template page of display demonstration, drag and choose the data item that will extract with mouse, program can be dragged the data item of selecting according to user, analyzes the XPATH path of this data item and records;
If B2. this data item also has corresponding page-tag in the page, this page-tag is also dragged and selected, program can be recorded the XPATH path of this page-tag and the content of text of this label, and with decimation rule of XPATH path common combination of the data item of selecting; If this data item does not have corresponding page-tag, need not select;
B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this attribute tags is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, has completed exactly page data item to the mapping being listed as in tables of data;
B4. repeat above B1 to B3 and walk, until all data that will extract are marked out, the decimation rule set obtaining through above step is saved as to a page extraction template.
The described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.
The described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base;
In described step D, the page extracts and specifically comprises the following steps in batches:
D1. the current page that will extract is changed into the XML file of standard;
D2. utilizing the decimation rule recording in extraction template, is exactly XPATH path, extracts needed data item;
D3. root, according to every data label that decimation rule is corresponding, is saved in the data item extracting in the corresponding row of database table;
Described step D2 specifically comprises the following steps:
D2-1 selects an also original decimation rule;
If this decimation rule of D2-2 does not record corresponding page-tag information, directly read out corresponding content of text according to XPATH path corresponding to data item, and this decimation rule is labeled as and is used, forward step D2-8 to; Record corresponding page-tag information if this decimation rule has, forward step D2-3 to;
D2-3 extracts corresponding text according to XPATH path corresponding to this page-tag; If extract successfully, forward step D2-4 to; If extract unsuccessfully, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-7 to;
D2-4 compares the page-tag text recording in the text extracting and this decimation rule; If coupling, according to the XPATH of the data item recording in decimation rule, extracts corresponding data, and this decimation rule is labeled as and is used, forward step D2-8 to; If do not mate, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-5 to;
D2-5 checks whether the text mates certain page-tag in original decimation rule; If there is corresponding decimation rule, this text will be served as a page-tag, forward step D2-6 to, otherwise forward step D2-7 to;
D2-6, according to the XPATH of the page-tag recording in decimation rule and data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;
D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, the explanation existence data item that this label is corresponding in current page is by default situation; If find,, according to taking out the page-tag that records in rule and the XPATH of data item, calculate the XPATH of this page-tag corresponding data item, extraction corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;
D2-8 repeats above step, until all decimation rules are all used.
Described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.
Beneficial effect of the present invention:
1, the present invention is directed to each data source, adopt visual user customizing method, design parameter, configurable wrapper, make it to possess visual, friendly user interactions ability, and the extensive Web page gathering is implemented to Automatic Extraction according to wrapper.
2, because the content and structure on the Web page often changes, cause the decimation rule having produced to lose efficacy, the adaptive ability that how effectively to improve Web data pick-up is studied, and the variation that enables to occur according to target web adjusts automatically, upgrades corresponding decimation rule.
3, data pick-up method applicability of the present invention is strong, and precision is high, can change by self adaptive net, can greatly improve extraction efficiency.
Accompanying drawing explanation
Fig. 1 is the Web data pick-up method flow based on the visual customization of extraction template;
Fig. 2 is template page pretreatment process;
Fig. 3 is the visual customization flow process of page extraction template;
Fig. 4 is that the page extracts overall procedure;
Fig. 5 is extraction process refinement flow process;
Fig. 6 is that the detailed page in certain website is as page template schematic diagram;
Fig. 7 carries out extraction process schematic diagram to the webpage of website.
Embodiment
Below in conjunction with accompanying drawing and embodiment, the invention will be further described.
In Fig. 1, a kind of Web data pick-up method based on the visual customization of extraction template, it comprises the following steps
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted to frequency setting in batches;
D, the page extract in batches.
The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display.
The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template.
The described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.
The described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base.
In Fig. 2, in described steps A, the conversion of template page source code and displaying specifically comprise the following steps:
A1. the template page providing is carried out to html source code analysis, change into the pagefile that meets XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, meeting under the condition of page original structure, adding necessary JS control routine, in order to realize page mark;
A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use.
In Fig. 3, in described step B, the visual customization of extraction template specifically comprises the following steps:
B1. after user opens the template page of display demonstration, drag and choose the data item that will extract with mouse, program can be dragged the data item of selecting according to user, analyzes the XPATH path of this data item and records;
If B2. this data item also has corresponding page-tag in the page, this data label is also dragged and is selected, program can be recorded the XPATH path of this data label and the content of text of this label, and with decimation rule of the data item XPATH common combination of selecting; If this data item does not have corresponding data label, need not select;
B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this label is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, and its essence has been exactly that page data item is to the mapping being listed as in tables of data;
B4. repeat above B1 to B3 and walk, until all data that will extract are marked out, the decimation rule set obtaining through above step is saved as to a page extraction template.
In Fig. 4, in described step D, the visual customization of extraction template specifically comprises the following steps:
D1. the current page that will extract is changed into the XML file of standard;
D2. utilize the decimation rule recording in extraction template, its essence is exactly XPATH path, extracts needed data item;
D3. according to every data label that decimation rule is corresponding, the data item extracting is saved in the corresponding row of database table.
In Fig. 5, described step D2 specifically comprises the following steps:
D2-1 selects an also original decimation rule;
If this decimation rule of D2-2 does not record corresponding page-tag information, directly read out corresponding content of text according to XPATH path corresponding to data item, and this decimation rule is labeled as and is used, forward step D2-8 to; Record corresponding page-tag information if this decimation rule has, forward step D2-3 to;
D2-3 extracts corresponding text according to XPATH path corresponding to this page-tag; If extract successfully, forward step D2-4 to; If extract unsuccessfully, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-7 to;
D2-4 compares the page-tag text recording in the text extracting and this decimation rule; If coupling, according to the XPATH of the data item recording in decimation rule, extracts corresponding data, and this decimation rule is labeled as and is used, forward step D2-8 to; If do not mate, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-5 to;
D2-5 checks whether the text mates certain page-tag in original decimation rule; If there is corresponding decimation rule, this text will be served as a page-tag, forward step D2-6 to, otherwise forward step D2-7 to;
D2-6, according to taking out the page-tag that records in rule and the XPATH of data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;
D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, the explanation existence data item that this label is corresponding in current page is by default situation; If find,, according to taking out the page-tag that records in rule and the XPATH of data item, calculate the XPATH of this page-tag corresponding data item, extraction corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;
D2-8 repeats above step, until all decimation rules are all used.
Described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.
Another embodiment of the present invention, we select to adopt certain website as data source.The page is as page template in detail, and for custom built forms, page general data region sectional drawing is as accompanying drawing 6.
Suppose that the data that will extract of the manual mark of user are as the part of being surrounded by rectangle frame in figure.
We can obtain following 10 decimation rules:
1. data label: position title;
Page-tag: sky;
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[3]/TD[2];
2. data label: recruitment company;
Page-tag: sky;
Data item XPAHT:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[1]/TBODY[1]/TR[2]/TD[1]/TABLE[1]/TBODY[1]/TR[1]/TD[1]/STRONG[1]
3. data label: date issued;
Page-tag: date issued;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[1]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[2]
4. data label: work place;
Page-tag: work place;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[3]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[4]
5. data label: the number of recruits;
Page-tag: the number of recruits;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[5]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[6]
6. data label: working experience;
Page-tag: length of service;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[1]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[2]
7. data label: language requirement;
Page-tag: language requirement;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[3]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[4]
8. data label: educational background;
Page-tag: educational requirement;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]
9. data label: level of salary;
Page-tag: salary scope;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]
The extraction template that utilizes these 9 decimation rules to form, we can carry out batch to the similar webpage that derives from this website.
Suppose that we extract by the webpage to same website (accompanying drawing 7):
We can find to lack in this page 2 data item that we will extract: language requirement and level of salary.Wherein analyze us by page code and can find that then 1 ~ 6 decimation rule effectively can directly utilize.In the time that we use the 7th article of decimation rule " language requirement ", we can find that the locational text of current page respective labels XPATH is educational background, be not inconsistent with the language requirement recording in decimation rule, but this page-tag of educational background exists in decimation rule 8, therefore the data item after educational background " junior college " is extracted, and in the page this page-tag of root expanded search " language requirement ", owing to there is not this label in the page, therefore search for less than.Although it is different to be extracted like this structure of page structure and drawing template establishment, the data on the page still can and extract by correct identification.
By reference to the accompanying drawings the specific embodiment of the present invention is described although above-mentioned; but not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme of the present invention, those skilled in the art do not need to pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims (1)

1. the Web data pick-up method based on the visual customization of extraction template, is characterized in that, it comprises the following steps:
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted to frequency setting in batches;
D. the page extracts in batches;
The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display; In described steps A, the conversion of template page source code and displaying specifically comprise the following steps:
A1. the template page providing is carried out to html source code analysis, change into the pagefile that meets XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, meeting under the condition of page original structure, adding necessary JS control routine, in order to realize page mark;
A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use;
The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template; In described step B, the visual customization of extraction template specifically comprises the following steps:
B1. after user opens the template page of display demonstration, drag and choose the data item that will extract with mouse, program can be dragged the data item of selecting according to user, analyzes the XPATH path of this data item and records;
If B2. this data item also has corresponding page-tag in the page, this page-tag is also dragged and selected, program can be recorded the XPATH path of this page-tag and the content of text of this page-tag, and with decimation rule of XPATH path common combination of the data item of selecting; If this data item does not have corresponding page-tag, need not select;
B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this attribute tags is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, has completed exactly page data item to the mapping being listed as in tables of data;
B4. repeat above B1 to B3 and walk, until all data that will extract are marked out, the decimation rule set obtaining through above step is saved as to a page extraction template.
2.web data pick-up method based on the visual customization of extraction template, is characterized in that as claimed in claim 1, and the described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.
3.web data pick-up method based on the visual customization of extraction template as claimed in claim 1, it is characterized in that, the described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base;
In described step D, the page extracts and specifically comprises the following steps in batches:
D1. the current page that will extract is changed into the XML file of standard;
D2. utilize the XPATH path of recording in extraction template, extract needed data item;
D3. according to every data label that decimation rule is corresponding, the data item extracting is saved in the corresponding row of database table;
Described step D2 specifically comprises the following steps:
D2-1 selects an also original decimation rule;
If this decimation rule of D2-2 does not record corresponding page-tag information, directly read out corresponding content of text according to XPATH path corresponding to data item, and this decimation rule is labeled as and is used, forward step D2-8 to; Record corresponding page-tag information if this decimation rule has, forward step D2-3 to;
D2-3 extracts corresponding text according to XPATH path corresponding to this page-tag; If extract successfully, forward step D2-4 to; If extract unsuccessfully, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-7 to;
D2-4 compares the page-tag text recording in the text extracting and this decimation rule; If coupling, according to the XPATH of the data item recording in decimation rule, extracts corresponding data, and this decimation rule is labeled as and is used, forward step D2-8 to; If do not mate, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-5 to;
D2-5 checks whether the text mates certain page-tag in original decimation rule; If there is corresponding decimation rule, this text will be served as a page-tag, forward step D2-6 to, otherwise forward step D2-7 to;
D2-6, according to the XPATH of the page-tag recording in decimation rule and data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;
D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, illustrate that data item corresponding to existence this page-tag in current page is by default situation; If find, according to the XPATH of the page-tag recording in decimation rule and data item, calculate the XPATH of this page-tag corresponding data item, extract corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;
D2-8 repeats above step, until all decimation rules are all used.
4.web data pick-up method based on the visual customization of extraction template as claimed in claim 3, it is characterized in that, described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.
CN201110301775.9A 2011-10-09 2011-10-09 Web data extraction method based on visual customization of extraction template Active CN102360368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110301775.9A CN102360368B (en) 2011-10-09 2011-10-09 Web data extraction method based on visual customization of extraction template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110301775.9A CN102360368B (en) 2011-10-09 2011-10-09 Web data extraction method based on visual customization of extraction template

Publications (2)

Publication Number Publication Date
CN102360368A CN102360368A (en) 2012-02-22
CN102360368B true CN102360368B (en) 2014-07-02

Family

ID=45585697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110301775.9A Active CN102360368B (en) 2011-10-09 2011-10-09 Web data extraction method based on visual customization of extraction template

Country Status (1)

Country Link
CN (1) CN102360368B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990140B2 (en) * 2012-06-08 2015-03-24 Microsoft Technology Licensing, Llc Transforming data into consumable content
US9595298B2 (en) 2012-07-18 2017-03-14 Microsoft Technology Licensing, Llc Transforming data to create layouts
CN103020189B (en) * 2012-12-03 2016-08-10 深圳中兴网信科技有限公司 Data processing equipment and data processing method
CN103116448A (en) * 2013-01-30 2013-05-22 浪潮电子信息产业股份有限公司 Extract method for visualizing information
CN104182412B (en) * 2013-05-24 2017-08-04 ***通信集团安徽有限公司 A kind of web page crawl method and system
CN105447184B (en) * 2015-12-15 2019-06-11 北京百分点信息科技有限公司 Information extraction method and device
CN106021485B (en) * 2016-05-19 2019-05-14 中国传媒大学 A kind of polynary attribute cinematic data visualization system
CN107437158B (en) * 2016-05-26 2021-08-10 北京京东尚科信息技术有限公司 Data query method, device and computer readable storage medium
CN106202348A (en) * 2016-07-04 2016-12-07 中山大学 A kind of web page form information extraction method
CN108121743A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of generation of generic web pages masterplate and application method, system
CN106648677B (en) * 2016-12-28 2019-08-02 中国科学院南京地理与湖泊研究所 A kind of water environment domain model integrates the visible customization method of template
US10380228B2 (en) 2017-02-10 2019-08-13 Microsoft Technology Licensing, Llc Output generation based on semantic expressions
CN106980921B (en) * 2017-03-02 2021-01-26 上海歌略软件科技有限公司 User-defined risk analysis method
CN107609144A (en) * 2017-09-21 2018-01-19 浪潮软件股份有限公司 A kind of analysis result processing method, apparatus and system
CN107608949B (en) * 2017-10-16 2019-04-16 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN108334634A (en) * 2018-02-27 2018-07-27 北京中关村科金技术有限公司 A kind of method, apparatus, equipment and the storage medium of extraction data information
CN110309364B (en) * 2018-03-02 2023-03-28 腾讯科技(深圳)有限公司 Information extraction method and device
CN108984683B (en) * 2018-06-29 2021-06-25 北京百度网讯科技有限公司 Method, system, equipment and storage medium for extracting structured data
TWI682287B (en) * 2018-10-25 2020-01-11 財團法人資訊工業策進會 Knowledge graph generating apparatus, method, and computer program product thereof
CN109753596B (en) * 2018-12-29 2021-05-25 中国科学院计算技术研究所 Information source management and configuration method and system for large-scale network data acquisition
CN111782737B (en) * 2020-08-12 2024-05-28 中国工商银行股份有限公司 Information processing method, device, equipment and storage medium
CN112199960B (en) * 2020-11-12 2021-05-25 北京三维天地科技股份有限公司 Standard knowledge element granularity analysis system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588371A (en) * 2004-09-08 2005-03-02 孟小峰 Forming method for package device
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588371A (en) * 2004-09-08 2005-03-02 孟小峰 Forming method for package device
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于DOM 树的可适应性Web 信息抽取;李 朝等;《计算机科学》;20090731;第36卷(第7期);203-203 *
李 朝等.基于DOM 树的可适应性Web 信息抽取.《计算机科学》.2009,第36卷(第7期),203-203.
网页结构化信息抽取技术方法研究;郝爱峰;《山西电子技术》;20080430(第4期);第2部分 *
郝爱峰.网页结构化信息抽取技术方法研究.《山西电子技术》.2008,(第4期),第2部分.

Also Published As

Publication number Publication date
CN102360368A (en) 2012-02-22

Similar Documents

Publication Publication Date Title
CN102360368B (en) Web data extraction method based on visual customization of extraction template
Mühleisen et al. Web Data Commons-Extracting Structured Data from Two Large Web Corpora.
US20110087708A1 (en) Business object based operational reporting and analysis
Kongdenfha et al. Rapid development of spreadsheet-based web mashups
Baumgartner et al. Web Data Extraction System.
CN103678509B (en) Generate the method and device of web page template
CN102646039A (en) Software interface generating system and method based on extensible markup language (XML) Schema
Mirza et al. Practicability of dataspace systems
Vercoustre et al. A descriptive language for information object reuse through virtual documents
CN108959580A (en) A kind of optimization method and system of label data
Razali et al. Mapping of annual extreme wind speed analysis from 12 stations in Peninsular Malaysia
KR20190131778A (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
CN103914488A (en) Document collection, identification, association, search and display system
CN108959356A (en) A kind of intelligence adapted TV university Data application system Data Mart method for building up
CN102819616B (en) Cloud online real-time multi-dimensional analysis system and method
CN102486792A (en) Method and system for reorganizing and displaying universal forum page
Della Penna et al. Visual extraction of information from web pages
Della Penna et al. A spatial relation-based framework to perform visual information extraction
May et al. Semantic technologies enhancing links and linked data for archaeological resources
Diaz et al. User-driven automation of web form filling
CN112711404A (en) Method for generating special topic webpage template once and automatically releasing special topic webpage
Zheng et al. Design and implementation of news collecting and filtering system based on RSS
Németh et al. Metadata Management and Future Plans to Generate Linked Open Data in the Hungarian Web Archiving Pilot Project.
Mukherjee et al. AHA: Asset harvester assistant
Alexander Value based design of former southeastern plantations using Millway Plantation as a research site

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant