CN102360368B - Web data extraction method based on visual customization of extraction template - Google Patents
Web data extraction method based on visual customization of extraction template Download PDFInfo
- Publication number
- CN102360368B CN102360368B CN201110301775.9A CN201110301775A CN102360368B CN 102360368 B CN102360368 B CN 102360368B CN 201110301775 A CN201110301775 A CN 201110301775A CN 102360368 B CN102360368 B CN 102360368B
- Authority
- CN
- China
- Prior art keywords
- page
- data
- tag
- template
- data item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Web data extraction method based on visual customization of an extraction template. The Web data extraction method comprises the following steps: A. pretreatment of template pages: converting and showing source codes of the template pages; B. visual customization of the extraction template: providing a drag selection function on a user interface, setting the corresponding relationship between attribute tags and data values on the template pages and attributes in a domain model by a user, and establishing the extraction template; C. setting of mass extraction frequency of the pages: extracting the crawled HTML (Hypertext Markup Language) pages in large quantity once every 8 hours; and D. mass extraction of the pages: extracting the crawled HTML pages in large quantity by the corresponding extraction template, converting semi-structured data into structured data and then storing the structured data in a local database.
Description
Technical field
The present invention relates to a kind of extraction of the Web of the relating to page, belong to computer application field, relate in particular to a kind of Web data pick-up method based on the visual customization of extraction template.
Background technology
Along with the develop rapidly of Internet technology, the website on Web and webpage quantity is with volatile trend growth, thereby makes Web become huge, a widely distributed data source.Text, form and multimedia file as picture, video etc. be the main forms of Web information, Web data pick-up is according to certain rule, from Web data, extract semantic consistency, structurized numerical value knowledge, set up numerical value Knowledge Element Repository, meet user data query, data analysis demand.For robotization the Web page of input is changed into structural data, launched a lot of work in data pick-up field.Web data pick-up is mainly for generation of structural data, and these structural datas are convenient to subsequent analysis and are excavated and process.Web data pick-up has vital function and significance for numerous Web data analyses and excavation application.
A Web data pick-up task can be defined as input and output in form.Input can be unstructured data, and for example free text can be also ubiquitous semi-structured document in Web.
Owing to there being above technical requirement, current aspect the extraction of Web page data, also have the following disadvantages:
1 due to the upper isomerism of data of Web and the disappearance of structure, causes towards the Web market demand of analyzing and excavating, and such as market intelligence analysis etc., need to spend a large amount of costs and go to process the Web data source of different-format.
The output of 2 one Web data pick-up tasks can be a data object that has the relation table of many records or have labyrinth.For some Web data pick-up tasks, attribute can lack or in a record certain attribute there are multiple property values, in addition, in the time that the semi-structured data in the Web page exists the not unique or misspelling of attribute order, Web data pick-up task will become more complicated and difficult.
Summary of the invention
Object of the present invention is exactly in order to address the above problem, and a kind of Web data pick-up method based on the visual customization of extraction template is provided, and it has visual, friendly user interactions ability advantage.
To achieve these goals, the present invention adopts following technical scheme:
A Web data pick-up method based on the visual customization of extraction template, it comprises the following steps:
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted to frequency setting in batches;
D. the page extracts in batches;
The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display; In described steps A, the conversion of template page source code and displaying specifically comprise the following steps:
A1. the template page providing is carried out to html source code analysis, change into the pagefile that meets XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, meeting under the condition of page original structure, adding necessary JS control routine, in order to realize page mark;
A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use.
The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template; In described step B, the visual customization of extraction template specifically comprises the following steps:
B1. after user opens the template page of display demonstration, drag and choose the data item that will extract with mouse, program can be dragged the data item of selecting according to user, analyzes the XPATH path of this data item and records;
If B2. this data item also has corresponding page-tag in the page, this page-tag is also dragged and selected, program can be recorded the XPATH path of this page-tag and the content of text of this label, and with decimation rule of XPATH path common combination of the data item of selecting; If this data item does not have corresponding page-tag, need not select;
B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this attribute tags is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, has completed exactly page data item to the mapping being listed as in tables of data;
B4. repeat above B1 to B3 and walk, until all data that will extract are marked out, the decimation rule set obtaining through above step is saved as to a page extraction template.
The described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.
The described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base;
In described step D, the page extracts and specifically comprises the following steps in batches:
D1. the current page that will extract is changed into the XML file of standard;
D2. utilizing the decimation rule recording in extraction template, is exactly XPATH path, extracts needed data item;
D3. root, according to every data label that decimation rule is corresponding, is saved in the data item extracting in the corresponding row of database table;
Described step D2 specifically comprises the following steps:
D2-1 selects an also original decimation rule;
If this decimation rule of D2-2 does not record corresponding page-tag information, directly read out corresponding content of text according to XPATH path corresponding to data item, and this decimation rule is labeled as and is used, forward step D2-8 to; Record corresponding page-tag information if this decimation rule has, forward step D2-3 to;
D2-3 extracts corresponding text according to XPATH path corresponding to this page-tag; If extract successfully, forward step D2-4 to; If extract unsuccessfully, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-7 to;
D2-4 compares the page-tag text recording in the text extracting and this decimation rule; If coupling, according to the XPATH of the data item recording in decimation rule, extracts corresponding data, and this decimation rule is labeled as and is used, forward step D2-8 to; If do not mate, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-5 to;
D2-5 checks whether the text mates certain page-tag in original decimation rule; If there is corresponding decimation rule, this text will be served as a page-tag, forward step D2-6 to, otherwise forward step D2-7 to;
D2-6, according to the XPATH of the page-tag recording in decimation rule and data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;
D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, the explanation existence data item that this label is corresponding in current page is by default situation; If find,, according to taking out the page-tag that records in rule and the XPATH of data item, calculate the XPATH of this page-tag corresponding data item, extraction corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;
D2-8 repeats above step, until all decimation rules are all used.
Described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.
Beneficial effect of the present invention:
1, the present invention is directed to each data source, adopt visual user customizing method, design parameter, configurable wrapper, make it to possess visual, friendly user interactions ability, and the extensive Web page gathering is implemented to Automatic Extraction according to wrapper.
2, because the content and structure on the Web page often changes, cause the decimation rule having produced to lose efficacy, the adaptive ability that how effectively to improve Web data pick-up is studied, and the variation that enables to occur according to target web adjusts automatically, upgrades corresponding decimation rule.
3, data pick-up method applicability of the present invention is strong, and precision is high, can change by self adaptive net, can greatly improve extraction efficiency.
Accompanying drawing explanation
Fig. 1 is the Web data pick-up method flow based on the visual customization of extraction template;
Fig. 2 is template page pretreatment process;
Fig. 3 is the visual customization flow process of page extraction template;
Fig. 4 is that the page extracts overall procedure;
Fig. 5 is extraction process refinement flow process;
Fig. 6 is that the detailed page in certain website is as page template schematic diagram;
Fig. 7 carries out extraction process schematic diagram to the webpage of website.
Embodiment
Below in conjunction with accompanying drawing and embodiment, the invention will be further described.
In Fig. 1, a kind of Web data pick-up method based on the visual customization of extraction template, it comprises the following steps
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted to frequency setting in batches;
D, the page extract in batches.
The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display.
The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template.
The described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.
The described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base.
In Fig. 2, in described steps A, the conversion of template page source code and displaying specifically comprise the following steps:
A1. the template page providing is carried out to html source code analysis, change into the pagefile that meets XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, meeting under the condition of page original structure, adding necessary JS control routine, in order to realize page mark;
A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use.
In Fig. 3, in described step B, the visual customization of extraction template specifically comprises the following steps:
B1. after user opens the template page of display demonstration, drag and choose the data item that will extract with mouse, program can be dragged the data item of selecting according to user, analyzes the XPATH path of this data item and records;
If B2. this data item also has corresponding page-tag in the page, this data label is also dragged and is selected, program can be recorded the XPATH path of this data label and the content of text of this label, and with decimation rule of the data item XPATH common combination of selecting; If this data item does not have corresponding data label, need not select;
B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this label is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, and its essence has been exactly that page data item is to the mapping being listed as in tables of data;
B4. repeat above B1 to B3 and walk, until all data that will extract are marked out, the decimation rule set obtaining through above step is saved as to a page extraction template.
In Fig. 4, in described step D, the visual customization of extraction template specifically comprises the following steps:
D1. the current page that will extract is changed into the XML file of standard;
D2. utilize the decimation rule recording in extraction template, its essence is exactly XPATH path, extracts needed data item;
D3. according to every data label that decimation rule is corresponding, the data item extracting is saved in the corresponding row of database table.
In Fig. 5, described step D2 specifically comprises the following steps:
D2-1 selects an also original decimation rule;
If this decimation rule of D2-2 does not record corresponding page-tag information, directly read out corresponding content of text according to XPATH path corresponding to data item, and this decimation rule is labeled as and is used, forward step D2-8 to; Record corresponding page-tag information if this decimation rule has, forward step D2-3 to;
D2-3 extracts corresponding text according to XPATH path corresponding to this page-tag; If extract successfully, forward step D2-4 to; If extract unsuccessfully, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-7 to;
D2-4 compares the page-tag text recording in the text extracting and this decimation rule; If coupling, according to the XPATH of the data item recording in decimation rule, extracts corresponding data, and this decimation rule is labeled as and is used, forward step D2-8 to; If do not mate, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-5 to;
D2-5 checks whether the text mates certain page-tag in original decimation rule; If there is corresponding decimation rule, this text will be served as a page-tag, forward step D2-6 to, otherwise forward step D2-7 to;
D2-6, according to taking out the page-tag that records in rule and the XPATH of data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;
D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, the explanation existence data item that this label is corresponding in current page is by default situation; If find,, according to taking out the page-tag that records in rule and the XPATH of data item, calculate the XPATH of this page-tag corresponding data item, extraction corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;
D2-8 repeats above step, until all decimation rules are all used.
Described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.
Another embodiment of the present invention, we select to adopt certain website as data source.The page is as page template in detail, and for custom built forms, page general data region sectional drawing is as accompanying drawing 6.
Suppose that the data that will extract of the manual mark of user are as the part of being surrounded by rectangle frame in figure.
We can obtain following 10 decimation rules:
1. data label: position title;
Page-tag: sky;
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[3]/TD[2];
2. data label: recruitment company;
Page-tag: sky;
Data item XPAHT:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[1]/TBODY[1]/TR[2]/TD[1]/TABLE[1]/TBODY[1]/TR[1]/TD[1]/STRONG[1]
3. data label: date issued;
Page-tag: date issued;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[1]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[2]
4. data label: work place;
Page-tag: work place;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[3]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[4]
5. data label: the number of recruits;
Page-tag: the number of recruits;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[5]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[1]/TD[6]
6. data label: working experience;
Page-tag: length of service;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[1]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[2]
7. data label: language requirement;
Page-tag: language requirement;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[3]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[4]
8. data label: educational background;
Page-tag: educational requirement;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]
9. data label: level of salary;
Page-tag: salary scope;
Page-tag XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[5]
Data item XPATH:/HTML/BODY[1]/DIV[2]/DIV[1]/DIV[2]/TABLE[3]/TBODY[1]/TR[2]/TD[6]
The extraction template that utilizes these 9 decimation rules to form, we can carry out batch to the similar webpage that derives from this website.
Suppose that we extract by the webpage to same website (accompanying drawing 7):
We can find to lack in this page 2 data item that we will extract: language requirement and level of salary.Wherein analyze us by page code and can find that then 1 ~ 6 decimation rule effectively can directly utilize.In the time that we use the 7th article of decimation rule " language requirement ", we can find that the locational text of current page respective labels XPATH is educational background, be not inconsistent with the language requirement recording in decimation rule, but this page-tag of educational background exists in decimation rule 8, therefore the data item after educational background " junior college " is extracted, and in the page this page-tag of root expanded search " language requirement ", owing to there is not this label in the page, therefore search for less than.Although it is different to be extracted like this structure of page structure and drawing template establishment, the data on the page still can and extract by correct identification.
By reference to the accompanying drawings the specific embodiment of the present invention is described although above-mentioned; but not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme of the present invention, those skilled in the art do not need to pay various modifications that creative work can make or distortion still in protection scope of the present invention.
Claims (1)
1. the Web data pick-up method based on the visual customization of extraction template, is characterized in that, it comprises the following steps:
A. template page pre-service;
B. the visual customization of extraction template;
C. the page is extracted to frequency setting in batches;
D. the page extracts in batches;
The pre-service of described steps A template page, the i.e. conversion of template page source code and displaying: it,, by analyzing the html source code of template page in internally stored program, is resolved its dom tree structure, and be translated into XML form, and show in the user interface of display; In described steps A, the conversion of template page source code and displaying specifically comprise the following steps:
A1. the template page providing is carried out to html source code analysis, change into the pagefile that meets XML standard;
A2. to its complete DOM structure of page analysis, and be illustrated in user interface;
A3. to the page after transforming, meeting under the condition of page original structure, adding necessary JS control routine, in order to realize page mark;
A4. the page of the XML form of crossing through above step process is displayed in user interface and offer user and carry out the visual customization of template and use;
The visual customization of described step B extraction template refers in user interface, to provide to pull chooses function, and the corresponding relation of setting voluntarily attribute in attribute tags on template page and data value and domain model by user, sets up extraction template; In described step B, the visual customization of extraction template specifically comprises the following steps:
B1. after user opens the template page of display demonstration, drag and choose the data item that will extract with mouse, program can be dragged the data item of selecting according to user, analyzes the XPATH path of this data item and records;
If B2. this data item also has corresponding page-tag in the page, this page-tag is also dragged and selected, program can be recorded the XPATH path of this page-tag and the content of text of this page-tag, and with decimation rule of XPATH path common combination of the data item of selecting; If this data item does not have corresponding page-tag, need not select;
B3. user is according to domain model, for selecting an attribute tags by the decimation rule forming after above-mentioned B1, B2 step, this attribute tags is included in the domain model having established in advance, and meet this decimation rule corresponding data item semanteme, this attribute tags indicates the semanteme of the data item that this decimation rule is corresponding, has completed exactly page data item to the mapping being listed as in tables of data;
B4. repeat above B1 to B3 and walk, until all data that will extract are marked out, the decimation rule set obtaining through above step is saved as to a page extraction template.
2.web data pick-up method based on the visual customization of extraction template, is characterized in that as claimed in claim 1, and the described step C page extracts set of frequency in batches by carrying out batch and extract once crawling the html page of acquisition every 8 hours.
3.web data pick-up method based on the visual customization of extraction template as claimed in claim 1, it is characterized in that, the described step D page extracts in batches and refers to that the corresponding extraction template of use carries out batch extraction to crawling a large amount of html pages of acquisition, turns composite structure data by semi-structured data wherein and is saved to local data base;
In described step D, the page extracts and specifically comprises the following steps in batches:
D1. the current page that will extract is changed into the XML file of standard;
D2. utilize the XPATH path of recording in extraction template, extract needed data item;
D3. according to every data label that decimation rule is corresponding, the data item extracting is saved in the corresponding row of database table;
Described step D2 specifically comprises the following steps:
D2-1 selects an also original decimation rule;
If this decimation rule of D2-2 does not record corresponding page-tag information, directly read out corresponding content of text according to XPATH path corresponding to data item, and this decimation rule is labeled as and is used, forward step D2-8 to; Record corresponding page-tag information if this decimation rule has, forward step D2-3 to;
D2-3 extracts corresponding text according to XPATH path corresponding to this page-tag; If extract successfully, forward step D2-4 to; If extract unsuccessfully, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-7 to;
D2-4 compares the page-tag text recording in the text extracting and this decimation rule; If coupling, according to the XPATH of the data item recording in decimation rule, extracts corresponding data, and this decimation rule is labeled as and is used, forward step D2-8 to; If do not mate, illustrate in current page, data item corresponding to this page-tag exists by situation default or displacement, forwards step D2-5 to;
D2-5 checks whether the text mates certain page-tag in original decimation rule; If there is corresponding decimation rule, this text will be served as a page-tag, forward step D2-6 to, otherwise forward step D2-7 to;
D2-6, according to the XPATH of the page-tag recording in decimation rule and data item, calculates in the time that this text is page-tag the XPATH of corresponding data item, and extract corresponding data, if extract data non-NULL, corresponding decimation rule is labeled as and is used, forward step D2-7 to;
D2-7 carries out expanded search according to the XPATH path of original page-tag in the page, finds this page-tag; If finally do not find, illustrate that data item corresponding to existence this page-tag in current page is by default situation; If find, according to the XPATH of the page-tag recording in decimation rule and data item, calculate the XPATH of this page-tag corresponding data item, extract corresponding data; Finally former decimation rule is labeled as and is used, forward step D2-8 to;
D2-8 repeats above step, until all decimation rules are all used.
4.web data pick-up method based on the visual customization of extraction template as claimed in claim 3, it is characterized in that, described step D2-3 is the situation that occurs the not unique or misspelling of attribute order when semi-structured data in the Web page for realizing, and guarantees there will not be the situation of loss of data by an expanded search.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110301775.9A CN102360368B (en) | 2011-10-09 | 2011-10-09 | Web data extraction method based on visual customization of extraction template |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110301775.9A CN102360368B (en) | 2011-10-09 | 2011-10-09 | Web data extraction method based on visual customization of extraction template |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102360368A CN102360368A (en) | 2012-02-22 |
CN102360368B true CN102360368B (en) | 2014-07-02 |
Family
ID=45585697
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110301775.9A Active CN102360368B (en) | 2011-10-09 | 2011-10-09 | Web data extraction method based on visual customization of extraction template |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102360368B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8990140B2 (en) * | 2012-06-08 | 2015-03-24 | Microsoft Technology Licensing, Llc | Transforming data into consumable content |
US9595298B2 (en) | 2012-07-18 | 2017-03-14 | Microsoft Technology Licensing, Llc | Transforming data to create layouts |
CN103020189B (en) * | 2012-12-03 | 2016-08-10 | 深圳中兴网信科技有限公司 | Data processing equipment and data processing method |
CN103116448A (en) * | 2013-01-30 | 2013-05-22 | 浪潮电子信息产业股份有限公司 | Extract method for visualizing information |
CN104182412B (en) * | 2013-05-24 | 2017-08-04 | ***通信集团安徽有限公司 | A kind of web page crawl method and system |
CN105447184B (en) * | 2015-12-15 | 2019-06-11 | 北京百分点信息科技有限公司 | Information extraction method and device |
CN106021485B (en) * | 2016-05-19 | 2019-05-14 | 中国传媒大学 | A kind of polynary attribute cinematic data visualization system |
CN107437158B (en) * | 2016-05-26 | 2021-08-10 | 北京京东尚科信息技术有限公司 | Data query method, device and computer readable storage medium |
CN106202348A (en) * | 2016-07-04 | 2016-12-07 | 中山大学 | A kind of web page form information extraction method |
CN108121743A (en) * | 2016-11-30 | 2018-06-05 | 中移(苏州)软件技术有限公司 | A kind of generation of generic web pages masterplate and application method, system |
CN106648677B (en) * | 2016-12-28 | 2019-08-02 | 中国科学院南京地理与湖泊研究所 | A kind of water environment domain model integrates the visible customization method of template |
US10380228B2 (en) | 2017-02-10 | 2019-08-13 | Microsoft Technology Licensing, Llc | Output generation based on semantic expressions |
CN106980921B (en) * | 2017-03-02 | 2021-01-26 | 上海歌略软件科技有限公司 | User-defined risk analysis method |
CN107609144A (en) * | 2017-09-21 | 2018-01-19 | 浪潮软件股份有限公司 | A kind of analysis result processing method, apparatus and system |
CN107608949B (en) * | 2017-10-16 | 2019-04-16 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
CN108334634A (en) * | 2018-02-27 | 2018-07-27 | 北京中关村科金技术有限公司 | A kind of method, apparatus, equipment and the storage medium of extraction data information |
CN110309364B (en) * | 2018-03-02 | 2023-03-28 | 腾讯科技(深圳)有限公司 | Information extraction method and device |
CN108984683B (en) * | 2018-06-29 | 2021-06-25 | 北京百度网讯科技有限公司 | Method, system, equipment and storage medium for extracting structured data |
TWI682287B (en) * | 2018-10-25 | 2020-01-11 | 財團法人資訊工業策進會 | Knowledge graph generating apparatus, method, and computer program product thereof |
CN109753596B (en) * | 2018-12-29 | 2021-05-25 | 中国科学院计算技术研究所 | Information source management and configuration method and system for large-scale network data acquisition |
CN111782737B (en) * | 2020-08-12 | 2024-05-28 | 中国工商银行股份有限公司 | Information processing method, device, equipment and storage medium |
CN112199960B (en) * | 2020-11-12 | 2021-05-25 | 北京三维天地科技股份有限公司 | Standard knowledge element granularity analysis system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1588371A (en) * | 2004-09-08 | 2005-03-02 | 孟小峰 | Forming method for package device |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
-
2011
- 2011-10-09 CN CN201110301775.9A patent/CN102360368B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1588371A (en) * | 2004-09-08 | 2005-03-02 | 孟小峰 | Forming method for package device |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
Non-Patent Citations (4)
Title |
---|
基于DOM 树的可适应性Web 信息抽取;李 朝等;《计算机科学》;20090731;第36卷(第7期);203-203 * |
李 朝等.基于DOM 树的可适应性Web 信息抽取.《计算机科学》.2009,第36卷(第7期),203-203. |
网页结构化信息抽取技术方法研究;郝爱峰;《山西电子技术》;20080430(第4期);第2部分 * |
郝爱峰.网页结构化信息抽取技术方法研究.《山西电子技术》.2008,(第4期),第2部分. |
Also Published As
Publication number | Publication date |
---|---|
CN102360368A (en) | 2012-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102360368B (en) | Web data extraction method based on visual customization of extraction template | |
Mühleisen et al. | Web Data Commons-Extracting Structured Data from Two Large Web Corpora. | |
US20110087708A1 (en) | Business object based operational reporting and analysis | |
Kongdenfha et al. | Rapid development of spreadsheet-based web mashups | |
Baumgartner et al. | Web Data Extraction System. | |
CN103678509B (en) | Generate the method and device of web page template | |
CN102646039A (en) | Software interface generating system and method based on extensible markup language (XML) Schema | |
Mirza et al. | Practicability of dataspace systems | |
Vercoustre et al. | A descriptive language for information object reuse through virtual documents | |
CN108959580A (en) | A kind of optimization method and system of label data | |
Razali et al. | Mapping of annual extreme wind speed analysis from 12 stations in Peninsular Malaysia | |
KR20190131778A (en) | Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL | |
CN103914488A (en) | Document collection, identification, association, search and display system | |
CN108959356A (en) | A kind of intelligence adapted TV university Data application system Data Mart method for building up | |
CN102819616B (en) | Cloud online real-time multi-dimensional analysis system and method | |
CN102486792A (en) | Method and system for reorganizing and displaying universal forum page | |
Della Penna et al. | Visual extraction of information from web pages | |
Della Penna et al. | A spatial relation-based framework to perform visual information extraction | |
May et al. | Semantic technologies enhancing links and linked data for archaeological resources | |
Diaz et al. | User-driven automation of web form filling | |
CN112711404A (en) | Method for generating special topic webpage template once and automatically releasing special topic webpage | |
Zheng et al. | Design and implementation of news collecting and filtering system based on RSS | |
Németh et al. | Metadata Management and Future Plans to Generate Linked Open Data in the Hungarian Web Archiving Pilot Project. | |
Mukherjee et al. | AHA: Asset harvester assistant | |
Alexander | Value based design of former southeastern plantations using Millway Plantation as a research site |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |