Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
Embodiment 1
The present embodiment provides a kind of method and device of generating web page template
Fig. 1 shows the method flow diagram of generating web page template according to an embodiment of the invention, and with reference to Fig. 1, described method comprises:
Step 102, builds the effect of visualization framework that webpage is marked;
In one implementation, described effect of visualization framework can comprise: content area, the masking-out that is positioned at the content area top of choosing and mark menu, described mark menu comprises plurality of kinds of contents type menu item.
Source code by obtaining webpage is html (hypertext mark-up language for example, HTML (Hypertext Markup Language)) document, by style sheet file css (cascading style sheets for example, CSS (cascading style sheet)) file appends to html document, and in html document, increase js (javascript) script, can build the effect of visualization framework of webpage.Particularly, by js script, can realize when certain content area being detected when selected, occur masking-out and mark menu above the content area of choosing, the rule that the display mode of described masking-out and mark menu can define in style sheet file limits.
According to above-mentioned effect of visualization framework, when webpage shows in browser, the each several part content area of webpage can have effect of visualization, when certain content area is selected, (mouse is for example detected and move to this content area top, again for example, in touch-screen, detect the click of this content area or the slip gesture at this content area detected), the top of this content area there will be masking-out, and, can there is mark menu simultaneously or occur mark menu according to triggering in the top of this content area, for example, a mouse click right button on selected content area, there will be various content type menu items.As shown in Figures 2 and 3, described content type menu item can comprise " being labeled as title ", " being labeled as text " and " being labeled as the date " etc., and in addition, described content type menu item can also comprise " preservation mark " and " end mark " etc.
Step 104, obtains the indication that webpage each several part content area is marked;
In embodiments of the present invention, the main body of carrying out mark is client, and client can be operated by user, operation personnel or managerial personnel.Can to webpage, mark by mouse, mouse is moved to certain content area top a mouse click right button, then, click certain content type menu item, just can complete the mark to this partial content region.In touch-screen, can also, according to the touch operation of menu item is carried out to chosen content type, realize the mark to webpage.As shown in Figure 2, by clicking " being labeled as title ", corresponding content area can be labeled as to title, as shown in Figure 3, by clicking " being labeled as text ", corresponding content area can be labeled as to text.
Step 106, records content area and the corresponding relation that marks indication, obtains web page template.
Content area of every mark, and choosing " preservation mark " menu item, just the corresponding relation of the content type of this content area and selection can be stored in web page template, by selecting " end mark " menu item, complete all marks that need the content area of mark in webpage, obtain the web page template that this webpage is corresponding (or being called web page contents template).
Visible, according to the technical scheme of the embodiment of the present invention, only need in described effect of visualization framework, select web page contents region to carry out visualized operation, can easily define web page template, improved the efficiency of generating web page template; And, because web page contents is presented intuitively, be easy to determine the content type of page structure, improved the accuracy of generating web page template.
Such scheme is the web page template corresponding to this webpage according to a webpage one-tenth in next life.For a resource website, it may comprise a lot of webpages, these webpages are generally to generate according to identical webpage design template, thereby the structure of these webpages can be basic identical, likely only there is difference seldom, for example, in some webpages, may comprise comment content, and some webpages do not comprise comment content, but these webpages all comprise title, author, deliver the contents such as time and text.If each webpage is carried out to above-mentioned step, generate web page template, workload is still larger.
So, for further improving the formation efficiency of web page template, described method can also comprise: a plurality of web page templates that generate according to a plurality of webpages under same resource website are added up, and the same section extracting in described a plurality of web page template generates final web page template.Particularly, all webpages that can comprise resource website are sampled, and obtain a plurality of webpages; Then, according to said method, generate a plurality of web page templates; Finally, extract same section in described a plurality of web page template (part that in web page template, the corresponding relation of every content area and content type is web page template) and generate the final web page template web page template of this resource website (or be called).
For example, for 360 websites, can first according to the homepage URL of this website (http://www.360.cn/), obtain the html document of homepage; Then this html document is analyzed and found that this website comprises a plurality of (for example 1000) sub-pages, so, from these 1000 sub-pages, for example, according to predetermined algorithm (random algorithm), extract 50 sub-pages; These 50 sub-pages are carried out generating 50 web page templates after visual mark; Finally, the same section extracting in these 50 web page templates generates the web page template corresponding to 360 websites.
In addition, in embodiments of the present invention, for ease of the content area in location and presentation web page, can also add cryptographic hash attribute for the label under each content area, what correspondingly, in web page template, store is exactly the corresponding relation of the cryptographic hash of label and the content type of selection under content area.In such cases, before the step of the effect of visualization framework that the method for the generating web page template of the embodiment of the present invention marks webpage at structure, can also comprise the steps:
First, obtain the source code of webpage, according to described source code, generate DOM (Document Object Model, the DOM Document Object Model) tree of described webpage;
Then, obtain the cryptographic hash of the label that in described dom tree, each node is corresponding;
Finally, each label for described webpage adds cryptographic hash attribute.
Wherein, described cryptographic hash can comprise the level cryptographic hash of label in described dom tree and the cryptographic hash of label self.The level cryptographic hash of label in dom tree can be calculated according to the hierarchical relationship of the dom tree at current label place, and the attribute node that the cryptographic hash of label self can have according to current label calculates.
When specific implementation, can bring in the cryptographic hash of carrying out label by service and calculate.As shown in figure 10, service end 210 is arranged in search engine 200, search engine 200 and 300 communication connections of a plurality of (there is shown 3) third party website server, and service end 210 can generate web page template by fit end 100.In such cases, the cryptographic hash of obtaining the label that in described dom tree, each node is corresponding described in can comprise:
First, each label that is described webpage in client 100 adds index attributes;
Then, client 100 sends to service end 210 by the source code that adds the webpage after index attributes;
Secondly, service end 210 is carried out the cryptographic hash calculating of label;
Finally, service end 210 sends to client 100 by the corresponding relation of tab indexes value and cryptographic hash.
When enforcement is of the present invention, the operation of client can comprise the steps:
First, in client, effect of visualization framework is installed and is generated plug-in unit, and access the webpage in third party website server 300;
Then, in one implementation, mouse moves to top, web page contents region, there is nattier blue masking-out in the top of content area, represents that this content area is selected, right-click, there is choice menus, can select this content area to belong to the content types such as title, text;
Finally, after mark completes, client generating web page template.
Client can send to service end the web page template generating, and service end can be used this web page template to carry out information acquisition when carrying out oriented acquisition web page contents.
Below provide detailed process of method of the generating web page template of an embodiment of the present invention.With reference to Fig. 4, described method comprises:
Step 402, client is obtained the source code of webpage, generates the dom tree of described webpage according to described source code;
Step 404, each label that client is dom tree adds index attributes, and wherein, the traversal of dom tree can be used the algorithm of depth-first to carry out;
Step 406, client sends to service end the source code that adds the webpage after index (index) attribute, and the content of transmission is for example:
Step 408, service end has received interpolation that client the sends source code of index attributes, analyzes source code, calculates the cryptographic hash of full page structure respective labels, and all cryptographic hash of calculating are returned to client;
The cryptographic hash that service end is calculated is corresponding with the index of label, can be packaged into json form and return to client, and json content format is for example: { tab indexes value: { Hash 1:hash1, Hash 2:hash2}...}.
Step 410, the json data that client service end is returned, by the corresponding relation of tab indexes value and cryptographic hash, for corresponding label adds two property values: the level cryptographic hash frame_hash of label in described dom tree and the cryptographic hash self_hash of label self;
For example, a div label substance that has added cryptographic hash attribute is as follows:
<div?frame_hash=”46131321231613”self_hash=”174461815164”index=”45”>
content
</div>
Wherein, frame_hash cryptographic hash is to calculate according to the hierarchical relationship of the dom tree at current label place, for example:
If calculate the frame_hash of div label, can carry out md5 to " html body div " this string and calculate a cryptographic hash, algorithm can have multiple, and the embodiment of the present invention does not limit concrete algorithm.
And being the attribute node having according to current label, self_hash cryptographic hash calculates, for example div label has class attribute and id attribute, can carry out md5 according to " class:name id:author " this string and calculate a cryptographic hash, algorithm also can have multiple, and the embodiment of the present invention does not limit concrete algorithm.
Like this, just can navigate to according to frame_hash and self_hash a node of dom tree.
Step 412, client is according to cryptographic hash attribute, for the page elements of webpage adds visual effect, in one implementation, mouse moves to this element top, and there is masking-out azury the top of this element, represents that the content area of this element is selected,, there is the menu items such as " being labeled as title ", " being labeled as text " in right-click on selected content area.
Step 414, when each content area of webpage is carried out to mark, the corresponding relation of the cryptographic hash of content area and the content type of mark under client records, generating web page template, the content of web page template is for example:
Frame_hash:243092489self_hash:49348393 title
Frame_hash:434389298self_hash:23439438 author
Frame_hash:023473843self_hash:34934932 text
The frame_hash:483928384self_hash:23487388 date
Step 416, client sends to service end by the web page template of generation, and service end is preserved the web page template that client sends, and during this website of oriented acquisition, uses this web page template that the title of webpage, text, content etc. are extracted.
The embodiment of the present invention also provides a kind of device of generating web page template, and with reference to Fig. 5, described device comprises that effect of visualization framework builds device 10, mark indication getter 20 and web page template maker 30, wherein:
Effect of visualization framework builds device 10 and is suitable for building the effect of visualization framework that webpage is marked.In one implementation, described effect of visualization framework comprises: content area, the masking-out that is positioned at the content area top of choosing and mark menu, described mark menu comprises plurality of kinds of contents type menu item.Effect of visualization framework builds for example html document of the source code of device 10 by obtaining webpage, by style sheet file for example css file append to html document, and increase js script in html document, can build the effect of visualization framework of webpage.
Mark indication getter 20 is suitable for obtaining the indication that webpage each several part content area is marked.Can to webpage, mark by mouse or touch-screen, for example, mouse is moved to certain content area top a mouse click right button, then, click certain content type menu item and complete the mark to this partial content region, mark indication getter 20 can detect mark operation, and obtains the content type of selecting by right-click menu.
Web page template maker 30 is suitable for recording content area and the corresponding relation that marks indication, obtains web page template.Mark indication getter 20 gets after the content type of selection, and web page template maker 30 can record the corresponding relation of the content type of content area and selection, thus generating web page template.
Alternatively, described device also comprises counter (not shown), and a plurality of web page templates that are suitable for generating according to a plurality of webpages under same resource website are added up, and the same section extracting in described a plurality of web page template generates final web page template.
In embodiments of the present invention, for ease of the content area in location and presentation web page, can also add cryptographic hash attribute for the label under each content area.Therefore, the device of the generating web page template of the embodiment of the present invention can also comprise dom tree maker, cryptographic hash getter and cryptographic hash attribute adder.By dom tree maker, obtain the source code of webpage, and according to described source code, generate the dom tree of described webpage; By cryptographic hash getter, obtain the cryptographic hash of the label that in described dom tree, each node is corresponding; Each label that is described webpage by cryptographic hash attribute adder adds cryptographic hash attribute.Wherein, described cryptographic hash can comprise: the level cryptographic hash of label in described dom tree and the cryptographic hash of label self.The level cryptographic hash of label in dom tree can be calculated according to the hierarchical relationship of the dom tree at current label place, and the attribute node that the cryptographic hash of label self can have according to current label calculates.Correspondingly, described web page template maker 30 obtains web page template by recording the corresponding relation of the cryptographic hash of label and the content type of selection under content area.
When specific implementation, can bring in the cryptographic hash of carrying out label by service and calculate.In such cases, described cryptographic hash getter further obtains the cryptographic hash of label in the following manner: be each label interpolation index attributes of described webpage; The source code that adds the webpage after index attributes is sent to service end, for service end, carry out the cryptographic hash of label and calculate; Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic hash.
It should be noted that, each step of the method in embodiment 1 can be cut apart as required and accept or reject, and each module of the device in embodiment 1 also can be cut apart as required and accept or reject.For example, by step 102 and step 104, form a kind of method that webpage is provided to visual mark, by effect of visualization framework, build device 10 and mark and indicate getter 20 to form a kind of device that webpage is provided to visual mark.
Embodiment 2
The present embodiment provides a kind of method and device that webpage is provided to visual mark.
Fig. 6 shows the method flow diagram that according to an embodiment of the invention webpage is provided visual mark, and with reference to Fig. 6, described method comprises:
Step 602, builds effect of visualization framework webpage being marked by being positioned at the masking-out of top, web page contents region;
Described effect of visualization framework can comprise: content area, the masking-out that is positioned at the content area top of choosing and mark menu, described mark menu comprises plurality of kinds of contents type menu item.
Source code by obtaining webpage is html document for example, by style sheet file for example css file append to html document, and in html document, increase js (javascript) script, can build the effect of visualization framework of webpage.Particularly, by js script, can realize when certain content area being detected when selected, occur masking-out and mark menu above the content area of choosing, the display mode of described masking-out and mark menu can the rule in style sheet file limit.
According to above-mentioned effect of visualization framework, when webpage shows in browser, the each several part content area of webpage can have effect of visualization, when certain content area is selected, (mouse is for example detected and move to this content area top, again for example, in touch-screen, detect the click of this content area or the slip gesture at this content area detected), the top of this content area there will be masking-out, and, can there is mark menu simultaneously or occur mark menu according to triggering in the top of this content area, for example, a mouse click right button on selected content area, there will be various content type menu items.As shown in Figures 2 and 3, described content type menu item can comprise " being labeled as title ", " being labeled as text " and " being labeled as the date " etc., and in addition, described content type menu item can also comprise " preservation mark " and " end mark " etc.
Step 604, obtains the indication in described masking-out, webpage each several part content area being marked.
Described indication can for by mark menu setecting the content type corresponding to the content area of choosing.In embodiments of the present invention, the main body of carrying out mark is client, and client can be operated by user, operation personnel or managerial personnel.Can to webpage, mark by mouse, mouse is moved to certain content area top a mouse click right button, then, click certain content type menu item, just can complete the mark to this partial content region.In touch-screen, can also, according to the touch operation of menu item is carried out to chosen content type, realize the mark to webpage.As shown in Figure 2, by clicking " being labeled as title ", corresponding content area can be labeled as to title, as shown in Figure 3, by clicking " being labeled as text ", corresponding content area can be labeled as to text.
Visible, according to the technical scheme of the embodiment of the present invention, by building effect of visualization framework, can carry out visual mark to webpage, improved the efficiency of mark; And, because web page contents is presented intuitively, be easy to determine the content type of page structure, improved the accuracy of mark.
In addition, in embodiments of the present invention, for ease of the content area in location and presentation web page, can also add cryptographic hash attribute for the label under each content area.In such cases, the embodiment of the present invention to webpage, provide the method method of visual mark before building the step of the effect of visualization framework that webpage is marked, can also comprise the steps:
First, obtain the source code of webpage, according to described source code, generate the dom tree of described webpage;
Then, obtain the cryptographic hash of the label that in described dom tree, each node is corresponding;
Finally, each label for described webpage adds cryptographic hash attribute.
Wherein, described cryptographic hash can comprise the level cryptographic hash of label in described dom tree and the cryptographic hash of label self.The level cryptographic hash of label in dom tree can be calculated according to the hierarchical relationship of the dom tree at current label place, and the attribute node that the cryptographic hash of label self can have according to current label calculates.
When specific implementation, can bring in the cryptographic hash of carrying out label by service and calculate.As shown in figure 10, service end 210 is arranged in search engine 200, search engine 200 and 300 communication connections of a plurality of (there is shown 3) third party website server, and service end 210 can generate web page template by fit end 100.In such cases, the cryptographic hash of obtaining the label that in described dom tree, each node is corresponding described in can comprise:
First, each label that is described webpage in client 100 adds index attributes;
Then, client 100 sends to service end 210 by the source code that adds the webpage after index attributes;
Secondly, service end 210 is carried out the cryptographic hash calculating of label;
Finally, service end 210 sends to client 100 by the corresponding relation of tab indexes value and cryptographic hash.
When enforcement is of the present invention, first the mark operation of client can comprise the steps:, in client, effect of visualization framework is installed and is generated plug-in unit, and access the webpage in third party website server 300; Then, in one implementation, mouse moves to top, web page contents region, there is nattier blue masking-out in the top of content area, represents that this content area is selected, right-click, there is choice menus, can select this content area to belong to the content types such as title, text; Repeatedly carry out above-mentioned steps, complete the mark to webpage.
The embodiment of the present invention also provides a kind of device that webpage is provided to visual mark, and with reference to Fig. 7, described device comprises that effect of visualization framework builds device 10 and mark indication getter 20, wherein:
Effect of visualization framework builds device 10 and is suitable for building effect of visualization framework webpage being marked by being positioned at the masking-out of top, web page contents region.In one implementation, described effect of visualization framework comprises: content area, the masking-out that is positioned at the content area top of choosing and mark menu, described mark menu comprises plurality of kinds of contents type menu item.Effect of visualization framework builds for example html document of the source code of device 10 by obtaining webpage, by style sheet file for example css file append to html document, and increase js script in html document, can build the effect of visualization framework of webpage.
Mark indication getter 20 is suitable for obtaining the indication in described masking-out, webpage each several part content area being marked, described in be designated as by the content type corresponding to the content area of choosing of mark menu setecting.Can to webpage, mark by mouse or touch-screen, for example, mouse is moved to certain content area top a mouse click right button, then, click certain content type menu item and complete the mark to this partial content region, mark indication getter 20 can detect mark operation, and obtains the content type of selecting by right-click menu.
In embodiments of the present invention, for ease of the content area in location and presentation web page, can also add cryptographic hash attribute for the label under each content area.Therefore, the embodiment of the present invention provides the device of visual mark can also comprise dom tree maker, cryptographic hash getter and cryptographic hash attribute adder to webpage.By dom tree maker, obtain the source code of webpage, and according to described source code, generate the dom tree of described webpage; By cryptographic hash getter, obtain the cryptographic hash of the label that in described dom tree, each node is corresponding; Each label that is described webpage by cryptographic hash attribute adder adds cryptographic hash attribute.Wherein, described cryptographic hash can comprise: the level cryptographic hash of label in described dom tree and the cryptographic hash of label self.The level cryptographic hash of label in dom tree can be calculated according to the hierarchical relationship of the dom tree at current label place, and the attribute node that the cryptographic hash of label self can have according to current label calculates.
When specific implementation, can bring in the cryptographic hash of carrying out label by service and calculate.In such cases, described cryptographic hash getter further obtains the cryptographic hash of label in the following manner: be each label interpolation index attributes of described webpage; The source code that adds the webpage after index attributes is sent to service end, for service end, carry out the cryptographic hash of label and calculate; Receive the tab indexes value of service end transmission and the corresponding relation of cryptographic hash.
Embodiment 3
The present embodiment provides a kind of method and device that carries out web page contents extraction according to visual template.
Figure 10 shows the system construction drawing that carries out according to an embodiment of the invention web page contents extraction according to visual template.With reference to Figure 10, described system comprises client 100, search engine 200 and a plurality of (there is shown 3) third party website server 300, search engine 200 comprises service end 210, search engine 200 and 300 communication connections of third party website server, service end 210 can generate web page template by fit end 100, search engine 200 can carry out web page contents extraction according to web page template, according to web page template, extracts the structured content of webpage in third party website server 300.
Fig. 8 shows the method flow diagram that carries out according to an embodiment of the invention web page contents extraction according to visual template.With reference to Fig. 8, described method comprises:
Step 802, during directed crawl targeted website, searches the web page template generating according to visual mark that whether records corresponding described targeted website in web page template storehouse;
In web page template storehouse, preserving the web page template generating according to visual mark arrives.Described web page template can be the web page template generating according to the scheme of embodiment 1 or embodiment 2.In web page template storehouse, store a plurality of web page templates, described web page template can identify with the homepage URL of website.Can search in web page template storehouse, whether there is corresponding web page template according to the homepage URL of targeted website.
Step 804, if record the web page template generating according to visual mark of corresponding described targeted website in web page template storehouse, carries out content extraction according to described web page template to described targeted website.
Can according to homepage URL, extract the URL of all external linkages in homepage, remove the part of wherein jumping out to other website, remaining URL is put into scheduling queue; Then, according to described web page template, webpage corresponding to URL in scheduling queue carried out respectively to content extraction.Can carry out described content extraction with webpage grabber, described webpage grabber can be that Web Spider, spiders, searching machine people or network capture shell script etc.
In the technical scheme of the embodiment of the present invention, the web page template that adopts visual mark to generate carries out web page contents extraction, and because the accuracy of this web page template is higher, therefore, the accuracy of carrying out content extraction according to this web page template is also improved.
Fig. 9 shows the structure drawing of device that carries out according to an embodiment of the invention web page contents extraction according to visual template.With reference to Fig. 9, described device comprises web page template storehouse 902, finger 904 and content extraction device 906, wherein:
Web page template storehouse 902 is suitable for preserving the web page template generating according to visual mark, and described web page template can identify with the URL of webpage, also can identify with the homepage URL of website.
When being suitable for directed crawl targeted website, searches finger 904 web page template generating according to visual mark that whether records corresponding described targeted website in web page template storehouse.
When content extraction device 906 is suitable for recording the web page template generating according to visual mark of corresponding described targeted website in web page template storehouse, according to described web page template, content extraction is carried out in described targeted website.Content extraction device 906 can extract the URL of all external linkages in homepage according to homepage URL, remove the part of wherein jumping out to other website, and remaining URL is put into scheduling queue; Then, according to described web page template, webpage corresponding to URL in scheduling queue carried out respectively to content extraction.Content extraction device 906 can be that Web Spider, spiders, searching machine people or network capture shell script etc.
Alternatively, described device also comprises the device for generating web page template, can comprise that the effect of visualization framework in embodiment 1 builds device 10, mark indication getter 20 and web page template maker 30, the annexation of these modules and principle of work can be referring to the descriptions in embodiment 1.
In sum, according to the technical scheme of the embodiment of the present invention, by building the effect of visualization framework that webpage is marked, do not need edit web page template text, only need in described effect of visualization framework, select web page contents region to carry out visualized operation and can complete the mark to web page contents, improve the efficiency of mark, and then improved the efficiency of generating web page template; And, because web page contents is presented intuitively, do not need to possess the professional knowledge of webpage design aspect, be just easy to determine the content type of page structure, improve the accuracy of mark, and then improved the accuracy of generating web page template; Further, because the accuracy of web page template is improved, like this, according to this web page template, carries out the accuracy that webpage captures the content obtaining and be also improved.
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that can use in practice microprocessor or digital signal processor (DSP) to realize carries out the some or all functions of the some or all parts in the device of web page contents extraction according to the embodiment of the present invention according to visual template.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.