CN102314497B

CN102314497B - Method and equipment for identifying body contents of markup language files

Info

Publication number: CN102314497B
Application number: CN201110249348.0A
Authority: CN
Inventors: 李伟刚; 秦玄铮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-08-26
Filing date: 2011-08-26
Publication date: 2014-12-10
Anticipated expiration: 2031-08-26
Also published as: CN102314497A

Abstract

The invention aims to provide a method and equipment for identifying body contents of markup language files. The method comprises the following steps of: acquiring a plurality of markup language files to be processed by using template providing equipment; obtaining one or more groups of markup language files according to relevant information of the markup language files; comparing and analyzing contents of corresponding nodes in each DOM (Document Object Model) tree which corresponds to each markup language file in each group of at least one group of markup language files to obtain a body content node comprising the body contents of the group of markup language files; and obtaining a content marking template for identifying the body contents of the group of markup language files according to the obtained body content node. Compared with the prior art, the invention has the advantages that: body contents are obtained according to structural information of markup language files independent of specific contents in the markup language files, so that the body content identifying accuracy of different types of webpage is ensured.

Description

A kind of method and apparatus for identification marking language file body matter

Technical field

The present invention relates to Internet technology, relate in particular to the technology for identification marking language file body matter.

Background technology

Along with development and the widespread use of development of Mobile Internet technology, increasing user passes through mobile terminal, as smart mobile phone etc., access internet web page, but because of the restriction of the screen size of mobile terminal, before showing on the screen of the html web page of browsing in computing machine at mobile terminal, its web page contents need be filtered, only retain the body matter of webpage, so that user browses.In prior art, in identification html web page, the method for body matter utilizes key word to obtain mating in this web page contents conventionally, wherein, the content of what body matter meant to carry in this webpage be different from other similar webpages, for example news web page comprises headline, news content, the link of other news, friendly link, advertisement etc., but the body matter of this webpage is headline and news content, the shortcoming of the method is that its body matter to identification webpage does not have versatility, be that its regular expression need customize according to concrete type of webpage, otherwise the accuracy rate of identification will reduce.

Therefore, how to utilize a kind of universal method to identify and become problem demanding prompt solution as making language document body matters such as HTML.

Summary of the invention

The object of this invention is to provide a kind of method and apparatus for identification marking language file body matter.

According to an aspect of the present invention, provide a kind of computer implemented method for identification marking language file body matter, wherein, the method comprises the following steps:

A obtains pending multiple making language documents;

B, according to the relevant information of described multiple making language documents, obtains one or more groups making language document;

C compares analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file, to obtain the body matter node of the body matter that comprises this group echo language file;

D, according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter.

According to another aspect of the present invention, also provide a kind of equipment for identification marking language file body matter, wherein, this equipment comprises:

File acquisition device, for obtaining pending multiple making language documents;

The first acquisition device, for according to the relevant information of described multiple making language documents, obtains one or more groups making language document;

Relative analytic apparatus, content for respective nodes in the corresponding each dom tree of each making language document of every group at least one group echo language file compares analysis, to obtain the body matter node of the body matter that comprises this group echo language file;

Template acquisition device, for according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter.

As mentioned above, compared with prior art, the present invention is by providing a kind of general method to obtain the content identification template of the body matter for identifying certain class making language document, the method does not rely on the particular content in making language document and obtains body matter according to the structural information of this making language document, and accordingly this content identification template is applied to the body matter that extracts such making language document, thereby ensure the accuracy rate of the body matter identification to dissimilar webpage.

Brief description of the drawings

By reading the detailed description that non-limiting example is done of doing with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:

Fig. 1 is the equipment schematic diagram for identification marking language file body matter according to one aspect of the invention;

Fig. 2 is for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 3 is for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 3 A is for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 3 B is for the exemplary plot of identification marking language file body matter according to the present invention;

Fig. 4 is the equipment schematic diagram for identification marking language file body matter in accordance with a preferred embodiment of the present invention;

Fig. 5 is the method flow diagram for identification marking language file body matter according to a further aspect of the present invention;

Fig. 6 is the method flow diagram for identification marking language file body matter in accordance with a preferred embodiment of the present invention.

In accompanying drawing, same or analogous Reference numeral represents same or analogous parts.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

Fig. 1 is the equipment schematic diagram for identification marking language file body matter according to one aspect of the invention.Template provides equipment 1 to comprise file acquisition device 11, the first acquisition device 12, relative analytic apparatus 13 and template acquisition device 14.At this, template provides equipment 1 to include but not limited to the cloud that computing machine, network host, single network server, multiple webserver collection or multiple server form.At this, cloud is made up of a large amount of computing machines based on cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is the one of Distributed Calculation, the super virtual machine being made up of the loosely-coupled computing machine collection of a group.

As shown in Figure 1, file acquisition device 11 obtains pending multiple making language documents.

Particularly, file acquisition device 11 obtains rule according to predetermined file and obtain the corresponding multiple making language documents of internet web page from template provides the web page library of equipment 1, and wherein said predetermined file is obtained rule and included but not limited to:

1) obtain the corresponding making language document of webpage that historical click volume exceedes certain click threshold;

2) obtain the corresponding making language document of webpage of the accumulative total access times a predetermined level is exceeded conducting interviews by mobile terminal;

Wherein, described web page library is for storing the historical visit information of the corresponding making language document of webpage and this webpage, and this web page library includes but not limited to relational database, memory storage, magnetic disk memory etc.

Alternatively, file acquisition device 11 is subject to predetermined condition or Event triggered ground or directly reads the plurality of making language document from third party device by the communication mode of agreement termly.

At this, described markup language means a kind of other information by text and text-dependent and combines, and shows the computword coding about file structure and data processing details, and described making language document includes but not limited to:

-HTML (Hypertext Markup Language) (HTML) file;

-extensible HyperText Markup Language (XHTML) file;

-extend markup language (XML) file.

In one example, file acquisition device 11 is by providing the webpage relevant information in the web page library of equipment 1 to carry out statistical study to template, obtain each webpage by user by the number of times of mobile terminal accessing, and obtain accordingly the corresponding html file of webpage that this number of times exceedes scheduled visit quantity, this scheduled visit quantity should be applied along with actual demand and specifically and change, for example, in the less concrete application of number of users, this scheduled visit quantity can be tens thousand of to hundreds thousand of, and in the more concrete application of number of users, this scheduled visit quantity can be hundreds thousand of to millions of, it is confirmable that this should be that those skilled in the art apply according to the actual requirements and specifically.

In another example, file acquisition device 11 is sent and obtains the request of making language document to third party device by the application programming interface (API) of calling setting termly, and receives multiple making language documents that this third party device returns based on this request.

Those skilled in the art will be understood that the above-mentioned mode of obtaining multiple making language documents is only for giving an example; other existing or modes of obtaining multiple making language documents that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

Subsequently, the relevant information of described multiple making language documents that the first acquisition device 12 obtains according to file acquisition device 11, obtains one or more groups making language document.

Particularly, multiple making language documents that the first acquisition device 12 obtains according to file acquisition device 11, for example, obtain the relevant information of described multiple making language documents, and accordingly those making language documents are carried out to cluster, to obtain one or more groups making language document; Or, obtain the relevant information of partial document in described making language document, and this partial document carried out to cluster, to obtain one or more groups making language document.Wherein, the relevant information of described multiple making language documents includes but not limited to:

1) relevant information of the DOM Document Object Model of making language document (DOM) tree; Wherein, described dom tree means the tree construction data that obtain by making language document is resolved, and the each node in this tree is corresponding with label and the label substance in making language document, the data by this dom tree in can operational label language file; Wherein, the relevant information of described multiple making language documents includes but not limited to:

A) relevant information of the corresponding dom tree of described multiple making language documents; Particularly, when the relevant information of the plurality of making language document comprises the relevant information of the corresponding dom tree of the plurality of making language document, the first acquisition device 12 can carry out cluster to the plurality of making language document according to the relevant information of this dom tree, to obtain one or more groups making language document; Wherein, the relevant information of described dom tree includes but not limited to:

I) number of nodes of described dom tree; Particularly, when the relevant information of dom tree comprises the number of nodes of this dom tree, the first acquisition device 12 can carry out cluster to the plurality of making language document according to this number of nodes, for example will wherein there is same node point quantity, or the making language document cluster of number of nodes in certain predetermined quantity interval is same group echo language file

Ii) topology information of described dom tree; Particularly, when the relevant information of dom tree comprises the topology information of this dom tree, wherein, this topology information includes but not limited to the distribution of each tree node in dom tree, and the first acquisition device 12 will have making language document cluster that identical tree node distributes in same group.

Those skilled in the art will be understood that, the relevant information of above-mentioned every dom tree not only can be obtained one or more groups making language document for the first acquisition device 12 separately, multinomial combination wherein can also be obtained to one or more groups making language document for the first acquisition device 12.

Those skilled in the art also will be understood that the relevant information of above-mentioned dom tree is only for giving an example; the relevant information of other dom trees existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

B) resource information in described multiple making language document; Particularly, when the relevant information of the plurality of making language document comprises the resource information in the plurality of making language document, wherein, this resource information includes but not limited to:

I) link information in making language document, include but not limited to link quantity, the similarity of link anchor text in the plurality of making language document;

Ii) pictorial information in making language document, includes but not limited to the quantity of picture, the similarity of picture name, descriptor in the plurality of making language document;

If this, the first acquisition device 12 can carry out cluster to the plurality of making language document according to this resource information, to obtain one or more groups making language document.

Those skilled in the art will be understood that, the relevant information of above-mentioned every making language document not only can be obtained one or more groups making language document for the first acquisition device 12 separately, multinomial combination wherein can also be obtained to one or more groups making language document for the first acquisition device 12.

Those skilled in the art also will be understood that the relevant information of above-mentioned making language document is only for giving an example; the relevant information of other making language documents existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

In one example, the first acquisition device 12 resolves respectively and generates dom tree corresponding thereto to multiple html files, then according to the topology information of each dom tree, the plurality of html file is carried out to cluster, and the topology information of this DOM includes but not limited to the distribution of the each tree node of dom tree.

Taking Fig. 2, Fig. 3 as example, the corresponding dom tree of a part of html file that above-mentioned the first acquisition device 12 clusters obtain has topological structure as shown in Figure 2, and the corresponding dom tree of other html files has topological structure as shown in Figure 3, thus, the first acquisition device 12 obtains 2 groups of html files, G1 group and G2 group, wherein the html file in G1 group has topological structure as shown in Figure 2, and the html file in G2 group has topological structure as shown in Figure 3.Preferably, the topological structure of the dom tree of the html file in cluster to group can be not quite identical, only need the trunk node of its dom tree to distribute consistent, for example the dom tree T1 of html file F 1 correspondence as shown in Figure 3A, the corresponding dom tree T2 of html file F2 as shown in Figure 3 B, as seen from the figure, T1 and T2 have dom tree topological structure as shown in Figure 3, and therefore F1 and F2 will be by cluster to G2 groups.

In another example, the first acquisition device 12 is by adding up respectively the label <a> in multiple html files, to obtain the quantity of hypertext link in each html file, and accordingly to those html file clusters.Preferably, also can, in conjunction with the similarity of the anchor content of text of this hypertext link, carry out cluster to those HTML, to obtain some groups of html files, wherein, the html file in every group has identical hypertext link quantity, and the content similarity of its anchor text exceedes predetermined similarity threshold.

Those skilled in the art will be understood that the above-mentioned mode of obtaining making language document group is only for giving an example; other existing or modes of obtaining making language document group that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

Then, relative analytic apparatus 13 compares analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file, to obtain the body matter node of the body matter that comprises this group echo language file.

Particularly, relative analytic apparatus 13 obtains at least one group echo language file in one or more groups making language document according to the first acquisition device 12, for example obtain respectively the making language document in every group, and those making language documents are resolved, to obtain its corresponding dom tree, and the content in node corresponding in each dom tree and subtree node thereof is compared to analysis, obtain the body matter node that comprises this group echo language file body matter, the method for wherein said comparative analysis includes but not limited to:

1) according to the number of characters of the non-link text in each dom tree respective nodes and subtree node content thereof, if exceeding in the dom tree of preset quantity ratio, the character quantity of the non-link text of this respective nodes and subtree node content thereof exceedes certain character quantity threshold value, and relative analytic apparatus 13 judges that this node is the body matter node that comprises body matter;

2) according to each dom tree respective nodes content shared full content display space ratio in the time showing, if exceeding in the dom tree of preset quantity ratio, the shared display space ratio of this respective nodes content all exceedes certain proportion threshold value, and relative analytic apparatus 13 judges that this node is the body matter node that comprises body matter;

3) according to the similarity of each dom tree respective nodes and subtree node content thereof, if in each dom tree, this respective nodes and subtree node content similarity each other thereof are all lower than certain similarity threshold, and relative analytic apparatus 13 judges that this node is the body matter node that comprises body matter.

In one example, relative analytic apparatus 13 obtains one group of html file, and 2 html files in this group html file are resolved, and obtains two dom tree T3 and T4, and wherein as shown in Figure 3A, T4 as shown in Figure 3 B for T3;

Then, relative analytic apparatus 13 travels through and the content of respective nodes and subtree node thereof is compared to analysis these two dom trees, as obtain the quantity of character in the content in node N4 and subtree node N6, the N7 in T3, as 2500, and obtain the quantity of character in the content in respective nodes N4 ' and the subtree node N6 ' thereof in T4, as 2000, its character quantity all exceedes 1500 of predetermined character quantity threshold values, therefore, relative analytic apparatus 13 is using this node as the body matter node that comprises this group html file body matter.

In another example, relative analytic apparatus 13 obtains one group of html file, and 2 html files in this group html file are resolved, obtain two dom tree T3 and T4, wherein T3 as shown in Figure 3A, T4 as shown in Figure 3 B, then, relative analytic apparatus 13 travels through two dom trees and the content of respective nodes and subtree node thereof is compared to analysis, the height and the width that its content arranging in node N3 in T3 as obtained shows, and the height and the width of the corresponding web displaying of this html file, and to obtain accordingly this node content shared display space in webpage be 30%, in like manner, the shared display space of content that obtains the respective nodes N3 ' in T4 is 35%, this equal proportion all exceedes predetermined proportion threshold value 20%, therefore, relative analytic apparatus 13 is using this node as the body matter node that comprises this group html file body matter.

Those skilled in the art will be understood that the mode of above-mentioned comparative analysis is only for giving an example; the mode of other comparative analysiss existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

At this, it should be noted that, the every numerical value in above-mentioned giving an example is only the example of illustration, for reader understanding the present invention, the True Data while being not practical application, should not be considered as any restriction to present patent application protection domain.If no special instructions, the function of other local numerical value that occur, with identical herein, for simplicity's sake, repeats no more herein.

At this; also it should be noted that, the corresponding concrete dom tree of making language document in above-mentioned giving an example is only the example of illustration, for understanding the present invention; true dom tree while not being practical application, should not be considered as any restriction to present patent application protection domain.If no special instructions, the function of other local dom trees that occur, with identical herein, for simplicity's sake, repeats no more herein.

Subsequently, template acquisition device 14, according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter.

Particularly, each body matter node of the body matter that what template acquisition device 14 obtained according to relative analytic apparatus 13 comprise this group echo language file, for example, by this body matter node corresponding numbering in dom tree of making an appointment; Or, by the routing information of this body matter node in dom tree, write in the content identification template corresponding with this group echo language file, at this, this routing information for example can be XPath, wherein, described XPath is a kind of path expression, can in dom tree, look for corresponding tree node by this path expression.At this, described content identification template is for describing the each body matter nodal information that comprises body matter, and this content identification template can be used as template file and is stored in file system, or can be used as data table stores in relational database.

In one example, as shown in Figure 3A, it is N1, N4 and N5 that relative analytic apparatus 13 obtains the body matter node that comprises certain group echo language file body matter, and the coding rule of body matter node is according under upper to the tree node in dom tree, order is from left to right numbered, thus, template acquisition device 14 determines that according to this coding rule N1, N4 and the corresponding numbering of N5 are followed successively by: 1,4 and 5, and be written in content identification template file.

In another example, as shown in Figure 3A, it is N3 and N4 that relative analytic apparatus 13 obtains the body matter node that comprises certain group echo language file body matter, thus, template acquisition device 14, according to those body matter nodes, obtains its corresponding XPath and is respectively in dom tree: the XPath of N3 is "/R0/N1/N3 "; The XPath of N4 is "/R0/N2/N4 ", and those XPath is written in the relational database at the content identification template place corresponding with this group echo language file.

Those skilled in the art will be understood that the mode of above-mentioned acquisition content identification template is only for giving an example; the mode of other acquisition content identification templates existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

Preferably, between file acquisition device 11, the first acquisition device 12, relative analytic apparatus 13 and template acquisition device 14, be to work continuously.Particularly, file acquisition device 11 obtains pending multiple making language documents constantly; Subsequently, the first acquisition device 12 also, constantly according to the relevant information of described multiple making language documents, obtains one or more groups making language document; Then, relative analytic apparatus 13 also compares analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file constantly, to obtain the body matter node of the body matter that comprises this group echo language file; Then, template acquisition device 14 also, constantly according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter; At this, it will be understood by those skilled in the art that " continuing " refers to that each device constantly carries out respectively obtaining, every group echo language file being compared and analyzes and obtain the content identification template for identification marking language file body matter of the obtaining of making language document, making language document group, until meet predetermined stoppage condition, for example file acquisition device 11 stops obtaining making language document in a long time.

(with reference to Fig. 1) in a preferred embodiment, relative analytic apparatus 13 comprises similarity acquiring unit (not shown) and node acquiring unit (not shown), wherein, similarity acquiring unit compares analysis to the content of respective nodes in the corresponding each dom tree of the making language document in described every group, to obtain the similarity of described content; Subsequently, node acquiring unit is determined described body matter node according to described similarity.

Referring to Fig. 1, the preferred embodiment is described in detail, wherein, file acquisition device 11 obtains pending multiple making language documents; The first acquisition device 12, according to the relevant information of described multiple making language documents, obtains one or more groups making language document; Template acquisition device 14, according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter.Its detailed process, with aforementioned identical with reference to the performed process of file acquisition device in the described embodiment of Fig. 1 11, the first acquisition device 12 and template acquisition device 14, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, in the corresponding each dom tree of making language document in every group at least one group echo language file that similarity acquiring unit obtains the first acquisition device 12, the content of respective nodes and subtree node thereof compares analysis, to obtain the similarity of described content, wherein, the method that obtains described content similarity includes but not limited to:

1) word content of the respective nodes to each dom tree and subtree node thereof carries out character string comparison, determines the similarity of this content, and wherein, the degree of string matching is higher, and the similarity of content is higher, otherwise the similarity of this content is lower;

2) word content of the respective nodes to each dom tree and subtree node thereof carries out participle, and by identical participle quantity in each respective nodes word content is added up, determine the similarity of this content, wherein, the quantity of identical participle is fewer, the similarity of content is lower, otherwise the similarity of this content is higher; At this, described point of word algorithm includes but not limited to Forward Maximum Method, oppositely maximum coupling, two-way maximum coupling, language model method, shortest path first etc.; Subsequently, the similarity of the node that node acquiring unit obtains according to similarity acquiring unit and subtree node content thereof, for example according to similarity lower than default similarity threshold, this content is body matter, otherwise, this content is the rule of non-body matter, determines this node body matter node whether literary composition comprises body matter.

In one example, similarity acquiring unit obtains the word content in respective nodes in certain group html file corresponding each dom tree and subtree node thereof, utilize Forward Maximum Method algorithm to carry out respectively word segmentation processing to each word content, obtain 3000 different participles, and carry out statistical study by the distribution of each participle in each word content to obtaining, determine and exceed certain preset quantity, as 1500, participle in all each word contents, occur, node acquiring unit obtains the similarity of this each word content accordingly, as 0.7; Subsequently, the similarity of the node that node acquiring unit obtains according to similarity acquiring unit and subtree node content thereof, its similarity, higher than default similarity threshold 0.4, is determined the body matter that does not comprise this group html file in this node.

The mode that those skilled in the art will be understood that above-mentioned acquisition node content similarity and obtain the body matter node that comprises body matter is only for giving an example; the mode of the body matter node that other acquisition node content similarities existing or that may occur from now on or acquisition comprise body matter is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

(with reference to Fig. 1) in a further advantageous embodiment, template acquisition device 14 comprises routing information acquiring unit (not shown) and template generation unit (not shown), wherein, routing information acquiring unit, according to described body matter node, obtains the routing information corresponding with described body matter node; Subsequently, template generation unit adds described routing information in described content identification template to, to obtain described content identification template.

Referring to Fig. 1, the preferred embodiment is described in detail, wherein, file acquisition device 11 obtains pending multiple making language documents; The first acquisition device 12, according to the relevant information of described multiple making language documents, obtains one or more groups making language document; Relative analytic apparatus 13 compares analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file, to obtain the body matter node of the body matter that comprises this group echo language file; Its detailed process, with aforementioned identical with reference to the performed process of file acquisition device in the described embodiment of Fig. 1 11, the first acquisition device 12 and relative analytic apparatus 13, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, the body matter node that what routing information acquiring unit obtained according to relative analytic apparatus 13 comprise certain group echo language file body matter, obtain the routing information of this node from the dom tree at this node place, wherein, the expression mode of this routing information includes but not limited to:

-XPath；

The combination of-XPath and regular expression, wherein said regular expression means the single character string for describing or mate a series of character strings that meet certain syntactic rule;

Subsequently, the routing information that template generation unit obtains routing information acquiring unit is written to the content identification template for identifying this group echo language file body matter, to obtain this content identification template.

In one example, as shown in Figure 3A, the body matter node that what relative analytic apparatus 13 obtained comprise certain group echo language file body matter is N6 and N7, routing information acquiring unit is according to those body matter nodes, obtain its corresponding routing information for "/R0/N2/N4/N[6-7] { 1} ", subsequently, template generation unit writes this routing information in certain content identification template file, to obtain the template for identifying this group echo language file body matter.

The mode that those skilled in the art will be understood that above-mentioned way to acquire information and obtain content identification template is only for giving an example; other way to acquire information existing or that may occur from now on or the mode that obtains content identification template are as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

In another preferred embodiment (with reference to Fig. 1), template provides equipment 1 also to comprise the second acquisition device (not shown), wherein, the second acquisition device, according to pre-defined rule, obtains at least one group echo language file in described one or more groups making language document; Then, relative analytic apparatus 13 compares analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of the described at least one group echo language file obtaining, to obtain described body matter node.Referring to Fig. 1, the preferred embodiment is described in detail, wherein, file acquisition device 11 obtains pending multiple making language documents; The first acquisition device 12, according to the relevant information of described multiple making language documents, obtains one or more groups making language document; Template acquisition device 14, according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter; Its detailed process, with aforementioned identical with reference to the performed process of file acquisition device in the described embodiment of Fig. 1 11, the first acquisition device 12 and template acquisition device 14, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, the second acquisition device obtains described making language document group according to pre-defined rule, for example obtain all making language document groups that the first acquisition device 12 provides, or only obtain some making language document groups of making language document quantity a predetermined level is exceeded wherein; Then, each group echo language file that relative analytic apparatus 13 obtains the second acquisition device carries out respectively described comparative analysis, for every group echo language file obtains the body matter node that comprises this group echo language file body matter; Wherein, described pre-defined rule comprise based on below at least any one obtain described making language document group:

1) quantity of making language document in this group;

Particularly, when pre-defined rule based on this group echo language file in the quantity of making language document, wherein, only in the time that the quantity of the making language document in this group is more, as exceed certain quantity of documents threshold value, can compare analysis by the body matter node content to each making language document, obtain more accurately the body matter node of this body matter that comprises group echo language file, otherwise the acquisition of this body matter node is by inaccurate, so, the second acquisition device only obtains the making language document group that making language document quantity exceedes this quantity of documents threshold value,

2) number of nodes of the corresponding dom tree of making language document etc.;

Particularly, when pre-defined rule based on this group echo language file in the number of nodes of the corresponding dom tree of making language document, wherein, if the number of nodes of this each dom tree is all little, as lower than certain number of nodes threshold value, represent that the content of its corresponding making language document is also little, without again its body matter being extracted, so the number of nodes that the second acquisition device only obtains each dom tree exceedes the making language document group of this number of nodes threshold value.

Those skilled in the art will be understood that and above-mentioned lifted everyly not only can obtain making language document group for the second acquisition device separately, multinomial combination wherein can also be obtained to making language document group for the second acquisition device.

Those skilled in the art also will be understood that above-mentioned pre-defined rule is only for giving an example, and other pre-defined rules existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.

In one example, the first acquisition device 12 obtains 3 groups of html files, and the second acquisition device directly extracts this 3 groups of html files.In another example, the first acquisition device 12 obtains 4 group echo language files, G3, G4, G5 and G6, wherein the making language document quantity of each group is followed successively by 120,50,5,150, the second acquisition device extracts 2 making language document groups of making language document quantity a predetermined level is exceeded, G3 and G6, at this, this predetermined quantity for example can be made as 100.

In another preferred embodiment (with reference to Fig. 1), template provides equipment 1 also to comprise template annotation equipment (not shown), wherein, the described body matter that template annotation equipment comprises according to described body matter node, the mark body matter relevant information corresponding with described body matter node in described content identification template; Wherein, described body matter relevant information comprises following at least any one:

The type information of-described body matter;

The displaying priority of-described body matter.

Referring to Fig. 1, the preferred embodiment is described in detail, wherein, file acquisition device 11 obtains pending multiple making language documents; The first acquisition device 12, according to the relevant information of described multiple making language documents, obtains one or more groups making language document; Relative analytic apparatus 13 compares analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file, to obtain the body matter node of the body matter that comprises this group echo language file; Template acquisition device 14, according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter; Its detailed process is with aforementioned identical with reference to the performed process of file acquisition device in the described embodiment of Fig. 1 11, the first acquisition device 12, relative analytic apparatus 13 and template acquisition device 14, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, the body matter that the body matter node that template annotation equipment obtains according to relative analytic apparatus 13 and subtree node thereof comprise, for example, according to predetermined mark rule, in the content identification template at this body matter node place, mark the body matter relevant information corresponding with this body matter node; Wherein, this body matter relevant information comprises following at least any one:

1) type information of described body matter, wherein, the type information includes but not limited to title content piece, body matter piece, navigation content piece etc.;

2) the displaying priority of described body matter, for example, having the higher body matter that represents priority will forwardly in webpage preferentially represent.

In one example, in the body matter that certain body matter node comprises, the character quantity of pure words content exceedes 5000, and this pure words content to be presented at the displaying ratio that this body matter occupies in showing be 85%, template annotation equipment determines that according to above information the type information of this body matter is body matter piece, and according to the type information, determine that this body matter is the content that height represents priority, then, template annotation equipment correspondingly writes the relevant information of this body matter in the content identification template file at this body matter node place, as shown in table 1 below.

Table 1

Content node information	Content-type information	Represent priority
			/R0/N1/N3	T1	High
/R0/N1/N9/N20	T3	Low
			/R0/N1/N[6-7]{1}	T6	In

Preferably, in described template file, also can mark non-body matter nodal information, and the content-type information of the non-body matter corresponding with this non-body matter nodal information, represent priority etc.

Those skilled in the art also will be understood that the mode of foregoing relevant information and marked content relevant information is only for giving an example; the mode of other content correlated informations existing or that may occur from now on or marked content relevant information is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

Fig. 4 is the equipment schematic diagram for identification marking language file body matter in accordance with a preferred embodiment of the present invention, wherein, also comprises screening unit 121 ' and cluster cell 122 ' in the first acquisition device 12 '.At this, shown in Fig. 4, install 11 ', 13 ' with identical with reference to the content of the described device 11,13 of Fig. 1 above, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, screening unit 121 ', according to predetermined filtering condition, screens described multiple making language documents, to obtain at least one making language document that meets described predetermined filtering condition; Then, cluster cell 122 ', according to the relevant information of the corresponding dom tree of described at least one making language document, carries out cluster to described at least one making language document, to obtain described one or more groups making language document; Finally, template acquisition device 14 ', according to obtained body matter node, obtains the described content identification template corresponding with this predetermined filtering condition.

More specifically, screening unit 121 ' is based on predetermined filtering condition, and multiple making language documents that file acquisition device 11 ' is obtained screen, to obtain at least one making language document that meets this predetermined filtering condition.Preferably, this predetermined filtering condition includes but not limited to following at least any one:

1) network address of described making language document;

Particularly, if this predetermined filtering condition network address based on making language document, wherein this network address includes but not limited to URL address, IP address etc., screening unit 121 ' can, according to the regular expression of the network address of making language document or the network address, screen those making language documents;

2) website under described making language document;

Particularly, if this predetermined filtering condition website based under making language document, for example whether making language document is from same website, or from the website of same type, screen unit 121 ' and for example can whether screen those html files from the website of news type according to html file.

Those skilled in the art will be understood that, above-mentioned every predetermined filtering condition not only can be screened multiple making language documents for screening unit 121 ' separately, multinomial combination wherein can also be used for screening unit 121 ' multiple making language documents are screened.

Those skilled in the art also will be understood that above-mentioned screening conditions are only for giving an example, and other screening conditions existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.

Then, the relevant information of the corresponding dom tree of making language document that cluster cell 122 ' obtains according to screening unit 121 ', those making language documents are carried out to cluster, to obtain described one or more groups making language document corresponding with this predetermined filtering condition;

Finally, template acquisition device 14 ' is every group of body matter node obtaining in this one or more groups making language document according to relative analytic apparatus 13 ', obtain and this each group echo language file one or more content identification templates one to one, and using these one or more content identification templates as the content identification template corresponding with this predetermined filtering condition.

In one example, URL(uniform resource locator) (URL) address that predetermined filtering condition C 1 is html file meets regular expression http://www.abc.com/news*.*html, screen in 150 html files that unit 121 ' obtains at file acquisition device 11 ' according to this predetermined filtering condition and screen, meet 70 html files of this regular expression to obtain its URL address, then, cluster cell 122 ' according to the dom tree relevant information of these 70 html files to these 70 html files are carried out to cluster, to obtain the 3 group html files corresponding with this predetermined filtering condition C 1, template acquisition device 14 ' is every group of body matter node obtaining in this 3 group echo language file according to relative analytic apparatus 13 ', obtain 3 the content identification template files corresponding with this 3 group echo language file, and using these 3 content identification template files as the content identification template corresponding with predetermined filtering condition C 1.

Those skilled in the art also will be understood that the mode of above-mentioned making language document screening and making language document cluster is only for giving an example; the mode of other making language documents existing or that may occur from now on screenings or making language document cluster is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

Preferably, template provides equipment 1 also to comprise screening conditions acquisition device (not shown), template selecting (not shown) and body matter recognition device (not shown), wherein, screening conditions acquisition device obtains the satisfied predetermined filtering condition of other making language documents of body matter to be identified; Then, template selecting is selected the corresponding content identification template of the satisfied predetermined filtering condition of these other making language documents; Then, body matter recognition device is identified the body matter of described other making language documents according to selected content identification template.

Particularly, screening conditions acquisition device is for example subject to predetermined condition or Event triggered ground or obtains termly other making language documents of body matter to be identified from third party device, and it is mated in each predetermined filtering condition, with the satisfied screening conditions of this making language document that obtain; Then, these screening conditions that template selecting is obtained according to screening conditions acquisition device, from template acquisition device 14 ', obtain its corresponding one or more content identification templates, and extract respectively the body matter nodal information in each content identification template, as XPath, and according to this nodal information, mate in the corresponding dom tree of these other making language documents according to predetermined matched rule, to obtain and the corresponding content identification template of these other making language documents, wherein, this matched rule includes but not limited to:

1) if according to each the body matter nodal information in content identification template, in the dom tree of these other making language documents, all can find corresponding tree node, these other making language documents and this content identification template matches,

2) if according to being labeled as essential body matter nodal information in content identification template, in the dom tree of these other making language documents, all can find corresponding tree node, these other making language documents and this content identification template matches;

Then, the content identification template that body matter recognition device obtains according to template selecting, from this content identification template, extract each body matter nodal information, and in the dom tree of these other making language documents, search its body matter node according to those body matter nodal informations, and obtain body matter from this node and subtree node thereof.

Those skilled in the art also will be understood that the above-mentioned mode of obtaining screening conditions, select template and obtaining body matter is only for giving an example; other existing or modes of obtaining screening conditions, select template or obtaining body matter that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.

Those skilled in the art also will be understood that above-mentioned the first acquisition device and the second acquisition device are only example, and in practice, they can be two independently modules, also can be integrated in a module.

Fig. 5 is the method flow diagram for identification marking language file body matter according to one aspect of the invention.Template provides equipment 1 to include but not limited to the cloud that computing machine, network host, single network server, multiple webserver collection or multiple server form.At this, cloud is made up of a large amount of computing machines based on cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is the one of Distributed Calculation, the super virtual machine being made up of the loosely-coupled computing machine collection of a group.

As shown in Figure 5, in step S1, template provides equipment 1 to obtain pending multiple making language documents.

Particularly, in step S1, template provides equipment 1 to obtain rule according to predetermined file and obtain the corresponding multiple making language documents of internet web page from template provides the web page library of equipment 1, and wherein said predetermined file is obtained rule and included but not limited to:

Alternatively, in step S1, template provides equipment 1 be subject to predetermined condition or Event triggered ground or directly read the plurality of making language document from third party device by the communication mode of agreement termly.

-HTML (Hypertext Markup Language) (HTML) file;

-extensible HyperText Markup Language (XHTML) file;

-extend markup language (XML) file.

In one example, in step S1, template provides equipment 1 by provide the webpage relevant information in the web page library of equipment 1 to carry out statistical study to template, obtain each webpage by user by the number of times of mobile terminal accessing, and obtain accordingly the corresponding html file of webpage that this number of times exceedes scheduled visit quantity, this scheduled visit quantity should be applied along with actual demand and specifically and change, for example, in the less concrete application of number of users, this scheduled visit quantity can be tens thousand of to hundreds thousand of, and in the more concrete application of number of users, this scheduled visit quantity can be hundreds thousand of to millions of, it is confirmable that this should be that those skilled in the art apply according to the actual requirements and specifically.

In another example, in step S1, template provides equipment 1 to be sent and obtain the request of making language document to third party device by the application programming interface (API) of calling setting termly, and receives multiple making language documents that this third party device returns based on this request.

Subsequently, in step S2, template provides the relevant information of described multiple making language documents that equipment 1 obtains in step S1 according to it, obtains one or more groups making language document.

Particularly, in step S2, multiple making language documents that template provides equipment 1 to obtain in step S1 according to it, for example, obtain the relevant information of described multiple making language documents, and accordingly those making language documents are carried out to cluster, to obtain one or more groups making language document; Or, obtain the relevant information of partial document in described making language document, and this partial document carried out to cluster, to obtain one or more groups making language document.Wherein, the relevant information of described multiple making language documents includes but not limited to:

A) relevant information of the corresponding dom tree of described multiple making language documents; Particularly, when the relevant information of the plurality of making language document comprises the relevant information of the corresponding dom tree of the plurality of making language document, in step S2, template provides equipment 1 to carry out cluster to the plurality of making language document according to the relevant information of this dom tree, to obtain one or more groups making language document; Wherein, the relevant information of described dom tree includes but not limited to:

I) number of nodes of described dom tree; Particularly, when the relevant information of dom tree comprises the number of nodes of this dom tree, in step S2, template provides equipment 1 to carry out cluster to the plurality of making language document according to this number of nodes, for example will wherein there is same node point quantity, or the making language document cluster of number of nodes in certain predetermined quantity interval is same group echo language file

Ii) topology information of described dom tree; Particularly, when the relevant information of dom tree comprises the topology information of this dom tree, wherein, this topology information includes but not limited to the distribution of each tree node in dom tree, in step S2, template provides equipment 1 to have making language document cluster that identical tree node distributes in same group.

Those skilled in the art will be understood that, the relevant information of above-mentioned every dom tree not only can provide equipment 1 to obtain one or more groups making language document for template separately, can also provide equipment 1 to obtain one or more groups making language document for template multinomial combination wherein.

If this,, in step S2, template provides equipment 1 to carry out cluster to the plurality of making language document according to this resource information, to obtain one or more groups making language document.

Those skilled in the art will be understood that, the relevant information of above-mentioned every making language document not only can provide equipment 1 to obtain one or more groups making language document for template separately, can also provide equipment 1 to obtain one or more groups making language document for template multinomial combination wherein.

In one example, in step S2, template provides equipment 1 multiple html files to be resolved respectively and generated dom tree corresponding thereto, then according to the topology information of each dom tree, the plurality of html file is carried out to cluster, and the topology information of this DOM includes but not limited to the distribution of the each tree node of dom tree.

Taking Fig. 2, Fig. 3 as example, in step S2, template provides the equipment corresponding dom tree of a part of html file that 1 cluster obtains to have topological structure as shown in Figure 2, and the corresponding dom tree of other html files has topological structure as shown in Figure 3, thus, template provides equipment 1 to obtain 2 groups of html files, G1 group and G2 group, wherein the html file in G1 group has topological structure as shown in Figure 2, and the html file in G2 group has topological structure as shown in Figure 3.Preferably, the topological structure of the dom tree of the html file in cluster to group can be not quite identical, only need the trunk node of its dom tree to distribute consistent, for example the dom tree T1 of html file F 1 correspondence as shown in Figure 3A, the corresponding dom tree T2 of html file F2 as shown in Figure 3 B, as seen from the figure, T1 and T2 have dom tree topological structure as shown in Figure 3, and therefore F1 and F2 will be by cluster to G2 groups.

In another example, in step S2, template provides equipment 1 by adding up respectively the label <a> in multiple html files, to obtain the quantity of hypertext link in each html file, and accordingly to those html file clusters.Preferably, also can, in conjunction with the similarity of the anchor content of text of this hypertext link, carry out cluster to those HTML, to obtain some groups of html files, wherein, the html file in every group has identical hypertext link quantity, and the content similarity of its anchor text exceedes predetermined similarity threshold.

Then, in step S3, template provides equipment 1 to compare analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file, to obtain the body matter node of the body matter that comprises this group echo language file.

Particularly, in step S3, template provides equipment 1 in step S2, to obtain at least one group echo language file in one or more groups making language document according to it, for example obtain respectively the making language document in every group, and those making language documents are resolved, to obtain its corresponding dom tree, and the content in node corresponding in each dom tree and subtree node thereof is compared to analysis, obtain the body matter node that comprises this group echo language file body matter, the method for wherein said comparative analysis includes but not limited to:

1) according to the number of characters of the non-link text in each dom tree respective nodes and subtree node content thereof, if exceeding in the dom tree of preset quantity ratio, the character quantity of the non-link text of this respective nodes and subtree node content thereof exceedes certain character quantity threshold value,, in step S3, template provides equipment 1 to judge that this node is the body matter node that comprises body matter;

2) according to each dom tree respective nodes content shared full content display space ratio in the time showing, if exceeding in the dom tree of preset quantity ratio, the shared display space ratio of this respective nodes content all exceedes certain proportion threshold value,, in step S3, template provides equipment 1 to judge that this node is the body matter node that comprises body matter;

3) according to the similarity of each dom tree respective nodes and subtree node content thereof, if in each dom tree, this respective nodes and subtree node content similarity each other thereof are all lower than certain similarity threshold,, in step S3, template provides equipment 1 to judge that this node is the body matter node that comprises body matter.

In one example, in step S3, template provides equipment 1 to obtain one group of html file, and 2 html files in this group html file are resolved, and obtains two dom tree T3 and T4, and wherein as shown in Figure 3A, T4 as shown in Figure 3 B for T3;

Then, in step S3, template provides equipment 1 travel through and the content of respective nodes and subtree node thereof is compared to analysis these two dom trees, as obtain node N4 and the subtree node N6 thereof in T3, the quantity of character in content in N7, as 2500, and obtain the quantity of character in the content in respective nodes N4 ' and the subtree node N6 ' thereof in T4, as 2000, its character quantity all exceedes 1500 of predetermined character quantity threshold values, therefore, in step S3, template provides equipment 1 using this node as the body matter node that comprises this group html file body matter.

In another example, in step S3, template provides equipment 1 to obtain one group of html file, and 2 html files in this group html file are resolved, obtain two dom tree T3 and T4, wherein T3 as shown in Figure 3A, T4 as shown in Figure 3 B, then, in step S3, template provides equipment 1 two dom trees are traveled through and the content of respective nodes and subtree node thereof is compared to analysis, the height and the width that its content arranging in node N3 in T3 as obtained shows, and the height and the width of the corresponding web displaying of this html file, and to obtain accordingly this node content shared display space in webpage be 30%, in like manner, the shared display space of content that obtains the respective nodes N3 ' in T4 is 35%, this equal proportion all exceedes predetermined proportion threshold value 20%, therefore, in step S3, template provides equipment 1 using this node as the body matter node that comprises this group html file body matter.

Subsequently, in step S4, template provides equipment 1 according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter.

Particularly, in step S4, template provides each body matter node of the body matter that comprises this group echo language file that equipment 1 obtains in step S3 according to it, for example, and by this body matter node corresponding numbering in dom tree of making an appointment; Or, by the routing information of this body matter node in dom tree, write in the content identification template corresponding with this group echo language file, at this, this routing information for example can be XPath, wherein, described XPath is a kind of path expression, can in dom tree, look for corresponding tree node by this path expression.At this, described content identification template is for describing the each body matter nodal information that comprises body matter, and this content identification template can be used as template file and is stored in file system, or can be used as data table stores in relational database.

In one example, as shown in Figure 3A, in step S3, it is N1, N4 and N5 that template provides equipment 1 to obtain the body matter node that comprises certain group echo language file body matter, and the coding rule of body matter node is according under upper to the tree node in dom tree, order is from left to right numbered, thus, in step S4, template provides equipment 1 to determine that according to this coding rule N1, N4 and the corresponding numbering of N5 are followed successively by: 1,4 and 5, and be written in content identification template file.

In another example, as shown in Figure 3A, in step S3, it is N3 and N4 that template provides equipment 1 to obtain the body matter node that comprises certain group echo language file body matter, thus, in step S4, template provides equipment 1 according to those body matter nodes, obtains its corresponding XPath and be respectively in dom tree: the XPath of N3 is "/R0/N1/N3 "; The XPath of N4 is "/R0/N2/N4 ", and those XPath is written in the relational database at the content identification template place corresponding with this group echo language file.

Preferably, between above steps, be to work continuously.Particularly, in step S1, template provides equipment 1 to obtain constantly pending multiple making language documents; Subsequently, in step S2, template provides equipment 1 also constantly according to the relevant information of described multiple making language documents, obtains one or more groups making language document; Then, in step S3, template provides equipment 1 also constantly the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file to be compared to analysis, to obtain the body matter node of the body matter that comprises this group echo language file; Then,, in step S4, template provides equipment 1 also constantly according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter; At this, it will be understood by those skilled in the art that " continuing " refers to that each step constantly carries out respectively obtaining, every group echo language file being compared and analyzes and obtain the content identification template for identification marking language file body matter of the obtaining of making language document, making language document group, until meet predetermined stoppage condition, for example template provides equipment 1 to stop in a long time obtaining making language document.

(with reference to Fig. 5) in a preferred embodiment, step S3 comprises step S31 (not shown) and step S32 (not shown), wherein, in step S31, template provides equipment 1 to compare analysis to the content of respective nodes in the corresponding each dom tree of the making language document in described every group, to obtain the similarity of described content; Subsequently, in step S32, template provides equipment 1 to determine described body matter node according to described similarity.

Referring to Fig. 5, the preferred embodiment is described in detail, wherein, in step S 1, template provides equipment 1 to obtain pending multiple making language documents; In step S2, template provides equipment 1 according to the relevant information of described multiple making language documents, obtains one or more groups making language document; In step S4, template provides equipment 1 according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter.Its detailed process, with aforementioned identical with reference to the performed process of step S1, S2 and S4 in the described embodiment of Fig. 5, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, in step S31, template provides the content of respective nodes in the corresponding each dom tree of making language document in every group at least one group echo language file that equipment 1 obtains in step S2 it and subtree node thereof to compare analysis, to obtain the similarity of described content, wherein, the method that obtains described content similarity includes but not limited to:

2) word content of the respective nodes to each dom tree and subtree node thereof carries out participle, and by identical participle quantity in each respective nodes word content is added up, determine the similarity of this content, wherein, the quantity of identical participle is fewer, the similarity of content is lower, otherwise the similarity of this content is higher; At this, described point of word algorithm includes but not limited to Forward Maximum Method, oppositely maximum coupling, two-way maximum coupling, language model method, shortest path first etc.; Subsequently, in step S32, template provides node that equipment 1 obtains in step S31 according to it and the similarity of subtree node content thereof, for example according to similarity lower than default similarity threshold, this content is body matter, otherwise the rule that this content is non-body matter, determines this node body matter node whether literary composition comprises body matter.

In one example, in step S31, template provides equipment 1 to obtain the word content in respective nodes in certain group html file corresponding each dom tree and subtree node thereof, utilize Forward Maximum Method algorithm to carry out respectively word segmentation processing to each word content, obtain 3000 different participles, and carry out statistical study by the distribution of each participle in each word content to obtaining, determine and exceed certain preset quantity, as 1500, participle in all each word contents, occur, in step S32, template provides equipment 1 to obtain accordingly the similarity of this each word content, as 0.7, subsequently, in step S32, template provides node that equipment 1 obtains in step S31 according to it and the similarity of subtree node content thereof, and its similarity, higher than default similarity threshold 0.4, is determined the body matter that does not comprise this group html file in this node.

(with reference to Fig. 5) in a further advantageous embodiment, step S4 comprises step S41 (not shown) and step S42 (not shown), wherein, in step S41, template provides equipment 1 according to described body matter node, obtains the routing information corresponding with described body matter node; Subsequently, in step S42, template provides equipment 1 that described routing information is added in described content identification template, to obtain described content identification template.

Referring to Fig. 5, the preferred embodiment is described in detail, wherein, in step S1, template provides equipment 1 to obtain pending multiple making language documents; In step S2, template provides equipment 1 according to the relevant information of described multiple making language documents, obtains one or more groups making language document; In step S3, template provides equipment 1 to compare analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file, to obtain the body matter node of the body matter that comprises this group echo language file; Its detailed process, with aforementioned identical with reference to the performed process of step S1, S2 and S3 in the described embodiment of Fig. 5, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, in step S41, the body matter node that comprises certain group echo language file body matter that template provides equipment 1 to obtain in step S3 according to it, obtains the routing information of this node from the dom tree at this node place, wherein, the expression mode of this routing information includes but not limited to:

-XPath；

Subsequently, in step S42, the routing information that template provides equipment 1 that it is obtained in step S41 is written to the content identification template for identifying this group echo language file body matter, to obtain this content identification template.

In one example, as shown in Figure 3A, in step S3, it is N6 and N7 that template provides the body matter node that comprises certain group echo language file body matter that equipment 1 obtains, in step S41, template provides equipment 1 according to those body matter nodes, obtain its corresponding routing information for "/R0/N2/N4/N[6-7] { 1} ", subsequently, in step S42, template provides equipment 1 that this routing information is write in certain content identification template file, to obtain the template for identifying this group echo language file body matter.

In another preferred embodiment (with reference to Fig. 5), this process also comprises step S5 (not shown), wherein, and in step S5, template provides equipment 1 according to pre-defined rule, obtains at least one group echo language file in described one or more groups making language document; Then,, in step S3, template provides equipment 1 to compare analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of the described at least one group echo language file obtaining, to obtain described body matter node.Referring to Fig. 5, the preferred embodiment is described in detail, wherein, in step S1, template provides equipment 1 to obtain pending multiple making language documents; In step S2, template provides equipment 1 according to the relevant information of described multiple making language documents, obtains one or more groups making language document; In step S4, template provides equipment 1 according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter; Its detailed process, with aforementioned identical with reference to the performed process of step S1, S2 and S4 in the described embodiment of Fig. 5, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, in step S5, template provides equipment 1 to obtain described making language document group according to pre-defined rule, for example obtain in step S2, all making language document groups that template provides equipment 1 to provide, or only obtain some making language document groups of making language document quantity a predetermined level is exceeded wherein; Then,, in step S3, template provides each group echo language file that equipment 1 obtains in step S5 it to carry out respectively described comparative analysis, for every group echo language file obtains the body matter node that comprises this group echo language file body matter; Wherein, described pre-defined rule comprise based on below at least any one obtain described making language document group:

1) quantity of making language document in this group;

Particularly, when pre-defined rule based on this group echo language file in the quantity of making language document, wherein, only in the time that the quantity of the making language document in this group is more, as exceed certain quantity of documents threshold value, can compare analysis by the body matter node content to each making language document, obtain more accurately the body matter node of this body matter that comprises group echo language file, otherwise the acquisition of this body matter node is by inaccurate, so, in step S5, template provides equipment 1 only to obtain the making language document group that making language document quantity exceedes this quantity of documents threshold value,

Particularly, when pre-defined rule based on this group echo language file in the number of nodes of the corresponding dom tree of making language document, wherein, if the number of nodes of this each dom tree is all little, as lower than certain number of nodes threshold value, represent that the content of its corresponding making language document is also little, without again its body matter being extracted, so in step S5, the number of nodes that template provides equipment 1 only to obtain each dom tree exceedes the making language document group of this number of nodes threshold value.

Those skilled in the art will be understood that and above-mentioned lifted everyly not only can provide equipment 1 to obtain making language document group for template separately, can also provide equipment 1 to obtain making language document group for template multinomial combination wherein.

In one example, in step S2, template provides equipment 1 to obtain 3 groups of html files, and, in step S5, template provides equipment 1 directly to extract these 3 groups of html files.In another example, in step S2, template provides equipment 1 to obtain 4 group echo language files, G3, G4, G5 and G6, wherein the making language document quantity of each group is followed successively by 120,50,5,150, in step S5, template provides equipment 1 to extract 2 making language document groups of making language document quantity a predetermined level is exceeded, G3 and G6, at this, this predetermined quantity for example can be made as 100.

In another preferred embodiment (with reference to Fig. 5), this process also comprises step S6 (not shown), wherein, in step S6, the described body matter that template provides equipment 1 to comprise according to described body matter node, the mark body matter relevant information corresponding with described body matter node in described content identification template; Wherein, described body matter relevant information comprises following at least any one:

The type information of-described body matter;

The displaying priority of-described body matter.

Referring to Fig. 5, the preferred embodiment is described in detail, wherein, in step S1, template provides equipment 1 to obtain pending multiple making language documents; In step S2, template provides equipment 1 according to the relevant information of described multiple making language documents, obtains one or more groups making language document; In step S3, template provides equipment 1 to compare analysis to the content of respective nodes in the corresponding each dom tree of each making language document in every group of at least one group echo language file, to obtain the body matter node of the body matter that comprises this group echo language file; In step S4, template provides equipment 1 according to obtained body matter node, obtains to identify the content identification template of this group echo language file body matter; Its detailed process, for simplicity's sake, is contained in this with way of reference, and does not repeat with reference to identical in the performed process of step S1, S2, S3 and S4 in the described embodiment of Fig. 5 with aforementioned.

Particularly, in step S6, the body matter that template provides body matter node that equipment 1 obtains in step S3 according to it and subtree node thereof to comprise, for example, according to predetermined mark rule, in the content identification template at this body matter node place, mark the body matter relevant information corresponding with this body matter node; Wherein, this body matter relevant information comprises following at least any one:

In one example, in the body matter that certain body matter node comprises, the character quantity of pure words content exceedes 5000, and this pure words content to be presented at the displaying ratio that this body matter occupies in showing be 85%, in step S6, template provides equipment 1 to determine that according to above information the type information of this body matter is body matter piece, and according to the type information, determine that this body matter is the content that height represents priority, then, in step S6, template provides equipment 1 relevant information of this body matter correspondingly to be write in the content identification template file at this body matter node place, as shown in table 2 below.

Table 2

Fig. 6 is the method flow diagram for identification marking language file body matter in accordance with a preferred embodiment of the present invention, wherein, also comprises step S21 ' and step S22 ' in step S2 '.At this, the S1 ' of step shown in Fig. 6, S3 ', with identical with reference to the content of the described step S1 of Fig. 5, S3 above, for simplicity's sake, be contained in this with way of reference, and do not repeat.

Particularly, in step S21 ', template provides equipment 1 according to predetermined filtering condition, described multiple making language documents is screened, to obtain at least one making language document that meets described predetermined filtering condition; Then, in step S22 ', template provides equipment 1 according to the relevant information of the corresponding dom tree of described at least one making language document, and described at least one making language document is carried out to cluster, to obtain described one or more groups making language document; Finally, in step S4 ', template provides equipment 1 according to obtained body matter node, obtains the described content identification template corresponding with this predetermined filtering condition.

More specifically, in step S21 ', template provides equipment 1 based on predetermined filtering condition, and multiple making language documents that it is obtained in step S1 ' screen, to obtain at least one making language document that meets this predetermined filtering condition.Preferably, this predetermined filtering condition includes but not limited to following at least any one:

1) network address of described making language document;

Particularly, if this predetermined filtering condition network address based on making language document, wherein this network address includes but not limited to URL address, IP address etc., in step S21 ', template provides equipment 1 according to the regular expression of the network address of making language document or the network address, those making language documents to be screened;

2) website under described making language document;

Particularly, if this predetermined filtering condition website based under making language document, for example whether making language document is from same website, or from the website of same type,, in step S21 ', whether template provides equipment 1 for example can screen those html files from the website of news type according to html file.

Those skilled in the art will be understood that, above-mentioned every predetermined filtering condition not only can be separately at step S21 ', template provides equipment 1 to screen multiple making language documents, can also be by multinomial combination wherein at step S21 ', template provides equipment 1 to screen multiple making language documents.

Then, in step S22 ', template provides the relevant information of the corresponding dom tree of making language document that equipment 1 obtains in step S21 ' according to it, those making language documents is carried out to cluster, to obtain described one or more groups making language document corresponding with this predetermined filtering condition;

Finally, in step S4 ', it is every group of body matter node obtaining in this one or more groups making language document according to it that template provides equipment 1 in step S3 ', obtain and this each group echo language file one or more content identification templates one to one, and using these one or more content identification templates as the content identification template corresponding with this predetermined filtering condition.

In one example, URL(uniform resource locator) (URL) address that predetermined filtering condition C 1 is html file meets regular expression http://www.abc.com/news*.*html, in step S21 ', in 150 html files that template provides equipment 1 to provide equipment 1 to obtain according to this predetermined filtering condition in template, screen, meet 70 html files of this regular expression to obtain its URL address, then, in step S22 ', template provide equipment 1 according to the dom tree relevant information of these 70 html files to these 70 html files are carried out to cluster, to obtain the 3 group html files corresponding with this predetermined filtering condition C 1, in step S4 ', it is every group of body matter node obtaining in this 3 group echo language file according to it that template provides equipment 1 in step S3 ', obtain 3 the content identification template files corresponding with this 3 group echo language file, and using these 3 content identification template files as the content identification template corresponding with predetermined filtering condition C 1.

Preferably, this process also comprises step S7 ' (not shown), step S8 ' (not shown) and step S9 ' (not shown), wherein, in step S7 ', the satisfied predetermined filtering condition of other making language documents that template provides equipment 1 to obtain body matter to be identified; Then,, in step S8 ', template provides equipment 1 to select the satisfied corresponding content identification template of predetermined filtering condition of these other making language documents; Then,, in step S9 ', template provides equipment 1 to identify the body matter of described other making language documents according to selected content identification template.

Particularly, in step S7 ', template provides equipment 1 to be for example subject to predetermined condition or Event triggered ground or obtains termly other making language documents of body matter to be identified from third party device, and it is mated in each predetermined filtering condition, with the satisfied screening conditions of this making language document that obtain; Then, in step S8 ', these screening conditions that template provides equipment 1 to obtain in step S7 ' according to it, among step S4 ', obtain its corresponding one or more content identification templates from it, and extract respectively the body matter nodal information in each content identification template, as XPath, and according to this nodal information, mate in the corresponding dom tree of these other making language documents according to predetermined matched rule, to obtain and the corresponding content identification template of these other making language documents, wherein, this matched rule includes but not limited to:

Then, in step S9 ', the content identification template that template provides equipment 1 to obtain in step S8 ' according to it, from this content identification template, extract each body matter nodal information, and in the dom tree of these other making language documents, search its body matter node according to those body matter nodal informations, and obtain body matter from this node and subtree node thereof.

To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned example embodiment, and in the situation that not deviating from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard embodiment as exemplary, and be nonrestrictive, scope of the present invention is limited by claims instead of above-mentioned explanation, is therefore intended to all changes that drop in the implication and the scope that are equal to important document of claim to include in the present invention.Any Reference numeral in claim should be considered as limiting related claim.In addition, obviously other unit or step do not got rid of in " comprising " word, and odd number is not got rid of plural number.Multiple unit of stating in system claim or device also can be realized by software or hardware by a unit or device.The first, the second word such as grade is used for representing title, and does not represent any specific order.

Claims

1. a computer implemented method for identification marking language file body matter, wherein, the method comprises the following steps:

A obtains pending multiple making language documents;

B1, according to predetermined filtering condition, screens described multiple making language documents, to obtain at least one making language document that meets described predetermined filtering condition;

B2, according to the relevant information of the corresponding dom tree of described at least one making language document, carries out cluster to described at least one making language document, to obtain one or more groups making language document;

D is according to obtained body matter node, obtain to identify the content identification template of this group echo language file body matter, and then the acquisition one or more content identification templates corresponding with described predetermined filtering condition, for identifying the body matter of other making language documents that meet described predetermined filtering condition.

2. method according to claim 1, wherein, the relevant information of described dom tree comprises following at least any one:

The number of nodes of-described dom tree;

The topology information of-described dom tree.

3. method according to claim 1, wherein, described step c specifically comprises:

-content of respective nodes in the corresponding each dom tree of each making language document in described every group is compared to analysis, to obtain the similarity of described content;

-determine described body matter node according to described similarity.

4. method according to claim 1, wherein, the step that obtains the content identification template in order to identify this group echo language file body matter described in described steps d specifically comprises:

-obtain the routing information of described body matter node in described dom tree;

-described routing information is added in the corresponding content identification template of this group echo language file, to obtain to identify the content identification template of this group echo language file body matter.

5. method according to claim 1, wherein, the method also comprises:

-according to pre-defined rule, obtain at least one group echo language file in described one or more groups making language document;

Wherein, described step c specifically comprises:

-content of respective nodes in the corresponding each dom tree of each making language document in every group of the described at least one group echo language file obtaining is compared to analysis, to obtain described body matter node.

6. method according to claim 5, wherein, described pre-defined rule comprise based on below at least any one obtain described at least one group echo language file:

The quantity of making language document in-this group echo language file;

The number of nodes of the corresponding dom tree of making language document in-this group echo language file.

7. according to the method described in any one in claim 1 to 6, wherein, the method also comprises:

-described the body matter that comprises according to described body matter node, the mark body matter relevant information corresponding with described body matter node in the content identification template in order to identify this group echo language file body matter;

Wherein, described body matter relevant information comprises following at least any one:

The type information of-described body matter;

The displaying priority of-described body matter.

8. method according to claim 1, wherein, described predetermined filtering condition based on below at least any one described multiple making language documents are screened:

The network address of-described making language document;

Website under-described making language document.

9. method according to claim 8, wherein, the method also comprises:

-obtain the satisfied predetermined filtering condition of other making language documents of body matter to be identified;

-satisfied corresponding content identification the template of predetermined filtering condition of other making language documents described in selecting;

-the body matter of other making language documents described in identifying according to selected content identification template.

10. method according to claim 1, wherein, described making language document comprises following at least any one:

-html file;

-XHTML file;

-XML file.

11. 1 kinds of equipment for identification marking language file body matter, wherein, this equipment comprises:

The first acquisition device, for:

-according to predetermined filtering condition, described multiple making language documents are screened, to obtain at least one making language document that meets described predetermined filtering condition;

-according to the relevant information of the corresponding dom tree of described at least one making language document, described at least one making language document is carried out to cluster, to obtain one or more groups making language document;

Template acquisition device, be used for according to obtained body matter node, obtain to identify the content identification template of this group echo language file body matter, and then the acquisition one or more content identification templates corresponding with described predetermined filtering condition, for identifying the body matter of other making language documents that meet described predetermined filtering condition.

12. equipment according to claim 11, wherein, the relevant information of described dom tree comprises following at least any one:

The number of nodes of-described dom tree;

The topology information of-described dom tree.

13. equipment according to claim 11, wherein, described relative analytic apparatus comprises:

Similarity acquiring unit, compares analysis for the content of respective nodes in the corresponding each dom tree of each making language document to described every group, to obtain the similarity of described content;

Node acquiring unit, for determining described body matter node according to described similarity.

14. equipment according to claim 11, wherein, the performed described acquisition of described template acquisition device specifically comprises in order to the operation of the content identification template of identifying this group echo language file body matter:

15. equipment according to claim 11, wherein, this equipment also comprises:

The second acquisition device, for according to pre-defined rule, obtains at least one group echo language file in described one or more groups making language document;

Wherein, described relative analytic apparatus specifically for:

16. equipment according to claim 15, wherein, described pre-defined rule comprise based on below at least any one obtain described at least one group echo language file:

The quantity of making language document in-this group echo language file;

17. according to claim 11 to the equipment described in any one in 16, and wherein, this equipment also comprises:

Template annotation equipment, for the described body matter comprising according to described body matter node, the mark body matter relevant information corresponding with described body matter node in the content identification template in order to identify this group echo language file body matter;

The type information of-described body matter;

The displaying priority of-described body matter.

18. equipment according to claim 11, wherein, described predetermined filtering condition based on below at least any one described multiple making language documents are screened:

The network address of-described making language document;

Website under-described making language document.

19. equipment according to claim 18, wherein, this equipment also comprises:

Screening conditions acquisition device, for obtaining the satisfied predetermined filtering condition of other making language documents of body matter to be identified;

Template selecting, for selecting the described satisfied corresponding content identification template of predetermined filtering condition of other making language documents;

Body matter recognition device, for identifying the body matter of described other making language documents according to selected content identification template.

20. equipment according to claim 11, wherein, described making language document comprises following at least any one:

-html file;

-XHTML file;

-XML file.