CN102955796B

CN102955796B - Based on frequent subtree come the method for derived record template

Info

Publication number: CN102955796B
Application number: CN201110245084.1A
Authority: CN
Inventors: 徐鹏; 陈正
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2011-08-16
Filing date: 2011-08-16
Publication date: 2017-06-27
Anticipated expiration: 2031-08-16
Also published as: CN102955796A

Abstract

The invention discloses it is a kind of based on frequent subtree come the method for derived record template.The method includes excavating common frequent trees and the frequent subtree of closure from webpage, and the common frequent trees and the frequent subtree of closure to going out from web mining are grouped.The mark template subtree in packet, the template subtree is the template for only including being necessary node.Optional node is identified based on the template subtree for identifying and is abated the noise.And based on template subtree from each node drawing-out structure information.

Description

Based on frequent subtree come the method for derived record template

Technical field

The present invention relates to the derivation of logging template, it is used to derive note based on frequent subtree more specifically to one kind The method for recording template.

Background technology

Computer networking technology is just turning into the important component in daily life.WWW (World Wide Web) is just Become the part of people's daily life, it is used for work, amusement, research etc..It can be said that WWW is maximum at present Information source.Generally, online substantial amounts of information has the data entity form of the structuring generated by template (template), such as Appear in the product description in product list on shopping webpage.Typically, the data entity of structuring can be defined as data note Record (data record).In order to all data records in a data record list are rendered as with similar outward appearance and cloth Office, data record is generated usually using the logging template developed in advance.Specifically, by by the numerical value of underlying database Fill out in each segment for the same logging template developed in advance to be dynamically generated each data in data record list Record.

However, the logging template for being presented with tag tree (tag tree) or label forest (tag forest) form herein exists It is explicitly defined in code server, but is hidden in html source code at client.Due to logging template The merging and the data from different web sites of structuring can be easy to easily be extracted from webpage, therefore, for client For, it is highly useful that these logging templates are derived from webpage.Also, these logging templates are derived to be searched for such as product It is also advantageous for the application such as rope, Meta Search Engine and data fusion.

Thus, it is desirable to it is a kind of can from webpage derived record template method.

The content of the invention

According to an embodiment, the present invention describes a kind of for deriving data record template from webpage.First, from webpage In excavated common frequent trees and the frequent subtree of closure.In order to excavate common frequent trees and closure frequency from webpage Numerous subtree, calculates the frequent 1- subtrees of the DOM Document Object Model dom tree of webpage first, and frequent 1- subtrees are the document objects of webpage Only there is a frequent subtree for node in model dom tree.Then, based on the frequent 1- subtrees for calculating, by most right extension To enumerate all derived subtrees by frequent sequence.Most right extension includes new subtree iteratively is attached into frequent son Node on the most right branch of tree increases frequent subtree.For all derived subtrees by frequent sequence, by these subtrees Divide into common frequent trees and the frequent subtree of closure.Thus, common frequent trees have been excavated from webpage and closure is frequent Subtree.Based on the common frequent trees and the frequent subtree of closure that go out from web mining, these frequent subtrees are grouped.At this Mark template subtree (TEN) in a little packets, the template subtree is the template for only including being necessary node.By template Tree, can identify optional node and abate the noise.Also, based on the template subtree, the drawing-out structure information from each node, by This derives data record template.

According to another embodiment, invention further describes a kind of method for excavating sub-tree structure.First, webpage is calculated DOM Document Object Model dom tree frequent 1- subtrees, frequent 1- subtrees are that only have a frequent son for node in the dom tree Tree.Then, based on the frequent 1- subtrees for calculating, all derived subtrees by frequent sequence are enumerated by most right extension. Most right extension includes new subtree iteratively is attached into the node on the most right branch of frequent subtree increasing frequent son Tree.For all derived subtrees by frequent sequence, these subtrees are divided into common frequent trees and the frequent subtree of closure.

According to another embodiment, invention further describes a kind of side for drawing data record template from sub-tree structure Method.First, the common frequent trees and the frequent subtree of closure that go out from web mining are grouped.Then, in these packets Mark template subtree (TEN), the template subtree is the template for only including being necessary node.By the template subtree, can identify Optional node simultaneously abates the noise.Also, based on the template subtree, the drawing-out structure information from each node, it follows that number According to logging template.

Brief description of the drawings

The above and other feature of the present invention, property and advantage will be detailed by with reference to the accompanying drawings and examples Describe and become readily apparent from, in the accompanying drawings, identical reference represents identical feature all the time, wherein：

Fig. 1 is the flow chart for excavating sub-tree structure of the invention；

Fig. 2 is the flow chart for drawing data record template from sub-tree structure of the invention；

Fig. 3 is the flow chart for deriving data record template from webpage of the invention.

Specific embodiment

It is not subjected to supervision the algorithm of (unsupervised) present invention utilizes one kind, (network data of robust is extracted RWDE Algorithm) (Robust Web Data Extraction), the algorithm is designed to the data record identified in complicated webpage and leads Go out corresponding logging template.RWDE algorithms are using webpage as DOM Document Object Model (Document Object Model, DOM) Non- label string is processed.DOM Document Object Model DOM is a kind of for HTML and the DLL of XML document, and it provides to document A kind of method for expressing of structuring, thus it is possible to vary the content and presentation mode of document.In the present invention, RWDE algorithms are based on right The analysis of the frequent subtree of the dom tree of webpage.These frequent subtrees are generated from individual node by most right extension, wherein most Right extension includes new node is only attached to the most right branch of frequent subtree to increase frequent subtree.Due to optional node and therefrom Can drawing-out structure data node presence, generally not fully matched between record.In fact, can be from the master record of webpage One or more common frequent trees and multiple frequent subtrees of closure are generated in list.Present invention demonstrates that, in these maximum frequencies In numerous subtree and the frequent subtree of closure, a template subtree (TEN), the template subtree only institute including logging template are certainly existed It is necessary node.In order to find such template subtree, the measurement that the present invention proposes referred to as weighting F1 is frequent to assess each It is probably the template subtree that subtree has much.When exist some therefrom can drawing-out structure data node when, generally from data List is recorded to generate multiple common frequent trees.By these common frequent trees that align, structural data can be identified Information.Usually, logging template be by necessary node, optional node and therefrom can the node of drawing-out structure data constitute. By identify necessary node, optional node and therefrom can drawing-out structure data node, the logging template of webpage can be derived.

The RWDE algorithms that the present invention is utilized are based on being excavated to closing frequent subtree and common frequent trees.In given webpage After the frequent subtree of dom tree, RWDE algorithms pass through following steps derived record template：(1) frequent subtree is grouped；(2) Identify optional node and abate the noise：(3) mark therefrom can drawing-out structure data node.

With reference now to Fig. 1, Fig. 1 describes the method 100 for excavating sub-tree structure of the invention.RWDE algorithms make The common frequent trees and the frequent subtree of closure of the dom tree of webpage are excavated with CMTreeMiner.CMTreeMiner is normally only For the frequent subtree of discovery closure and common frequent trees, and the frequent subtree of not all.In step 102, CMTreeMiner is first Frequent 1- subtrees are calculated, the frequent 1- subtrees are that only have a subtree for node.Then, in step 104, based on what is calculated Frequent 1- subtrees, the derived subtree by frequent sequence is enumerated by most right extension, wherein most right extension includes passing through iteration New subtree is attached to the node on the most right branch of frequent subtree to increase the frequent subtree by ground.Enumerated by this, set number Can be listed according to all derived subtrees by frequent sequence in storehouse.

Originally design ground is to excavate common frequent trees and closure from the database being made up of multiple tree to CMTreeMiner Frequent subtree.However, in order to improve the efficiency of CMTreeMiner and apply it to of the invention needs from single tree excavation most In the RWDE algorithms of big frequently subtree and the frequent subtree of closure, the present invention has made following modification to CMTreeMiner：

1) frequent 1- subtrees are created using tag path.In tree the tag path of a node be from the root of the tree to this A series of nodes of node.When frequent 1- subtrees are created, the original mark of present invention not a node using the tag path of node It is denoted as the mark for the node.The reason for making the modification is, in dom tree, the data record in column data record is usual Under identical father node.Therefore, the corresponding root of each data record has identical tag path.The present invention is right This modification that CMTreeMiner makes improves the efficiency of CMTreeMiner.

2) determine that a frequent subtree is closure using the support of affairs is not based on based on the support (support) for occurring Frequent subtree or common frequent trees.The reason for making the modification be, to the single dom tree of the then input of RWDE algorithms and Speech, the support based on affairs of subtree is either 0 or is 1.

In step 106, all derived subtrees by frequent sequence are divided into common frequent trees and closure is frequent Subtree.When one that only exists current subtree frequent appropriate (proper) derives hyperon tree (supertree) and the hyperon tree When meeting following condition, the current subtree is defined as maximum：(1) the hyperon tree is only capable of being added to current subtree by by node The father of root generate；(2) support of the support of the hyperon tree less than current subtree..

With reference now to Fig. 2, Fig. 2 describes the method for drawing data record template from sub-tree structure of the invention 200.Generally, there are multiple data record lists in webpage, but in most of webpages, in this multiple data record list In there is Maurer Data Record list.The Maurer Data Record list includes the most contents of webpage, and it typically occurs in net The center of page simultaneously occupies most of region of webpage.RWDE algorithms of the invention are intended to derive these Maurer Data Record lists In data record template.

First, the present invention proposes following logic.

For the column data record of usage record template generation, the closure occurred in each data record is always existed in Frequent subtree or common frequent trees.The frequent subtree of the closure or common frequent trees are defined as template subtree TEN (i.e., only Template including be necessary node).

Logging template be by necessary node, optional node and therefrom can the node of drawing-out structure data constitute.True In the webpage in the real world, optional node and therefrom can drawing-out structure data node always in the bottom of logging template, and All child nodes of these nodes be also optional node or therefrom can drawing-out structure data node.So, logging template Root can not be optional, that is to say, that certainly exist at least one necessary node occurred in each data record.Due to By most right extension, all frequent subtrees can be enumerated, therefore, TEN can be generated certainly.In present tree is tree expansion process TEN when, by new optional node/therefrom can the node of drawing-out structure data be added to the present tree and will cause what it was supported Reduce because optional node/therefrom can the node of drawing-out structure data occur not in each data record.Thus may be used See, TEN is closure or common frequent trees.

In step 202., the common frequent trees and the frequent subtree of closure that go out from web mining are grouped.It is specific and , be grouped for they based on the belonging relation of common frequent trees and the frequent subtree of closure by speech.For common frequent trees With the belonging relation for closing frequent subtree, the present invention proposes following logic.

Set the frequent subtree P of closure of T_cBelong to the common frequent trees P of same tree_m, and if only if for any node N_c∈ P_c, there is node N_m∈P_m, node N_mAppearance be N_cAppearance subset.

Belonging relation is identified by comparing the frequent subtree of each pair closure and common frequent trees.One in these relations Dividing can be detected during extension is set, because all of generation close during the process of a common frequent trees is generated The numerous subtree of sum of fundamental frequencies necessarily belongs to the common frequent trees.After step 202, frequent subtree is matched several packets, each point Group includes a common frequent trees and multiple frequent subtrees of closure for belonging to the common frequent trees.

In step 204, based on the frequent subtree packet generated in step 202, the optional node in identification record template And eliminate the noise around data record list.Before the step is described, present invention is initially set forth for common frequent trees With the observation of Maurer Data Record list in webpage.

In general webpage, in all common frequent trees excavated from the dom tree of the webpage, there is at least one Individual its all of appearance is the common frequent trees of Maurer Data Record.

For Maurer Data Record list, because the data record in list generally follows same logging template, thus Multiple frequently subtrees can be generated.For common webpage, the tag tree of Maurer Data Record is different from other subtrees.Therefore, label is worked as When tree rises to the level that can not again add more nodes, node of the addition from other subtrees is also impossible.Thus, it is raw Into common frequent trees.Because the common frequent trees are very similar with logging template, and in most cases, main number Template according to record is not similar with other subtrees, therefore, occurring for these common frequent trees be Maurer Data Record.

For each common frequent trees, frequent subtree is generated by step 202 and is grouped.So, exist at least Frequent subtree packet as one, its common frequent trees match some Maurer Data Records, and Maurer Data Record TEN In such packet.Thus, by identifying these TEN in being grouped, the optional node of Maurer Data Record template can be identified simultaneously Eliminate the noise around Maurer Data Record list.

RWDE algorithms of the invention do not filter out the frequent subtree that those its common frequent trees match Maurer Data Record Packet, but weighting F1 is calculated to each the frequent subtree in each packet.For each data record list, if it meets Following condition, then can derive its template in RWDE algorithms：One or more common frequent trees match data record row Some of table data record and these common frequent trees are all of occurs being the data record in the data record list. RWDE algorithms export its derived all template.Some pre-prepd or follow-up steps can be taken to find Maurer Data Record The template of list.Methods as described below is in the case where the template of all data record lists for meeting conditions above is derived Occur.

Optional node identification procedure is shown with code form below.

It is visible based on said process, first, occur RO to root by father id and be grouped (row 1).Then, to including most The root of big frequently subtree is grouped, and by these be grouped in all positions be added to CRRP- candidate datas record Root position (row 2-6).The step assumes that occurring for common frequent trees be data record.Its root position is in CRRP Interior subtree is probably record because their appearance with common frequent trees share same father and with these maximum frequencies The appearance of numerous subtree is similar.It is that each frequent subtree calculates weighting F1 measurements, and the frequency of F1 is weighted with highest based on CRRP Numerous subtree is considered as TEN (row 7-13).Finally, the following nodes in common frequent trees are identified as optional node：It is selected In TEN and without the node (row 14-18) matched with the node.

In subtree packet, the frequent subtree with less node generally has larger support.In the net of real world Page in, all positions in CRRP be not be all data record root.Thus, there is noise in CRRP, and these are made an uproar Sound is typically distributed on the two ends of data record list, such as page turning label, advertisement and sequence label.Because such noise has The structure similar to data record (at least with marked with data record identical root) simultaneously shares same father with data record, Therefore, these noises are considered into a packet.Based on this, define the measurement of referred to as weighting F1 to assess data record row Noise and mark TEN around table.Less frequent son in the CRRP external positions with more appearance in the middle of CRRP Tree will have weighting F1 higher.It is desirable that the frequent subtree in subtree packet with highest weighting F1 is corresponding number According to the TEN of record list.For convenience's sake, there is the frequent subtree definition of highest weighting F1 during be grouped for a subtree by the present invention It is HWFT.

Weighting function：w_i=-20i²/n²+20i/n-2(1)

Position in CRRP is sorted in ascending order first.Assuming that the set length of CRRP is n, for each, its index is i Position, its weight w is calculated by formula (1)_i.As i=n/2, w_iMaximum 3 is obtained, and when a position is in the location column When afterbody or stem (i=0 or the i=n) of table, w_iObtain negative value -2.Closer to the centre of CRRP, w_iValue it is bigger.In order to higher Accuracy, can be from sample web page learning weighting function, but in the present invention, use above tentative function.

Weighting precision：

For each position Pi that the root of frequent subtree occurs, if it is in CRRP, ψ (p_i) it is equal to w_i, its Middle i is indexes of the Pi in the CRRP of sequence, otherwise, ψ (p_i) it is equal to 0.Then, for each the frequent son in subtree packet Tree, weighting precision is calculated by formula (2).RootOccurrences is that the root of current frequently subtree collection occurs.

The traditional definition of recall (recalling) is as follows：

Now, weighting F1 can be obtained based on recall (recalling) and weighting precision：

In the case where data record list either end has noise, weighting F1 can be used correctly to assess noise and look for Go out TEN.When data record list two ends are in the absence of in the case of noise, in the absence of only matching its w_iData note more than 0 In the case of the frequent subtree of record, weighting F1 is also effective.However, rule of thumb, this extreme case seldom occurs.

It should be noted that noise discussed above is the noise similar to data record, do not remember with data for those Noise as picture recording (that is, with the noise of different roots mark), they are closing frequent subtree and common frequent trees excavation Period is eliminated.

Refer again to Fig. 2, in step 206, mark therefrom can drawing-out structure data node.Therefrom can drawing-out structure The node of data is also frequently included in data record template.The structural data for being extracted refers to logic OR.When with phase With only having in several nodes or set of node of father node and the brotgher of node during a node or set of node can appear in each record When, these nodes or set of node be known as therefrom can drawing-out structure data node.For the difference including mutually extracting For the record of node, they therefrom can the node of drawing-out structure data can match the node of different frequently subtrees.In son Tree expansion process in, when node to be added be therefrom can drawing-out structure data node when, be constantly present more than one Extension branch.It is desirable that more than one frequent subtree will be generated, and these frequent subtrees belong to different subtrees point Group, but the HWFT of these subtrees packet is typically identical.

Analyzed based on more than, be identical common frequent trees (there is optional information) alignment generally by its corresponding HWFT Come identify therefrom can drawing-out structure data node.The alignment is not based on vertex ticks, but the appearance based on them.Such as Really the appearance of two frequent subtrees is identical, then the two frequent subtrees are matchings.There is identical in frequent subtree The brotgher of node and father node but the optional node for occurring the child node of node in being used as dom tree from being not together can be recognized For be therefrom can drawing-out structure data node.At present, RWDE algorithms only process the simple feelings that two leaf nodes are mutually extracted Condition.

With reference to Fig. 3, Fig. 3 shows the method 300 for deriving data record template from webpage of the invention. Step 302, excavates common frequent trees and the frequent subtree of closure from webpage.Step 302 includes step 304,306 and 308. In 304, frequent 1- subtrees (frequent-1subtree) are calculated, during frequent 1- subtrees are the DOM Document Object Model dom tree of webpage Only there is a frequent subtree for node.In 306, based on the frequent 1- subtrees for calculating, enumerated by frequency by most right extension The derived subtree of numerous sequence, wherein most right extension includes new subtree iteratively is attached into most right point of frequent subtree Node in branch increases frequent subtree.In step 308, all derived subtrees are divided into common frequent trees and closure Frequent subtree, wherein derived subtree is by frequently sequence.In the step 310, Maximum Frequent to going out from web mining Frequent subtree is set and closes to be grouped.In step 312, mark template subtree (TEN) in packet, the template subtree is only Template including be necessary node, optional node is identified based on the template subtree and is abated the noise.In a step 314, base In the template subtree mark therefrom can drawing-out structure data node.

Except data record template described above, the present invention can be also used for nested data record and forest class (forest-like) logging template.

Nested data record is to include that one or more have the data record of the nested list of repeat pattern.It is of the invention RWDE algorithms can derived record template can also derive the template of nested list, also, the RWDE algorithms can also identify them it Between relation.Because in RWDE algorithms, the extension of subtree starts from each frequent tag path.For including more than two For the record list of individual data record, its occur be the root of data record node necessarily frequently.Start from this Extension at node will generate logging template.Extension can also start in the root of nested list, but after this extension, will not have There are common frequent trees to generate.If because it occurs being that the node of the father of the root of nested record is added to frequent son Tree, then will generate new frequent subtree, and thus, these frequent subtrees are not maximum.However, according to above-mentioned right Second modification of CMTreeMiner, RWDE will define one or more frequent subtrees for " maximum ", because nested list Root support more than his father support.Accordingly it is also possible to nested list template is derived, and according to their appearance position Also their nest relation can easily be identified.

In the case where logging template is forest, multiple frequent modes can be derived from data record list (to be had optional With the subtree of the information of extracting), and logging template is a series of such patterns.Correct one can be detected using multiple rules Train patterns.In the RWDE algorithms of the invention realization, more nodes and with more supports in matching data records list A series of such frequent modes are chosen as logging template.

Above-described embodiment is available to be familiar with person in the art and realizes or using of the invention, be familiar with this area Personnel can make various modifications or change without deviating from invention thought of the invention, thus protection of the invention to above-described embodiment Scope is not limited by above-described embodiment, and should be the maximum magnitude for meeting the inventive features that claims are mentioned.

Claims

1. one kind is used for the method that data record template (data record template) is derived from webpage, methods described bag Include:

Common frequent trees (maximal frequent subtree) are excavated from webpage and frequent subtree (closed is closed Frequent subtree), including：

Frequent 1- subtrees (frequent-1subtree) are calculated, the frequent 1- subtrees are the DOM Document Object Model DOM of webpage Only there is a frequent subtree for node in (document object model) tree；

Based on the frequent 1- subtrees, the derived subtree by frequent sequence, the most right extension are enumerated by most right extension Increase frequent subtree including new subtree iteratively is attached into the node on the most right branch of frequent subtree；

All derived subtrees are divided into common frequent trees and the frequent subtree of closure, wherein derived subtree is by frequently row Sequence；

Common frequent trees and the frequent subtree of closure to going out from web mining are grouped；

Mark template subtree (TEN) in packet, the template subtree is the template for only including being necessary node, based on template Subtree identifies optional node and abates the noise；And

Based on template subtree from each node drawing-out structure data, to derive data record template.

2. the method for claim 1, it is characterised in that derived data record template be directed to the master of the webpage The data record template of data record list.

3. the method for claim 1, it is characterised in that abate the noise and further include to eliminate around data record list Noise.

4. the method for claim 1, it is characterised in that the common frequent trees and closure to going out from web mining are frequent Subtree be grouped and further includes to be grouped based on the belonging relation between common frequent trees and the frequent subtree of closure, is made Obtaining each packet includes a common frequent trees and multiple frequent subtrees of closure.

5. method as claimed in claim 4, it is characterised in that mark template subtree is further included in packet in packet Each common frequent trees and the frequent subtree of closure calculate a weighted metric, wherein the Maximum Frequent with highest weighted metric Subtree or the frequent subtree of closure are identified as the template subtree.

6. the method for claim 1, it is characterised in that the optional node includes the following sections in common frequent trees Point：Without the node matched with the node in the template subtree.

7. method as claimed in claim 5, it is characterised in that based on template subtree from each node drawing-out structure data Further include that the frequent subtree with maximum weighted measurement in being grouped is that identical common frequent trees align to mark Know therefrom can drawing-out structure data node.

8. method as claimed in claim 7, it is characterised in that it is described therefrom can the node of drawing-out structure data be included in frequency There is with the identical brotgher of node and father node but from being not together the child node of node in being used as dom tree in numerous subtree Optional node.

9. a kind of method for drawing data record template from sub-tree structure, methods described includes：

Based on template subtree from each node drawing-out structure data, to draw data record template.

10. method as claimed in claim 9, it is characterised in that common frequent trees and the frequent subtree of closure are from the net Excavated in the DOM Document Object Model dom tree of page.

11. methods as claimed in claim 9, it is characterised in that abate the noise and further include to eliminate data record list week The noise for enclosing.

12. methods as claimed in claim 9, it is characterised in that to the common frequent trees and closure frequency that go out from web mining Numerous subtree be grouped and further includes to be grouped based on the belonging relation between common frequent trees and the frequent subtree of closure, So that each packet includes a common frequent trees and multiple frequent subtrees of closure.

13. methods as claimed in claim 12, it is characterised in that mark template subtree is further included to packet in packet Interior each common frequent trees and the frequent subtree of closure calculate a weighted metric, wherein the maximum frequency with highest weighted metric Numerous subtree or the frequent subtree of closure are identified as the template subtree.

14. methods as claimed in claim 9, it is characterised in that the optional node is including following in common frequent trees Node：Without the node matched with the node in the template subtree.

15. methods as claimed in claim 13, it is characterised in that based on template subtree from each node drawing-out structure number It is that the alignment of identical common frequent trees comes according to the frequent subtree measured with maximum weighted further included in being grouped Mark therefrom can drawing-out structure data node.

16. methods as claimed in claim 15, it is characterised in that it is described therefrom can the node of drawing-out structure data be included in There is the identical brotgher of node and father node in frequent subtree but occur the sub of node in being used as dom tree from being not together and save The optional node of point.