CN102436472A - Multi- category WEB object extract method based on relationship mechanism - Google Patents

Multi- category WEB object extract method based on relationship mechanism Download PDF

Info

Publication number
CN102436472A
CN102436472A CN2011102948467A CN201110294846A CN102436472A CN 102436472 A CN102436472 A CN 102436472A CN 2011102948467 A CN2011102948467 A CN 2011102948467A CN 201110294846 A CN201110294846 A CN 201110294846A CN 102436472 A CN102436472 A CN 102436472A
Authority
CN
China
Prior art keywords
web object
web
classification
relationship
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102948467A
Other languages
Chinese (zh)
Other versions
CN102436472B (en
Inventor
陈小武
赵沁平
蒋恺
马永焘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN 201110294846 priority Critical patent/CN102436472B/en
Publication of CN102436472A publication Critical patent/CN102436472A/en
Application granted granted Critical
Publication of CN102436472B publication Critical patent/CN102436472B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a multi-category WEB object extract method based on a relationship mechanism and comprises a multi-category WEB object relationship library formed by utilizing wikipedia data, wherein the multi-category WEB object relationship library comprises a WEB object, a category, a relationship between objects and a category layering relationship; in the multi-category WEB object relationship library, relationship weight between WEB object categories is iteratively computed, and a core relationship template of the WEB object categories is drawn; a WEB page is converted into a HTML (Hyper Text Markup Language) tag tree, and a WEB object record block is extracted from the WEB page according to dimensions and characteristics of nodes of the HTML tag tree; a template is utilized to match and obtain a subordinative category of the WEB object record block, according to the core relationship template of the WEB object categories, a core WEB object and a related WEB object of the WEB object record block are extracted by adopting a voting strategy; a method of information visualization is adopted, and various relationships of the WEB object in the multi-category WEB object relationship library can be displayed. The multi-category WEB object extract method can be widely applied to fields, such as internet data mining, information retrieval and the like.

Description

A kind of multi-class WEB object abstracting method based on relation mechanism
Technical field
The invention belongs to computer network, information retrieval and integrated technology field, specifically a kind of multi-class WEB object abstracting method based on relation mechanism.
Background technology
The WEB information extraction is the effective means of magnanimity Internet information retrieval.And the extraction of WEB object also has been widely used in the middle of the vertical search engine application.The notion of WEB object produces along with the appearance of vertical search engine, is intended to solve problems such as the traditional search engines Search Results is redundant, degree of accuracy is low.Microsoft is " the master data object of WEB, its relevant information will be collected, index also sorts " with the WEB object definition.Representing of WEB object is divided into two levels: object piece rank and attribute rank.Other WEB object of object piece level is only showed the written record piece relevant with the WEB object to the user, and the specific object of WEB object should be judged by record through reading by user oneself.Other WEB object of attribute level has then comprised the object correlation attribute information, and this further extracts on object character record block basis and obtains.The extraction of WEB object piece comprises based on the abstracting method of WEB file structure with based on the method for WEB document visual information.
The people such as Lerman of American South University of California have proposed a kind of method according to WEB file structure Automatic Extraction information.This method is learnt the structure of similar document through the similar document of learning a certain website; This method supposes that usually the child node expression with same father node has the information of very strong correlation; Distinguish the node of expressing different objects with the similarity on the file structure, and according to this content and hypothesis on location extraction information from document.
People such as Gupta tabulate through the Advertisement Server that keeps a continual renovation and remove advertisement, remove lists of links through calculating linking number, non-link literal number.But this method can not be discerned picture concerned, also very easily deletes list of relevant links.And, need the threshold value of manual setting parameter just can reach the best effect that extracts for different webpages.
The InfoDiscover system that Lin and Ho propose at first is divided into several content blocks to webpage according to the TABLE label, then speech is come out as feature extraction and calculates the entropy of each speech, and then calculate the entropy of each content blocks.Divide related content piece and irrelevant contents piece through the threshold value of setting entropy at last.Although above method has obtained certain effect, all be to single website, so certain limitation is arranged.
People such as the Liu of Univ Chicago USA and Grossman have proposed a kind of method of from the structuring WEB page, extracting the WEB list object.This method was divided into for 3 steps: make up the html tag tree, mining data zone, recognition data record.This method is carried out pre-service to tag tree, and the label that need not match in the html tag is revised, so that all labels in the former WEB document can both mate, and converts the WEB document into the html tag tree.
People such as Kovacevic utilize the position that the page is divided into head, pin, left and right and zone line.The shortcoming of this method is that this structure of web page template can not be applicable to all webpages, and the method for this zoning also is difficult to the semantic consistency that guarantees that each is regional.People such as the Cai of Microsoft Research, Asia carry out piecemeal according to visual signatures such as the color of WEB document, character area, literal sizes to the WEB document, thereby generate the visual structure tree of a WEB document.
The method for distilling of a kind of WEB entity overall situation of people's type of proposition such as Yao template on world's web-seminar in 2008.This method requires the user that the part attribute of classification is provided at first, and is that keyword carries out iterative analysis to the result that search engine returns with these attributes, thereby gets access to information such as the attribute of giving classification WEB object that defined on the network, attribute another name.This method is once moved the description template that can only obtain a classification, also needs the user that priori is provided simultaneously, has also limited the extraction of multi-class WEB object to a certain extent.
The WEB object visual aspect, people such as the Keim of Konstanz, Germany university and Mansmann have proposed level annulus algorithm.In this algorithm, at all levels is a plurality of radial donuts by layout, and inboard annulus is being represented the father node in the outside, and all annulus are divided into some sectors according to the data type of innermost layer node and are beneficial to show the stratification information corresponding with the internal layer node.This algorithm is beneficial to the expression grouping information; But be unfavorable for showing the information of big data quantity, need this moment certain user interactions (like bubble prompting, information filtering) as auxiliary.The Herr of India university in 2008 and Holloway have realized that mosaic view is in order to the editing activity in the visual dimension base.Represent every piece of literary composition with yellow dots will, represent editor's frequency of article with the size of point, the maximum article of frequency then is shown as its corresponding picture, representes frequent editor's article recently with red point.The general status and the much-talked-about topic that can reflect dimension base by this method, but this visual interactive function that lacks with the user makes the user be difficult to obtain detailed information.The Holloway of indiana ,US university in 2007 and borner have designed the basic visualization tool of dimension, in order to attributes such as the classification that covered with the visual dimension base page of macroscopic perspective face, edit sessions.This instrument definition and calculated other of dimension base class similarity, some represent a page among the Wei Ji, the similarity of pressing of all pages of dimension base is distributed in the page, and with affiliated different classes of of different colours representing pages.
Summary of the invention
In order to overcome the deficiency of prior art; The objective of the invention is to: propose a kind of multi-class WEB object abstracting method based on relation mechanism; Its multi-class WEB object piece that can be fit to the structuring and the destructuring WEB page is simultaneously extracted, and can browse relation and the subordinate relation between WEB object and the classification between WEB relation between objects, the WEB object type intuitively through the visual user of making.
For accomplishing goal of the invention; The technical scheme that the present invention takes is: utilize the multi-class WEB object relationship of wikipedia data configuration storehouse; Comprising WEB object type, WEB object, WEB object relationship and relevant inheritance hierarchy relation, thereby make up the relation between the WEB object type; Concern weights between iterative computation WEB object type, and extract the core relationship template between the WEB object type; With the WEB conversion of page is the html tag tree; With the amount of text of tag tree node as the node size; The tag tree node that filtering node size is less or the text support is lower; With the text support of size similarity between the sibling and node, extract structuring node and destructuring node respectively, select the maximum node of size as WEB object record piece; Utilize template matches that WEB object record piece is classified, obtain the affiliated classification of WEB object,, adopt temporal voting strategy to extract the core WEB object and the relevant WEB object thereof of WEB object record piece through the core relationship template of WEB object type; The various relations of visual WEB object make the user can browse relation and the subordinate relation between WEB object and the classification between WEB relation between objects, the WEB object type intuitively.
Aspect the multi-class core relationship template of study, the first step of structure core relationship template is will generate between classification to concern.For this reason, the multi-class WEB object relationship storehouse that the present invention is based on the wikipedia data configuration is comprising WEB object type, WEB object, WEB object relationship and relevant inheritance hierarchy relation.To every WEB object relationship, between the classification of closing owner, object, set up between classification and concern that relation has weights between classification, is worth for concerning that object is used to describe the frequency that concerns main body.Because number of objects, object relationship number are huge, therefore can obtain the comparatively comprehensively relation between classification.Second step was between the classification that generates, to extract the core relationship template the relation.The present invention proposes to concern between the WEB object type that weights calculate and iterative algorithm obtains the core relationship template.Relation between all categories of same classification main body is pressed the descending ordering of weights, and at every turn that current weight is maximum relation adds the core set of relations and calculates the information redundance of this set.When the redundance of core set of relations greater than a certain threshold value, and the weights of residue relation are during all less than an assigned frequency, promptly think the core set of relationship that has got access to this classification main body.Utilize this method that each WEB object type is carried out iteration, thereby obtain the mutual core relationship template of describing between classification.
Aspect extraction WEB object record piece, choose selected html tag tree and go up the foundation of the size of node as type of webpage judgement, the extraction of WEB object record piece.Based on actual observation, provide the prerequisite hypothesis that a series of structurings and the destructuring page are differentiated and extracted, and provide the rule that type of webpage is judged and the object record piece extracts according to these hypothesis to a large amount of WEB pages.This rule mainly comprises 3 points.The first, for all webpages, the main contents of this page have occupied page main body, and therefore, with between layer sibling, the obviously less node of those sizes will be by filtering in the html tag tree, thereby realizes the coarse filtration of the page.The second, for the destructuring page, because it adopts the formal description WEB object of big section character narrate, the node that it is characterized by corresponding html tag tree has comprised a large amount of literal and punctuate.For weighing the notion that this characteristic has provided the text support.When the value of the text support of node during greater than a certain threshold value, this node is judged as the destructuring node.The 3rd, for the structuring page, owing to such page overwhelming majority generates through template, so the child node of each object same position has approximate size in the list object.Through the degree of approximation of size between variance calculating node, when the child node of sibling all has approximate size continuously more than two, these siblings will be differentiated for constituting the structuring node of WEB list object.
Aspect extraction attribute rank WEB object, because the core relationship template of each classification is known, the affiliated classification of known WEB object just can extract its association attributes according to such other template.Other WEB object of attribute level extracts and was divided into for two steps: object class and object extraction.At sorting phase, at first the text in the WEB object record piece is carried out participle, and noun that will be wherein matees with the object oriented in the WEB object relationship storehouse, obtain the classification of all objects in this object record piece and gather.The set of these classifications has constituted the local template of describing this object record piece.Utilize the method for template matches that local template and core relationship template are mated, both can judge the classification of WEB object.On known WEB object type basis, adopt temporal voting strategy from the object record piece, to extract core WEB object and relevant WEB object thereof according to such other core relationship template.
Aspect the relation, the magnanimity information of WEB library of object has constituted huge knowledge network with complicated relation between visualized objects.For making the user can browse the various relations between the object intuitively; Provide the method for visualizing of object relationship; This is visual not only can show and concern between WEB object distribution, classification etc. and also can specifically reflect the temperature of WEB object, the detailed information of object relationship by macroscopic information.
The present invention compares with existing method and technology, and its useful effect is: 1, the present invention can be fit to the multi-class WEB object piece extraction of the structuring and the destructuring WEB page simultaneously, thereby has solved single abstracting method problem poor for applicability; 2, method for visualizing of the present invention can reflect in the dimension base between classification the incidence relation between incidence relation and classification between level relation, entry all sidedly; And can take into account local with whole information; Make the user obtain relatively comprehensively information, can locate its information of interest simultaneously.
Description of drawings:
Fig. 1 is a general system set-up synoptic diagram of the present invention;
Fig. 2 is a WEB object relationship structural representation of the present invention;
Fig. 3 is that three layers of classification of WEB object to the of the present invention remap schematic flow sheet;
Fig. 4 is that relation generates method flow diagram between classification of the present invention;
Fig. 5 is that core concerns the method for distilling process flow diagram between classification of the present invention;
Fig. 6 is a WEB object piece abstracting method process flow diagram of the present invention;
Fig. 7 is a WEB object record block sort method flow diagram of the present invention.
Embodiment:
Below in conjunction with accompanying drawing the present invention is elaborated.
Consult Fig. 1 general system set-up synoptic diagram of the present invention, the multi-class WEB object abstracting method based on relation mechanism that the present invention proposes mainly comprises following several module: WEB object relationship library module, data persistence module, relationship template services block, WEB object record piece extraction module, WEB object record block sort module and attribute level WEB object abstraction module.
WEB object relationship storehouse is used to store the raw data of dimension base and through handling object, classification and the relation information of processing, mainly comprises the WEB object, the WEB object type, and the WEB object relationship, subordinate relation between WEB object and classification concerns between classification.WEB object relationship storehouse is the basis of subsequent operation.Because the data scale that dimension base provides is huge, so the present invention optimizes WEB object relationship storehouse, sets up index, and adopts measure such as submeter to improve access of database efficient.
The data persistence module is by the visit of Hibernate tool implementation to database, thus the logic in isolated data storehouse, physical characteristics.Utilizing Hibernate tool implementation construction data access layer module fast, is that the relative service logic of characteristic such as relation, list structure is transparent between the table of database, is convenient to the exploitation of upper-layer service logic.
Core relationship template service module provides the service relevant with the core relationship template, comprises that relation generates between classification, the study of core relationship template, operations such as core relationship template coupling.Core relationship template service module is a topmost module, and in this module, the process of core relationship template study is a key link.
WEB object abstraction module has realized from webpage, extracting the function of WEB object record piece and attribute rank WEB object.It calls core relationship template service record block is classified, and calls the data persistence layer and deposits the WEB object that extracts in WEB object relationship storehouse.WEB object abstraction module comprises 4 sub-module, is respectively WEB object record piece extraction module, text word-dividing mode, record block sort module and four parts of attribute level WEB object abstraction module.
Consult WEB object relationship structural representation among Fig. 2 the present invention, comprise WEB object type, WEB object, subordinate relation and incidence relation in the WEB object relationship storehouse.The WEB object type comes from the catalog system in the dimension base, is used for the WEB object is carried out the stratification classification.The WEB object comes from the concrete entry of dimension in the base, and every entry has all that independently the WEB page is described.
Inheritance hierarchy relation comes from the basic catalog system of dimension between classification and classification, the subordinate relation between classification and entry.In the dimension base, every piece of article all is subordinated at least one classification, and same or analogous topic normally told about in the article under the same classification, and classification can be subordinated to more higher leveled parent again.Like this, finally just form a catalog system hierarchical structure.In this directory hierarchy, the subordinate relation of existing entity and classification also has the subordinate relation of subclass and parent.
Relation between the WEB object is meant the relation through producing between hyperlink mode and other entry in the description of dimension keyword bar text.The main body of the corresponding WEB object relationship of the entry that text is described, other entry that is linked to through hyperlink is the object of relation.Since during text is described the complicacy of relation with lack semantic information, the present invention can't confirm the semanteme that concerns, therefore, the relation here only is meant and has relation between two subject and objects.Each WEB relation between objects is by the combination sign of the subject and object of relation.
Pass between the WEB object type means the affiliated classification of the two WEB objects that have relation, or the relation between the parent of affiliated classification.Relation can't directly be obtained between the WEB object type, needs to get through the calculating of the subordinate relation between WEB object relationship and WEB object and classification thereof.Analysis of Relationship will be described in detail at next joint with calculating between the WEB object type.Relation between the WEB object only provides the relation between individuality as an example, and the WEB object type concerns then at the relation information that provides on the statistical significance between two classifications.The weights of classification relation have pair WEB object relationship number statistics and get, and can be used to judge the strong and weak degree that concerns between classification.
The learning method of core relationship template mainly comprises three steps: at first all WEB objects are remapped to the 3rd layer of classification (based on the class declaration in the Chinese wikipedia; Each classification has the different degree of depth apart from the top layer classification; Because the scale and the granularity of every type of WEB object are moderate in the 3rd layer the multinomial classification; So select the target classification of the 3rd layer of classification for use) as the mapping of WEB object type; Set up the 3rd layer of relation between classification according to the WEB object relationship then, extract the core relationship template the iterative algorithm that proposes through the present invention at last concerns between all categories.The method that remaps of WEB object begins to travel through from bottom to top all succession paths to the 3rd layer of classification node for the leaf node of being set by the inheritance at WEB object place, and the WEB object map is arrived all the 3rd layer of classification nodes that traveled through.Yet, because inheritance hierarchy is bigger in the category system of dimension base, and there be inherit more, this makes the WEB object to be mapped on a large amount of irrelevant classifications by replay.For example classification " computing machine " will be re-mapped to irrelevant classification like " world history ", " occidental art " etc.For addressing this problem, find that through experiment the present invention these irrelevant classifications mainly are because the existence of inheriting makes the width of traversal be exaggerated more, thereby those irrelevant classifications are being asked in the traversal time receiving in profound path.And those mapping relations that meet objective reality only are present in the middle of the short traverse path usually.As long as when remapping, the length of traverse path just limited and effectively to improve the accuracy rate that remaps.Therefore the present invention is provided with threshold tau.If shortest length is l in all mapping path of three layers of classification of certain WEB object to the, then the WEB object only is re-mapped in the middle of the classification of those paths less than l+ τ.
Consult among Fig. 3 the present invention three layers of classification of WEB object to the and remap schematic flow sheet and shown that the WEB object is remapped to the process of the 3rd layer of classification, its detailed process is following: (1) is loaded into internal memory through Hibernate with all WEB objects and subordinate relation.Because the enormous amount of WEB object in the WEB object relationship storehouse and inheritance hierarchy relation; And when the frequent access database; The access of database time will become the bottleneck of program run, so will significantly improve the travelling speed of program in the implementation procedure with above-mentioned data load such as internal memory and through hashed table index; (2) travel through each WEB object, operation below implementing.Text-processing among the present invention is a processing unit with the WEB object.Therefore, the processing to each WEB object can be regarded as dimension operation.And the process of traversal object need accomplish to stablize, efficiently; (3) judge whether the WEB object that still is untreated,, show that all WEB objects have remapped to finish that algorithm finishes if do not have.Otherwise take out a WEB object, make that its path is 0, and the WEB object is pressed into interim stack; (4) each element in the interim stack is searched its all parent class, the path value that the parent path is made as this element adds 1, and all parent elements are added interim stack.If have the 3rd layer classification in the above-mentioned parent element, this element put into stack as a result, and shift out interim stack.If the object elements in the judgement stack all ejects, explain that then current object disposes, jump to step (5), otherwise repeating step (4) is empty up to interim stack; (5), choose path and add 3 classification less than shortest path and take turns the 3rd layer of classification that the replay of WEB object is mapped to for this to the length ordering by path of stack as a result.
Consult and concern the generation method flow diagram among Fig. 4 the present invention between classification; After accomplishing the remapping of WEB object; Need according to existing WEB object relationship opening relationships between the WEB object type; Same consideration from executing efficiency, with the WEB object relationship, the level inheritance is loaded into internal memory.Travel through all WEB object relationships, carry out following process.At first judge whether to exist be untreated the WEB object relationship, if relation disposes not then between description object, relation is set up flow process and is finished between classification.If the relation that needs processing is arranged, then obtain all parent class of the two WEB objects that have relation.Whether relation exists relation between the classification that the inquiry parent class is formed between any two.If exist, should concern then that weights added one.Concern otherwise set up between new classification, and establish this new classification and concern that weights are 1.
Concern the branch that power is arranged between classification, need further analyze, extract the core relation between classification relation between classification according to above process acquisition.The core relation whether distinguished all has significance to the identification of obtaining with object of relation.The core relation of distinguishing a certain classification on the one hand helps from numerous relations of this classification WEB object, finding out most important relation, thereby for the user service that concerns of high-quality is provided; The core composition of relations of a certain classification can be regarded as the description scheme of this classification object on the one hand, thereby can classify according to its relation schema to unknown object, extracts related object according to the core relationship template.The frequency weight that concerns between two classifications is big more, explains that two classifications relation is close more, and the editor of dimension base tends to concern main body with concerning that object is described more.Therefore, can be the relation of the core between the classification greater than relation between the classification of a certain threshold value by the easy choice weights.Given classification cat (sub) iWith classification cat (obj) j, relation table is shown (cat (sub) between classification i, cat (obj) j, freq Ij), if freq IjGreater than threshold value κ (being 0.8 among the present invention), think that then this relation is the core relation that concerns main body.
But only rely on frequency values can not guarantee that all core relations are all by complete extraction.For example, in wikipedia, exist some unexpected winner classifications, owing to lack domain knowledge, such other WEB object relationship will be less than other classifications, thereby causes such other frequency values to descend.This causes the lower core relation of those frequency values not extracted.
Can consider the meaning of core relation from another angle.The core relation is the notable feature of main body as ubiquitous a kind of relation between two classification WEB object instances, i.e. the object set of all core relations of main body classification has been played main effect to identifying this main body classification; And non-core relation is very little to the contribution of sign main body classification as the accidental relation that exists between the classification object instance.Therefore the core of classification relation is appreciated that and is such other a certain subclass that concerns, the information that this subclass can provide is enough abundant, can represent and identify such other all relation informations.For describing the abundant information degree of this subclass, the present invention has introduced entropy and the notion of redundance in the information theory.Redundance has been represented because the probability that relation occurs between each classification of same main body is different, and the degree that information entropy is reduced.Be that redundance has represented to be a certain main body that concerns of sign the ratio of the unnecessary redundancy section of the core set of relations of relative this main body of relation between all categories of this main body classification.Therefore, the present invention utilizes redundance to weigh the identification capability of core set of relations to concerning between all categories.The probability P (r) that a certain n-th-trem relation n r exists in this set of set of relationship between all categories of a given classification
Figure BDA0000095208900000071
is by computes:
p ( r ) = f ij Σ k = 0 p f ik ,
Wherein, f IjBe this n-th-trem relation n frequency of occurrences,
Figure BDA0000095208900000082
For all concern frequency of occurrences sum.
For R AllSubclass R Sub, R SubInformation redundance can be expressed as:
redundancy ( R sub ) = 1 - H ( R sub ) log | R all | ,
Wherein
Figure BDA0000095208900000084
P (r) is subclass R SubThe middle probability that concerns that r exists, | R All| be R AllThe number of middle element.
The present invention has provided a kind of iterative algorithm and has extracted the core set of relationship between all categories of a classification, concerning.At first all relations are sorted by frequency weight.Each is taken turns circulation and all from the surplus element of set of relations, takes out a n-th-trem relation n and join in the current core set of relations and calculate this set and whether satisfy stopping criterion for iteration.Take all factors into consideration frequency weight and redundance, concern that (its frequency is freq to r new the adding in iteration Tf) time, if satisfy freq Tf<κ and redundancy (R Sub)>λ then thinks to concern that r should not add current core set of relationship, and this is taken turns iteration and stops.
Consult between Fig. 5 classification of the present invention core and concern method for distilling flow process 1 figure, the present invention proposes a kind of iterative algorithm in order to extract core relation between classification.During each iteration; From between all categories of same main body, choose the maximum adding core set of relations of weights the relation; And the information redundance that utilizes the core set of relations calculates the end condition of iteration; Concrete steps are following: (1) loads between WEB object type and classification and is related to internal memory, and is similar with above several modules, is target to improve routine access speed equally; (2) each WEB object type is carried out following process.Obtaining a untreated object type is the unit.Judge whether to have the classification that is untreated,, then finish this process if do not exist.If exist, then jump to step (3); (3) obtain between all categories that is the main body with current classification A and concern, and relation between classification is sorted by weights are descending, be stored in the middle of the queue structure; (4) shift out relation and adding core set of relations between classification of formation head.Calculate the information redundance of current core set of relations, judge whether current set satisfies stopping criterion for iteration, forward step (5) to if satisfy, otherwise iteration execution in step (4); (5) preserve current core set of relations, and jump to step (2).
The present invention is according to the actual observation to a large amount of WEB pages; Node size around html tag tree provided the prerequisite hypothesis that a series of structurings are differentiated with the destructuring page and extracted, and provided according to these hypothesis that type of webpage is judged and the rule of object record piece extraction.Particularly, the detailed content of each hypothesis is described below: (1) for all webpages, the main contents of this page have occupied page main body, so WEB object piece is distributed on the node of large-size in the corresponding html tag tree of the WEB page; (2) suppose on a large amount of bases of observing; The webpage that will comprise WEB object record piece is summed up as two types; The structuring and the destructuring page, also promptly, if a WEB page comprises WEB object piece; He otherwise only comprise the WEB object piece of big section text description of a usefulness, or be the tabulation of a plurality of WEB object pieces; (3) in the structuring WEB page, WEB object agllutination point distributes with the sibling form and has identical father node; (4) for the structuring page, because its overwhelming majority generates by template, if therefore wherein comprise the WEB list object, then each WEB object agllutination point child node of being in same position has approximate size.
Consult WEB object piece abstracting method process flow diagram among Fig. 6 the present invention and shown the extraction flow process of WEB object piece.When WEB object record piece extracts, at first load a WEB page in the middle of internal memory; Subsequently, because different WEB page coding is different, so unification is the UTE-8 coding with all WEB conversion of page.In the code conversion process, may occur that page coding still be the situation of mess code behind some transform codings, program will be dished out unusually at this moment, thus the processing of current page will be skipped; Program is removed the useless label in the WEB page then, is elementary cell based on the discovery of WEB object with extracting with the html tag.Therefore, also be the process of the rough handling of denoising in the process of removing useless label.Utilize the Tidy instrument that the WEB page is carried out pre-service, convert XML document into.XML document is the structured document of standard, makes that html page can be through the DOM interface to generating tag tree.On this basis, the size of each node of recursive calculation from the bottom to top.Utilize the size similarity of text support, sibling then, filter out all candidates' article node and WEB list object node.The final maximum node of size of from candidate WEB object record agllutination point, choosing returns as net result.The method detailed step is following:
(1) WEB page pre-service.At first, utilize the Tidy instrument that the WEB page is formatd,, and special character changed, finally convert XML document into the label polishing that lacks in the html document.Afterwards the useless label in the XML document is removed; Then, utilize DOM to set up tag tree.In this tag tree, pair of tag is counted as a label node on the tree, and the subtab in label is regarded as the child node of this label node.
(2) the node size is calculated.Consider of the effect of different labels, different labels are endowed different weights size.Make n represent literal number or punctuate number in the node, w representes power, then the size size of node NodeCan obtain by following formula recursive calculation:
size node = ( n words * w words + n punctuation * w punctuation ) * w tag + Σ cn ∈ children size cn ,
N wherein WordsAnd n PunctuationBe to be contained in the middle of the node but not to be contained in literal number and the punctuate number in any child node, w Words, w PunctuationAnd w TagThe weight of representing literal, punctuate and label respectively,
Figure BDA0000095208900000092
Be all child node size sums.Each label all should be given different weights under the ideal situation.But find in the actual experiment, adopt the strategy (be similar label and divide into groups, and give identical power for same group label) of set of tags can reach better effects, and reduce the complexity of label power definition.
(3) the WEB main contents of webpages filters.Provide the conceptual description WEB page singularity of text support.Text support textsupport can represent following formula:
textsupport node = wsize node + psize node lsize node ,
Wherein, wsize Node, psize NodeAnd lsize NodeThe size of representing literal, punctuate and hyperlink respectively.For profound node, if text support textsupport is less than threshold epsilon (being 0.2 among the present invention), then this node will be filtered.
(4) WEB object piece extracts.If comprise list object in the WEB page, then the father node of this tabulation becomes the tabulation node.If the WEB page is the destructuring page, comprise a unique WEB object piece, then the father node of this WEB object piece becomes the article node.To convert detection into to the detection of tabulation node to the size template.In general, the tabulation node has program to generate automatically according to template, and the size of each child node of the node of therefore tabulating is more or less the same.Given one group of child node
Figure BDA0000095208900000102
the present invention utilizes variance to weigh the similarity between all child nodes, shown in following formula:
sim = Σ n ∈ { child i } i = 1 n ( size n - average { child i } i = 1 n ) 2 | { child i } i = 1 n | ,
Size wherein nBe certain child node size,
Figure BDA0000095208900000104
Be all average-sizes of child node on the same group,
Figure BDA0000095208900000105
It is the number of child node.It is high more to be worth between more little expression node similarity.Therefore, when one group of neighborhood of nodes has identical child node number, and the sim of each group child node thinks then that less than threshold value σ (being 64 among the present invention) this adjacent node has constituted a Groups List, and its father node is represented as potential tabulation node.
The detection of article node is simple relatively.Those textsupport values are higher, and the more node of punctuate number can be thought the article node.Detection process is begun by root node, in case detect textsupport greater than 1.5, and punctuate quantity then is designated the article node with this node greater than 6, and no longer its child node is surveyed.
Attribute rank WEB object abstracting method realized being divided into two steps: block sort of WEB object record and WEB object extract.In the process of WEB object record block sort; Need structure two-dimentional object support matrix and main body contribution degree matrix, calculate the matching value of local template of webpage and core relationship template on this basis; And choose the classification of mating the most WEB object record piece is classified; Consult WEB object record block sort method flow diagram among Fig. 7 the present invention, idiographic flow is following: participle is carried out to the text in the WEB record block in (1), and extracts substantive noun wherein.The present invention has constructed in actual participle and has stopped vocabulary, has write down in the WEB page vocabulary of not having a sort feature that often occurs, like vocabulary such as country name; (2) all nouns are mated, search the WEB object of each noun correspondence in WEB object relationship storehouse.Because the WEB object name based in the WEB object relationship storehouse of dimension base data structure is not distinguished either traditional and simplified characters, therefore once coupling is not hit, and then noun is carried out complicated and simple conversion, matees once more.Because a WEB object surplus having had 260,000 in the WEB object relationship storehouse, and write down the various another names of WEB object, therefore, can expect that the most of noun in the WEB object record piece can match corresponding WEB object; (3) the WEB object type that goes out of match query is constructed local classification relationship template.Because classification that the many successions of existence, each WEB object can be mapped to many.This step is not handled this, but with all categories common local classification relationship template that constitutes that gathers together; (4) calculate object support matrix according to local classification relationship template and core relationship template; Calculate main body contribution degree degree matrix according to local classification relationship template and core relationship template; According to local classification relationship template and core relationship template calculation template matching ratio.Matching ratio then is meant the ratio data relation of object support matrix and main body support matrix, is used for reflecting the template matches degree of entity speech; (5) object support, main body contribution degree degree and template matches ratio are multiplied each other, calculate the template matches degree of all coupling classifications; (6) to all coupling classifications by template matches degree ordering, and choose the classification results of the template of coupling the most as WEB object record piece.
Wherein the object support is a given classification B, describes the probability of classification A with it; The main body contribution degree be meant one concern object relative one concern main body all concern the significance level of object; Template matches is than being that all close the ratio of coefficients in the relative core relationship template of pass coefficient of local WEB object template and core relationship template coupling.
On the basis of known WEB object piece, the object with the core relationship template does not conform to that identifies in the former WEB object record piece will be filtered and remove.Thereby what obtained this moment is the WEB object set that meets sub-category core relationship template { Relwo i } i = 0 m .
Adopt temporal voting strategy from Middle identification core WEB object wo DescAt first, the WEB object wo that occurrence number is maximum in WEB object piece FreqWill by from
Figure BDA0000095208900000113
In remove.Experiment shows that it is exactly core WEB object that the WEB object that occurrence number is maximum has very high probability.Afterwards, in each sentence in the WEB object record piece
Figure BDA0000095208900000114
In in the context that occurs of any object, if
Figure BDA0000095208900000115
(comprise wo Freq) have object to occur, then this object is thrown positive ticket, otherwise throws non placet.After all sentences were voted, all WEB objects sorted by gained vote, and who gets the most votes and be subordinated to sub-category WEB object will be core WEB object by differentiation.Through above step, accomplished extraction, thereby obtained having each other in classification and the WEB page of the WEB page WEB object of general relationship the WEB page.
The visual thought that adopts Venn diagram of WEB object relationship of the present invention utilizes closed square to represent that the classification in the wikipedia, the point in square represent to be subordinated to such other entry, concerns that number is directly proportional between the radius value of point and entry that this entry has.Interconnect the incidence relation between the expression classification through the Radial algorithm between classification.Because the entry number of each classification is numerous, be difficult to demonstrate simultaneously the mutual relationship of all entries, so the present invention adopts interactive means to remedy this deficiency.When mouse-over was on the corresponding round dot of some entries, there was the round dot that concerns in all with selected entry high bright demonstration, and between related round dot, draw the limit.
The above is merely basic explanations more of the present invention, and any equivalent transformation according to technical scheme of the present invention is done all should belong to protection scope of the present invention.

Claims (9)

1. multi-class WEB object abstracting method based on relation mechanism is characterized in that comprising following steps:
(1) towards the study of core relationship template, utilizes the multi-class WEB object relationship of wikipedia data configuration storehouse;
(2) concern weights between iterative computation WEB object type, and extract WEB object type core relationship template;
(3) according to the WEB object piece in tag tree node size and the characteristics drawing-out structureization and the destructuring WEB page;
(4), adopt temporal voting strategy in WEB object piece, to carry out attribute rank WEB object and extract according to WEB object type core relationship template;
(5) utilize the method for information visualization, the various relations of showing WEB object in the multi-class WEB object relationship storehouse.
2. the multi-class WEB object abstracting method based on relation mechanism according to claim 1, it is characterized in that: the WEB object relationship storehouse in the step (1) comprises between WEB object, WEB object relationship, WEB object type, classification and concerning.The WEB object comes from the concrete entry in the dimension base; Every entry has all that independently the WEB page is described; The WEB object type comes from the catalog system in the dimension base; Be used to describe classification under the WEB object, the inheritance hierarchy relation comes from the basic catalog system of dimension between classification and classification, the subordinate relation between classification and entry.
3. the multi-class WEB object abstracting method based on relation mechanism according to claim 1, it is characterized in that: the core relationship template learning method based on multi-class WEB object relationship storehouse in the step (2) comprises the steps:
(2.1) based on the class declaration in the Chinese wikipedia; Each classification has the different degree of depth apart from the top layer classification; Because the scale and the granularity of every type of WEB object are moderate in the 3rd layer the multinomial classification, so select the target classification of the 3rd layer of classification for use as the mapping of WEB object type;
(2.2) get the classification level degree of depth in the Chinese wikipedia as the classification level degree of depth of all WEB objects under this classification, the WEB object replay that the classification level degree of depth is darker is mapped to the 3rd layer of classification;
(2.3) tie up between the 3rd layer of classification according to the pass between the WEB object and set up the relation between the classification, both comprised the core relation between classification, also comprise the non-core relation between classification;
(2.4) the non-core relation between a kind of iterative algorithm filtering classification of employing, the core relationship template between structure classes.
4. the multi-class WEB object abstracting method based on relation mechanism according to claim 1 is characterized in that: the WEB object piece of the drawing-out structureization in the step (3) and the destructuring page is based on following assumed condition:
(A) for all webpages, the shared length of the main contents of this page is much larger than other information, thus WEB object piece be distributed in the WEB page in the corresponding html tag tree on the node of large-size;
(B) if a WEB page comprises WEB object piece, for the destructuring WEB page, this WEB object piece is a text description type Web object piece, and for the structuring WEB page, this WEB object piece is the tabulation of a plurality of WEB object pieces;
(C) in the structuring WEB page, WEB object agllutination point distributes with the sibling form and has identical father node;
(D) in the structuring WEB page, if wherein comprise the WEB list object, then each WEB object agllutination point child node of being in same position has approximate size.
5. the multi-class WEB object abstracting method based on relation mechanism according to claim 1 is characterized in that: the WEB object piece of the drawing-out structureization in the step (3) and the destructuring page may further comprise the steps:
(3.1) WEB page pre-service converts this html page into tag tree;
(3.2) the node size is calculated, and is divided into two types: character size and punctuate size, be the different importance of correct reflection punctuate and literal, and punctuate has been endowed different power with literal during actual computation;
(3.3) the WEB main contents of webpages filters, and utilizes the node filtering rule of definition from the tag tree of known node size, to extract body matter, and filters out irrelevant node;
(3.4) WEB object piece extracts, and judge the WEB page type: if comprise list object in the WEB page, then the father node of this tabulation becomes the tabulation node; If the WEB page is the destructuring page, only comprise a WEB object piece, then the father node of this WEB object piece becomes the article node.
6. the multi-class WEB object abstracting method based on relation mechanism according to claim 1 is characterized in that: the attribute rank WEB object extraction of carrying out in the step (4) may further comprise the steps:
(4.1) text in the WEB object record piece is carried out participle; And noun that will be wherein and the object oriented in the WEB object relationship storehouse mate; Obtain the classification set of all nouns in this WEB object record piece, this classification set has constituted the local template of describing this WEB object record piece;
(4.2) consider object support and main body contribution degree, local template and core relationship template are carried out template matches, judge the classification of WEB object according to template matches ratio;
(4.3) on known WEB object type basis, adopt temporal voting strategy, from the object record piece, extract core WEB object and relevant WEB object thereof according to such other core relationship template.
7. the multi-class WEB object abstracting method based on relation mechanism according to claim 6, it is characterized in that: the object support in the step (4.2) is a given classification B, describes the probability of classification A with it; The main body contribution degree be one concern object relative one concern main body all concern the significance level of object; Template matches ratio is that all close the ratio of coefficients in the relative core relationship template of pass coefficient of local WEB object template and core relationship template coupling.
8. the multi-class WEB object abstracting method based on relation mechanism according to claim 6 is characterized in that: the middle employing of step (4.3) temporal voting strategy from the object record piece, extracts core WEB object according to such other core relationship template and relevant WEB object may further comprise the steps: from the WEB object set, remove the maximum WEB object of occurrence number; To residue object ballot in the WEB object set, be unit with each sentence in the WEB object record piece, if other objects in the WEB object set are then thrown positive ticket to this object, otherwise throw non placet; After all sentences were voted, all WEB objects sorted by gained vote, and who gets the most votes and be subordinated to sub-category WEB object will be core WEB object by differentiation.
9. the multi-class WEB object abstracting method based on relation mechanism according to claim 1; It is characterized in that: the WEB object relationship in the step (5) is visual in order to make the user browse the various relations between the object intuitively; Visual not only can the performance between WEB object distribution, classification of the object relationship that provides concerns; The temperature that also can specifically reflect the WEB object, and the detailed information of object relationship.
CN 201110294846 2011-09-30 2011-09-30 Multi- category WEB object extract method based on relationship mechanism Active CN102436472B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110294846 CN102436472B (en) 2011-09-30 2011-09-30 Multi- category WEB object extract method based on relationship mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110294846 CN102436472B (en) 2011-09-30 2011-09-30 Multi- category WEB object extract method based on relationship mechanism

Publications (2)

Publication Number Publication Date
CN102436472A true CN102436472A (en) 2012-05-02
CN102436472B CN102436472B (en) 2013-10-30

Family

ID=45984535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110294846 Active CN102436472B (en) 2011-09-30 2011-09-30 Multi- category WEB object extract method based on relationship mechanism

Country Status (1)

Country Link
CN (1) CN102436472B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046236A (en) * 2019-03-20 2019-07-23 腾讯科技(深圳)有限公司 A kind of search method and device of unstructured data
CN112667940A (en) * 2020-10-15 2021-04-16 广东电子工业研究院有限公司 Webpage text extraction method based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN101996190A (en) * 2009-08-12 2011-03-30 北京大学 Method and device for extracting information from webpage
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN101996190A (en) * 2009-08-12 2011-03-30 北京大学 Method and device for extracting information from webpage
CN102073654A (en) * 2009-11-20 2011-05-25 富士通株式会社 Methods and equipment for generating and maintaining web content extraction template

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046236A (en) * 2019-03-20 2019-07-23 腾讯科技(深圳)有限公司 A kind of search method and device of unstructured data
CN110046236B (en) * 2019-03-20 2022-12-20 腾讯科技(深圳)有限公司 Unstructured data retrieval method and device
CN112667940A (en) * 2020-10-15 2021-04-16 广东电子工业研究院有限公司 Webpage text extraction method based on deep learning
CN112667940B (en) * 2020-10-15 2022-02-18 广东电子工业研究院有限公司 Webpage text extraction method based on deep learning

Also Published As

Publication number Publication date
CN102436472B (en) 2013-10-30

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN112199511B (en) Cross-language multi-source vertical domain knowledge graph construction method
Tang et al. Using Bayesian decision for ontology mapping
CN101408885B (en) Modeling topics using statistical distributions
Cantador et al. Categorising social tags to improve folksonomy-based recommendations
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN101430695B (en) System and method for computing difference affinities of word
Zanasi Text mining and its applications to intelligence, CRM and knowledge management
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
CN102609512A (en) System and method for heterogeneous information mining and visual analysis
JP2005526317A (en) Method and system for automatically searching a concept hierarchy from a document corpus
CN112925901B (en) Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof
CN103425740A (en) IOT (Internet Of Things) faced material information retrieval method based on semantic clustering
CN114254201A (en) Recommendation method for science and technology project review experts
Chen et al. Developing a semantic-enable information retrieval mechanism
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
Das et al. Case study of trend mining in Transportation Research Record articles
Nevzorova et al. Towards a recommender system for the choice of UDC code for mathematical articles
Salas et al. Interoperability by design using the StdTrip tool: an a priori approach
CN102436472B (en) Multi- category WEB object extract method based on relationship mechanism
Loglisci et al. Toward geographic information harvesting: Extraction of spatial relational facts from Web documents
Hu et al. A classification model of power operation inspection defect texts based on graph convolutional network
CN115965085A (en) Ship static attribute reasoning method and system based on knowledge graph technology
Lopez et al. Guided exploration and integration of urban data
Cheng et al. Improving access to and understanding of regulations through taxonomies

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant