CN102254014A

CN102254014A - Adaptive information extraction method for webpage characteristics

Info

Publication number: CN102254014A
Application number: CN 201110205137
Authority: CN
Inventors: 金海�; 李毅; 赵峰; 严奉伟
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2011-07-21
Filing date: 2011-07-21
Publication date: 2011-11-23
Anticipated expiration: 2031-07-21
Also published as: CN102254014B

Abstract

The invention discloses a method for extracting information from an academic home page. The method comprises the following steps of: (1) finding an academic home page from Internet; (2) crawling and analyzing the academic home page, wherein the crawling of an irrelevant page is reduced by using a heuristic strategy so as to accelerate analysis speed; (3) analyzing the page into a form of document object module (DOM), and dividing according to attributes and contents of elements so as to acquire a cohesive text unit list; (4) identifying the text unit by using an information recognizer, wherein each information recognizer only identifies one information type, and performing subfield extraction on the text information; (5) performing association analysis on the extraction result, eliminating different meanings by using the association of the information, and complementing the missing field; and (6) matching the extraction result and a database, and eliminating the redundant data, wherein the extraction result is stored in a semantic database in a form of semantic data. In the method, by combination of heuristic rules, a machine learning method and a conditional probability model, academic information can be extracted efficiently and accurately from the academic home page.

Description

The adaptive information extraction method of a kind of web page characteristics

Technical field

The invention belongs to the information extraction system field, be specifically related to the adaptive information extraction method of a kind of web page characteristics, this method is particularly useful for from academic homepage extracting author's name, mailbox, mechanism information and information such as publish an article.

Background technology

The arriving of information age makes network become the main path that people share and obtain information gradually, and various information are read for people on the internet with the form issue of webpage.Yet explosive increase along with internet information, it is found that and in the internet, find required information to become more and more difficult, on the one hand quantity of information is huge, and the mode that presents of information is very flexible and free on the other hand, and this has increased the cost of people's discrimination objective information.Therefore, the Web page information extraction technology becomes the information age is worth the field of research.

The Web page information extraction technology is to extract from traditional text message to grow up.With the text message difference, web page contents comprises text with the HTML(Hypertext Markup Language) statement, picture and other multimedia messagess, and allow the tree-shaped structure of mutually nested formation between the mark.The fundamental purpose of Web page information extraction task is to extract target information from semi-structured web page text.Info web has following feature usually: (1) discretize, information does not concentrate on a certain website, but is published on the different websites by different people.(2) isomerism is even similar information also can use different modes to present on different websites.(3) redundancy, identical information may repeat on a plurality of websites.At these features of info web, the Web page information extraction system needs to have stronger adaptive faculty and resolving ability.

The research of early stage Web page information extraction is concentrated and has been explored rule method, from scripting abstracting method based on regular expression, to after the proprietary extraction language that grows up, its core concept is to extract the AD HOC that comprises target information.The method of the extraction of pattern is the main difference of this type systematic, some systems use manual mode to come the extraction pattern, such benefit is that the pattern of extracting is more accurate, but need extract pattern with many when handling complicated extraction task, so cost of labor is higher.Cost for the extraction of reduction pattern, people have proposed the pattern learning system based on automatic training, system need accept one group of training examples, sample is by the target information piece that manually identifies wherein, according to sum up possible match pattern from sample, pattern is used to actual extraction task through checking with after screening to learning system automatically.This method has had certain automatic extractability, but because bottom still depends on rule method, therefore the extraction task to complexity can't reach higher accuracy rate.Recent years, abstracting method turns on machine learning model gradually, and the method for some scripts in handling the natural language understanding process is used to process information and extracts problem, obtained good effect.

Academic homepage is the website that the researchist in the sphere of learning is used for showing own individual essential information and achievement in research.Different authors makes different Page Templates according to the hobby of oneself and presents personal information.Although page style has nothing in common with each other, comprised similar information on the academic homepage usually, as author's name, mechanism information, contact method, project, article information etc.It is very valuable using information extraction system that these information gatherings are got up.

Summary of the invention

The purpose of this invention is to provide the adaptive information extraction method of a kind of web page characteristics, this method can be extracted required information from the academic homepage of different styles, and it is strong to have adaptive faculty, accuracy rate height, and the strong characteristics of extendability.

The adaptive information extraction method of a kind of web page characteristics provided by the invention is characterized in that this method comprises the steps:

The 1st step was searched type from the internet be the website of academic homepage;

The 2nd step was analyzed the academic homepage of searching, and regarded the page of academic homepage as two tuples (L, set C), wherein L is the URL of link, and C is the context of link, reexamines whether comprise key word among L and the C, if comprise, then entered for the 3rd step, otherwise filter out this link;

The 3rd step was analyzed described link, obtained the document tree structure of the page, according to the attribute and the content of tree node the page was divided, and was divided into text unit T, constituted text unit set { T ₁, T ₂..., T _n}

The 4th step is from text unit set { T ₁, T ₂..., T _nIn extract author's name N, mailbox M, mechanism information U and article information set { P ₁, P ₂..., P _nThese four aiming fields, as preliminary extraction result;

The preliminary extraction result that the 5th step obtained the 4th step carries out association analysis, utilizes the relevance disambiguation of information, and the disappearance field is carried out completion, obtains extracting the result, deposits to result database;

The 6th step is with article information set { P ₁, P ₂..., P _nIn element and the record in the result database mate, eliminate redundant data;

The result is extracted in the output of the 7th step.

The adaptive information extraction method of a kind of web page characteristics provided by the invention, this method has been used in combination machine learning algorithm, and probability model and rule method can extract author's name from the academic homepage of different styles, mailbox, mechanism information and information such as publish an article.Particularly, the present invention has following effect and advantage:

(1) adaptability is strong

The author of academic homepage is many different researchers, and content and composing are of all kinds.The present invention can be good at solving the skimble-scamble problem of page formatting, adapts to various situations of change automatically;

(2) accuracy height

Core algorithm of the present invention is based on machine learning algorithm and probability model, and has been used in combination heuristic rule, can both reach very high accuracy rate to the extraction of each aiming field;

(3) extensibility is strong

The present invention can be expanded other fields that extract in the page, and its identifying also can be used to other similar problems of solution, and expansion process is simple, highly versatile.

Description of drawings

Fig. 1 is the overall flow figure of extraction process of the present invention;

The process flow diagram that Fig. 2 extracts authors' name for the present invention;

The process flow diagram that Fig. 3 extracts mailbox for the present invention;

The process flow diagram that Fig. 4 extracts mechanism information for the present invention;

The process flow diagram that Fig. 5 extracts article information for the present invention.

Embodiment

The present invention is described in detail below in conjunction with accompanying drawing and example.

The adaptive information extraction method of a kind of web page characteristics provided by the invention, its step comprises:

(1) searching type from the internet is the website of academic homepage, and this process can be divided into two stages: searching stage and decision stage.

In the searching stage; at first from existing data in literature, derive the data set of author's name as seed data; in search engine, retrieve as key word with each authors' name of data centralization then; search engine returns result for retrieval with tabular form; each bar result for retrieval is usually by title; chain feature and a bit of summary texts are formed, and search engine can return the multipage result usually, and the chain feature and the summary texts of first page result for retrieval left in the candidate result tabulation.

In decision stage, at first the result for retrieval in the candidate result tabulation is filtered according to chain feature and summary texts.Used a database in the filter process, this database has comprised the website of obscuring that often occurs in the result for retrieval, is referred to as to shield linked database.Filtering policy comprises two steps, checks at first whether result for retrieval is present in the shielding linked database, and the result for retrieval that will be arranged in this database is directly got rid of.Then, to remaining result for retrieval, check whether its chain feature is rendered as the pattern of "～"+author name, if then keep, otherwise then directly get rid of, that passes through this two step filtration carries out following operation to remaining each bar result for retrieval more successively: send page request according to its chain feature, whether the page that uses the judgement of support vector machine sorting algorithm to return is the academic homepage of author, if, then directly it is saved as the academic homepage of author, judge and finish, otherwise continue next bar result for retrieval is carried out identical operations.

(2) the academic homepage of author is analyzed, the academic homepage of author is a complete website normally, has comprised many subpage frames, and wherein some has comprised the target information of system's needs, and some then is irrelevant fully.To get efficient in order improving to climb, to avoid the excessive useless page to be carried out deep parsing by subsequent module, the consumption calculations resource, the present invention has used a kind of filter algorithm based on heuristic strategies.This algorithm is regarded the page as two tuple (L, C) set, wherein L is the URL of link, C is the context of link, this algorithm checks among L and the C whether comprise key words such as publication, paper, research, if comprise then further resolve this link (entering step (3)), otherwise filter out this link.

(3) page to be resolved is analyzed, obtained the document tree structure of webpage, the page is divided, be divided into several junior units, be referred to as text unit T, divide the result and be text unit set { T according to the attribute and the content of document tree node ₁, T ₂..., T _n, step is as follows.

(a) at first use html parser that the page is resolved, obtain the document tree of the page.The node of document tree is promptly corresponding to the html tag in the page, and document tree shows the relation between each html tag in the page with tree structure.

(b) then the page is divided.Html tag can be divided into piece level element and inline element, common piece level element such as BR, DIV, H1, H2, LI, UL, TH, TD, TR, TABLE etc., common inline element such as SPAN, BOLD, A, FONT, IMG etc.Html page can be regarded as the set of piece level element, has two kinds of relations between the piece level element: set membership and brotherhood.Can be mutually nested between piece level element and the inline element.Document tree is exactly that form with tree node presents these relations, the node that contains piece level element in the document tree is called piece level node, and other nodes are called non-level node, and the node of document tree is traveled through, come the page is divided by the classification of decision node, partiting step is as follows:

(b1) initial, the text unit set is for empty;

(b2) document tree is carried out depth-first traversal, find out all piece level nodes, to each piece level node Ni, generate a text unit Ti, and Ni content corresponding in the page is divided to Ti;

(b3), judge whether it has non-level child node in document tree, if having then all non-level child nodes content corresponding in the page is divided to Ti with it to each piece level child node Ni;

(b4) Ti is added in the text unit set;

(b5) finish.

(c) after traversal finishes, finish the division of the page, obtain text unit set { T ₁, T ₂..., T _n.

(4) from text unit set { T ₁, T ₂..., T _nIn extract author's name N, mailbox M, mechanism information U and article information set { P ₁, P ₂..., P _nThese four aiming fields, as preliminary extraction result;

At dissimilar aiming fields, introduce the abstracting method of different field below respectively:

The extraction process of author's name N as shown in Figure 2, its basic step is as follows:

(a1) use the support vector machine sorting algorithm to text unit set { T ₁, T ₂..., T _nThe text unit of lining classifies, retention class is the text unit set T of author's name _Name

(a2) use the authors' name numerical data base from T _NameIn match the authors' name character segment, the authors' name numerical data base is a preprepared database, common English men and women's name and some Chinese pinyin are collected and put in order to this database, uses this database from T _NameIn match candidate's author's name set;

(a3) extract literal in the academic homepage title of author, the title of the academic homepage of author can comprise author's name XXX with the form of " XXX ' s Hompage " in the time of most of, extracts the author's name XXX in the academic homepage title of author;

(a4) the author's name XXX that obtains with (a3) mates candidate author's name that (a2) obtains, and the name that selection and XXX matching degree are the highest is exported as author's name N.

The extraction process of mailbox M as shown in Figure 3, its basic step is as follows:

(b1) at first use support vector machine classifier from text unit set { T ₁, T ₂..., T _nIn find out possible mailbox candidate text unit set T _EmailThe input feature vector of support vector machine comprises the common symbol in the mailbox message, as " Email ", " ", ". " etc.At T _EmailMiddle these characteristic symbols, the generating feature vector sought.Algorithm of support vector machine according to proper vector to T _EmailMiddle mailbox candidate text unit is judged, if classification results then carry out (b2) and handle, otherwise direct filtration is fallen for certainly.

(b2) remove unnecessary part in the mailbox candidate text unit,, remove these information and help the legal mailbox message of subsequent step acquisition as indicative prefix " Email: ".

(b3) next adopt the fuzzy matching state machine algorithms that mailbox candidate text unit is mated, the mailbox of a standard has following field: user name (provider's domain name .)+. TLD.This algorithm is set up a matched node for each field, and the user mode machine is enumerated possible matched form, generates many different matching results, has tens usually.

(b4) each field and the matching result of mailbox candidate text unit are compared, the result who chooses the matching degree maximum is as net result, and is converted into the legal mailbox form output of standard according to the mailbox field of standard.

The extraction process of mechanism information U as shown in Figure 4, its basic step is as follows:

(c1) at first from the data of interconnected online collection whole world university and research institute, comprise the name and the link of its corresponding homepage of mechanism, set up mechanism's homepage database.For database is set up inverted index.Inverted index is supported keyword search fast, can determine to comprise the clauses and subclauses of a set of keyword fast.

(c2) use support vector machine classifier from text unit set { T ₁, T ₂..., T _nIn find out possible mechanism information text unit set T _U, with T _UIn the mechanism information text unit be converted to textual form, it is searched in index as key word, obtain first three result for retrieval of rank.First three result for retrieval and corresponding mechanism information text unit are carried out fuzzy matching, if can mate then determine the text to should mechanism, the matching result output that matching degree is the highest all can't be mated else if, then changes (c3) and handles.

(c3) utilize the URL of homepage to seek, academic website is the substation point of mechanism's website normally, and therefore domain name and the mechanism's homepage database with homepage mates, if there is the record of coupling, think that then the author belongs to this mechanism, the record of coupling is as a result of exported.

Article information { P ₁, P ₂..., P _nExtraction process as shown in Figure 5, its basic step is as follows:

(a) at first use the support vector machine sorting algorithm that text unit is classified, filter out the text unit that may comprise article information.The final recognition accuracy of the accuracy rate of sorting algorithm and article information is in close relations, and sorting algorithm need filter out curriculum information, patent, the analog information that project etc. are obscured easily.The accuracy rate of sorting algorithm mainly depends on two aspects: the choosing of training examples and feature.The structure of training examples is corrected original model according to process of iteration by constantly wrong sample being added in the training set.Proper vector is made of one group of vocabulary vector with separating capacity.Through the screening of sorting algorithm, irrelevant text unit is excluded, and obtains candidate article information text unit.

(b) then candidate article information text unit is carried out sequence labelling, extract each son field in candidate's text, comprising: author's name, title, meeting journal title, time.The algorithm of sequence labelling has been used following feature based on conditional random field models in the model:

1. text category feature

A) entry itself comprises primitive form and root-form

B) the capital and small letter feature comprises initial caps, full capitalization, single upper case letter

C) numerical characteristic, digital, the mixing of numeral and letter, Roman character

D) punctuate feature, comma, quotation marks, fullstop etc.

E) html tag feature, label is initial, center section and latter end

2. pattern feature

A) time feature, 19XX or 20XX

B) page mode, XXX-XXX

3. dictionary feature

Author's name, geographic position, publishing house, time, meeting journal title, mechanism's name

4. term characteristics

Commonly used vocabulary in the data in literature is as pp/editor/volume etc.

Extract above-mentioned feature from candidate article information text unit, the fundamental function in the conditional random field models uses truth expression, i.e. function output is or denys.Through the calculating of model, provide the most probable labeling form of candidate article information text unit.Symbol with same label can be merged into corresponding son field, as author's name field, and header field, meeting periodical field, time fields etc. are carried out corresponding subsequent processing to these fields respectively then.

(c) the authors' name field has comprised whole list of authors, need be divided into single author's form.Partitioning algorithm is based on heuristic rule, main according to length, abbreviated form and the punctuation mark of name.Result after cutting apart is stored in the array.

Header field need be passed through the standardization cutting could be as final result.The fundamental purpose of cutting is in order to get rid of the unallowable instruction digit of prefix and suffix, such as punctuation mark, and boundary error etc.

There is multiple expression way in practice in the meeting journal title, as uppercase abbreviation and common custom address etc.Directly the meeting periodical field of Ti Quing can not be as final result, mating in needs and the database.Common meeting and journal title and corresponding abbreviated form have been collected in document journal data storehouse.At first extract in the field to be identified capitalization abbreviation part, in database, search, if coupling then the full name and the input field of coupling are carried out fuzzy matching prevents the mistake that the situation of abbreviated form conflict causes.If coupling is then directly exported the result.Otherwise set up index for the meeting journal title, field to be matched is retrieved in index, result for retrieval and field to be matched are done fuzzy matching.If find coupling then to export the result.

Time field service regeulations method uses regular expression to seek legal time pattern in input text.Legal time pattern has two kinds of forms: first kind with 19 or 20 beginnings, and are 4-digit number; Second kind of capitalization abbreviated form with meeting periodical name begins, then quotation marks and time.Use these two kinds of patterns can handle most situations in the reality, recognition accuracy surpasses 99 percent.

(5) the preliminary extraction result that step (4) is obtained (comprises author's name N, mailbox M, mechanism information U and article information set { P ₁, P ₂..., P _n) lack field completion and ambiguity elimination, obtain final extraction result, deposit to result database.

May there be disappearance and nonstandard situation to a certain degree in the information that comprises in the actual pages, and may identify a plurality of results to identical items of information needs further to judge.This process is utilized the incidence relation between the information, carries out completion to extracting the result, and the result who has ambiguity is further judged.Associating information comprises following situation:

(a) association between authors' name and the mailbox user name;

(b) related between mechanism information and the homepage domain name;

(c) list of authors related in authors' name and the article information;

According to above-mentioned association, can carry out completion to extracting the result, as when there is disappearance in mechanism information, homepage can be linked in the database and inquire about, obtain corresponding mechanism information.Aspect the ambiguity elimination of information, when having a plurality of mailbox, can utilize the corresponding relation between authors' name and the user name, exclude wrong result.

(6) with article information set { P ₁, P ₂..., P _nIn element and the record in the result database mate, eliminate redundant data.

Though through after the association analysis, extraction process is just finished, and may have the redundant information of repetition among the result.The record that this step will extract in result and the result database mates.When finding matching result, both are blured comparison, if there is the disappearance of relevant field in the record in the result database, then this field is carried out completion.If in result database, do not find matching result, then the extraction result added in the result database.

(7) result is extracted in output.

Example:

With from academic homepage Http:// www.cs.uiuc.edu/～hanj/The process of middle extraction information is an example, at first use Jiawei Han in search engine, to retrieve as search key, at first according to the shadow data storehouse, exclude the result of Wikipedia and DBLP, choose first three result of rank then and send page request, judge through sorter, select first Search Results to be this author's academic homepage.

Use html parser that the page is resolved, obtain sublink wherein, further analyze according to concatenated key and the selected following subpage frame of context:

http://www.cs.uiuc.edu/homes/hanj/pubs/index.htm

https://agora.cs.illinois.edu/display/cs591han/Research+Publications+-+Data+Mining+Researc?h+Group+at+CS％2C+UIUC

Each page to be analyzed is carried out the division of text unit, is example with the page of homepage, obtains following result:

Use support vector machine that above-mentioned text unit is classified, be judged to be author's name respectively, extraneous data, university's information, mailbox, article information.Further extract according to different extraction flow processs according to the classification of judging, extraneous data is then directly abandoned.

The leaching process of author's name finds homepage title division (Jiawei Han) respectively, author's name in the text (Jiawei Han), and author's name (the Jiawei Han that comprises in the article information, Xiaofei He, Deng Cai), through cross-matched, determine that Jiawei Han is final result.

Prefix part (E-mail) is at first removed in the extraction of mailbox message: use the fuzzy matching automat to enumerate all possible mailbox matching result afterwards, as:

Hanj (user name) at (separator) cs (domain name). (point) uiuc (domain name). (point) edu (domain name)

Matching degree according to coupling is marked to the result, chooses the legal form of optimal result as mailbox, is converted to legal form output afterwards.

The leaching process of mechanism information will be classified as the text unit of mechanism information and retrieve in mechanism's index, be that key word is retrieved with " Univ.of Illinois at Urbana-Champaign " in this example, article one record is " University of Illinois at Urbana-Champaign " in the result for retrieval that obtains, judge that through fuzzy matching both conform to, therefore can directly export the result.

Article information need use the sequence labelling algorithm that article information is marked, and identifies authors' name wherein, such as for the article information that finds previously, it is labeled as following form:

＜author〉Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han,＜/author〉＜title〉Gfaph Cube:On Warehousing and OLAP Multidimensional Networks,＜/title〉＜meeting〉Proc.of 2011 ACM SIGMOD Int.Conf.on Management of Data (SIGMOD ' 11),＜/meeting〉＜place〉Athens, Greece,＜/place〉＜time〉June 2011＜/time 〉

Each son field is identified the identifying of promptly having finished article information respectively.According to the relevant association between the information result who has disappearance and ambiguity is carried out completion and judgement afterwards, result and result database are merged.

The present invention not only is confined to above-mentioned embodiment; persons skilled in the art are according to content disclosed by the invention; can adopt other multiple embodiment to implement the present invention; therefore; every employing project organization of the present invention and thinking; do some simple designs that change or change, all fall into the scope of protection of the invention.

Claims

1. the adaptive information extraction method of web page characteristics is characterized in that this method comprises the steps:

The result is extracted in the output of the 7th step.

2. information extraction method according to claim 1 is characterized in that, the 1st step was divided into two stages: searching stage and decision stage;

In the searching stage, at first from existing data in literature, derive the data set of author's name as seed data, in search engine, retrieve as key word with each authors' name of data centralization then, search engine returns result for retrieval with tabular form, each bar result for retrieval is by title, chain feature and summary texts are formed, and the chain feature and the summary texts of first page result for retrieval in the return results left in the candidate result tabulation;

In decision stage, at first according to the chain feature of result for retrieval and summary texts to candidate result tabulation filter by following mode, check at first whether link is present in the shielding linked database, the result that will be arranged in this database directly gets rid of, then, to remaining result for retrieval, check whether its chain feature is rendered as the pattern of "～"+author name, if then keep, otherwise then directly get rid of, that passes through this two step filtration carries out following operation to remaining each bar result for retrieval more successively: send page request according to its chain feature, whether the page that uses the judgement of support vector machine sorting algorithm to return is the academic homepage of author, if then directly it is saved as the academic homepage of author, judge and finish, otherwise continue next bar result for retrieval is carried out identical operations.

3. information extraction method according to claim 1 is characterized in that, step (3) comprises following process:

(3.1) at first use html parser that the page is resolved, obtain the document tree of the page, the node of document tree is promptly corresponding to the html tag in the page, and document tree shows the relation between each html tag in the page with tree structure;

(3.2) then the page is divided, obtained text unit set { T ₁, T ₂..., T _n.

4. information extraction method according to claim 3 is characterized in that, step (3.2) is divided the page by following process:

(b1) initial, the text unit set is for empty;

(b4) Ti is added in the text unit set;

(b5) finish.