CN102254014A - Adaptive information extraction method for webpage characteristics - Google Patents

Adaptive information extraction method for webpage characteristics Download PDF

Info

Publication number
CN102254014A
CN102254014A CN 201110205137 CN201110205137A CN102254014A CN 102254014 A CN102254014 A CN 102254014A CN 201110205137 CN201110205137 CN 201110205137 CN 201110205137 A CN201110205137 A CN 201110205137A CN 102254014 A CN102254014 A CN 102254014A
Authority
CN
China
Prior art keywords
page
result
information
academic
text unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110205137
Other languages
Chinese (zh)
Other versions
CN102254014B (en
Inventor
金海�
李毅
赵峰
严奉伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN 201110205137 priority Critical patent/CN102254014B/en
Publication of CN102254014A publication Critical patent/CN102254014A/en
Application granted granted Critical
Publication of CN102254014B publication Critical patent/CN102254014B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for extracting information from an academic home page. The method comprises the following steps of: (1) finding an academic home page from Internet; (2) crawling and analyzing the academic home page, wherein the crawling of an irrelevant page is reduced by using a heuristic strategy so as to accelerate analysis speed; (3) analyzing the page into a form of document object module (DOM), and dividing according to attributes and contents of elements so as to acquire a cohesive text unit list; (4) identifying the text unit by using an information recognizer, wherein each information recognizer only identifies one information type, and performing subfield extraction on the text information; (5) performing association analysis on the extraction result, eliminating different meanings by using the association of the information, and complementing the missing field; and (6) matching the extraction result and a database, and eliminating the redundant data, wherein the extraction result is stored in a semantic database in a form of semantic data. In the method, by combination of heuristic rules, a machine learning method and a conditional probability model, academic information can be extracted efficiently and accurately from the academic home page.

Description

The adaptive information extraction method of a kind of web page characteristics
Technical field
The invention belongs to the information extraction system field, be specifically related to the adaptive information extraction method of a kind of web page characteristics, this method is particularly useful for from academic homepage extracting author's name, mailbox, mechanism information and information such as publish an article.
Background technology
The arriving of information age makes network become the main path that people share and obtain information gradually, and various information are read for people on the internet with the form issue of webpage.Yet explosive increase along with internet information, it is found that and in the internet, find required information to become more and more difficult, on the one hand quantity of information is huge, and the mode that presents of information is very flexible and free on the other hand, and this has increased the cost of people's discrimination objective information.Therefore, the Web page information extraction technology becomes the information age is worth the field of research.
The Web page information extraction technology is to extract from traditional text message to grow up.With the text message difference, web page contents comprises text with the HTML(Hypertext Markup Language) statement, picture and other multimedia messagess, and allow the tree-shaped structure of mutually nested formation between the mark.The fundamental purpose of Web page information extraction task is to extract target information from semi-structured web page text.Info web has following feature usually: (1) discretize, information does not concentrate on a certain website, but is published on the different websites by different people.(2) isomerism is even similar information also can use different modes to present on different websites.(3) redundancy, identical information may repeat on a plurality of websites.At these features of info web, the Web page information extraction system needs to have stronger adaptive faculty and resolving ability.
The research of early stage Web page information extraction is concentrated and has been explored rule method, from scripting abstracting method based on regular expression, to after the proprietary extraction language that grows up, its core concept is to extract the AD HOC that comprises target information.The method of the extraction of pattern is the main difference of this type systematic, some systems use manual mode to come the extraction pattern, such benefit is that the pattern of extracting is more accurate, but need extract pattern with many when handling complicated extraction task, so cost of labor is higher.Cost for the extraction of reduction pattern, people have proposed the pattern learning system based on automatic training, system need accept one group of training examples, sample is by the target information piece that manually identifies wherein, according to sum up possible match pattern from sample, pattern is used to actual extraction task through checking with after screening to learning system automatically.This method has had certain automatic extractability, but because bottom still depends on rule method, therefore the extraction task to complexity can't reach higher accuracy rate.Recent years, abstracting method turns on machine learning model gradually, and the method for some scripts in handling the natural language understanding process is used to process information and extracts problem, obtained good effect.
Academic homepage is the website that the researchist in the sphere of learning is used for showing own individual essential information and achievement in research.Different authors makes different Page Templates according to the hobby of oneself and presents personal information.Although page style has nothing in common with each other, comprised similar information on the academic homepage usually, as author's name, mechanism information, contact method, project, article information etc.It is very valuable using information extraction system that these information gatherings are got up.
Summary of the invention
The purpose of this invention is to provide the adaptive information extraction method of a kind of web page characteristics, this method can be extracted required information from the academic homepage of different styles, and it is strong to have adaptive faculty, accuracy rate height, and the strong characteristics of extendability.
The adaptive information extraction method of a kind of web page characteristics provided by the invention is characterized in that this method comprises the steps:
The 1st step was searched type from the internet be the website of academic homepage;
The 2nd step was analyzed the academic homepage of searching, and regarded the page of academic homepage as two tuples (L, set C), wherein L is the URL of link, and C is the context of link, reexamines whether comprise key word among L and the C, if comprise, then entered for the 3rd step, otherwise filter out this link;
The 3rd step was analyzed described link, obtained the document tree structure of the page, according to the attribute and the content of tree node the page was divided, and was divided into text unit T, constituted text unit set { T 1, T 2..., T n}
The 4th step is from text unit set { T 1, T 2..., T nIn extract author's name N, mailbox M, mechanism information U and article information set { P 1, P 2..., P nThese four aiming fields, as preliminary extraction result;
The preliminary extraction result that the 5th step obtained the 4th step carries out association analysis, utilizes the relevance disambiguation of information, and the disappearance field is carried out completion, obtains extracting the result, deposits to result database;
The 6th step is with article information set { P 1, P 2..., P nIn element and the record in the result database mate, eliminate redundant data;
The result is extracted in the output of the 7th step.
The adaptive information extraction method of a kind of web page characteristics provided by the invention, this method has been used in combination machine learning algorithm, and probability model and rule method can extract author's name from the academic homepage of different styles, mailbox, mechanism information and information such as publish an article.Particularly, the present invention has following effect and advantage:
(1) adaptability is strong
The author of academic homepage is many different researchers, and content and composing are of all kinds.The present invention can be good at solving the skimble-scamble problem of page formatting, adapts to various situations of change automatically;
(2) accuracy height
Core algorithm of the present invention is based on machine learning algorithm and probability model, and has been used in combination heuristic rule, can both reach very high accuracy rate to the extraction of each aiming field;
(3) extensibility is strong
The present invention can be expanded other fields that extract in the page, and its identifying also can be used to other similar problems of solution, and expansion process is simple, highly versatile.
Description of drawings
Fig. 1 is the overall flow figure of extraction process of the present invention;
The process flow diagram that Fig. 2 extracts authors' name for the present invention;
The process flow diagram that Fig. 3 extracts mailbox for the present invention;
The process flow diagram that Fig. 4 extracts mechanism information for the present invention;
The process flow diagram that Fig. 5 extracts article information for the present invention.
Embodiment
The present invention is described in detail below in conjunction with accompanying drawing and example.
The adaptive information extraction method of a kind of web page characteristics provided by the invention, its step comprises:
(1) searching type from the internet is the website of academic homepage, and this process can be divided into two stages: searching stage and decision stage.
In the searching stage; at first from existing data in literature, derive the data set of author's name as seed data; in search engine, retrieve as key word with each authors' name of data centralization then; search engine returns result for retrieval with tabular form; each bar result for retrieval is usually by title; chain feature and a bit of summary texts are formed, and search engine can return the multipage result usually, and the chain feature and the summary texts of first page result for retrieval left in the candidate result tabulation.
In decision stage, at first the result for retrieval in the candidate result tabulation is filtered according to chain feature and summary texts.Used a database in the filter process, this database has comprised the website of obscuring that often occurs in the result for retrieval, is referred to as to shield linked database.Filtering policy comprises two steps, checks at first whether result for retrieval is present in the shielding linked database, and the result for retrieval that will be arranged in this database is directly got rid of.Then, to remaining result for retrieval, check whether its chain feature is rendered as the pattern of "~"+author name, if then keep, otherwise then directly get rid of, that passes through this two step filtration carries out following operation to remaining each bar result for retrieval more successively: send page request according to its chain feature, whether the page that uses the judgement of support vector machine sorting algorithm to return is the academic homepage of author, if, then directly it is saved as the academic homepage of author, judge and finish, otherwise continue next bar result for retrieval is carried out identical operations.
(2) the academic homepage of author is analyzed, the academic homepage of author is a complete website normally, has comprised many subpage frames, and wherein some has comprised the target information of system's needs, and some then is irrelevant fully.To get efficient in order improving to climb, to avoid the excessive useless page to be carried out deep parsing by subsequent module, the consumption calculations resource, the present invention has used a kind of filter algorithm based on heuristic strategies.This algorithm is regarded the page as two tuple (L, C) set, wherein L is the URL of link, C is the context of link, this algorithm checks among L and the C whether comprise key words such as publication, paper, research, if comprise then further resolve this link (entering step (3)), otherwise filter out this link.
(3) page to be resolved is analyzed, obtained the document tree structure of webpage, the page is divided, be divided into several junior units, be referred to as text unit T, divide the result and be text unit set { T according to the attribute and the content of document tree node 1, T 2..., T n, step is as follows.
(a) at first use html parser that the page is resolved, obtain the document tree of the page.The node of document tree is promptly corresponding to the html tag in the page, and document tree shows the relation between each html tag in the page with tree structure.
(b) then the page is divided.Html tag can be divided into piece level element and inline element, common piece level element such as BR, DIV, H1, H2, LI, UL, TH, TD, TR, TABLE etc., common inline element such as SPAN, BOLD, A, FONT, IMG etc.Html page can be regarded as the set of piece level element, has two kinds of relations between the piece level element: set membership and brotherhood.Can be mutually nested between piece level element and the inline element.Document tree is exactly that form with tree node presents these relations, the node that contains piece level element in the document tree is called piece level node, and other nodes are called non-level node, and the node of document tree is traveled through, come the page is divided by the classification of decision node, partiting step is as follows:
(b1) initial, the text unit set is for empty;
(b2) document tree is carried out depth-first traversal, find out all piece level nodes, to each piece level node Ni, generate a text unit Ti, and Ni content corresponding in the page is divided to Ti;
(b3), judge whether it has non-level child node in document tree, if having then all non-level child nodes content corresponding in the page is divided to Ti with it to each piece level child node Ni;
(b4) Ti is added in the text unit set;
(b5) finish.
(c) after traversal finishes, finish the division of the page, obtain text unit set { T 1, T 2..., T n.
(4) from text unit set { T 1, T 2..., T nIn extract author's name N, mailbox M, mechanism information U and article information set { P 1, P 2..., P nThese four aiming fields, as preliminary extraction result;
At dissimilar aiming fields, introduce the abstracting method of different field below respectively:
The extraction process of author's name N as shown in Figure 2, its basic step is as follows:
(a1) use the support vector machine sorting algorithm to text unit set { T 1, T 2..., T nThe text unit of lining classifies, retention class is the text unit set T of author's name Name
(a2) use the authors' name numerical data base from T NameIn match the authors' name character segment, the authors' name numerical data base is a preprepared database, common English men and women's name and some Chinese pinyin are collected and put in order to this database, uses this database from T NameIn match candidate's author's name set;
(a3) extract literal in the academic homepage title of author, the title of the academic homepage of author can comprise author's name XXX with the form of " XXX ' s Hompage " in the time of most of, extracts the author's name XXX in the academic homepage title of author;
(a4) the author's name XXX that obtains with (a3) mates candidate author's name that (a2) obtains, and the name that selection and XXX matching degree are the highest is exported as author's name N.
The extraction process of mailbox M as shown in Figure 3, its basic step is as follows:
(b1) at first use support vector machine classifier from text unit set { T 1, T 2..., T nIn find out possible mailbox candidate text unit set T EmailThe input feature vector of support vector machine comprises the common symbol in the mailbox message, as " Email ", " ", ". " etc.At T EmailMiddle these characteristic symbols, the generating feature vector sought.Algorithm of support vector machine according to proper vector to T EmailMiddle mailbox candidate text unit is judged, if classification results then carry out (b2) and handle, otherwise direct filtration is fallen for certainly.
(b2) remove unnecessary part in the mailbox candidate text unit,, remove these information and help the legal mailbox message of subsequent step acquisition as indicative prefix " Email: ".
(b3) next adopt the fuzzy matching state machine algorithms that mailbox candidate text unit is mated, the mailbox of a standard has following field: user name (provider's domain name .)+. TLD.This algorithm is set up a matched node for each field, and the user mode machine is enumerated possible matched form, generates many different matching results, has tens usually.
(b4) each field and the matching result of mailbox candidate text unit are compared, the result who chooses the matching degree maximum is as net result, and is converted into the legal mailbox form output of standard according to the mailbox field of standard.
The extraction process of mechanism information U as shown in Figure 4, its basic step is as follows:
(c1) at first from the data of interconnected online collection whole world university and research institute, comprise the name and the link of its corresponding homepage of mechanism, set up mechanism's homepage database.For database is set up inverted index.Inverted index is supported keyword search fast, can determine to comprise the clauses and subclauses of a set of keyword fast.
(c2) use support vector machine classifier from text unit set { T 1, T 2..., T nIn find out possible mechanism information text unit set T U, with T UIn the mechanism information text unit be converted to textual form, it is searched in index as key word, obtain first three result for retrieval of rank.First three result for retrieval and corresponding mechanism information text unit are carried out fuzzy matching, if can mate then determine the text to should mechanism, the matching result output that matching degree is the highest all can't be mated else if, then changes (c3) and handles.
(c3) utilize the URL of homepage to seek, academic website is the substation point of mechanism's website normally, and therefore domain name and the mechanism's homepage database with homepage mates, if there is the record of coupling, think that then the author belongs to this mechanism, the record of coupling is as a result of exported.
Article information { P 1, P 2..., P nExtraction process as shown in Figure 5, its basic step is as follows:
(a) at first use the support vector machine sorting algorithm that text unit is classified, filter out the text unit that may comprise article information.The final recognition accuracy of the accuracy rate of sorting algorithm and article information is in close relations, and sorting algorithm need filter out curriculum information, patent, the analog information that project etc. are obscured easily.The accuracy rate of sorting algorithm mainly depends on two aspects: the choosing of training examples and feature.The structure of training examples is corrected original model according to process of iteration by constantly wrong sample being added in the training set.Proper vector is made of one group of vocabulary vector with separating capacity.Through the screening of sorting algorithm, irrelevant text unit is excluded, and obtains candidate article information text unit.
(b) then candidate article information text unit is carried out sequence labelling, extract each son field in candidate's text, comprising: author's name, title, meeting journal title, time.The algorithm of sequence labelling has been used following feature based on conditional random field models in the model:
1. text category feature
A) entry itself comprises primitive form and root-form
B) the capital and small letter feature comprises initial caps, full capitalization, single upper case letter
C) numerical characteristic, digital, the mixing of numeral and letter, Roman character
D) punctuate feature, comma, quotation marks, fullstop etc.
E) html tag feature, label is initial, center section and latter end
2. pattern feature
A) time feature, 19XX or 20XX
B) page mode, XXX-XXX
3. dictionary feature
Author's name, geographic position, publishing house, time, meeting journal title, mechanism's name
4. term characteristics
Commonly used vocabulary in the data in literature is as pp/editor/volume etc.
Extract above-mentioned feature from candidate article information text unit, the fundamental function in the conditional random field models uses truth expression, i.e. function output is or denys.Through the calculating of model, provide the most probable labeling form of candidate article information text unit.Symbol with same label can be merged into corresponding son field, as author's name field, and header field, meeting periodical field, time fields etc. are carried out corresponding subsequent processing to these fields respectively then.
(c) the authors' name field has comprised whole list of authors, need be divided into single author's form.Partitioning algorithm is based on heuristic rule, main according to length, abbreviated form and the punctuation mark of name.Result after cutting apart is stored in the array.
Header field need be passed through the standardization cutting could be as final result.The fundamental purpose of cutting is in order to get rid of the unallowable instruction digit of prefix and suffix, such as punctuation mark, and boundary error etc.
There is multiple expression way in practice in the meeting journal title, as uppercase abbreviation and common custom address etc.Directly the meeting periodical field of Ti Quing can not be as final result, mating in needs and the database.Common meeting and journal title and corresponding abbreviated form have been collected in document journal data storehouse.At first extract in the field to be identified capitalization abbreviation part, in database, search, if coupling then the full name and the input field of coupling are carried out fuzzy matching prevents the mistake that the situation of abbreviated form conflict causes.If coupling is then directly exported the result.Otherwise set up index for the meeting journal title, field to be matched is retrieved in index, result for retrieval and field to be matched are done fuzzy matching.If find coupling then to export the result.
Time field service regeulations method uses regular expression to seek legal time pattern in input text.Legal time pattern has two kinds of forms: first kind with 19 or 20 beginnings, and are 4-digit number; Second kind of capitalization abbreviated form with meeting periodical name begins, then quotation marks and time.Use these two kinds of patterns can handle most situations in the reality, recognition accuracy surpasses 99 percent.
(5) the preliminary extraction result that step (4) is obtained (comprises author's name N, mailbox M, mechanism information U and article information set { P 1, P 2..., P n) lack field completion and ambiguity elimination, obtain final extraction result, deposit to result database.
May there be disappearance and nonstandard situation to a certain degree in the information that comprises in the actual pages, and may identify a plurality of results to identical items of information needs further to judge.This process is utilized the incidence relation between the information, carries out completion to extracting the result, and the result who has ambiguity is further judged.Associating information comprises following situation:
(a) association between authors' name and the mailbox user name;
(b) related between mechanism information and the homepage domain name;
(c) list of authors related in authors' name and the article information;
According to above-mentioned association, can carry out completion to extracting the result, as when there is disappearance in mechanism information, homepage can be linked in the database and inquire about, obtain corresponding mechanism information.Aspect the ambiguity elimination of information, when having a plurality of mailbox, can utilize the corresponding relation between authors' name and the user name, exclude wrong result.
(6) with article information set { P 1, P 2..., P nIn element and the record in the result database mate, eliminate redundant data.
Though through after the association analysis, extraction process is just finished, and may have the redundant information of repetition among the result.The record that this step will extract in result and the result database mates.When finding matching result, both are blured comparison, if there is the disappearance of relevant field in the record in the result database, then this field is carried out completion.If in result database, do not find matching result, then the extraction result added in the result database.
(7) result is extracted in output.
Example:
With from academic homepage Http:// www.cs.uiuc.edu/~hanj/The process of middle extraction information is an example, at first use Jiawei Han in search engine, to retrieve as search key, at first according to the shadow data storehouse, exclude the result of Wikipedia and DBLP, choose first three result of rank then and send page request, judge through sorter, select first Search Results to be this author's academic homepage.
Use html parser that the page is resolved, obtain sublink wherein, further analyze according to concatenated key and the selected following subpage frame of context:
http://www.cs.uiuc.edu/homes/hanj/pubs/index.htm
https://agora.cs.illinois.edu/display/cs591han/Research+Publications+-+Data+Mining+Researc?h+Group+at+CS%2C+UIUC
Each page to be analyzed is carried out the division of text unit, is example with the page of homepage, obtains following result:
Figure BDA0000077464020000101
Figure BDA0000077464020000111
Use support vector machine that above-mentioned text unit is classified, be judged to be author's name respectively, extraneous data, university's information, mailbox, article information.Further extract according to different extraction flow processs according to the classification of judging, extraneous data is then directly abandoned.
The leaching process of author's name finds homepage title division (Jiawei Han) respectively, author's name in the text (Jiawei Han), and author's name (the Jiawei Han that comprises in the article information, Xiaofei He, Deng Cai), through cross-matched, determine that Jiawei Han is final result.
Prefix part (E-mail) is at first removed in the extraction of mailbox message: use the fuzzy matching automat to enumerate all possible mailbox matching result afterwards, as:
Hanj (user name) at (separator) cs (domain name). (point) uiuc (domain name). (point) edu (domain name)
Matching degree according to coupling is marked to the result, chooses the legal form of optimal result as mailbox, is converted to legal form output afterwards.
The leaching process of mechanism information will be classified as the text unit of mechanism information and retrieve in mechanism's index, be that key word is retrieved with " Univ.of Illinois at Urbana-Champaign " in this example, article one record is " University of Illinois at Urbana-Champaign " in the result for retrieval that obtains, judge that through fuzzy matching both conform to, therefore can directly export the result.
Article information need use the sequence labelling algorithm that article information is marked, and identifies authors' name wherein, such as for the article information that finds previously, it is labeled as following form:
<author〉Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han,</author〉<title〉Gfaph Cube:On Warehousing and OLAP Multidimensional Networks,</title〉<meeting〉Proc.of 2011 ACM SIGMOD Int.Conf.on Management of Data (SIGMOD ' 11),</meeting〉<place〉Athens, Greece,</place〉<time〉June 2011</time 〉
Each son field is identified the identifying of promptly having finished article information respectively.According to the relevant association between the information result who has disappearance and ambiguity is carried out completion and judgement afterwards, result and result database are merged.
The present invention not only is confined to above-mentioned embodiment; persons skilled in the art are according to content disclosed by the invention; can adopt other multiple embodiment to implement the present invention; therefore; every employing project organization of the present invention and thinking; do some simple designs that change or change, all fall into the scope of protection of the invention.

Claims (4)

1. the adaptive information extraction method of web page characteristics is characterized in that this method comprises the steps:
The 1st step was searched type from the internet be the website of academic homepage;
The 2nd step was analyzed the academic homepage of searching, and regarded the page of academic homepage as two tuples (L, set C), wherein L is the URL of link, and C is the context of link, reexamines whether comprise key word among L and the C, if comprise, then entered for the 3rd step, otherwise filter out this link;
The 3rd step was analyzed described link, obtained the document tree structure of the page, according to the attribute and the content of tree node the page was divided, and was divided into text unit T, constituted text unit set { T 1, T 2..., T n}
The 4th step is from text unit set { T 1, T 2..., T nIn extract author's name N, mailbox M, mechanism information U and article information set { P 1, P 2..., P nThese four aiming fields, as preliminary extraction result;
The preliminary extraction result that the 5th step obtained the 4th step carries out association analysis, utilizes the relevance disambiguation of information, and the disappearance field is carried out completion, obtains extracting the result, deposits to result database;
The 6th step is with article information set { P 1, P 2..., P nIn element and the record in the result database mate, eliminate redundant data;
The result is extracted in the output of the 7th step.
2. information extraction method according to claim 1 is characterized in that, the 1st step was divided into two stages: searching stage and decision stage;
In the searching stage, at first from existing data in literature, derive the data set of author's name as seed data, in search engine, retrieve as key word with each authors' name of data centralization then, search engine returns result for retrieval with tabular form, each bar result for retrieval is by title, chain feature and summary texts are formed, and the chain feature and the summary texts of first page result for retrieval in the return results left in the candidate result tabulation;
In decision stage, at first according to the chain feature of result for retrieval and summary texts to candidate result tabulation filter by following mode, check at first whether link is present in the shielding linked database, the result that will be arranged in this database directly gets rid of, then, to remaining result for retrieval, check whether its chain feature is rendered as the pattern of "~"+author name, if then keep, otherwise then directly get rid of, that passes through this two step filtration carries out following operation to remaining each bar result for retrieval more successively: send page request according to its chain feature, whether the page that uses the judgement of support vector machine sorting algorithm to return is the academic homepage of author, if then directly it is saved as the academic homepage of author, judge and finish, otherwise continue next bar result for retrieval is carried out identical operations.
3. information extraction method according to claim 1 is characterized in that, step (3) comprises following process:
(3.1) at first use html parser that the page is resolved, obtain the document tree of the page, the node of document tree is promptly corresponding to the html tag in the page, and document tree shows the relation between each html tag in the page with tree structure;
(3.2) then the page is divided, obtained text unit set { T 1, T 2..., T n.
4. information extraction method according to claim 3 is characterized in that, step (3.2) is divided the page by following process:
(b1) initial, the text unit set is for empty;
(b2) document tree is carried out depth-first traversal, find out all piece level nodes, to each piece level node Ni, generate a text unit Ti, and Ni content corresponding in the page is divided to Ti;
(b3), judge whether it has non-level child node in document tree, if having then all non-level child nodes content corresponding in the page is divided to Ti with it to each piece level child node Ni;
(b4) Ti is added in the text unit set;
(b5) finish.
CN 201110205137 2011-07-21 2011-07-21 Adaptive information extraction method for webpage characteristics Expired - Fee Related CN102254014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110205137 CN102254014B (en) 2011-07-21 2011-07-21 Adaptive information extraction method for webpage characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110205137 CN102254014B (en) 2011-07-21 2011-07-21 Adaptive information extraction method for webpage characteristics

Publications (2)

Publication Number Publication Date
CN102254014A true CN102254014A (en) 2011-11-23
CN102254014B CN102254014B (en) 2013-06-05

Family

ID=44981278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110205137 Expired - Fee Related CN102254014B (en) 2011-07-21 2011-07-21 Adaptive information extraction method for webpage characteristics

Country Status (1)

Country Link
CN (1) CN102254014B (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411630A (en) * 2011-12-22 2012-04-11 南京烽火星空通信发展有限公司 Attribute searching method
CN102663123A (en) * 2012-04-20 2012-09-12 哈尔滨工业大学 Semantic attribute automatic extraction method on basis of pseudo-seed attributes and random walk sort and system for implementing same
CN102662969A (en) * 2012-03-11 2012-09-12 复旦大学 Internet information object positioning method based on webpage structure semantic meaning
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN102867064A (en) * 2012-09-28 2013-01-09 用友软件股份有限公司 Associated field query device and associated field query method
CN102932400A (en) * 2012-07-20 2013-02-13 北京网康科技有限公司 Method and device for identifying uniform resource locator primary links
CN103051895A (en) * 2012-12-07 2013-04-17 浙江大学 Method and device of context model selection
CN103218362A (en) * 2012-01-19 2013-07-24 中兴通讯股份有限公司 Method and system for constructing domain ontology
CN103577578A (en) * 2012-03-30 2014-02-12 奇智软件(北京)有限公司 Marker file parsing method and device
CN103793285A (en) * 2012-10-29 2014-05-14 百度在线网络技术(北京)有限公司 Method and platform server for processing online anomalies
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104331438A (en) * 2014-10-24 2015-02-04 北京奇虎科技有限公司 Method and device for selectively extracting content of novel webpage
CN104376108A (en) * 2014-11-26 2015-02-25 克拉玛依红有软件有限责任公司 Unstructured natural language information extraction method based on 6W semantic annotation
CN104699797A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Webpage data structured analytic method and device
CN105095400A (en) * 2015-07-07 2015-11-25 清华大学 Method for finding personal homepage
CN106469176A (en) * 2015-08-20 2017-03-01 百度在线网络技术(北京)有限公司 A kind of method and apparatus for extracting text snippet
CN106484920A (en) * 2016-11-21 2017-03-08 北京恒华伟业科技股份有限公司 A kind of abstracting method of evaluation document index
CN106681596A (en) * 2017-01-03 2017-05-17 北京百度网讯科技有限公司 Information display method and device
CN106708913A (en) * 2015-11-12 2017-05-24 财团法人资讯工业策进会 Intelligent product storage system and method thereof
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN108153851A (en) * 2017-12-21 2018-06-12 北京工业大学 A kind of rule-based and semantic universal forum topic post page info abstracting method
CN108241680A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 The method and apparatus for obtaining the amount of reading of webpage
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN109117435A (en) * 2017-06-22 2019-01-01 索意互动(北京)信息技术有限公司 A kind of client, server, search method and its system
CN109657180A (en) * 2018-12-11 2019-04-19 中科国力(镇江)智能技术有限公司 It is a kind of intelligence web page contents automatically obscure extraction system
US10289963B2 (en) 2017-02-27 2019-05-14 International Business Machines Corporation Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques
CN110020366A (en) * 2017-12-07 2019-07-16 北大方正集团有限公司 Mailbox message abstracting method and device
CN110189210A (en) * 2019-06-05 2019-08-30 浙江米奥兰特商务会展股份有限公司 Purchaser's information collecting method, device, equipment and the storage medium that foreign trade is brought together
CN110781497A (en) * 2019-10-21 2020-02-11 新华三信息安全技术有限公司 Method for detecting web page link and storage medium
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information
CN113268573A (en) * 2021-05-19 2021-08-17 上海博亦信息科技有限公司 Extraction method of academic talent information
CN113434797A (en) * 2021-06-29 2021-09-24 中国电信集团***集成有限责任公司 Webpage information extraction method and device
CN114116757A (en) * 2020-08-31 2022-03-01 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004046312A (en) * 2002-07-09 2004-02-12 Nippon Telegr & Teleph Corp <Ntt> Site manager information extraction method and device, site manager information extraction program, and recording medium with the program recorded
CN101620608A (en) * 2008-07-04 2010-01-06 全国组织机构代码管理中心 Information collection method and system
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004046312A (en) * 2002-07-09 2004-02-12 Nippon Telegr & Teleph Corp <Ntt> Site manager information extraction method and device, site manager information extraction program, and recording medium with the program recorded
CN101620608A (en) * 2008-07-04 2010-01-06 全国组织机构代码管理中心 Information collection method and system
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411630A (en) * 2011-12-22 2012-04-11 南京烽火星空通信发展有限公司 Attribute searching method
CN103218362A (en) * 2012-01-19 2013-07-24 中兴通讯股份有限公司 Method and system for constructing domain ontology
CN102662969A (en) * 2012-03-11 2012-09-12 复旦大学 Internet information object positioning method based on webpage structure semantic meaning
CN103577578B (en) * 2012-03-30 2017-04-05 北京奇虎科技有限公司 A kind of tab file analysis method and device
CN103577578A (en) * 2012-03-30 2014-02-12 奇智软件(北京)有限公司 Marker file parsing method and device
CN102663123A (en) * 2012-04-20 2012-09-12 哈尔滨工业大学 Semantic attribute automatic extraction method on basis of pseudo-seed attributes and random walk sort and system for implementing same
CN102663123B (en) * 2012-04-20 2014-09-03 哈尔滨工业大学 Semantic attribute automatic extraction method on basis of pseudo-seed attributes and random walk sort and system for implementing same
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN102841920B (en) * 2012-06-30 2017-05-10 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN102932400A (en) * 2012-07-20 2013-02-13 北京网康科技有限公司 Method and device for identifying uniform resource locator primary links
CN102932400B (en) * 2012-07-20 2015-06-17 北京网康科技有限公司 Method and device for identifying uniform resource locator primary links
CN102867064A (en) * 2012-09-28 2013-01-09 用友软件股份有限公司 Associated field query device and associated field query method
CN102867064B (en) * 2012-09-28 2015-12-02 用友网络科技股份有限公司 Associate field inquiry unit and associate field querying method
CN103793285A (en) * 2012-10-29 2014-05-14 百度在线网络技术(北京)有限公司 Method and platform server for processing online anomalies
CN103051895B (en) * 2012-12-07 2016-04-13 浙江大学 The method and apparatus that a kind of context model is selected
CN103051895A (en) * 2012-12-07 2013-04-17 浙江大学 Method and device of context model selection
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104252530B (en) * 2014-09-10 2017-09-15 北京京东尚科信息技术有限公司 A kind of unit crawler capturing method and system
CN104331438A (en) * 2014-10-24 2015-02-04 北京奇虎科技有限公司 Method and device for selectively extracting content of novel webpage
CN104331438B (en) * 2014-10-24 2018-04-17 北京奇虎科技有限公司 To novel web page contents selectivity abstracting method and device
CN104376108A (en) * 2014-11-26 2015-02-25 克拉玛依红有软件有限责任公司 Unstructured natural language information extraction method based on 6W semantic annotation
CN104376108B (en) * 2014-11-26 2017-06-06 克拉玛依红有软件有限责任公司 A kind of destructuring natural language information abstracting method based on the semantic marks of 6W
CN104699797A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Webpage data structured analytic method and device
CN104699797B (en) * 2015-03-18 2018-02-23 浪潮集团有限公司 A kind of web page data structured analysis method and device
CN105095400A (en) * 2015-07-07 2015-11-25 清华大学 Method for finding personal homepage
CN106469176A (en) * 2015-08-20 2017-03-01 百度在线网络技术(北京)有限公司 A kind of method and apparatus for extracting text snippet
CN106469176B (en) * 2015-08-20 2019-08-16 百度在线网络技术(北京)有限公司 It is a kind of for extracting the method and apparatus of text snippet
CN106708913A (en) * 2015-11-12 2017-05-24 财团法人资讯工业策进会 Intelligent product storage system and method thereof
CN106708913B (en) * 2015-11-12 2020-03-27 财团法人资讯工业策进会 Intelligent product storage system and method thereof
CN106484920A (en) * 2016-11-21 2017-03-08 北京恒华伟业科技股份有限公司 A kind of abstracting method of evaluation document index
CN108241680A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 The method and apparatus for obtaining the amount of reading of webpage
CN108241680B (en) * 2016-12-26 2020-10-13 北京国双科技有限公司 Method and device for acquiring reading amount of webpage
CN106681596A (en) * 2017-01-03 2017-05-17 北京百度网讯科技有限公司 Information display method and device
US10289963B2 (en) 2017-02-27 2019-05-14 International Business Machines Corporation Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques
CN109117435A (en) * 2017-06-22 2019-01-01 索意互动(北京)信息技术有限公司 A kind of client, server, search method and its system
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN110020366B (en) * 2017-12-07 2021-06-15 北大方正集团有限公司 Mailbox information extraction method and device
CN110020366A (en) * 2017-12-07 2019-07-16 北大方正集团有限公司 Mailbox message abstracting method and device
CN108153851B (en) * 2017-12-21 2021-06-18 北京工业大学 General forum subject post page information extraction method based on rules and semantics
CN108153851A (en) * 2017-12-21 2018-06-12 北京工业大学 A kind of rule-based and semantic universal forum topic post page info abstracting method
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN109033282B (en) * 2018-07-11 2021-07-23 山东邦尼信息科技有限公司 Webpage text extraction method and device based on extraction template
CN109657180A (en) * 2018-12-11 2019-04-19 中科国力(镇江)智能技术有限公司 It is a kind of intelligence web page contents automatically obscure extraction system
CN110189210A (en) * 2019-06-05 2019-08-30 浙江米奥兰特商务会展股份有限公司 Purchaser's information collecting method, device, equipment and the storage medium that foreign trade is brought together
CN110781497A (en) * 2019-10-21 2020-02-11 新华三信息安全技术有限公司 Method for detecting web page link and storage medium
CN114116757A (en) * 2020-08-31 2022-03-01 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment and readable storage medium
CN114116757B (en) * 2020-08-31 2022-10-18 阿里巴巴集团控股有限公司 Data processing method and device, electronic equipment and readable storage medium
CN113268573A (en) * 2021-05-19 2021-08-17 上海博亦信息科技有限公司 Extraction method of academic talent information
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information
CN113434797A (en) * 2021-06-29 2021-09-24 中国电信集团***集成有限责任公司 Webpage information extraction method and device
CN113434797B (en) * 2021-06-29 2024-05-31 ***数智科技有限公司 Webpage information extraction method and device

Also Published As

Publication number Publication date
CN102254014B (en) 2013-06-05

Similar Documents

Publication Publication Date Title
CN102254014B (en) Adaptive information extraction method for webpage characteristics
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
CN111723215B (en) Device and method for establishing biotechnological information knowledge graph based on text mining
US8751218B2 (en) Indexing content at semantic level
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN101464898B (en) Method for extracting feature word of text
US8868621B2 (en) Data extraction from HTML documents into tables for user comparison
CN105893611B (en) Method for constructing interest topic semantic network facing social network
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN103678412B (en) A kind of method and device of file retrieval
CN101079025B (en) File correlation computing system and method
Schenker Graph-theoretic techniques for web content mining
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN106649666A (en) Left-right recursion-based new word discovery method
CN109145260A (en) A kind of text information extraction method
CN102609427A (en) Public opinion vertical search analysis system and method
Döhmen et al. Multi-hypothesis CSV parsing
CN109165373B (en) Data processing method and device
CN108153851B (en) General forum subject post page information extraction method based on rules and semantics
CN109657114B (en) Method for extracting webpage semi-structured data
CN106970938A (en) Web page towards focusing is obtained and information extraction method
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN106649557A (en) Semantic association mining method for defect report and mail list
CN104346382B (en) Use the text analysis system and method for language inquiry
WO2016099422A2 (en) Content sensitive document ranking method by analyzing the citation contexts

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Jin Hai

Inventor after: Li Yi

Inventor after: Zhao Feng

Inventor after: Yan Fengwei

Inventor after: Chen Heng

Inventor before: Jin Hai

Inventor before: Li Yi

Inventor before: Zhao Feng

Inventor before: Yan Fengwei

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: JIN HAI LI YI ZHAO FENG YAN FENGWEI TO: JIN HAI LI YI ZHAO FENG YAN FENGWEI CHEN HENG

GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130605

Termination date: 20200721