CN102591612B

CN102591612B - General webpage text extraction method based on punctuation continuity and system thereof

Info

Publication number: CN102591612B
Application number: CN201110446701.4A
Authority: CN
Inventors: 胡海斌; 赵庸; 张雪峰
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2011-12-27
Filing date: 2011-12-27
Publication date: 2014-12-03
Anticipated expiration: 2031-12-27
Also published as: CN102591612A

Abstract

The invention discloses a general webpage text extraction method based on punctuation continuity and a system thereof. The method comprises the processing steps of reading in files and converting the files into Unicode; removing noise marking information; generating html marking trees, processing text form information; extracting text nodes to generate text sentence sequences; using normally used punctuations to divide text sequence blocks into sentences again; and using the punctuation continuity principle to extract longest continuous texts. The method using the continuity of the punctuations to acquire webpage texts has the advantages of being quick in processing speed, strong in adaptability and strong in generality.

Description

A kind of based on the successional generic web pages context extraction method of punctuate and system thereof

Technical field

The present invention relates to field of computer technology, particularly relate to a kind of based on the successional generic web pages context extraction method of punctuate and system thereof.

Background technology

Along with the develop rapidly of internet, increasing enterprises and individuals information announcement on the net.Every day, has thousands of webpage to produce on internet, people can cross over the time and space boundary is shared bulk information, and internet has become a maximum information source in the world.In this vast as the open sea information ocean, how helping people to extract fast effective information becomes an important problem.

Webpage, as information carrier the most widely on internet, has comprised most internet informations, becomes the most common means of search engine and domestic consumer's obtaining information.But the unit that the webpage of take is obtaining information is inadequate, because webpage has often comprised the information of multiple theme, as comprise navigation block, commercial block, copyright statement piece, message block etc.For obtaining for taker of information, message block is the object of its unique care often.Remaining information becomes noise.

For how removing webpage noise, the existing many research of information extraction piece automatically:

1. the information extraction based on DOM Document Object Model (DOM, Document Object Model)

HTML is a kind of standard, a kind of standard, and it carrys out the various piece in webpage that mark will show by label symbol.By the mark extracting in html document, can generate a dom tree, again to the specific node (Table in tree, Div, P) etc. process to obtain webpage useful information, as thought in the research > > of: the < < Web page text information extraction method based on statistics the text message (useful information) of webpage, be present in a Table node, information by Chinese text in statistics node obtains specific Table node, extraction word wherein obtains the useful text of webpage.This type of research also has the Web page text information extracting method > > of < < based on mark window etc.There are several problems in the webpage context extraction method based on DOM: many webpages are not well-formeds, and the dom tree of formation may be lack of standardization; How the HTML language that serves as a mark pays close attention to display web page, is generally indifferent to piecemeal and the semantic information of webpage; The webpage typesetting of different web sites is different (information of text is not necessarily included in a Table node) often.

2. the information extraction based on vision

From the mankind's angle, when a user observes the Web page, it always can naturally be treated a semantic chunk as a single object, and how the inner structure that can not manage the Web page is described.Generally, when differentiating semantic chunk, user can help by some sense of vision factor, such as spacing between background color, font color and size, frame, logical block and logical block etc.If therefore use fully the visual cues of the Web page, and carry out the semantic piecemeal of the page in conjunction with dom tree, can make up some shortcoming of only using dom tree to bring.The representative of these class methods is < < VIPS: the page partitioning algorithm > > based on vision, context extraction method based on vision need to obtain the sense of vision factor of the page, this is a process that calculated amount is larger, if the sense of vision factor of the page is controlled (as: controlling by CSS CSS (cascading style sheet) file) by different files, cause obtaining webpage and also needed to obtain its relevant control documents, need to repeatedly ask, efficiency is lower.And be not very good situation for web page style, the text based on vision extracts degree of accuracy also can be lower.

3. the method for rule-based formulation and machine learning

This method is based on machine learning, sorting technique during common usage data excavates: by setting a series of attribute relevant to Web page text, the webpage training set of a large amount of (The more the better) is trained and obtained judging that certain block of a webpage is a sorter of text block, then instruct with the sorter after training the text that obtains webpage.These class methods need to identify for the Web page text piece in training set in training process, and this is a process that workload is very large.And different website rules is often not quite similar, can obtain a general regular difficulty very large, equally also because so, the degree of accuracy that has caused Web page text to extract is lower.

In above-mentioned three kinds of extracting method: it is good that the method for the statistical information based on DOM is directed to Website style, the webpage that typesetting is more consistent, and due to developer's difference, the complexity variation of html web page tag application, the imposition layout of website is also often ever-changing, the experiment webpage of existing research is the portal website based on regular mostly, and the versatility of method is poor.The calculated amount that method based on visual information needs is large, vision heuristic rule is also not necessarily general (such as the heuristic rule for title: whether the font of piece A word large than piece B for different web sites, whether the font color of piece A, piece B is different, just can not be completely general for different web sites), the webpage of the page typesetting that the method based on visual information is controlled for CSS has very large restriction, and now increasing Web Page Layout is controlled employing CSS, so the practical application of the method is less, versatility is not strong.The difficult point of the method based on machine learning mainly contains two: first is that the data volume size of webpage training set is directly relevant with the extraction accuracy rate of sorter, needs manually mark the region of Web page text, and workload is heavy.Whether second difficult point is: exist one can judge that Web page text is interval and have the general rule set compared with high-accuracy, also not studies have shown that now.

Summary of the invention

The object of the invention is to overcome the deficiency of prior art, provide a kind of based on the successional generic web pages context extraction method of punctuate and system thereof, be to utilize the continuity of punctuate to obtain Web page text, have that processing speed is fast, a feature of strong adaptability and highly versatile.

The technical solution adopted for the present invention to solve the technical problems is: a kind of based on the successional generic web pages context extraction method of punctuate, comprise the steps:

Read in file, and the file reading in is converted into the html source code of the character stream form of Unicode;

Html source code is carried out to pre-service, by preset noise token, remove some that exist in html source code and extract the character string without help for Web page text;

Generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;

Text format information in labelled tree is processed, with preset specific character, removed to replace corresponding format information;

Extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree;

The conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;

Utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.

The processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:

A. separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting ₁, s ₂, s ₃..., s _n], s wherein _nit is a short sentence;

B. travel through array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing _m, join punctuate sentence array B=[s _i, s _j, s _k..., s _n], and record the sequence number m of short sentence;

C. the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s _j, s _kbetween there is not continuity, make short sentence set s _i, s _i+1, s _i+2s _j, be the longest current punctuate continuation character set of strings, buffer memory is L={s _i, s _i+1, s _i+2s _j;

D. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;

E. array B processed after, the word of set in L is Web page text.

Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, be to utilize the continuity of punctuate to obtain the text of webpage.This method according to being: punctuation mark be in Chinese punctuate, semantic cutting important symbol, one piece of Chinese article without punctuate almost cannot allow people's correct understanding article meaning to be expressed, punctuate is sign indispensable in Chinese article, therefore, punctuation mark is a part indispensable in Web page text; And, in the text of webpage, conventionally exist the punctuate of continuous appearance; Like this, just can judge word that punctuate continuity the is the highest text of webpage often.The continuity of punctuate refers in webpage the word to occurring herein, carries out after piecemeal the continuity that punctuate in each piece word exists situation.

First the html source file of text to be extracted is converted into the character stream form of Unicode, the text coding of most webpages is stored in the energy collecting of Unicode character, and unified coding is conducive to follow-up character handling procedure.

Then remove noise token information, to existing some to extract without help for Web page text, can extract the noise token piece that causes interference to text on the contrary in html source code, need to delete at pretreatment stage.As script piece (< (no)? script.*? </ (no)? script>) be generally used for subsidiary function, annotation piece (<!--.*?--be >) that developer is for the annotation of webpage source code.Other are as the select piece of drop-down list, the style piece that form is controlled, the marquee piece of horse race lamp, etc. for the extraction of Web page text, be also without helping.

Then generate html labelled tree, html is HTML (Hypertext Markup Language) again, is a subset of standard generalized markup language, by analytical tools such as neko or htmlparser, can easily html source code representation be become to the form of labelled tree.

Then, for the text format control information in labelled tree, process, for line feed format information as P, the line feed marks such as BR replace to special character so that storage line feed information, for Word messages such as font, colors, so this method be not absorbed in keep original text whole font informations this type of as FONT, the label informations such as STRONG are done to delete and are processed (because likely follow-up processing being impacted).

Then, extract the node of text.What Web page text extracted is the set of literal node, and algorithm filters the literal node sequence of extracting in html labelled tree and carries out follow-up processing.

Follow again the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided: for the character in literal node, if gather the punctuate in P, after punctuate, add separating character (space character) as separated sign.

Finally, utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.

Based on the successional generic web pages text of a punctuate extraction system, comprising:

Read in module, this reads in module and is used for reading in file, and the file reading in is converted into the html source code of the character stream form of Unicode;

Remove noise token information module, this removals noise token information module is used for html source code to carry out pre-service, by preset noise token, remove in html source code, exist some for Web page text extraction the character string without help;

Generate html labelled tree module, this generation html labelled tree module is used for generating html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;

Process text format information module, this processing text format information module is used for the text format information in labelled tree to process, and with preset specific character, removes to replace corresponding format information;

Extract text node and generate text sentence block, this extraction text node generates text sentence block and is used for extracting the node of text and adopts filter algorithm to generate the literal node sequence in html labelled tree;

Utilize conventional punctuate to text sequence piece again subordinate sentence module, this utilizes conventional punctuate text sequence piece again subordinate sentence module to be used for defining the conventional punctuate set P={ of an article.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;

Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text.

The invention has the beneficial effects as follows, owing to having adopted, read in file, be converted into Unicode; Remove noise token information; Generate html labelled tree; Process text format information; Extract text node and generate text sentence sequence; Utilize conventional punctuate to text sequence piece again subordinate sentence; Utilize the continuity principle of punctuate to extract the treatment steps such as long continuous text and realize generic web pages text and extract, with respect to prior art, tool has the following advantages:

1, punctuation mark is Web page text necessary part, so method has very high versatility.

2, punctuation mark is only processed text strings, and the various object informations without analyzing web page, have larger advantage in performance, is suitable for real-time Web page text and extracts.

Even if 3 page results are complicated, contain multiple interfere information, the method also can effectively be extracted the body part of webpage, and the specific aim of method is very strong.

4, the text that the webpage word that punctuate continuity is the longest is webpage, has also guaranteed the degree of accuracy that Web page text extracts.

Below in conjunction with drawings and Examples, the present invention is described in further detail; But of the present inventionly a kind ofly based on the successional generic web pages context extraction method of punctuate and system thereof, be not limited to embodiment.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of the web page news page first;

Fig. 2 is the schematic flow sheet of the inventive method;

Fig. 3 is the structural representation of html labelled tree of the present invention.

Embodiment

Embodiment, shown in Figure 1, Fig. 1 is the schematic diagram of the web page news page first, as seen from Figure 1, in the text of news, being inevitable appears in continuous punctuate.Based on the successional context extraction method of punctuate and the extraction of the text based on vision, there is identical point: in the method based on vision, text block is the strongest block of punctuate continuity.

Shown in Figure 2, of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, comprise the steps:

Step S1: read in file, and the file reading in is converted into the html source code of the character stream form of Unicode; Corresponding to " read in file, transform Unicode " frame of Fig. 2;

Step S2: html source code is carried out to pre-service, remove some that exist in html source code by preset noise token and extract the character string without help for Web page text; Corresponding to " remove noise token information " frame of Fig. 2;

Step S3: generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree; Corresponding to " generate html labelled tree " frame of Fig. 2;

Step S4: the text format information in labelled tree is processed, removed to replace corresponding format information with preset specific character; Corresponding to " process text format information " frame of Fig. 2;

Step S5: extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree; Corresponding to " extract text node generate text sentence sequence " frame of Fig. 2;

Step S6: the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in step S5 being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign; Corresponding to " utilizing conventional punctuate to text sequence piece again subordinate sentence " frame of Fig. 2;

Step S7: utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text; Corresponding to " utilizing the continuity principle of punctuate to extract the longest continuous text module " frame of Fig. 2.

Wherein, the processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:

Step a. be take separator or the space character character string after division points is processed step S6 and is carried out cutting, obtains character string array A=[s after cutting ₁, s ₂, s ₃..., s _n], s wherein _nit is a short sentence;

Step b. traversal array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing _m, join punctuate sentence array B=[s _i, s _j, s _k..., s _n], and record the sequence number m of short sentence;

Step c is the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s _j, s _kbetween there is not continuity, make short sentence set s _i, s _i+1, s _i+2s _j, be the longest current punctuate continuation character set of strings, buffer memory is L={s _i, s _i+1, s _i+2s _j;

Steps d. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;

After step e. array B is processed, the word in set L is Web page text.

Then remove noise token information, to existing some to extract without help for Web page text, can extract the tag block in morning that causes interference to text on the contrary in html source code, need to delete at pretreatment stage.As script piece (< (no)? script.*? </ (no)? script>) be generally used for subsidiary function, annotation piece () User Exploitation person is for the annotation of webpage source code.Other: the select piece of drop-down list, the style piece that form is controlled, the marquee piece of horse race lamp, etc. for the extraction of Web page text, be also without helping.

Then generate html labelled tree, html is HTML (Hypertext Markup Language) again, is a subset of standard generalized markup language, by analytical tools such as neko or htmlparser, can easily html source code representation be become to the form of labelled tree; As shown in Figure 3.

Specific algorithm java is expressed as follows:

Of the present invention a kind of based on the successional generic web pages text of punctuate extraction system, comprising:

The extraction of Web page text assists in removing webpage noise, helps user to obtain accurate required information.For search engine, particularly vertical search engine is as news search engine, and the subject information of blog search engine extracts and follow-up index is significant.The template of webpage is ever-changing, and the webpage context extraction method based on statistics or vision has limitation separately, and in versatility, effect is poor.Context extraction method based on Intelligence Classifier, needs huge webpage training set, and very large for the high sorter difficulty of numerous accuracys rate of webpage structure.

Punctuation mark is the indispensable part of Chinese article, and utilizing the continuity of punctuate to come between the text area of locating web-pages is a process that conforms with logic of natural language.Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate and system thereof, there is processing speed fast; The feature such as adaptability, highly versatile.

Above-described embodiment is only used for further illustrating of the present invention a kind of based on the successional generic web pages context extraction method of punctuate and system thereof; but the present invention is not limited to embodiment; any simple modification, equivalent variations and modification that every foundation technical spirit of the present invention is done above embodiment, all fall in the protection domain of technical solution of the present invention.

Claims

1. based on the successional generic web pages context extraction method of punctuate, it is characterized in that: comprise the steps:

Utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text;

A. separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting ₁, s ₂, s ₃..., s _n], wherein all array elements in character string array A are short sentence;

E. array B processed after, the word of set in L is Web page text.

2. based on the successional generic web pages text of a punctuate extraction system, it is characterized in that: comprising:

Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text;

The described continuity principle of punctuate of utilizing is extracted the longest continuous text module, comprising:

Cutting submodule, separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting ₁, s ₂, s ₃..., s _n], wherein all array elements in character string array A are short sentence;

Inquiry submodule, traversal array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing _m, join punctuate sentence array B=[s _i, s _j, s _k..., s _n], and record the sequence number m of short sentence;

Calculating sub module, the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s _j, s _kbetween there is not continuity, make short sentence set s _i, s _i+1, s _i+2s _j, be the longest current punctuate continuation character set of strings, buffer memory is L={s _i, s _i+1, s _i+2s _j;

Replace submodule, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;

Extract Web page text module, after array B is processed, the word in set L is Web page text.