CN102591612B - General webpage text extraction method based on punctuation continuity and system thereof - Google Patents

General webpage text extraction method based on punctuation continuity and system thereof Download PDF

Info

Publication number
CN102591612B
CN102591612B CN201110446701.4A CN201110446701A CN102591612B CN 102591612 B CN102591612 B CN 102591612B CN 201110446701 A CN201110446701 A CN 201110446701A CN 102591612 B CN102591612 B CN 102591612B
Authority
CN
China
Prior art keywords
punctuate
text
character
continuity
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110446701.4A
Other languages
Chinese (zh)
Other versions
CN102591612A (en
Inventor
胡海斌
赵庸
张雪峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201110446701.4A priority Critical patent/CN102591612B/en
Publication of CN102591612A publication Critical patent/CN102591612A/en
Application granted granted Critical
Publication of CN102591612B publication Critical patent/CN102591612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a general webpage text extraction method based on punctuation continuity and a system thereof. The method comprises the processing steps of reading in files and converting the files into Unicode; removing noise marking information; generating html marking trees, processing text form information; extracting text nodes to generate text sentence sequences; using normally used punctuations to divide text sequence blocks into sentences again; and using the punctuation continuity principle to extract longest continuous texts. The method using the continuity of the punctuations to acquire webpage texts has the advantages of being quick in processing speed, strong in adaptability and strong in generality.

Description

A kind of based on the successional generic web pages context extraction method of punctuate and system thereof
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of based on the successional generic web pages context extraction method of punctuate and system thereof.
Background technology
Along with the develop rapidly of internet, increasing enterprises and individuals information announcement on the net.Every day, has thousands of webpage to produce on internet, people can cross over the time and space boundary is shared bulk information, and internet has become a maximum information source in the world.In this vast as the open sea information ocean, how helping people to extract fast effective information becomes an important problem.
Webpage, as information carrier the most widely on internet, has comprised most internet informations, becomes the most common means of search engine and domestic consumer's obtaining information.But the unit that the webpage of take is obtaining information is inadequate, because webpage has often comprised the information of multiple theme, as comprise navigation block, commercial block, copyright statement piece, message block etc.For obtaining for taker of information, message block is the object of its unique care often.Remaining information becomes noise.
For how removing webpage noise, the existing many research of information extraction piece automatically:
1. the information extraction based on DOM Document Object Model (DOM, Document Object Model)
HTML is a kind of standard, a kind of standard, and it carrys out the various piece in webpage that mark will show by label symbol.By the mark extracting in html document, can generate a dom tree, again to the specific node (Table in tree, Div, P) etc. process to obtain webpage useful information, as thought in the research > > of: the < < Web page text information extraction method based on statistics the text message (useful information) of webpage, be present in a Table node, information by Chinese text in statistics node obtains specific Table node, extraction word wherein obtains the useful text of webpage.This type of research also has the Web page text information extracting method > > of < < based on mark window etc.There are several problems in the webpage context extraction method based on DOM: many webpages are not well-formeds, and the dom tree of formation may be lack of standardization; How the HTML language that serves as a mark pays close attention to display web page, is generally indifferent to piecemeal and the semantic information of webpage; The webpage typesetting of different web sites is different (information of text is not necessarily included in a Table node) often.
2. the information extraction based on vision
From the mankind's angle, when a user observes the Web page, it always can naturally be treated a semantic chunk as a single object, and how the inner structure that can not manage the Web page is described.Generally, when differentiating semantic chunk, user can help by some sense of vision factor, such as spacing between background color, font color and size, frame, logical block and logical block etc.If therefore use fully the visual cues of the Web page, and carry out the semantic piecemeal of the page in conjunction with dom tree, can make up some shortcoming of only using dom tree to bring.The representative of these class methods is < < VIPS: the page partitioning algorithm > > based on vision, context extraction method based on vision need to obtain the sense of vision factor of the page, this is a process that calculated amount is larger, if the sense of vision factor of the page is controlled (as: controlling by CSS CSS (cascading style sheet) file) by different files, cause obtaining webpage and also needed to obtain its relevant control documents, need to repeatedly ask, efficiency is lower.And be not very good situation for web page style, the text based on vision extracts degree of accuracy also can be lower.
3. the method for rule-based formulation and machine learning
This method is based on machine learning, sorting technique during common usage data excavates: by setting a series of attribute relevant to Web page text, the webpage training set of a large amount of (The more the better) is trained and obtained judging that certain block of a webpage is a sorter of text block, then instruct with the sorter after training the text that obtains webpage.These class methods need to identify for the Web page text piece in training set in training process, and this is a process that workload is very large.And different website rules is often not quite similar, can obtain a general regular difficulty very large, equally also because so, the degree of accuracy that has caused Web page text to extract is lower.
In above-mentioned three kinds of extracting method: it is good that the method for the statistical information based on DOM is directed to Website style, the webpage that typesetting is more consistent, and due to developer's difference, the complexity variation of html web page tag application, the imposition layout of website is also often ever-changing, the experiment webpage of existing research is the portal website based on regular mostly, and the versatility of method is poor.The calculated amount that method based on visual information needs is large, vision heuristic rule is also not necessarily general (such as the heuristic rule for title: whether the font of piece A word large than piece B for different web sites, whether the font color of piece A, piece B is different, just can not be completely general for different web sites), the webpage of the page typesetting that the method based on visual information is controlled for CSS has very large restriction, and now increasing Web Page Layout is controlled employing CSS, so the practical application of the method is less, versatility is not strong.The difficult point of the method based on machine learning mainly contains two: first is that the data volume size of webpage training set is directly relevant with the extraction accuracy rate of sorter, needs manually mark the region of Web page text, and workload is heavy.Whether second difficult point is: exist one can judge that Web page text is interval and have the general rule set compared with high-accuracy, also not studies have shown that now.
Summary of the invention
The object of the invention is to overcome the deficiency of prior art, provide a kind of based on the successional generic web pages context extraction method of punctuate and system thereof, be to utilize the continuity of punctuate to obtain Web page text, have that processing speed is fast, a feature of strong adaptability and highly versatile.
The technical solution adopted for the present invention to solve the technical problems is: a kind of based on the successional generic web pages context extraction method of punctuate, comprise the steps:
Read in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Html source code is carried out to pre-service, by preset noise token, remove some that exist in html source code and extract the character string without help for Web page text;
Generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Text format information in labelled tree is processed, with preset specific character, removed to replace corresponding format information;
Extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree;
The conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.
The processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:
A. separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting 1, s 2, s 3..., s n], s wherein nit is a short sentence;
B. travel through array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing m, join punctuate sentence array B=[s i, s j, s k..., s n], and record the sequence number m of short sentence;
C. the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s j, s kbetween there is not continuity, make short sentence set s i, s i+1, s i+2s j, be the longest current punctuate continuation character set of strings, buffer memory is L={s i, s i+1, s i+2s j;
D. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
E. array B processed after, the word of set in L is Web page text.
Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, be to utilize the continuity of punctuate to obtain the text of webpage.This method according to being: punctuation mark be in Chinese punctuate, semantic cutting important symbol, one piece of Chinese article without punctuate almost cannot allow people's correct understanding article meaning to be expressed, punctuate is sign indispensable in Chinese article, therefore, punctuation mark is a part indispensable in Web page text; And, in the text of webpage, conventionally exist the punctuate of continuous appearance; Like this, just can judge word that punctuate continuity the is the highest text of webpage often.The continuity of punctuate refers in webpage the word to occurring herein, carries out after piecemeal the continuity that punctuate in each piece word exists situation.
First the html source file of text to be extracted is converted into the character stream form of Unicode, the text coding of most webpages is stored in the energy collecting of Unicode character, and unified coding is conducive to follow-up character handling procedure.
Then remove noise token information, to existing some to extract without help for Web page text, can extract the noise token piece that causes interference to text on the contrary in html source code, need to delete at pretreatment stage.As script piece (< (no)? script.*? </ (no)? script>) be generally used for subsidiary function, annotation piece (<!--.*?--be >) that developer is for the annotation of webpage source code.Other are as the select piece of drop-down list, the style piece that form is controlled, the marquee piece of horse race lamp, etc. for the extraction of Web page text, be also without helping.
Then generate html labelled tree, html is HTML (Hypertext Markup Language) again, is a subset of standard generalized markup language, by analytical tools such as neko or htmlparser, can easily html source code representation be become to the form of labelled tree.
Then, for the text format control information in labelled tree, process, for line feed format information as P, the line feed marks such as BR replace to special character so that storage line feed information, for Word messages such as font, colors, so this method be not absorbed in keep original text whole font informations this type of as FONT, the label informations such as STRONG are done to delete and are processed (because likely follow-up processing being impacted).
Then, extract the node of text.What Web page text extracted is the set of literal node, and algorithm filters the literal node sequence of extracting in html labelled tree and carries out follow-up processing.
Follow again the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided: for the character in literal node, if gather the punctuate in P, after punctuate, add separating character (space character) as separated sign.
Finally, utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.
Based on the successional generic web pages text of a punctuate extraction system, comprising:
Read in module, this reads in module and is used for reading in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Remove noise token information module, this removals noise token information module is used for html source code to carry out pre-service, by preset noise token, remove in html source code, exist some for Web page text extraction the character string without help;
Generate html labelled tree module, this generation html labelled tree module is used for generating html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Process text format information module, this processing text format information module is used for the text format information in labelled tree to process, and with preset specific character, removes to replace corresponding format information;
Extract text node and generate text sentence block, this extraction text node generates text sentence block and is used for extracting the node of text and adopts filter algorithm to generate the literal node sequence in html labelled tree;
Utilize conventional punctuate to text sequence piece again subordinate sentence module, this utilizes conventional punctuate text sequence piece again subordinate sentence module to be used for defining the conventional punctuate set P={ of an article.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text.
The invention has the beneficial effects as follows, owing to having adopted, read in file, be converted into Unicode; Remove noise token information; Generate html labelled tree; Process text format information; Extract text node and generate text sentence sequence; Utilize conventional punctuate to text sequence piece again subordinate sentence; Utilize the continuity principle of punctuate to extract the treatment steps such as long continuous text and realize generic web pages text and extract, with respect to prior art, tool has the following advantages:
1, punctuation mark is Web page text necessary part, so method has very high versatility.
2, punctuation mark is only processed text strings, and the various object informations without analyzing web page, have larger advantage in performance, is suitable for real-time Web page text and extracts.
Even if 3 page results are complicated, contain multiple interfere information, the method also can effectively be extracted the body part of webpage, and the specific aim of method is very strong.
4, the text that the webpage word that punctuate continuity is the longest is webpage, has also guaranteed the degree of accuracy that Web page text extracts.
Below in conjunction with drawings and Examples, the present invention is described in further detail; But of the present inventionly a kind ofly based on the successional generic web pages context extraction method of punctuate and system thereof, be not limited to embodiment.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of the web page news page first;
Fig. 2 is the schematic flow sheet of the inventive method;
Fig. 3 is the structural representation of html labelled tree of the present invention.
Embodiment
Embodiment, shown in Figure 1, Fig. 1 is the schematic diagram of the web page news page first, as seen from Figure 1, in the text of news, being inevitable appears in continuous punctuate.Based on the successional context extraction method of punctuate and the extraction of the text based on vision, there is identical point: in the method based on vision, text block is the strongest block of punctuate continuity.
Shown in Figure 2, of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, comprise the steps:
Step S1: read in file, and the file reading in is converted into the html source code of the character stream form of Unicode; Corresponding to " read in file, transform Unicode " frame of Fig. 2;
Step S2: html source code is carried out to pre-service, remove some that exist in html source code by preset noise token and extract the character string without help for Web page text; Corresponding to " remove noise token information " frame of Fig. 2;
Step S3: generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree; Corresponding to " generate html labelled tree " frame of Fig. 2;
Step S4: the text format information in labelled tree is processed, removed to replace corresponding format information with preset specific character; Corresponding to " process text format information " frame of Fig. 2;
Step S5: extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree; Corresponding to " extract text node generate text sentence sequence " frame of Fig. 2;
Step S6: the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in step S5 being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign; Corresponding to " utilizing conventional punctuate to text sequence piece again subordinate sentence " frame of Fig. 2;
Step S7: utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text; Corresponding to " utilizing the continuity principle of punctuate to extract the longest continuous text module " frame of Fig. 2.
Wherein, the processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:
Step a. be take separator or the space character character string after division points is processed step S6 and is carried out cutting, obtains character string array A=[s after cutting 1, s 2, s 3..., s n], s wherein nit is a short sentence;
Step b. traversal array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing m, join punctuate sentence array B=[s i, s j, s k..., s n], and record the sequence number m of short sentence;
Step c is the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s j, s kbetween there is not continuity, make short sentence set s i, s i+1, s i+2s j, be the longest current punctuate continuation character set of strings, buffer memory is L={s i, s i+1, s i+2s j;
Steps d. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
After step e. array B is processed, the word in set L is Web page text.
Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, be to utilize the continuity of punctuate to obtain the text of webpage.This method according to being: punctuation mark be in Chinese punctuate, semantic cutting important symbol, one piece of Chinese article without punctuate almost cannot allow people's correct understanding article meaning to be expressed, punctuate is sign indispensable in Chinese article, therefore, punctuation mark is a part indispensable in Web page text; And, in the text of webpage, conventionally exist the punctuate of continuous appearance; Like this, just can judge word that punctuate continuity the is the highest text of webpage often.The continuity of punctuate refers in webpage the word to occurring herein, carries out after piecemeal the continuity that punctuate in each piece word exists situation.
First the html source file of text to be extracted is converted into the character stream form of Unicode, the text coding of most webpages is stored in the energy collecting of Unicode character, and unified coding is conducive to follow-up character handling procedure.
Then remove noise token information, to existing some to extract without help for Web page text, can extract the tag block in morning that causes interference to text on the contrary in html source code, need to delete at pretreatment stage.As script piece (< (no)? script.*? </ (no)? script>) be generally used for subsidiary function, annotation piece (<!--.*?-->) User Exploitation person is for the annotation of webpage source code.Other: the select piece of drop-down list, the style piece that form is controlled, the marquee piece of horse race lamp, etc. for the extraction of Web page text, be also without helping.
Then generate html labelled tree, html is HTML (Hypertext Markup Language) again, is a subset of standard generalized markup language, by analytical tools such as neko or htmlparser, can easily html source code representation be become to the form of labelled tree; As shown in Figure 3.
Then, for the text format control information in labelled tree, process, for line feed format information as P, the line feed marks such as BR replace to special character so that storage line feed information, for Word messages such as font, colors, so this method be not absorbed in keep original text whole font informations this type of as FONT, the label informations such as STRONG are done to delete and are processed (because likely follow-up processing being impacted).
Then, extract the node of text.What Web page text extracted is the set of literal node, and algorithm filters the literal node sequence of extracting in html labelled tree and carries out follow-up processing.
Follow again the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided: for the character in literal node, if gather the punctuate in P, after punctuate, add separating character (space character) as separated sign.
Specific algorithm java is expressed as follows:
Finally, utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.
Of the present invention a kind of based on the successional generic web pages text of punctuate extraction system, comprising:
Read in module, this reads in module and is used for reading in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Remove noise token information module, this removals noise token information module is used for html source code to carry out pre-service, by preset noise token, remove in html source code, exist some for Web page text extraction the character string without help;
Generate html labelled tree module, this generation html labelled tree module is used for generating html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Process text format information module, this processing text format information module is used for the text format information in labelled tree to process, and with preset specific character, removes to replace corresponding format information;
Extract text node and generate text sentence block, this extraction text node generates text sentence block and is used for extracting the node of text and adopts filter algorithm to generate the literal node sequence in html labelled tree;
Utilize conventional punctuate to text sequence piece again subordinate sentence module, this utilizes conventional punctuate text sequence piece again subordinate sentence module to be used for defining the conventional punctuate set P={ of an article.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text.
The extraction of Web page text assists in removing webpage noise, helps user to obtain accurate required information.For search engine, particularly vertical search engine is as news search engine, and the subject information of blog search engine extracts and follow-up index is significant.The template of webpage is ever-changing, and the webpage context extraction method based on statistics or vision has limitation separately, and in versatility, effect is poor.Context extraction method based on Intelligence Classifier, needs huge webpage training set, and very large for the high sorter difficulty of numerous accuracys rate of webpage structure.
Punctuation mark is the indispensable part of Chinese article, and utilizing the continuity of punctuate to come between the text area of locating web-pages is a process that conforms with logic of natural language.Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate and system thereof, there is processing speed fast; The feature such as adaptability, highly versatile.
Above-described embodiment is only used for further illustrating of the present invention a kind of based on the successional generic web pages context extraction method of punctuate and system thereof; but the present invention is not limited to embodiment; any simple modification, equivalent variations and modification that every foundation technical spirit of the present invention is done above embodiment, all fall in the protection domain of technical solution of the present invention.

Claims (2)

1. based on the successional generic web pages context extraction method of punctuate, it is characterized in that: comprise the steps:
Read in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Html source code is carried out to pre-service, by preset noise token, remove some that exist in html source code and extract the character string without help for Web page text;
Generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Text format information in labelled tree is processed, with preset specific character, removed to replace corresponding format information;
Extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree;
The conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text;
The processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:
A. separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting 1, s 2, s 3..., s n], wherein all array elements in character string array A are short sentence;
B. travel through array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing m, join punctuate sentence array B=[s i, s j, s k..., s n], and record the sequence number m of short sentence;
C. the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s j, s kbetween there is not continuity, make short sentence set s i, s i+1, s i+2s j, be the longest current punctuate continuation character set of strings, buffer memory is L={s i, s i+1, s i+2s j;
D. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
E. array B processed after, the word of set in L is Web page text.
2. based on the successional generic web pages text of a punctuate extraction system, it is characterized in that: comprising:
Read in module, this reads in module and is used for reading in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Remove noise token information module, this removals noise token information module is used for html source code to carry out pre-service, by preset noise token, remove in html source code, exist some for Web page text extraction the character string without help;
Generate html labelled tree module, this generation html labelled tree module is used for generating html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Process text format information module, this processing text format information module is used for the text format information in labelled tree to process, and with preset specific character, removes to replace corresponding format information;
Extract text node and generate text sentence block, this extraction text node generates text sentence block and is used for extracting the node of text and adopts filter algorithm to generate the literal node sequence in html labelled tree;
Utilize conventional punctuate to text sequence piece again subordinate sentence module, this utilizes conventional punctuate text sequence piece again subordinate sentence module to be used for defining the conventional punctuate set P={ of an article.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text;
The described continuity principle of punctuate of utilizing is extracted the longest continuous text module, comprising:
Cutting submodule, separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting 1, s 2, s 3..., s n], wherein all array elements in character string array A are short sentence;
Inquiry submodule, traversal array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing m, join punctuate sentence array B=[s i, s j, s k..., s n], and record the sequence number m of short sentence;
Calculating sub module, the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s j, s kbetween there is not continuity, make short sentence set s i, s i+1, s i+2s j, be the longest current punctuate continuation character set of strings, buffer memory is L={s i, s i+1, s i+2s j;
Replace submodule, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
Extract Web page text module, after array B is processed, the word in set L is Web page text.
CN201110446701.4A 2011-12-27 2011-12-27 General webpage text extraction method based on punctuation continuity and system thereof Active CN102591612B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110446701.4A CN102591612B (en) 2011-12-27 2011-12-27 General webpage text extraction method based on punctuation continuity and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110446701.4A CN102591612B (en) 2011-12-27 2011-12-27 General webpage text extraction method based on punctuation continuity and system thereof

Publications (2)

Publication Number Publication Date
CN102591612A CN102591612A (en) 2012-07-18
CN102591612B true CN102591612B (en) 2014-12-03

Family

ID=46480349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110446701.4A Active CN102591612B (en) 2011-12-27 2011-12-27 General webpage text extraction method based on punctuation continuity and system thereof

Country Status (1)

Country Link
CN (1) CN102591612B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577171B (en) * 2012-07-30 2018-11-13 腾讯科技(深圳)有限公司 A kind of method and mobile terminal of display web page contents
CN102937958B (en) * 2012-08-06 2016-03-16 厦门市美亚柏科信息股份有限公司 A kind of web data record extraction method based on incomplete Sub-tree Matching
CN103631799A (en) * 2012-08-23 2014-03-12 深圳市世纪光速信息技术有限公司 Network group image aggregating method and system and image searching method and system
CN102902790B (en) * 2012-09-29 2017-06-06 北京奇虎科技有限公司 Web page classification system and method
CN103049536A (en) * 2012-11-01 2013-04-17 广州汇讯营销咨询有限公司 Webpage main text content extracting method and webpage text content extracting system
CN103838790A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Webpage data extraction method
CN103744636A (en) * 2013-12-30 2014-04-23 上海斐讯数据通信技术有限公司 Text composition method for adapting to window size
CN106649560B (en) * 2016-11-03 2019-09-24 中国电子科技集团公司第二十八研究所 A kind of Web page text extracting method and device
CN106528509B (en) * 2016-11-11 2020-04-03 政和科技股份有限公司 Webpage information extraction method and device
CN106951505B (en) * 2017-03-16 2021-02-02 北京搜狐新媒体信息技术有限公司 Webpage information obtaining method and system
CN107967243A (en) * 2017-11-22 2018-04-27 语联网(武汉)信息技术有限公司 A kind of processing method for supporting that user independently makes pauses in reading unpunctuated ancient writings
CN111131000B (en) * 2019-12-24 2022-01-25 北京达佳互联信息技术有限公司 Information transmission method, device, server and terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101237465A (en) * 2007-01-30 2008-08-06 中国科学院声学研究所 A webpage context extraction method based on quick Fourier conversion
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020143821A1 (en) * 2000-12-15 2002-10-03 Douglas Jakubowski Site mining stylesheet generator

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101237465A (en) * 2007-01-30 2008-08-06 中国科学院声学研究所 A webpage context extraction method based on quick Fourier conversion
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101706807A (en) * 2009-11-27 2010-05-12 清华大学 Method for automatically acquiring new words from Chinese webpages
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Content Extraction from Web Pages Based on Chinese Punctuation Number;Mingqiu Song et al;《International Conference on Wireless Communications, Networking and Mobile Computing, 2007. (WiCom 2007)》;20070925;5573-5575 *
Mingqiu Song et al.Content Extraction from Web Pages Based on Chinese Punctuation Number.《International Conference on Wireless Communications, Networking and Mobile Computing, 2007. (WiCom 2007)》.2007,5573-5575. *
吴麒等.基于权值优化的网页正文内容提取算法.《华南理工大学学报(自然科学版)》.2011,32-37. *
基于权值优化的网页正文内容提取算法;吴麒等;《华南理工大学学报(自然科学版)》;20110415;32-37 *
基于统计的网页正文信息抽取方法的研究;孙承杰等;《中文信息学报》;20040925;17-22 *
孙承杰等.基于统计的网页正文信息抽取方法的研究.《中文信息学报》.2004,17-22. *

Also Published As

Publication number Publication date
CN102591612A (en) 2012-07-18

Similar Documents

Publication Publication Date Title
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
CN102184189B (en) Webpage core block determining method based on DOM (Document Object Model) node text density
CN104598577B (en) A kind of extracting method of Web page text
CN103473263B (en) News event development process-oriented visual display method
CN102253930B (en) A kind of method of text translation and device
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN106446072B (en) The treating method and apparatus of web page contents
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
CN106021392A (en) News key information extraction method and system
CN104063380A (en) Method and device for converting picture files into webpage files
CN106055667A (en) Method for extracting core content of webpage based on text-tag density
CN103049536A (en) Webpage main text content extracting method and webpage text content extracting system
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN104657375A (en) Image-text theme description method, device and system
CN105677638A (en) Web information extraction method
CN112667940A (en) Webpage text extraction method based on deep learning
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN103440315A (en) Web page cleaning method based on theme
CN107436931B (en) Webpage text extraction method and device
CN106528509B (en) Webpage information extraction method and device
CN103064966A (en) Method for extracting regular noise from single record web pages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20120718

Assignee: Xiaoma Baoli (Xiamen) Network Technology Co.,Ltd.

Assignor: XIAMEN MEIYA PICO INFORMATION Co.,Ltd.

Contract record no.: X2023350000077

Denomination of invention: A General Web Page Text Extraction Method and System Based on Punctuation Continuity

Granted publication date: 20141203

License type: Common License

Record date: 20230313

EE01 Entry into force of recordation of patent licensing contract