CN102591612B - General webpage text extraction method based on punctuation continuity and system thereof - Google Patents
General webpage text extraction method based on punctuation continuity and system thereof Download PDFInfo
- Publication number
- CN102591612B CN102591612B CN201110446701.4A CN201110446701A CN102591612B CN 102591612 B CN102591612 B CN 102591612B CN 201110446701 A CN201110446701 A CN 201110446701A CN 102591612 B CN102591612 B CN 102591612B
- Authority
- CN
- China
- Prior art keywords
- punctuate
- text
- character
- continuity
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a general webpage text extraction method based on punctuation continuity and a system thereof. The method comprises the processing steps of reading in files and converting the files into Unicode; removing noise marking information; generating html marking trees, processing text form information; extracting text nodes to generate text sentence sequences; using normally used punctuations to divide text sequence blocks into sentences again; and using the punctuation continuity principle to extract longest continuous texts. The method using the continuity of the punctuations to acquire webpage texts has the advantages of being quick in processing speed, strong in adaptability and strong in generality.
Description
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of based on the successional generic web pages context extraction method of punctuate and system thereof.
Background technology
Along with the develop rapidly of internet, increasing enterprises and individuals information announcement on the net.Every day, has thousands of webpage to produce on internet, people can cross over the time and space boundary is shared bulk information, and internet has become a maximum information source in the world.In this vast as the open sea information ocean, how helping people to extract fast effective information becomes an important problem.
Webpage, as information carrier the most widely on internet, has comprised most internet informations, becomes the most common means of search engine and domestic consumer's obtaining information.But the unit that the webpage of take is obtaining information is inadequate, because webpage has often comprised the information of multiple theme, as comprise navigation block, commercial block, copyright statement piece, message block etc.For obtaining for taker of information, message block is the object of its unique care often.Remaining information becomes noise.
For how removing webpage noise, the existing many research of information extraction piece automatically:
1. the information extraction based on DOM Document Object Model (DOM, Document Object Model)
HTML is a kind of standard, a kind of standard, and it carrys out the various piece in webpage that mark will show by label symbol.By the mark extracting in html document, can generate a dom tree, again to the specific node (Table in tree, Div, P) etc. process to obtain webpage useful information, as thought in the research > > of: the < < Web page text information extraction method based on statistics the text message (useful information) of webpage, be present in a Table node, information by Chinese text in statistics node obtains specific Table node, extraction word wherein obtains the useful text of webpage.This type of research also has the Web page text information extracting method > > of < < based on mark window etc.There are several problems in the webpage context extraction method based on DOM: many webpages are not well-formeds, and the dom tree of formation may be lack of standardization; How the HTML language that serves as a mark pays close attention to display web page, is generally indifferent to piecemeal and the semantic information of webpage; The webpage typesetting of different web sites is different (information of text is not necessarily included in a Table node) often.
2. the information extraction based on vision
From the mankind's angle, when a user observes the Web page, it always can naturally be treated a semantic chunk as a single object, and how the inner structure that can not manage the Web page is described.Generally, when differentiating semantic chunk, user can help by some sense of vision factor, such as spacing between background color, font color and size, frame, logical block and logical block etc.If therefore use fully the visual cues of the Web page, and carry out the semantic piecemeal of the page in conjunction with dom tree, can make up some shortcoming of only using dom tree to bring.The representative of these class methods is < < VIPS: the page partitioning algorithm > > based on vision, context extraction method based on vision need to obtain the sense of vision factor of the page, this is a process that calculated amount is larger, if the sense of vision factor of the page is controlled (as: controlling by CSS CSS (cascading style sheet) file) by different files, cause obtaining webpage and also needed to obtain its relevant control documents, need to repeatedly ask, efficiency is lower.And be not very good situation for web page style, the text based on vision extracts degree of accuracy also can be lower.
3. the method for rule-based formulation and machine learning
This method is based on machine learning, sorting technique during common usage data excavates: by setting a series of attribute relevant to Web page text, the webpage training set of a large amount of (The more the better) is trained and obtained judging that certain block of a webpage is a sorter of text block, then instruct with the sorter after training the text that obtains webpage.These class methods need to identify for the Web page text piece in training set in training process, and this is a process that workload is very large.And different website rules is often not quite similar, can obtain a general regular difficulty very large, equally also because so, the degree of accuracy that has caused Web page text to extract is lower.
In above-mentioned three kinds of extracting method: it is good that the method for the statistical information based on DOM is directed to Website style, the webpage that typesetting is more consistent, and due to developer's difference, the complexity variation of html web page tag application, the imposition layout of website is also often ever-changing, the experiment webpage of existing research is the portal website based on regular mostly, and the versatility of method is poor.The calculated amount that method based on visual information needs is large, vision heuristic rule is also not necessarily general (such as the heuristic rule for title: whether the font of piece A word large than piece B for different web sites, whether the font color of piece A, piece B is different, just can not be completely general for different web sites), the webpage of the page typesetting that the method based on visual information is controlled for CSS has very large restriction, and now increasing Web Page Layout is controlled employing CSS, so the practical application of the method is less, versatility is not strong.The difficult point of the method based on machine learning mainly contains two: first is that the data volume size of webpage training set is directly relevant with the extraction accuracy rate of sorter, needs manually mark the region of Web page text, and workload is heavy.Whether second difficult point is: exist one can judge that Web page text is interval and have the general rule set compared with high-accuracy, also not studies have shown that now.
Summary of the invention
The object of the invention is to overcome the deficiency of prior art, provide a kind of based on the successional generic web pages context extraction method of punctuate and system thereof, be to utilize the continuity of punctuate to obtain Web page text, have that processing speed is fast, a feature of strong adaptability and highly versatile.
The technical solution adopted for the present invention to solve the technical problems is: a kind of based on the successional generic web pages context extraction method of punctuate, comprise the steps:
Read in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Html source code is carried out to pre-service, by preset noise token, remove some that exist in html source code and extract the character string without help for Web page text;
Generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Text format information in labelled tree is processed, with preset specific character, removed to replace corresponding format information;
Extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree;
The conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.
The processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:
A. separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting
1, s
2, s
3..., s
n], s wherein
nit is a short sentence;
B. travel through array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing
m, join punctuate sentence array B=[s
i, s
j, s
k..., s
n], and record the sequence number m of short sentence;
C. the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s
j, s
kbetween there is not continuity, make short sentence set s
i, s
i+1, s
i+2s
j, be the longest current punctuate continuation character set of strings, buffer memory is L={s
i, s
i+1, s
i+2s
j;
D. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
E. array B processed after, the word of set in L is Web page text.
Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, be to utilize the continuity of punctuate to obtain the text of webpage.This method according to being: punctuation mark be in Chinese punctuate, semantic cutting important symbol, one piece of Chinese article without punctuate almost cannot allow people's correct understanding article meaning to be expressed, punctuate is sign indispensable in Chinese article, therefore, punctuation mark is a part indispensable in Web page text; And, in the text of webpage, conventionally exist the punctuate of continuous appearance; Like this, just can judge word that punctuate continuity the is the highest text of webpage often.The continuity of punctuate refers in webpage the word to occurring herein, carries out after piecemeal the continuity that punctuate in each piece word exists situation.
First the html source file of text to be extracted is converted into the character stream form of Unicode, the text coding of most webpages is stored in the energy collecting of Unicode character, and unified coding is conducive to follow-up character handling procedure.
Then remove noise token information, to existing some to extract without help for Web page text, can extract the noise token piece that causes interference to text on the contrary in html source code, need to delete at pretreatment stage.As script piece (< (no)? script.*? </ (no)? script>) be generally used for subsidiary function, annotation piece (<!--.*?--be >) that developer is for the annotation of webpage source code.Other are as the select piece of drop-down list, the style piece that form is controlled, the marquee piece of horse race lamp, etc. for the extraction of Web page text, be also without helping.
Then generate html labelled tree, html is HTML (Hypertext Markup Language) again, is a subset of standard generalized markup language, by analytical tools such as neko or htmlparser, can easily html source code representation be become to the form of labelled tree.
Then, for the text format control information in labelled tree, process, for line feed format information as P, the line feed marks such as BR replace to special character so that storage line feed information, for Word messages such as font, colors, so this method be not absorbed in keep original text whole font informations this type of as FONT, the label informations such as STRONG are done to delete and are processed (because likely follow-up processing being impacted).
Then, extract the node of text.What Web page text extracted is the set of literal node, and algorithm filters the literal node sequence of extracting in html labelled tree and carries out follow-up processing.
Follow again the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided: for the character in literal node, if gather the punctuate in P, after punctuate, add separating character (space character) as separated sign.
Finally, utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.
Based on the successional generic web pages text of a punctuate extraction system, comprising:
Read in module, this reads in module and is used for reading in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Remove noise token information module, this removals noise token information module is used for html source code to carry out pre-service, by preset noise token, remove in html source code, exist some for Web page text extraction the character string without help;
Generate html labelled tree module, this generation html labelled tree module is used for generating html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Process text format information module, this processing text format information module is used for the text format information in labelled tree to process, and with preset specific character, removes to replace corresponding format information;
Extract text node and generate text sentence block, this extraction text node generates text sentence block and is used for extracting the node of text and adopts filter algorithm to generate the literal node sequence in html labelled tree;
Utilize conventional punctuate to text sequence piece again subordinate sentence module, this utilizes conventional punctuate text sequence piece again subordinate sentence module to be used for defining the conventional punctuate set P={ of an article.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text.
The invention has the beneficial effects as follows, owing to having adopted, read in file, be converted into Unicode; Remove noise token information; Generate html labelled tree; Process text format information; Extract text node and generate text sentence sequence; Utilize conventional punctuate to text sequence piece again subordinate sentence; Utilize the continuity principle of punctuate to extract the treatment steps such as long continuous text and realize generic web pages text and extract, with respect to prior art, tool has the following advantages:
1, punctuation mark is Web page text necessary part, so method has very high versatility.
2, punctuation mark is only processed text strings, and the various object informations without analyzing web page, have larger advantage in performance, is suitable for real-time Web page text and extracts.
Even if 3 page results are complicated, contain multiple interfere information, the method also can effectively be extracted the body part of webpage, and the specific aim of method is very strong.
4, the text that the webpage word that punctuate continuity is the longest is webpage, has also guaranteed the degree of accuracy that Web page text extracts.
Below in conjunction with drawings and Examples, the present invention is described in further detail; But of the present inventionly a kind ofly based on the successional generic web pages context extraction method of punctuate and system thereof, be not limited to embodiment.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of the web page news page first;
Fig. 2 is the schematic flow sheet of the inventive method;
Fig. 3 is the structural representation of html labelled tree of the present invention.
Embodiment
Embodiment, shown in Figure 1, Fig. 1 is the schematic diagram of the web page news page first, as seen from Figure 1, in the text of news, being inevitable appears in continuous punctuate.Based on the successional context extraction method of punctuate and the extraction of the text based on vision, there is identical point: in the method based on vision, text block is the strongest block of punctuate continuity.
Shown in Figure 2, of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, comprise the steps:
Step S1: read in file, and the file reading in is converted into the html source code of the character stream form of Unicode; Corresponding to " read in file, transform Unicode " frame of Fig. 2;
Step S2: html source code is carried out to pre-service, remove some that exist in html source code by preset noise token and extract the character string without help for Web page text; Corresponding to " remove noise token information " frame of Fig. 2;
Step S3: generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree; Corresponding to " generate html labelled tree " frame of Fig. 2;
Step S4: the text format information in labelled tree is processed, removed to replace corresponding format information with preset specific character; Corresponding to " process text format information " frame of Fig. 2;
Step S5: extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree; Corresponding to " extract text node generate text sentence sequence " frame of Fig. 2;
Step S6: the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in step S5 being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign; Corresponding to " utilizing conventional punctuate to text sequence piece again subordinate sentence " frame of Fig. 2;
Step S7: utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text; Corresponding to " utilizing the continuity principle of punctuate to extract the longest continuous text module " frame of Fig. 2.
Wherein, the processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:
Step a. be take separator or the space character character string after division points is processed step S6 and is carried out cutting, obtains character string array A=[s after cutting
1, s
2, s
3..., s
n], s wherein
nit is a short sentence;
Step b. traversal array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing
m, join punctuate sentence array B=[s
i, s
j, s
k..., s
n], and record the sequence number m of short sentence;
Step c is the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s
j, s
kbetween there is not continuity, make short sentence set s
i, s
i+1, s
i+2s
j, be the longest current punctuate continuation character set of strings, buffer memory is L={s
i, s
i+1, s
i+2s
j;
Steps d. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
After step e. array B is processed, the word in set L is Web page text.
Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate, be to utilize the continuity of punctuate to obtain the text of webpage.This method according to being: punctuation mark be in Chinese punctuate, semantic cutting important symbol, one piece of Chinese article without punctuate almost cannot allow people's correct understanding article meaning to be expressed, punctuate is sign indispensable in Chinese article, therefore, punctuation mark is a part indispensable in Web page text; And, in the text of webpage, conventionally exist the punctuate of continuous appearance; Like this, just can judge word that punctuate continuity the is the highest text of webpage often.The continuity of punctuate refers in webpage the word to occurring herein, carries out after piecemeal the continuity that punctuate in each piece word exists situation.
First the html source file of text to be extracted is converted into the character stream form of Unicode, the text coding of most webpages is stored in the energy collecting of Unicode character, and unified coding is conducive to follow-up character handling procedure.
Then remove noise token information, to existing some to extract without help for Web page text, can extract the tag block in morning that causes interference to text on the contrary in html source code, need to delete at pretreatment stage.As script piece (< (no)? script.*? </ (no)? script>) be generally used for subsidiary function, annotation piece (<!--.*?-->) User Exploitation person is for the annotation of webpage source code.Other: the select piece of drop-down list, the style piece that form is controlled, the marquee piece of horse race lamp, etc. for the extraction of Web page text, be also without helping.
Then generate html labelled tree, html is HTML (Hypertext Markup Language) again, is a subset of standard generalized markup language, by analytical tools such as neko or htmlparser, can easily html source code representation be become to the form of labelled tree; As shown in Figure 3.
Then, for the text format control information in labelled tree, process, for line feed format information as P, the line feed marks such as BR replace to special character so that storage line feed information, for Word messages such as font, colors, so this method be not absorbed in keep original text whole font informations this type of as FONT, the label informations such as STRONG are done to delete and are processed (because likely follow-up processing being impacted).
Then, extract the node of text.What Web page text extracted is the set of literal node, and algorithm filters the literal node sequence of extracting in html labelled tree and carries out follow-up processing.
Follow again the conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided: for the character in literal node, if gather the punctuate in P, after punctuate, add separating character (space character) as separated sign.
Specific algorithm java is expressed as follows:
Finally, utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text.
Of the present invention a kind of based on the successional generic web pages text of punctuate extraction system, comprising:
Read in module, this reads in module and is used for reading in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Remove noise token information module, this removals noise token information module is used for html source code to carry out pre-service, by preset noise token, remove in html source code, exist some for Web page text extraction the character string without help;
Generate html labelled tree module, this generation html labelled tree module is used for generating html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Process text format information module, this processing text format information module is used for the text format information in labelled tree to process, and with preset specific character, removes to replace corresponding format information;
Extract text node and generate text sentence block, this extraction text node generates text sentence block and is used for extracting the node of text and adopts filter algorithm to generate the literal node sequence in html labelled tree;
Utilize conventional punctuate to text sequence piece again subordinate sentence module, this utilizes conventional punctuate text sequence piece again subordinate sentence module to be used for defining the conventional punctuate set P={ of an article.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text.
The extraction of Web page text assists in removing webpage noise, helps user to obtain accurate required information.For search engine, particularly vertical search engine is as news search engine, and the subject information of blog search engine extracts and follow-up index is significant.The template of webpage is ever-changing, and the webpage context extraction method based on statistics or vision has limitation separately, and in versatility, effect is poor.Context extraction method based on Intelligence Classifier, needs huge webpage training set, and very large for the high sorter difficulty of numerous accuracys rate of webpage structure.
Punctuation mark is the indispensable part of Chinese article, and utilizing the continuity of punctuate to come between the text area of locating web-pages is a process that conforms with logic of natural language.Of the present invention a kind of based on the successional generic web pages context extraction method of punctuate and system thereof, there is processing speed fast; The feature such as adaptability, highly versatile.
Above-described embodiment is only used for further illustrating of the present invention a kind of based on the successional generic web pages context extraction method of punctuate and system thereof; but the present invention is not limited to embodiment; any simple modification, equivalent variations and modification that every foundation technical spirit of the present invention is done above embodiment, all fall in the protection domain of technical solution of the present invention.
Claims (2)
1. based on the successional generic web pages context extraction method of punctuate, it is characterized in that: comprise the steps:
Read in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Html source code is carried out to pre-service, by preset noise token, remove some that exist in html source code and extract the character string without help for Web page text;
Generate html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Text format information in labelled tree is processed, with preset specific character, removed to replace corresponding format information;
Extract the node of text and adopt filter algorithm to generate the literal node sequence in html labelled tree;
The conventional punctuate set P={ of an article of definition.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity of punctuate, extract the highest character block of punctuate continuity, be returned as text;
The processing procedure of the character block that described extraction punctuate continuity is the highest, comprises the steps:
A. separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting
1, s
2, s
3..., s
n], wherein all array elements in character string array A are short sentence;
B. travel through array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing
m, join punctuate sentence array B=[s
i, s
j, s
k..., s
n], and record the sequence number m of short sentence;
C. the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s
j, s
kbetween there is not continuity, make short sentence set s
i, s
i+1, s
i+2s
j, be the longest current punctuate continuation character set of strings, buffer memory is L={s
i, s
i+1, s
i+2s
j;
D. repeating step c, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
E. array B processed after, the word of set in L is Web page text.
2. based on the successional generic web pages text of a punctuate extraction system, it is characterized in that: comprising:
Read in module, this reads in module and is used for reading in file, and the file reading in is converted into the html source code of the character stream form of Unicode;
Remove noise token information module, this removals noise token information module is used for html source code to carry out pre-service, by preset noise token, remove in html source code, exist some for Web page text extraction the character string without help;
Generate html labelled tree module, this generation html labelled tree module is used for generating html labelled tree, by preset analytical tool, html source code representation is become to the form of labelled tree;
Process text format information module, this processing text format information module is used for the text format information in labelled tree to process, and with preset specific character, removes to replace corresponding format information;
Extract text node and generate text sentence block, this extraction text node generates text sentence block and is used for extracting the node of text and adopts filter algorithm to generate the literal node sequence in html labelled tree;
Utilize conventional punctuate to text sequence piece again subordinate sentence module, this utilizes conventional punctuate text sequence piece again subordinate sentence module to be used for defining the conventional punctuate set P={ of an article.:; " " ..., the word literal node sequence in previous step being carried out again with the node of gathering in P is divided, and for the character in literal node, if gather the punctuate in P, after punctuate, adds separator or space character as separated sign;
Utilize the continuity principle of punctuate to extract the longest continuous text module, this utilizes the continuity principle of punctuate to extract the continuity that the longest continuous text module is used for utilizing punctuate, extracts the highest character block of punctuate continuity, is returned as text;
The described continuity principle of punctuate of utilizing is extracted the longest continuous text module, comprising:
Cutting submodule, separator or the space character character string after division points is processed previous step of take is carried out cutting, obtains character string array A=[s after cutting
1, s
2, s
3..., s
n], wherein all array elements in character string array A are short sentence;
Inquiry submodule, traversal array A, the short sentence s of the punctuate that short sentence in A be take in set P as finishing
m, join punctuate sentence array B=[s
i, s
j, s
k..., s
n], and record the sequence number m of short sentence;
Calculating sub module, the poor j-i of subscript sequence number of element in set of computations B successively, k-j ... if k-j is greater than threshold value, represent short sentence s
j, s
kbetween there is not continuity, make short sentence set s
i, s
i+1, s
i+2s
j, be the longest current punctuate continuation character set of strings, buffer memory is L={s
i, s
i+1, s
i+2s
j;
Replace submodule, the longest punctuate continuation character set of strings length of obtaining if current is greater than the length of L, and replacing L is the current the longest punctuate continuation character set of strings of obtaining;
Extract Web page text module, after array B is processed, the word in set L is Web page text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110446701.4A CN102591612B (en) | 2011-12-27 | 2011-12-27 | General webpage text extraction method based on punctuation continuity and system thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110446701.4A CN102591612B (en) | 2011-12-27 | 2011-12-27 | General webpage text extraction method based on punctuation continuity and system thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102591612A CN102591612A (en) | 2012-07-18 |
CN102591612B true CN102591612B (en) | 2014-12-03 |
Family
ID=46480349
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110446701.4A Active CN102591612B (en) | 2011-12-27 | 2011-12-27 | General webpage text extraction method based on punctuation continuity and system thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102591612B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577171B (en) * | 2012-07-30 | 2018-11-13 | 腾讯科技(深圳)有限公司 | A kind of method and mobile terminal of display web page contents |
CN102937958B (en) * | 2012-08-06 | 2016-03-16 | 厦门市美亚柏科信息股份有限公司 | A kind of web data record extraction method based on incomplete Sub-tree Matching |
CN103631799A (en) * | 2012-08-23 | 2014-03-12 | 深圳市世纪光速信息技术有限公司 | Network group image aggregating method and system and image searching method and system |
CN102902790B (en) * | 2012-09-29 | 2017-06-06 | 北京奇虎科技有限公司 | Web page classification system and method |
CN103049536A (en) * | 2012-11-01 | 2013-04-17 | 广州汇讯营销咨询有限公司 | Webpage main text content extracting method and webpage text content extracting system |
CN103838790A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Webpage data extraction method |
CN103744636A (en) * | 2013-12-30 | 2014-04-23 | 上海斐讯数据通信技术有限公司 | Text composition method for adapting to window size |
CN106649560B (en) * | 2016-11-03 | 2019-09-24 | 中国电子科技集团公司第二十八研究所 | A kind of Web page text extracting method and device |
CN106528509B (en) * | 2016-11-11 | 2020-04-03 | 政和科技股份有限公司 | Webpage information extraction method and device |
CN106951505B (en) * | 2017-03-16 | 2021-02-02 | 北京搜狐新媒体信息技术有限公司 | Webpage information obtaining method and system |
CN107967243A (en) * | 2017-11-22 | 2018-04-27 | 语联网(武汉)信息技术有限公司 | A kind of processing method for supporting that user independently makes pauses in reading unpunctuated ancient writings |
CN111131000B (en) * | 2019-12-24 | 2022-01-25 | 北京达佳互联信息技术有限公司 | Information transmission method, device, server and terminal |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101237465A (en) * | 2007-01-30 | 2008-08-06 | 中国科学院声学研究所 | A webpage context extraction method based on quick Fourier conversion |
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
CN101706807A (en) * | 2009-11-27 | 2010-05-12 | 清华大学 | Method for automatically acquiring new words from Chinese webpages |
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN101763425A (en) * | 2010-01-12 | 2010-06-30 | 苏州阔地网络科技有限公司 | Universal method for capturing webpage contents of any webpage |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020143821A1 (en) * | 2000-12-15 | 2002-10-03 | Douglas Jakubowski | Site mining stylesheet generator |
-
2011
- 2011-12-27 CN CN201110446701.4A patent/CN102591612B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101237465A (en) * | 2007-01-30 | 2008-08-06 | 中国科学院声学研究所 | A webpage context extraction method based on quick Fourier conversion |
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN101706807A (en) * | 2009-11-27 | 2010-05-12 | 清华大学 | Method for automatically acquiring new words from Chinese webpages |
CN101763425A (en) * | 2010-01-12 | 2010-06-30 | 苏州阔地网络科技有限公司 | Universal method for capturing webpage contents of any webpage |
Non-Patent Citations (6)
Title |
---|
Content Extraction from Web Pages Based on Chinese Punctuation Number;Mingqiu Song et al;《International Conference on Wireless Communications, Networking and Mobile Computing, 2007. (WiCom 2007)》;20070925;5573-5575 * |
Mingqiu Song et al.Content Extraction from Web Pages Based on Chinese Punctuation Number.《International Conference on Wireless Communications, Networking and Mobile Computing, 2007. (WiCom 2007)》.2007,5573-5575. * |
吴麒等.基于权值优化的网页正文内容提取算法.《华南理工大学学报(自然科学版)》.2011,32-37. * |
基于权值优化的网页正文内容提取算法;吴麒等;《华南理工大学学报(自然科学版)》;20110415;32-37 * |
基于统计的网页正文信息抽取方法的研究;孙承杰等;《中文信息学报》;20040925;17-22 * |
孙承杰等.基于统计的网页正文信息抽取方法的研究.《中文信息学报》.2004,17-22. * |
Also Published As
Publication number | Publication date |
---|---|
CN102591612A (en) | 2012-07-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN101251855B (en) | Equipment, system and method for cleaning internet web page | |
CN105630941B (en) | Web body matter abstracting methods based on statistics and structure of web page | |
CN102184189B (en) | Webpage core block determining method based on DOM (Document Object Model) node text density | |
CN104598577B (en) | A kind of extracting method of Web page text | |
CN103473263B (en) | News event development process-oriented visual display method | |
CN102253930B (en) | A kind of method of text translation and device | |
CN103927397B (en) | Recognition method for Web page link blocks based on block tree | |
CN106446072B (en) | The treating method and apparatus of web page contents | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN102298638A (en) | Method and system for extracting news webpage contents by clustering webpage labels | |
CN106021392A (en) | News key information extraction method and system | |
CN104063380A (en) | Method and device for converting picture files into webpage files | |
CN106055667A (en) | Method for extracting core content of webpage based on text-tag density | |
CN103049536A (en) | Webpage main text content extracting method and webpage text content extracting system | |
CN102929902A (en) | Character splitting method and device based on Chinese retrieval | |
CN104657375A (en) | Image-text theme description method, device and system | |
CN105677638A (en) | Web information extraction method | |
CN112667940A (en) | Webpage text extraction method based on deep learning | |
CN105740355B (en) | Webpage context extraction method and device based on aggregation text density | |
CN103440315A (en) | Web page cleaning method based on theme | |
CN107436931B (en) | Webpage text extraction method and device | |
CN106528509B (en) | Webpage information extraction method and device | |
CN103064966A (en) | Method for extracting regular noise from single record web pages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20120718 Assignee: Xiaoma Baoli (Xiamen) Network Technology Co.,Ltd. Assignor: XIAMEN MEIYA PICO INFORMATION Co.,Ltd. Contract record no.: X2023350000077 Denomination of invention: A General Web Page Text Extraction Method and System Based on Punctuation Continuity Granted publication date: 20141203 License type: Common License Record date: 20230313 |
|
EE01 | Entry into force of recordation of patent licensing contract |