CN102541874B - Webpage text content extracting method and device - Google Patents

Webpage text content extracting method and device Download PDF

Info

Publication number
CN102541874B
CN102541874B CN 201010591506 CN201010591506A CN102541874B CN 102541874 B CN102541874 B CN 102541874B CN 201010591506 CN201010591506 CN 201010591506 CN 201010591506 A CN201010591506 A CN 201010591506A CN 102541874 B CN102541874 B CN 102541874B
Authority
CN
China
Prior art keywords
content blocks
webpage
content
density
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201010591506
Other languages
Chinese (zh)
Other versions
CN102541874A (en
Inventor
周奕
周宇煜
吴淑燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN 201010591506 priority Critical patent/CN102541874B/en
Publication of CN102541874A publication Critical patent/CN102541874A/en
Application granted granted Critical
Publication of CN102541874B publication Critical patent/CN102541874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage text content extracting method and device. The method comprises the following steps of: acquiring two webpages which belong to a catalogue at the same hierarchy below the same site; for each acquired webpage, respectively executing the following steps of: dividing the webpage into content blocks; determining label density and/or link density of each content block; selecting the content block the label density and/or link density of which meets corresponding preset conditions; extracting the content block with the text content of being not consistent with the text contexts of the content blocks selected from another webpage; and determining the extracted content block as the text content of the webpage. By adopting the technical scheme of the invention, the problem that accuracy is lower when the text content of the webpage is extracted in the prior art can be solved.

Description

Webpage text content extracting method and device
Technical field
The present invention relates to the internet information processing technology field, relate in particular to a kind of Webpage text content extracting method and device.
Background technology
along with developing rapidly of Internet technology, information on webpage is more and more abundanter, in order better to use the information on webpage, the technology of network information can be effectively organized and be utilized in the continuous pursuit of people, yet webpage is also neat unlike traditional text, totally, wherein comprising a large amount of noise contents, the script that for example adds in order to strengthen user interactivity, the navigation link that adds for the ease of the user browses, and consider the advertisement link add etc. for business, above-mentioned noise content has not only affected the info web effectiveness of retrieval, but also caused the accuracy of retrieval lower, the accurate extraction of Web page text content not only can filtering web page in navigation information, advertising message, copyright information, the interference of the contents such as peer link to result for retrieval, can also carry out automatic word segmentation to webpage, named entity recognition, autoabstract, automatic classification and automatic cluster etc.
As shown in Figure 1, be Webpage text content extracting method process flow diagram in prior art, its concrete treatment scheme is as follows:
Step 11 for single piece of webpage, determines that i is capable and character (i+1) row content is total and the Chinese character number;
Step 12 is calculated the text density of capable and (i+1) row content of i, can calculate text density divided by the character sum with the Chinese character number;
Step 13 compares the text density that calculates and the threshold value of presetting;
Step 14 is not less than default threshold value if comparative result is text density, determines capable and (i+1) behavior body matter of i, if comparative result be text density less than default threshold value, determine the capable and non-body matter of (i+1) behavior of i;
Step 15 if determine capable and (i+1) behavior body matter of i, determines according to the method described above that i is capable, whether (i+1) row and (i+2) row be body matter;
Step 16 if determine the capable and non-body matter of (i+1) behavior of i, determines according to the method described above whether (i+2) row and (i+3) row are body matter.
Step 17 is carried out above-mentioned steps, until travel through all row of this webpage.
In said method, if the text density of multiple line content is not less than predetermined threshold value continuously, just think that this continuous multiple line content is body matter, but in now a lot of webpages, there is the higher non-body matter of a lot of degree of disturbances, such as personal information, short essay chapter, disclaimer etc., the text density of these non-body matters is larger, probably greater than default threshold value, therefore may be mistaken as body matter, thereby make the extraction accuracy of body matter lower.
Summary of the invention
The embodiment of the present invention provides a kind of Webpage text content extracting method and device, in order to solve the lower problem of accuracy of the extraction Web page text content that prior art exists.
Embodiment of the present invention technical scheme is as follows:
A kind of Webpage text content extracting method, the method comprising the steps of: two webpages that obtain to belong to same level catalogue under same website; For each webpage that obtains, carry out respectively: this webpage is divided into each content blocks; Determine label density and/or the link density of each content blocks of marking off; And select label density and/or link density and satisfy corresponding pre-conditioned content blocks; In each content blocks of selecting, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all; With the content blocks that extracts, be defined as the body matter of this webpage.
A kind of Web page text contents extraction device comprises: obtain the unit, be used for obtaining to belong to two webpages of same level catalogue under same website; Division unit is used for for each webpage that obtains the unit acquisition, this webpage being divided into each content blocks; The first determining unit is used for for each webpage that obtains the unit acquisition, determines label density and/or the link density of each content blocks that division unit marks off; Selected cell is used for for each webpage that obtains the unit acquisition, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density; Extraction unit is used for each webpage of obtaining for obtaining the unit, and in each content blocks that selected cell is selected, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all; The second determining unit is used for the content blocks that extraction unit extracts, being defined as the body matter of this webpage for each webpage that obtains the unit acquisition.
in embodiment of the present invention technical scheme, because the webpage that belongs to same level catalogue under same website is all generated by same template, its structure of web page is similar or identical, therefore the embodiment of the present invention is for two webpages that belong to same level catalogue under same website, at first select alternative body matter piece according to label density and/or link density, then in the content blocks of selecting, remove two non-body matter pieces that webpage Chinese version content is identical, thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.
Description of drawings
Fig. 1 is in prior art, the Webpage text content extracting method schematic flow sheet;
Fig. 2 is in the embodiment of the present invention, the Webpage text content extracting method schematic flow sheet;
Fig. 3 is in the embodiment of the present invention, Webpage text content extracting method specific implementation schematic flow sheet;
Fig. 4 is in the embodiment of the present invention, Web page text contents extraction apparatus structure schematic diagram.
Embodiment
Embodiment of the present invention technical scheme main realized principle, embodiment and the beneficial effect that should be able to reach is at length set forth below in conjunction with each accompanying drawing.
As shown in Figure 2, be Webpage text content extracting method process flow diagram in the embodiment of the present invention, its concrete treatment scheme is as follows:
Step 21, acquisition belongs to two webpages of same level catalogue under same website;
The embodiment of the present invention proposes, the different pages of same level catalogue under same website, normally by same HTML (Hypertext Markup Language) (HTML, Hyper Text Mark-up Language) template generates, therefore under same website, the structure of web page between the different web pages under the same level catalogue is identical or similar, for example under same website in the different pages of same level catalogue, all comprise personal information, disclaimer or the copyright statement etc. of identical content, the position of these contents in the different pages may be different, but content is identical.
Step 22, each webpage for obtaining is divided into each content blocks with this webpage respectively;
when webpage is divided into content blocks, need first webpage to be standardized pre-service, make it to meet the html language standard, then pretreated webpage being carried out structuring processes, generate DOM Document Object Model (DOM, Document Object Model) tree, obtain the HTML structuring statement of webpage, according in the dom tree that generates<table or<div mark, webpage is carried out the sense of vision piecemeal to be processed, be divided into each content blocks, wherein can but be not limited to adopt the mode of Multilevel Block to divide content blocks, for example adopt the mode of two-stage piecemeal to divide content blocks, first webpage is divided into each one-level content blocks, then respectively each one-level content blocks is divided into each secondary content blocks, other Multilevel Block modes and aforesaid way are similar, here repeat no more.
After webpage is divided into each content blocks, can but be not limited to number and the numberings at different levels of content blocks are come the sign content piece by webpage, the mode that for example adopts the two-stage piecemeal is carried out content blocks when dividing to webpage, uses C i(j, k) identify the content blocks that marks off, wherein i represents that this content blocks is the content blocks in i webpage, and j represents that this content blocks is j one-level content blocks of i webpage, k represents that this content blocks is k secondary content blocks of j one-level content blocks of i webpage, that is to say C iIn i webpage of (j, k) sign, k secondary content blocks in j one-level content blocks.
Step 23 for each webpage that obtains, is determined label density and/or the link density of each content blocks of marking off;
The embodiment of the present invention proposes, can determine the label density of each content blocks, select alternative body matter piece according to label density, also can determine the link density of each content blocks, select alternative body matter piece according to link density, can also determine label density and the link density of each content blocks, select alternative body matter piece according to label density and link density.
Wherein, the label density of content blocks is label number in this content blocks and the ratio of text number of words, and the link density of content blocks is link number in this content blocks and the ratio of text number of words.
If content blocks C iContent of text in (j, k) is T i(j, k), the text number of words is N i(j, k), the label number is Q i(j, k), the link number is P i(j, k) determines label density Y by following manner i(j, k) and link density X i(j, k):
Y i ( j , k ) = Q i ( j , k ) N i ( j , k )
X i ( j , k ) = P i ( j , k ) N i ( j , k )
Step 24 for each webpage that obtains, is selected the satisfied corresponding pre-conditioned content blocks of label density and/or link density;
If select alternative body matter piece according to label density, its process can but be not limited to following:
At first obtain the label density threshold of each content blocks mark off, then select the content blocks that label density is not more than the corresponding label density threshold, with the content blocks of selecting, be defined as satisfying pre-conditioned content blocks, be alternative body matter piece;
If select alternative body matter piece according to link density, its process can but be not limited to following:
At first obtain the link density threshold of each content blocks mark off, then select the content blocks that link density is not more than corresponding link density threshold, with the content blocks of selecting, be defined as satisfying pre-conditioned content blocks, be alternative body matter piece;
If select alternative body matter piece according to label density and link density, its process can but be not limited to following:
At first obtain label density threshold and the link density threshold of each content blocks mark off, then select label density and be not more than the corresponding label density threshold, and link density is not more than the content blocks of corresponding link density threshold, with the content blocks of selecting, be defined as satisfying pre-conditioned content blocks, be alternative body matter piece.
Wherein the label density threshold can but be not limited to obtain by following manner, be specially:
At first for each content blocks that marks off, respectively according to label density, determine the label density variance of this content blocks, and according to the label density variance of determining, determine the label density threshold of this content blocks;
Wherein link density threshold can but be not limited to obtain by following manner, be specially:
At first for each content blocks that marks off, respectively according to link density, determine the link density variance of this content blocks, and respectively according to the link density variance of determining, determine the link density threshold of this content blocks;
Content blocks C iLabel density in (j, k) is Y i(j, k), link density is X i(j, k) is according to label density Y i(j, k) can determine content blocks C iThe label density variance D (Y of (j, k) i(j, k)), according to link density X i(j, k) can determine content blocks C iThe link density variance D (X of (j, k) i(j, k)), according to content blocks C iThe label density variance D (Y of (j, k) i(j, k)), can further determine content blocks C iThe label density threshold B (Y) of (j, k) is according to content blocks C iThe link density variance D (X of (j, k) i(j, k)), can further determine content blocks C iThe link density threshold B (X) of (j, k).
With label density Y i(j, k) and label density threshold B (Y) compare, if comparative result is Y i(j, k) is greater than B (Y), with Y iThe value of (j, k) is set to 0, if comparative result is Y i(j, k) is not more than B (Y), with Y iThe value of (j, k) is set to 1, that is:
Y i ( j , k ) = 0 Y i ( j , k ) > B ( Y ) Y i ( j , k ) = 1 Y i ( j , k ) ≤ B ( Y )
To link density X i(j, k) and label density threshold B (X) compare, if comparative result is X i(j, k) is greater than B (X), with X iThe value of (j, k) is set to 0, if comparative result is X i(j, k) is not more than B (X), with X iThe value of (j, k) is set to 1, that is:
X i ( j , k ) = 0 X i ( j , k ) > B ( X ) X i ( j , k ) = 1 X i ( j , k ) ≤ B ( X )
If select alternative body matter piece according to label density, with Y i(j, k) is that 1 content blocks is chosen as alternative body matter piece, namely satisfies corresponding pre-conditioned content blocks;
If select alternative body matter piece according to link density, with X i(j, k) is that 1 content blocks is chosen as alternative body matter piece, namely satisfies corresponding pre-conditioned content blocks;
If select alternative body matter piece according to label density and link density, for content blocks C i(j, k) calculates X i(j, k) * Y i(j, k) if result of calculation is 1, is chosen as this content blocks alternative body matter piece, namely satisfies corresponding pre-conditioned content blocks, wherein passes through above-mentioned computing, X i(j, k) and Y iThe value of (j, k) is 1 or 0, only has the X of working as i(j, k) and Y iThe value of (j, k) is at 1 o'clock, X i(j, k) * Y iThe result of calculation of (j, k) is just 1.
In the embodiment of the present invention, owing to selecting alternative body matter piece according to label density and/or link density, rather than determine body matter according to text density, thereby first removed the higher non-body matter of a part of degree of disturbance, therefore can effectively improve the accuracy of extracting the Web page text content.
Step 25, for each webpage that obtains, in each content blocks of selecting, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;
The embodiment of the present invention proposes, body text content under same website between the different pages of same level catalogue differs greatly, and noise content is identical, therefore can after selecting alternative body matter piece, in two pages, further remove the identical content blocks of content of text, these content blocks are noise content, therefore be non-content of text, remaining alternative body matter piece is the Web page text content.
Wherein, can but be not limited to adopt all inconsistent content blocks of content of text that the mode of poll extracts each content blocks of selecting in content of text and another webpage, for example:
The alternative body matter piece of selecting for webpage 1 is: content blocks C 1(1,1), C 1(1,2), the alternative body matter piece of selecting for webpage 2 is: content blocks C 2(1,3), C 2(2,2), C 2(3,1) are at first with content blocks C 1The content of text T of (1,1) 1(1,1) and content blocks C 2The content of text T of (1,3) 2(1,3) compares, and comparative result is inconsistent, with content blocks C 1The content of text T of (1,1) 1(1,1) and content blocks C 2The content of text T of (2,2) 2(2,2) compare, and comparative result is consistent, confirms content blocks C 1(1,1) and content blocks C 2(2,2) are non-body matter, therefore remove in the alternative content blocks of webpage 1, remove content blocks C 1(1,1) in the alternative content blocks of webpage 2, removes content blocks C 2(2,2);
With content blocks C 1The content of text T of (1,2) 1(1,2) and content blocks C 2The content of text T of (1,3) 2(1,3) compares, and comparative result is inconsistent, with content blocks C 1The content of text T of (1,2) 1(1,2) and content blocks C 2The content of text T of (3,1) 2(3,1) compare, and comparative result is inconsistent, confirms content blocks C 1(1,2), C 2(1,3) and C 2(3,1) are body matter, and namely the body matter in webpage 1 is C 1(1,2), the body matter in webpage 2 are C 2(1,3) and C 2(3,1).
although the noise content under same website between the different pages of same level catalogue is identical, but residing position may be different in the page, for example personal information is positioned at the upper left side in the page 1, be positioned at the lower left in the page 2, if the coordinate according to node in dom tree is searched identical subtree, require content and position all identical, so just may content is identical, but the different noise content in position is thought body matter by mistake, the embodiment of the present application adopts the mode of above-mentioned poll, extract the Web page text content in alternative body matter piece, therefore just content can be removed identical, but the noise content that the position is different, only the content blocks that content of text is different is extracted as body matter, thereby effectively improved the accuracy of extracting the Web page text content.
Step 26, each webpage for obtaining with the content blocks that extracts, is defined as the body matter of this webpage.
by above-mentioned processing procedure as can be known, in embodiment of the present invention technical scheme, because the webpage that belongs to same level catalogue under same website is all generated by same template, its structure of web page is similar or identical, therefore the embodiment of the present invention is for two webpages that belong to same level catalogue under same website, at first select alternative body matter piece according to label density and/or link density, then in the content blocks of selecting, remove two non-body matter pieces that webpage Chinese version content is identical, thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.
The below provides more detailed embodiment.
As shown in Figure 3, be Webpage text content extracting method specific implementation process flow diagram in the embodiment of the present invention, its concrete processing procedure is as follows:
Step 31, acquisition belongs to webpage 1 and the webpage 2 of same level catalogue under same website;
Step 32 to each webpage that obtains pre-service that standardizes, makes it to meet the html language standard;
Step 33 is carried out structuring to pretreated each webpage and is processed, and generates dom tree;
Step 34, according in the dom tree that generates<table or<div mark, webpage is carried out the sense of vision piecemeal processes;
Step 35 is calculated the label density of each content blocks and links density;
Step 36 for each content blocks, compares label density and label density threshold, will link density and link density threshold to compare;
Step 37 with being not more than the content blocks of corresponding link density threshold, is defined as alternative body matter piece;
Step 38, the mode of employing poll is carried out similarity relatively with the alternative body matter piece in webpage 1 and the alternative body matter piece in webpage 2;
Step 39, for each webpage, according to comparative result, the content of text that extracts alternative body matter piece in content of text and another webpage is inconsistent content blocks all, and the content blocks that extracts is the body matter of this webpage.
Accordingly, the embodiment of the present invention provides a kind of Web page text contents extraction device, and its structure comprises obtaining unit 41, division unit 42, the first determining unit 43, selected cell 44, extraction unit 45 and the second determining unit 46 as shown in Figure 4, wherein:
Obtain unit 41, be used for obtaining to belong to two webpages of same level catalogue under same website;
Division unit 42 is used for for each webpage that obtains unit 41 acquisitions, this webpage being divided into each content blocks;
The first determining unit 43 is used for for each webpage that obtains unit 41 acquisitions, determines label density and/or the link density of each content blocks that division unit 42 marks off;
Selected cell 44 is used for for each webpage that obtains unit 41 acquisitions, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;
Extraction unit 45 is used for each webpage of obtaining for obtaining unit 41, and in each content blocks that selected cell 44 is selected, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;
The second determining unit 46 is used for the content blocks that extraction unit 45 extracts, being defined as the body matter of this webpage for each webpage that obtains unit 41 acquisitions.
Preferably, for each content blocks that division unit 42 marks off, described the first determining unit 43 is defined as the label density of this content blocks with the ratio of the label number in this content blocks and text number of words, and
With the ratio of the link number in this content blocks and text number of words, be defined as the link density of this content blocks.
Preferably, selected cell 44 specifically comprises acquisition subelement, chooser unit and definite subelement, wherein:
Obtain subelement, be used for for each webpage that obtains unit 41 acquisitions, the label density threshold of each content blocks that acquisition division unit 42 marks off and/or link density threshold;
The chooser unit be used for selecting label density and being not more than the corresponding label density threshold for each webpage that obtains unit 41 acquisitions, and/or link density is not more than the content blocks of corresponding link density threshold;
Determine subelement, be used for each webpage of obtaining for obtaining unit 41, the content blocks with the chooser unit is selected is defined as satisfying pre-conditioned content blocks.
More preferably, for each content blocks that division unit 42 marks off, described acquisition subelement is determined the label density threshold of this content blocks according to the label density variance of this content blocks, and
According to the link density variance of this content blocks, determine the link density threshold of this content blocks.
Preferably, described division unit 42 specifically comprises the pre-service subelement, generates subelement and division subelement, wherein:
The pre-service subelement is used for each webpage of obtaining for obtaining unit 41, to the pre-service that standardizes of this webpage;
Generate subelement, be used for for each webpage that obtains unit 41 acquisitions, webpage pretreated according to the pre-service subelement generates corresponding dom tree;
Divide subelement, be used for for each webpage that obtains unit 41 acquisitions, the dom tree according to the generation subelement generates is divided into each content blocks with this webpage.
Obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention.Like this, if within of the present invention these are revised and modification belongs to the scope of claim of the present invention and equivalent technologies thereof, the present invention also is intended to comprise these changes and modification interior.

Claims (8)

1. a Webpage text content extracting method, is characterized in that, comprising:
Acquisition belongs to two webpages of same level catalogue under same website;
For each webpage that obtains, carry out respectively:
This webpage is divided into each content blocks;
Determine label density and/or the link density of each content blocks of marking off, wherein: described label density is label number in this content blocks and the ratio of text number of words, and described link density is link number in this content blocks and the ratio of text number of words; And
Select the satisfied corresponding pre-conditioned content blocks of label density and/or link density;
In each content blocks of selecting, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;
With the content blocks that extracts, be defined as the body matter of this webpage.
2. Webpage text content extracting method as claimed in claim 1, is characterized in that, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density, specifically comprises:
The label density threshold of each content blocks that acquisition marks off and/or link density threshold;
Select label density and be not more than the corresponding label density threshold, and/or link density is not more than the content blocks of corresponding link density threshold;
With the content blocks of selecting, be defined as satisfying pre-conditioned content blocks.
3. Webpage text content extracting method as claimed in claim 2, is characterized in that, the label density threshold of each content blocks that acquisition marks off specifically comprises:
For each content blocks that marks off, respectively according to label density, determine the label density variance of this content blocks, and according to the label density variance of determining, determine the label density threshold of this content blocks;
The link density threshold of each content blocks that acquisition marks off specifically comprises:
For each content blocks that marks off, according to link density, determine the link density variance of this content blocks, and according to the link density variance of determining, determine the link density threshold of this content blocks respectively.
4. Webpage text content extracting method as claimed in claim 1, is characterized in that, this webpage is divided into each content blocks, specifically comprises:
To the pre-service that standardizes of this webpage;
According to pretreated webpage, generate corresponding DOM Document Object Model dom tree;
Dom tree according to generating is divided into each content blocks with this webpage.
5. a Web page text contents extraction device, is characterized in that, comprising:
Obtain the unit, be used for obtaining to belong to two webpages of same level catalogue under same website;
Division unit is used for for each webpage that obtains the unit acquisition, this webpage being divided into each content blocks;
The first determining unit, be used for for each webpage that obtains the unit acquisition, determine label density and/or the link density of each content blocks that division unit marks off, wherein: described label density is label number in this content blocks and the ratio of text number of words, and described link density is link number in this content blocks and the ratio of text number of words;
Selected cell is used for for each webpage that obtains the unit acquisition, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;
Extraction unit is used for each webpage of obtaining for obtaining the unit, and in each content blocks that selected cell is selected, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;
The second determining unit is used for the content blocks that extraction unit extracts, being defined as the body matter of this webpage for each webpage that obtains the unit acquisition.
6. Web page text contents extraction device as claimed in claim 5, is characterized in that, selected cell specifically comprises:
Obtain subelement, be used for for each webpage that obtains the unit acquisition, the label density threshold of each content blocks that the acquisition division unit marks off and/or link density threshold;
The chooser unit be used for selecting label density and being not more than the corresponding label density threshold for each webpage that obtains the unit acquisition, and/or link density is not more than the content blocks of corresponding link density threshold;
Determine subelement, be used for each webpage of obtaining for obtaining the unit, the content blocks with the chooser unit is selected is defined as satisfying pre-conditioned content blocks.
7. Web page text contents extraction device as claimed in claim 6, is characterized in that, for each content blocks that division unit marks off, described acquisition subelement is determined the label density threshold of this content blocks according to the label density variance of this content blocks, and
According to the link density variance of this content blocks, determine the link density threshold of this content blocks.
8. Web page text contents extraction device as claimed in claim 6, is characterized in that, described division unit specifically comprises:
The pre-service subelement is used for each webpage of obtaining for obtaining the unit, to the pre-service that standardizes of this webpage;
Generate subelement, be used for for each webpage that obtains the unit acquisition, webpage pretreated according to the pre-service subelement generates corresponding DOM Document Object Model dom tree;
Divide subelement, be used for for each webpage that obtains the unit acquisition, the dom tree according to the generation subelement generates is divided into each content blocks with this webpage.
CN 201010591506 2010-12-16 2010-12-16 Webpage text content extracting method and device Active CN102541874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010591506 CN102541874B (en) 2010-12-16 2010-12-16 Webpage text content extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010591506 CN102541874B (en) 2010-12-16 2010-12-16 Webpage text content extracting method and device

Publications (2)

Publication Number Publication Date
CN102541874A CN102541874A (en) 2012-07-04
CN102541874B true CN102541874B (en) 2013-11-06

Family

ID=46348795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010591506 Active CN102541874B (en) 2010-12-16 2010-12-16 Webpage text content extracting method and device

Country Status (1)

Country Link
CN (1) CN102541874B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103970755B (en) * 2013-01-28 2018-12-11 腾讯科技(深圳)有限公司 A kind of recognition methods of listing of novel item, device and system
CN103116760A (en) * 2013-02-18 2013-05-22 人民搜索网络股份公司 Method and device for identifying text-missing web pages
CN103258000B (en) * 2013-03-29 2017-02-08 北界无限(北京)软件有限公司 Method and device for clustering high-frequency keywords in webpages
CN103309961B (en) * 2013-05-30 2015-07-15 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN105335382B (en) * 2014-06-27 2018-11-16 优视科技有限公司 The extracting method and device of Web page text
WO2015165324A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage text extraction method and device, and webpage advertisement handling method and device
CN104268192B (en) * 2014-09-20 2018-08-07 广州猎豹网络科技有限公司 A kind of webpage information extracting method, device and terminal
CN104484451B (en) * 2014-12-25 2017-12-19 北京国双科技有限公司 The extracting method and device of Webpage information
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN106802899B (en) * 2015-11-26 2020-11-24 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN106855859B (en) * 2015-12-08 2020-11-10 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN107103012A (en) * 2016-01-28 2017-08-29 阿里巴巴集团控股有限公司 Recognize method, device and the server of violated webpage
CN105808644A (en) * 2016-02-25 2016-07-27 浪潮软件集团有限公司 Method and device for determining text node
CN107203527B (en) * 2016-03-16 2019-06-28 北大方正集团有限公司 The text extracting method and system of news web page
CN106227858B (en) * 2016-07-28 2019-06-25 北京橘子文化传媒有限公司 A kind of accurate extracting method of mobile Internet webpage or media platform article content
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN110020312B (en) * 2017-12-11 2022-09-06 北京京东尚科信息技术有限公司 Method and device for extracting webpage text
CN110020247B (en) * 2017-12-22 2021-05-14 中移(苏州)软件技术有限公司 Webpage key module extraction method and device
CN108763591B (en) * 2018-06-21 2021-01-08 湖南星汉数智科技有限公司 Webpage text extraction method and device, computer device and computer readable storage medium
CN110334300A (en) * 2019-07-10 2019-10-15 哈尔滨工业大学 Text aid reading method towards the analysis of public opinion
CN112749528A (en) * 2019-10-31 2021-05-04 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763740A (en) * 2003-09-18 2006-04-26 富士通株式会社 Info web piece extracting method and device
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763740A (en) * 2003-09-18 2006-04-26 富士通株式会社 Info web piece extracting method and device
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout

Also Published As

Publication number Publication date
CN102541874A (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN102541874B (en) Webpage text content extracting method and device
Sun et al. Dom based content extraction via text density
CN101727461B (en) Method for extracting content of web page
CN107590219A (en) Webpage personage subject correlation message extracting method
CN103853760B (en) Method and device for extracting contents of bodies of web pages
CN105022803B (en) A kind of method and system for extracting Web page text content
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN107992542A (en) A kind of similar article based on topic model recommends method
CN102270206A (en) Method and device for capturing valid web page contents
CN103810251B (en) Method and device for extracting text
EP2425353A1 (en) Method and apparatus for identifying synonyms and using synonyms to search
CN104598577A (en) Extraction method for webpage text
CN103473338A (en) Webpage content extraction method and webpage content extraction system
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN104699667A (en) Semantic dictionary-based improved word similarity calculating method and device
CN102117289A (en) Method and device for extracting comment content from webpage
CN104572934A (en) Webpage key content extracting method based on DOM
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN103440315A (en) Web page cleaning method based on theme
CN102799638B (en) In-page navigation generation method facing barrier-free access to webpage contents
CN103942211A (en) Text page recognition method and device
CN107436931B (en) Webpage text extraction method and device
CN103455572B (en) Obtain the method and device of video display main body in webpage
CN106528509A (en) Webpage information extracting method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant