CN102541874B

CN102541874B - Webpage text content extracting method and device

Info

Publication number: CN102541874B
Application number: CN 201010591506
Authority: CN
Inventors: 周奕; 周宇煜; 吴淑燕
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2010-12-16
Filing date: 2010-12-16
Publication date: 2013-11-06
Anticipated expiration: 2030-12-16
Also published as: CN102541874A

Abstract

The invention discloses a webpage text content extracting method and device. The method comprises the following steps of: acquiring two webpages which belong to a catalogue at the same hierarchy below the same site; for each acquired webpage, respectively executing the following steps of: dividing the webpage into content blocks; determining label density and/or link density of each content block; selecting the content block the label density and/or link density of which meets corresponding preset conditions; extracting the content block with the text content of being not consistent with the text contexts of the content blocks selected from another webpage; and determining the extracted content block as the text content of the webpage. By adopting the technical scheme of the invention, the problem that accuracy is lower when the text content of the webpage is extracted in the prior art can be solved.

Description

Webpage text content extracting method and device

Technical field

The present invention relates to the internet information processing technology field, relate in particular to a kind of Webpage text content extracting method and device.

Background technology

along with developing rapidly of Internet technology, information on webpage is more and more abundanter, in order better to use the information on webpage, the technology of network information can be effectively organized and be utilized in the continuous pursuit of people, yet webpage is also neat unlike traditional text, totally, wherein comprising a large amount of noise contents, the script that for example adds in order to strengthen user interactivity, the navigation link that adds for the ease of the user browses, and consider the advertisement link add etc. for business, above-mentioned noise content has not only affected the info web effectiveness of retrieval, but also caused the accuracy of retrieval lower, the accurate extraction of Web page text content not only can filtering web page in navigation information, advertising message, copyright information, the interference of the contents such as peer link to result for retrieval, can also carry out automatic word segmentation to webpage, named entity recognition, autoabstract, automatic classification and automatic cluster etc.

As shown in Figure 1, be Webpage text content extracting method process flow diagram in prior art, its concrete treatment scheme is as follows:

Step 11 for single piece of webpage, determines that i is capable and character (i+1) row content is total and the Chinese character number;

Step 12 is calculated the text density of capable and (i+1) row content of i, can calculate text density divided by the character sum with the Chinese character number;

Step 13 compares the text density that calculates and the threshold value of presetting;

Step 14 is not less than default threshold value if comparative result is text density, determines capable and (i+1) behavior body matter of i, if comparative result be text density less than default threshold value, determine the capable and non-body matter of (i+1) behavior of i;

Step 15 if determine capable and (i+1) behavior body matter of i, determines according to the method described above that i is capable, whether (i+1) row and (i+2) row be body matter;

Step 16 if determine the capable and non-body matter of (i+1) behavior of i, determines according to the method described above whether (i+2) row and (i+3) row are body matter.

Step 17 is carried out above-mentioned steps, until travel through all row of this webpage.

In said method, if the text density of multiple line content is not less than predetermined threshold value continuously, just think that this continuous multiple line content is body matter, but in now a lot of webpages, there is the higher non-body matter of a lot of degree of disturbances, such as personal information, short essay chapter, disclaimer etc., the text density of these non-body matters is larger, probably greater than default threshold value, therefore may be mistaken as body matter, thereby make the extraction accuracy of body matter lower.

Summary of the invention

The embodiment of the present invention provides a kind of Webpage text content extracting method and device, in order to solve the lower problem of accuracy of the extraction Web page text content that prior art exists.

Embodiment of the present invention technical scheme is as follows:

A kind of Webpage text content extracting method, the method comprising the steps of: two webpages that obtain to belong to same level catalogue under same website; For each webpage that obtains, carry out respectively: this webpage is divided into each content blocks; Determine label density and/or the link density of each content blocks of marking off; And select label density and/or link density and satisfy corresponding pre-conditioned content blocks; In each content blocks of selecting, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all; With the content blocks that extracts, be defined as the body matter of this webpage.

A kind of Web page text contents extraction device comprises: obtain the unit, be used for obtaining to belong to two webpages of same level catalogue under same website; Division unit is used for for each webpage that obtains the unit acquisition, this webpage being divided into each content blocks; The first determining unit is used for for each webpage that obtains the unit acquisition, determines label density and/or the link density of each content blocks that division unit marks off; Selected cell is used for for each webpage that obtains the unit acquisition, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density; Extraction unit is used for each webpage of obtaining for obtaining the unit, and in each content blocks that selected cell is selected, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all; The second determining unit is used for the content blocks that extraction unit extracts, being defined as the body matter of this webpage for each webpage that obtains the unit acquisition.

in embodiment of the present invention technical scheme, because the webpage that belongs to same level catalogue under same website is all generated by same template, its structure of web page is similar or identical, therefore the embodiment of the present invention is for two webpages that belong to same level catalogue under same website, at first select alternative body matter piece according to label density and/or link density, then in the content blocks of selecting, remove two non-body matter pieces that webpage Chinese version content is identical, thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.

Description of drawings

Fig. 1 is in prior art, the Webpage text content extracting method schematic flow sheet;

Fig. 2 is in the embodiment of the present invention, the Webpage text content extracting method schematic flow sheet;

Fig. 3 is in the embodiment of the present invention, Webpage text content extracting method specific implementation schematic flow sheet;

Fig. 4 is in the embodiment of the present invention, Web page text contents extraction apparatus structure schematic diagram.

Embodiment

Embodiment of the present invention technical scheme main realized principle, embodiment and the beneficial effect that should be able to reach is at length set forth below in conjunction with each accompanying drawing.

As shown in Figure 2, be Webpage text content extracting method process flow diagram in the embodiment of the present invention, its concrete treatment scheme is as follows:

Step 21, acquisition belongs to two webpages of same level catalogue under same website;

The embodiment of the present invention proposes, the different pages of same level catalogue under same website, normally by same HTML (Hypertext Markup Language) (HTML, Hyper Text Mark-up Language) template generates, therefore under same website, the structure of web page between the different web pages under the same level catalogue is identical or similar, for example under same website in the different pages of same level catalogue, all comprise personal information, disclaimer or the copyright statement etc. of identical content, the position of these contents in the different pages may be different, but content is identical.

Step 22, each webpage for obtaining is divided into each content blocks with this webpage respectively;

when webpage is divided into content blocks, need first webpage to be standardized pre-service, make it to meet the html language standard, then pretreated webpage being carried out structuring processes, generate DOM Document Object Model (DOM, Document Object Model) tree, obtain the HTML structuring statement of webpage, according in the dom tree that generates＜table or＜div mark, webpage is carried out the sense of vision piecemeal to be processed, be divided into each content blocks, wherein can but be not limited to adopt the mode of Multilevel Block to divide content blocks, for example adopt the mode of two-stage piecemeal to divide content blocks, first webpage is divided into each one-level content blocks, then respectively each one-level content blocks is divided into each secondary content blocks, other Multilevel Block modes and aforesaid way are similar, here repeat no more.

After webpage is divided into each content blocks, can but be not limited to number and the numberings at different levels of content blocks are come the sign content piece by webpage, the mode that for example adopts the two-stage piecemeal is carried out content blocks when dividing to webpage, uses C _i(j, k) identify the content blocks that marks off, wherein i represents that this content blocks is the content blocks in i webpage, and j represents that this content blocks is j one-level content blocks of i webpage, k represents that this content blocks is k secondary content blocks of j one-level content blocks of i webpage, that is to say C _iIn i webpage of (j, k) sign, k secondary content blocks in j one-level content blocks.

Step 23 for each webpage that obtains, is determined label density and/or the link density of each content blocks of marking off;

The embodiment of the present invention proposes, can determine the label density of each content blocks, select alternative body matter piece according to label density, also can determine the link density of each content blocks, select alternative body matter piece according to link density, can also determine label density and the link density of each content blocks, select alternative body matter piece according to label density and link density.

Wherein, the label density of content blocks is label number in this content blocks and the ratio of text number of words, and the link density of content blocks is link number in this content blocks and the ratio of text number of words.

If content blocks C _iContent of text in (j, k) is T _i(j, k), the text number of words is N _i(j, k), the label number is Q _i(j, k), the link number is P _i(j, k) determines label density Y by following manner _i(j, k) and link density X _i(j, k):

Y_{i} (j, k) = \frac{Q_{i} (j, k)}{N_{i} (j, k)}

X_{i} (j, k) = \frac{P_{i} (j, k)}{N_{i} (j, k)}

Step 24 for each webpage that obtains, is selected the satisfied corresponding pre-conditioned content blocks of label density and/or link density;

If select alternative body matter piece according to label density, its process can but be not limited to following:

At first obtain the label density threshold of each content blocks mark off, then select the content blocks that label density is not more than the corresponding label density threshold, with the content blocks of selecting, be defined as satisfying pre-conditioned content blocks, be alternative body matter piece;

If select alternative body matter piece according to link density, its process can but be not limited to following:

At first obtain the link density threshold of each content blocks mark off, then select the content blocks that link density is not more than corresponding link density threshold, with the content blocks of selecting, be defined as satisfying pre-conditioned content blocks, be alternative body matter piece;

If select alternative body matter piece according to label density and link density, its process can but be not limited to following:

At first obtain label density threshold and the link density threshold of each content blocks mark off, then select label density and be not more than the corresponding label density threshold, and link density is not more than the content blocks of corresponding link density threshold, with the content blocks of selecting, be defined as satisfying pre-conditioned content blocks, be alternative body matter piece.

Wherein the label density threshold can but be not limited to obtain by following manner, be specially:

At first for each content blocks that marks off, respectively according to label density, determine the label density variance of this content blocks, and according to the label density variance of determining, determine the label density threshold of this content blocks;

Wherein link density threshold can but be not limited to obtain by following manner, be specially:

At first for each content blocks that marks off, respectively according to link density, determine the link density variance of this content blocks, and respectively according to the link density variance of determining, determine the link density threshold of this content blocks;

Content blocks C _iLabel density in (j, k) is Y _i(j, k), link density is X _i(j, k) is according to label density Y _i(j, k) can determine content blocks C _iThe label density variance D (Y of (j, k) _i(j, k)), according to link density X _i(j, k) can determine content blocks C _iThe link density variance D (X of (j, k) _i(j, k)), according to content blocks C _iThe label density variance D (Y of (j, k) _i(j, k)), can further determine content blocks C _iThe label density threshold B (Y) of (j, k) is according to content blocks C _iThe link density variance D (X of (j, k) _i(j, k)), can further determine content blocks C _iThe link density threshold B (X) of (j, k).

With label density Y _i(j, k) and label density threshold B (Y) compare, if comparative result is Y _i(j, k) is greater than B (Y), with Y _iThe value of (j, k) is set to 0, if comparative result is Y _i(j, k) is not more than B (Y), with Y _iThe value of (j, k) is set to 1, that is:

\{\begin{matrix} Y_{i} (j, k) = 0 & Y_{i} (j, k) > B (Y) \\ Y_{i} (j, k) = 1 & Y_{i} (j, k) \leq B (Y) \end{matrix}

To link density X _i(j, k) and label density threshold B (X) compare, if comparative result is X _i(j, k) is greater than B (X), with X _iThe value of (j, k) is set to 0, if comparative result is X _i(j, k) is not more than B (X), with X _iThe value of (j, k) is set to 1, that is:

\{\begin{matrix} X_{i} (j, k) = 0 & X_{i} (j, k) > B (X) \\ X_{i} (j, k) = 1 & X_{i} (j, k) \leq B (X) \end{matrix}

If select alternative body matter piece according to label density, with Y _i(j, k) is that 1 content blocks is chosen as alternative body matter piece, namely satisfies corresponding pre-conditioned content blocks;

If select alternative body matter piece according to link density, with X _i(j, k) is that 1 content blocks is chosen as alternative body matter piece, namely satisfies corresponding pre-conditioned content blocks;

If select alternative body matter piece according to label density and link density, for content blocks C _i(j, k) calculates X _i(j, k) * Y _i(j, k) if result of calculation is 1, is chosen as this content blocks alternative body matter piece, namely satisfies corresponding pre-conditioned content blocks, wherein passes through above-mentioned computing, X _i(j, k) and Y _iThe value of (j, k) is 1 or 0, only has the X of working as _i(j, k) and Y _iThe value of (j, k) is at 1 o'clock, X _i(j, k) * Y _iThe result of calculation of (j, k) is just 1.

In the embodiment of the present invention, owing to selecting alternative body matter piece according to label density and/or link density, rather than determine body matter according to text density, thereby first removed the higher non-body matter of a part of degree of disturbance, therefore can effectively improve the accuracy of extracting the Web page text content.

Step 25, for each webpage that obtains, in each content blocks of selecting, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;

The embodiment of the present invention proposes, body text content under same website between the different pages of same level catalogue differs greatly, and noise content is identical, therefore can after selecting alternative body matter piece, in two pages, further remove the identical content blocks of content of text, these content blocks are noise content, therefore be non-content of text, remaining alternative body matter piece is the Web page text content.

Wherein, can but be not limited to adopt all inconsistent content blocks of content of text that the mode of poll extracts each content blocks of selecting in content of text and another webpage, for example:

The alternative body matter piece of selecting for webpage 1 is: content blocks C ₁(1,1), C ₁(1,2), the alternative body matter piece of selecting for webpage 2 is: content blocks C ₂(1,3), C ₂(2,2), C ₂(3,1) are at first with content blocks C ₁The content of text T of (1,1) ₁(1,1) and content blocks C ₂The content of text T of (1,3) ₂(1,3) compares, and comparative result is inconsistent, with content blocks C ₁The content of text T of (1,1) ₁(1,1) and content blocks C ₂The content of text T of (2,2) ₂(2,2) compare, and comparative result is consistent, confirms content blocks C ₁(1,1) and content blocks C ₂(2,2) are non-body matter, therefore remove in the alternative content blocks of webpage 1, remove content blocks C ₁(1,1) in the alternative content blocks of webpage 2, removes content blocks C ₂(2,2);

With content blocks C ₁The content of text T of (1,2) ₁(1,2) and content blocks C ₂The content of text T of (1,3) ₂(1,3) compares, and comparative result is inconsistent, with content blocks C ₁The content of text T of (1,2) ₁(1,2) and content blocks C ₂The content of text T of (3,1) ₂(3,1) compare, and comparative result is inconsistent, confirms content blocks C ₁(1,2), C ₂(1,3) and C ₂(3,1) are body matter, and namely the body matter in webpage 1 is C ₁(1,2), the body matter in webpage 2 are C ₂(1,3) and C ₂(3,1).

although the noise content under same website between the different pages of same level catalogue is identical, but residing position may be different in the page, for example personal information is positioned at the upper left side in the page 1, be positioned at the lower left in the page 2, if the coordinate according to node in dom tree is searched identical subtree, require content and position all identical, so just may content is identical, but the different noise content in position is thought body matter by mistake, the embodiment of the present application adopts the mode of above-mentioned poll, extract the Web page text content in alternative body matter piece, therefore just content can be removed identical, but the noise content that the position is different, only the content blocks that content of text is different is extracted as body matter, thereby effectively improved the accuracy of extracting the Web page text content.

Step 26, each webpage for obtaining with the content blocks that extracts, is defined as the body matter of this webpage.

by above-mentioned processing procedure as can be known, in embodiment of the present invention technical scheme, because the webpage that belongs to same level catalogue under same website is all generated by same template, its structure of web page is similar or identical, therefore the embodiment of the present invention is for two webpages that belong to same level catalogue under same website, at first select alternative body matter piece according to label density and/or link density, then in the content blocks of selecting, remove two non-body matter pieces that webpage Chinese version content is identical, thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.

The below provides more detailed embodiment.

As shown in Figure 3, be Webpage text content extracting method specific implementation process flow diagram in the embodiment of the present invention, its concrete processing procedure is as follows:

Step 31, acquisition belongs to webpage 1 and the webpage 2 of same level catalogue under same website;

Step 32 to each webpage that obtains pre-service that standardizes, makes it to meet the html language standard;

Step 33 is carried out structuring to pretreated each webpage and is processed, and generates dom tree;

Step 34, according in the dom tree that generates＜table or＜div mark, webpage is carried out the sense of vision piecemeal processes;

Step 35 is calculated the label density of each content blocks and links density;

Step 36 for each content blocks, compares label density and label density threshold, will link density and link density threshold to compare;

Step 37 with being not more than the content blocks of corresponding link density threshold, is defined as alternative body matter piece;

Step 38, the mode of employing poll is carried out similarity relatively with the alternative body matter piece in webpage 1 and the alternative body matter piece in webpage 2;

Step 39, for each webpage, according to comparative result, the content of text that extracts alternative body matter piece in content of text and another webpage is inconsistent content blocks all, and the content blocks that extracts is the body matter of this webpage.

Accordingly, the embodiment of the present invention provides a kind of Web page text contents extraction device, and its structure comprises obtaining unit 41, division unit 42, the first determining unit 43, selected cell 44, extraction unit 45 and the second determining unit 46 as shown in Figure 4, wherein:

Obtain unit 41, be used for obtaining to belong to two webpages of same level catalogue under same website;

Division unit 42 is used for for each webpage that obtains unit 41 acquisitions, this webpage being divided into each content blocks;

The first determining unit 43 is used for for each webpage that obtains unit 41 acquisitions, determines label density and/or the link density of each content blocks that division unit 42 marks off;

Selected cell 44 is used for for each webpage that obtains unit 41 acquisitions, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;

Extraction unit 45 is used for each webpage of obtaining for obtaining unit 41, and in each content blocks that selected cell 44 is selected, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;

The second determining unit 46 is used for the content blocks that extraction unit 45 extracts, being defined as the body matter of this webpage for each webpage that obtains unit 41 acquisitions.

Preferably, for each content blocks that division unit 42 marks off, described the first determining unit 43 is defined as the label density of this content blocks with the ratio of the label number in this content blocks and text number of words, and

With the ratio of the link number in this content blocks and text number of words, be defined as the link density of this content blocks.

Preferably, selected cell 44 specifically comprises acquisition subelement, chooser unit and definite subelement, wherein:

Obtain subelement, be used for for each webpage that obtains unit 41 acquisitions, the label density threshold of each content blocks that acquisition division unit 42 marks off and/or link density threshold;

The chooser unit be used for selecting label density and being not more than the corresponding label density threshold for each webpage that obtains unit 41 acquisitions, and/or link density is not more than the content blocks of corresponding link density threshold;

Determine subelement, be used for each webpage of obtaining for obtaining unit 41, the content blocks with the chooser unit is selected is defined as satisfying pre-conditioned content blocks.

More preferably, for each content blocks that division unit 42 marks off, described acquisition subelement is determined the label density threshold of this content blocks according to the label density variance of this content blocks, and

According to the link density variance of this content blocks, determine the link density threshold of this content blocks.

Preferably, described division unit 42 specifically comprises the pre-service subelement, generates subelement and division subelement, wherein:

The pre-service subelement is used for each webpage of obtaining for obtaining unit 41, to the pre-service that standardizes of this webpage;

Generate subelement, be used for for each webpage that obtains unit 41 acquisitions, webpage pretreated according to the pre-service subelement generates corresponding dom tree;

Divide subelement, be used for for each webpage that obtains unit 41 acquisitions, the dom tree according to the generation subelement generates is divided into each content blocks with this webpage.

Obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention.Like this, if within of the present invention these are revised and modification belongs to the scope of claim of the present invention and equivalent technologies thereof, the present invention also is intended to comprise these changes and modification interior.

Claims

1. a Webpage text content extracting method, is characterized in that, comprising:

Acquisition belongs to two webpages of same level catalogue under same website;

For each webpage that obtains, carry out respectively:

This webpage is divided into each content blocks;

Determine label density and/or the link density of each content blocks of marking off, wherein: described label density is label number in this content blocks and the ratio of text number of words, and described link density is link number in this content blocks and the ratio of text number of words; And

Select the satisfied corresponding pre-conditioned content blocks of label density and/or link density;

In each content blocks of selecting, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;

With the content blocks that extracts, be defined as the body matter of this webpage.

2. Webpage text content extracting method as claimed in claim 1, is characterized in that, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density, specifically comprises:

The label density threshold of each content blocks that acquisition marks off and/or link density threshold;

Select label density and be not more than the corresponding label density threshold, and/or link density is not more than the content blocks of corresponding link density threshold;

With the content blocks of selecting, be defined as satisfying pre-conditioned content blocks.

3. Webpage text content extracting method as claimed in claim 2, is characterized in that, the label density threshold of each content blocks that acquisition marks off specifically comprises:

For each content blocks that marks off, respectively according to label density, determine the label density variance of this content blocks, and according to the label density variance of determining, determine the label density threshold of this content blocks;

The link density threshold of each content blocks that acquisition marks off specifically comprises:

For each content blocks that marks off, according to link density, determine the link density variance of this content blocks, and according to the link density variance of determining, determine the link density threshold of this content blocks respectively.

4. Webpage text content extracting method as claimed in claim 1, is characterized in that, this webpage is divided into each content blocks, specifically comprises:

To the pre-service that standardizes of this webpage;

According to pretreated webpage, generate corresponding DOM Document Object Model dom tree;

Dom tree according to generating is divided into each content blocks with this webpage.

5. a Web page text contents extraction device, is characterized in that, comprising:

Obtain the unit, be used for obtaining to belong to two webpages of same level catalogue under same website;

Division unit is used for for each webpage that obtains the unit acquisition, this webpage being divided into each content blocks;

The first determining unit, be used for for each webpage that obtains the unit acquisition, determine label density and/or the link density of each content blocks that division unit marks off, wherein: described label density is label number in this content blocks and the ratio of text number of words, and described link density is link number in this content blocks and the ratio of text number of words;

Selected cell is used for for each webpage that obtains the unit acquisition, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;

Extraction unit is used for each webpage of obtaining for obtaining the unit, and in each content blocks that selected cell is selected, the content of text that extracts each content blocks of selecting in content of text and another webpage is inconsistent content blocks all;

The second determining unit is used for the content blocks that extraction unit extracts, being defined as the body matter of this webpage for each webpage that obtains the unit acquisition.

6. Web page text contents extraction device as claimed in claim 5, is characterized in that, selected cell specifically comprises:

Obtain subelement, be used for for each webpage that obtains the unit acquisition, the label density threshold of each content blocks that the acquisition division unit marks off and/or link density threshold;

The chooser unit be used for selecting label density and being not more than the corresponding label density threshold for each webpage that obtains the unit acquisition, and/or link density is not more than the content blocks of corresponding link density threshold;

Determine subelement, be used for each webpage of obtaining for obtaining the unit, the content blocks with the chooser unit is selected is defined as satisfying pre-conditioned content blocks.

7. Web page text contents extraction device as claimed in claim 6, is characterized in that, for each content blocks that division unit marks off, described acquisition subelement is determined the label density threshold of this content blocks according to the label density variance of this content blocks, and

8. Web page text contents extraction device as claimed in claim 6, is characterized in that, described division unit specifically comprises:

The pre-service subelement is used for each webpage of obtaining for obtaining the unit, to the pre-service that standardizes of this webpage;

Generate subelement, be used for for each webpage that obtains the unit acquisition, webpage pretreated according to the pre-service subelement generates corresponding DOM Document Object Model dom tree;

Divide subelement, be used for for each webpage that obtains the unit acquisition, the dom tree according to the generation subelement generates is divided into each content blocks with this webpage.