CN105512225A

CN105512225A - Method and device extracting main content from webpage

Info

Publication number: CN105512225A
Application number: CN201510857404.7A
Authority: CN
Inventors: 喻春霖; 李小磊; 万巍
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Apabi Technology Co Ltd
Priority date: 2015-11-30
Filing date: 2015-11-30
Publication date: 2016-04-20

Abstract

The invention provides a method and device extracting main content from a webpage; the method comprises the following steps: removing useless labels from a to-be extracted webpage so as to obtain a content text; clipping the content text into a plurality of lines; determining a starting line and an ending line of the main content according to character numbers of continuous lines; extracting the text between the starting line and the ending line as the main content of the to-be extracted webpage. The method changes a conventional mode in which complex analysis is carried out for data content; the novel method is simple in regularity, can extract the main content without having problems caused by different webpage and webpage revision, thus improving the main content extracting efficiency.

Description

A kind of method and device extracting main contents from webpage

Technical field

The present invention relates to networking technology area, be specifically related to a kind of from webpage, extract main contents method and device.

Background technology

Along with the development of computer technology, abundant Internet resources, for the daily information life of people brings great convenience.Correspondingly, also often need to obtain relevant web page contents on webpage, so that subsequent treatment, such as, web page contents analyzed thus know the information content that related web page is issued.But the data of all kinds and various structure are all likely mixed in together on webpage, this brings no small trouble with regard to giving the text message captured on webpage.

Traditional Grasp Modes purpose, limitation are very strong, be only that a certain specific webpage is extracted, and once change another one webpage, then need to redesign routine processes mode, but exploitation needs cycle regular hour, therefore timeliness also has certain delay.If correcting is carried out in the website captured before, original fetching may be just no longer applicable, and still need to redesign fetching, these process needs repeat and the work of poor efficiency in a large number.

Therefore, how to extract the main contents in webpage effectively rapidly, become the major subjects that high-level efficiency extracts web page contents.

Summary of the invention

Therefore, the technical problem to be solved in the present invention is that the purpose of existing webpage main contents grasping means and limitation are comparatively strong, therefore cannot be applicable to various dissimilar webpage.

For this reason, following technical scheme is embodiments provided:

From webpage, extract a method for main contents, comprise the steps:

Remove the useless label in webpage to be extracted, obtain content text;

Content text is divided into multirow;

According to begin column and the end line of continuous multirow character number determination main contents;

By the main contents that the Text Feature Extraction between begin column and end line is webpage to be extracted.

Preferably, step content text being divided into multirow comprises:

Content text is divided into multirow according to line feed label;

Total line number after segmentation is compared with predetermined threshold value;

When the total line number after splitting is less than predetermined threshold value, then split row according to paragraph tag.

Preferably, content text is divided into multirow step and according between the begin column of continuous multirow character number determination main contents and the step of end line, also comprise:

Delete the line feed label in content text and paragraph tag.

Preferably, comprise according to the continuous begin column of multirow character number determination main contents and the step of end line:

The number of characters that the predetermined number of statistics next-door neighbour is capable from the first row of content text;

When its number of characters is not less than preset value, judge that the first row that this predetermined number is capable is begin column;

After begin column, when the line character number of predetermined number is less than preset value, judge that last column that this predetermined number is capable is end line.

From webpage, extract a device for main contents, comprising:

Content text acquiring unit, for removing the useless label in webpage to be extracted, obtains content text;

Cutting unit, for being divided into multirow by content text;

Begin column and end line determining unit, for according to the begin column of continuous multirow character number determination main contents and end line;

Main contents extraction unit, for by the Text Feature Extraction between begin column and end line being the main contents of webpage to be extracted.

Preferably, cutting unit comprises:

First segmentation subelement, for being divided into multirow by content text according to line feed label;

Relatively subelement, for comparing the total line number after segmentation with predetermined threshold value;

Second segmentation subelement, for when the total line number after splitting is less than predetermined threshold value, then splits row according to paragraph tag.

Preferably, also comprise:

Delete cells, for deleting line feed label in content text and paragraph tag.

Preferably, begin column and end line determining unit comprise:

Number of characters statistics subelement, the number of characters that the predetermined number for statistics next-door neighbour from the first row of content text is capable;

Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is begin column;

End line determination subelement, for after begin column, when the line character number of predetermined number is less than preset value, judges that last column that this predetermined number is capable is end line.

Technical solution of the present invention, tool has the following advantages:

Method and the device extracting main contents from webpage provided by the invention, break in webpage content extracting method and complicated tradition of resolving has been carried out to data content, extract by a kind of method of simple regularity, namely first extract the content text in html corresponding to webpage and delete useless label, then branch, determine main contents according to continuous multirow number of characters again, evaded the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.

Accompanying drawing explanation

In order to be illustrated more clearly in the specific embodiment of the invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of process flow diagram extracting the method for main contents from webpage in the embodiment of the present invention 1;

Fig. 2 is the example of deleting a content text after useless label in the embodiment of the present invention 1;

Fig. 3 is that in the embodiment of the present invention 1, the row of content text has split an example after the whole label of rear deletion;

Fig. 4 is the begin column of a kind of basis continuous multirow character number determination main contents and the method flow diagram of end line in the embodiment of the present invention 1;

Fig. 5 is a kind of theory diagram extracting the device of main contents from webpage in the embodiment of the present invention 2.

Embodiment

Be clearly and completely described technical scheme of the present invention below in conjunction with accompanying drawing, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

In addition, if below in the described different embodiment of the present invention involved technical characteristic do not form conflict each other and just can be combined with each other.

Embodiment 1

As shown in Figure 1, present embodiments provide a kind of method extracting main contents from webpage, comprise the steps:

S1: remove the useless label in webpage to be extracted, obtains content text, and this useless label does not comprise line feed label and paragraph tag;

S2: content text is divided into multirow;

S3: according to begin column and the end line of continuous multirow character number determination main contents;

S4: the main contents by the Text Feature Extraction between begin column and end line being webpage to be extracted.

By to the deep research of existing main stream website and contrast, and in conjunction with the universal law of website design, design philosophy and design style, find that the representing of content of website has mark governed.And the method for the present embodiment is on these researchs, sum up the method being applicable to major part detailed Page web page extraction main contents out.This method has been broken in webpage content extracting method and has been carried out complicated tradition of resolving to data content, extract by a kind of method of simple regularity, evade the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.

Particularly, in above-mentioned steps S1, webpage to be extracted can be capture according to the detailed page address of user's input, can be such as the detailed link url of a certain news web page, access this page link url, obtain corresponding html and resolve this html, then the <body> part in the page is captured, and use regular expression to remove to obtain content text after all script, style labels and content thereof except line feed label and paragraph tag, such as, the content text shown in Fig. 2.

Particularly, in above-mentioned steps S2, the step being divided into multirow by content text comprises:

First, content text is divided into multirow according to line feed label;

Then, the total line number after segmentation is compared with predetermined threshold value;

Finally, when the total line number after splitting is less than predetermined threshold value, then row is split according to paragraph tag.

If the total line number after segmentation is not less than preset value, so or according to line feed label split row.Only have, when the total line number after splitting is less than predetermined threshold value, just split row according to paragraph tag.Reason is, if be less than certain value according to the row sum of newline segmentation, think that web page tag is compression, before namely server sends to browser, removes the ignore character such as space, newline between label.Thus if split capable according to newline (line feed label), whole content text does not enter a new line substantially.In the present embodiment, this predetermined threshold value can be 10.

In the present embodiment, also comprise after selecting to be divided into multirow according to line feed label or paragraph tag according to actual needs and travel through each row successively, use regular expression to remove the html label of every a line.Namely the line feed label in content text and paragraph tag is deleted.Now, the content shown in accompanying drawing 2 becomes shown in Fig. 3.

Particularly, as shown in Figure 4, above-mentioned steps S3, namely comprises according to the continuous begin column of multirow number of characters determination main contents and the step of end line:

S31: the number of characters that the predetermined number of statistics next-door neighbour is capable from content text the first row;

S32: when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is the begin column of main contents;

S33: after begin column, when the number of characters of predetermined number is less than preset value, judges that last column that corresponding predetermined number is capable is the end line of main contents.

In the present embodiment, travel through each row, if the number of characters of the 4 row contents be close to is more than or equal to 120, then think the begin column of the first behavior main contents in this 4 row.The number of characters that such as, 3-6 in Fig. 3 is capable is greater than 120, then determine beginning behavior the 3rd row of main contents.After determining the begin column of main contents, then travel through each row, if the number of characters of the 4 row contents be close to is less than 120, then think that the last column in this 4 row is the end line of main contents.Such as, the number of characters that the 15-18 in Fig. 3 is capable is less than 120, then determine the end line of the 18th behavior main contents.

Particularly, above-mentioned steps S4, the text between begin column to end line, comprises begin column and end line, is namely the main contents that this webpage extracts.Such as, the 3rd row in Fig. 3 is exactly the main contents extracted from a certain news web page of the link correspondence of user's input to the 18th row.

Embodiment 2

As shown in Figure 5, originally execute example and a kind of device extracting main contents from webpage be provided, comprising:

Content text acquiring unit U1, for removing the useless label in webpage to be extracted, obtains content text;

Cutting unit U2, for being divided into multirow by content text;

Begin column and end line determining unit U3, for according to the begin column of continuous multirow character number determination main contents and end line;

Main contents extraction unit U4, for by the Text Feature Extraction between begin column and end line being the main contents of webpage to be extracted.

The device of the present embodiment, break in webpage content extracting method and complicated tradition of resolving has been carried out to data content, extract by a kind of method of simple regularity, evaded the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.

Particularly, cutting unit U2 comprises:

Particularly, this device also comprises:

Delete cells, for deleting line feed label in content text and paragraph tag.After according to line feed label or paragraph tag content text being divided into multirow, regular expression is namely utilized to remove line feed label in content text and paragraph tag.

Particularly, begin column and end line determining unit U3 comprise:

Just the text between begin column and end line can be extracted, i.e. the main contents of webpage to be extracted after the begin column determining main contents and end line.

Obviously, above-described embodiment is only for clearly example being described, and the restriction not to embodiment.For those of ordinary skill in the field, can also make other changes in different forms on the basis of the above description.Here exhaustive without the need to also giving all embodiments.And thus the apparent change of extending out or variation be still among the protection domain of the invention.

Claims

1. from webpage, extract a method for main contents, it is characterized in that, comprise the steps:

Remove the useless label in webpage to be extracted, obtain content text;

Described content text is divided into multirow;

Be the main contents of described webpage to be extracted by the Text Feature Extraction between described begin column and described end line.

2. the method for claim 1, is characterized in that, the described step described content text being divided into multirow comprises:

Described content text is divided into multirow according to line feed label;

When the total line number after splitting is less than described predetermined threshold value, then split row according to paragraph tag.

3. method as claimed in claim 1 or 2, is characterized in that, described content text be divided into the step of multirow described and between the begin column of described basis continuous multirow character number determination main contents and the step of end line, also comprise:

Delete the described line feed label in described content text and described paragraph tag.

4. the method according to any one of claim 1-3, is characterized in that, the begin column of described basis continuous multirow character number determination main contents and the step of end line comprise:

The number of characters that the predetermined number of statistics next-door neighbour is capable from the first row of described content text;

When its number of characters is not less than preset value, judge that the first row that this predetermined number is capable is described begin column;

After described begin column, when the line character number of predetermined number is less than described preset value, judge that last column that this predetermined number is capable is described end line.

5. from webpage, extract a device for main contents, it is characterized in that, comprising:

Cutting unit, for being divided into multirow by described content text;

Main contents extraction unit, for by the Text Feature Extraction between described begin column and described end line being the main contents of described webpage to be extracted.

6. device as claimed in claim 5, it is characterized in that, described cutting unit comprises:

First segmentation subelement, for being divided into multirow by described content text according to line feed label;

Second segmentation subelement, for when the total line number after splitting is less than described predetermined threshold value, then splits row according to paragraph tag.

7. the device as described in claim 5 or 6, is characterized in that, also comprises:

Delete cells, for deleting described line feed label in described content text and described paragraph tag.

8. the device according to any one of claim 5-7, is characterized in that, described begin column and end line determining unit comprise:

Number of characters statistics subelement, for the number of characters that the predetermined number of statistics next-door neighbour the first row from described content text is capable;

Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is described begin column;

End line determination subelement, for after described begin column, when the line character number of predetermined number is less than described preset value, judges that last column that this predetermined number is capable is described end line.