CN105512225A - Method and device extracting main content from webpage - Google Patents
Method and device extracting main content from webpage Download PDFInfo
- Publication number
- CN105512225A CN105512225A CN201510857404.7A CN201510857404A CN105512225A CN 105512225 A CN105512225 A CN 105512225A CN 201510857404 A CN201510857404 A CN 201510857404A CN 105512225 A CN105512225 A CN 105512225A
- Authority
- CN
- China
- Prior art keywords
- webpage
- line
- content text
- main contents
- begin column
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a method and device extracting main content from a webpage; the method comprises the following steps: removing useless labels from a to-be extracted webpage so as to obtain a content text; clipping the content text into a plurality of lines; determining a starting line and an ending line of the main content according to character numbers of continuous lines; extracting the text between the starting line and the ending line as the main content of the to-be extracted webpage. The method changes a conventional mode in which complex analysis is carried out for data content; the novel method is simple in regularity, can extract the main content without having problems caused by different webpage and webpage revision, thus improving the main content extracting efficiency.
Description
Technical field
The present invention relates to networking technology area, be specifically related to a kind of from webpage, extract main contents method and device.
Background technology
Along with the development of computer technology, abundant Internet resources, for the daily information life of people brings great convenience.Correspondingly, also often need to obtain relevant web page contents on webpage, so that subsequent treatment, such as, web page contents analyzed thus know the information content that related web page is issued.But the data of all kinds and various structure are all likely mixed in together on webpage, this brings no small trouble with regard to giving the text message captured on webpage.
Traditional Grasp Modes purpose, limitation are very strong, be only that a certain specific webpage is extracted, and once change another one webpage, then need to redesign routine processes mode, but exploitation needs cycle regular hour, therefore timeliness also has certain delay.If correcting is carried out in the website captured before, original fetching may be just no longer applicable, and still need to redesign fetching, these process needs repeat and the work of poor efficiency in a large number.
Therefore, how to extract the main contents in webpage effectively rapidly, become the major subjects that high-level efficiency extracts web page contents.
Summary of the invention
Therefore, the technical problem to be solved in the present invention is that the purpose of existing webpage main contents grasping means and limitation are comparatively strong, therefore cannot be applicable to various dissimilar webpage.
For this reason, following technical scheme is embodiments provided:
From webpage, extract a method for main contents, comprise the steps:
Remove the useless label in webpage to be extracted, obtain content text;
Content text is divided into multirow;
According to begin column and the end line of continuous multirow character number determination main contents;
By the main contents that the Text Feature Extraction between begin column and end line is webpage to be extracted.
Preferably, step content text being divided into multirow comprises:
Content text is divided into multirow according to line feed label;
Total line number after segmentation is compared with predetermined threshold value;
When the total line number after splitting is less than predetermined threshold value, then split row according to paragraph tag.
Preferably, content text is divided into multirow step and according between the begin column of continuous multirow character number determination main contents and the step of end line, also comprise:
Delete the line feed label in content text and paragraph tag.
Preferably, comprise according to the continuous begin column of multirow character number determination main contents and the step of end line:
The number of characters that the predetermined number of statistics next-door neighbour is capable from the first row of content text;
When its number of characters is not less than preset value, judge that the first row that this predetermined number is capable is begin column;
After begin column, when the line character number of predetermined number is less than preset value, judge that last column that this predetermined number is capable is end line.
From webpage, extract a device for main contents, comprising:
Content text acquiring unit, for removing the useless label in webpage to be extracted, obtains content text;
Cutting unit, for being divided into multirow by content text;
Begin column and end line determining unit, for according to the begin column of continuous multirow character number determination main contents and end line;
Main contents extraction unit, for by the Text Feature Extraction between begin column and end line being the main contents of webpage to be extracted.
Preferably, cutting unit comprises:
First segmentation subelement, for being divided into multirow by content text according to line feed label;
Relatively subelement, for comparing the total line number after segmentation with predetermined threshold value;
Second segmentation subelement, for when the total line number after splitting is less than predetermined threshold value, then splits row according to paragraph tag.
Preferably, also comprise:
Delete cells, for deleting line feed label in content text and paragraph tag.
Preferably, begin column and end line determining unit comprise:
Number of characters statistics subelement, the number of characters that the predetermined number for statistics next-door neighbour from the first row of content text is capable;
Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is begin column;
End line determination subelement, for after begin column, when the line character number of predetermined number is less than preset value, judges that last column that this predetermined number is capable is end line.
Technical solution of the present invention, tool has the following advantages:
Method and the device extracting main contents from webpage provided by the invention, break in webpage content extracting method and complicated tradition of resolving has been carried out to data content, extract by a kind of method of simple regularity, namely first extract the content text in html corresponding to webpage and delete useless label, then branch, determine main contents according to continuous multirow number of characters again, evaded the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.
Accompanying drawing explanation
In order to be illustrated more clearly in the specific embodiment of the invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of process flow diagram extracting the method for main contents from webpage in the embodiment of the present invention 1;
Fig. 2 is the example of deleting a content text after useless label in the embodiment of the present invention 1;
Fig. 3 is that in the embodiment of the present invention 1, the row of content text has split an example after the whole label of rear deletion;
Fig. 4 is the begin column of a kind of basis continuous multirow character number determination main contents and the method flow diagram of end line in the embodiment of the present invention 1;
Fig. 5 is a kind of theory diagram extracting the device of main contents from webpage in the embodiment of the present invention 2.
Embodiment
Be clearly and completely described technical scheme of the present invention below in conjunction with accompanying drawing, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
In addition, if below in the described different embodiment of the present invention involved technical characteristic do not form conflict each other and just can be combined with each other.
Embodiment 1
As shown in Figure 1, present embodiments provide a kind of method extracting main contents from webpage, comprise the steps:
S1: remove the useless label in webpage to be extracted, obtains content text, and this useless label does not comprise line feed label and paragraph tag;
S2: content text is divided into multirow;
S3: according to begin column and the end line of continuous multirow character number determination main contents;
S4: the main contents by the Text Feature Extraction between begin column and end line being webpage to be extracted.
By to the deep research of existing main stream website and contrast, and in conjunction with the universal law of website design, design philosophy and design style, find that the representing of content of website has mark governed.And the method for the present embodiment is on these researchs, sum up the method being applicable to major part detailed Page web page extraction main contents out.This method has been broken in webpage content extracting method and has been carried out complicated tradition of resolving to data content, extract by a kind of method of simple regularity, evade the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.
Particularly, in above-mentioned steps S1, webpage to be extracted can be capture according to the detailed page address of user's input, can be such as the detailed link url of a certain news web page, access this page link url, obtain corresponding html and resolve this html, then the <body> part in the page is captured, and use regular expression to remove to obtain content text after all script, style labels and content thereof except line feed label and paragraph tag, such as, the content text shown in Fig. 2.
Particularly, in above-mentioned steps S2, the step being divided into multirow by content text comprises:
First, content text is divided into multirow according to line feed label;
Then, the total line number after segmentation is compared with predetermined threshold value;
Finally, when the total line number after splitting is less than predetermined threshold value, then row is split according to paragraph tag.
If the total line number after segmentation is not less than preset value, so or according to line feed label split row.Only have, when the total line number after splitting is less than predetermined threshold value, just split row according to paragraph tag.Reason is, if be less than certain value according to the row sum of newline segmentation, think that web page tag is compression, before namely server sends to browser, removes the ignore character such as space, newline between label.Thus if split capable according to newline (line feed label), whole content text does not enter a new line substantially.In the present embodiment, this predetermined threshold value can be 10.
In the present embodiment, also comprise after selecting to be divided into multirow according to line feed label or paragraph tag according to actual needs and travel through each row successively, use regular expression to remove the html label of every a line.Namely the line feed label in content text and paragraph tag is deleted.Now, the content shown in accompanying drawing 2 becomes shown in Fig. 3.
Particularly, as shown in Figure 4, above-mentioned steps S3, namely comprises according to the continuous begin column of multirow number of characters determination main contents and the step of end line:
S31: the number of characters that the predetermined number of statistics next-door neighbour is capable from content text the first row;
S32: when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is the begin column of main contents;
S33: after begin column, when the number of characters of predetermined number is less than preset value, judges that last column that corresponding predetermined number is capable is the end line of main contents.
In the present embodiment, travel through each row, if the number of characters of the 4 row contents be close to is more than or equal to 120, then think the begin column of the first behavior main contents in this 4 row.The number of characters that such as, 3-6 in Fig. 3 is capable is greater than 120, then determine beginning behavior the 3rd row of main contents.After determining the begin column of main contents, then travel through each row, if the number of characters of the 4 row contents be close to is less than 120, then think that the last column in this 4 row is the end line of main contents.Such as, the number of characters that the 15-18 in Fig. 3 is capable is less than 120, then determine the end line of the 18th behavior main contents.
Particularly, above-mentioned steps S4, the text between begin column to end line, comprises begin column and end line, is namely the main contents that this webpage extracts.Such as, the 3rd row in Fig. 3 is exactly the main contents extracted from a certain news web page of the link correspondence of user's input to the 18th row.
Embodiment 2
As shown in Figure 5, originally execute example and a kind of device extracting main contents from webpage be provided, comprising:
Content text acquiring unit U1, for removing the useless label in webpage to be extracted, obtains content text;
Cutting unit U2, for being divided into multirow by content text;
Begin column and end line determining unit U3, for according to the begin column of continuous multirow character number determination main contents and end line;
Main contents extraction unit U4, for by the Text Feature Extraction between begin column and end line being the main contents of webpage to be extracted.
The device of the present embodiment, break in webpage content extracting method and complicated tradition of resolving has been carried out to data content, extract by a kind of method of simple regularity, evaded the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.
Particularly, cutting unit U2 comprises:
First segmentation subelement, for being divided into multirow by content text according to line feed label;
Relatively subelement, for comparing the total line number after segmentation with predetermined threshold value;
Second segmentation subelement, for when the total line number after splitting is less than predetermined threshold value, then splits row according to paragraph tag.
Particularly, this device also comprises:
Delete cells, for deleting line feed label in content text and paragraph tag.After according to line feed label or paragraph tag content text being divided into multirow, regular expression is namely utilized to remove line feed label in content text and paragraph tag.
Particularly, begin column and end line determining unit U3 comprise:
Number of characters statistics subelement, the number of characters that the predetermined number for statistics next-door neighbour from the first row of content text is capable;
Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is begin column;
End line determination subelement, for after begin column, when the line character number of predetermined number is less than preset value, judges that last column that this predetermined number is capable is end line.
Just the text between begin column and end line can be extracted, i.e. the main contents of webpage to be extracted after the begin column determining main contents and end line.
Obviously, above-described embodiment is only for clearly example being described, and the restriction not to embodiment.For those of ordinary skill in the field, can also make other changes in different forms on the basis of the above description.Here exhaustive without the need to also giving all embodiments.And thus the apparent change of extending out or variation be still among the protection domain of the invention.
Claims (8)
1. from webpage, extract a method for main contents, it is characterized in that, comprise the steps:
Remove the useless label in webpage to be extracted, obtain content text;
Described content text is divided into multirow;
According to begin column and the end line of continuous multirow character number determination main contents;
Be the main contents of described webpage to be extracted by the Text Feature Extraction between described begin column and described end line.
2. the method for claim 1, is characterized in that, the described step described content text being divided into multirow comprises:
Described content text is divided into multirow according to line feed label;
Total line number after segmentation is compared with predetermined threshold value;
When the total line number after splitting is less than described predetermined threshold value, then split row according to paragraph tag.
3. method as claimed in claim 1 or 2, is characterized in that, described content text be divided into the step of multirow described and between the begin column of described basis continuous multirow character number determination main contents and the step of end line, also comprise:
Delete the described line feed label in described content text and described paragraph tag.
4. the method according to any one of claim 1-3, is characterized in that, the begin column of described basis continuous multirow character number determination main contents and the step of end line comprise:
The number of characters that the predetermined number of statistics next-door neighbour is capable from the first row of described content text;
When its number of characters is not less than preset value, judge that the first row that this predetermined number is capable is described begin column;
After described begin column, when the line character number of predetermined number is less than described preset value, judge that last column that this predetermined number is capable is described end line.
5. from webpage, extract a device for main contents, it is characterized in that, comprising:
Content text acquiring unit, for removing the useless label in webpage to be extracted, obtains content text;
Cutting unit, for being divided into multirow by described content text;
Begin column and end line determining unit, for according to the begin column of continuous multirow character number determination main contents and end line;
Main contents extraction unit, for by the Text Feature Extraction between described begin column and described end line being the main contents of described webpage to be extracted.
6. device as claimed in claim 5, it is characterized in that, described cutting unit comprises:
First segmentation subelement, for being divided into multirow by described content text according to line feed label;
Relatively subelement, for comparing the total line number after segmentation with predetermined threshold value;
Second segmentation subelement, for when the total line number after splitting is less than described predetermined threshold value, then splits row according to paragraph tag.
7. the device as described in claim 5 or 6, is characterized in that, also comprises:
Delete cells, for deleting described line feed label in described content text and described paragraph tag.
8. the device according to any one of claim 5-7, is characterized in that, described begin column and end line determining unit comprise:
Number of characters statistics subelement, for the number of characters that the predetermined number of statistics next-door neighbour the first row from described content text is capable;
Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is described begin column;
End line determination subelement, for after described begin column, when the line character number of predetermined number is less than described preset value, judges that last column that this predetermined number is capable is described end line.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510857404.7A CN105512225A (en) | 2015-11-30 | 2015-11-30 | Method and device extracting main content from webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510857404.7A CN105512225A (en) | 2015-11-30 | 2015-11-30 | Method and device extracting main content from webpage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105512225A true CN105512225A (en) | 2016-04-20 |
Family
ID=55720207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510857404.7A Pending CN105512225A (en) | 2015-11-30 | 2015-11-30 | Method and device extracting main content from webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512225A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107247742A (en) * | 2017-05-17 | 2017-10-13 | 武汉工程大学 | A kind of text message abstracting method based on web page characteristics |
CN110750960A (en) * | 2018-07-05 | 2020-02-04 | 武汉斗鱼网络科技有限公司 | Configuration file analysis method, storage medium, electronic device and system |
CN110795933A (en) * | 2019-09-30 | 2020-02-14 | 奇安信科技集团股份有限公司 | Method and device for identifying and processing webpage text |
CN111310418A (en) * | 2020-02-25 | 2020-06-19 | 深圳市元征科技股份有限公司 | Text extraction method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101673306A (en) * | 2009-10-19 | 2010-03-17 | 中国科学院计算技术研究所 | Website information query method and system thereof |
US20130204867A1 (en) * | 2010-07-30 | 2013-08-08 | Hewlett-Packard Development Company, Lp. | Selection of Main Content in Web Pages |
CN103258000A (en) * | 2013-03-29 | 2013-08-21 | 北界创想(北京)软件有限公司 | Method and device for clustering high-frequency keywords in webpages |
CN103425765A (en) * | 2013-08-06 | 2013-12-04 | 优视科技有限公司 | Method and device for extracting webpage text and method and system for webpage preview |
CN104615728A (en) * | 2015-02-09 | 2015-05-13 | 浪潮集团有限公司 | Webpage main text extraction method and device |
-
2015
- 2015-11-30 CN CN201510857404.7A patent/CN105512225A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101673306A (en) * | 2009-10-19 | 2010-03-17 | 中国科学院计算技术研究所 | Website information query method and system thereof |
US20130204867A1 (en) * | 2010-07-30 | 2013-08-08 | Hewlett-Packard Development Company, Lp. | Selection of Main Content in Web Pages |
CN103258000A (en) * | 2013-03-29 | 2013-08-21 | 北界创想(北京)软件有限公司 | Method and device for clustering high-frequency keywords in webpages |
CN103425765A (en) * | 2013-08-06 | 2013-12-04 | 优视科技有限公司 | Method and device for extracting webpage text and method and system for webpage preview |
CN104615728A (en) * | 2015-02-09 | 2015-05-13 | 浪潮集团有限公司 | Webpage main text extraction method and device |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107247742A (en) * | 2017-05-17 | 2017-10-13 | 武汉工程大学 | A kind of text message abstracting method based on web page characteristics |
CN110750960A (en) * | 2018-07-05 | 2020-02-04 | 武汉斗鱼网络科技有限公司 | Configuration file analysis method, storage medium, electronic device and system |
CN110795933A (en) * | 2019-09-30 | 2020-02-14 | 奇安信科技集团股份有限公司 | Method and device for identifying and processing webpage text |
CN110795933B (en) * | 2019-09-30 | 2023-10-31 | 奇安信科技集团股份有限公司 | Webpage text recognition processing method and device |
CN111310418A (en) * | 2020-02-25 | 2020-06-19 | 深圳市元征科技股份有限公司 | Text extraction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6653334B2 (en) | Information extraction method and device | |
CN106095979B (en) | URL merging processing method and device | |
Oh et al. | Advanced evidence collection and analysis of web browser activity | |
CN107766328B (en) | Text information extraction method of structured text, storage medium and server | |
CN109543126B (en) | Webpage text information extraction method based on block character ratio | |
CN104750704B (en) | A kind of webpage URL address sorts recognition methods and device | |
CN105512225A (en) | Method and device extracting main content from webpage | |
CN104572934B (en) | A kind of webpage key content abstracting method based on DOM | |
CN104899219B (en) | Pseudo- static state URL's screens out method, system and web page crawl method, system | |
WO2014153457A1 (en) | Merging web page style addresses | |
CN102207974B (en) | Method for combining context web pages | |
CN103984749A (en) | Focused crawler method based on link analysis | |
CN112612761B (en) | Data cleaning method, device, equipment and storage medium | |
CN106294885A (en) | A kind of data collection towards isomery webpage and mask method | |
CN103942211A (en) | Text page recognition method and device | |
CN105528357A (en) | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures | |
CN106934049B (en) | News question selection analysis method and device | |
CN105335408B (en) | A kind of extended method and related system of search term white list | |
CN106897289A (en) | The optimization method and device of information search | |
CN103455572B (en) | Obtain the method and device of video display main body in webpage | |
CN101673263B (en) | Method for searching video content | |
CN103258021B (en) | The character terminal characteristic extracting method that a kind of Behavior-based control is analyzed | |
CN103853777A (en) | Method and device for accessing websites through keywords | |
CN111125704B (en) | Webpage Trojan horse recognition method and system | |
CN108287831B (en) | URL classification method and system and data processing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160420 |
|
RJ01 | Rejection of invention patent application after publication |