CN105512225A - Method and device extracting main content from webpage - Google Patents

Method and device extracting main content from webpage Download PDF

Info

Publication number
CN105512225A
CN105512225A CN201510857404.7A CN201510857404A CN105512225A CN 105512225 A CN105512225 A CN 105512225A CN 201510857404 A CN201510857404 A CN 201510857404A CN 105512225 A CN105512225 A CN 105512225A
Authority
CN
China
Prior art keywords
webpage
line
content text
main contents
begin column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510857404.7A
Other languages
Chinese (zh)
Inventor
喻春霖
李小磊
万巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201510857404.7A priority Critical patent/CN105512225A/en
Publication of CN105512225A publication Critical patent/CN105512225A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and device extracting main content from a webpage; the method comprises the following steps: removing useless labels from a to-be extracted webpage so as to obtain a content text; clipping the content text into a plurality of lines; determining a starting line and an ending line of the main content according to character numbers of continuous lines; extracting the text between the starting line and the ending line as the main content of the to-be extracted webpage. The method changes a conventional mode in which complex analysis is carried out for data content; the novel method is simple in regularity, can extract the main content without having problems caused by different webpage and webpage revision, thus improving the main content extracting efficiency.

Description

A kind of method and device extracting main contents from webpage
Technical field
The present invention relates to networking technology area, be specifically related to a kind of from webpage, extract main contents method and device.
Background technology
Along with the development of computer technology, abundant Internet resources, for the daily information life of people brings great convenience.Correspondingly, also often need to obtain relevant web page contents on webpage, so that subsequent treatment, such as, web page contents analyzed thus know the information content that related web page is issued.But the data of all kinds and various structure are all likely mixed in together on webpage, this brings no small trouble with regard to giving the text message captured on webpage.
Traditional Grasp Modes purpose, limitation are very strong, be only that a certain specific webpage is extracted, and once change another one webpage, then need to redesign routine processes mode, but exploitation needs cycle regular hour, therefore timeliness also has certain delay.If correcting is carried out in the website captured before, original fetching may be just no longer applicable, and still need to redesign fetching, these process needs repeat and the work of poor efficiency in a large number.
Therefore, how to extract the main contents in webpage effectively rapidly, become the major subjects that high-level efficiency extracts web page contents.
Summary of the invention
Therefore, the technical problem to be solved in the present invention is that the purpose of existing webpage main contents grasping means and limitation are comparatively strong, therefore cannot be applicable to various dissimilar webpage.
For this reason, following technical scheme is embodiments provided:
From webpage, extract a method for main contents, comprise the steps:
Remove the useless label in webpage to be extracted, obtain content text;
Content text is divided into multirow;
According to begin column and the end line of continuous multirow character number determination main contents;
By the main contents that the Text Feature Extraction between begin column and end line is webpage to be extracted.
Preferably, step content text being divided into multirow comprises:
Content text is divided into multirow according to line feed label;
Total line number after segmentation is compared with predetermined threshold value;
When the total line number after splitting is less than predetermined threshold value, then split row according to paragraph tag.
Preferably, content text is divided into multirow step and according between the begin column of continuous multirow character number determination main contents and the step of end line, also comprise:
Delete the line feed label in content text and paragraph tag.
Preferably, comprise according to the continuous begin column of multirow character number determination main contents and the step of end line:
The number of characters that the predetermined number of statistics next-door neighbour is capable from the first row of content text;
When its number of characters is not less than preset value, judge that the first row that this predetermined number is capable is begin column;
After begin column, when the line character number of predetermined number is less than preset value, judge that last column that this predetermined number is capable is end line.
From webpage, extract a device for main contents, comprising:
Content text acquiring unit, for removing the useless label in webpage to be extracted, obtains content text;
Cutting unit, for being divided into multirow by content text;
Begin column and end line determining unit, for according to the begin column of continuous multirow character number determination main contents and end line;
Main contents extraction unit, for by the Text Feature Extraction between begin column and end line being the main contents of webpage to be extracted.
Preferably, cutting unit comprises:
First segmentation subelement, for being divided into multirow by content text according to line feed label;
Relatively subelement, for comparing the total line number after segmentation with predetermined threshold value;
Second segmentation subelement, for when the total line number after splitting is less than predetermined threshold value, then splits row according to paragraph tag.
Preferably, also comprise:
Delete cells, for deleting line feed label in content text and paragraph tag.
Preferably, begin column and end line determining unit comprise:
Number of characters statistics subelement, the number of characters that the predetermined number for statistics next-door neighbour from the first row of content text is capable;
Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is begin column;
End line determination subelement, for after begin column, when the line character number of predetermined number is less than preset value, judges that last column that this predetermined number is capable is end line.
Technical solution of the present invention, tool has the following advantages:
Method and the device extracting main contents from webpage provided by the invention, break in webpage content extracting method and complicated tradition of resolving has been carried out to data content, extract by a kind of method of simple regularity, namely first extract the content text in html corresponding to webpage and delete useless label, then branch, determine main contents according to continuous multirow number of characters again, evaded the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.
Accompanying drawing explanation
In order to be illustrated more clearly in the specific embodiment of the invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of process flow diagram extracting the method for main contents from webpage in the embodiment of the present invention 1;
Fig. 2 is the example of deleting a content text after useless label in the embodiment of the present invention 1;
Fig. 3 is that in the embodiment of the present invention 1, the row of content text has split an example after the whole label of rear deletion;
Fig. 4 is the begin column of a kind of basis continuous multirow character number determination main contents and the method flow diagram of end line in the embodiment of the present invention 1;
Fig. 5 is a kind of theory diagram extracting the device of main contents from webpage in the embodiment of the present invention 2.
Embodiment
Be clearly and completely described technical scheme of the present invention below in conjunction with accompanying drawing, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
In addition, if below in the described different embodiment of the present invention involved technical characteristic do not form conflict each other and just can be combined with each other.
Embodiment 1
As shown in Figure 1, present embodiments provide a kind of method extracting main contents from webpage, comprise the steps:
S1: remove the useless label in webpage to be extracted, obtains content text, and this useless label does not comprise line feed label and paragraph tag;
S2: content text is divided into multirow;
S3: according to begin column and the end line of continuous multirow character number determination main contents;
S4: the main contents by the Text Feature Extraction between begin column and end line being webpage to be extracted.
By to the deep research of existing main stream website and contrast, and in conjunction with the universal law of website design, design philosophy and design style, find that the representing of content of website has mark governed.And the method for the present embodiment is on these researchs, sum up the method being applicable to major part detailed Page web page extraction main contents out.This method has been broken in webpage content extracting method and has been carried out complicated tradition of resolving to data content, extract by a kind of method of simple regularity, evade the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.
Particularly, in above-mentioned steps S1, webpage to be extracted can be capture according to the detailed page address of user's input, can be such as the detailed link url of a certain news web page, access this page link url, obtain corresponding html and resolve this html, then the <body> part in the page is captured, and use regular expression to remove to obtain content text after all script, style labels and content thereof except line feed label and paragraph tag, such as, the content text shown in Fig. 2.
Particularly, in above-mentioned steps S2, the step being divided into multirow by content text comprises:
First, content text is divided into multirow according to line feed label;
Then, the total line number after segmentation is compared with predetermined threshold value;
Finally, when the total line number after splitting is less than predetermined threshold value, then row is split according to paragraph tag.
If the total line number after segmentation is not less than preset value, so or according to line feed label split row.Only have, when the total line number after splitting is less than predetermined threshold value, just split row according to paragraph tag.Reason is, if be less than certain value according to the row sum of newline segmentation, think that web page tag is compression, before namely server sends to browser, removes the ignore character such as space, newline between label.Thus if split capable according to newline (line feed label), whole content text does not enter a new line substantially.In the present embodiment, this predetermined threshold value can be 10.
In the present embodiment, also comprise after selecting to be divided into multirow according to line feed label or paragraph tag according to actual needs and travel through each row successively, use regular expression to remove the html label of every a line.Namely the line feed label in content text and paragraph tag is deleted.Now, the content shown in accompanying drawing 2 becomes shown in Fig. 3.
Particularly, as shown in Figure 4, above-mentioned steps S3, namely comprises according to the continuous begin column of multirow number of characters determination main contents and the step of end line:
S31: the number of characters that the predetermined number of statistics next-door neighbour is capable from content text the first row;
S32: when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is the begin column of main contents;
S33: after begin column, when the number of characters of predetermined number is less than preset value, judges that last column that corresponding predetermined number is capable is the end line of main contents.
In the present embodiment, travel through each row, if the number of characters of the 4 row contents be close to is more than or equal to 120, then think the begin column of the first behavior main contents in this 4 row.The number of characters that such as, 3-6 in Fig. 3 is capable is greater than 120, then determine beginning behavior the 3rd row of main contents.After determining the begin column of main contents, then travel through each row, if the number of characters of the 4 row contents be close to is less than 120, then think that the last column in this 4 row is the end line of main contents.Such as, the number of characters that the 15-18 in Fig. 3 is capable is less than 120, then determine the end line of the 18th behavior main contents.
Particularly, above-mentioned steps S4, the text between begin column to end line, comprises begin column and end line, is namely the main contents that this webpage extracts.Such as, the 3rd row in Fig. 3 is exactly the main contents extracted from a certain news web page of the link correspondence of user's input to the 18th row.
Embodiment 2
As shown in Figure 5, originally execute example and a kind of device extracting main contents from webpage be provided, comprising:
Content text acquiring unit U1, for removing the useless label in webpage to be extracted, obtains content text;
Cutting unit U2, for being divided into multirow by content text;
Begin column and end line determining unit U3, for according to the begin column of continuous multirow character number determination main contents and end line;
Main contents extraction unit U4, for by the Text Feature Extraction between begin column and end line being the main contents of webpage to be extracted.
The device of the present embodiment, break in webpage content extracting method and complicated tradition of resolving has been carried out to data content, extract by a kind of method of simple regularity, evaded the extraction problem that different web pages or webpage correcting bring greatly, improve the efficiency that main contents are extracted.
Particularly, cutting unit U2 comprises:
First segmentation subelement, for being divided into multirow by content text according to line feed label;
Relatively subelement, for comparing the total line number after segmentation with predetermined threshold value;
Second segmentation subelement, for when the total line number after splitting is less than predetermined threshold value, then splits row according to paragraph tag.
Particularly, this device also comprises:
Delete cells, for deleting line feed label in content text and paragraph tag.After according to line feed label or paragraph tag content text being divided into multirow, regular expression is namely utilized to remove line feed label in content text and paragraph tag.
Particularly, begin column and end line determining unit U3 comprise:
Number of characters statistics subelement, the number of characters that the predetermined number for statistics next-door neighbour from the first row of content text is capable;
Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is begin column;
End line determination subelement, for after begin column, when the line character number of predetermined number is less than preset value, judges that last column that this predetermined number is capable is end line.
Just the text between begin column and end line can be extracted, i.e. the main contents of webpage to be extracted after the begin column determining main contents and end line.
Obviously, above-described embodiment is only for clearly example being described, and the restriction not to embodiment.For those of ordinary skill in the field, can also make other changes in different forms on the basis of the above description.Here exhaustive without the need to also giving all embodiments.And thus the apparent change of extending out or variation be still among the protection domain of the invention.

Claims (8)

1. from webpage, extract a method for main contents, it is characterized in that, comprise the steps:
Remove the useless label in webpage to be extracted, obtain content text;
Described content text is divided into multirow;
According to begin column and the end line of continuous multirow character number determination main contents;
Be the main contents of described webpage to be extracted by the Text Feature Extraction between described begin column and described end line.
2. the method for claim 1, is characterized in that, the described step described content text being divided into multirow comprises:
Described content text is divided into multirow according to line feed label;
Total line number after segmentation is compared with predetermined threshold value;
When the total line number after splitting is less than described predetermined threshold value, then split row according to paragraph tag.
3. method as claimed in claim 1 or 2, is characterized in that, described content text be divided into the step of multirow described and between the begin column of described basis continuous multirow character number determination main contents and the step of end line, also comprise:
Delete the described line feed label in described content text and described paragraph tag.
4. the method according to any one of claim 1-3, is characterized in that, the begin column of described basis continuous multirow character number determination main contents and the step of end line comprise:
The number of characters that the predetermined number of statistics next-door neighbour is capable from the first row of described content text;
When its number of characters is not less than preset value, judge that the first row that this predetermined number is capable is described begin column;
After described begin column, when the line character number of predetermined number is less than described preset value, judge that last column that this predetermined number is capable is described end line.
5. from webpage, extract a device for main contents, it is characterized in that, comprising:
Content text acquiring unit, for removing the useless label in webpage to be extracted, obtains content text;
Cutting unit, for being divided into multirow by described content text;
Begin column and end line determining unit, for according to the begin column of continuous multirow character number determination main contents and end line;
Main contents extraction unit, for by the Text Feature Extraction between described begin column and described end line being the main contents of described webpage to be extracted.
6. device as claimed in claim 5, it is characterized in that, described cutting unit comprises:
First segmentation subelement, for being divided into multirow by described content text according to line feed label;
Relatively subelement, for comparing the total line number after segmentation with predetermined threshold value;
Second segmentation subelement, for when the total line number after splitting is less than described predetermined threshold value, then splits row according to paragraph tag.
7. the device as described in claim 5 or 6, is characterized in that, also comprises:
Delete cells, for deleting described line feed label in described content text and described paragraph tag.
8. the device according to any one of claim 5-7, is characterized in that, described begin column and end line determining unit comprise:
Number of characters statistics subelement, for the number of characters that the predetermined number of statistics next-door neighbour the first row from described content text is capable;
Begin column determination subelement, for when its number of characters is not less than preset value, judges that the first row that this predetermined number is capable is described begin column;
End line determination subelement, for after described begin column, when the line character number of predetermined number is less than described preset value, judges that last column that this predetermined number is capable is described end line.
CN201510857404.7A 2015-11-30 2015-11-30 Method and device extracting main content from webpage Pending CN105512225A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510857404.7A CN105512225A (en) 2015-11-30 2015-11-30 Method and device extracting main content from webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510857404.7A CN105512225A (en) 2015-11-30 2015-11-30 Method and device extracting main content from webpage

Publications (1)

Publication Number Publication Date
CN105512225A true CN105512225A (en) 2016-04-20

Family

ID=55720207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510857404.7A Pending CN105512225A (en) 2015-11-30 2015-11-30 Method and device extracting main content from webpage

Country Status (1)

Country Link
CN (1) CN105512225A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247742A (en) * 2017-05-17 2017-10-13 武汉工程大学 A kind of text message abstracting method based on web page characteristics
CN110750960A (en) * 2018-07-05 2020-02-04 武汉斗鱼网络科技有限公司 Configuration file analysis method, storage medium, electronic device and system
CN110795933A (en) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 Method and device for identifying and processing webpage text
CN111310418A (en) * 2020-02-25 2020-06-19 深圳市元征科技股份有限公司 Text extraction method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101673306A (en) * 2009-10-19 2010-03-17 中国科学院计算技术研究所 Website information query method and system thereof
US20130204867A1 (en) * 2010-07-30 2013-08-08 Hewlett-Packard Development Company, Lp. Selection of Main Content in Web Pages
CN103258000A (en) * 2013-03-29 2013-08-21 北界创想(北京)软件有限公司 Method and device for clustering high-frequency keywords in webpages
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN104615728A (en) * 2015-02-09 2015-05-13 浪潮集团有限公司 Webpage main text extraction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101673306A (en) * 2009-10-19 2010-03-17 中国科学院计算技术研究所 Website information query method and system thereof
US20130204867A1 (en) * 2010-07-30 2013-08-08 Hewlett-Packard Development Company, Lp. Selection of Main Content in Web Pages
CN103258000A (en) * 2013-03-29 2013-08-21 北界创想(北京)软件有限公司 Method and device for clustering high-frequency keywords in webpages
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN104615728A (en) * 2015-02-09 2015-05-13 浪潮集团有限公司 Webpage main text extraction method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247742A (en) * 2017-05-17 2017-10-13 武汉工程大学 A kind of text message abstracting method based on web page characteristics
CN110750960A (en) * 2018-07-05 2020-02-04 武汉斗鱼网络科技有限公司 Configuration file analysis method, storage medium, electronic device and system
CN110795933A (en) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 Method and device for identifying and processing webpage text
CN110795933B (en) * 2019-09-30 2023-10-31 奇安信科技集团股份有限公司 Webpage text recognition processing method and device
CN111310418A (en) * 2020-02-25 2020-06-19 深圳市元征科技股份有限公司 Text extraction method and device

Similar Documents

Publication Publication Date Title
JP6653334B2 (en) Information extraction method and device
CN106095979B (en) URL merging processing method and device
Oh et al. Advanced evidence collection and analysis of web browser activity
CN107766328B (en) Text information extraction method of structured text, storage medium and server
CN109543126B (en) Webpage text information extraction method based on block character ratio
CN104750704B (en) A kind of webpage URL address sorts recognition methods and device
CN105512225A (en) Method and device extracting main content from webpage
CN104572934B (en) A kind of webpage key content abstracting method based on DOM
CN104899219B (en) Pseudo- static state URL&#39;s screens out method, system and web page crawl method, system
WO2014153457A1 (en) Merging web page style addresses
CN102207974B (en) Method for combining context web pages
CN103984749A (en) Focused crawler method based on link analysis
CN112612761B (en) Data cleaning method, device, equipment and storage medium
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
CN103942211A (en) Text page recognition method and device
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN106934049B (en) News question selection analysis method and device
CN105335408B (en) A kind of extended method and related system of search term white list
CN106897289A (en) The optimization method and device of information search
CN103455572B (en) Obtain the method and device of video display main body in webpage
CN101673263B (en) Method for searching video content
CN103258021B (en) The character terminal characteristic extracting method that a kind of Behavior-based control is analyzed
CN103853777A (en) Method and device for accessing websites through keywords
CN111125704B (en) Webpage Trojan horse recognition method and system
CN108287831B (en) URL classification method and system and data processing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160420

RJ01 Rejection of invention patent application after publication