CN105740355A

CN105740355A - Aggregated text density based webpage body text extraction method and apparatus

Info

Publication number: CN105740355A
Application number: CN201610050995.1A
Authority: CN
Inventors: 刘忠; 陈发君; 黄金才; 朱承; 修保新; 程光权; 陈超; 冯旸赫
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2016-01-26
Filing date: 2016-01-26
Publication date: 2016-07-06
Anticipated expiration: 2036-01-26
Also published as: CN105740355B

Abstract

The present invention provides an aggregated text density based webpage body text extraction method and apparatus. In the method, webpage text content is segmented by a method of separating a webpage HTML according to a tag, so as to effectively separate various types of texts in the content. A special website extraction rule does not need to be customized, so that the method is high in generality; a complex text mining means is not required, so that the method is simple and efficient and accurate for extraction of various types of webpage body texts.

Description

Based on webpage context extraction method and the device of assembling text density

Technical field

The present invention relates to spiders technical field, be specifically related to a kind of based on webpage context extraction method and the device of assembling text density.

Background technology

Along with developing rapidly of social informatization, the Internet has become as people and obtains an important sources of information.Netizen generally uses browser to be directly viewable web page contents, in addition, also having many information processing works based on the Internet (such as information retrieval, data mining, machine translation etc.) is also that data carry out based on the information content of webpage, and the main text being based on webpage processes.But most of webpages also comprise many noise informations except the information (such as body matter) included, for instance the navigation information of website, peer link and advertisement, copyright information and some scripts etc..How to extract the text message of webpage accurately and efficiently, accomplish that neither omitting text is not mixed into noise, has become as the important topic that current network information extracts and applies, has significantly high using value and practice significance yet.

Multiple extracting method is there is in this problem prior art:

1) based on the context extraction method of DOM tree structure

First repair the structure lack of standardization in the html file of webpage or information (as: start label<h1>it is not over label</a>deng), make the html file of standard.Then html file is resolved to DOM (DocumentObjectModel, document dbject model) tree.Finally traversal dom tree identification also rejects non-text message, and according to the Rule Extraction body text such as page layout, text density.At present the page structure of a lot of websites become increasingly complex, also more and more lack of standardization, can cause constructing DOM number thus extracting text and extracting template and build unsuccessfully.Structure afterwards and traversal dom tree process, Space-time Complexity is high, efficiency is low, speed is slow.Noise identification needs manual maintenance more fresh information (such as Advertisement Server list etc.), it is impossible to accomplish automatization.

2) rule-based extraction text

It is that extracting rule is specified in specific website by artificial means, for instance regular expression or XPath etc..Advantage is that order of accuarcy is high, but shortcoming is not possess versatility, cannot extend, and can only resolve the webpage of fixing website or set form, and the formulation process of rule is wasted time and energy, once page layout changes, it is difficult to find in time to be updated safeguarding.3) text block is extracted based on Web-page segmentation

Utilize the text block that the separator bar in html tag and some visual informations (such as text color, font size, Word message etc.) are separated out in webpage.Due to the HTML different style of different web sites, segmentation does not have unified approach, and versatility is difficult to ensure that；Need to increase the artificial rule of a lot of auxiliary.4) text is extracted based on data mining and machine learning method

The method comprises the following steps: linearisation reconstruct web page code makes the logical order of text paragraph not because the nested rule of label is destroyed；Filter HTML noise label；Text fragment is resolved in units of<table>label and stores；Text Clustering Algorithm is used to paragraphs clustering and to ultimately generate text.Existing problems: simple problem complicates so that extract text and become very complicated, be unfavorable for extensive utilization.

Summary of the invention

The technical problem that the prior art that present invention aims to mention in above-mentioned background technology exists, it is provided that a kind of based on webpage context extraction method and the device of assembling text density.

The present invention provides a kind of based on the webpage context extraction method assembling text density, comprises the following steps: step S100: obtain the html source file text of webpage, delete unworthy first label the spcial character rejecting in text, obtain sample text；Step S200: be null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein；All texts in each subqueue are merged into a text block by step S300: queue T is separated into multiple subqueue, and multiple text block are formed queue B, according to text threshold value and index threshold value, queue T are split；Step S400: choose the maximum text of text size from queue B as Web page text；Index threshold value is the null number between default any two subqueues, and text threshold value is contained text character number in default subqueue.

Further, in step S200, the second label uses regular expression to be replaced, and Substitution Rules are: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.

Further, step S300 comprises the following steps:

Step S310: searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than text size threshold value, then currentElement Tc text added in enqueue B and continue traversal queue T, if effective Chinese character number of currentElement Tc is more than text threshold value, remember that currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv；

Step S320: the next element after the currently active literary composition Tcv begins stepping through queue T, ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv；

Step S330: continue the next effective element Ncv after next effective text Ncv_i+2Traversal queue T, if Ncv_i+2With the currently active text Tcv location index difference in queue T more than index threshold value, then provisional version block Temp is replicated portion and put in queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

Further, the first label is unworthy Html label.

Another aspect of the present invention additionally provides a kind of method described above with based on the Web page text extraction element assembling text density, including: webpage html file acquisition module, for obtaining the html source file text of webpage, delete unworthy first label the spcial character rejecting in text, obtain sample text；Null segmentation module, for being null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein；Queue conversion module, is separated into multiple subqueue for queue T, and all texts in each subqueue are merged into a text block, and multiple text block are formed queue B, according to text threshold value and index threshold value, queue T is split；Text selection module, for choosing the maximum text of text size as Web page text from queue B.

Further, the second label uses regular expression to be replaced, and Substitution Rules are: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.

Further, queue conversion module includes: the first loop module: for searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than text size threshold value, then currentElement Tc text added in enqueue B and continue traversal queue T, if effective Chinese character number of currentElement Tc is more than text threshold value, remember that currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv；Second loop module, queue T is begun stepping through for the next element after the currently active literary composition Tcv, ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv；Queue B forms module, for continuing the next effective element Ncv after next effective text Ncv_i+2Traversal queue T, if Ncv_i+2With the currently active text Tcv location index difference in queue T more than index threshold value, then provisional version block Temp is replicated portion and put in queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

The technique effect of the present invention:

Provided by the invention based on the webpage context extraction method Web page text method assembling text density, it is not necessary to customize special website extracting rule, highly versatile；Without using the text mining means of complexity, the method is simply efficient, and all kinds of Web page texts are extracted precise and high efficiency.Webpage context extraction method provided by the invention, by the webpage HTML obtained is obtained Web page text according to after label cleaning, conversion process by the method assembled, had not both customized special website rule, it is to avoid arrange the poor website of versatility regular；Also either with or without generating and traversal dom tree, it is to avoid the situation under efficient；Through the extraction Web page text of practice ground test the method precise and high efficiency, it is simultaneously applicable to all kinds of website.

Web page text extraction element based on gathering text density provided by the invention is without using the text mining means of complexity, and the method is simply efficient, and all kinds of Web page texts are extracted precise and high efficiency.

Specifically refer to the described below of the various embodiments based on the webpage context extraction method and device proposition of assembling text density according to the present invention, by apparent for the above and other aspect making the present invention.

Accompanying drawing explanation

Fig. 1 is provided by the invention based on the webpage context extraction method schematic flow sheet assembling text density；

Fig. 2 is the structural representation based on the Web page text extraction element assembling text density provided by the invention.

Detailed description of the invention

The accompanying drawing constituting the part of the application is used for providing a further understanding of the present invention, and the schematic description and description of the present invention is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.

Referring to Fig. 1, provided by the invention based on the webpage context extraction method assembling text density, comprise the following steps:

Step S100: obtain the html source file text of webpage, delete unworthy first label the spcial character rejecting in text, obtain sample text；

Step S200: be null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein；

Step S300: queue T is separated into multiple subqueue, all texts in each subqueue are merged into a text block, form the queue B being made up of multiple text block, according to text threshold value and index threshold value, queue T is split, index threshold value is the null number between default any two subqueues, and text threshold value is contained text character number in default subqueue；

Step S400: choose the maximum text of text size from queue B as Web page text.

The present invention is using the replacement of label and deletes as initiateing, and according to text character number and null number, text subqueue in source file text is divided into different subqueue, thus the text by body text with other effects separates, the method specifically extracts principle without manually arranging according to concrete webpage, only need to be replaced according to label condition, the extraction to text body can be realized.Efficiency is improved.

Unworthy first label can be all kinds of conventional unworthy Html labels.Unworthy Html label referred to herein include but not limited to explain (<！--...-->、<！...>), script (<script...>...</script>), head (<head..>...</head>), pattern (<link.../>), editor class (<input../>).

Rejecting spcial character, some text can replace with special character in web page source file, if space character is " " in webpage source code, is and deletes this type of spcial character without concrete meaning known to herein.Specifically, in this step each element text in queue T is detected, rejecting all kinds of normal spcial character in text, these spcial characters include but not limited to space (” &nbsp "), greater-than sign (” &gt；"), less than sign (” &lt；") and equal to number (“ &quot；”).

The second label herein refers to after deleting unworthy first labelling step, other the conventional Html labels not being deleted.By by after null that the second tag replacement all of in Html text is some, in sample text, the text containing body matter separates with the content that other labels are split.

Preferably, step S200 is comprised the following steps: the text label in the second label is carried out null replacement according to following rule.According to corresponding relation Substitution Rules it is: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.Such as: R:[(" div ", 5), (" tr ", 5), (" h1 ", 9), (" br ", 5), (" span ", 4), (" table ", 2)] use regular expression to be replaced.

Specific as follows: all elements in R is made up of key-value pair, the key in R element is bookmark name: such as div, tr, hl etc. are all kinds of the second conventional labels.Value in R element represents the null number replaced in label transformation process；In R, first element (" div ", 5) represents when the second label detected is div, beginnings or end-tag replaces with 5 nulls and accords with (" n ").For not the second label of other in relational expression R, then replace with a null symbol.The replacement principle of this step is based on visual effect and replaces, and the second label that in visual effect, interval is more big will be replaced with more null.Afterwards multiple in a web page text are formed list T by the separated text of null, with null segmentation adjacent text between two in list.

Step S300 is for assembling text steps, and the web page text information obtained by above step is separated the small text block in order to be separated by null one by one by label, and small text block adjacent for physical location is collected as a text block by this step.

Specifically, step S300 comprises the following steps:

Step S310: searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than specifying text size threshold value (such as 4), then currentElement Tc text adds enqueue B and continues traversal queue T.If effective Chinese character number of Tc is more than specifying threshold value, illustrating that currentElement Tc is effective text, note Tc is the currently active text Tcv, creates the textual value that provisional version block Temp is the currently active text Tcv.

Step S320: begin stepping through queue T from the next element after Tcv afterwards, ignore space or null text element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than the index threshold value (such as 7) specified, then the text of next effective text Ncv are appended in provisional version block Temp and next effective text Ncv is assigned to effective text Tcv.

Step S330: continue afterwards the next effective element Ncvi+2 after next effective text Ncv is begun stepping through queue T.If Ncv_i+2With the currently active text Tcv location index difference in queue T more than the index threshold value specified, then text block Temp is replicated portion and put in queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

Step S400 is for selecting text step, after step S300, relevant text has flocked together (such as: text, advertisement, link etc.), obtains the element that in queue B, element text size is the longest, this element text is exactly text, extracts to this text and is fully completed.

Adopting this step according in usual webpage: 1) text connects together, will not be separated by noises such as advertisements；2) the text block length of text is longer and be separated by not far；3) content of text should be the longest.Thus effective, text collection in webpage was both avoided and used the step and algorithm replicated, and turn avoid and specify the loaded down with trivial details of different extracting rule for different web pages, improve the efficiency that web page text is extracted.

Referring to Fig. 2, another aspect of the present invention also provides for the device of another kind of said method, including:

Webpage html file acquisition module 100, for obtaining the html source file text of webpage, deletes unworthy first label the spcial character rejecting in text, obtains sample text；

Null segmentation module 200, for being null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein；

Queue conversion module 300, is separated into multiple subqueue for queue T, and all texts in each subqueue are merged into a text block, and multiple text block are formed queue B, according to text threshold value and index threshold value, queue T is split；

Text selection module 400, for choosing the maximum text of text size as Web page text from queue B.

This device is without according to concrete webpage design extracting rule, it is not necessary to manpower intervention, can be effectively improved extraction efficiency.

Preferably, the second label uses regular expression to be replaced, and Substitution Rules are: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.Extract by this rule, can effectively realize flying to separate with body text to invalid document, it is to avoid be difficult to split after the two mixing.

Preferably, queue conversion module includes:

First loop module: for searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than text size threshold value, then currentElement Tc text added in enqueue B and continue traversal queue T, if effective Chinese character number of currentElement Tc is more than text threshold value, remember that currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv；

Second loop module, queue T is begun stepping through for the next element after the currently active literary composition Tcv, ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv；

Queue B forms module, for continuing the next effective element Ncv after next effective text Ncv_i+2Traversal queue T, if Ncv_i+2With the currently active text Tcv location index difference in queue T more than index threshold value, then provisional version block Temp is replicated portion and put in queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

Adopt this module, the omission to text contained in branch's text can be prevented effectively from, when being particularly suited for for text also occurs in that label.

Those skilled in the art will know that the scope of the present invention is not restricted to example discussed above, it is possible to it is carried out some changes and amendment, without deviating from the scope of the present invention that appended claims limits.Although oneself is through illustrating and describing the present invention in the accompanying drawings and the description in detail, but such explanation and description are only illustrate or schematic, and nonrestrictive.The present invention is not limited to the disclosed embodiments.

By to accompanying drawing, the research of specification and claims, it will be appreciated by those skilled in the art that and realize the deformation of the disclosed embodiments when implementing the present invention.In detail in the claims, term " includes " being not excluded for other steps or element, and indefinite article " " or " one " are not excluded for multiple.The fact that some measure quoted in mutually different dependent claims do not mean that the combination of these measures can not be advantageously used.Any reference marker in claims is not construed to limit the scope of the present.

Claims

1., based on the webpage context extraction method assembling text density, comprise the following steps:

Step S200: be null by the second tag replacement all in described sample text, generates multiple null text, null text is converted to queue T, and described null adjacent between two is separated by null symbol herein；

All texts in each described subqueue are merged into a text block by step S300: described queue T is separated into multiple subqueue, and multiple described text block are formed queue B, according to text threshold value and index threshold value, described queue T are split；

Step S400: choose the maximum text of text size from described queue B as Web page text；

Described index threshold value is the null number described in default any two between subqueue, and described text threshold value is contained text character number in default described subqueue.

2. according to claim 1 based on the webpage context extraction method assembling text density, it is characterized in that, second label described in described step S200 uses regular expression to be replaced, Substitution Rules are: R [(" i ", n)], wherein " i " is described second label, and n is this tag replacement is the quantity of null.

3. according to claim 2 based on the webpage context extraction method assembling text density, it is characterised in that described step S300 comprises the following steps:

Step S310: searching loop queue T, note currentElement is Tc, if effective Chinese character number of described currentElement Tc is less than described text size threshold value, then described currentElement Tc text added in described queue B and continue to travel through described queue T, if effective Chinese character number of described currentElement Tc is more than described text threshold value, remember that described currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv；

Step S320: the next element after described the currently active literary composition Tcv begins stepping through described queue T, ignore space or null element until finding next effective text Ncv, if next effective text Ncv described and described effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv described is appended in described provisional version block Temp, and next effective text Ncv described is assigned to effective text Tcv；

Step S330: continue the next effective element Ncv after next effective text Ncv described_i+2Travel through described queue T, if described Ncv_i+2With described the currently active text Tcv location index difference in described queue T more than described index threshold value, then described provisional version block Temp is replicated portion and put in described queue B, by described Ncv_i+2It is assigned to described currentElement Tc and continues cycling through traversal queue T.

4. according to claim 3 based on the webpage context extraction method assembling text density, it is characterised in that described first label is unworthy Html label.

5. the method as according to any one of Claims 1 to 4 is with based on the Web page text extraction element assembling text density, it is characterised in that including:

Webpage html file acquisition module, for obtaining the html source file text of webpage, deletes unworthy first label the spcial character rejecting in text, obtains sample text；

Null segmentation module, for being null by the second tag replacement all in described sample text, generates multiple null text, null text is converted to queue T, and described null adjacent between two is separated by null symbol herein；

Queue conversion module, is separated into multiple subqueue for described queue T, and all texts in each described subqueue are merged into a text block, and multiple described text block are formed queue B, according to text threshold value and index threshold value, described queue T is split；

Text selection module, for choosing the maximum text of text size as Web page text from described queue B.

6. according to claim 5 based on the Web page text extraction element assembling text density, it is characterized in that, described second label uses regular expression to be replaced, Substitution Rules are: R [(" i ", n)], wherein " i " is described second label, and n is this tag replacement is the quantity of null.

7. according to claim 5 based on the Web page text extraction element assembling text density, it is characterised in that described queue conversion module includes:

First loop module: for searching loop queue T, note currentElement is Tc, if effective Chinese character number of described currentElement Tc is less than described text size threshold value, then described currentElement Tc text added in described queue B and continue to travel through described queue T, if effective Chinese character number of described currentElement Tc is more than described text threshold value, remember that described currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv；

Second loop module, described queue T is begun stepping through for the next element after described the currently active literary composition Tcv, ignore space or null element until finding next effective text Ncv, if next effective text Ncv described and described effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv described is appended in described provisional version block Temp, and next effective text Ncv described is assigned to effective text Tcv；

Queue B forms module, for continuing the next effective element Ncv after next effective text Ncv described_i+2Travel through described queue T, if described Ncv_i+2With described the currently active text Tcv location index difference in described queue T more than described index threshold value, then described provisional version block Temp is replicated portion and put in described queue B, by described Ncv_i+2It is assigned to described currentElement Tc and continues cycling through traversal queue T.