CN105740355A - Aggregated text density based webpage body text extraction method and apparatus - Google Patents

Aggregated text density based webpage body text extraction method and apparatus Download PDF

Info

Publication number
CN105740355A
CN105740355A CN201610050995.1A CN201610050995A CN105740355A CN 105740355 A CN105740355 A CN 105740355A CN 201610050995 A CN201610050995 A CN 201610050995A CN 105740355 A CN105740355 A CN 105740355A
Authority
CN
China
Prior art keywords
text
queue
null
ncv
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610050995.1A
Other languages
Chinese (zh)
Other versions
CN105740355B (en
Inventor
刘忠
陈发君
黄金才
朱承
修保新
程光权
陈超
冯旸赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201610050995.1A priority Critical patent/CN105740355B/en
Publication of CN105740355A publication Critical patent/CN105740355A/en
Application granted granted Critical
Publication of CN105740355B publication Critical patent/CN105740355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides an aggregated text density based webpage body text extraction method and apparatus. In the method, webpage text content is segmented by a method of separating a webpage HTML according to a tag, so as to effectively separate various types of texts in the content. A special website extraction rule does not need to be customized, so that the method is high in generality; a complex text mining means is not required, so that the method is simple and efficient and accurate for extraction of various types of webpage body texts.

Description

Based on webpage context extraction method and the device of assembling text density
Technical field
The present invention relates to spiders technical field, be specifically related to a kind of based on webpage context extraction method and the device of assembling text density.
Background technology
Along with developing rapidly of social informatization, the Internet has become as people and obtains an important sources of information.Netizen generally uses browser to be directly viewable web page contents, in addition, also having many information processing works based on the Internet (such as information retrieval, data mining, machine translation etc.) is also that data carry out based on the information content of webpage, and the main text being based on webpage processes.But most of webpages also comprise many noise informations except the information (such as body matter) included, for instance the navigation information of website, peer link and advertisement, copyright information and some scripts etc..How to extract the text message of webpage accurately and efficiently, accomplish that neither omitting text is not mixed into noise, has become as the important topic that current network information extracts and applies, has significantly high using value and practice significance yet.
Multiple extracting method is there is in this problem prior art:
1) based on the context extraction method of DOM tree structure
First repair the structure lack of standardization in the html file of webpage or information (as: start label<h1>it is not over label</a>deng), make the html file of standard.Then html file is resolved to DOM (DocumentObjectModel, document dbject model) tree.Finally traversal dom tree identification also rejects non-text message, and according to the Rule Extraction body text such as page layout, text density.At present the page structure of a lot of websites become increasingly complex, also more and more lack of standardization, can cause constructing DOM number thus extracting text and extracting template and build unsuccessfully.Structure afterwards and traversal dom tree process, Space-time Complexity is high, efficiency is low, speed is slow.Noise identification needs manual maintenance more fresh information (such as Advertisement Server list etc.), it is impossible to accomplish automatization.
2) rule-based extraction text
It is that extracting rule is specified in specific website by artificial means, for instance regular expression or XPath etc..Advantage is that order of accuarcy is high, but shortcoming is not possess versatility, cannot extend, and can only resolve the webpage of fixing website or set form, and the formulation process of rule is wasted time and energy, once page layout changes, it is difficult to find in time to be updated safeguarding.3) text block is extracted based on Web-page segmentation
Utilize the text block that the separator bar in html tag and some visual informations (such as text color, font size, Word message etc.) are separated out in webpage.Due to the HTML different style of different web sites, segmentation does not have unified approach, and versatility is difficult to ensure that;Need to increase the artificial rule of a lot of auxiliary.4) text is extracted based on data mining and machine learning method
The method comprises the following steps: linearisation reconstruct web page code makes the logical order of text paragraph not because the nested rule of label is destroyed;Filter HTML noise label;Text fragment is resolved in units of<table>label and stores;Text Clustering Algorithm is used to paragraphs clustering and to ultimately generate text.Existing problems: simple problem complicates so that extract text and become very complicated, be unfavorable for extensive utilization.
Summary of the invention
The technical problem that the prior art that present invention aims to mention in above-mentioned background technology exists, it is provided that a kind of based on webpage context extraction method and the device of assembling text density.
The present invention provides a kind of based on the webpage context extraction method assembling text density, comprises the following steps: step S100: obtain the html source file text of webpage, delete unworthy first label the spcial character rejecting in text, obtain sample text;Step S200: be null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein;All texts in each subqueue are merged into a text block by step S300: queue T is separated into multiple subqueue, and multiple text block are formed queue B, according to text threshold value and index threshold value, queue T are split;Step S400: choose the maximum text of text size from queue B as Web page text;Index threshold value is the null number between default any two subqueues, and text threshold value is contained text character number in default subqueue.
Further, in step S200, the second label uses regular expression to be replaced, and Substitution Rules are: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.
Further, step S300 comprises the following steps:
Step S310: searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than text size threshold value, then currentElement Tc text added in enqueue B and continue traversal queue T, if effective Chinese character number of currentElement Tc is more than text threshold value, remember that currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv;
Step S320: the next element after the currently active literary composition Tcv begins stepping through queue T, ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv;
Step S330: continue the next effective element Ncv after next effective text Ncvi+2Traversal queue T, if Ncvi+2With the currently active text Tcv location index difference in queue T more than index threshold value, then provisional version block Temp is replicated portion and put in queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Further, the first label is unworthy Html label.
Another aspect of the present invention additionally provides a kind of method described above with based on the Web page text extraction element assembling text density, including: webpage html file acquisition module, for obtaining the html source file text of webpage, delete unworthy first label the spcial character rejecting in text, obtain sample text;Null segmentation module, for being null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein;Queue conversion module, is separated into multiple subqueue for queue T, and all texts in each subqueue are merged into a text block, and multiple text block are formed queue B, according to text threshold value and index threshold value, queue T is split;Text selection module, for choosing the maximum text of text size as Web page text from queue B.
Further, the second label uses regular expression to be replaced, and Substitution Rules are: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.
Further, queue conversion module includes: the first loop module: for searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than text size threshold value, then currentElement Tc text added in enqueue B and continue traversal queue T, if effective Chinese character number of currentElement Tc is more than text threshold value, remember that currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv;Second loop module, queue T is begun stepping through for the next element after the currently active literary composition Tcv, ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv;Queue B forms module, for continuing the next effective element Ncv after next effective text Ncvi+2Traversal queue T, if Ncvi+2With the currently active text Tcv location index difference in queue T more than index threshold value, then provisional version block Temp is replicated portion and put in queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
The technique effect of the present invention:
Provided by the invention based on the webpage context extraction method Web page text method assembling text density, it is not necessary to customize special website extracting rule, highly versatile;Without using the text mining means of complexity, the method is simply efficient, and all kinds of Web page texts are extracted precise and high efficiency.Webpage context extraction method provided by the invention, by the webpage HTML obtained is obtained Web page text according to after label cleaning, conversion process by the method assembled, had not both customized special website rule, it is to avoid arrange the poor website of versatility regular;Also either with or without generating and traversal dom tree, it is to avoid the situation under efficient;Through the extraction Web page text of practice ground test the method precise and high efficiency, it is simultaneously applicable to all kinds of website.
Web page text extraction element based on gathering text density provided by the invention is without using the text mining means of complexity, and the method is simply efficient, and all kinds of Web page texts are extracted precise and high efficiency.
Specifically refer to the described below of the various embodiments based on the webpage context extraction method and device proposition of assembling text density according to the present invention, by apparent for the above and other aspect making the present invention.
Accompanying drawing explanation
Fig. 1 is provided by the invention based on the webpage context extraction method schematic flow sheet assembling text density;
Fig. 2 is the structural representation based on the Web page text extraction element assembling text density provided by the invention.
Detailed description of the invention
The accompanying drawing constituting the part of the application is used for providing a further understanding of the present invention, and the schematic description and description of the present invention is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.
Referring to Fig. 1, provided by the invention based on the webpage context extraction method assembling text density, comprise the following steps:
Step S100: obtain the html source file text of webpage, delete unworthy first label the spcial character rejecting in text, obtain sample text;
Step S200: be null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein;
Step S300: queue T is separated into multiple subqueue, all texts in each subqueue are merged into a text block, form the queue B being made up of multiple text block, according to text threshold value and index threshold value, queue T is split, index threshold value is the null number between default any two subqueues, and text threshold value is contained text character number in default subqueue;
Step S400: choose the maximum text of text size from queue B as Web page text.
The present invention is using the replacement of label and deletes as initiateing, and according to text character number and null number, text subqueue in source file text is divided into different subqueue, thus the text by body text with other effects separates, the method specifically extracts principle without manually arranging according to concrete webpage, only need to be replaced according to label condition, the extraction to text body can be realized.Efficiency is improved.
Unworthy first label can be all kinds of conventional unworthy Html labels.Unworthy Html label referred to herein include but not limited to explain (<!--...-->、<!...>), script (<script...>...</script>), head (<head..>...</head>), pattern (<link.../>), editor class (<input../>).
Rejecting spcial character, some text can replace with special character in web page source file, if space character is " " in webpage source code, is and deletes this type of spcial character without concrete meaning known to herein.Specifically, in this step each element text in queue T is detected, rejecting all kinds of normal spcial character in text, these spcial characters include but not limited to space (” &nbsp "), greater-than sign (” &gt;"), less than sign (” &lt;") and equal to number (“ &quot;”).
The second label herein refers to after deleting unworthy first labelling step, other the conventional Html labels not being deleted.By by after null that the second tag replacement all of in Html text is some, in sample text, the text containing body matter separates with the content that other labels are split.
Preferably, step S200 is comprised the following steps: the text label in the second label is carried out null replacement according to following rule.According to corresponding relation Substitution Rules it is: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.Such as: R:[(" div ", 5), (" tr ", 5), (" h1 ", 9), (" br ", 5), (" span ", 4), (" table ", 2)] use regular expression to be replaced.
Specific as follows: all elements in R is made up of key-value pair, the key in R element is bookmark name: such as div, tr, hl etc. are all kinds of the second conventional labels.Value in R element represents the null number replaced in label transformation process;In R, first element (" div ", 5) represents when the second label detected is div, beginnings or end-tag replaces with 5 nulls and accords with (" n ").For not the second label of other in relational expression R, then replace with a null symbol.The replacement principle of this step is based on visual effect and replaces, and the second label that in visual effect, interval is more big will be replaced with more null.Afterwards multiple in a web page text are formed list T by the separated text of null, with null segmentation adjacent text between two in list.
Step S300 is for assembling text steps, and the web page text information obtained by above step is separated the small text block in order to be separated by null one by one by label, and small text block adjacent for physical location is collected as a text block by this step.
Specifically, step S300 comprises the following steps:
Step S310: searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than specifying text size threshold value (such as 4), then currentElement Tc text adds enqueue B and continues traversal queue T.If effective Chinese character number of Tc is more than specifying threshold value, illustrating that currentElement Tc is effective text, note Tc is the currently active text Tcv, creates the textual value that provisional version block Temp is the currently active text Tcv.
Step S320: begin stepping through queue T from the next element after Tcv afterwards, ignore space or null text element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than the index threshold value (such as 7) specified, then the text of next effective text Ncv are appended in provisional version block Temp and next effective text Ncv is assigned to effective text Tcv.
Step S330: continue afterwards the next effective element Ncvi+2 after next effective text Ncv is begun stepping through queue T.If Ncvi+2With the currently active text Tcv location index difference in queue T more than the index threshold value specified, then text block Temp is replicated portion and put in queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Step S400 is for selecting text step, after step S300, relevant text has flocked together (such as: text, advertisement, link etc.), obtains the element that in queue B, element text size is the longest, this element text is exactly text, extracts to this text and is fully completed.
Adopting this step according in usual webpage: 1) text connects together, will not be separated by noises such as advertisements;2) the text block length of text is longer and be separated by not far;3) content of text should be the longest.Thus effective, text collection in webpage was both avoided and used the step and algorithm replicated, and turn avoid and specify the loaded down with trivial details of different extracting rule for different web pages, improve the efficiency that web page text is extracted.
Referring to Fig. 2, another aspect of the present invention also provides for the device of another kind of said method, including:
Webpage html file acquisition module 100, for obtaining the html source file text of webpage, deletes unworthy first label the spcial character rejecting in text, obtains sample text;
Null segmentation module 200, for being null by the second tag replacement all in sample text, generates multiple null text, null text is converted to queue T, and null adjacent between two is separated by null symbol herein;
Queue conversion module 300, is separated into multiple subqueue for queue T, and all texts in each subqueue are merged into a text block, and multiple text block are formed queue B, according to text threshold value and index threshold value, queue T is split;
Text selection module 400, for choosing the maximum text of text size as Web page text from queue B.
This device is without according to concrete webpage design extracting rule, it is not necessary to manpower intervention, can be effectively improved extraction efficiency.
Preferably, the second label uses regular expression to be replaced, and Substitution Rules are: R [(" i ", n)], wherein " i " is the second label, and n is this tag replacement is the quantity of null.Extract by this rule, can effectively realize flying to separate with body text to invalid document, it is to avoid be difficult to split after the two mixing.
Preferably, queue conversion module includes:
First loop module: for searching loop queue T, note currentElement is Tc, if effective Chinese character number of currentElement Tc is less than text size threshold value, then currentElement Tc text added in enqueue B and continue traversal queue T, if effective Chinese character number of currentElement Tc is more than text threshold value, remember that currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv;
Second loop module, queue T is begun stepping through for the next element after the currently active literary composition Tcv, ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv;
Queue B forms module, for continuing the next effective element Ncv after next effective text Ncvi+2Traversal queue T, if Ncvi+2With the currently active text Tcv location index difference in queue T more than index threshold value, then provisional version block Temp is replicated portion and put in queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Adopt this module, the omission to text contained in branch's text can be prevented effectively from, when being particularly suited for for text also occurs in that label.
Those skilled in the art will know that the scope of the present invention is not restricted to example discussed above, it is possible to it is carried out some changes and amendment, without deviating from the scope of the present invention that appended claims limits.Although oneself is through illustrating and describing the present invention in the accompanying drawings and the description in detail, but such explanation and description are only illustrate or schematic, and nonrestrictive.The present invention is not limited to the disclosed embodiments.
By to accompanying drawing, the research of specification and claims, it will be appreciated by those skilled in the art that and realize the deformation of the disclosed embodiments when implementing the present invention.In detail in the claims, term " includes " being not excluded for other steps or element, and indefinite article " " or " one " are not excluded for multiple.The fact that some measure quoted in mutually different dependent claims do not mean that the combination of these measures can not be advantageously used.Any reference marker in claims is not construed to limit the scope of the present.

Claims (7)

1., based on the webpage context extraction method assembling text density, comprise the following steps:
Step S100: obtain the html source file text of webpage, delete unworthy first label the spcial character rejecting in text, obtain sample text;
Step S200: be null by the second tag replacement all in described sample text, generates multiple null text, null text is converted to queue T, and described null adjacent between two is separated by null symbol herein;
All texts in each described subqueue are merged into a text block by step S300: described queue T is separated into multiple subqueue, and multiple described text block are formed queue B, according to text threshold value and index threshold value, described queue T are split;
Step S400: choose the maximum text of text size from described queue B as Web page text;
Described index threshold value is the null number described in default any two between subqueue, and described text threshold value is contained text character number in default described subqueue.
2. according to claim 1 based on the webpage context extraction method assembling text density, it is characterized in that, second label described in described step S200 uses regular expression to be replaced, Substitution Rules are: R [(" i ", n)], wherein " i " is described second label, and n is this tag replacement is the quantity of null.
3. according to claim 2 based on the webpage context extraction method assembling text density, it is characterised in that described step S300 comprises the following steps:
Step S310: searching loop queue T, note currentElement is Tc, if effective Chinese character number of described currentElement Tc is less than described text size threshold value, then described currentElement Tc text added in described queue B and continue to travel through described queue T, if effective Chinese character number of described currentElement Tc is more than described text threshold value, remember that described currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv;
Step S320: the next element after described the currently active literary composition Tcv begins stepping through described queue T, ignore space or null element until finding next effective text Ncv, if next effective text Ncv described and described effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv described is appended in described provisional version block Temp, and next effective text Ncv described is assigned to effective text Tcv;
Step S330: continue the next effective element Ncv after next effective text Ncv describedi+2Travel through described queue T, if described Ncvi+2With described the currently active text Tcv location index difference in described queue T more than described index threshold value, then described provisional version block Temp is replicated portion and put in described queue B, by described Ncvi+2It is assigned to described currentElement Tc and continues cycling through traversal queue T.
4. according to claim 3 based on the webpage context extraction method assembling text density, it is characterised in that described first label is unworthy Html label.
5. the method as according to any one of Claims 1 to 4 is with based on the Web page text extraction element assembling text density, it is characterised in that including:
Webpage html file acquisition module, for obtaining the html source file text of webpage, deletes unworthy first label the spcial character rejecting in text, obtains sample text;
Null segmentation module, for being null by the second tag replacement all in described sample text, generates multiple null text, null text is converted to queue T, and described null adjacent between two is separated by null symbol herein;
Queue conversion module, is separated into multiple subqueue for described queue T, and all texts in each described subqueue are merged into a text block, and multiple described text block are formed queue B, according to text threshold value and index threshold value, described queue T is split;
Text selection module, for choosing the maximum text of text size as Web page text from described queue B.
6. according to claim 5 based on the Web page text extraction element assembling text density, it is characterized in that, described second label uses regular expression to be replaced, Substitution Rules are: R [(" i ", n)], wherein " i " is described second label, and n is this tag replacement is the quantity of null.
7. according to claim 5 based on the Web page text extraction element assembling text density, it is characterised in that described queue conversion module includes:
First loop module: for searching loop queue T, note currentElement is Tc, if effective Chinese character number of described currentElement Tc is less than described text size threshold value, then described currentElement Tc text added in described queue B and continue to travel through described queue T, if effective Chinese character number of described currentElement Tc is more than described text threshold value, remember that described currentElement Tc is the currently active text Tcv, create the textual value that provisional version block Temp is the currently active text Tcv;
Second loop module, described queue T is begun stepping through for the next element after described the currently active literary composition Tcv, ignore space or null element until finding next effective text Ncv, if next effective text Ncv described and described effective text Tcv location index difference in queue T are less than index threshold value, then the text of next effective text Ncv described is appended in described provisional version block Temp, and next effective text Ncv described is assigned to effective text Tcv;
Queue B forms module, for continuing the next effective element Ncv after next effective text Ncv describedi+2Travel through described queue T, if described Ncvi+2With described the currently active text Tcv location index difference in described queue T more than described index threshold value, then described provisional version block Temp is replicated portion and put in described queue B, by described Ncvi+2It is assigned to described currentElement Tc and continues cycling through traversal queue T.
CN201610050995.1A 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density Active CN105740355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610050995.1A CN105740355B (en) 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610050995.1A CN105740355B (en) 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density

Publications (2)

Publication Number Publication Date
CN105740355A true CN105740355A (en) 2016-07-06
CN105740355B CN105740355B (en) 2019-03-26

Family

ID=56246654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610050995.1A Active CN105740355B (en) 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density

Country Status (1)

Country Link
CN (1) CN105740355B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951401A (en) * 2017-03-14 2017-07-14 深圳市茁壮网络股份有限公司 A kind of document text recognition method and device
CN107273491A (en) * 2017-06-15 2017-10-20 华中师范大学 Webpage splitting method, device and electronic equipment
CN107766477A (en) * 2017-09-30 2018-03-06 武汉汉思信息技术有限责任公司 Page structure data extraction method, terminal device and storage medium
CN111563387A (en) * 2019-02-12 2020-08-21 阿里巴巴集团控股有限公司 Sentence similarity determining method and device and sentence translation method and device
CN113537091A (en) * 2021-07-20 2021-10-22 东莞市盟大塑化科技有限公司 Webpage text recognition method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855324A (en) * 2012-09-11 2013-01-02 北京云泓道元信息技术有限公司 Automatic extracting method and device for network information
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
US20150205769A1 (en) * 2012-06-25 2015-07-23 Beijing Qihoo Technology Company Limited System and method for recognizing non-body text in webpage
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205769A1 (en) * 2012-06-25 2015-07-23 Beijing Qihoo Technology Company Limited System and method for recognizing non-body text in webpage
CN102855324A (en) * 2012-09-11 2013-01-02 北京云泓道元信息技术有限公司 Automatic extracting method and device for network information
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951401A (en) * 2017-03-14 2017-07-14 深圳市茁壮网络股份有限公司 A kind of document text recognition method and device
CN107273491A (en) * 2017-06-15 2017-10-20 华中师范大学 Webpage splitting method, device and electronic equipment
CN107273491B (en) * 2017-06-15 2020-07-24 华中师范大学 Webpage segmentation method and device and electronic equipment
CN107766477A (en) * 2017-09-30 2018-03-06 武汉汉思信息技术有限责任公司 Page structure data extraction method, terminal device and storage medium
CN111563387A (en) * 2019-02-12 2020-08-21 阿里巴巴集团控股有限公司 Sentence similarity determining method and device and sentence translation method and device
CN111563387B (en) * 2019-02-12 2023-05-02 阿里巴巴集团控股有限公司 Sentence similarity determining method and device, sentence translating method and device
CN113537091A (en) * 2021-07-20 2021-10-22 东莞市盟大塑化科技有限公司 Webpage text recognition method and device, electronic equipment and storage medium
CN113537091B (en) * 2021-07-20 2024-05-03 东莞盟大集团有限公司 Webpage text recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN105740355B (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN109543126B (en) Webpage text information extraction method based on block character ratio
CN109857956B (en) News webpage key information automatic extraction method based on label and block characteristics
CN105022803B (en) A kind of method and system for extracting Web page text content
US20100030752A1 (en) System, methods and applications for structured document indexing
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
Zheng et al. Template-independent news extraction based on visual consistency
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN103064827A (en) Method and device for extracting webpage content
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN106503211B (en) Method for automatically generating mobile version facing information publishing website
US8205153B2 (en) Information extraction combining spatial and textual layout cues
CN106557565A (en) A kind of text message extracting method based on website construction
CN106021392A (en) News key information extraction method and system
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN110704570A (en) Continuous page layout document structured information extraction method
CN112667940B (en) Webpage text extraction method based on deep learning
CN107145591B (en) Title-based webpage effective metadata content extraction method
Du et al. Managing knowledge on the Web–Extracting ontology from HTML Web
CN109657114B (en) Method for extracting webpage semi-structured data
CN103049536A (en) Webpage main text content extracting method and webpage text content extracting system
CN107590288B (en) Method and device for extracting webpage image-text blocks
CN117312711A (en) Search engine optimization method and system based on AI analysis
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CN113392354B (en) Webpage text analysis method, system, medium and electronic equipment
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant