WO2017080090A1 - 一种网页正文提取比对方法 - Google Patents

一种网页正文提取比对方法 Download PDF

Info

Publication number
WO2017080090A1
WO2017080090A1 PCT/CN2015/100180 CN2015100180W WO2017080090A1 WO 2017080090 A1 WO2017080090 A1 WO 2017080090A1 CN 2015100180 W CN2015100180 W CN 2015100180W WO 2017080090 A1 WO2017080090 A1 WO 2017080090A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
text
tags
sub
module
Prior art date
Application number
PCT/CN2015/100180
Other languages
English (en)
French (fr)
Inventor
孙燕群
Original Assignee
孙燕群
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 孙燕群 filed Critical 孙燕群
Publication of WO2017080090A1 publication Critical patent/WO2017080090A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management

Definitions

  • the invention relates to a computer network technology method, in particular to a web page text extraction comparison method.
  • the main web page text extraction methods are as follows: DOM-based web page text extraction method, statistics-based web page text extraction method, block-based web page text extraction method, and other web page text extraction methods.
  • the Document Object Model is a standard interface specification developed by the W3C. Because the DOM nodes are organized based on the tree's hierarchy, after the tree structure is established, the original operations on the web page can be converted into operations through the tree. Although the web page structure can be converted into a DOM tree format according to the standards set by the W3C organization, in fact many web pages do not follow the standard. Therefore, when the DOM method is used, it usually needs a preprocessing module to finally abstract the web page into a DOM tree.
  • the DOM-based web page text extraction method is a DOM-based web page content extraction method, and its original purpose is to improve the PDA application and remove the advertisement content.
  • the DOM method abstracts the content of the web page into corresponding objects and converts them into the form of nodes; then organizes the nodes with the parent-child relationship to form a tree structure.
  • the structure of web pages from the same website on the Internet is mostly the same.
  • the ⁇ body> tag of Yahoo News page is composed of two tags: ⁇ iframe> and ⁇ div>, so you can group these web page templates into one. class.
  • the clustering similar DOM tree needs to calculate the similarity.
  • the procedure for calculating the similarity of two simple DOM trees is: the first step is to judge whether the root nodes of the two trees are the same, and if they are not the same, return 0; if they are the same, continue to compare The leaf nodes of the two trees.
  • the second step compares the names and attributes of the leaf nodes of the two DOM trees and returns the number of identical nodes in the two DOM trees.
  • the statistical-based method is mainly used to extract the body of news-based web pages.
  • the principle of this method is that the web page body information can only be located in the ⁇ table> tag node in the web page.
  • the basic steps of the method are as follows: the first step is to remove the noise of the page, and the webpage is correspondingly represented as a tree according to the webpage label; the second step processes each ⁇ table> node, removes the HTML label in the node, and then obtains the label without any label. String The third step compares the number of characters in each node. Usually, the node with the largest number of characters is the body of the web page.
  • the advantage of this method is that it utilizes the characteristics of the news webpage, has good versatility, is simple to implement, does not need to construct different templates for different webpages, does not require sample learning, and has low time complexity.
  • the disadvantage is that the algorithm is only applicable to the case where all the text information in the webpage is placed in a ⁇ table> node, and the effect is not good for a webpage having multiple ⁇ table> texts. Due to the rise of Weibo, light blogs, etc., more and more complex formats and short text pages have been created, and the limitations of this method are more obvious.
  • the method to be solved by the present invention is to provide a web page text extraction and comparison method based on the similarity of the subject, and the result shows that the method of the present invention achieves a large improvement in accuracy.
  • the present invention provides a web page text extraction and comparison method, comprising the following steps:
  • Step A determining whether the webpage is a text page based on a specific label for the webpage
  • Step B Identification of parallel web pages.
  • Step C For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation.
  • a threshold that is, the number of Chinese punctuation
  • the network is judged.
  • Page ⁇ p> tag Chinese text if the number of Chinese punctuation is greater than the given threshold, you can add it to the body, and then get multiple consecutive ⁇ P> tags (1 or 2 between p tags) The text of the other tags) is added to the text by the above judgment.
  • the step A may further comprise the following sub-steps:
  • Step 1 Preprocessing the web page to construct an HTML tree
  • Step 2 Pruning the HTML tree
  • Step 3 Obtain the webpage theme
  • Step 4 Extract the contents of the string in the block
  • Step 5 Calculate the distance between the subject S and the content y in a block
  • Step 6 Compare the edit distances L and max(p, q).
  • the second step may further include the following substeps: performing block according to the ⁇ table> tag, and removing the leaf node that does not contain text and link information.
  • the step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
  • the step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
  • the feature information extraction sub-step further includes:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; When the structure symmetry is deleted.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the present invention also provides a webpage text extraction and comparison system, comprising the following modules:
  • Module A for determining whether a webpage is a text page based on a specific label for a webpage
  • Module B Used to identify parallel web pages.
  • the module A may further comprise the following sub-modules:
  • Pre-processing sub-module used to pre-process the web page and construct an HTML tree
  • Pruning sub-module used to pruning HTML trees
  • Extracting the sub-module of the block for extracting the content of the string within the block;
  • Calculating the distance sub-module used to calculate the distance between the subject S and the content y within a block;
  • Compare Distance Submodule Used to compare the edit distances L and max(p, q).
  • the pruning sub-module may be further configured to: block the leaf according to the ⁇ table> tag, and remove the leaf node that does not include the text and the link information.
  • the calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
  • the module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
  • the feature information extraction submodule is used to:
  • feature information includes web page HTML tag structure information and content-based text The length information, the text sentence number information, and the digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the webpage text extraction comparison method of the present invention has the following advantages over the conventional webpage blocking algorithm and the webpage text extraction method based on the topic similarity partitioning:
  • Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.
  • the invention is based on the theme mentioned in the web page text extraction and comparison method of the topic similarity block, namely the title and label of the webpage.
  • the algorithm of the present invention does not calculate the entropy of the content block, and mainly uses the similarity of the topic and the content block as the judgment basis of the extracted block.
  • the main features of the web page are:
  • the web page format has a tree structure.
  • Web page tags are usually nested in pairs, so they can be converted into an HTML tree.
  • the shape structure in fact, also takes advantage of this feature in the DOM-based web page text extraction method.
  • the tree structure of HTML is constructed in the method of the present invention, mainly for cutting out useless branches and reducing the amount of calculation.
  • web pages are usually arranged in chunks.
  • each web page basically includes the following blocks: a classification block, a navigation block, a text block, a related link block, and an advertisement information block.
  • web page tags are usually nested in pairs, web pages are used to block web pages.
  • the label ⁇ table> ⁇ /table> tag has a good layout feature
  • most of the web pages now use the ⁇ table> tag for the layout of the web page format when finally presented to the user. .
  • the web page text extraction method is based on this, and the ⁇ table> tag is used to parse the web page.
  • the theme and content are related.
  • Web pages usually have a title and a number of tags, and a high-level summary of the body of the page, so the theme actually reflects the characteristics of the body of the page, representing the key content of the page. This was not considered in the previous web page extraction method.
  • the method of the present invention is to use the relationship between the subject and the text as an important index for text extraction. Especially because the structure of mobile Internet webpages is more and more diversified, the length of webpage content is different, the interrogation information of advertisements is many, and the webpage content of short texts is easily submerged in advertisement information, so the theme and webpage content are extracted in webpage extraction. Similarity considerations are indispensable.
  • the indicator for measuring similarity in the present invention is the edit distance (i.e., the Levenshtein distance).
  • the Levenshtein distance is the minimum number of insertions, deletions, and substitutions required to convert from the original string (a) to the target string (b).
  • the Levenshtein formula is shown in the following equation (1):
  • a, b are strings, i is the length of the string a, and j is the length of the string b.
  • the basic idea of the web page text extraction method based on the topic similarity block is as follows: converting the web page into the structure of the HTML tree; extracting the theme of the web page; extracting the content block by using the webpage label; and editing the theme and content viewing
  • the distance L from the Levenshtein is regarded as the content of the webpage body when the distance L is smaller than the length p of the content block. When the distance L is greater than (including equal to) the length of a certain content block, the content is ignored.
  • the present invention provides a web page body text comparison and comparison method, comprising the following steps:
  • Step A determining whether the webpage is a text page based on a specific label for the webpage
  • Step B identification of parallel web pages
  • Step C For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation.
  • a threshold that is, the number of Chinese punctuation
  • the text of the ⁇ p> tag is judged. If the number of Chinese punctuation is greater than a given threshold, you can After adding the text, and then obtaining a plurality of consecutive ⁇ P> tags (there may be one or two other tags between the p tags), the text is added to the text by the above determination.
  • the step A may further comprise the following sub-steps:
  • Step 1 Preprocessing the web page to construct an HTML tree
  • Step 2 Pruning the HTML tree
  • Step 3 Obtain the webpage theme
  • Step 4 Extract the contents of the string in the block
  • Step 5 Calculate the distance between the subject S and the content y in a block
  • Step 6 Compare the edit distances L and max(p, q).
  • the second step may further include the following substeps: performing block according to the ⁇ table> tag, and removing the leaf node that does not contain text and link information.
  • the step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
  • the step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
  • the feature information extraction sub-step further includes:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the present invention also provides a web page text extraction and comparison system, comprising the following modules:
  • Module A for determining whether a webpage is a text page based on a specific label for a webpage
  • Module B Used to identify parallel web pages.
  • the module A may further comprise the following sub-modules:
  • Pre-processing sub-module used to pre-process the web page and construct an HTML tree
  • Pruning sub-module used to pruning HTML trees
  • Extracting the sub-module of the block for extracting the content of the string within the block;
  • Calculating the distance sub-module used to calculate the distance between the subject S and the content y within a block;
  • Compare Distance Submodule Used to compare the edit distances L and max(p, q).
  • the pruning sub-module may be further configured to: block the leaf according to the ⁇ table> tag, and remove the leaf node that does not include the text and the link information.
  • the calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
  • the module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
  • the feature information extraction submodule is used to:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the algorithm of the present invention obviously includes three main steps of constructing an HTM tree, extracting a web page theme, calculating a topic, and blocking similarity;
  • the basic steps of the algorithm are as follows:
  • Step 1 Web page preprocessing, constructing an html tree. Normalize the web page and finally map it into a tree structure, including the following substeps:
  • each start tag corresponds to an end tag, such as ⁇ body> corresponding ⁇ /body>, ⁇ head> corresponding ⁇ /head>.
  • the tags are nested correctly, such as ⁇ a>, ⁇ b>, ⁇ /b>, ⁇ /a>. Only nested correctly can be correctly iterated.
  • Step 2 Pruning the HTML tree. Since the block is segmented according to the ⁇ table> tag, some leaf nodes do not contain text and link information, so these useless branches are removed, reducing the amount of computation.
  • Step 3 Get the web page theme. Get the content of the page Title and its various levels of title ⁇ h1> ⁇ hg> and the tag ⁇ meta>. If it is Chinese, you can use the ICTCLAS word segmentation system proposed by the Chinese Academy of Sciences to process the above words, then remove the word, stop words, etc., and finally get only the The sequence Stitle of the real word.
  • Step 4 Extract the contents of the string in the block. First, the leaf nodes of the HTML tree, that is, the subtree corresponding to the innermost ⁇ table> tag, are merged into one block, and the HTML mark in the block is removed, and the string content Y in the block is obtained.
  • Step 5 Calculate the distance between the subject S and the content y within a block.
  • the distance between the subject S and the content y For Chinese, it is necessary to segment Chinese words, and also use the Chinese Academy of Sciences word segmentation system in step (3).
  • the Levenshtein distance specifically used in the present invention is as shown in the formulas (2) and (3):
  • Step 6 Compare the edit distances L and max(p, q). If L ⁇ max(p,q), the block is the body information, which is extracted; otherwise it is recognized as interference information and ignored. Finally get the body information of the web page.
  • webpage text extraction and comparison method of the present invention further includes the identification of parallel webpages.
  • the parallel webpage identification of the invention mainly comprises two parts: feature information extraction and support vector machine classification.
  • the feature information mainly includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information.
  • HTML label is divided into structural labels, format labels and according to different functional features such as webpage layout, display, and link.
  • Unrelated tags three types of tags:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul, etc.;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u, etc.
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend, etc., when calculating structural symmetry Delete.
  • the similarity of the classified HTML tag sequences is calculated using the improved edit distance.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings.
  • the edit operation consists of replacing one character with another, inserting one character, and deleting one character.
  • the improved edit distance is defined as the minimum operational cost of converting different types of tags into one string by deleting, inserting, and replacing them into another string.
  • the cost of the delete operation and the insert operation is 1, the cost of the in-class replacement operation is 0, and the cost of the replacement operation between classes is 1.5, which is:
  • the lower right corner element M[A, B] is the modified editing distance of S 1 and S 2 , then the label structure information D t :
  • the improved edit distance matrix is shown in Table 1.
  • the content surface features specifically refer to information that is directly related to the content but not vocabulary, mainly including the text sentence number information, the text length information and the digital sequence information of the text pair, and the features are calculated as follows:
  • the matrix C is used to establish the maximum matching length matrix D of the string, and the calculation principle of the element D[i, j] is as follows:
  • the finally generated element D[0,0] in the matrix D is the maximum matching length Z.
  • the calculated matching relationship matrix C is as shown in Table 2.
  • the webpage text extraction comparison method of the present invention adopts the SVM algorithm of support vector machine classification.
  • the SVM algorithm is an implementation of statistical theory.
  • the SVM is based on the theory of Vapnik-Chervonenkis Dimension and the principle of structural risk minimization.
  • the kernel function By introducing the kernel function, the sample vector is mapped to the high-dimensional feature space, and then the optimal classification surface is constructed in the high-dimensional space. Linear optimal decision function.
  • the advantage of SVM is that it can solve the dimension problem by using the kernel function, which avoids the direct correlation between the computational complexity of the learning algorithm and the sample dimension.
  • Sgn[.] is a symbol function
  • non-negative variable ⁇ i is a Lagrange function
  • b is an offset value of a hyperplane.
  • Selecting a webpage within two levels of the mirrored to local path from the preprocessed source language and the target language document constitutes a candidate parallel webpage pair.
  • Dt reflects the webpage structure information, and extracts from the preprocessed webpage; Di, Ds and Dn reflect the webpage content information, and extract it from the webpage body.
  • a method for extracting and comparing webpage texts including double sentence alignment is also provided.
  • the step of aligning the two sentences in the method for extracting and comparing the webpage text of the present invention is: after obtaining the chapter-level bilingual parallel webpage document, the bilingual parallel webpage is extracted by the text, and the sentence is formed to form a sentence pair (S i , T j ), and the candidate sentence is aligned.
  • C and B are ⁇ c 1 , c 2 , ..., c n ⁇ and ⁇ b 1 , b 2 , ..., b n ⁇ , respectively, where C i and B i are words after word segmentation. Assuming that there are K pairs of words that are translated into each other, then the similarity of (S i , T j )
  • stf(c m , b m ) is the number of occurrences of mutually translated words in the pair of sentences
  • are the number of sentences in the source language S i and the target language T j , respectively
  • idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text; with They are the lengths of the sentences in the source language S i and the target language T j respectively;
  • ) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from taking more sentences. combine it all toghther; Is a penalty factor determined by length.
  • the webpage text extraction comparison method of the present invention compares the traditional webpage blocking algorithm with the webpage text extraction method based on the topic similarity partitioning, and the latter has the following advantages:
  • Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种网页正文提取对比方法,包括以下步骤:步骤A:基于对于网页特定标签,判断网页是否为正文页;步骤B:对平行网页的识别。所述步骤A进一步包括以下子步骤:步骤一:网页预处理,构造HTML树;步骤二:对HTML树剪枝;步骤三:获取网页主题;步骤四:提取分块内的字符串内容;步骤五:计算主题S和一个块内内容y的距离;步骤六:比较编辑距离L和max(p,q)。该网页正文提取比对方法具有以下优点:能提取正文较短的网页,内容的长短并不会影响选择的正确性。因为无论正文长短都会参与计算,都不会被忽略。对处理<table>嵌套的复杂的网页可以保证每一个<table>标签都能得到一致的处理。

Description

一种网页正文提取比对方法
方法领域
本发明涉及计算机网络技术方法,特别涉及一种网页正文提取比对方法。
背景方法
网页正文提取方法有很多,其中有专门针对评论网页或者新闻网页的方法,但是本发明所讨论的是针对大部分通用网页的正文提取方法。总的说来,目前主要的网页正文提取方法有以下几个方向:基于DOM的网页正文提取方法、基于统计的网页正文提取方法、基于分块的网页正文提取方法及其他网页正文提取方法。
文档对象模型(Document Object Model,DOM)是W3C所制定的标准接口规范。因为DOM节点是基于树的层次结构来组织的,因此在建立了树结构之后,就可以将原本对网页的操作转化为通过对树的操作。虽然按照W3C组织所制定的标准,网页结构均可以对应地转换成DOM树的形式,但实际上许多网页并没有遵循该标准。因此在DOM方法使用时通常都需要预处理模块,将网页最终抽象为一棵DOM树。
一、基于DOM的网页正文提取方法
基于DOM的网页正文提取方法是一种基于DOM的网页内容提取方法,其最初目的是完善PDA应用,移除广告内容。DOM方法先将网页内容抽象为对应的对象,转换为节点的形式;然后用父子关系将各节点组织起来,最终形成一棵树型结构。
在互联网中来自同一网站的网页结构大部分都是相同的,例如Yahoo新闻网页<body>标签都是由<iframe>和<div>两个标签组成的,因此可以把这类网页模板聚为一类。聚类相似的DOM树需要计算相似度,计算两棵简单的DOM树相似度的步骤是:第一步判断两棵树的根节点是否相同,若不相同就返回0;若相同,则继续比较两棵树的叶子节点。第二步比较两棵DOM树的叶子节点的名称和属性,返回两棵DOM树中相同节点的数目。
二、基于统计的网页正文提取方法
基于统计的方法主要用于提取新闻类网页的正文。该方法的原理是网页正文信息只能位于网页中的<table>标签节点。方法的基本步骤是:第一步去除页面的噪声,根据网页标签将网页对应表示成一棵树;第二步处理每个<table>节点,去除节点内的HTML标签,然后得到不含任何标签的字符串; 第三步比较每个节点的字符数量,通常选取字符数量最大的节点为网页正文。该方法优点是利用了新闻网页的特性,通用性好,实现简单,也不需要针对不同的网页构建不同的模板,不需要样本学习,时间复杂度低。但是缺点是该算法只适用于网页中所有正文信息都被放在一个<table>节点中的情况,对于有多个<table>正文的网页,效果并不好。由于现在微博、轻博客等的兴起,越来越多的复杂格式和短文本网页被产生,这种方法的局限性更加明显。
现有方法中网页正文提取比对效果表:
Figure PCTCN2015100180-appb-000001
总的说来,目前在网页正文提取和网页相似性计算的相关算法都还停留在主要针对传统互联网网页阶段,无论是网页正文提取还是网页相似性研究,对移动互联网网页内容的新特点并没有认真考量,主要表现在以下几个缺点:
(1)移动互联网的网页结构越来越复杂,涉及的新兴方法也越来越多,传统的2.2节所介绍的网页正文提取算法的局限性越来越明显。
(2)由于短文本网页内容太多,部分2.3节介绍的文本相似性研究算法的理论基础不再存在,算法准确率降低,已经不能适应大规模数据使用的需求。
发明内容
本发明所要解决的方法问题在于,提供了一种本基于主题相似分块的网页正文提取及比对方法,结果表明本发明方法在准确率上取得较大提升。
为解决上述方法问题,本发明提供了一种网页正文提取对比方法,包括以下步骤:
步骤A:基于对于网页特定标签,判断网页是否为正文页;
步骤B:对平行网页的识别。
步骤C:对中文网页,正文部分往往包括中文标点,而标题中是不包含或包含很少的中文标点,通过设置一个阈值,即中文标点的个数,来判断网 页<p>标签中文字,如果其中中文标点的个数大于给定的阈值,则可以将其加入正文内,然后获得多个连续的<P>标签(p标签之间可以有1个或2个其它标签)的文本,通过以上的判定,加入到正文中。
所述步骤A可以进一步包括以下子步骤:
步骤一:网页预处理,构造HTML树;
步骤二:对HTML树剪枝;
步骤三:获取网页主题;
步骤四:提取分块内的字符串内容;
步骤五:计算主题S和一个块内内容y的距离;
步骤六:比较编辑距离L和max(p,q)。
所述步骤二还可以进一步包括以下子步骤:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。
所述步骤五可以进一步包括:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:
Figure PCTCN2015100180-appb-000002
Figure PCTCN2015100180-appb-000003
所述步骤B可以进一步包括:特征信息提取子步骤和支持向量机分类子步骤;
所述特征信息提取子步骤进一步包括:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算 结构对称性时删去。
采用改进的编辑距离计算分类的HTML标签序列的相似度:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;
编辑操作包括一个字符替换成另一个字符、***一个字符和删除一个字符;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、***和替换转换成另一个字符串不同类型标签最少操作代价。
为解决上述技术问题,本发明还提供了一种网页正文提取对比***,包括以下模块:
模块A:用于基于对于网页特定标签,判断网页是否为正文页;
模块B:用于对平行网页的识别。
所述模块A可以进一步包括以下子模块:
预处理子模块:用于对网页预处理,构造HTML树;
剪枝子模块:用于对HTML树剪枝;
获取主题子模块:用于获取网页主题;
提取分块子模块:用于提取分块内的字符串内容;
计算距离子模块:用于计算主题S和一个块内内容y的距离;
比较距离子模块:用于比较编辑距离L和max(p,q)。
所述剪枝子模块可以进一步用于:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。
所述计算距离子模块可以进一步用于:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:
Figure PCTCN2015100180-appb-000004
Figure PCTCN2015100180-appb-000005
所述模块B可以进一步包括以下子模块:特征信息提取子模块和支持向量机分类子模块;
所述特征信息提取子模块用于:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文 本长度信息、文本句数信息和数字序列信息;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。
采用改进的编辑距离计算分类的HTML标签序列的相似度:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;
编辑操作包括一个字符替换成另一个字符、***一个字符和删除一个字符;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、***和替换转换成另一个字符串不同类型标签最少操作代价。
本发明有益的方法效果在于:本发明网页正文提取比对方法对比传统的网页分块算法和基于主题相似分块的网页正文提取方法,具有以下优点:
(1)能提取正文较短的网页,内容的长短并不会影响选择的正确性。因为无论正文长短都会参与计算,都不会被忽略。
(2)对处理<table>嵌套的复杂的网页。因为构建了一棵HTML树,可以保证每一个<table>标签都能得到一致的处理。
(3)降低了运算量。不需要进行簇的分析,聚类是非常耗费时间的,不需要计算块的熵,只是针对本网页进行分析就能判断。
(4)增加了一定程度的语义信息。因为有效利用了标题标签与正文的语义信息,提取正文的语义相关性更强。
具体实施方式
以下将结合实施例来详细说明本发明的实施方式,借此对本发明如何应用方法手段来解决方法问题,并达成方法效果的实现过程能充分理解并据以实施。
本发明基于主题相似分块的网页正文提取对比方法中所说的主题,即网页的标题和标签。本发明算法为了避免移动互联网短文本分块被忽略,不计算内容块的熵,主要利用主题和内容块的相似度作为提取块的判断依据。具体而言,主要利用网页的以下几个特点:
一是网页格式具有树形结构。现在越来越多的网页格式是按照xml的标准构建,网页标签通常是嵌套成对出现的,因此可以转换成一个HTML树 形结构,实际上在基于DOM的网页正文提取方法中也有利用这一特性。在本发明方法中构建HTML的树形结构,主要是为了剪掉无用的分枝,减少运算量。
二是网页通常是分块布局的。移动互联网的网页格式虽然复杂,但是从内容上来讲,每个网页基本都包括以下块:分类块、导航块、正文块、相关链接块和广告信息块等。利用网页的这种特性,并且网页标签通常是嵌套成对出现的,利用网页标签对网页进行分块。实际上目前由于DIV+CSS方法的广泛使用,加之标签<table></table>标签具有很好的布局特性,现在大部分网页在最终呈现给用户时都采用<table>标签进行网页格式的布局。基于主题相似分块的网页正文提取方法正是以此为依据,利用<table>标签对网页进行解析。
三是主题和内容有关联性。网页通常都具有标题和若干标签,而且高度概括了网页正文,因此主题实际上最能体现网页正文的特征,代表了网页的关键内容。这在以前的网页正文提取方法中都未能考虑。本发明方法正是将主题与正文的关系作为正文提取的重要指标。特别由于移动互联网网页的结构越来越多样化,网页内容的长短不一,广告的干扰信息多,短文本的网页内容很容易淹没在广告信息中,因此在网页提取中将主题和网页内容的相似度考虑进来是必不可少的。本发明度量相似度的指标是编辑距离(即Levenshtein距离)。Levenshtein距离即从原串(a)转换到目标串(b)所需要的最少的***、删除和替换的数目。Levenshtein公式如下式(1)所示:
Figure PCTCN2015100180-appb-000006
说明:a、b为字符串,i为字符串a的长度,j为字符串b的长度。利用以上三点为基础,本基于主题相似分块的网页正文提取方法基本思想如下:将网页转换为HTML树的结构;提取网页的主题;利用网页标签提取内容块;计算主题和内容看的编辑距离Levenshtein距离L,当距离L小于内容块的长度p时,则视为网页正文内容被提取出来,当距离L大于(包括等于)某一内容块的长度时,则忽略该内容。
在一实施例中,本发明提供了一种网页正文提取对比方法,包括以下步骤:
步骤A:基于对于网页特定标签,判断网页是否为正文页;
步骤B:对平行网页的识别;
步骤C:对中文网页,正文部分往往包括中文标点,而标题中是不包含或包含很少的中文标点,通过设置一个阈值,即中文标点的个数,来判断网页<p>标签中文字,如果其中中文标点的个数大于给定的阈值,则可以将其 加入正文内,然后获得多个连续的<P>标签(p标签之间可以有1个或2个其它标签)的文本,通过以上的判定,加入到正文中。
所述步骤A可以进一步包括以下子步骤:
步骤一:网页预处理,构造HTML树;
步骤二:对HTML树剪枝;
步骤三:获取网页主题;
步骤四:提取分块内的字符串内容;
步骤五:计算主题S和一个块内内容y的距离;
步骤六:比较编辑距离L和max(p,q)。
所述步骤二还可以进一步包括以下子步骤:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。
所述步骤五可以进一步包括:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:
Figure PCTCN2015100180-appb-000007
Figure PCTCN2015100180-appb-000008
所述步骤B可以进一步包括:特征信息提取子步骤和支持向量机分类子步骤;
所述特征信息提取子步骤进一步包括:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。
采用改进的编辑距离计算分类的HTML标签序列的相似度:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;
编辑操作包括一个字符替换成另一个字符、***一个字符和删除一个字符;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、***和替换转换成另一个字符串不同类型标签最少操作代价。
在另一实施例中,本发明还提供了一种网页正文提取对比***,包括以下模块:
模块A:用于基于对于网页特定标签,判断网页是否为正文页;
模块B:用于对平行网页的识别。
所述模块A可以进一步包括以下子模块:
预处理子模块:用于对网页预处理,构造HTML树;
剪枝子模块:用于对HTML树剪枝;
获取主题子模块:用于获取网页主题;
提取分块子模块:用于提取分块内的字符串内容;
计算距离子模块:用于计算主题S和一个块内内容y的距离;
比较距离子模块:用于比较编辑距离L和max(p,q)。
所述剪枝子模块可以进一步用于:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。
所述计算距离子模块可以进一步用于:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:
Figure PCTCN2015100180-appb-000009
Figure PCTCN2015100180-appb-000010
所述模块B可以进一步包括以下子模块:特征信息提取子模块和支持向量机分类子模块;
所述特征信息提取子模块用于:
建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息;
将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。
采用改进的编辑距离计算分类的HTML标签序列的相似度:
编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;
编辑操作包括一个字符替换成另一个字符、***一个字符和删除一个字符;
根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、***和替换转换成另一个字符串不同类型标签最少操作代价。
在又一实施例中,结合本发明基于主题相似分块的网页正文提取方法的基本思想,本发明算法显然要包括构造HTM树、提取网页主题、计算主题和分块相似度三个主要步骤;另外由于网页是半结构化的,需要进行预处理;同时为了降低运算量,需要对构造的树进行剪枝。具体而言,算法的基本步如下:
步骤一:网页预处理,构造html树。对网页进行规范化,最终映射成树形结构,包括以下子步骤:
(1)在除了网页<table>相关标签外的地方若出现的“〈”和“〉”均用&lt和&gt;替换,补全网页由于不规范所缺的<li>、<hr>等表示结束的标志。
(2)网页中全部标签的属性值都被放在引号中,如
〈a href="www.hust.edu.cn"〉。
(3)标签都是成对匹配的,即每个开始标签都对应一个结束标签,如<body>对应</body>,<head>对应</head>。
(4)标签嵌套正确,如〈a〉,,〈b〉,,〈/b〉,,〈/a〉。只有嵌套正确了,才能被正确的迭代处理。
(5)去除一些无用的标记,如form、img等。利用规范后的标签信息,利用递归的方法,构造网页对应的html树。
步骤二:对HTML树剪枝。由于按照<table>标签进行分块,部分叶子节点不包含文本和链接信息,因此将这些无用枝去掉,降低运算量。
步骤三:获取网页主题。获取网页Title及其各级标题〈h1〉~〈hg〉和标签<meta>的内容。若是中文,可以利用中国科学院提出的ICTCLAS分词***对以上内容进行分词处理,然后去掉虚词、停用词等,最后得到只含有 实词的序列Stitle。
步骤四:提取分块内的字符串内容。首先对HTML树的叶子节点,即最内层的<table>标签对应的子树合并成一个块,去掉块内的HTML标记,得到块内的字符串内容Y。
步骤五:计算主题S和一个块内内容y的距离。对于中文,需要对中文进行分词,也是利用步骤(三)中的中科院分词***。在本发明中具体使用的Levenshtein距离如式(2)和式(3)所示:
Figure PCTCN2015100180-appb-000011
Figure PCTCN2015100180-appb-000012
步骤六:比较编辑距离L和max(p,q)。若L<max(p,q),则该块内是正文信息,提取出来;否则识别为干扰信息,忽略。最终得到网页的正文信息。
另外,本发明网页正文提取对比方法还包括对平行网页的识别。
本发明平行网页识别主要包括特征信息提取和支持向量机分类两部分组成。
1、特征信息提取
特征信息主要有网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息。
(1)标签结构特征
双语平行网页的主体内容互译,但网页的呈现形式往往差异性较大。为避免因形式的差异而误排除了平行网页,增强平行网页间结构标签对齐的相似性程度,,将HTML标签按其在网页布局、显示、链接等不同功能特征划分为结构标签、格式标签和无关标签三类标签:
结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul等;
格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u等;
无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend等,计算结构对称性时删去。
采用改进的编辑距离计算分类的HTML标签序列的相似度。
编辑距离是指两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数,编辑操作包括一个字符替换成另一个字符、***一个字符和删除一个字符。根据标签的分类特性,改进的编辑距离定义为一个字符串的不同类型标签通过删除、***和替换转换成另一个字符串不同类型标签最少操作代价。其中,删除操作和***操作代价为1,类内替换操作代价为0,类间替换操作代价为1.5,即为:
***操作:Ct(t)=1;
删除操作:Cd(t)=1;
替换操作:
Figure PCTCN2015100180-appb-000013
HTML标签序列W=[w0,w1,…wa,…wA]和Z=[z0,z1,…zb,…zB]采用动态规划计算两者改进的编辑距离矩阵M,矩阵元素算法M[a,b]:
Figure PCTCN2015100180-appb-000014
矩阵右下角元素M[A,B]即S1和S2改进的编辑距离,则标签结构信息Dt
Dt=M[A,B]/Max(A+1,B+1)
如HTML标签序列[div、style、style、div、style、style、p、p、div、div]和Z=[div、table、tr、td、span、span、td、tr、table、div],改进的编辑距离矩阵如表1所示,改进的编辑距离为3,标签结构信息Dt=0.3。
表1:W与Z改进的编辑距离矩阵M
Figure PCTCN2015100180-appb-000015
(2)内容表面特征
为降低对双语词典的依赖程度,内容表面特征特指与内容直接相关但非词汇互译的信息,主要包含文本对的文本句数信息、文本长度信息和数字序列信息,各特征如下计算:
1)文本句数信息Ds:
Ds=Min(SS,ST)/Max(SS,ST)
2)文木长度信息Dt:
Dt=|LS-LT|/Max(LS,LT)
3)数字序列信息Dn:
Dn=1-Z/Max(m,n)
其中m和n分别为源语言文本和目标语言文本出现数字的个数,Z为最大匹配长度,详细计算步骤如下:
假设从源语言和目标语言文木对巾提取的数字序列分别为X=[x1,x2,…,xi,…,xm]和Y=[y1,y2,…,yj,…,yn],由此构建m*n维匹配关系矩阵C,矩阵元素c[i,j]为:
Figure PCTCN2015100180-appb-000016
利用矩阵C建立字符串最大匹配长度矩阵D,元素D[i,j]计算原则:
a、循环从右向左、从下而上的。
b、元素D[i,j]为:
D[i,j]=Max(C[i,j]+C[i+1,j+1],C[i,j+1],C[i+1,j])
其中,矩阵D中最终生成的元素D[0,0]即为最大匹配长度Z。
为充分展示共现数字序列信息的计算方法,列举数字序列分别为X=[4,5,34,5,2,45,8,12]和Y=[4,7,34,8,78,9,5,2,12]。计算所得匹配关系矩阵C如表2,最大匹配矩阵D如表3,因此得到最大匹配长度Z为5,数字序列信息Dn的大小为1-5/9=0.44。
表2:X与Y匹配关系矩阵C
Figure PCTCN2015100180-appb-000017
表3:X与Y最大匹配矩阵D
Figure PCTCN2015100180-appb-000018
本发明网页正文提取比对方法采用了支持向量机分类的SVM算法。SVM算法是统计学理论的一种实现方法。SVM建立在统计学习VC维(Vapnik-Chervonenkis Dimension)理论和结构风险最小原理基础上,通过引入核函数,将样本向量映射到高维特征空间,然后在高维空间中构造最优分类面,获得线性最优决策函数。SVM的优势是可以通过采用核函数巧妙解决维数问题,避免了学习算法计算复杂度与样本维数的直接相关。
令{(xi,yi),i=1,…,S}由S个数据点构成了SVM的训练数据集,其中,xi∈Rn,yi∈{-1,1},最优决策函数为:
Figure PCTCN2015100180-appb-000019
其中,Sgn[.]为符号函数,非负变量αi为Lagrange函数,b为超平面的偏置值。
从预处理过的源语言和目标语言文档中分别选择镜像至本地路径相差两级以内的网页构成候选平行网页对。针对网页对分别计算HTML标签序列信息Dt、文本长度信息Di、文本句数信息Ds和数字序列信息Dn构成SVM分类器的特征信息xi∈Rn(n=4)。其中,Dt反映网页结构信息,从预处理过的网页中提取;Di、Ds和Dn反映网页内容信息,从网页正文中提取。
通过在由已知的平行网页对和非平行网页对构成的训练集上训练SVM,判定未知分类的网页是否为平行网页。支持向量机的判断结果yi=1表示网页对为平行网页对,yi=-1表示网页对为非平行网页对。
在本发明的再一实施例中,还提供了一种包含双语句对齐的网页正文提取对比方法。
本发明网页正文提取对比方法中双语句对齐的步骤是:在己获得篇章级的双语平行网页文档后,设双语平行网页经正文抽取后断句形成句对(Si,Tj),候选句对齐C和B分别为{c1,c2,…,cn}和{b1,b2,…,bn},其中,Ci和Bi是分词后的词汇。假定有K对互为翻译的词对,则(Si,Tj)的相似度采
Figure PCTCN2015100180-appb-000020
用如下计算方法:
其中,stf(cm,bm)是互为翻译的词语对在句对中出现的次数;|Si|和|Tj|分别是是源语言Si和目标语言Tj中的句子数;idtf(cm)为cm在Si中出现的总次数与cm在文本中出现次数的比值;
Figure PCTCN2015100180-appb-000021
Figure PCTCN2015100180-appb-000022
分别是是源语言Si和目标语言Tj中的句子的长度;Matching(|Si|,|Tj|)是惩罚因子,不同对齐模式进行不同程度的惩罚,以防止算法将更多句子组合在一起;
Figure PCTCN2015100180-appb-000023
是由长度决定的惩罚因子。
在相似度评价函数Sim(Si,Tj)基础上,使用动态规划实现寻找最优句对齐路径,获取双语平行语料。
本发明网页正文提取比对方法对比传统的网页分块算法和基于主题相似分块的网页正文提取方法,后者具有以下优点:
(1)能提取正文较短的网页,内容的长短并不会影响选择的正确性。因为无论正文长短都会参与计算,都不会被忽略。
(2)对处理<table>嵌套的复杂的网页。因为构建了一棵HTML树,可以保证每一个<table>标签都能得到一致的处理。
(3)降低了运算量。不需要进行簇的分析,聚类是非常耗费时间的,不需要计算块的熵,只是针对本网页进行分析就能判断。
(4)增加了一定程度的语义信息。因为有效利用了标题标签与正文的语义信息,提取正文的语义相关性更强。
所有上述的首要实施这一知识产权,并没有设定限制其他形式的实施这种新产品和/或新方法。本领域方法人员将利用这一重要信息,上述内容修改,以实现类似的执行情况。但是,所有修改或改造基于本发明新产品属于保留的权利。

Claims (10)

  1. 一种网页正文提取对比方法,其特征在于,包括以下步骤:
    步骤A:基于对于网页特定标签,判断网页是否为正文页;
    步骤B:对平行网页的识别;
    步骤C:对中文网页,设定中文标点的个数阈值;通过所述中文标点的个数阈值来判断网页<p>标签中文字:如果其中中文标点的个数大于设定的阈值,则将其加入正文内。
  2. 根据权利要求1所述网页正文提取对比方法,其特征在于,所述步骤A进一步包括以下子步骤:
    步骤一:网页预处理,构造HTML树;
    步骤二:对HTML树剪枝;
    步骤三:获取网页主题;
    步骤四:提取分块内的字符串内容;
    步骤五:计算主题S和一个块内内容y的距离;
    步骤六:比较编辑距离L和max(p,q)。
  3. 根据权利要求1或2所述网页正文提取对比方法,其特征在于,所述步骤二进一步包括以下子步骤:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。
  4. 根据权利要求1~3中任一项所述网页正文提取对比方法,其特征在于,所述步骤五进一步包括:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:
    Figure PCTCN2015100180-appb-100001
    Figure PCTCN2015100180-appb-100002
  5. 根据权利要求1~4中任一项所述网页正文提取对比方法,其特征在于,所述步骤B进一步包括:特征信息提取子步骤和支持向量机分类子步骤;
    所述特征信息提取子步骤进一步包括:
    建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息;
    将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:
    结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;
    格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;
    无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。
    采用改进的编辑距离计算分类的HTML标签序列的相似度:
    编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;
    编辑操作包括一个字符替换成另一个字符、***一个字符和删除一个字符;
    根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、***和替换转换成另一个字符串不同类型标签最少操作代价;
    所述网页正文提取对比方法,进一步包括双语句对齐的网页正文提取对比步骤;
    所述双语句对齐的网页正文提取对比步骤是:在己获得篇章级的双语平行网页文档后,设双语平行网页经正文抽取后断句形成句对(Si,Tj),候选句对齐C和B分别为{c1,c2,…,cn}和{b1,b2,…,bn},其中,Ci和Bi是分词后的词汇;假定有K对互为翻译的词对,则(Si,Tj)的相似度采用如下计算方法:
    Figure PCTCN2015100180-appb-100003
    其中,stf(cm,bm)是互为翻译的词语对在句对中出现的次数;
    |Si|和|Tj|分别是是源语言Si和目标语言Tj中的句子数;
    idtf(cm)为cm在Si中出现的总次数与cm在文本中出现次数的比值;
    Figure PCTCN2015100180-appb-100004
    Figure PCTCN2015100180-appb-100005
    分别是是源语言Si和目标语言Tj中的句子的长度;
    Matching(|Si|,|Tj|)是惩罚因子,不同对齐模式进行不同程度的惩罚,以防止算法将更多句子组合在一起;
    Figure PCTCN2015100180-appb-100006
    是由长度决定的惩罚因子;
    在相似度评价函数Sim(Si,Tj)基础上,使用动态规划实现寻找最优句对齐路径,获取双语平行语料。
  6. 一种网页正文提取对比***,其特征在于,包括以下模块:
    模块A:用于基于对于网页特定标签,判断网页是否为正文页;
    模块B:用于对平行网页的识别。
  7. 根据权利要求6所述网页正文提取对比***,其特征在于,所述模块A进一步包括以下子模块:
    预处理子模块:用于对网页预处理,构造HTML树;
    剪枝子模块:用于对HTML树剪枝;
    获取主题子模块:用于获取网页主题;
    提取分块子模块:用于提取分块内的字符串内容;
    计算距离子模块:用于计算主题S和一个块内内容y的距离;
    比较距离子模块:用于比较编辑距离L和max(p,q)。
  8. 根据权利要求6或7所述网页正文提取对比***,其特征在于,所述剪枝子模块进一步用于:按照<table>标签进行分块,将不包含文本和链接信息的叶子节点去掉。
  9. 根据权利要求6~8中任一项所述网页正文提取对比***,其特征在于,所述计算距离子模块进一步用于:对中文进行分词,使用的Levenshtein距离如式(2)和式(3)所示:
    Figure PCTCN2015100180-appb-100007
    Figure PCTCN2015100180-appb-100008
  10. 根据权利要求6~9中任一项所述网页正文提取对比***,其特征在于,所述模块B进一步包括以下子模块:特征信息提取子模块和支持向量机分类子模块;
    所述特征信息提取子模块用于:
    建立特征信息:特征信息包括网页HTML标签结构信息和基于内容的文本长度信息、文本句数信息和数字序列信息;
    将HTML标签按其在网页布局、显示、链接功能特征划分为结构标签、格式标签和无关标签三类标签:
    结构标签:blockquote、body、dir、div、dt、h、head、hr、li、menu、p、q、to We、tbody、td、tfoot、th、thead、tr、ul;
    格式标签:abbr、acronym、b、big、center、cite、code、dfh、em、font、i、pre、s、small、span、strike、strong、style、sub、sup、tt、u;
    无关标签:applet、base、basefont、bdo、br、button、del、kbd、link、meta、samp、script、var、a、fieldset、form、input、is index、label、legend;计算结构对称性时删去。
    采用改进的编辑距离计算分类的HTML标签序列的相似度:
    编辑距离为两个字符串之间,由一个字符串转变成另一个字符串所需的最少编辑操作次数;
    编辑操作包括一个字符替换成另一个字符、***一个字符和删除一个字符;
    根据标签的分类特性,改进的编辑距离定义为:一个字符串的不同类型标签通过删除、***和替换转换成另一个字符串不同类型标签最少操作代价;
    所述网页正文提取对比***,进一步包括双语句对齐的网页正文提取对比模块;
    所述双语句对齐的网页正文提取对比模块用于:在己获得篇章级的双语平行网页文档后,设双语平行网页经正文抽取后断句形成句对(Si,Tj),候选句对齐C和B分别为{c1,c2,…,cn}和{b1,b2,…,bn},,其中,Ci和Bi是分词后的词汇;假定有K对互为翻译的词对,则(Si,Tj)的相似度采用如下计算方法:
    Figure PCTCN2015100180-appb-100009
    其中,stf(cm,bm)是互为翻译的词语对在句对中出现的次数;
    |Si|和|Tj|分别是是源语言Si和目标语言Tj中的句子数;
    idtf(cm)为cm在Si中出现的总次数与cm在文本中出现次数的比值;
    Figure PCTCN2015100180-appb-100010
    Figure PCTCN2015100180-appb-100011
    分别是是源语言Si和目标语言Tj中的句子的长度;
    Matching(|Si|,|Tj|)是惩罚因子,不同对齐模式进行不同程度的惩罚,以防止算法将更多句子组合在一起;
    Figure PCTCN2015100180-appb-100012
    是由长度决定的惩罚因子;
    在相似度评价函数Sim(Si,Tj)基础上,使用动态规划实现寻找最优句对齐路径,获取双语平行语料。
PCT/CN2015/100180 2015-11-14 2015-12-31 一种网页正文提取比对方法 WO2017080090A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510793525.XA CN106528583A (zh) 2015-11-14 2015-11-14 一种网页正文提取比对方法
CN201510793525.X 2015-11-14

Publications (1)

Publication Number Publication Date
WO2017080090A1 true WO2017080090A1 (zh) 2017-05-18

Family

ID=58348780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/100180 WO2017080090A1 (zh) 2015-11-14 2015-12-31 一种网页正文提取比对方法

Country Status (2)

Country Link
CN (1) CN106528583A (zh)
WO (1) WO2017080090A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110196968A (zh) * 2019-06-06 2019-09-03 北京林业大学 一种基于特定字符串查找的简体中文编码方式自动识别***及方法
CN110795933A (zh) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 一种网页正文的识别处理方法及装置
CN110874428A (zh) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 电商页面的结构化数据提取装置、方法及可读存储介质
CN111241446A (zh) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 一种web网页的正文内容提取方法、装置、设备及介质
CN111708900A (zh) * 2020-06-17 2020-09-25 北京明略软件***有限公司 标签同义词的扩充方法、扩充装置、电子设备及存储介质
CN112101004A (zh) * 2020-09-23 2020-12-18 电子科技大学 基于条件随机场与句法分析的通用网页人物信息提取方法
CN112269906A (zh) * 2020-10-14 2021-01-26 西安邮电大学 网页正文的自动抽取方法及装置
CN112287254A (zh) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 网页结构化信息提取方法、装置、电子设备及存储介质
CN112668309A (zh) * 2020-11-25 2021-04-16 紫光云技术有限公司 一种融合压缩dom树结构向量的网络行为预测模型
CN113033220A (zh) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 一种基于莱文斯坦比的文言文-现代文翻译***构建方法
CN113434797A (zh) * 2021-06-29 2021-09-24 中国电信集团***集成有限责任公司 一种网页信息提取方法及装置
CN113486228A (zh) * 2021-07-02 2021-10-08 燕山大学 基于md5三叉树和改进birch算法的互联网论文数据自动抽取算法
CN113569119A (zh) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 一种基于多模态机器学习的新闻网页正文抽取***及方法
CN117573959A (zh) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 一种基于网页xpath获取新闻正文的通用方法

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920434B (zh) * 2018-06-06 2022-08-30 武汉酷犬数据科技有限公司 一种通用的网页主题内容提取方法和***
US20200349204A1 (en) * 2018-07-31 2020-11-05 Ai Samurai Inc. Patent evaluation and determination method, patent evaluation and determination device, and patent evaluation and determination program
CN109543126B (zh) * 2018-11-19 2022-04-29 四川长虹电器股份有限公司 基于块文字占比的网页正文信息提取方法
CN112214737B (zh) * 2020-11-10 2022-06-24 山东比特智能科技股份有限公司 以图片为主的欺诈网页的识别方法、***、装置和介质
CN112528205B (zh) * 2020-12-22 2021-10-29 中科院计算技术研究所大数据研究院 一种网页主体信息提取方法、装置及存储介质
CN112765940B (zh) * 2021-01-20 2024-04-19 南京万得资讯科技有限公司 一种基于主题特征和内容语义的网页去重方法
CN114239590B (zh) * 2021-12-01 2023-09-19 马上消费金融股份有限公司 一种数据处理方法及装置
CN115238208A (zh) * 2022-06-28 2022-10-25 北京关键科技股份有限公司 一种基于符号特征的数据检索方法及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197849A (zh) * 2007-12-21 2008-06-11 腾讯科技(深圳)有限公司 将互联网页面转换为无线应用协议页面的转换方法和装置
CN102663023A (zh) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 一种提取网页内容的实现方法
EP2562656A1 (en) * 2010-10-14 2013-02-27 JVC KENWOOD Corporation Filtering device and filtering method
CN103064966A (zh) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 一种从单记录网页中抽取规律噪音的方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197849A (zh) * 2007-12-21 2008-06-11 腾讯科技(深圳)有限公司 将互联网页面转换为无线应用协议页面的转换方法和装置
EP2562656A1 (en) * 2010-10-14 2013-02-27 JVC KENWOOD Corporation Filtering device and filtering method
CN102663023A (zh) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 一种提取网页内容的实现方法
CN103064966A (zh) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 一种从单记录网页中抽取规律噪音的方法

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019794B (zh) * 2017-11-07 2023-04-25 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110196968A (zh) * 2019-06-06 2019-09-03 北京林业大学 一种基于特定字符串查找的简体中文编码方式自动识别***及方法
CN110196968B (zh) * 2019-06-06 2023-04-07 北京林业大学 一种基于特定字符串查找的简体中文编码方式自动识别***及方法
CN110795933A (zh) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 一种网页正文的识别处理方法及装置
CN110795933B (zh) * 2019-09-30 2023-10-31 奇安信科技集团股份有限公司 一种网页正文的识别处理方法及装置
CN110874428A (zh) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 电商页面的结构化数据提取装置、方法及可读存储介质
CN111241446A (zh) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 一种web网页的正文内容提取方法、装置、设备及介质
CN111241446B (zh) * 2020-01-13 2023-10-31 杭州安恒信息技术股份有限公司 一种web网页的正文内容提取方法、装置、设备及介质
CN111708900B (zh) * 2020-06-17 2023-08-25 北京明略软件***有限公司 标签同义词的扩充方法、扩充装置、电子设备及存储介质
CN111708900A (zh) * 2020-06-17 2020-09-25 北京明略软件***有限公司 标签同义词的扩充方法、扩充装置、电子设备及存储介质
CN112101004B (zh) * 2020-09-23 2023-03-21 电子科技大学 基于条件随机场与句法分析的通用网页人物信息提取方法
CN112101004A (zh) * 2020-09-23 2020-12-18 电子科技大学 基于条件随机场与句法分析的通用网页人物信息提取方法
CN112269906B (zh) * 2020-10-14 2023-04-14 西安邮电大学 网页正文的自动抽取方法及装置
CN112269906A (zh) * 2020-10-14 2021-01-26 西安邮电大学 网页正文的自动抽取方法及装置
CN112287254B (zh) * 2020-11-23 2023-10-27 武汉虹旭信息技术有限责任公司 网页结构化信息提取方法、装置、电子设备及存储介质
CN112287254A (zh) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 网页结构化信息提取方法、装置、电子设备及存储介质
CN112668309A (zh) * 2020-11-25 2021-04-16 紫光云技术有限公司 一种融合压缩dom树结构向量的网络行为预测模型
CN112668309B (zh) * 2020-11-25 2023-03-07 紫光云技术有限公司 一种融合压缩dom树结构向量的网络行为预测方法
CN113033220A (zh) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 一种基于莱文斯坦比的文言文-现代文翻译***构建方法
CN113434797A (zh) * 2021-06-29 2021-09-24 中国电信集团***集成有限责任公司 一种网页信息提取方法及装置
CN113434797B (zh) * 2021-06-29 2024-05-31 ***数智科技有限公司 一种网页信息提取方法及装置
CN113486228B (zh) * 2021-07-02 2022-05-10 燕山大学 基于md5三叉树和改进birch算法的互联网论文数据自动抽取算法
CN113486228A (zh) * 2021-07-02 2021-10-08 燕山大学 基于md5三叉树和改进birch算法的互联网论文数据自动抽取算法
CN113569119A (zh) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 一种基于多模态机器学习的新闻网页正文抽取***及方法
CN117573959A (zh) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 一种基于网页xpath获取新闻正文的通用方法
CN117573959B (zh) * 2023-10-17 2024-04-05 北京国科众安科技有限公司 一种基于网页xpath获取新闻正文的通用方法

Also Published As

Publication number Publication date
CN106528583A (zh) 2017-03-22

Similar Documents

Publication Publication Date Title
WO2017080090A1 (zh) 一种网页正文提取比对方法
WO2022022045A1 (zh) 基于知识图谱的文本比对方法、装置、设备及存储介质
CN111104794B (zh) 一种基于主题词的文本相似度匹配方法
KR102237702B1 (ko) 엔티티 관계 데이터 생성 방법, 장치, 기기 및 저장 매체
CN109145260B (zh) 一种文本信息自动提取方法
CN110413787B (zh) 文本聚类方法、装置、终端和存储介质
CN101079025B (zh) 一种文档相关度计算***和方法
CN110770735A (zh) 具有嵌入式数学表达式的文档的编码转换
CN112380864B (zh) 一种基于回译的文本三元组标注样本增强方法
CN104750820A (zh) 一种语料库的过滤方法及装置
CN111046660B (zh) 一种识别文本专业术语的方法及装置
CN101114281A (zh) 开放式文档同构引擎***
CN102779135A (zh) 跨语言获取搜索资源的方法和装置及对应搜索方法和装置
CN111737623A (zh) 网页信息提取方法及相关设备
CN105574066A (zh) 网页正文提取比对方法及其***
Jain et al. Context sensitive text summarization using k means clustering algorithm
CN108763192B (zh) 用于文本处理的实体关系抽取方法及装置
CN107463571A (zh) 网页消重方法
CN112765999A (zh) 机器翻译双语对照方法及***
CN107145591B (zh) 一种基于标题的网页有效元数据内容提取方法
CN106372232B (zh) 基于人工智能的信息挖掘方法和装置
Zanibbi et al. Math search for the masses: Multimodal search interfaces and appearance-based retrieval
CN110705285B (zh) 一种政务文本主题词库构建方法、装置、服务器及可读存储介质
CN117312711A (zh) 一种基于ai分析的搜索引擎优化方法及***
CN111859887A (zh) 一种基于深度学习的科技新闻自动写作***

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15908220

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15908220

Country of ref document: EP

Kind code of ref document: A1