WO2017080090A1 - Procédé d'extraction et de comparaison pour un texte de page internet - Google Patents

Procédé d'extraction et de comparaison pour un texte de page internet Download PDF

Info

Publication number
WO2017080090A1
WO2017080090A1 PCT/CN2015/100180 CN2015100180W WO2017080090A1 WO 2017080090 A1 WO2017080090 A1 WO 2017080090A1 CN 2015100180 W CN2015100180 W CN 2015100180W WO 2017080090 A1 WO2017080090 A1 WO 2017080090A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
text
tags
sub
module
Prior art date
Application number
PCT/CN2015/100180
Other languages
English (en)
Chinese (zh)
Inventor
孙燕群
Original Assignee
孙燕群
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 孙燕群 filed Critical 孙燕群
Publication of WO2017080090A1 publication Critical patent/WO2017080090A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management

Definitions

  • the invention relates to a computer network technology method, in particular to a web page text extraction comparison method.
  • the main web page text extraction methods are as follows: DOM-based web page text extraction method, statistics-based web page text extraction method, block-based web page text extraction method, and other web page text extraction methods.
  • the Document Object Model is a standard interface specification developed by the W3C. Because the DOM nodes are organized based on the tree's hierarchy, after the tree structure is established, the original operations on the web page can be converted into operations through the tree. Although the web page structure can be converted into a DOM tree format according to the standards set by the W3C organization, in fact many web pages do not follow the standard. Therefore, when the DOM method is used, it usually needs a preprocessing module to finally abstract the web page into a DOM tree.
  • the DOM-based web page text extraction method is a DOM-based web page content extraction method, and its original purpose is to improve the PDA application and remove the advertisement content.
  • the DOM method abstracts the content of the web page into corresponding objects and converts them into the form of nodes; then organizes the nodes with the parent-child relationship to form a tree structure.
  • the structure of web pages from the same website on the Internet is mostly the same.
  • the ⁇ body> tag of Yahoo News page is composed of two tags: ⁇ iframe> and ⁇ div>, so you can group these web page templates into one. class.
  • the clustering similar DOM tree needs to calculate the similarity.
  • the procedure for calculating the similarity of two simple DOM trees is: the first step is to judge whether the root nodes of the two trees are the same, and if they are not the same, return 0; if they are the same, continue to compare The leaf nodes of the two trees.
  • the second step compares the names and attributes of the leaf nodes of the two DOM trees and returns the number of identical nodes in the two DOM trees.
  • the statistical-based method is mainly used to extract the body of news-based web pages.
  • the principle of this method is that the web page body information can only be located in the ⁇ table> tag node in the web page.
  • the basic steps of the method are as follows: the first step is to remove the noise of the page, and the webpage is correspondingly represented as a tree according to the webpage label; the second step processes each ⁇ table> node, removes the HTML label in the node, and then obtains the label without any label. String The third step compares the number of characters in each node. Usually, the node with the largest number of characters is the body of the web page.
  • the advantage of this method is that it utilizes the characteristics of the news webpage, has good versatility, is simple to implement, does not need to construct different templates for different webpages, does not require sample learning, and has low time complexity.
  • the disadvantage is that the algorithm is only applicable to the case where all the text information in the webpage is placed in a ⁇ table> node, and the effect is not good for a webpage having multiple ⁇ table> texts. Due to the rise of Weibo, light blogs, etc., more and more complex formats and short text pages have been created, and the limitations of this method are more obvious.
  • the method to be solved by the present invention is to provide a web page text extraction and comparison method based on the similarity of the subject, and the result shows that the method of the present invention achieves a large improvement in accuracy.
  • the present invention provides a web page text extraction and comparison method, comprising the following steps:
  • Step A determining whether the webpage is a text page based on a specific label for the webpage
  • Step B Identification of parallel web pages.
  • Step C For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation.
  • a threshold that is, the number of Chinese punctuation
  • the network is judged.
  • Page ⁇ p> tag Chinese text if the number of Chinese punctuation is greater than the given threshold, you can add it to the body, and then get multiple consecutive ⁇ P> tags (1 or 2 between p tags) The text of the other tags) is added to the text by the above judgment.
  • the step A may further comprise the following sub-steps:
  • Step 1 Preprocessing the web page to construct an HTML tree
  • Step 2 Pruning the HTML tree
  • Step 3 Obtain the webpage theme
  • Step 4 Extract the contents of the string in the block
  • Step 5 Calculate the distance between the subject S and the content y in a block
  • Step 6 Compare the edit distances L and max(p, q).
  • the second step may further include the following substeps: performing block according to the ⁇ table> tag, and removing the leaf node that does not contain text and link information.
  • the step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
  • the step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
  • the feature information extraction sub-step further includes:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; When the structure symmetry is deleted.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the present invention also provides a webpage text extraction and comparison system, comprising the following modules:
  • Module A for determining whether a webpage is a text page based on a specific label for a webpage
  • Module B Used to identify parallel web pages.
  • the module A may further comprise the following sub-modules:
  • Pre-processing sub-module used to pre-process the web page and construct an HTML tree
  • Pruning sub-module used to pruning HTML trees
  • Extracting the sub-module of the block for extracting the content of the string within the block;
  • Calculating the distance sub-module used to calculate the distance between the subject S and the content y within a block;
  • Compare Distance Submodule Used to compare the edit distances L and max(p, q).
  • the pruning sub-module may be further configured to: block the leaf according to the ⁇ table> tag, and remove the leaf node that does not include the text and the link information.
  • the calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
  • the module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
  • the feature information extraction submodule is used to:
  • feature information includes web page HTML tag structure information and content-based text The length information, the text sentence number information, and the digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the webpage text extraction comparison method of the present invention has the following advantages over the conventional webpage blocking algorithm and the webpage text extraction method based on the topic similarity partitioning:
  • Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.
  • the invention is based on the theme mentioned in the web page text extraction and comparison method of the topic similarity block, namely the title and label of the webpage.
  • the algorithm of the present invention does not calculate the entropy of the content block, and mainly uses the similarity of the topic and the content block as the judgment basis of the extracted block.
  • the main features of the web page are:
  • the web page format has a tree structure.
  • Web page tags are usually nested in pairs, so they can be converted into an HTML tree.
  • the shape structure in fact, also takes advantage of this feature in the DOM-based web page text extraction method.
  • the tree structure of HTML is constructed in the method of the present invention, mainly for cutting out useless branches and reducing the amount of calculation.
  • web pages are usually arranged in chunks.
  • each web page basically includes the following blocks: a classification block, a navigation block, a text block, a related link block, and an advertisement information block.
  • web page tags are usually nested in pairs, web pages are used to block web pages.
  • the label ⁇ table> ⁇ /table> tag has a good layout feature
  • most of the web pages now use the ⁇ table> tag for the layout of the web page format when finally presented to the user. .
  • the web page text extraction method is based on this, and the ⁇ table> tag is used to parse the web page.
  • the theme and content are related.
  • Web pages usually have a title and a number of tags, and a high-level summary of the body of the page, so the theme actually reflects the characteristics of the body of the page, representing the key content of the page. This was not considered in the previous web page extraction method.
  • the method of the present invention is to use the relationship between the subject and the text as an important index for text extraction. Especially because the structure of mobile Internet webpages is more and more diversified, the length of webpage content is different, the interrogation information of advertisements is many, and the webpage content of short texts is easily submerged in advertisement information, so the theme and webpage content are extracted in webpage extraction. Similarity considerations are indispensable.
  • the indicator for measuring similarity in the present invention is the edit distance (i.e., the Levenshtein distance).
  • the Levenshtein distance is the minimum number of insertions, deletions, and substitutions required to convert from the original string (a) to the target string (b).
  • the Levenshtein formula is shown in the following equation (1):
  • a, b are strings, i is the length of the string a, and j is the length of the string b.
  • the basic idea of the web page text extraction method based on the topic similarity block is as follows: converting the web page into the structure of the HTML tree; extracting the theme of the web page; extracting the content block by using the webpage label; and editing the theme and content viewing
  • the distance L from the Levenshtein is regarded as the content of the webpage body when the distance L is smaller than the length p of the content block. When the distance L is greater than (including equal to) the length of a certain content block, the content is ignored.
  • the present invention provides a web page body text comparison and comparison method, comprising the following steps:
  • Step A determining whether the webpage is a text page based on a specific label for the webpage
  • Step B identification of parallel web pages
  • Step C For the Chinese webpage, the body part often includes Chinese punctuation, and the title does not contain or contain few Chinese punctuation.
  • a threshold that is, the number of Chinese punctuation
  • the text of the ⁇ p> tag is judged. If the number of Chinese punctuation is greater than a given threshold, you can After adding the text, and then obtaining a plurality of consecutive ⁇ P> tags (there may be one or two other tags between the p tags), the text is added to the text by the above determination.
  • the step A may further comprise the following sub-steps:
  • Step 1 Preprocessing the web page to construct an HTML tree
  • Step 2 Pruning the HTML tree
  • Step 3 Obtain the webpage theme
  • Step 4 Extract the contents of the string in the block
  • Step 5 Calculate the distance between the subject S and the content y in a block
  • Step 6 Compare the edit distances L and max(p, q).
  • the second step may further include the following substeps: performing block according to the ⁇ table> tag, and removing the leaf node that does not contain text and link information.
  • the step 5 may further include: segmenting the Chinese word, and using the Levenshtein distance as shown in the formula (2) and the formula (3):
  • the step B may further include: a feature information extraction sub-step and a support vector machine classification sub-step;
  • the feature information extraction sub-step further includes:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the present invention also provides a web page text extraction and comparison system, comprising the following modules:
  • Module A for determining whether a webpage is a text page based on a specific label for a webpage
  • Module B Used to identify parallel web pages.
  • the module A may further comprise the following sub-modules:
  • Pre-processing sub-module used to pre-process the web page and construct an HTML tree
  • Pruning sub-module used to pruning HTML trees
  • Extracting the sub-module of the block for extracting the content of the string within the block;
  • Calculating the distance sub-module used to calculate the distance between the subject S and the content y within a block;
  • Compare Distance Submodule Used to compare the edit distances L and max(p, q).
  • the pruning sub-module may be further configured to: block the leaf according to the ⁇ table> tag, and remove the leaf node that does not include the text and the link information.
  • the calculating distance sub-module may be further used to: segment Chinese characters, and the Levenshtein distance used is as shown in formula (2) and formula (3):
  • the module B may further include the following sub-modules: a feature information extraction sub-module and a support vector machine classification sub-module;
  • the feature information extraction submodule is used to:
  • the feature information includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information;
  • HTML tags are divided into three types of tags: structure tags, format tags, and irrelevant tags according to their web page layout, display, and link features:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u;
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend; delete when calculating structural symmetry go with.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings
  • Editing operations include replacing one character with another, inserting one character, and deleting one character;
  • the improved editing distance is defined as: the minimum operation cost of different types of labels of one string is converted into another string by deleting, inserting and replacing.
  • the algorithm of the present invention obviously includes three main steps of constructing an HTM tree, extracting a web page theme, calculating a topic, and blocking similarity;
  • the basic steps of the algorithm are as follows:
  • Step 1 Web page preprocessing, constructing an html tree. Normalize the web page and finally map it into a tree structure, including the following substeps:
  • each start tag corresponds to an end tag, such as ⁇ body> corresponding ⁇ /body>, ⁇ head> corresponding ⁇ /head>.
  • the tags are nested correctly, such as ⁇ a>, ⁇ b>, ⁇ /b>, ⁇ /a>. Only nested correctly can be correctly iterated.
  • Step 2 Pruning the HTML tree. Since the block is segmented according to the ⁇ table> tag, some leaf nodes do not contain text and link information, so these useless branches are removed, reducing the amount of computation.
  • Step 3 Get the web page theme. Get the content of the page Title and its various levels of title ⁇ h1> ⁇ hg> and the tag ⁇ meta>. If it is Chinese, you can use the ICTCLAS word segmentation system proposed by the Chinese Academy of Sciences to process the above words, then remove the word, stop words, etc., and finally get only the The sequence Stitle of the real word.
  • Step 4 Extract the contents of the string in the block. First, the leaf nodes of the HTML tree, that is, the subtree corresponding to the innermost ⁇ table> tag, are merged into one block, and the HTML mark in the block is removed, and the string content Y in the block is obtained.
  • Step 5 Calculate the distance between the subject S and the content y within a block.
  • the distance between the subject S and the content y For Chinese, it is necessary to segment Chinese words, and also use the Chinese Academy of Sciences word segmentation system in step (3).
  • the Levenshtein distance specifically used in the present invention is as shown in the formulas (2) and (3):
  • Step 6 Compare the edit distances L and max(p, q). If L ⁇ max(p,q), the block is the body information, which is extracted; otherwise it is recognized as interference information and ignored. Finally get the body information of the web page.
  • webpage text extraction and comparison method of the present invention further includes the identification of parallel webpages.
  • the parallel webpage identification of the invention mainly comprises two parts: feature information extraction and support vector machine classification.
  • the feature information mainly includes webpage HTML tag structure information and content-based text length information, text sentence number information, and digital sequence information.
  • HTML label is divided into structural labels, format labels and according to different functional features such as webpage layout, display, and link.
  • Unrelated tags three types of tags:
  • Structure tags blockquote, body, dir, div, dt, h, head, hr, li, menu, p, q, to We, tbody, td, tfoot, th, thead, tr, ul, etc.;
  • Format tags abbr, acronym, b, big, center, cite, code, dfh, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u, etc.
  • Irrelevant tags applet, base, basefont, bdo, br, button, del, kbd, link, meta, samp, script, var, a, fieldset, form, input, is index, label, legend, etc., when calculating structural symmetry Delete.
  • the similarity of the classified HTML tag sequences is calculated using the improved edit distance.
  • the edit distance is the minimum number of edit operations required to convert from one string to another between two strings.
  • the edit operation consists of replacing one character with another, inserting one character, and deleting one character.
  • the improved edit distance is defined as the minimum operational cost of converting different types of tags into one string by deleting, inserting, and replacing them into another string.
  • the cost of the delete operation and the insert operation is 1, the cost of the in-class replacement operation is 0, and the cost of the replacement operation between classes is 1.5, which is:
  • the lower right corner element M[A, B] is the modified editing distance of S 1 and S 2 , then the label structure information D t :
  • the improved edit distance matrix is shown in Table 1.
  • the content surface features specifically refer to information that is directly related to the content but not vocabulary, mainly including the text sentence number information, the text length information and the digital sequence information of the text pair, and the features are calculated as follows:
  • the matrix C is used to establish the maximum matching length matrix D of the string, and the calculation principle of the element D[i, j] is as follows:
  • the finally generated element D[0,0] in the matrix D is the maximum matching length Z.
  • the calculated matching relationship matrix C is as shown in Table 2.
  • the webpage text extraction comparison method of the present invention adopts the SVM algorithm of support vector machine classification.
  • the SVM algorithm is an implementation of statistical theory.
  • the SVM is based on the theory of Vapnik-Chervonenkis Dimension and the principle of structural risk minimization.
  • the kernel function By introducing the kernel function, the sample vector is mapped to the high-dimensional feature space, and then the optimal classification surface is constructed in the high-dimensional space. Linear optimal decision function.
  • the advantage of SVM is that it can solve the dimension problem by using the kernel function, which avoids the direct correlation between the computational complexity of the learning algorithm and the sample dimension.
  • Sgn[.] is a symbol function
  • non-negative variable ⁇ i is a Lagrange function
  • b is an offset value of a hyperplane.
  • Selecting a webpage within two levels of the mirrored to local path from the preprocessed source language and the target language document constitutes a candidate parallel webpage pair.
  • Dt reflects the webpage structure information, and extracts from the preprocessed webpage; Di, Ds and Dn reflect the webpage content information, and extract it from the webpage body.
  • a method for extracting and comparing webpage texts including double sentence alignment is also provided.
  • the step of aligning the two sentences in the method for extracting and comparing the webpage text of the present invention is: after obtaining the chapter-level bilingual parallel webpage document, the bilingual parallel webpage is extracted by the text, and the sentence is formed to form a sentence pair (S i , T j ), and the candidate sentence is aligned.
  • C and B are ⁇ c 1 , c 2 , ..., c n ⁇ and ⁇ b 1 , b 2 , ..., b n ⁇ , respectively, where C i and B i are words after word segmentation. Assuming that there are K pairs of words that are translated into each other, then the similarity of (S i , T j )
  • stf(c m , b m ) is the number of occurrences of mutually translated words in the pair of sentences
  • are the number of sentences in the source language S i and the target language T j , respectively
  • idtf(c m ) is the ratio of the total number of occurrences of c m in S i to the number of occurrences of c m in the text; with They are the lengths of the sentences in the source language S i and the target language T j respectively;
  • ) is a penalty factor, and different alignment modes are penalized to different degrees to prevent the algorithm from taking more sentences. combine it all toghther; Is a penalty factor determined by length.
  • the webpage text extraction comparison method of the present invention compares the traditional webpage blocking algorithm with the webpage text extraction method based on the topic similarity partitioning, and the latter has the following advantages:
  • Cluster analysis is not required, and clustering is very time consuming. It is not necessary to calculate the entropy of the block, but it can be judged by analyzing this web page.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

L'invention concerne un procédé d'extraction et de comparaison, pour un texte d'une page Internet, qui consiste : A : à déterminer si une page Internet est une page de texte selon un onglet spécifique d'une page Internet ou non ; B : à identifier une page Internet parallèle. L'étape A comprend en outre les sous-étapes suivantes : 1, le pré-traitement de la page Internet et la construction d'un arbre HTML ; 2, la réduction de l'arbre HTML ; 3, l'acquisition des thèmes de page Internet ; 4, l'extraction d'un contenu de chaîne de caractères dans des sous-blocs ; 5, le calcul de la distance entre un thème S et un contenu y dans un bloc ; 6, la comparaison d'une distance d'édition L et un maximum (p, q). Le procédé d'extraction et de comparaison de texte de page Internet présente les avantages suivants : des pages Internet ayant un texte court peuvent être extraites, et la correction de sélection n'est pas touchée quelle que soit la longueur du contenu. Quelle que soit la longueur du texte, le texte peut participer au calcul et n'est pas ignoré. Tous les onglets « table » peuvent être traités de manière cohérente lorsqu'une page Internet à imbrication de « table » compliquée est traitée.
PCT/CN2015/100180 2015-11-14 2015-12-31 Procédé d'extraction et de comparaison pour un texte de page internet WO2017080090A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510793525.XA CN106528583A (zh) 2015-11-14 2015-11-14 一种网页正文提取比对方法
CN201510793525.X 2015-11-14

Publications (1)

Publication Number Publication Date
WO2017080090A1 true WO2017080090A1 (fr) 2017-05-18

Family

ID=58348780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/100180 WO2017080090A1 (fr) 2015-11-14 2015-12-31 Procédé d'extraction et de comparaison pour un texte de page internet

Country Status (2)

Country Link
CN (1) CN106528583A (fr)
WO (1) WO2017080090A1 (fr)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110196968A (zh) * 2019-06-06 2019-09-03 北京林业大学 一种基于特定字符串查找的简体中文编码方式自动识别***及方法
CN110795933A (zh) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 一种网页正文的识别处理方法及装置
CN110874428A (zh) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 电商页面的结构化数据提取装置、方法及可读存储介质
CN111241446A (zh) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 一种web网页的正文内容提取方法、装置、设备及介质
CN111708900A (zh) * 2020-06-17 2020-09-25 北京明略软件***有限公司 标签同义词的扩充方法、扩充装置、电子设备及存储介质
CN112101004A (zh) * 2020-09-23 2020-12-18 电子科技大学 基于条件随机场与句法分析的通用网页人物信息提取方法
CN112269906A (zh) * 2020-10-14 2021-01-26 西安邮电大学 网页正文的自动抽取方法及装置
CN112287254A (zh) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 网页结构化信息提取方法、装置、电子设备及存储介质
CN112668309A (zh) * 2020-11-25 2021-04-16 紫光云技术有限公司 一种融合压缩dom树结构向量的网络行为预测模型
CN113033220A (zh) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 一种基于莱文斯坦比的文言文-现代文翻译***构建方法
CN113065086A (zh) * 2021-04-23 2021-07-02 深圳壹账通智能科技有限公司 网页正文提取方法、装置、电子设备及存储介质
CN113434797A (zh) * 2021-06-29 2021-09-24 中国电信集团***集成有限责任公司 一种网页信息提取方法及装置
CN113486228A (zh) * 2021-07-02 2021-10-08 燕山大学 基于md5三叉树和改进birch算法的互联网论文数据自动抽取算法
CN113569119A (zh) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 一种基于多模态机器学习的新闻网页正文抽取***及方法
CN117573959A (zh) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 一种基于网页xpath获取新闻正文的通用方法

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920434B (zh) * 2018-06-06 2022-08-30 武汉酷犬数据科技有限公司 一种通用的网页主题内容提取方法和***
WO2020026366A1 (fr) * 2018-07-31 2020-02-06 株式会社 AI Samurai Procédé de détermination d'évaluation de brevet, dispositif de détermination d'évaluation de brevet et programme de détermination d'évaluation de brevet
CN109543126B (zh) * 2018-11-19 2022-04-29 四川长虹电器股份有限公司 基于块文字占比的网页正文信息提取方法
CN112214737B (zh) * 2020-11-10 2022-06-24 山东比特智能科技股份有限公司 以图片为主的欺诈网页的识别方法、***、装置和介质
CN112528205B (zh) * 2020-12-22 2021-10-29 中科院计算技术研究所大数据研究院 一种网页主体信息提取方法、装置及存储介质
CN112765940B (zh) * 2021-01-20 2024-04-19 南京万得资讯科技有限公司 一种基于主题特征和内容语义的网页去重方法
CN113449078A (zh) * 2021-06-25 2021-09-28 完美世界控股集团有限公司 相似新闻识别方法、设备、***及存储介质
CN114239590B (zh) * 2021-12-01 2023-09-19 马上消费金融股份有限公司 一种数据处理方法及装置
CN115238208A (zh) * 2022-06-28 2022-10-25 北京关键科技股份有限公司 一种基于符号特征的数据检索方法及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197849A (zh) * 2007-12-21 2008-06-11 腾讯科技(深圳)有限公司 将互联网页面转换为无线应用协议页面的转换方法和装置
CN102663023A (zh) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 一种提取网页内容的实现方法
EP2562656A1 (fr) * 2010-10-14 2013-02-27 JVC KENWOOD Corporation Dispositif de filtrage et procédé de filtrage
CN103064966A (zh) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 一种从单记录网页中抽取规律噪音的方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197849A (zh) * 2007-12-21 2008-06-11 腾讯科技(深圳)有限公司 将互联网页面转换为无线应用协议页面的转换方法和装置
EP2562656A1 (fr) * 2010-10-14 2013-02-27 JVC KENWOOD Corporation Dispositif de filtrage et procédé de filtrage
CN102663023A (zh) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 一种提取网页内容的实现方法
CN103064966A (zh) * 2012-12-31 2013-04-24 中国科学院计算技术研究所 一种从单记录网页中抽取规律噪音的方法

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110019794B (zh) * 2017-11-07 2023-04-25 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110196968A (zh) * 2019-06-06 2019-09-03 北京林业大学 一种基于特定字符串查找的简体中文编码方式自动识别***及方法
CN110196968B (zh) * 2019-06-06 2023-04-07 北京林业大学 一种基于特定字符串查找的简体中文编码方式自动识别***及方法
CN110795933A (zh) * 2019-09-30 2020-02-14 奇安信科技集团股份有限公司 一种网页正文的识别处理方法及装置
CN110795933B (zh) * 2019-09-30 2023-10-31 奇安信科技集团股份有限公司 一种网页正文的识别处理方法及装置
CN110874428A (zh) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 电商页面的结构化数据提取装置、方法及可读存储介质
CN111241446B (zh) * 2020-01-13 2023-10-31 杭州安恒信息技术股份有限公司 一种web网页的正文内容提取方法、装置、设备及介质
CN111241446A (zh) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 一种web网页的正文内容提取方法、装置、设备及介质
CN111708900A (zh) * 2020-06-17 2020-09-25 北京明略软件***有限公司 标签同义词的扩充方法、扩充装置、电子设备及存储介质
CN111708900B (zh) * 2020-06-17 2023-08-25 北京明略软件***有限公司 标签同义词的扩充方法、扩充装置、电子设备及存储介质
CN112101004A (zh) * 2020-09-23 2020-12-18 电子科技大学 基于条件随机场与句法分析的通用网页人物信息提取方法
CN112101004B (zh) * 2020-09-23 2023-03-21 电子科技大学 基于条件随机场与句法分析的通用网页人物信息提取方法
CN112269906A (zh) * 2020-10-14 2021-01-26 西安邮电大学 网页正文的自动抽取方法及装置
CN112269906B (zh) * 2020-10-14 2023-04-14 西安邮电大学 网页正文的自动抽取方法及装置
CN112287254A (zh) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 网页结构化信息提取方法、装置、电子设备及存储介质
CN112287254B (zh) * 2020-11-23 2023-10-27 武汉虹旭信息技术有限责任公司 网页结构化信息提取方法、装置、电子设备及存储介质
CN112668309B (zh) * 2020-11-25 2023-03-07 紫光云技术有限公司 一种融合压缩dom树结构向量的网络行为预测方法
CN112668309A (zh) * 2020-11-25 2021-04-16 紫光云技术有限公司 一种融合压缩dom树结构向量的网络行为预测模型
CN113033220A (zh) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 一种基于莱文斯坦比的文言文-现代文翻译***构建方法
CN113065086A (zh) * 2021-04-23 2021-07-02 深圳壹账通智能科技有限公司 网页正文提取方法、装置、电子设备及存储介质
CN113434797A (zh) * 2021-06-29 2021-09-24 中国电信集团***集成有限责任公司 一种网页信息提取方法及装置
CN113434797B (zh) * 2021-06-29 2024-05-31 ***数智科技有限公司 一种网页信息提取方法及装置
CN113569119A (zh) * 2021-07-02 2021-10-29 中译语通科技股份有限公司 一种基于多模态机器学习的新闻网页正文抽取***及方法
CN113486228A (zh) * 2021-07-02 2021-10-08 燕山大学 基于md5三叉树和改进birch算法的互联网论文数据自动抽取算法
CN113486228B (zh) * 2021-07-02 2022-05-10 燕山大学 基于md5三叉树和改进birch算法的互联网论文数据自动抽取算法
CN117573959A (zh) * 2023-10-17 2024-02-20 北京国科众安科技有限公司 一种基于网页xpath获取新闻正文的通用方法
CN117573959B (zh) * 2023-10-17 2024-04-05 北京国科众安科技有限公司 一种基于网页xpath获取新闻正文的通用方法

Also Published As

Publication number Publication date
CN106528583A (zh) 2017-03-22

Similar Documents

Publication Publication Date Title
WO2017080090A1 (fr) Procédé d'extraction et de comparaison pour un texte de page internet
WO2022022045A1 (fr) Procédé et appareil de comparaison de texte basée sur un graphe de connaissances, dispositif, et support de stockage
KR102237702B1 (ko) 엔티티 관계 데이터 생성 방법, 장치, 기기 및 저장 매체
CN109145260B (zh) 一种文本信息自动提取方法
CN110413787B (zh) 文本聚类方法、装置、终端和存储介质
CN101079025B (zh) 一种文档相关度计算***和方法
CN110770735A (zh) 具有嵌入式数学表达式的文档的编码转换
CN112380864B (zh) 一种基于回译的文本三元组标注样本增强方法
CN104750820A (zh) 一种语料库的过滤方法及装置
CN101114281A (zh) 开放式文档同构引擎***
CN111046660B (zh) 一种识别文本专业术语的方法及装置
CN102779135A (zh) 跨语言获取搜索资源的方法和装置及对应搜索方法和装置
CN111737623A (zh) 网页信息提取方法及相关设备
CN105574066A (zh) 网页正文提取比对方法及其***
CN107463571A (zh) 网页消重方法
Jain et al. Context sensitive text summarization using k means clustering algorithm
CN112765999A (zh) 机器翻译双语对照方法及***
CN108763192B (zh) 用于文本处理的实体关系抽取方法及装置
CN107145591B (zh) 一种基于标题的网页有效元数据内容提取方法
CN106372232B (zh) 基于人工智能的信息挖掘方法和装置
Zanibbi et al. Math search for the masses: Multimodal search interfaces and appearance-based retrieval
CN110705285B (zh) 一种政务文本主题词库构建方法、装置、服务器及可读存储介质
CN117312711A (zh) 一种基于ai分析的搜索引擎优化方法及***
CN111859887A (zh) 一种基于深度学习的科技新闻自动写作***
CN105426388A (zh) 一种网页正文提取比对装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15908220

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15908220

Country of ref document: EP

Kind code of ref document: A1