WO2011085588A1 - Webpage contents grabbing method which can be general adapted to any webpage - Google Patents

Webpage contents grabbing method which can be general adapted to any webpage Download PDF

Info

Publication number
WO2011085588A1
WO2011085588A1 PCT/CN2010/076100 CN2010076100W WO2011085588A1 WO 2011085588 A1 WO2011085588 A1 WO 2011085588A1 CN 2010076100 W CN2010076100 W CN 2010076100W WO 2011085588 A1 WO2011085588 A1 WO 2011085588A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
node
variable
client
array
Prior art date
Application number
PCT/CN2010/076100
Other languages
French (fr)
Chinese (zh)
Inventor
胡加明
Original Assignee
苏州阔地网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州阔地网络科技有限公司 filed Critical 苏州阔地网络科技有限公司
Publication of WO2011085588A1 publication Critical patent/WO2011085588A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Definitions

  • a general method for crawling webpage content that can be used for any webpage

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A webpage contents grabbing method which can be general adapted to any webpage is disclosed, the method involves: inputting a website and a condition expression to be grabbed by a user, creating a sub-page and sending the website and condition expression to the sub-page, requesting a server and obtaining all the webpage contents of the website and embedding a segment of javascript program to the webpage contents by the sub-page, transforming the condition expression into an array variable by the javascript program, traversing the array by the javascript program to find all the nodes matching the condition out; obtaining all the attribute values of an inner HTML or outer HTML of all the nodes to gain the corresponding webpage contents by the javascript program.

Description

一种通用的可用于任何网页的网页内容抓取的方法 技术领域  A general method for crawling webpage content that can be used for any webpage
本发明属于网络技术领域,具体涉及一种通用的可用于任何网页的网页内容 抓取的方法。 背景技术  The invention belongs to the technical field of networks, and in particular relates to a universal method for crawling webpage content that can be used for any webpage. Background technique
因特网时代, 丰富的互联网资源, 大大地便利了人们的信息生活。 然而, 随 着信息量的大量***式增加, 而作为一个普通网民, 如果能够轻松方便地将互联 网上的其他任何网页中的对自己有价值的信息内容抓取到自己的网页上使用,成 了网民们头疼的问题。 因为传统的抓取方式有很高的技术壁垒, 为了抓取某一个 网页的某一块信息内容, 往往需要对该网页的数据内容进行一次复杂的解析, 最 后才将自己需要的信息内容提取出来,而一旦换了另外一个网页进行抓取它的网 页内容的话, 又要重新设计程序的解析代码, 这个过程大量的工作重复, 而且低 效, 因为所有的解析工作都是需要自己设计解析代码, 而不是采用***的原生函 数进行解析, 所以往往容易解析出错, 并且, 一般的网民很难进行如此复杂的操 作。 发明内容  In the Internet era, rich Internet resources have greatly facilitated people's information life. However, with the massive explosion of information volume, as an ordinary netizen, if it is easy and convenient to capture the valuable information content of any other webpage on the Internet to its own webpage, it becomes Netizens have headaches. Because the traditional method of crawling has high technical barriers, in order to capture a certain piece of information content of a certain webpage, it is often necessary to perform a complicated analysis on the data content of the webpage, and finally extract the information content that is needed by itself. Once another web page is changed to capture the content of the web page, the program's parsing code has to be redesigned. This process is a lot of work and it is inefficient, because all parsing work requires designing the parsing code. It is not easy to parse errors by using the system's native functions, and it is difficult for ordinary netizens to perform such complicated operations. Summary of the invention
本发明的目的是提供一种通用的可用于任何网页的网页内容的抓取方法,提 高网页内容抓取的效率以及准确性。  It is an object of the present invention to provide a general method for crawling webpage content that can be used for any webpage, and to improve the efficiency and accuracy of webpage content crawling.
为达到上述目的, 本发明采用的技术方案为: 一种通用的可用于任何网页的 网页内容的抓取方法, 包括以下步骤:  To achieve the above objective, the technical solution adopted by the present invention is: A general method for capturing webpage content that can be used for any webpage, including the following steps:
1 ) 客户端输入一个待抓取的目标网址和条件表达式, 在客户端生成一个显 示网页所有内容的子页面;  1) The client inputs a target URL and a conditional expression to be crawled, and generates a subpage displaying all the contents of the webpage on the client;
2) 客户端将条件表达式解析为节点标签名和条件的数组变量;  2) The client parses the conditional expression into an array variable of the node tag name and condition;
3) 获取数组变量的单元变量, 根据单元变量的标签名找出对应的节点, 并 判断这些节点是否符合单元变量设定的条件, 如果符合条件, 则继续获取数组的 下一个单元变量; 不断循环直到数组变量的长度为 0为止, 最后将符合最后一个 单元变量的设定条件的所有节点保存到一个数组变量; 3) Get the unit variable of the array variable, find the corresponding node according to the label name of the unit variable, and judge whether the nodes meet the conditions set by the unit variable, and if the conditions are met, continue to obtain the array. The next unit variable; continuously loops until the length of the array variable is 0, and finally saves all nodes that match the set condition of the last unit variable to an array variable;
4) 客户端获取保存的数组变量中的所有节点的 inner HTML 属性值或者 outer HTM L属性值。  4) The client gets the inner HTML attribute value or the outer HTM L attribute value of all nodes in the saved array variable.
上述技术方案中, 步骤 1 )包含如下过程:  In the above technical solution, step 1) includes the following process:
l a) 客户端输入一个网址和一个条件表达式, 客户端用 j ava scr ipt脚本判断 输入的网址是否合法, 若检查结果合法, 继续下一步, 否则提示重新输入网址; lb ) 客户端生成相应的子页面, 并给子页面赋予一个网址, 子页面独立请求 服务器, 服务器获取相应网址的网页所有内容, 其中的内容是网页中的一个或者 多个 html元素。  La) The client enters a URL and a conditional expression. The client uses the j ava scr ipt script to determine whether the entered URL is legal. If the check result is legal, continue to the next step, otherwise prompt to re-enter the URL; lb) The client generates the corresponding Subpage, and assign a URL to the subpage, the subpage independently requests the server, and the server obtains all the content of the webpage of the corresponding webpage, wherein the content is one or more html elements in the webpage.
上述技术方案中, 步骤 2)包含如下过程: 客户端先将条件表达式解析为具有 父子结构的数组变量, 前一个数组单元是后一个数组单元的父节点, 然后再将每 一个数组单元解析为具有节点标签名和节点条件的数组。  In the above technical solution, step 2) includes the following process: the client first parses the conditional expression into an array variable having a parent-child structure, the previous array unit is the parent node of the latter array unit, and then parses each array element into An array with node tag names and node conditions.
上述技术方案中, 步骤 3)包含如下过程: 获取数组变量的一个单元变量为当 前单元变量, 如果当前单元变量为第一个单元变量, 则在子页面根据当前单元变 量的节点标签名找出对应的节点; 如果不是第一个单元变量, 则在符合上一个单 元变量条件的所有节点的子节点中根据当前变量的节点标签名找出对应的节点, 遍历所有找出的节点, 判断这些节点是否符合当前单元变量设定的条件, 如果符 合条件, 则继续获取数组的下一个单元变量; 重复步骤 3直到数组变量的长度为 0为止; 最后将符合最后一个单元变量的设定条件的所有节点保存到一个数组变 量 nodes , 所述 n odes为设定的数组变量名。  In the above technical solution, step 3) includes the following process: obtaining a unit variable of the array variable as the current unit variable, and if the current unit variable is the first unit variable, finding a corresponding value in the sub-page according to the node label name of the current unit variable Node; if it is not the first unit variable, find the corresponding node according to the node tag name of the current variable in the child nodes of all nodes that meet the condition of the previous unit variable, traverse all the found nodes, and judge whether these nodes are If the conditions of the current unit variable are met, if the condition is met, continue to get the next unit variable of the array; repeat step 3 until the length of the array variable is 0; finally save all nodes that meet the setting conditions of the last unit variable. Go to an array variable nodes, the n odes is the set array variable name.
上述技术方案中, 步骤 4)包含如下过程: 遍历变量 n odes中的所有节点, 并 获取节点的 inner HTML属性值或者 outer HTM L属性值保存到一个新的数组变 量。  In the above technical solution, step 4) includes the following process: Traversing all the nodes in the variable n odes, and obtaining the inner HTML attribute value of the node or the outer HTM L attribute value is saved to a new array variable.
本发明的原理为:本方法打破了传统信息抓取的对数据内容进行复杂解析的 过程, 采用了 j ava scr ipt调用***自带的函数来直接读取网页中的内容, 并且彻 底地将信息抓取的条件从程序中分离了出来。所以网民采用本方法对网页进行抓 取的话, 信息抓取稳定可靠, 而且不会解析出错, 也不需要去关心抓取的过程是 如何实现的, 不用去修改抓取过程的程序代码, 而只需要做的是修改抓取的条件 即可, 由于抓取的条件可以采用 xpath表达式或者 JQUERY (—个 javascript 的 框架)中的类似 xpath 的表达式来实现, 所以对于网民来讲, 学习的成本低, 可 以很轻松的掌握, 很大程度上提高抓取程序的开发效率。 The principle of the invention is: the method breaks the process of complex analysis of the data content by the traditional information capture, and uses the function of the ava scr ipt calling system to directly read the content in the webpage and thoroughly information The conditions for the crawl are separated from the program. Therefore, if the netizen uses this method to crawl the webpage, the information crawling is stable and reliable, and the error is not parsed, and there is no need to care about how the crawling process is implemented, without modifying the program code of the crawling process, and only What needs to be done is to modify the conditions of the crawl. That is, since the condition of the crawl can be implemented by an xpath expression or an expression similar to xpath in JQUERY (a framework of javascript), the cost of learning is low and can be easily grasped. Greatly improve the development efficiency of the crawling program.
本发明与现有技术相比具有以下的优点:  Compared with the prior art, the invention has the following advantages:
( 1) 本发明中只需要简单的修改条件表达式即可抓取网页中的任何内容, 而无须针对每个网页都写一份解析网页内容的代码;  (1) In the present invention, it is only necessary to simply modify the conditional expression to capture any content in the webpage without writing a code for parsing the webpage content for each webpage;
(2)本发明中条件表达式由于是自定义的,所以比较灵活,可以采用 XPATH 或者其他任何表达式来表示;  (2) The conditional expression in the present invention is flexible because it is customized, and can be represented by XPATH or any other expression;
(3) 由于本发明采用的是***的 javascript函数来解析 HTML页面, 所以 不存在利用正则表达式来做解析匹配时候的条件考虑不周导致的解析遗漏或者 错误的问题;  (3) Since the present invention uses the javascript function of the system to parse the HTML page, there is no problem of parsing omissions or errors caused by inconsistent conditions when using regular expressions for parsing and matching;
(4) 本发明大大便利了普通网民抓取互联网资源为己所用, 提高效率。 附图说明  (4) The invention greatly facilitates the use of Internet resources by ordinary netizens for their own use and improves efficiency. DRAWINGS
图 1、 图 2是本发明实施例中的详细流程图。 具体实施方式  1 and 2 are detailed flowcharts in the embodiment of the present invention. detailed description
一种通用的可用于任何网页的网页内容的抓取方法, 包括以下步骤:  A general method for crawling web content that can be used on any web page, including the following steps:
1) 客户端输入一个待抓取的目标网址和条件表达式, 在客户端生成一个显 示网页所有内容的子页面;  1) The client inputs a target URL and a conditional expression to be crawled, and generates a subpage displaying all contents of the webpage on the client;
2) 客户端将条件表达式解析为节点标签名和条件的数组变量;  2) The client parses the conditional expression into an array variable of the node tag name and condition;
3) 遍历数组, 在子页面找出符合条件的节点, 并将符合最后一个条件的所 有节点保存到一个数组变量;  3) Traverse the array, find the matching nodes in the subpage, and save all the nodes that meet the last condition to an array variable;
4) 客户端获取保存的数组变量中的所有节点的 innerHTML 属性值或者 outerHTML属性值。  4) The client gets the innerHTML property value or the outerHTML property value of all nodes in the saved array variable.
下面进一步详细说明本发明所述方法的具体步骤:  The specific steps of the method of the present invention are described in further detail below:
参见图 1、 图 2, 在用户端输入一个合法的网址, (例如: www.***.com), 和条件表达式, (例如: /div[@ClaSS=list]) 向后台服务器发出请求, 由服务器端 程序分析输入的网址字符串, 并检查网址是否合法 (是否具备 http://XXX 或者 XXX. XXX这样的形式), 若检查结果合法, 继续下一步, 否则提示重新输入网址; 客户端根据网址生成相应的子页面, 并给子页面赋予一个网址, 子页面独立请求 服务器, 获取相应网址的网页所有内容并显示, 其中, 网页内容是网页中的一个 或者多个 html元素, 并将 javascript程序代码嵌入子页面, 客户端获取父窗口 上的条件表达式的值, 并将之解析为具有父子结构的数组变量, 前一个数组单元 是后一个数组单元的父节点,然后再将每一个数组单元解析为具有节点标签名和 节点条件的数组; 具体方法是: 假设条件表达式是以 "/" 符号来表示父子节点 的分割符号, "/" 符号前的为父节点, "/" 符号后的为子节点, 则该数组变量可 以通过 javascript调用***的方法 split()通过如: "条件表达式" .split( "/" )来 将条件表达式转化为具有父子关系的数组变量 a(a 为设定的数组变量名)。 遍历 这个数组变量 a, 将数组变量 a中的每一个单元变量分别解析为节点标签名 (包 括标准的标签名和自定义的特殊符号组成的标签名)和节点匹配条件组成的数组 变量 b(b 为设定的数组变量名), 并遍历数组变量 b, 子页面的 javascript 程序 使用 getElementsByTagNameO 的方法来査找节点标签名对应的所有节点对象。 遍历所有节点对象, javascript 程序使用 getAttr ibute() 方法分别来获取这些节 点对象的属性值, 再与节点条件进行比较, 若是符合匹配条件, 则继续获取数组 变量 a的下一个单元变量。 不断循环直到数组变量 a的长度变为 0, 即条件表达 式中的所有节点査找完毕,并将数组变量的最后一个单元匹配到的所有节点保存 到一个数组变量 match (match为设定的数组变量名) See Figure 1, Figure 2, enter a valid URL on the client side (for example: www.***.com), and a conditional expression, (for example: /div[@ C l aSS =list]) to make a request to the background server , the server-side program analyzes the input URL string and checks if the URL is legal (whether it has http:// XXX or XXX. XXX.) If the check result is legal, continue to the next step, otherwise prompt to re-enter the URL; the client generates a corresponding sub-page according to the URL, and assigns a URL to the sub-page, and the sub-page independently requests the server to obtain the corresponding URL. All the content of the webpage is displayed, wherein the webpage content is one or more html elements in the webpage, and the javascript program code is embedded in the subpage, and the client obtains the value of the conditional expression on the parent window, and parses it into The array variable of the parent-child structure, the previous array unit is the parent of the next array unit, and then each array unit is parsed into an array with the node label name and node condition; the specific method is: Assume that the conditional expression is "/" The symbol indicates the split symbol of the parent and child nodes. The "/" symbol is the parent node, and the "/" symbol is the child node. Then the array variable can be called by the javascript method of the system by using split(). " .split( "/" ) to convert a conditional expression into an array variable a with a parent-child relationship (a is Given array variable names). Traversing the array variable a, parsing each unit variable in the array variable a into a node tag name (including the standard tag name and the tag name of the custom special symbol) and an array variable b (b is the node matching condition) Set the array variable name), and iterate over the array variable b. The javascript program of the subpage uses the method getElementsByTagNameO to find all the node objects corresponding to the node tag name. By traversing all node objects, the javascript program uses the getAttr ibute() method to obtain the attribute values of these node objects respectively, and then compares them with the node conditions. If the matching condition is met, the next unit variable of the array variable a is continued. Continue to loop until the length of the array variable a becomes 0, that is, all the nodes in the conditional expression are searched, and all nodes matching the last unit of the array variable are saved to an array variable match (match is the set array variable) name)
客户端遍历数组变量 match, 取出每一个单元变量, 再通过获取它们的 inner HTML属性值或者 outer HTML属性值来抓取对应的网页内容。  The client traverses the array variable match, extracts each unit variable, and then fetches the corresponding web content by getting their inner HTML attribute value or outer HTML attribute value.

Claims

1、 一种通用的可用于任何网页的网页内容抓取的方法, 其特征 在于, 包括以下步骤: A general method for crawling web content of any webpage, characterized in that it comprises the following steps:
1 ) 客户端输入一个待抓取的目标网址和条件表达式, 在客户端 生成一个显示网页所有内容的子页面;  1) The client inputs a target URL and a conditional expression to be crawled, and generates a subpage displaying all the contents of the webpage on the client;
2) 客户端将条件表达式解析为节点标签名和条件的数组变量; 2) The client parses the conditional expression into an array variable of the node tag name and condition;
3) 遍历数组, 在子页面找出符合条件的节点, 并将符合最后一 个条件的所有节点保存到一个数组变量; 3) Traversing the array, finding the nodes that meet the conditions on the subpage, and saving all the nodes that meet the last condition to an array variable;
4) 客户端获取保存的数组变量中的所有节点的 inner H TM L 属 性值或者 outer HTML属性值。  4) The client obtains the inner H TM L attribute value or the outer HTML attribute value of all nodes in the saved array variable.
2、 如权利要求 1所述的一种通用的可用于任何网页的网页内容 抓取的方法, 其特征在于: 其中步骤 1 )包含如下过程:  2. A universal method for crawling web content of any webpage according to claim 1, wherein: step 1) comprises the following process:
l a) 客户端输入一个网址和一个条件表达式, 客户端用 j ava scr ip t脚本判断输入的网址是否合法, 若检查结果合法, 继续下 一步, 否则提示重新输入网址;  l a) The client enters a URL and a conditional expression. The client uses the j ava scr ip t script to determine whether the entered URL is legal. If the check result is legal, continue to the next step, otherwise prompt to re-enter the URL;
lb ) 客户端生成相应的子页面, 并给子页面赋予一个网址, 子页 面独立请求服务器, 服务器获取相应网址的网页所有内容。  Lb) The client generates the corresponding sub-page, and assigns a sub-page to the sub-page. The sub-page independently requests the server, and the server obtains all the content of the webpage of the corresponding webpage.
3、 如权利要求 1或 2所述的一种通用的可用于任何网页的网页 内容抓取的方法, 其特征在于: 步骤 1 )中, 输入条件表达式的方法 采用: xp ath、 es s选择器或者程序中定义的规则中的一种。  3. A universal method for crawling web content of any webpage according to claim 1 or 2, wherein: in step 1), the method of inputting the conditional expression adopts: xp ath, es s selection One of the rules defined in the program or program.
4、 如权利要求 3所述的一种通用的可用于任何网页的网页内容 抓取的方法, 其特征在于: 步骤 3)包含如下过程:  4. A universal method for crawling web content of any webpage according to claim 3, wherein: step 3) comprises the following process:
3a) 客户端获取数组变量的一个单元变量,并在子页面根据单元 变量的节点标签名查找节点;  3a) The client obtains a unit variable of the array variable, and finds the node in the subpage according to the node tag name of the unit variable;
3b ) 将找到的节点和该节点对应的设定条件进行比较,若符合条 件, 则进入下一步;  3b) Compare the found node with the setting conditions corresponding to the node, and if the conditions are met, proceed to the next step;
3c) 重复步骤 3a)直到数组长度为 0为止;  3c) Repeat step 3a) until the array length is 0;
3d) 将符合最后一个单元变量的设定条件的所有节点保存到一 个数组变量 nodes , 所述 n odes为设定的数组变量名。  3d) Save all nodes that match the set condition of the last unit variable to an array variable nodes , which is the set array variable name.
5、 如权利要求 4所述的一种通用的可用于任何网页的网页内容 抓取的方法, 其特征在于: 其中步骤 4)包含如下过程: 遍历变量 nodes中的所有节点, 并获取节点的 innerHTML属性 值或者 outerHTML属性值保存到一个新的数组变量。 5. A universal web page content for any web page as claimed in claim 4 The method of fetching is characterized in that: step 4) comprises the following process: traversing all nodes in the variable nodes, and obtaining the innerHTML attribute value of the node or the outerHTML attribute value is saved to a new array variable.
6、 如权利要求 1或 2所述的一种通用的可用于任何网页的网页 内容抓取的方法, 其特征在于: 步骤 2)包含如下过程: 客户端将条 件表达式解析为包含节点标签名和节点条件的数组变量。  6. A universal method for crawling web content of any webpage according to claim 1 or 2, wherein: step 2) comprises the following process: the client parses the conditional expression into a node tag name and Array variable for node condition.
7、 如权利要求 2所述的一种通用的可用于任何网页的网页内容 抓取的方法, 其特征在于: 在子页面增加一段 javascript 程序, 该 javascript程序将条件表达式解析为 javascript程序的包含节点标签 名和节点条件的数组变量。  7. A universal method for crawling webpage content of any webpage according to claim 2, wherein: adding a javascript program to the subpage, the javascript program parsing the conditional expression into a javascript program Array variable for node tag name and node condition.
8、 如权利要求 7所述的一种通用的可用于任何网页的网页内容 抓取的方法, 其特征在于: javascript 程序根据节点标签名使用 getElementsByTagName()的方法来查找节点对象, javascript程序使 用 getAttributeO方法来获取节点的属性值, 再与节点条件进行比较。  8. A universal method for web page content capture of any web page according to claim 7, wherein: the javascript program uses the method of getElementsByTagName() to find the node object according to the node tag name, and the javascript program uses getAttributeO. The method is to obtain the attribute value of the node and compare it with the node condition.
PCT/CN2010/076100 2010-01-12 2010-08-18 Webpage contents grabbing method which can be general adapted to any webpage WO2011085588A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010002563A CN101763425A (en) 2010-01-12 2010-01-12 Universal method for capturing webpage contents of any webpage
CN201010002563.6 2010-01-12

Publications (1)

Publication Number Publication Date
WO2011085588A1 true WO2011085588A1 (en) 2011-07-21

Family

ID=42494589

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/076100 WO2011085588A1 (en) 2010-01-12 2010-08-18 Webpage contents grabbing method which can be general adapted to any webpage

Country Status (2)

Country Link
CN (1) CN101763425A (en)
WO (1) WO2011085588A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019925A (en) * 2011-09-26 2013-04-03 阿里巴巴集团控股有限公司 Selector acquisition method and device

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage
CN103002061B (en) * 2011-09-16 2015-06-24 阿里巴巴集团控股有限公司 Method and device for mutual conversion of long domain names and short domain names
CN103139260B (en) * 2011-11-30 2015-09-30 国际商业机器公司 For reusing the method and system of HTML content
CN103164195B (en) * 2011-12-13 2017-06-23 阿里巴巴集团控股有限公司 Selector technique of expression and device based on browser
CN102591612B (en) * 2011-12-27 2014-12-03 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof
CN103838747B (en) * 2012-11-22 2017-07-07 富士通株式会社 Network service construction method and equipment and webpage data extracting method and equipment
CN105677862A (en) * 2016-01-08 2016-06-15 上海数道信息科技有限公司 Method and device for grabbing webpage content
CN109032917B (en) * 2017-06-09 2021-06-18 北京金山云网络技术有限公司 Page debugging method and system, mobile terminal and computer terminal
CN107463713A (en) * 2017-08-24 2017-12-12 四川长虹电器股份有限公司 The method of fast verification CSS selector
CN109508181A (en) * 2017-09-14 2019-03-22 韩真 A kind of method of efficient semantization front end selection subscheme
CN107729475B (en) * 2017-10-16 2021-07-02 深圳视界信息技术有限公司 Webpage element acquisition method, device, terminal and computer-readable storage medium
CN109063110A (en) * 2018-07-28 2018-12-21 安徽捷兴信息安全技术有限公司 A kind of grasping means and device using application message in store
CN110276039B (en) * 2019-06-27 2021-09-28 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment
CN110795647A (en) * 2019-10-29 2020-02-14 维沃移动通信有限公司 Website prompting method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206664A (en) * 2007-12-17 2008-06-25 张尧森 Method for interception and incorporation of web page information unit
CN101320370A (en) * 2008-05-16 2008-12-10 崔志明 Deep layer web page data source sort management method based on query interface connection drawing
CN101520796A (en) * 2009-02-16 2009-09-02 深圳市腾讯计算机***有限公司 Method and system for extracting uniform resource locators from web page content
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206664A (en) * 2007-12-17 2008-06-25 张尧森 Method for interception and incorporation of web page information unit
CN101320370A (en) * 2008-05-16 2008-12-10 崔志明 Deep layer web page data source sort management method based on query interface connection drawing
CN101520796A (en) * 2009-02-16 2009-09-02 深圳市腾讯计算机***有限公司 Method and system for extracting uniform resource locators from web page content
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019925A (en) * 2011-09-26 2013-04-03 阿里巴巴集团控股有限公司 Selector acquisition method and device
CN103019925B (en) * 2011-09-26 2015-02-18 阿里巴巴集团控股有限公司 Selector acquisition method and device

Also Published As

Publication number Publication date
CN101763425A (en) 2010-06-30

Similar Documents

Publication Publication Date Title
WO2011085588A1 (en) Webpage contents grabbing method which can be general adapted to any webpage
US7873663B2 (en) Methods and apparatus for converting a representation of XML and other markup language data to a data structure format
TWI592807B (en) Method and device for web style address merge
US8566702B2 (en) Methods and systems of outputting content of interest
WO2017124692A1 (en) Method and apparatus for searching for conversion relationship between form pages and target pages
US20110145299A1 (en) Offline Gadgets IDE
JP2013506175A (en) Management of application state information by unified resource identifier (URI)
Szeredi et al. The semantic web explained: The technology and mathematics behind web 3.0
Danielsen et al. Validation and interactivity of web API documentation
US20130232424A1 (en) User operation detection system and user operation detection method
JP4935399B2 (en) Security operation management system, method and program
WO2011017929A1 (en) Method and apparatus for positioning effective information quickly by mobile phone browser
WO2015109928A1 (en) Method, device and system for loading recommendation information and detecting url
JP2008134906A (en) Business process definition generation method, device and program
US20120072824A1 (en) Content acquisition documents, methods, and systems
US8875094B2 (en) System and method for implementing intelligent java server faces (JSF) composite component generation
EP2431891A1 (en) Methods and systems of outputting content of interest
JP2009259248A (en) Method and unit for tagging images included in web page and providing web retrieval service by using the result and computer-readable recording medium
CN110851678A (en) Method and device for crawling data
TWI610190B (en) A method of client side page processing and server side page generating thereof for reducing html tags
Jin Image information collection system based on Python Web crawler technology
AU2022203715B2 (en) Extracting explainable corpora embeddings
EP2431893A1 (en) Methods and systems for identifying content elements
Mayer Service integration-a web of things perspective
JP2010146111A (en) Multilingual database server and multilingual database system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10842871

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10842871

Country of ref document: EP

Kind code of ref document: A1