WO2011085588A1

WO2011085588A1 - Webpage contents grabbing method which can be general adapted to any webpage

Info

Publication number: WO2011085588A1
Application number: PCT/CN2010/076100
Authority: WO
Inventors: 胡加明
Original assignee: 苏州阔地网络科技有限公司
Priority date: 2010-01-12
Filing date: 2010-08-18
Publication date: 2011-07-21
Also published as: CN101763425A

Abstract

A webpage contents grabbing method which can be general adapted to any webpage is disclosed, the method involves: inputting a website and a condition expression to be grabbed by a user, creating a sub-page and sending the website and condition expression to the sub-page, requesting a server and obtaining all the webpage contents of the website and embedding a segment of javascript program to the webpage contents by the sub-page, transforming the condition expression into an array variable by the javascript program, traversing the array by the javascript program to find all the nodes matching the condition out; obtaining all the attribute values of an inner HTML or outer HTML of all the nodes to gain the corresponding webpage contents by the javascript program.

Description

一种通用的可用于任何网页的网页内容抓取的方法技术领域 A general method for crawling webpage content that can be used for any webpage

本发明属于网络技术领域，具体涉及一种通用的可用于任何网页的网页内容抓取的方法。背景技术 The invention belongs to the technical field of networks, and in particular relates to a universal method for crawling webpage content that can be used for any webpage. Background technique

因特网时代，丰富的互联网资源，大大地便利了人们的信息生活。然而，随着信息量的大量***式增加，而作为一个普通网民，如果能够轻松方便地将互联网上的其他任何网页中的对自己有价值的信息内容抓取到自己的网页上使用，成了网民们头疼的问题。因为传统的抓取方式有很高的技术壁垒，为了抓取某一个网页的某一块信息内容，往往需要对该网页的数据内容进行一次复杂的解析，最后才将自己需要的信息内容提取出来，而一旦换了另外一个网页进行抓取它的网页内容的话，又要重新设计程序的解析代码，这个过程大量的工作重复，而且低效，因为所有的解析工作都是需要自己设计解析代码，而不是采用***的原生函数进行解析，所以往往容易解析出错，并且，一般的网民很难进行如此复杂的操作。发明内容 In the Internet era, rich Internet resources have greatly facilitated people's information life. However, with the massive explosion of information volume, as an ordinary netizen, if it is easy and convenient to capture the valuable information content of any other webpage on the Internet to its own webpage, it becomes Netizens have headaches. Because the traditional method of crawling has high technical barriers, in order to capture a certain piece of information content of a certain webpage, it is often necessary to perform a complicated analysis on the data content of the webpage, and finally extract the information content that is needed by itself. Once another web page is changed to capture the content of the web page, the program's parsing code has to be redesigned. This process is a lot of work and it is inefficient, because all parsing work requires designing the parsing code. It is not easy to parse errors by using the system's native functions, and it is difficult for ordinary netizens to perform such complicated operations. Summary of the invention

本发明的目的是提供一种通用的可用于任何网页的网页内容的抓取方法，提高网页内容抓取的效率以及准确性。 It is an object of the present invention to provide a general method for crawling webpage content that can be used for any webpage, and to improve the efficiency and accuracy of webpage content crawling.

为达到上述目的，本发明采用的技术方案为：一种通用的可用于任何网页的网页内容的抓取方法，包括以下步骤： To achieve the above objective, the technical solution adopted by the present invention is: A general method for capturing webpage content that can be used for any webpage, including the following steps:

1 ) 客户端输入一个待抓取的目标网址和条件表达式，在客户端生成一个显示网页所有内容的子页面； 1) The client inputs a target URL and a conditional expression to be crawled, and generates a subpage displaying all the contents of the webpage on the client;

2) 客户端将条件表达式解析为节点标签名和条件的数组变量； 2) The client parses the conditional expression into an array variable of the node tag name and condition;

3) 获取数组变量的单元变量，根据单元变量的标签名找出对应的节点，并判断这些节点是否符合单元变量设定的条件，如果符合条件，则继续获取数组的下一个单元变量；不断循环直到数组变量的长度为 0为止，最后将符合最后一个单元变量的设定条件的所有节点保存到一个数组变量； 3) Get the unit variable of the array variable, find the corresponding node according to the label name of the unit variable, and judge whether the nodes meet the conditions set by the unit variable, and if the conditions are met, continue to obtain the array. The next unit variable; continuously loops until the length of the array variable is 0, and finally saves all nodes that match the set condition of the last unit variable to an array variable;

4) 客户端获取保存的数组变量中的所有节点的 inner HTML 属性值或者 outer HTM L属性值。 4) The client gets the inner HTML attribute value or the outer HTM L attribute value of all nodes in the saved array variable.

上述技术方案中，步骤 1 )包含如下过程： In the above technical solution, step 1) includes the following process:

l a) 客户端输入一个网址和一个条件表达式，客户端用 j ava scr ipt脚本判断输入的网址是否合法，若检查结果合法，继续下一步，否则提示重新输入网址； lb ) 客户端生成相应的子页面，并给子页面赋予一个网址，子页面独立请求服务器，服务器获取相应网址的网页所有内容，其中的内容是网页中的一个或者多个 html元素。 La) The client enters a URL and a conditional expression. The client uses the j ava scr ipt script to determine whether the entered URL is legal. If the check result is legal, continue to the next step, otherwise prompt to re-enter the URL; lb) The client generates the corresponding Subpage, and assign a URL to the subpage, the subpage independently requests the server, and the server obtains all the content of the webpage of the corresponding webpage, wherein the content is one or more html elements in the webpage.

上述技术方案中，步骤 2)包含如下过程：客户端先将条件表达式解析为具有父子结构的数组变量，前一个数组单元是后一个数组单元的父节点，然后再将每一个数组单元解析为具有节点标签名和节点条件的数组。 In the above technical solution, step 2) includes the following process: the client first parses the conditional expression into an array variable having a parent-child structure, the previous array unit is the parent node of the latter array unit, and then parses each array element into An array with node tag names and node conditions.

上述技术方案中，步骤 3)包含如下过程：获取数组变量的一个单元变量为当前单元变量，如果当前单元变量为第一个单元变量，则在子页面根据当前单元变量的节点标签名找出对应的节点；如果不是第一个单元变量，则在符合上一个单元变量条件的所有节点的子节点中根据当前变量的节点标签名找出对应的节点，遍历所有找出的节点，判断这些节点是否符合当前单元变量设定的条件，如果符合条件，则继续获取数组的下一个单元变量；重复步骤 3直到数组变量的长度为 0为止；最后将符合最后一个单元变量的设定条件的所有节点保存到一个数组变量 nodes , 所述 n odes为设定的数组变量名。 In the above technical solution, step 3) includes the following process: obtaining a unit variable of the array variable as the current unit variable, and if the current unit variable is the first unit variable, finding a corresponding value in the sub-page according to the node label name of the current unit variable Node; if it is not the first unit variable, find the corresponding node according to the node tag name of the current variable in the child nodes of all nodes that meet the condition of the previous unit variable, traverse all the found nodes, and judge whether these nodes are If the conditions of the current unit variable are met, if the condition is met, continue to get the next unit variable of the array; repeat step 3 until the length of the array variable is 0; finally save all nodes that meet the setting conditions of the last unit variable. Go to an array variable nodes, the n odes is the set array variable name.

上述技术方案中，步骤 4)包含如下过程：遍历变量 n odes中的所有节点，并获取节点的 inner HTML属性值或者 outer HTM L属性值保存到一个新的数组变量。 In the above technical solution, step 4) includes the following process: Traversing all the nodes in the variable n odes, and obtaining the inner HTML attribute value of the node or the outer HTM L attribute value is saved to a new array variable.

本发明的原理为：本方法打破了传统信息抓取的对数据内容进行复杂解析的过程，采用了 j ava scr ipt调用***自带的函数来直接读取网页中的内容，并且彻底地将信息抓取的条件从程序中分离了出来。所以网民采用本方法对网页进行抓取的话，信息抓取稳定可靠，而且不会解析出错，也不需要去关心抓取的过程是如何实现的，不用去修改抓取过程的程序代码，而只需要做的是修改抓取的条件即可，由于抓取的条件可以采用 xpath表达式或者 JQUERY (—个 javascript 的框架）中的类似 xpath 的表达式来实现，所以对于网民来讲，学习的成本低，可以很轻松的掌握，很大程度上提高抓取程序的开发效率。 The principle of the invention is: the method breaks the process of complex analysis of the data content by the traditional information capture, and uses the function of the ava scr ipt calling system to directly read the content in the webpage and thoroughly information The conditions for the crawl are separated from the program. Therefore, if the netizen uses this method to crawl the webpage, the information crawling is stable and reliable, and the error is not parsed, and there is no need to care about how the crawling process is implemented, without modifying the program code of the crawling process, and only What needs to be done is to modify the conditions of the crawl. That is, since the condition of the crawl can be implemented by an xpath expression or an expression similar to xpath in JQUERY (a framework of javascript), the cost of learning is low and can be easily grasped. Greatly improve the development efficiency of the crawling program.

本发明与现有技术相比具有以下的优点： Compared with the prior art, the invention has the following advantages:

( 1) 本发明中只需要简单的修改条件表达式即可抓取网页中的任何内容，而无须针对每个网页都写一份解析网页内容的代码； (1) In the present invention, it is only necessary to simply modify the conditional expression to capture any content in the webpage without writing a code for parsing the webpage content for each webpage;

(2)本发明中条件表达式由于是自定义的，所以比较灵活，可以采用 XPATH 或者其他任何表达式来表示； (2) The conditional expression in the present invention is flexible because it is customized, and can be represented by XPATH or any other expression;

(3) 由于本发明采用的是***的 javascript函数来解析 HTML页面，所以不存在利用正则表达式来做解析匹配时候的条件考虑不周导致的解析遗漏或者错误的问题； (3) Since the present invention uses the javascript function of the system to parse the HTML page, there is no problem of parsing omissions or errors caused by inconsistent conditions when using regular expressions for parsing and matching;

(4) 本发明大大便利了普通网民抓取互联网资源为己所用，提高效率。附图说明 (4) The invention greatly facilitates the use of Internet resources by ordinary netizens for their own use and improves efficiency. DRAWINGS

图 1、图 2是本发明实施例中的详细流程图。具体实施方式 1 and 2 are detailed flowcharts in the embodiment of the present invention. detailed description

一种通用的可用于任何网页的网页内容的抓取方法，包括以下步骤： A general method for crawling web content that can be used on any web page, including the following steps:

1) 客户端输入一个待抓取的目标网址和条件表达式，在客户端生成一个显示网页所有内容的子页面； 1) The client inputs a target URL and a conditional expression to be crawled, and generates a subpage displaying all contents of the webpage on the client;

3) 遍历数组，在子页面找出符合条件的节点，并将符合最后一个条件的所有节点保存到一个数组变量； 3) Traverse the array, find the matching nodes in the subpage, and save all the nodes that meet the last condition to an array variable;

4) 客户端获取保存的数组变量中的所有节点的 innerHTML 属性值或者 outerHTML属性值。 4) The client gets the innerHTML property value or the outerHTML property value of all nodes in the saved array variable.

下面进一步详细说明本发明所述方法的具体步骤： The specific steps of the method of the present invention are described in further detail below:

参见图 1、图 2，在用户端输入一个合法的网址，（例如： www.***.com)，和条件表达式，（例如： /div[@_Cl_aSS=list]) 向后台服务器发出请求，由服务器端程序分析输入的网址字符串，并检查网址是否合法（是否具备 http://_XXX 或者 XXX. XXX这样的形式），若检查结果合法，继续下一步，否则提示重新输入网址；客户端根据网址生成相应的子页面，并给子页面赋予一个网址，子页面独立请求服务器，获取相应网址的网页所有内容并显示，其中，网页内容是网页中的一个或者多个 html元素，并将 javascript程序代码嵌入子页面，客户端获取父窗口上的条件表达式的值，并将之解析为具有父子结构的数组变量，前一个数组单元是后一个数组单元的父节点，然后再将每一个数组单元解析为具有节点标签名和节点条件的数组；具体方法是：假设条件表达式是以 "/" 符号来表示父子节点的分割符号， "/" 符号前的为父节点， "/" 符号后的为子节点，则该数组变量可以通过 javascript调用***的方法 split()通过如： "条件表达式" .split( "/" )来将条件表达式转化为具有父子关系的数组变量 a(a 为设定的数组变量名）。遍历这个数组变量 a，将数组变量 a中的每一个单元变量分别解析为节点标签名（包括标准的标签名和自定义的特殊符号组成的标签名）和节点匹配条件组成的数组变量 b(b 为设定的数组变量名），并遍历数组变量 b，子页面的 javascript 程序使用 getElementsByTagNameO 的方法来査找节点标签名对应的所有节点对象。遍历所有节点对象， javascript 程序使用 getAttr ibute() 方法分别来获取这些节点对象的属性值，再与节点条件进行比较，若是符合匹配条件，则继续获取数组变量 a的下一个单元变量。不断循环直到数组变量 a的长度变为 0，即条件表达式中的所有节点査找完毕，并将数组变量的最后一个单元匹配到的所有节点保存到一个数组变量 match (match为设定的数组变量名） See Figure 1, Figure 2, enter a valid URL on the client side (for example: www.***.com), and a conditional expression, (for example: /div[@ _C l _aSS =list]) to make a request to the background server , the server-side program analyzes the input URL string and checks if the URL is legal (whether it has http:// _XXX or XXX. XXX.) If the check result is legal, continue to the next step, otherwise prompt to re-enter the URL; the client generates a corresponding sub-page according to the URL, and assigns a URL to the sub-page, and the sub-page independently requests the server to obtain the corresponding URL. All the content of the webpage is displayed, wherein the webpage content is one or more html elements in the webpage, and the javascript program code is embedded in the subpage, and the client obtains the value of the conditional expression on the parent window, and parses it into The array variable of the parent-child structure, the previous array unit is the parent of the next array unit, and then each array unit is parsed into an array with the node label name and node condition; the specific method is: Assume that the conditional expression is "/" The symbol indicates the split symbol of the parent and child nodes. The "/" symbol is the parent node, and the "/" symbol is the child node. Then the array variable can be called by the javascript method of the system by using split(). " .split( "/" ) to convert a conditional expression into an array variable a with a parent-child relationship (a is Given array variable names). Traversing the array variable a, parsing each unit variable in the array variable a into a node tag name (including the standard tag name and the tag name of the custom special symbol) and an array variable b (b is the node matching condition) Set the array variable name), and iterate over the array variable b. The javascript program of the subpage uses the method getElementsByTagNameO to find all the node objects corresponding to the node tag name. By traversing all node objects, the javascript program uses the getAttr ibute() method to obtain the attribute values of these node objects respectively, and then compares them with the node conditions. If the matching condition is met, the next unit variable of the array variable a is continued. Continue to loop until the length of the array variable a becomes 0, that is, all the nodes in the conditional expression are searched, and all nodes matching the last unit of the array variable are saved to an array variable match (match is the set array variable) name)

客户端遍历数组变量 match, 取出每一个单元变量，再通过获取它们的 inner HTML属性值或者 outer HTML属性值来抓取对应的网页内容。 The client traverses the array variable match, extracts each unit variable, and then fetches the corresponding web content by getting their inner HTML attribute value or outer HTML attribute value.

Claims

1、一种通用的可用于任何网页的网页内容抓取的方法，其特征在于，包括以下步骤： A general method for crawling web content of any webpage, characterized in that it comprises the following steps:

3) 遍历数组，在子页面找出符合条件的节点，并将符合最后一个条件的所有节点保存到一个数组变量； 3) Traversing the array, finding the nodes that meet the conditions on the subpage, and saving all the nodes that meet the last condition to an array variable;

4) 客户端获取保存的数组变量中的所有节点的 inner H TM L 属性值或者 outer HTML属性值。 4) The client obtains the inner H TM L attribute value or the outer HTML attribute value of all nodes in the saved array variable.

2、如权利要求 1所述的一种通用的可用于任何网页的网页内容抓取的方法，其特征在于：其中步骤 1 )包含如下过程： 2. A universal method for crawling web content of any webpage according to claim 1, wherein: step 1) comprises the following process:

l a) 客户端输入一个网址和一个条件表达式，客户端用 j ava scr ip t脚本判断输入的网址是否合法，若检查结果合法，继续下一步，否则提示重新输入网址； l a) The client enters a URL and a conditional expression. The client uses the j ava scr ip t script to determine whether the entered URL is legal. If the check result is legal, continue to the next step, otherwise prompt to re-enter the URL;

lb ) 客户端生成相应的子页面，并给子页面赋予一个网址，子页面独立请求服务器，服务器获取相应网址的网页所有内容。 Lb) The client generates the corresponding sub-page, and assigns a sub-page to the sub-page. The sub-page independently requests the server, and the server obtains all the content of the webpage of the corresponding webpage.

3、如权利要求 1或 2所述的一种通用的可用于任何网页的网页内容抓取的方法，其特征在于：步骤 1 )中，输入条件表达式的方法采用： xp ath、 es s选择器或者程序中定义的规则中的一种。 3. A universal method for crawling web content of any webpage according to claim 1 or 2, wherein: in step 1), the method of inputting the conditional expression adopts: xp ath, es s selection One of the rules defined in the program or program.

4、如权利要求 3所述的一种通用的可用于任何网页的网页内容抓取的方法，其特征在于：步骤 3)包含如下过程： 4. A universal method for crawling web content of any webpage according to claim 3, wherein: step 3) comprises the following process:

3a) 客户端获取数组变量的一个单元变量，并在子页面根据单元变量的节点标签名查找节点； 3a) The client obtains a unit variable of the array variable, and finds the node in the subpage according to the node tag name of the unit variable;

3b ) 将找到的节点和该节点对应的设定条件进行比较，若符合条件，则进入下一步； 3b) Compare the found node with the setting conditions corresponding to the node, and if the conditions are met, proceed to the next step;

3c) 重复步骤 3a)直到数组长度为 0为止； 3c) Repeat step 3a) until the array length is 0;

3d) 将符合最后一个单元变量的设定条件的所有节点保存到一个数组变量 nodes , 所述 n odes为设定的数组变量名。 3d) Save all nodes that match the set condition of the last unit variable to an array variable nodes , which is the set array variable name.

5、如权利要求 4所述的一种通用的可用于任何网页的网页内容抓取的方法，其特征在于：其中步骤 4)包含如下过程：遍历变量 nodes中的所有节点，并获取节点的 innerHTML属性值或者 outerHTML属性值保存到一个新的数组变量。 5. A universal web page content for any web page as claimed in claim 4 The method of fetching is characterized in that: step 4) comprises the following process: traversing all nodes in the variable nodes, and obtaining the innerHTML attribute value of the node or the outerHTML attribute value is saved to a new array variable.

6、如权利要求 1或 2所述的一种通用的可用于任何网页的网页内容抓取的方法，其特征在于：步骤 2)包含如下过程：客户端将条件表达式解析为包含节点标签名和节点条件的数组变量。 6. A universal method for crawling web content of any webpage according to claim 1 or 2, wherein: step 2) comprises the following process: the client parses the conditional expression into a node tag name and Array variable for node condition.

7、如权利要求 2所述的一种通用的可用于任何网页的网页内容抓取的方法，其特征在于：在子页面增加一段 javascript 程序，该 javascript程序将条件表达式解析为 javascript程序的包含节点标签名和节点条件的数组变量。 7. A universal method for crawling webpage content of any webpage according to claim 2, wherein: adding a javascript program to the subpage, the javascript program parsing the conditional expression into a javascript program Array variable for node tag name and node condition.

8、如权利要求 7所述的一种通用的可用于任何网页的网页内容抓取的方法，其特征在于： javascript 程序根据节点标签名使用 getElementsByTagName()的方法来查找节点对象， javascript程序使用 getAttributeO方法来获取节点的属性值，再与节点条件进行比较。 8. A universal method for web page content capture of any web page according to claim 7, wherein: the javascript program uses the method of getElementsByTagName() to find the node object according to the node tag name, and the javascript program uses getAttributeO. The method is to obtain the attribute value of the node and compare it with the node condition.