CN101763425A - Universal method for capturing webpage contents of any webpage - Google Patents
Universal method for capturing webpage contents of any webpage Download PDFInfo
- Publication number
- CN101763425A CN101763425A CN201010002563A CN201010002563A CN101763425A CN 101763425 A CN101763425 A CN 101763425A CN 201010002563 A CN201010002563 A CN 201010002563A CN 201010002563 A CN201010002563 A CN 201010002563A CN 101763425 A CN101763425 A CN 101763425A
- Authority
- CN
- China
- Prior art keywords
- webpage
- node
- variable
- client
- array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a universal method for capturing webpage contents of any webpage, which belongs to the technical field of networks. The technical scheme is a universal method for capturing webpage contents of any webpage. The method comprises the following steps: inputting the captured websites and conditional expressions, creating a subpage and transmitting the websites and the conditional expressions to the subpage by a user; requesting a server, acquiring all contents of webpages of the websites and embedding a segment of javascript program into the webpage contents by the subpage; converting the conditional expressions into an array variable, traversing the array and finding out all nodes according with the conditions by the javascript program; and acquiring inner HTML(Hypertext Markup Language) or outer HTML(Hypertext Markup Language) attribute value of all nodes and further acquiring corresponding webpage contents by the javascript program. The method can ensure that the user can capture any content of the webpages by only simply modifying the conditional expressions without writing a code for analyzing the webpage contents aiming at each web page.
Description
Technical field
The invention belongs to networking technology area, specifically belong to a kind of method of the general capturing webpage contents that can be used for any webpage.
Background technology
Internet Age, abundant Internet resources, convenient widely people's information life.Yet, a large amount of explosion types increases along with quantity of information, and,, become the problem of netizens' headaches if can be easily the own valuable information content is grabbed in other any webpages on the internet be used on oneself the webpage as a common netizen.Because traditional Grasp Modes has very high technology barriers, in order to grasp a certain block message content of some webpages, often need the data content of this webpage is carried out once complicated parsing, just the information content that oneself needs is extracted at last, and in case changed the words that the another one webpage grasps its web page contents, redesign the resolving code of program again, the a large amount of work of this process repeats, and poor efficiency, because all parsing work all is to need own design resolving code, rather than the primary function of the system of employing resolves, thus often easily parsing make mistakes, and general netizen is difficult to carry out complicated operations like this.And this method has broken that conventional information grasps the data content is carried out complicated process of resolving, the function that has adopted the javascript calling system to carry directly reads the content in the webpage, and the condition that up hill and dale information is grasped has separated to come out from program.So words that the netizen adopts this method that webpage is grasped, information grasps reliable and stable, and can not resolve and make mistakes, how the process that does not also need to be concerned about extracting realizes, do not spend the program code of revising the extracting process, and what only need make is that the condition that revise to grasp gets final product, because the condition that grasps can adopt the expression formula of the similar xpath among xpath expression formula or the JQUERY (framework of a javascript) to realize, so for the netizen, the cost of study is low, can grasp very easily, improve the development efficiency of capture program to a great extent.
Summary of the invention
Technical matters to be solved by this invention is: a kind of grasping means of the general web page contents that can be used for any webpage is provided.
In order to solve the problems of the technologies described above, the present invention adopts following technical scheme: a kind of grasping means of the general web page contents that can be used for any webpage may further comprise the steps:
A kind of grasping means of the general web page contents that can be used for any webpage may further comprise the steps:
1) a target network address and conditional expression to be grasped of client input is at the subpage frame of display web page all the elements of client generation;
2) client resolves to conditional expression the array variable of node label name and condition;
3) obtain the element variable of array variable, find out corresponding node, and judge whether these nodes meet the condition that element variable is set,, then continue to obtain the next element variable of array if eligible according to the tag name of element variable.Constantly circulation is till the length of array variable is 0.All nodes that impose a condition that will meet last element variable at last are saved in an array variable;
4) client is obtained the innerHTML property value or the outerHTML property value of all nodes in the array variable of preservation.
Wherein step 1) comprises following process: 1a) a client network address of input and a conditional expression, and client judges with the javascript script whether the network address of input is legal, if check result is legal, continues next step, otherwise network address is re-entered in prompting; 1b) client generates corresponding subpage frame, and gives a network address to subpage frame, and subpage frame individual requests server, server obtain webpage all the elements of corresponding network address, and content wherein is one or more html element in the webpage.
Step 2 wherein) comprise following process: client resolves to the array variable with father and son's structure with conditional expression earlier, previous array location is the father node of a back array location, and then each array location is resolved to the array with node label name and node condition;
Wherein step 3) comprises following process: an element variable obtaining array variable is the active cell variable, if the active cell variable is first element variable, then finds out corresponding node at subpage frame according to the node label name of active cell variable.If not first element variable, then on meeting, find out corresponding node according to the node label name of current variable in the child node of all nodes of an element variable condition, travel through the node that all are found out, judge whether these nodes meet the condition of active cell specification of variables, if eligible, then continue to obtain the next element variable of array.Repeating step 3 is till the length of array variable is 0.All nodes that impose a condition that will meet last element variable at last are saved in an array variable nodes (the array variable name of nodes for setting);
Wherein step 4) comprises following process: all nodes among the traversal variable nodes, and the innerHTML property value or the outerHTML property value that obtain node are saved in a new array variable.
The present invention compared with prior art has following advantage:
(1) the simple modification of a needs conditional expression can grasp any content in the webpage, and need not all write the code of a analyzing web page content at each webpage;
(2) conditional expression so compare flexibly, can adopt XPATH or other any expression formulas to represent owing to be self-defining;
(3) because what adopt is that the javascript function of system is resolved html page, so do not exist the inconsiderate parsing that causes of condition that does not utilize regular expression to do when resolving coupling to omit or wrong problem;
(4) big convenience common netizen to grasp Internet resources used for oneself, raise the efficiency.
Description of drawings
Fig. 1, Fig. 2 are detail flowcharts of the present invention.
Embodiment
A kind of grasping means of the general web page contents that can be used for any webpage may further comprise the steps:
1) a target network address and conditional expression to be grasped of client input is at the subpage frame of display web page all the elements of client generation;
2) client resolves to conditional expression the array variable of node label and condition;
3) the traversal array is found out qualified node at subpage frame, and all nodes that will meet last condition are saved in an array variable;
4) client is obtained the innerHTML property value or the outerHTML property value of all nodes in the array variable of preservation.
Further describe the concrete steps of the method for the invention below:
Referring to Fig. 1, Fig. 2, in legal network address of user side input, (for example: www.***.com), and conditional expression, (for example :/div[@class=list]) send request to background server, by the network address character string of server analysis input, and check network address whether legal (whether possessing the such form of http://xxx or xxx.xxx), if check result is legal, continue next step, otherwise network address is re-entered in prompting; Client generates corresponding subpage frame according to network address, and give a network address to subpage frame, subpage frame individual requests server, obtain the webpage all the elements and the demonstration of corresponding network address, wherein, web page contents is one or more html element in the webpage, and with javascript program code embedding subpage frame, client is obtained the value of the conditional expression on the parent window, and it is resolved to the array variable with father and son's structure, previous array location is the father node of a back array location, and then each array location is resolved to the array with node label name and node condition; Concrete grammar is: the assumed condition expression formula is to represent the segmentation symbol of father and son's node with "/" symbol; " / " before the symbol be father node; behind "/" symbol be child node, then the method split () that this array variable can be by the javascript calling system by as: " conditional expression " .split ("/") is converted into conditional expression the array variable a (a is the array variable name of setting) with set membership.Travel through this array variable a, each element variable among the array variable a is resolved to the array variable b (the array variable name of b) of node label name (comprising the tag name of standard and the label that self-defining special symbol is formed) and node matching condition composition respectively for setting, and traversal array variable b, the javascript program of subpage frame uses the method for getElementsByTagName () to search all node objects of node label correspondence.Travel through all node objects, the javascript program uses getAttribute () method to obtain the property value of these node objects respectively, compare with the node condition again,, then continue to obtain the next element variable of array variable a if meet matching condition.Constantly circulation becomes 0 up to the length of array variable a, and promptly all nodes in the conditional expression are searched and finished, and with last units match of array variable to all nodes be saved in an array variable match (the array variable name of match) for setting
Client traversal array variable match takes out each element variable, grasps corresponding web page contents by innerHTML property value or the outerHTML property value that obtains them again.
Claims (8)
1. the method for a general capturing webpage contents that can be used for any webpage may further comprise the steps:
1) a target network address and conditional expression to be grasped of client input is at the subpage frame of display web page all the elements of client generation;
2) client resolves to conditional expression the array variable of node label and condition
3) the traversal array is found out qualified node at subpage frame, and all nodes that will meet last condition are saved in an array variable;
4) client is obtained the innerHTML property value or the outerHTML property value of all nodes in the array variable of preservation.
2. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 1, it is characterized in that: wherein step 1) comprises following process: 1a) a client network address of input and a conditional expression, client judges with the javascript script whether the network address of input is legal, if check result is legal, continue next step, otherwise network address is re-entered in prompting; 1b) client generates corresponding subpage frame, and gives a network address to subpage frame, and subpage frame individual requests server, server obtain webpage all the elements of corresponding network address.
3. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 1 or 2 is characterized in that: the user can be with the regular send a letter here write condition expression formula that defines in xpath, css selector switch or the program.
4. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 1 or 2 is characterized in that: step 2 wherein) comprise following process: client resolves to conditional expression the array variable that comprises node label name and node condition;
5. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 2, it is characterized in that: increase by one section javascript program at subpage frame, this javascript program resolves to conditional expression the array variable that comprises node label and node condition of javascript program.
6. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 3, it is characterized in that: wherein step 3) comprises following process: 3a) client is obtained an element variable of array variable, and searches node at subpage frame according to the node label name of element variable; 3b) node and the imposing a condition of finding of this node correspondence compared,, then enter next step if eligible; 3c) repeating step 3a), till array length is 0; 3d) all nodes that impose a condition that will meet last element variable are saved in an array variable nodes (the array variable name of nodes for setting).
7. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 3, it is characterized in that: the javascript program uses the method for getElementsByTagName () to search node object according to node label, the javascript program uses getAttribute () method to obtain the property value of node, compares with the node condition again.
8. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 4, it is characterized in that: wherein step 4) comprises following process: all nodes among the traversal variable nodes, and the innerHTML property value or the outerHTML property value that obtain node are saved in a new array variable.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010002563A CN101763425A (en) | 2010-01-12 | 2010-01-12 | Universal method for capturing webpage contents of any webpage |
PCT/CN2010/076100 WO2011085588A1 (en) | 2010-01-12 | 2010-08-18 | Webpage contents grabbing method which can be general adapted to any webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010002563A CN101763425A (en) | 2010-01-12 | 2010-01-12 | Universal method for capturing webpage contents of any webpage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101763425A true CN101763425A (en) | 2010-06-30 |
Family
ID=42494589
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010002563A Pending CN101763425A (en) | 2010-01-12 | 2010-01-12 | Universal method for capturing webpage contents of any webpage |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN101763425A (en) |
WO (1) | WO2011085588A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011085588A1 (en) * | 2010-01-12 | 2011-07-21 | 苏州阔地网络科技有限公司 | Webpage contents grabbing method which can be general adapted to any webpage |
CN102591612A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
CN103002061A (en) * | 2011-09-16 | 2013-03-27 | 阿里巴巴集团控股有限公司 | Method and device for mutual conversion of long domain names and short domain names |
CN103139260A (en) * | 2011-11-30 | 2013-06-05 | 国际商业机器公司 | Method and system for reusing hypertext markup language (HTML) content |
CN103164195A (en) * | 2011-12-13 | 2013-06-19 | 阿里巴巴集团控股有限公司 | Selector presenting method based on browser and device |
CN103838747A (en) * | 2012-11-22 | 2014-06-04 | 富士通株式会社 | Network service construction method and device and webpage data extraction method and device |
CN105677862A (en) * | 2016-01-08 | 2016-06-15 | 上海数道信息科技有限公司 | Method and device for grabbing webpage content |
CN107463713A (en) * | 2017-08-24 | 2017-12-12 | 四川长虹电器股份有限公司 | The method of fast verification CSS selector |
CN107729475A (en) * | 2017-10-16 | 2018-02-23 | 深圳视界信息技术有限公司 | Web page element acquisition method, device, terminal and computer-readable recording medium |
CN109032917A (en) * | 2017-06-09 | 2018-12-18 | 北京金山云网络技术有限公司 | Page adjustment method and system, mobile terminal and computer end |
CN109063110A (en) * | 2018-07-28 | 2018-12-21 | 安徽捷兴信息安全技术有限公司 | A kind of grasping means and device using application message in store |
CN109508181A (en) * | 2017-09-14 | 2019-03-22 | 韩真 | A kind of method of efficient semantization front end selection subscheme |
CN110276039A (en) * | 2019-06-27 | 2019-09-24 | 北京金山安全软件有限公司 | Page element path generation method and device and electronic equipment |
CN110795647A (en) * | 2019-10-29 | 2020-02-14 | 维沃移动通信有限公司 | Website prompting method and device |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103019925B (en) * | 2011-09-26 | 2015-02-18 | 阿里巴巴集团控股有限公司 | Selector acquisition method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100559374C (en) * | 2007-12-17 | 2009-11-11 | 杭州阔地网络科技有限公司 | The intercepting of info web unit, the method that merges |
CN101320370B (en) * | 2008-05-16 | 2011-06-01 | 苏州普达新信息技术有限公司 | Deep layer web page data source sort management method based on query interface connection drawing |
CN101520796A (en) * | 2009-02-16 | 2009-09-02 | 深圳市腾讯计算机***有限公司 | Method and system for extracting uniform resource locators from web page content |
CN101763425A (en) * | 2010-01-12 | 2010-06-30 | 苏州阔地网络科技有限公司 | Universal method for capturing webpage contents of any webpage |
-
2010
- 2010-01-12 CN CN201010002563A patent/CN101763425A/en active Pending
- 2010-08-18 WO PCT/CN2010/076100 patent/WO2011085588A1/en active Application Filing
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011085588A1 (en) * | 2010-01-12 | 2011-07-21 | 苏州阔地网络科技有限公司 | Webpage contents grabbing method which can be general adapted to any webpage |
CN103002061A (en) * | 2011-09-16 | 2013-03-27 | 阿里巴巴集团控股有限公司 | Method and device for mutual conversion of long domain names and short domain names |
CN103002061B (en) * | 2011-09-16 | 2015-06-24 | 阿里巴巴集团控股有限公司 | Method and device for mutual conversion of long domain names and short domain names |
US10318616B2 (en) | 2011-11-30 | 2019-06-11 | International Business Machines Corporation | Method and system for reusing HTML content |
CN103139260A (en) * | 2011-11-30 | 2013-06-05 | 国际商业机器公司 | Method and system for reusing hypertext markup language (HTML) content |
CN103139260B (en) * | 2011-11-30 | 2015-09-30 | 国际商业机器公司 | For reusing the method and system of HTML content |
US10678994B2 (en) | 2011-11-30 | 2020-06-09 | International Business Machines Corporation | Method and system for reusing HTML content |
US9507759B2 (en) | 2011-11-30 | 2016-11-29 | International Business Machines Corporation | Method and system for reusing HTML content |
CN103164195A (en) * | 2011-12-13 | 2013-06-19 | 阿里巴巴集团控股有限公司 | Selector presenting method based on browser and device |
CN103164195B (en) * | 2011-12-13 | 2017-06-23 | 阿里巴巴集团控股有限公司 | Selector technique of expression and device based on browser |
CN102591612A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
CN102591612B (en) * | 2011-12-27 | 2014-12-03 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
CN103838747A (en) * | 2012-11-22 | 2014-06-04 | 富士通株式会社 | Network service construction method and device and webpage data extraction method and device |
CN103838747B (en) * | 2012-11-22 | 2017-07-07 | 富士通株式会社 | Network service construction method and equipment and webpage data extracting method and equipment |
CN105677862A (en) * | 2016-01-08 | 2016-06-15 | 上海数道信息科技有限公司 | Method and device for grabbing webpage content |
CN109032917A (en) * | 2017-06-09 | 2018-12-18 | 北京金山云网络技术有限公司 | Page adjustment method and system, mobile terminal and computer end |
CN109032917B (en) * | 2017-06-09 | 2021-06-18 | 北京金山云网络技术有限公司 | Page debugging method and system, mobile terminal and computer terminal |
CN107463713A (en) * | 2017-08-24 | 2017-12-12 | 四川长虹电器股份有限公司 | The method of fast verification CSS selector |
CN109508181A (en) * | 2017-09-14 | 2019-03-22 | 韩真 | A kind of method of efficient semantization front end selection subscheme |
CN107729475A (en) * | 2017-10-16 | 2018-02-23 | 深圳视界信息技术有限公司 | Web page element acquisition method, device, terminal and computer-readable recording medium |
CN109063110A (en) * | 2018-07-28 | 2018-12-21 | 安徽捷兴信息安全技术有限公司 | A kind of grasping means and device using application message in store |
CN110276039A (en) * | 2019-06-27 | 2019-09-24 | 北京金山安全软件有限公司 | Page element path generation method and device and electronic equipment |
CN110795647A (en) * | 2019-10-29 | 2020-02-14 | 维沃移动通信有限公司 | Website prompting method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2011085588A1 (en) | 2011-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101763425A (en) | Universal method for capturing webpage contents of any webpage | |
CN104125209B (en) | Malice website prompt method and router | |
US10691507B2 (en) | API learning | |
CN104410711A (en) | Cross-domain network resource request method and device for client | |
US20120210243A1 (en) | Web co-navigation | |
KR20190039230A (en) | Method and system for server-side rendering of native content for presentations | |
CN112287273B (en) | Method, system and storage medium for classifying website list pages | |
US11055373B2 (en) | Method and apparatus for generating information | |
CN103810268B (en) | Search result recommendation information loading method, device and system and URL detection method, device and system | |
CN103577427A (en) | Browser kernel based web page crawling method and device and browser containing device | |
US20130232424A1 (en) | User operation detection system and user operation detection method | |
US20210064453A1 (en) | Automated application programming interface (api) specification construction | |
CN110602269B (en) | Method for converting domain name | |
CN103593434A (en) | Application recommendation method and device and server equipment | |
CN102831190B (en) | A kind of method that CML files are browsed in low side devices | |
US20190132378A1 (en) | Identifying an http resource using multi-variant http requests | |
CN103793508B (en) | A kind of loading recommendation information, the methods, devices and systems of network address detection | |
US9058399B2 (en) | System and method for providing network resource identifier shortening service to computing devices | |
JP5309121B2 (en) | Information processing method, program, information processing system | |
CN102915318A (en) | Method and device for positioning and searching information in browser | |
CN102314494A (en) | Method and equipment for processing webpage contents | |
CN101894109A (en) | Database building method and device | |
US8584007B2 (en) | Information processing method, information processing apparatus, and program | |
CN103825772A (en) | Method for identifying user click behavior and gateway equipment | |
CN104331512B (en) | A kind of BBS pages automatic acquiring method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Open date: 20100630 |