CN101763425A - Universal method for capturing webpage contents of any webpage - Google Patents

Universal method for capturing webpage contents of any webpage Download PDF

Info

Publication number
CN101763425A
CN101763425A CN201010002563A CN201010002563A CN101763425A CN 101763425 A CN101763425 A CN 101763425A CN 201010002563 A CN201010002563 A CN 201010002563A CN 201010002563 A CN201010002563 A CN 201010002563A CN 101763425 A CN101763425 A CN 101763425A
Authority
CN
China
Prior art keywords
webpage
node
variable
client
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201010002563A
Other languages
Chinese (zh)
Inventor
胡加明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Codyy Network Technology Co Ltd
Original Assignee
Suzhou Codyy Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Codyy Network Technology Co Ltd filed Critical Suzhou Codyy Network Technology Co Ltd
Priority to CN201010002563A priority Critical patent/CN101763425A/en
Publication of CN101763425A publication Critical patent/CN101763425A/en
Priority to PCT/CN2010/076100 priority patent/WO2011085588A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a universal method for capturing webpage contents of any webpage, which belongs to the technical field of networks. The technical scheme is a universal method for capturing webpage contents of any webpage. The method comprises the following steps: inputting the captured websites and conditional expressions, creating a subpage and transmitting the websites and the conditional expressions to the subpage by a user; requesting a server, acquiring all contents of webpages of the websites and embedding a segment of javascript program into the webpage contents by the subpage; converting the conditional expressions into an array variable, traversing the array and finding out all nodes according with the conditions by the javascript program; and acquiring inner HTML(Hypertext Markup Language) or outer HTML(Hypertext Markup Language) attribute value of all nodes and further acquiring corresponding webpage contents by the javascript program. The method can ensure that the user can capture any content of the webpages by only simply modifying the conditional expressions without writing a code for analyzing the webpage contents aiming at each web page.

Description

A kind of method of the general capturing webpage contents that can be used for any webpage
Technical field
The invention belongs to networking technology area, specifically belong to a kind of method of the general capturing webpage contents that can be used for any webpage.
Background technology
Internet Age, abundant Internet resources, convenient widely people's information life.Yet, a large amount of explosion types increases along with quantity of information, and,, become the problem of netizens' headaches if can be easily the own valuable information content is grabbed in other any webpages on the internet be used on oneself the webpage as a common netizen.Because traditional Grasp Modes has very high technology barriers, in order to grasp a certain block message content of some webpages, often need the data content of this webpage is carried out once complicated parsing, just the information content that oneself needs is extracted at last, and in case changed the words that the another one webpage grasps its web page contents, redesign the resolving code of program again, the a large amount of work of this process repeats, and poor efficiency, because all parsing work all is to need own design resolving code, rather than the primary function of the system of employing resolves, thus often easily parsing make mistakes, and general netizen is difficult to carry out complicated operations like this.And this method has broken that conventional information grasps the data content is carried out complicated process of resolving, the function that has adopted the javascript calling system to carry directly reads the content in the webpage, and the condition that up hill and dale information is grasped has separated to come out from program.So words that the netizen adopts this method that webpage is grasped, information grasps reliable and stable, and can not resolve and make mistakes, how the process that does not also need to be concerned about extracting realizes, do not spend the program code of revising the extracting process, and what only need make is that the condition that revise to grasp gets final product, because the condition that grasps can adopt the expression formula of the similar xpath among xpath expression formula or the JQUERY (framework of a javascript) to realize, so for the netizen, the cost of study is low, can grasp very easily, improve the development efficiency of capture program to a great extent.
Summary of the invention
Technical matters to be solved by this invention is: a kind of grasping means of the general web page contents that can be used for any webpage is provided.
In order to solve the problems of the technologies described above, the present invention adopts following technical scheme: a kind of grasping means of the general web page contents that can be used for any webpage may further comprise the steps:
A kind of grasping means of the general web page contents that can be used for any webpage may further comprise the steps:
1) a target network address and conditional expression to be grasped of client input is at the subpage frame of display web page all the elements of client generation;
2) client resolves to conditional expression the array variable of node label name and condition;
3) obtain the element variable of array variable, find out corresponding node, and judge whether these nodes meet the condition that element variable is set,, then continue to obtain the next element variable of array if eligible according to the tag name of element variable.Constantly circulation is till the length of array variable is 0.All nodes that impose a condition that will meet last element variable at last are saved in an array variable;
4) client is obtained the innerHTML property value or the outerHTML property value of all nodes in the array variable of preservation.
Wherein step 1) comprises following process: 1a) a client network address of input and a conditional expression, and client judges with the javascript script whether the network address of input is legal, if check result is legal, continues next step, otherwise network address is re-entered in prompting; 1b) client generates corresponding subpage frame, and gives a network address to subpage frame, and subpage frame individual requests server, server obtain webpage all the elements of corresponding network address, and content wherein is one or more html element in the webpage.
Step 2 wherein) comprise following process: client resolves to the array variable with father and son's structure with conditional expression earlier, previous array location is the father node of a back array location, and then each array location is resolved to the array with node label name and node condition;
Wherein step 3) comprises following process: an element variable obtaining array variable is the active cell variable, if the active cell variable is first element variable, then finds out corresponding node at subpage frame according to the node label name of active cell variable.If not first element variable, then on meeting, find out corresponding node according to the node label name of current variable in the child node of all nodes of an element variable condition, travel through the node that all are found out, judge whether these nodes meet the condition of active cell specification of variables, if eligible, then continue to obtain the next element variable of array.Repeating step 3 is till the length of array variable is 0.All nodes that impose a condition that will meet last element variable at last are saved in an array variable nodes (the array variable name of nodes for setting);
Wherein step 4) comprises following process: all nodes among the traversal variable nodes, and the innerHTML property value or the outerHTML property value that obtain node are saved in a new array variable.
The present invention compared with prior art has following advantage:
(1) the simple modification of a needs conditional expression can grasp any content in the webpage, and need not all write the code of a analyzing web page content at each webpage;
(2) conditional expression so compare flexibly, can adopt XPATH or other any expression formulas to represent owing to be self-defining;
(3) because what adopt is that the javascript function of system is resolved html page, so do not exist the inconsiderate parsing that causes of condition that does not utilize regular expression to do when resolving coupling to omit or wrong problem;
(4) big convenience common netizen to grasp Internet resources used for oneself, raise the efficiency.
Description of drawings
Fig. 1, Fig. 2 are detail flowcharts of the present invention.
Embodiment
A kind of grasping means of the general web page contents that can be used for any webpage may further comprise the steps:
1) a target network address and conditional expression to be grasped of client input is at the subpage frame of display web page all the elements of client generation;
2) client resolves to conditional expression the array variable of node label and condition;
3) the traversal array is found out qualified node at subpage frame, and all nodes that will meet last condition are saved in an array variable;
4) client is obtained the innerHTML property value or the outerHTML property value of all nodes in the array variable of preservation.
Further describe the concrete steps of the method for the invention below:
Referring to Fig. 1, Fig. 2, in legal network address of user side input, (for example: www.***.com), and conditional expression, (for example :/div[@class=list]) send request to background server, by the network address character string of server analysis input, and check network address whether legal (whether possessing the such form of http://xxx or xxx.xxx), if check result is legal, continue next step, otherwise network address is re-entered in prompting; Client generates corresponding subpage frame according to network address, and give a network address to subpage frame, subpage frame individual requests server, obtain the webpage all the elements and the demonstration of corresponding network address, wherein, web page contents is one or more html element in the webpage, and with javascript program code embedding subpage frame, client is obtained the value of the conditional expression on the parent window, and it is resolved to the array variable with father and son's structure, previous array location is the father node of a back array location, and then each array location is resolved to the array with node label name and node condition; Concrete grammar is: the assumed condition expression formula is to represent the segmentation symbol of father and son's node with "/" symbol; " / " before the symbol be father node; behind "/" symbol be child node, then the method split () that this array variable can be by the javascript calling system by as: " conditional expression " .split ("/") is converted into conditional expression the array variable a (a is the array variable name of setting) with set membership.Travel through this array variable a, each element variable among the array variable a is resolved to the array variable b (the array variable name of b) of node label name (comprising the tag name of standard and the label that self-defining special symbol is formed) and node matching condition composition respectively for setting, and traversal array variable b, the javascript program of subpage frame uses the method for getElementsByTagName () to search all node objects of node label correspondence.Travel through all node objects, the javascript program uses getAttribute () method to obtain the property value of these node objects respectively, compare with the node condition again,, then continue to obtain the next element variable of array variable a if meet matching condition.Constantly circulation becomes 0 up to the length of array variable a, and promptly all nodes in the conditional expression are searched and finished, and with last units match of array variable to all nodes be saved in an array variable match (the array variable name of match) for setting
Client traversal array variable match takes out each element variable, grasps corresponding web page contents by innerHTML property value or the outerHTML property value that obtains them again.

Claims (8)

1. the method for a general capturing webpage contents that can be used for any webpage may further comprise the steps:
1) a target network address and conditional expression to be grasped of client input is at the subpage frame of display web page all the elements of client generation;
2) client resolves to conditional expression the array variable of node label and condition
3) the traversal array is found out qualified node at subpage frame, and all nodes that will meet last condition are saved in an array variable;
4) client is obtained the innerHTML property value or the outerHTML property value of all nodes in the array variable of preservation.
2. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 1, it is characterized in that: wherein step 1) comprises following process: 1a) a client network address of input and a conditional expression, client judges with the javascript script whether the network address of input is legal, if check result is legal, continue next step, otherwise network address is re-entered in prompting; 1b) client generates corresponding subpage frame, and gives a network address to subpage frame, and subpage frame individual requests server, server obtain webpage all the elements of corresponding network address.
3. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 1 or 2 is characterized in that: the user can be with the regular send a letter here write condition expression formula that defines in xpath, css selector switch or the program.
4. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 1 or 2 is characterized in that: step 2 wherein) comprise following process: client resolves to conditional expression the array variable that comprises node label name and node condition;
5. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 2, it is characterized in that: increase by one section javascript program at subpage frame, this javascript program resolves to conditional expression the array variable that comprises node label and node condition of javascript program.
6. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 3, it is characterized in that: wherein step 3) comprises following process: 3a) client is obtained an element variable of array variable, and searches node at subpage frame according to the node label name of element variable; 3b) node and the imposing a condition of finding of this node correspondence compared,, then enter next step if eligible; 3c) repeating step 3a), till array length is 0; 3d) all nodes that impose a condition that will meet last element variable are saved in an array variable nodes (the array variable name of nodes for setting).
7. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 3, it is characterized in that: the javascript program uses the method for getElementsByTagName () to search node object according to node label, the javascript program uses getAttribute () method to obtain the property value of node, compares with the node condition again.
8. the method for a kind of general capturing webpage contents that can be used for any webpage as claimed in claim 4, it is characterized in that: wherein step 4) comprises following process: all nodes among the traversal variable nodes, and the innerHTML property value or the outerHTML property value that obtain node are saved in a new array variable.
CN201010002563A 2010-01-12 2010-01-12 Universal method for capturing webpage contents of any webpage Pending CN101763425A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201010002563A CN101763425A (en) 2010-01-12 2010-01-12 Universal method for capturing webpage contents of any webpage
PCT/CN2010/076100 WO2011085588A1 (en) 2010-01-12 2010-08-18 Webpage contents grabbing method which can be general adapted to any webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010002563A CN101763425A (en) 2010-01-12 2010-01-12 Universal method for capturing webpage contents of any webpage

Publications (1)

Publication Number Publication Date
CN101763425A true CN101763425A (en) 2010-06-30

Family

ID=42494589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010002563A Pending CN101763425A (en) 2010-01-12 2010-01-12 Universal method for capturing webpage contents of any webpage

Country Status (2)

Country Link
CN (1) CN101763425A (en)
WO (1) WO2011085588A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011085588A1 (en) * 2010-01-12 2011-07-21 苏州阔地网络科技有限公司 Webpage contents grabbing method which can be general adapted to any webpage
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof
CN103002061A (en) * 2011-09-16 2013-03-27 阿里巴巴集团控股有限公司 Method and device for mutual conversion of long domain names and short domain names
CN103139260A (en) * 2011-11-30 2013-06-05 国际商业机器公司 Method and system for reusing hypertext markup language (HTML) content
CN103164195A (en) * 2011-12-13 2013-06-19 阿里巴巴集团控股有限公司 Selector presenting method based on browser and device
CN103838747A (en) * 2012-11-22 2014-06-04 富士通株式会社 Network service construction method and device and webpage data extraction method and device
CN105677862A (en) * 2016-01-08 2016-06-15 上海数道信息科技有限公司 Method and device for grabbing webpage content
CN107463713A (en) * 2017-08-24 2017-12-12 四川长虹电器股份有限公司 The method of fast verification CSS selector
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN109032917A (en) * 2017-06-09 2018-12-18 北京金山云网络技术有限公司 Page adjustment method and system, mobile terminal and computer end
CN109063110A (en) * 2018-07-28 2018-12-21 安徽捷兴信息安全技术有限公司 A kind of grasping means and device using application message in store
CN109508181A (en) * 2017-09-14 2019-03-22 韩真 A kind of method of efficient semantization front end selection subscheme
CN110276039A (en) * 2019-06-27 2019-09-24 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment
CN110795647A (en) * 2019-10-29 2020-02-14 维沃移动通信有限公司 Website prompting method and device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019925B (en) * 2011-09-26 2015-02-18 阿里巴巴集团控股有限公司 Selector acquisition method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100559374C (en) * 2007-12-17 2009-11-11 杭州阔地网络科技有限公司 The intercepting of info web unit, the method that merges
CN101320370B (en) * 2008-05-16 2011-06-01 苏州普达新信息技术有限公司 Deep layer web page data source sort management method based on query interface connection drawing
CN101520796A (en) * 2009-02-16 2009-09-02 深圳市腾讯计算机***有限公司 Method and system for extracting uniform resource locators from web page content
CN101763425A (en) * 2010-01-12 2010-06-30 苏州阔地网络科技有限公司 Universal method for capturing webpage contents of any webpage

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011085588A1 (en) * 2010-01-12 2011-07-21 苏州阔地网络科技有限公司 Webpage contents grabbing method which can be general adapted to any webpage
CN103002061A (en) * 2011-09-16 2013-03-27 阿里巴巴集团控股有限公司 Method and device for mutual conversion of long domain names and short domain names
CN103002061B (en) * 2011-09-16 2015-06-24 阿里巴巴集团控股有限公司 Method and device for mutual conversion of long domain names and short domain names
US10318616B2 (en) 2011-11-30 2019-06-11 International Business Machines Corporation Method and system for reusing HTML content
CN103139260A (en) * 2011-11-30 2013-06-05 国际商业机器公司 Method and system for reusing hypertext markup language (HTML) content
CN103139260B (en) * 2011-11-30 2015-09-30 国际商业机器公司 For reusing the method and system of HTML content
US10678994B2 (en) 2011-11-30 2020-06-09 International Business Machines Corporation Method and system for reusing HTML content
US9507759B2 (en) 2011-11-30 2016-11-29 International Business Machines Corporation Method and system for reusing HTML content
CN103164195A (en) * 2011-12-13 2013-06-19 阿里巴巴集团控股有限公司 Selector presenting method based on browser and device
CN103164195B (en) * 2011-12-13 2017-06-23 阿里巴巴集团控股有限公司 Selector technique of expression and device based on browser
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof
CN102591612B (en) * 2011-12-27 2014-12-03 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof
CN103838747A (en) * 2012-11-22 2014-06-04 富士通株式会社 Network service construction method and device and webpage data extraction method and device
CN103838747B (en) * 2012-11-22 2017-07-07 富士通株式会社 Network service construction method and equipment and webpage data extracting method and equipment
CN105677862A (en) * 2016-01-08 2016-06-15 上海数道信息科技有限公司 Method and device for grabbing webpage content
CN109032917A (en) * 2017-06-09 2018-12-18 北京金山云网络技术有限公司 Page adjustment method and system, mobile terminal and computer end
CN109032917B (en) * 2017-06-09 2021-06-18 北京金山云网络技术有限公司 Page debugging method and system, mobile terminal and computer terminal
CN107463713A (en) * 2017-08-24 2017-12-12 四川长虹电器股份有限公司 The method of fast verification CSS selector
CN109508181A (en) * 2017-09-14 2019-03-22 韩真 A kind of method of efficient semantization front end selection subscheme
CN107729475A (en) * 2017-10-16 2018-02-23 深圳视界信息技术有限公司 Web page element acquisition method, device, terminal and computer-readable recording medium
CN109063110A (en) * 2018-07-28 2018-12-21 安徽捷兴信息安全技术有限公司 A kind of grasping means and device using application message in store
CN110276039A (en) * 2019-06-27 2019-09-24 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment
CN110795647A (en) * 2019-10-29 2020-02-14 维沃移动通信有限公司 Website prompting method and device

Also Published As

Publication number Publication date
WO2011085588A1 (en) 2011-07-21

Similar Documents

Publication Publication Date Title
CN101763425A (en) Universal method for capturing webpage contents of any webpage
CN104125209B (en) Malice website prompt method and router
US10691507B2 (en) API learning
CN104410711A (en) Cross-domain network resource request method and device for client
US20120210243A1 (en) Web co-navigation
KR20190039230A (en) Method and system for server-side rendering of native content for presentations
CN112287273B (en) Method, system and storage medium for classifying website list pages
US11055373B2 (en) Method and apparatus for generating information
CN103810268B (en) Search result recommendation information loading method, device and system and URL detection method, device and system
CN103577427A (en) Browser kernel based web page crawling method and device and browser containing device
US20130232424A1 (en) User operation detection system and user operation detection method
US20210064453A1 (en) Automated application programming interface (api) specification construction
CN110602269B (en) Method for converting domain name
CN103593434A (en) Application recommendation method and device and server equipment
CN102831190B (en) A kind of method that CML files are browsed in low side devices
US20190132378A1 (en) Identifying an http resource using multi-variant http requests
CN103793508B (en) A kind of loading recommendation information, the methods, devices and systems of network address detection
US9058399B2 (en) System and method for providing network resource identifier shortening service to computing devices
JP5309121B2 (en) Information processing method, program, information processing system
CN102915318A (en) Method and device for positioning and searching information in browser
CN102314494A (en) Method and equipment for processing webpage contents
CN101894109A (en) Database building method and device
US8584007B2 (en) Information processing method, information processing apparatus, and program
CN103825772A (en) Method for identifying user click behavior and gateway equipment
CN104331512B (en) A kind of BBS pages automatic acquiring method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20100630