CN111090797A - Data acquisition method and device, computer equipment and storage medium - Google Patents

Data acquisition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111090797A
CN111090797A CN201911198993.7A CN201911198993A CN111090797A CN 111090797 A CN111090797 A CN 111090797A CN 201911198993 A CN201911198993 A CN 201911198993A CN 111090797 A CN111090797 A CN 111090797A
Authority
CN
China
Prior art keywords
webpage
target
path information
terminal
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911198993.7A
Other languages
Chinese (zh)
Other versions
CN111090797B (en
Inventor
张冠龙
孙慧生
高勇
蒋旭曦
朱宏雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Cloud Computing Co Ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201911198993.7A priority Critical patent/CN111090797B/en
Publication of CN111090797A publication Critical patent/CN111090797A/en
Application granted granted Critical
Publication of CN111090797B publication Critical patent/CN111090797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application relates to a data acquisition method, a data acquisition device, computer equipment and a storage medium of webpage elements, wherein the method comprises the following steps: acquiring first webpage element path information of a first target webpage; when at least two webpage elements of the same type in a first target webpage are triggered, acquiring path information of the triggered first webpage elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements. The method can acquire the webpage data of the target elements in batch aiming at the webpage structures of different webpages.

Description

Data acquisition method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of web page element processing technologies, and in particular, to a method and an apparatus for acquiring data of a web page element, a computer device, and a storage medium.
Background
With the prevalence of browsers, more and more web applications are in motion. There is a large amount of valuable web page data in web applications. For example, the data of commodity list information of the e-commerce website, blog article list data, microblog hot data and the like. Different webpages have different webpage structures, and how to acquire the webpage data in batches is a problem to be solved for webpage data capture.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a data acquisition method, an apparatus, a computer device and a storage medium for web page elements, which are capable of acquiring web page data of target elements in batches for web page structures of different web pages.
A data acquisition method for web page elements comprises the following steps: acquiring first webpage element path information of a first target webpage; when at least two webpage elements of the same type in a first target webpage are triggered, acquiring path information of the triggered first webpage elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In one embodiment, obtaining first web page element path information of a first target web page includes: and traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversal result.
In one embodiment, the method for acquiring data of a web page element further includes: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; generating a mask layer for the first web page element according to the boundary value; acquiring path information of the triggered first webpage element, wherein the path information comprises: and acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, the method for acquiring data of a web page element further includes: acquiring page turning information in a first target webpage; acquiring a second target webpage according to the page turning information; acquiring second webpage element path information of a second target webpage; when at least two webpage elements of the same type in a second target webpage are triggered, acquiring path information of the triggered second webpage elements; acquiring second similar path information with a similar path structure according to the triggered path information of the second webpage element; and determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In one embodiment, determining a plurality of first target elements in a first target webpage according to the first similar path information and the first webpage element path information includes: and acquiring the same-level elements and parent-level elements of the triggered first webpage elements of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and respectively taking the triggered first webpage elements of the same type, the same-level elements and the parent-level elements as the first target elements.
In one embodiment, obtaining web page data for a plurality of first target elements includes: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating to extract data of preset parameters in webpage elements of the first target webpage; and acquiring webpage data of a plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters, and the acquiring of the webpage data of the plurality of first target elements according to the configuration information includes: and acquiring the text data and/or the link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
An apparatus for data acquisition of web page elements, the apparatus comprising: the first acquisition module is used for acquiring first webpage element path information of a first target webpage; the second acquisition module is used for acquiring the path information of the triggered first webpage element when at least two webpage elements of the same type in the first target webpage are triggered; the third acquisition module is used for acquiring first similar path information with similar path structures according to the path information of the first webpage element; the fourth obtaining module is configured to determine, according to the first similar path information and the first webpage element path information, a plurality of first target elements in the first target webpage, and obtain webpage data of the plurality of first target elements.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the above embodiments when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.
The method, the device, the computer equipment and the storage medium for acquiring the data of the webpage elements acquire the path information of the first webpage element of the first target webpage. And when at least two webpage elements of the same type in the first target webpage are triggered, acquiring the path information of the triggered first webpage element, and acquiring first similar path information with similar path structures according to the path information of the first webpage element. Further, a plurality of first target elements in the first target webpage are determined according to the first similar path information and the first webpage element path information, and finally webpage data of the plurality of first target elements are obtained. Therefore, the method can acquire the webpage data of the elements of the same type under the webpage in batch according to the webpage element path information of the webpage and the path information of the elements of the same type in the webpage aiming at the webpage structures of different webpages.
Drawings
FIG. 1 is a diagram of an application environment for a method for data retrieval of web page elements, according to an embodiment;
FIG. 2 is a flowchart illustrating a method for obtaining data of web page elements according to an embodiment;
FIG. 3 is a flowchart illustrating a data retrieving method for a web page element according to another embodiment;
FIG. 4 is a schematic interface diagram of an RPA designer in one embodiment;
FIG. 5 is a schematic interface diagram of a web interface corresponding to FIG. 4;
FIG. 6 is a schematic representation of an interface of an RPA designer in another embodiment;
FIG. 7 is a schematic interface diagram of a web interface corresponding to FIG. 6;
FIG. 8 is a schematic representation of an interface of an RPA designer in yet another embodiment;
FIG. 9 is a schematic diagram of an interface for a target web page in one embodiment;
FIG. 10 is a schematic diagram of an interface of a target web page in another embodiment;
FIG. 11 is a block diagram of an apparatus for data retrieval of web page elements, according to one embodiment;
FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data acquisition method of the webpage elements is applied to the application environment shown in fig. 1. The server 110 is used for implementing a data acquisition method of a web page element of the present application. Among them, the server 110 may be a computer device supporting an RPA (Robotic Process Automation) designer. The server 110 is communicatively coupled to the terminal device 120. The terminal device 120 is a user terminal device used by a consumer of web page data information. The terminal device 120 may present web pages of different web page structures. When the terminal device 120 displays the first target webpage 121, the server 110 obtains first webpage element path information in the first target webpage 121. When a user triggers at least two web page elements of the same type of the first target web page 121, for example, triggers two titles of the same title type in the first target web page 121, the server 110 obtains path information of the triggered first web page element, further obtains first similar path information with a similar path structure according to the path information of the first web page element, determines a plurality of first target elements in the first target web page according to the first similar path information and the path information of the first web page element, and finally obtains web page data of the plurality of first target elements. Further, the server 110 is also communicatively connected to the terminal device 130. The terminal device 130 is a terminal device used by a developer who performs data processing on web page data information. The developer uses the terminal device 130 to perform corresponding operations on the server 110. The web page data of the plurality of first target elements obtained by the server 110 may be displayed in the display interface 131 of the terminal device 130 for the developer to preview. The server 110 may be implemented as a server cluster composed of a plurality of servers, and the terminal device 120 may be a notebook, a desktop, other mobile devices, and the like.
In one embodiment, as shown in fig. 2, a data obtaining method for a web page element is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s101, acquiring first webpage element path information of a first target webpage.
In this embodiment, when the terminal device opens the first target webpage, the server obtains first webpage element path information of the first target webpage in the terminal device. The obtaining manner may be that the server parses a DOM (Document Object Model) tree structure of the first target web page, and obtains the first web page element path information of the first target web page according to the DOM tree structure of the first target web page. The first web page element path information is used to identify path information for all web page elements in the first target web page.
In one embodiment, step S101 includes: and traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversal result.
In this embodiment, the server generates the first web page element path information of the first target web page by traversing the DOM tree structure of the first target web page. Specifically, when the terminal respectively clicks a target element a and a target element B in a first target webpage through a mouse, the server traverses up the DOM tree of the entire first target webpage layer by layer to generate first webpage element path information that can uniquely identify the first target webpage. For example, the first webpage element path information is html- > body- > div- > table.
S103, when at least two webpage elements of the same type in the first target webpage are triggered, acquiring the path information of the triggered first webpage element.
In this embodiment, when at least two web page elements of the same type in the first target web page are triggered, the server acquires path information of the triggered first web page element. Here, the at least two web page elements of the same type may be triggered sequentially, or triggered simultaneously. In this embodiment, the server only needs to detect that at least two web page elements of the same type are in the triggered state. Wherein the same type of web page elements refer to web page elements identified as the same type in the first target web page. The number of the triggered first webpage elements is multiple. The web page element may be triggered by manually triggering the web page element in the first target web page in the terminal device. Or after the server reads the first target webpage to the server, the research and development personnel trigger webpage elements in the first target webpage in the server through the terminal device in communication connection with the server. In addition, the server can directly read the path information of all the webpage elements in the first target webpage from the terminal equipment displaying the first target webpage, and when the first webpage element is triggered, the server can directly read the path information of the first webpage element.
S105, acquiring first similar path information with similar path structure according to the path information of the first webpage element.
In this embodiment, the number of the triggered first web page elements is multiple, and the server obtains the first similar path information according to the path information of the triggered first web page elements. The first similar path information comprises path information with similar path structures in the path information of the plurality of first webpage elements. For example, the path information of the first web page element a is: html- > body- > div- > table- > tr [1], and the path information of the first webpage element B is: html- > body- > div- > table- > tr [2 ]. At this time, the first similar path information includes html- > body- > div- > table- > tr.
S107, determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In this embodiment, the server determines a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information. The determining method may be that all path information matched with the first similar path information is acquired from the first webpage element path information, and then the webpage elements corresponding to all the matched path information are used as the first target element. For example, the first affinity path information includes html- > body- > div- > table- > tr. At this time, the first webpage element path information acquisition path structure prefix includes webpage elements corresponding to all path information of html- > body- > div- > table- > tr as the first target element. And finally, acquiring the webpage data of all the first target elements, thereby realizing the batch acquisition of the webpage data of all the elements with similar paths in the first target webpage.
In a specific implementation process, the first target webpage is a webpage list. And clicking the target element A under the webpage list and the target element B under the webpage list by using a mouse. The target element A and the target element B are the triggered first webpage elements of the same type. The server traverses the DOM tree of the whole webpage list layer by layer upwards, and the first element path information which can uniquely identify the webpage list is generated as follows: html- > body- > div- > table, and the path information of the target element a is: html- > body- > div- > table- > tr [1], and the path information of the target element B is: (html- > body- > div- > table- > tr [2 ]). Therefore, the obtained first similar path information is: html- > body- > div- > table- > tr. According to the first element path information and the first similar path information of the web page list, all elements of similar paths under the web page list, that is, the plurality of first target elements, can be retrieved. And finally, acquiring the webpage data of a plurality of first target elements, thereby realizing the batch acquisition of the webpage data of the webpage elements with similar paths in the webpage list.
The data acquisition method of the webpage elements acquires first webpage element path information of a first target webpage. And when at least two webpage elements of the same type in the first target webpage are triggered, acquiring the path information of the triggered first webpage element, and acquiring first similar path information with similar path structures according to the path information of the first webpage element. Further, a plurality of first target elements in the first target webpage are determined according to the first similar path information and the first webpage element path information, and finally webpage data of the plurality of first target elements are obtained. Therefore, the method can acquire the webpage data of the elements of the same type under the webpage in batch according to the webpage element path information of the webpage and the path information of the elements of the same type in the webpage aiming at the webpage structures of different webpages. In addition, the method obtains the webpage data in batch by analyzing the webpage element path information of the webpage, and is suitable for obtaining the webpage data with any webpage structure.
In one embodiment, before step S103, the method further includes the steps of: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; a mask layer for the first web page element is generated based on the boundary value. At this time, step S103 includes: and acquiring the path information of the triggered first webpage element according to the mask layer.
In this embodiment, the positions of the web page elements in the first target web page are identified by means of a coordinate system. When a first webpage element in the first target webpage is triggered, a coordinate value of the first webpage element in the first target webpage is obtained, and a boundary value of the first webpage element is obtained according to the coordinate value. Further, a mask layer for the first web page element is generated according to the boundary value, so that the link jump does not occur when the content containing the jump link attribute in the first web page element is triggered. Further, the server acquires the path information of the triggered first webpage element according to the mask layer. Specifically, when the server identifies the mask layer of the first web page element, the server obtains the path information of the first web page element. If the first webpage element contains the skippable link attribute content, the server identifies the mask layer of the first webpage element, and at the moment, the path information of the first webpage element can be acquired when the first webpage element is triggered, so that the phenomenon that the path information of the first webpage element cannot be acquired due to skipping when the first webpage element is triggered is avoided.
In a specific implementation process, when a mouse slides over a certain webpage element of a first target webpage, namely a triggered first webpage element, a boundary value of a rectangular frame of the first webpage element is obtained according to an (x, y) coordinate value of the first webpage element on the current first target webpage (the coordinate of the first target webpage is a two-dimensional coordinate represented by an x coordinate system and a y coordinate system). The mask layer for the first web page element is generated by the boundary value, and ensures that the target content with href (uniform resource locator (URL) for specifying the hyperlink target) attribute does not click to jump when the mouse slides over the first web page element to capture the target content by drawing a frame for the first web page element.
In one embodiment, as shown in fig. 3, after step S107, the method further includes the steps of:
s109, page turning information in the first target webpage is obtained.
And S111, acquiring a second target webpage according to the page turning information.
S113, second webpage element path information of the second target webpage is obtained.
S115, when at least two web page elements of the same type in the second target web page are triggered, the path information of the triggered second web page element is obtained.
And S117, acquiring second similar path information with similar path structure according to the triggered path information of the second webpage element.
S119, determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In this embodiment, the first target web page includes page turning information. The page turning information is used to instruct the web page to jump from the current web page to another web page. Wherein, the page turning information may be jump link indication information. The server acquires a second target webpage according to the page turning information in the first target webpage, and further executes operations similar to the steps S101 to S107 for the second target webpage to acquire webpage data of a target element in the second target webpage. Specifically, when at least two web page elements of the same type in the second target web page are triggered, the server acquires the path information of the triggered second web page element, and acquires second similar path information with similar path structures according to the path information of the triggered second web page element. And finally, determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
For example, web page data of a target element in a first target web page is obtained through one or more entry addresses. As the article list address of the first web page: https:// www.cnblogs.com/# p1, web page data for a target element in a first web page is obtained. Specifically, the capturing of the web page data of the article list of the first web page is completed according to the operations of step S101 to step S107. And entering the next level webpage, namely the second target webpage according to the page turning information of the entry webpage, namely the page turning information of the first webpage, such as the link point https:// www.cnblogs.com/# p 2. The operations of step S109 to step S119 are performed in the second target web page to capture the web page data of the target element in the second target web page. And circulating infinitely until the page turning information is executed.
In one embodiment, step S107 includes: and acquiring the same-level elements and parent-level elements of the triggered first webpage elements of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and respectively taking the triggered first webpage elements of the same type, the same-level elements and the parent-level elements as the first target elements.
In this embodiment, the first similar path information is path information with a similar path structure determined according to the triggered path information of the first web page elements of the same type. The first web page element path information is a set of path information for web page elements of the first target web page. According to the first similar path information and the first webpage element path information, the same-level element and the parent-level element of the triggered first webpage element of the same type can be determined. Specifically, all path information matched with the similar path information is acquired from the first webpage element path information, and the same-level element and the parent-level element of the triggered first webpage element of the same type are acquired from the first target webpage according to the all path information.
In one embodiment, step S107 includes: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating to extract data of preset parameters in webpage elements of the first target webpage; and acquiring webpage data of a plurality of first target elements according to the configuration information.
In this embodiment, the web page elements of the first target web page include data corresponding to a plurality of parameters. In a specific implementation process, the acquired target element is often an html element, and there is web page data corresponding to attributes such as title attribute information, href attribute information, class attribute information, and the like, so that the finally acquired data can be configured in advance. In this embodiment, the server obtains the web page data of the preset parameter in the plurality of first target elements according to the configuration information of the first target web page.
In one embodiment, the preset parameters include text parameters and/or link parameters. Acquiring webpage data of a plurality of first target elements according to the configuration information, wherein the acquiring comprises the following steps: and acquiring the text data and/or the link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
In this embodiment, the preset parameters include a text parameter and/or a link parameter. The configuration information is used for indicating webpage data for extracting the text parameter and/or the link parameter in the webpage element of the first target webpage. And the server extracts the text data and/or the link data from the plurality of first target elements according to the indication parameters in the configuration information.
For the data acquisition method of the web page element, a specific implementation scenario is given below to further detail the data acquisition method of the web page element:
a server that implements the data acquisition method for a web page element of the present application is a computer device that supports the operation of an RPA (robotic process Automation) designer. Therefore, the RPA designer can extract different types of webpage data of different webpages by adopting the data acquisition method of the webpage elements. For example, for the commodity list information of the traditional e-commerce website, the related information such as the name or price of the commodity or the description or the evaluation or sales volume can be extracted at one time. Specifically, as shown in fig. 4, the display interface of the RPA designer prompts the developer to select data table 1 in the target web page. As shown in FIG. 5, after the developer opens the target web page, a mouse trigger is used to select the first title. After the RPA designer reads the path information of the first title and the web page data, as shown in fig. 6, a display interface of the RPA designer prompts a developer to select the data table 2 in the target web page. As shown in FIG. 7, the developer continues to open the target web page and uses the mouse to trigger the selection of the second title. The RPA designer reads the path information of the second title and the web page data. The first title and the second title are the triggered first web page elements of the same type of the target web page. The RPA designer determines that the first title and the second title are the same type of web page elements, and can acquire all the web page data of the title class with similar path information in the target web page by executing the data acquisition method of the web page elements. In addition, the RPA designer may also provide configuration options for the developer to select web page data for the corresponding parameters extracted from the web page elements. As shown in fig. 8, the developer can check the parameters to be extracted. Such as text parameters, link parameters. And the RPA designer extracts the webpage data in the target element according to the parameters selected by the research personnel. For example, the configuration chosen by the developer is: the character of the target element is captured, and if the element has the href attribute, the link can be selected and captured.
Further, if the developer wants to crawl more types of web page data, the developer can click to continue the selection. The selection of a web page element in the target web page is illustrated with reference to fig. 9 and 10. For example, the title category of the product is grabbed for the first time, and the grabbing of the price category of the product is required to be continued.
In conclusion, the RPA designer provides a visual webpage data grabbing way and grabbing result screening, and users (such as research and development personnel) can use the webpage data grabbing way and the grabbing result screening more conveniently and efficiently. In addition, compared with the traditional modes of crawling by crawlers and crawling the webpage data by different regular matching aiming at different webpages, the data acquisition method of the webpage elements used by the RPA designer has wider application range, and the circulation of the webpage data becomes efficient and simple in the RPA process.
The present application further provides a data acquiring apparatus for web page elements, as shown in fig. 11, the apparatus includes a first acquiring module 10, a second acquiring module 20, a third acquiring module 30, and a fourth acquiring module 40.
The first obtaining module 10 is configured to obtain first webpage element path information of a first target webpage.
The second obtaining module 20 is configured to obtain path information of the triggered first web page element when at least two web page elements of the same type in the first target web page are triggered.
The third obtaining module 30 is configured to obtain first similar path information with a similar path structure according to the path information of the first webpage element.
The fourth obtaining module 40 is configured to determine, according to the first similar path information and the first webpage element path information, a plurality of first target elements in the first target webpage, and obtain webpage data of the plurality of first target elements.
In one embodiment, the first obtaining module 10 may include (not shown in fig. 11):
and the first generation unit is used for traversing the DOM tree structure of the first target webpage and generating the path information of the first webpage element according to the traversal result.
In one embodiment, the data acquisition apparatus for web page elements further includes (not shown in fig. 11):
the second generating unit is used for acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; a mask layer for the first web page element is generated based on the boundary value.
The second obtaining module 20 further comprises
And the path acquisition unit is used for acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, the data acquisition apparatus for web page elements further includes (not shown in fig. 11):
and the fourth acquisition module is used for acquiring page turning information in the first target webpage.
And the fifth acquisition module is used for acquiring the second target webpage according to the page turning information.
And the sixth acquisition module is used for acquiring the second webpage element path information of the second target webpage.
And the seventh obtaining module is used for obtaining the path information of the triggered second webpage element when at least two webpage elements of the same type in the second target webpage are triggered.
And the eighth obtaining module is used for obtaining second similar path information with similar path structures according to the triggered path information of the second webpage element.
And the ninth obtaining module is configured to determine, according to the second similar path information and the second webpage element path information, a plurality of second target elements in the second target webpage, and obtain webpage data of the plurality of second target elements.
In one embodiment, the fourth obtaining module 40 further includes (not shown in fig. 11):
and the element acquiring unit is used for acquiring the same-level element and the parent-level element of the triggered first webpage element of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and respectively taking the triggered first webpage element of the same type, the same-level element and the parent-level element as the first target element.
In one embodiment, the fourth obtaining module 40 further includes (not shown in fig. 11):
the data acquisition unit is used for acquiring configuration information of the first target webpage, wherein the configuration information is used for indicating data of preset parameters in webpage elements of the first target webpage to be extracted; and acquiring webpage data of a plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters. The data acquisition unit further comprises (not shown in fig. 11):
and the data acquisition subunit is used for acquiring the text data and/or the link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprise the text data and the link data.
For specific limitations of the data acquisition device for the web page element, reference may be made to the above limitations of the data acquisition method for the web page element, and details are not described here. The modules in the data acquisition device of the web page element can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server supporting the operation of an RPA designer, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for connecting with an external terminal so as to read information such as web pages, web page elements and web page data on the terminal. The computer program is executed by a processor to implement a method for data acquisition of web page elements.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring first webpage element path information of a first target webpage; when at least two webpage elements of the same type in a first target webpage are triggered, acquiring path information of the triggered first webpage elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In one embodiment, when the processor executes the computer program to implement the step of obtaining the path information of the first webpage element of the first target webpage, the following steps are specifically implemented: and traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversal result.
In one embodiment, the processor, when executing the computer program, performs the steps of: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; generating a mask layer for the first web page element according to the boundary value; when the processor executes the computer program to realize the step of acquiring the path information of the triggered first webpage element, the following steps are specifically realized: and acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, the processor, when executing the computer program, performs the steps of: acquiring page turning information in a first target webpage; acquiring a second target webpage according to the page turning information; acquiring second webpage element path information of a second target webpage; when at least two webpage elements of the same type in a second target webpage are triggered, acquiring path information of the triggered second webpage elements; acquiring second similar path information with a similar path structure according to the triggered path information of the second webpage element; and determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In one embodiment, when the processor executes the computer program to implement the step of determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, the following steps are specifically implemented: and acquiring the same-level elements and parent-level elements of the triggered first webpage elements of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and respectively taking the triggered first webpage elements of the same type, the same-level elements and the parent-level elements as the first target elements.
In one embodiment, when the processor executes the computer program to implement the step of acquiring the web page data of the plurality of first target elements, the following steps are specifically implemented: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating to extract data of preset parameters in webpage elements of the first target webpage; and acquiring webpage data of a plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters, and when the processor executes the computer program to implement the step of acquiring the webpage data of the plurality of first target elements according to the configuration information, the following steps are specifically implemented: and acquiring the text data and/or the link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring first webpage element path information of a first target webpage; when at least two webpage elements of the same type in a first target webpage are triggered, acquiring path information of the triggered first webpage elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In one embodiment, when the processor executes the step of obtaining the path information of the first webpage element of the first target webpage, the following steps are specifically implemented: and traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversal result.
In one embodiment, the computer program when executed by the processor performs the steps of: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; generating a mask layer for the first web page element according to the boundary value; when the computer program is executed by the processor to implement the step of acquiring the path information of the triggered first webpage element, the following steps are specifically implemented: and acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, the computer program when executed by the processor performs the steps of: acquiring page turning information in a first target webpage; acquiring a second target webpage according to the page turning information; acquiring second webpage element path information of a second target webpage; when at least two webpage elements of the same type in a second target webpage are triggered, acquiring path information of the triggered second webpage elements; acquiring second similar path information with a similar path structure according to the triggered path information of the second webpage element; and determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In one embodiment, when the computer program is executed by the processor to implement the step of determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, the following steps are specifically implemented: and acquiring the same-level elements and parent-level elements of the triggered first webpage elements of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and respectively taking the triggered first webpage elements of the same type, the same-level elements and the parent-level elements as the first target elements.
In one embodiment, when the computer program is executed by the processor to implement the step of acquiring the web page data of the plurality of first target elements, the following steps are specifically implemented: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating to extract data of preset parameters in webpage elements of the first target webpage; and acquiring webpage data of a plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters, and when the computer program is executed by the processor to implement the step of acquiring the webpage data of the plurality of first target elements according to the configuration information, the following steps are specifically implemented: and acquiring the text data and/or the link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A device control right sharing method, the method comprising:
determining intelligent equipment with current open control right, and acquiring equipment type identification of the intelligent equipment and network information of the intelligent equipment;
establishing a corresponding relation between the intelligent equipment and the equipment type identification and the network information;
receiving a control right acquisition request sent by a first terminal, wherein the control right acquisition request carries a current equipment type identifier;
and determining a target intelligent device according to the current device type identifier and the corresponding relation, and sharing the control right of the target intelligent device for the first terminal.
2. The method according to claim 1, wherein the network information comprises network identification information of a wireless access point to which the intelligent device is connected and a peripheral wireless access point list of the intelligent device, preferably, the network identification information comprises a MAC address and an egress IP;
the determining the target intelligent device according to the current device type identifier and the corresponding relationship includes:
when the first terminal is connected with a wireless access point, acquiring network identification information of the wireless access point connected with the first terminal, and determining the target intelligent equipment according to the current equipment type identification, the network identification information of the wireless access point connected with the first terminal and the corresponding relation list;
or/and
and when the first terminal is not connected with a wireless access point, acquiring a peripheral wireless access point list of the first terminal, and determining the target intelligent equipment according to the current equipment type identifier, the peripheral wireless access point list connected with the first terminal and the corresponding relation list.
3. The method according to claim 2, wherein the determining the target smart device according to the current device type identifier, the peripheral wireless access point list to which the first terminal is connected, and the correspondence list comprises:
when the intelligent equipment to be compared meets a preset equipment matching condition, determining the intelligent equipment to be compared as the target intelligent equipment, wherein the intelligent equipment to be compared is the intelligent equipment with the established corresponding relation;
the equipment matching condition comprises that the peripheral wireless access point list of the first terminal comprises wireless access points connected with the intelligent equipment to be compared, and the matching degree of the peripheral wireless access point list of the first terminal and the peripheral wireless access point list of the intelligent equipment to be compared exceeds a preset threshold value;
or the device matching condition includes that the peripheral wireless access point list of the intelligent device to be compared includes the wireless access point connected with the first terminal, and the matching degree of the peripheral wireless access point list of the first terminal and the peripheral wireless access point list of the intelligent device to be compared exceeds a preset threshold value.
4. The method according to claim 3, wherein when determining the smart device that currently opens the control right, the geographical location information of the smart device is further obtained, and when the first terminal is not connected to the wireless access point, the geographical location information of the first terminal is further obtained;
the establishing of the corresponding relationship between the intelligent device and the device type identifier and the network information includes: establishing a corresponding relation between the intelligent equipment and the equipment type identification, the network information and the geographical position information;
the device matching condition further comprises that the distance value between the first terminal and the intelligent device to be compared is smaller than a preset distance threshold value, and the distance value is determined according to the geographical position information of the first terminal and the geographical position information of the intelligent device to be compared.
5. The method according to any one of claims 1 to 4, wherein the current device type identifier is obtained by scanning a two-dimensional code by the first terminal, and preferably, the two-dimensional code is disposed on a surface of a smart device.
6. The method of claim 5, further comprising:
receiving a control right opening request sent by a second terminal, wherein the control right opening request carries user information;
and when the binding relationship between the user information and the intelligent equipment exists, determining part of the intelligent equipment or all the intelligent equipment bound with the user information as the intelligent equipment with open control right.
7. The method of claim 5, further comprising:
periodically updating the information in the corresponding relation, preferably, interactively updating the information in the corresponding relation through intelligent equipment related to the corresponding relation;
or/and
the control right is a temporary control right, and the method further comprises the following steps: and when the time that the first terminal enjoys the control right of the target intelligent equipment exceeds a preset time threshold, terminating the control right of the first terminal to the target intelligent equipment, wherein the time threshold is preferably an adjustable value.
8. An apparatus for sharing control right of a device, the apparatus comprising:
the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for determining the intelligent equipment with the current open control right, and acquiring the equipment type identifier of the intelligent equipment and the network information of the intelligent equipment;
the establishing module is used for establishing the corresponding relation between the intelligent equipment and the equipment type identification and the network information;
the receiving module is used for receiving a control right acquisition request sent by a first terminal, wherein the control right acquisition request carries a current equipment type identifier;
and the sharing module is used for determining the target intelligent equipment according to the current equipment type identifier and the corresponding relation and sharing the control right of the target intelligent equipment for the first terminal.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201911198993.7A 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium Active CN111090797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911198993.7A CN111090797B (en) 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911198993.7A CN111090797B (en) 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111090797A true CN111090797A (en) 2020-05-01
CN111090797B CN111090797B (en) 2023-07-25

Family

ID=70393709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911198993.7A Active CN111090797B (en) 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111090797B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111638879A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 System, method and device for overcoming pixel point positioning limitation and readable storage medium
CN112882625A (en) * 2021-02-10 2021-06-01 南京苏宁软件技术有限公司 Element pickup method, element pickup device, computer equipment and storage medium
CN113918460A (en) * 2021-10-15 2022-01-11 京东科技信息技术有限公司 Page testing method, device, equipment and medium
CN114528005A (en) * 2021-11-29 2022-05-24 深圳市千源互联网科技服务有限公司 Grab tag updating method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN102117289A (en) * 2009-12-30 2011-07-06 北京大学 Method and device for extracting comment content from webpage
CN102831121A (en) * 2011-06-15 2012-12-19 阿里巴巴集团控股有限公司 Method and system for extracting webpage information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN102117289A (en) * 2009-12-30 2011-07-06 北京大学 Method and device for extracting comment content from webpage
CN102831121A (en) * 2011-06-15 2012-12-19 阿里巴巴集团控股有限公司 Method and system for extracting webpage information

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111638879A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 System, method and device for overcoming pixel point positioning limitation and readable storage medium
CN111638879B (en) * 2020-05-15 2023-10-31 民生科技有限责任公司 System, method, apparatus and readable storage medium for overcoming pixel positioning limitation
CN112882625A (en) * 2021-02-10 2021-06-01 南京苏宁软件技术有限公司 Element pickup method, element pickup device, computer equipment and storage medium
CN112882625B (en) * 2021-02-10 2022-05-17 南京苏宁软件技术有限公司 Element pickup method, element pickup device, computer equipment and storage medium
CN113918460A (en) * 2021-10-15 2022-01-11 京东科技信息技术有限公司 Page testing method, device, equipment and medium
CN114528005A (en) * 2021-11-29 2022-05-24 深圳市千源互联网科技服务有限公司 Grab tag updating method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111090797B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN111090797A (en) Data acquisition method and device, computer equipment and storage medium
CN106294648B (en) Processing method and device for page access path
CN103246678B (en) A kind of web page content preview method and apparatus
CN106598972B (en) Information display method and device and intelligent terminal
CN103186670B (en) A kind of method and system of complete collection info web
CN107679214B (en) Link positioning method, device, terminal and computer readable storage medium
CN107644100B (en) Information processing method, device and system and computer readable storage medium
CN103577596A (en) Keyword searching method and device based on current browse webpage
CN104243273A (en) Method and device for displaying information on instant messaging client and information display system
CN104765746B (en) Data processing method and device for mobile communication terminal browser
CN107679077B (en) Paging implementation method and device, computer equipment and storage medium
CN106776615B (en) Thermodynamic diagram generation method and device
CN103577392A (en) Keyword pushing method and device based on current browse webpage
US20170169122A1 (en) Webpage display method, mobile terminal, intelligent terminal, program and storage medium
CN105740417A (en) Webpage based target data search method and module, browser and terminal
CN111177623A (en) Information processing method and device
CN108667768B (en) Network application fingerprint identification method and device
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN109492149B (en) Crawler task processing method and device
CN106649350B (en) Method and device for acquiring position information of link element
KR101637016B1 (en) Method for providing user reaction web page
CN104408133A (en) Webpage link area thermodynamic diagram display method and device
CN104808891A (en) Page information processing method and device
CN107341234B (en) Page display method and device and computer readable storage medium
CN111273964B (en) Data loading method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant