CN114443929A

CN114443929A - Data capturing method, device and medium

Info

Publication number: CN114443929A
Application number: CN202210117885.8A
Authority: CN
Inventors: 麻荣雨; 李宁; 高鹏超; 毕云鹏
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2022-05-06

Abstract

The invention relates to the technical field of crawlers, and particularly provides a data capturing method. Compared with the prior art, the invention can effectively avoid the anti-crawler mechanism through a series of operations when facing the anti-crawler mechanism, thereby greatly improving the threshold of data acquisition.

Description

Data capturing method, device and medium

Technical Field

The invention relates to the technical field of crawlers, and particularly provides a data capturing method, a data capturing device and a data capturing medium.

Background

A web crawler is a program or script that automatically crawls the world Wide Web according to certain rules. The workflow of the crawler is complex, links irrelevant to the subject are filtered according to a certain webpage analysis algorithm from the URL of one or a plurality of initial webpages, and new URLs are continuously extracted from the current webpage and put into a queue until certain stop conditions of the system are met. In addition, all web pages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for later query and retrieval.

The prior web crawler can pass the request after all communication flows are analyzed and completed when simulating the request, and then returns a response result, and the intermediate communication flow is relatively complex. Some networks need to send a large amount of ajax requests, asynchronously obtain data and render the data on a page, and a web crawler cannot respond to and process the asynchronous requests in time. And some websites add anti-crawler mechanism, ordinary web crawlers are not suitable.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data capturing method with strong practicability.

The invention further aims to provide a data grabbing device which is reasonable in design, safe and applicable.

It is a further technical task of the present invention to provide a computer readable medium.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a data capturing method is characterized by utilizing a Python environment and a selenium automatic test tool to call a browser to initiate a webpage access request, simulating user operation, opening a page, extracting target data from the page, obtaining a webpage rendering result and acquiring data returned to the page.

Further, the Selenium automated testing tool comprises Selenium IDE and Selenium WebDriver;

the Selenium IDE is a plug-in embedded in a Firefox browser and is used for recording and playing back Selenium scripts on the Firefox and converting the recorded scripts into program languages supported by various Selenium WebDrivers.

Furthermore, the Selenium Webdriver is used for operating a set of API of the browser, supporting various types of browsers, crossing operating systems, and providing a complete third-party library for realizing web automation testing for a plurality of languages.

Further, a step of capturing data by using a Selenium automated tool in a Python environment:

s1, installing Python development environment and selenium;

s2, installing a Webdriver browser driver in a Python environment;

s3, solving the anti-crawler mechanism by using an agent Ip, a port, a hidden selenium configuration item or controlling a browser opened in advance;

s4, simulating a real user to browse a webpage;

s5, capturing useful data and storing the useful data in a document storage tool;

and S6, repeatedly executing the step S4 and the step S5 until the target data acquisition is completed.

Further, in step S4, according to the specific web page structure, the WebDriver positioning element characteristics are used to simulate the real user to browse the web page, and the data capture logic code is written after a few seconds of pause between two button clicks.

Further, in step S4, the method further includes:

s4-1, introducing Webdriver from the Selenium package and using the method of Selenium Webdriver;

s4-2, calling a Selenium command to interact with the browser by using an interface provided by a Selenium package;

s4-3, setting implicit waiting time of 20-40 seconds to define the timeout time of the Selenium execution step;

s4-4, calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;

s4-5, Selenium Webdriver to locate and operate elements;

s4-6, inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;

s4-7, loading a search result page, reading the content of the result list and printing and outputting; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;

and S8, finally printing the text content of the acquired label.

Further, at the end of the script, the browser is closed using driver.quick () in step S4-7.

Further, in step S5, the browser is started by using the WebDriver component using the code, the logic code is executed, and the captured useful data is stored in the database, the Excel or the notepad.

A data capture device, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is used for calling the machine readable program and executing a data grabbing method.

A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform a method of data crawling.

Compared with the prior art, the data capturing method, the data capturing device and the data capturing medium have the following outstanding beneficial effects:

the invention utilizes the selenium automatic test tool to completely simulate the effect of automatically accessing the target site and operating by using the browser for adults through codes so as to obtain the effect after webpage rendering, avoids a series of complex communication processes, can conveniently process asynchronous requests and effectively improves the data capturing capacity. And when the anti-crawler mechanism is faced, the anti-crawler mechanism can be effectively avoided through a series of operations, and the threshold of data acquisition is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flow chart diagram of a data capture method.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A preferred embodiment is given below:

as shown in fig. 1, in the data capture method in this embodiment, a browser is called to initiate a web access request by using a Python environment and a selenium automation test tool, a user operation is simulated, a page is opened, target data is extracted from the page, a result after the web is rendered is obtained, and data in a returned page is obtained, so that an effect of batch capture of data is achieved. In addition, the invention is provided with some additional treatment, and can effectively prevent the strategy of deseliunium reptiles.

The Selenium automated testing tool comprises Selenium IDE and Selenium WebDriver;

the Selenium IDE is a plug-in embedded in a FireFox browser, and is used for recording and playing back Selenium scripts on the FireFox, converting the recorded scripts into program languages supported by various Selenium webdrivers, and further expanding the recorded scripts to a wider browser type.

The Selenium Webdriver can support multiple languages, is used for operating a set of API of the browser, supports various types of browsers, spans operating systems, and provides a complete third-party library for realizing web automation testing for multiple languages.

The method comprises the following steps of using a Selenium automated tool to grab data in a Python environment:

s1, installing Python development environment and selenium.

And S2, installing the Webdriver browser driver in the Python environment.

S3, aiming at some websites with anti-crawler strategies, the anti-crawler is possible, and the anti-crawler mechanism problem can be solved by using a proxy Ip and a port, hiding a selenium configuration item, controlling a browser opened in advance and the like.

S4, according to a specific webpage structure (HTML code), the characteristics of WebDriver positioning elements are utilized to simulate a real user to browse a webpage, a pause is slightly made for several seconds between two times of button clicking, and a data capture logic code is written.

(1) Introducing Webdriver from the Selenium package by using a method of the Selenium Webdriver;

(2) selecting a browser driver instance, and calling a Selenium command to interact with a browser by using an interface provided by a Selenium packet;

(3) setting an implicit latency of 30s to define the timeout time for the Selenium execution step;

(4) calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;

(5) selenium WebDriver provides a number of methods to locate and manipulate these elements, such as setting values, clicking a button, selecting an option in a drop-down component, etc.;

(6) inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;

(7) loading a search result page, reading the content of the result list and printing and outputting the content; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;

(8) finally, printing to obtain the text content of the label; at the end of the script, we can close the browser using driver.

And S5, starting a browser by using the code and utilizing a WebDriver component, operating the logic code, and storing the captured useful data into a database or document storage tools such as Excel, notepad and the like.

And S6, repeating the fourth step and the fifth step until the target data acquisition is finished.

the at least one memory to store a machine readable program;

The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of a data capture method, device and medium of the present invention and are made by those skilled in the art should fall within the protection scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A data capture method is characterized in that a browser is called to initiate a webpage access request by utilizing a Python environment and a selenium automation test tool, user operation is simulated, a page is opened, target data are extracted from the page, a webpage rendering result is obtained, and data returned to the page are obtained.

2. The data capture method of claim 1, wherein the Selenium automated test tools comprise Selenium ide and Selenium WebDriver;

the Selenium IDE is a plug-in embedded in a Firefox browser, and is used for recording and playing back a Selenium script on the Firefox, and converting the recorded script into various program languages supported by a Selenium Webdriver.

3. The data capture method as claimed in claim 2, wherein the Selenium WebDriver is used for operating a set of APIs of a browser, supporting various types of browsers, and providing a complete third party library for implementing web automation test for multiple languages across operating systems.

4. The method according to claim 3, wherein the step of capturing data using a Selenium automated tool in Python environment:

s1, installing Python development environment and selenium;

s2, installing a Webdriver browser driver in a Python environment;

s4, simulating a real user to browse a webpage;

s6, repeating the step S4 and the step S5 until the target data acquisition is completed.

5. The method for data capture according to claim 4, wherein in step S4, according to the specific web page structure, the Webdriver positioning element characteristic is used to simulate the real user to browse the web page, and the logic code for data capture is written with a few seconds pause between two button clicks.

6. The data capturing method as claimed in claim 5, wherein in step S4, the method further comprises:

s4-5, Selenium Webdriver to locate and operate elements;

and S8, finally printing the text content of the acquired label.

7. The method of claim 6, wherein in step S4-7, at the end of the script, the browser is closed using driver.

8. The data capturing method as claimed in claim 7, wherein in step S5, the web driver component is used to launch the browser, run the logic code, and store the captured useful data in the database, Excel or notepad.

9. A data capture device, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program, to perform the method of any of claims 1 to 8.

10. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 8.