CN114443929A - Data capturing method, device and medium - Google Patents

Data capturing method, device and medium Download PDF

Info

Publication number
CN114443929A
CN114443929A CN202210117885.8A CN202210117885A CN114443929A CN 114443929 A CN114443929 A CN 114443929A CN 202210117885 A CN202210117885 A CN 202210117885A CN 114443929 A CN114443929 A CN 114443929A
Authority
CN
China
Prior art keywords
selenium
data
webdriver
browser
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210117885.8A
Other languages
Chinese (zh)
Inventor
麻荣雨
李宁
高鹏超
毕云鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202210117885.8A priority Critical patent/CN114443929A/en
Publication of CN114443929A publication Critical patent/CN114443929A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention relates to the technical field of crawlers, and particularly provides a data capturing method. Compared with the prior art, the invention can effectively avoid the anti-crawler mechanism through a series of operations when facing the anti-crawler mechanism, thereby greatly improving the threshold of data acquisition.

Description

Data capturing method, device and medium
Technical Field
The invention relates to the technical field of crawlers, and particularly provides a data capturing method, a data capturing device and a data capturing medium.
Background
A web crawler is a program or script that automatically crawls the world Wide Web according to certain rules. The workflow of the crawler is complex, links irrelevant to the subject are filtered according to a certain webpage analysis algorithm from the URL of one or a plurality of initial webpages, and new URLs are continuously extracted from the current webpage and put into a queue until certain stop conditions of the system are met. In addition, all web pages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for later query and retrieval.
The prior web crawler can pass the request after all communication flows are analyzed and completed when simulating the request, and then returns a response result, and the intermediate communication flow is relatively complex. Some networks need to send a large amount of ajax requests, asynchronously obtain data and render the data on a page, and a web crawler cannot respond to and process the asynchronous requests in time. And some websites add anti-crawler mechanism, ordinary web crawlers are not suitable.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data capturing method with strong practicability.
The invention further aims to provide a data grabbing device which is reasonable in design, safe and applicable.
It is a further technical task of the present invention to provide a computer readable medium.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a data capturing method is characterized by utilizing a Python environment and a selenium automatic test tool to call a browser to initiate a webpage access request, simulating user operation, opening a page, extracting target data from the page, obtaining a webpage rendering result and acquiring data returned to the page.
Further, the Selenium automated testing tool comprises Selenium IDE and Selenium WebDriver;
the Selenium IDE is a plug-in embedded in a Firefox browser and is used for recording and playing back Selenium scripts on the Firefox and converting the recorded scripts into program languages supported by various Selenium WebDrivers.
Furthermore, the Selenium Webdriver is used for operating a set of API of the browser, supporting various types of browsers, crossing operating systems, and providing a complete third-party library for realizing web automation testing for a plurality of languages.
Further, a step of capturing data by using a Selenium automated tool in a Python environment:
s1, installing Python development environment and selenium;
s2, installing a Webdriver browser driver in a Python environment;
s3, solving the anti-crawler mechanism by using an agent Ip, a port, a hidden selenium configuration item or controlling a browser opened in advance;
s4, simulating a real user to browse a webpage;
s5, capturing useful data and storing the useful data in a document storage tool;
and S6, repeatedly executing the step S4 and the step S5 until the target data acquisition is completed.
Further, in step S4, according to the specific web page structure, the WebDriver positioning element characteristics are used to simulate the real user to browse the web page, and the data capture logic code is written after a few seconds of pause between two button clicks.
Further, in step S4, the method further includes:
s4-1, introducing Webdriver from the Selenium package and using the method of Selenium Webdriver;
s4-2, calling a Selenium command to interact with the browser by using an interface provided by a Selenium package;
s4-3, setting implicit waiting time of 20-40 seconds to define the timeout time of the Selenium execution step;
s4-4, calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;
s4-5, Selenium Webdriver to locate and operate elements;
s4-6, inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;
s4-7, loading a search result page, reading the content of the result list and printing and outputting; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;
and S8, finally printing the text content of the acquired label.
Further, at the end of the script, the browser is closed using driver.quick () in step S4-7.
Further, in step S5, the browser is started by using the WebDriver component using the code, the logic code is executed, and the captured useful data is stored in the database, the Excel or the notepad.
A data capture device, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is used for calling the machine readable program and executing a data grabbing method.
A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform a method of data crawling.
Compared with the prior art, the data capturing method, the data capturing device and the data capturing medium have the following outstanding beneficial effects:
the invention utilizes the selenium automatic test tool to completely simulate the effect of automatically accessing the target site and operating by using the browser for adults through codes so as to obtain the effect after webpage rendering, avoids a series of complex communication processes, can conveniently process asynchronous requests and effectively improves the data capturing capacity. And when the anti-crawler mechanism is faced, the anti-crawler mechanism can be effectively avoided through a series of operations, and the threshold of data acquisition is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flow chart diagram of a data capture method.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A preferred embodiment is given below:
as shown in fig. 1, in the data capture method in this embodiment, a browser is called to initiate a web access request by using a Python environment and a selenium automation test tool, a user operation is simulated, a page is opened, target data is extracted from the page, a result after the web is rendered is obtained, and data in a returned page is obtained, so that an effect of batch capture of data is achieved. In addition, the invention is provided with some additional treatment, and can effectively prevent the strategy of deseliunium reptiles.
The Selenium automated testing tool comprises Selenium IDE and Selenium WebDriver;
the Selenium IDE is a plug-in embedded in a FireFox browser, and is used for recording and playing back Selenium scripts on the FireFox, converting the recorded scripts into program languages supported by various Selenium webdrivers, and further expanding the recorded scripts to a wider browser type.
The Selenium Webdriver can support multiple languages, is used for operating a set of API of the browser, supports various types of browsers, spans operating systems, and provides a complete third-party library for realizing web automation testing for multiple languages.
The method comprises the following steps of using a Selenium automated tool to grab data in a Python environment:
s1, installing Python development environment and selenium.
And S2, installing the Webdriver browser driver in the Python environment.
S3, aiming at some websites with anti-crawler strategies, the anti-crawler is possible, and the anti-crawler mechanism problem can be solved by using a proxy Ip and a port, hiding a selenium configuration item, controlling a browser opened in advance and the like.
S4, according to a specific webpage structure (HTML code), the characteristics of WebDriver positioning elements are utilized to simulate a real user to browse a webpage, a pause is slightly made for several seconds between two times of button clicking, and a data capture logic code is written.
(1) Introducing Webdriver from the Selenium package by using a method of the Selenium Webdriver;
(2) selecting a browser driver instance, and calling a Selenium command to interact with a browser by using an interface provided by a Selenium packet;
(3) setting an implicit latency of 30s to define the timeout time for the Selenium execution step;
(4) calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;
(5) selenium WebDriver provides a number of methods to locate and manipulate these elements, such as setting values, clicking a button, selecting an option in a drop-down component, etc.;
(6) inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;
(7) loading a search result page, reading the content of the result list and printing and outputting the content; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;
(8) finally, printing to obtain the text content of the label; at the end of the script, we can close the browser using driver.
And S5, starting a browser by using the code and utilizing a WebDriver component, operating the logic code, and storing the captured useful data into a database or document storage tools such as Excel, notepad and the like.
And S6, repeating the fourth step and the fifth step until the target data acquisition is finished.
A data capture device, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is used for calling the machine readable program and executing a data grabbing method.
A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform a method of data crawling.
The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of a data capture method, device and medium of the present invention and are made by those skilled in the art should fall within the protection scope of the present invention.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A data capture method is characterized in that a browser is called to initiate a webpage access request by utilizing a Python environment and a selenium automation test tool, user operation is simulated, a page is opened, target data are extracted from the page, a webpage rendering result is obtained, and data returned to the page are obtained.
2. The data capture method of claim 1, wherein the Selenium automated test tools comprise Selenium ide and Selenium WebDriver;
the Selenium IDE is a plug-in embedded in a Firefox browser, and is used for recording and playing back a Selenium script on the Firefox, and converting the recorded script into various program languages supported by a Selenium Webdriver.
3. The data capture method as claimed in claim 2, wherein the Selenium WebDriver is used for operating a set of APIs of a browser, supporting various types of browsers, and providing a complete third party library for implementing web automation test for multiple languages across operating systems.
4. The method according to claim 3, wherein the step of capturing data using a Selenium automated tool in Python environment:
s1, installing Python development environment and selenium;
s2, installing a Webdriver browser driver in a Python environment;
s3, solving the anti-crawler mechanism by using an agent Ip, a port, a hidden selenium configuration item or controlling a browser opened in advance;
s4, simulating a real user to browse a webpage;
s5, capturing useful data and storing the useful data in a document storage tool;
s6, repeating the step S4 and the step S5 until the target data acquisition is completed.
5. The method for data capture according to claim 4, wherein in step S4, according to the specific web page structure, the Webdriver positioning element characteristic is used to simulate the real user to browse the web page, and the logic code for data capture is written with a few seconds pause between two button clicks.
6. The data capturing method as claimed in claim 5, wherein in step S4, the method further comprises:
s4-1, introducing Webdriver from the Selenium package and using the method of Selenium Webdriver;
s4-2, calling a Selenium command to interact with the browser by using an interface provided by a Selenium package;
s4-3, setting implicit waiting time of 20-40 seconds to define the timeout time of the Selenium execution step;
s4-4, calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;
s4-5, Selenium Webdriver to locate and operate elements;
s4-6, inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;
s4-7, loading a search result page, reading the content of the result list and printing and outputting; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;
and S8, finally printing the text content of the acquired label.
7. The method of claim 6, wherein in step S4-7, at the end of the script, the browser is closed using driver.
8. The data capturing method as claimed in claim 7, wherein in step S5, the web driver component is used to launch the browser, run the logic code, and store the captured useful data in the database, Excel or notepad.
9. A data capture device, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor, configured to invoke the machine readable program, to perform the method of any of claims 1 to 8.
10. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 8.
CN202210117885.8A 2022-02-08 2022-02-08 Data capturing method, device and medium Pending CN114443929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210117885.8A CN114443929A (en) 2022-02-08 2022-02-08 Data capturing method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210117885.8A CN114443929A (en) 2022-02-08 2022-02-08 Data capturing method, device and medium

Publications (1)

Publication Number Publication Date
CN114443929A true CN114443929A (en) 2022-05-06

Family

ID=81370741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210117885.8A Pending CN114443929A (en) 2022-02-08 2022-02-08 Data capturing method, device and medium

Country Status (1)

Country Link
CN (1) CN114443929A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium
CN116719986B (en) * 2023-08-10 2023-12-26 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US9767082B2 (en) Method and system of retrieving ajax web page content
CN109033115B (en) Dynamic webpage crawler system
US9418243B2 (en) Invoking a private browsing mode by selection of a visual control element within a browser tab
US8424004B2 (en) High performance script behavior detection through browser shimming
CN105243159A (en) Visual script editor-based distributed web crawler system
CN103092936B (en) A kind of Internet of Things dynamic page real-time information collection method
US20160371386A1 (en) Topical Mapping
CN104168250B (en) Business Process Control method and device based on CGI frames
CN111797407A (en) XSS vulnerability detection method based on deep learning model optimization
CN110147476A (en) Data crawling method, terminal device and computer readable storage medium based on Scrapy
CN110909229A (en) Webpage data acquisition and storage system based on simulated browser access
Montoto et al. Automated browsing in AJAX websites
CN114443929A (en) Data capturing method, device and medium
CN110321503A (en) A kind of web component caching method, device and electronic equipment
US20100031166A1 (en) System and method for web browsing using placemarks and contextual relationships in a data processing system
CN114491206A (en) General low-code crawler method and system for news blog websites
EP3291109A1 (en) Document object model transaction crawler
Li et al. Automatically crawling dynamic web applications via proxy-based javascript injection and runtime analysis
CN109783755A (en) Browser operation analogy method, device, readable storage medium storing program for executing and terminal device
Dincturk Model-based crawling-an approach to design efficient crawling strategies for rich internet applications
Shetty et al. Symbolic verification of web crawler functionality and its properties
CN117134986A (en) Method, system and device for generating external network honey point based on ChatGPT
CN109656816A (en) Control recognition methods, device, equipment and storage medium
Ricca et al. Web testware evolution
Losada et al. Efficient execution of web navigation sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination