CN114443929A - Data capturing method, device and medium - Google Patents
Data capturing method, device and medium Download PDFInfo
- Publication number
- CN114443929A CN114443929A CN202210117885.8A CN202210117885A CN114443929A CN 114443929 A CN114443929 A CN 114443929A CN 202210117885 A CN202210117885 A CN 202210117885A CN 114443929 A CN114443929 A CN 114443929A
- Authority
- CN
- China
- Prior art keywords
- selenium
- data
- webdriver
- browser
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000007246 mechanism Effects 0.000 claims abstract description 8
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 claims description 52
- 229910052711 selenium Inorganic materials 0.000 claims description 52
- 239000011669 selenium Substances 0.000 claims description 52
- 238000013515 script Methods 0.000 claims description 14
- 238000013481 data capture Methods 0.000 claims description 13
- 238000012360 testing method Methods 0.000 claims description 10
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 3
- 238000011161 development Methods 0.000 claims description 3
- 238000009877 rendering Methods 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 claims description 2
- 230000006854 communication Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 241000938605 Crocodylia Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
Abstract
The invention relates to the technical field of crawlers, and particularly provides a data capturing method. Compared with the prior art, the invention can effectively avoid the anti-crawler mechanism through a series of operations when facing the anti-crawler mechanism, thereby greatly improving the threshold of data acquisition.
Description
Technical Field
The invention relates to the technical field of crawlers, and particularly provides a data capturing method, a data capturing device and a data capturing medium.
Background
A web crawler is a program or script that automatically crawls the world Wide Web according to certain rules. The workflow of the crawler is complex, links irrelevant to the subject are filtered according to a certain webpage analysis algorithm from the URL of one or a plurality of initial webpages, and new URLs are continuously extracted from the current webpage and put into a queue until certain stop conditions of the system are met. In addition, all web pages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for later query and retrieval.
The prior web crawler can pass the request after all communication flows are analyzed and completed when simulating the request, and then returns a response result, and the intermediate communication flow is relatively complex. Some networks need to send a large amount of ajax requests, asynchronously obtain data and render the data on a page, and a web crawler cannot respond to and process the asynchronous requests in time. And some websites add anti-crawler mechanism, ordinary web crawlers are not suitable.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a data capturing method with strong practicability.
The invention further aims to provide a data grabbing device which is reasonable in design, safe and applicable.
It is a further technical task of the present invention to provide a computer readable medium.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a data capturing method is characterized by utilizing a Python environment and a selenium automatic test tool to call a browser to initiate a webpage access request, simulating user operation, opening a page, extracting target data from the page, obtaining a webpage rendering result and acquiring data returned to the page.
Further, the Selenium automated testing tool comprises Selenium IDE and Selenium WebDriver;
the Selenium IDE is a plug-in embedded in a Firefox browser and is used for recording and playing back Selenium scripts on the Firefox and converting the recorded scripts into program languages supported by various Selenium WebDrivers.
Furthermore, the Selenium Webdriver is used for operating a set of API of the browser, supporting various types of browsers, crossing operating systems, and providing a complete third-party library for realizing web automation testing for a plurality of languages.
Further, a step of capturing data by using a Selenium automated tool in a Python environment:
s1, installing Python development environment and selenium;
s2, installing a Webdriver browser driver in a Python environment;
s3, solving the anti-crawler mechanism by using an agent Ip, a port, a hidden selenium configuration item or controlling a browser opened in advance;
s4, simulating a real user to browse a webpage;
s5, capturing useful data and storing the useful data in a document storage tool;
and S6, repeatedly executing the step S4 and the step S5 until the target data acquisition is completed.
Further, in step S4, according to the specific web page structure, the WebDriver positioning element characteristics are used to simulate the real user to browse the web page, and the data capture logic code is written after a few seconds of pause between two button clicks.
Further, in step S4, the method further includes:
s4-1, introducing Webdriver from the Selenium package and using the method of Selenium Webdriver;
s4-2, calling a Selenium command to interact with the browser by using an interface provided by a Selenium package;
s4-3, setting implicit waiting time of 20-40 seconds to define the timeout time of the Selenium execution step;
s4-4, calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;
s4-5, Selenium Webdriver to locate and operate elements;
s4-6, inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;
s4-7, loading a search result page, reading the content of the result list and printing and outputting; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;
and S8, finally printing the text content of the acquired label.
Further, at the end of the script, the browser is closed using driver.quick () in step S4-7.
Further, in step S5, the browser is started by using the WebDriver component using the code, the logic code is executed, and the captured useful data is stored in the database, the Excel or the notepad.
A data capture device, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is used for calling the machine readable program and executing a data grabbing method.
A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform a method of data crawling.
Compared with the prior art, the data capturing method, the data capturing device and the data capturing medium have the following outstanding beneficial effects:
the invention utilizes the selenium automatic test tool to completely simulate the effect of automatically accessing the target site and operating by using the browser for adults through codes so as to obtain the effect after webpage rendering, avoids a series of complex communication processes, can conveniently process asynchronous requests and effectively improves the data capturing capacity. And when the anti-crawler mechanism is faced, the anti-crawler mechanism can be effectively avoided through a series of operations, and the threshold of data acquisition is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flow chart diagram of a data capture method.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A preferred embodiment is given below:
as shown in fig. 1, in the data capture method in this embodiment, a browser is called to initiate a web access request by using a Python environment and a selenium automation test tool, a user operation is simulated, a page is opened, target data is extracted from the page, a result after the web is rendered is obtained, and data in a returned page is obtained, so that an effect of batch capture of data is achieved. In addition, the invention is provided with some additional treatment, and can effectively prevent the strategy of deseliunium reptiles.
The Selenium automated testing tool comprises Selenium IDE and Selenium WebDriver;
the Selenium IDE is a plug-in embedded in a FireFox browser, and is used for recording and playing back Selenium scripts on the FireFox, converting the recorded scripts into program languages supported by various Selenium webdrivers, and further expanding the recorded scripts to a wider browser type.
The Selenium Webdriver can support multiple languages, is used for operating a set of API of the browser, supports various types of browsers, spans operating systems, and provides a complete third-party library for realizing web automation testing for multiple languages.
The method comprises the following steps of using a Selenium automated tool to grab data in a Python environment:
s1, installing Python development environment and selenium.
And S2, installing the Webdriver browser driver in the Python environment.
S3, aiming at some websites with anti-crawler strategies, the anti-crawler is possible, and the anti-crawler mechanism problem can be solved by using a proxy Ip and a port, hiding a selenium configuration item, controlling a browser opened in advance and the like.
S4, according to a specific webpage structure (HTML code), the characteristics of WebDriver positioning elements are utilized to simulate a real user to browse a webpage, a pause is slightly made for several seconds between two times of button clicking, and a data capture logic code is written.
(1) Introducing Webdriver from the Selenium package by using a method of the Selenium Webdriver;
(2) selecting a browser driver instance, and calling a Selenium command to interact with a browser by using an interface provided by a Selenium packet;
(3) setting an implicit latency of 30s to define the timeout time for the Selenium execution step;
(4) calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;
(5) selenium WebDriver provides a number of methods to locate and manipulate these elements, such as setting values, clicking a button, selecting an option in a drop-down component, etc.;
(6) inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;
(7) loading a search result page, reading the content of the result list and printing and outputting the content; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;
(8) finally, printing to obtain the text content of the label; at the end of the script, we can close the browser using driver.
And S5, starting a browser by using the code and utilizing a WebDriver component, operating the logic code, and storing the captured useful data into a database or document storage tools such as Excel, notepad and the like.
And S6, repeating the fourth step and the fifth step until the target data acquisition is finished.
A data capture device, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is used for calling the machine readable program and executing a data grabbing method.
A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform a method of data crawling.
The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of a data capture method, device and medium of the present invention and are made by those skilled in the art should fall within the protection scope of the present invention.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (10)
1. A data capture method is characterized in that a browser is called to initiate a webpage access request by utilizing a Python environment and a selenium automation test tool, user operation is simulated, a page is opened, target data are extracted from the page, a webpage rendering result is obtained, and data returned to the page are obtained.
2. The data capture method of claim 1, wherein the Selenium automated test tools comprise Selenium ide and Selenium WebDriver;
the Selenium IDE is a plug-in embedded in a Firefox browser, and is used for recording and playing back a Selenium script on the Firefox, and converting the recorded script into various program languages supported by a Selenium Webdriver.
3. The data capture method as claimed in claim 2, wherein the Selenium WebDriver is used for operating a set of APIs of a browser, supporting various types of browsers, and providing a complete third party library for implementing web automation test for multiple languages across operating systems.
4. The method according to claim 3, wherein the step of capturing data using a Selenium automated tool in Python environment:
s1, installing Python development environment and selenium;
s2, installing a Webdriver browser driver in a Python environment;
s3, solving the anti-crawler mechanism by using an agent Ip, a port, a hidden selenium configuration item or controlling a browser opened in advance;
s4, simulating a real user to browse a webpage;
s5, capturing useful data and storing the useful data in a document storage tool;
s6, repeating the step S4 and the step S5 until the target data acquisition is completed.
5. The method for data capture according to claim 4, wherein in step S4, according to the specific web page structure, the Webdriver positioning element characteristic is used to simulate the real user to browse the web page, and the logic code for data capture is written with a few seconds pause between two button clicks.
6. The data capturing method as claimed in claim 5, wherein in step S4, the method further comprises:
s4-1, introducing Webdriver from the Selenium package and using the method of Selenium Webdriver;
s4-2, calling a Selenium command to interact with the browser by using an interface provided by a Selenium package;
s4-3, setting implicit waiting time of 20-40 seconds to define the timeout time of the Selenium execution step;
s4-4, calling a driver () method to access the application program, and after the method is called, the Webdriver waits until the page loading is finished and continues to execute the script;
s4-5, Selenium Webdriver to locate and operate elements;
s4-6, inputting a new specific value through a send _ keys () method, and calling submit () to submit a search request;
s4-7, loading a search result page, reading the content of the result list and printing and outputting; acquiring all div tags whose path satisfies class ═ c-abstratt' through find _ elements _ by _ xpath, which will return more than one element list;
and S8, finally printing the text content of the acquired label.
7. The method of claim 6, wherein in step S4-7, at the end of the script, the browser is closed using driver.
8. The data capturing method as claimed in claim 7, wherein in step S5, the web driver component is used to launch the browser, run the logic code, and store the captured useful data in the database, Excel or notepad.
9. A data capture device, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor, configured to invoke the machine readable program, to perform the method of any of claims 1 to 8.
10. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210117885.8A CN114443929A (en) | 2022-02-08 | 2022-02-08 | Data capturing method, device and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210117885.8A CN114443929A (en) | 2022-02-08 | 2022-02-08 | Data capturing method, device and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114443929A true CN114443929A (en) | 2022-05-06 |
Family
ID=81370741
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210117885.8A Pending CN114443929A (en) | 2022-02-08 | 2022-02-08 | Data capturing method, device and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114443929A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719986A (en) * | 2023-08-10 | 2023-09-08 | 深圳传趣网络技术有限公司 | Python-based data grabbing method, device, equipment and storage medium |
-
2022
- 2022-02-08 CN CN202210117885.8A patent/CN114443929A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719986A (en) * | 2023-08-10 | 2023-09-08 | 深圳传趣网络技术有限公司 | Python-based data grabbing method, device, equipment and storage medium |
CN116719986B (en) * | 2023-08-10 | 2023-12-26 | 深圳传趣网络技术有限公司 | Python-based data grabbing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9767082B2 (en) | Method and system of retrieving ajax web page content | |
CN109033115B (en) | Dynamic webpage crawler system | |
US9418243B2 (en) | Invoking a private browsing mode by selection of a visual control element within a browser tab | |
US8424004B2 (en) | High performance script behavior detection through browser shimming | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN103092936B (en) | A kind of Internet of Things dynamic page real-time information collection method | |
US20160371386A1 (en) | Topical Mapping | |
CN104168250B (en) | Business Process Control method and device based on CGI frames | |
CN111797407A (en) | XSS vulnerability detection method based on deep learning model optimization | |
CN110147476A (en) | Data crawling method, terminal device and computer readable storage medium based on Scrapy | |
CN110909229A (en) | Webpage data acquisition and storage system based on simulated browser access | |
Montoto et al. | Automated browsing in AJAX websites | |
CN114443929A (en) | Data capturing method, device and medium | |
CN110321503A (en) | A kind of web component caching method, device and electronic equipment | |
US20100031166A1 (en) | System and method for web browsing using placemarks and contextual relationships in a data processing system | |
CN114491206A (en) | General low-code crawler method and system for news blog websites | |
EP3291109A1 (en) | Document object model transaction crawler | |
Li et al. | Automatically crawling dynamic web applications via proxy-based javascript injection and runtime analysis | |
CN109783755A (en) | Browser operation analogy method, device, readable storage medium storing program for executing and terminal device | |
Dincturk | Model-based crawling-an approach to design efficient crawling strategies for rich internet applications | |
Shetty et al. | Symbolic verification of web crawler functionality and its properties | |
CN117134986A (en) | Method, system and device for generating external network honey point based on ChatGPT | |
CN109656816A (en) | Control recognition methods, device, equipment and storage medium | |
Ricca et al. | Web testware evolution | |
Losada et al. | Efficient execution of web navigation sequences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |