CN110737813B - Method, equipment and medium for improving efficiency of reptiles - Google Patents

Method, equipment and medium for improving efficiency of reptiles Download PDF

Info

Publication number
CN110737813B
CN110737813B CN201910919214.1A CN201910919214A CN110737813B CN 110737813 B CN110737813 B CN 110737813B CN 201910919214 A CN201910919214 A CN 201910919214A CN 110737813 B CN110737813 B CN 110737813B
Authority
CN
China
Prior art keywords
response information
filtering
speed
analyzing
filtering rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910919214.1A
Other languages
Chinese (zh)
Other versions
CN110737813A (en
Inventor
马玉斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN201910919214.1A priority Critical patent/CN110737813B/en
Publication of CN110737813A publication Critical patent/CN110737813A/en
Application granted granted Critical
Publication of CN110737813B publication Critical patent/CN110737813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for improving crawler efficiency, which comprises the following steps: acquiring response information returned after the crawler sends a request to the website; analyzing and filtering the response information according to a first filtering rule so that the analyzing speed can be matched with the downloading speed of the acquired response information; downloading the filtered response information to a database; and analyzing and filtering the response information in the database according to the second filtering rule. The invention also discloses a computer device and a readable storage medium. According to the method, the device and the medium for improving the crawler efficiency, the crawler analysis part is divided into two times of analysis by using the storage device, the time is saved by using the storage space, and the working efficiency of the crawler is improved.

Description

Method, equipment and medium for improving efficiency of crawler
Technical Field
The present invention relates to the field of data processing, and more particularly, to a method, an apparatus, and a readable medium for improving crawler efficiency.
Background
A web crawler is a program or script that automatically captures web information according to certain rules. In order to improve the working efficiency, the web crawler can adopt a certain crawling strategy. The parallel crawler is a crawler which runs a plurality of processes in parallel, and can effectively improve the working efficiency. Its goal is to maximize the speed of downloading while minimizing the overhead of parallelism and downloading duplicate pages. The method for the crawler to acquire the designated information is a regular expression, and the regular matching is to construct the regular expression and filter the character string. At present, the efficiency of various crawlers is improved by optimizing regular expressions, multithreading and parallel operation of the crawlers, distributed crawlers and the like. However, when the downloading speed of the crawler obtaining information is much faster than the analyzing speed of the content obtained by analyzing according to the regular expression, the above method cannot effectively improve the efficiency of the crawler.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method, a device, and a medium for improving crawler efficiency, in which a storage device is used to decouple network downloading and content parsing, and content parsing is divided into multiple times, so that operation time is effectively saved, and efficiency of crawler operation can be effectively improved when a downloading speed is much higher than a parsing speed.
Based on the above purpose, an aspect of the embodiments of the present invention provides a method for improving efficiency of a crawler, including the following steps: acquiring response information returned after the crawler sends a request to the website; analyzing and filtering the response information according to a first filtering rule so that the analyzing speed can be matched with the downloading speed of the response information; downloading the filtered response information to a database; and analyzing and filtering the response information in the database according to a second filtering rule.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and dynamically adjusting the first filtering rule according to the change of the downloading speed.
In some embodiments, the dynamically adjusting the first filtering rule according to the change in the download speed further comprises: and judging whether the difference value between the downloading speed and the analyzing speed exceeds a threshold value.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and sending a request to a corresponding website according to the new URL address included in the analyzed content.
In some embodiments, further comprising: and storing the response information filtered according to the second filtering rule in a second database.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: acquiring response information returned after the crawler sends a request to the website; analyzing and filtering the response information according to a first filtering rule so that the analyzing speed can be matched with the downloading speed of the response information; downloading the filtered response information to a database; and analyzing and filtering the response information in the database according to a second filtering rule.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and dynamically adjusting the first filtering rule according to the change of the downloading speed.
In some embodiments, the dynamically adjusting the first filtering rule according to the change in the download speed further comprises: and judging whether the difference value between the downloading speed and the analyzing speed exceeds a threshold value.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and sending a request to a corresponding website according to the new URL address included in the analyzed content.
In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.
The invention has the following beneficial technical effects: the network downloading and the content analysis are decoupled by using the storage device, and the content analysis is divided into a plurality of times, so that the running time is effectively saved, and the running efficiency of the crawler can be effectively improved under the condition that the downloading speed is much higher than the analysis speed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a flow chart of a prior art crawler operation;
FIG. 2 is a schematic diagram of an embodiment of a method of increasing crawler efficiency provided by the present invention;
FIG. 3 is a flowchart of an embodiment of a method for improving crawler efficiency according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
FIG. 1 is a flow chart of a prior art crawler run. As shown in fig. 1, a basic process of crawler operation is to send a request to a website according to a URL address, obtain response information returned by the website, and analyze the obtained response information, on one hand, a new URL address is analyzed, and then a request is sent to a corresponding website according to the new URL address, and on the other hand, data meeting a rule is filtered out after analysis, and then the data is downloaded to a database.
The working efficiency of the crawler mainly depends on the parts for acquiring and analyzing the response information, the acquired response information is mainly limited by the network environment, and the analysis of the response information is mainly influenced by the matching expression and the size of the memory. If the downloading speed of the acquired response information is far higher than the analysis speed of analyzing the response information, the final efficiency of the crawler can only be measured by the analysis speed, so that a good network environment is wasted, and the efficiency of the crawler is greatly reduced. Therefore, the embodiment of the invention decouples the network downloading and the content analysis by using the storage device, divides the content analysis into a plurality of times, effectively saves the running time, and can effectively improve the efficiency of crawler running under the condition that the downloading speed is much higher than the analysis speed.
In view of the above objects, a first aspect of embodiments of the present invention proposes an embodiment of a method for improving efficiency of a crawler. Fig. 2 is a schematic diagram illustrating an embodiment of the method for improving efficiency of a crawler according to the present invention. As shown in fig. 2, the embodiment of the present invention includes the following steps:
s1, acquiring response information returned after the crawler sends a request to the website;
S2, analyzing and filtering the response information according to a first filtering rule so that the analyzing speed can be matched with the downloading speed of the response information;
s3, downloading the filtered response information to a database; and
and S4, analyzing and filtering the response information in the database according to a second filtering rule.
And acquiring response information returned by the crawler after sending the request to the website. A URL (Uniform Resource Locator) is a compact representation of the location and access method of a Resource available from the internet, and is the address of a standard Resource on the internet. Each file on the internet has a unique URL that contains information indicating the location of the file and how the browser should handle it. The user can access various resources on the internet through the URL. And the crawler sends a request to the website according to the URL address and acquires response information returned by the website.
And analyzing and filtering the response information according to a first filtering rule, so that the analyzing speed can be matched with the downloading speed of the response information. Parsing and filtering the response information according to a first filtering rule is the first content parsing of the embodiment of the present invention. The first content analysis is mainly to set a simple and efficient filtering expression to perform primary filtering on the webpage source codes aiming at response information, namely the webpage source codes, acquired by the crawler after sending the webpage request. The filtering expressions are set so that the parsing action is as efficient and as fast as possible. For example, assuming that a crawler acquires an html interface and needs to analyze email information in a mailbox, a filtering rule may be set to acquire data between < html > </html >, filter out CSS, JavaScript, jQuery, and the like, and perform a preliminary filtering. And storing the filtered data into a storage device, and waiting for the second content analysis. The simpler the filtering expression is set, the faster the crawler analyzes the response information, and the more the information after filtering. In order to make the difference between the parsing speed and the downloading speed not large, a simple and efficient filtering expression can be set to perform a preliminary filtering on the webpage source codes.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and dynamically adjusting the first filtering rule according to the change of the downloading speed. At some point, the network may fluctuate, and thus the download speed may vary. To further adapt the download speed, the first filtering rule may be dynamically adjusted according to changes in the download speed. For example, the first filtering rule may be simplified when the download speed becomes large, and the first filtering rule may be complicated when the download speed becomes small.
In some embodiments, the dynamically adjusting the first filtering rule according to the change in the download speed further comprises: and judging whether the difference value between the downloading speed and the analyzing speed exceeds a threshold value. For better resolution speed adjustment, a threshold may be set, and when the difference between the download speed and the resolution speed exceeds the threshold, the first filtering rule is adjusted accordingly.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and sending a request to a corresponding website according to the new URL address included in the analyzed content.
And analyzing and filtering the response information in the database according to a second filtering rule, namely the second content analysis of the embodiment of the invention. For the second content analysis, the analysis rule meeting the self requirement is set according to the specific conditions to complete the content screening. Because the data of the second analysis is loaded from the local storage device, the second analysis is not influenced by network fluctuation. And the data analyzed for the second time is the data analyzed for the first time, the code amount is reduced when the analysis is performed again.
In some embodiments, further comprising: and storing the response information filtered according to the second filtering rule in a second database. For convenience of description, the database in step S3 is referred to as a first database, and the second database may be independent from the first database or may be a part of the first database, and may be adjusted accordingly according to the circumstances.
FIG. 3 is a flow chart illustrating an embodiment of a method for improving crawler efficiency provided by the present invention. As shown in FIG. 3, beginning at block 101 and proceeding to block 102, the crawler sends a request to a website; then proceed to block 103 to obtain the returned response information; then, proceeding to block 104, analyzing and filtering the response information according to a first filtering rule, so that the analyzing speed can be matched with the downloading speed of the response information; then proceed to block 105, download the filtered response information to the database; proceeding to block 106, the response information in the database is parsed and filtered according to the second filtering rule, and proceeding to block 107 is ended.
It should be particularly noted that, the steps in the embodiments of the above-mentioned method for improving crawler efficiency can be mutually intersected, replaced, added, or deleted, so that these reasonable permutations and combinations are also included in the scope of the present invention, and the scope of the present invention should not be limited to the embodiments.
In view of the above object, a second aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, response information returned after the crawler sends the request to the website is obtained; s2, analyzing and filtering the response information according to a first filtering rule so that the analyzing speed can be matched with the downloading speed of the response information; s3, downloading the filtered response information to a database; and S4, analyzing and filtering the response information in the database according to a second filtering rule.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and dynamically adjusting the first filtering rule according to the change of the downloading speed.
In some embodiments, the dynamically adjusting the first filtering rule according to the change in the download speed further comprises: and judging whether the difference value between the downloading speed and the analyzing speed exceeds a threshold value.
In some embodiments, the parsing and filtering the response information according to the first filtering rule further comprises: and sending a request to a corresponding website according to the new URL address included in the analyzed content and the URL address.
In some embodiments, further comprising: and storing the response information filtered according to the second filtering rule in a second database.
The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for improving crawler efficiency can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (6)

1. A method of increasing crawler efficiency, comprising the steps of:
acquiring response information returned after the crawler sends a request to the website;
analyzing and filtering the response information according to a first filtering rule so that the analyzing speed can be matched with the downloading speed of the response information;
downloading the filtered response information to a database; and
analyzing and filtering the response information in the database according to a second filtering rule;
Wherein, the analyzing and filtering the response information according to the first filtering rule further comprises:
dynamically adjusting the first filtering rule according to the change of the downloading speed;
the dynamically adjusting the first filtering rule according to the change of the download speed further comprises:
and judging whether the difference value between the downloading speed and the analyzing speed exceeds a threshold value.
2. The method of claim 1, wherein parsing and filtering the response information according to a first filtering rule further comprises:
and sending a request to a corresponding website according to the new URL address included in the analyzed content.
3. The method of claim 1, further comprising:
and storing the response information filtered according to the second filtering rule in a second database.
4. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of:
acquiring response information returned after the crawler sends a request to the website;
analyzing and filtering the response information according to a first filtering rule so that the analyzing speed can be matched with the downloading speed of the response information;
Downloading the filtered response information to a database; and
analyzing and filtering the response information in the database according to a second filtering rule;
wherein the parsing and filtering the response information according to the first filtering rule further comprises:
dynamically adjusting the first filtering rule according to the change of the downloading speed;
the dynamically adjusting the first filtering rule according to the change of the download speed further comprises:
and judging whether the difference value between the downloading speed and the analyzing speed exceeds a threshold value.
5. The computer device of claim 4, wherein the parsing and filtering the response information according to the first filtering rule further comprises:
and sending a request to a corresponding website according to the new URL address included in the analyzed content.
6. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.
CN201910919214.1A 2019-09-26 2019-09-26 Method, equipment and medium for improving efficiency of reptiles Active CN110737813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910919214.1A CN110737813B (en) 2019-09-26 2019-09-26 Method, equipment and medium for improving efficiency of reptiles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910919214.1A CN110737813B (en) 2019-09-26 2019-09-26 Method, equipment and medium for improving efficiency of reptiles

Publications (2)

Publication Number Publication Date
CN110737813A CN110737813A (en) 2020-01-31
CN110737813B true CN110737813B (en) 2022-07-29

Family

ID=69269618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910919214.1A Active CN110737813B (en) 2019-09-26 2019-09-26 Method, equipment and medium for improving efficiency of reptiles

Country Status (1)

Country Link
CN (1) CN110737813B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536075B (en) * 2021-07-20 2024-06-04 锐掣(杭州)科技有限公司 Data extraction method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN107257390A (en) * 2017-05-27 2017-10-17 北京思特奇信息技术股份有限公司 A kind of parsing method and system of URL addresses
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system
CN110134853A (en) * 2019-05-13 2019-08-16 重庆八戒传媒有限公司 Data crawling method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN107257390A (en) * 2017-05-27 2017-10-17 北京思特奇信息技术股份有限公司 A kind of parsing method and system of URL addresses
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system
CN110134853A (en) * 2019-05-13 2019-08-16 重庆八戒传媒有限公司 Data crawling method and system

Also Published As

Publication number Publication date
CN110737813A (en) 2020-01-31

Similar Documents

Publication Publication Date Title
US10108595B2 (en) Method and system for automated analysis and transformation of web pages
CN108206802B (en) Method and device for detecting webpage backdoor
US8660976B2 (en) Web content rewriting, including responses
US20080168345A1 (en) Automatically collecting and compressing style attributes within a web document
CN107257390B (en) URL address resolution method and system
CN104766014A (en) Method and system used for detecting malicious website
CN109450844B (en) Method and device for triggering vulnerability detection
CN112800309A (en) Crawler system based on HTTP proxy and implementation method thereof
CN112637361A (en) Page proxy method, device, electronic equipment and storage medium
CN110737813B (en) Method, equipment and medium for improving efficiency of reptiles
CN105022824A (en) Method and device for recognizing invalid link
CN112860265A (en) Method and device for detecting operation abnormity of source code database
CN105912573B (en) Data updating method and device
CN104954363A (en) Method and device for generating interface document
CN102681996B (en) Pre-head method and device
CN105468776A (en) Method, device and system for operating database
CN111723369A (en) File management method, equipment and medium of FTP server
CN107562790B (en) Method and system for realizing batch warehousing of data processing
US11657161B2 (en) Correlation between source code repositories and web endpoints
CN115174133A (en) Application program interface API identification method and device
CN111443906B (en) Application access method and device
CN114020610A (en) Interface analysis method and device based on graph mining and related equipment
CN109635175B (en) Page data splicing method and device, readable storage medium and electronic equipment
CN106570044B (en) Method and device for analyzing webpage codes
CN112130860A (en) JSON object analysis method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant