CN104933138A

CN104933138A - Webpage crawler system and webpage crawling method

Info

Publication number: CN104933138A
Application number: CN201510334805.4A
Authority: CN
Inventors: 吴鹏越; 吴凌峰; 华浩锋
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2015-06-16
Filing date: 2015-06-16
Publication date: 2015-09-23

Abstract

The invention discloses a webpage crawler system and a webpage crawling method. The webpage crawler system comprises a page opening module for automatically calling a browser to opening a target page; an area crawling module for performing automatic screenshot to an appointed area in the target page and returning the screenshot back to an OCR server, and the OCR server for performing image recognition to the screenshot according to the appointed area and a sample word stock and outputting the recognition result according to a preset configuration format. The webpage crawler system and the webpage crawling method of the invention could break through all front-end anti-crawling limitation and recognize and capture the information only by opening the page under the condition that the IP of the page is not blocked, thereby improving usability of the crawler system.

Description

Spiders system and webpage crawling method

Technical field

The present invention relates to a kind of spiders system and webpage crawling method.

Background technology

Crawler technology is just suffering unprecedented challenge at present, along with anti-development of climbing technology, data grabber becomes more and more difficult, can estimate to pass through traditional means from now on, again will successfully cannot grab valuable data, prior art needs new crawler technology badly to crawl web data.

Summary of the invention

The technical problem to be solved in the present invention is to overcome the defect that in prior art, anti-development of climbing technology makes data grabber become more and more difficult, provides a kind of spiders system and webpage crawling method.

The present invention solves above-mentioned technical matters by following technical proposals:

The invention provides a kind of spiders system, its feature is, comprising:

Page open module, opens target pages for Automatic dispatching browser;

Region crawls module, for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR (optical character identification) server;

Described OCR server is used for according to described appointed area and sample character library, carries out image recognition, and according to preset configuration form, export recognition result sectional drawing.

Preferably, described region crawls module also for compressing sectional drawing, and the sectional drawing after compression is back to OCR server.

Preferably, described configuration format is the configuration format that can customize.

Preferably, the task that described page open module is used for issuing based on dispatching system opens target pages.

The object of the invention is to additionally provide a kind of webpage crawling method, its feature is, it utilizes above-mentioned spiders system to realize, and comprises the following steps:

S ₁, page open module Automatic dispatching browser opens target pages;

S ₂, region crawls module and carries out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;

S ₃, described OCR server according to described appointed area and sample character library, image recognition is carried out to sectional drawing, and according to preset configuration form, recognition result is exported.

Preferably, step S ₂described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server.

Preferably, step S ₁described in the page open module task of issuing based on dispatching system open target pages.

Positive progressive effect of the present invention is: the present invention can break through that all front ends of website are counter climbs restriction, the page can be opened as long as achieve, when not being blocked IP (agreement interconnected between network), just can carry out identification and the crawl of information, thus improve the availability of crawler system.

Accompanying drawing explanation

Fig. 1 is the module diagram of the spiders system of preferred embodiment of the present invention.

Fig. 2 is the process flow diagram of the webpage crawling method of preferred embodiment of the present invention.

Embodiment

Mode below by embodiment further illustrates the present invention, but does not therefore limit the present invention among described scope of embodiments.

As shown in Figure 1, spiders system of the present invention comprises page open module 1, region crawls module 2 and OCR server 3, wherein, page open module 1 opens target pages for Automatic dispatching browser, based on the task that dispatching system issues, directly in generic browser, target pages is opened;

Described region crawls module 2 for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server 3, due to reptile machine limited capacity, OCR needs to take a large amount of CPU (central processing unit) resources, therefore, proper mode is exactly that described region crawls module 2 pairs of sectional drawings and suitably compresses, and the sectional drawing after compression is back to OCR server, so that the latter focuses on;

Described OCR server 3, according to described appointed area and sample character library, carries out image recognition to sectional drawing, and according to set configuration format, exports recognition result.Wherein, described configuration format can be self-defined according to user's request.

In summary it can be seen, the working method of whole system and the navigation patterns of real user not any difference, the vision system of simulating human carries out the crawl of information, all making targeted website climb tactful complete failure based on the counter of front end, make reptile can carry out data grabber as required, ensure that the availability of system in the most of the time, even if Website front-end UI (user interface) carries out large-area correcting, system of the present invention also can carry out dynamic adaptation by adjusting respective profiles in time.

As shown in Figure 2, the webpage crawling method that the present invention utilizes the spiders system of the present embodiment to realize comprises the following steps:

Target pages opened by step 101, page open module Automatic dispatching browser;

Step 102, region crawl module and carry out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;

Step 103, described OCR server, according to described appointed area and sample character library, carry out image recognition to sectional drawing, and according to preset configuration form, export recognition result.

Wherein, step S ₁described in the page open module task of issuing based on dispatching system open target pages, step S ₂described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server, and described configuration format can need according to user self-defined.

Although the foregoing describe the specific embodiment of the present invention, it will be understood by those of skill in the art that these only illustrate, protection scope of the present invention is defined by the appended claims.Those skilled in the art, under the prerequisite not deviating from principle of the present invention and essence, can make various changes or modifications to these embodiments, but these change and amendment all falls into protection scope of the present invention.

Claims

1. a spiders system, is characterized in that, comprising:

Page open module, opens target pages for Automatic dispatching browser;

Region crawls module, for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;

2. spiders system as claimed in claim 1, it is characterized in that, described region crawls module also for compressing sectional drawing, and the sectional drawing after compression is back to OCR server.

3. spiders system as claimed in claim 1, it is characterized in that, described configuration format is the configuration format that can customize.

4. spiders system as claimed in claim 1, is characterized in that, the task that described page open module is used for issuing based on dispatching system opens target pages.

5. a webpage crawling method, is characterized in that, it utilizes spiders system as claimed in claim 1 to realize, and comprises the following steps:

S ₁, page open module Automatic dispatching browser opens target pages;

6. webpage crawling method as claimed in claim 5, is characterized in that, step S ₂described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server.

7. webpage crawling method as claimed in claim 5, it is characterized in that, described configuration format is the configuration format that can customize.

8. webpage crawling method as claimed in claim 5, is characterized in that, step S ₁described in the page open module task of issuing based on dispatching system open target pages.