CN104933138A - Webpage crawler system and webpage crawling method - Google Patents

Webpage crawler system and webpage crawling method Download PDF

Info

Publication number
CN104933138A
CN104933138A CN201510334805.4A CN201510334805A CN104933138A CN 104933138 A CN104933138 A CN 104933138A CN 201510334805 A CN201510334805 A CN 201510334805A CN 104933138 A CN104933138 A CN 104933138A
Authority
CN
China
Prior art keywords
sectional drawing
module
webpage
target pages
ocr server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510334805.4A
Other languages
Chinese (zh)
Inventor
吴鹏越
吴凌峰
华浩锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN201510334805.4A priority Critical patent/CN104933138A/en
Publication of CN104933138A publication Critical patent/CN104933138A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Input (AREA)

Abstract

The invention discloses a webpage crawler system and a webpage crawling method. The webpage crawler system comprises a page opening module for automatically calling a browser to opening a target page; an area crawling module for performing automatic screenshot to an appointed area in the target page and returning the screenshot back to an OCR server, and the OCR server for performing image recognition to the screenshot according to the appointed area and a sample word stock and outputting the recognition result according to a preset configuration format. The webpage crawler system and the webpage crawling method of the invention could break through all front-end anti-crawling limitation and recognize and capture the information only by opening the page under the condition that the IP of the page is not blocked, thereby improving usability of the crawler system.

Description

Spiders system and webpage crawling method
Technical field
The present invention relates to a kind of spiders system and webpage crawling method.
Background technology
Crawler technology is just suffering unprecedented challenge at present, along with anti-development of climbing technology, data grabber becomes more and more difficult, can estimate to pass through traditional means from now on, again will successfully cannot grab valuable data, prior art needs new crawler technology badly to crawl web data.
Summary of the invention
The technical problem to be solved in the present invention is to overcome the defect that in prior art, anti-development of climbing technology makes data grabber become more and more difficult, provides a kind of spiders system and webpage crawling method.
The present invention solves above-mentioned technical matters by following technical proposals:
The invention provides a kind of spiders system, its feature is, comprising:
Page open module, opens target pages for Automatic dispatching browser;
Region crawls module, for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR (optical character identification) server;
Described OCR server is used for according to described appointed area and sample character library, carries out image recognition, and according to preset configuration form, export recognition result sectional drawing.
Preferably, described region crawls module also for compressing sectional drawing, and the sectional drawing after compression is back to OCR server.
Preferably, described configuration format is the configuration format that can customize.
Preferably, the task that described page open module is used for issuing based on dispatching system opens target pages.
The object of the invention is to additionally provide a kind of webpage crawling method, its feature is, it utilizes above-mentioned spiders system to realize, and comprises the following steps:
S 1, page open module Automatic dispatching browser opens target pages;
S 2, region crawls module and carries out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
S 3, described OCR server according to described appointed area and sample character library, image recognition is carried out to sectional drawing, and according to preset configuration form, recognition result is exported.
Preferably, step S 2described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server.
Preferably, described configuration format is the configuration format that can customize.
Preferably, step S 1described in the page open module task of issuing based on dispatching system open target pages.
Positive progressive effect of the present invention is: the present invention can break through that all front ends of website are counter climbs restriction, the page can be opened as long as achieve, when not being blocked IP (agreement interconnected between network), just can carry out identification and the crawl of information, thus improve the availability of crawler system.
Accompanying drawing explanation
Fig. 1 is the module diagram of the spiders system of preferred embodiment of the present invention.
Fig. 2 is the process flow diagram of the webpage crawling method of preferred embodiment of the present invention.
Embodiment
Mode below by embodiment further illustrates the present invention, but does not therefore limit the present invention among described scope of embodiments.
As shown in Figure 1, spiders system of the present invention comprises page open module 1, region crawls module 2 and OCR server 3, wherein, page open module 1 opens target pages for Automatic dispatching browser, based on the task that dispatching system issues, directly in generic browser, target pages is opened;
Described region crawls module 2 for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server 3, due to reptile machine limited capacity, OCR needs to take a large amount of CPU (central processing unit) resources, therefore, proper mode is exactly that described region crawls module 2 pairs of sectional drawings and suitably compresses, and the sectional drawing after compression is back to OCR server, so that the latter focuses on;
Described OCR server 3, according to described appointed area and sample character library, carries out image recognition to sectional drawing, and according to set configuration format, exports recognition result.Wherein, described configuration format can be self-defined according to user's request.
In summary it can be seen, the working method of whole system and the navigation patterns of real user not any difference, the vision system of simulating human carries out the crawl of information, all making targeted website climb tactful complete failure based on the counter of front end, make reptile can carry out data grabber as required, ensure that the availability of system in the most of the time, even if Website front-end UI (user interface) carries out large-area correcting, system of the present invention also can carry out dynamic adaptation by adjusting respective profiles in time.
As shown in Figure 2, the webpage crawling method that the present invention utilizes the spiders system of the present embodiment to realize comprises the following steps:
Target pages opened by step 101, page open module Automatic dispatching browser;
Step 102, region crawl module and carry out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
Step 103, described OCR server, according to described appointed area and sample character library, carry out image recognition to sectional drawing, and according to preset configuration form, export recognition result.
Wherein, step S 1described in the page open module task of issuing based on dispatching system open target pages, step S 2described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server, and described configuration format can need according to user self-defined.
Although the foregoing describe the specific embodiment of the present invention, it will be understood by those of skill in the art that these only illustrate, protection scope of the present invention is defined by the appended claims.Those skilled in the art, under the prerequisite not deviating from principle of the present invention and essence, can make various changes or modifications to these embodiments, but these change and amendment all falls into protection scope of the present invention.

Claims (8)

1. a spiders system, is characterized in that, comprising:
Page open module, opens target pages for Automatic dispatching browser;
Region crawls module, for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
Described OCR server is used for according to described appointed area and sample character library, carries out image recognition, and according to preset configuration form, export recognition result sectional drawing.
2. spiders system as claimed in claim 1, it is characterized in that, described region crawls module also for compressing sectional drawing, and the sectional drawing after compression is back to OCR server.
3. spiders system as claimed in claim 1, it is characterized in that, described configuration format is the configuration format that can customize.
4. spiders system as claimed in claim 1, is characterized in that, the task that described page open module is used for issuing based on dispatching system opens target pages.
5. a webpage crawling method, is characterized in that, it utilizes spiders system as claimed in claim 1 to realize, and comprises the following steps:
S 1, page open module Automatic dispatching browser opens target pages;
S 2, region crawls module and carries out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
S 3, described OCR server according to described appointed area and sample character library, image recognition is carried out to sectional drawing, and according to preset configuration form, recognition result is exported.
6. webpage crawling method as claimed in claim 5, is characterized in that, step S 2described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server.
7. webpage crawling method as claimed in claim 5, it is characterized in that, described configuration format is the configuration format that can customize.
8. webpage crawling method as claimed in claim 5, is characterized in that, step S 1described in the page open module task of issuing based on dispatching system open target pages.
CN201510334805.4A 2015-06-16 2015-06-16 Webpage crawler system and webpage crawling method Pending CN104933138A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510334805.4A CN104933138A (en) 2015-06-16 2015-06-16 Webpage crawler system and webpage crawling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510334805.4A CN104933138A (en) 2015-06-16 2015-06-16 Webpage crawler system and webpage crawling method

Publications (1)

Publication Number Publication Date
CN104933138A true CN104933138A (en) 2015-09-23

Family

ID=54120305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510334805.4A Pending CN104933138A (en) 2015-06-16 2015-06-16 Webpage crawler system and webpage crawling method

Country Status (1)

Country Link
CN (1) CN104933138A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512193A (en) * 2015-11-26 2016-04-20 上海携程商务有限公司 Data acquisition system and method based on browser expansion
CN106095918A (en) * 2016-06-06 2016-11-09 山东科技大学 A kind of acquisition methods of the protected exponent data of network based on OCR technique
CN106599001A (en) * 2015-10-20 2017-04-26 中国电信股份有限公司 Webpage content acquisition method and system
CN106909694A (en) * 2017-03-13 2017-06-30 杭州普玄科技有限公司 Tag along sort data capture method and device
CN107465682A (en) * 2017-08-10 2017-12-12 深圳市华傲数据技术有限公司 Reptile logs in the realization method and system of targeted website
CN108595583A (en) * 2018-04-18 2018-09-28 平安科技(深圳)有限公司 Dynamic chart class page data crawling method, device, terminal and storage medium
CN109582850A (en) * 2018-12-03 2019-04-05 金瓜子科技发展(北京)有限公司 A kind of method, apparatus of web page crawl, storage medium and electronic equipment
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN111125217A (en) * 2019-12-13 2020-05-08 天津润华科技有限公司 Editable visual image recognition type intelligent data acquisition system and application thereof
CN111310693A (en) * 2020-02-26 2020-06-19 腾讯科技(深圳)有限公司 Intelligent labeling method and device for text in image and storage medium
CN114547418A (en) * 2022-02-25 2022-05-27 哈尔滨工程大学 Fatigue simulation model-based anthropomorphic crawler method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120051645A1 (en) * 2010-08-30 2012-03-01 Alibaba Group Holding Limited Recognition of digital images
CN102855423A (en) * 2011-06-29 2013-01-02 盛乐信息技术(上海)有限公司 Tracking method and device of literary works
CN104598902A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and device for identifying screenshot and browser

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120051645A1 (en) * 2010-08-30 2012-03-01 Alibaba Group Holding Limited Recognition of digital images
CN102855423A (en) * 2011-06-29 2013-01-02 盛乐信息技术(上海)有限公司 Tracking method and device of literary works
CN104598902A (en) * 2015-01-29 2015-05-06 百度在线网络技术(北京)有限公司 Method and device for identifying screenshot and browser

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599001A (en) * 2015-10-20 2017-04-26 中国电信股份有限公司 Webpage content acquisition method and system
CN105512193A (en) * 2015-11-26 2016-04-20 上海携程商务有限公司 Data acquisition system and method based on browser expansion
CN106095918B (en) * 2016-06-06 2020-03-06 山东科技大学 Network protected index data acquisition method based on OCR technology
CN106095918A (en) * 2016-06-06 2016-11-09 山东科技大学 A kind of acquisition methods of the protected exponent data of network based on OCR technique
CN106909694A (en) * 2017-03-13 2017-06-30 杭州普玄科技有限公司 Tag along sort data capture method and device
CN106909694B (en) * 2017-03-13 2020-01-17 杭州普玄科技有限公司 Classification tag data acquisition method and device
CN107465682A (en) * 2017-08-10 2017-12-12 深圳市华傲数据技术有限公司 Reptile logs in the realization method and system of targeted website
CN108595583A (en) * 2018-04-18 2018-09-28 平安科技(深圳)有限公司 Dynamic chart class page data crawling method, device, terminal and storage medium
CN109582850A (en) * 2018-12-03 2019-04-05 金瓜子科技发展(北京)有限公司 A kind of method, apparatus of web page crawl, storage medium and electronic equipment
CN109582850B (en) * 2018-12-03 2021-07-02 金瓜子科技发展(北京)有限公司 Webpage crawling method and device, storage medium and electronic equipment
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN111125217A (en) * 2019-12-13 2020-05-08 天津润华科技有限公司 Editable visual image recognition type intelligent data acquisition system and application thereof
CN111310693A (en) * 2020-02-26 2020-06-19 腾讯科技(深圳)有限公司 Intelligent labeling method and device for text in image and storage medium
CN111310693B (en) * 2020-02-26 2023-08-29 腾讯科技(深圳)有限公司 Intelligent labeling method, device and storage medium for text in image
CN114547418A (en) * 2022-02-25 2022-05-27 哈尔滨工程大学 Fatigue simulation model-based anthropomorphic crawler method

Similar Documents

Publication Publication Date Title
CN104933138A (en) Webpage crawler system and webpage crawling method
CN109639481B (en) Deep learning-based network traffic classification method and system and electronic equipment
CN104462152B (en) A kind of recognition methods of webpage and device
CN109410036A (en) A kind of fraud detection model training method and device and fraud detection method and device
CN109743311B (en) WebShell detection method, device and storage medium
CN106294325B (en) The optimization method and device of spatial term sentence
CN105260662A (en) Detection device and method of unknown application bug threat
CN107341399A (en) Assess the method and device of code file security
CN105528422A (en) Focused crawler processing method and apparatus
CN113010944B (en) Model verification method, electronic equipment and related products
CN103927400A (en) Web site product detailed information classification crawling and product information base establishing method
CN111740923A (en) Method and device for generating application identification rule, electronic equipment and storage medium
CN107102993A (en) A kind of user's demand analysis method and device
CN104794051A (en) Automatic Android platform malicious software detecting method
CN106778910A (en) Deep learning system and method based on local training
CN110020161B (en) Data processing method, log processing method and terminal
CN113204695B (en) Website identification method and device
CN109347873A (en) A kind of detection method, device and the computer equipment of order injection attacks
CN109885708A (en) The searching method and device of certificate picture
CN108985059B (en) Webpage backdoor detection method, device, equipment and storage medium
CN106055571A (en) Method and system for website identification
CN109995605B (en) Flow identification method and device and computer readable storage medium
CN105279230A (en) Method and system for constructing internet application feature identification database with active learning method
CN102769607A (en) Malicious code detecting method and system based on network packet
CN105591842A (en) Method and device for obtaining version of mobile terminal operating system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150923

RJ01 Rejection of invention patent application after publication