CN104933138A - Webpage crawler system and webpage crawling method - Google Patents
Webpage crawler system and webpage crawling method Download PDFInfo
- Publication number
- CN104933138A CN104933138A CN201510334805.4A CN201510334805A CN104933138A CN 104933138 A CN104933138 A CN 104933138A CN 201510334805 A CN201510334805 A CN 201510334805A CN 104933138 A CN104933138 A CN 104933138A
- Authority
- CN
- China
- Prior art keywords
- sectional drawing
- module
- webpage
- target pages
- ocr server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Input (AREA)
Abstract
The invention discloses a webpage crawler system and a webpage crawling method. The webpage crawler system comprises a page opening module for automatically calling a browser to opening a target page; an area crawling module for performing automatic screenshot to an appointed area in the target page and returning the screenshot back to an OCR server, and the OCR server for performing image recognition to the screenshot according to the appointed area and a sample word stock and outputting the recognition result according to a preset configuration format. The webpage crawler system and the webpage crawling method of the invention could break through all front-end anti-crawling limitation and recognize and capture the information only by opening the page under the condition that the IP of the page is not blocked, thereby improving usability of the crawler system.
Description
Technical field
The present invention relates to a kind of spiders system and webpage crawling method.
Background technology
Crawler technology is just suffering unprecedented challenge at present, along with anti-development of climbing technology, data grabber becomes more and more difficult, can estimate to pass through traditional means from now on, again will successfully cannot grab valuable data, prior art needs new crawler technology badly to crawl web data.
Summary of the invention
The technical problem to be solved in the present invention is to overcome the defect that in prior art, anti-development of climbing technology makes data grabber become more and more difficult, provides a kind of spiders system and webpage crawling method.
The present invention solves above-mentioned technical matters by following technical proposals:
The invention provides a kind of spiders system, its feature is, comprising:
Page open module, opens target pages for Automatic dispatching browser;
Region crawls module, for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR (optical character identification) server;
Described OCR server is used for according to described appointed area and sample character library, carries out image recognition, and according to preset configuration form, export recognition result sectional drawing.
Preferably, described region crawls module also for compressing sectional drawing, and the sectional drawing after compression is back to OCR server.
Preferably, described configuration format is the configuration format that can customize.
Preferably, the task that described page open module is used for issuing based on dispatching system opens target pages.
The object of the invention is to additionally provide a kind of webpage crawling method, its feature is, it utilizes above-mentioned spiders system to realize, and comprises the following steps:
S
1, page open module Automatic dispatching browser opens target pages;
S
2, region crawls module and carries out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
S
3, described OCR server according to described appointed area and sample character library, image recognition is carried out to sectional drawing, and according to preset configuration form, recognition result is exported.
Preferably, step S
2described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server.
Preferably, described configuration format is the configuration format that can customize.
Preferably, step S
1described in the page open module task of issuing based on dispatching system open target pages.
Positive progressive effect of the present invention is: the present invention can break through that all front ends of website are counter climbs restriction, the page can be opened as long as achieve, when not being blocked IP (agreement interconnected between network), just can carry out identification and the crawl of information, thus improve the availability of crawler system.
Accompanying drawing explanation
Fig. 1 is the module diagram of the spiders system of preferred embodiment of the present invention.
Fig. 2 is the process flow diagram of the webpage crawling method of preferred embodiment of the present invention.
Embodiment
Mode below by embodiment further illustrates the present invention, but does not therefore limit the present invention among described scope of embodiments.
As shown in Figure 1, spiders system of the present invention comprises page open module 1, region crawls module 2 and OCR server 3, wherein, page open module 1 opens target pages for Automatic dispatching browser, based on the task that dispatching system issues, directly in generic browser, target pages is opened;
Described region crawls module 2 for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server 3, due to reptile machine limited capacity, OCR needs to take a large amount of CPU (central processing unit) resources, therefore, proper mode is exactly that described region crawls module 2 pairs of sectional drawings and suitably compresses, and the sectional drawing after compression is back to OCR server, so that the latter focuses on;
Described OCR server 3, according to described appointed area and sample character library, carries out image recognition to sectional drawing, and according to set configuration format, exports recognition result.Wherein, described configuration format can be self-defined according to user's request.
In summary it can be seen, the working method of whole system and the navigation patterns of real user not any difference, the vision system of simulating human carries out the crawl of information, all making targeted website climb tactful complete failure based on the counter of front end, make reptile can carry out data grabber as required, ensure that the availability of system in the most of the time, even if Website front-end UI (user interface) carries out large-area correcting, system of the present invention also can carry out dynamic adaptation by adjusting respective profiles in time.
As shown in Figure 2, the webpage crawling method that the present invention utilizes the spiders system of the present embodiment to realize comprises the following steps:
Target pages opened by step 101, page open module Automatic dispatching browser;
Step 102, region crawl module and carry out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
Step 103, described OCR server, according to described appointed area and sample character library, carry out image recognition to sectional drawing, and according to preset configuration form, export recognition result.
Wherein, step S
1described in the page open module task of issuing based on dispatching system open target pages, step S
2described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server, and described configuration format can need according to user self-defined.
Although the foregoing describe the specific embodiment of the present invention, it will be understood by those of skill in the art that these only illustrate, protection scope of the present invention is defined by the appended claims.Those skilled in the art, under the prerequisite not deviating from principle of the present invention and essence, can make various changes or modifications to these embodiments, but these change and amendment all falls into protection scope of the present invention.
Claims (8)
1. a spiders system, is characterized in that, comprising:
Page open module, opens target pages for Automatic dispatching browser;
Region crawls module, for carrying out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
Described OCR server is used for according to described appointed area and sample character library, carries out image recognition, and according to preset configuration form, export recognition result sectional drawing.
2. spiders system as claimed in claim 1, it is characterized in that, described region crawls module also for compressing sectional drawing, and the sectional drawing after compression is back to OCR server.
3. spiders system as claimed in claim 1, it is characterized in that, described configuration format is the configuration format that can customize.
4. spiders system as claimed in claim 1, is characterized in that, the task that described page open module is used for issuing based on dispatching system opens target pages.
5. a webpage crawling method, is characterized in that, it utilizes spiders system as claimed in claim 1 to realize, and comprises the following steps:
S
1, page open module Automatic dispatching browser opens target pages;
S
2, region crawls module and carries out automatic sectional drawing to the appointed area in target pages, and sectional drawing is back to OCR server;
S
3, described OCR server according to described appointed area and sample character library, image recognition is carried out to sectional drawing, and according to preset configuration form, recognition result is exported.
6. webpage crawling method as claimed in claim 5, is characterized in that, step S
2described in region crawl module and also sectional drawing compressed, and the sectional drawing after compression is back to OCR server.
7. webpage crawling method as claimed in claim 5, it is characterized in that, described configuration format is the configuration format that can customize.
8. webpage crawling method as claimed in claim 5, is characterized in that, step S
1described in the page open module task of issuing based on dispatching system open target pages.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510334805.4A CN104933138A (en) | 2015-06-16 | 2015-06-16 | Webpage crawler system and webpage crawling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510334805.4A CN104933138A (en) | 2015-06-16 | 2015-06-16 | Webpage crawler system and webpage crawling method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104933138A true CN104933138A (en) | 2015-09-23 |
Family
ID=54120305
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510334805.4A Pending CN104933138A (en) | 2015-06-16 | 2015-06-16 | Webpage crawler system and webpage crawling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104933138A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512193A (en) * | 2015-11-26 | 2016-04-20 | 上海携程商务有限公司 | Data acquisition system and method based on browser expansion |
CN106095918A (en) * | 2016-06-06 | 2016-11-09 | 山东科技大学 | A kind of acquisition methods of the protected exponent data of network based on OCR technique |
CN106599001A (en) * | 2015-10-20 | 2017-04-26 | 中国电信股份有限公司 | Webpage content acquisition method and system |
CN106909694A (en) * | 2017-03-13 | 2017-06-30 | 杭州普玄科技有限公司 | Tag along sort data capture method and device |
CN107465682A (en) * | 2017-08-10 | 2017-12-12 | 深圳市华傲数据技术有限公司 | Reptile logs in the realization method and system of targeted website |
CN108595583A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Dynamic chart class page data crawling method, device, terminal and storage medium |
CN109582850A (en) * | 2018-12-03 | 2019-04-05 | 金瓜子科技发展(北京)有限公司 | A kind of method, apparatus of web page crawl, storage medium and electronic equipment |
CN110069688A (en) * | 2019-03-16 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | Page display method, server, storage medium and the device of anti-crawler |
CN111125217A (en) * | 2019-12-13 | 2020-05-08 | 天津润华科技有限公司 | Editable visual image recognition type intelligent data acquisition system and application thereof |
CN111310693A (en) * | 2020-02-26 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Intelligent labeling method and device for text in image and storage medium |
CN114547418A (en) * | 2022-02-25 | 2022-05-27 | 哈尔滨工程大学 | Fatigue simulation model-based anthropomorphic crawler method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120051645A1 (en) * | 2010-08-30 | 2012-03-01 | Alibaba Group Holding Limited | Recognition of digital images |
CN102855423A (en) * | 2011-06-29 | 2013-01-02 | 盛乐信息技术(上海)有限公司 | Tracking method and device of literary works |
CN104598902A (en) * | 2015-01-29 | 2015-05-06 | 百度在线网络技术(北京)有限公司 | Method and device for identifying screenshot and browser |
-
2015
- 2015-06-16 CN CN201510334805.4A patent/CN104933138A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120051645A1 (en) * | 2010-08-30 | 2012-03-01 | Alibaba Group Holding Limited | Recognition of digital images |
CN102855423A (en) * | 2011-06-29 | 2013-01-02 | 盛乐信息技术(上海)有限公司 | Tracking method and device of literary works |
CN104598902A (en) * | 2015-01-29 | 2015-05-06 | 百度在线网络技术(北京)有限公司 | Method and device for identifying screenshot and browser |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599001A (en) * | 2015-10-20 | 2017-04-26 | 中国电信股份有限公司 | Webpage content acquisition method and system |
CN105512193A (en) * | 2015-11-26 | 2016-04-20 | 上海携程商务有限公司 | Data acquisition system and method based on browser expansion |
CN106095918B (en) * | 2016-06-06 | 2020-03-06 | 山东科技大学 | Network protected index data acquisition method based on OCR technology |
CN106095918A (en) * | 2016-06-06 | 2016-11-09 | 山东科技大学 | A kind of acquisition methods of the protected exponent data of network based on OCR technique |
CN106909694A (en) * | 2017-03-13 | 2017-06-30 | 杭州普玄科技有限公司 | Tag along sort data capture method and device |
CN106909694B (en) * | 2017-03-13 | 2020-01-17 | 杭州普玄科技有限公司 | Classification tag data acquisition method and device |
CN107465682A (en) * | 2017-08-10 | 2017-12-12 | 深圳市华傲数据技术有限公司 | Reptile logs in the realization method and system of targeted website |
CN108595583A (en) * | 2018-04-18 | 2018-09-28 | 平安科技(深圳)有限公司 | Dynamic chart class page data crawling method, device, terminal and storage medium |
CN109582850A (en) * | 2018-12-03 | 2019-04-05 | 金瓜子科技发展(北京)有限公司 | A kind of method, apparatus of web page crawl, storage medium and electronic equipment |
CN109582850B (en) * | 2018-12-03 | 2021-07-02 | 金瓜子科技发展(北京)有限公司 | Webpage crawling method and device, storage medium and electronic equipment |
CN110069688A (en) * | 2019-03-16 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | Page display method, server, storage medium and the device of anti-crawler |
CN111125217A (en) * | 2019-12-13 | 2020-05-08 | 天津润华科技有限公司 | Editable visual image recognition type intelligent data acquisition system and application thereof |
CN111310693A (en) * | 2020-02-26 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Intelligent labeling method and device for text in image and storage medium |
CN111310693B (en) * | 2020-02-26 | 2023-08-29 | 腾讯科技(深圳)有限公司 | Intelligent labeling method, device and storage medium for text in image |
CN114547418A (en) * | 2022-02-25 | 2022-05-27 | 哈尔滨工程大学 | Fatigue simulation model-based anthropomorphic crawler method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104933138A (en) | Webpage crawler system and webpage crawling method | |
CN109639481B (en) | Deep learning-based network traffic classification method and system and electronic equipment | |
CN104462152B (en) | A kind of recognition methods of webpage and device | |
CN109410036A (en) | A kind of fraud detection model training method and device and fraud detection method and device | |
CN109743311B (en) | WebShell detection method, device and storage medium | |
CN106294325B (en) | The optimization method and device of spatial term sentence | |
CN105260662A (en) | Detection device and method of unknown application bug threat | |
CN107341399A (en) | Assess the method and device of code file security | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN113010944B (en) | Model verification method, electronic equipment and related products | |
CN103927400A (en) | Web site product detailed information classification crawling and product information base establishing method | |
CN111740923A (en) | Method and device for generating application identification rule, electronic equipment and storage medium | |
CN107102993A (en) | A kind of user's demand analysis method and device | |
CN104794051A (en) | Automatic Android platform malicious software detecting method | |
CN106778910A (en) | Deep learning system and method based on local training | |
CN110020161B (en) | Data processing method, log processing method and terminal | |
CN113204695B (en) | Website identification method and device | |
CN109347873A (en) | A kind of detection method, device and the computer equipment of order injection attacks | |
CN109885708A (en) | The searching method and device of certificate picture | |
CN108985059B (en) | Webpage backdoor detection method, device, equipment and storage medium | |
CN106055571A (en) | Method and system for website identification | |
CN109995605B (en) | Flow identification method and device and computer readable storage medium | |
CN105279230A (en) | Method and system for constructing internet application feature identification database with active learning method | |
CN102769607A (en) | Malicious code detecting method and system based on network packet | |
CN105591842A (en) | Method and device for obtaining version of mobile terminal operating system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150923 |
|
RJ01 | Rejection of invention patent application after publication |