CN107784036A - Network crawler system and the data processing method based on network crawler system - Google Patents

Network crawler system and the data processing method based on network crawler system Download PDF

Info

Publication number
CN107784036A
CN107784036A CN201610798817.7A CN201610798817A CN107784036A CN 107784036 A CN107784036 A CN 107784036A CN 201610798817 A CN201610798817 A CN 201610798817A CN 107784036 A CN107784036 A CN 107784036A
Authority
CN
China
Prior art keywords
task
module
functional module
web page
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610798817.7A
Other languages
Chinese (zh)
Inventor
崔志伸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610798817.7A priority Critical patent/CN107784036A/en
Publication of CN107784036A publication Critical patent/CN107784036A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of network crawler system and the data processing method based on network crawler system.The network crawler system includes:Multiple functional modules, wherein, it can be communicated with each other between each functional module;Any functional module in multiple functional modules is after task is received, circulate information according to corresponding to task, it is determined that the functional module of the task of execution and the execution sequence of task, and task is sent to corresponding functional module, so that functional module performs task according to execution sequence.Multiple deployment of components of web crawlers in the prior art are solved in same machine, or are deployed in close relation between multiple machines but each component, cause the technical problem for being not easy to expand and safeguarding.

Description

Network crawler system and the data processing method based on network crawler system
Technical field
The present invention relates to Internet technical field, in particular to a kind of network crawler system and based on web crawlers The data processing method of system.
Background technology
Possess the information of magnanimity on current internet, people want to obtain these information, it is necessary to use web crawlers.Pass The web crawlers of system is divided into two kinds of standalone and concentrating type.
Unit reptile the component such as will crawl, handle, storing and being all deployed on same machine, or directly be programmed in same In individual program.The advantages of this method is to be easy to dispose, migrate, safeguard, cost is low, and foot point is not that performance depends on unit Can, it is not easy to extend, adjust automatically is unable to when running into performance bottleneck.
All programs are deployed in a clusters of machines by concentrating type reptile, and every machine can be individually responsible in cluster A certain responsibility, multiple responsibilities can also be responsible for.The advantages of this method is that performance can configure, and can maximally utilize system resource, The configuration of reptile can be higher than the efficiency of standalone version with elastic telescopic.Shortcoming is that deployment is complicated, and dependence is tight between establishment Close, framework closure is strong, is not easy to extend and safeguards, it is higher to set up cost.
For web crawlers in the prior art multiple deployment of components in same machine, or be deployed in multiple machines but each Close relation between individual component, the technical problem for causing to be not easy to expand and safeguard, effective solution party is not yet proposed at present Case.
The content of the invention
The embodiments of the invention provide a kind of network crawler system and the data processing method based on network crawler system, with Multiple deployment of components of web crawlers in the prior art are at least solved in same machine, or are deployed in multiple machines but each group Close relation between part, cause the technical problem for being not easy to expand and safeguarding.
One side according to embodiments of the present invention, there is provided a kind of network crawler system, including:Multiple functional modules, Wherein, can be communicated with each other between each functional module;Any functional module in multiple functional modules is taken office in reception After business, circulate information according to corresponding to task, it is determined that the execution sequence of the functional module of the task of execution and task, and by task Corresponding functional module is sent to, so that functional module performs task according to execution sequence.
Further, multiple functional modules comprise at least:Web page crawl module, for according to valid link address, from mutual Web page contents corresponding to valid link address are obtained in networking;Result treatment module, for the implementing result of task to be stored in Corresponding storage region, and terminate this subtask;Or there is mistake in the implementing result of task or receive preset instructions Afterwards, new pending task is generated.
Further, multiple functional modules also include:Linkage extraction module, for extracting active chain from web page contents Connect;And/or Web Page Processing module, for web page contents to be carried out with the first default processing, wherein, the first default processing includes:Net Page screening and/or link screening;And/or link processing module, for carrying out the second default processing to valid link, wherein, the Two default processing include:Deformation, delete and/or add.
Further, it is characterised in that system also includes:Central module, for preserving the registered place of each functional module Location, and can be communicated with each functional module.
Further, each functional module includes:Address acquisition unit, for obtaining objective function mould according to execution sequence The registered address of block, objective function module are to receive the functional module of current functional module task action result;Receiving unit, use In the task of reception;Processing unit, for performing task;Transmitting element, for the implementing result of task to be sent into objective function Module.
Further, each functional module also includes:First resource adjustment unit, exceed for the stand-by period in task In the case of preset time, increase the quantity of processing unit;Secondary resource adjustment unit, for the resource consumption in the task of execution In the case of more than predetermined threshold value, the quantity of processing unit is reduced.
One side according to embodiments of the present invention, there is provided a kind of data processing method based on network crawler system, Wherein, network crawler system is any one network crawler system in above-described embodiment, and method includes:In multiple functional modules Any functional module is after task is received, and circulate information according to corresponding to task, it is determined that perform task functional module and The execution sequence of task;Task is sent to corresponding functional module, so that functional module performs task according to execution sequence.
Further, the above method also includes:According to valid link address, valid link address pair is obtained from internet The web page contents answered;The implementing result of task is stored in corresponding storage region, and terminates this subtask;Or in task Implementing result there is mistake or after receiving preset instructions, generate new pending task;And/or extracted from web page contents Valid link;And/or web page contents are carried out with the first default processing, wherein, the first default processing includes:Webpage screen and/or Link screening;And/or the second default processing is carried out to valid link, wherein, the second default processing includes:Deformation, delete and/or Addition.
Further, task is sent to corresponding functional module, including:Obtain the registered place of corresponding functional module Location;The task is sent to by corresponding functional module according to registered address.
Further, in the case where the stand-by period of task exceedes preset time, the quantity of processing unit is increased;Holding In the case that the resource consumption of row task exceedes predetermined threshold value, the quantity of processing unit is reduced.
In embodiments of the present invention, the webcrawler module in the application such scheme includes multiple functional modules, any Functional module is after task is received, and circulate information according to corresponding to task, it is determined that the functional module and task of the task of execution Execution sequence, and task is sent to corresponding functional module, so that the functional module performs task according to execution sequence. Such scheme by network crawler system by being divided into multiple functional modules so that does not have inevitable coupling between multiple functional modules Conjunction relation, so as to solve multiple deployment of components of web crawlers in the prior art in same machine, or be deployed in it is multiple Close relation between machine but each component, cause the technical problem of technical problem for being not easy to expand and safeguarding.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is a kind of structural representation of network crawler system according to embodiments of the present invention;
Fig. 2 is a kind of structural representation of optional network crawler system according to embodiments of the present invention;
Fig. 3 is the structural representation that task is handled according to a kind of network crawler system of the embodiment of the present application;And
Fig. 4 is the flow chart of the data processing method according to embodiments of the present invention based on network crawler system.
Embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.
Embodiment 1
The invention provides a kind of network crawler system, Fig. 1 is a kind of network crawler system according to embodiments of the present invention Structural representation, as shown in figure 1, the system includes:Multiple functional modules, wherein, can be mutual between each functional module Communicated;
Any functional module in multiple functional modules is after task is received, and circulate information according to corresponding to task, really Surely the functional module of task and the execution sequence of task are performed, and task is sent to corresponding functional module, so that function Module performs task according to execution sequence.
Specifically, above-mentioned each functional module can include multiple detachable modules, wherein, between each functional module There is no dependence, and each module oneself is stateless processing unit, i.e., after the task of reception or execution task Corresponding return value will not be returned to, so that the debugging of crawler system is simpler with disposing.Above-mentioned circulation information is used for table Levy execution sequence when multiple functional modules perform task;Corresponding functional module is the current work(preserved in the information that circulates The functional module of next execution task of energy module.
In a kind of optional embodiment, there is numbering A, B, C, D, E with multiple functional modules as an example, circulation information In can according to the execution sequence of multiple functional modules come preserve numbering, for example, the execution sequence for the task that system receives is: A, C, E, D, B, then the numbering of multiple functional modules can be preserved in circulation information according to A, C, E, D, B order, then In the example, task is received by A modules, task is forwarded to C modules after A module execution tasks, performed by C modules, successively Forwarding, until all modules in circulation information have performed received task.
Herein it should also be noted that, when network crawler system performs task, multiple functions in network crawler system Module it is not absolutely required to whole execution tasks, if still having numbering A, B, C, D, E with multiple functional modules as an example, appointing The information that preserves is A, C, D, B in circulation information corresponding to business, then when system performs the task, functional module that numbering is E It is not carried out the task, that is to say, that whether each module in multiple functional modules participates in the execution and execution of task Order during task, depending on the information that circulated corresponding to task.
In a kind of optional embodiment, the communication between each functional module can be interacted by communications protocol, be used for The communications protocol of each functional module communication includes but is not limited to common http protocol or Transmission Control Protocol etc..Each function mould Communication between block can be stateless communication, i.e., return value need not be processed, so as to improve web crawlers system The treatment effeciency united to task.
Because all module communications are all based on IP to do, so we can be managed by a center module All addresses, and according to category classification.It can be done when each module starts by a self-test step, in this step Need to register the reference address of oneself to center module, and fetch all effective next step addresses.Doing so may insure that work as Front module deployment is correct, and decentralization in the process of running.
In an alternative embodiment, due to the uncertainty for the task that outside introduces, task can be carried out more Hair processing, i.e., each functional module can ensure that sending more parts asks to next functional module after input is received.So may be used To ensure that final result will not be lost, quantity forwarded can be according to being actually needed carry out preset configuration.
It is worth noting that, network crawler system is finely divided by the above embodiment of the present invention, flow point is crawled by whole For multiple detachable modules, each module performs task according to the default processing sequence of task, therefore between each module There is no dependence, and module is stateless processing unit.It can so cause the debugging of whole system with disposing more Simply, after module divides, the communication of each intermodule can be interacted by communications protocol, and communications protocol includes but is not limited to Common http protocol or Transmission Control Protocol, above-mentioned communication are all stateless communication, i.e., return value need not be processed, can Greatly to accelerate treatment effeciency.
From the foregoing, it will be observed that the webcrawler module in the application such scheme includes multiple functional modules, multiple functional modules In any functional module after task is received, circulate information according to corresponding to task, it is determined that perform task functional module And the execution sequence of task, and task is sent to corresponding functional module, so that functional module performs according to execution sequence Task.Such scheme by the way that network crawler system is divided into multiple functional modules, due between multiple functional modules without must Right coupled relation, so as to solve multiple deployment of components of web crawlers in the prior art in same machine, or it is deployed in Close relation between multiple machines but each component, cause the technical problem of technical problem for being not easy to expand and safeguarding.
Optionally, according to the above embodiments of the present application, as shown in Fig. 2 multiple functional modules comprise at least:
Web page crawl module, for according to valid link address, net corresponding to valid link address to be obtained from internet Page content.
Specifically, wherein, effective url addresses of above-mentioned input can be the URL addresses carried in task or lead to Cross the URL addresses that other functional modules are got.
In said system, the effect of web page crawl module is effective URL (Uniform Resoure according to input Locator, URL) address removes to obtain Webpage corresponding to the address, and the content of pages returned.In one kind Web page crawl inside modules use the mission bit stream (URL) of input in optional embodiment, and HTTP request is sent to website, By the content for asking to return by packaging, other modules are sent to as output.
It should be noted that the module is the nucleus module of network crawler system, it is obtain internet information unique group Part.
Result treatment module, for the implementing result of task to be stored in into corresponding storage region, and terminate this subtask; Or after the implementing result of task mistake occurs or receives preset instructions, generate new pending task.
Result treatment module is the outlet of network crawler system, and main function is that the final result for crawling system preserves Get up, and according to the task of input, by resultant content storage into corresponding storage medium.
Herein it should be noted that result treatment module is not modified or converted to result, only it is responsible for storing result Get up.In crawler system, the module is the sole outlet of whole system, and all tasks are led to after by the resume module Often be sent in system all without regenerating new task, however it is some it is special in the case of, for example system performs task hair Existing mistake is born, or receives preset instructions, result treatment module can generate new processing task and be sent to system.
From the foregoing, it will be observed that the application said system by web page crawl module according to valid link address, obtained from internet Web page contents corresponding to valid link address are taken, the implementing result of task is stored in by corresponding storage by result treatment module Region, and terminate this subtask;Or after the implementing result of task mistake occurs or receives preset instructions, generate new Pending task.Such scheme solves prior art by being separate functional module by network crawler system bonus point Multiple deployment of components of middle web crawlers close relation between same machine, or deployment and multiple machines but each component, Cause the technical problem for being not easy to expand and safeguarding.
Optionally, can also be included according to the above embodiments of the present application, above-mentioned multiple functional modules:
Linkage extraction module, for extracting valid link from web page contents;And/or
In said system, the effect of linkage extraction module is extracted valid link from crawling in the web page contents that obtain Out, the address and some HTTP requests that the URL for the webpage for needing to go to obtain is contained in link need the parameter that includes.
In a kind of optional embodiment, link module is extracted after these links are drawn into, can be to extracting Effectively connection is packed, and the packing valid link composition of the task is sent into other modules, and linkage extraction module is reptile The nucleus module of system, the basis of crawler system are exactly to link, and the page in network is exactly to be connected by way of link 's.
Web Page Processing module, for web page contents to be carried out with the first default processing, wherein, the first default processing includes:Net Page screening and/or link screening;And/or
Specifically, above-mentioned can be that useful web page contents are screened according to task to web page contents progress webpage Screening Treatment And/or link.
For example, the web page contents that web page crawl module crawls to obtain include substantial amounts of information, wherein, with web page contents Webpage exemplified by, including the useful web page contents of needs and unwanted useless pages, the ad content in webpage is as useless Webpage, the webpage screening by Web Page Processing module, example can be removed from the useful webpage that substantial amounts of information sifting is needed Such as advertising message invalid web pages;Again by taking the link in web page contents as an example, the link in web page contents can be divided into active chain Connect and invalid link, valid link are further divided into useful link and useless link, the only valid link required for system In useful link, therefore the link in web page contents can be screened by Web Page Processing module, filtered out useful Link, being forwarded to next functional module after packaging is handled.
Herein it should be noted that because Web Page Processing module is that web page contents are carried out with the first default processing, thus it is logical In the case of often, if the functional module of the task of execution includes Web Page Processing module, the order of Web Page Processing module is located at webpage After crawling module.
Processing module is linked, for carrying out the second default processing to valid link, wherein, the second default processing includes:Become Shape, deletion and/or addition.
Specifically, the effect of link processing module is that the link extracted is handled, and generates new crawling and appoint Business.
It is worth noting that, link processing module and linkage extraction are different to the process angle of link, link pretreatment mould Block can deform to link, delete, the operation such as addition, and the input of reception can be a link or one group of chain The set connect, after the processing Jing Guo the module, caused can be that the page crawls task needed for module.Link processing module The quantity of useless link can be efficiently reduced according to linked contents, so as to improve the efficiency that crawls of reptile, reduction is entirely climbed The load capacity of worm system.
From the foregoing, it will be observed that the above-mentioned spiders system of the application passes through linkage extraction module, Web Page Processing module and link Processing module, in the multiple deployment of components for solving web crawlers in the prior art in same machine, or it is deployed in multiple machines Close relation between device but each component, on the basis of causing the technical problem that is not easy to expand and safeguards, raising system is reached System crawls the technique effect of efficiency.
Fig. 3 is the structural representation that task is handled according to a kind of network crawler system of the embodiment of the present application, with reference to Example shown in Fig. 3, a kind of embodiment that task is performed to above-mentioned network crawler system are described.
In a kind of optional embodiment, the example with reference to shown in Fig. 3, the functional module for the information record that circulates performs task Processing sequence repeatedly circulate execution for link processing module, web page crawl module, Web Page Processing module, linkage extraction module and appoint Business, then sends final result to result treatment module.
It should be noted that, in the examples described above, spiders system can include whole work(in above-described embodiment herein Energy module, the partial function module in above-described embodiment can also be only included, but comprised at least:At web page crawl module and result Manage module.Illustrate two kinds of optional network crawler system framework modes as example below:
Mode one, network crawler system includes:Link processing module, web page crawl module, page processing module, link are taken out Modulus block and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way one, circulation information can wrap corresponding to task Functional module containing whole, can also only include partial function module.
Mode two, network crawler system includes:Web page crawl module, linkage extraction module and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way two, the execution of task is according to corresponding to task Circulation information is handled.
Wherein, both the above mode is only as an example, the overall network crawler system protected not comprising the present invention.
Herein it should also be noted that, network crawler system is while task is got, task fortune can also be got Information needed for row, such as:Crawl the number of plies, crawl the header information of use and what is used when crawling crawls strategy etc..
In a manner of introducing network crawler system according to circulation information handling task by a kind of optional embodiment below.
Circulation information can be the form for the mark for being stored with multiple functional modules, include in above-mentioned multiple functional modules: Web page crawl module, result treatment module, linkage extraction module, Web Page Processing module and link processing module example in, If identification information possessed by above-mentioned multiple functional modules is respectively:01A, 02B, 03C, 04D and 05E, if circulation information In the information that includes be:03C, 05E, 04D and 02B, then when network crawler system handles task corresponding to above-mentioned circulation information Processing sequence should be:Linkage extraction module, link processing module, Web Page Processing module and result treatment module.
Optionally, also included according to the above embodiments of the present application, said system:
Central module, for preserving the registered address of each functional module, and it can be led to each functional module Letter.
In the case where the information that circulated corresponding to task does not include the registered address of functional module, a center can be passed through Module manages the registered address of all functional modules, and according to category classification.In a kind of optional embodiment, each function Module, which starts, can do by a self-test step, need to register the reference address of oneself to center module in this step, and Fetch all effective next step addresses.Doing so may insure that current block deployment is correct, and go in the process of running The heart;Each functional module only can just start when getting at least one next functional module address.Otherwise will always to Central module makes requests on.Each input of the functional module in the task of reception is the information of structuring, can be, but not limited to make With the information of the forms such as Json, Xml.
In a kind of optional embodiment, network crawler system includes:Link processing module, web page crawl module, the page Processing module, linkage extraction module, result treatment module and central module, central module preserve the registered place of functional module Location, functional module can by central module obtain task circulation information in, the address of next functional module, to transmit Task result, therefore can all be communicated between each functional module by obtaining registered address.In above-mentioned network , it is necessary to carry out initialization procedure before crawler system task, i.e., central module obtains the registered address of each functional module, passes through Task is injected into any one functional module and performs task to start network crawler system, performing task in each module is tied After fruit, the registered address of next functional module in the circulation information of task can be obtained from central module, result is sent To next functional module.
Fig. 2 is a kind of structural representation of optional network crawler system according to embodiments of the present invention, a kind of optional In embodiment, network crawler system include web page crawl module, result treatment module, linkage extraction module, page processing module, Link processing module and central module, the function of central module are not limited to preserve outside the registered address of multiple functional modules, also It can be used in monitoring, scheduling and health examination to network crawler system etc., be technical staff's awareness network crawler system fortune The channel of row state.
Optionally, included according to the above embodiments of the present application, each functional module:
Address acquisition unit, for obtaining the registered address of objective function module, objective function module according to execution sequence To receive the functional module of current functional module task action result.
Specifically, above-mentioned objective function module is the current function mould indicated by the tasks carrying order in circulation information Next module corresponding to block, the result of task output is performed for receiving current functional module.
Herein it should be noted that in the case where circulation information includes the registered address of each functional module, address is obtained Take unit to obtain the registered address of objective function module from circulation information, do not include each functional module in circulation information Registered address in the case of, address acquisition unit from circulation information in obtain objective function module after, can be from central mould The registered address of objective function module is obtained in block.
Receiving unit, for receiving task.
Specifically, above-mentioned receiving unit is used to receive outside incoming task, and list is handled according to incoming task call Member.
Processing unit, for performing task.
Specifically, above-mentioned processing unit is used for the performing receiving unit reception of the task, and result is transferred to transmission Unit.
Transmitting element, for the implementing result of task to be sent into objective function module.
Specifically, the registered address for the objective function module that above-mentioned transmitting element obtains according to address acquisition unit, will locate Reason result is sent to objective function module.
Optionally, functional module can also include initialization unit, for being sent in transmitting element to all functional modules During the implementing result failure of task, function of initializing module, with the new registered address of request target functional module.
In a kind of optional embodiment, in the case where sending module all fails to all transmission address transmissions, work( Energy module can suspend the currently reception to task and start initialization unit, and the address of all failures is sent to central module, and The address that please be looked for novelty, central module can update the address list of itself after receiving request and be returned to each functional module newest List.
Optionally, also included according to the above embodiments of the present application, each functional module:
First resource adjustment unit, in the case of exceeding preset time in the stand-by period of task, increase processing is single The quantity of member.
In the case that the stand-by period of task exceedes preset time, the load of the processing unit in task function module Greatly, in the case of the load excessive of functional module, a kind of optional mode is to increase the processing unit in functional module, is increased newly Processing unit can directly initiate, and be automatically added in functional module, to mitigate the load of each processing unit, to improve Data-handling efficiency.
Secondary resource adjustment unit, in the case of exceeding predetermined threshold value in the resource consumption of the task of execution, at reduction Manage the quantity of unit.
In the case where the resource consumption of the task of execution exceedes predetermined threshold value, it is believed that functional module is without current quantity Processing unit can handle current task, therefore the processing unit in functional module can be reduced, to reduce disappearing for resource Consumption.
From the foregoing, it will be observed that the application said system realizes function by first resource adjustment unit and Secondary resource adjustment unit The dynamic of the network crawler system of module opens up appearance.
Embodiment 2
According to embodiments of the present invention, there is provided a kind of embodiment of the method for the data processing of network crawler system is, it is necessary to say It is bright, it can be held the step of the flow of accompanying drawing illustrates in the computer system of such as one group computer executable instructions OK, although also, show logical order in flow charts, in some cases, can be with different from order herein Perform shown or described step.
Fig. 4 is the flow chart of the data processing method according to embodiments of the present invention based on network crawler system, above-mentioned net Network crawler system includes any one network crawler system in embodiment 1, as shown in figure 4, above-mentioned data processing method includes:
Step S402, any functional module in multiple functional modules is after task is received, according to corresponding to task Circulate information, it is determined that the execution sequence of the functional module of the task of execution and task.
Step S404, task is sent to corresponding functional module, so that functional module performs task according to execution sequence.
Specifically, above-mentioned multiple functional modules can be multiple detachable modules, wherein, do not have between each functional module There is dependence, and each module oneself is stateless processing unit, i.e., after the task of reception or execution task not Corresponding return value can be returned to, so that the debugging of crawler system is simpler with disposing.Above-mentioned circulation information is used to characterize Multiple functional modules perform execution sequence during task;What corresponding functional module as preserved in the information that circulates works as function mould The functional module of next execution task of block.
In a kind of optional embodiment, the example with reference to shown in Fig. 3, the functional module for the information record that circulates performs task Processing sequence repeatedly circulate execution for link processing module, web page crawl module, page processing module, linkage extraction module and appoint Business, then sends final result to result treatment module.
It should be noted that, in the examples described above, spiders system can include whole work(in above-described embodiment herein Energy module, the partial function module in above-described embodiment can also be only included, but comprised at least:Crawl Web page module, extract chain Connection module and result treatment module.Illustrate two kinds of optional network crawler system framework modes as example below:
Mode one, network crawler system includes:Connect pretreatment module, web page crawl module, page processing module, link Abstraction module and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way one, circulation information can wrap corresponding to task Functional module containing whole, can also only include partial function module.
Mode two, network crawler system includes:Crawl Web page module, extract link module and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way two, the execution of task is according to corresponding to task Circulation information is handled.
Wherein, both the above mode is only as an example, the network crawler system having not comprising the whole that the present invention protects.
Herein it should also be noted that, network crawler system is while task is got, task fortune can also be got Information needed for row, such as:Crawl the number of plies, crawl the header information of use and what is used when crawling crawls strategy etc..
In a manner of introducing network crawler system according to circulation information handling task by a kind of optional embodiment below.
Circulation information can be the form for the mark for being stored with multiple functional modules, include in above-mentioned multiple functional modules: Web page crawl module, result treatment module, linkage extraction module, Web Page Processing module and link processing module example in, If identification information possessed by above-mentioned multiple functional modules is respectively:01A, 02B, 03C, 04D and 05E, if circulation information In the information that includes be:03C, 05E, 04D and 02B, then when network crawler system handles task corresponding to above-mentioned circulation information Processing sequence should be to make:Linkage extraction module, link processing module, Web Page Processing module and result treatment module.
Herein it should be noted that all modules in above-mentioned network crawler system are all stateless, so in theory Task can be started to any one module injection task, i.e., any one functional module can be used as Elementary Function mould Block, but under normal circumstances, starting to crawl information all by a web page interlinkage, it is usually the case that initial power Energy module is link preprocessor.While a link is injected, the letter required for can also adding in task run Breath, for example level is crawled, the header information of use is crawled, strategy used when crawling etc..Each module can start multiple Example, so injection can be injected into any task processing module at random, it can also be injected into multiple processing modules.Task Place to go by handling inside block combiner.
Herein it should also be noted that, in above-mentioned multiple functional modules, each functional module can include by multiple Processing unit, therefore when injecting task to functional module, task can be injected into a processing unit of functional module, The multiple units that task can be injected into functional module, task is injected more parts, it can be ensured that task is bound to by function Module performs.
From the foregoing, it will be observed that the webcrawler module in the application such scheme includes multiple functional modules, network crawler system In the task of execution, circulation information corresponding to task is obtained, then multiple functional modules in network crawler system are according to circulation The processing sequence that information includes performs task.Such scheme is made by the way that network crawler system is divided into multiple functional modules There is no inevitable coupled relation between multiple functional modules, so as to solve multiple assembly portion of web crawlers in the prior art Administration is in same machine, or is deployed in close relation between multiple machines but each component, causes what is be not easy to expand and safeguard The technical problem of technical problem.
In one embodiment, the above-mentioned data processing method based on network crawler system, in data processing, can To comprise the following steps:
According to valid link address, web page contents corresponding to valid link address are obtained from internet;
The implementing result of task is stored in corresponding storage region, and terminates this subtask;Or the execution in task As a result after mistake occur or receiving preset instructions, new pending task is generated;And/or
Valid link is extracted from web page contents;And/or
Web page contents are carried out with the first default processing, wherein, the first default processing includes:Webpage screens and/or link sieve Choosing;And/or
The second default processing is carried out to valid link, wherein, the second default processing includes:Deformation, delete and/or add.
Optionally, according to the above embodiments of the present application, task is sent to corresponding functional module, including:
Step S4041, obtain the registered address of corresponding functional module.
Step S4043, task is sent to corresponding functional module according to registered address.
, can be from circulation in the case where circulation information includes the registered address of each functional module in above-mentioned steps The registered address of objective function module is obtained in information, does not include the situation of the registered address of each functional module in circulation information Under, after objective function module is obtained from circulation information, the registered place of objective function module can be obtained from central module Location.
Optionally, according to the above embodiments of the present application, functional module performs task, including:
Step S4045, receive and perform task.
Step S4047, the implementing result of task is sent to objective function module, objective function module is the current work(of reception The functional module of energy module design task implementing result.
Specifically, above-mentioned be sent to objective function module as implementing result is sent to target work(by the implementing result of task The registered address of energy module.
Optionally, functional module can also send the implementing result failure of task in transmitting element to all functional modules When, initialized, with the new registered address of request target functional module.
In a kind of optional embodiment, in the case where all failing to all transmission address transmissions, functional module meeting The pause currently reception to task simultaneously starts initialization unit, and the address of all failures is sent to central module, and please look for novelty Address, central module can update the address list of itself after receiving request and return to newest list to each functional module.
In a kind of optional embodiment, being adjusted after functional module processing is completed, which can use transmission interface to handle, ties Fruit sends.It is the address got from central module to send address.When what is all failed to all transmission address transmissions When, functional module can suspend the currently reception to task, and the address of all failures is sent to central module, please look for novelty Address, central module can update the address list of itself after receiving request and return to newest list.
Herein it should be noted that in the case where the circulation information of task includes the registered address of functional module, the The registered address of the second functional module is directly obtained from circulation information with functional module, if the first function in this case Module, which sends the result to second functional module according to the registered address of second functional module, to fail, then Can more new task circulation information, and obtain renewal after circulation information.
Optionally, also included according to the above embodiments of the present application, the above method:
In the case where the stand-by period of task exceedes preset time, increase the quantity of processing unit.
In the case that the stand-by period of task exceedes preset time, the load of the processing unit in task function module Greatly, in the case of the load excessive of functional module, a kind of optional mode is to increase the processing unit in functional module, is increased newly Processing unit can directly initiate, and be automatically added in functional module, to mitigate the load of each processing unit, raising is climbed Take efficiency.
In the case where the resource consumption of the task of execution exceedes predetermined threshold value, the quantity of processing unit is reduced.
In the case where the resource consumption of the task of execution exceedes predetermined threshold value, it is believed that functional module is without current quantity Processing unit can handle current task, therefore the processing unit in functional module can be reduced, to reduce disappearing for resource Consumption.
From the foregoing, it will be observed that the application above method is flexibly increased by the loading condition according to processing unit in functional module Subtract the quantity of processing unit, so as to realize that the dynamic of the network crawler system of functional module opens up appearance.
Also include processor and memory in multiple functional modules based on network crawler system, included in processor Kernel, gone in memory to transfer corresponding program unit by kernel.Kernel can set one or more, be joined by adjusting kernel Web crawlers is divided into multiple functional modules by number, so as to solve multiple deployment of components of web crawlers in the prior art same One machine, or close relation between multiple machines but each component is deployed in, the technology for causing to be not easy to expand and safeguard is asked The technical problem of topic.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/ Or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one deposit Store up chip.
Present invention also provides a kind of computer program product, when being performed on data processing equipment, is adapted for carrying out just The program code of beginningization there are as below methods step:Any functional module in the multiple functional module is receiving task Afterwards, circulate information according to corresponding to the task, it is determined that the execution for performing the functional module and the task of the task is suitable Sequence;The task is sent to corresponding functional module, so that the functional module performs described appoint according to the execution sequence Business.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, others can be passed through Mode is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Person is desirably integrated into another system, or some features can be ignored, or does not perform.Another, shown or discussed is mutual Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module Connect, can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On unit.Some or all of unit therein can be selected to realize the purpose of this embodiment scheme according to the actual needs.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (10)

  1. A kind of 1. network crawler system, it is characterised in that including:Multiple functional modules, wherein, between each functional module It can communicate with each other;
    Any functional module in the multiple functional module is after task is received, and circulate letter according to corresponding to the task Breath, it is determined that performing the functional module of the task and the execution sequence of the task, and the task is sent to accordingly Functional module, so that the functional module performs the task according to the execution sequence.
  2. 2. system according to claim 1, it is characterised in that the multiple functional module comprises at least:
    Web page crawl module, for according to valid link address, net corresponding to the valid link address to be obtained from internet Page content;
    Result treatment module, for the implementing result of the task to be stored in into corresponding storage region, and terminate this subtask; Or after the implementing result of the task mistake occurs or receives preset instructions, generate new pending task.
  3. 3. system according to claim 2, it is characterised in that the multiple functional module also includes:
    Linkage extraction module, for extracting valid link from the web page contents;And/or
    Web Page Processing module, for the web page contents to be carried out with the first default processing, wherein, the described first default processing bag Include:Webpage screens and/or link screening;And/or
    Processing module is linked, for carrying out the second default processing to the valid link, wherein, the described second default processing bag Include:Deformation, delete and/or add.
  4. 4. according to the system described in any one of claims 1 to 3, it is characterised in that the system also includes:
    Central module, for preserving the registered address of each functional module, and it can enter with each functional module Row communication.
  5. 5. system according to claim 4, it is characterised in that each functional module includes:
    Address acquisition unit, for obtaining the registered address of objective function module, the objective function according to the execution sequence Module is to receive the functional module of current functional module task action result;
    Receiving unit, for receiving the task;
    Processing unit, for performing the task;
    Transmitting element, for the implementing result of the task to be sent into the objective function module.
  6. 6. system according to claim 5, it is characterised in that each functional module also includes:
    First resource adjustment unit, in the case of exceeding preset time in the stand-by period of the task, increase the place Manage the quantity of unit;
    Secondary resource adjustment unit, in the case of exceeding predetermined threshold value in the resource consumption for performing the task, reduce institute State the quantity of processing unit.
  7. 7. a kind of data processing method based on network crawler system, it is characterised in that the network crawler system will for right The network crawler system any one of 1 to 6 is sought, methods described includes:
    Any functional module in the multiple functional module is after task is received, and circulate letter according to corresponding to the task Breath, it is determined that performing the functional module of the task and the execution sequence of the task;
    The task is sent to corresponding functional module, so that the functional module performs described appoint according to the execution sequence Business.
  8. 8. according to the method for claim 7, it is characterised in that methods described also includes:
    According to valid link address, web page contents corresponding to the valid link address are obtained from internet;
    The implementing result of the task is stored in corresponding storage region, and terminates this subtask;Or in the task After implementing result mistake occurs or receives preset instructions, new pending task is generated;And/or
    Valid link is extracted from the web page contents;And/or
    The web page contents are carried out with the first default processing, wherein, the described first default processing includes:Webpage screens and/or chain Connect screening;And/or
    The second default processing is carried out to the valid link, wherein, the described second default processing includes:Deform, delete and/or add Add.
  9. 9. the method according to claim 7 or 8, it is characterised in that described that the task is sent to corresponding function mould Block, including:
    Obtain the registered address of corresponding functional module;
    The task is sent to by corresponding functional module according to the registered address.
  10. 10. according to the method for claim 9, it is characterised in that methods described also includes:
    In the case where the stand-by period of the task exceedes preset time, increase the quantity of processing unit;
    In the case where the resource consumption for performing the task exceedes predetermined threshold value, the quantity of the processing unit is reduced.
CN201610798817.7A 2016-08-31 2016-08-31 Network crawler system and the data processing method based on network crawler system Pending CN107784036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610798817.7A CN107784036A (en) 2016-08-31 2016-08-31 Network crawler system and the data processing method based on network crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610798817.7A CN107784036A (en) 2016-08-31 2016-08-31 Network crawler system and the data processing method based on network crawler system

Publications (1)

Publication Number Publication Date
CN107784036A true CN107784036A (en) 2018-03-09

Family

ID=61451578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610798817.7A Pending CN107784036A (en) 2016-08-31 2016-08-31 Network crawler system and the data processing method based on network crawler system

Country Status (1)

Country Link
CN (1) CN107784036A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951739A (en) * 2019-03-27 2019-06-28 北京市博汇科技股份有限公司 Video traffic processing method, device and electronic equipment
CN110377680A (en) * 2019-07-11 2019-10-25 中国水利水电科学研究院 The method of mountain flood database sharing and update based on web crawlers and semantics recognition

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1967488A (en) * 2005-11-15 2007-05-23 索尼计算机娱乐公司 Task allocation method and task allocation apparatus
US20080104113A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Uniform resource locator scoring for targeted web crawling
US8255385B1 (en) * 2011-03-22 2012-08-28 Microsoft Corporation Adaptive crawl rates based on publication frequency
CN102999549A (en) * 2012-09-25 2013-03-27 金博 Method for realizing web crawler tasks
CN103761279A (en) * 2014-01-09 2014-04-30 北京京东尚科信息技术有限公司 Method and system for scheduling network crawlers on basis of keyword search
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN104182462A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Web crawler service system for housing library network
CN104536814A (en) * 2015-01-16 2015-04-22 北京京东尚科信息技术有限公司 Method and system for processing workflow
CN105260405A (en) * 2015-09-22 2016-01-20 北京云知声信息技术有限公司 Web crawler method and device
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1967488A (en) * 2005-11-15 2007-05-23 索尼计算机娱乐公司 Task allocation method and task allocation apparatus
US20080104113A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Uniform resource locator scoring for targeted web crawling
US8255385B1 (en) * 2011-03-22 2012-08-28 Microsoft Corporation Adaptive crawl rates based on publication frequency
CN102999549A (en) * 2012-09-25 2013-03-27 金博 Method for realizing web crawler tasks
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN103761279A (en) * 2014-01-09 2014-04-30 北京京东尚科信息技术有限公司 Method and system for scheduling network crawlers on basis of keyword search
CN104182462A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Web crawler service system for housing library network
CN104536814A (en) * 2015-01-16 2015-04-22 北京京东尚科信息技术有限公司 Method and system for processing workflow
CN105260405A (en) * 2015-09-22 2016-01-20 北京云知声信息技术有限公司 Web crawler method and device
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
白鹤 等: ""分布式多主题网络爬虫***的研究与实现"", 《计算机工程》 *
韩璞: "《OpenStack技术原理与实战》", 1 April 2016 *
黄宇鹏 等: ""一种分布式的舆情分析***架构"", 《电信科学》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951739A (en) * 2019-03-27 2019-06-28 北京市博汇科技股份有限公司 Video traffic processing method, device and electronic equipment
CN109951739B (en) * 2019-03-27 2021-06-08 北京市博汇科技股份有限公司 Video service processing method and device and electronic equipment
CN110377680A (en) * 2019-07-11 2019-10-25 中国水利水电科学研究院 The method of mountain flood database sharing and update based on web crawlers and semantics recognition

Similar Documents

Publication Publication Date Title
CN103778254B (en) The processing method of page access data, apparatus and system
CN106897357A (en) A kind of method for crawling the network information for band checking distributed intelligence
CN108090091A (en) Web page crawl method and apparatus
CN106933871A (en) Short linking processing method, device and short linked server
CN106293794A (en) Load the methods, devices and systems of the page
CN107784036A (en) Network crawler system and the data processing method based on network crawler system
CN103888539B (en) Bootstrap technique, device and the P2P caching systems of P2P cachings
CN107948052A (en) Information crawler method, apparatus, electronic equipment and system
CN107104924A (en) The verification method and device of website backdoor file
CN106897217A (en) Method of testing and test device
CN106559447A (en) The method for processing business and system of JSLEE containers
CN107025230A (en) The processing method and processing device of web crawlers
CN109344126A (en) Processing method, device, storage medium and the electronic device of textures
CN106649357A (en) Data processing method and apparatus used for crawler program
CN106020891A (en) Page loading method and device
CN107623666A (en) The methods, devices and systems of information search
CN103248627B (en) Method, forward proxy server and system for visiting website resources
CN101645021B (en) Integrating method for multisystem single-spot logging under Java application server
CN107040427A (en) A kind of method and device of network card configuration
CN106572135A (en) Network request processing method and device
CN106484545A (en) The method and device of call subroutine
CN108268498A (en) The treating method and apparatus of batch reptile task
CN107547381A (en) A kind of ORF treating method and apparatus
CN105607928A (en) Supporting method for browser kernel and webpage display method and apparatus
CN105847363A (en) Method and system used for cross-region file sharing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20180309

RJ01 Rejection of invention patent application after publication