CN107784036A - Network crawler system and the data processing method based on network crawler system - Google Patents
Network crawler system and the data processing method based on network crawler system Download PDFInfo
- Publication number
- CN107784036A CN107784036A CN201610798817.7A CN201610798817A CN107784036A CN 107784036 A CN107784036 A CN 107784036A CN 201610798817 A CN201610798817 A CN 201610798817A CN 107784036 A CN107784036 A CN 107784036A
- Authority
- CN
- China
- Prior art keywords
- task
- module
- functional module
- web page
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of network crawler system and the data processing method based on network crawler system.The network crawler system includes:Multiple functional modules, wherein, it can be communicated with each other between each functional module;Any functional module in multiple functional modules is after task is received, circulate information according to corresponding to task, it is determined that the functional module of the task of execution and the execution sequence of task, and task is sent to corresponding functional module, so that functional module performs task according to execution sequence.Multiple deployment of components of web crawlers in the prior art are solved in same machine, or are deployed in close relation between multiple machines but each component, cause the technical problem for being not easy to expand and safeguarding.
Description
Technical field
The present invention relates to Internet technical field, in particular to a kind of network crawler system and based on web crawlers
The data processing method of system.
Background technology
Possess the information of magnanimity on current internet, people want to obtain these information, it is necessary to use web crawlers.Pass
The web crawlers of system is divided into two kinds of standalone and concentrating type.
Unit reptile the component such as will crawl, handle, storing and being all deployed on same machine, or directly be programmed in same
In individual program.The advantages of this method is to be easy to dispose, migrate, safeguard, cost is low, and foot point is not that performance depends on unit
Can, it is not easy to extend, adjust automatically is unable to when running into performance bottleneck.
All programs are deployed in a clusters of machines by concentrating type reptile, and every machine can be individually responsible in cluster
A certain responsibility, multiple responsibilities can also be responsible for.The advantages of this method is that performance can configure, and can maximally utilize system resource,
The configuration of reptile can be higher than the efficiency of standalone version with elastic telescopic.Shortcoming is that deployment is complicated, and dependence is tight between establishment
Close, framework closure is strong, is not easy to extend and safeguards, it is higher to set up cost.
For web crawlers in the prior art multiple deployment of components in same machine, or be deployed in multiple machines but each
Close relation between individual component, the technical problem for causing to be not easy to expand and safeguard, effective solution party is not yet proposed at present
Case.
The content of the invention
The embodiments of the invention provide a kind of network crawler system and the data processing method based on network crawler system, with
Multiple deployment of components of web crawlers in the prior art are at least solved in same machine, or are deployed in multiple machines but each group
Close relation between part, cause the technical problem for being not easy to expand and safeguarding.
One side according to embodiments of the present invention, there is provided a kind of network crawler system, including:Multiple functional modules,
Wherein, can be communicated with each other between each functional module;Any functional module in multiple functional modules is taken office in reception
After business, circulate information according to corresponding to task, it is determined that the execution sequence of the functional module of the task of execution and task, and by task
Corresponding functional module is sent to, so that functional module performs task according to execution sequence.
Further, multiple functional modules comprise at least:Web page crawl module, for according to valid link address, from mutual
Web page contents corresponding to valid link address are obtained in networking;Result treatment module, for the implementing result of task to be stored in
Corresponding storage region, and terminate this subtask;Or there is mistake in the implementing result of task or receive preset instructions
Afterwards, new pending task is generated.
Further, multiple functional modules also include:Linkage extraction module, for extracting active chain from web page contents
Connect;And/or Web Page Processing module, for web page contents to be carried out with the first default processing, wherein, the first default processing includes:Net
Page screening and/or link screening;And/or link processing module, for carrying out the second default processing to valid link, wherein, the
Two default processing include:Deformation, delete and/or add.
Further, it is characterised in that system also includes:Central module, for preserving the registered place of each functional module
Location, and can be communicated with each functional module.
Further, each functional module includes:Address acquisition unit, for obtaining objective function mould according to execution sequence
The registered address of block, objective function module are to receive the functional module of current functional module task action result;Receiving unit, use
In the task of reception;Processing unit, for performing task;Transmitting element, for the implementing result of task to be sent into objective function
Module.
Further, each functional module also includes:First resource adjustment unit, exceed for the stand-by period in task
In the case of preset time, increase the quantity of processing unit;Secondary resource adjustment unit, for the resource consumption in the task of execution
In the case of more than predetermined threshold value, the quantity of processing unit is reduced.
One side according to embodiments of the present invention, there is provided a kind of data processing method based on network crawler system,
Wherein, network crawler system is any one network crawler system in above-described embodiment, and method includes:In multiple functional modules
Any functional module is after task is received, and circulate information according to corresponding to task, it is determined that perform task functional module and
The execution sequence of task;Task is sent to corresponding functional module, so that functional module performs task according to execution sequence.
Further, the above method also includes:According to valid link address, valid link address pair is obtained from internet
The web page contents answered;The implementing result of task is stored in corresponding storage region, and terminates this subtask;Or in task
Implementing result there is mistake or after receiving preset instructions, generate new pending task;And/or extracted from web page contents
Valid link;And/or web page contents are carried out with the first default processing, wherein, the first default processing includes:Webpage screen and/or
Link screening;And/or the second default processing is carried out to valid link, wherein, the second default processing includes:Deformation, delete and/or
Addition.
Further, task is sent to corresponding functional module, including:Obtain the registered place of corresponding functional module
Location;The task is sent to by corresponding functional module according to registered address.
Further, in the case where the stand-by period of task exceedes preset time, the quantity of processing unit is increased;Holding
In the case that the resource consumption of row task exceedes predetermined threshold value, the quantity of processing unit is reduced.
In embodiments of the present invention, the webcrawler module in the application such scheme includes multiple functional modules, any
Functional module is after task is received, and circulate information according to corresponding to task, it is determined that the functional module and task of the task of execution
Execution sequence, and task is sent to corresponding functional module, so that the functional module performs task according to execution sequence.
Such scheme by network crawler system by being divided into multiple functional modules so that does not have inevitable coupling between multiple functional modules
Conjunction relation, so as to solve multiple deployment of components of web crawlers in the prior art in same machine, or be deployed in it is multiple
Close relation between machine but each component, cause the technical problem of technical problem for being not easy to expand and safeguarding.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair
Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is a kind of structural representation of network crawler system according to embodiments of the present invention;
Fig. 2 is a kind of structural representation of optional network crawler system according to embodiments of the present invention;
Fig. 3 is the structural representation that task is handled according to a kind of network crawler system of the embodiment of the present application;And
Fig. 4 is the flow chart of the data processing method according to embodiments of the present invention based on network crawler system.
Embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention
Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people
The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects
Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, "
Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use
Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or
Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product
Or the intrinsic other steps of equipment or unit.
Embodiment 1
The invention provides a kind of network crawler system, Fig. 1 is a kind of network crawler system according to embodiments of the present invention
Structural representation, as shown in figure 1, the system includes:Multiple functional modules, wherein, can be mutual between each functional module
Communicated;
Any functional module in multiple functional modules is after task is received, and circulate information according to corresponding to task, really
Surely the functional module of task and the execution sequence of task are performed, and task is sent to corresponding functional module, so that function
Module performs task according to execution sequence.
Specifically, above-mentioned each functional module can include multiple detachable modules, wherein, between each functional module
There is no dependence, and each module oneself is stateless processing unit, i.e., after the task of reception or execution task
Corresponding return value will not be returned to, so that the debugging of crawler system is simpler with disposing.Above-mentioned circulation information is used for table
Levy execution sequence when multiple functional modules perform task;Corresponding functional module is the current work(preserved in the information that circulates
The functional module of next execution task of energy module.
In a kind of optional embodiment, there is numbering A, B, C, D, E with multiple functional modules as an example, circulation information
In can according to the execution sequence of multiple functional modules come preserve numbering, for example, the execution sequence for the task that system receives is:
A, C, E, D, B, then the numbering of multiple functional modules can be preserved in circulation information according to A, C, E, D, B order, then
In the example, task is received by A modules, task is forwarded to C modules after A module execution tasks, performed by C modules, successively
Forwarding, until all modules in circulation information have performed received task.
Herein it should also be noted that, when network crawler system performs task, multiple functions in network crawler system
Module it is not absolutely required to whole execution tasks, if still having numbering A, B, C, D, E with multiple functional modules as an example, appointing
The information that preserves is A, C, D, B in circulation information corresponding to business, then when system performs the task, functional module that numbering is E
It is not carried out the task, that is to say, that whether each module in multiple functional modules participates in the execution and execution of task
Order during task, depending on the information that circulated corresponding to task.
In a kind of optional embodiment, the communication between each functional module can be interacted by communications protocol, be used for
The communications protocol of each functional module communication includes but is not limited to common http protocol or Transmission Control Protocol etc..Each function mould
Communication between block can be stateless communication, i.e., return value need not be processed, so as to improve web crawlers system
The treatment effeciency united to task.
Because all module communications are all based on IP to do, so we can be managed by a center module
All addresses, and according to category classification.It can be done when each module starts by a self-test step, in this step
Need to register the reference address of oneself to center module, and fetch all effective next step addresses.Doing so may insure that work as
Front module deployment is correct, and decentralization in the process of running.
In an alternative embodiment, due to the uncertainty for the task that outside introduces, task can be carried out more
Hair processing, i.e., each functional module can ensure that sending more parts asks to next functional module after input is received.So may be used
To ensure that final result will not be lost, quantity forwarded can be according to being actually needed carry out preset configuration.
It is worth noting that, network crawler system is finely divided by the above embodiment of the present invention, flow point is crawled by whole
For multiple detachable modules, each module performs task according to the default processing sequence of task, therefore between each module
There is no dependence, and module is stateless processing unit.It can so cause the debugging of whole system with disposing more
Simply, after module divides, the communication of each intermodule can be interacted by communications protocol, and communications protocol includes but is not limited to
Common http protocol or Transmission Control Protocol, above-mentioned communication are all stateless communication, i.e., return value need not be processed, can
Greatly to accelerate treatment effeciency.
From the foregoing, it will be observed that the webcrawler module in the application such scheme includes multiple functional modules, multiple functional modules
In any functional module after task is received, circulate information according to corresponding to task, it is determined that perform task functional module
And the execution sequence of task, and task is sent to corresponding functional module, so that functional module performs according to execution sequence
Task.Such scheme by the way that network crawler system is divided into multiple functional modules, due between multiple functional modules without must
Right coupled relation, so as to solve multiple deployment of components of web crawlers in the prior art in same machine, or it is deployed in
Close relation between multiple machines but each component, cause the technical problem of technical problem for being not easy to expand and safeguarding.
Optionally, according to the above embodiments of the present application, as shown in Fig. 2 multiple functional modules comprise at least:
Web page crawl module, for according to valid link address, net corresponding to valid link address to be obtained from internet
Page content.
Specifically, wherein, effective url addresses of above-mentioned input can be the URL addresses carried in task or lead to
Cross the URL addresses that other functional modules are got.
In said system, the effect of web page crawl module is effective URL (Uniform Resoure according to input
Locator, URL) address removes to obtain Webpage corresponding to the address, and the content of pages returned.In one kind
Web page crawl inside modules use the mission bit stream (URL) of input in optional embodiment, and HTTP request is sent to website,
By the content for asking to return by packaging, other modules are sent to as output.
It should be noted that the module is the nucleus module of network crawler system, it is obtain internet information unique group
Part.
Result treatment module, for the implementing result of task to be stored in into corresponding storage region, and terminate this subtask;
Or after the implementing result of task mistake occurs or receives preset instructions, generate new pending task.
Result treatment module is the outlet of network crawler system, and main function is that the final result for crawling system preserves
Get up, and according to the task of input, by resultant content storage into corresponding storage medium.
Herein it should be noted that result treatment module is not modified or converted to result, only it is responsible for storing result
Get up.In crawler system, the module is the sole outlet of whole system, and all tasks are led to after by the resume module
Often be sent in system all without regenerating new task, however it is some it is special in the case of, for example system performs task hair
Existing mistake is born, or receives preset instructions, result treatment module can generate new processing task and be sent to system.
From the foregoing, it will be observed that the application said system by web page crawl module according to valid link address, obtained from internet
Web page contents corresponding to valid link address are taken, the implementing result of task is stored in by corresponding storage by result treatment module
Region, and terminate this subtask;Or after the implementing result of task mistake occurs or receives preset instructions, generate new
Pending task.Such scheme solves prior art by being separate functional module by network crawler system bonus point
Multiple deployment of components of middle web crawlers close relation between same machine, or deployment and multiple machines but each component,
Cause the technical problem for being not easy to expand and safeguarding.
Optionally, can also be included according to the above embodiments of the present application, above-mentioned multiple functional modules:
Linkage extraction module, for extracting valid link from web page contents;And/or
In said system, the effect of linkage extraction module is extracted valid link from crawling in the web page contents that obtain
Out, the address and some HTTP requests that the URL for the webpage for needing to go to obtain is contained in link need the parameter that includes.
In a kind of optional embodiment, link module is extracted after these links are drawn into, can be to extracting
Effectively connection is packed, and the packing valid link composition of the task is sent into other modules, and linkage extraction module is reptile
The nucleus module of system, the basis of crawler system are exactly to link, and the page in network is exactly to be connected by way of link
's.
Web Page Processing module, for web page contents to be carried out with the first default processing, wherein, the first default processing includes:Net
Page screening and/or link screening;And/or
Specifically, above-mentioned can be that useful web page contents are screened according to task to web page contents progress webpage Screening Treatment
And/or link.
For example, the web page contents that web page crawl module crawls to obtain include substantial amounts of information, wherein, with web page contents
Webpage exemplified by, including the useful web page contents of needs and unwanted useless pages, the ad content in webpage is as useless
Webpage, the webpage screening by Web Page Processing module, example can be removed from the useful webpage that substantial amounts of information sifting is needed
Such as advertising message invalid web pages;Again by taking the link in web page contents as an example, the link in web page contents can be divided into active chain
Connect and invalid link, valid link are further divided into useful link and useless link, the only valid link required for system
In useful link, therefore the link in web page contents can be screened by Web Page Processing module, filtered out useful
Link, being forwarded to next functional module after packaging is handled.
Herein it should be noted that because Web Page Processing module is that web page contents are carried out with the first default processing, thus it is logical
In the case of often, if the functional module of the task of execution includes Web Page Processing module, the order of Web Page Processing module is located at webpage
After crawling module.
Processing module is linked, for carrying out the second default processing to valid link, wherein, the second default processing includes:Become
Shape, deletion and/or addition.
Specifically, the effect of link processing module is that the link extracted is handled, and generates new crawling and appoint
Business.
It is worth noting that, link processing module and linkage extraction are different to the process angle of link, link pretreatment mould
Block can deform to link, delete, the operation such as addition, and the input of reception can be a link or one group of chain
The set connect, after the processing Jing Guo the module, caused can be that the page crawls task needed for module.Link processing module
The quantity of useless link can be efficiently reduced according to linked contents, so as to improve the efficiency that crawls of reptile, reduction is entirely climbed
The load capacity of worm system.
From the foregoing, it will be observed that the above-mentioned spiders system of the application passes through linkage extraction module, Web Page Processing module and link
Processing module, in the multiple deployment of components for solving web crawlers in the prior art in same machine, or it is deployed in multiple machines
Close relation between device but each component, on the basis of causing the technical problem that is not easy to expand and safeguards, raising system is reached
System crawls the technique effect of efficiency.
Fig. 3 is the structural representation that task is handled according to a kind of network crawler system of the embodiment of the present application, with reference to
Example shown in Fig. 3, a kind of embodiment that task is performed to above-mentioned network crawler system are described.
In a kind of optional embodiment, the example with reference to shown in Fig. 3, the functional module for the information record that circulates performs task
Processing sequence repeatedly circulate execution for link processing module, web page crawl module, Web Page Processing module, linkage extraction module and appoint
Business, then sends final result to result treatment module.
It should be noted that, in the examples described above, spiders system can include whole work(in above-described embodiment herein
Energy module, the partial function module in above-described embodiment can also be only included, but comprised at least:At web page crawl module and result
Manage module.Illustrate two kinds of optional network crawler system framework modes as example below:
Mode one, network crawler system includes:Link processing module, web page crawl module, page processing module, link are taken out
Modulus block and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way one, circulation information can wrap corresponding to task
Functional module containing whole, can also only include partial function module.
Mode two, network crawler system includes:Web page crawl module, linkage extraction module and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way two, the execution of task is according to corresponding to task
Circulation information is handled.
Wherein, both the above mode is only as an example, the overall network crawler system protected not comprising the present invention.
Herein it should also be noted that, network crawler system is while task is got, task fortune can also be got
Information needed for row, such as:Crawl the number of plies, crawl the header information of use and what is used when crawling crawls strategy etc..
In a manner of introducing network crawler system according to circulation information handling task by a kind of optional embodiment below.
Circulation information can be the form for the mark for being stored with multiple functional modules, include in above-mentioned multiple functional modules:
Web page crawl module, result treatment module, linkage extraction module, Web Page Processing module and link processing module example in,
If identification information possessed by above-mentioned multiple functional modules is respectively:01A, 02B, 03C, 04D and 05E, if circulation information
In the information that includes be:03C, 05E, 04D and 02B, then when network crawler system handles task corresponding to above-mentioned circulation information
Processing sequence should be:Linkage extraction module, link processing module, Web Page Processing module and result treatment module.
Optionally, also included according to the above embodiments of the present application, said system:
Central module, for preserving the registered address of each functional module, and it can be led to each functional module
Letter.
In the case where the information that circulated corresponding to task does not include the registered address of functional module, a center can be passed through
Module manages the registered address of all functional modules, and according to category classification.In a kind of optional embodiment, each function
Module, which starts, can do by a self-test step, need to register the reference address of oneself to center module in this step, and
Fetch all effective next step addresses.Doing so may insure that current block deployment is correct, and go in the process of running
The heart;Each functional module only can just start when getting at least one next functional module address.Otherwise will always to
Central module makes requests on.Each input of the functional module in the task of reception is the information of structuring, can be, but not limited to make
With the information of the forms such as Json, Xml.
In a kind of optional embodiment, network crawler system includes:Link processing module, web page crawl module, the page
Processing module, linkage extraction module, result treatment module and central module, central module preserve the registered place of functional module
Location, functional module can by central module obtain task circulation information in, the address of next functional module, to transmit
Task result, therefore can all be communicated between each functional module by obtaining registered address.In above-mentioned network
, it is necessary to carry out initialization procedure before crawler system task, i.e., central module obtains the registered address of each functional module, passes through
Task is injected into any one functional module and performs task to start network crawler system, performing task in each module is tied
After fruit, the registered address of next functional module in the circulation information of task can be obtained from central module, result is sent
To next functional module.
Fig. 2 is a kind of structural representation of optional network crawler system according to embodiments of the present invention, a kind of optional
In embodiment, network crawler system include web page crawl module, result treatment module, linkage extraction module, page processing module,
Link processing module and central module, the function of central module are not limited to preserve outside the registered address of multiple functional modules, also
It can be used in monitoring, scheduling and health examination to network crawler system etc., be technical staff's awareness network crawler system fortune
The channel of row state.
Optionally, included according to the above embodiments of the present application, each functional module:
Address acquisition unit, for obtaining the registered address of objective function module, objective function module according to execution sequence
To receive the functional module of current functional module task action result.
Specifically, above-mentioned objective function module is the current function mould indicated by the tasks carrying order in circulation information
Next module corresponding to block, the result of task output is performed for receiving current functional module.
Herein it should be noted that in the case where circulation information includes the registered address of each functional module, address is obtained
Take unit to obtain the registered address of objective function module from circulation information, do not include each functional module in circulation information
Registered address in the case of, address acquisition unit from circulation information in obtain objective function module after, can be from central mould
The registered address of objective function module is obtained in block.
Receiving unit, for receiving task.
Specifically, above-mentioned receiving unit is used to receive outside incoming task, and list is handled according to incoming task call
Member.
Processing unit, for performing task.
Specifically, above-mentioned processing unit is used for the performing receiving unit reception of the task, and result is transferred to transmission
Unit.
Transmitting element, for the implementing result of task to be sent into objective function module.
Specifically, the registered address for the objective function module that above-mentioned transmitting element obtains according to address acquisition unit, will locate
Reason result is sent to objective function module.
Optionally, functional module can also include initialization unit, for being sent in transmitting element to all functional modules
During the implementing result failure of task, function of initializing module, with the new registered address of request target functional module.
In a kind of optional embodiment, in the case where sending module all fails to all transmission address transmissions, work(
Energy module can suspend the currently reception to task and start initialization unit, and the address of all failures is sent to central module, and
The address that please be looked for novelty, central module can update the address list of itself after receiving request and be returned to each functional module newest
List.
Optionally, also included according to the above embodiments of the present application, each functional module:
First resource adjustment unit, in the case of exceeding preset time in the stand-by period of task, increase processing is single
The quantity of member.
In the case that the stand-by period of task exceedes preset time, the load of the processing unit in task function module
Greatly, in the case of the load excessive of functional module, a kind of optional mode is to increase the processing unit in functional module, is increased newly
Processing unit can directly initiate, and be automatically added in functional module, to mitigate the load of each processing unit, to improve
Data-handling efficiency.
Secondary resource adjustment unit, in the case of exceeding predetermined threshold value in the resource consumption of the task of execution, at reduction
Manage the quantity of unit.
In the case where the resource consumption of the task of execution exceedes predetermined threshold value, it is believed that functional module is without current quantity
Processing unit can handle current task, therefore the processing unit in functional module can be reduced, to reduce disappearing for resource
Consumption.
From the foregoing, it will be observed that the application said system realizes function by first resource adjustment unit and Secondary resource adjustment unit
The dynamic of the network crawler system of module opens up appearance.
Embodiment 2
According to embodiments of the present invention, there is provided a kind of embodiment of the method for the data processing of network crawler system is, it is necessary to say
It is bright, it can be held the step of the flow of accompanying drawing illustrates in the computer system of such as one group computer executable instructions
OK, although also, show logical order in flow charts, in some cases, can be with different from order herein
Perform shown or described step.
Fig. 4 is the flow chart of the data processing method according to embodiments of the present invention based on network crawler system, above-mentioned net
Network crawler system includes any one network crawler system in embodiment 1, as shown in figure 4, above-mentioned data processing method includes:
Step S402, any functional module in multiple functional modules is after task is received, according to corresponding to task
Circulate information, it is determined that the execution sequence of the functional module of the task of execution and task.
Step S404, task is sent to corresponding functional module, so that functional module performs task according to execution sequence.
Specifically, above-mentioned multiple functional modules can be multiple detachable modules, wherein, do not have between each functional module
There is dependence, and each module oneself is stateless processing unit, i.e., after the task of reception or execution task not
Corresponding return value can be returned to, so that the debugging of crawler system is simpler with disposing.Above-mentioned circulation information is used to characterize
Multiple functional modules perform execution sequence during task;What corresponding functional module as preserved in the information that circulates works as function mould
The functional module of next execution task of block.
In a kind of optional embodiment, the example with reference to shown in Fig. 3, the functional module for the information record that circulates performs task
Processing sequence repeatedly circulate execution for link processing module, web page crawl module, page processing module, linkage extraction module and appoint
Business, then sends final result to result treatment module.
It should be noted that, in the examples described above, spiders system can include whole work(in above-described embodiment herein
Energy module, the partial function module in above-described embodiment can also be only included, but comprised at least:Crawl Web page module, extract chain
Connection module and result treatment module.Illustrate two kinds of optional network crawler system framework modes as example below:
Mode one, network crawler system includes:Connect pretreatment module, web page crawl module, page processing module, link
Abstraction module and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way one, circulation information can wrap corresponding to task
Functional module containing whole, can also only include partial function module.
Mode two, network crawler system includes:Crawl Web page module, extract link module and result treatment module.
In the case where network crawler system is the framework mode of aforesaid way two, the execution of task is according to corresponding to task
Circulation information is handled.
Wherein, both the above mode is only as an example, the network crawler system having not comprising the whole that the present invention protects.
Herein it should also be noted that, network crawler system is while task is got, task fortune can also be got
Information needed for row, such as:Crawl the number of plies, crawl the header information of use and what is used when crawling crawls strategy etc..
In a manner of introducing network crawler system according to circulation information handling task by a kind of optional embodiment below.
Circulation information can be the form for the mark for being stored with multiple functional modules, include in above-mentioned multiple functional modules:
Web page crawl module, result treatment module, linkage extraction module, Web Page Processing module and link processing module example in,
If identification information possessed by above-mentioned multiple functional modules is respectively:01A, 02B, 03C, 04D and 05E, if circulation information
In the information that includes be:03C, 05E, 04D and 02B, then when network crawler system handles task corresponding to above-mentioned circulation information
Processing sequence should be to make:Linkage extraction module, link processing module, Web Page Processing module and result treatment module.
Herein it should be noted that all modules in above-mentioned network crawler system are all stateless, so in theory
Task can be started to any one module injection task, i.e., any one functional module can be used as Elementary Function mould
Block, but under normal circumstances, starting to crawl information all by a web page interlinkage, it is usually the case that initial power
Energy module is link preprocessor.While a link is injected, the letter required for can also adding in task run
Breath, for example level is crawled, the header information of use is crawled, strategy used when crawling etc..Each module can start multiple
Example, so injection can be injected into any task processing module at random, it can also be injected into multiple processing modules.Task
Place to go by handling inside block combiner.
Herein it should also be noted that, in above-mentioned multiple functional modules, each functional module can include by multiple
Processing unit, therefore when injecting task to functional module, task can be injected into a processing unit of functional module,
The multiple units that task can be injected into functional module, task is injected more parts, it can be ensured that task is bound to by function
Module performs.
From the foregoing, it will be observed that the webcrawler module in the application such scheme includes multiple functional modules, network crawler system
In the task of execution, circulation information corresponding to task is obtained, then multiple functional modules in network crawler system are according to circulation
The processing sequence that information includes performs task.Such scheme is made by the way that network crawler system is divided into multiple functional modules
There is no inevitable coupled relation between multiple functional modules, so as to solve multiple assembly portion of web crawlers in the prior art
Administration is in same machine, or is deployed in close relation between multiple machines but each component, causes what is be not easy to expand and safeguard
The technical problem of technical problem.
In one embodiment, the above-mentioned data processing method based on network crawler system, in data processing, can
To comprise the following steps:
According to valid link address, web page contents corresponding to valid link address are obtained from internet;
The implementing result of task is stored in corresponding storage region, and terminates this subtask;Or the execution in task
As a result after mistake occur or receiving preset instructions, new pending task is generated;And/or
Valid link is extracted from web page contents;And/or
Web page contents are carried out with the first default processing, wherein, the first default processing includes:Webpage screens and/or link sieve
Choosing;And/or
The second default processing is carried out to valid link, wherein, the second default processing includes:Deformation, delete and/or add.
Optionally, according to the above embodiments of the present application, task is sent to corresponding functional module, including:
Step S4041, obtain the registered address of corresponding functional module.
Step S4043, task is sent to corresponding functional module according to registered address.
, can be from circulation in the case where circulation information includes the registered address of each functional module in above-mentioned steps
The registered address of objective function module is obtained in information, does not include the situation of the registered address of each functional module in circulation information
Under, after objective function module is obtained from circulation information, the registered place of objective function module can be obtained from central module
Location.
Optionally, according to the above embodiments of the present application, functional module performs task, including:
Step S4045, receive and perform task.
Step S4047, the implementing result of task is sent to objective function module, objective function module is the current work(of reception
The functional module of energy module design task implementing result.
Specifically, above-mentioned be sent to objective function module as implementing result is sent to target work(by the implementing result of task
The registered address of energy module.
Optionally, functional module can also send the implementing result failure of task in transmitting element to all functional modules
When, initialized, with the new registered address of request target functional module.
In a kind of optional embodiment, in the case where all failing to all transmission address transmissions, functional module meeting
The pause currently reception to task simultaneously starts initialization unit, and the address of all failures is sent to central module, and please look for novelty
Address, central module can update the address list of itself after receiving request and return to newest list to each functional module.
In a kind of optional embodiment, being adjusted after functional module processing is completed, which can use transmission interface to handle, ties
Fruit sends.It is the address got from central module to send address.When what is all failed to all transmission address transmissions
When, functional module can suspend the currently reception to task, and the address of all failures is sent to central module, please look for novelty
Address, central module can update the address list of itself after receiving request and return to newest list.
Herein it should be noted that in the case where the circulation information of task includes the registered address of functional module, the
The registered address of the second functional module is directly obtained from circulation information with functional module, if the first function in this case
Module, which sends the result to second functional module according to the registered address of second functional module, to fail, then
Can more new task circulation information, and obtain renewal after circulation information.
Optionally, also included according to the above embodiments of the present application, the above method:
In the case where the stand-by period of task exceedes preset time, increase the quantity of processing unit.
In the case that the stand-by period of task exceedes preset time, the load of the processing unit in task function module
Greatly, in the case of the load excessive of functional module, a kind of optional mode is to increase the processing unit in functional module, is increased newly
Processing unit can directly initiate, and be automatically added in functional module, to mitigate the load of each processing unit, raising is climbed
Take efficiency.
In the case where the resource consumption of the task of execution exceedes predetermined threshold value, the quantity of processing unit is reduced.
In the case where the resource consumption of the task of execution exceedes predetermined threshold value, it is believed that functional module is without current quantity
Processing unit can handle current task, therefore the processing unit in functional module can be reduced, to reduce disappearing for resource
Consumption.
From the foregoing, it will be observed that the application above method is flexibly increased by the loading condition according to processing unit in functional module
Subtract the quantity of processing unit, so as to realize that the dynamic of the network crawler system of functional module opens up appearance.
Also include processor and memory in multiple functional modules based on network crawler system, included in processor
Kernel, gone in memory to transfer corresponding program unit by kernel.Kernel can set one or more, be joined by adjusting kernel
Web crawlers is divided into multiple functional modules by number, so as to solve multiple deployment of components of web crawlers in the prior art same
One machine, or close relation between multiple machines but each component is deployed in, the technology for causing to be not easy to expand and safeguard is asked
The technical problem of topic.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/
Or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one deposit
Store up chip.
Present invention also provides a kind of computer program product, when being performed on data processing equipment, is adapted for carrying out just
The program code of beginningization there are as below methods step:Any functional module in the multiple functional module is receiving task
Afterwards, circulate information according to corresponding to the task, it is determined that the execution for performing the functional module and the task of the task is suitable
Sequence;The task is sent to corresponding functional module, so that the functional module performs described appoint according to the execution sequence
Business.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment
The part of detailed description, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, others can be passed through
Mode is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, Ke Yiwei
A kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or
Person is desirably integrated into another system, or some features can be ignored, or does not perform.Another, shown or discussed is mutual
Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module
Connect, can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On unit.Some or all of unit therein can be selected to realize the purpose of this embodiment scheme according to the actual needs.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially
The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products
Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer
Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or
Part steps.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes
Medium.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (10)
- A kind of 1. network crawler system, it is characterised in that including:Multiple functional modules, wherein, between each functional module It can communicate with each other;Any functional module in the multiple functional module is after task is received, and circulate letter according to corresponding to the task Breath, it is determined that performing the functional module of the task and the execution sequence of the task, and the task is sent to accordingly Functional module, so that the functional module performs the task according to the execution sequence.
- 2. system according to claim 1, it is characterised in that the multiple functional module comprises at least:Web page crawl module, for according to valid link address, net corresponding to the valid link address to be obtained from internet Page content;Result treatment module, for the implementing result of the task to be stored in into corresponding storage region, and terminate this subtask; Or after the implementing result of the task mistake occurs or receives preset instructions, generate new pending task.
- 3. system according to claim 2, it is characterised in that the multiple functional module also includes:Linkage extraction module, for extracting valid link from the web page contents;And/orWeb Page Processing module, for the web page contents to be carried out with the first default processing, wherein, the described first default processing bag Include:Webpage screens and/or link screening;And/orProcessing module is linked, for carrying out the second default processing to the valid link, wherein, the described second default processing bag Include:Deformation, delete and/or add.
- 4. according to the system described in any one of claims 1 to 3, it is characterised in that the system also includes:Central module, for preserving the registered address of each functional module, and it can enter with each functional module Row communication.
- 5. system according to claim 4, it is characterised in that each functional module includes:Address acquisition unit, for obtaining the registered address of objective function module, the objective function according to the execution sequence Module is to receive the functional module of current functional module task action result;Receiving unit, for receiving the task;Processing unit, for performing the task;Transmitting element, for the implementing result of the task to be sent into the objective function module.
- 6. system according to claim 5, it is characterised in that each functional module also includes:First resource adjustment unit, in the case of exceeding preset time in the stand-by period of the task, increase the place Manage the quantity of unit;Secondary resource adjustment unit, in the case of exceeding predetermined threshold value in the resource consumption for performing the task, reduce institute State the quantity of processing unit.
- 7. a kind of data processing method based on network crawler system, it is characterised in that the network crawler system will for right The network crawler system any one of 1 to 6 is sought, methods described includes:Any functional module in the multiple functional module is after task is received, and circulate letter according to corresponding to the task Breath, it is determined that performing the functional module of the task and the execution sequence of the task;The task is sent to corresponding functional module, so that the functional module performs described appoint according to the execution sequence Business.
- 8. according to the method for claim 7, it is characterised in that methods described also includes:According to valid link address, web page contents corresponding to the valid link address are obtained from internet;The implementing result of the task is stored in corresponding storage region, and terminates this subtask;Or in the task After implementing result mistake occurs or receives preset instructions, new pending task is generated;And/orValid link is extracted from the web page contents;And/orThe web page contents are carried out with the first default processing, wherein, the described first default processing includes:Webpage screens and/or chain Connect screening;And/orThe second default processing is carried out to the valid link, wherein, the described second default processing includes:Deform, delete and/or add Add.
- 9. the method according to claim 7 or 8, it is characterised in that described that the task is sent to corresponding function mould Block, including:Obtain the registered address of corresponding functional module;The task is sent to by corresponding functional module according to the registered address.
- 10. according to the method for claim 9, it is characterised in that methods described also includes:In the case where the stand-by period of the task exceedes preset time, increase the quantity of processing unit;In the case where the resource consumption for performing the task exceedes predetermined threshold value, the quantity of the processing unit is reduced.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610798817.7A CN107784036A (en) | 2016-08-31 | 2016-08-31 | Network crawler system and the data processing method based on network crawler system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610798817.7A CN107784036A (en) | 2016-08-31 | 2016-08-31 | Network crawler system and the data processing method based on network crawler system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107784036A true CN107784036A (en) | 2018-03-09 |
Family
ID=61451578
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610798817.7A Pending CN107784036A (en) | 2016-08-31 | 2016-08-31 | Network crawler system and the data processing method based on network crawler system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107784036A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109951739A (en) * | 2019-03-27 | 2019-06-28 | 北京市博汇科技股份有限公司 | Video traffic processing method, device and electronic equipment |
CN110377680A (en) * | 2019-07-11 | 2019-10-25 | 中国水利水电科学研究院 | The method of mountain flood database sharing and update based on web crawlers and semantics recognition |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1967488A (en) * | 2005-11-15 | 2007-05-23 | 索尼计算机娱乐公司 | Task allocation method and task allocation apparatus |
US20080104113A1 (en) * | 2006-10-26 | 2008-05-01 | Microsoft Corporation | Uniform resource locator scoring for targeted web crawling |
US8255385B1 (en) * | 2011-03-22 | 2012-08-28 | Microsoft Corporation | Adaptive crawl rates based on publication frequency |
CN102999549A (en) * | 2012-09-25 | 2013-03-27 | 金博 | Method for realizing web crawler tasks |
CN103761279A (en) * | 2014-01-09 | 2014-04-30 | 北京京东尚科信息技术有限公司 | Method and system for scheduling network crawlers on basis of keyword search |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104182462A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Web crawler service system for housing library network |
CN104536814A (en) * | 2015-01-16 | 2015-04-22 | 北京京东尚科信息技术有限公司 | Method and system for processing workflow |
CN105260405A (en) * | 2015-09-22 | 2016-01-20 | 北京云知声信息技术有限公司 | Web crawler method and device |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
-
2016
- 2016-08-31 CN CN201610798817.7A patent/CN107784036A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1967488A (en) * | 2005-11-15 | 2007-05-23 | 索尼计算机娱乐公司 | Task allocation method and task allocation apparatus |
US20080104113A1 (en) * | 2006-10-26 | 2008-05-01 | Microsoft Corporation | Uniform resource locator scoring for targeted web crawling |
US8255385B1 (en) * | 2011-03-22 | 2012-08-28 | Microsoft Corporation | Adaptive crawl rates based on publication frequency |
CN102999549A (en) * | 2012-09-25 | 2013-03-27 | 金博 | Method for realizing web crawler tasks |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN103761279A (en) * | 2014-01-09 | 2014-04-30 | 北京京东尚科信息技术有限公司 | Method and system for scheduling network crawlers on basis of keyword search |
CN104182462A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Web crawler service system for housing library network |
CN104536814A (en) * | 2015-01-16 | 2015-04-22 | 北京京东尚科信息技术有限公司 | Method and system for processing workflow |
CN105260405A (en) * | 2015-09-22 | 2016-01-20 | 北京云知声信息技术有限公司 | Web crawler method and device |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
Non-Patent Citations (3)
Title |
---|
白鹤 等: ""分布式多主题网络爬虫***的研究与实现"", 《计算机工程》 * |
韩璞: "《OpenStack技术原理与实战》", 1 April 2016 * |
黄宇鹏 等: ""一种分布式的舆情分析***架构"", 《电信科学》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109951739A (en) * | 2019-03-27 | 2019-06-28 | 北京市博汇科技股份有限公司 | Video traffic processing method, device and electronic equipment |
CN109951739B (en) * | 2019-03-27 | 2021-06-08 | 北京市博汇科技股份有限公司 | Video service processing method and device and electronic equipment |
CN110377680A (en) * | 2019-07-11 | 2019-10-25 | 中国水利水电科学研究院 | The method of mountain flood database sharing and update based on web crawlers and semantics recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103778254B (en) | The processing method of page access data, apparatus and system | |
CN106897357A (en) | A kind of method for crawling the network information for band checking distributed intelligence | |
CN108090091A (en) | Web page crawl method and apparatus | |
CN106933871A (en) | Short linking processing method, device and short linked server | |
CN106293794A (en) | Load the methods, devices and systems of the page | |
CN107784036A (en) | Network crawler system and the data processing method based on network crawler system | |
CN103888539B (en) | Bootstrap technique, device and the P2P caching systems of P2P cachings | |
CN107948052A (en) | Information crawler method, apparatus, electronic equipment and system | |
CN107104924A (en) | The verification method and device of website backdoor file | |
CN106897217A (en) | Method of testing and test device | |
CN106559447A (en) | The method for processing business and system of JSLEE containers | |
CN107025230A (en) | The processing method and processing device of web crawlers | |
CN109344126A (en) | Processing method, device, storage medium and the electronic device of textures | |
CN106649357A (en) | Data processing method and apparatus used for crawler program | |
CN106020891A (en) | Page loading method and device | |
CN107623666A (en) | The methods, devices and systems of information search | |
CN103248627B (en) | Method, forward proxy server and system for visiting website resources | |
CN101645021B (en) | Integrating method for multisystem single-spot logging under Java application server | |
CN107040427A (en) | A kind of method and device of network card configuration | |
CN106572135A (en) | Network request processing method and device | |
CN106484545A (en) | The method and device of call subroutine | |
CN108268498A (en) | The treating method and apparatus of batch reptile task | |
CN107547381A (en) | A kind of ORF treating method and apparatus | |
CN105607928A (en) | Supporting method for browser kernel and webpage display method and apparatus | |
CN105847363A (en) | Method and system used for cross-region file sharing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180309 |
|
RJ01 | Rejection of invention patent application after publication |