CN103870329B

CN103870329B - Distributed crawler task scheduling method based on weighted round-robin algorithm

Info

Publication number: CN103870329B
Application number: CN201410073829.4A
Authority: CN
Inventors: 蒋昌俊; 陈闳中; 闫春钢; 丁志军; 王鹏伟; 孙海春; 邓晓栋; 葛大劼
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2014-03-03
Filing date: 2014-03-03
Publication date: 2017-01-18
Anticipated expiration: 2034-03-03
Also published as: CN103870329A

Abstract

A distributed crawler task scheduling method based on a weighted round-robin algorithm at least includes the following steps: (1) according to different scales, network crawlers are divided into five types of crawlers, i.e. a stand-alone multi-thread crawler, a homogeneous centralized crawler, a heterogeneous centralized crawler, a small-scale distributed crawler and a large-scale distributed crawler; (2) a master-slave architecture is deployed; (3) when a crawler node is connected to a master node for the first time, the master node gives an initial weight to the crawler node; (4) according to the scheduling algorithm based on weighted round-robin, the master node continuously chooses a crawler node and assigns a URL (Uniform Resource Locator) task to be crawled to the crawler node; (5) each time when a URL task is crawled by a crawler node, a result is returned to the master node, and the weight of the crawler node is updated by the master node, and the like. The distributed crawler scheduling policy based on the weighted round-robin algorithm, which is put forward by the invention, is designed for small-scale distributed crawlers and can ensure the load balance of each crawler node and ensure that crawler nodes have flexible scalability and fault tolerance.

Description

Distributed reptile method for scheduling task based on weighted round robin algorithm

Technical field

The present invention relates to web search technical field.

Background technology

One search engine is segmented into several part such as reptile, index, searcher and user interface.Wherein, reptile It is responsible for the information in the Internet constantly being made a look up and collecting, play important role in a search engine.With network Rapidly development, information is even more skyrocketed through, the simple unit web crawlers of tradition and centralized network reptile crawl ability The growth rate of internet information can not have been kept up with.And get more and more the today being mentioned in distributed concept, distribution Formula reptile also becomes the scheme solving the problems, such as big data quantity naturally.Distributed reptile is dispersed in the middle part of wide area network by multiple The node composition of administration, carrying out that can be parallel crawls work, meets the needs to reptile ability for the people.Due to crawling of each node Ability is different, and a good scheduling strategy is requisite.Reptile for different scales has different dispatching algorithms, Wherein, the dispatching algorithm comparing main flow has:

(1) Hash scheduling

Common hash function is a kind of mapping relations, by this mapping relations, by the character string of script, number or other Information is converted to an index value.The crawler system of early stage is all this mode adopting mostly in fact, and it is using url as Hash Input, the value being obtained according to hash function is just as the output dispatched.Such scheduling strategy is not only very easy to calculate, and And overhead also very little；Meanwhile, due to the mathematical randomness of hash function, just ensure that task between reptile node The uniformity of distribution.

(2) central load dispatch

Taking the Beijing University sky net reptile after extensive improvement as a example, it is the pattern of centerized fusion, and its overall framework is One master control node carries out collaborative work with several reptile nodes.The scheduling method that its task scheduling adopts is: master control node It is responsible for distribution url, and reptile node is responsible for crawling url.Each website is responsible for by crawlers, all on this website Url is crawled by this crawlers.One reptile node can have multiple crawlers, but each crawlers must be Run on one reptile node.Master control node is allocated from seed url, each place website is not also started and climbs The url of worm program, can find a reptile node according to certain load balance principle, url is transmitted in the past, and requires it to open Open new crawlers.Next all this website ground url can be distributed to this reptile node, and is entered by this crawlers Row crawls work.

(3) it is scheduling according to network site

In large-scale search engine, because reptile node is deployed in all parts of the world, the calculating of therefore network site is Considerable.In such reptile, the thought of its basic scheduling strategy is exactly using such as gnp algorithm, by measurement Network distance between less pre-determined several groups of websites and reptile node, estimate other substantial amounts of nodes between network away from From, finally obtain network distance using prediction and to calculate reptile node again crawling the time required for the corresponding webpage of url, and will be minimum The reptile node sets of time overhead are the scheduler object of corresponding url.Such scheduling scheme is effectively according to network distance pair Reptile task is dispatched, and decreases the time overhead of Layge-scale Internet measurement.

Content of the invention

Distributed reptile scheduling strategy based on weighted round robin algorithm proposed by the present invention, is for small distributed reptile And be designed, because of thought and central load dispatch strategy yearning between lovers, also it is simultaneously suitable for the centralized reptile of isomery, can make Each reptile node load balance, and make reptile node have flexible extensibility and fault-tolerance.

The inventive method technical scheme is characterized as:

A kind of distributed reptile method for scheduling task based on weighted round robin algorithm is it is characterised in that successively according to as follows Step is implemented:

1) different according to scale, web crawlers is divided into that unit multithreading, isomorphism are centralized, isomery is centralized by the present invention, Small distributed and large-scale distributed five class reptiles, this reptile method for scheduling task is that the reptile task for small distributed is adjusted Degree method.Although small distributed reptile refers to that each node is distributed deployment, but still it is deployed in a little physical region Among, therefore each node network delay difference on the internet less, but the transmission between each node might not be Carry out in LAN environment, therefore transmit possibly insecure, propagation delay time also must account for.

2) master-slave architecture deployment, i.e. a main controlled node and several distributed deployments and energy and main controlled node intercommunication Reptile node it is ensured that all reptile nodes can be connected to the Internet.Main controlled node is responsible for the traffic control of reptile task, and that is, one Which reptile node individual url to be crawled should distribute to completes, and duplicate removal work, will reptile node return one Central new url to be crawled after the exterior chain duplicate removal that bar url obtains.Reptile node is then responsible for specific reptile work, to each Bar main controlled node is distributed to its url and is removed to crawl its whole html on the Internet, and parses comprise in this page outer Chain, these information are returned to main controlled node.

3) when reptile node First Contact Connections are to main controlled node, main controlled node gives its empirical value and weighs as initial Value.

4) main controlled node, according to the dispatching algorithm based on weighted round robin proposed by the present invention, constantly selects a reptile section Point, a url task to be crawled is distributed to it.The main body of this dispatching algorithm is traditional weighted round robin dispatching algorithm, that is, One current scheduling weights of setting, are reinitialized to the maximum of currently all node weights when it is kept to non-positive number, Then each node being inquired successively, seeing whether its weights is not less than current scheduling weights, if then being dispatched, when all sections After point inquiry finishes, a step-length that current scheduling weights subtract certainly, then start each node is inquired successively, so constantly reciprocal. In traditional weighted round robin dispatching algorithm, step-length is the minimum common divisor of all weights that is to say, that there being a lot of weights In the case of may be considered 1.And the weight calculation method that dispatching algorithm proposed by the present invention then sets according to this method and a large amount of Its step size settings is 4 by experiment.

5) when reptile node has crawled a url task, return result to main controlled node, main controlled node is according to this The weight calculation method according to nearest task completion time and the number of tasks not completed that invention proposes updates this reptile node Weights.

6) when the weights of a reptile node are reduced to zero with the increase of number of tasks, main controlled node will be no longer allocated to it Task.When its weights revert to positive number again, just can retrieve distribution.

7) url is constantly distributed to reptile node by so main controlled node, and url is then constantly crawled and obtains it by reptile node Html and exterior chain return to main controlled node, and main controlled node is redistributed away after exterior chain duplicate removal again.Reality according to the Internet Situation, such whole system will be gone down in endless operation, constantly crawls and obtains new webpage, until manually according to reality Situation stops manually.

8) have fault recovering mechanism, main controlled node can detect the abnormal conditions of reptile node, and its weights is put Zero.

9) have good autgmentability, new node can add system at any time, and old node can also be at any time from system In remove.

Different according to scale, web crawlers is divided into five classes by the present invention:

(1) unit multithreading reptile

Unit multithreading reptile is reptile form the most traditional, and its load balance is embodied in task and uniformly divides as far as possible It is fitted on each thread.All kinds of hash algorithms is all to be suitable for dispatching algorithm.

(2) the centralized reptile of isomorphism

The centralized reptile of isomorphism is similar with unit multithreading reptile, and each node is equivalent to each in unit multithreading Thread, only scale is slightly larger, and ability is slightly strong.Therefore, all kinds of hash algorithms is still the scheduling calculation of such reptile suitable Method.

(3) the centralized reptile of isomery

The centralized reptile of isomery and front different being of two classes, the index such as performance of each node is different, therefore each section That puts crawls ability and differs.The strong node of ability should be assigned to more tasks, and the node of ability should distribute To less task.Central load dispatch can have a good scheduling to such reptile.

(4) small distributed reptile

Although small distributed reptile refers to that each node is distributed deployment, but still it is deployed in a little region, respectively Node network delay difference on the internet less, its reptile centralized with isomery compare similar, but between each node Transmission might not carry out in LAN, therefore transmission may be considered insecure, and propagation delay time also must be examined Consider.Such reptile can be carried out necessarily to such reptile currently without preferable specific aim dispatching algorithm, central load dispatch The scheduling of degree, but good scheduling strategy should be to make some changes on the basis of central load dispatch, with more preferable Agree with such reptile.

(5) large-scale distributed reptile

Large-scale distributed reptile is exactly the reptile form that all kinds of large commercial search engines adopt now, each Node distribution All over the world, network delay differs greatly, and is exactly to be such reptile amount body therefore according to the strategy that network site is scheduling Make.

Distributed reptile scheduling strategy based on weighted round robin algorithm proposed by the present invention, is for little by above-mentioned classification Type distributed reptile is designed.

The present invention devises a weight computing formula currently crawling efficiency based on each reptile node, and its major function is Ensure that the load balance of system.And be then based on this power based on the distributed reptile task scheduling algorithm of weighted round robin algorithm Value computing formula is specifically responsible for the task scheduling of url.In addition the fault recovering mechanism of present invention design has then been to ensure that system Stability.

Brief description

Fig. 1 scheduling flow figure.

Fig. 2 is based on weighted round robin algorithm flow chart.

Specific embodiment

The present invention adopt master-slave mode reptile framework, in main controlled node, exist a node table, three url queues and Scheduler module and reptile feedback module.Node table records the information of each reptile node, including node number, weights etc..It must Must dynamically update to keep consistent with actual reptile node situation.The opportunity that it is dynamic to update can be reptile node each time Carry out the feedback of a url task or per certain time has been carried out once, can arrange as the case may be.Scheduling Module first takes out a url from url queue to be crawled, then takes out each nodal information in from node table, and therefrom selects one Individual reptile node is scheduling, and this url is distributed to this reptile node, and this url is stored in allocated url queue.And When a reptile node completes after crawling work of a url, reptile feedback module goes this url in allocated url queue Inquiry, if existing, being removed from it, and being stored in the url queue having crawled, the exterior chain that finally this url crawls out can be sent into Export after molality block to url queue to be crawled, only consider scheduling process here, and ignore this process.Scheduling flow figure such as Fig. 1 It is shown.

In general it may be considered that load balancing factor have cpu performance, cpu utilization rate, memory usage, transmission when Prolong, but they are returned more on earth or embody in time, therefore we adopt this index of time as the weighing apparatus of load balancing Amount standard, that is, determine the factor of weights.After we go to judge according to the ruuning situation before a reptile node, it is possible Situation, determines weights with this.

Specifically, for a reptile node it is assumed that its completed number of tasks is n, the total time spending altogether It is that (time here includes going out on missions to this node until this node is fed back from main controlled node distribution t millisecond, is leading Control node rather than reptile node are to take propagation delay time into account the reason calculating weights), then this reptile node Averagely complete the time that a task needs to spendFor:

\overset{&overbar;}{t} = \frac{t}{n} - - - (1)

Assume to have distributed to this reptile node but the number of tasks that remains unfulfilled is m, then this reptile node completes residue The time t that required by task is wanted is exactlyNamely:

t = \frac{t}{n} * m - - - (2)

The value of t is bigger, also implies that it completes the time of remaining task needs more, then main controlled node just should be given The less task of this node distribution, that is, weight w should be less, therefore, inverted to t, obtain:

w = \frac{1}{t} - - - (3)

(2) are substituted in (3), obtain:

w = \frac{n}{t * m} - - - (4)

Wherein, with constantly crawling, n value and t value all can constantly become big, but when node is idle, value will be zero, therefore In order to allow denominator to be not zero, replace m using m+1, be so updated in (4), obtain:

w = \frac{n}{t * (m + 1)} - - - (5)

It is noted that t value and n value record always starts up-to-date situation from the 1st task, then with t value and n The continuous change of value is big,Can tend to stable.However, this to be not us desired, because now no matter the crawling of this node During run into any problem, all cannot embody from formula it is intended that weights should be able to reflect this node work as Front situation.Therefore, the system has used for reference the concept of sliding window, and weights are modified.Only consider and calculate nearest k The performance of task is it is assumed that t_iThe time completing for nearest i-th task, then weight w should be just:

w = \frac{k}{\underset{i = 1}{k} t_{i} * (m + 1)} - - - (6)

Wherein, how much properly k value takes, and the present invention is taken as 100 when realizing according to the situation of crawling, and this value can make actual With during different reptiles taken with suitable value.

The present invention proposes a kind of scheduling strategy based on weighted round robin algorithm, from weighted round robin algorithm as scheduling plan Foundation main slightly will consider aspect:

(1) simply efficient

In addition to the corresponding weights of each reptile node, algorithm only need to store two simple variables (j and c), can be in o Complete once to dispatch in (x) time, x here refers to reptile nodes.

(2) support-weight dynamic change

The weights taking in all algorithms directly can obtain it means that the weights of each acquirement are all in from node table Currently up-to-date value, therefore no matter in the how asynchronous concept transfer table of feedback module reptile node corresponding weights, algorithm is all Traffic control can be completed according to the current node condition of system.

(3) variation tendency of weights can be estimated

In algorithm, the variation tendency of threshold weights weights variation tendency corresponding with each reptile node matches.This means Even if any temporary problem in reptile node not feeding back in time to update weights, algorithm also accurate can predict it The change of weights, reasonably distributes task.

(4) the reptile node of low weights will not be died of hunger

In algorithm, threshold weights are reduced to a wheel point of zero (or being less than zero) from currently all node maximum weights each time During joining, no matter weights height, all can obtain the chance dispatched, this make right value update not in time when, the climbing of low weights Worm node also will not be died of hunger.

Assume node table n={ n₀,n₁,...,n_x1, w (n_j) represent node n_jWeights, variable j represents last selected Node, variable c represents the weights of current scheduling, and max (n) represents maximum weights in all nodes in n, and s represents c each time Reduce value, this value we need former algorithm is modified, to be mated with the weights that the present invention designs, eventually pass through experiment The value of s is set to 4 by us.The initial value of variable j and c is 0.Flow chart based on weighted round robin algorithm is as shown in Figure 2.

It is noted that w (n here_j) got by formula (6), and it is a decimal in (0,1) interval, also simultaneously It is not applied for weighted round robin algorithm it is therefore necessary to be changed.Allow it be multiplied by a coefficient a first, then it is taken Whole operation, such weight w will be changed in the interval of 0 to one positive integer.So, weight w is:

w = [\frac{a * k}{\underset{i = 1}{k} t_{i} * (m + 1)}] - - - (7)

Wherein, a can be set as suitably being worth by the situation according to reptile.Here a is set as 300,000 by us, when When m is zero, w is in 120 about floatings.

The fault recovering mechanism of the present invention can be divided into two parts, is respectively directed to fault recovering mechanism and the pin of node Fault recovering mechanism to url.

When a node delays suddenly machine, main controlled node also should be able to real-time capture to this situation, and to this system The mistake occurring is recovered.Typically admissible scheme is heartbeat mechanism.But we employ socket when realizing Mode, directly catch the io that socket dishes out and extremely just can capture the situation that node disconnects, then we find out All url distributing to this reptile node in allocated url queue, and they are re-started distribution.So it is directed to node Fault recovering mechanism just complete.

In addition, we monitor allocated url queue, when the situation that a url does not feed back for a long time it is believed that This url there is a problem, has such as been lost it should again distribute to it in transmitting procedure.Here it is the mistake for url Restoration Mechanism.

The innovative point of technical solution of the present invention and its advantage:

1st, the classification being refined reptile according to scale, facilitates the reptile of different scales to select different scheduling plans Slightly.

2nd, devise the weight computing formula currently crawling efficiency based on each reptile node, can preferably reflect each reptile The present situation of node is so that the scheduling strategy based on this weights being capable of load balancing.

3rd, devise the dispatching algorithm based on weighted round robin algorithm, mainly step-length be have modified according to experimental result so that This dispatching algorithm coordinates each reptile node weights computing formula proposed by the present invention, and whole crawler system can be allowed to realize well Load balancing.

4th, fault recovering mechanism can allow crawler system have suitable fault-tolerance so that the stability of whole system is good Good.

Claims

1. a kind of distributed reptile method for scheduling task based on weighted round robin algorithm is it is characterised in that walk according to following successively Rapid enforcement:

1) different according to scale, web crawlers is divided into unit multithreading, isomorphism are centralized, isomery is centralized, small distributed With large-scale distributed five class reptiles, for the reptile task scheduling of small distributed, small distributed reptile refers to each node It is distributed deployment, be deployed among a little physical region；

2) master-slave architecture deployment, that is, main controlled node and several distributed deployments and can and the climbing of main controlled node intercommunication Worm node is it is ensured that all reptile nodes can be connected to the Internet；Main controlled node is responsible for the traffic control of reptile task, treats for one Which reptile node the url crawling should distribute to completes, and duplicate removal work, will return one of reptile node Central new url to be crawled after the exterior chain duplicate removal that url obtains；Reptile node is then responsible for specific reptile work, to each Main controlled node is distributed to its url and is removed to crawl its whole html on the Internet, and parses the exterior chain comprising in this page, These information are returned to main controlled node；

3) when reptile node First Contact Connections are to main controlled node, main controlled node gives its empirical value as initial weight；

4) main controlled node, according to the dispatching algorithm based on weighted round robin, is constantly selected a reptile node, one is waited to crawl Url task distribute to it；This dispatching algorithm, that is, arrange current scheduling weights, when it is kept to non-positive number again just Begin to turn to the maximum of currently all node weights, then each node is inquired successively, see its weights whether not less than current Scheduling weights, if then being dispatched, after the inquiry of all nodes finishes, a step-length that current scheduling weights subtract certainly, then start Each node is inquired successively, so constantly reciprocal；And the weight computing side that described dispatching algorithm then sets according to this method Its step size settings is 4 by method and many experiments；

5) when reptile node has crawled a url task, return result to main controlled node, main controlled node is appointed according to nearest The weight calculation method of business deadline and the number of tasks not completed updates the weights of this reptile node；

6) when the weights of a reptile node are reduced to zero with the increase of number of tasks, main controlled node will be no longer allocated to it Business, when its weights revert to positive number again, just can retrieve distribution；

7) url is constantly distributed to reptile node by so main controlled node, reptile node then constantly url is crawled obtain its html and Exterior chain returns to main controlled node, and main controlled node is redistributed away after exterior chain duplicate removal again；According to the practical situation of the Internet, this Sample whole system will be gone down in endless operation, constantly crawls and obtains new webpage, until manually according to practical situation handss Dynamic stopping；

8) have fault recovering mechanism, main controlled node can detect the abnormal conditions of reptile node, and by its weights zero setting.