CN103870329B - Distributed crawler task scheduling method based on weighted round-robin algorithm - Google Patents

Distributed crawler task scheduling method based on weighted round-robin algorithm Download PDF

Info

Publication number
CN103870329B
CN103870329B CN201410073829.4A CN201410073829A CN103870329B CN 103870329 B CN103870329 B CN 103870329B CN 201410073829 A CN201410073829 A CN 201410073829A CN 103870329 B CN103870329 B CN 103870329B
Authority
CN
China
Prior art keywords
node
reptile
crawler
url
main controlled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410073829.4A
Other languages
Chinese (zh)
Other versions
CN103870329A (en
Inventor
蒋昌俊
陈闳中
闫春钢
丁志军
王鹏伟
孙海春
邓晓栋
葛大劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201410073829.4A priority Critical patent/CN103870329B/en
Publication of CN103870329A publication Critical patent/CN103870329A/en
Application granted granted Critical
Publication of CN103870329B publication Critical patent/CN103870329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A distributed crawler task scheduling method based on a weighted round-robin algorithm at least includes the following steps: (1) according to different scales, network crawlers are divided into five types of crawlers, i.e. a stand-alone multi-thread crawler, a homogeneous centralized crawler, a heterogeneous centralized crawler, a small-scale distributed crawler and a large-scale distributed crawler; (2) a master-slave architecture is deployed; (3) when a crawler node is connected to a master node for the first time, the master node gives an initial weight to the crawler node; (4) according to the scheduling algorithm based on weighted round-robin, the master node continuously chooses a crawler node and assigns a URL (Uniform Resource Locator) task to be crawled to the crawler node; (5) each time when a URL task is crawled by a crawler node, a result is returned to the master node, and the weight of the crawler node is updated by the master node, and the like. The distributed crawler scheduling policy based on the weighted round-robin algorithm, which is put forward by the invention, is designed for small-scale distributed crawlers and can ensure the load balance of each crawler node and ensure that crawler nodes have flexible scalability and fault tolerance.

Description

Distributed reptile method for scheduling task based on weighted round robin algorithm
Technical field
The present invention relates to web search technical field.
Background technology
One search engine is segmented into several part such as reptile, index, searcher and user interface.Wherein, reptile It is responsible for the information in the Internet constantly being made a look up and collecting, play important role in a search engine.With network Rapidly development, information is even more skyrocketed through, the simple unit web crawlers of tradition and centralized network reptile crawl ability The growth rate of internet information can not have been kept up with.And get more and more the today being mentioned in distributed concept, distribution Formula reptile also becomes the scheme solving the problems, such as big data quantity naturally.Distributed reptile is dispersed in the middle part of wide area network by multiple The node composition of administration, carrying out that can be parallel crawls work, meets the needs to reptile ability for the people.Due to crawling of each node Ability is different, and a good scheduling strategy is requisite.Reptile for different scales has different dispatching algorithms, Wherein, the dispatching algorithm comparing main flow has:
(1) Hash scheduling
Common hash function is a kind of mapping relations, by this mapping relations, by the character string of script, number or other Information is converted to an index value.The crawler system of early stage is all this mode adopting mostly in fact, and it is using url as Hash Input, the value being obtained according to hash function is just as the output dispatched.Such scheduling strategy is not only very easy to calculate, and And overhead also very little;Meanwhile, due to the mathematical randomness of hash function, just ensure that task between reptile node The uniformity of distribution.
(2) central load dispatch
Taking the Beijing University sky net reptile after extensive improvement as a example, it is the pattern of centerized fusion, and its overall framework is One master control node carries out collaborative work with several reptile nodes.The scheduling method that its task scheduling adopts is: master control node It is responsible for distribution url, and reptile node is responsible for crawling url.Each website is responsible for by crawlers, all on this website Url is crawled by this crawlers.One reptile node can have multiple crawlers, but each crawlers must be Run on one reptile node.Master control node is allocated from seed url, each place website is not also started and climbs The url of worm program, can find a reptile node according to certain load balance principle, url is transmitted in the past, and requires it to open Open new crawlers.Next all this website ground url can be distributed to this reptile node, and is entered by this crawlers Row crawls work.
(3) it is scheduling according to network site
In large-scale search engine, because reptile node is deployed in all parts of the world, the calculating of therefore network site is Considerable.In such reptile, the thought of its basic scheduling strategy is exactly using such as gnp algorithm, by measurement Network distance between less pre-determined several groups of websites and reptile node, estimate other substantial amounts of nodes between network away from From, finally obtain network distance using prediction and to calculate reptile node again crawling the time required for the corresponding webpage of url, and will be minimum The reptile node sets of time overhead are the scheduler object of corresponding url.Such scheduling scheme is effectively according to network distance pair Reptile task is dispatched, and decreases the time overhead of Layge-scale Internet measurement.
Content of the invention
Distributed reptile scheduling strategy based on weighted round robin algorithm proposed by the present invention, is for small distributed reptile And be designed, because of thought and central load dispatch strategy yearning between lovers, also it is simultaneously suitable for the centralized reptile of isomery, can make Each reptile node load balance, and make reptile node have flexible extensibility and fault-tolerance.
The inventive method technical scheme is characterized as:
A kind of distributed reptile method for scheduling task based on weighted round robin algorithm is it is characterised in that successively according to as follows Step is implemented:
1) different according to scale, web crawlers is divided into that unit multithreading, isomorphism are centralized, isomery is centralized by the present invention, Small distributed and large-scale distributed five class reptiles, this reptile method for scheduling task is that the reptile task for small distributed is adjusted Degree method.Although small distributed reptile refers to that each node is distributed deployment, but still it is deployed in a little physical region Among, therefore each node network delay difference on the internet less, but the transmission between each node might not be Carry out in LAN environment, therefore transmit possibly insecure, propagation delay time also must account for.
2) master-slave architecture deployment, i.e. a main controlled node and several distributed deployments and energy and main controlled node intercommunication Reptile node it is ensured that all reptile nodes can be connected to the Internet.Main controlled node is responsible for the traffic control of reptile task, and that is, one Which reptile node individual url to be crawled should distribute to completes, and duplicate removal work, will reptile node return one Central new url to be crawled after the exterior chain duplicate removal that bar url obtains.Reptile node is then responsible for specific reptile work, to each Bar main controlled node is distributed to its url and is removed to crawl its whole html on the Internet, and parses comprise in this page outer Chain, these information are returned to main controlled node.
3) when reptile node First Contact Connections are to main controlled node, main controlled node gives its empirical value and weighs as initial Value.
4) main controlled node, according to the dispatching algorithm based on weighted round robin proposed by the present invention, constantly selects a reptile section Point, a url task to be crawled is distributed to it.The main body of this dispatching algorithm is traditional weighted round robin dispatching algorithm, that is, One current scheduling weights of setting, are reinitialized to the maximum of currently all node weights when it is kept to non-positive number, Then each node being inquired successively, seeing whether its weights is not less than current scheduling weights, if then being dispatched, when all sections After point inquiry finishes, a step-length that current scheduling weights subtract certainly, then start each node is inquired successively, so constantly reciprocal. In traditional weighted round robin dispatching algorithm, step-length is the minimum common divisor of all weights that is to say, that there being a lot of weights In the case of may be considered 1.And the weight calculation method that dispatching algorithm proposed by the present invention then sets according to this method and a large amount of Its step size settings is 4 by experiment.
5) when reptile node has crawled a url task, return result to main controlled node, main controlled node is according to this The weight calculation method according to nearest task completion time and the number of tasks not completed that invention proposes updates this reptile node Weights.
6) when the weights of a reptile node are reduced to zero with the increase of number of tasks, main controlled node will be no longer allocated to it Task.When its weights revert to positive number again, just can retrieve distribution.
7) url is constantly distributed to reptile node by so main controlled node, and url is then constantly crawled and obtains it by reptile node Html and exterior chain return to main controlled node, and main controlled node is redistributed away after exterior chain duplicate removal again.Reality according to the Internet Situation, such whole system will be gone down in endless operation, constantly crawls and obtains new webpage, until manually according to reality Situation stops manually.
8) have fault recovering mechanism, main controlled node can detect the abnormal conditions of reptile node, and its weights is put Zero.
9) have good autgmentability, new node can add system at any time, and old node can also be at any time from system In remove.
Different according to scale, web crawlers is divided into five classes by the present invention:
(1) unit multithreading reptile
Unit multithreading reptile is reptile form the most traditional, and its load balance is embodied in task and uniformly divides as far as possible It is fitted on each thread.All kinds of hash algorithms is all to be suitable for dispatching algorithm.
(2) the centralized reptile of isomorphism
The centralized reptile of isomorphism is similar with unit multithreading reptile, and each node is equivalent to each in unit multithreading Thread, only scale is slightly larger, and ability is slightly strong.Therefore, all kinds of hash algorithms is still the scheduling calculation of such reptile suitable Method.
(3) the centralized reptile of isomery
The centralized reptile of isomery and front different being of two classes, the index such as performance of each node is different, therefore each section That puts crawls ability and differs.The strong node of ability should be assigned to more tasks, and the node of ability should distribute To less task.Central load dispatch can have a good scheduling to such reptile.
(4) small distributed reptile
Although small distributed reptile refers to that each node is distributed deployment, but still it is deployed in a little region, respectively Node network delay difference on the internet less, its reptile centralized with isomery compare similar, but between each node Transmission might not carry out in LAN, therefore transmission may be considered insecure, and propagation delay time also must be examined Consider.Such reptile can be carried out necessarily to such reptile currently without preferable specific aim dispatching algorithm, central load dispatch The scheduling of degree, but good scheduling strategy should be to make some changes on the basis of central load dispatch, with more preferable Agree with such reptile.
(5) large-scale distributed reptile
Large-scale distributed reptile is exactly the reptile form that all kinds of large commercial search engines adopt now, each Node distribution All over the world, network delay differs greatly, and is exactly to be such reptile amount body therefore according to the strategy that network site is scheduling Make.
Distributed reptile scheduling strategy based on weighted round robin algorithm proposed by the present invention, is for little by above-mentioned classification Type distributed reptile is designed.
The present invention devises a weight computing formula currently crawling efficiency based on each reptile node, and its major function is Ensure that the load balance of system.And be then based on this power based on the distributed reptile task scheduling algorithm of weighted round robin algorithm Value computing formula is specifically responsible for the task scheduling of url.In addition the fault recovering mechanism of present invention design has then been to ensure that system Stability.
Brief description
Fig. 1 scheduling flow figure.
Fig. 2 is based on weighted round robin algorithm flow chart.
Specific embodiment
The present invention adopt master-slave mode reptile framework, in main controlled node, exist a node table, three url queues and Scheduler module and reptile feedback module.Node table records the information of each reptile node, including node number, weights etc..It must Must dynamically update to keep consistent with actual reptile node situation.The opportunity that it is dynamic to update can be reptile node each time Carry out the feedback of a url task or per certain time has been carried out once, can arrange as the case may be.Scheduling Module first takes out a url from url queue to be crawled, then takes out each nodal information in from node table, and therefrom selects one Individual reptile node is scheduling, and this url is distributed to this reptile node, and this url is stored in allocated url queue.And When a reptile node completes after crawling work of a url, reptile feedback module goes this url in allocated url queue Inquiry, if existing, being removed from it, and being stored in the url queue having crawled, the exterior chain that finally this url crawls out can be sent into Export after molality block to url queue to be crawled, only consider scheduling process here, and ignore this process.Scheduling flow figure such as Fig. 1 It is shown.
In general it may be considered that load balancing factor have cpu performance, cpu utilization rate, memory usage, transmission when Prolong, but they are returned more on earth or embody in time, therefore we adopt this index of time as the weighing apparatus of load balancing Amount standard, that is, determine the factor of weights.After we go to judge according to the ruuning situation before a reptile node, it is possible Situation, determines weights with this.
Specifically, for a reptile node it is assumed that its completed number of tasks is n, the total time spending altogether It is that (time here includes going out on missions to this node until this node is fed back from main controlled node distribution t millisecond, is leading Control node rather than reptile node are to take propagation delay time into account the reason calculating weights), then this reptile node Averagely complete the time that a task needs to spendFor:
t &overbar; = t n - - - ( 1 )
Assume to have distributed to this reptile node but the number of tasks that remains unfulfilled is m, then this reptile node completes residue The time t that required by task is wanted is exactlyNamely:
t = t n * m - - - ( 2 )
The value of t is bigger, also implies that it completes the time of remaining task needs more, then main controlled node just should be given The less task of this node distribution, that is, weight w should be less, therefore, inverted to t, obtain:
w = 1 t - - - ( 3 )
(2) are substituted in (3), obtain:
w = n t * m - - - ( 4 )
Wherein, with constantly crawling, n value and t value all can constantly become big, but when node is idle, value will be zero, therefore In order to allow denominator to be not zero, replace m using m+1, be so updated in (4), obtain:
w = n t * ( m + 1 ) - - - ( 5 )
It is noted that t value and n value record always starts up-to-date situation from the 1st task, then with t value and n The continuous change of value is big,Can tend to stable.However, this to be not us desired, because now no matter the crawling of this node During run into any problem, all cannot embody from formula it is intended that weights should be able to reflect this node work as Front situation.Therefore, the system has used for reference the concept of sliding window, and weights are modified.Only consider and calculate nearest k The performance of task is it is assumed that tiThe time completing for nearest i-th task, then weight w should be just:
w = k k i = 1 t i * ( m + 1 ) - - - ( 6 )
Wherein, how much properly k value takes, and the present invention is taken as 100 when realizing according to the situation of crawling, and this value can make actual With during different reptiles taken with suitable value.
The present invention proposes a kind of scheduling strategy based on weighted round robin algorithm, from weighted round robin algorithm as scheduling plan Foundation main slightly will consider aspect:
(1) simply efficient
In addition to the corresponding weights of each reptile node, algorithm only need to store two simple variables (j and c), can be in o Complete once to dispatch in (x) time, x here refers to reptile nodes.
(2) support-weight dynamic change
The weights taking in all algorithms directly can obtain it means that the weights of each acquirement are all in from node table Currently up-to-date value, therefore no matter in the how asynchronous concept transfer table of feedback module reptile node corresponding weights, algorithm is all Traffic control can be completed according to the current node condition of system.
(3) variation tendency of weights can be estimated
In algorithm, the variation tendency of threshold weights weights variation tendency corresponding with each reptile node matches.This means Even if any temporary problem in reptile node not feeding back in time to update weights, algorithm also accurate can predict it The change of weights, reasonably distributes task.
(4) the reptile node of low weights will not be died of hunger
In algorithm, threshold weights are reduced to a wheel point of zero (or being less than zero) from currently all node maximum weights each time During joining, no matter weights height, all can obtain the chance dispatched, this make right value update not in time when, the climbing of low weights Worm node also will not be died of hunger.
Assume node table n={ n0,n1,...,nx1, w (nj) represent node njWeights, variable j represents last selected Node, variable c represents the weights of current scheduling, and max (n) represents maximum weights in all nodes in n, and s represents c each time Reduce value, this value we need former algorithm is modified, to be mated with the weights that the present invention designs, eventually pass through experiment The value of s is set to 4 by us.The initial value of variable j and c is 0.Flow chart based on weighted round robin algorithm is as shown in Figure 2.
It is noted that w (n herej) got by formula (6), and it is a decimal in (0,1) interval, also simultaneously It is not applied for weighted round robin algorithm it is therefore necessary to be changed.Allow it be multiplied by a coefficient a first, then it is taken Whole operation, such weight w will be changed in the interval of 0 to one positive integer.So, weight w is:
w = [ a * k k i = 1 t i * ( m + 1 ) ] - - - ( 7 )
Wherein, a can be set as suitably being worth by the situation according to reptile.Here a is set as 300,000 by us, when When m is zero, w is in 120 about floatings.
The fault recovering mechanism of the present invention can be divided into two parts, is respectively directed to fault recovering mechanism and the pin of node Fault recovering mechanism to url.
When a node delays suddenly machine, main controlled node also should be able to real-time capture to this situation, and to this system The mistake occurring is recovered.Typically admissible scheme is heartbeat mechanism.But we employ socket when realizing Mode, directly catch the io that socket dishes out and extremely just can capture the situation that node disconnects, then we find out All url distributing to this reptile node in allocated url queue, and they are re-started distribution.So it is directed to node Fault recovering mechanism just complete.
In addition, we monitor allocated url queue, when the situation that a url does not feed back for a long time it is believed that This url there is a problem, has such as been lost it should again distribute to it in transmitting procedure.Here it is the mistake for url Restoration Mechanism.
The innovative point of technical solution of the present invention and its advantage:
1st, the classification being refined reptile according to scale, facilitates the reptile of different scales to select different scheduling plans Slightly.
2nd, devise the weight computing formula currently crawling efficiency based on each reptile node, can preferably reflect each reptile The present situation of node is so that the scheduling strategy based on this weights being capable of load balancing.
3rd, devise the dispatching algorithm based on weighted round robin algorithm, mainly step-length be have modified according to experimental result so that This dispatching algorithm coordinates each reptile node weights computing formula proposed by the present invention, and whole crawler system can be allowed to realize well Load balancing.
4th, fault recovering mechanism can allow crawler system have suitable fault-tolerance so that the stability of whole system is good Good.

Claims (1)

1. a kind of distributed reptile method for scheduling task based on weighted round robin algorithm is it is characterised in that walk according to following successively Rapid enforcement:
1) different according to scale, web crawlers is divided into unit multithreading, isomorphism are centralized, isomery is centralized, small distributed With large-scale distributed five class reptiles, for the reptile task scheduling of small distributed, small distributed reptile refers to each node It is distributed deployment, be deployed among a little physical region;
2) master-slave architecture deployment, that is, main controlled node and several distributed deployments and can and the climbing of main controlled node intercommunication Worm node is it is ensured that all reptile nodes can be connected to the Internet;Main controlled node is responsible for the traffic control of reptile task, treats for one Which reptile node the url crawling should distribute to completes, and duplicate removal work, will return one of reptile node Central new url to be crawled after the exterior chain duplicate removal that url obtains;Reptile node is then responsible for specific reptile work, to each Main controlled node is distributed to its url and is removed to crawl its whole html on the Internet, and parses the exterior chain comprising in this page, These information are returned to main controlled node;
3) when reptile node First Contact Connections are to main controlled node, main controlled node gives its empirical value as initial weight;
4) main controlled node, according to the dispatching algorithm based on weighted round robin, is constantly selected a reptile node, one is waited to crawl Url task distribute to it;This dispatching algorithm, that is, arrange current scheduling weights, when it is kept to non-positive number again just Begin to turn to the maximum of currently all node weights, then each node is inquired successively, see its weights whether not less than current Scheduling weights, if then being dispatched, after the inquiry of all nodes finishes, a step-length that current scheduling weights subtract certainly, then start Each node is inquired successively, so constantly reciprocal;And the weight computing side that described dispatching algorithm then sets according to this method Its step size settings is 4 by method and many experiments;
5) when reptile node has crawled a url task, return result to main controlled node, main controlled node is appointed according to nearest The weight calculation method of business deadline and the number of tasks not completed updates the weights of this reptile node;
6) when the weights of a reptile node are reduced to zero with the increase of number of tasks, main controlled node will be no longer allocated to it Business, when its weights revert to positive number again, just can retrieve distribution;
7) url is constantly distributed to reptile node by so main controlled node, reptile node then constantly url is crawled obtain its html and Exterior chain returns to main controlled node, and main controlled node is redistributed away after exterior chain duplicate removal again;According to the practical situation of the Internet, this Sample whole system will be gone down in endless operation, constantly crawls and obtains new webpage, until manually according to practical situation handss Dynamic stopping;
8) have fault recovering mechanism, main controlled node can detect the abnormal conditions of reptile node, and by its weights zero setting.
CN201410073829.4A 2014-03-03 2014-03-03 Distributed crawler task scheduling method based on weighted round-robin algorithm Active CN103870329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410073829.4A CN103870329B (en) 2014-03-03 2014-03-03 Distributed crawler task scheduling method based on weighted round-robin algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410073829.4A CN103870329B (en) 2014-03-03 2014-03-03 Distributed crawler task scheduling method based on weighted round-robin algorithm

Publications (2)

Publication Number Publication Date
CN103870329A CN103870329A (en) 2014-06-18
CN103870329B true CN103870329B (en) 2017-01-18

Family

ID=50908893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410073829.4A Active CN103870329B (en) 2014-03-03 2014-03-03 Distributed crawler task scheduling method based on weighted round-robin algorithm

Country Status (1)

Country Link
CN (1) CN103870329B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376063B (en) * 2014-11-11 2019-02-19 南京邮电大学 Multi-threaded network crawler method and information real-time update system based on Classification Management
CN105989151B (en) * 2015-03-02 2019-09-06 阿里巴巴集团控股有限公司 Webpage capture method and device
CN104766014B (en) 2015-04-30 2017-12-01 安一恒通(北京)科技有限公司 For detecting the method and system of malice network address
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN106897129B (en) * 2017-01-24 2019-07-23 浙江工商大学 A kind of multiple agent internet data acquisition tasks dispatching method based on region
CN107122246B (en) * 2017-04-27 2020-05-19 中国海洋石油集团有限公司 Intelligent numerical simulation operation management and feedback method
CN110020066B (en) * 2017-07-31 2021-09-07 北京国双科技有限公司 Method and device for annotating tasks to crawler platform
CN107590188B (en) * 2017-08-08 2020-02-14 杭州灵皓科技有限公司 Crawler crawling method and management system for automatic vertical subdivision field
WO2019079966A1 (en) * 2017-10-24 2019-05-02 麦格创科技(深圳)有限公司 Distributed crawler task distribution method and system
CN108712503B (en) * 2018-05-30 2021-06-22 南京邮电大学 Multi-agent distributed crawler system and method for network load balancing
CN111092921B (en) * 2018-10-24 2022-05-10 北大方正集团有限公司 Data acquisition method, device and storage medium
CN111125487A (en) * 2019-12-24 2020-05-08 个体化细胞治疗技术国家地方联合工程实验室(深圳) Crawling method and device for web crawler
CN111158892B (en) * 2020-04-02 2020-10-02 支付宝(杭州)信息技术有限公司 Task queue generating method, device and equipment
CN112231534A (en) * 2020-10-14 2021-01-15 上海蜜度信息技术有限公司 Crawler configuration method and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8042112B1 (en) * 2003-07-03 2011-10-18 Google Inc. Scheduler for search engine crawler
CN102968465A (en) * 2012-11-09 2013-03-13 同济大学 Network information service platform and search service method based on network information service platform
CN103514301A (en) * 2013-10-24 2014-01-15 深圳市同洲电子股份有限公司 Method and system for scheduling tasks of distributed network crawlers
CN103559258A (en) * 2013-11-04 2014-02-05 同济大学 Webpage ranking method based on cloud computation
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8042112B1 (en) * 2003-07-03 2011-10-18 Google Inc. Scheduler for search engine crawler
CN102968465A (en) * 2012-11-09 2013-03-13 同济大学 Network information service platform and search service method based on network information service platform
CN103514301A (en) * 2013-10-24 2014-01-15 深圳市同洲电子股份有限公司 Method and system for scheduling tasks of distributed network crawlers
CN103559258A (en) * 2013-11-04 2014-02-05 同济大学 Webpage ranking method based on cloud computation
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method

Also Published As

Publication number Publication date
CN103870329A (en) 2014-06-18

Similar Documents

Publication Publication Date Title
CN103870329B (en) Distributed crawler task scheduling method based on weighted round-robin algorithm
Kao et al. Assessment of top-down and bottom-up controls on the collapse of alewives (Alosa pseudoharengus) in Lake Huron
CN108846570A (en) A method of solving resource constrained project scheduling problem
CN107682395A (en) A kind of big data cloud computing runtime and method
CN106779346A (en) A kind of Forecasting Methodology of monthly power consumption
Shi et al. Partitioning dynamic graph asynchronously with distributed FENNEL
CN107958052A (en) A kind of access method and device of large scale network crawlers
Mostafa et al. An intelligent dynamic replica selection model within grid systems
CN113672684A (en) Layered user training management system and method for non-independent same-distribution data
Xiaoshuan et al. A forecasting support system for aquatic products price in China
Bredahl et al. Behavior and productivity implications of institutional and project funding of research
CN104573916B (en) A kind of technical indicator example generation method and device
Yadav et al. Parallel crawler architecture and web page change detection
Nagappan et al. Agent based weighted page ranking algorithm for Web content information retrieval
Zhao et al. Resource schedule algorithm based on artificial fish swarm in cloud computing environment
Khatib et al. Evolutionary computing for multidisciplinary optimisation
Gong et al. Accelerating large-scale prioritized graph computations by hotness balanced partition
Nasonov et al. Metaheuristic coevolution workflow scheduling in cloud environment
Woodard et al. Exploiting volatile opportunistic computing resources with Lobster
Gupta et al. MetaFusion: An efficient metasearch engine using genetic algorithm
Hu et al. Adaptive evolvement of query plan based on low cost in dynamic grid database
Long et al. Optimizing Data Mining Efficiency in Professional Farmer Simulation Training System with Cloud-Edge Collaboration
Tambaoan et al. Prediction of migration path of a colony of bounded-rational species foraging on patchily distributed resources
Liu et al. Performance analysis of data aggregation in wireless sensor mesh networks
Levin et al. The investigation of the possibility of automated collection of information in the hidden segment of the Internet

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant