CN109871265A

CN109871265A - The dispatching method and device of Reduce task

Info

Publication number: CN109871265A
Application number: CN201711270644.2A
Authority: CN
Inventors: 林文辉; 舒南飞
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2019-06-11

Abstract

The present invention provides the dispatching method and device of a kind of Reduce task, and to overcome, when carrying out task schedule, network transmission expense is larger, task schedule time longer defect in the prior art.Wherein method includes: to determine the data locality metric of each not scheduled Reduce task when requesting node application Reduce task；The smallest Reduce task of data locality metric is chosen, and determines whether the Reduce task chosen meets schedulable condition；If satisfied, the Reduce task of selection is then distributed to requesting node.The embodiment of the present invention and indirect scheduling Reduce task, but it reallocates after being determined according to the data locality metric of Reduce task and meets the Reduce task of schedulable condition, therefore the data locality of Reduce task is considered when carrying out task schedule, reduce the task schedule time, improves job through-put.

Description

The dispatching method and device of Reduce task

Technical field

The present embodiments relate to field of computer technology more particularly to a kind of dispatching methods and dress of Reduce task It sets.

Background technique

MapReduce is a kind of simple programming model, can be attributed to two stages: Map to the operation of data (mapping) stage and Reduce (reduction) stage, due to handling simplicity and flexibility when analyzing large-scale dataset, MapReduce becomes the mass data processing frame of current main-stream for the moment.Open source Hadoop (Hai Dupu) outstanding as its Realization is widely used.The task scheduling algorithm of Hadoop determines the execution performance of task, and then influences entire Hadoop The performance of cluster.Fortunately, the insertable scheduler that Hadoop is realized can be calculated neatly with different task schedules Method is that task distributes cluster resource, these algorithms are each advantageous, their working efficiencies depend on workload and cluster features.

There are three types of current main MapReduce method for scheduling task, is respectively as follows: FIFO (First Input First Output, first in, first out) dispatching method, equity dispatching method and capacity scheduling method.FIFO dispatching method presses the submission of task Time executes task, does not consider the priority or size of operation, easy to accomplish, and efficiency is relatively high；Equity dispatching side Method (Fair Scheduler) is to be developed to design by Facebook, its key concept is: over time, average Ground is that operation distributes cluster resource, and it is bigger which can allow Hadoop cluster to make the multiple types task of submission Response ratio, the cluster especially suitable for middle and small scale；Capacity scheduling device (the Capacity developed by Yahoo Scheduler) similar with Fair Scheduler, it can provide bigger control ability, guarantee the minimum capacity requirement of user, and Extra capacity is shared between users, is mainly used for possessing the large construction cluster of multiple isolated users and destination application.

Above-mentioned equity dispatching method use is more extensive, but equity dispatching method is not examined when carrying out task schedule Consider the data locality of Reduce task, it is larger so as to cause network transmission expense, the task schedule time is longer.

Summary of the invention

In view of this, one of the technical issues of embodiment of the present invention is solved is to provide a kind of tune of Reduce task Method and device is spent, to overcome in the prior art when carrying out task schedule, does not consider that the data of Reduce task are local Property, task schedule time longer defect larger so as to cause network transmission expense.

The embodiment of the present invention provides a kind of dispatching method of Reduce task, which comprises

When requesting node application Reduce task, the data locality weighing apparatus of each not scheduled Reduce task is determined Magnitude；

The smallest Reduce task of data locality metric is chosen, and determines whether the Reduce task chosen meets Schedulable condition；

If satisfied, the Reduce task of the selection is then distributed to the requesting node.

Optionally, the step of whether determining Reduce task chosen meets schedulable condition, comprising:

Compare the data locality metric of the Reduce task of the selection and the Reduce task of the selection etc. To the corresponding threshold value of number；

If the data locality metric is less than or equal to the threshold value, it is determined that the Reduce task of the selection Meet schedulable condition.

Optionally, it after the described the step of Reduce task of the selection is distributed to the requesting node, also wraps It includes: the waiting number of the Reduce task of the selection is set to 0.

Optionally, the method also includes: if not satisfied, then the waiting number of the Reduce task of the selection is added 1。

Optionally, the step of data locality metric for the Reduce task that the determination is not scheduled respectively, comprising:

For each not scheduled Reduce task, the input of the not scheduled Reduce task is calculated separately Ratio data of the data in other each nodes, and calculate separately the topology between the requesting node and other each nodes Distance；

The product of the ratio data and topology distance corresponding to same node is calculated separately, and determines the tired of the product Data locality metric long-pending and for the not scheduled Reduce task.

Optionally, before the determination respectively data locality metric of not scheduled Reduce task the step of, Further include: determine the topology distance between different nodes.

Optionally, the step of topology distance between the different nodes of the determination, comprising:

Determine that the topology distance between the different nodes for belonging to same rack is the first preset value；

Topology distance between the determining node for belonging to different racks but belonging to same data center is the second preset value；

Determine that the topology distance between the node for belonging to different data center is third preset value；

Wherein, first preset value is less than second preset value, and second preset value is default less than the third Value.

The embodiment of the present invention also provides a kind of dispatching device of Reduce task, and described device includes:

Determining module, for when requesting node application Reduce task, determining each not scheduled Reduce task Data locality metric；

Module is chosen, for choosing the smallest Reduce task of data locality metric, and determines the Reduce chosen Whether task meets schedulable condition；

Distribution module distributes to the Reduce task of the selection if determining to meet for the selection module The requesting node.

Optionally, the selection module includes:

Comparing unit, data locality metric and the selection of the Reduce task for the selection The corresponding threshold value of waiting number of Reduce task；

Determination unit, if being less than or equal to the threshold value for the data locality metric, it is determined that the selection Reduce task meet schedulable condition.

Optionally, the determining module includes:

Computing unit, it is described not scheduled for calculating separately for each not scheduled Reduce task Ratio data of the input data of Reduce task in other each nodes, and calculate separately the requesting node and other Topology distance between each node；

Cumulative unit corresponds to the ratio data of same node and the product of topology distance for calculating separately, and determines The accumulation of the product and data locality metric for the not scheduled Reduce task.

In the embodiment of the present invention when requesting node application Reduce task, it is first determined the Reduce not being scheduled respectively The data locality metric of task；Then the smallest Reduce task of data locality metric is chosen, and determines selection Whether Reduce task meets schedulable condition；It is saved if satisfied, the Reduce task of the selection is then distributed to the request Point.By above technical scheme as it can be seen that the embodiment of the present invention and indirect scheduling Reduce task, but according to Reduce task Data locality metric determined after reallocate and meet the Reduce task of schedulable condition, therefore carrying out task tune The data locality that Reduce task is considered when spending reduces the task schedule time, improves job through-put.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only The some embodiments recorded in the embodiment of the present invention for those of ordinary skill in the art can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is a kind of step flow chart of the dispatching method of Reduce task of the embodiment of the present invention one；

Fig. 2 is a kind of step flow chart of the dispatching method of Reduce task of the embodiment of the present invention two；

Fig. 3 is a kind of step flow chart of the dispatching method of Reduce task of the embodiment of the present invention three；

Fig. 4 is a kind of structural block diagram of the dispatching device of Reduce task of the embodiment of the present invention four；

Fig. 5 is a kind of structural block diagram of the dispatching device of Reduce task of the embodiment of the present invention five.

Specific embodiment

Certainly, any technical solution for implementing the embodiment of the present invention must be not necessarily required to reach simultaneously above all excellent Point.

In order to make those skilled in the art more fully understand the technical solution in the embodiment of the present invention, below in conjunction with this hair Attached drawing in bright embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described Embodiment is only a part of the embodiment of the embodiment of the present invention, instead of all the embodiments.Based on the reality in the embodiment of the present invention Example is applied, the model of protection of the embodiment of the present invention all should belong in those of ordinary skill in the art's every other embodiment obtained It encloses.

Below with reference to attached drawing of the embodiment of the present invention the embodiment of the present invention will be further explained specific implementation.

Embodiment one

Referring to Fig.1, a kind of step flow chart of the dispatching method of Reduce task of the embodiment of the present invention one is shown.

The dispatching method of the Reduce task of the present embodiment the following steps are included:

Step 101, when requesting node application Reduce task, the data of each not scheduled Reduce task are determined Locality metric.

When operation is split into multiple tasks execution by Hadoop cluster, suffering from a problem that is: the data of required by task It may not be in corresponding calculate node.There are two the methods for solving this problem: a) copying data to calculate node On；B) calculation procedure is copied to and is executed on the machine comprising the data block.Since calculation procedure is usually smaller than data block very More, in order to reduce network transmission expense, what Hadoop was selected is thought of the mobile computing program without mobile data, is also exactly Because using this thought, Hadoop is just allowed to have high big data computational efficiency.Due to the number of certain required by task According on multiple nodes or back end is not idle etc., even if mobile computing program, also inevitably need to carry out a part Data network copy.Data locality is mainly used to reflect data network copy cost, is specifically exactly that calculate node should be use up It measures close to back end (data locality highest when being in same node).Reduce task is in JVM (Java Virtual Machine, Java Virtual Machine) execution in example has been subdivided into 3 stages: Shuffle (shuffling), Sort (sequence) and Reduce (reduction).Wherein, the Shuffle stage, which is mainly responsible for, collects all Map output for belonging to the task, until all Map (in addition to the Map task of failure) output collect it is neat；The Sort stage is mainly to arrange the Map output of collection Sequence, and there is the key-value (key-value) of identical key (key) to gathering together；The Reduce stage is sequenced to these The Map output of sequence carries out user-defined Reduce operation, while constantly the output result of Reduce is saved in and is configured Distributed file system a certain intermediate file under.

Data locality is a key factor for influencing Hadoop clustering performance, its thought is that task should exist as far as possible It is executed on the node of its required data.This is also a core concept i.e. mobile computing program of Hadoop, without mobile number According to.Higher data locality, can reduce data transfer overhead, save valuable bandwidth resources, and then can improve sociability Energy.Traditional Hadoop framework does not optimize the data locality of Reduce task, this may result in network transmission The excessive problem of expense.The embodiment of the present invention be directed to the demand, the Shuffle stage use waiting mechanism, guarantee task it Between concurrency do not reduce in the case where, the data locality of Lai Tigao Reduce task, and then come when reducing job execution Between and improve job through-put.

The resource dispatching model of Hadoop framework is pull (pull-out) model, which means that scheduler not actively handle Task is assigned in calculate node and executes, but idle node is waited to pass through heartbeat mechanism request task.When some requesting node When applying for Reduce task, not scheduled Reduce task list can be traversed, determines each not scheduled Reduce task Data locality metric.

Step 102, the smallest Reduce task of data locality metric is chosen, and determines the Reduce task chosen Whether schedulable condition is met.

The data locality metric that each not scheduled Reduce task is defined by above-mentioned steps 101, from The middle the smallest Reduce task of selection data locality metric, and the data locality of the Reduce task according to the selection Metric determines if to meet schedulable condition.For example, can by the data locality metric of the Reduce task of selection with A certain threshold value is compared, to determine whether to meet schedulable condition.

Step 103, if satisfied, the Reduce task of the selection is then distributed to the requesting node.

When the Reduce task of selection meets schedulable condition, then the Reduce task of selection can be distributed to described Requesting node；Otherwise, the Reduce task of the selection can not be distributed temporarily.

Embodiment two

Referring to Fig. 2, a kind of step flow chart of the dispatching method of Reduce task of the embodiment of the present invention two is shown.

Step 201, when requesting node application Reduce task, the data of each not scheduled Reduce task are determined Locality metric.

Unlike Map task, Reduce task execution time is long, almost across the life cycle of entire operation, and Output of its input from all Map tasks.In order to carry out quantum chemical method in scheduling, the embodiment of the present invention is defined Reduce task data locality measures model.

It, can be using the ratio data of node and the accumulation of topology distance and to measure in a kind of preferred embodiment The data locality of Reduce task.Determine that the data locality of each not scheduled Reduce task is measured in the step 201 The process of value can further include:

A1 calculates separately the defeated of the not scheduled Reduce task for each not scheduled Reduce task Enter ratio data of the data in other each nodes (other each nodes i.e. in addition to requesting node), and calculates separately institute State the topology distance between requesting node and other each nodes；

A2 calculates separately the product of the ratio data and topology distance corresponding to same node, and determines the product Accumulation and data locality metric for the not scheduled Reduce task.

Wherein, refer to corresponding to the ratio data of same node with topology distance, such as not scheduled Reduce task Ratio data of the input data in node u, with the topology distance between requesting node and node u.

Since the Map task in Shuffle stage has only completed a small part, i.e. the only output data of part, thus When measuring data locality, it can predict that conceptual data is distributed by partial data distribution.It is grouped into the embodiment of the present invention The calculation formula of the not data locality metric of scheduled Reduce task of i can be as shown in formula 1:

Wherein,Indicate that node v (i.e. requesting node) processing is grouped into the data locality measurement of the Reduce task of i Value, data locality metric is bigger, and data locality is lower.p_i(u) indicate the Reduce task for being grouped into i in node u Input data account for node u total input data ratio, topology distance of the h (u, v) between node u and node v.It needs Illustrate, V can indicate the set of other nodes in addition to requesting node；V can also indicate that all nodes (including are asked Ask node and other nodes) set substantially counted since the topology distance between same node is 0 in the case of this kind The calculate or corresponding correlation values of other nodes.

Further, it is contemplated that the calculating of the network topology distance between node is complex, and Hadoop framework needs The continually network topology distance between calculate node, therefore the embodiment of the present invention proposes a kind of easy topology distance mould Type.It further include determining different sections before the step of determining each not data locality metric of scheduled Reduce task Topology distance between point.

It may include multiple data centers in Hadoop framework, each data center may include multiple racks, Mei Geji It may include multiple nodes in frame.Preferably, the step of determining the topology distance between different nodes may include: determining category Topology distance between the different nodes of same rack is the first preset value；Determination belongs to different racks but belongs to same number It is the second preset value according to the topology distance between the node at center；Determine the topology between the node for belonging to different data center Distance is third preset value；Wherein, first preset value is less than second preset value, and second preset value is less than institute State third preset value.

For the specific value of above-mentioned first preset value, the second preset value and third preset value, those skilled in the art can To be arranged any suitable value based on practical experience, the embodiment of the present invention to this and it is without restriction.For example, can be same The topology distance of node is set as 0, and the topology distance between the different nodes for belonging to same rack is set as 2, belonging to not With rack but the distance between the node that belongs to same data center is set as 4, belonging between the node at different data center Topology distance be set as 6, in the case of this kind, the maximum value of the data locality metric of Reduce task is 6.

Step 202, the smallest Reduce task of data locality metric is chosen, and determines the Reduce task chosen Whether schedulable condition is met.If satisfied, thening follow the steps 203；If not satisfied, thening follow the steps 204.

After the data locality metric for determining each not scheduled Reduce task, from wherein choosing data The smallest Reduce task of locality metric, and relevant treatment is carried out to the Reduce task of the selection.

Determine whether the Reduce task chosen meets the mistake of schedulable condition in a kind of preferred embodiment, in the step 202 Journey can further include: the data locality metric of the Reduce task of the selection and the selection The corresponding threshold value of waiting number of Reduce task；If the data locality metric is less than or equal to the threshold value, really The Reduce task of the fixed selection meets schedulable condition；If the data locality metric is greater than the threshold value, it is determined that The Reduce task of the selection is unsatisfactory for schedulable condition.

In the embodiment of the present invention, N grades of waiting mechanisms can be set, and preset the corresponding data sheet of N grades of waiting mechanisms Ground threshold value is array R, and R indicates the array of N number of element, i.e. array length is N, and component elements therein are incremented by, and last One-component is h (maximum value of topology distance of the h between node namely the data locality metric of Reduce task Maximum value).Number k can be waited for the setting of each Reduce task, waiting the initial value of number k is 0, and k (waiting number) < N (waits rank), and the corresponding threshold value of each waiting number, i.e., each waiting number corresponds to an element in array R. For example, corresponding threshold value is R [0] (i.e. the 1st element in array R) when k=0；When k=1, corresponding threshold value is R [1] (i.e. the 2nd element in array R), and so on.

It should be noted that for waiting, level n and the corresponding data locality threshold value R's of N grades of waiting mechanisms is specific Any suitable numerical value can be arranged in numerical value, those skilled in the art based on practical experience, and the embodiment of the present invention is to this and is not added With limitation.For example, as N=3, array R can be [1.2,3.5, h], then when k=0, corresponding threshold value be R [0]= 1.2；When k=1, when corresponding threshold value is [1]=3.5, k=2 R, corresponding threshold value is R [2]=h, and since h is The maximum value of the data locality metric of Reduce task, therefore as k=2, the data of the Reduce task of selection are local Property metric be less than or equal to h.

The setting of setting for above-mentioned level n and array R, N is related with flock size, and cluster is bigger, it is meant that cluster Resource is abundanter, and in order to improve data locality, what N can be arranged is slightly a little bigger.The setting of R and the topology of clustered node Be distributed it is related to Map quantity, if structure is complicated for cluster topology, it is meant that data locality metric may also can be bigger, this The value of sample R can be set slightly a little bigger.For same cluster, the value of R can be set into it is multiple, can also carry out dynamic match It sets, for extensive operation, Map is more, and the input distribution of Reduce task is more extensive, and the value of such R can also be set Set slightly a little bigger.

Step 203, if satisfied, the Reduce task of the selection is then distributed to the requesting node, and by the choosing The waiting number of the Reduce task taken is set to 0.

If the Reduce task chosen meets schedulable condition, the Reduce task of the selection can be distributed to described Requesting node executes.And due to the Reduce task of this scheduled selection, by the Reduce task of the selection Waiting resets to 0, determines the threshold being compared according to modified waiting number when choosing the Reduce task again so as to next time Value.

Step 204, if not satisfied, the waiting number of the Reduce task of the selection is then added 1.

If the Reduce task chosen meets schedulable condition, temporarily the Reduce task of the selection institute is not distributed into State requesting node execution.And it is waited once due to the Reduce task of this unscheduled selection more, therefore should The waiting number of the Reduce task of selection adds 1, according to modified waiting time when to choose the Reduce task next time again Number determines the threshold value being compared.

When Shuffle mechanism is opened, start (to default after dispatching Reduce task and complete 5% in Map task), When idle requesting node v application Reduce task, then not scheduled Reduce task list (non Runing is traversed Reduces), and model is measured using Reduce task data locality and successively calculates each not scheduled Reduce task Data locality metric, choose the smallest Reduce task of data locality metric, judge data locality metric Whether the waiting number corresponding threshold value of the Reduce task chosen is less than or equal to.If so, the Reduce of selection is appointed Node v execution is distributed in business, and the waiting number for the Reduce task then chosen is reset, then is waited and being dispatched next time；Otherwise The waiting number of the Reduce task of selection adds 1, then waits and dispatching next time.

Since the high node of data locality may be in busy condition, thus the principle of node request task may The data locality of Reduce task can be reduced.Delay dispatching thought is applied to by the embodiment of the present invention in the Shuffle stage In the scheduling of Reduce task, first passes through Reduce task data locality and measure model to measure the data of Reduce task Locality, then task schedule multistage waiting mechanism is established by locality threshold value, to realize the scheduling to Reduce task, from And reach under the premise of not reducing task concurrency, the data locality of Reduce task is improved, when reducing task schedule Between, and then improve the performance of Hadoop cluster.

Embodiment three

Referring to Fig. 3, a kind of step flow chart of the dispatching method of Reduce task of the embodiment of the present invention three is shown.

Step 301, it is local to define data by initializing variable, the waiting number k=0 of each Reduce task when defining initial Property threshold value (such as array R).

Step 302, the available free resource of node v, application one Reduce task of starting.

Step 303, the data local that model calculates each not scheduled Reduce task is measured using data local line Property metric.

Step 304, the smallest Reduce task of data locality metric is chosen.

Step 305, determine whether the data locality metric for the Reduce task chosen is less than or equal to selection The corresponding threshold value of waiting number of Reduce task.If so, thening follow the steps 306；If it is not, thening follow the steps 309.

Step 306, the Reduce task of selection is distributed into node v.

Step 307, the waiting number of the Reduce task of selection is reset to 0.

Step 308, waiting has node to apply for Reduce task again, if there is node to apply for Reduce task again, holds Row step 302.

Step 309, this Reduce task for not distributing selection gives node v.

Step 310, the waiting number of the Reduce task of selection is added 1.

Step 311, waiting has node to apply for Reduce task again, if there is node to apply for Reduce task again, holds Row step 302.

The embodiment of the present invention proposes a kind of Reduce task delay dispatching method based on the Shuffle stage, this method On the basis of concurrency between not reduction task, pass through the data locality for improving Reduce task, shortens job execution Time.The embodiment of the present invention is for improving the data locality of Reduce task, reducing network transmission expense and shortening operation The execution time have certain directive significance.Under single-job environment, the embodiment of the present invention can be by improving Reduce task Data locality, to reduce the execution time of single operation, while the dispatching algorithm also can be compatible with Hadoop framework High fault tolerance.Under more operating environments, the embodiment of the present invention can be on the basis of not reducing task concurrency, by mentioning The data locality of high Reduce task can effectively improve the overall performance of cluster to reduce the execution time of operation.

Example IV

Referring to Fig. 4, a kind of structural block diagram of the dispatching device of Reduce task of the embodiment of the present invention four is shown.

The dispatching device of the Reduce task of the present embodiment comprises the following modules:

Determining module 401, for when requesting node application Reduce task, determining that each not scheduled Reduce appoints The data locality metric of business.

Module 402 is chosen, for choosing the smallest Reduce task of data locality metric, and determines selection Whether Reduce task meets schedulable condition.

Distribution module 403, if determining to meet for the selection module, by the Reduce task of the selection point Requesting node described in dispensing.

Embodiment five

Referring to Fig. 5, a kind of structural block diagram of the dispatching device of Reduce task of the embodiment of the present invention five is shown.

Determining module 501, for when requesting node application Reduce task, determining that each not scheduled Reduce appoints The data locality metric of business.

Module 502 is chosen, for choosing the smallest Reduce task of data locality metric, and determines selection Whether Reduce task meets schedulable condition.

Distribution module 503, if determining to meet for the selection module, by the Reduce task of the selection point Requesting node described in dispensing.

Preferably, the determining module 501 may include: computing unit, for for each not scheduled Reduce Task calculates separately ratio data of the input data of the not scheduled Reduce task in other each nodes, and Calculate separately the topology distance between the requesting node and other each nodes；Cumulative unit corresponds to together for calculating separately The ratio data of one node and the product of topology distance, and determine the accumulation of the product and be described not scheduled The data locality metric of Reduce task.

Preferably, the selection module 502 may include: comparing unit, the Reduce task for the selection Data locality metric threshold value corresponding with the waiting number of Reduce task of the selection；Determination unit, if for The data locality metric is less than or equal to the threshold value, it is determined that the Reduce task of the selection meets scheduling item Part.

Preferably, the dispatching device of the Reduce task of the present embodiment can also include: the first setup module 504, be used for After the Reduce task of the selection is distributed to the requesting node by the distribution module, by the selection The waiting number of Reduce task is set to 0.

Preferably, the dispatching device of the Reduce task of the present embodiment can also include: the second setup module 505, be used for If the selection module is determined to be unsatisfactory for, the waiting number of the Reduce task of the selection is added 1.

Preferably, the dispatching device of the Reduce task of the present embodiment can also include: setting module 506, for determining Topology distance between different nodes.

Preferably, setting module is further used for determining that the topology distance between the different nodes for belonging to same rack is First preset value；Topology distance between the determining node for belonging to different racks but belonging to same data center is second default Value；Determine that the topology distance between the node for belonging to different data center is third preset value；Wherein, first preset value Less than second preset value, second preset value is less than the third preset value.

Delay dispatching thought is applied in the scheduling of Reduce task by the embodiment of the present invention in the Shuffle stage, first Model is measured by Reduce task data locality to measure the data locality of Reduce task, then passes through locality threshold Value establishes task schedule multistage waiting mechanism, to realize the scheduling to Reduce task, to reach not reduce task parallel Property under the premise of, improve the data locality of Reduce task, reduce the task schedule time, and then improve Hadoop cluster Performance.

It may be noted that all parts/step described in the embodiment of the present invention can be split as according to the needs of implementation The part operation of two or more components/steps or components/steps can also be combined into new portion by more components/steps Part/step, to realize the purpose of the embodiment of the present invention.

It is above-mentioned to be realized in hardware, firmware according to the method for the embodiment of the present invention, or be implemented as being storable in note Software or computer code in recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optic disk), or be implemented through The original storage of network downloading is in long-range recording medium or nonvolatile machine readable media and will be stored in local record Jie Computer code in matter, so that method described herein can be stored in using general purpose computer, application specific processor or can Such software processing in the recording medium of programming or specialized hardware (such as ASIC or FPGA).It is appreciated that computer, Processor, microprocessor controller or programmable hardware include the storage assembly that can store or receive software or computer code (for example, RAM, ROM, flash memory etc.), when the software or computer code are by computer, processor or hardware access and execution When, realize the dispatching method of Reduce task described herein.In addition, when general purpose computer accesses for realizing being shown here Reduce task dispatching method code when, the execution of code general purpose computer is converted to be used for execute be shown here Reduce task dispatching method special purpose computer.

Those of ordinary skill in the art may be aware that described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and method and step can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions It is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Professional technique Personnel can use different methods to achieve the described function each specific application, but this realization should not be recognized For the range beyond the embodiment of the present invention.

Embodiment that the above embodiments are only used to illustrate the present invention, and the limitation not to the embodiment of the present invention, related skill The those of ordinary skill in art field can also make various in the case where not departing from the spirit and scope of the embodiment of the present invention Variation and modification, therefore all equivalent technical solutions also belong to the scope of the embodiment of the present invention, the patent of the embodiment of the present invention Protection scope should be defined by the claims.

Claims

1. a kind of dispatching method of Reduce task, which is characterized in that the described method includes:

When requesting node application Reduce task, the data locality metric of each not scheduled Reduce task is determined；

The smallest Reduce task of data locality metric is chosen, and determines whether the Reduce task chosen meets scheduling item Part；

2. the method according to claim 1, wherein whether the determining Reduce task chosen meets scheduling The step of condition, comprising:

Compare the waiting time of the data locality metric of the Reduce task of the selection and the Reduce task of the selection The corresponding threshold value of number；

If the data locality metric is less than or equal to the threshold value, it is determined that the Reduce task of the selection, which meets, to be adjusted Degree condition.

3. method according to claim 1 or 2, which is characterized in that distributed in the Reduce task by the selection After the step of to the requesting node, further includes:

The waiting number of the Reduce task of the selection is set to 0.

4. method according to claim 1 or 2, which is characterized in that the method also includes:

If not satisfied, the waiting number of the Reduce task of the selection is then added 1.

5. the method according to claim 1, wherein the data for the Reduce task that the determination is not scheduled respectively The step of locality metric, comprising:

For each not scheduled Reduce task, the input data for calculating separately the not scheduled Reduce task exists Ratio data in other each nodes, and calculate separately the topology distance between the requesting node and other each nodes；

The product of the ratio data and topology distance corresponding to same node is calculated separately, and determines the accumulation of the product and is The data locality metric of the not scheduled Reduce task.

6. method according to claim 1 or 5, which is characterized in that in the Reduce task that the determination is not scheduled respectively Data locality metric the step of before, further includes: determine the topology distance between different nodes.

7. according to the method described in claim 6, it is characterized in that, the step of the topology distance between the different nodes of the determination Suddenly, comprising:

Wherein, first preset value is less than second preset value, and second preset value is less than the third preset value.

8. a kind of dispatching device of Reduce task, which is characterized in that described device includes:

Determining module, for when requesting node application Reduce task, determining the data of each not scheduled Reduce task Locality metric；

Module is chosen, for choosing the smallest Reduce task of data locality metric, and determines the Reduce task chosen Whether schedulable condition is met；

Distribution module is distributed to the Reduce task of the selection described if determining to meet for the selection module Requesting node.

9. device according to claim 8, which is characterized in that the selection module includes:

Comparing unit, the data locality metric of the Reduce task for the selection and the Reduce of the selection The corresponding threshold value of waiting number of task；

Determination unit, if being less than or equal to the threshold value for the data locality metric, it is determined that the selection Reduce task meets schedulable condition.

10. device according to claim 8, which is characterized in that the determining module includes:

Computing unit, for calculating separately the not scheduled Reduce and appointing for each not scheduled Reduce task Ratio data of the input data of business in other each nodes, and calculate separately between the requesting node and other each nodes Topology distance；

Cumulative unit corresponds to the ratio data of same node and the product of topology distance for calculating separately, and described in determination The accumulation of product and data locality metric for the not scheduled Reduce task.