CN105808334A

CN105808334A - MapReduce short job optimization system and method based on resource reuse

Info

Publication number: CN105808334A
Application number: CN201610124760.2A
Authority: CN
Inventors: 史玉良; 崔立真; 李庆忠; 郑永清; 张开会
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2016-03-04
Filing date: 2016-03-04
Publication date: 2016-07-27
Anticipated expiration: 2036-03-04
Also published as: CN105808334B

Abstract

The invention discloses a MapReduce short job optimization system and method based on resource reuse. The system comprises a master node, a primary slave node and a plurality of secondary slave nodes, wherein the master node is connected with the primary slave node; the primary slave node is connected with the plurality of secondary slave nodes; a resource manager and a primary scheduler are deployed on the master node; an application manager, a task performance estimator and a sub-scheduler are deployed on the primary slave node; the sub-scheduler is connected with the task performance estimator; the sub-scheduler is further connected with the master node; and node managers are deployed on the secondary slave nodes. Through adoption of the MapReduce short job optimization system and method, short job running performance is optimized from the aspect of increase of the effective resource utilization ratio; the resource allocation and recovery frequency is lowered; the resource allocation and recovery time is used for running short jobs; and the short job execution performance is improved in a way of shortening resource waiting time of jobs.

Description

The short optimization of job system and method for a kind of MapReduce based on resource reuse

Technical field

The present invention relates to the short optimization of job system and method for a kind of MapReduce based on resource reuse.

Background technology

The industries such as the Internet, finance and media are faced with the challenge processing large-scale dataset, but the data processing tools of routine and computation model can not meet its requirement.The MapReduce model that Google proposes is for which providing an effective solution, and Hadoop is the realization of increasing income of MapReduce.Hadoop, by Map task and Reduce task that the breakdown of operation of submission is that granularity is less, parallel running on these tasks multiple nodes in the cluster, therefore shortens the operation time of operation greatly.Hadoop conceals the details of parallel computation and distributes data to computing node, reruns the task dispatching of failure, allows user can be absorbed in concrete business logic processing.And, Hadoop provides the high fault tolerance of good linear expansion, data redundancy and calculating, these advantages Hadoop is become service data is intensive and the main flow Computational frame of compute-intensive applications.Hadoop promotes academia to start to pay close attention to Hadoop in the success of industrial quarters, and its design and the deficiency realized are proposed Improving advice.

The initial designs target of Hadoop is the operation that parallel processing is larger in a large amount of computing nodes, but in actual production, Hadoop is but usually utilized to the short operation that treatment scale is less.Short operation refers to that the deadline is less than the operation setting threshold value, and threshold value is generally arranged voluntarily by user.The short operation of small scale and sweeping operation there are differences in many aspects, stock number, the deadline of task and the user's expectation etc. to the operation deadline that the size of such as input data set, the number of tasks of breakdown of operation, task need.Do not consider short operation due to Hadoop, cause the Performance comparision poor efficiency that short operation runs in Hadoop.

Affecting job run performance factor and have many aspects, the wherein configuration of clustered node, the dispatching algorithm of operation and the load of cluster are three factors that comparison is crucial.Hadoop supposes that when scheduler task the node constituting cluster is isomorphism, and namely the configuration of the hardware such as the CPU of each node, internal memory, disk is identical.But, along with the scale expanding cluster of corporate business expands gradually, increase the configuration configuration apparently higher than early stage node of node newly.Therefore, identical task performs on different nodes, and the deadline difference of task is bigger.When cluster high capacity, all tasks of breakdown of operation can not apply now the resource of operation task, and partial task enters the queue waiting resource.Obtaining the resource of release busy after the tasks carrying of resource completes, Hadoop selects suitable task that available resource is distributed to the task of selection according to dispatching algorithm from waiting list.Therefore, if the task of breakdown of operation is relatively more, the task of same breakdown of operation is likely to perform to run many wheels and just can complete.Such as in the Hadoop cluster of TaoBao, operation more than Map task run two-wheeled accounts for more than 30%.Therefore, response time and the deadline of operation are had conclusive impact by the loading condition of cluster.

Recent years, the execution performance optimizing MapReduce operation is increasingly becoming study hotspot, and a large amount of research work promote the execution efficiency of operation, the framework of such as Hadoop, the dispatching algorithm of operation, the running of operation and hardware accelerator etc. from many aspects.

Piranha goes out based on job history Data Summary that the feature number of tasks of little operation is few and mortality is low.Having the feature that task based access control number is few to optimize the execution flow process of little operation in prior art, such as Map task and Reduce task start simultaneously, the intermediate object program of Map task is written directly to Reduce task end.Owing to the mortality of little operation is lower than 1%, if having employed the failure of easy fault tolerant mechanism job run, Piranha re-executes the task of whole operation rather than failure.Having in prior art and batch processing job is decomposed into substantial amounts of small task, each task only processes the quantity of 8M, and task can complete to calculate within the several seconds.Owing to the data volume of task input is little, it is short to run the time, task is absent from data skew and falls behind the problem of task, solves interactive operation simultaneously and waits as long for the deficiency of resource.Tenzing is the turnaround time reducing MapReduce operation, it is provided that a group job process pool.After new job is submitted to, the scheduler of Tenzing selects an idle process to run the operation submitted to from operation process pool.Use process pool to reduce the cost starting new job, but Tenzing exists two shortcomings: the process number that is reserved exceedes and is actually needed, and wastes the resource of cluster；Another is that the scheduler limiting Tenzing can only use the process that some is concrete, compromises the characteristic of task local computing.

Prior art has and analyzes the deficiency that the short operation of operation of Hadoop Computational frame exists, it is proposed to improve the operational efficiency of short operation from the framework aspect of Hadoop.Paper is optimized from three aspects: 1) setup task and cleanup task is changed into and running at host node, it is to avoid changed the state of operation by heartbeat message, the deadline of operation directly shortens a heart beat cycle；2) change task distribution into " pushing away " pattern from " drawing " pattern, reduce the delay of task distribution；3) by host node and from the control message between node from current heartbeat message mechanism independent, adopt instant pass through mechanism.

Spark is also the Computational frame processing large-scale dataset, has the engine running DAG figure.The difference of Spark and Hadoop is Spark to be calculated based on internal memory, and spark is stored in internal memory by the intermediate object program of operation rather than in local disk or HDFS, the subsequent job in DAG figure directly reads input data from internal memory and is calculated.The target of SpongeFiles is the problem alleviating data skew, but reducing effect highly significant in the job run time.Having introducing distributed memory in prior art, the output result of Map task is preferentially write in distributed memory, and the time of minimizing data write local disk consumption and shuffle stage read the time of Map task intermediate object program.It is obvious that Map task is exported the bigger optimization of job effect of data by the method, but the operation that only Map task and Map task output data quantity are little is not had obvious effect.It prior art is all the execution efficiency improving operation by reducing disk I/O.

The design object of Sparrow is to provide the task scheduling of low latency.Sparrow is longtime running partial task process in each node, the task process of longtime running perform new task, reduces the cost of task process frequent starting.The task process number of longtime running is by static the setting or automatically adjusted according to the load of cluster by explorer of user.Quincy is the scheduler of task rank, similar with Sparrow.Scheduling problem is mapped as a figure for the dispatching sequence calculating optimum by Quincy, considers the position of data, fairness and hungry problem during scheduling.Compared with Sparrow, it is longer that Quincy calculates dispatching sequence's time.

There is important impact the operation time of operation by the job scheduling algorithm of Hadoop.The submission time sequential scheduling operation according to operation of the FIFO dispatching algorithm.Owing to FIFO does not account for the difference between operation, cause the Performance comparision poor efficiency performing little operation and interactive operation.FAIR dispatching algorithm ensures the shared cluster resource that the operation of user's submission is fair, it is ensured that short operation completes within the rational time, it is to avoid hungry problem occurs in operation.But FAIR dispatching algorithm does not account for the isomerism of cluster and operation has the situation of time-constrain.HFSP assesses the size of operation when job run, little operation is set to the operation of high priority, it is ensured that little operation completes within the shortest time.The priority of operation adjusts over time dynamically, it is prevented that hunger phenomenon occurs in operation.

Summary of the invention

The purpose of the present invention is contemplated to solve the problems referred to above, the short optimization of job system and method for a kind of MapReduce based on resource reuse is provided, it optimizes short job run performance from improving utilization of resources aspect, reduce resource distribution and the frequency reclaimed, it is used for running short operation by resource distribution and the time reclaimed, wait resource time by reducing operation, improve the performance performing short operation.

To achieve these goals, the present invention adopts the following technical scheme that

The short optimization of job system of a kind of MapReduce based on resource reuse, including: host node, one-level are from node and several two grades from node, and wherein host node is connected from node with one-level, and one-level is connected from node and several two grades from node；

Described host node is disposed explorer and one-level task dispatcher；

Described one-level disposes application manager, mission performance evaluator and second task scheduler from node, and wherein second task scheduler is connected with mission performance evaluator, and second task scheduler is also connected with host node；

Node manager is all disposed for each described two grades from node.

Described explorer be responsible for global resource distribution and monitoring, and the startup of application manager and monitoring.

Described one-level task dispatcher for carrying out the scheduling of job queue according to the priority height of task, required by task resource, task submission time order etc..

Described application manager is Map task and Reduce task breakdown of operation, is also Map task and Reduce task application resource simultaneously, coordinates operation with node manager, be additionally operable to task is monitored.Application manager is the control unit of job run, the corresponding application manager of each operation.

Described mission performance evaluator, task based access control performance model predicts the deadline of task and the unscheduled task being currently running.

Described second task scheduler, according to mission performance evaluator predict the outcome judge the task of being carrying out whether belong to short task and from unscheduled task queue selection task.

If being carrying out of task is short task, second task scheduler selects new short task, new short task to re-use the resource that the short task being currently executing is about to discharge from unscheduled task queue；If being carrying out of task is not short task, the tasks carrying being carrying out directly discharges shared resource after completing.

Second task scheduler is when selecting unscheduled task, it is necessary to consider the isomerism running time, the fairness of resource and cluster of the locality of task, task.

Node manager is responsible for the stock number that monitor task uses, it is prevented that the stock number used in task run process exceedes the stock number of task application.

A kind of short optimization of job method of MapReduce based on resource reuse, comprises the steps:

Step (1): application manager by heartbeat message to explorer application resource；

Step (2): the resource that explorer would sit idle for distributes to the application manager of application resource, application manager obtains the resource that application is arrived；

Step (3): application manager application to resource distribute to unscheduled task, the node manager that then application manager notice is corresponding starts task process；

Step (4): node manager starts task process operation task, task sends heartbeat message at interval of the heart beat cycle set to affiliated application manager in the process run, and described heartbeat message includes the progress of task, task statistical data and task health status；Task statistical data refers to processed data volume, the quantity exported, the time consumed, overflows and write I/O rate；Task health status refers to that whether the process of execution task is abnormal；

Step (5): application manager receives heartbeat message, task completion time forecast model in task based access control Performance Evaluation device predicts the operation time of task, if the task completion time that the time of running of task sets less than or equal to user, then current task is short task；Otherwise current task is long task；

If current task is short task, second task scheduler selects new task from unscheduled task queue；If current task is long task, it is ignored as heartbeat message；

Step (6): the new task notice task process that application manager will select, task process continues to run with new task after having performed being currently running of task；

Step (7): if task process does not receive new task after having performed being currently running of task, task process exits the resource of also release busy, the resource notification explorer that node manager will be discharged by heartbeat message.

The task of described step (4) in the process run at interval of the heart beat cycle set to the step of affiliated application manager transmission heartbeat message as:

Step (401): judge whether Task Progress exceedes setting threshold value, if Task Progress exceedes setting threshold value, calculates the statistical data of current task, sends statistics heartbeat message, goes to step (403)；Otherwise go to step (402)；The statistical data of described current task includes the processed quantity of task, runs time and output data quantity；

Step (402): the progress of task, not less than setting threshold value, sends task health heartbeat message；Enter step (404)；

Step (403): task receives the heartbeat message of application manager；Enter step (404)；

Step (404): judge that task process is made whether to receive new task, if task process receives new task, go to step (405)；Otherwise turn (406)；

Step (405): by the input digital independent of new task to present node；After current task has performed, run being newly received of task；

Step (406): if task process does not receive new task, after current task has performed, the resource that task process release task takies.

Task completion time forecast model in the task based access control Performance Evaluation device of described step (5) predicts the operation time of task:

If the running of task can be divided into multiple sub stage, the time that the deadline of task and task consume at sub stage is closely related, therefore sets up task completion time forecast model according to task in the time of sub stage consumption；

The running of task is divided into multiple sub stage, if being absent from overlap between sub stage, the deadline of task is the time sum of each sub stage；

If there is overlap between sub stage, the part of overlap to be removed；

If the running of task i is decomposed into n sub stage, the data volume vector s that each sub stage processes represents, s=[s₁, s₁..., s_n], each sub stage processes the speed vector r of data and represents, r=[r₁, r₁..., r_n]。

The deadline T of task i_iFor:

T = Σ_{i = 1}^{n} \frac{s_{i}}{r_{i}} + α - - - (1)

Wherein, α is the time of startup task, changes, be a constant in fixing scope.

The running of Map task is decomposed into four sub stages, respectively reads input data phase, runs the map operational phase, and the excessive write phase of output data, in the intermediate object program Piece file mergence stage, each stage represents with read, map, spill, combine respectively.

The deadline T of Map task i is calculated according to formula (1)_i:

T_{i} = \frac{s_{r e a d}}{r_{r e a d}} + \frac{s_{m a p}}{r_{m a p}} + \frac{s_{s p i l l}}{r_{s p i l l}} + \frac{s_{c o m b i n e}}{r_{c o m b i n e}} + α - - - (2)

For short task, the intermediate object program Piece file mergence stage running time is very little, regards constant as.

Read input data phase identical with the quantity that the map operational phase processes, for the input data volume s of task_task。

Partly overlapping owing to map operational phase and the excessive write phase of output data exist, the data of the write phase that overflows measure the quantity of output after map operates, the quantity write of namely overflowing for the last time.

IfProcessed input data volume,For the data volume exported, s_bufferThe cache size that configuration item sets,For the data volume write of overflowing for the last time, formula (2) develops into:

T_{i} = s_{t a s k} (\frac{1}{r_{r e a d}} + \frac{1}{r_{m a p}}) + \frac{s_{s p i l l}^{l a s t}}{r_{s p i l l}} + β - - - (3)

s_{s p i l l}^{l a s t} = (s_{t a s k} * {ratio}_{o u t p u t}) {%s}_{b u f f e r} - - - (4)

{ratio}_{o u t p u t} = s_{o u t p u t}^{d o n e} / s_{i n p u t}^{d o n e} - - - (5)

Wherein, s_taskFor the input data volume size of task, β is constant, r_read, r_map, r_spillObtain from the statistics heartbeat message that task end sends. For the data volume that the i stage is processed,For the time consumed in the i stage.

When application manager receives the statistics heartbeat message that task end sends, calculate the deadline of current task according to formula (3).

If task k runs on node w, it was predicted that unscheduled task m is when the node w deadline run, and the processing speed of each sub stage of task m is

r_{i}^{m} = η_{i}^{k}, i &Element; {r e a d, m a p, s p i l l},

{ratio}_{o u t p u t}^{m} = {ratio}_{o u t p u t}^{k} .

The second task scheduler of described step (5) selects the step of new task to include from unscheduled task queue:

Step (51):

By the locality of task, unscheduled task is divided into node this locality task, frame this locality task, other frame tasks；

If the input data of task are on the node of the task of process, claiming this task is node this locality task；

If the node of the input data of task and the task of process is in same frame, it is called that this task is frame this locality task；

If the node of the input data of task and the task of process is not in same frame, it is called that this task is other frame tasks；

Order according still further to node this locality task, frame this locality task, other frame tasks selects task, runs the task of minimal time in each priority of prioritizing selection；

Step (52): select OPTIMAL TASK according to isomerism principle.

The step of described step (51) is:

Step (511): application manager receives the statistics heartbeat message that task process sends, and goes to step (512)；

Step (512): the prediction of mission performance evaluator sends the deadline of heart beating task, goes to step (513)；

Step (513): if sending the time less than the setting deadline of heart beating task, then calculate unscheduled task operation time on the whole nodes running current work, go to step (514)；

Step (514): unscheduled task is added separately to the task queue of node this locality, the task queue of frame this locality and other frame task queues according to locality, goes to step (515)；Locality refers to that whether the input data of task are identical with process node, if in a frame；

Step (515): the task queue of node this locality, the task queue of frame this locality and other frame task queues are sorted according to the operation time of task；Go to step (516)；

Step (516): the locality priority according to task, it is judged that whether unscheduled task belongs to OPTIMAL TASK, if OPTIMAL TASK is then by deleting in the unscheduled task never dispatch list currently selected, returns this task；Otherwise judge next unscheduled task, until checking out whole unscheduled task.

The step of described step (52) is:

Step (521): calculate the incoming task operation time on the full node running current work, go to step (522)；

Step (522): obtain the node calculating incoming task operation shortest time, go to step (523)；

Step (523): judging that whether the node of node and the input obtained is identical, if identical, return task is OPTIMAL TASK；Otherwise go to step (524)；

Step (524): calculating task transfers to, from input node, the income run the node of operation task shortest time, if income exceedes the threshold value of setting, returning incoming task is OPTIMAL TASK；Otherwise returning incoming task is not OPTIMAL TASK.

The locality of described task refers to: the input data of task are called local task with processing node at same node；If being called frame this locality task in same frame；In a frame, or not is not called other frame tasks.Which kind of relation is the input data being exactly task be with process node.

The transfer of described task runs income and refers to: if referring to, task is run at other nodes, and it is more shorter than running the time at origin node to run the time.This part-time shortened is called income.

Described OPTIMAL TASK refers to: be that this runs the task of shortest time at present node.

Beneficial effects of the present invention: Hadoop is optimized, improves the operational efficiency of short operation.First, Problems existing in short operation processing procedure, by analyzing the execution process of operation in Hadoop, is described by the present invention.Then, the feature taken turns according to task run under high load condition, it is proposed to a kind of short optimization of job mechanism based on resource reuse passes through reusing the discharged resource of executed task, reduces resource waste in distribution and removal process more.When cluster high capacity, the ratio that in the task of the many wheels of operation, Map task accounts for is far above Reduce task, and therefore the present invention only optimizes Map task.Test result indicate that, the short optimization of job method based on resource reuse that the present invention proposes can effectively reduce the operation time of short operation, and significantly improves the utilization rate of cluster resource.

Accompanying drawing explanation

Fig. 1 is long task run sequential chart；

Fig. 2 is short Activity Calculation framework；

Fig. 3 is short tasks carrying flow process sequential chart；

When the process that Fig. 4 (a) is task takes different values, the error change situation of the task that mission performance model prediction is currently running；

Fig. 4 (b) is the relation between Runtime and error；

Fig. 5 (a) is the word frequency statistics job run time；

Fig. 5 (b) is the HiveSQL job run time converted；

The job run time that Fig. 5 (c) is short optimization of job；

Fig. 6 is computing node cpu busy percentage；

Fig. 7 is computing node memory usage.

Detailed description of the invention

Below in conjunction with accompanying drawing, the invention will be further described with embodiment.

Hadoop is a kind of parallel computing platform for processing large-scale dataset, possesses the advantages such as good autgmentability, high fault tolerance, easy programming.Although its initial designs target is the operation that parallel processing is larger in a large amount of computing nodes, but in actual production, Hadoop is usually utilized to the short operation that treatment scale is less.Do not consider short operation due to Hadoop, cause that short operation performs to compare poor efficiency in Hadoop.For above-mentioned challenge, the present invention first passes through the execution process of operation in Hadoop of analyzing, and Problems existing in short operation processing procedure is described.Then, the feature taken turns according to task run under high load condition, it is proposed to a kind of short optimization of job mechanism based on resource reuse passes through reusing the discharged resource of executed task, reduces resource waste in distribution and removal process more.Test result indicate that, the short optimization of job method based on resource reuse that the present invention proposes can effectively reduce the operation time of short operation, and significantly improves the utilization rate of cluster resource.

Short operation is optimized by the present invention based on Hadoop2.x version.First this section analyzes the Computational frame of Hadoop and the process of tasks carrying, then describes Hadoop and performs short operation Problems existing.

Hadoop adopts host-guest architecture, by a host node with multiple form from node, with reference to Fig. 2.Explorer (ResourceManager) runs on the primary node, is responsible for the distribution of global resource and monitoring, and the startup of application manager and monitoring.Application manager (ApplicationMaster) operates in from node, and its responsibility is that breakdown of operation is Map task and Reduce task, coordinates operation and monitor task for Map task and Reduce task application resource with node manager.Application manager is the control unit of job run, the corresponding application manager of each operation.Node manager (NodeManager) also operates in from node, is responsible for the stock number (such as internal memory, CPU etc.) that monitor task uses, it is prevented that the stock number used in task run process exceedes the stock number of task application.

It is task application resource and task run process that Fig. 1 describes application manager.1. application manager divides the job into Map task and Reduce task, by heartbeat message to explorer application resource.2. explorer selects suitable task according to certain dispatching algorithm from the task queue of application resource, and the resource that would sit idle for distributes to the task of selection, and application manager obtains the resource that application is arrived when upper once heartbeat message.3. application manager distributes to applicable task the resource of application, and then the node manager of notice resource place node starts task.4. node manager starts task process operation task, and task estrangement hop cycle in the middle of the process run reports progress and the health status of task to affiliated application manager.5. task complete after the resource that takies of release task, node manager by heartbeat message, the resource of release is returned to explorer.

Learning from the execution process of above-mentioned task, task that the operation time is short and the task of long operational time all defer to said process, but operation time short task performs as procedure described above, and there are the following problems:

1) task start cost is high.Application manager is that task application resource at least needs a heart beat cycle (being defaulted as 3 seconds), notifies that node manager operation task completes needs 1～2 second to task initialization from application manager.In the cluster of Taobao, the Map task lower than 10 seconds accounts for more than 50% the operation time, at Yahoo！Extemporaneous inquiry and data analytic set group in the average completion time of a large amount of Map tasks be about 20 seconds, therefore the startup time of task accounts for more than the 20% (5/25) of total time.When task application resource, if explorer do not have can resource, task needs to wait in line available resources, and this will cause that task start cost increases further.And the phenomenon that task queue waits available resources often occurs in the cluster.

2) serious waste of resources.The resource shared by task release that executed is complete, is returned to explorer by one, node manager interval heart beat cycle by the resource of release.Explorer distributes to idle resource the task of waiting resource, and one, the application manager interval belonging to task heart beat cycle could obtain the resource of application.Completing needs 1～2 second from starting task process to task initialization, so the resource of release is re-used needs 7～8 seconds again, wherein heart beat cycle is 3 seconds.Therefore, cluster performs operation time short task frequently, will cause the serious waste of cluster resource.

According to above-mentioned analysis, current Hadoop performs the flow process of task and is not suitable for performing operation time short task.In order to conveniently describe problem, we are that operation time short task introduces a short task of new ideas.

Definition 1. sets deadline of task i as T_i, user sets task completion time as T_shortTaskIf the deadline of task meets T_i≤T_shortTesk, claiming task i is short task.

Short operation is to be made up of substantial amounts of short task, runs substantial amounts of short task and reduces the execution performance of short operation.According to the feature that task run under high load condition is taken turns more, the present invention proposes a kind of short optimization of job mechanism based on resource reuse by reusing the discharged resource of executed task, reduces resource waste in distribution and removal process.The task of reusing resource is run in advance, thus decreasing the operation time of operation.

The short Activity Calculation framework of 1 resource reuse

This section describes the short Activity Calculation framework of resource reuse and the tasks carrying process of short operation.Task run needs a number of resource, and including internal memory, CPU, network, disk storage space, task process etc., these resources can be used by unscheduled task.The resource quantity needed due to the task of different work has difference, and the present invention only considers to carry out resource reuse between the Map task of same operation.

1.1 short Activity Calculation frameworks

In the framework of Hadoop, application manager is the control centre of an operation, is responsible for task application resource and coordinates execution and monitor task with node manager.The present invention increases by two assemblies mission performance evaluator (TaskPerformanceEstimator) and second task scheduler (Sub-scheduler) in application manager, and framework is as shown in Figure 2.

Mission performance evaluator task based access control performance model predicts the task and unscheduled task that the deadline of two generic tasks is currently running.Because the mechanism of resource reuse is only applicable to short task, needs the operation time of task according to the definition of short task, and this time was unknown before task completes.Unscheduled task is in the key foundation that the operation time setting node is that second task scheduler selects task, and its result of calculation directly affects the dispatching sequence of task.Owing to task can be divided into multiple sub stage, mission performance model is based on the time foundation of task sub stage consumption.In the process of tasks carrying, continual collection of statistical data, the time that calculating task consumes at each sub stage.

The responsibility of second task scheduler is whether the task of judging to be carrying out that predicts the outcome according to mission performance evaluator belongs to short task, and selects task from unscheduled task queue.If being carrying out of task is short task, second task scheduler selects suitable task to reuse the resource that short task is about to discharge from unscheduled task queue.Second task scheduler, when selecting unscheduled task, considers the isomerism problem of the locality of task, the operation time of task, the fairness of resource and cluster.

The tasks carrying process of 1.2 short operations

Fig. 3 describes the tasks carrying flow process of short operation.1. application manager by heartbeat message to explorer application resource.2. the resource that explorer would sit idle for distributes to the task of application resource, and application manager obtains the resource that application is arrived when upper once heartbeat message.3. application manager distributes to applicable task the resource of application, then notifies that the node manager of correspondence starts task.4. node manager starts task process operation task, and task estrangement hop cycle in the middle of the process run reports the progress of task, task statistical data and health status to affiliated application manager.5. application manager receives heartbeat message, if current task is short task, second task scheduler selects suitable task from unscheduled task queue.6. the task notifications task process that application manager will select, task process continues to run with being newly received of task after having performed being currently running of task.7. task process does not receive new task after having performed being currently running of task, and task process exits the resource of also release busy, the resource notification explorer that node manager will be discharged by heartbeat message.

Short Activity Calculation framework does not affect the execution of long task, but long task is still followed Fig. 1 process described and performed, and the process that Fig. 3 describes is only applicable to short task.

The short optimization of job of 2 resource reuses realizes

Design according to short Activity Calculation framework, will realize respectively at application manager end and task process.The statistical data that the task based access control process that realizes of application manager end is collected, first this section introduces the realization of task process heartbeat message, then introduces the realization of application manager end mission performance model and second task scheduler.

2.1 task process heartbeat message

Task and application manager are based on heartbeat message communication, and the content of heartbeat message includes the information such as the health status of the progress of task, task.The processed input quantity of statistical data in the deadline task based access control running of mission performance model prediction task, the data volume exported, task output rating, read the information such as the input speed of data, the speed of map operation and the speed of writing output data of overflowing.Because normal heartbeat message transmission frequency is higher, in order to alleviate the pressure of application manager, the present invention increases a kind of statistics heartbeat message between task and application manager, is responsible for statistical data is sent to application manager.Statistics heartbeat message only has the progress of task to exceed just transmission during the threshold value of setting.

Algorithm 1.sendHeartbeat algorithm.

Input: send the minimum task progress of statistics heartbeat message, be set by the user；

The progress of current task；

Output: heartbeat message

Step 101: if Task Progress exceedes the threshold value of setting, calculates the statistical data (the processed quantity of task, run time, output data quantity etc.) of current task, sends statistics heartbeat message, go to step 103；Otherwise go to step 102；

Step 102: the progress of task, not less than setting threshold value, sends task health heartbeat message；

Step 103: task receives the heartbeat message of application manager；

Step 104: if task process receives new task, go to step 105；Otherwise turn 106；

Step 105: by the input digital independent of new task to present node.After current task has performed, run being newly received of task；

Step 106: if task process does not receive new task, current task has performed the resource that rear task process release task takies.

Algorithm 1 describes task process and sends the process of heartbeat message, and curTask is being currently running of task, and newTask is newly received task.When the progress of task exceedes the threshold value of setting, calculate the statistical datas such as the processed quantity of task, the data volume of output, read-write speed, then send statistics heartbeat message to application manager.Task process, when receiving the heartbeat message of application manager feedback, checks whether and receives new task.If task process receives new task, the input data of new task are read present node in advance.The operation reading new task input data performs with being currently running tasks in parallel, it is to avoid read data again when performing new task.After being currently running of task completes, continue executing with being newly received of task.If task process does not receive new task, the task run being currently running complete after the resource of release busy.

2.2 task completion time forecast models

If the running of task can be divided into multiple sub stage, the time that the deadline of task and task consume at sub stage is closely related, and therefore the present invention sets up mission performance model according to task in the time of sub stage consumption.The running of task can be divided into multiple sub stage, if being absent from overlap between sub stage, the deadline of task is the time sum of each sub stage；If there is overlap between sub stage, the part of overlap to be removed.If the running of task i is decomposed into n sub stage, the data volume vector s that each sub stage processes represents, s=[s₁, s₁..., s_n], each sub stage processes the speed vector r of data and represents, r=[r₁, r₁..., r_n].The deadline T of task i_iFor:

T_{i} = Σ_{i = 1}^{n} \frac{s_{i}}{r_{i}} + α - - - (1)

Wherein, α is the time of startup task, and this value changes in fixing scope, it is possible to regard a constant as.

The running of Map task is decomposed into four sub stages, respectively reads input data phase, runs the map operational phase, and the excessive write phase of output data, in the intermediate object program Piece file mergence stage, each stage represents with read, map, spill, combine respectively.The deadline T of Map task i is calculated according to formula (1)_i:

T_{i} = \frac{s_{r e a d}}{r_{r e a d}} + \frac{s_{m a p}}{r_{m a p}} + \frac{s_{s p i l l}}{r_{s p i l l}} + \frac{s_{c o m b i n e}}{r_{c o m b i n e}} + α - - - (2)

For short task, the intermediate object program Piece file mergence stage running time is very little, it is possible to regard constant as.Read input data phase identical with the quantity that the map operational phase processes, for the input data volume s of task_task.Partly overlapping owing to map operational phase and the excessive write phase of output data exist, the data of the write phase that overflows measure the quantity of output after map operates, the quantity write of namely overflowing for the last time.IfProcessed input data volume,For the data volume exported, s_bufferThe cache size that configuration item sets,For the data volume write of overflowing for the last time, formula (2) develops into:

T_{i} = s_{t a s k (} \frac{1}{r_{r e a d}} + \frac{1}{r_{m a p}}) + \frac{s_{s p i l l}^{l a s t}}{r_{s p i l l}} + β - - - (3)

s_{s p i l l}^{l a s t} = (s_{t a s k} * {ratio}_{o u t p u t}) {%s}_{b u f f e r} - - - (4)

{ratio}_{o u t p u t} = s_{o u t p u t}^{d o n e} / s_{i n p u t}^{d o n e} - - - (5)

When application manager receives the statistics heartbeat message that task end sends, the deadline of current task can be calculated according to formula (3).If task k runs on node w, it was predicted that unscheduled task m is when the node w deadline run, and the processing speed of each sub stage of task m is

r_{i}^{m} = r_{i}^{k}, i &Element; {r e a d, m a p, s p i l l}, {ratio}_{o u t p u t}^{m} = {ratio}_{o u t p u t}^{k} .

2.3 second task schedulers

After application manager receives the statistics heartbeat message that task sends, the operation time of task based access control performance model prediction task.If the task of sending statistics heartbeat message is short task, from unscheduled task queue, select suitable task.This section describes the process selecting unscheduled task, considers the problem in task locality, the isomerism of cluster and the fairness three of resource when the task of selection.

The locality of problem 1. task.The operation time of task is the time read and input data and the time sum calculating input data, and the time therefore reducing reading input data can shorten the operation time of task.In the cluster, the I/O rate of local disk is higher than the transfer rate of network, and the network bandwidth in same frame is more roomy than the Netowrk tape between frame, so to assign the task to the node near input data when distributing task as far as possible.This strategy is not only able to reduce the time of data copy, and can alleviate the load of network.Therefore, second task scheduler selects task according to the order of node this locality task, frame this locality task, other frame tasks.

Define 2. task transfers and run income.If task i in the operation time of node a isIn the operation time of node b it isTask i by the income transferred in node b operation in node a operation isIfFor on the occasion of, illustrate task i by node a run transfer to node b run the time shorten；Shift the income degree run for task, be worth more big income more high.IfFor negative value, illustrate that Runtime extends.

The OPTIMAL TASK that definition 3. runs at certain node.Meet (6) or (7) at running of task i of node a, then claiming task i is in the node a OPTIMAL TASK run.

T_{a}^{i} = = T_{m}^{i} - - - (6)

{gain}_{a}^{m} < {gain}_{c o n f i g} - - - (7)

Wherein,X is the node set running current work, and m is the node of operation task i shortest time.(6) task operation minimal time at a node is described, the income of the node operation that (7) are the shortest at runtime for the task i of explanation is less than the revenue threshold set.

Algorithm 2.assignTask algorithm

Input: unappropriated set of tasks；

Output: will dispatching of task；

Step 201: application manager receives the statistics heartbeat message that task process sends, and goes to step 202；

Step 202: the prediction of mission performance evaluator sends the deadline of heart beating task, goes to step 203；

Step 203: if sending the time less than the setting deadline of heart beating task, then calculate unscheduled task operation time on the whole nodes running current work, go to step 204；

Step 204: unscheduled task is added separately to the task queue of node this locality, frame task queue and other task queues according to locality, goes to step 205；

Step 205: the task queue of node this locality, frame task queue and other task queues are sorted according to the operation time of task；Go to step 206

Step 206: according to task locality priority, it is judged that whether unscheduled task belongs to OPTIMAL TASK, if OPTIMAL TASK is then never deleted in dispatch list, returns this task；Otherwise judge next unscheduled task, until checking out whole unscheduled task.

The isomerism of problem 2. cluster.The isomerism of cluster refers to that the hardware configuration of clustered node is inconsistent, and hardware refers mainly to CPU, internal memory, disk three.The execution efficiency of task is had important impact by the isomerism of cluster.Same task performs compared with the node low in performance execution at the node that performance is high, and the former time of running is huge with the operation time difference of the latter.Second task scheduler, when selecting unscheduled task, checks selecting of task is for whether certain node is OPTIMAL TASK.If selecting of task is that OPTIMAL TASK is carried out this task；Otherwise skip this task, reselect performing of task.

The fairness of problem 3. resource.The fairness of resource refers to the shared cluster resource that each operation is fair.Owing to cluster is isomery, the computing capability of each node differs greatly, certain operation will be avoided to take for a long time performance is high or performance is low node resource.For Fairshare resource, being set by the user maximum the reusing the time of resource, the time of reusing of resource is identical.Maximum for resource reuses the time,For task average operating time on node x, the reuse-time of resource isTask is different at the average operating time of each node, so the reuse-time of resource is different in each node, the resource reuse in the node that performance is high is often.

Second task scheduler selects unscheduled task in two steps.The first step is, by the locality of task, unscheduled task is divided into node this locality task, frame this locality task, other frame tasks.Select task according still further to locality priority orders, each priority of prioritizing selection is run the task of minimal time, sees algorithm 2.Second step is to select OPTIMAL TASK according to isomerism principle, sees algorithm 3.

In algorithm 2, k is the node of operation task i,The reuse-time of resource, dask is used for task i_nodeFor node this locality set of tasks, Task_rackFor frame this locality set of tasks, Task_offRackFor other frame set of tasks,Maximum for resource reuses the time, is set by the user.Algorithm 2 describes application manager and receives the process of the heartbeat message unscheduled task of selection.First determine whether whether the task of sending heartbeat message is short task, and whether the reuse-time of resource exceedes maximum.Then unscheduled task is divided into node this locality task, frame this locality task, other frame tasks three groups, finally selects task according to the priority of locality.

Algorithm 3:selectOptimalTask algorithm.

Input: run the node set of current work；

The minimum yield that user sets；

Selecting of task；

Run the node of current task；

Output: whether the task of selection is OPTIMAL TASK.

Step 301: calculate the incoming task operation time on the full node running current work, go to step 302；

Step 302: obtain the node calculating incoming task operation shortest time, go to step 303；

Step 303: judging that whether the node of node and the input obtained is identical, if identical, return task is OPTIMAL TASK；Otherwise go to step 304；

Step 304: calculating task transfers to, from input node, the income run the node of operation task shortest time, if income exceedes the threshold value of setting, returning incoming task is OPTIMAL TASK；Otherwise returning incoming task is not OPTIMAL TASK.

Algorithm 3 describes the task of selection to whether present node is OPTIMAL TASK.First, selecting of the task operation time on other nodes running current work is calculated, it is judged that whether the task of the selection operation time on present node is minimum.If the operation minimal time that selecting of task is on node m, the income that the task of calculating performs on node m.If the income of task exceedes the threshold value of setting, then skip the task of selection；Selecting of task is otherwise performed at present node.

3 experiments and analysis

The present invention realizes the optimization of short operation based on Hadoop2.2.0, and the version (ApacheHadoop) before optimization is called AH, and the version after optimization is called SJH.We assess the effect of optimization of short operation by the mode of SJH and AH contrast.

Experiment cluster is made up of 1 host node and 8 computing nodes, and the main configuration of node is as shown in table 1.The use configuration one of host node and four computing nodes, all the other four computing nodes use configuration two.Node distribution, in two frames, uses kilomega network to connect between node.The data block of HDFS is sized to 64M, and the number of copies of data block is set to 3.Cluster uses the Hadoop FAIR scheduler provided, and running Map task needs 1G internal memory and the resource of 1 CPU, and Reduce task and application manager are respectively necessary for 1.5G internal memory and the resource of 1 CPU.

Table 1 clustered node configuration information

Experiment uses two test data set, and the randomtextwriter that data set one is provided by Hadoop generates, and data set two is the user power utilization data that certain electric power saving terminal acquisition system gathers.Randomtextwriter is used for generating the data set of setting data amount, and the content of data set is made up of random word.Experiment uses word frequency statistics (wordcount), and single surface condition inquiry of Hive and Terasort are as benchmark.Hive builds the data warehouse on HDFS, and the class SQL that Hive provides is converted into Hadoop operation.After single surface condition query statement of Hive is converted to Hadoop operation, only Map task.Word frequency statistics is for adding up the frequency that each word occurs in input data set, and input data set is pressed lexicographic order sequence by Terasort.

The accuracy of 3.1 mission performance models

The present invention first test assignment performance model predicts the accuracy of task and the off-duty task being currently running.Using word frequency statistics program as benchmark, adopt data set one, data volume is sized to 2.5G, and cluster uses 40 Map tasks to process data set.Experiment uses relative error to describe accuracy,E is error size,For the prediction deadline of task, TⁱFor the deadline that task is actual.Progress p in task_minRespectively 40%, 60%, 70%, when 80%, it was predicted that be currently running the error of task completion time.

Fig. 4 (a) describes the process of task when taking different values, the error change situation of the task that mission performance model prediction is currently running.Being seen by Fig. 4 (a), error reduces along with the increase of progress, and error tends towards stability.The progress of task is more than 60%, and error is within 20%；When the progress of task is more than 70%, error is about 10%.Fig. 4 (b) describes the relation between Runtime and error.Error increases along with the increase of Runtime, and when the deadline of task reaches 80 seconds, error is more than 20%.And predict that the error of off-duty task is higher than prediction and is currently running the error of task.Mission performance model uses the read rate in the short time, map to operate processing speed and the deadline of excessive writing speed calculating task, and does not consider speed situation over time.Therefore, the forecast error of mission performance model can increase along with the increase of task time.Learning from Fig. 4 (b), when task completion time was lower than 30 seconds, forecast error is less than 15%, and the average completion time of short task is 20 seconds, and therefore this forecast model is satisfied is actually needed.

3.2 operation deadlines

We test short optimization of job effect when short operation and long operation mixed running, and test data are as shown in table 2.Experiment uses shell script poll to submit benchmark to, and 1 operation of submission per second, All Jobs has been submitted in 30 seconds.The master container quantity that cluster provides is 96, and the Map task of therefore each operation needs many wheels to have dispatched.

Fig. 5 (a) and Fig. 5 (b) shows the impact on the operation deadline of the short optimization of job.The word frequency statistics job run time shorten 6%-22%, HiveSQL convert the job run time shorten 6%-27%.These two kinds of operations are short operations, and the mono-surface condition inquiry only Map task of Hive, its effect of optimization is better than word frequency statistics operation.It can be seen that long operation is not had significant impact by short optimization of job from Fig. 5 (c), operation maximum prolongation time is 5%.The operation of short task optimization makes the task preemption of short operation resource, causes that the long operation deadline has part to extend.

Table 2 operation trials data

The utilization rate of 3.3 resources

This section pays close attention to the CPU of computing node and the utilization rate of internal memory, and benchmark and data set are as shown in table 2.The utilization rate of resource uses Ganglia to collect, and final result is taken in calculating process the meansigma methods of each resource.Fig. 6 is the CPU utilization power of computing node, and the cpu busy percentage after optimization on average improves 4%-13%.In benchmark, the operation of word frequency statistics and HiveSQL conversion belongs to computation-intensive operation.N1, N2, N6 and N7 node the amplitude that improves more than N3, N4, N5 and N8 of the amplitude that improves of cpu busy percentage because the former node that to be performance high, the latter is the node that performance is low.Fig. 7 describes the change of computing node internal memory, and after optimization, the utilization rate of internal memory improves 2.6%-6.18%.The amplitude that memory usage improves is little, because only the output data of terasort are relatively larger in benchmark.Experimental results shows, the short operational method of optimization effect in improving cluster resource and effectively utilizing that the present invention proposes is obvious.

4 sum up

Do not consider short operation due to Hadoop, cause that short operation performs to compare poor efficiency in Hadoop.For above-mentioned challenge, the present invention first passes through the execution process of operation in Hadoop of analyzing, and Problems existing in short operation processing procedure is described.Then, the feature taken turns according to task run under high load condition, it is proposed to a kind of short optimization of job mechanism based on resource reuse passes through reusing the discharged resource of executed task, reduces resource waste in distribution and removal process more.The present invention, by optimizing Map tasks carrying process, reduces the operation time of short operation.Following work will improve the execution efficiency of short operation from the execution process of Reduce task and task scheduling aspect.

The specific embodiment of the present invention is described in conjunction with accompanying drawing although above-mentioned; but not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme, those skilled in the art need not pay various amendments or deformation that creative work can make still within protection scope of the present invention.

Claims

1., based on the short optimization of job system of the MapReduce of resource reuse, it is characterized in that, including: host node, one-level are from node and several two grades from node, and wherein host node is connected from node with one-level, and one-level is connected from node and several two grades from node；Described host node is disposed explorer and one-level task dispatcher；Described one-level disposes application manager, mission performance evaluator and second task scheduler from node, and wherein second task scheduler is connected with mission performance evaluator, and second task scheduler is also connected with host node；Node manager is all disposed for each described two grades from node；

Described mission performance evaluator, it was predicted that being currently running of task and the deadline of unscheduled task；

2. the short optimization of job system of a kind of MapReduce based on resource reuse as claimed in claim 1, is characterized in that,

3. the short optimization of job system of a kind of MapReduce based on resource reuse as claimed in claim 1, is characterized in that,

4. the short optimization of job system of a kind of MapReduce based on resource reuse as claimed in claim 1, is characterized in that,

Described explorer be responsible for global resource distribution and monitoring, and the startup of application manager and monitoring；

Described application manager is Map task and Reduce task breakdown of operation, is also Map task and Reduce task application resource simultaneously, coordinates operation with node manager, be additionally operable to task is monitored；Application manager is the control unit of job run, the corresponding application manager of each operation；

5., based on the short optimization of job method of the MapReduce of resource reuse, it is characterized in that, comprise the steps:

Step (4): node manager starts task process operation task, task sends heartbeat message at interval of the heart beat cycle set to affiliated application manager in the process run；

Step (5): application manager receives heartbeat message, task completion time forecast model in task based access control Performance Evaluation device predicts the operation time of task, if the task completion time that the time of running of task sets less than or equal to user, then current task is short task；Otherwise current task is long task；If current task is short task, second task scheduler selects new task from unscheduled task queue；If current task is long task, it is ignored as heartbeat message；

6. method as claimed in claim 5, is characterized in that,

7. method as claimed in claim 5, is characterized in that, the second task scheduler of described step (5) selects the step of new task to include from unscheduled task queue:

Step (51):

Step (52): select OPTIMAL TASK according to isomerism principle.

8. method as claimed in claim 7, is characterized in that,

The step of described step (51) is:

9. method as claimed in claim 7, is characterized in that,

The step of described step (52) is:

10. method as claimed in claim 7, is characterized in that,

If the node of the input data of task and the task of process is not in same frame, it is called that this task is other frame tasks.