CN111736959B

CN111736959B - Spark task scheduling method considering data affinity under heterogeneous cluster

Info

Publication number: CN111736959B
Application number: CN202010683860.5A
Authority: CN
Inventors: 文建璋; 陈祥军
Original assignee: Nanjing Nansoft Technology Co ltd
Current assignee: Nanjing Nansoft Technology Co ltd
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-11-27
Anticipated expiration: 2040-07-16
Also published as: CN111736959A

Abstract

The invention discloses a Spark task scheduling method considering data affinity under a heterogeneous cluster, which is used for minimizing the maximum completion time of Spark application considering data affinity from the perspective of a user. Decomposing Spark application submitted by a user into task scheduling sequences, distributing tasks to proper virtual machines to obtain an initial solution, and then further optimizing the initial solution by adjusting the task scheduling sequences to obtain an optimal scheduling result. The method of the invention optimizes the maximum completion time of Spark application by realizing dynamic allocation of proper resources.

Description

Spark task scheduling method considering data affinity under heterogeneous cluster

Technical Field

The invention relates to a Spark task scheduling method considering data affinity under a heterogeneous cluster, and belongs to the technical field of cloud computing resource scheduling.

Background

In recent years, with the rapid development of social networks, internet of things and other technologies, a great deal of data analysis needs exist in many fields such as banking, medical care, business prediction, scientific exploration and the like, and big data processing becomes crucial. The Spark framework has been widely used in big data processing.

There are two default task scheduling strategies: FIFO mode and Fair sharing mode. In the FIFO mode, the Spark defaults to considering mobile computing instead of mobile data, and allocates the task in the Stage to the nodes for storing the input data required by the task as much as possible, so that some nodes may run in an overload state and some nodes are in an idle state in the cluster, which seriously wastes the computing resources of the cluster and leads to the increase of the total completion time of the application. On the other hand, with the continuous development of computer hardware technology, the machine update of the data center is very frequent, the server cluster of the data center is no longer homogeneous, but the default scheduling policy of Spark is used for solving the task scheduling problem of the homogeneous cluster. It is necessary to study the Spark task scheduling problem under heterogeneous clusters.

The Spark application consists of Jobs with partial order relationship constraints, which make up the DAG. Each Job may be divided into multiple stages, and partial order relationship constraints also exist among stages in the Job, which may also constitute DAGs. In addition, each Stage comprises a plurality of independent tasks, and data required by the tasks can come from original data or intermediate data generated by the tasks in a precursor Stage set. The coupling relation of the original data, the intermediate data and the task, namely data affinity, required by the task is comprehensively considered. Data affinity considers that tasks and data thereof are as close as possible to reduce the network transmission cost of the data.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a Spark task scheduling method considering data affinity under a heterogeneous cluster, which is used for improving the existing Spark task scheduler, constructing a Spark workflow scheduling system architecture and minimizing the maximum completion time of Spark application, regarding the Spark task scheduling problem under the cluster formed by data center heterogeneous servers and considering the characteristics of Spark application workflows and the data affinity of virtual machines.

The technical scheme is as follows: a Spark task scheduling method considering data affinity under a heterogeneous cluster comprises the following steps:

step 1, calculating time parameters of all stages in Spark according to partial sequence properties of Spark workflow, and generating a Stage scheduling queue RSQ according to a Stage sequencing model;

step 2, taking out stages from the Stage scheduling queue RSQ in sequence, and generating a task scheduling queue TQ for parallel tasks in the stages according to a task sequencing model;

step 3, taking out the tasks from the task scheduling queue TQ in sequence, maintaining an earliest available virtual machine List for each task, then calculating the data affinity of all the virtual machines in the earliest available virtual machine List, and adding a plurality of virtual machines with the highest data affinity into the VMList according to the requirement;

step 4, searching the virtual machine from the VMList table according to the virtual machine search strategy for the tasks in the task scheduling queue TQ, and distributing the tasks to the virtual machine;

step 5, repeating the step 3 and the step 4 until the task scheduling queue TQ is empty, and when the task scheduling queue TQ is empty, indicating that all tasks in the Stage are completely executed;

step 6, invoking an RSA algorithm, updating elements in the Stage scheduling queue RSQ, turning to step 2, and obtaining an initial scheduling solution until the Stage scheduling queue RSQ is empty;

and 7, improving the initial scheduling solution by adjusting the task scheduling sequence to obtain a final scheduling result, and ending the method.

The partial order property of the Spark workflow in the step 1 is considered from two aspects: the Spark workflow is composed of Job with partial order relation, and Job is composed of Stage with partial order relation, namely, partial order relation exists in both Job level and Stage level. Job, Stage are merged and the Spark workflow is represented as a Directed Acyclic Graph (DAG) for Stage.

Spark task stream: g { S }₁，S₂，..，S_nAnd the step B is a DAG consisting of n stages, and a plurality of tasks which can be executed in parallel exist in each Stage.

In the step 1, the Stage sequencing model is as follows:

(1) earliest start time priority rule: calculating the earliest start time EST for each Stage, and arranging stages in the Stage scheduling queue RSQ in an increasing order according to the calculated earliest start time EST;

(2) maximum estimated processing time precedence rule: calculating estimated processing time EDT for each Stage, and sequencing stages in the Stage scheduling queue RSQ in descending order according to the calculated estimated processing time EDT;

(3) minimum float time priority rule: calculating FL (difference value between latest starting time and earliest starting time) for each Stage, and arranging stages in the Stage scheduling queue RSQ in descending order according to the calculated FL;

(4) random rule: to compare with the above rule, Stage in the Stage scheduling queue RSQ is randomly selected as the Stage with the highest priority.

The task ordering model in the step 2 specifically comprises the following steps:

(1) instruction number priority rule: the ith Stage S_iAll tasks in the system are sorted in a non-increasing mode according to the size of the task instruction number;

(2) transmission time priority rule: the ith Stage S_iAll tasks in (a) are ordered in a non-increasing manner according to estimated transmission times;

(3) processing time priority rule: the ith Stage S_iAll tasks in (a) are ordered in a non-incremental manner according to the estimated task processing time.

The calculation formula of the data affinity of the virtual machine in the step 3 is as follows:

wherein

To represent

All direct predecessor Stage sets of (1);

is a task

And task

The amount of data to be transferred between,

representing tasks

Whether or not at the server

In the above-mentioned step (2),

is a task

Is required to be stored in

The amount of raw data in (1);

representing tasks

Required to be stored in server

The amount of data in (1) is,

representing tasks

The total amount of data required; data are stored in the server, and data transmission is not considered among the virtual machines in the same server, so that the data affinity of the virtual machines in the same server is the same, and the virtual machines

The data affinity calculation formula of (1):

。

the virtual machine search strategy in the step 4 specifically includes:

(1) the fastest speed priority strategy is as follows: considering the processing speed of the virtual machine, preferentially distributing the tasks to the virtual machine with high processing speed in the VMList table for execution, and shortening the execution time of the tasks as much as possible;

(2) earliest available time first policy: allocating the task to the virtual machine with the earliest availability for execution by considering the earliest availability time of the virtual machine in the VMList table;

(3) earliest completion time priority strategy: the method comprises the steps that the start time and the task execution time of a task are considered, and the task is distributed to a virtual machine in a VMList table which can guarantee that the completion time of the task is earliest to be executed;

(4) random strategy: and comparing with the virtual machine searching strategy, randomly selecting a virtual machine from the VMList, and distributing the task to the virtual machine for execution.

The RSA (Ready Stage addition) algorithm in step 6 specifically includes:

(1) the input is the ith Stage S_i（S_iAll tasks have been scheduled) and save the ready Stage scheduling queue RSQ that has been sorted. The output is a Stage scheduling queue RSQ which is added into some ready stages and then is reordered;

(2) for the ith Stage S_iEach Stage S in the direct successor set of_i', will S_iFrom S_i' deletion in the immediate predecessor set, followed by a decision S_i' whether the immediate predecessor set is an empty set, if so, then S_i' insert into Stage scheduling queue RSQ.

(3) And reordering the elements in the Stage scheduling queue RSQ according to the Stage ordering model.

The method for adjusting the task scheduling sequence in the step 7 comprises the following steps:

for all tasks in the Stage in the critical path, the completion time of the Stage is determined by the latest completed task, the time gap between two tasks in the virtual machine is searched, the latest completed task is transferred to the time gap for execution, and the completion time of the Stage is reduced as much as possible, namely the completion time of the Stage is optimized;

marking the Stage with optimized completion time as true, and not needing to optimize again even if the Stage is still in the critical path next time;

after the optimization of the previous Stage, the start time and the completion time of the subsequent Stage are changed. At this time, the critical path may be changed, and the critical path needs to be obtained again.

Drawings

FIG. 1 is a block diagram of a Spark workflow;

FIG. 2 is a block diagram of a method of performing an embodiment of the invention;

FIG. 3 is a flow chart of an implementation of a method of an embodiment of the present invention;

FIG. 4 is an initial state diagram of an example RSA algorithm prior to scheduling;

FIG. 5 is a state diagram of an example RSA algorithm after execution of S1;

FIG. 6 is a state diagram of an example RSA algorithm after execution of S3;

FIG. 7 is a process diagram of a DAGSarser merge Job, Stage.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 2, the data center in this embodiment includes 2 servers: a Master Server (Master) and a general Server (Server). The main function of the main server is task scheduling, the common servers are responsible for executing tasks, and each common server comprises a plurality of heterogeneous virtual machines with different numbers.

In the present embodiment, in the Spark task scheduling method considering data affinity under a heterogeneous cluster, first, Job and Stage are merged to obtain a Stage-based DAG, a maximum value method is adopted to estimate an execution speed of a virtual machine, then, an execution time and a data transmission time of a task are estimated, so as to estimate a processing time of the Stage, and then, time parameters of the Stage (that is, the earliest start time, the earliest completion time, the latest start time and the latest completion time of the Stage) are calculated on the basis. Secondly, a ready Stage priority queue, the Stage scheduling queue RSQ, is created for saving a ready Stage. Considering that the start time of Stage, the execution time of Stage and the completion time of Stage have a crucial influence on Spark task scheduling, four Stage sorting rules (Stage sorting model) are proposed: an earliest start time priority rule, a maximum estimated execution time priority rule, a minimum float difference time priority rule, and a random rule. And sequencing stages in the RSQ based on four Stage sequencing rules to obtain a Stage topological order. When task scheduling is carried out, Stage with the highest priority is taken out from RSQ each time, and according to three task sorting rules (task sorting models): and ordering all parallel tasks in the Stage according to an instruction number priority rule, a transmission time priority rule and a processing time priority rule to obtain a task scheduling sequence TQ. And resource allocation is carried out according to the task scheduling sequence: maintaining a virtual machine list in which the data affinity of the virtual machines is high and all the virtual machines are available, considering the data affinity and the load balance; according to four designed virtual machine search strategies: and selecting a virtual machine for the task according to a fastest speed priority strategy, an earliest available time priority strategy, an earliest completion time priority strategy and a random selection strategy. And after the tasks in the current Stage are all scheduled, calling an RSA algorithm to add the new ready Stage to the RSQ, and recalculating the time parameters of the rest unscheduled stages. And repeating the steps until the RSQ is an empty set to obtain a scheduling solution. And finally, improving the initial scheduling solution by adjusting the task scheduling sequence to obtain a final scheduling solution.

As shown in fig. 3, the specific implementation steps of the Spark task scheduling method in the heterogeneous cluster environment according to the embodiment of the present invention are as follows:

in step s201, Spark application G is composed of h Jobs with partial order constraints, and each Job J _iIn which comprisesh _iStage, Stage has partial order relation,S _i,lto representJ _iTo (1)lThe stages are converted into DAG formed by stages with partial order relationship by combining the Job, stages and Spark application through DADAGSerserS _i,lIs converted intoS _jThe mapping relationship is as follows:

. As shown in FIG. 7, DAGSParser merges the processes of Job, Stage: an example of a simple Spark application, G, containing h jobs,J ₁the first Job is shown, which contains 3 stages.J ₁AndJ ₂there is a partial order relationship between them, specificallyS _1,3AndS _2,1the partial order relationship exists between the two, the simplification is carried out through the DADAGSER, the partial order relationship of the Job layer is ignored, and the partial order relationship to the Stage is directly embodied, namely the original partial order relationshipS _1,3According to the formula of mapping relationship (

) Is simplified intoS ₃(ii) a OriginallyS _2,1Is simplified intoS ₄。J _iRepresenting the ith Job in the application.

Step s202, calculating time parameters of all stages in G, including: the earliest start time and the earliest completion time, the latest start time and the latest completion time.

Step S203, establishing a Stage scheduling queue RSQ according to the Stage sorting model, and sequentially sequencing S according to the priority_iAdded to the RSQ.

And step s204, judging whether the RSQ is empty, and if the RSQ is empty, executing step s 210. If not, step s205 is executed.

Step s205, taking out the Stage with the highest priority in the RSQ, adding all tasks in the taken-out Stage into a queue TQ, sorting the tasks in the TQ according to a task sorting model, and maintaining an earliest available virtual machine List for each task in the TQ;

step s206, one task in the TQ is fetched, a List of virtual machines is obtained for the fetched task, data affinities of all the virtual machines in the List are calculated, and a plurality of virtual machines with higher data affinity are selected from the List and added into the VMList table according to requirements;

step s207, searching a virtual machine from the VMList table for the current task according to a virtual machine search strategy, and distributing the task to the searched virtual machine for execution;

in step s208, it is determined whether the TQ is empty, and if so, step s209 is performed. If not, go to step s 206;

step s209, the RSA algorithm is scheduled to be added into a new ready Stage, a new sequenced RSQ queue is obtained, and the step s204 is switched to;

step s210, according to the above steps, an initial scheduling solution can be obtained, the task scheduling sequence of Stage on the critical path is adjusted according to the Stage sequencing model, and the completion time of Spark application is further minimized; the current Stage has completed scheduling, and the Stage with the Stage admission of 0 is the ready Stage.

In step s211, the maximum completion time of the entire Spark application is obtained.

The RSA algorithm is specifically:

As shown in fig. 4-6, one vertex of each directed graph represents a Stage, and the five tuples above each vertex respectively represent the estimated current earliest start time, earliest completion time, latest start time, latest completion time and floating time of the Stage, wherein the floating time is obtained by subtracting the latest start time from the earliest start time, and the floating time of 0 represents that the Stage is on the critical path. Fig. 4 is the initial state before scheduling, when S1, S3, S5, S6 and S7 are on the critical path (float time is all 0), S2 and S4 float times are all 1, Stage1 is ready (all immediate predecessors have completed scheduling) and is added to the ready Stage priority queue RSQ. Scheduling S1 to execute on a resource that is fast enough in processing rate, Stage1 completes earlier than expected (expected to complete in two time units, actually complete in one time unit). After S1 is completed, the earliest start time and the earliest completion time of the stages immediately succeeding S2 and S3 are updated, and then the earliest start time and the earliest completion time of all stages are updated from front to back according to the topological sequence. S7 is used as the last Stage, the latest start time is equal to the earliest start time, the latest completion time is equal to the earliest completion time, the latest start time and the latest completion time of all unscheduled stages are obtained from S7, and the floating time of each Stage is finally calculated, the result is shown in fig. 5. At this time, S2 and S3 become ready states, and these two stages are added to the ready Stage queue RSQ, and if the float time of S3 is 0 according to the minimum float time priority rule, the priority is highest among all ready stages. Then scheduling S3, where there are not enough fast resources to use, S3 the completion time is later than expected (expected to complete within 1 time unit, actually take 2 time units to complete), and updating the earliest start time, the earliest completion time, the latest start time, the latest completion time and the float time by the above method, the result is shown in fig. 6.

Claims

1. A Spark task scheduling method considering data affinity under a heterogeneous cluster is characterized by comprising the following steps:

step 3, taking out the tasks from the task scheduling queue TQ in sequence, establishing an earliest available virtual machine List for each task, then calculating the data affinity of all the virtual machines in the earliest available virtual machine List, and adding a plurality of virtual machines with the highest data affinity into the VMList according to the requirement;

step 4, searching the virtual machines from the VMList table according to the virtual machine search strategy for the tasks in the task scheduling queue TQ, and distributing the tasks to the searched virtual machines;

step 5, repeating the step 3 and the step 4 until the task scheduling queue TQ is empty;

step 6, invoking an RSA algorithm, updating elements in the Stage scheduling queue RSQ, turning to step 2 until the Stage scheduling queue RSQ is empty, and obtaining an initial scheduling solution;

step 7, adjusting a task scheduling sequence to obtain a final scheduling result;

wherein

To representS _jAll direct predecessor Stage sets of (1);

is a task

And task

The amount of data to be transferred between,

representing tasks

Whether or not at the server

In the above-mentioned step (2),

is a task

Is required to be stored in

The amount of raw data in (1);

representing tasks

Required to be stored in server

The amount of data in (1) is,

representing tasks

The data affinity calculation formula of (1):

；

in the step 1, the Stage sorting model is executed according to the following rules:

earliest start time priority rule: calculating the earliest start time EST for each Stage, and arranging stages in the Stage scheduling queue RSQ in an increasing order according to the calculated earliest start time EST;

maximum estimated processing time precedence rule: calculating estimated processing time EDT for each Stage, and sequencing stages in the Stage scheduling queue RSQ in descending order according to the calculated estimated processing time EDT;

minimum float time priority rule: calculating the difference FL between the latest starting time and the earliest starting time for each Stage, and arranging the stages in the Stage scheduling queue RSQ in a descending order according to the calculated difference FL between the latest starting time and the earliest starting time;

random rule: comparing with the rule, and randomly selecting the Stage in the Stage scheduling queue RSQ as the Stage with the highest priority;

the virtual machine search strategy in the step 4 is as follows:

the fastest speed priority strategy is as follows: according to the processing speed of the virtual machine, preferentially distributing the tasks to the virtual machine with the high processing speed in the VMList table for execution;

earliest available time first policy: distributing the task to the virtual machine with the earliest availability according to the earliest availability time of the virtual machine in the VMList table for execution;

earliest completion time priority strategy: distributing the tasks to the virtual machines in the VMList table which can ensure the earliest task completion time to execute according to the start time and the task execution time of the tasks;

random strategy: comparing with the virtual machine searching strategy, randomly selecting a virtual machine from the VMList, and distributing the task to the virtual machine for execution;

the RSA algorithm in the step 6 is as follows:

the input of RSA algorithm is ith Stage S_iAnd saving the ordered ready Stage scheduling queue RSQ; the output of RSA algorithm is the Stage scheduling queue RSQ which is added into the ready Stage and then is reordered; wherein S_iAll the tasks are scheduled;

for the ith Stage S_iEach Stage S in the direct successor set of_i', will S_iFrom S_i' deletion in the immediate predecessor set, followed by a decision S_i' whether the immediate predecessor set is an empty set, if so, then S_i' insert into Stage scheduling queue RSQ;

and reordering the elements in the Stage scheduling queue RSQ according to the Stage ordering model.

2. The Spark task scheduling method considering data affinity under the heterogeneous cluster according to claim 1, wherein the partial ordering property of Spark workflow in step 1 includes a partial ordering relationship at a Job level and a partial ordering relationship at a Stage level; and merging the partial order relationship of the Job level and the partial order relationship of the Stage level, so that the Spark workflow is represented as a directed acyclic graph about the Stage.

3. The method for dispatching Spark tasks considering data affinity under heterogeneous cluster as claimed in claim 1, wherein the Spark workflow is a directed acyclic graph composed of n stages, denoted as G { S ™₁，S₂，..，S_nAnd there are multiple tasks in each Stage that can be executed in parallel.

4. The method for dispatching Spark tasks under heterogeneous clusters according to claim 1, wherein the task ordering model in step 2 is executed according to the following rules:

(2) transmission time priority rule: the ith Stage S_iAll tasks in (1) are ordered in a non-incremental manner according to estimated task data transmission time;

5. The Spark task scheduling method considering data affinity under the heterogeneous cluster according to claim 1, wherein the method for adjusting the task scheduling sequence in step 7 is:

for all tasks in the Stage of the critical path, searching a time gap between two tasks in the virtual machine, and migrating part of the tasks to the time gap for execution;

marking the optimized Stage as true, and not optimizing the Stage again next time when the Stage is still in the critical path;

and after the previous Stage is optimized, the critical path is obtained again.