CN112416585B

CN112416585B - Deep learning-oriented GPU resource management and intelligent scheduling method

Info

Publication number: CN112416585B
Application number: CN202011310749.8A
Authority: CN
Inventors: 顾荣; 刘率; 王肇康; 袁春风; 黄宜华
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2024-03-15
Anticipated expiration: 2040-11-20
Also published as: CN112416585A

Abstract

The invention discloses a deep learning-oriented GPU resource management and intelligent scheduling method, which comprises the following steps: the method comprises the steps that firstly, a user submits a deep learning job through a front-end interface component, wherein the deep learning job comprises a deep learning program to be executed and a training data set; secondly, after verification, adding the job to a queue to be scheduled corresponding to the scheduler; third, starting an independent job manager for the job; fourthly, applying for computing resources required by operation of the job to a resource manager; fifthly, performing feature modeling and analysis on the job to be scheduled; sixthly, generating a resource scheduling scheme according to the job characteristics and the cluster computing node characteristics; seventh, dispatching the job to the appointed computing node according to a dispatching scheme; eighth, the job executor starts the container and executes the deep learning program. The method can solve the problems of low GPU resource utilization rate and poor job execution performance of the conventional cluster resource scheduling method in a deep learning scene.

Description

Deep learning-oriented GPU resource management and intelligent scheduling method

Technical Field

The invention relates to the technical field of cluster resource scheduling, in particular to a deep learning-oriented GPU resource management and intelligent scheduling method.

Background

Research and practice in recent years show that compared with the traditional machine learning technology, deep learning can obtain higher precision in the fields of computer vision, voice recognition and the like, so that the deep learning technology is widely used. The deep learning model training process is computationally intensive, and the graphics processor (Graphics Processing Unit, GPU) is able to perform such simple but massive computing tasks more efficiently, thus becoming an important underlying computing resource for running deep learning programs.

Because GPU cards are typically expensive, deploying separate private clusters for each user (group) is costly, and users do not always perform model training, users typically share these GPU resources to reduce costs. In order to avoid the problem of collision and fully utilize cluster resources, a large number of resources such as GPUs and the like need to be efficiently managed and user jobs need to be uniformly and reasonably scheduled.

For GPU resource management and scheduling in a deep learning scene, the following problems exist:

in terms of resource utilization, with the rapid development of hardware technology, new GPU cards are pushed out continuously, so that GPU cards with different types generally exist in a cluster, and the GPU cards have great differences in computing power and video memory, and indiscriminately distributed GPU cards can cause the problems that some jobs have insufficient performance when being executed and other jobs have excessive performance when being executed. Because of the lack of mature and efficient GPU resource virtualization technology, the GPU is generally used exclusively at present, but the small job resource requirement for part of development and test purposes is low, and the problem of resource waste is aggravated by the exclusive use.

In terms of resource scheduling strategies, at present, many deep learning model training works are still carried out in a single-machine single-GPU card mode, but in order to pursue higher accuracy, a deep learning model network is deeper and deeper, parameters are more and more, the scale of a data set for training is also larger and larger, the single-GPU card is difficult to accommodate, the performance is also bottleneck, and a single-machine multi-card and multi-machine multi-card distributed training mode is generated. Unlike big data applications, there is complex and massive data exchange and information synchronization between multiple instances of distributed deep learning jobs, and unreasonable resource scheduling schemes can greatly reduce job execution performance.

Therefore, how to design a scheduling mechanism, so that the scheduler still obtains good GPU resource utilization and job execution performance in a deep learning scenario becomes a very challenging task.

Disclosure of Invention

The invention aims to: aiming at the problems and the shortcomings of the prior art, the invention aims to provide a deep learning oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor job execution performance of the existing system in a deep learning scene.

The technical scheme is as follows: in order to achieve the above purpose, the technical scheme adopted by the invention is to provide a deep learning oriented GPU resource management and intelligent scheduling method, which comprises the following steps:

(1) A user submits a deep learning job (job for short) through a front-end interface component, wherein the job comprises a deep learning program to be executed, a program running input data set and task division information of the job;

(2) Performing parameter validity check and authority verification on the job, and then adding the job into a designated queue to be scheduled for waiting scheduling;

(3) When a job is selected to start scheduling, starting an independent job manager for the job to take charge of subsequent operation of the job;

(4) The job manager applies for computing resources required by operation to the global resource manager for each task according to the task division of the deep learning job;

(5) Modeling and analyzing the operation characteristics based on an intelligent operation resource demand prediction model (called a prediction model for short), including calculating power of a GPU, a GPU video memory, a CPU, a memory and network bandwidth resource demand characteristics during operation, and generating an operation execution resource demand vector;

(6) Generating a resource scheduling scheme of the job by utilizing the resource demand vector returned in the step (5) and combining the distributed architecture of the job and the cluster network topology structure;

(7) According to the resource scheduling scheme, scheduling the job to a designated computing node through a pushing mechanism;

(8) The job executor initiates a separate run container for each task of the job to specifically execute the deep learning program.

Further, in the step (3), since most of the current deep learning frameworks do not have an elastic mechanism, in order to avoid the problem of resource deadlock caused by scheduling a plurality of jobs at the same time, a group scheduling mechanism is adopted, that is, the resource allocation for the next job is started after all the resource requirements of the previous job are satisfied.

Further, in the step (3), in order to reduce the load of the job scheduler, an independent job manager is started for each job, and the job manager is responsible for the life cycle management of the job, where the life cycle management includes applying for resources, pushing to a computing node, monitoring the running state, retrying failed tasks, and the like.

Further, in the step (4), since most of the current distributed deep learning frameworks adopt a static mapping manner, the task division of the job is already determined before execution, so the scheduling system only needs to allocate resources according to the task divided in advance and determine a resource scheduling scheme.

Further, in the step (5), an intelligent prediction model of the job resource demand is built, input features of the prediction model include task division, super-parameter setting and data set scale, output labels of the prediction model include job resource demand vectors (simply called vectors), the vectors include a CPU, a memory, a GPU computing power, a GPU video memory and a network bandwidth, and regression problems corresponding to the prediction model are solved by utilizing a traditional machine learning algorithm.

Further, in the step (5), features such as actual resource requirements of similar jobs in historical running are collected, and the prediction model is utilized to predict the subsequent job resource requirement features.

Further, in the step (6), firstly, the execution sequence of the job is determined and the job to be scheduled is selected according to the principle that fair scheduling is adopted among the queues and first-come first-serve scheduling is adopted in the queues, then, the distributed topology of the job and the topology structure of the cluster network are extracted, a network communication cost model is built according to the resource demand characteristics of the job, and finally, a heuristic genetic algorithm is used for solving and generating a scheduling scheme.

The beneficial effects are that: according to the method, the problems of low GPU resource utilization rate and poor job execution performance can be effectively solved through a deep learning oriented GPU resource management and intelligent scheduling method under a deep learning scene: first, the invention abstracts and extracts commonality from the existing mainstream deep learning framework, provides a service interface irrelevant to the deep learning framework, and has good framework compatibility and usability. Secondly, the invention provides an intelligent prediction model for the job resource demand, which can predict the characteristic of the job to be scheduled in running according to the historical scheduling data, thereby automatically determining the job resource demand vector and enhancing the scheduling. Thirdly, unlike the previous method that the job to be scheduled is completely regarded as a black box, the method utilizes the collected information, considers the distributed topology structure of the job and the cluster network topology structure in scheduling, generates a more efficient scheduling scheme, and improves the execution performance of the job.

Drawings

FIG. 1 is a schematic diagram of the overall process of the method of the present invention;

FIG. 2 is a schematic diagram illustrating a resource scheduling scheme according to the present invention by using a block coding scheme;

FIG. 3 is a flow chart of a scheduling policy according to the present invention.

Detailed Description

The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various modifications of the invention, which are equivalent to those skilled in the art upon reading the invention, will fall within the scope of the invention as defined in the appended claims.

The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor job execution performance in a deep learning scene.

As shown in FIG. 1, the complete flow of the present invention comprises 8 parts of a job submitting stage, a right verifying stage, a job manager starting stage, a resource applying stage, a job feature modeling and analyzing stage, a resource scheduling scheme generating stage, a job distributing stage and an executing stage. Specific embodiments are described below:

the job submitting stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: a user submits a deep learning job through a visual management front end or an API interface, wherein the deep learning job comprises an executable deep learning program, an input training set required by program operation, job task division and program starting parameters. In the scheduling system of the present invention, the definition of the job is as follows: a job consists of several tasks, such as in a Parameter Server (Parameter Server) architecture, including both a Parameter Server and a working node (Worker). The single-machine single-card and single-machine multi-card operation only has one task, and the multi-machine multi-card operation comprises a plurality of tasks, and each task corresponds to a parameter server or a working node. Since most deep learning frameworks do not have an elastic mechanism, the number and division of tasks are specified by the user when submitting the job. In scheduling, one task is scheduled to run on one computing node (physical machine), and one computing node can run multiple tasks simultaneously. A job is the basic unit submitted by a user, while a task is the basic unit scheduled for execution by the system.

The authority verification stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: after receiving the job submitted by the user, firstly checking the validity and the integrity of the job parameters, then verifying whether the user has authority to submit the job to a designated queue to be scheduled, finally adding the job to the designated queue to be scheduled of the scheduler after the verification is passed, and recording the request.

The starting stage of the job manager corresponds to the technical scheme step (3). The specific implementation mode is as follows: the scheduler decides the execution sequence of the job to be scheduled according to the fair scheduling principle. When the job is selected to begin scheduling execution, a separate job manager is started to take charge of the subsequent lifecycle flow of the job.

The resource application stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: the job manager divides the job into tasks according to the job, and then applies for computing resources for each task until all task resource requirements of the job are satisfied.

And (5) a technical scheme corresponding to the operation characteristic modeling and analysis stage. The specific implementation mode is as follows: the scheduling system utilizes actual use data of the jobs collected during similar job history operation on CPU, memory, GPU power resources, GPU video memory resources and network bandwidth resources, then utilizes a traditional machine learning model random forest algorithm to train a job resource demand vector prediction model based on the data, and then utilizes the model to predict the resource demand vector characteristics of the job to be scheduled during operation, so that proper resources are allocated to the jobs and proper scheduling schemes are selected.

And (6) a resource scheduling scheme generation stage corresponds to the technical scheme. The specific implementation mode is as follows: firstly, filtering out computing nodes and GPU which do not meet the requirements to obtain a candidate node list; then, using the job resource demand vector returned in the step (5) to obtain a GPU model which is most matched with the calculation power demand of the job GPU, selecting a node with the GPU model from candidate nodes as a candidate node list of the next step, and if the GPU resource of the model is insufficient, selecting performance to be close; and finally, grouping a genetic algorithm by using a heuristic method to generate a better resource scheduling scheme.

In order to solve the resource scheduling problem of the present invention, formal definition of the scheduling object is required. For a distributed model training job, different resource scheduling schemes have a great influence on the job execution performance, which is mainly determined by the network communication quality, so that network communication needs to be considered when generating the resource scheduling schemes, and other indexes such as reducing resource fragmentation and the like are considered as much as possible on the basis of the network communication. Firstly, different scheduling schemes are evaluated based on factors such as job topology, cluster network topology structure and the like, and a Score is calculated by network communication overhead Cost _network Fitness to node match _node Two parts. The purpose of the schedule is to minimize the score.

Score＝Cost _network +ΣFitness _node

The following describes how to solve the resource scheduling scheme generation problem of the present invention in combination with a packet genetic algorithm:

1) Coding scheme. Fig. 2 illustrates how a resource scheduling scheme (two scheduling schemes where 8 tasks are placed on several compute nodes) is represented using block coding under the scheduling policy of the present invention. From the perspective of genetic algorithm concepts, each chromosome represents a resource scheduling scheme, each gene locus represents a task, each genome represents a computational node, and the computational nodes are used as units in crossover and mutation operations.

2) And (5) generating an initial population. Two simple algorithms, namely a First-Fit algorithm (First-Fit) and a Random-Fit algorithm (Random-Fit), are adopted to generate a plurality of initial resource scheduling schemes. The idea of the first adaptation algorithm is to schedule each task to the first computing node that can be placed, while the random adaptation algorithm is to randomly select a node that meets the requirements. Both algorithms are low in complexity and run fast enough.

3) Fitness function and selection strategy. The purpose of the scheduling of the invention is to reduce the communication overhead, so the Fitness function Fitness is a negative value of the scheduling target (the smaller the network communication overhead is, the better the Fitness is). In order to accelerate the algorithm convergence speed, a new variable NumNodes is introduced on the basis _using I.e. the number of computational nodes required by the scheduling scheme, so that the algorithm can preferentially select the resource scheduling scheme with fewer nodes when the network overhead is close.

Fitness＝-(Cost _network +∑Fitness _node +NumNodes _using )

Tournament methods are selected as selection strategies. The tournament method performs multiple rounds of elimination and selects the best, does not need to perform full-quantity sorting, has low complexity, can perform parallelization processing, has smaller time cost, and is more suitable for online scheduling scenes.

4) Crossover and mutation rules. The crossover process is as follows: firstly, selecting two schemes X and Y from the current resource scheduling scheme by utilizing a selection strategy, and respectively selecting an intersection point (a computing node) and an intersection position; then, adding the selected computing node and the task on the node into the crossing position of another scheduling scheme; then, after crossing, there may be repeated computing nodes and tasks in the new scheduling scheme, and these repeated tasks need to be deleted, and because the basic unit of the cross variation is one computing node, the repeated computing nodes and the computing node where the repeated tasks are located need to be deleted; finally, since the tasks on the deleted nodes are also deleted, these tasks need to be re-added to the remaining computing nodes, and the deleted tasks are added to the remaining computing nodes using the first adaptation algorithm. Similar to the crossover rule, the process of mutation is as follows: firstly, selecting a resource scheduling scheme Y, randomly selecting a computing node, and deleting the computing node and tasks on the computing node; and then, the deleted task is relocated to the rest of the computing nodes according to the first adaptive algorithm, so that a new scheduling scheme is obtained.

Fig. 3 illustrates a resource scheduling scheme generation flow in the present invention. Due to the existence of resource fragmentation, the selected jobs may not be scheduled simultaneously, for which case a period of time is waited and an attempt is made again to generate an effective resource scheduling scheme. In the scenes of scattered resources and the like, the generated resource scheduling scheme may not be good enough, and a period of time can be waited for to see whether more and more proper resources are released; and finally, after the generated scheduling scheme is good enough or the running time of the scheduling algorithm is overtime, scheduling the job according to the current optimal resource scheduling scheme.

The job distribution stage corresponds to the technical scheme step (7). The specific implementation mode is as follows: after all task resource requirements of the job are met, the job manager pushes the task of the job to the corresponding computing node according to a resource scheduling scheme, and the job is waited for execution.

The job execution stage corresponds to the technical scheme step (8). The specific implementation mode is as follows: firstly, creating a corresponding job running environment (namely a Container) for a job, and limiting available resources of the Container according to job resource requirements; then, after the container is started, downloading a user deep learning program contained in the job to a designated position in the container; then, the training data set required by the deep learning program model training is mounted to a local corresponding catalogue; then, a deep learning program (called as a program for short) of a user is started through a starting command, and the running condition of the program is continuously monitored; finally, after the program is executed, the output file of the program is transferred to an external reliable storage HDFS, and the container is destroyed and the occupied system resource is released.

The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method. By modeling and analyzing the job, the system can effectively predict the job operation resource requirement in advance. Compared with the common scheduling method (scattering, centralizing and random), the execution time of a single job is reduced by 33.5 to 59.5 percent, and the average job completion time (Job Completion Time, JCT) of a plurality of jobs is further reduced by 10 percent. Compared with the existing Kubernetes system, the resource scheduling method reduces the average completion time of the jobs by 48%. In the aspect of system expandability, when cluster nodes are increased, the throughput speed of a dispatching system can be kept stable, and the system has good expandability. The deep learning-oriented GPU resource management and intelligent scheduling method provided by the invention has a remarkable performance optimization effect.

Claims

1. A deep learning-oriented GPU resource management and intelligent scheduling method comprises the following steps:

(1) A user submits a deep learning job through a front-end interface component, wherein the deep learning job comprises a deep learning program to be executed, a program running input data set and task division information of the job;

(2) Performing parameter validity check and authority verification on the deep learning operation, and then adding the deep learning operation into a designated queue to be scheduled for waiting scheduling;

(3) When the deep learning operation is selected to start scheduling, starting an independent operation manager for the deep learning operation to take charge of subsequent operation of the deep learning operation;

(4) The job manager applies for computing resources required by operation to the global resource manager for each task according to task division of the deep learning job;

(5) Modeling and analyzing the operation characteristics based on the operation resource demand intelligent prediction model, including calculating power of a GPU, GPU video memory, CPU, memory and network bandwidth resource demand characteristics during operation, and generating an operation execution resource demand vector;

(6) Generating a resource scheduling scheme of the deep learning job by utilizing the resource demand vector returned in the step (5) and combining the distributed architecture of the job and a cluster network topology structure;

(7) According to the resource scheduling scheme, scheduling the deep learning job to a designated computing node through a pushing mechanism;

(8) The job executor starts a separate running container for each task of the deep learning job to specifically execute the deep learning program.

2. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (3), a group scheduling mechanism is adopted: all resource requirements of the previous job are met before the next job is allocated resources.

3. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (3), an independent job manager is started for each deep learning job, and the job manager is responsible for life cycle management of the deep learning job, wherein the life cycle management comprises resource application, pushing to a computing node, running state monitoring and failed task retrying.

4. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (4), the scheduling system only needs to allocate resources according to the tasks divided in advance and determine a resource scheduling scheme, and the task division of the deep learning job is determined before execution.

5. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (5), an intelligent task resource demand prediction model is built, input characteristics of the intelligent task resource demand prediction model comprise task division, super-parameter setting and data set scale, an output label of the intelligent task resource demand prediction model is a task execution resource demand vector, the task execution resource demand vector comprises a CPU (central processing unit), a memory, a GPU (graphics processing unit) video memory and a network bandwidth, and a regression problem corresponding to the intelligent task resource demand prediction model is solved by utilizing a traditional machine learning algorithm.

6. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (5), the actual resource demand characteristics of similar jobs in the historical operation are collected, and the subsequent deep learning job resource demand characteristics are predicted by utilizing the intelligent prediction model of the job resource demands.

7. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (6), firstly, the execution sequence of the jobs is determined according to the principle of fair scheduling among the queues and first-come first-serve scheduling in the queues, the jobs to be scheduled are selected, then the distributed topology of the jobs and the topology structure of the cluster network are extracted, a network communication cost model is built according to the resource demand characteristics of the jobs, and finally, a heuristic genetic algorithm is used for solving and generating a resource scheduling scheme.