CN112416585B - Deep learning-oriented GPU resource management and intelligent scheduling method - Google Patents

Deep learning-oriented GPU resource management and intelligent scheduling method Download PDF

Info

Publication number
CN112416585B
CN112416585B CN202011310749.8A CN202011310749A CN112416585B CN 112416585 B CN112416585 B CN 112416585B CN 202011310749 A CN202011310749 A CN 202011310749A CN 112416585 B CN112416585 B CN 112416585B
Authority
CN
China
Prior art keywords
job
deep learning
resource
scheduling
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011310749.8A
Other languages
Chinese (zh)
Other versions
CN112416585A (en
Inventor
顾荣
刘率
王肇康
袁春风
黄宜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202011310749.8A priority Critical patent/CN112416585B/en
Publication of CN112416585A publication Critical patent/CN112416585A/en
Application granted granted Critical
Publication of CN112416585B publication Critical patent/CN112416585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a deep learning-oriented GPU resource management and intelligent scheduling method, which comprises the following steps: the method comprises the steps that firstly, a user submits a deep learning job through a front-end interface component, wherein the deep learning job comprises a deep learning program to be executed and a training data set; secondly, after verification, adding the job to a queue to be scheduled corresponding to the scheduler; third, starting an independent job manager for the job; fourthly, applying for computing resources required by operation of the job to a resource manager; fifthly, performing feature modeling and analysis on the job to be scheduled; sixthly, generating a resource scheduling scheme according to the job characteristics and the cluster computing node characteristics; seventh, dispatching the job to the appointed computing node according to a dispatching scheme; eighth, the job executor starts the container and executes the deep learning program. The method can solve the problems of low GPU resource utilization rate and poor job execution performance of the conventional cluster resource scheduling method in a deep learning scene.

Description

Deep learning-oriented GPU resource management and intelligent scheduling method
Technical Field
The invention relates to the technical field of cluster resource scheduling, in particular to a deep learning-oriented GPU resource management and intelligent scheduling method.
Background
Research and practice in recent years show that compared with the traditional machine learning technology, deep learning can obtain higher precision in the fields of computer vision, voice recognition and the like, so that the deep learning technology is widely used. The deep learning model training process is computationally intensive, and the graphics processor (Graphics Processing Unit, GPU) is able to perform such simple but massive computing tasks more efficiently, thus becoming an important underlying computing resource for running deep learning programs.
Because GPU cards are typically expensive, deploying separate private clusters for each user (group) is costly, and users do not always perform model training, users typically share these GPU resources to reduce costs. In order to avoid the problem of collision and fully utilize cluster resources, a large number of resources such as GPUs and the like need to be efficiently managed and user jobs need to be uniformly and reasonably scheduled.
For GPU resource management and scheduling in a deep learning scene, the following problems exist:
in terms of resource utilization, with the rapid development of hardware technology, new GPU cards are pushed out continuously, so that GPU cards with different types generally exist in a cluster, and the GPU cards have great differences in computing power and video memory, and indiscriminately distributed GPU cards can cause the problems that some jobs have insufficient performance when being executed and other jobs have excessive performance when being executed. Because of the lack of mature and efficient GPU resource virtualization technology, the GPU is generally used exclusively at present, but the small job resource requirement for part of development and test purposes is low, and the problem of resource waste is aggravated by the exclusive use.
In terms of resource scheduling strategies, at present, many deep learning model training works are still carried out in a single-machine single-GPU card mode, but in order to pursue higher accuracy, a deep learning model network is deeper and deeper, parameters are more and more, the scale of a data set for training is also larger and larger, the single-GPU card is difficult to accommodate, the performance is also bottleneck, and a single-machine multi-card and multi-machine multi-card distributed training mode is generated. Unlike big data applications, there is complex and massive data exchange and information synchronization between multiple instances of distributed deep learning jobs, and unreasonable resource scheduling schemes can greatly reduce job execution performance.
Therefore, how to design a scheduling mechanism, so that the scheduler still obtains good GPU resource utilization and job execution performance in a deep learning scenario becomes a very challenging task.
Disclosure of Invention
The invention aims to: aiming at the problems and the shortcomings of the prior art, the invention aims to provide a deep learning oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor job execution performance of the existing system in a deep learning scene.
The technical scheme is as follows: in order to achieve the above purpose, the technical scheme adopted by the invention is to provide a deep learning oriented GPU resource management and intelligent scheduling method, which comprises the following steps:
(1) A user submits a deep learning job (job for short) through a front-end interface component, wherein the job comprises a deep learning program to be executed, a program running input data set and task division information of the job;
(2) Performing parameter validity check and authority verification on the job, and then adding the job into a designated queue to be scheduled for waiting scheduling;
(3) When a job is selected to start scheduling, starting an independent job manager for the job to take charge of subsequent operation of the job;
(4) The job manager applies for computing resources required by operation to the global resource manager for each task according to the task division of the deep learning job;
(5) Modeling and analyzing the operation characteristics based on an intelligent operation resource demand prediction model (called a prediction model for short), including calculating power of a GPU, a GPU video memory, a CPU, a memory and network bandwidth resource demand characteristics during operation, and generating an operation execution resource demand vector;
(6) Generating a resource scheduling scheme of the job by utilizing the resource demand vector returned in the step (5) and combining the distributed architecture of the job and the cluster network topology structure;
(7) According to the resource scheduling scheme, scheduling the job to a designated computing node through a pushing mechanism;
(8) The job executor initiates a separate run container for each task of the job to specifically execute the deep learning program.
Further, in the step (3), since most of the current deep learning frameworks do not have an elastic mechanism, in order to avoid the problem of resource deadlock caused by scheduling a plurality of jobs at the same time, a group scheduling mechanism is adopted, that is, the resource allocation for the next job is started after all the resource requirements of the previous job are satisfied.
Further, in the step (3), in order to reduce the load of the job scheduler, an independent job manager is started for each job, and the job manager is responsible for the life cycle management of the job, where the life cycle management includes applying for resources, pushing to a computing node, monitoring the running state, retrying failed tasks, and the like.
Further, in the step (4), since most of the current distributed deep learning frameworks adopt a static mapping manner, the task division of the job is already determined before execution, so the scheduling system only needs to allocate resources according to the task divided in advance and determine a resource scheduling scheme.
Further, in the step (5), an intelligent prediction model of the job resource demand is built, input features of the prediction model include task division, super-parameter setting and data set scale, output labels of the prediction model include job resource demand vectors (simply called vectors), the vectors include a CPU, a memory, a GPU computing power, a GPU video memory and a network bandwidth, and regression problems corresponding to the prediction model are solved by utilizing a traditional machine learning algorithm.
Further, in the step (5), features such as actual resource requirements of similar jobs in historical running are collected, and the prediction model is utilized to predict the subsequent job resource requirement features.
Further, in the step (6), firstly, the execution sequence of the job is determined and the job to be scheduled is selected according to the principle that fair scheduling is adopted among the queues and first-come first-serve scheduling is adopted in the queues, then, the distributed topology of the job and the topology structure of the cluster network are extracted, a network communication cost model is built according to the resource demand characteristics of the job, and finally, a heuristic genetic algorithm is used for solving and generating a scheduling scheme.
The beneficial effects are that: according to the method, the problems of low GPU resource utilization rate and poor job execution performance can be effectively solved through a deep learning oriented GPU resource management and intelligent scheduling method under a deep learning scene: first, the invention abstracts and extracts commonality from the existing mainstream deep learning framework, provides a service interface irrelevant to the deep learning framework, and has good framework compatibility and usability. Secondly, the invention provides an intelligent prediction model for the job resource demand, which can predict the characteristic of the job to be scheduled in running according to the historical scheduling data, thereby automatically determining the job resource demand vector and enhancing the scheduling. Thirdly, unlike the previous method that the job to be scheduled is completely regarded as a black box, the method utilizes the collected information, considers the distributed topology structure of the job and the cluster network topology structure in scheduling, generates a more efficient scheduling scheme, and improves the execution performance of the job.
Drawings
FIG. 1 is a schematic diagram of the overall process of the method of the present invention;
FIG. 2 is a schematic diagram illustrating a resource scheduling scheme according to the present invention by using a block coding scheme;
FIG. 3 is a flow chart of a scheduling policy according to the present invention.
Detailed Description
The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various modifications of the invention, which are equivalent to those skilled in the art upon reading the invention, will fall within the scope of the invention as defined in the appended claims.
The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method, which solves the problems of low GPU resource utilization rate and poor job execution performance in a deep learning scene.
As shown in FIG. 1, the complete flow of the present invention comprises 8 parts of a job submitting stage, a right verifying stage, a job manager starting stage, a resource applying stage, a job feature modeling and analyzing stage, a resource scheduling scheme generating stage, a job distributing stage and an executing stage. Specific embodiments are described below:
the job submitting stage corresponds to the technical scheme step (1). The specific implementation mode is as follows: a user submits a deep learning job through a visual management front end or an API interface, wherein the deep learning job comprises an executable deep learning program, an input training set required by program operation, job task division and program starting parameters. In the scheduling system of the present invention, the definition of the job is as follows: a job consists of several tasks, such as in a Parameter Server (Parameter Server) architecture, including both a Parameter Server and a working node (Worker). The single-machine single-card and single-machine multi-card operation only has one task, and the multi-machine multi-card operation comprises a plurality of tasks, and each task corresponds to a parameter server or a working node. Since most deep learning frameworks do not have an elastic mechanism, the number and division of tasks are specified by the user when submitting the job. In scheduling, one task is scheduled to run on one computing node (physical machine), and one computing node can run multiple tasks simultaneously. A job is the basic unit submitted by a user, while a task is the basic unit scheduled for execution by the system.
The authority verification stage corresponds to the technical scheme step (2). The specific implementation mode is as follows: after receiving the job submitted by the user, firstly checking the validity and the integrity of the job parameters, then verifying whether the user has authority to submit the job to a designated queue to be scheduled, finally adding the job to the designated queue to be scheduled of the scheduler after the verification is passed, and recording the request.
The starting stage of the job manager corresponds to the technical scheme step (3). The specific implementation mode is as follows: the scheduler decides the execution sequence of the job to be scheduled according to the fair scheduling principle. When the job is selected to begin scheduling execution, a separate job manager is started to take charge of the subsequent lifecycle flow of the job.
The resource application stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: the job manager divides the job into tasks according to the job, and then applies for computing resources for each task until all task resource requirements of the job are satisfied.
And (5) a technical scheme corresponding to the operation characteristic modeling and analysis stage. The specific implementation mode is as follows: the scheduling system utilizes actual use data of the jobs collected during similar job history operation on CPU, memory, GPU power resources, GPU video memory resources and network bandwidth resources, then utilizes a traditional machine learning model random forest algorithm to train a job resource demand vector prediction model based on the data, and then utilizes the model to predict the resource demand vector characteristics of the job to be scheduled during operation, so that proper resources are allocated to the jobs and proper scheduling schemes are selected.
And (6) a resource scheduling scheme generation stage corresponds to the technical scheme. The specific implementation mode is as follows: firstly, filtering out computing nodes and GPU which do not meet the requirements to obtain a candidate node list; then, using the job resource demand vector returned in the step (5) to obtain a GPU model which is most matched with the calculation power demand of the job GPU, selecting a node with the GPU model from candidate nodes as a candidate node list of the next step, and if the GPU resource of the model is insufficient, selecting performance to be close; and finally, grouping a genetic algorithm by using a heuristic method to generate a better resource scheduling scheme.
In order to solve the resource scheduling problem of the present invention, formal definition of the scheduling object is required. For a distributed model training job, different resource scheduling schemes have a great influence on the job execution performance, which is mainly determined by the network communication quality, so that network communication needs to be considered when generating the resource scheduling schemes, and other indexes such as reducing resource fragmentation and the like are considered as much as possible on the basis of the network communication. Firstly, different scheduling schemes are evaluated based on factors such as job topology, cluster network topology structure and the like, and a Score is calculated by network communication overhead Cost network Fitness to node match node Two parts. The purpose of the schedule is to minimize the score.
Score=Cost network +ΣFitness node
The following describes how to solve the resource scheduling scheme generation problem of the present invention in combination with a packet genetic algorithm:
1) Coding scheme. Fig. 2 illustrates how a resource scheduling scheme (two scheduling schemes where 8 tasks are placed on several compute nodes) is represented using block coding under the scheduling policy of the present invention. From the perspective of genetic algorithm concepts, each chromosome represents a resource scheduling scheme, each gene locus represents a task, each genome represents a computational node, and the computational nodes are used as units in crossover and mutation operations.
2) And (5) generating an initial population. Two simple algorithms, namely a First-Fit algorithm (First-Fit) and a Random-Fit algorithm (Random-Fit), are adopted to generate a plurality of initial resource scheduling schemes. The idea of the first adaptation algorithm is to schedule each task to the first computing node that can be placed, while the random adaptation algorithm is to randomly select a node that meets the requirements. Both algorithms are low in complexity and run fast enough.
3) Fitness function and selection strategy. The purpose of the scheduling of the invention is to reduce the communication overhead, so the Fitness function Fitness is a negative value of the scheduling target (the smaller the network communication overhead is, the better the Fitness is). In order to accelerate the algorithm convergence speed, a new variable NumNodes is introduced on the basis using I.e. the number of computational nodes required by the scheduling scheme, so that the algorithm can preferentially select the resource scheduling scheme with fewer nodes when the network overhead is close.
Fitness=-(Cost network +∑Fitness node +NumNodes using )
Tournament methods are selected as selection strategies. The tournament method performs multiple rounds of elimination and selects the best, does not need to perform full-quantity sorting, has low complexity, can perform parallelization processing, has smaller time cost, and is more suitable for online scheduling scenes.
4) Crossover and mutation rules. The crossover process is as follows: firstly, selecting two schemes X and Y from the current resource scheduling scheme by utilizing a selection strategy, and respectively selecting an intersection point (a computing node) and an intersection position; then, adding the selected computing node and the task on the node into the crossing position of another scheduling scheme; then, after crossing, there may be repeated computing nodes and tasks in the new scheduling scheme, and these repeated tasks need to be deleted, and because the basic unit of the cross variation is one computing node, the repeated computing nodes and the computing node where the repeated tasks are located need to be deleted; finally, since the tasks on the deleted nodes are also deleted, these tasks need to be re-added to the remaining computing nodes, and the deleted tasks are added to the remaining computing nodes using the first adaptation algorithm. Similar to the crossover rule, the process of mutation is as follows: firstly, selecting a resource scheduling scheme Y, randomly selecting a computing node, and deleting the computing node and tasks on the computing node; and then, the deleted task is relocated to the rest of the computing nodes according to the first adaptive algorithm, so that a new scheduling scheme is obtained.
Fig. 3 illustrates a resource scheduling scheme generation flow in the present invention. Due to the existence of resource fragmentation, the selected jobs may not be scheduled simultaneously, for which case a period of time is waited and an attempt is made again to generate an effective resource scheduling scheme. In the scenes of scattered resources and the like, the generated resource scheduling scheme may not be good enough, and a period of time can be waited for to see whether more and more proper resources are released; and finally, after the generated scheduling scheme is good enough or the running time of the scheduling algorithm is overtime, scheduling the job according to the current optimal resource scheduling scheme.
The job distribution stage corresponds to the technical scheme step (7). The specific implementation mode is as follows: after all task resource requirements of the job are met, the job manager pushes the task of the job to the corresponding computing node according to a resource scheduling scheme, and the job is waited for execution.
The job execution stage corresponds to the technical scheme step (8). The specific implementation mode is as follows: firstly, creating a corresponding job running environment (namely a Container) for a job, and limiting available resources of the Container according to job resource requirements; then, after the container is started, downloading a user deep learning program contained in the job to a designated position in the container; then, the training data set required by the deep learning program model training is mounted to a local corresponding catalogue; then, a deep learning program (called as a program for short) of a user is started through a starting command, and the running condition of the program is continuously monitored; finally, after the program is executed, the output file of the program is transferred to an external reliable storage HDFS, and the container is destroyed and the occupied system resource is released.
The invention provides a deep learning-oriented GPU resource management and intelligent scheduling method. By modeling and analyzing the job, the system can effectively predict the job operation resource requirement in advance. Compared with the common scheduling method (scattering, centralizing and random), the execution time of a single job is reduced by 33.5 to 59.5 percent, and the average job completion time (Job Completion Time, JCT) of a plurality of jobs is further reduced by 10 percent. Compared with the existing Kubernetes system, the resource scheduling method reduces the average completion time of the jobs by 48%. In the aspect of system expandability, when cluster nodes are increased, the throughput speed of a dispatching system can be kept stable, and the system has good expandability. The deep learning-oriented GPU resource management and intelligent scheduling method provided by the invention has a remarkable performance optimization effect.

Claims (7)

1. A deep learning-oriented GPU resource management and intelligent scheduling method comprises the following steps:
(1) A user submits a deep learning job through a front-end interface component, wherein the deep learning job comprises a deep learning program to be executed, a program running input data set and task division information of the job;
(2) Performing parameter validity check and authority verification on the deep learning operation, and then adding the deep learning operation into a designated queue to be scheduled for waiting scheduling;
(3) When the deep learning operation is selected to start scheduling, starting an independent operation manager for the deep learning operation to take charge of subsequent operation of the deep learning operation;
(4) The job manager applies for computing resources required by operation to the global resource manager for each task according to task division of the deep learning job;
(5) Modeling and analyzing the operation characteristics based on the operation resource demand intelligent prediction model, including calculating power of a GPU, GPU video memory, CPU, memory and network bandwidth resource demand characteristics during operation, and generating an operation execution resource demand vector;
(6) Generating a resource scheduling scheme of the deep learning job by utilizing the resource demand vector returned in the step (5) and combining the distributed architecture of the job and a cluster network topology structure;
(7) According to the resource scheduling scheme, scheduling the deep learning job to a designated computing node through a pushing mechanism;
(8) The job executor starts a separate running container for each task of the deep learning job to specifically execute the deep learning program.
2. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (3), a group scheduling mechanism is adopted: all resource requirements of the previous job are met before the next job is allocated resources.
3. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (3), an independent job manager is started for each deep learning job, and the job manager is responsible for life cycle management of the deep learning job, wherein the life cycle management comprises resource application, pushing to a computing node, running state monitoring and failed task retrying.
4. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (4), the scheduling system only needs to allocate resources according to the tasks divided in advance and determine a resource scheduling scheme, and the task division of the deep learning job is determined before execution.
5. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (5), an intelligent task resource demand prediction model is built, input characteristics of the intelligent task resource demand prediction model comprise task division, super-parameter setting and data set scale, an output label of the intelligent task resource demand prediction model is a task execution resource demand vector, the task execution resource demand vector comprises a CPU (central processing unit), a memory, a GPU (graphics processing unit) video memory and a network bandwidth, and a regression problem corresponding to the intelligent task resource demand prediction model is solved by utilizing a traditional machine learning algorithm.
6. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (5), the actual resource demand characteristics of similar jobs in the historical operation are collected, and the subsequent deep learning job resource demand characteristics are predicted by utilizing the intelligent prediction model of the job resource demands.
7. The deep learning oriented GPU resource management and intelligent scheduling method of claim 1, wherein: in the step (6), firstly, the execution sequence of the jobs is determined according to the principle of fair scheduling among the queues and first-come first-serve scheduling in the queues, the jobs to be scheduled are selected, then the distributed topology of the jobs and the topology structure of the cluster network are extracted, a network communication cost model is built according to the resource demand characteristics of the jobs, and finally, a heuristic genetic algorithm is used for solving and generating a resource scheduling scheme.
CN202011310749.8A 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method Active CN112416585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011310749.8A CN112416585B (en) 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011310749.8A CN112416585B (en) 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method

Publications (2)

Publication Number Publication Date
CN112416585A CN112416585A (en) 2021-02-26
CN112416585B true CN112416585B (en) 2024-03-15

Family

ID=74776959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011310749.8A Active CN112416585B (en) 2020-11-20 2020-11-20 Deep learning-oriented GPU resource management and intelligent scheduling method

Country Status (1)

Country Link
CN (1) CN112416585B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094116B (en) * 2021-04-01 2022-10-11 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on load characteristic analysis
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium
CN113608722A (en) * 2021-07-31 2021-11-05 云南电网有限责任公司信息中心 Algorithm packaging method based on distributed technology
CN113791906A (en) * 2021-08-09 2021-12-14 戴西(上海)软件有限公司 Scheduling system and optimization algorithm based on GPU resources in artificial intelligence and engineering fields
CN115202850B (en) * 2022-09-09 2022-12-20 国家超级计算天津中心 Job scheduling method, device, electronic equipment and storage medium
CN117827415A (en) * 2022-09-27 2024-04-05 中兴通讯股份有限公司 GPU resource scheduling method, server and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106959891A (en) * 2017-03-30 2017-07-18 山东超越数控电子有限公司 A kind of cluster management method and system for realizing GPU scheduling
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
CN109034396A (en) * 2018-07-11 2018-12-18 北京百度网讯科技有限公司 Method and apparatus for handling the deep learning operation in distributed type assemblies
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation
CN109189401A (en) * 2018-07-06 2019-01-11 曙光信息产业(北京)有限公司 A kind of dispositions method and system of deep learning frame
CN110399222A (en) * 2019-07-25 2019-11-01 北京邮电大学 GPU cluster deep learning task parallel method, device and electronic equipment
CN110442451A (en) * 2019-07-12 2019-11-12 中电海康集团有限公司 A kind of polymorphic type GPU cluster resource management dispatching method and system towards deep learning
CN111090456A (en) * 2019-12-06 2020-05-01 浪潮(北京)电子信息产业有限公司 Construction method, device, equipment and medium for deep learning development environment
KR102140730B1 (en) * 2019-12-17 2020-08-04 (주) 씨이랩 Method and system for providing develop environment of deep learning based gpu
CN111694656A (en) * 2020-04-22 2020-09-22 北京大学 Cluster resource scheduling method and system based on multi-agent deep reinforcement learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885762B (en) * 2017-09-19 2021-06-11 北京百度网讯科技有限公司 Intelligent big data system, method and equipment for providing intelligent big data service
US10884795B2 (en) * 2018-04-26 2021-01-05 International Business Machines Corporation Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106959891A (en) * 2017-03-30 2017-07-18 山东超越数控电子有限公司 A kind of cluster management method and system for realizing GPU scheduling
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
CN109189401A (en) * 2018-07-06 2019-01-11 曙光信息产业(北京)有限公司 A kind of dispositions method and system of deep learning frame
CN109034396A (en) * 2018-07-11 2018-12-18 北京百度网讯科技有限公司 Method and apparatus for handling the deep learning operation in distributed type assemblies
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation
CN110442451A (en) * 2019-07-12 2019-11-12 中电海康集团有限公司 A kind of polymorphic type GPU cluster resource management dispatching method and system towards deep learning
CN110399222A (en) * 2019-07-25 2019-11-01 北京邮电大学 GPU cluster deep learning task parallel method, device and electronic equipment
CN111090456A (en) * 2019-12-06 2020-05-01 浪潮(北京)电子信息产业有限公司 Construction method, device, equipment and medium for deep learning development environment
KR102140730B1 (en) * 2019-12-17 2020-08-04 (주) 씨이랩 Method and system for providing develop environment of deep learning based gpu
CN111694656A (en) * 2020-04-22 2020-09-22 北京大学 Cluster resource scheduling method and system based on multi-agent deep reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster;panelVíctor Campos et al.;《Procedia Computer Science》;全文 *
深度学习云服务适配问题研究;林健 等;《软件导刊》;全文 *
面向GPU异构集群的自学习负载均衡调度算法惠;刘惠 等;《西安石油大学学报( 自然科学版)》;第30卷(第3期);全文 *

Also Published As

Publication number Publication date
CN112416585A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112416585B (en) Deep learning-oriented GPU resource management and intelligent scheduling method
CN110737529B (en) Short-time multi-variable-size data job cluster scheduling adaptive configuration method
Guo et al. Cloud resource scheduling with deep reinforcement learning and imitation learning
US20200257968A1 (en) Self-learning scheduler for application orchestration on shared compute cluster
CN107038069B (en) Dynamic label matching DLMS scheduling method under Hadoop platform
Polo et al. Performance-driven task co-scheduling for mapreduce environments
Gu et al. Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters
CN113377540A (en) Cluster resource scheduling method and device, electronic equipment and storage medium
CN104765640B (en) A kind of intelligent Service dispatching method
CN111381950A (en) Task scheduling method and system based on multiple copies for edge computing environment
CN110262897B (en) Hadoop calculation task initial allocation method based on load prediction
WO2021180092A1 (en) Task dispatching method and apparatus
CN114237869B (en) Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment
CN110221909A (en) A kind of Hadoop calculating task supposition execution method based on load estimation
CN109710372A (en) A kind of computation-intensive cloud workflow schedule method based on cat owl searching algorithm
CN109740870A (en) The resource dynamic dispatching method that Web is applied under cloud computing environment
CN114911613A (en) Cross-cluster resource high-availability scheduling method and system in inter-cloud computing environment
CN111061565A (en) Two-stage pipeline task scheduling method and system in Spark environment
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
Ibrahim et al. Improving mapreduce performance with progress and feedback based speculative execution
CN115098240B (en) Multiprocessor application scheduling method and system and storage medium
Chen et al. Deadline-constrained MapReduce scheduling based on graph modelling
Singh et al. Market-inspired dynamic resource allocation in many-core high performance computing systems
CN116010051A (en) Federal learning multitasking scheduling method and device
Hu et al. An optimal resource allocator of elastic training for deep learning jobs on cloud

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant