CN114820278A

CN114820278A - Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment

Info

Publication number: CN114820278A
Application number: CN202210463699.XA
Authority: CN
Inventors: 周方; 何水兵; 秦亦; 朱春节; 方启明; 曾令仿
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-29

Abstract

The invention belongs to the field of deep learning under artificial intelligence, and discloses a heterogeneous GPU (graphics processing Unit) distribution system and a method for a multi-deep learning task under a distributed environment, wherein the system comprises a GPU Profile module, a task information acquisition module, a GPU selection module and a deep learning training module; according to the heterogeneous GPU allocation method for the multi-deep learning task in the distributed environment, the GPUs with different computing capabilities are allocated to the tasks with corresponding requirements, the tasks with complex model levels and large batch data size are adapted to the GPU with the best performance and the nodes with enough memory for display are operated, the task needing to be subjected to deep learning training for a longer time is accelerated, and therefore the multi-task execution efficiency in the heterogeneous environment is obviously improved; and when the multi-deep learning task is executed concurrently, the multi-deep learning can be completed integrally and quickly, and the time for a programmer or a user to wait for a result can be saved.

Description

Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment

Technical Field

The invention belongs to the field of deep learning under artificial intelligence, and particularly relates to a heterogeneous GPU (graphics processing unit) distribution system and method for multi-deep learning tasks under a distributed environment.

Background

Nowadays, a deep neural network obtains a very accurate model by training large-scale data, so that the deep neural network is promoted to be continuously applied in the fields of image classification, voice recognition, unmanned driving and the like. These trends have led to increasingly complex deep neural network models, and have also led to the emergence of devices, such as GPUs, FPGAs, TPUs, etc., that accelerate deep neural network training. How to more efficiently utilize heterogeneous acceleration devices in a distributed environment is becoming an important hotspot.

The concurrent deep learning training of multiple tasks in a distributed heterogeneous GPU environment gradually becomes a common phenomenon, in one scenario, the multiple deep learning tasks in the distributed environment are cooperatively trained, and each task needs to complete a training index.

The traditional GPU allocation method of the deep learning training framework at present is generally that when a distributed environment starts multitask, GPU parameters are statically appointed, and the GPU parameters provided by the deep learning training framework are used for scheduling tasks with different requirements to corresponding GPUs for deep learning training; the deep learning training framework also provides a method for distributing all available GPUs, so that batch data of each task is distributed to all GPUs for deep learning training, and the GPU with strong computing power is idle for a long time due to the GPU with strong computing power which trains small batch data completely, and the utilization rate of the GPU with strong computing power is low.

As the traditional GPU allocation scheme of the multi-deep learning task under the distributed environment does not consider the characteristics and requirements of the task and does not fully utilize heterogeneous GPU performance, the concurrent operation of different deep learning training tasks is met, and the overall execution efficiency of the multi-deep learning task is low.

For the defects of the above conventional scheme, an effective method for GPU allocation is needed in a scenario of multi-deep learning tasks in a distributed environment, and an effective solution is urgently needed for the problem in such a scenario.

Disclosure of Invention

The present invention is directed to a system and a method for allocating heterogeneous GPUs for multiple deep learning tasks in a distributed environment, so as to solve the above technical problems.

In order to solve the technical problems, the specific technical scheme of the heterogeneous GPU allocation system and method for the multi-deep learning task in the distributed environment is as follows:

a heterogeneous GPU distribution system for multi-deep learning tasks in a distributed environment comprises a GPU Profile module, a task information acquisition module, a GPU selection module and a deep learning training module;

the GPU Profile module: the method is used for detecting whether the machine in the heterogeneous environment contains the GPU, the performance of the GPU is strong or weak, and the video memory size of the GPU;

the task information acquisition module: the system is responsible for collecting a training model of each task, the size of a data batch and the training time of batch data of each task;

the GPU selection module: the system is responsible for target GPU selection and task batch data distribution from a memory cache space to a GPU;

the deep learning training module: and the system is responsible for applying the decision GPU information issued by the GPU selection module to the module and acquiring task model information and data batch size information so as to execute corresponding network level deep learning training calculation on the GPU.

The invention also discloses a heterogeneous GPU allocation method for the multi-deep learning task in the distributed environment, which comprises the following steps:

s1, initializing a multi-deep learning training task;

s2, cold starting of the multi-deep learning training task;

s3, dynamically adjusting the GPU scheme of the multi-deep learning training task;

and S4, iterating the multi-deep learning training task loop.

Further, when each deep learning training task is initialized, the GPU Profile module acquires feature information of each GPU in the heterogeneous environment, and records parameter information of the task start itself, where the information is a reference factor of the GPU selection module.

Further, in step S2, when the first Epoch of the multitask deep learning training is started, the GPU selection module gives a GPU allocation scheme of the multitask deep learning training cold start.

Further, the S2 includes the following specific steps:

s21, sorting all tasks according to the level of the deep learning training network of the tasks by taking the model types of the tasks as a first priority sorting factor;

s22, sorting all tasks according to the data batch size of the task deep learning training by taking the batch data size of the tasks as a second priority sorting factor;

s23, sorting all GPUs according to the GPU computing power as a first priority sorting factor;

s24, sorting all GPUs according to the video memory size of the GPU by taking the video memory size of the GPU as a second priority sorting factor;

and S25, according to the sequence arranged in the steps S22 and S24, the first GPU to the last GPU are sequentially distributed to the first task to the last task in the sequence, and the information of one-to-one mapping is recorded in the global GPU distribution table to serve as a cold-start distribution scheme.

Further, in step S3, according to the cold start scheme, the first Epoch training of the multi-deep learning task is started, the task performance calculation module calculates the training time of the batch data of each task on the corresponding GPU, records the training time in the global runtime list, and transmits the training time to the allocation algorithm of the GPU selection module to dynamically adjust the current GPU allocation and optimize the current GPU allocation until the optimal scheme is reached.

Further, the S3 includes the following specific steps:

s31, performing min-batch data training once for each task, obtaining the longest running time of the current GPU allocation scheme by recording the training running time of the current batch data of each task in the log and obtaining the maximum time value according to the time of the multi-task batch data, wherein the record is T _cur ；

S32, starting deep learning training by multiple tasks in the current GPU allocation scheme, when the training of the first batch data of each task is finished, calculating to obtain the batch data training running time of each task, namely the information in the global running time list, redistributing the GPUs of the tasks with the longest running time and the shortest running time recorded in the global running time list, enabling the GPU with stronger computing power to execute the task with the longest current running time, and enabling the GPU with weaker computing power to execute the task with the shortest running time, namely obtaining a new GPU allocation scheme, obtaining the running time of the next batch deep learning training of each task, then obtaining the longest running time of the new GPU allocation scheme, and recording the longest running time as T _next If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance as T _best Modifying the new GPU allocation scheme in the global GPU allocation table;

s33, the batch data of each task of the current multitask is operated, the operation time is recorded in a global operation time list, the GPU allocation algorithm updates the allocation scheme of the GPU at the moment, the allocation scheme is modified in the global GPU allocation list, and when deep learning training of the next batch of data is not started by each task, the allocation scheme enables the tasks with automatically set parameters to be subjected to deep learning training on the corresponding GPU;

s34, in the first Epoch stage of each task, the allocation algorithm in the GPU selection module continuously takes the batch data of each task to conduct training iteration to adjust the scheme until the execution efficiency reaches the optimum, the first Epoch of the deep learning training of each task is trained, and a scheme with the optimum multi-task execution efficiency is selected to serve as the final scheme of the next multi-cycle deep learning training of each task.

Further, in S4, when the first Epoch training of the multi-deep learning task is completed, the deep learning training of each Epoch in the next multi-task is to distribute the batch data of each task to the corresponding GPU for the deep learning training according to the GPU optimal allocation scheme generated by the first Epoch, and if the number of training rounds of all tasks is completed, the deep learning training task is ended.

The heterogeneous GPU distribution system and method for the multi-deep learning task in the distributed environment have the following advantages: the invention provides a heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning tasks in a distributed environment, which can be used for allocating GPUs with different computing capabilities to tasks with corresponding requirements, adapting the tasks with complex model levels and large batch data size to the GPU with the best performance and running on nodes with enough memory, accelerating the tasks requiring longer time for deep learning training, and thus obviously improving the multi-task execution efficiency in the heterogeneous environment; and when the multi-deep learning tasks are executed concurrently, the multi-deep learning can be completed integrally and quickly, and the time for a programmer or a user to wait for results can be saved.

Drawings

FIG. 1 is a block diagram of a logical architecture of a system employing the method of the present invention.

FIG. 2 is a data flow diagram of the multitask deep learning in the present invention.

FIG. 3 is a flow chart of the GPU assignment algorithm of the present invention.

Fig. 4 is an exemplary diagram of GPU allocation at cold start in the present invention.

Fig. 5 is a schematic diagram of a GPU allocation scheme in the present invention.

Fig. 6 is a performance verification diagram of an example GPU selection scheme.

Detailed Description

For better understanding of the purpose, structure and function of the present invention, the following describes a system and method for allocating heterogeneous GPUs for multi-deep learning task in distributed environment in further detail with reference to the accompanying drawings.

As shown in fig. 1, the heterogeneous GPU allocation system for multi-deep learning tasks in a distributed environment of the present invention includes a GPU Profile module, a task information acquisition module, a GPU selection module, and a deep learning training module;

GPU Profile module: the method is used for detecting whether the machine in the heterogeneous environment contains the GPU, the performance of the GPU is strong or weak, and the video memory size of the GPU;

a GPU selection module: the system is responsible for target GPU selection and task batch data distribution from a memory cache space to a GPU;

The invention discloses a heterogeneous GPU allocation method for multi-deep learning tasks in a distributed environment, which comprises the following steps:

s1, initializing a multi-deep learning training task; when each deep learning training task is initialized, the GPU Profile module collects characteristic information of each GPU in a heterogeneous environment and records parameter information of the starting task, wherein the information is a reference factor of the GPU selection module;

s2, cold starting of the multi-deep learning training task; when a first Epoch of the multitask deep learning training is started, a GPU selection module gives a GPU allocation scheme of the multitask deep learning training cold start;

S3, dynamically adjusting the GPU scheme of the multi-deep learning training task; according to a cold start scheme, starting first Epoch training of a multi-deep learning task, calculating training time of batch data of each task on a corresponding GPU by a task performance calculation module, recording the training time in a global running time list, transmitting the training time to an allocation algorithm of a GPU selection module to dynamically adjust current GPU allocation, and optimizing the current GPU allocation until an optimal scheme is achieved, wherein the method comprises the following steps:

S32, starting deep learning training by multiple tasks in the current GPU allocation scheme, when the training of the first batch data of each task is finished, calculating to obtain the batch data training running time of each task, namely the information in the global running time list, redistributing the GPUs of the tasks with the longest running time and the shortest running time recorded in the global running time list, enabling the GPU with stronger computing power to execute the task with the longest current running time, and enabling the GPU with weaker computing power to execute the task with the shortest running time, namely obtaining a new GPU allocation scheme, obtaining the running time of the next batch deep learning training of each task, then obtaining the longest running time of the new GPU allocation scheme, and recording the longest running time as T _next， If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance asT _best Modifying the new GPU allocation scheme in the global GPU allocation table;

and S33, the batch data of each task of the current multitask is finished to run, the running time is recorded in a global running time list, and at the moment, the GPU allocation algorithm updates the allocation scheme of the GPU and modifies the allocation scheme in the global GPU allocation list. The allocation scheme can perform deep learning training on the tasks assigned with the automatic setting parameters in the corresponding GPU when each task is not started to perform deep learning training on the next batch of data;

and S34, in the first Epoch stage of each task, the allocation algorithm in the GPU selection module continuously takes the batch data of each task to carry out training iteration so as to adjust the scheme until the execution efficiency reaches the optimum. After the first Epoch of the deep learning training of each task is trained, the algorithm can select a scheme with the optimal multi-task execution efficiency as a final scheme of the next multi-round deep learning training of each task;

s4, performing loop iteration on the multi-deep learning training task; when the first Epoch training of the multi-deep learning task is completed, next deep learning training of each Epoch of the multi-task is carried out, and data of each task batch are distributed to the corresponding GPU for deep learning training according to the GPU optimal distribution scheme generated by the first Epoch, and if the training rounds of all the tasks are completed, the deep learning training task is ended.

Example (b):

the invention designs a heterogeneous GPU allocation method for multi-deep learning tasks in a distributed environment according to factors such as current heterogeneous GPU configuration, whether the video memory capacity of a GPU can be used for loading batch data of a plurality of tasks at the same time, the execution efficiency of deep learning training of each task and the like, and the method comprises the following steps:

s1, initializing a multi-deep learning training task; as shown in fig. 2, the data flow of the multi-task deep learning training in the heterogeneous environment is shown, the heterogeneous GPUs shown in the present invention are different, the bottom layer stored raw data is transmitted to the DRAM cache in the heterogeneous environment, the time required for each task is approximately the same in the process, and the present invention needs to optimize how the batch data in the memory cache is better distributed to the corresponding GPU. When multi-task deep learning is initialized, a GPU Profile module acquires heterogeneous GPU characteristic information, and the characteristic information is recorded as GpuInfMap < gpuId, [ computrCapability, memSze ] >, wherein the gpuId represents the unique number of a GPU in an environment, the computrCapability represents the value of GPU computing capability, the memSze represents the value of GPU display memory, information in the GpuInfMap serves as a reference factor of a GPU selection module, the task starts with self-contained parameter information, the characteristic information is recorded as JobInfMap < jobId, [ modelType, BatchSize ] >, wherein jobId represents the unique number of a deep learning training task, modelType represents the model type of deep learning training, and BatsSize represents the size of deep learning-trained batch data and serves as the reference factor of the GPU selection module; the data structure of these static information records is shown in fig. 4, and this data structure is recorded in the global memory as the multitasking operation.

S11, reading the calculation capacity of each GPU in the heterogeneous environment through a configuration information file of the cluster, and converting a calculation capacity numerical table provided by an Nvidia official website into an integer value to represent the calculation performance of the GPU;

s12, the multitask is used for operating parameters of deep learning training, and the operating parameters can be read in advance from a multitask starting command script and recorded in a JobInfoMap data structure of the global memory.

S2, cold starting of the multi-deep learning training task; when the first Epoch of the multitask deep learning training is started, the GPU selection module gives a GPU allocation scheme of the multitask deep learning training cold start, and an example of the allocation scheme is shown in FIG. 4.

The step S2 may use a multi-factor sorting rule with different priorities to obtain a GPU allocation scheme during cold start, and is implemented according to the following steps:

s21, according to the model type of the task as the first sequencing factor, making a priority sequencing according to the deep learning training network level of the task, and sequencing all the tasks recorded in the JobInfoMap;

s22, according to the size of the batch data of the tasks as a second sequencing factor, performing priority sequencing according to the size of the data batch of the deep learning training of the tasks, and then sequencing all the tasks recorded in the JobInfoMap;

s23, according to the calculation capability of the GPU as a first sorting factor, performing priority sorting according to the strength of the calculation capability of the GPU, and sorting all GPUs recorded in the GpuInfoMap;

s24, according to the size of the video memory of the GPU as a second sorting factor, performing priority sorting according to the size of the video memory of the GPU, and then sorting all GPUs recorded in the GpuInfoMap;

and S25, sequentially distributing the first GPU to the last GPU in the JobInfoMap according to the sequence arranged in the steps S22 and S24, wherein the information of one-to-one mapping is recorded in a global GPU distribution table and is recorded as a GpuAllocList, and the GpuAllocList is used as the cold-start distribution scheme.

S3, dynamically adjusting the GPU scheme of the multi-deep learning training task; in a first Epoch stage of the multitask execution deep learning training, the task performance calculation module calculates the training time of batch data of each task on the corresponding GPU, records the training time in a global running time list, records the training time as T [ jobId ], transmits the training time to an allocation algorithm of the GPU selection module to dynamically adjust the current GPU allocation and optimize the current GPU allocation until an optimal scheme is achieved, and the algorithm implementation steps are shown in FIG. 3, and are further explained as follows:

s31, acquiring the multitask running time of the current GPU allocation scheme, and recording the multitask running time in T [ jobId]The maximum time value is obtained by calculating the time value of the multi-task batch data, namely the maximum running time of the current scheme is obtained and recorded as T _cur (ii) a As shown in the following equation:

T[jobId]= T(jobId_batch)

T _cur = Max(T[jobId])

s32, starting deep learning training by multiple tasks in the current GPU allocation scheme, and calculating the batch data training running time of each task, namely T [ jobId ], when the first batch data training of each task is finished]Information in (1), converting T [ jobId]The longest running time of the recorded tasksThe GPU of the task with the shortest line time is redistributed, the GPU with stronger computing power executes the task with the longest current running time, and the GPU with weaker computing power executes the task with the shortest task running time, so that a new GPU distribution scheme is obtained, as shown in fig. 5, the GPU with the best matching requirement is reselected for the two tasks of the first scheme, and the obtained second scheme is superior to the first scheme. Recording the longest running time obtained by deep learning training of batch data of a plurality of tasks under a new scheme, wherein the record is T _next If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance as T _best And recording the new GPU allocation scheme in the GpuAllocList. As shown in the following equation:

T _best = min( T _cur， T _next )

if T _best is minest, GpuAllocList = [...,(jobId, newGpuId), ...]

and S33, the batch data of each task of the current multitask is completely operated, the operation time is recorded in T [ jobId ], and at the moment, the GPU allocation algorithm updates the allocation scheme of the GPU and modifies the allocation scheme in the global GPU allocation table, namely, the GpuAllocList is updated. And the allocation scheme can perform deep learning training on the corresponding GPU by automatically setting the parameter designated task when each task is not started to perform deep learning training on the next batch of data. If the multitask at the current moment is already running a batch of data deep learning training, the multitask at the current moment needs to execute the data deep learning training of the current batch according to the previous GPU distribution, and a new GPU distribution scheme is not adopted to execute the deep learning training until the next batch of data of each task is ready;

and S34, in the first Epoch stage of each task, the allocation algorithm in the GPU selection module continuously takes the batch data of each task to carry out training iteration so as to adjust the scheme to optimize the execution efficiency of multiple tasks. And finally, selecting a scheme with the optimal multi-task execution efficiency as a final scheme of next multi-round deep learning training of each task. As an example shown in fig. 6, the optimal solution provided by the design method and system of the present invention is solution two, and the whole deep learning training process of multitask is executed through solution two, so that the execution efficiency of multitask is improved compared with other solutions;

It is to be understood that the present invention has been described with reference to certain embodiments and that various changes in form and details may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A heterogeneous GPU distribution system for multi-deep learning tasks in a distributed environment is characterized by comprising a GPU Profile module, a task information acquisition module, a GPU selection module and a deep learning training module;

the task information acquisition module: the system is used for acquiring a training model of each task, the size of a data batch and the training time of batch data of each task;

2. A method for performing heterogeneous GPU allocation using the heterogeneous GPU allocation system for multi-deep learning task in distributed environment according to claim 1, comprising the steps of:

s1, initializing a multi-deep learning training task;

s2, cold starting of the multi-deep learning training task;

and S4, iterating the multi-deep learning training task loop.

3. The method according to claim 2, wherein in step S1, when each deep learning training task is initialized, the GPU Profile module collects feature information of each GPU in the heterogeneous environment, and records parameter information of the task itself when starting, where the feature information is a reference factor of the GPU selection module.

4. The method according to claim 2, wherein in S2, when the first Epoch of the multitask deep learning training is started, the GPU selection module provides a GPU allocation scheme of the multitask deep learning training cold start.

5. The method according to claim 4, wherein the step S2 comprises the following steps:

6. The method as claimed in claim 2, wherein the S3 is configured to start a first Epoch training of the multi-deep learning task according to a cold start scheme, the task performance calculation module calculates a training time of each task' S batch data on the corresponding GPU, records the training time in the global runtime list, and transmits the training time to the allocation algorithm of the GPU selection module to dynamically adjust the current GPU allocation and optimize the current GPU allocation until the optimal scheme is reached.

7. The method according to claim 6, wherein the step S3 comprises the following steps:

S32, starting deep learning training by multiple tasks in the current GPU allocation scheme, calculating the batch data training running time of each task when the first batch data training of each task is finished, namely the information in the global running time list, and recording the information in the global running time listThe GPU of the task with the longest task running time and the shortest task running time is redistributed, the GPU with stronger computing power is used for executing the task with the longest current running time, the GPU with relatively weaker computing power is used for executing the task with the shortest task running time, a new GPU allocation scheme is obtained, the running time of the next batch deep learning training of each task is obtained, the longest running time of the new GPU allocation scheme is obtained, and the longest running time is recorded as T _next If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance as T _best Modifying the new GPU allocation scheme in the global GPU allocation table;

8. The method according to claim 2, wherein in S4, when a first Epoch training of the multi-deep learning task is completed, a next deep learning training of each Epoch of the multi-task is performed, and data of each task batch is distributed to a corresponding GPU for deep learning training according to a GPU optimal distribution scheme generated by the first Epoch, and if the number of training rounds of all tasks is completed, the deep learning training task is ended.