CN114820278A - Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment - Google Patents

Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment Download PDF

Info

Publication number
CN114820278A
CN114820278A CN202210463699.XA CN202210463699A CN114820278A CN 114820278 A CN114820278 A CN 114820278A CN 202210463699 A CN202210463699 A CN 202210463699A CN 114820278 A CN114820278 A CN 114820278A
Authority
CN
China
Prior art keywords
gpu
task
deep learning
training
scheme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210463699.XA
Other languages
Chinese (zh)
Inventor
周方
何水兵
秦亦
朱春节
方启明
曾令仿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202210463699.XA priority Critical patent/CN114820278A/en
Publication of CN114820278A publication Critical patent/CN114820278A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of deep learning under artificial intelligence, and discloses a heterogeneous GPU (graphics processing Unit) distribution system and a method for a multi-deep learning task under a distributed environment, wherein the system comprises a GPU Profile module, a task information acquisition module, a GPU selection module and a deep learning training module; according to the heterogeneous GPU allocation method for the multi-deep learning task in the distributed environment, the GPUs with different computing capabilities are allocated to the tasks with corresponding requirements, the tasks with complex model levels and large batch data size are adapted to the GPU with the best performance and the nodes with enough memory for display are operated, the task needing to be subjected to deep learning training for a longer time is accelerated, and therefore the multi-task execution efficiency in the heterogeneous environment is obviously improved; and when the multi-deep learning task is executed concurrently, the multi-deep learning can be completed integrally and quickly, and the time for a programmer or a user to wait for a result can be saved.

Description

Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment
Technical Field
The invention belongs to the field of deep learning under artificial intelligence, and particularly relates to a heterogeneous GPU (graphics processing unit) distribution system and method for multi-deep learning tasks under a distributed environment.
Background
Nowadays, a deep neural network obtains a very accurate model by training large-scale data, so that the deep neural network is promoted to be continuously applied in the fields of image classification, voice recognition, unmanned driving and the like. These trends have led to increasingly complex deep neural network models, and have also led to the emergence of devices, such as GPUs, FPGAs, TPUs, etc., that accelerate deep neural network training. How to more efficiently utilize heterogeneous acceleration devices in a distributed environment is becoming an important hotspot.
The concurrent deep learning training of multiple tasks in a distributed heterogeneous GPU environment gradually becomes a common phenomenon, in one scenario, the multiple deep learning tasks in the distributed environment are cooperatively trained, and each task needs to complete a training index.
The traditional GPU allocation method of the deep learning training framework at present is generally that when a distributed environment starts multitask, GPU parameters are statically appointed, and the GPU parameters provided by the deep learning training framework are used for scheduling tasks with different requirements to corresponding GPUs for deep learning training; the deep learning training framework also provides a method for distributing all available GPUs, so that batch data of each task is distributed to all GPUs for deep learning training, and the GPU with strong computing power is idle for a long time due to the GPU with strong computing power which trains small batch data completely, and the utilization rate of the GPU with strong computing power is low.
As the traditional GPU allocation scheme of the multi-deep learning task under the distributed environment does not consider the characteristics and requirements of the task and does not fully utilize heterogeneous GPU performance, the concurrent operation of different deep learning training tasks is met, and the overall execution efficiency of the multi-deep learning task is low.
For the defects of the above conventional scheme, an effective method for GPU allocation is needed in a scenario of multi-deep learning tasks in a distributed environment, and an effective solution is urgently needed for the problem in such a scenario.
Disclosure of Invention
The present invention is directed to a system and a method for allocating heterogeneous GPUs for multiple deep learning tasks in a distributed environment, so as to solve the above technical problems.
In order to solve the technical problems, the specific technical scheme of the heterogeneous GPU allocation system and method for the multi-deep learning task in the distributed environment is as follows:
a heterogeneous GPU distribution system for multi-deep learning tasks in a distributed environment comprises a GPU Profile module, a task information acquisition module, a GPU selection module and a deep learning training module;
the GPU Profile module: the method is used for detecting whether the machine in the heterogeneous environment contains the GPU, the performance of the GPU is strong or weak, and the video memory size of the GPU;
the task information acquisition module: the system is responsible for collecting a training model of each task, the size of a data batch and the training time of batch data of each task;
the GPU selection module: the system is responsible for target GPU selection and task batch data distribution from a memory cache space to a GPU;
the deep learning training module: and the system is responsible for applying the decision GPU information issued by the GPU selection module to the module and acquiring task model information and data batch size information so as to execute corresponding network level deep learning training calculation on the GPU.
The invention also discloses a heterogeneous GPU allocation method for the multi-deep learning task in the distributed environment, which comprises the following steps:
s1, initializing a multi-deep learning training task;
s2, cold starting of the multi-deep learning training task;
s3, dynamically adjusting the GPU scheme of the multi-deep learning training task;
and S4, iterating the multi-deep learning training task loop.
Further, when each deep learning training task is initialized, the GPU Profile module acquires feature information of each GPU in the heterogeneous environment, and records parameter information of the task start itself, where the information is a reference factor of the GPU selection module.
Further, in step S2, when the first Epoch of the multitask deep learning training is started, the GPU selection module gives a GPU allocation scheme of the multitask deep learning training cold start.
Further, the S2 includes the following specific steps:
s21, sorting all tasks according to the level of the deep learning training network of the tasks by taking the model types of the tasks as a first priority sorting factor;
s22, sorting all tasks according to the data batch size of the task deep learning training by taking the batch data size of the tasks as a second priority sorting factor;
s23, sorting all GPUs according to the GPU computing power as a first priority sorting factor;
s24, sorting all GPUs according to the video memory size of the GPU by taking the video memory size of the GPU as a second priority sorting factor;
and S25, according to the sequence arranged in the steps S22 and S24, the first GPU to the last GPU are sequentially distributed to the first task to the last task in the sequence, and the information of one-to-one mapping is recorded in the global GPU distribution table to serve as a cold-start distribution scheme.
Further, in step S3, according to the cold start scheme, the first Epoch training of the multi-deep learning task is started, the task performance calculation module calculates the training time of the batch data of each task on the corresponding GPU, records the training time in the global runtime list, and transmits the training time to the allocation algorithm of the GPU selection module to dynamically adjust the current GPU allocation and optimize the current GPU allocation until the optimal scheme is reached.
Further, the S3 includes the following specific steps:
s31, performing min-batch data training once for each task, obtaining the longest running time of the current GPU allocation scheme by recording the training running time of the current batch data of each task in the log and obtaining the maximum time value according to the time of the multi-task batch data, wherein the record is T cur
S32, starting deep learning training by multiple tasks in the current GPU allocation scheme, when the training of the first batch data of each task is finished, calculating to obtain the batch data training running time of each task, namely the information in the global running time list, redistributing the GPUs of the tasks with the longest running time and the shortest running time recorded in the global running time list, enabling the GPU with stronger computing power to execute the task with the longest current running time, and enabling the GPU with weaker computing power to execute the task with the shortest running time, namely obtaining a new GPU allocation scheme, obtaining the running time of the next batch deep learning training of each task, then obtaining the longest running time of the new GPU allocation scheme, and recording the longest running time as T next If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance as T best Modifying the new GPU allocation scheme in the global GPU allocation table;
s33, the batch data of each task of the current multitask is operated, the operation time is recorded in a global operation time list, the GPU allocation algorithm updates the allocation scheme of the GPU at the moment, the allocation scheme is modified in the global GPU allocation list, and when deep learning training of the next batch of data is not started by each task, the allocation scheme enables the tasks with automatically set parameters to be subjected to deep learning training on the corresponding GPU;
s34, in the first Epoch stage of each task, the allocation algorithm in the GPU selection module continuously takes the batch data of each task to conduct training iteration to adjust the scheme until the execution efficiency reaches the optimum, the first Epoch of the deep learning training of each task is trained, and a scheme with the optimum multi-task execution efficiency is selected to serve as the final scheme of the next multi-cycle deep learning training of each task.
Further, in S4, when the first Epoch training of the multi-deep learning task is completed, the deep learning training of each Epoch in the next multi-task is to distribute the batch data of each task to the corresponding GPU for the deep learning training according to the GPU optimal allocation scheme generated by the first Epoch, and if the number of training rounds of all tasks is completed, the deep learning training task is ended.
The heterogeneous GPU distribution system and method for the multi-deep learning task in the distributed environment have the following advantages: the invention provides a heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning tasks in a distributed environment, which can be used for allocating GPUs with different computing capabilities to tasks with corresponding requirements, adapting the tasks with complex model levels and large batch data size to the GPU with the best performance and running on nodes with enough memory, accelerating the tasks requiring longer time for deep learning training, and thus obviously improving the multi-task execution efficiency in the heterogeneous environment; and when the multi-deep learning tasks are executed concurrently, the multi-deep learning can be completed integrally and quickly, and the time for a programmer or a user to wait for results can be saved.
Drawings
FIG. 1 is a block diagram of a logical architecture of a system employing the method of the present invention.
FIG. 2 is a data flow diagram of the multitask deep learning in the present invention.
FIG. 3 is a flow chart of the GPU assignment algorithm of the present invention.
Fig. 4 is an exemplary diagram of GPU allocation at cold start in the present invention.
Fig. 5 is a schematic diagram of a GPU allocation scheme in the present invention.
Fig. 6 is a performance verification diagram of an example GPU selection scheme.
Detailed Description
For better understanding of the purpose, structure and function of the present invention, the following describes a system and method for allocating heterogeneous GPUs for multi-deep learning task in distributed environment in further detail with reference to the accompanying drawings.
As shown in fig. 1, the heterogeneous GPU allocation system for multi-deep learning tasks in a distributed environment of the present invention includes a GPU Profile module, a task information acquisition module, a GPU selection module, and a deep learning training module;
GPU Profile module: the method is used for detecting whether the machine in the heterogeneous environment contains the GPU, the performance of the GPU is strong or weak, and the video memory size of the GPU;
the task information acquisition module: the system is responsible for collecting a training model of each task, the size of a data batch and the training time of batch data of each task;
a GPU selection module: the system is responsible for target GPU selection and task batch data distribution from a memory cache space to a GPU;
the deep learning training module: and the system is responsible for applying the decision GPU information issued by the GPU selection module to the module and acquiring task model information and data batch size information so as to execute corresponding network level deep learning training calculation on the GPU.
The invention discloses a heterogeneous GPU allocation method for multi-deep learning tasks in a distributed environment, which comprises the following steps:
s1, initializing a multi-deep learning training task; when each deep learning training task is initialized, the GPU Profile module collects characteristic information of each GPU in a heterogeneous environment and records parameter information of the starting task, wherein the information is a reference factor of the GPU selection module;
s2, cold starting of the multi-deep learning training task; when a first Epoch of the multitask deep learning training is started, a GPU selection module gives a GPU allocation scheme of the multitask deep learning training cold start;
s21, sorting all tasks according to the level of the deep learning training network of the tasks by taking the model types of the tasks as a first priority sorting factor;
s22, sorting all tasks according to the data batch size of the task deep learning training by taking the batch data size of the tasks as a second priority sorting factor;
s23, sorting all GPUs according to the GPU computing power as a first priority sorting factor;
s24, sorting all GPUs according to the video memory size of the GPU by taking the video memory size of the GPU as a second priority sorting factor;
and S25, according to the sequence arranged in the steps S22 and S24, the first GPU to the last GPU are sequentially distributed to the first task to the last task in the sequence, and the information of one-to-one mapping is recorded in the global GPU distribution table to serve as a cold-start distribution scheme.
S3, dynamically adjusting the GPU scheme of the multi-deep learning training task; according to a cold start scheme, starting first Epoch training of a multi-deep learning task, calculating training time of batch data of each task on a corresponding GPU by a task performance calculation module, recording the training time in a global running time list, transmitting the training time to an allocation algorithm of a GPU selection module to dynamically adjust current GPU allocation, and optimizing the current GPU allocation until an optimal scheme is achieved, wherein the method comprises the following steps:
s31, performing min-batch data training once for each task, obtaining the longest running time of the current GPU allocation scheme by recording the training running time of the current batch data of each task in the log and obtaining the maximum time value according to the time of the multi-task batch data, wherein the record is T cur
S32, starting deep learning training by multiple tasks in the current GPU allocation scheme, when the training of the first batch data of each task is finished, calculating to obtain the batch data training running time of each task, namely the information in the global running time list, redistributing the GPUs of the tasks with the longest running time and the shortest running time recorded in the global running time list, enabling the GPU with stronger computing power to execute the task with the longest current running time, and enabling the GPU with weaker computing power to execute the task with the shortest running time, namely obtaining a new GPU allocation scheme, obtaining the running time of the next batch deep learning training of each task, then obtaining the longest running time of the new GPU allocation scheme, and recording the longest running time as T next, If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance asT best Modifying the new GPU allocation scheme in the global GPU allocation table;
and S33, the batch data of each task of the current multitask is finished to run, the running time is recorded in a global running time list, and at the moment, the GPU allocation algorithm updates the allocation scheme of the GPU and modifies the allocation scheme in the global GPU allocation list. The allocation scheme can perform deep learning training on the tasks assigned with the automatic setting parameters in the corresponding GPU when each task is not started to perform deep learning training on the next batch of data;
and S34, in the first Epoch stage of each task, the allocation algorithm in the GPU selection module continuously takes the batch data of each task to carry out training iteration so as to adjust the scheme until the execution efficiency reaches the optimum. After the first Epoch of the deep learning training of each task is trained, the algorithm can select a scheme with the optimal multi-task execution efficiency as a final scheme of the next multi-round deep learning training of each task;
s4, performing loop iteration on the multi-deep learning training task; when the first Epoch training of the multi-deep learning task is completed, next deep learning training of each Epoch of the multi-task is carried out, and data of each task batch are distributed to the corresponding GPU for deep learning training according to the GPU optimal distribution scheme generated by the first Epoch, and if the training rounds of all the tasks are completed, the deep learning training task is ended.
Example (b):
the invention designs a heterogeneous GPU allocation method for multi-deep learning tasks in a distributed environment according to factors such as current heterogeneous GPU configuration, whether the video memory capacity of a GPU can be used for loading batch data of a plurality of tasks at the same time, the execution efficiency of deep learning training of each task and the like, and the method comprises the following steps:
s1, initializing a multi-deep learning training task; as shown in fig. 2, the data flow of the multi-task deep learning training in the heterogeneous environment is shown, the heterogeneous GPUs shown in the present invention are different, the bottom layer stored raw data is transmitted to the DRAM cache in the heterogeneous environment, the time required for each task is approximately the same in the process, and the present invention needs to optimize how the batch data in the memory cache is better distributed to the corresponding GPU. When multi-task deep learning is initialized, a GPU Profile module acquires heterogeneous GPU characteristic information, and the characteristic information is recorded as GpuInfMap < gpuId, [ computrCapability, memSze ] >, wherein the gpuId represents the unique number of a GPU in an environment, the computrCapability represents the value of GPU computing capability, the memSze represents the value of GPU display memory, information in the GpuInfMap serves as a reference factor of a GPU selection module, the task starts with self-contained parameter information, the characteristic information is recorded as JobInfMap < jobId, [ modelType, BatchSize ] >, wherein jobId represents the unique number of a deep learning training task, modelType represents the model type of deep learning training, and BatsSize represents the size of deep learning-trained batch data and serves as the reference factor of the GPU selection module; the data structure of these static information records is shown in fig. 4, and this data structure is recorded in the global memory as the multitasking operation.
S11, reading the calculation capacity of each GPU in the heterogeneous environment through a configuration information file of the cluster, and converting a calculation capacity numerical table provided by an Nvidia official website into an integer value to represent the calculation performance of the GPU;
s12, the multitask is used for operating parameters of deep learning training, and the operating parameters can be read in advance from a multitask starting command script and recorded in a JobInfoMap data structure of the global memory.
S2, cold starting of the multi-deep learning training task; when the first Epoch of the multitask deep learning training is started, the GPU selection module gives a GPU allocation scheme of the multitask deep learning training cold start, and an example of the allocation scheme is shown in FIG. 4.
The step S2 may use a multi-factor sorting rule with different priorities to obtain a GPU allocation scheme during cold start, and is implemented according to the following steps:
s21, according to the model type of the task as the first sequencing factor, making a priority sequencing according to the deep learning training network level of the task, and sequencing all the tasks recorded in the JobInfoMap;
s22, according to the size of the batch data of the tasks as a second sequencing factor, performing priority sequencing according to the size of the data batch of the deep learning training of the tasks, and then sequencing all the tasks recorded in the JobInfoMap;
s23, according to the calculation capability of the GPU as a first sorting factor, performing priority sorting according to the strength of the calculation capability of the GPU, and sorting all GPUs recorded in the GpuInfoMap;
s24, according to the size of the video memory of the GPU as a second sorting factor, performing priority sorting according to the size of the video memory of the GPU, and then sorting all GPUs recorded in the GpuInfoMap;
and S25, sequentially distributing the first GPU to the last GPU in the JobInfoMap according to the sequence arranged in the steps S22 and S24, wherein the information of one-to-one mapping is recorded in a global GPU distribution table and is recorded as a GpuAllocList, and the GpuAllocList is used as the cold-start distribution scheme.
S3, dynamically adjusting the GPU scheme of the multi-deep learning training task; in a first Epoch stage of the multitask execution deep learning training, the task performance calculation module calculates the training time of batch data of each task on the corresponding GPU, records the training time in a global running time list, records the training time as T [ jobId ], transmits the training time to an allocation algorithm of the GPU selection module to dynamically adjust the current GPU allocation and optimize the current GPU allocation until an optimal scheme is achieved, and the algorithm implementation steps are shown in FIG. 3, and are further explained as follows:
s31, acquiring the multitask running time of the current GPU allocation scheme, and recording the multitask running time in T [ jobId]The maximum time value is obtained by calculating the time value of the multi-task batch data, namely the maximum running time of the current scheme is obtained and recorded as T cur (ii) a As shown in the following equation:
T[jobId]= T(jobId_batch)
T cur = Max(T[jobId])
s32, starting deep learning training by multiple tasks in the current GPU allocation scheme, and calculating the batch data training running time of each task, namely T [ jobId ], when the first batch data training of each task is finished]Information in (1), converting T [ jobId]The longest running time of the recorded tasksThe GPU of the task with the shortest line time is redistributed, the GPU with stronger computing power executes the task with the longest current running time, and the GPU with weaker computing power executes the task with the shortest task running time, so that a new GPU distribution scheme is obtained, as shown in fig. 5, the GPU with the best matching requirement is reselected for the two tasks of the first scheme, and the obtained second scheme is superior to the first scheme. Recording the longest running time obtained by deep learning training of batch data of a plurality of tasks under a new scheme, wherein the record is T next If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance as T best And recording the new GPU allocation scheme in the GpuAllocList. As shown in the following equation:
T best = min( T cur, T next )
if T best is minest, GpuAllocList = [...,(jobId, newGpuId), ...]
and S33, the batch data of each task of the current multitask is completely operated, the operation time is recorded in T [ jobId ], and at the moment, the GPU allocation algorithm updates the allocation scheme of the GPU and modifies the allocation scheme in the global GPU allocation table, namely, the GpuAllocList is updated. And the allocation scheme can perform deep learning training on the corresponding GPU by automatically setting the parameter designated task when each task is not started to perform deep learning training on the next batch of data. If the multitask at the current moment is already running a batch of data deep learning training, the multitask at the current moment needs to execute the data deep learning training of the current batch according to the previous GPU distribution, and a new GPU distribution scheme is not adopted to execute the deep learning training until the next batch of data of each task is ready;
and S34, in the first Epoch stage of each task, the allocation algorithm in the GPU selection module continuously takes the batch data of each task to carry out training iteration so as to adjust the scheme to optimize the execution efficiency of multiple tasks. And finally, selecting a scheme with the optimal multi-task execution efficiency as a final scheme of next multi-round deep learning training of each task. As an example shown in fig. 6, the optimal solution provided by the design method and system of the present invention is solution two, and the whole deep learning training process of multitask is executed through solution two, so that the execution efficiency of multitask is improved compared with other solutions;
s4, performing loop iteration on the multi-deep learning training task; when the first Epoch training of the multi-deep learning task is completed, next deep learning training of each Epoch of the multi-task is carried out, and data of each task batch are distributed to the corresponding GPU for deep learning training according to the GPU optimal distribution scheme generated by the first Epoch, and if the training rounds of all the tasks are completed, the deep learning training task is ended.
It is to be understood that the present invention has been described with reference to certain embodiments and that various changes in form and details may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (8)

1. A heterogeneous GPU distribution system for multi-deep learning tasks in a distributed environment is characterized by comprising a GPU Profile module, a task information acquisition module, a GPU selection module and a deep learning training module;
the GPU Profile module: the method is used for detecting whether the machine in the heterogeneous environment contains the GPU, the performance of the GPU is strong or weak, and the video memory size of the GPU;
the task information acquisition module: the system is used for acquiring a training model of each task, the size of a data batch and the training time of batch data of each task;
the GPU selection module: the system is responsible for target GPU selection and task batch data distribution from a memory cache space to a GPU;
the deep learning training module: and the system is responsible for applying the decision GPU information issued by the GPU selection module to the module and acquiring task model information and data batch size information so as to execute corresponding network level deep learning training calculation on the GPU.
2. A method for performing heterogeneous GPU allocation using the heterogeneous GPU allocation system for multi-deep learning task in distributed environment according to claim 1, comprising the steps of:
s1, initializing a multi-deep learning training task;
s2, cold starting of the multi-deep learning training task;
s3, dynamically adjusting the GPU scheme of the multi-deep learning training task;
and S4, iterating the multi-deep learning training task loop.
3. The method according to claim 2, wherein in step S1, when each deep learning training task is initialized, the GPU Profile module collects feature information of each GPU in the heterogeneous environment, and records parameter information of the task itself when starting, where the feature information is a reference factor of the GPU selection module.
4. The method according to claim 2, wherein in S2, when the first Epoch of the multitask deep learning training is started, the GPU selection module provides a GPU allocation scheme of the multitask deep learning training cold start.
5. The method according to claim 4, wherein the step S2 comprises the following steps:
s21, sorting all tasks according to the level of the deep learning training network of the tasks by taking the model types of the tasks as a first priority sorting factor;
s22, sorting all tasks according to the data batch size of the task deep learning training by taking the batch data size of the tasks as a second priority sorting factor;
s23, sorting all GPUs according to the GPU computing power as a first priority sorting factor;
s24, sorting all GPUs according to the video memory size of the GPU by taking the video memory size of the GPU as a second priority sorting factor;
and S25, according to the sequence arranged in the steps S22 and S24, the first GPU to the last GPU are sequentially distributed to the first task to the last task in the sequence, and the information of one-to-one mapping is recorded in the global GPU distribution table to serve as a cold-start distribution scheme.
6. The method as claimed in claim 2, wherein the S3 is configured to start a first Epoch training of the multi-deep learning task according to a cold start scheme, the task performance calculation module calculates a training time of each task' S batch data on the corresponding GPU, records the training time in the global runtime list, and transmits the training time to the allocation algorithm of the GPU selection module to dynamically adjust the current GPU allocation and optimize the current GPU allocation until the optimal scheme is reached.
7. The method according to claim 6, wherein the step S3 comprises the following steps:
s31, performing min-batch data training once for each task, obtaining the longest running time of the current GPU allocation scheme by recording the training running time of the current batch data of each task in the log and obtaining the maximum time value according to the time of the multi-task batch data, wherein the record is T cur
S32, starting deep learning training by multiple tasks in the current GPU allocation scheme, calculating the batch data training running time of each task when the first batch data training of each task is finished, namely the information in the global running time list, and recording the information in the global running time listThe GPU of the task with the longest task running time and the shortest task running time is redistributed, the GPU with stronger computing power is used for executing the task with the longest current running time, the GPU with relatively weaker computing power is used for executing the task with the shortest task running time, a new GPU allocation scheme is obtained, the running time of the next batch deep learning training of each task is obtained, the longest running time of the new GPU allocation scheme is obtained, and the longest running time is recorded as T next If the execution efficiency of the new scheme is better than that of the previous scheme, recording the overall optimal performance as T best Modifying the new GPU allocation scheme in the global GPU allocation table;
s33, the batch data of each task of the current multitask is operated, the operation time is recorded in a global operation time list, the GPU allocation algorithm updates the allocation scheme of the GPU at the moment, the allocation scheme is modified in the global GPU allocation list, and when deep learning training of the next batch of data is not started by each task, the allocation scheme enables the tasks with automatically set parameters to be subjected to deep learning training on the corresponding GPU;
s34, in the first Epoch stage of each task, the allocation algorithm in the GPU selection module continuously takes the batch data of each task to conduct training iteration to adjust the scheme until the execution efficiency reaches the optimum, the first Epoch of the deep learning training of each task is trained, and a scheme with the optimum multi-task execution efficiency is selected to serve as the final scheme of the next multi-cycle deep learning training of each task.
8. The method according to claim 2, wherein in S4, when a first Epoch training of the multi-deep learning task is completed, a next deep learning training of each Epoch of the multi-task is performed, and data of each task batch is distributed to a corresponding GPU for deep learning training according to a GPU optimal distribution scheme generated by the first Epoch, and if the number of training rounds of all tasks is completed, the deep learning training task is ended.
CN202210463699.XA 2022-04-29 2022-04-29 Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment Pending CN114820278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210463699.XA CN114820278A (en) 2022-04-29 2022-04-29 Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210463699.XA CN114820278A (en) 2022-04-29 2022-04-29 Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment

Publications (1)

Publication Number Publication Date
CN114820278A true CN114820278A (en) 2022-07-29

Family

ID=82510102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210463699.XA Pending CN114820278A (en) 2022-04-29 2022-04-29 Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment

Country Status (1)

Country Link
CN (1) CN114820278A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860114A (en) * 2022-11-07 2023-03-28 北京百度网讯科技有限公司 Deep learning model training method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860114A (en) * 2022-11-07 2023-03-28 北京百度网讯科技有限公司 Deep learning model training method and device, electronic equipment and storage medium
CN115860114B (en) * 2022-11-07 2023-09-08 北京百度网讯科技有限公司 Training method and device for deep learning model, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110389816B (en) Method, apparatus and computer readable medium for resource scheduling
CN110321222B (en) Decision tree prediction-based data parallel operation resource allocation method
CN111274036B (en) Scheduling method of deep learning task based on speed prediction
CN112416585B (en) Deep learning-oriented GPU resource management and intelligent scheduling method
CN110069341B (en) Method for scheduling tasks with dependency relationship configured according to needs by combining functions in edge computing
WO2019193075A1 (en) Coordinated heterogeneous processing of training data for deep neural networks
CN111190712A (en) Task scheduling method, device, equipment and medium
CN113127203B (en) Deep learning distributed compiler for cloud edge computing and construction method
CN109684088B (en) Remote sensing big data rapid processing task scheduling method based on cloud platform resource constraint
CN114820278A (en) Heterogeneous GPU (graphics processing Unit) distribution system and method for multi-deep learning task in distributed environment
CN111176637B (en) Schedulability analysis method of AADL model based on cache preemption delay constraint
CN114237869A (en) Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment
CN115543626A (en) Power defect image simulation method adopting heterogeneous computing resource load balancing scheduling
CN113741999B (en) Dependency-oriented task unloading method and device based on mobile edge calculation
Feljan et al. Task allocation optimization for multicore embedded systems
CN111061565A (en) Two-stage pipeline task scheduling method and system in Spark environment
CN111767121A (en) Operation method, device and related product
CN116938323B (en) Satellite transponder resource allocation method based on reinforcement learning
CN112463340A (en) Tensorflow-based multi-task flexible scheduling method and system
CN106354633A (en) Task schedule generation method based on algorithmic plug-in
CN112650449A (en) Release method and release system of cache space, electronic device and storage medium
CN114490094B (en) GPU (graphics processing Unit) video memory allocation method and system based on machine learning
CN115827225A (en) Distribution method of heterogeneous operation, model training method, device, chip, equipment and medium
CN115756789A (en) GPU scheduling optimization method for deep learning inference service system
CN116010051A (en) Federal learning multitasking scheduling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination