WO2024060789A1 - 面向智能计算的分布式训练任务调度方法、***和装置 - Google Patents

面向智能计算的分布式训练任务调度方法、***和装置 Download PDF

Info

Publication number
WO2024060789A1
WO2024060789A1 PCT/CN2023/105626 CN2023105626W WO2024060789A1 WO 2024060789 A1 WO2024060789 A1 WO 2024060789A1 CN 2023105626 W CN2023105626 W CN 2023105626W WO 2024060789 A1 WO2024060789 A1 WO 2024060789A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
gpu
subtask
task
training
Prior art date
Application number
PCT/CN2023/105626
Other languages
English (en)
French (fr)
Inventor
朱世强
李勇
程稳
陈�光
曾令仿
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Publication of WO2024060789A1 publication Critical patent/WO2024060789A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of intelligent computing, and in particular to a distributed training task scheduling method, system and device for intelligent computing.
  • a computing task is divided into multiple subtasks and assigned to different GPUs for execution.
  • Different distributed training methods have different communication and computing efficiencies.
  • the simple scheduling method obviously cannot bring out the best performance of the intelligent computing cluster.
  • distributed training models are trained in intelligent computing clusters at the same time, if you only rely on the local resource scheduler, problems such as waiting for each other, GPU idleness, and communication congestion may occur.
  • the purpose of the present invention is to provide a distributed training task scheduling method, system and device for intelligent computing, which solves the problem that single-card single-task scheduling in the prior art cannot fully utilize the coordinated scheduling characteristics of the distributed training method, and cannot fully exploit The performance potential of distributed training in intelligent computing clusters.
  • Embodiments of the present invention provide a distributed training task scheduling system for intelligent computing.
  • the computing cluster includes multiple computing nodes.
  • the multiple computing nodes can communicate with each other.
  • Each computing node includes at least one CPU and at least one GPU.
  • the system includes:
  • Model performance prediction and decomposition module used to determine the distributed training method corresponding to the model to be trained based on the model to be trained, target completion time and target input resources input by the user, and divide the model to be trained into multiple subtasks , and determine the resource consumption information of each subtask.
  • the distributed training method includes one of data parallelism, pipeline parallelism and hybrid parallelism.
  • the hybrid parallelism includes data parallelism and pipeline parallelism.
  • the resource consumption information includes computing consumption. and memory consumption;
  • Global GPU resource scheduler used to, after receiving the subtask request sent by the model performance prediction and decomposition module, According to the resource consumption information of each sub-task and the GPU operation status of multiple computing nodes, each sub-task is assigned to the GPU of the matching computing node for training, and the communication topology between each sub-task is constructed, and each computing node During the process of the node's GPU training corresponding to the subtask, the computing resource operation status of the GPU of each computing node is monitored, and the scheduling of the subtask is controlled based on the computing resource operation status of the GPU of all computing nodes, wherein the subtask request carries There is a distributed training method corresponding to the model to be trained, multiple subtasks and resource consumption information of each subtask; and
  • a local GPU resource scheduler configured on each computing node used to locally schedule subtasks assigned to the computing node according to the distributed training method.
  • Embodiments of the present invention also provide a distributed training task scheduling method for intelligent computing.
  • the computing cluster includes multiple computing nodes.
  • the multiple computing nodes can communicate with each other.
  • Each computing node includes at least one CPU and at least one GPU.
  • the methods include:
  • the model performance prediction and decomposition module determines the distributed training method corresponding to the model to be trained based on the model to be trained, target completion time, and target investment resources input by the user, and divides the model to be trained into multiple subtasks, and Determine the resource consumption information of each subtask.
  • the distributed training method includes one of data parallelism, pipeline parallelism and hybrid parallelism.
  • the hybrid parallelism includes data parallelism and pipeline parallelism.
  • the resource consumption information includes computing consumption and memory. consumption; consumption
  • the global GPU resource scheduler After receiving the subtask request sent by the model performance prediction and decomposition module, the global GPU resource scheduler allocates each subtask to the resource consumption information of each subtask and the GPU operating conditions of multiple computing nodes.
  • the GPU of the matching computing node is trained, and the communication topology between each sub-task is constructed, and during the process of the GPU training of the corresponding sub-task of each computing node, the computing resource operation status of the GPU of each computing node is monitored, and according to all Compute the running status of the computing resources of the GPU of the node and control the scheduling of subtasks, where the subtask request carries the distributed training method corresponding to the model to be trained, a plurality of the subtasks, and the resource consumption of each subtask. information; and
  • the local GPU resource scheduler configured on each computing node performs local scheduling on the subtasks assigned to the computing node according to the distributed training method.
  • An embodiment of the present invention also provides a distributed training task scheduling device for intelligent computing, comprising a memory and one or more processors, wherein the memory stores executable code, and when the one or more processors execute the executable code, they are used to implement any of the above-mentioned distributed training task scheduling methods for intelligent computing.
  • Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the distributed training task scheduling method for intelligent computing as described in any one of the above is implemented.
  • the beneficial effects of the present invention are: by setting up a global GPU resource scheduler, the allocation of subtasks, the communication topology between subtasks, the monitoring of the computing resource operation status of the GPU of each computing node, and the scheduling of subtasks improve computing efficiency.
  • the utilization of the cluster’s GPU, network and other resources reduces the waiting time for sub-task training.
  • FIG1 is a schematic diagram of the structure of a computing cluster provided by an embodiment of the present invention.
  • Figure 2 is a schematic structural diagram of a distributed training task scheduling system for intelligent computing provided by another embodiment of the present invention.
  • Figure 3 is a schematic flow chart of an implementation method for a model performance prediction and decomposition module provided by an embodiment of the present invention to determine the distributed training method corresponding to the model to be trained based on the model to be trained, the target completion time and the target input resources input by the user;
  • Figure 4 is a schematic flow chart of a method for controlling the scheduling of subtasks by a global GPU resource scheduler according to the computing resource operation status of GPUs of all computing nodes according to an embodiment of the present invention
  • Figure 5 is a schematic diagram of the functions of a global GPU resource scheduler and the interaction between the global GPU resource scheduler and the local resource scheduler provided by an embodiment of the present invention
  • Figure 6 is a schematic flowchart of a method for implementing a local GPU resource scheduler configured on each computing node to locally schedule subtasks assigned to the computing node according to a distributed training method according to an embodiment of the present invention
  • Figure 7 is a schematic flow chart of a first scheduling strategy provided by an embodiment of the present invention.
  • Figure 8 is a schematic flow chart of a second scheduling strategy provided by an embodiment of the present invention.
  • Figure 9 is a schematic flow chart of a distributed training task scheduling method for intelligent computing provided by an embodiment of the present invention.
  • Figure 10 is a structural block diagram of a distributed training task scheduling device for intelligent computing provided by an embodiment of the present invention.
  • Model performance prediction and decomposition module 20. Global GPU resource scheduler; 30. Local GPU resource scheduler.
  • Model parameter training methods can include data parallelism and pipeline parallelism.
  • Data parallelism copies the model to multiple GPUs, and performs gradient interaction and parameter updating through collective communication or parameter servers.
  • the training process can be divided into gradient calculation and gradient synchronization. stage, in which the calculation effect of gradient calculation is high, the GPU utilization is high, and there is basically no communication volume. However, in the gradient synchronization stage, the calculation overhead is relatively low, and the communication overhead is very high.
  • Pipeline parallelism divides the model into multiple stages by layer, and each stage is deployed on the GPU. Multiple stages are executed sequentially to perform forward calculations, and the loss function is calculated in the last stage, and then sequentially from the last stage to the first stage. Perform reverse calculations, and calculate the previous steps at different stages in the entire process. The idle waiting time between forward calculation and reverse calculation is not the same.
  • GPU computing resource scheduling is often based on a single card as the scheduling unit, which cannot fully utilize the coordinated scheduling characteristics of distributed training methods and cannot fully tap the performance potential of distributed training in intelligent computing clusters.
  • the present invention improves the computing cluster by setting up a global GPU resource scheduler 20 to perform the allocation of subtasks, the communication topology between subtasks, the monitoring of the computing resource operation of the GPU of each computing node, and the scheduling of subtasks. Improve the utilization of resources such as GPU and network, and reduce the waiting time of sub-task training.
  • a computing cluster in an embodiment of the present invention may include multiple computing nodes.
  • the multiple computing nodes can communicate with each other.
  • Each computing node includes at least one CPU and at least one GPU.
  • the computing cluster may include computing node 1, computing node 2, ..., computing node N, where N is a positive integer, and N is greater than or equal to 3.
  • the distributed training task scheduling method, system and device for intelligent computing in the embodiments of the present invention are applicable to the distributed task scheduling of the computing cluster as shown in FIG1 .
  • the distributed training task scheduling system for intelligent computing may include a model performance prediction and decomposition module 10, a global GPU resource scheduler 20 and a local GPU resource scheduler 30.
  • each computing node is configured with local GPU resource scheduling.
  • the computing cluster may include computing node 1, computing node 2, ..., computing node N, where N is a positive integer, and N is greater than or equal to 3, Computing node 1, computing node 2, ..., computing node N are respectively configured with a local GPU resource scheduler 31, a local GPU resource scheduler 32, ..., a local GPU resource scheduler 3N.
  • the model performance prediction and decomposition module 10 is used to determine the distributed training method corresponding to the model to be trained based on the model to be trained, the target completion time and the target investment resources input by the user, and divide the model to be trained into multiple subtasks, and Determine the resource consumption information of each subtask.
  • the distributed training method includes one of data parallelism, pipeline parallelism and hybrid parallelism.
  • Hybrid parallelism includes data parallelism and pipeline parallelism.
  • the resource consumption information includes computing consumption and memory consumption.
  • the global GPU resource scheduler 20 is configured to, after receiving the subtask request sent by the model performance prediction and decomposition module 10, allocate each subtask to a matching task based on the resource consumption information of each subtask and the GPU operating conditions of multiple computing nodes.
  • the GPU of each computing node is trained, and the communication topology between each sub-task is constructed, and during the process of the GPU of each computing node training the corresponding sub-task, the computing resource operation status of the GPU of each computing node is monitored, and based on all calculations
  • the computing resource operation status of the node's GPU controls the scheduling of subtasks.
  • the subtask request carries the distributed training method corresponding to the model to be trained, multiple subtasks, and resource consumption information of each subtask.
  • the local GPU resource scheduler configured on each computing node is used to locally schedule the subtasks assigned to the computing node according to the distributed training method.
  • the model to be trained may be a neural network model or other types of models, such as trained mathematical model.
  • model to be trained may include one model or multiple models.
  • the target completion time can be determined based on the predicted time required to complete the training of the model to be trained.
  • the target completion time can be equal to the predicted time required to complete the training of the model to be trained, or the target completion time can be slightly larger than the predicted time required to complete the training of the model to be trained.
  • the predicted time required to complete training of the model to be trained can be predicted based on experience, such as historical training data prediction.
  • the target input resources can be determined based on the predicted size of resources required to complete the training of the model to be trained.
  • the target input resource size can be equal to the predicted size of resources required to complete the training of the model to be trained, or the target input resource size can be slightly larger than the predicted size.
  • the amount of resources required to complete the training of the model to be trained can be predicted based on experience, such as historical training data prediction.
  • the model performance prediction and decomposition module 10 is used to determine the distributed training method corresponding to the model to be trained based on the model to be trained, the target completion time and the target investment resources input by the user, This can be achieved using the following steps:
  • the model performance prediction and decomposition module 10 obtains the computing time and memory overhead required for each layer parameter in the model to be trained by performing pre-training on one machine. It should be noted that the pre-training in this step does not complete the training, but only performs several training iterations on the model to be trained, and then takes the average to predict the calculation time and memory overhead of each layer parameter.
  • a specific implementation of pre-training the model to be trained and determining the computing time and memory overhead required for each layer of parameters in the model to be trained may include: conducting multiple training iterations for each layer of parameters in the model to be trained, and determining The calculation time and memory overhead of each training iteration for each layer parameter; determine the calculation time required for the layer parameters based on the average of the calculation time for multiple iterations of training for each layer parameter; perform multiple iterations based on each layer parameter The average memory cost of training iterations to determine the memory cost required for the parameters of this layer.
  • the model to be trained includes three layers, namely the first layer, the second layer and the third layer.
  • the model performance prediction and decomposition module 10 first treats the first layer parameters, the second layer parameters and the third layer parameters in the training model. Conduct 10 training iterations respectively, and determine the calculation time T 1i and memory overhead R 1i for each training iteration of the first layer parameters, where i is the number of iterations, and the calculation time T 2i and memory overhead of each training iteration for the second layer parameters.
  • Memory overhead R 2i where i is the number of iterations
  • the calculation time required for each layer parameter may be equal to the average value of the calculation time of multiple iterative training of the layer parameter, or the calculation time required for each layer parameter may be slightly greater than the average value of the calculation time of multiple iterative training of the layer parameter.
  • the calculation time required for the first layer parameter may be equal to T 1 or slightly greater than T 1 .
  • the memory overhead required for each layer of parameters can be equal to the average memory overhead of multiple iterations of training for the parameters of this layer, or the memory overhead required for each layer of parameters can be slightly larger than the memory overhead of multiple iterations of training for the parameters of this layer. Average memory overhead.
  • the size of the memory overhead required for the first layer parameters may be equal to R 1 or slightly larger than R 1 .
  • each iterative training in the above step S11 can be data parallel or pipeline parallel.
  • Each layer parameter can adopt an iterative training method, or data parallel and pipeline parallel (including data parallel and pipeline parallel).
  • the training method can be called a mixed training method).
  • step S12 first perform permutations and combinations of iterative training of the multi-layer parameters of the model to be trained.
  • Each permutation and combination is equivalent to a distributed training method, and then evaluate the performance of the multi-layer parameters of the model to be trained under each permutation and combination.
  • the required GPU resources and task completion time respectively.
  • the permutations and combinations in which the GPU resources exceed the target input resources in the above step S12 exclude the permutations and combinations in which the GPU resources exceed the target input resources in the above step S12, and then select the permutations and combinations with the smallest task completion time among the remaining permutations and combinations as the distributed training method of the model to be trained, thereby ensuring the best training efficiency.
  • the permutations and combinations with the next smallest task completion time can be selected from the remaining permutations and combinations as the distributed training method of the model to be trained. , thereby meeting different training needs.
  • the model performance prediction and decomposition module 10 After selecting the distributed training method of the model to be trained, the model performance prediction and decomposition module 10 then divides the model to be trained into multiple subtasks according to the GPUs of multiple computing nodes in the computing cluster. Among them, if it is data parallelism, then each subtask is a complete model, and the subtask on each GPU performs gradient exchange and parameter update through collective communication or parameter server; if it is pipeline parallelism, then the subtask on each GPU A task is a sub-model containing several layers of parameters, and the sub-model on each GPU communicates intermediate parameters in a point-to-point manner.
  • the model performance prediction and decomposition module 10 will assign descriptive information such as subtasks and resource consumption information of each subtask. Sent to the global GPU resource scheduler 20, the global GPU resource scheduler 20 finds the GPU of the appropriate computing node to run and builds the communication topology.
  • the global GPU resource scheduler 20 After receiving the subtask request sent from the model performance prediction and decomposition module 10, the global GPU resource scheduler 20 combines the model according to the current GPU operation status of the computing cluster (that is, the GPU operation status of each computing node in the computing cluster). Based on the calculation time and memory requirements of all subtasks (that is, the resource consumption information of each subtask), the subtasks are assigned to the appropriate GPU for execution, and the communication topology between the subtasks is constructed. Then the GPU training of each computing node is assigned to the subtasks on it. That is, the global GPU resource scheduler 20 in the embodiment of the present invention has a global resource allocation function.
  • the global GPU resource scheduler 20 maps the subtasks decomposed by the model performance prediction and decomposition module 10 to specific GPUs, so that subtasks of multiple models can share the GPU while minimizing the number of subtasks of a model. waiting time.
  • the computing resource running status in the embodiment of the present invention may include the waiting time of subtasks and GPU utilization. It can be understood that the computing resource running status is not limited to the waiting time and GPU utilization of the above subtasks, and may also include others, such as Compute the CPU utilization of the node.
  • the way in which the global GPU resource scheduler 20 monitors the computing resource operation status of the GPU of each computing node can be selected according to needs. For example, in some embodiments, the global GPU resource scheduler 20 actively obtains the computing resources of the GPU from each computing node in the computing cluster. Computing resource operation status; in other embodiments, each computing node in the computing cluster actively reports to the global GPU resource scheduler 20 that the computing node obtains the computing resource operation status of the GPU.
  • the global GPU resource scheduler 20 periodically obtains the computing resource operation status of the GPU of each computing node.
  • the global GPU resource scheduler 20 periodically receives the computing resource operation status of the GPU of the computing node fed back to the global GPU resource scheduler 20 by each computing node, that is, each computing node periodically reports the calculation to the global GPU resource scheduler 20 The computing resource operation status of the node's GPU.
  • the global GPU resource scheduler 20 periodically obtains the computing resource operation status of the GPU from each computing node in the computing cluster. That is, the global GPU resource scheduler 20 actively obtains the computing resource operation status of the GPU from each computing node in the computing cluster periodically.
  • the length of the acquisition cycle for the computing resource running status of the GPU of the computing node can be set as needed, for example, 10 minutes.
  • the global GPU resource scheduler 20 aperiodically obtains the computing resource operation status of the GPU of each computing node.
  • the global GPU resource scheduler 20 can obtain the computing resource operation status of the GPU of each computing node when needed.
  • the global GPU resource scheduler 20 in the embodiment of the present invention has a sub-task cooperative scheduling function.
  • the global GPU resource scheduler 20 may include but is not limited to the following steps when controlling the scheduling of subtasks based on the computing resource operating conditions of the GPUs of all computing nodes:
  • the backup nodes are multiple computing nodes. Other computing nodes in the node except the current computing node corresponding to the subtask whose waiting time is greater than or equal to the preset duration threshold, and the GPU utilization of the backup node is less than or equal to the preset utilization threshold.
  • the size of the preset duration threshold, preset utilization threshold, etc. can be set by the user according to actual needs.
  • the preset duration threshold is 5 minutes
  • the preset utilization threshold is 70%
  • the GPU of computing node 1 executes subtasks 11 and 12
  • computing node 2 executes subtask 13
  • computing node 3 executes subtasks 14 and 15
  • the waiting time of subtask 12 is greater than 5 minutes
  • the GPU utilization of computing node 2 is less than 70%
  • the utilization of computing node 3 is greater than 70%, then computing node 2 can be used as a backup node.
  • computing node 1 may have performed corresponding training on subtask 12. Therefore, the latest model parameters of subtask 12 are copied to the computing node. on node 2.
  • the waiting time is greater than or equal to the preset time threshold.
  • the latest model parameters corresponding to the subtask with a duration threshold are added to the backup node in a data parallel manner to participate in the training of the task in the next iteration, thereby reducing the training waiting time for subtasks whose waiting duration is greater than or equal to the preset duration threshold. , make full use of the idle time of the backup node, ultimately reducing the overall training waiting time and improving training efficiency.
  • the latest model parameters corresponding to the subtasks whose waiting time is greater than or equal to the preset time threshold are added to the backup node in a data parallel manner to participate in the training of the task in the next round of iterations.
  • the current computing node and The backup node builds a small-scale data parallelism for these two nodes.
  • the current computing node only needs to train half of the data in the next iteration, which can reduce the load of the current computing node.
  • the global GPU resource scheduler 20 sends first scheduling information to the local GPU resource scheduler of the backup node.
  • the first scheduling information carries subtasks whose waiting time is greater than or equal to the preset time length threshold.
  • the corresponding latest model parameters after receiving the first scheduling information, the local GPU resource scheduler of the backup node will add the latest model parameters corresponding to the subtasks whose waiting time is greater than or equal to the preset time threshold to the backup node in a data parallel manner. Participate in the training of this task in one iteration.
  • the global GPU resource scheduler 20 sends the first scheduling information to the local GPU resource scheduler of the computing node 2.
  • the first scheduling information carries the latest model parameters of the subtask 12; the local GPU resource scheduler of the computing node 2 is in After receiving the first scheduling information, the latest model parameters of the subtask 12 are added to the computing node 2 in a data parallel manner, and as a new data parallel node, it participates in the training of the task in the next round of iterations.
  • the global GPU resource scheduler 20 may also consider the distributed training method corresponding to the subtask when controlling the scheduling of subtasks based on the computing resource operation conditions of the GPUs of all computing nodes. For example, in some embodiments, when the subtask When the distributed training method corresponding to the task is data parallel, the training process of the subtask includes a gradient calculation phase and a gradient synchronization phase. The global GPU resource scheduler 20 controls the scheduling of the subtask based on the computing resource operation status of the GPUs of all computing nodes. At this time, based on the computing resource operation status of the GPU of the computing node where all data parallel subtasks are located, the prefetching of the model parameters and intermediate variables of the corresponding subtasks is controlled.
  • the global GPU resource scheduler 20 controls the prefetching of the model parameters and intermediate variables of the corresponding subtasks according to the computing resource operation status of the GPU of the computing node where all data parallel subtasks are located.
  • the server After receiving the parameter server, the server starts calculation.
  • the second scheduling information is sent to the computing nodes corresponding to the data-parallel sub-tasks, so as to prompt the computing nodes corresponding to the data-parallel sub-tasks to prioritize executing the corresponding tasks through the second scheduling information.
  • model parameters and intermediate variables corresponding to the corresponding data parallel subtask are temporarily migrated from the GPU memory of the corresponding data parallel computing node to the CPU main memory of the corresponding data parallel computing node.
  • the model parameters of the subtask and The intermediate variables are temporarily moved to the CPU main memory and then prefetched before the next calculation.
  • This can improve the utilization of resources such as GPU and network in the computing cluster.
  • the CPU-GPU memory copy is transmitted through the PCI-E channel and the transmission rate is relatively fixed. Therefore, the CPU-GPU memory copy time can be calculated by dividing the amount of transferred data by the PCI-E channel transmission rate.
  • the global GPU resource scheduler 20 After the global GPU resource scheduler 20 receives that the parameter server of the computing cluster starts to calculate the global gradient information, it sends the second scheduling information to the computing node of the corresponding subtask (the computing node corresponding to the above-mentioned data parallel subtask). After the computing node receives the second scheduling information, it will prompt to execute the subtask first and copy its model parameters and intermediate variables from the CPU main memory back to the GPU memory as soon as possible, thereby improving the computing efficiency of the GPU while minimizing the calculation of the subtask. waiting time.
  • the global GPU resource scheduler 20 in this embodiment has a computing resource adjustment function. It should be noted that the global gradient information is determined based on the gradient information of each subtask. Optionally, the global gradient information includes the gradient information of each subtask. Optionally, the global gradient information is processed according to the gradient information of each subtask. get. Among them, gradient information includes gradient calculation information and gradient synchronization information.
  • the global GPU resource scheduler 20 also has a task resource recycling function. Specifically, the global GPU resource scheduler 20 is also used to, after the training of the model to be trained is completed, according to the historical allocation of each sub-task of the model to be trained. information to determine the computing node where each subtask is located; control the computing node where each subtask is located to recycle the corresponding training data on the computing node. Local resources used for subtasks; after it is determined that all computing node resource recycling is completed, the resources used for training the model to be trained on the global GPU resource scheduler 20 are released.
  • the global GPU resource scheduler 20 in one embodiment of the present invention integrates the functions of global resource allocation, subtask coordination scheduling, computing resource adjustment, and task resource recovery.
  • the local GPU resource scheduler configured on each computing node performs local scheduling of subtasks assigned to the computing node according to the distributed training method, including but not limited to the following steps:
  • the training types include data parallel tasks and pipeline parallel tasks.
  • the local scheduling strategy of the subtask is the first scheduling strategy; when the training type is a pipeline parallel task, the local scheduling strategy of the subtask is the second scheduling strategy.
  • the first scheduling policy can be set as needed.
  • the training process of the subtask includes a gradient calculation phase and a gradient synchronization phase.
  • the gradient calculation stage has high computational efficiency and very low communication overhead; while the gradient synchronization stage has low computational efficiency and high communication overhead.
  • the first scheduling strategy schedules and manages subtasks based on the above characteristics, thereby achieving optimal scheduling, improving the utilization of GPU and network resources of the computing cluster, and reducing the waiting time for subtask training.
  • the first scheduling strategy includes: obtaining the first computing requirements of the current subtask in the gradient calculation phase and the second computing requirements of other subtasks in the current computing node; according to the first computing requirements and the second computing requirements , determine the training order of all subtasks of the current computing node according to computing efficiency.
  • computing efficiency is negatively related to the size of computing requirements, that is, the greater the computing requirements, the lower the computing efficiency; the smaller the computing requirements, the greater the computing efficiency.
  • the current computing node will use this judgment result and the computing resource usage of the current computing node's GPU.
  • the situation is fed back to the global GPU resource scheduler 20 together, and the global GPU resource scheduler 20 is asked whether there are other computing nodes that can meet the user's expected task completion time.
  • the scheduling of the new task will end.
  • the first scheduling strategy also includes: when the local computing resources of the current computing node exceed the current sub- When the computing requirements of the task are met, the computing resource operation status of the GPU of the current computing node is fed back to the global GPU resource scheduler 20 to ask the global GPU resource scheduler 20 whether there are other computing nodes whose computing resources do not exceed the computing requirements of the current subtask. point.
  • the training time of the current subtask exceeds the user's expected task completion time; when the local computing resources of the computing node are less than the computing requirements of the current subtask, the current subtask The training time of the task is less than the user's expected task completion time.
  • the first scheduling strategy also includes: when the global GPU resource scheduler 20 feedbacks that there are no other computing nodes, the current computing node builds a high Priority queue and low-priority queue, and put the gradient calculation phase task of the current subtask into the high-priority queue, and put the gradient synchronization phase task of the current subtask into the low-priority queue; the GPU execution of the current computing node Gradient calculation phase task, the CPU of the current computing node performs the gradient synchronization phase task.
  • the first scheduling strategy also includes: when the gradient calculation phase task is completed, copy the model parameters and intermediate variables corresponding to the gradient calculation phase task to the CPU main memory of the current computing node; when the gradient calculation phase task and gradient When all tasks in the synchronization phase are completed, the model parameters and intermediate variables of the corresponding subtasks are copied to the GPU memory of the current computing node; and/or the first scheduling information sent by the global GPU resource scheduler 20 is received at the current computing node. Finally, the subtasks in the low-priority queue are marked for prefetching, and the GPU of each computing node executes the subtasks marked with the prefetching mark first.
  • the current computing node builds a two-level queue, divides the data parallel tasks into gradient calculation phase tasks and gradient synchronization phase tasks, and puts the gradient calculation phase tasks into high priority. Level queue, put the gradient synchronization phase tasks into the low-priority queue. If the computing resources of the GPU of the current computing node are tight, the gradient synchronization phase task will be completed by the CPU of the current computing node.
  • the current computing node marks the current sub-task in the low-priority queue with a prefetch mark. The local scheduling policy of the current computing node will give priority to subtasks with prefetch marks when selecting tasks.
  • the model parameters and intermediate variables of the current subtask will be changed from those of the current computing node.
  • the CPU host memory is copied to the GPU memory of the current computing node. Otherwise, monitoring continues until the gradient synchronization is completed and the copy is started.
  • the current subtask when the training type is a pipeline parallel task, includes multiple task stages, in which the computing task of the last stage of the current subtask is a complete computing task, and the computing tasks of other stages of the current subtask are Including forward calculation tasks and backward calculation tasks.
  • the pipeline parallel task divides the training process into multiple stages, and is a two-way pipeline parallelization. The forward calculation is from the first stage to the last stage, then the loss function is calculated, and then the reverse calculation is from the last stage to the first node. , and the idle time in different stages is different. Among them, the idle time is the largest in the first stage, and then decreases successively. In the last stage, the forward calculation and reverse calculation are connected together without any idle time.
  • the second scheduling strategy schedules and manages subtasks based on this characteristic to achieve optimal scheduling, improve the utilization of GPU and network resources of the computing cluster, and reduce the waiting time for subtask training.
  • the current computing node decides whether to divide the current subtask into two computing tasks, forward and backward, according to the stage to which it belongs. If it is the last stage, the current subtask The task will be treated as a complete computing task, and other stages are divided into two computing tasks, forward and backward.
  • the current computing node determines whether it can meet the computing requirements of the current subtask based on the current local GPU resource operation status of the current computing node. If not, the global GPU resource scheduler 20 is asked whether there are other computing nodes that can meet the user's expectations. task completion time. If the global GPU resource scheduler 20 arranges it to other computing nodes, then the scheduling of the model subtask ends.
  • the second scheduling strategy includes: the current computing node determines whether the GPU resources of the current computing node can meet the computing requirements of the current subtask based on the operation status of the local GPU resources. If not, then the global GPU resources are used. The scheduler 20 asks whether there are other computing nodes that can meet the computing requirements of the current subtask; when the global GPU resource scheduler 20 feedbacks that there are no other computing nodes, the current computing node puts the current subtask into a high-priority queue.
  • the second scheduling strategy of this embodiment can also include: according to the current subtask In the idle time of the forward computing task phase and the backward computing task phase, insert the computing tasks of other subtasks; and/or, after the forward computing task of the current subtask is completed, the forward computing task of the current subtask corresponds to The model parameters and intermediate variables are copied from the GPU of the current computing node to the CPU main memory of the current computing node, and the backward computing tasks associated with the forward computing tasks of the current subtask are marked with pre-execution time based on the estimated idle time.
  • Embodiments of the present invention also provide a distributed training task scheduling method for intelligent computing.
  • the distributed training task scheduling method for intelligent computing in the embodiment of the present invention may include:
  • Model performance prediction and decomposition module uses the model performance prediction and decomposition module to determine the distributed training method corresponding to the model to be trained based on the model to be trained, target completion time and target investment resources input by the user, and divide the model to be trained into multiple subtasks, and determine each Resource consumption information of subtasks.
  • Distributed training methods include one of data parallelism, pipeline parallelism and hybrid parallelism.
  • Hybrid parallelism includes data parallelism and pipeline parallelism.
  • Resource consumption information includes computing consumption and memory consumption;
  • the global GPU resource scheduler allocates each subtask to the GPU of the matching computing node for training according to the resource consumption information of each subtask and the GPU operation status of multiple computing nodes, and builds a communication topology between each subtask, and monitors the computing resource operation status of the GPU of each computing node during the process of training the corresponding subtask of the GPU of each computing node, and controls the scheduling of subtasks according to the computing resource operation status of the GPUs of all computing nodes, wherein the subtask request carries the distributed training mode corresponding to the model to be trained, multiple subtasks and the resource consumption information of each subtask; and
  • S300 Use the local GPU resource scheduler configured on each computing node to locally schedule the subtasks assigned to the computing node according to the distributed training method.
  • the present invention also provides an embodiment of a distributed training task scheduling device for intelligent computing.
  • an embodiment of the present invention provides a distributed training task scheduling device for intelligent computing, including a memory and one or more processors.
  • the memory stores executable code.
  • the one or more processors When the processor executes the executable code, it is used to implement the distributed training task scheduling method for intelligent computing in the above embodiment.
  • the distributed training task scheduling device for intelligent computing provided by the embodiment of the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer.
  • the device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 10, it is a hardware structure diagram of any device with data processing capabilities where the distributed training task scheduling device for intelligent computing provided by the embodiment of the present invention is located.
  • any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. I won’t go into details about this.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the distributed training task scheduling method for intelligent computing in the above embodiments is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, a smart memory card (SMC), an SD card, or a flash memory card equipped on the device. (Flash Card) etc.
  • the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提供一种面向智能计算的分布式训练任务调度方法、***和装置,***包括模型性能预测和分解模块、全局GPU资源调度器和各计算节点均配置的本地GPU资源调度器,全局GPU资源调度器在接收到模型性能预测和分解模块发送的子任务请求后,根据各子任务的资源消耗信息及多个计算节点的GPU运行情况,将各子任务分配到匹配的计算节点的GPU进行训练,并构建各子任务之间的通信拓扑,并在各计算节点的GPU训练对应子任务的过程中,监控各计算节点的GPU的计算资源运行情况,及根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度。本发明能够提高计算集群的GPU和网络等资源的利用率,减少子任务训练的等待时间。

Description

面向智能计算的分布式训练任务调度方法、***和装置 技术领域
本发明涉及一种智能计算领域,尤其涉及一种面向智能计算的分布式训练任务调度方法、***和装置。
背景技术
深度学习的出现给自然语言处理、音视频处理、融媒体等领域带来了巨大的更新,但是随着深度学习模型越来越大,有些大模型的参数量甚至超过了几百亿,如此大规模的模型往往通过构建分布式机器学习***来完成模型训练。同时,因为单张GPU算力有限,在模型训练的时候通过在多台机器和多张GPU卡上构建分布式训练方法来加快模型训练,已经成为一种非常普遍的方法。
分布式训练中,一个计算任务会分被分成多个子任务,并且被分配到不同GPU上执行,并且不同的分布式训练方法的通信效率和计算效率不同,当多个模型同时在计算集群中训练时,简单的调度方法显然无法发挥智能计算集群的最佳性能。当分布式训练的模型同时在智能计算集群训练时,如果仅依赖本地资源调度器,可能出现训练任务相互之间出现等待、GPU空闲、通信拥塞等问题。
发明内容
本发明的目的在于提供一种面向智能计算的分布式训练任务调度方法、***和装置,解决了现有技术中单卡单任务调度无法充分利用分布式训练方法的协调调度特征,且无法充分挖掘分布式训练在智能计算集群的性能潜力问题。
本发明采用的技术方案如下:
本发明实施例提供一种面向智能计算的分布式训练任务调度***,计算集群包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,所述***包括:
模型性能预测和分解模块:用于根据用户输入的待训练模型、目标完成时间和目标投入资源,确定所述待训练模型对应的分布式训练方式,并将所述待训练模型划分成多个子任务,以及确定各子任务的资源消耗信息,所述分布式训练方式包括数据并行、流水并行和混合并行中的一种,所述混合并行包括数据并行和流水并行,所述资源消耗信息包括计算消耗和内存消耗;
全局GPU资源调度器:用于在接收到所述模型性能预测和分解模块发送的子任务请求后, 根据各子任务的资源消耗信息及多个所述计算节点的GPU运行情况,将各子任务分配到匹配的计算节点的GPU进行训练,并构建各子任务之间的通信拓扑,并在各计算节点的GPU训练对应子任务的过程中,监控各计算节点的GPU的计算资源运行情况,以及根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度,其中,所述子任务请求携带有所述待训练模型对应的分布式训练方式、多个所述子任务及各子任务的资源消耗信息;和
各计算节点均配置的本地GPU资源调度器:用于根据所述分布式训练方式,对分配到该计算节点的子任务进行本地调度。
本发明实施例还提供一种面向智能计算的分布式训练任务调度方法,计算集群包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,所述方法包括:
通过模型性能预测和分解模块根据用户输入的待训练模型、目标完成时间和目标投入资源,确定所述待训练模型对应的分布式训练方式,并将所述待训练模型划分成多个子任务,以及确定各子任务的资源消耗信息,所述分布式训练方式包括数据并行、流水并行和混合并行中的一种,所述混合并行包括数据并行和流水并行,所述资源消耗信息包括计算消耗和内存消耗;
通过全局GPU资源调度器在接收到所述模型性能预测和分解模块发送的子任务请求后,根据各子任务的资源消耗信息及多个所述计算节点的GPU运行情况,将各子任务分配到匹配的计算节点的GPU进行训练,并构建各子任务之间的通信拓扑,并在各计算节点的GPU训练对应子任务的过程中,监控各计算节点的GPU的计算资源运行情况,以及根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度,其中,所述子任务请求携带有所述待训练模型对应的分布式训练方式、多个所述子任务及各子任务的资源消耗信息;和
通过各计算节点均配置的本地GPU资源调度器根据所述分布式训练方式,对分配到该计算节点的子任务进行本地调度。
本发明实施例还提供一种面向智能计算的分布式训练任务调度装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述任一项所述的面向智能计算的分布式训练任务调度方法。
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述任一项所述的一种面向智能计算的分布式训练任务调度方法。
本发明的有益效果是:通过设置全局GPU资源调度器,进行子任务的分配、各子任务之间的通信拓扑、各计算节点的GPU的计算资源运行情况的监控以及子任务的调度,提高计算集群的GPU和网络等资源的利用率,减少子任务训练的等待时间。
附图说明
图1为本发明一实施例提供的一种计算集群的结构示意图;
图2为本发明另一实施例提供的一种面向智能计算的分布式训练任务调度***的结构示意图;
图3为本发明一实施例提供的一种模型性能预测和分解模块根据用户输入的待训练模型、目标完成时间和目标投入资源,确定待训练模型对应的分布式训练方式的实现方法流程示意图;
图4为本发明一实施例提供的一种全局GPU资源调度器根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度的实现方法流程示意图;
图5为本发明一实施例提供的一种全局GPU资源调度器的功能及全局GPU资源调度器与本地资源调度器的交互示意图;
图6为本发明一实施例提供的一种各计算节点均配置的本地GPU资源调度器根据分布式训练方式,对分配到该计算节点的子任务进行本地调度的实现方法流程示意图;
图7为本发明一实施例提供的一种第一调度策略的流程示意图;
图8为本发明一实施例提供的一种第二调度策略的流程示意图;
图9为本发明一实施例提供的一种面向智能计算的分布式训练任务调度方法流程示意图;
图10为本发明一实施例提供的一种面向智能计算的分布式训练任务调度装置的结构框图。
附图标记:
10、模型性能预测和分解模块;20、全局GPU资源调度器;30、本地GPU资源调度器。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
模型参数训练方法可包括数据并行和流水并行,其中,数据并行将模型复制到多个GPU中,通过集合通信或者参数服务器进行梯度交互和参数更新,训练过程可以分为梯度计算和梯度同步两个阶段,其中梯度计算的计算效果较高,GPU利用率较高,基本上没有通信量,但是在梯度同步阶段计算开销则相对较低,而通信开销非常高。流水并行将模型按层划分成多个阶段,每个阶段部署到GPU上,多个阶段执行顺序执行前向计算,并在最后一个阶段计算损失函数,再从最后一个阶段到第一个阶段依次进行反向计算,整个过程中不同阶段的前 向计算和反向计算的之间的空闲等待时间并不相同。
当前的分布式训练中,GPU计算资源调度往往都是针对单卡为调度单位,无法充分利用分布式训练方法的协调调度特征,无法充分挖掘分布式训练在智能计算集群的性能潜力。
对于此,本发明通过设置全局GPU资源调度器20,进行子任务的分配、各子任务之间的通信拓扑、各计算节点的GPU的计算资源运行情况的监控以及子任务的调度,提高计算集群的GPU和网络等资源的利用率,减少子任务训练的等待时间。
需要说明的是,在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。
参见图1,本发明实施例中的计算集群可包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU。如图1所示,计算集群可包括计算节点1、计算节点2、…、计算节点N,其中,N为正整数,且N大于或等于3。
需要说明的是,本发明实施例中的面向智能计算的分布式训练任务调度方法、***和装置适用于如图1所示的计算集群的分布式任务调度。
具体地,参见图2,本发明实施例的面向智能计算的分布式训练任务调度***可包括模型性能预测和分解模块10、全局GPU资源调度器20和本地GPU资源调度器30,本发明实施例中,各计算节点均配置本地GPU资源调度,如图2所示,计算集群可包括计算节点1、计算节点2、…、计算节点N,其中,N为正整数,且N大于或等于3,计算节点1、计算节点2、…、计算节点N分别配置本地GPU资源调度器31、本地GPU资源调度器32、…、本地GPU资源调度器3N。
其中,模型性能预测和分解模块10用于根据用户输入的待训练模型、目标完成时间和目标投入资源,确定待训练模型对应的分布式训练方式,并将待训练模型划分成多个子任务,以及确定各子任务的资源消耗信息,分布式训练方式包括数据并行、流水并行和混合并行中的一种,混合并行包括数据并行和流水并行,资源消耗信息包括计算消耗和内存消耗。
全局GPU资源调度器20用于在接收到模型性能预测和分解模块10发送的子任务请求后,根据各子任务的资源消耗信息及多个计算节点的GPU运行情况,将各子任务分配到匹配的计算节点的GPU进行训练,并构建各子任务之间的通信拓扑,并在各计算节点的GPU训练对应子任务的过程中,监控各计算节点的GPU的计算资源运行情况,以及根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度,其中,子任务请求携带有待训练模型对应的分布式训练方式、多个子任务及各子任务的资源消耗信息。
各计算节点均配置的本地GPU资源调度器用于根据分布式训练方式,对分配到该计算节点的子任务进行本地调度。
在本发明实施例中,待训练模型可以为神经网络模型,也可以为其他类型的模型,如待 训练的数学模型。
另外,待训练模型可以包括一个模型,也可以包括多个模型。
目标完成时间大小可根据预测的完成待训练模型的训练所需的时长确定,例如,目标完成时间的大小可等于预测的完成待训练模型的训练所需的时长,或者目标完成时间的大小稍大于预测的完成待训练模型的训练所需的时长。其中,预测的完成待训练模型的训练所需的时长可根据经验预测,如历史训练数据预测。
目标投入资源可根据预测的完成待训练模型的训练所需的资源大小确定,例如,目标投入资源大小可等于预测的完成待训练模型的训练所需的资源大小,或者目标投入资源大小稍大于预测的完成待训练模型的训练所需的资源大小。其中,预测的完成待训练模型的训练所需的资源大小可根据经验预测,如历史训练数据预测。
在一可行的实现方式中,参见图3,模型性能预测和分解模块10在用于根据用户输入的待训练模型、目标完成时间和目标投入资源,确定待训练模型对应的分布式训练方式时,可采用如下步骤实现:
S11、对待训练模型进行预训练,确定待训练模型中每一层参数所需的计算时间和内存开销。
由于模型训练的时间差较少,模型性能预测和分解模块10通过在一个机器上进行预训练的方式获取待训练模型中每一层参数所需的计算时间和内存开销。需要说的是,该步骤中的预训练不是完成训练,只是对待训练模型进行若干次训练迭代,然后取平均值来预测每一层参数的计算时间和内存开销。
一具体的实现对待训练模型进行预训练,确定待训练模型中每一层参数所需的计算时间和内存开销的过程可包括:对待训练模型中的每一层参数分别进行多次训练迭代,确定每一层参数进行每一次训练迭代的计算时间和内存开销;根据每一层参数进行多次迭代训练的计算时间的平均值,确定该层参数所需的计算时间;根据每一层参数进行多次迭代训练的内存开销的平均值,确定该层参数所需的内存开销。
例如,待训练模型包括3层,分别为第一层、第二层和第三层,模型性能预测和分解模块10先对待训练模型中的第一层参数、第二层参数及第三层参数分别进行10次训练迭代,确定第一层参数进行每一次训练迭代的计算时间T1i和内存开销R1i,其中i为迭代的次数,第二层参数进行每一次训练迭代的计算时间T2i和内存开销R2i,其中i为迭代的次数,第三层参数进行每一次训练迭代的计算时间T3i和内存开销R3i,其中i为迭代的次数的序号,i=0、1,2,…,9。
第一层参数进行10次迭代训练的计算时间的平均值为:T1=(T10+T11+…+T19)/10,第一 层参数进行10次迭代训练的内存开销的平均值为:R1=(R10+R11+…+R19)/10,第二层参数、第三层参数进行10次迭代训练的计算时间的平均值和内存开销的平均值计算方式相类似,不再赘述。
其中,每一层参数所需的计算时间大小可以等于该层参数进行多次迭代训练的计算时间的平均值,或者每一层参数所需的计算时间大小可以稍大于该层参数进行多次迭代训练的计算时间的平均值。沿用上述实施例,第一层参数所需的计算时间大小可以等于T1或者稍大于T1
每一层参数所需的内存开销大小可以等于该层参数进行多次迭代训练的内存开销的平均值,或者每一层参数所需的内存开销大小可以稍大于该层参数进行多次迭代训练的内存开销的平均值。沿用上述实施例,第一层参数所需的内存开销的大小可以等于R1或者稍大于R1
S12、根据每一层参数所需的计算时间和内存开销,确定采用不同分布式训练方式分别需要的GPU资源和任务完成时间。
需要说明的是,上述步骤S11中的每次迭代训练可以数据并行或流水并行,每一层参数可以采用一种迭代训练方式,也可采用数据并行和流水并行(其中,包括数据并行和流水并行的训练方式可以称为混合训练方式)。
该步骤S12中,先对待训练模型的多层参数的迭代训练进行排列组合,每一种排列组合相当于一种分布式训练方式,再评估待训练模型的多层参数在每一种排列组合下分别需要的GPU资源和任务完成时间。
S13、根据目标完成时间和目标投入资源,选择任务完成时间最小的分布式训练方式作为待训练模型对应的分布式训练方式。
可选地,排除上述步骤S12中GPU资源超过目标投入资源的排列组合,然后在剩下的排列组合中挑选出任务完成时间最小的排列组合作为待训练模型的分布式训练方式,从而确保最佳的训练效率。当然,在其他实施例中,排除上述步骤S12中GPU资源超过目标投入资源的排列组合,可在剩下的排列组合中挑选出任务完成时间次小的排列组合作为待训练模型的分布式训练方式,从而满足不同的训练需求。
选定待训练模型的分布式训练方式后,模型性能预测和分解模块10再将待训练模型按照计算集群中多个计算节点的GPU划分成多个子任务。其中,如果是数据并行,那么每个子任务都是一个完整的模型,每个GPU上的子任务通过集合通信或者参数服务器进行梯度交换和更新参数;如果是流水并行,那么每个GPU上的子任务是一个包含若干层参数的子模型,每个GPU上的子模型通过点对点的方式进行中间参数通信。
模型性能预测和分解模块10将分配好的子任务和各子任务的资源消耗信息等描述信息 发送给全局GPU资源调度器20,由全局GPU资源调度器20寻找合适的计算节点的GPU运行和构建通信拓扑。
全局GPU资源调度器20在收到模型性能预测和分解模块10发送过来的子任务请求后,根据计算集群目前的GPU运行情况(即计算集群中各计算节点的GPU运行情况),并结合该模型所有子任务的计算时间和内存需求(即各子任务的资源消耗信息),将子任务分配到合适的GPU上执行,同时构建出各子任务之间的通信拓扑。然后各计算节点的GPU训练分配到其上的子任务。即本发明实施例的全局GPU资源调度器20具备全局资源分配功能。
可选地,全局GPU资源调度器20将模型性能预测和分解模块10分解后的子任务映射到具体的GPU,使得多个模型的子任务可以共享GPU,同时尽量减少一个模型的多个子任务之间的等待时间。
本发明实施例中的计算资源运行情况可包括子任务的等待时间和GPU利用率,可以理解的是,计算资源运行情况不限于上述子任务的等待时间和GPU利用率,还可包括其他,如计算节点的CPU利用率。
全局GPU资源调度器20监控各计算节点的GPU的计算资源运行情况的方式可根据需要选择,例如,在一些实施例中,全局GPU资源调度器20主动地从计算集群中各计算节点获取GPU的计算资源运行情况;在另外一些实施例中,计算集群中各计算节点主动向全局GPU资源调度器20上报该计算节点获取GPU的计算资源运行情况。
可选地,在一些实施例中,全局GPU资源调度器20周期性获取各计算节点的GPU的计算资源运行情况。例如,全局GPU资源调度器20周期性接收各计算节点向全局GPU资源调度器20反馈的该计算节点的GPU的计算资源运行情况,即各计算节点向全局GPU资源调度器20周期性上报该计算节点的GPU的计算资源运行情况。又如,全局GPU资源调度器20周期性从计算集群中各计算节点获取GPU的计算资源运行情况,即全局GPU资源调度器20主动地从计算集群中各计算节点周期性获取GPU的计算资源运行情况。计算节点的GPU的计算资源运行情况的获取周期时长大小可根据需要设置,例如,10分钟。
应当理解的是,在另外一些实施例中,全局GPU资源调度器20是非周期性获取各计算节点的GPU的计算资源运行情况的。本实施例中,全局GPU资源调度器20可在需要时获取各计算节点的GPU的计算资源运行情况。
本发明实施例的全局GPU资源调度器20具备子任务协作调度功能。在一可行的实现方式中,参见图4,全局GPU资源调度器20在根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度时,可包括但不限于如下步骤:
S21、对等待时长大于或等于预设时长阈值的子任务增加备份节点,备份节点是多个计算 节点中除等待时长大于或等于预设时长阈值的子任务对应的当前计算节点外的其他计算节点,且备份节点的GPU利用率小于或等于预设利用率阈值。
预设时长阈值、预设利用率阈值等的大小可根据实际需求由用户设定。
例如,预设时长阈值为5分钟,预设利用率阈值为70%,计算节点1的GPU执行子任务11和子任务12,计算节点2执行子任务13,计算节点3执行子任务14和子任务15,子任务12等待的时长大于大于5分钟,计算节点2的GPU利用率小于70%,计算节点3的利用率大于70%,则可以将计算节点2作为备份节点。
S22、将等待时长大于或等于预设时长阈值的子任务对应的最新模型参数拷贝到备份节点,以将等待时长大于或等于预设时长阈值的子任务对应的最新模型参数以数据并行方式加入到备份节点在下一轮的迭代中参与该任务的训练中。
沿用上述实施例,需要说明的是,子任务12在调度到计算节点2之前,计算节点1可能已经对子任务12进行了相应的训练,因此,这里将子任务12的最新模型参数拷贝到计算节点2上。
通过对等待时长大于或等于预设时长阈值的子任务增加备份节点,并将等待时长大于或等于预设时长阈值的子任务对应的最新模型参数拷贝到备份节点,从而将等待时长大于或等于预设时长阈值的子任务对应的最新模型参数以数据并行方式加入到备份节点在下一轮的迭代中参与该任务的训练中,从而减少等待时长大于或等于预设时长阈值的子任务的训练等待时间,充分利用备份节点的空闲时间,最终减少整体训练等待时间,提升训练效率。
该步骤中,将等待时长大于或等于预设时长阈值的子任务对应的最新模型参数以数据并行方式加入到备份节点在下一轮的迭代中参与该任务的训练中,具体地,当前计算节点和备份节点对这2个节点构建一个小范围的数据并行,当前计算节点在下一轮的迭代只需要训练一半数据,可以减轻当前计算节点的负载。
其中,在实现步骤S22时,可选地,全局GPU资源调度器20向备份节点的本地GPU资源调度器发送第一调度信息,第一调度信息携带等待时长大于或等于预设时长阈值的子任务对应的最新模型参数;备份节点的本地GPU资源调度器在接收到第一调度信息后,将等待时长大于或等于预设时长阈值的子任务对应的最新模型参数以数据并行方式加入到备份节点在下一轮的迭代中参与该任务的训练中。
沿用上述实施例,全局GPU资源调度器20向计算节点2的本地GPU资源调度器发送第一调度信息,第一调度信息携带子任务12的最新模型参数;计算节点2的本地GPU资源调度器在接收到第一调度信息后,将子任务12的最新模型参数以数据并行方式加入到计算节点2,作为数据并行的一个新节点在下一轮的迭代中参与该任务的训练中。
其中,全局GPU资源调度器20在根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度时,还可考虑子任务对应的分布式训练方式,例如,在一些实施例中,当子任务对应的分布式训练方式为数据并行时,子任务的训练过程包括梯度计算阶段和梯度同步阶段,全局GPU资源调度器20在根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度时,根据所有数据并行的子任务所在的计算节点的GPU的计算资源运行情况,控制对应子任务的模型参数和中间变量的预取。具体地,全局GPU资源调度器20根据所有数据并行的子任务所在的计算节点的GPU的计算资源运行情况,控制对应子任务的模型参数和中间变量的预取时,在接收到参数服务器开始计算所有数据并行的子任务的全局梯度信息后,发送第二调度信息给数据并行的子任务对应的计算节点,以通过第二调度信息提示所述数据并行的子任务对应的计算节点优先执行相应的数据并行的子任务,并从所述数据并行的子任务对应的计算节点的GPU主存中将相应的数据并行的子任务对应的最新模型参数和中间变量拷贝回计算节点的GPU显存中;在相应的数据并行的子任务等待其他依赖子任务的计算结果期间,且预计的等待时长超过所述相应的数据并行的计算节点的CPU-GPU内存拷贝时间时,所述相应的数据并行的计算节点将相应的数据并行的子任务对应的模型参数和中间变量由所述相应的数据并行的计算节点的GPU显存中暂时迁移至所述相应的数据并行的计算节点的CPU主存中。利用CPU主存内存较大的特点,一个模型在GPU上的子任务在等待其他依赖任务的计算结果期间,并且预计的等待时间超过CPU-GPU内存拷贝时间时,将该子任务的模型参数和中间变量暂时移动到CPU主存中,然后在下次计算前预取回来,如此,可提高计算集群中GPU和网络等资源的利用率。其中,CPU-GPU内存拷贝通过PCI-E通道传输并且传输速率较为固定,因此,CPU-GPU内存拷贝时间可以通过传输的数据量大小除以PCI-E通道传输速率计算得到。全局GPU资源调度器20收到计算集群的参数服务器开始计算全局梯度信息后,就发送第二调度信息给相应子任务的计算节点(上述数据并行的子任务对应的计算节点),相应子任务的计算节点收到第二调度信息后,会提示优先执行该子任务并尽快从CPU主存中将其模型参数和中间变量拷贝回GPU显存,从而在提升GPU的计算效率的同时尽量减少子任务计算的等待时间。本实施例中的全局GPU资源调度器20具备计算资源调整功能。需要说明的是,全局梯度信息根据各子任务的梯度信息确定,可选地,全局梯度信息包括各子任务的梯度信息;可选地,全局梯度信息为根据各子任务的梯度信息进行处理后获得。其中,梯度信息包括梯度计算信息和梯度同步信息。
在一些实施例中,全局GPU资源调度器20还具备任务资源回收功能,具体地,全局GPU资源调度器20还用于在待训练模型训练完成后,根据待训练模型的各子任务的历史分配信息,确定各子任务所在的计算节点;控制各子任务所在的计算节点回收该计算节点上训练对应的 子任务时使用的本地资源;在确定所有计算节点资源回收结束后,释放全局GPU资源调度器20上训练待训练模型时使用的资源。
参见图5,本发明一实施例中的全局GPU资源调度器20集成了全局资源分配、子任务协调调度、计算资源调整以及任务资源回收功能。
参见图6,各计算节点均配置的本地GPU资源调度器在根据分布式训练方式,对分配到该计算节点的子任务进行本地调度时,包括但不限于如下步骤:
S31、当根据分布式训练方式,确定分配到本地的子任务的训练类型,训练类型包括数据并行任务和流水并行任务。
如步骤S13确定的排列组合中对应的该计算节点本地的子任务的训练类型。
S32、根据分配到本地的子任务的训练类型,确定分配到本地的子任务的本地调度策略;
S33、根据本地调度策略,对分配到本地的子任务进行本地调度。
其中,当训练类型为数据并行任务时,子任务的本地调度策略为第一调度策略;当训练类型为流水并行任务时,子任务的本地调度策略为第二调度策略。
第一调度策略可以根据需要设定,例如,在一些实施例中,当训练类型为数据并行任务时,子任务的训练过程包括梯度计算阶段和梯度同步阶段。其中,梯度计算阶段计算效率高,通信开销非常低;而梯度同步阶段计算效率较低,通信开销较高。第一调度策略根据上述特点进行子任务的调度和管理,从而取得最优的调度,提高计算集群的GPU和网络等资源的利用率,减少子任务训练的等待时间。具体地,参见图7,第一调度策略包括:获取当前子任务在梯度计算阶段的第一计算需求以及当前计算节点中其他子任务的第二计算需求;根据第一计算需求和第二计算需求,按照计算效率确定当前计算节点的所有子任务的训练顺序。应当理解的是,计算效率与计算需求大小负相关,即计算需求越大,计算效率越低;计算需求越小,计算效率越大。可选的,计算效率越大,训练顺序越靠前;计算效率越小,训练顺序越靠后。
如果当前因为当前计算节点的本地计算资源已经不能够满足新任务的计算需求,可能会影响用户预期的任务完成时间,那么当前计算节点的会连同这个判断结果和当前计算节点的GPU的计算资源使用情况一起反馈给全局GPU资源调度器20,询问全局GPU资源调度器20是否有其他计算节点可以满足用户预期的任务完成时间。
其中,对于全局GPU资源调度器20安排到其他计算节点的情况,会结束该新任务的调度,进一步地,参见图7,第一调度策略还包括:当当前计算节点的本地计算资源超出当前子任务的计算需求时,向全局GPU资源调度器20反馈该当前计算节点的GPU的计算资源运行情况,以询问全局GPU资源调度器20是否存在计算资源不超出当前子任务的计算需求的其他计算节 点。其中,当计算节点的本地计算资源超出当前子任务的计算需求时,当前子任务的训练时长超出用户预期的任务完成时长;当计算节点的本地计算资源小于当前子任务的计算需求时,当前子任务的训练时长小于用户预期的任务完成时长。
对于全局GPU资源调度器20未安排到其他计算节点的情况,进一步地,参见图7,第一调度策略还包括:当全局GPU资源调度器20反馈不存在其他计算节点时,当前计算节点构建高优先级队列和低优先级队列,并将当前子任务的梯度计算阶段任务放入高优先级队列中,将当前子任务的梯度同步阶段任务放入低优先级队列中;当前计算节点的GPU执行梯度计算阶段任务,当前计算节点的CPU执行梯度同步阶段任务。更进一步地,第一调度策略还包括:当梯度计算阶段任务执行完毕时,将梯度计算阶段任务对应的模型参数和中间变量拷贝到当前计算节点的CPU主存中;当梯度计算阶段任务和梯度同步阶段任务均完成时,将对应的子任务的模型参数和中间变量拷贝到当前计算节点的GPU显存中;和/或,在当前计算节点接收到全局GPU资源调度器20发送的第一调度信息后,将低优先级队列的子任务进行预取标记,各计算节点的GPU优先执行标记有预取标记的子任务。例如,当全局GPU资源调度器20反馈不存在其他计算节点时,当前计算节点构建两级队列,将数据并行任务分为梯度计算阶段任务和梯度同步阶段任务,将梯度计算阶段任务放入高优先级队列,将梯度同步阶段任务放入低优先级队列。如果当前计算节点的GPU的计算资源紧张,那么将梯度同步阶段任务会由当前计算节点的CPU负责完成。当前计算节点收到全局GPU资源调度器20发送过来的第一调度信息后,将低优先级队列的该当前子任务标记上预取标记。当前计算节点的本地调度策略在选取任务的时候会优先执行有预取标记的子任务,如果该当前子任务的梯度同步完成,那么将该当前子任务的模型参数和中间变量从当前计算节点的CPU主机内存拷贝到当前计算节点的GPU显存,否则,继续监控直到梯度同步完成后开始拷贝。
在另外一些实施例中,当训练类型为流水并行任务时,当前子任务包括多个任务阶段,其中当前子任务的最后一个阶段的计算任务为完整计算任务,当前子任务的其他阶段的计算任务包括前向计算任务和后向计算任务。流水并行任务将训练过程分成多个阶段,并且是一个双向的流水并行,前向计算从第一个阶段到最后一个阶段,再计算损失函数,然后反向计算从最后一个阶段到第一个节点,并且不同阶段的空闲时间不同。其中,第一个阶段空闲时间最大,然后依次减少,最后一个阶段前向计算和反向计算连在一起没有空闲时间。第二调度策略根据这一特点进行子任务的调度和管理,从而取得最优的调度,提高计算集群的GPU和网络等资源的利用率,减少子任务训练的等待时间。
可选地,当前计算节点在收到GPU资源全局调度器的调度请求后,根据该当前子任务所属的阶段决定是否将其划分为前向和后向2个计算任务。如果是最后一个阶段,该当前子任 务会作为一个完整的计算任务,其他阶段分成前向和后向2个计算任务。当前计算节点再根据该当前计算节点当前的本地GPU资源运行情况,判断是否能够满足该当前子任务的计算需求,如不满足,则询问全局GPU资源调度器20是否有其他计算节点可以满足用户预期的任务完成时间。如果全局GPU资源调度器20安排到其他计算节点,那么结束该模型子任务的调度。如果全局GPU资源调度器20反馈不存在其他计算节点时,当前计算节点将当前子任务放入高优先级队列中。可选地,参见图8,第二调度策略包括:当前计算节点根据本地GPU资源运行情况,判断当前计算节点的GPU资源是否能够满足当前子任务的计算需求,若不满足,则向全局GPU资源调度器20询问是否存在能够满足当前子任务的计算需求的其他计算节点;当全局GPU资源调度器20反馈不存在其他计算节点时,当前计算节点将当前子任务放入高优先级队列中。更进一步地,为充分发挥GPU计算效率,可在前向和后向阶段的空闲时间,***其他任务的计算任务,参见图8,本实施例的第二调度策略还可包括:根据当前子任务的前向计算任务阶段和后向计算任务阶段的空闲时间,***其他子任务的计算任务;和/或,在当前子任务的前向计算任务完成后,将当前子任务的前向计算任务对应的模型参数和中间变量由当前计算节点的GPU拷贝到当前计算节点的CPU主存中,并根据预计空闲时间为与当前子任务的前向计算任务相关联的后向计算任务标记上预执行时间;在预执行时间结束后,若相关联的后向计算任务未开始执行,则将对应的当前子任务的前向计算任务对应的模型参数和中间变量从CPU主存中再拷贝回当前计算节点的GPU中,以减少后向计算的等待时间。
本发明实施例还提供一种面向智能计算的分布式训练任务调度方法,参见图9,本发明实施例中的面向智能计算的分布式训练任务调度方法可包括:
S100、通过模型性能预测和分解模块根据用户输入的待训练模型、目标完成时间和目标投入资源,确定待训练模型对应的分布式训练方式,并将待训练模型划分成多个子任务,以及确定各子任务的资源消耗信息,分布式训练方式包括数据并行、流水并行和混合并行中的一种,混合并行包括数据并行和流水并行,资源消耗信息包括计算消耗和内存消耗;
S200、通过全局GPU资源调度器在接收到模型性能预测和分解模块发送的子任务请求后,根据各子任务的资源消耗信息及多个计算节点的GPU运行情况,将各子任务分配到匹配的计算节点的GPU进行训练,并构建各子任务之间的通信拓扑,并在各计算节点的GPU训练对应子任务的过程中,监控各计算节点的GPU的计算资源运行情况,以及根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度,其中,子任务请求携带有待训练模型对应的分布式训练方式、多个子任务及各子任务的资源消耗信息;和
S300、通过各计算节点均配置的本地GPU资源调度器根据分布式训练方式,对分配到该计算节点的子任务进行本地调度。
与前述面向智能计算的分布式训练任务调度方法的实施例相对应,本发明还提供了一种面向智能计算的分布式训练任务调度装置的实施例。
参见图10,本发明实施例提供的一种面向智能计算的分布式训练任务调度装置,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例中的面向智能计算的分布式训练任务调度方法。
本发明实施例提供的面向智能计算的分布式训练任务调度装置的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图10所示,为本发明实施例提供的面向智能计算的分布式训练任务调度装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图10所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的面向智能计算的分布式训练任务调度方法。
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。

Claims (19)

  1. 一种面向智能计算的分布式训练任务调度***,其特征在于,计算集群包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,所述***包括:
    模型性能预测和分解模块:用于根据用户输入的待训练模型、目标完成时间和目标投入资源,确定所述待训练模型对应的分布式训练方式,并将所述待训练模型划分成多个子任务,以及确定各子任务的资源消耗信息,所述分布式训练方式包括数据并行、流水并行和混合并行中的一种,所述混合并行包括数据并行和流水并行,所述资源消耗信息包括计算消耗和内存消耗;
    全局GPU资源调度器:用于在接收到所述模型性能预测和分解模块发送的子任务请求后,根据各子任务的资源消耗信息及多个所述计算节点的GPU运行情况,将各子任务分配到匹配的计算节点的GPU进行训练,并构建各子任务之间的通信拓扑,并在各计算节点的GPU训练对应子任务的过程中,监控各计算节点的GPU的计算资源运行情况,以及根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度,其中,所述子任务请求携带有所述待训练模型对应的分布式训练方式、多个所述子任务及各子任务的资源消耗信息;和
    各计算节点均配置的本地GPU资源调度器:用于根据所述分布式训练方式,对分配到该计算节点的子任务进行本地调度;
    所述计算资源运行情况包括子任务的等待时间和GPU利用率;
    所述全局GPU资源调度器在根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度时,具体用于:
    对等待时长大于或等于预设时长阈值的子任务增加备份节点,所述备份节点是多个所述计算节点中除所述等待时长大于或等于预设时长阈值的子任务对应的当前计算节点外的其他计算节点,且所述备份节点的GPU利用率小于或等于预设利用率阈值;
    将所述等待时长大于或等于预设时长阈值的子任务对应的最新模型参数拷贝到所述备份节点,以将所述等待时长大于或等于预设时长阈值的子任务对应的最新模型参数以数据并行方式加入到所述备份节点在下一轮的迭代中参与该任务的训练中。
  2. 根据权利要求1所述的面向智能计算的分布式训练任务调度***,其特征在于,所述全局GPU资源调度器在将所述等待时长大于或等于预设时长阈值的子任务对应的最新模型参数拷贝到所述备份节点,以将所述等待时长大于或等于预设时长阈值的子任务对应的最新模 型参数以数据并行方式加入到所述备份节点在下一轮的迭代中参与该任务的训练中时,具体用于:向所述备份节点的本地GPU资源调度器发送第一调度信息,所述第一调度信息携带所述等待时长大于或等于预设时长阈值的子任务对应的最新模型参数;
    所述备份节点的本地GPU资源调度器在接收到所述第一调度信息后,将所述等待时长大于或等于预设时长阈值的子任务对应的最新模型参数以数据并行方式加入到所述备份节点在下一轮的迭代中参与该任务的训练中。
  3. 根据权利要求1所述的面向智能计算的分布式训练任务调度***,其特征在于,当所述子任务对应的分布式训练方式为数据并行时,所述子任务的训练过程包括梯度计算阶段和梯度同步阶段,所述全局GPU资源调度器在根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度时,具体用于:
    根据所有数据并行的子任务所在的计算节点的GPU的计算资源运行情况,控制对应子任务的模型参数和中间变量的预取。
  4. 根据权利要求3所述的面向智能计算的分布式训练任务调度***,其特征在于,所述全局GPU资源调度器在根据所有数据并行的子任务所在的计算节点的GPU的计算资源运行情况,控制对应子任务的模型参数和中间变量的预取时,具体用于:
    在接收到参数服务器开始计算所有数据并行的子任务的全局梯度信息后,发送第二调度信息给数据并行的子任务对应的计算节点,以通过所述第二调度信息提示所述计算节点优先执行相应的数据并行的子任务,并从所述计算节点的GPU主存中将所述相应的数据并行的子任务对应的最新模型参数和中间变量拷贝回所述计算节点的GPU显存中;
    所述相应的数据并行的子任务对应的最新模型参数和中间变量是在所述相应的数据并行的子任务等待其他依赖子任务的计算结果期间,且预计的等待时长超过所述计算节点的CPU-GPU内存拷贝时间时,所述计算节点将所述相应的数据并行的子任务对应的模型参数和中间变量由所述计算节点的GPU显存中暂时迁移至所述计算节点的CPU主存中。
  5. 根据权利要求1至4任一项所述的面向智能计算的分布式训练任务调度***,其特征在于,所述全局GPU资源调度器在监控各计算节点的GPU的计算资源运行情况时,具体用于:
    周期性获取各计算节点的GPU的计算资源运行情况。
  6. 根据权利要求5所述的面向智能计算的分布式训练任务调度***,其特征在于,所述全局GPU资源调度器具体用于:周期性接收各计算节点向所述全局GPU资源调度器反馈的该计算节点的GPU的计算资源运行情况。
  7. 根据权利要求1所述的面向智能计算的分布式训练任务调度***,其特征在于,所述全局GPU资源调度器还用于:
    在所述待训练模型训练完成后,根据所述待训练模型的各子任务的历史分配信息,确定各子任务所在的计算节点;
    控制各子任务所在的计算节点回收该计算节点上训练对应的子任务时使用的本地资源;
    在确定所有计算节点资源回收结束后,释放全局GPU资源调度器上训练所述待训练模型时使用的资源。
  8. 根据权利要求5所述的面向智能计算的分布式训练任务调度***,其特征在于,各计算节点均配置的本地GPU资源调度器在根据所述分布式训练方式,对分配到该计算节点的子任务进行本地调度时,具体用于:
    当根据所述分布式训练方式,确定分配到本地的子任务的训练类型,所述训练类型包括数据并行任务和流水并行任务;
    根据所述分配到本地的子任务的训练类型,确定所述分配到本地的子任务的本地调度策略;
    根据所述本地调度策略,对所述分配到本地的子任务进行本地调度;
    当所述训练类型为数据并行任务时,所述子任务的本地调度策略为第一调度策略;
    当所述训练类型为流水并行任务时,所述子任务的本地调度策略为第二调度策略。
  9. 根据权利要求8所述的面向智能计算的分布式训练任务调度***,其特征在于,当所述训练类型为数据并行任务时,所述子任务的训练过程包括梯度计算阶段和梯度同步阶段;
    所述第一调度策略包括:
    获取当前子任务在所述梯度计算阶段的第一计算需求以及当前计算节点中其他子任务的第二计算需求;
    根据所述第一计算需求和所述第二计算需求,按照计算效率确定所述当前计算节点的所有子任务的训练顺序。
  10. 根据权利要求9所述的面向智能计算的分布式训练任务调度***,其特征在于,所述第一调度策略还包括:
    当所述当前计算节点的本地计算资源超出当前子任务的计算需求时,向所述全局GPU资源调度器反馈该计算节点的GPU的计算资源运行情况,以询问所述全局GPU资源调度器是否存在计算资源不超出所述当前子任务的计算需求的其他计算节点;
    其中,当计算节点的本地计算资源超出所述当前子任务的计算需求时,所述当前子任务的训练时长超出用户预期的任务完成时长;当计算节点的本地计算资源小于所述当前子任务的计算需求时,所述当前子任务的训练时长小于用户预期的任务完成时长。
  11. 根据权利要求10所述的面向智能计算的分布式训练任务调度***,其特征在于,所 述第一调度策略还包括:
    当所述全局GPU资源调度器反馈不存在所述其他计算节点时,所述当前计算节点构建高优先级队列和低优先级队列,并将所述当前子任务的梯度计算阶段任务放入所述高优先级队列中,将所述当前子任务的梯度同步阶段任务放入低优先级队列中;
    所述当前计算节点的GPU执行所述梯度计算阶段任务,所述当前计算节点的CPU执行所述梯度同步阶段任务。
  12. 根据权利要求11所述的面向智能计算的分布式训练任务调度***,其特征在于,所述第一调度策略还包括:
    当所述梯度计算阶段任务执行完毕时,将所述梯度计算阶段任务对应的模型参数和中间变量拷贝到所述当前计算节点的CPU主存中;当所述梯度计算阶段任务和所述梯度同步阶段任务均完成时,将对应的子任务的模型参数和中间变量拷贝到所述当前计算节点的GPU显存中;
    和/或,
    在所述当前计算节点接收到所述全局GPU资源调度器发送的第一调度信息后,将所述低优先级队列的子任务进行预取标记,各计算节点的GPU优先执行标记有所述预取标记的子任务。
  13. 根据权利要求8所述的面向智能计算的分布式训练任务调度***,其特征在于,当所述训练类型为流水并行任务时,当前子任务包括多个任务阶段,其中所述当前子任务的最后一个阶段的计算任务为完整计算任务,所述当前子任务的其他阶段的计算任务包括前向计算任务和后向计算任务;
    所述第二调度策略包括:
    当前计算节点根据本地GPU资源运行情况,判断所述计算节点的GPU资源是否能够满足所述当前子任务的计算需求,若不满足,则向所述全局GPU资源调度器询问是否存在能够满足所述当前子任务的计算需求的其他计算节点;
    当所述全局GPU资源调度器反馈不存在所述其他计算节点时,所述当前计算节点将所述当前子任务放入高优先级队列中。
  14. 根据权利要求13所述的面向智能计算的分布式训练任务调度***,其特征在于,所述第二调度策略还包括:
    根据所述当前子任务的前向计算任务阶段和后向计算任务阶段的空闲时间,***其他子任务的计算任务;
    和/或,
    在所述当前子任务的前向计算任务完成后,将所述当前子任务的前向计算任务对应的模型参数和中间变量由所述当前计算节点的GPU拷贝到所述当前计算节点的CPU主存中,并根据预计空闲时间为与所述当前子任务的前向计算任务相关联的后向计算任务标记上预执行时间;但所述预执行时间结束后,若所述相关联的后向计算任务未开始执行,则将对应的所述当前子任务的前向计算任务对应的模型参数和中间变量从所述CPU主存中再拷贝回所述当前计算节点的GPU中。
  15. 根据权利要求1所述的面向智能计算的分布式训练任务调度***,其特征在于,所述模型性能预测和分解模块在用于根据用户输入的待训练模型、目标完成时间和目标投入资源,确定所述待训练模型对应的分布式训练方式时,具体用于:
    对所述待训练模型进行预训练,确定所述待训练模型中每一层参数所需的计算时间和内存开销;
    根据每一层参数所需的计算时间和内存开销,确定采用不同分布式训练方式分别需要的GPU资源和任务完成时间;
    根据所述目标完成时间和所述目标投入资源,选择任务完成时间最小的分布式训练方式作为所述待训练模型对应的分布式训练方式。
  16. 根据权利要求15所述的面向智能计算的分布式训练任务调度***,其特征在于,所述模型性能预测和分解模块在对所述待训练模型进行预训练,确定所述待训练模型中每一层参数所需的计算时间和内存开销时,具体用于:
    对所述待训练模型中的每一层参数分别进行多次训练迭代,确定每一层参数进行每一次训练迭代的计算时间和内存开销;
    根据每一层参数进行多次迭代训练的计算时间的平均值,确定该层参数所需的计算时间;
    根据每一层参数进行多次迭代训练的内存开销的平均值,确定该层参数所需的内存开销。
  17. 一种面向智能计算的分布式训练任务调度方法,其特征在于,计算集群包括多个计算节点,多个计算节点之间能够相互通信,各计算节点包括至少一CPU和至少一个GPU,所述方法包括:
    通过模型性能预测和分解模块根据用户输入的待训练模型、目标完成时间和目标投入资源,确定所述待训练模型对应的分布式训练方式,并将所述待训练模型划分成多个子任务,以及确定各子任务的资源消耗信息,所述分布式训练方式包括数据并行、流水并行和混合并行中的一种,所述混合并行包括数据并行和流水并行,所述资源消耗信息包括计算消耗和内存消耗;
    通过全局GPU资源调度器在接收到所述模型性能预测和分解模块发送的子任务请求后, 根据各子任务的资源消耗信息及多个所述计算节点的GPU运行情况,将各子任务分配到匹配的计算节点的GPU进行训练,并构建各子任务之间的通信拓扑,并在各计算节点的GPU训练对应子任务的过程中,监控各计算节点的GPU的计算资源运行情况,以及根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度,其中,所述子任务请求携带有所述待训练模型对应的分布式训练方式、多个所述子任务及各子任务的资源消耗信息;和
    通过各计算节点均配置的本地GPU资源调度器根据所述分布式训练方式,对分配到该计算节点的子任务进行本地调度;
    所述计算资源运行情况包括子任务的等待时间和GPU利用率;
    所述全局GPU资源调度器在根据所有计算节点的GPU的计算资源运行情况,控制子任务的调度时,具体用于:
    对等待时长大于或等于预设时长阈值的子任务增加备份节点,所述备份节点是多个所述计算节点中除所述等待时长大于或等于预设时长阈值的子任务对应的当前计算节点外的其他计算节点,且所述备份节点的GPU利用率小于或等于预设利用率阈值;
    将所述等待时长大于或等于预设时长阈值的子任务对应的最新模型参数拷贝到所述备份节点,以将所述等待时长大于或等于预设时长阈值的子任务对应的最新模型参数以数据并行方式加入到所述备份节点在下一轮的迭代中参与该任务的训练中。
  18. 一种面向智能计算的分布式训练任务调度装置,其特征在于,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现权利要求17所述的面向智能计算的分布式训练任务调度方法。
  19. 一种计算机可读存储介质,其特征在于,其上存储有程序,该程序被处理器执行时,实现权利要求17所述的面向智能计算的分布式训练任务调度方法。
PCT/CN2023/105626 2022-09-21 2023-07-04 面向智能计算的分布式训练任务调度方法、***和装置 WO2024060789A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211148202.1 2022-09-21
CN202211148202.1A CN115248728B (zh) 2022-09-21 2022-09-21 面向智能计算的分布式训练任务调度方法、***和装置

Publications (1)

Publication Number Publication Date
WO2024060789A1 true WO2024060789A1 (zh) 2024-03-28

Family

ID=83700043

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105626 WO2024060789A1 (zh) 2022-09-21 2023-07-04 面向智能计算的分布式训练任务调度方法、***和装置

Country Status (2)

Country Link
CN (1) CN115248728B (zh)
WO (1) WO2024060789A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115248728B (zh) * 2022-09-21 2023-02-03 之江实验室 面向智能计算的分布式训练任务调度方法、***和装置
CN115454654B (zh) * 2022-11-11 2023-01-13 中诚华隆计算机技术有限公司 一种自适应资源匹配获取方法和装置
CN116069495A (zh) * 2022-11-18 2023-05-05 深圳先进技术研究院 弹性深度学习作业调度方法、***及计算机设备
CN116167463B (zh) * 2023-04-26 2023-07-07 之江实验室 一种面向智能计算的分布式模型训练容器调度方法及装置
CN116204327B (zh) * 2023-05-06 2023-08-01 阿里巴巴(中国)有限公司 分布式***通信调度方法及分布式机器学习***
CN116680060B (zh) * 2023-08-02 2023-11-03 浪潮电子信息产业股份有限公司 面向异构计算***的任务分配方法、装置、设备和介质
CN116702885B (zh) * 2023-08-02 2023-11-07 浪潮电子信息产业股份有限公司 同步数据并行训练控制方法、***、装置、设备及介质
CN117057411B (zh) * 2023-10-11 2024-01-09 北京燧原智能科技有限公司 一种大语言模型训练方法、装置、设备及存储介质
CN117155928B (zh) * 2023-10-31 2024-02-09 浪潮电子信息产业股份有限公司 通信任务处理方法、***、设备、集群及可读存储介质
CN117519953B (zh) * 2024-01-08 2024-04-05 北京大学 一种面向服务器无感知计算的分离式内存管理方法
CN117519996B (zh) * 2024-01-08 2024-03-15 长春吉大正元信息技术股份有限公司 一种数据处理方法、装置、设备以及存储介质
CN117555696B (zh) * 2024-01-11 2024-03-15 西北工业大学 一种多模型并发执行的数据交互方法及***
CN117687802B (zh) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 一种基于云平台的深度学***台

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489223A (zh) * 2019-08-26 2019-11-22 北京邮电大学 一种异构集群中任务调度方法、装置及电子设备
US20200042362A1 (en) * 2018-08-03 2020-02-06 EMC IP Holding Company LLC Self-adaptive batch dataset partitioning for distributed deep learning using hybrid set of accelerators
CN112559147A (zh) * 2020-12-08 2021-03-26 和美(深圳)信息技术股份有限公司 基于gpu占用资源特点的动态匹配算法、***和设备
US20220012089A1 (en) * 2020-07-13 2022-01-13 Accenture Global Solutions Limited System for computational resource prediction and subsequent workload provisioning
CN114647515A (zh) * 2022-04-12 2022-06-21 杭州电子科技大学 一种面向gpu集群的动态资源调度方法
CN114675964A (zh) * 2022-03-08 2022-06-28 杭州博盾习言科技有限公司 基于联邦决策树模型训练的分布式调度方法、***及介质
CN115248728A (zh) * 2022-09-21 2022-10-28 之江实验室 面向智能计算的分布式训练任务调度方法、***和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399222B (zh) * 2019-07-25 2022-01-21 北京邮电大学 Gpu集群深度学习任务并行化方法、装置及电子设备
CN111079921A (zh) * 2019-11-29 2020-04-28 杭州电子科技大学舟山同博海洋电子信息研究院有限公司 一种基于异构分布式***的高效神经网络训练调度方法
WO2021115082A1 (zh) * 2019-12-09 2021-06-17 华为技术有限公司 作业调度方法以及作业调度装置
CN112114951A (zh) * 2020-09-22 2020-12-22 北京华如科技股份有限公司 一种自下而上的分布式调度***及方法
WO2022116095A1 (en) * 2020-12-03 2022-06-09 Nvidia Corporation Distributed neural network training system
CN114035937A (zh) * 2021-10-15 2022-02-11 北京潞晨科技有限公司 一种基于人工智能的分布式训练和推理方法、***、设备和可读存储介质
CN114741207B (zh) * 2022-06-10 2022-09-30 之江实验室 一种基于多维度组合并行的gpu资源调度方法和***

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200042362A1 (en) * 2018-08-03 2020-02-06 EMC IP Holding Company LLC Self-adaptive batch dataset partitioning for distributed deep learning using hybrid set of accelerators
CN110489223A (zh) * 2019-08-26 2019-11-22 北京邮电大学 一种异构集群中任务调度方法、装置及电子设备
US20220012089A1 (en) * 2020-07-13 2022-01-13 Accenture Global Solutions Limited System for computational resource prediction and subsequent workload provisioning
CN112559147A (zh) * 2020-12-08 2021-03-26 和美(深圳)信息技术股份有限公司 基于gpu占用资源特点的动态匹配算法、***和设备
CN114675964A (zh) * 2022-03-08 2022-06-28 杭州博盾习言科技有限公司 基于联邦决策树模型训练的分布式调度方法、***及介质
CN114647515A (zh) * 2022-04-12 2022-06-21 杭州电子科技大学 一种面向gpu集群的动态资源调度方法
CN115248728A (zh) * 2022-09-21 2022-10-28 之江实验室 面向智能计算的分布式训练任务调度方法、***和装置

Also Published As

Publication number Publication date
CN115248728B (zh) 2023-02-03
CN115248728A (zh) 2022-10-28

Similar Documents

Publication Publication Date Title
WO2024060789A1 (zh) 面向智能计算的分布式训练任务调度方法、***和装置
US10042886B2 (en) Distributed resource-aware task scheduling with replicated data placement in parallel database clusters
CN109034396B (zh) 用于处理分布式集群中的深度学习作业的方法和装置
Yu et al. Gillis: Serving large neural networks in serverless functions with automatic model partitioning
CN108572873B (zh) 一种解决Spark数据倾斜问题的负载均衡方法及装置
CN111274036B (zh) 一种基于速度预测的深度学习任务的调度方法
CN112416585B (zh) 面向深度学习的gpu资源管理与智能化调度方法
US20240111586A1 (en) Multi-policy intelligent scheduling method and apparatus oriented to heterogeneous computing power
WO2024060788A1 (zh) 面向智能计算的流水并行训练自适应调整***、方法
CN113472597B (zh) 分布式卷积神经网络细粒度的参数传输调度方法及装置
CN114237869B (zh) 基于强化学习的Ray双层调度方法、装置和电子设备
DE102020110655A1 (de) Verfahren und vorrichtung zum verbessern der verwendung eines heterogenen systems, das software ausführt
Wang et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems
CN113094159B (zh) 一种数据中心作业调度方法、***、存储介质及计算设备
CN114610474A (zh) 一种异构超算环境下多策略的作业调度方法及***
CN115586961A (zh) 一种ai平台计算资源任务调度方法、装置及介质
CN106934539A (zh) 一种带有期限和费用约束的工作流调度方法
CN115437760A (zh) 计算资源分配方法、电子设备、存储介质及程序产品
CN115994567A (zh) 一种深度神经网络模型并行计算任务异步调度方法
CN116684420A (zh) 集群资源调度方法、装置、集群***和可读存储介质
Ohmura et al. Toward building a digital twin of job scheduling and power management on an HPC system
CN116028193B (zh) 一种混部集群的大数据任务动态高能效调度方法和***
CN114816690A (zh) 一种任务分配方法、装置、设备及存储介质
CN114816742A (zh) 请求处理方法、装置、电子设备及存储介质
CN110532091B (zh) 基于图形处理器的图计算边向量负载平衡方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23867084

Country of ref document: EP

Kind code of ref document: A1