CN116932201A

CN116932201A - Multi-resource sharing scheduling method for deep learning training task

Info

Publication number: CN116932201A
Application number: CN202310124944.9A
Authority: CN
Inventors: 金鑫; 刘譞哲; 赵怡浩; 刘渊强; 黄罡
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-10-24

Abstract

The embodiment of the invention provides a multi-resource sharing scheduling method for deep learning training tasks. The method comprises the following steps: acquiring resource use data of each training task submitted to a task queue; determining the sharing efficiency between training tasks according to the acquired resource use data and the sharing mechanism; determining a sharing scheduling scheme according to the acquired resource use data and the sharing efficiency; and controlling the executor cluster to execute the training task through the shared scheduling scheme. The method aims at greatly improving the utilization rate of various resources in the cluster and greatly reducing the completion time of deep learning training tasks through multi-resource sharing and scheduling.

Description

Multi-resource sharing scheduling method for deep learning training task

Technical Field

The invention relates to the technical field of deep learning, in particular to a multi-resource sharing scheduling method for deep learning training tasks.

Background

Deep learning has been widely integrated into web services and applications, and training deep learning models is becoming an important load for data centers. Typically, enterprises build Graphics Processor (GPU) clusters to support deep learning training tasks, users submit deep learning training tasks to the clusters, and a cluster scheduler schedules the submitted tasks and allocates resources to achieve higher resource utilization and task execution efficiency.

Cluster schedulers for deep learning training tasks (deep learning schedulers for short) have made some progress in improving resource utilization and task training efficiency, but most deep learning schedulers only focus on GPU allocation. In practice, however, deep learning training requires a variety of resources, including memory resources for reading training data, central Processing Unit (CPU) resources for preprocessing and simulation, GPU resources for forward and backward propagation, and network resources for gradient synchronization at the working end in distributed training. For early deep learning models (e.g., resNet), GPU resources were indeed the bottleneck in model training, and therefore existing deep learning schedulers focused on GPU allocation. However, with the rapid development of deep learning, the above assumption is no longer true. At present, the deep learning model has larger difference in model size and model type, and the training of the deep learning model has different resource requirements and resource bottlenecks. For example, reinforcement learning training requires interaction between the simulation environment of the CPU and the agent, and the simulation time is typically longer than the computation time, so the CPU is a resource bottleneck in reinforcement learning training; the resource bottleneck of the small deep learning model used for training the side equipment is usually the storage read-write used for reading data; large models of distributed training require a lot of network communication time, and thus network read-write is a resource bottleneck.

The existing deep learning schedulers can be divided into two types, namely, exclusive resources of training tasks and only shared GPU resources according to the resource allocation mode. When the deep learning training only uses GPU resources, the GPU resource sharing can effectively improve the resource utilization rate and the task execution efficiency. However, the resource usage of the existing deep learning task is quite different, so that only GPU resource sharing is considered and even performance is reduced. Thus, existing deep learning schedulers have not been studied in terms of multi-resource sharing.

Disclosure of Invention

In view of this, the embodiment of the invention provides a multi-resource sharing scheduling method for deep learning training tasks, which aims to greatly improve the utilization rate of various resources in a cluster and greatly reduce the completion time of the deep learning training tasks through multi-resource sharing and scheduling.

The first aspect of the embodiment of the invention provides a multi-resource sharing scheduling method for deep learning training tasks, which comprises the following steps:

acquiring resource use data of each training task submitted to a task queue;

determining the sharing efficiency between training tasks according to the acquired resource use data and the sharing mechanism;

determining a sharing scheduling scheme according to the acquired resource use data and the sharing efficiency;

And controlling the executor cluster to execute the training task through the shared scheduling scheme.

Optionally, the acquiring the resource usage data submitted to each training task in the task queue includes:

determining whether each training task in the task queue is submitted to the task queue for the first time;

when the training task is the training task submitted to the task queue for the first time, executing training iteration for the training task for preset times, acquiring resource use data in the executing process, and storing the acquired resource use data into a database;

when the training task is a training task which is historically submitted to the task queue, acquiring resource use data corresponding to the training task from the database.

Optionally, when the resource usage data is of a resource usage type, when the training task is a training task submitted to the task queue for the first time, performing training iteration for a preset number of times on the training task, obtaining the resource usage data in the execution process, and storing the obtained resource usage data in a database, including:

when the training task is the training task submitted to the task queue for the first time, executing preset times of training iteration on the training task;

Acquiring the utilization rate of each resource in each time period in the training iteration process, and carrying out normalization processing on the utilization rate peak value of each resource in the same time period;

and determining the resource type with the highest resource utilization rate after normalization processing as the resource utilization type of the same time period.

Optionally, the determining a shared scheduling scheme according to the acquired resource usage data and the sharing efficiency includes:

determining the priority of the training tasks in the task queue and already executed according to the acquired resource use data of each training task;

sequencing the training tasks according to the priorities of the training tasks in the task queue and the executed training tasks to obtain corresponding sequencing results;

selecting a plurality of training tasks which are ranked in front and can occupy the executor cluster resources to form a training task subset to be executed according to the ranking result;

and determining a sharing scheduling scheme of the training task subset to be executed according to the acquired resource use data and sharing efficiency of each training task in the training task subset to be executed.

Optionally, the determining a shared scheduling scheme of the training task subset to be performed according to the obtained resource usage data and the sharing efficiency of each training task in the training task subset to be performed includes:

Constructing a corresponding hypergraph according to the acquired resource usage data and sharing efficiency of each training task in the training task subset to be executed, wherein nodes in the hypergraph represent one or a group of training tasks, nodes on edges in the hypergraph represent training tasks sharing the same resources, and edge weights on edges in the hypergraph are sharing efficiency among the nodes on the edges;

determining a plurality of matches according to the hypergraph;

determining the matching with the largest sum of sharing efficiency in the plurality of matching as a training task sharing scheme;

and determining the sharing scheduling scheme of the training task subset to be executed according to a preset scheduling strategy and the training task sharing scheme.

Optionally, the resource occupation amounts of the plurality of nodes on the edges in the hypergraph are the same.

Optionally, when the hypergraph includes a hyperedge connecting a plurality of nodes, the method further includes:

converting the over-edges connecting the plurality of nodes into edges connecting only two nodes to construct a subgraph;

determining a plurality of sub-matches according to the sub-graph;

determining the sub-match with the largest sum of sharing efficiency in the plurality of sub-matches as a sub-training task sharing scheme;

and merging two nodes which are shared in the sub-training task sharing scheme into one node so as to simplify the superside which is connected with a plurality of nodes in the supergraph into an edge which is only connected with the two nodes.

Optionally, the determining the priority of the training task in the task queue and being executed according to the acquired resource usage data of each training task includes:

when the duration of the training task is known, determining a corresponding priority according to the remaining time of the training task and the required resource quantity;

when the duration of the training task is unknown, the corresponding priority is determined according to the executed time of the training task and the required resource quantity.

Optionally, the shared scheduling scheme at least includes allocating resources for each training task in descending order of the required number of resources for the training task.

Optionally, the method further comprises:

monitoring the resource utilization rate of the executor cluster and the execution condition of each training task;

determining whether the training task is abnormal according to the execution condition of the training task;

when the training task is abnormal, the executor is controlled to terminate the training task and put the training task back into the task queue.

The embodiment of the invention has the following advantages:

the multi-resource sharing scheduling method for deep learning training tasks provided by the embodiment of the invention comprises the steps of firstly obtaining resource usage data of each training task in a task queue to determine resource conditions required by each training task and sharing efficiency among the training tasks, determining which training tasks are to be combined with a sharing group to share resources according to the resource usage conditions and the sharing efficiency of each training task, thereby obtaining a sharing scheduling scheme, and distributing various resources in an executor cluster for each sharing group based on the obtained sharing scheduling scheme, so that each training task in the same sharing group can share various resources in the executor cluster distributed for the training task. Therefore, the shared groups can share and schedule the plurality of resources allocated to the shared groups by allocating the plurality of resources in the executor cluster to each shared group, so that the utilization rate of the plurality of resources in the executor cluster is greatly improved, and the completion time of the deep learning training task is greatly shortened.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a multi-resource sharing scheduling method for deep learning training tasks according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating determination of a training task sharing scheme in a multi-resource sharing scheduling method for deep learning training tasks according to an embodiment of the present invention;

FIG. 3 is a simplified hypergraphic illustration of a multi-resource sharing scheduling method for deep learning training tasks according to an embodiment of the present invention;

FIG. 4 is another flow chart of a multi-resource sharing scheduling method for deep learning training tasks according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the present invention, fig. 1 is a flowchart of a multi-resource sharing scheduling method for deep learning training tasks according to an embodiment of the present invention. Referring to fig. 1, the multi-resource sharing scheduling method for deep learning training tasks provided by the invention comprises the following steps:

step S11: acquiring resource use data of each training task submitted to a task queue;

step S12: determining the sharing efficiency between training tasks according to the acquired resource use data and the sharing mechanism;

step S13: determining a sharing scheduling scheme according to the acquired resource use data and the sharing efficiency;

step S14: and controlling the executor cluster to execute the training task through the shared scheduling scheme.

In the embodiment of the invention, the training tasks are periodically shared and scheduled, and the shared and scheduled is carried out once at preset time intervals, namely a new shared and scheduled scheme is generated. When the shared scheduling is performed each time, firstly, resource usage data of each training task submitted to a task queue, such as execution time, remaining training time, resource types occupied by each training stage, occupation quantity of the resource types and the like of each training stage in the training process of the training task, are acquired. The acquired training tasks submitted to the task queue comprise training tasks waiting to be executed in the current task queue and training tasks in the executing process.

After the usage data of each training task is obtained, the sharing efficiency between each training task is determined according to the sharing mechanism. It should be understood that the sharing mechanism may be any of various existing sharing mechanisms, and is not specifically limited herein.

And creating a shared scheduling scheme for each training task according to the obtained resource use data and sharing efficiency of each training task, and controlling the executor cluster to train each training task through the constructed shared scheduling scheme.

In the present invention, the acquiring resource usage data of each training task submitted to the task queue includes: determining whether each training task in the task queue is submitted to the task queue for the first time; when the training task is the training task submitted to the task queue for the first time, executing training iteration for the training task for preset times, acquiring resource use data in the executing process, and storing the acquired resource use data into a database; when the training task is a training task which is historically submitted to the task queue, acquiring resource use data corresponding to the training task from the database.

In the embodiment of the present invention, step S11: one embodiment for obtaining resource usage data submitted to each training task in the task queue is: determining whether each training task in the task queue is submitted to the task queue for the first time, and executing deep learning training iteration for the training task submitted to the task queue for the first time by an analysis tool PyTorch Profiler for the training task submitted to the task queue for the first time to acquire resource use data of an operator in the process of deep learning training. Specifically, the information such as the execution time of each training stage, the type of the used resources, the occupied quantity of the resources and the like of the operator in the deep learning training process is obtained, the obtained information is summarized, the average iteration time, the average use time and the use quantity of each resource are obtained, and the resource use data are stored in a database.

The preset times can be set according to the actual application scenario, and are not limited herein, for example, 10 times or 20 times of iteration.

When the training task is the training task which is submitted to the task queue before, the resource use data of the training task is stored in the database, and the resource use data corresponding to the training task is directly obtained from the database.

In the present invention, when the resource usage data is of a resource usage type, and when the training task is a training task submitted to the task queue for the first time, performing training iteration for a preset number of times on the training task, obtaining the resource usage data in the execution process, and storing the obtained resource usage data in a database, including: when the training task is the training task submitted to the task queue for the first time, executing preset times of training iteration on the training task; acquiring the utilization rate of each resource in each time period in the training iteration process, and carrying out normalization processing on the utilization rate peak value of each resource in the same time period; and determining the resource type with the highest resource utilization rate after normalization processing as the resource utilization type of the same time period.

In the embodiment of the invention, in the training process of the training task, only one main resource type is used at a specific moment, although other resource types are used, the utilization rate of the main resource type is very low, and the utilization rate of the resource type with very low utilization rate is not influenced by other training tasks.

For example, in the time period S, the normalization processing is performed on the utilization peaks of various resources of the training task 1 in the time period S to obtain that the CPU occupancy peak of the training task 1 in the time period S is 1%, the network occupancy peak is 0.5%, and the GPU occupancy peak is 60%, so that the resource type used by the training task 1 in the time period S is determined to be GPU, and the CPU and the network are not used, so that when the shared scheduling scheme is subsequently created, the occupation condition of the training task 1 on the network and the CPU is not considered, and only the GPU resources are used by the training task 1 in the time period S, thereby improving the calculation efficiency of the scheduling function.

In the present invention, the determining a shared scheduling scheme according to the obtained resource usage data and the sharing efficiency includes: determining the priority of the training tasks in the task queue and already executed according to the acquired resource use data of each training task; sequencing the training tasks according to the priorities of the training tasks in the task queue and the executed training tasks to obtain corresponding sequencing results; selecting a plurality of training tasks which are ranked in front and can occupy the executor cluster resources to form a training task subset to be executed according to the ranking result; and determining a sharing scheduling scheme of the training task subset to be executed according to the acquired resource use data and sharing efficiency of each training task in the training task subset to be executed.

In the embodiment of the invention, compared with the deep learning scheduling of exclusive resources, the deep learning scheduling of multi-resource sharing can simultaneously execute more deep learning tasks, but the submitted task quantity still cannot be simultaneously executed. The invention therefore proposes a priority-based compression method for matching candidate sets to optimize the index of the average completion time of another training task when selecting the subset of training tasks to be performed that should be currently performed. Specifically, firstly, according to the acquired resource usage data of each training task, the priority of the training task waiting to be executed and the priority of the training task already being executed in the task queue are determined, that is, the training task already being executed may return to the task queue again in the sharing scheduling scheme of the next round, and may be executed continuously in the sharing scheduling scheme of the next round. After the priorities of the training tasks are obtained, the training tasks are ordered according to the priorities of the training tasks, and corresponding ordering results are obtained.

And selecting a plurality of training tasks which can occupy the executor cluster resources from the training tasks with the front ordering according to the obtained ordering result to form a training task subset to be executed, and creating and generating the training task subset to be executed by using a sharing scheduling scheme of the next round, so as to optimize the average completion time of the training tasks.

In the embodiment of the present invention, the resource type occupying the actuator cluster resource is preferably GPU.

In the present invention, the determining a shared scheduling scheme of the training task subset to be executed according to the obtained resource usage data and the sharing efficiency of each training task in the training task subset to be executed includes: constructing a corresponding hypergraph according to the acquired resource usage data and sharing efficiency of each training task in the training task subset to be executed, wherein nodes in the hypergraph represent one or a group of training tasks, nodes on edges in the hypergraph represent training tasks sharing the same resources, and edge weights on edges in the hypergraph are sharing efficiency among the nodes on the edges; determining a plurality of matches according to the hypergraph; determining the matching with the largest sum of sharing efficiency in the plurality of matching as a training task sharing scheme; and determining the sharing scheduling scheme of the training task subset to be executed according to a preset scheduling strategy and the training task sharing scheme.

In an embodiment of the present invention, a corresponding hypergraph G (V, E) is constructed based on the obtained resource usage data and sharing efficiency for each training task in the subset of training tasks to be performed, wherein each node V E V represents one or a group of tasks, each edge (u ₀ ,u ₁ …) E represents task u ₀ ,u ₁ … share the same resource, and the edge weight of any one edge is the sharing efficiency of each training task on the any one edge. A plurality of matching results will be obtained by hypergraph G, each matching result consisting of a plurality of edges without a common vertex. After a plurality of matching results are obtained, a matching result with the largest sum of sharing efficiency is selected from the plurality of matching results to be determined as a training task sharing scheme. Wherein the sum of sharing efficiency refers to the sum of sharing efficiency of all sharing groups. The training task sharing scheme comprises at least one sharing group, wherein any sharing group comprises at least one training task, and each training task in the same sharing group shares the executor cluster resources allocated to the same sharing group. As shown in fig. 2, two simple hypergraph matching results are shown in fig. 2, the overall sharing efficiency is 2 and 1.5 respectively, and the matching algorithm outputs a matching scheme 1 with higher overall sharing efficiency, which can better utilize cluster resources and complete tasks faster, so that the matching scheme 1 is determined as a training task sharing scheme.

In the embodiment of the invention, after the training task sharing scheme is obtained, a sharing scheduling scheme of each sharing group is determined for each sharing group in the training task sharing scheme through a scheduling strategy and the training task sharing scheme. That is, the resource allocation of the executor cluster is performed for each shared group, and the execution time and the execution sequence of each training task in the same shared group in the allocated resource of the executor cluster are determined. The training task sharing scheme can be set according to an actual application scene, and is not particularly limited herein.

In the invention, the resource occupation quantity of a plurality of nodes on the edges in the hypergraph is the same.

In the embodiment of the invention, in order to reduce the creation complexity of the shared scheduling scheme, when the hypergraph is created based on the subset of the training tasks to be executed, the nodes with the same number of occupied resources are arranged on the same side, so that the creation complexity of the training task sharing scheme is reduced, and the creation complexity of the shared scheduling scheme is reduced. For example, when the number of GPU resources required to be occupied by two training tasks is 8, one edge may be directly connected to two nodes corresponding to the two training tasks, and if a node corresponding to a training task with the number of GPU resources required to be occupied of 8 is connected to a node corresponding to a training task with the number of GPU resources required to be occupied of 2, then a point corresponding to a training task with the number of GPU resources required to be occupied of 6 needs to be connected to the edge. Therefore, the supergraph has the advantages that the superside dimension is improved and the task sharing efficiency is reduced due to the difference of the number of resources occupied by the nodes, so that the construction complexity of the training task sharing scheme is increased.

In an embodiment of the present invention, when the hypergraph includes a hyperedge connecting a plurality of nodes, the method further includes: converting the over-edges connecting the plurality of nodes into edges connecting only two nodes to construct a subgraph; determining a plurality of sub-matches according to the sub-graph; determining the sub-match with the largest sum of sharing efficiency in the plurality of sub-matches as a sub-training task sharing scheme; and merging two nodes which are shared in the sub-training task sharing scheme into one node so as to simplify the superside which is connected with a plurality of nodes in the supergraph into an edge which is only connected with the two nodes.

In the embodiment of the invention, for the deep learning training task in practice, four resources are generally used, and at most four training tasks can be shared, and the hypergraph G is displayed at the momentThere may be a superside that connects up to four nodes. The optimal combination scheme is converted into the maximum weight k dimension hypergraph matching problem in graph theory, and the problem is equivalent to the maximum weight independent set problem, is an NP-hard problem, and cannot be used for solving the optimal solution in polynomial time. Therefore, the invention provides a multi-round heuristic scheduling algorithm, which decomposes the sharing and matching process of training tasks into a plurality of rounds, and each round combines two groups of training tasks by using a flower tree algorithm. At the end of each round of combining, each pair of nodes selected by the algorithm with flowers is merged into a point and used as input for the next round of sharing matches. The above process is repeated for at most two rounds, so that each sharing group can comprise at most four deep learning training tasks, and therefore, the time complexity is still at a polynomial level, namely O (|V|) ³ )。

Specifically, four resources (including GPU, network, CPU and storage) are generally used for deep learning training tasks in the actual process, so that resource sharing of four training tasks can be realized at most, at the moment, the edges in the hypergraph G can also have the hyperedges connected with four nodes at most, and for the hyperedges connected with a plurality of nodes in the hypergraph G, the hyperedges are converted into a plurality of edges connected with only two nodes, so as to construct a subgraph. All nodes in the subgraph are nodes in the superside. And obtaining a plurality of sub-matching results according to the sub-graph. And determining the sub-matching result with the largest sum of the sharing efficiency among the plurality of sub-matching results as a sub-training task sharing scheme. After the sub-sharing matching scheme is obtained, two nodes which are shared in the sub-training task sharing scheme are combined into one node, and then substituted into the hypergraph, so that the hyperedge which is connected with a plurality of nodes in the hypergraph is simplified into an edge which is only connected with the two nodes.

For example, as shown in fig. 3, there is a superside X connecting 4 nodes in the supergraph G, and the superside X is converted into a plurality of sides connecting only two nodes, so as to obtain a sub graph G0. And obtaining a plurality of sub-matching results according to the sub-graph. And determining that the sum of sharing efficiency of the sub-matching results of the sharing matching between the node 1 and the node 3 and the sharing matching between the node 2 and the node 4 is maximum in the plurality of sub-matching results, and determining the sub-matching result as a sub-training task sharing scheme. And then merging the node 1 and the node 3 which are shared in the sub-training task sharing scheme into one node 13, merging the node 2 and the node 4 which are shared in the sub-training task sharing scheme into one node 24, and substituting the node 24 into a hypergraph G so as to simplify the hyperedge of the hypergraph, which is connected with a plurality of nodes, into an edge which is only connected with two nodes.

In the present invention, the determining the priority of the training task in the task queue and being executed according to the obtained resource usage data of each training task includes: when the duration of the training task is known, determining a corresponding priority according to the remaining time of the training task and the required resource quantity; when the duration of the training task is unknown, the corresponding priority is determined according to the executed time of the training task and the required resource quantity.

In the embodiment of the present invention, according to the acquired resource usage data of each training task, one implementation of determining the priorities of the training tasks in the task queue and already being executed is as follows: when the duration of a training task is known, a corresponding priority is determined based on the remaining time of the training task and the number of resources required. And when the duration of the training task is unknown, determining the corresponding priority according to the executed time of the training task and the required resource quantity.

For example, in the case where the duration of the training task 1 is known, the priority of the training task is determined by the remaining service priority policy, specifically, assuming that the remaining time of the training task 1 is r and the number of GPUs required is g, the priority is r×g; when the duration of the training task 2 is unknown, the priority of the training task is determined by a two-dimensional acquired service priority policy, specifically, assuming that the executed time of the training task 2 is a and the required number of GPUs is h, the priority is a×h.

In the invention, the shared scheduling scheme at least comprises the step of allocating resources for each training task according to the descending order of the required resources of the training task.

In the embodiment of the invention, in order to avoid resource fragmentation of the executors and minimize the number of the executors used by each training task, the invention allocates the GPU resources of the executors for the training tasks according to the descending order of GPU numbers under the condition that the GPU numbers used by the deep learning training tasks are all powers of 2.

In the present invention, the method further comprises: monitoring the resource utilization rate of the executor cluster and the execution condition of each training task; determining whether the training task is abnormal according to the execution condition of the training task; when the training task is abnormal, the executor is controlled to terminate the training task and put the training task back into the task queue.

In the embodiment of the invention, the self resource utilization rate reported by the executor is monitored, including GPU utilization rate, CPU utilization rate, storage read-write speed and the like. Meanwhile, the execution condition of each training task is monitored, and the inter-process communication control executor is used for processing events such as the start, the end, the termination, the abnormality and the like of the training task. When an abnormality occurs, the control executor terminates the abnormal training task and puts the abnormal training task back into the task queue so that the abnormal training task can participate in the scheduling process when being scheduled next time, and the abnormal training task can be executed in the follow-up.

In the embodiment of the invention, the multi-resource sharing scheduling method for the deep learning training task is realized in a multi-resource sharing scheduler for the deep learning training task, and the realization process requires parameters of the multi-resource sharing scheduler for the deep learning training task and an executor cluster, and the multi-resource sharing scheduler for the deep learning training task is called as a scheduler in the following. As shown in fig. 4, the scheduler includes a resource analysis module, a task scheduling module, and a monitoring module. The user submits the training task of deep learning to a scheduler, the scheduler maintains a task queue to buffer the submitted training task, makes a scheduling decision of the training task, and simultaneously monitors the execution condition of the training task in the executor and performs exception handling based on the execution condition.

The resource analysis module is used for analyzing the resource use condition of each training task submitted to the task queue to obtain corresponding resource use data, calculating and obtaining the sharing efficiency between the training tasks according to the sharing mechanism, and taking the resource use data and the sharing efficiency as the input of the task scheduling module after obtaining the resource use data and the sharing efficiency of the training tasks.

And the task scheduling module periodically gives a sharing scheduling scheme according to the acquired resource use data and the sharing efficiency, and controls the executor cluster to execute training tasks through the sharing scheduling scheme.

The monitoring module is used for monitoring the resource utilization rate of the executor cluster and the execution condition of each training task, determining whether the training task is abnormal according to the execution condition of the training task, and controlling the executor to terminate the training task and put the training task back into the task queue when the training task is abnormal.

The executor cluster is used for executing the training task of the deep learning according to the scheduling decision of the scheduler, and simultaneously reporting the utilization rate of the self resources and the execution progress of the training task of the deep learning to the monitoring module of the scheduler. When an abnormality occurs in the training process, the executor reports the abnormality information to a monitoring module of the scheduler and assists in processing the abnormality.

In the embodiment of the invention, the operation flow of the scheduler comprises three stages, namely, the scheduler counts the execution data of the training tasks for deep learning, the scheduler generates a scheduling strategy and the executor executes training.

First, the scheduler counts the execution data of the training task for deep learning. The scheduler firstly analyzes by using a resource analysis module, performs training iteration for preset times on training tasks submitted to a task queue for the first time, and counts and acquires resource use data of an operator in the deep learning training process. Specifically, the information such as the execution time of each training stage, the type of the used resources, the number of occupied resources and the like of an operator in the deep learning training process is obtained, the obtained information is summarized, the average iteration time, the average use time and the use number of each resource are obtained, the resource use data are stored in a database, and the sharing efficiency of training tasks is calculated by utilizing the data and a sharing mechanism. The resource analysis module communicates the resource usage data and the sharing efficiency to the task scheduling module.

The scheduler then generates a scheduling policy. In order to respond to events such as the end, the submission, the abnormality and the like of the training task in time, the task scheduling module adopts periodic scheduling, namely, global scheduling is started once at preset time intervals. The decisions generated by each schedule include the training tasks to be performed in the next cycle, the sharing pattern between training tasks, and the resource allocation of the training tasks. Aiming at a sharing mechanism capable of providing sharing efficiency, a task scheduling module abstracts a task decision process into a maximum weight graph matching problem, the invention expands a flower tree algorithm aiming at a general graph into a multi-round heuristic scheduling algorithm, and provides a matching candidate set compression method based on priority. The finally generated scheduling strategy comprises a training task sharing scheme and a resource allocation scheme, and the training task sharing scheme and the resource allocation scheme form a sharing scheduling scheme.

Finally, the executor executes the task phase. An executor deployed on each computer receives the shared scheduling scheme and performs model training in accordance with the shared scheduling scheme. Different training tasks are executed according to a sharing mechanism, the resource use condition and the training progress of the training tasks in the execution process are counted, and statistics data are reported to a monitoring module of the scheduler. In addition, when an abnormality occurs, the executor reports the abnormality information to the monitor module of the scheduler, and assists the monitor module in terminating the training task or re-executing the training task.

In the embodiment of the invention, the task scheduling module periodically generates a shared scheduling scheme of training tasks according to the tasks cached in the task queue and the selected sharing mechanism. The invention adopts the staggered sharing mechanism based on the time dimension to realize the sharing of a plurality of training tasks, and the sharing mechanism can calculate the sharing strategy according to the iteration time and the resource using time provided by the resource analysis module and broadcast the strategy to the executor deployed on each computer through inter-process communication.

In the embodiment of the invention, an actuator is deployed on each server of the actuator cluster, and each actuator executes a training task according to a shared scheduling scheme of the scheduler. The executor uses PyTorch as a deep learning framework and Horovod as a distributed training framework, and adopts a multi-resource staggered sharing mechanism to execute sharing tasks. The executor records the execution iteration number and the average iteration time of the training task, and reports the execution iteration number and the average iteration time to a monitoring module of the scheduler through interprocess communication. When an abnormality occurs, the executor reports the abnormality information to the monitoring module and terminates the abnormal task.

In the embodiment of the invention, the multi-resource sharing scheduler facing the deep learning training task fully considers the diversity of resource use conditions (comprising GPU, CPU, storage and network) in the existing deep learning training, and optimizes the deep learning cluster level by utilizing the existing deep learning sharing mechanism. Mainly comprises two parts, namely a scheduler and an actuator. The scheduler is responsible for acquiring model information, deciding a scheduling strategy and monitoring task execution; the executor is responsible for executing tasks according to the scheduling policy. According to the invention, 8 common deep learning models are selected, and Philliy Trace cluster tracks are utilized, so that performance evaluation is performed on a system consisting of the scheduler and the executor clusters provided by the invention in terms of task completion time, maximum completion time, resource utilization rate and the like, and a real cluster experiment is performed by using 64V 100 GPU clusters, so that compared with a baseline method, the experiment result shows that the average completion time, tail average completion time and maximum completion time of training tasks can be effectively reduced, and meanwhile, the utilization rate of various resources is improved. Meanwhile, the effectiveness of the system and the universality of different parameter configurations are verified aiming at a simulation experiment of 128 GPUs. The multi-resource sharing scheduling method for the deep learning training task improves the cluster and task efficiency in the deep learning training scene; the completion time and the completion time of the deep learning task can be greatly reduced by sharing and scheduling multiple resources on the premise of not affecting the training accuracy, and the utilization rate of multiple resources in the cluster is greatly improved.

In the embodiment of the invention, the multi-resource sharing scheduling method for the deep learning training task is aimed at the data parallel distributed training and can be applied to the model parallel, pipeline parallel and other distributed training modes. The system of the invention is realized based on a isomorphic V100 GPU and can be applied based on heterogeneous clusters comprising various acceleration hardware.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The multi-resource sharing scheduling method for deep learning training tasks provided by the invention is described in detail, and specific examples are applied to illustrate the principles and the implementation modes of the invention, and the description of the above examples is only used for helping to understand the method and the core ideas of the method; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The multi-resource sharing scheduling method for the deep learning training task is characterized by comprising the following steps of:

acquiring resource use data of each training task submitted to a task queue;

2. The deep learning training task oriented multi-resource sharing scheduling method according to claim 1, wherein the obtaining the resource usage data of each training task submitted to the task queue includes:

3. The multi-resource sharing scheduling method for deep learning training tasks according to claim 1, wherein when the resource usage data is of a resource usage type, the training task is performed for a preset number of training iterations when the training task is a training task submitted to the task queue for the first time, the resource usage data during execution is obtained, and the obtained resource usage data is stored in a database, including:

4. The multi-resource sharing scheduling method for deep learning training tasks according to claim 1, wherein the determining a sharing scheduling scheme according to the acquired resource usage data and the sharing efficiency includes:

5. The deep learning training task oriented multi-resource sharing scheduling method according to claim 4, wherein the determining the sharing scheduling scheme of the training task subset to be performed according to the obtained resource usage data and sharing efficiency of each training task in the training task subset to be performed includes:

determining a plurality of matches according to the hypergraph;

6. The multi-resource sharing scheduling method for deep learning training tasks according to claim 5, wherein the number of resources occupied by a plurality of nodes on the edges in the hypergraph is the same.

7. The deep learning training task oriented multi-resource sharing scheduling method of claim 5, wherein when the hypergraph includes a hyperedge connecting a plurality of nodes, the method further comprises:

Determining a plurality of sub-matches according to the sub-graph;

8. The deep learning training task oriented multi-resource sharing scheduling method according to claim 4, wherein determining the priorities of the training tasks in the task queue and already executing according to the acquired resource usage data of each training task includes:

9. The multi-resource sharing scheduling method for deep learning training tasks according to claim 1, wherein the sharing scheduling scheme at least comprises allocating resources for each training task in descending order of the required number of resources for the training task.

10. The deep learning training task oriented multi-resource sharing scheduling method of claim 1, further comprising: