WO2021051713A1 - 一种深度学习训练任务的工作方法及装置 - Google Patents

一种深度学习训练任务的工作方法及装置 Download PDF

Info

Publication number
WO2021051713A1
WO2021051713A1 PCT/CN2019/129995 CN2019129995W WO2021051713A1 WO 2021051713 A1 WO2021051713 A1 WO 2021051713A1 CN 2019129995 W CN2019129995 W CN 2019129995W WO 2021051713 A1 WO2021051713 A1 WO 2021051713A1
Authority
WO
WIPO (PCT)
Prior art keywords
deep learning
learning training
gpu
task
parameters
Prior art date
Application number
PCT/CN2019/129995
Other languages
English (en)
French (fr)
Inventor
赵仁明
陈培
Original Assignee
广东浪潮大数据研究有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广东浪潮大数据研究有限公司 filed Critical 广东浪潮大数据研究有限公司
Priority to US17/761,877 priority Critical patent/US20230333898A1/en
Priority to KR1020227010633A priority patent/KR20220054396A/ko
Publication of WO2021051713A1 publication Critical patent/WO2021051713A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5033Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of deep learning, and in particular to a working method and device for deep learning training tasks.
  • Deep learning training technology is a brand-new technology that is developing very rapidly. With the increase in the amount of data used for deep learning training services and the increase in training speed requirements, the demand for computing power has also increased significantly. The demand for basic resources for training tasks has evolved from single-server single-GPU training to single-server multi-GPU training and multi-server multi-GPU training. The overall scale of GPU server clusters has also increased significantly.
  • this application provides a working method and device for deep learning training tasks, which solves the prior art by rationally allocating the remaining resources in the GPU in a single server node and a multi-server node China is unable to ensure the utilization of GPU while taking into account the problems of single-type tasks and multi-type tasks.
  • the present invention provides a working method for deep learning training tasks, including:
  • the type of the deep learning training task is determined from the task parameters, and the type of the deep learning training task includes: single model and multiple models;
  • the task type is a single model
  • the task type is multi-model
  • the deep learning training task parameters first select the GPU that meets the deep learning training task parameters and has the smallest amount of remaining resources from a single server node to work.
  • a single server node selects a GPU that meets the parameters of the deep learning training task and has the smallest amount of remaining resources from a combination of multiple server nodes to perform work;
  • the CPU with the shortest communication distance with the GPU is selected for work.
  • the deep learning training task parameters include: neural network model, data set, training batch size, Batch size, and training mode.
  • the selecting the GPU with the smallest remaining resources to perform work according to the parameters of the deep learning training task includes:
  • the selecting a GPU that satisfies the deep learning training task parameters and has the smallest amount of remaining resources in a server node to perform work according to the deep learning training task parameters includes:
  • the BestFit algorithm calculates that the server node is working on the GPU that satisfies the parameters of the deep learning training task and has the smallest amount of remaining resources.
  • the single-machine task includes: a single-machine single-card task or a single-machine multi-card task.
  • the multi-type task includes: a multi-machine multi-card Ring-AllReduce type task or a multi-machine multi-card PS-Worker type task.
  • the task type is multi-model
  • the deep learning training task parameters first select a GPU that satisfies the deep learning training task parameters and has the smallest amount of remaining resources from a single server node for work, If there is no single server node that meets the conditions, selecting the GPU that meets the parameters of the deep learning training task and has the smallest amount of remaining resources from the combination of multiple server nodes to perform work includes:
  • the task type is a multi-machine multi-card PS-Worker task
  • the GPU with the smallest remaining resources that satisfies the parameters of the deep learning training task find from a single server node whether there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task; when there is no GPU that meets the depth of the single server node
  • the GPU with the smallest remaining resource for learning the training task parameters searches for whether there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task among multiple server nodes; when none exists, it waits for the next scheduling.
  • the task type is multi-model
  • the deep learning training task parameters first select a GPU that satisfies the deep learning training task parameters and has the smallest amount of remaining resources from a single server node for work, If there is no single server node that meets the conditions, selecting the GPU that meets the parameters of the deep learning training task and has the smallest amount of remaining resources from the combination of multiple server nodes to perform work includes:
  • the task type is a multi-machine and multi-card Ring-AllReduce task
  • the method before acquiring the parameters of the deep learning training task input by the user, the method further includes:
  • a resource topology structure is established for each server, and the resource topology structure is used to display the communication overhead between GPU nodes in the server.
  • the method before acquiring the parameters of the deep learning training task input by the user, the method further includes:
  • a topology structure between the server nodes is established, and the topology structure is used to display the communication speed between the server nodes.
  • the task type is multi-model
  • the deep learning training task parameters first select a GPU that satisfies the deep learning training task parameters and has the smallest amount of remaining resources from a single server node for work, If there is no single server node that meets the conditions, selecting the GPU that meets the parameters of the deep learning training task and has the smallest amount of remaining resources from the combination of multiple server nodes to perform work further includes:
  • the method before acquiring the parameters of the deep learning training task input by the user, the method further includes:
  • This application also provides a working device for deep learning training tasks, the device including:
  • the acquiring unit is used to acquire the deep learning training task parameters input by the user;
  • the identification unit is used to determine the type of the deep learning training task from the task parameters; wherein, the type of the deep learning training task includes: single model and multiple models;
  • the first allocation unit is configured to allocate GPU nodes for the training task; wherein, when the task type is a single model, select a single server node to satisfy the deep learning training task according to the deep learning training task parameters
  • the GPU with the smallest parameter and the smallest amount of remaining resources works; when the task type is multi-model, first select a single server node from a single server node that satisfies the deep learning training task parameters and has the smallest amount of remaining resources according to the deep learning training task parameters If there is no single server node that meets the conditions, select a GPU that meets the parameters of the deep learning training task and has the smallest amount of remaining resources from a combination of multiple server nodes to perform work;
  • the second allocation unit is configured to allocate the CPU with the shortest communication distance with the GPU to work for the training task according to the location of the GPU.
  • the deep learning training task parameters include: neural network model, data set, training batch size, Batch size, and training mode.
  • the first allocation unit includes:
  • the first selection unit used to screen out the GPU with the smallest remaining resources that meets the application network model, data set, and batch size conditions.
  • the first allocation unit includes:
  • the calculation unit is configured to calculate by the BestFit algorithm according to the deep learning training task parameters to work on the GPU that satisfies the deep learning training task parameters and has the smallest amount of remaining resources among the server nodes.
  • the single-machine task includes: a single-machine single-card task or a single-machine multi-card task.
  • the multi-type task includes: a multi-machine multi-card Ring-AllReduce type task or a multi-machine multi-card PS-Worker type task.
  • the first allocation unit includes:
  • the PS-Worker type allocation unit is used to, when the task type is a multi-machine multi-card PS-Worker type task, firstly find from a single CPU subtree whether there is a GPU with the smallest remaining resources that satisfies the parameters of the deep learning training task ; When there is no GPU with the smallest remaining resources that meets the parameters of the deep learning training task in a single CPU subtree, find whether there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task from a single server node; There is no GPU with the smallest remaining resources that meets the parameters of the deep learning training task in the server nodes. Search for multiple server nodes to see if there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task; when none exists, Waiting for the next dispatch.
  • the first allocation unit includes:
  • the Ring-AllReduce type allocation unit is used for when the task type is a multi-machine multi-card Ring-AllReduce type task, firstly, from a single server node, whether there is a minimum and feasible remaining resource that satisfies the parameters of the deep learning training task Closed-loop structure GPU; if it does not exist, find from multiple server nodes whether there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task and can be formed into a closed-loop structure; if not, wait for the next scheduling.
  • the method further includes:
  • the first topology unit is used to establish a resource topology structure for each server according to the resource situation of each server, and the resource topology structure is used to display the communication overhead between GPU nodes in the server.
  • the method further includes:
  • the second topology unit is used to establish a topology structure between server nodes according to the network interconnection mode and network topology between the server nodes, and the topology structure is used to display the communication speed between the server nodes.
  • the first allocation unit further includes:
  • the second selection unit is used to select the GPU with the lowest communication cost to work in the single server node when the training task is of multiple models and when there is a single server node that meets the conditions; when there is no single server node that meets the conditions At the time, select a group of GPUs with the same communication speed and the smallest communication speed from the combination of multiple server nodes to work.
  • the acquiring unit further includes:
  • the update unit is used to dynamically update the resource usage of each node and each GPU card.
  • the method described in this application has the following advantages: by setting different GPU allocation strategies for single-model and multi-model tasks, and allocating GPU node resources for training tasks according to the principle of minimum remaining GPU resources, it is realized in the same server cluster Under the premise of maximizing the use of GPU node resources, it can process single-type tasks and multi-type tasks at the same time.
  • this application also establishes a topological structure for each server according to the resource situation in a single server, and the resource topological structure is used to display the communication overhead between GPU nodes in the server; at the same time, according to the network interconnection mode between the server nodes and Network topology, to establish a topology structure between server nodes, the topology structure is used to display the communication speed between server nodes, so that when multi-model tasks need to be processed, the GPU group with the least communication overhead and the fastest communication speed can be selected working.
  • FIG. 1 is a flowchart of a working method of a deep learning training task provided in Embodiment 1 of this application;
  • FIG. 2 is a diagram of a server resource topology structure provided by Embodiment 2 of this application;
  • FIG. 3 is a server resource topology tree provided by Embodiment 2 of this application.
  • FIG. 4 is a server node adjacency matrix table provided in the second embodiment of this application.
  • FIG. 5 is a topological structure diagram between server nodes provided in the second embodiment of this application.
  • FIG. 6 is a topological tree diagram between server nodes provided in the second embodiment of this application.
  • FIG. 7 is a topological result table between server nodes provided in the second embodiment of the application.
  • FIG. 8 is a flowchart of a working method of another embodiment of a deep learning training task provided in Embodiment 2 of this application;
  • FIG. 9 is a structural block diagram of a working device for a deep learning training task provided in Embodiment 3 of the application.
  • the first embodiment of the present application provides a working method of a deep learning training task, which will be described in detail below with reference to the accompanying drawings.
  • FIG. 1 is a flowchart of a working method of a deep learning training task provided in Embodiment 1 of the application.
  • the deep learning training task parameters include: neural network model, data set, training batch size, Batch size, and training mode.
  • the neural network model, data set, and batch size determine the amount of GPU resources required, and the training method determines the GPU allocation method.
  • S102 Determine the type of the deep learning training task from the task parameters, and the type of the deep learning training task includes: single model and multiple models;
  • the single-machine task includes the single-machine single-card type and the single-machine multi-card type.
  • the single-machine single-card type refers to a single process during the training task and only uses one GPU card of a physical server; the single-machine multi-card type refers to the training task For a single process, multiple GPU cards of the same physical server are used.
  • Multi-model tasks include multi-machine multi-card Ring-AllReduce tasks and multi-machine multi-card PS-Worker tasks.
  • the multi-machine multi-card Ring-AllReduce task refers to the multi-machine multi-card task with the PS node being 0 and the Worker node greater than 1.
  • the multi-machine multi-card PS-Worker task refers to the multi-machine and multi-card PS-Worker task where the PS node is greater than 1 and the Worker node is greater than 0.
  • Machine multi-card task The user specifies the corresponding training type by setting the number of PS nodes and Worker nodes.
  • the task type is a single model
  • the task contains only one process, and only one physical server needs to find GPU resources for allocation. Therefore, it is only necessary to consider whether the GPU resource allocation is reasonable.
  • the principle of allocating GPU resources to tasks On the premise that the amount of resources required by the task is met, the GPU with the least remaining resources is selected for allocation, so as to ensure the maximum utilization of GPU resources.
  • the amount of resources required for the task is determined by the neural network model, data set, and Batch Size in the deep learning training task parameters set by the user.
  • the specific calculation method is:
  • Model's video memory usage parameter W video memory usage + gradient video memory usage + optimizer momentum video memory usage
  • the output scale of each layer is also determined:
  • out represents the width of the calculated feature map
  • inw represents the input size
  • P represents Padding
  • f represents the size of the convolution kernel
  • Total video memory usage model video memory usage + model output video memory usage
  • BestFit After obtaining the memory occupation required by the task, use the BestFit algorithm to find the most suitable GPU for allocation.
  • the content of BestFit is as follows:
  • S104 When the task type is multi-model, according to the deep learning training task parameters, first select the GPU that meets the deep learning training task parameters and has the smallest amount of remaining resources from a single server node to work.
  • Conditional single server node selecting a GPU that satisfies the parameters of the deep learning training task and has the smallest amount of remaining resources from a combination of multiple server nodes to perform work;
  • the tree structure GPU in a single physical server performs the task at the fastest speed. Fast, so it is first to find whether there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task from a single CPU subtree; when there is no GPU with the smallest remaining resources that meets the parameters of the deep learning training task in a single CPU subtree When there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task from a single server node; when there is no GPU with the smallest remaining resources that meets the parameters of the deep learning training task in a single server node, Find whether there is a GPU with the smallest remaining resources that meets the parameters of the deep learning training task in each server node.
  • the task type is a multi-machine multi-card Ring-AllReduce type task
  • the task type is a closed-loop information transmission task
  • the second embodiment of the present application also provides another method embodiment of the deep learning training task, which will be described in detail below with reference to the accompanying drawings.
  • FIG. 8 is a flowchart of a deep learning training task method provided in the second embodiment of the application.
  • S201 Establish a resource topology structure for each server according to the resource situation of each server, where the resource topology structure is used to display the communication overhead between GPU nodes in the server.
  • Figure 2 is a resource topology structure diagram in a server
  • Figure 3 is a topology tree diagram generated according to the resource topology structure diagram
  • Figure 4 is a node adjacency matrix generated according to the resource topology structure diagram table.
  • SYS It can only communicate across sockets and through QPI/UPI (ie across NUMA groups).
  • NUMA group communicates through different PCIe Host Bridge.
  • PHB Only communicate through one PCIe Host Bridge. (The communication between CPU and GPU in the same group is this way)
  • PXB Communication through multiple PCIe Switches without passing through PCIe Host Bridge.
  • NV# Communicate through NVLink.
  • the communication overhead decreases sequentially from top to bottom, with SYS having the largest overhead and NVLink having the smallest communication overhead.
  • the most common connection types are: NV, NODE and SYS.
  • S202 Establish a topological structure between the server nodes according to the network interconnection mode and the network topology between the server nodes, where the topological structure is used to display the communication speed between the server nodes.
  • Fig. 5 The topological structure between servers is shown in Fig. 5, Fig. 6 is a topological tree diagram generated according to the topological structure, and Fig. 7 is a node adjacency matrix diagram generated according to the topological structure.
  • this application defines it in the following way:
  • IB 1 Two nodes can communicate through a level 1 IB switch.
  • IB 2 The two nodes need to communicate through a level 2 IB switch.
  • IB n The two nodes need to communicate through an n-level IB switch.
  • Ethernet 1 Two nodes only need to pass through one switch to communicate. (For stacking of two switches of the same level, it is regarded as 1 switch)
  • Ethernet 2 The two nodes need to communicate through a level 2 switch.
  • Ethernet n Two nodes need to communicate through an n-level switch.
  • n>2,n ⁇ N+ The communication overhead increases from top to bottom, and the IB switching speed is higher than the Ethernet switching.
  • NODE1 and NODE2 belong to the same access switch.
  • NODE2 and NODE3 are connected through different access switches, but are connected through the same aggregation switch.
  • NODE3 and NODE4 are interconnected through the same IB switch.
  • NODE4 and NODE5 are interconnected through the IB router. From this, the adjacency matrix Figure 6 can be obtained.
  • IB1 is the fastest communication speed.
  • the weight of IB1 is defined as 1, and the weight of Ethernet1 is defined as 100.
  • the X weight is defined as -1.
  • the weights of each type are successively increased by IB1 ⁇ IB2 ⁇ Ethernet1 ⁇ Ethernet2 (that is, the weights are 1, 2, 100, 101, and IBs above 100 layers are not considered temporarily).
  • S203 Dynamically update the resource usage of each node and each GPU card.
  • This step is to enable the system to find qualified GPU cards in time.
  • S207 When the task type is multiple models, first select a group of GPUs from a single server node that satisfies the deep learning training task parameters and has the smallest remaining resources and the lowest communication overhead according to the deep learning training task parameters. Work, if there is no single server node that meets the conditions, select from a combination of multiple server nodes a GPU that meets the parameters of the deep learning training task, has the smallest amount of remaining resources, and has the fastest communication speed and is the same GPU for work.
  • the GPU group with the lowest communication overhead is selected according to the above-mentioned single server resource topology for communication; when the multi-model training task is performed in multiple servers, due to the communication between the servers The speed is much slower than the communication speed between resource nodes of a single server, so only how to save the communication speed between servers is considered at this time.
  • server node topology structure a group of GPUs of server nodes with the fastest communication speed can be selected.
  • the multi-model task is a multi-machine multi-card PS-Worker task
  • each lower-level GPU must be passed up to the end before the transfer can be completed
  • the task is a multi-machine multi-card Ring-AllReduce type task
  • each GPU node needs to pass the information to the next GPU node in one pass to complete the pass. Therefore, according to the barrel effect,
  • the transfer time of the two multi-model tasks depends on the two GPU nodes with the slowest transfer. Therefore, this application limits the selection of GPU nodes to select a group of GPU nodes with the same communication speed to reduce resources. waste.
  • the third embodiment of the present application also provides a working device of the deep learning training task, which will be described in detail below with reference to the accompanying drawings.
  • FIG. 9 is a working device for deep learning training tasks provided in the third embodiment of the application, and the device includes:
  • a discrimination unit configured to determine the type of the deep learning training task from the task parameters; wherein, the type of the deep learning training task includes: single model and multiple models;
  • the first allocation unit is configured to allocate GPU nodes for the training task; wherein, when the task type is a single model, select a single server node to satisfy the deep learning according to the deep learning training task parameters The GPU with the training task parameters and the smallest amount of remaining resources works; when the task type is multi-model, first select from a single server node from a single server node that satisfies the deep learning training task parameters and has remaining resources according to the deep learning training task parameters The GPU with the smallest amount performs work, if there is no single server node that meets the conditions, select a GPU that meets the parameters of the deep learning training task and has the smallest amount of remaining resources from a combination of multiple server nodes to perform work;
  • a second allocation unit configured to allocate the CPU with the shortest communication distance with the GPU to work for the training task according to the location of the GPU.
  • At least one (item) refers to one or more, and “multiple” refers to two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A, only B, and both A and B , Where A and B can be singular or plural.
  • the character “/” generally indicates that the associated objects before and after are in an “or” relationship.
  • the following at least one item (a) or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • At least one of a, b, or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ", where a, b, and c can be single or multiple.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种深度学习训练任务的工作方法及装置,通过根据单服务器或多服务器节点中GPU剩余资源情况为多种深度学习训练任务类型分配GPU,达到了可以在保证GPU利用率的同时兼顾多种类深度学习训练任务的效果。其中,所述方法包括:获取用户输入的深度学习训练任务参数;从所述任务参数中确定所述深度学习训练任务类型,所述深度学习训练任务类型包括:单机型、多机型;根据不同的深度学习训练任务类型按照不同的策略进行GPU选取;根据所述GPU的位置选取与所述GPU之间通信距离最短的CPU进行工作。

Description

一种深度学习训练任务的工作方法及装置
本申请要求于2019年09月20日提交中国专利局、申请号为201910894815.1、发明名称为“一种深度学习训练任务的工作方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及深度学习领域,尤其涉及一种深度学习训练任务的工作方法及装置。
背景技术
深度学习训练技术是目前发展十分迅速的一项崭新技术。随着用于深度学习训练业务的数据量的增加以及对于训练速度要求的提高,对于算力的需求也在显著增加。训练任务对于基础资源的需求从单服务器单GPU训练发展到单服务器多GPU训练以及多服务器多GPU训练,GPU服务器集群的整体规模也在显著提升。
作为集群中使用率较高,且相对于CPU、内存更加稀缺的资源,GPU的利用率通常决定了深度学习训练任务的整体效率。而如何做到在保证GPU利用率的同时做到兼顾单机型任务和多机型任务成为一个亟待解决的问题,在现有技术中缺乏一种能够解决上述问题的深度学习训练任务的工作方法。
发明内容
为了解决现有技术存在的上述技术问题,本申请提供了一种深度学习训练任务的工作方法及装置,通过对单服务器节点以及多服务器节点中GPU中的剩余资源的合理分配解决了现有技术中无法在保证GPU利用率的同时做到兼顾单机型任务和多机型任务的问题。
本发明提供了一种深度学习训练任务的工作方法,包括:
获取用户输入的深度学习训练任务参数;
从所述任务参数中确定所述深度学习训练任务类型,所述深度学习训练任务类型包括:单机型、多机型;
当所述任务类型为单机型时,根据所述深度学习训练任务参数在单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的 GPU进行工作;
当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
根据所述GPU的位置选取与所述GPU之间通信距离最短的CPU进行工作。
可选的,所述深度学习训练任务参数包括:神经网络模型、数据集、训练批量大小Batch size和训练方式。
可选的,所述根据所述深度学习训练任务参数选择剩余资源最小的GPU进行工作包括:
选择满足所述申请网络模型、数据集和Batch size条件的剩余资源最小的GPU进行工作。
可选的,所述根据所述深度学习训练任务参数在服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作包括:
通过BestFit算法根据所述深度学习训练任务参数计算出服务器节点中在满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作。
可选的,所述单机型任务包括:单机单卡型任务或单机多卡型任务。
可选的,所述多机型任务包括:多机多卡Ring-AllReduce型任务或多机多卡PS-Worker型任务。
可选的,所述当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作包括:
当所述任务类型为多机多卡PS-Worker型任务时,优先从单一CPU子树中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一CPU子树中不存在满足所述深度学习训练任务参数的剩余资源最小 的GPU时,从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一服务器节点中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU,在多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当都不存在时,等待下次调度。
可选的,所述当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作包括:
当所述任务类型为多机多卡Ring-AllReduce型任务时,优先从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若不存在,从多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若都不存在,等待下次调度。
可选的,在所述获取用户输入的深度学习训练任务参数前进一步包括:
根据每台服务器资源情况,为每台服务器建立资源拓扑结构,所述资源拓扑结构用于显示服务器中GPU节点之间的通信开销。
可选的,在所述获取用户输入的深度学习训练任务参数前进一步包括:
根据服务器节点之间的网络互联方式和网络拓扑,建立服务器节点之间的拓扑结构,所述拓扑结构用于显示服务器节点之间的通信速度。
可选的,所述当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作进一步包括:
当存在满足条件的单一服务器节点时,在所述单一服务器节点中选择通信开销最低的GPU进行工作;当不存在满足条件的单一服务器节点时,在多个服务器节点的组合中选取通信速度相同且最小的一组GPU进行工作。
可选的,在所述获取用户输入的深度学习训练任务参数前进一步包括:
动态更新各个节点和各个GPU卡的资源使用情况。
本申请还提供了一种深度学习训练任务的工作装置,所述装置包括:
获取单元,用于获取用户输入的深度学习训练任务参数;
辨别单元,用于从所述任务参数中确定所述深度学习训练任务的类型;其中,所述深度学习训练任务类型包括:单机型、多机型;
第一分配单元,用于为所述训练任务分配GPU节点;其中,当所述任务类型为单机型时,根据所述深度学习训练任务参数在单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
第二分配单元,用于根据所述GPU的位置为所述训练任务分配与所述GPU之间通信距离最短的CPU进行工作。
可选的,所述深度学习训练任务参数包括:神经网络模型、数据集、训练批量大小Batch size和训练方式。
可选的,所述第一分配单元包括:
第一选择单元:用于筛选出满足所述申请网络模型、数据集和Batch size条件的剩余资源最小的GPU。
可选的,所述第一分配单元包括:
计算单元,用于通过BestFit算法根据所述深度学习训练任务参数计算出服务器节点中在满足深度学习训练任务参数且剩余资源量最小的GPU进行工作。
可选的,所述单机型任务包括:单机单卡型任务或单机多卡型任务。
可选的,所述多机型任务包括:多机多卡Ring-AllReduce型任务或多机多卡PS-Worker型任务。
可选的,所述第一分配单元包括:
PS-Worker型分配单元,用于当所述任务类型为多机多卡PS-Worker型任务时,优先从单一CPU子树中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一CPU子树中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU时,从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一服务器节点中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU,在多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当都不存在时,等待下次调度。
可选的,所述第一分配单元包括:
Ring-AllReduce型分配单元,用于当所述任务类型为多机多卡Ring-AllReduce型任务时,优先从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若不存在,从多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若都不存在,等待下次调度。
可选的,在所述获取单元前进一步包括:
第一拓扑单元,用于根据每台服务器资源情况,为每台服务器建立资源拓扑结构,所述资源拓扑结构用于显示服务器中GPU节点之间的通信开销。
可选的,在所述获取单元前进一步包括:
第二拓扑单元,用于根据服务器节点之间的网络互联方式和网络拓扑,建立服务器节点之间的拓扑结构,所述拓扑结构用于显示服务器节点之间的通信速度。
可选的,所述第一分配单元进一步包括:
第二选择单元,用于当训练任务为多机型且当存在满足条件的单一服务器节点时,在所述单一服务器节点中选择通信开销最低的GPU进行工作;当不存在满足条件的单一服务器节点时,在多个服务器节点的组合中选取通信速度相同且最小的一组GPU进行工作。
可选的,所述获取单元前进一步包括:
更新单元,用于动态更新各个节点和各个GPU卡的资源使用情况。
本申请所述方法具有以下优点:通过对单机型和多机型任务设定不同的GPU分配策略以及根据GPU剩余资源量最小的原则为训练任务分配GPU节点资源,实现了在同一服务器集群中在满足最大化利用GPU节点资源的前提下,可以同时处理单机型任务以及多机型任务。
同时,本申请还根据单服务器中的资源情况为每台服务器建立拓扑结构,所述资源拓扑结构用于显示服务器中GPU节点之间的通信开销;同时,根据服务器节点之间的网络互联方式和网络拓扑,建立服务器节点之间的拓扑结构,所述拓扑结构用于显示服务器节点之间的通信速度,从而当需要处理多机型任务时,能够选取通信开销最小且通信速度最快的GPU组进行工作。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本申请实施例一提供的一种深度学习训练任务的工作方法流程图;
图2为本申请实施例二提供的一种服务器资源拓扑结构图;
图3为本申请实施例二提供的一种服务器资源拓扑树;
图4为本申请实施例二提供的一种服务器节点邻接矩阵表;
图5为本申请实施例二提供的一种服务器节点之间的拓扑结构图;
图6为本申请实施例二提供的一种服务器节点之间的拓扑树图;
图7为本申请实施例二提供的一种服务器节点之间的拓扑结果表;
图8为本申请实施例二提供的一种深度学习训练任务的另一实施例的工作方法流程图;
图9为本申请实施例三提供的一种深度学习训练任务的工作装置的结构框图。
具体实施方式
实施例一:
本申请实施例一提供了一种深度学习训练任务的工作方法,下面结合附图具体说明。
参见图1,该图为本申请实施例一提供的一种深度学习训练任务的工作方法的流程图。
本申请实施例一所述方法包括以下步骤:
S101:获取用户输入的深度学习训练任务参数;
其中,所述深度学习训练任务参数包括:神经网络模型、数据集、训练批量大小Batch size和训练方式。神经网络模型、数据集以及Batch size决定了所需的GPU资源量,训练方式决定了GPU的分配方式。
S102:从所述任务参数中确定所述深度学习训练任务类型,所述深度学习训练任务类型包括:单机型、多机型;
其中,单机型任务又包括单机单卡型和单机多卡型,单机单卡型是指训练任务时单个进程,只使用一台物理服务器的一张GPU卡;单机多卡型是指训练任务为单个进程,使用同一台物理服务器的多张GPU卡。
多机型任务又包括多机多卡Ring-AllReduce型任务和多机多卡PS-Worker型任务。多机多卡Ring-AllReduce型任务是指PS节点为0,Worker节点大于1的多机多卡型任务;多机多卡PS-Worker型任务是指PS节点大于1且Worker节点大于0的多机多卡型任务。用户通过对PS节点和Worker节点个数的设置来指定相应的训练类型。
S103:当所述任务类型为单机型时,根据所述深度学习训练任务参数在单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
当任务类型为单机型时,任务只包含一个进程,且只需在一台物理服务器中寻找GPU资源进行分配,因此只需考虑GPU资源分配是否合理即可。
为任务分配GPU资源的原则在满足任务所需资源量的前提下,选择剩 余资源量最少的GPU进行分配,从而可以保证GPU资源利用率最大化。任务所需资源量是由用户设置的深度学习训练任务参数中的神经网络模型、数据集及Batch Size决定的,具体计算方法为:
模型输出的显存占用=每一层输出规模*batchSize
模型的显存占用=参数W的显存占用+梯度的显存占用+优化器动量的显存占用
(如果是SGD+Momentum训练方式)
在模型确定之后,每一层的输出规模也就确定了:
Figure PCTCN2019129995-appb-000001
其中out代表计算得到的feature map的width,inw代表输入尺寸,P代表Padding,f代表卷积核大小。
总显存占用=模型的显存占用+模型输出的显存占用
当模型较小或batch size较大时:
总显存占用≈batch size*单样本的显存占用
所以,所以当给定一个模型和batch size之后,我们可以得到一个确定的显存占用量。在调度时,需要保证显存足够。
在得到任务需要的显存占用量后,使用BestFit算法找到最适合的GPU进行分配。BestFit内容具体为:
BestFit算法伪代码
Figure PCTCN2019129995-appb-000002
Figure PCTCN2019129995-appb-000003
S104:当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
其中,当多机型任务为多机多卡PS-Worker型任务时,由于该类型任务是层层向上传递信息的任务,因此在单一物理服务器中树形结构的GPU在执行该任务时速度最快,因此优先从单一CPU子树中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一CPU子树中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU时,从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一服务器节点中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU时,在多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU,由于多个服务器之间的传输速度要远远大于单服务器中的GPU之间的信息传输速度,因此当跨服务器进行传输时,树形结构带来的速度优势可以忽略不计;当都不存在时,认为此次调度没有找到合适资源,等待下次调度。
当所述任务类型为多机多卡Ring-AllReduce型任务时,由于该任务类型为闭环型信息传递任务,因此优先从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若不存在,从多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若都不存在,认为此次调度没有找到合适资源,等待下次调度。
S105:根据所述GPU的位置选取与所述GPU之间通信距离最短的CPU进行工作。
通过这种方法选取CPU可以保证信息之间的传输速度最快,通信开销 最低。
实施例二:
本申请实施例二还提供了另一种深度学习训练任务的方法实施例,下面结合附图具体说明。
参见图8,该图为本申请实施例二提供的一种深度学习训练任务方法的流程图。
本申请实施例包括以下步骤:
S201:根据每台服务器资源情况,为每台服务器建立资源拓扑结构,所述资源拓扑结构用于显示服务器中GPU节点之间的通信开销。
如图2所示,图2是一台服务器中的资源拓扑结构图,图3是根据所述资源拓扑结构图生成的拓扑树图,图4是根据所述资源拓扑结构图生成的节点邻接矩阵表。对于同一台服务器之内不同类型链接的GPU通信开销,本申请定义为以下6个等级:
SYS:只能跨socket,通过QPI/UPI通信(即跨NUMA组)。
NODE:NUMA组内的通过不同的PCIe Host Bridge通信。
PHB:只通过一个PCIe Host Bridge通信。(同组内的CPU与GPU之间通信是这种方式)
PXB:通过多个PCIe Switch通信而不经过PCIe Host Bridge。
PIX:同一个PCIe Switch内通信。
NV#:通过NVLink进行通信。
通信开销从上向下依次减小,SYS的开销最大,NVLink的通信开销最小。其中最常见的连接类型是:NV,NODE和SYS。
S202:根据服务器节点之间的网络互联方式和网络拓扑,建立服务器节点之间的拓扑结构,所述拓扑结构用于显示服务器节点之间的通信速度。
服务器之间的拓扑结构如图5所示,图6是根据所述拓扑结构生成的拓扑树图,图7是根据所述拓扑结构生成的节点邻接矩阵图。对于服务器之间的通信速度,本申请按照下面的方式进行定义:
IB 1:两个节点之间通过1级IB交换机就可以进行通信。
IB 2:两个节点之间需要通过2级IB交换机进行通信。
IB n:两个节点之间需要通过n级IB交换机进行通信。
Ethernet 1:两个节点之间只需经过1个交换机就可以通信。(对于两个同级交换机做堆叠,视为1个交换机)
Ethernet 2:两个节点之间需要通过2级交换机进行通信。
Ethernet n:两个节点之间需要通过n级交换机进行通信。
n>2,n∈N+。通信开销从上向下依次增大,IB交换速度高于Ethernet交换。
由图5可知,NODE1与NODE2同属一个接入交换机。NODE2与NODE3通过不同接入交换机,但通过相同汇聚交换机相连。NODE3与NODE4通过相同IB交换机互联。NODE4与NODE5通过IB路由器互联。由此可以得到邻接矩阵图6。其中X表示互相不连通。其中IB1为通信速度最快。IB1权值定义为1,Ethernet1权值定义为100。X权值定义为-1。各类型权值依次升高IB1<IB2<Ethernet1<Ethernet2(即权值分别为1,2,100,101,暂不考虑100层以上的IB)。
S203:动态更新各个节点和各个GPU卡的资源使用情况。
该步骤是为了使***可以及时找到符合条件的GPU卡。
S204-S206:与S101-S103相同。
S207:当所述任务类型为多机型时,根据所述深度学习训练任务参数先 从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小且通信开销最低的一组GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数、剩余资源量最小且通信速度最快且相同的GPU进行工作。
由于多机型训练任务中包含多个进程,因此需要进行进程之间的信息传递。当多机型训练任务在单一服务器中进行时,根据上述单一服务器资源拓扑结构选择通信开销最低的GPU组进行通信;当多机型训练任务在多个服务器中进行时,由于服务器之间的通信速度要远远慢于单一服务器的资源节点之间的通信速度,因此此时只考虑如何节省服务器之间的通信速度。通过上述服务器节点拓扑结构可以选择出一组通讯速度最快的服务器节点的GPU。同时,在多机型任务为多机多卡PS-Worker型任务时,由于在树形结构中,一次传递时需要每一个下层GPU向上传递过程都结束后才能完成此次传递;在多机型任务为多机多卡Ring-AllReduce型任务时,由于在环形结构中,一次传递时需要每一个GPU节点都向下一个GPU节点传递信息完成后才能完成此次传递,因此,根据木桶效应,所述的两种多机型任务的一次传递的时间都取决于传递最慢的两个GPU节点,因此,本申请限定在选择GPU节点时,选取通信速度相同的一组GPU节点,以减少资源浪费。
S208:根据所述GPU位置选取与所述GPU之间通信距离最短的CPU进行工作。
实施例三:
基于上述实施例提供的一种深度学习训练任务的工作方法,本申请实施例三还提供了一种深度学习训练任务的工作装置,下面结合附图具体说明。
如图9所示,图9为本申请实施例三提供的一种深度学习训练任务的工作装置,所述装置包括:
101:获取单元,用于获取用户输入的深度学习训练任务参数;
102:辨别单元,用于从所述任务参数中确定所述深度学习训练任务的类型;其中,所述深度学习训练任务类型包括:单机型、多机型;
103:第一分配单元,用于为所述训练任务分配GPU节点;其中,当所述任务类型为单机型时,根据所述深度学习训练任务参数在单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
104:第二分配单元,用于根据所述GPU的位置为所述训练任务分配与所述GPU之间通信距离最短的CPU进行工作。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置的具体工作过程可以参考前述方法实施例中的对应过程,在此不再赘述。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所 描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元及模块可以是或者也可以不是物理上分开的。另外,还可以根据实际的需要选择其中的部分或者全部单元和模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述仅是本申请的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (23)

  1. 一种深度学习训练任务的工作方法,其特征在于,包括:
    获取用户输入的深度学习训练任务参数;
    从所述任务参数中确定所述深度学习训练任务类型,所述深度学习训练任务类型包括:单机型、多机型;
    当所述任务类型为单机型时,根据所述深度学习训练任务参数在单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
    当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
    根据所述GPU的位置选取与所述GPU之间通信距离最短的CPU进行工作。
  2. 根据权利要求1所述的方法,其特征在于,所述深度学习训练任务参数包括:神经网络模型、数据集、训练批量大小Batch size和训练方式。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述深度学习训练任务参数选择剩余资源最小的GPU进行工作包括:
    选择满足所述申请网络模型、数据集和Batch size条件的剩余资源最小的GPU进行工作。
  4. 根据权利要求1所述的方法,其特征在于,所述根据所述深度学习训练任务参数在服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作包括:
    通过BestFit算法根据所述深度学习训练任务参数计算出服务器节点中满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作。
  5. 根据权利要求1所述的方法,其特征在于,所述单机型任务包括:单机单卡型任务或单机多卡型任务。
  6. 根据权利要求1所述的方法,其特征在于,所述多机型任务包括: 多机多卡Ring-AllReduce型任务或多机多卡PS-Worker型任务。
  7. 根据权利要求6所述的方法,其特征在于,所述当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作包括:
    当所述任务类型为多机多卡PS-Worker型任务时,优先从单一CPU子树中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一CPU子树中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU时,从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一服务器节点中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU,在多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当都不存在时,等待下次调度。
  8. 根据权利要求6所述的方法,其特征在于,所述当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作包括:
    当所述任务类型为多机多卡Ring-AllReduce型任务时,优先从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若不存在,从多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若都不存在,等待下次调度。
  9. 根据权利要求1所述的方法,其特征在于,在所述获取用户输入的深度学习训练任务参数前进一步包括:
    根据每台服务器资源情况,为每台服务器建立资源拓扑结构,所述资源拓扑结构用于显示服务器中GPU节点之间的通信开销。
  10. 根据权利要求1所述的方法,其特征在于,在所述获取用户输入 的深度学习训练任务参数前进一步包括:
    根据服务器节点之间的网络互联方式和网络拓扑,建立服务器节点之间的拓扑结构,所述拓扑结构用于显示服务器节点之间的通信速度。
  11. 根据权利要求1所述的方法,其特征在于,所述当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作进一步包括:
    当存在满足条件的单一服务器节点时,在所述单一服务器节点中选择通信开销最低的GPU进行工作;当不存在满足条件的单一服务器节点时,在多个服务器节点的组合中选取通信速度相同且最小的一组GPU进行工作。
  12. 根据权利要求1所述的方法,其特征在于,在获取用户输入的深度学习训练任务参数前进一步包括:
    动态更新各个节点和各个GPU卡的资源使用情况。
  13. 一种深度学习训练任务的工作装置,其特征在于,包括:
    获取单元,用于获取用户输入的深度学习训练任务参数;
    辨别单元,用于从所述任务参数中确定所述深度学习训练任务的类型;其中,所述深度学习训练任务类型包括:单机型、多机型;
    第一分配单元,用于为所述训练任务分配GPU节点;其中,当所述任务类型为单机型时,根据所述深度学习训练任务参数在单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;当所述任务类型为多机型时,根据所述深度学习训练任务参数先从单一服务器节点中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作,若不存在满足条件的单一服务器节点,从多个服务器节点的组合中选择满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作;
    第二分配单元,用于根据所述GPU的位置为所述训练任务分配与所述GPU之间通信距离最短的CPU进行工作。
  14. 根据权利要求13所述的装置,其特征在于,所述深度学习训练任 务参数包括:神经网络模型、数据集、训练批量大小Batch size和训练方式。
  15. 根据权利要求14所述的装置,其特征在于,所述第一分配单元包括:
    第一选择单元:用于筛选出满足所述申请网络模型、数据集和Batch size条件的剩余资源最小的GPU。
  16. 根据权利要求13所述的装置,其特征在于,所述第一分配单元包括:
    计算单元,用于通过BestFit算法根据所述深度学习训练任务参数计算出服务器节点中满足所述深度学习训练任务参数且剩余资源量最小的GPU进行工作。
  17. 根据权利要求13所述的装置,其特征在于,所述多机型任务包括:多机多卡Ring-AllReduce型任务或多机多卡PS-Worker型任务。
  18. 根据权利要求17所述的装置,其特征在于,所述第一分配单元包括:
    PS-Worker型分配单元,用于当所述任务类型为多机多卡PS-Worker型任务时,优先从单一CPU子树中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一CPU子树中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU时,从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当单一服务器节点中不存在满足所述深度学习训练任务参数的剩余资源最小的GPU,在多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小的GPU;当都不存在时,等待下次调度。
  19. 根据权利要求17所述的装置,其特征在于,所述第一分配单元包括:
    Ring-AllReduce型分配单元,用于当所述任务类型为多机多卡Ring-AllReduce型任务时,优先从单一服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若不存在,从多个服务器节点中寻找是否存在满足所述深度学习训练任务参数的剩余资源最小且可成闭环结构的GPU;若都不存在,等待下次调度。
  20. 根据权利要求13所述的装置,其特征在于,在所述获取单元前进一步包括:
    第一拓扑单元,用于根据每台服务器资源情况,为每台服务器建立资源拓扑结构,所述资源拓扑结构用于显示服务器中GPU节点之间的通信开销。
  21. 根据权利要求13所述的装置,其特征在于,在所述获取单元前进一步包括:
    第二拓扑单元,用于根据服务器节点之间的网络互联方式和网络拓扑,建立服务器节点之间的拓扑结构,所述拓扑结构用于显示服务器节点之间的通信速度。
  22. 根据权利要求13所述的装置,其特征在于,所述第一分配单元进一步包括:
    第二选择单元,用于当训练任务为多机型且当存在满足条件的单一服务器节点时,在所述单一服务器节点中选择通信开销最低的GPU进行工作;当不存在满足条件的单一服务器节点时,在多个服务器节点的组合中选取通信速度相同且最小的一组GPU进行工作。
  23. 根据权利要求13所述的装置,其特征在于,所述获取单元前进一步包括:
    更新单元,用于动态更新各个节点和各个GPU卡的资源使用情况。
PCT/CN2019/129995 2019-09-20 2019-12-30 一种深度学习训练任务的工作方法及装置 WO2021051713A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/761,877 US20230333898A1 (en) 2019-09-20 2019-12-30 Working method and device for deep learning training task
KR1020227010633A KR20220054396A (ko) 2019-09-20 2019-12-30 딥러닝 트레이닝 태스크의 작동 방법 및 장치

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910894815.1 2019-09-20
CN201910894815.1A CN110618870B (zh) 2019-09-20 2019-09-20 一种深度学习训练任务的工作方法及装置

Publications (1)

Publication Number Publication Date
WO2021051713A1 true WO2021051713A1 (zh) 2021-03-25

Family

ID=68923824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/129995 WO2021051713A1 (zh) 2019-09-20 2019-12-30 一种深度学习训练任务的工作方法及装置

Country Status (4)

Country Link
US (1) US20230333898A1 (zh)
KR (1) KR20220054396A (zh)
CN (1) CN110618870B (zh)
WO (1) WO2021051713A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033098A (zh) * 2021-03-26 2021-06-25 山东科技大学 一种基于AdaRW算法的海洋目标检测深度学习模型训练方法
CN113377520A (zh) * 2021-07-07 2021-09-10 北京百度网讯科技有限公司 资源调度方法、装置、设备以及存储介质
CN113469372A (zh) * 2021-07-02 2021-10-01 北京市商汤科技开发有限公司 强化学习训练方法、装置、电子设备以及存储介质
CN115378818A (zh) * 2022-10-26 2022-11-22 西南民族大学 一种适用于大规模分布式机器学习的新型拓扑设计方法
CN116155750A (zh) * 2023-04-19 2023-05-23 之江实验室 深度学习作业资源放置方法、***、设备和存储介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110618870B (zh) * 2019-09-20 2021-11-19 广东浪潮大数据研究有限公司 一种深度学习训练任务的工作方法及装置
CN111191794B (zh) * 2019-12-29 2023-03-14 广东浪潮大数据研究有限公司 一种训练任务处理方法、装置、设备及可读存储介质
CN113127163A (zh) * 2019-12-31 2021-07-16 杭州海康威视数字技术股份有限公司 模型验证方法、装置及电子设备
CN111738404B (zh) * 2020-05-08 2024-01-12 深圳市万普拉斯科技有限公司 模型训练任务处理方法、装置、电子设备和存储介质
CN111880911A (zh) * 2020-06-19 2020-11-03 浪潮电子信息产业股份有限公司 一种任务负载调度方法、装置、设备及可读存储介质
CN112084017B (zh) * 2020-07-30 2024-04-19 北京聚云科技有限公司 一种内存管理方法、装置、电子设备及存储介质
WO2022102009A1 (ja) * 2020-11-11 2022-05-19 日本電信電話株式会社 分散処理システムおよび方法
CN112988383A (zh) * 2021-03-12 2021-06-18 中国平安人寿保险股份有限公司 一种资源分配方法、装置、设备以及存储介质
CN113094183B (zh) * 2021-06-09 2021-09-17 苏州浪潮智能科技有限公司 Ai训练平台的训练任务创建方法、装置、***及介质
CN113900793B (zh) * 2021-07-29 2023-11-10 苏州浪潮智能科技有限公司 一种服务器集群及其深度学习的集合通信***和方法
CN114091688B (zh) * 2021-11-25 2022-05-20 北京九章云极科技有限公司 一种计算资源获取方法、装置、电子设备和存储介质
CN116187426B (zh) * 2022-11-09 2024-04-19 北京百度网讯科技有限公司 深度学习模型的模型参数多流广播方法及其装置
CN117687802B (zh) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 一种基于云平台的深度学***台

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110228A1 (en) * 2014-06-17 2016-04-21 Huawei Technologies Co., Ltd. Service Scheduling Method, Apparatus, and System
CN108805798A (zh) * 2017-05-05 2018-11-13 英特尔公司 用于深度学习框架的细粒度计算通信执行
CN109918199A (zh) * 2019-02-28 2019-06-21 中国科学技术大学苏州研究院 基于gpu的分布式图处理***
CN110231976A (zh) * 2019-05-20 2019-09-13 西安交通大学 一种基于负载预测的边缘计算平台容器部署方法及***
CN110618870A (zh) * 2019-09-20 2019-12-27 广东浪潮大数据研究有限公司 一种深度学习训练任务的工作方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130088512A (ko) * 2012-01-31 2013-08-08 한국전자통신연구원 클러스터 컴퓨팅 환경에서의 자원 관리 장치 및 방법
CN103699440B (zh) * 2012-09-27 2017-05-24 北京搜狐新媒体信息技术有限公司 一种云计算平台***为任务分配资源的方法和装置
CN105573827A (zh) * 2015-12-11 2016-05-11 联动优势电子商务有限公司 一种多机并行处理方法及装置
CN106919442A (zh) * 2015-12-24 2017-07-04 中国电信股份有限公司 多gpu调度装置和分布式计算***以及多gpu调度方法
CN106878439B (zh) * 2017-03-03 2020-08-11 广东浪潮大数据研究有限公司 一种多节点计算机***内中继节点选择和资源分配方法
US11797837B2 (en) * 2017-04-24 2023-10-24 Intel Corporation Dynamic distributed training of machine learning models
CN108460457A (zh) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 一种面向卷积神经网络的多机多卡混合并行异步训练方法
CN109660526A (zh) * 2018-12-05 2019-04-19 国网江西省电力有限公司信息通信分公司 一种应用于信息安全领域的大数据分析方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160110228A1 (en) * 2014-06-17 2016-04-21 Huawei Technologies Co., Ltd. Service Scheduling Method, Apparatus, and System
CN108805798A (zh) * 2017-05-05 2018-11-13 英特尔公司 用于深度学习框架的细粒度计算通信执行
CN109918199A (zh) * 2019-02-28 2019-06-21 中国科学技术大学苏州研究院 基于gpu的分布式图处理***
CN110231976A (zh) * 2019-05-20 2019-09-13 西安交通大学 一种基于负载预测的边缘计算平台容器部署方法及***
CN110618870A (zh) * 2019-09-20 2019-12-27 广东浪潮大数据研究有限公司 一种深度学习训练任务的工作方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033098A (zh) * 2021-03-26 2021-06-25 山东科技大学 一种基于AdaRW算法的海洋目标检测深度学习模型训练方法
CN113033098B (zh) * 2021-03-26 2022-05-17 山东科技大学 一种基于AdaRW算法的海洋目标检测深度学习模型训练方法
CN113469372A (zh) * 2021-07-02 2021-10-01 北京市商汤科技开发有限公司 强化学习训练方法、装置、电子设备以及存储介质
CN113377520A (zh) * 2021-07-07 2021-09-10 北京百度网讯科技有限公司 资源调度方法、装置、设备以及存储介质
CN115378818A (zh) * 2022-10-26 2022-11-22 西南民族大学 一种适用于大规模分布式机器学习的新型拓扑设计方法
CN115378818B (zh) * 2022-10-26 2023-02-24 西南民族大学 一种适用于大规模分布式机器学习的新型拓扑设计方法
CN116155750A (zh) * 2023-04-19 2023-05-23 之江实验室 深度学习作业资源放置方法、***、设备和存储介质
CN116155750B (zh) * 2023-04-19 2023-08-01 之江实验室 深度学习作业资源放置方法、***、设备和存储介质

Also Published As

Publication number Publication date
US20230333898A1 (en) 2023-10-19
KR20220054396A (ko) 2022-05-02
CN110618870B (zh) 2021-11-19
CN110618870A (zh) 2019-12-27

Similar Documents

Publication Publication Date Title
WO2021051713A1 (zh) 一种深度学习训练任务的工作方法及装置
CN108829494B (zh) 基于负载预测的容器云平台智能资源优化方法
US9054977B2 (en) Automatic NoC topology generation
WO2015196911A1 (zh) 数据挖掘方法和节点
CN104834569A (zh) 一种基于应用类型的集群资源调度方法及***
WO2013015905A1 (en) Method and apparatus for assignment of virtual resources within a cloud environment
CN113904923B (zh) 一种基于软件定义网络的服务功能链联合优化方法
CN111158909B (zh) 集群资源分配处理方法、装置、设备及存储介质
Paul et al. MG-Join: A scalable join for massively parallel multi-GPU architectures
CN110191155B (zh) 一种面向胖树互连网络的并行作业调度方法、***及存储介质
CN113238848A (zh) 一种任务调度方法、装置、计算机设备和存储介质
CN113033800A (zh) 分布式深度学习方法、装置、参数服务器及主工作节点
CN108427602B (zh) 一种分布式计算任务的协同调度方法及装置
Wang et al. Dependency-aware network adaptive scheduling of data-intensive parallel jobs
WO2020124488A1 (zh) 应用进程映射方法、电子装置及计算机可读存储介质
CN112596879B (zh) 用于量子云计算平台任务调度的方法
CN110958192B (zh) 一种基于虚拟交换机的虚拟数据中心资源分配***及方法
WO2017100987A1 (zh) 基于拥塞规避的非均匀带宽虚拟数据中心嵌入实现方法
WO2015055502A2 (en) Method of partitioning storage in a distributed data storage system and corresponding device
CN113094179A (zh) 作业分配方法、装置、电子设备及可读存储介质
CN107918676A (zh) 结构化查询的资源优化方法及数据库查询***
CN110308965A (zh) 云数据中心的基于规则的启发式虚拟机分配方法及***
Lu et al. NPIY: A novel partitioner for improving mapreduce performance
CN114691370A (zh) 一种基于遗传算法的任务分配方法和装置
CN116610422A (zh) 一种任务调度方法、装置和***

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945480

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20227010633

Country of ref document: KR

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 19945480

Country of ref document: EP

Kind code of ref document: A1