CN110297670B

CN110297670B - Method and system for improving training efficiency of distributed tasks on container cloud

Info

Publication number: CN110297670B
Application number: CN201910413700.6A
Authority: CN
Inventors: 张春海; 孙夏; 冉玫美
Original assignee: Shenzhen Zhixing Technology Co Ltd
Current assignee: Shenzhen Zhixing Technology Co Ltd
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2023-06-27
Anticipated expiration: 2039-05-17
Also published as: CN110297670A

Abstract

The invention provides a method and a system for improving training efficiency of distributed tasks on a container cloud, which are characterized in that an additional RDMA network is provided for a trained container cluster, and after the container cluster is deployed and before the distributed training task is started to be executed, the network environment of a subtask is effectively reconfigured in time, so that RDMA network communication is used when the network environment is communicated, the communication efficiency of training task data is improved, the problem of training data communication bottleneck in the distributed training on the container cloud under the scenes of a big model and big data is solved, and the efficiency of executing the distributed training on a container cloud platform is further improved.

Description

Method and system for improving training efficiency of distributed tasks on container cloud

Technical Field

The invention relates to the field of container clouds and distributed machine learning; in particular to a method and a system for improving training efficiency of distributed tasks on a container cloud.

Background

With the continuous deep research of big data and machine learning, the machine learning in the big data age starts to present the characteristics of big model and big data. "Large model" means that as machine learning (especially deep learning) progresses, many problems increasingly require a larger model to be able to approach the specific function of the problem to be solved as much as possible; "big data" means that when the training data set is small, the effect of machine learning (especially deep learning) is not ideal, so that data as big as possible is generally required as the training set to improve the effect of machine learning. Therefore, in a large-scale machine learning training scene, the training data and model parameters are too large to be processed by a single machine. Thus, distributed machine learning also follows.

Distributed machine learning refers to the process of decomposing a training task into a plurality of small tasks, and distributing the tasks to a plurality of devices for training. Distributed machine learning is not only the distribution of training tasks across multiple processors, but also the distribution of data (including training data and intermediate results) among the stores of different devices. In order to achieve greater computing power, storage, throughput, and fault tolerance, there is an increasing trend toward distributed machine learning training.

However, a bare metal cluster (i.e., a physical host cluster) for distributed machine learning training with practical significance is a very specialized and complex or even cumbersome task from build deployment to operational maintenance. Therefore, the container cloud technology is applied to the field of distributed machine learning, and the difficulty of constructing, deploying, operating and maintaining the container cloud technology is simplified.

The container cloud technology not only can realize rapid deployment of container clusters, but also is a lightweight solution, and bare metal resources can be effectively integrated and managed. Taking the Kubernetes platform running the distributed machine learning training task as an example, the Kubernetes not only provides a consistent method for packaging applications and ensures the consistency of the applications running on different devices, but also provides resource isolation for the running environment of the applications, abstracts the complexity of a hardware bottom layer and node management, supports the dispatching of the GPU, and can also be elastically expanded based on the needs of the applications and the clusters.

The container cloud platform of the container and container orchestration tool are all running on the operating system, so its default communication is typically also enabled by the connection access services provided by the container cloud platform, which, although it is a highly available container cloud network solution, cannot bypass the operating system. Because the communication process of the scheme needs intervention of an operating system and a protocol stack, a large amount of CPU resources are inevitably occupied in the transmission process of the training gradient network under the scene of a big data training set, and larger network delay is caused, so that the training efficiency is severely restricted.

RDMA, a remote direct data access technique; RDMA can realize direct transfer of Buffer between application software of two nodes through a network. Compared with the traditional network transmission, RDMA does not need intervention of an operating system and a protocol stack, so that occupation of a large amount of CPU resources in the network transmission process is avoided, and network delay is reduced. When a physical host performs distributed computation as a cluster of nodes, RDMA communication has been implemented by mounting an RDMA network card (i.e., a physical network card supporting the RDMA protocol) for each physical node.

In order to more efficiently use computing power resources, when a distributed machine training is deployed on a container cloud platform, a task to be trained is often decomposed into a plurality of subtasks, environment configuration parameters are generated for each subtask (to ensure the dependency relationship among the subtasks and control the data consistency among the subtasks), then corresponding containers/container groups are created for each subtask (the containers/container groups refer to the minimum unit of the container clusters in the process of arrangement management, wherein the containers refer to containers running independent applications in the container environment, the container groups refer to a 'logic host' running independent applications in the container environment, one or more tightly coupled application containers such as Pod of a Kubernetes platform are run), connection access services are then run for distributed training; in the running distributed training process, the corresponding connection access service can be obtained by utilizing the connection parameter of the environment configuration parameter, namely the connection access service name, so that training data communication is realized under a default network. However, the connection access service is only suitable for providing relevant access connection services under the default network, and obviously cannot support RDMA network in a manner that default communication between containers/container groups can be realized through iptables of a kernel and the like.

Disclosure of Invention

In view of the foregoing, the present invention provides a method and system for improving the training efficiency of distributed tasks on a container cloud.

In one aspect, the embodiment of the invention provides a method for improving the training efficiency of distributed tasks on a container cloud.

The method comprises the following steps:

when the container cloud platform deploys the distributed training tasks,

decomposing the training task into a plurality of subtasks;

generating environment configuration parameters for the subtasks respectively;

deploying a container cluster for training tasks:

a corresponding container/group of containers is created for the subtasks,

and providing connection access services, and additionally RDMA network access;

after the container cluster deployment is completed and before the distributed training task is initiated,

reconfiguring the network environment of the subtasks to communicate using RDMA network communications;

after configuration is completed, the executive distributed training task is started.

On the other hand, the embodiment of the invention provides a system for improving the training efficiency of distributed tasks on a container cloud.

With reference to the first aspect, correspondingly, the system includes:

the system comprises a distributed training task management unit, a task scheduling unit and a container cloud platform; wherein,,

the distributed training task management unit is used for decomposing a task to be trained into a plurality of subtasks;

the task scheduling unit is used for scheduling various tasks including subtasks on the container cloud platform; the method comprises the following steps:

generating environment configuration parameters and defining containers/container groups to be created for performing the various tasks, etc.;

the container cloud platform is used for deploying container clusters for training and managing the container clusters; the method comprises the following steps:

creating a container/container group corresponding to the subtask, providing connection access service and additionally providing RDMA network access according to the definition of the task scheduling unit;

and after the container cluster deployment is completed and before the distributed training task is initiated,

and after the configuration is completed, starting to execute the distributed training task.

According to the method and the system for improving the training efficiency of the distributed tasks on the container cloud, the additional RDMA network is provided for the trained container cluster, and the network environment of the subtasks is effectively reconfigured in time after the container cluster is deployed and before the distributed training tasks are started to be executed, so that the RDMA network is used for communication during communication, the communication efficiency of training task data is improved, the problem of training data communication bottleneck during the distributed training on the container cloud under the large model and large data scenes is solved, and the efficiency of executing the distributed training on the container cloud platform is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of some of the embodiments of the present invention or the drawings involved in the description of the prior art will be provided below.

Fig. 1 is a flow chart of a method for improving training efficiency of distributed tasks on a container cloud according to some preferred embodiments of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by a person of ordinary skill in the art without any creative effort, are within the scope of the present invention based on the embodiments of the present invention.

The following are some preferred embodiments of the present invention. Wherein,,

the preferred embodiments described above provide a method of improving the efficiency of distributed task training on a container cloud. As shown in fig. 1, the method includes:

when the container cloud platform deploys the distributed training tasks,

the task to be trained is decomposed into a number (not less than 2) of subtasks,

generating environment configuration parameters for each subtask respectively to ensure the dependency relationship among the subtasks and the data consistency among the training subtasks;

deploying a container cluster for training tasks:

creating a container/container group corresponding to each subtask, providing connection access service and additionally providing RDMA network access;

the above method for providing connection access service and RDMA network access may specifically be:

providing at least two virtual network card interfaces for each container/container group described above through the multi-network card cni; wherein,,

the first network card interface is used for mounting a virtual network card; the container/container group accesses a default network through respective virtual network cards;

the second network card interface is used for mounting a virtual RDMA network card; the container/container group accesses the RDMA network through the respective RDMA network card;

thus, the container/container group forms a dual-network container cluster for training by connecting access service, RDMA network access and default network and RDMA network between them;

after the container cluster is deployed and before the distributed training task is started, updating a default connection parameter/a default DNS when a subtask corresponding to a container/container group accesses a network in a subtask environment configuration parameter and providing additional RDMA network DNS for the subtask so as to reconfigure the network environment of the subtask when running on the container cluster, and enable the subtask to use RDMA network communication when communicating;

Specifically, in the method for improving the training efficiency of the distributed task on the container cloud provided by some of these embodiments, the default connection parameters in the environment configuration parameters are updated, and the specific process may be:

RDMA connection parameters of containers/container groups related to subtask communication (for example, containers/container groups for executing subtasks, namely, RDMA network IP obtained by container/container group allocation corresponding to the subtask and RDMA network IP obtained by other container/container group allocation related to the subtask communication) are obtained, so that default connection parameters in subtask environment configuration parameters are updated and replaced.

In particular, some of these embodiments provide methods of improving the efficiency of distributed task training on a container cloud,

masking the subtask from and providing RDMA network DNS to the container/container group default network DNS, one implementation of which may be to specify a master DNS that is the container/container group that performs the subtask (i.e., the container/container group that the subtask corresponds to created) while providing RDMA network DNS services to the RDMA network.

Specifically, in the method for improving the training efficiency of the distributed task on the container cloud provided by some of these embodiments, the second network card interface is provided for the container/container group through sriov-cni; correspondingly, the virtual RDMA network card for mounting is obtained through the sriov virtual physical RDMA network card.

The following takes a process of deploying a distributed TensorFlow task on a Kubernetes platform based on the above method as an example, to further help understand the method for improving the training efficiency of the distributed task on the container cloud in the above preferred embodiment. The process is as follows:

according to the type of distributed training, combining with computing resources/computing resources and a model, decomposing the whole task to be pre-trained into a plurality of (not less than 2) subtasks, and respectively generating TF_CONFIG (wherein the TF_CONFIG comprises connection parameters required by the subtasks in the communication process when the subtasks are executed, and service names are adopted as the connection parameters by default) for each subtask so as to ensure the dependency relationship among the subtasks and the data consistency among the training subtasks, and generating other related parameters for defining the Pod of the task to be created in the subsequent step (namely, a 'container group' of a Kubernetes platform, which is the minimum scheduling unit when the platform performs scheduling management on containers); for example, define the Pod corresponding to the subtask as a training Pod;

container clusters for training were deployed on the Kubernetes platform:

corresponding training Pod, service (supporting default network access) and additional RDMA network access are created for each subtask described above according to TF _ CONFIG, etc:

calling the corresponding cni plugin through multus_ cni, two virtual network card interfaces are provided for each training Pod described above:

the method comprises the steps of providing a default network interface, mounting a virtual network card and accessing a default network by calling a flannel_ cni plug-in; default networks are typically used for data communications for platform management tasks;

by calling the sroov_ cni plug-in, an RDMA network interface is provided, an RDMA virtual network card is mounted (the virtual RDMA network cards are obtained based on the sroov virtual RDMA network card), and an RDMA network is accessed; RDMA networks will be used for data communication of training tasks (i.e., sub-tasks), such as communication of gradient data during gradient aggregation during task execution;

after the container cluster is deployed and before the distributed TensorFlow task is started, the network environment of the subtasks is reconfigured, so that RDMA network communication is used when the network environment is communicated; here, a way to update default connection parameters is adopted: acquiring the IP of each RDMA network of training Pod related to subtask communication, and updating and replacing a default connection parameter, namely a service name, in the subtask TF_CONFIG by the IP;

after the updating is finished, the distributed TensorFlow task is started to be executed, namely the corresponding subtasks are executed on each training Pod.

Further preferred embodiments of the present invention provide a system for improving the efficiency of distributed task training on a container cloud. The system comprises:

the distributed training task management unit is used for decomposing the whole task to be trained into a plurality of (not less than 2) subtasks;

generating environment configuration parameters for the various tasks (particularly generating environment configuration parameters for each subtask to ensure the dependency relationship among the subtasks and the data consistency among the subtasks),

and defining containers/groups of containers, etc. to be created for performing the various tasks; for example, defining the container/container group corresponding to the subtask as a training container/container group, so that the container cloud can provide custom performance settings suitable for training, such as multiple networks, when the container cloud is created after the container cloud platform is requested;

the container cloud platform is used for deploying container clusters for training, managing the container clusters and the like; the method comprises the following steps:

creating a container/container group corresponding to the subtask, providing connection access service and additionally providing RDMA network access according to the definition of the task scheduling unit; the above method for providing connection access service and RDMA network access may specifically be: providing at least two virtual network card interfaces for each container/container group described above through the multi-network card cni; wherein,,

the container/container group forms a double-network container cluster for training by connecting an access service, an RDMA network access and a default network and an RDMA network between the access service and the RDMA network;

further comprises: after the container cluster is deployed and before the distributed training task is started, updating a default connection parameter/a default DNS when a subtask corresponding to the container/container group accesses the network in the subtask environment configuration parameters and providing additional RDMA network DNS for the subtask so as to reconfigure the network environment of the subtask when running on the container cluster and enable the subtask to use RDMA network communication when communicating;

and after the configuration is completed, starting the executable distributed training task.

Specifically, in some of these embodiments, in the system for improving the training efficiency of the distributed task on the container cloud, the updating the default connection parameter in the environment configuration parameter may include:

In particular, some of these embodiments provide a system that improves the efficiency of distributed task training on a container cloud,

the masking subtasks correspond to the default network DNS of the container/container group and provide RDMA network DNS for the container/container group, and one implementation manner of the masking subtasks may be that the container cloud platform specifies the container cloud platform as the master DNS of the container/container group (i.e., the container/container group created by the subtask corresponding to the subtask) for executing the subtask while providing RDMA network DNS services for the RDMA network.

Specifically, in the system for improving the training efficiency of the distributed task on the container cloud provided by some of these embodiments, the container cloud platform provides the second network card interface for the container/container group through sriov-cni; correspondingly, the virtual RDMA network card for mounting is obtained through the sriov virtual physical RDMA network card.

The above description is merely an embodiment of the present invention, but the scope of the present invention is not limited thereto.

Claims

1. A method for improving the training efficiency of distributed tasks on a container cloud, comprising:

when the container cloud platform deploys the distributed training tasks,

decomposing the training task into a plurality of subtasks;

generating environment configuration parameters for the subtasks;

deploying a container cluster for training tasks:

a corresponding container/group of containers is created for the subtasks,

and providing connection access services, and additionally RDMA network access;

reconfiguring the network environment of the subtasks to communicate using RDMA network communication;

after configuration is completed, starting an executive distributed training task;

the method provides connection access service and RDMA network access:

providing at least two virtual network card interfaces for the container/container group through a plurality of network cards cni; wherein,,

the first network card interface is used for mounting a virtual network card; through which the container/group of containers accesses a default network;

the second network card interface is used for mounting a virtual RDMA network card; through which the container/group of containers access an RDMA network;

wherein the containers/container groups constitute a dual network container cluster for training by connecting access services, RDMA network access and default network between them, RDMA network;

the reconfiguration of the network environment is accomplished by updating default connection parameters in the environment configuration parameters:

and obtaining RDMA connection parameters of the container/container group related to the subtask, and updating and replacing default connection parameters in environment configuration parameters of the subtask.

2. The method for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 1,

the reconfiguration network environment is implemented by masking the default network DNS for the subtask corresponding container/group of containers and providing RDMA network DNS for it:

RDMA network DNS services are provided for the RDMA network, while primary DNS is designated for the corresponding container/group of containers for the subtask.

3. The method for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 1,

providing the second network card interface for the container/container group through sriov-cni;

correspondingly, the virtual RDMA network card is obtained through the sriov virtual physical RDMA network card.

4. A system for improving the training efficiency of distributed tasks on a container cloud, comprising:

the task scheduling unit is used for scheduling various tasks including the subtasks on the container cloud platform; the method comprises the following steps:

generating environment configuration parameters and defining containers/container groups for the subtasks; the container cloud platform is used for deploying container clusters and managing the container clusters; the method comprises the following steps:

after the configuration is completed, starting to execute the distributed training task;

the method provides connection access service and RDMA network access:

5. The system for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 4,

the container cloud platform provides RDMA network DNS services for the RDMA network, and simultaneously designates the RDMA network DNS services as a main DNS of the container/container group corresponding to the subtask.

6. The system for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 4,