CN110297670B - Method and system for improving training efficiency of distributed tasks on container cloud - Google Patents

Method and system for improving training efficiency of distributed tasks on container cloud Download PDF

Info

Publication number
CN110297670B
CN110297670B CN201910413700.6A CN201910413700A CN110297670B CN 110297670 B CN110297670 B CN 110297670B CN 201910413700 A CN201910413700 A CN 201910413700A CN 110297670 B CN110297670 B CN 110297670B
Authority
CN
China
Prior art keywords
container
network
rdma
training
distributed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910413700.6A
Other languages
Chinese (zh)
Other versions
CN110297670A (en
Inventor
张春海
孙夏
冉玫美
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhixing Technology Co Ltd
Original Assignee
Shenzhen Zhixing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhixing Technology Co Ltd filed Critical Shenzhen Zhixing Technology Co Ltd
Priority to CN201910413700.6A priority Critical patent/CN110297670B/en
Publication of CN110297670A publication Critical patent/CN110297670A/en
Application granted granted Critical
Publication of CN110297670B publication Critical patent/CN110297670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural
    • G06F9/4484Executing subprograms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/465Distributed object oriented systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a method and a system for improving training efficiency of distributed tasks on a container cloud, which are characterized in that an additional RDMA network is provided for a trained container cluster, and after the container cluster is deployed and before the distributed training task is started to be executed, the network environment of a subtask is effectively reconfigured in time, so that RDMA network communication is used when the network environment is communicated, the communication efficiency of training task data is improved, the problem of training data communication bottleneck in the distributed training on the container cloud under the scenes of a big model and big data is solved, and the efficiency of executing the distributed training on a container cloud platform is further improved.

Description

Method and system for improving training efficiency of distributed tasks on container cloud
Technical Field
The invention relates to the field of container clouds and distributed machine learning; in particular to a method and a system for improving training efficiency of distributed tasks on a container cloud.
Background
With the continuous deep research of big data and machine learning, the machine learning in the big data age starts to present the characteristics of big model and big data. "Large model" means that as machine learning (especially deep learning) progresses, many problems increasingly require a larger model to be able to approach the specific function of the problem to be solved as much as possible; "big data" means that when the training data set is small, the effect of machine learning (especially deep learning) is not ideal, so that data as big as possible is generally required as the training set to improve the effect of machine learning. Therefore, in a large-scale machine learning training scene, the training data and model parameters are too large to be processed by a single machine. Thus, distributed machine learning also follows.
Distributed machine learning refers to the process of decomposing a training task into a plurality of small tasks, and distributing the tasks to a plurality of devices for training. Distributed machine learning is not only the distribution of training tasks across multiple processors, but also the distribution of data (including training data and intermediate results) among the stores of different devices. In order to achieve greater computing power, storage, throughput, and fault tolerance, there is an increasing trend toward distributed machine learning training.
However, a bare metal cluster (i.e., a physical host cluster) for distributed machine learning training with practical significance is a very specialized and complex or even cumbersome task from build deployment to operational maintenance. Therefore, the container cloud technology is applied to the field of distributed machine learning, and the difficulty of constructing, deploying, operating and maintaining the container cloud technology is simplified.
The container cloud technology not only can realize rapid deployment of container clusters, but also is a lightweight solution, and bare metal resources can be effectively integrated and managed. Taking the Kubernetes platform running the distributed machine learning training task as an example, the Kubernetes not only provides a consistent method for packaging applications and ensures the consistency of the applications running on different devices, but also provides resource isolation for the running environment of the applications, abstracts the complexity of a hardware bottom layer and node management, supports the dispatching of the GPU, and can also be elastically expanded based on the needs of the applications and the clusters.
The container cloud platform of the container and container orchestration tool are all running on the operating system, so its default communication is typically also enabled by the connection access services provided by the container cloud platform, which, although it is a highly available container cloud network solution, cannot bypass the operating system. Because the communication process of the scheme needs intervention of an operating system and a protocol stack, a large amount of CPU resources are inevitably occupied in the transmission process of the training gradient network under the scene of a big data training set, and larger network delay is caused, so that the training efficiency is severely restricted.
RDMA, a remote direct data access technique; RDMA can realize direct transfer of Buffer between application software of two nodes through a network. Compared with the traditional network transmission, RDMA does not need intervention of an operating system and a protocol stack, so that occupation of a large amount of CPU resources in the network transmission process is avoided, and network delay is reduced. When a physical host performs distributed computation as a cluster of nodes, RDMA communication has been implemented by mounting an RDMA network card (i.e., a physical network card supporting the RDMA protocol) for each physical node.
In order to more efficiently use computing power resources, when a distributed machine training is deployed on a container cloud platform, a task to be trained is often decomposed into a plurality of subtasks, environment configuration parameters are generated for each subtask (to ensure the dependency relationship among the subtasks and control the data consistency among the subtasks), then corresponding containers/container groups are created for each subtask (the containers/container groups refer to the minimum unit of the container clusters in the process of arrangement management, wherein the containers refer to containers running independent applications in the container environment, the container groups refer to a 'logic host' running independent applications in the container environment, one or more tightly coupled application containers such as Pod of a Kubernetes platform are run), connection access services are then run for distributed training; in the running distributed training process, the corresponding connection access service can be obtained by utilizing the connection parameter of the environment configuration parameter, namely the connection access service name, so that training data communication is realized under a default network. However, the connection access service is only suitable for providing relevant access connection services under the default network, and obviously cannot support RDMA network in a manner that default communication between containers/container groups can be realized through iptables of a kernel and the like.
Disclosure of Invention
In view of the foregoing, the present invention provides a method and system for improving the training efficiency of distributed tasks on a container cloud.
In one aspect, the embodiment of the invention provides a method for improving the training efficiency of distributed tasks on a container cloud.
The method comprises the following steps:
when the container cloud platform deploys the distributed training tasks,
decomposing the training task into a plurality of subtasks;
generating environment configuration parameters for the subtasks respectively;
deploying a container cluster for training tasks:
a corresponding container/group of containers is created for the subtasks,
and providing connection access services, and additionally RDMA network access;
after the container cluster deployment is completed and before the distributed training task is initiated,
reconfiguring the network environment of the subtasks to communicate using RDMA network communications;
after configuration is completed, the executive distributed training task is started.
On the other hand, the embodiment of the invention provides a system for improving the training efficiency of distributed tasks on a container cloud.
With reference to the first aspect, correspondingly, the system includes:
the system comprises a distributed training task management unit, a task scheduling unit and a container cloud platform; wherein,,
the distributed training task management unit is used for decomposing a task to be trained into a plurality of subtasks;
the task scheduling unit is used for scheduling various tasks including subtasks on the container cloud platform; the method comprises the following steps:
generating environment configuration parameters and defining containers/container groups to be created for performing the various tasks, etc.;
the container cloud platform is used for deploying container clusters for training and managing the container clusters; the method comprises the following steps:
creating a container/container group corresponding to the subtask, providing connection access service and additionally providing RDMA network access according to the definition of the task scheduling unit;
and after the container cluster deployment is completed and before the distributed training task is initiated,
reconfiguring the network environment of the subtasks to communicate using RDMA network communications;
and after the configuration is completed, starting to execute the distributed training task.
According to the method and the system for improving the training efficiency of the distributed tasks on the container cloud, the additional RDMA network is provided for the trained container cluster, and the network environment of the subtasks is effectively reconfigured in time after the container cluster is deployed and before the distributed training tasks are started to be executed, so that the RDMA network is used for communication during communication, the communication efficiency of training task data is improved, the problem of training data communication bottleneck during the distributed training on the container cloud under the large model and large data scenes is solved, and the efficiency of executing the distributed training on the container cloud platform is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of some of the embodiments of the present invention or the drawings involved in the description of the prior art will be provided below.
Fig. 1 is a flow chart of a method for improving training efficiency of distributed tasks on a container cloud according to some preferred embodiments of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by a person of ordinary skill in the art without any creative effort, are within the scope of the present invention based on the embodiments of the present invention.
The following are some preferred embodiments of the present invention. Wherein,,
the preferred embodiments described above provide a method of improving the efficiency of distributed task training on a container cloud. As shown in fig. 1, the method includes:
when the container cloud platform deploys the distributed training tasks,
the task to be trained is decomposed into a number (not less than 2) of subtasks,
generating environment configuration parameters for each subtask respectively to ensure the dependency relationship among the subtasks and the data consistency among the training subtasks;
deploying a container cluster for training tasks:
creating a container/container group corresponding to each subtask, providing connection access service and additionally providing RDMA network access;
the above method for providing connection access service and RDMA network access may specifically be:
providing at least two virtual network card interfaces for each container/container group described above through the multi-network card cni; wherein,,
the first network card interface is used for mounting a virtual network card; the container/container group accesses a default network through respective virtual network cards;
the second network card interface is used for mounting a virtual RDMA network card; the container/container group accesses the RDMA network through the respective RDMA network card;
thus, the container/container group forms a dual-network container cluster for training by connecting access service, RDMA network access and default network and RDMA network between them;
after the container cluster is deployed and before the distributed training task is started, updating a default connection parameter/a default DNS when a subtask corresponding to a container/container group accesses a network in a subtask environment configuration parameter and providing additional RDMA network DNS for the subtask so as to reconfigure the network environment of the subtask when running on the container cluster, and enable the subtask to use RDMA network communication when communicating;
after configuration is completed, the executive distributed training task is started.
Specifically, in the method for improving the training efficiency of the distributed task on the container cloud provided by some of these embodiments, the default connection parameters in the environment configuration parameters are updated, and the specific process may be:
RDMA connection parameters of containers/container groups related to subtask communication (for example, containers/container groups for executing subtasks, namely, RDMA network IP obtained by container/container group allocation corresponding to the subtask and RDMA network IP obtained by other container/container group allocation related to the subtask communication) are obtained, so that default connection parameters in subtask environment configuration parameters are updated and replaced.
In particular, some of these embodiments provide methods of improving the efficiency of distributed task training on a container cloud,
masking the subtask from and providing RDMA network DNS to the container/container group default network DNS, one implementation of which may be to specify a master DNS that is the container/container group that performs the subtask (i.e., the container/container group that the subtask corresponds to created) while providing RDMA network DNS services to the RDMA network.
Specifically, in the method for improving the training efficiency of the distributed task on the container cloud provided by some of these embodiments, the second network card interface is provided for the container/container group through sriov-cni; correspondingly, the virtual RDMA network card for mounting is obtained through the sriov virtual physical RDMA network card.
The following takes a process of deploying a distributed TensorFlow task on a Kubernetes platform based on the above method as an example, to further help understand the method for improving the training efficiency of the distributed task on the container cloud in the above preferred embodiment. The process is as follows:
according to the type of distributed training, combining with computing resources/computing resources and a model, decomposing the whole task to be pre-trained into a plurality of (not less than 2) subtasks, and respectively generating TF_CONFIG (wherein the TF_CONFIG comprises connection parameters required by the subtasks in the communication process when the subtasks are executed, and service names are adopted as the connection parameters by default) for each subtask so as to ensure the dependency relationship among the subtasks and the data consistency among the training subtasks, and generating other related parameters for defining the Pod of the task to be created in the subsequent step (namely, a 'container group' of a Kubernetes platform, which is the minimum scheduling unit when the platform performs scheduling management on containers); for example, define the Pod corresponding to the subtask as a training Pod;
container clusters for training were deployed on the Kubernetes platform:
corresponding training Pod, service (supporting default network access) and additional RDMA network access are created for each subtask described above according to TF _ CONFIG, etc:
calling the corresponding cni plugin through multus_ cni, two virtual network card interfaces are provided for each training Pod described above:
the method comprises the steps of providing a default network interface, mounting a virtual network card and accessing a default network by calling a flannel_ cni plug-in; default networks are typically used for data communications for platform management tasks;
by calling the sroov_ cni plug-in, an RDMA network interface is provided, an RDMA virtual network card is mounted (the virtual RDMA network cards are obtained based on the sroov virtual RDMA network card), and an RDMA network is accessed; RDMA networks will be used for data communication of training tasks (i.e., sub-tasks), such as communication of gradient data during gradient aggregation during task execution;
after the container cluster is deployed and before the distributed TensorFlow task is started, the network environment of the subtasks is reconfigured, so that RDMA network communication is used when the network environment is communicated; here, a way to update default connection parameters is adopted: acquiring the IP of each RDMA network of training Pod related to subtask communication, and updating and replacing a default connection parameter, namely a service name, in the subtask TF_CONFIG by the IP;
after the updating is finished, the distributed TensorFlow task is started to be executed, namely the corresponding subtasks are executed on each training Pod.
Further preferred embodiments of the present invention provide a system for improving the efficiency of distributed task training on a container cloud. The system comprises:
the system comprises a distributed training task management unit, a task scheduling unit and a container cloud platform; wherein,,
the distributed training task management unit is used for decomposing the whole task to be trained into a plurality of (not less than 2) subtasks;
the task scheduling unit is used for scheduling various tasks including subtasks on the container cloud platform; the method comprises the following steps:
generating environment configuration parameters for the various tasks (particularly generating environment configuration parameters for each subtask to ensure the dependency relationship among the subtasks and the data consistency among the subtasks),
and defining containers/groups of containers, etc. to be created for performing the various tasks; for example, defining the container/container group corresponding to the subtask as a training container/container group, so that the container cloud can provide custom performance settings suitable for training, such as multiple networks, when the container cloud is created after the container cloud platform is requested;
the container cloud platform is used for deploying container clusters for training, managing the container clusters and the like; the method comprises the following steps:
creating a container/container group corresponding to the subtask, providing connection access service and additionally providing RDMA network access according to the definition of the task scheduling unit; the above method for providing connection access service and RDMA network access may specifically be: providing at least two virtual network card interfaces for each container/container group described above through the multi-network card cni; wherein,,
the first network card interface is used for mounting a virtual network card; the container/container group accesses a default network through respective virtual network cards;
the second network card interface is used for mounting a virtual RDMA network card; the container/container group accesses the RDMA network through the respective RDMA network card;
the container/container group forms a double-network container cluster for training by connecting an access service, an RDMA network access and a default network and an RDMA network between the access service and the RDMA network;
further comprises: after the container cluster is deployed and before the distributed training task is started, updating a default connection parameter/a default DNS when a subtask corresponding to the container/container group accesses the network in the subtask environment configuration parameters and providing additional RDMA network DNS for the subtask so as to reconfigure the network environment of the subtask when running on the container cluster and enable the subtask to use RDMA network communication when communicating;
and after the configuration is completed, starting the executable distributed training task.
Specifically, in some of these embodiments, in the system for improving the training efficiency of the distributed task on the container cloud, the updating the default connection parameter in the environment configuration parameter may include:
RDMA connection parameters of containers/container groups related to subtask communication (for example, containers/container groups for executing subtasks, namely, RDMA network IP obtained by container/container group allocation corresponding to the subtask and RDMA network IP obtained by other container/container group allocation related to the subtask communication) are obtained, so that default connection parameters in subtask environment configuration parameters are updated and replaced.
In particular, some of these embodiments provide a system that improves the efficiency of distributed task training on a container cloud,
the masking subtasks correspond to the default network DNS of the container/container group and provide RDMA network DNS for the container/container group, and one implementation manner of the masking subtasks may be that the container cloud platform specifies the container cloud platform as the master DNS of the container/container group (i.e., the container/container group created by the subtask corresponding to the subtask) for executing the subtask while providing RDMA network DNS services for the RDMA network.
Specifically, in the system for improving the training efficiency of the distributed task on the container cloud provided by some of these embodiments, the container cloud platform provides the second network card interface for the container/container group through sriov-cni; correspondingly, the virtual RDMA network card for mounting is obtained through the sriov virtual physical RDMA network card.
The above description is merely an embodiment of the present invention, but the scope of the present invention is not limited thereto.

Claims (6)

1. A method for improving the training efficiency of distributed tasks on a container cloud, comprising:
when the container cloud platform deploys the distributed training tasks,
decomposing the training task into a plurality of subtasks;
generating environment configuration parameters for the subtasks;
deploying a container cluster for training tasks:
a corresponding container/group of containers is created for the subtasks,
and providing connection access services, and additionally RDMA network access;
after the container cluster deployment is completed and before the distributed training task is initiated,
reconfiguring the network environment of the subtasks to communicate using RDMA network communication;
after configuration is completed, starting an executive distributed training task;
the method provides connection access service and RDMA network access:
providing at least two virtual network card interfaces for the container/container group through a plurality of network cards cni; wherein,,
the first network card interface is used for mounting a virtual network card; through which the container/group of containers accesses a default network;
the second network card interface is used for mounting a virtual RDMA network card; through which the container/group of containers access an RDMA network;
wherein the containers/container groups constitute a dual network container cluster for training by connecting access services, RDMA network access and default network between them, RDMA network;
the reconfiguration of the network environment is accomplished by updating default connection parameters in the environment configuration parameters:
and obtaining RDMA connection parameters of the container/container group related to the subtask, and updating and replacing default connection parameters in environment configuration parameters of the subtask.
2. The method for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 1,
the reconfiguration network environment is implemented by masking the default network DNS for the subtask corresponding container/group of containers and providing RDMA network DNS for it:
RDMA network DNS services are provided for the RDMA network, while primary DNS is designated for the corresponding container/group of containers for the subtask.
3. The method for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 1,
providing the second network card interface for the container/container group through sriov-cni;
correspondingly, the virtual RDMA network card is obtained through the sriov virtual physical RDMA network card.
4. A system for improving the training efficiency of distributed tasks on a container cloud, comprising:
the system comprises a distributed training task management unit, a task scheduling unit and a container cloud platform; wherein,,
the distributed training task management unit is used for decomposing a task to be trained into a plurality of subtasks;
the task scheduling unit is used for scheduling various tasks including the subtasks on the container cloud platform; the method comprises the following steps:
generating environment configuration parameters and defining containers/container groups for the subtasks; the container cloud platform is used for deploying container clusters and managing the container clusters; the method comprises the following steps:
creating a container/container group corresponding to the subtask, providing connection access service and additionally providing RDMA network access according to the definition of the task scheduling unit;
and after the container cluster deployment is completed and before the distributed training task is initiated,
reconfiguring the network environment of the subtasks to communicate using RDMA network communication;
after the configuration is completed, starting to execute the distributed training task;
the method provides connection access service and RDMA network access:
providing at least two virtual network card interfaces for the container/container group through a plurality of network cards cni; wherein,,
the first network card interface is used for mounting a virtual network card; through which the container/group of containers accesses a default network;
the second network card interface is used for mounting a virtual RDMA network card; through which the container/group of containers access an RDMA network;
wherein the containers/container groups constitute a dual network container cluster for training by connecting access services, RDMA network access and default network between them, RDMA network;
the reconfiguration of the network environment is accomplished by updating default connection parameters in the environment configuration parameters:
and obtaining RDMA connection parameters of the container/container group related to the subtask, and updating and replacing default connection parameters in environment configuration parameters of the subtask.
5. The system for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 4,
the reconfiguration network environment is implemented by masking the default network DNS for the subtask corresponding container/group of containers and providing RDMA network DNS for it:
the container cloud platform provides RDMA network DNS services for the RDMA network, and simultaneously designates the RDMA network DNS services as a main DNS of the container/container group corresponding to the subtask.
6. The system for improving the training efficiency of distributed tasks on a container cloud as claimed in claim 4,
providing the second network card interface for the container/container group through sriov-cni;
correspondingly, the virtual RDMA network card is obtained through the sriov virtual physical RDMA network card.
CN201910413700.6A 2019-05-17 2019-05-17 Method and system for improving training efficiency of distributed tasks on container cloud Active CN110297670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910413700.6A CN110297670B (en) 2019-05-17 2019-05-17 Method and system for improving training efficiency of distributed tasks on container cloud

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910413700.6A CN110297670B (en) 2019-05-17 2019-05-17 Method and system for improving training efficiency of distributed tasks on container cloud

Publications (2)

Publication Number Publication Date
CN110297670A CN110297670A (en) 2019-10-01
CN110297670B true CN110297670B (en) 2023-06-27

Family

ID=68026929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910413700.6A Active CN110297670B (en) 2019-05-17 2019-05-17 Method and system for improving training efficiency of distributed tasks on container cloud

Country Status (1)

Country Link
CN (1) CN110297670B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111147297B (en) * 2019-12-23 2022-07-15 广东省新一代通信与网络创新研究院 Multi-layer network plane construction method of kubernets
CN113138832B (en) * 2020-01-17 2024-03-01 深圳致星科技有限公司 Distributed training method and system based on reset training data transmission network
CN113138831B (en) * 2020-01-17 2024-03-08 深圳致星科技有限公司 Network resetting method and acceleration distributed training method and system based on same
CN113517991A (en) * 2020-04-09 2021-10-19 深圳致星科技有限公司 Deployment method for accelerating distributed AI training cloud platform and related platform
CN113515341A (en) * 2020-04-09 2021-10-19 深圳致星科技有限公司 Flexible distributed AI training cloud platform deployment method and related platform
CN112671896A (en) * 2020-12-22 2021-04-16 上海上实龙创智能科技股份有限公司 Agricultural management method, equipment and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN109117265A (en) * 2018-07-12 2019-01-01 北京百度网讯科技有限公司 The method, apparatus, equipment and storage medium of schedule job in the cluster
CN109284184A (en) * 2018-03-07 2019-01-29 中山大学 A kind of building method of the distributed machines learning platform based on containerization technique
CN109639455A (en) * 2018-11-09 2019-04-16 武汉烽火信息集成技术有限公司 A kind of network management and system of container cloud platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102197247B1 (en) * 2017-06-01 2020-12-31 한국전자통신연구원 Parameter server and method for sharing distributed deep learning parameter using the same

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN109284184A (en) * 2018-03-07 2019-01-29 中山大学 A kind of building method of the distributed machines learning platform based on containerization technique
CN109117265A (en) * 2018-07-12 2019-01-01 北京百度网讯科技有限公司 The method, apparatus, equipment and storage medium of schedule job in the cluster
CN109639455A (en) * 2018-11-09 2019-04-16 武汉烽火信息集成技术有限公司 A kind of network management and system of container cloud platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Daehyeok Kim and Tianlong Yu et al.FreeF1ow:Software-based Virtual RDMA Networking for Containerized Clouds.《Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation》.2019,第1-10章. *
FreeF1ow:Software-based Virtual RDMA Networking for Containerized Clouds;Daehyeok Kim;Tianlong Yu et al;《Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation》;20190228;第1-10章 *
Kubernetes与HPC:(1) RDMA网络;yiduyangyi;《CSDN https://blog.csdn.net/yiduyangyi/article/details/90183733?》;20190513;第1-3页 *

Also Published As

Publication number Publication date
CN110297670A (en) 2019-10-01

Similar Documents

Publication Publication Date Title
CN110297670B (en) Method and system for improving training efficiency of distributed tasks on container cloud
CN110308986B (en) Method for distributed training data communication on container cloud based on optimal scheduling
CN110308987B (en) Method for updating connection parameters of distributed training tasks on container cloud
CN112187545B (en) Network slice deployment method and device
CN105979009B (en) A kind of increase load automatic balancing method for cloud application container
CN110311948B (en) Communication method between container groups and container cloud network system based on same
CN108337109B (en) Resource allocation method and device and resource allocation system
CN109194502B (en) Management method of multi-tenant container cloud computing system
CN105103506A (en) Network function virtualization method and device
CN110198364B (en) Container cloud distributed training data communication method based on designated DNS analysis
CN110838939B (en) Scheduling method based on lightweight container and edge Internet of things management platform
CN111371616B (en) Virtual network function chain deployment method and system for NUMA (non Uniform memory Access) architecture server
US9774542B2 (en) Computer-implemented method and a system for providing a networking service, and a computer program product adapted to perform the method
CN103747107A (en) Compatible cloud operating platform and realizing method thereof
Song et al. Gaia scheduler: A kubernetes-based scheduler framework
CN114374609B (en) Deep learning job operation method and system based on RDMA equipment
CN109525413B (en) CDN network function virtualization management method, device and system
WO2020108337A1 (en) Cpu resource scheduling method and electronic equipment
CN110011984B (en) REST and RPC-based distributed cluster system and method
Tseng et al. An mec-based vnf placement and scheduling scheme for ar application topology
CN114124714A (en) Multi-level network deployment method, device, equipment and storage medium
JP2024501005A (en) Management method and device for container clusters
CN110300192B (en) Method for updating distributed training task connection parameters according to IP distribution table
CN114924835A (en) Method and system for improving virtual machine access performance under super-fusion environment
CN112087311B (en) Virtual network function VNF deployment method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 518000 a2405, building 9, zone 2, Shenzhen Bay science and technology ecological park, 3609 Baishi Road, community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Zhixing Technology Co.,Ltd.

Address before: Room 408, Building 3, 4 Chegongzhuang Street, Xicheng District, Beijing 100044

Applicant before: BEIJING HANHAI CLUSTAR TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant