CN112667594A

CN112667594A - Heterogeneous computing platform based on hybrid cloud resources and model training method

Info

Publication number: CN112667594A
Application number: CN202110049064.0A
Authority: CN
Inventors: 曹岗; 邵洲; 张肖龙; 曲含笑
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-04-16

Abstract

The invention discloses a heterogeneous computing platform based on hybrid cloud resources and a model training method, wherein the platform comprises a basic component layer, a computing framework layer, a resource management layer and a resource management layer, and the method comprises the following steps: a user sets a model training task through a basic component layer and starts the task, wherein the setting of the model training task comprises selecting a model, a data set, a learning frame and/or a calculation resource; the calculation framework layer provides the selected learning framework; and the resource management layer allocates the model training task according to the setting of the model training task and calls computing resources, network resources and storage resources of the infrastructure layer to perform model training. The heterogeneous computing platform can enable the whole process of machine learning modeling to be visualized by supporting various reinforcement learning architectures and ultra-large-scale distributed training, and meanwhile solves the problems of limited computing power, single AI chip adaptation, frame fixation and the like commonly existing in the existing cloud management platform, so that the model training process is convenient, rapid and efficient.

Description

Heterogeneous computing platform based on hybrid cloud resources and model training method

Technical Field

The invention relates to the technical field of cloud, in particular to a heterogeneous computing platform based on mixed cloud resources and a model training method.

Background

The existing three resources of computing, storage and network are isolated in different virtualization platforms, so that unified monitoring and management on a private cloud layer cannot be realized, and with the development of cloud computing technology, in order to realize frequent switching of management users among different management interfaces and master different management logics and virtualization models of various platforms, enterprises need to hire or cultivate managers familiar with specific virtualization platforms to perform respective management.

The hybrid cloud is a solution combining a private cloud and one or more public cloud services, and not only can provide a private and safe data storage and computing environment, but also can provide more flexible and lower-cost computing, storage and network resources.

At present, most of hybrid Cloud Management systems realize Management of a multi-Cloud system based on a Cloud Management Platform (CMP), but the Cloud Management Platform generally has the problems of long process, easy error in manual operation and the like, so that a user cannot apply for using resources and improve self-service capability in a uniform manner.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

One aspect of the present invention relates to a heterogeneous computing platform based on hybrid cloud resources, comprising:

the basic component layer is used for providing an interface of user operation, and the user operation comprises setting a model training task;

the calculation framework layer is used for providing a learning framework used by the model training task;

a resource management layer for allocating and scheduling the hybrid cloud resources in the infrastructure layer to perform the model training task;

and the infrastructure layer is used for providing mixed cloud resources, including heterogeneous computing resources, network resources and storage resources.

Further, the learning framework includes a deep learning framework and a reinforcement learning framework.

Further, the resource management layer comprises a resource management module, a kubernets module and a Docker module, and the resource management module realizes scheduling of heterogeneous computational resources, network resources and storage resources in the infrastructure layer through the kubernets module and the Docker module.

Further, the heterogeneous computing resources include distributed CPU, GPU, ASIC processor resources, the network resources include RDMA networks, and the storage resources include distributed storage systems HDFS, Ceph, and/or ClusterFS.

Further, the user operation further comprises uploading a data set and/or an uploading algorithm.

Further, the computing framework layer further comprises a big data engine for managing the uploaded data set.

Another aspect of the present invention relates to a model training method implemented by using the above heterogeneous computing platform based on hybrid cloud resources, including:

a user sets a model training task through the basic component layer and starts the task, wherein the setting of the model training task comprises selecting a model, a data set, a learning frame and/or a calculation resource;

the calculation framework layer provides the selected learning framework;

and the resource management layer allocates and calls computing resources, network resources and storage resources of the infrastructure layer according to the settings of the model training task to perform model training.

Preferably, the resource management layer allocates and calls computational resources, network resources and storage resources of the infrastructure layer according to the settings of the model training task, including:

and the resource management layer allocates computational power resources, network resources and storage resources to the model training task according to the settings of the model training task, and calls a Kubernetes module and a Docker module to establish a container for the model training task, wherein the container comprises the mirror images of the allocated computational power resources, network resources and storage resources.

Further, the resource management layer allocating computing resources to the model training task according to the settings of the model training task includes:

acquiring currently available computing power resources;

if the setting of the model training task comprises selection of computing resources, distributing corresponding computing resources based on the selection;

otherwise, identifying the type of the model training task, and determining the type and the size of the calculation force resource according to the type;

and allocating the computing resources from the currently available computing resources according to the type and the size of the needed computing resources.

Further, the resource management layer records the resource condition used by each model training task in real time, and dynamically adjusts the distributed computing resources, network resources and storage resources in the model training process.

The invention has the beneficial effects that: the invention provides a heterogeneous computing platform based on hybrid cloud resources and a model training method, wherein an operation and maintenance mode taking an administrator as a center is converted into a decentralized self-service operation and maintenance mode, and an operation mode of one-way supply is converted into a transparent autonomous operation mode, so that the working efficiency of managing and using heterogeneous resources is improved. And the heterogeneous computing platform can realize unified management of multiple clusters and synchronous use of large-scale and multiple users, can enable the whole course of machine learning modeling to be visualized by supporting various reinforcement learning architectures and super-large-scale distributed training, and simultaneously solves the problems of limited computing power, single AI chip adaptation, frame fixation and the like commonly existing in the existing cloud platform, so that the model training process is convenient, rapid and efficient.

Drawings

FIG. 1 is a schematic structural diagram of a hybrid cloud resource-based heterogeneous computing platform according to the present invention;

fig. 2 is a schematic flow chart of a model training method implemented by using a heterogeneous computing platform based on hybrid cloud resources according to the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Example one

As shown in fig. 1, an embodiment of the present invention provides a heterogeneous computing platform based on hybrid cloud resources, including:

the basic component layer 11 is used for providing an interface of user operation, and the user operation comprises setting a model training task;

a computation framework layer 12, configured to provide a learning framework used by the model training task;

a resource management layer 13, configured to allocate and schedule the hybrid cloud resources in the infrastructure layer to execute the model training task;

the infrastructure layer 14 is configured to provide hybrid cloud resources including heterogeneous computing resources, network resources, and storage resources.

The basic component layer 11 includes a data management module 111, an algorithm development module 112, and a model training module 113. The user uploads the data set through the data management module 111, and deletes, modifies, and exports the data set. The user uploads the algorithm, and modifies and deletes the algorithm through the algorithm development module 112. The user sets model training tasks, including setting algorithms, data sets, and/or learning frameworks used for model training, through the model training module 113. Optionally, the base component layer 11 further includes a customization orchestration module 114 for customizing resources used by the model training, including processor type, number of processors, and the like.

The learning framework 121 provided by the computation framework layer 12 includes a deep learning framework and a reinforcement learning framework. The deep learning frames comprise international mainstream deep learning frames such as TensorFlow, mxnet, Caffe and PyTorch, and domestic frames such as OneFlow, MegEngine, PaddlePaddle and MindSpore. The reinforcement learning framework comprises a multi-tenant reinforcement learning framework Ray. The learning frame is preset in the platform.

In the using process, a user can designate a computing framework used for model training through the model training module 113 of the basic component layer 11, and when the model is trained, the platform directly calls the designated computing framework from the computing framework layer 12, so that the method is convenient and fast, the deployment process is greatly simplified, and the operation efficiency is improved.

The computing framework layer 12 also includes a big data engine 122 for operating on the uploaded data set, including storage, computation, mining, management, and the like. The big data engine comprises a plurality of data engines, such as SPARK, HADOOP, STORM, HIVE, FLINK, KAFKA and the like, so that full data intercommunication and zero configuration use are completed, and a unified rich ecological data collection body is created.

The resource management layer 13 includes a resource management module 131, a kubernets module 132, and a Docker module 133, where the resource management module 131 implements allocation and scheduling of heterogeneous computing resources, network resources, and storage resources in the infrastructure layer through the kubernets module 132 and the Docker module 133. Specifically, the resource management module 131 allocates computational power resources, network resources, and storage resources of the infrastructure layer 14 to the model training task according to the settings of the model training task, and then calls the kubernets module 132 to establish a container for the model training task, where the container includes a mirror image of the allocated computational power resources, network resources, and storage resources, and the container is stored in the Docker module 133, so that resource scheduling is performed in units of Docker containers.

Infrastructure layer 14 includes private cloud module 141 and public cloud module 142, and private cloud module 141 is used for providing private cloud resources, and public cloud module 142 is used for providing public cloud resources 142, and the private cloud resources include heterogeneous computing resources, network resources, and storage resources. Heterogeneous computing resources include various types of processors, such as distributed CPUs, GPUs, ASICs, and processor families of different manufacturers, such as cambrian, huashanteng, hectoritan, etc., so as to satisfy various computing requirements of users; the network resources comprise an RDMA network, so that the copying overhead from a user space to a system space is avoided, and the CPU use efficiency of a remote server is improved; the storage resources comprise a distributed storage system HDFS, a Ceph and/or a ClusterFS, so that users can more conveniently access shared files distributed on a network. The public cloud resources comprise Huazhiyun, Aliyun, Jinshan cloud and the like.

The invention provides a heterogeneous computing platform based on mixed cloud resources, which realizes multi-cluster unified management and large-scale multi-user synchronous use by integrating heterogeneous computing resources, a plurality of computing frames and a big data engine, supports various reinforcement learning architectures and ultra-large-scale distributed training, can enable the whole course of machine learning modeling to be visualized, and simultaneously solves the problems of limited computing power, single AI chip adaptation, frame fixation and the like commonly existing in the existing cloud platform, so that the model training process is convenient, rapid and efficient.

Example two

As shown in fig. 2, the embodiment provides a model training method implemented by using the heterogeneous computing platform based on hybrid cloud resources according to the first embodiment, including:

s101, a user sets a model training task through the basic component layer and starts the task, wherein the set model training task comprises an algorithm, a data set, a learning frame and/or a calculation resource used for selecting training;

s102, providing the selected learning frame by the computing frame layer;

s103, the resource management layer allocates and calls computing resources, network resources and storage resources of the infrastructure layer according to the settings of the model training task to perform model training.

Specifically, the user sets a model training task through the model training module 113, including setting an algorithm, a data set, and/or a learning framework used for model training, the algorithm used may be uploaded by the user through the algorithm development module 112, the data set used may be uploaded by the user through the data management module 111, and the algorithm and the data set may also be uploaded in advance by an administrator or other users. The administrator or user can choose whether to disclose the algorithm or the data set when uploading the algorithm or the data set, and if so, all users of the platform can choose to use the algorithm or the data set. The user customizes the resource usage scenario of model training via customization orchestration module 114, including selecting a public cloud or a private cloud, selecting computational resources in the private cloud, such as processor type, number of processors, processor family, etc. Therefore, the user can flexibly set the training resources based on the self requirement. For example, if the user wishes to increase the training speed, a greater number of processors may be selected; if the user has a requirement on the type of the processor, the GPU or the CPU can be selected; if the user wants to verify a processor of a particular vendor, a processor family of that vendor may be selected, such as for example, the martial era. Therefore, a flexible and uniform resource using mode can be provided for the user, and the personalized resource requirements of the user can be met.

After the setup is complete, the user initiates a model training task. The platform calls the selected learning frame from the learning frame 121 of the computing frame layer 12 according to the setting of the user, extracts a data set as training data, extracts an algorithm code and executes the algorithm code;

meanwhile, the resource management layer 13 allocates computational resources, network resources, and storage resources according to the customized resource usage scheme, and for the case of not customizing or only customizing part of resources, allocates resources according to the setting of model training and the current usage of the resources that are not customized. For example, if the customized resource usage scheme only defines the type and number of processors, the resource management layer 13 allocates free network resources and storage resources according to the settings of the model training (e.g., the size of the data set used by the model training). If the current available resource is smaller than the customized resource use scheme, the model training task is distributed according to the current available resource, and the model training task is recorded, and when new idle resources exist, the model training task is preferentially distributed to the task until the customized resource use scheme is reached.

Then, the platform calls a Kubernetes module to establish a Docker container for the model training task, the container is stored in the Docker module, and the distributed computing power resource, network resource and storage resource are packaged into a mirror image and placed in the established container. Therefore, when a plurality of model training tasks are executed in parallel, each task has a corresponding Docker container, and the platform can call the Kubernetes model to uniformly manage the Docker containers, for example, recording the resource condition used by each model training task in real time, and dynamically adjusting the mirror image of computational resources, network resources and storage resources contained in each container.

The heterogeneous computing platform based on the hybrid cloud resources and the model training method provided by the embodiment of the invention can be well applied to various scenes related to the field of artificial intelligence, such as machine translation, face recognition, AI medical treatment, brain-like computing, intelligent simulation and the like.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A heterogeneous computing platform based on hybrid cloud resources, comprising:

2. The hybrid cloud resource-based heterogeneous computing platform of claim 1, wherein the learning framework comprises a deep learning framework and a reinforcement learning framework.

3. The hybrid cloud resource-based heterogeneous computing platform of claim 2, wherein the resource management layer comprises a resource management module, a kubernets module, and a Docker module, and wherein the resource management module implements scheduling of heterogeneous computing, network, and storage resources in the infrastructure layer via the kubernets module and the Docker module.

4. The hybrid cloud resource-based heterogeneous computing platform of claim 3, wherein the heterogeneous computational resources comprise distributed CPU, GPU, ASIC processor resources, the network resources comprise RDMA networks, and the storage resources comprise distributed storage systems (HDFS), Ceph, and/or ClusterFS.

5. The hybrid cloud resource-based heterogeneous computing platform of claim 4, wherein the user operations further comprise an upload dataset and/or upload algorithm.

6. The hybrid cloud resource-based heterogeneous computing platform of claim 5, wherein the computing framework layer further comprises a big data engine to manage the uploaded data set.

7. A model training method implemented by the hybrid cloud resource-based heterogeneous computing platform of claim 6, comprising:

the calculation framework layer provides the selected learning framework;

8. The model training method of claim 7, wherein the resource management layer allocating and invoking computational resources, network resources, and storage resources of the infrastructure layer for the model training task according to the settings of the model training task comprises:

9. The model training method of claim 8, wherein the resource management layer assigning computational resources to the model training task based on the settings of the model training task comprises:

acquiring currently available computing power resources;

10. The model training method of claim 9, wherein the resource management layer records the resource situation used by each model training task in real time and dynamically adjusts the allocated computational resources, network resources and storage resources during the model training process.