CN112667594A - Heterogeneous computing platform based on hybrid cloud resources and model training method - Google Patents

Heterogeneous computing platform based on hybrid cloud resources and model training method Download PDF

Info

Publication number
CN112667594A
CN112667594A CN202110049064.0A CN202110049064A CN112667594A CN 112667594 A CN112667594 A CN 112667594A CN 202110049064 A CN202110049064 A CN 202110049064A CN 112667594 A CN112667594 A CN 112667594A
Authority
CN
China
Prior art keywords
resources
model training
layer
training task
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110049064.0A
Other languages
Chinese (zh)
Inventor
曹岗
邵洲
张肖龙
曲含笑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyuan Artificial Intelligence Research Institute
Original Assignee
Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyuan Artificial Intelligence Research Institute filed Critical Beijing Zhiyuan Artificial Intelligence Research Institute
Priority to CN202110049064.0A priority Critical patent/CN112667594A/en
Publication of CN112667594A publication Critical patent/CN112667594A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Stored Programmes (AREA)

Abstract

The invention discloses a heterogeneous computing platform based on hybrid cloud resources and a model training method, wherein the platform comprises a basic component layer, a computing framework layer, a resource management layer and a resource management layer, and the method comprises the following steps: a user sets a model training task through a basic component layer and starts the task, wherein the setting of the model training task comprises selecting a model, a data set, a learning frame and/or a calculation resource; the calculation framework layer provides the selected learning framework; and the resource management layer allocates the model training task according to the setting of the model training task and calls computing resources, network resources and storage resources of the infrastructure layer to perform model training. The heterogeneous computing platform can enable the whole process of machine learning modeling to be visualized by supporting various reinforcement learning architectures and ultra-large-scale distributed training, and meanwhile solves the problems of limited computing power, single AI chip adaptation, frame fixation and the like commonly existing in the existing cloud management platform, so that the model training process is convenient, rapid and efficient.

Description

Heterogeneous computing platform based on hybrid cloud resources and model training method
Technical Field
The invention relates to the technical field of cloud, in particular to a heterogeneous computing platform based on mixed cloud resources and a model training method.
Background
The existing three resources of computing, storage and network are isolated in different virtualization platforms, so that unified monitoring and management on a private cloud layer cannot be realized, and with the development of cloud computing technology, in order to realize frequent switching of management users among different management interfaces and master different management logics and virtualization models of various platforms, enterprises need to hire or cultivate managers familiar with specific virtualization platforms to perform respective management.
The hybrid cloud is a solution combining a private cloud and one or more public cloud services, and not only can provide a private and safe data storage and computing environment, but also can provide more flexible and lower-cost computing, storage and network resources.
At present, most of hybrid Cloud Management systems realize Management of a multi-Cloud system based on a Cloud Management Platform (CMP), but the Cloud Management Platform generally has the problems of long process, easy error in manual operation and the like, so that a user cannot apply for using resources and improve self-service capability in a uniform manner.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides the following technical scheme.
One aspect of the present invention relates to a heterogeneous computing platform based on hybrid cloud resources, comprising:
the basic component layer is used for providing an interface of user operation, and the user operation comprises setting a model training task;
the calculation framework layer is used for providing a learning framework used by the model training task;
a resource management layer for allocating and scheduling the hybrid cloud resources in the infrastructure layer to perform the model training task;
and the infrastructure layer is used for providing mixed cloud resources, including heterogeneous computing resources, network resources and storage resources.
Further, the learning framework includes a deep learning framework and a reinforcement learning framework.
Further, the resource management layer comprises a resource management module, a kubernets module and a Docker module, and the resource management module realizes scheduling of heterogeneous computational resources, network resources and storage resources in the infrastructure layer through the kubernets module and the Docker module.
Further, the heterogeneous computing resources include distributed CPU, GPU, ASIC processor resources, the network resources include RDMA networks, and the storage resources include distributed storage systems HDFS, Ceph, and/or ClusterFS.
Further, the user operation further comprises uploading a data set and/or an uploading algorithm.
Further, the computing framework layer further comprises a big data engine for managing the uploaded data set.
Another aspect of the present invention relates to a model training method implemented by using the above heterogeneous computing platform based on hybrid cloud resources, including:
a user sets a model training task through the basic component layer and starts the task, wherein the setting of the model training task comprises selecting a model, a data set, a learning frame and/or a calculation resource;
the calculation framework layer provides the selected learning framework;
and the resource management layer allocates and calls computing resources, network resources and storage resources of the infrastructure layer according to the settings of the model training task to perform model training.
Preferably, the resource management layer allocates and calls computational resources, network resources and storage resources of the infrastructure layer according to the settings of the model training task, including:
and the resource management layer allocates computational power resources, network resources and storage resources to the model training task according to the settings of the model training task, and calls a Kubernetes module and a Docker module to establish a container for the model training task, wherein the container comprises the mirror images of the allocated computational power resources, network resources and storage resources.
Further, the resource management layer allocating computing resources to the model training task according to the settings of the model training task includes:
acquiring currently available computing power resources;
if the setting of the model training task comprises selection of computing resources, distributing corresponding computing resources based on the selection;
otherwise, identifying the type of the model training task, and determining the type and the size of the calculation force resource according to the type;
and allocating the computing resources from the currently available computing resources according to the type and the size of the needed computing resources.
Further, the resource management layer records the resource condition used by each model training task in real time, and dynamically adjusts the distributed computing resources, network resources and storage resources in the model training process.
The invention has the beneficial effects that: the invention provides a heterogeneous computing platform based on hybrid cloud resources and a model training method, wherein an operation and maintenance mode taking an administrator as a center is converted into a decentralized self-service operation and maintenance mode, and an operation mode of one-way supply is converted into a transparent autonomous operation mode, so that the working efficiency of managing and using heterogeneous resources is improved. And the heterogeneous computing platform can realize unified management of multiple clusters and synchronous use of large-scale and multiple users, can enable the whole course of machine learning modeling to be visualized by supporting various reinforcement learning architectures and super-large-scale distributed training, and simultaneously solves the problems of limited computing power, single AI chip adaptation, frame fixation and the like commonly existing in the existing cloud platform, so that the model training process is convenient, rapid and efficient.
Drawings
FIG. 1 is a schematic structural diagram of a hybrid cloud resource-based heterogeneous computing platform according to the present invention;
fig. 2 is a schematic flow chart of a model training method implemented by using a heterogeneous computing platform based on hybrid cloud resources according to the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example one
As shown in fig. 1, an embodiment of the present invention provides a heterogeneous computing platform based on hybrid cloud resources, including:
the basic component layer 11 is used for providing an interface of user operation, and the user operation comprises setting a model training task;
a computation framework layer 12, configured to provide a learning framework used by the model training task;
a resource management layer 13, configured to allocate and schedule the hybrid cloud resources in the infrastructure layer to execute the model training task;
the infrastructure layer 14 is configured to provide hybrid cloud resources including heterogeneous computing resources, network resources, and storage resources.
The basic component layer 11 includes a data management module 111, an algorithm development module 112, and a model training module 113. The user uploads the data set through the data management module 111, and deletes, modifies, and exports the data set. The user uploads the algorithm, and modifies and deletes the algorithm through the algorithm development module 112. The user sets model training tasks, including setting algorithms, data sets, and/or learning frameworks used for model training, through the model training module 113. Optionally, the base component layer 11 further includes a customization orchestration module 114 for customizing resources used by the model training, including processor type, number of processors, and the like.
The learning framework 121 provided by the computation framework layer 12 includes a deep learning framework and a reinforcement learning framework. The deep learning frames comprise international mainstream deep learning frames such as TensorFlow, mxnet, Caffe and PyTorch, and domestic frames such as OneFlow, MegEngine, PaddlePaddle and MindSpore. The reinforcement learning framework comprises a multi-tenant reinforcement learning framework Ray. The learning frame is preset in the platform.
In the using process, a user can designate a computing framework used for model training through the model training module 113 of the basic component layer 11, and when the model is trained, the platform directly calls the designated computing framework from the computing framework layer 12, so that the method is convenient and fast, the deployment process is greatly simplified, and the operation efficiency is improved.
The computing framework layer 12 also includes a big data engine 122 for operating on the uploaded data set, including storage, computation, mining, management, and the like. The big data engine comprises a plurality of data engines, such as SPARK, HADOOP, STORM, HIVE, FLINK, KAFKA and the like, so that full data intercommunication and zero configuration use are completed, and a unified rich ecological data collection body is created.
The resource management layer 13 includes a resource management module 131, a kubernets module 132, and a Docker module 133, where the resource management module 131 implements allocation and scheduling of heterogeneous computing resources, network resources, and storage resources in the infrastructure layer through the kubernets module 132 and the Docker module 133. Specifically, the resource management module 131 allocates computational power resources, network resources, and storage resources of the infrastructure layer 14 to the model training task according to the settings of the model training task, and then calls the kubernets module 132 to establish a container for the model training task, where the container includes a mirror image of the allocated computational power resources, network resources, and storage resources, and the container is stored in the Docker module 133, so that resource scheduling is performed in units of Docker containers.
Infrastructure layer 14 includes private cloud module 141 and public cloud module 142, and private cloud module 141 is used for providing private cloud resources, and public cloud module 142 is used for providing public cloud resources 142, and the private cloud resources include heterogeneous computing resources, network resources, and storage resources. Heterogeneous computing resources include various types of processors, such as distributed CPUs, GPUs, ASICs, and processor families of different manufacturers, such as cambrian, huashanteng, hectoritan, etc., so as to satisfy various computing requirements of users; the network resources comprise an RDMA network, so that the copying overhead from a user space to a system space is avoided, and the CPU use efficiency of a remote server is improved; the storage resources comprise a distributed storage system HDFS, a Ceph and/or a ClusterFS, so that users can more conveniently access shared files distributed on a network. The public cloud resources comprise Huazhiyun, Aliyun, Jinshan cloud and the like.
The invention provides a heterogeneous computing platform based on mixed cloud resources, which realizes multi-cluster unified management and large-scale multi-user synchronous use by integrating heterogeneous computing resources, a plurality of computing frames and a big data engine, supports various reinforcement learning architectures and ultra-large-scale distributed training, can enable the whole course of machine learning modeling to be visualized, and simultaneously solves the problems of limited computing power, single AI chip adaptation, frame fixation and the like commonly existing in the existing cloud platform, so that the model training process is convenient, rapid and efficient.
Example two
As shown in fig. 2, the embodiment provides a model training method implemented by using the heterogeneous computing platform based on hybrid cloud resources according to the first embodiment, including:
s101, a user sets a model training task through the basic component layer and starts the task, wherein the set model training task comprises an algorithm, a data set, a learning frame and/or a calculation resource used for selecting training;
s102, providing the selected learning frame by the computing frame layer;
s103, the resource management layer allocates and calls computing resources, network resources and storage resources of the infrastructure layer according to the settings of the model training task to perform model training.
Specifically, the user sets a model training task through the model training module 113, including setting an algorithm, a data set, and/or a learning framework used for model training, the algorithm used may be uploaded by the user through the algorithm development module 112, the data set used may be uploaded by the user through the data management module 111, and the algorithm and the data set may also be uploaded in advance by an administrator or other users. The administrator or user can choose whether to disclose the algorithm or the data set when uploading the algorithm or the data set, and if so, all users of the platform can choose to use the algorithm or the data set. The user customizes the resource usage scenario of model training via customization orchestration module 114, including selecting a public cloud or a private cloud, selecting computational resources in the private cloud, such as processor type, number of processors, processor family, etc. Therefore, the user can flexibly set the training resources based on the self requirement. For example, if the user wishes to increase the training speed, a greater number of processors may be selected; if the user has a requirement on the type of the processor, the GPU or the CPU can be selected; if the user wants to verify a processor of a particular vendor, a processor family of that vendor may be selected, such as for example, the martial era. Therefore, a flexible and uniform resource using mode can be provided for the user, and the personalized resource requirements of the user can be met.
After the setup is complete, the user initiates a model training task. The platform calls the selected learning frame from the learning frame 121 of the computing frame layer 12 according to the setting of the user, extracts a data set as training data, extracts an algorithm code and executes the algorithm code;
meanwhile, the resource management layer 13 allocates computational resources, network resources, and storage resources according to the customized resource usage scheme, and for the case of not customizing or only customizing part of resources, allocates resources according to the setting of model training and the current usage of the resources that are not customized. For example, if the customized resource usage scheme only defines the type and number of processors, the resource management layer 13 allocates free network resources and storage resources according to the settings of the model training (e.g., the size of the data set used by the model training). If the current available resource is smaller than the customized resource use scheme, the model training task is distributed according to the current available resource, and the model training task is recorded, and when new idle resources exist, the model training task is preferentially distributed to the task until the customized resource use scheme is reached.
Then, the platform calls a Kubernetes module to establish a Docker container for the model training task, the container is stored in the Docker module, and the distributed computing power resource, network resource and storage resource are packaged into a mirror image and placed in the established container. Therefore, when a plurality of model training tasks are executed in parallel, each task has a corresponding Docker container, and the platform can call the Kubernetes model to uniformly manage the Docker containers, for example, recording the resource condition used by each model training task in real time, and dynamically adjusting the mirror image of computational resources, network resources and storage resources contained in each container.
The heterogeneous computing platform based on the hybrid cloud resources and the model training method provided by the embodiment of the invention can be well applied to various scenes related to the field of artificial intelligence, such as machine translation, face recognition, AI medical treatment, brain-like computing, intelligent simulation and the like.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A heterogeneous computing platform based on hybrid cloud resources, comprising:
the basic component layer is used for providing an interface of user operation, and the user operation comprises setting a model training task;
the calculation framework layer is used for providing a learning framework used by the model training task;
a resource management layer for allocating and scheduling the hybrid cloud resources in the infrastructure layer to perform the model training task;
and the infrastructure layer is used for providing mixed cloud resources, including heterogeneous computing resources, network resources and storage resources.
2. The hybrid cloud resource-based heterogeneous computing platform of claim 1, wherein the learning framework comprises a deep learning framework and a reinforcement learning framework.
3. The hybrid cloud resource-based heterogeneous computing platform of claim 2, wherein the resource management layer comprises a resource management module, a kubernets module, and a Docker module, and wherein the resource management module implements scheduling of heterogeneous computing, network, and storage resources in the infrastructure layer via the kubernets module and the Docker module.
4. The hybrid cloud resource-based heterogeneous computing platform of claim 3, wherein the heterogeneous computational resources comprise distributed CPU, GPU, ASIC processor resources, the network resources comprise RDMA networks, and the storage resources comprise distributed storage systems (HDFS), Ceph, and/or ClusterFS.
5. The hybrid cloud resource-based heterogeneous computing platform of claim 4, wherein the user operations further comprise an upload dataset and/or upload algorithm.
6. The hybrid cloud resource-based heterogeneous computing platform of claim 5, wherein the computing framework layer further comprises a big data engine to manage the uploaded data set.
7. A model training method implemented by the hybrid cloud resource-based heterogeneous computing platform of claim 6, comprising:
a user sets a model training task through the basic component layer and starts the task, wherein the setting of the model training task comprises selecting a model, a data set, a learning frame and/or a calculation resource;
the calculation framework layer provides the selected learning framework;
and the resource management layer allocates and calls computing resources, network resources and storage resources of the infrastructure layer according to the settings of the model training task to perform model training.
8. The model training method of claim 7, wherein the resource management layer allocating and invoking computational resources, network resources, and storage resources of the infrastructure layer for the model training task according to the settings of the model training task comprises:
and the resource management layer allocates computational power resources, network resources and storage resources to the model training task according to the settings of the model training task, and calls a Kubernetes module and a Docker module to establish a container for the model training task, wherein the container comprises the mirror images of the allocated computational power resources, network resources and storage resources.
9. The model training method of claim 8, wherein the resource management layer assigning computational resources to the model training task based on the settings of the model training task comprises:
acquiring currently available computing power resources;
if the setting of the model training task comprises selection of computing resources, distributing corresponding computing resources based on the selection;
otherwise, identifying the type of the model training task, and determining the type and the size of the calculation force resource according to the type;
and allocating the computing resources from the currently available computing resources according to the type and the size of the needed computing resources.
10. The model training method of claim 9, wherein the resource management layer records the resource situation used by each model training task in real time and dynamically adjusts the allocated computational resources, network resources and storage resources during the model training process.
CN202110049064.0A 2021-01-14 2021-01-14 Heterogeneous computing platform based on hybrid cloud resources and model training method Pending CN112667594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110049064.0A CN112667594A (en) 2021-01-14 2021-01-14 Heterogeneous computing platform based on hybrid cloud resources and model training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110049064.0A CN112667594A (en) 2021-01-14 2021-01-14 Heterogeneous computing platform based on hybrid cloud resources and model training method

Publications (1)

Publication Number Publication Date
CN112667594A true CN112667594A (en) 2021-04-16

Family

ID=75415161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110049064.0A Pending CN112667594A (en) 2021-01-14 2021-01-14 Heterogeneous computing platform based on hybrid cloud resources and model training method

Country Status (1)

Country Link
CN (1) CN112667594A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326116A (en) * 2021-06-30 2021-08-31 北京九章云极科技有限公司 Data processing method and system
CN114661482A (en) * 2022-05-25 2022-06-24 成都索贝数码科技股份有限公司 GPU computing power management method, medium, equipment and system
DE202022104275U1 (en) 2022-07-28 2022-08-25 Ahmed Alemran System for intelligent resource management for distributed machine learning tasks
CN115562877A (en) * 2022-11-15 2023-01-03 北京阿丘科技有限公司 Arrangement method, device and equipment of distributed computing power resources and storage medium
CN116521380A (en) * 2023-07-05 2023-08-01 之江实验室 Resource self-adaptive collaborative model training acceleration method, device and equipment
CN117271424A (en) * 2023-11-24 2023-12-22 北京中星微人工智能芯片技术有限公司 Processing device and processing method based on multimode fusion computing framework
CN117421108A (en) * 2023-12-15 2024-01-19 企商在线(北京)数据技术股份有限公司 Heterogeneous computing power platform design method, heterogeneous computing power platform and resource scheduling method

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107995043A (en) * 2017-12-15 2018-05-04 南京南瑞信息通信科技有限公司 Application disaster recovery and backup systems and calamity based on mixing cloud platform are for collocation method
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
CN109670600A (en) * 2018-12-14 2019-04-23 启元世界(北京)信息技术服务有限公司 Decision-making technique and system based on cloud platform
CN109933306A (en) * 2019-02-11 2019-06-25 山东大学 Mix Computational frame generation, data processing method, device and mixing Computational frame
WO2019130009A1 (en) * 2017-12-29 2019-07-04 Agarik Sas Orchestrated hybrid cloud platform for multi-cloud environment
CN110347498A (en) * 2019-06-10 2019-10-18 华南理工大学 A kind of load dynamic migration method under container and virtual machine mixing cloud environment
CN110490450A (en) * 2019-08-15 2019-11-22 安诺优达生命科学研究院 Biological information management system based on mixed cloud
WO2020092446A2 (en) * 2018-10-29 2020-05-07 Strong Force TX Portfolio 2018, LLC Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources
CN111209077A (en) * 2019-12-26 2020-05-29 中科曙光国际信息产业有限公司 Deep learning framework design method
WO2020135806A1 (en) * 2018-12-28 2020-07-02 华为技术有限公司 Operation maintenance method and equipment applied to data center
CN111427549A (en) * 2020-03-30 2020-07-17 中国科学院计算机网络信息中心 Artificial intelligence reinforcement learning service platform
CN111612300A (en) * 2020-04-16 2020-09-01 国网甘肃省电力公司信息通信公司 Scene anomaly perception index calculation method and system based on deep hybrid cloud model
CN111626338A (en) * 2020-05-06 2020-09-04 中移雄安信息通信科技有限公司 Cloud environment matching method, device, equipment and medium based on fusion classification model

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107995043A (en) * 2017-12-15 2018-05-04 南京南瑞信息通信科技有限公司 Application disaster recovery and backup systems and calamity based on mixing cloud platform are for collocation method
WO2019130009A1 (en) * 2017-12-29 2019-07-04 Agarik Sas Orchestrated hybrid cloud platform for multi-cloud environment
CN108881446A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A kind of artificial intelligence plateform system based on deep learning
WO2020092446A2 (en) * 2018-10-29 2020-05-07 Strong Force TX Portfolio 2018, LLC Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources
CN109670600A (en) * 2018-12-14 2019-04-23 启元世界(北京)信息技术服务有限公司 Decision-making technique and system based on cloud platform
WO2020135806A1 (en) * 2018-12-28 2020-07-02 华为技术有限公司 Operation maintenance method and equipment applied to data center
CN109933306A (en) * 2019-02-11 2019-06-25 山东大学 Mix Computational frame generation, data processing method, device and mixing Computational frame
CN110347498A (en) * 2019-06-10 2019-10-18 华南理工大学 A kind of load dynamic migration method under container and virtual machine mixing cloud environment
CN110490450A (en) * 2019-08-15 2019-11-22 安诺优达生命科学研究院 Biological information management system based on mixed cloud
CN111209077A (en) * 2019-12-26 2020-05-29 中科曙光国际信息产业有限公司 Deep learning framework design method
CN111427549A (en) * 2020-03-30 2020-07-17 中国科学院计算机网络信息中心 Artificial intelligence reinforcement learning service platform
CN111612300A (en) * 2020-04-16 2020-09-01 国网甘肃省电力公司信息通信公司 Scene anomaly perception index calculation method and system based on deep hybrid cloud model
CN111626338A (en) * 2020-05-06 2020-09-04 中移雄安信息通信科技有限公司 Cloud environment matching method, device, equipment and medium based on fusion classification model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
朱连章 等: "基于深度学习的普适云服务迁移方法研究", 太原理工大学学报, no. 05, 15 September 2018 (2018-09-15), pages 736 - 744 *
林健;谢冬鸣;余波;: "深度学习云服务适配问题研究", 软件导刊, no. 06, 15 June 2020 (2020-06-15), pages 1 - 8 *
陈建辉;: "混合云环境下基于椭圆曲线加密的隐私保护模型", 微电子学与计算机, no. 08, 5 August 2017 (2017-08-05), pages 128 - 132 *
陈星;兰兴土;李隘鹏;郭文忠;黄罡;: "基于运行时模型的混合云管理方法", 软件学报, no. 07, pages 1881 - 1897 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326116A (en) * 2021-06-30 2021-08-31 北京九章云极科技有限公司 Data processing method and system
CN114661482A (en) * 2022-05-25 2022-06-24 成都索贝数码科技股份有限公司 GPU computing power management method, medium, equipment and system
DE202022104275U1 (en) 2022-07-28 2022-08-25 Ahmed Alemran System for intelligent resource management for distributed machine learning tasks
CN115562877A (en) * 2022-11-15 2023-01-03 北京阿丘科技有限公司 Arrangement method, device and equipment of distributed computing power resources and storage medium
CN115562877B (en) * 2022-11-15 2023-03-24 北京阿丘科技有限公司 Arranging method, device and equipment of distributed computing power resources and storage medium
CN116521380A (en) * 2023-07-05 2023-08-01 之江实验室 Resource self-adaptive collaborative model training acceleration method, device and equipment
CN117271424A (en) * 2023-11-24 2023-12-22 北京中星微人工智能芯片技术有限公司 Processing device and processing method based on multimode fusion computing framework
CN117271424B (en) * 2023-11-24 2024-02-06 北京中星微人工智能芯片技术有限公司 Processing device and processing method based on multimode fusion computing framework
CN117421108A (en) * 2023-12-15 2024-01-19 企商在线(北京)数据技术股份有限公司 Heterogeneous computing power platform design method, heterogeneous computing power platform and resource scheduling method

Similar Documents

Publication Publication Date Title
CN112667594A (en) Heterogeneous computing platform based on hybrid cloud resources and model training method
CN103810023B (en) The intelligent deployment method of Distributed Application and system in a kind of cloud platform
US10467725B2 (en) Managing access to a resource pool of graphics processing units under fine grain control
CN105207798B (en) Service arrangement method and device in software defined network
CN103870314B (en) Method and system for simultaneously operating different types of virtual machines by single node
Sotomayor et al. Virtual infrastructure management in private and hybrid clouds
CN107222531B (en) Container cloud resource scheduling method
CN109672709B (en) Hybrid cloud service scheduling system and method
CN104123182B (en) Based on the MapReduce task of client/server across data center scheduling system and method
CN102971724B (en) The method and apparatus relevant with the management based on modular virtual resource in data center environment
CN104503832B (en) A kind of scheduling virtual machine system and method for fair and efficiency balance
CN106325975A (en) Method for automatically deploying and managing big data clusters through Docker container
CN108920153A (en) A kind of Docker container dynamic dispatching method based on load estimation
CN111045786B (en) Container creation system and method based on mirror image layering technology in cloud environment
US20240111586A1 (en) Multi-policy intelligent scheduling method and apparatus oriented to heterogeneous computing power
CN104112049B (en) Based on the MapReduce task of P2P framework across data center scheduling system and method
CN104021029B (en) Spatial information cloud computing system and implementing method thereof
CN110069341A (en) What binding function configured on demand has the dispatching method of dependence task in edge calculations
CN109144661A (en) A kind of deep learning management method based on docker
CN116541134B (en) Method and device for deploying containers in multi-architecture cluster
CN104331332A (en) Virtual resource preallocation algorithm based on SLA (Service Level Agreement)
CN109992373A (en) Resource regulating method, approaches to IM and device and task deployment system
CN112433823A (en) Apparatus and method for dynamically virtualizing physical card
EP2923320A1 (en) Transparently routing job submissions between disparate environments
Kherbache et al. Scheduling live-migrations for fast, adaptable and energy-efficient relocation operations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination