CN114356543A - Kubernetes-based multi-tenant machine learning task resource scheduling method - Google Patents

Kubernetes-based multi-tenant machine learning task resource scheduling method Download PDF

Info

Publication number
CN114356543A
CN114356543A CN202111460970.6A CN202111460970A CN114356543A CN 114356543 A CN114356543 A CN 114356543A CN 202111460970 A CN202111460970 A CN 202111460970A CN 114356543 A CN114356543 A CN 114356543A
Authority
CN
China
Prior art keywords
node
resource
gpu
cpu
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111460970.6A
Other languages
Chinese (zh)
Inventor
杨立波
王宇冬
马斌
李一鹏
栗维勋
袁龙
李�昊
季学纯
孙云枫
李佳阳
沈嘉灵
徐丽燕
胡锐锋
劳莹莹
陈子韵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Hebei Electric Power Co Ltd
Nari Technology Co Ltd
State Grid Electric Power Research Institute
Original Assignee
State Grid Corp of China SGCC
State Grid Hebei Electric Power Co Ltd
Nari Technology Co Ltd
State Grid Electric Power Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Hebei Electric Power Co Ltd, Nari Technology Co Ltd, State Grid Electric Power Research Institute filed Critical State Grid Corp of China SGCC
Priority to CN202111460970.6A priority Critical patent/CN114356543A/en
Publication of CN114356543A publication Critical patent/CN114356543A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a Kubernetes-based multi-tenant machine learning task resource scheduling method, which is used for carrying out quota management on computing resources available for different users, monitoring the resource state information of nodes in a Kubernetes platform, considering the problem of resource utilization rate of a host where the nodes are located, avoiding the problem of inaccurate scheduling result, carrying out priority sequencing on the nodes according to scheduling task demand information by monitoring real-time scheduling and pre-scheduling request demand information, obtaining a host label of an optimal Node, and carrying out reasonable allocation on resource demands of various machine learning model training and prediction tasks according to the label. The invention effectively prevents and reduces the problem of inclination of node resource use in the Kubernetes platform, realizes multi-node load balance and improves the utilization rate of the node resource.

Description

Kubernetes-based multi-tenant machine learning task resource scheduling method
Technical Field
The invention relates to a Kubernetes-based multi-tenant machine learning task resource scheduling method, and belongs to the technical field of power regulation and control.
Background
At present, the artificial intelligence technology in the field of power grid regulation and control obtains primary results, but the breakthrough problems that the calculation power is dispersed and the application is restricted are met in the aspect of calculation power resource management and control, and the artificial intelligence development and operation environment is deployed in a chimney type by various applications, so that the repeated construction of bottom hardware resources, the calculation power dispersion and the difficult expansion are caused.
The cloud computing platform IaaS layer mainly utilizes a virtualization technology to realize multi-tenant resource isolation and dynamic allocation, but the traditional virtualization technology has high occupancy rate of hardware resources and is not suitable for machine learning model training and high-utilization-rate scenes of computational resources for predicting tasks; and the complexity of links such as application program configuration, operation and management is higher, which is not beneficial to the clustered overall management.
The kubernets have the capability of automatically arranging, deploying and scheduling resources for services, and the like, and are popular with developers, the method carries out custom arrangement and scheduling on resources based on the kubernets, supports the product development work of an artificial intelligent application development and service support platform in a new generation scheduling technology support system, is used for machine learning training and resource scheduling of prediction tasks such as power grid fault identification and analysis, power grid operation prediction and analysis, power grid intelligent scheduling aid decision and the like, and application results verify the technical route and reliability of the method.
Disclosure of Invention
The purpose is as follows: in order to overcome the defects in the prior art, the invention provides a Kubernetes-based multi-tenant machine learning task resource scheduling method, which adopts Kubernetes and a container technology to uniformly regulate and control the CPU, GPU and memory resources of an IaaS layer, constructs the standardized operation environment of an application program for training and predicting a multi-tenant machine learning model, and improves the controllability, the elastic expansion capability and the resource isolation capability of a power grid regulation and control system.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a Kubernetes-based multi-tenant machine learning task resource scheduling method comprises the following steps:
and calculating the difference value of the used resources of the Node nodes in the cluster and the used resources of the created containers to obtain the resource information occupied by all processes of the Node operating system.
And calling a Kubernetes API to acquire resource information applied by all machine learning models and prediction task containers on the Node.
And subtracting the resource information occupied by all processes of the Node operating system and the resource information applied by all machine learning model training and prediction task containers on the Node from the inherent resource capacity of the Node, and calculating the real-time available resource information of the Node.
And calculating the availability ratios of the CPU, the GPU and the memory of the Node according to the real-time available resource information of the Node and the inherent resource capacity of the Node.
The Node nodes with the availability ratios of the Node CPU, the GPU and the memory not lower than the preset resource threshold percentage allocate computing resources for the machine learning model training and predicting tasks.
And the machine learning task scheduling service sends the quantities of CPU, GPU and memory resources applied by the machine learning model training and prediction tasks of different users to the system cluster resource management and control service.
The system cluster resource management and control service obtains the user-applicable residual resources by calculating the resource difference value of the multi-tenant resource quota table and the user resource use condition table, and checks whether the number of CPUs (central processing units), GPUs (graphic processing units) and memories applied by the machine learning model training and prediction tasks exceeds the number of the user-applicable residual resources.
And selecting the Node nodes which do not exceed the amount of the residual resources which can be applied by the user, and calculating the difference value of the real-time available resource information of the Node nodes and the amount of the applied CPU, GPU and memory by the system cluster resource management and control service, and dividing the difference value by the inherent resource capacity of the Node nodes to obtain the percentage of the residual resources of the CPU, GPU and memory after the resources are distributed.
Selecting Node nodes with the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed being larger than the preset threshold percentage of the resources, carrying out score calculation on the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed of each Node, and sequencing according to the score from large to small.
And the Node ordered in the sequence of the system cluster resource management and control service is the optimal Node, the Node name of the optimal Node is returned to the machine learning task scheduling service, and persistent storage is carried out in the user resource use condition table.
And dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.
As a preferred scheme, a CPU, a GPU and a memory use condition acquisition program are deployed on each Kubernetes Node in the cluster.
As a preferred scheme, the inherent resource capacity of the Node nodes in the cluster takes the user ID as a name space in Kubernetes to logically divide and isolate the virtual resource pool.
As a preferred solution, the multi-tenant resource quota table is as follows:
Figure BDA0003388184710000021
Figure BDA0003388184710000031
preferably, the user resource usage table is as follows:
Figure BDA0003388184710000032
preferably, the role-based access control by Kubernetes gives access rights to namespaces operable by different users.
Preferably, the Kubernetes cluster comprises the following components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, Container runtime.
As a preferred scheme, the method for carrying out score calculation on the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed by each Node is as follows:
Scorei=request_cpu×percent_cpui+request_gpu×percent_gpui+request_mem×percent_memiwherein, ScoreiRating _ cpu, rating of i Nodei、percent_gpui、percent_memiThe percentages of the residual resources of the CPU, the GPU and the memory after the resources are distributed for the ith Node respectively, and the request _ CPU, the request _ GPU and the request _ mem are respectively the ith Node
The number of CPUs, GPUs and memories of the application of the point.
Has the advantages that: the invention provides a Kubernetes-based multi-tenant machine learning task resource scheduling method, which is used for carrying out quota management on computing resources available for different users, monitoring the resource state information of nodes in a Kubernetes platform, considering the problem of resource utilization rate of a host where the nodes are located, avoiding the problem of inaccurate scheduling result, and simultaneously carrying out priority sequencing on the nodes according to scheduling task demand information by monitoring real-time scheduling and pre-scheduling request demand information to obtain a host label of an optimal Node, carrying out reasonable distribution on resource demands of various machine learning model training and prediction tasks according to the label, effectively preventing and reducing the inclination problem of Node resource use in the Kubernetes platform, realizing multi-Node load balancing and improving the utilization rate of Node resources.
Drawings
FIG. 1 is a schematic diagram of cluster resource multi-tenant management in an example of the invention.
Fig. 2 is a schematic diagram of a kubernets cluster resource management architecture in an embodiment of the present invention.
FIG. 3 is a flow chart of machine learning training and prediction task creation in an embodiment of the invention.
Detailed Description
The present invention will be further described with reference to the following examples.
A Kubernetes-based multi-tenant machine learning task resource scheduling method comprises the following steps:
1) by calculating the used resource of Node (Node _ cpu _ used)i、node_gpu_usediAnd node _ mem _ usedi) Use the resource (pod _ cpu _ used) with the created containeri、pod_gpu_usediAnd pod _ mem _ usedi) And obtaining resource information occupied by all processes of the node operating system.
2) Acquiring resource information (pod _ cpu _ req) applied by all machine learning model training and prediction task containers on the node by calling Kubernets APIi、pod_gpu_reqiAnd pod _ mem _ reqi)。
3) Node inherent resource capacity (Node _ cpu _ total)i、node_gpu_totaliAnd node _ mem _ totali) Subtracting the two values to calculate the Node real-time available resource information (Node _ cpu)i、node_gpuiAnd node _ memi)。
4) Calculating the availability ratios of the CPU, the GPU and the memory of each Node through the following formula:
percent_cpui=node_cpui/node_cpu_totali
percent_gpui=node_gpui/node_gpu_totali
percent_memi=node_memi/node_mem_totali
5) the system cluster resource management and control service allocates computational resources for the training and prediction tasks of the machine learning model no longer through the preset resource threshold percentage, and ensures that overload operation conditions of nodes do not occur.
6) The machine learning task scheduling service sends the quantities (request _ CPU, request _ GPU and request _ mem) of the CPU, the GPU and the memory resources required by the machine learning model training and predicting tasks of different users to the system cluster resource management and control service.
7) The system cluster resource management and control service obtains the user applicable residual resources by calculating the resource difference values of the multi-tenant resource quota table and the user resource use condition table, and verifies whether the number of request _ cpu, request _ gpu and request _ mem of the machine learning model training and prediction task application exceeds the user applicable residual resources.
8) The system cluster resource management and control service enables Node nodes to use resource information (Node _ cpu) in real timei、node_gpuiAnd node _ memi) And calculating difference values with the applied request _ cpu, request _ gpu and request _ mem, and dividing the difference values by the inherent resource capacity of the Node to obtain the percentage of the resources left after the resources are distributed.
percent_cpui=(node_cpui-request_cpu)/node_cpu_totali
percent_gpui=(node_gpui-request_gpu)/node_gpu_totali
percent_memi=(node_memi-request_mem)/node_mem_totali
And comparing and filtering the nodes with the resource distribution percentage smaller than the preset resource threshold percentage, and then carrying out score calculation on the resource distribution percentage of the remaining nodes and sequencing according to the score.
Scorei=request_cpu×percent_cpui+request_gpu×percent_gpui+request_mem×percent_memi
9) The system cluster resource management and control service selects the Node with the top grade from the sequencing as the optimal Node, returns the Node name to the machine learning task scheduling service, and performs persistent storage in the user resource use condition table.
10) And dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.
The invention aims to uniformly regulate and control the CPU, GPU, memory and storage resources of an IaaS layer by adopting Kubernets and a container technology, construct an application program standardized operation environment for multi-tenant machine learning model training and prediction, and improve the controllability, elastic expansion capability and resource isolation capability of a power grid system. The technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:
performing tagging management according to available CPU, GPU, memory and storage resource information of different Node nodes in a cluster, integrating cluster resources into one resource pool through Kubernetes, and logically partitioning and isolating the virtual resource pool by using a user ID as a Namespace (Namespace) in Kubernetes, as shown in fig. 1.
A system administrator distributes required resource information for different users through a cluster multi-tenant resource management interface tool, and the information is stored persistently by adopting a multi-tenant resource quota table, as shown in table 1. The role-based access control (RBAC) of Kubernetes gives access rights to namespaces which can be operated by different users, and prevents resource usage among the users from interfering with each other.
And deploying a CPU, a GPU and a memory use condition acquisition program on each Kubernetes Node in the cluster, and respectively calculating the available resource condition and the available rate condition on all the Node nodes according to the acquired information.
TABLE 1 Multi-tenant resource quota Table
Name of field Use of
user_id User unique ID consistent with authority and single sign-on
cpu_capacity Total core number of CPU logic
memory_capacity Memory assemblyVolume (GB)
gpu_capacity GPU card number
storage_capacity Storage space (GB)
TABLE 2 user resource usage List
Figure BDA0003388184710000061
Fig. 2 is a schematic diagram of a Kubernetes cluster resource management architecture in this embodiment, where a Kubernetes cluster in this embodiment is composed of 2 Master nodes and 6 Node nodes. The Master node is a main control unit of the cluster and is mainly used for scheduling and managing the cluster to prevent the increase of project requirements and access amount, so that a high-availability mode of double Master nodes is constructed in the embodiment; node nodes are workload nodes and are mainly used for running containers of business applications, the Node nodes comprise two clusters of a CPU and a GPU, the CPU cluster is mainly used for creating a conventional pod task, the GPU cluster is mainly used for creating a pod task related to image operation, and the mode of the double Node clusters enables the applications deployed in the Node nodes to run more reasonably and efficiently.
The kubernets cluster mainly consists of seven main components: the scheduling strategy of the invention mainly plays a role in the Scheduler and calculates the evaluation scores of real-time tasks and timing tasks at nodes. The evaluation score comprises two aspects, on one hand, the actual use condition of the resource of the Node is referred to, and on the other hand, the preference degree of the pod for the requirements of the CPU, the GPU and the memory resource is considered. Finally, the scheduling strategy of the invention comprehensively evaluates each Node according to the real-time task and the timing task, selects the Node with the highest evaluation score as the target scheduling Node, skips the preselection strategy and the optimization strategy of the Scheduler, and can directly establish the pod at the appointed Node by setting the unique label. Fig. 2 is a flowchart illustrating the creation of a pod task request in this embodiment, and the specific manner is as follows:
step 1: acquiring the CPU, GPU and memory use information of a host machine where each Node is located in a Kubernetes platform, the use information and request distribution information of the CPU, GPU and memory of the Node, and respectively calculating the available resource condition and the available rate of each Node according to the acquired information;
firstly, calculating a difference value of the host machine and the pod used resources to obtain the resource use condition of the host machine outside the pod container; secondly, acquiring the actual resource allocation condition of the pod, and summing the use condition of the host outside the pod to calculate the actual available condition of the Node; the actual available resource conditions of the CPUs, GPUs and memories of all the Node nodes are calculated through the following formula:
node_cpui=node_cpu_totali-(host_cpu_usedi-pod_cpu_usedi)-pod_cpu_reqi
node_memi=node_mem_totali-(host_mem_usedi-pod_mem_usedi)-pod_mem_reqi
node_gpui=node_gpu_totali-(host_gpu_usedi-pod_gpu_usedi)-pod_gpu_reqi
wherein the node _ cpui、node_memiAnd node _ gpuiRespectively corresponding to the actual available resource information of Node CPU, GPU and memory, the Node _ CPU _ totali、node_mem_totaliAnd node _ gpu _ totaliRespectively corresponding to the total resource configuration information of the Node CPU, GPU and memory, wherein the host _ CPU _ usedi、host_mem_usediAnd host _ gpu _ usediHost machine use information respectively corresponding to Node CPU, GPU and memory, host _ CPU _ usedi、pod_mem_usediAnd host _ gpu _ usediPod use information respectively corresponding to Node CPU, GPU and internal memory, wherein the pod _ CPU _ reqi、pod_mem_reqiAnd pod _ gpu _ reqiRespectively corresponding to the current Node CPU, GPU and the pod resource request distribution information of the memory;
calculating the availability ratios of the CPU, the GPU and the memory of each Node through the following formula:
percent_cpui=node_cpui/node_cpu_totali
percent_memi=node_memi/node_mem_totali
percent_gpui=node_gpui/node_gpu_totali
step 2: comparing the CPU, GPU and memory availability of each Node with a preset threshold, if any Node is lower than the specified threshold, indicating that the Node is overloaded, filtering the Node, and if the number of the filtered nodes is 0, returning to scheduling failure; if the number of the filtered nodes is more than 0, continuing to carry out the step 3;
and step 3: acquiring request information of a real-time task and a timed task pod to CPU, GPU and memory resources through a K8s scheduler, wherein the request information is request _ CPU, request _ GPU, request _ mem and a user ID, searching a table according to the user ID to acquire the residual information of the current user resource, judging whether to support continuous pod creation or not through comparison, if not, returning to the scheduling failure, and if so, continuing the next step;
and 4, step 4: comparing the task resource request information acquired in the step 3 with available resources of the Node nodes, filtering the Node nodes with insufficient CPU, GPU and memory resources, if the number of the filtered nodes is 0, returning to scheduling failure, if the number of the filtered nodes is equal to 1, setting the Node nodes as hosts of the pod to be created, and if the number of the filtered nodes is more than 1, continuing to perform the next step;
and 5: and scoring the filtered Node nodes, and calculating the percentage of the resources left after the resources are distributed by each Node by the request task according to the following formula.
percent_cpui=(node_cpui-request_cpu)/node_cpu_totali
percent_gpui=(node_gpui-request_gpu)/node_gpu_totali
percent_memi=(node_memi-request_mem)/node_mem_totali
And excluding the nodes with the residual resource percentage less than the threshold percentage of the reserved resource after the resources are allocated, and accumulating and sequencing the residual resource percentages of the resources allocated from the CPU, the GPU and the memory of all the nodes.
Carrying out priority sequencing on each Node, determining the number of optimal nodes according to the sequencing, and if the number of the nodes is 1, determining that the Node is the optimal Node and acquiring a label of the Node; if the number of the nodes is more than 1, selecting the optimal Node according to the sorting and obtaining the label of the optimal Node; and finally, the pod is started by the label specified by the yaml file of the machine learning task.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (8)

1. A Kubernetes-based multi-tenant machine learning task resource scheduling method is characterized by comprising the following steps: the method comprises the following steps:
calculating the difference value of the used resources of the Node nodes in the cluster and the used resources of the created containers to obtain the resource information occupied by all processes of the Node operating system;
calling a Kubernetes API to acquire resource information applied by all machine learning models and prediction task containers on Node nodes;
subtracting resource information occupied by all processes of a Node operating system and resource information applied by all machine learning model training and prediction task containers on the Node from the inherent resource capacity of the Node, and calculating real-time available resource information of the Node;
calculating the availability ratios of a Node CPU, a GPU and a memory according to the real-time available resource information of the Node and the inherent resource capacity of the Node;
the Node nodes with the availability ratios of Node CPUs, GPUs and memories not lower than the preset resource threshold percentage allocate computing resources for the machine learning model training and predicting tasks;
the machine learning task scheduling service sends the quantities of CPU, GPU and memory resources applied by machine learning model training and prediction tasks of different users to the system cluster resource management and control service;
the system cluster resource management and control service calculates resource difference values of a multi-tenant resource quota table and a user resource use condition table to obtain user-applicable residual resources, and checks whether the quantity of CPUs (central processing units), GPUs (graphic processing units) and memories applied by the machine learning model training and prediction tasks exceeds the quantity of the user-applicable residual resources or not;
selecting Node nodes which do not exceed the amount of the residual resources which can be applied by the user, calculating the difference value of the real-time available resource information of the Node nodes and the amount of the applied CPU, GPU and memory by the system cluster resource management and control service, and dividing the difference value by the inherent resource capacity of the Node nodes to obtain the percentage of the residual resources of the CPU, GPU and memory after the resources are distributed;
selecting Node nodes with the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed being larger than the preset resource threshold percentage, carrying out score calculation on the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed of each Node, and sequencing according to the score from large to small;
the Node ordered in the sequence of the system cluster resource management and control service is the optimal Node, the Node name of the optimal Node is returned to the machine learning task scheduling service, and persistent storage is carried out in the user resource use condition table;
and dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.
2. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: and each Kubernetes Node in the cluster is provided with a CPU, a GPU and a memory use condition acquisition program.
3. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the inherent resource capacity of the Node nodes in the cluster takes the user ID as a name space in Kubernetes to logically divide and isolate the virtual resource pool.
4. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the multi-tenant resource quota table is as follows:
name of field Use of user_id User unique ID consistent with authority and single sign-on cpu_capacity Total core number of CPU logic memory_capacity Total memory (GB) gpu_capacity GPU card number storage_capacity Storage space (GB)
5. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the user resource usage table is as follows:
Figure FDA0003388184700000021
6. the Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the role-based access control by Kubernetes gives access rights to namespaces that are operable by different users.
7. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the Kubernetes cluster includes the following components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, Container runtime.
8. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: CPU, GPU and memory of each Node after distributing resource
The method for calculating the scores by percentage is as follows:
Scorei=request_cpu×percent_cpui+request_gpu×percent_gpui+request_mem×percent_memi
wherein, ScoreiRating _ cpu, rating of i Nodei、percent_gpui、percent_memiAnd respectively allocating the percentages of the residual resources of the CPU, the GPU and the memory for the ith Node after the resources are allocated, wherein the request _ CPU, the request _ GPU and the request _ mem are respectively the CPU, the GPU and the memory quantity applied by the ith Node.
CN202111460970.6A 2021-12-02 2021-12-02 Kubernetes-based multi-tenant machine learning task resource scheduling method Pending CN114356543A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111460970.6A CN114356543A (en) 2021-12-02 2021-12-02 Kubernetes-based multi-tenant machine learning task resource scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111460970.6A CN114356543A (en) 2021-12-02 2021-12-02 Kubernetes-based multi-tenant machine learning task resource scheduling method

Publications (1)

Publication Number Publication Date
CN114356543A true CN114356543A (en) 2022-04-15

Family

ID=81096598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111460970.6A Pending CN114356543A (en) 2021-12-02 2021-12-02 Kubernetes-based multi-tenant machine learning task resource scheduling method

Country Status (1)

Country Link
CN (1) CN114356543A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114661482A (en) * 2022-05-25 2022-06-24 成都索贝数码科技股份有限公司 GPU computing power management method, medium, equipment and system
CN115098238A (en) * 2022-07-07 2022-09-23 北京鼎成智造科技有限公司 Application program task scheduling method and device
CN115237608A (en) * 2022-09-21 2022-10-25 之江实验室 Multi-mode scheduling system and method based on unified computing power of multiple clusters
CN115373764A (en) * 2022-10-27 2022-11-22 中诚华隆计算机技术有限公司 Automatic container loading method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114661482A (en) * 2022-05-25 2022-06-24 成都索贝数码科技股份有限公司 GPU computing power management method, medium, equipment and system
CN115098238A (en) * 2022-07-07 2022-09-23 北京鼎成智造科技有限公司 Application program task scheduling method and device
CN115098238B (en) * 2022-07-07 2023-05-05 北京鼎成智造科技有限公司 Application program task scheduling method and device
CN115237608A (en) * 2022-09-21 2022-10-25 之江实验室 Multi-mode scheduling system and method based on unified computing power of multiple clusters
CN115373764A (en) * 2022-10-27 2022-11-22 中诚华隆计算机技术有限公司 Automatic container loading method and device

Similar Documents

Publication Publication Date Title
US20200287961A1 (en) Balancing resources in distributed computing environments
CN114356543A (en) Kubernetes-based multi-tenant machine learning task resource scheduling method
US20070169127A1 (en) Method, system and computer program product for optimizing allocation of resources on partitions of a data processing system
CN105446816B (en) A kind of energy optimization dispatching method towards heterogeneous platform
CN107346264A (en) A kind of method, apparatus and server apparatus of virtual machine load balance scheduling
CN110221920B (en) Deployment method, device, storage medium and system
CN111966500A (en) Resource scheduling method and device, electronic equipment and storage medium
CN106020934A (en) Optimized deploying method based on virtual cluster online migration
CN104679594B (en) A kind of middleware distributed computing method
WO2021180092A1 (en) Task dispatching method and apparatus
CN114968601B (en) Scheduling method and scheduling system for AI training jobs with resources reserved in proportion
CN111459684A (en) Cloud computing resource fusion scheduling management method, system and medium for multiprocessor architecture
Mylavarapu et al. An optimized capacity planning approach for virtual infrastructure exhibiting stochastic workload
CN112559122A (en) Virtualization instance management and control method and system based on electric power special security and protection equipment
CN114968566A (en) Container scheduling method and device under shared GPU cluster
CN108694083B (en) Data processing method and device for server
CN113391914A (en) Task scheduling method and device
CN107203256B (en) Energy-saving distribution method and device under network function virtualization scene
Emara et al. Genetic-Based Multi-objective Task Scheduling Algorithm in Cloud Computing Environment.
CN112416520B (en) Intelligent resource scheduling method based on vSphere
CN113672391A (en) Parallel computing task scheduling method and system based on Kubernetes
CN111966447A (en) Container placing method based on double-row genetic algorithm
CN109298949B (en) Resource scheduling system of distributed file system
CN116302327A (en) Resource scheduling method and related equipment
EP4206915A1 (en) Container creation method and apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination