CN114356543A - Kubernetes-based multi-tenant machine learning task resource scheduling method - Google Patents
Kubernetes-based multi-tenant machine learning task resource scheduling method Download PDFInfo
- Publication number
- CN114356543A CN114356543A CN202111460970.6A CN202111460970A CN114356543A CN 114356543 A CN114356543 A CN 114356543A CN 202111460970 A CN202111460970 A CN 202111460970A CN 114356543 A CN114356543 A CN 114356543A
- Authority
- CN
- China
- Prior art keywords
- node
- resource
- gpu
- cpu
- machine learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000012163 sequencing technique Methods 0.000 claims abstract description 9
- 230000015654 memory Effects 0.000 claims description 46
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 4
- 230000002085 persistent effect Effects 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 abstract description 4
- 238000007726 management method Methods 0.000 description 18
- 238000005516 engineering process Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000002955 isolation Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000012356 Product development Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Abstract
The invention discloses a Kubernetes-based multi-tenant machine learning task resource scheduling method, which is used for carrying out quota management on computing resources available for different users, monitoring the resource state information of nodes in a Kubernetes platform, considering the problem of resource utilization rate of a host where the nodes are located, avoiding the problem of inaccurate scheduling result, carrying out priority sequencing on the nodes according to scheduling task demand information by monitoring real-time scheduling and pre-scheduling request demand information, obtaining a host label of an optimal Node, and carrying out reasonable allocation on resource demands of various machine learning model training and prediction tasks according to the label. The invention effectively prevents and reduces the problem of inclination of node resource use in the Kubernetes platform, realizes multi-node load balance and improves the utilization rate of the node resource.
Description
Technical Field
The invention relates to a Kubernetes-based multi-tenant machine learning task resource scheduling method, and belongs to the technical field of power regulation and control.
Background
At present, the artificial intelligence technology in the field of power grid regulation and control obtains primary results, but the breakthrough problems that the calculation power is dispersed and the application is restricted are met in the aspect of calculation power resource management and control, and the artificial intelligence development and operation environment is deployed in a chimney type by various applications, so that the repeated construction of bottom hardware resources, the calculation power dispersion and the difficult expansion are caused.
The cloud computing platform IaaS layer mainly utilizes a virtualization technology to realize multi-tenant resource isolation and dynamic allocation, but the traditional virtualization technology has high occupancy rate of hardware resources and is not suitable for machine learning model training and high-utilization-rate scenes of computational resources for predicting tasks; and the complexity of links such as application program configuration, operation and management is higher, which is not beneficial to the clustered overall management.
The kubernets have the capability of automatically arranging, deploying and scheduling resources for services, and the like, and are popular with developers, the method carries out custom arrangement and scheduling on resources based on the kubernets, supports the product development work of an artificial intelligent application development and service support platform in a new generation scheduling technology support system, is used for machine learning training and resource scheduling of prediction tasks such as power grid fault identification and analysis, power grid operation prediction and analysis, power grid intelligent scheduling aid decision and the like, and application results verify the technical route and reliability of the method.
Disclosure of Invention
The purpose is as follows: in order to overcome the defects in the prior art, the invention provides a Kubernetes-based multi-tenant machine learning task resource scheduling method, which adopts Kubernetes and a container technology to uniformly regulate and control the CPU, GPU and memory resources of an IaaS layer, constructs the standardized operation environment of an application program for training and predicting a multi-tenant machine learning model, and improves the controllability, the elastic expansion capability and the resource isolation capability of a power grid regulation and control system.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a Kubernetes-based multi-tenant machine learning task resource scheduling method comprises the following steps:
and calculating the difference value of the used resources of the Node nodes in the cluster and the used resources of the created containers to obtain the resource information occupied by all processes of the Node operating system.
And calling a Kubernetes API to acquire resource information applied by all machine learning models and prediction task containers on the Node.
And subtracting the resource information occupied by all processes of the Node operating system and the resource information applied by all machine learning model training and prediction task containers on the Node from the inherent resource capacity of the Node, and calculating the real-time available resource information of the Node.
And calculating the availability ratios of the CPU, the GPU and the memory of the Node according to the real-time available resource information of the Node and the inherent resource capacity of the Node.
The Node nodes with the availability ratios of the Node CPU, the GPU and the memory not lower than the preset resource threshold percentage allocate computing resources for the machine learning model training and predicting tasks.
And the machine learning task scheduling service sends the quantities of CPU, GPU and memory resources applied by the machine learning model training and prediction tasks of different users to the system cluster resource management and control service.
The system cluster resource management and control service obtains the user-applicable residual resources by calculating the resource difference value of the multi-tenant resource quota table and the user resource use condition table, and checks whether the number of CPUs (central processing units), GPUs (graphic processing units) and memories applied by the machine learning model training and prediction tasks exceeds the number of the user-applicable residual resources.
And selecting the Node nodes which do not exceed the amount of the residual resources which can be applied by the user, and calculating the difference value of the real-time available resource information of the Node nodes and the amount of the applied CPU, GPU and memory by the system cluster resource management and control service, and dividing the difference value by the inherent resource capacity of the Node nodes to obtain the percentage of the residual resources of the CPU, GPU and memory after the resources are distributed.
Selecting Node nodes with the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed being larger than the preset threshold percentage of the resources, carrying out score calculation on the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed of each Node, and sequencing according to the score from large to small.
And the Node ordered in the sequence of the system cluster resource management and control service is the optimal Node, the Node name of the optimal Node is returned to the machine learning task scheduling service, and persistent storage is carried out in the user resource use condition table.
And dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.
As a preferred scheme, a CPU, a GPU and a memory use condition acquisition program are deployed on each Kubernetes Node in the cluster.
As a preferred scheme, the inherent resource capacity of the Node nodes in the cluster takes the user ID as a name space in Kubernetes to logically divide and isolate the virtual resource pool.
As a preferred solution, the multi-tenant resource quota table is as follows:
preferably, the user resource usage table is as follows:
preferably, the role-based access control by Kubernetes gives access rights to namespaces operable by different users.
Preferably, the Kubernetes cluster comprises the following components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, Container runtime.
As a preferred scheme, the method for carrying out score calculation on the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed by each Node is as follows:
Scorei=request_cpu×percent_cpui+request_gpu×percent_gpui+request_mem×percent_memiwherein, ScoreiRating _ cpu, rating of i Nodei、percent_gpui、percent_memiThe percentages of the residual resources of the CPU, the GPU and the memory after the resources are distributed for the ith Node respectively, and the request _ CPU, the request _ GPU and the request _ mem are respectively the ith Node
The number of CPUs, GPUs and memories of the application of the point.
Has the advantages that: the invention provides a Kubernetes-based multi-tenant machine learning task resource scheduling method, which is used for carrying out quota management on computing resources available for different users, monitoring the resource state information of nodes in a Kubernetes platform, considering the problem of resource utilization rate of a host where the nodes are located, avoiding the problem of inaccurate scheduling result, and simultaneously carrying out priority sequencing on the nodes according to scheduling task demand information by monitoring real-time scheduling and pre-scheduling request demand information to obtain a host label of an optimal Node, carrying out reasonable distribution on resource demands of various machine learning model training and prediction tasks according to the label, effectively preventing and reducing the inclination problem of Node resource use in the Kubernetes platform, realizing multi-Node load balancing and improving the utilization rate of Node resources.
Drawings
FIG. 1 is a schematic diagram of cluster resource multi-tenant management in an example of the invention.
Fig. 2 is a schematic diagram of a kubernets cluster resource management architecture in an embodiment of the present invention.
FIG. 3 is a flow chart of machine learning training and prediction task creation in an embodiment of the invention.
Detailed Description
The present invention will be further described with reference to the following examples.
A Kubernetes-based multi-tenant machine learning task resource scheduling method comprises the following steps:
1) by calculating the used resource of Node (Node _ cpu _ used)i、node_gpu_usediAnd node _ mem _ usedi) Use the resource (pod _ cpu _ used) with the created containeri、pod_gpu_usediAnd pod _ mem _ usedi) And obtaining resource information occupied by all processes of the node operating system.
2) Acquiring resource information (pod _ cpu _ req) applied by all machine learning model training and prediction task containers on the node by calling Kubernets APIi、pod_gpu_reqiAnd pod _ mem _ reqi)。
3) Node inherent resource capacity (Node _ cpu _ total)i、node_gpu_totaliAnd node _ mem _ totali) Subtracting the two values to calculate the Node real-time available resource information (Node _ cpu)i、node_gpuiAnd node _ memi)。
4) Calculating the availability ratios of the CPU, the GPU and the memory of each Node through the following formula:
percent_cpui=node_cpui/node_cpu_totali
percent_gpui=node_gpui/node_gpu_totali
percent_memi=node_memi/node_mem_totali
5) the system cluster resource management and control service allocates computational resources for the training and prediction tasks of the machine learning model no longer through the preset resource threshold percentage, and ensures that overload operation conditions of nodes do not occur.
6) The machine learning task scheduling service sends the quantities (request _ CPU, request _ GPU and request _ mem) of the CPU, the GPU and the memory resources required by the machine learning model training and predicting tasks of different users to the system cluster resource management and control service.
7) The system cluster resource management and control service obtains the user applicable residual resources by calculating the resource difference values of the multi-tenant resource quota table and the user resource use condition table, and verifies whether the number of request _ cpu, request _ gpu and request _ mem of the machine learning model training and prediction task application exceeds the user applicable residual resources.
8) The system cluster resource management and control service enables Node nodes to use resource information (Node _ cpu) in real timei、node_gpuiAnd node _ memi) And calculating difference values with the applied request _ cpu, request _ gpu and request _ mem, and dividing the difference values by the inherent resource capacity of the Node to obtain the percentage of the resources left after the resources are distributed.
percent_cpui=(node_cpui-request_cpu)/node_cpu_totali
percent_gpui=(node_gpui-request_gpu)/node_gpu_totali
percent_memi=(node_memi-request_mem)/node_mem_totali
And comparing and filtering the nodes with the resource distribution percentage smaller than the preset resource threshold percentage, and then carrying out score calculation on the resource distribution percentage of the remaining nodes and sequencing according to the score.
Scorei=request_cpu×percent_cpui+request_gpu×percent_gpui+request_mem×percent_memi
9) The system cluster resource management and control service selects the Node with the top grade from the sequencing as the optimal Node, returns the Node name to the machine learning task scheduling service, and performs persistent storage in the user resource use condition table.
10) And dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.
The invention aims to uniformly regulate and control the CPU, GPU, memory and storage resources of an IaaS layer by adopting Kubernets and a container technology, construct an application program standardized operation environment for multi-tenant machine learning model training and prediction, and improve the controllability, elastic expansion capability and resource isolation capability of a power grid system. The technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:
performing tagging management according to available CPU, GPU, memory and storage resource information of different Node nodes in a cluster, integrating cluster resources into one resource pool through Kubernetes, and logically partitioning and isolating the virtual resource pool by using a user ID as a Namespace (Namespace) in Kubernetes, as shown in fig. 1.
A system administrator distributes required resource information for different users through a cluster multi-tenant resource management interface tool, and the information is stored persistently by adopting a multi-tenant resource quota table, as shown in table 1. The role-based access control (RBAC) of Kubernetes gives access rights to namespaces which can be operated by different users, and prevents resource usage among the users from interfering with each other.
And deploying a CPU, a GPU and a memory use condition acquisition program on each Kubernetes Node in the cluster, and respectively calculating the available resource condition and the available rate condition on all the Node nodes according to the acquired information.
TABLE 1 Multi-tenant resource quota Table
Name of field | Use of |
user_id | User unique ID consistent with authority and single sign-on |
cpu_capacity | Total core number of CPU logic |
memory_capacity | Memory assemblyVolume (GB) |
gpu_capacity | GPU card number |
storage_capacity | Storage space (GB) |
TABLE 2 user resource usage List
Fig. 2 is a schematic diagram of a Kubernetes cluster resource management architecture in this embodiment, where a Kubernetes cluster in this embodiment is composed of 2 Master nodes and 6 Node nodes. The Master node is a main control unit of the cluster and is mainly used for scheduling and managing the cluster to prevent the increase of project requirements and access amount, so that a high-availability mode of double Master nodes is constructed in the embodiment; node nodes are workload nodes and are mainly used for running containers of business applications, the Node nodes comprise two clusters of a CPU and a GPU, the CPU cluster is mainly used for creating a conventional pod task, the GPU cluster is mainly used for creating a pod task related to image operation, and the mode of the double Node clusters enables the applications deployed in the Node nodes to run more reasonably and efficiently.
The kubernets cluster mainly consists of seven main components: the scheduling strategy of the invention mainly plays a role in the Scheduler and calculates the evaluation scores of real-time tasks and timing tasks at nodes. The evaluation score comprises two aspects, on one hand, the actual use condition of the resource of the Node is referred to, and on the other hand, the preference degree of the pod for the requirements of the CPU, the GPU and the memory resource is considered. Finally, the scheduling strategy of the invention comprehensively evaluates each Node according to the real-time task and the timing task, selects the Node with the highest evaluation score as the target scheduling Node, skips the preselection strategy and the optimization strategy of the Scheduler, and can directly establish the pod at the appointed Node by setting the unique label. Fig. 2 is a flowchart illustrating the creation of a pod task request in this embodiment, and the specific manner is as follows:
step 1: acquiring the CPU, GPU and memory use information of a host machine where each Node is located in a Kubernetes platform, the use information and request distribution information of the CPU, GPU and memory of the Node, and respectively calculating the available resource condition and the available rate of each Node according to the acquired information;
firstly, calculating a difference value of the host machine and the pod used resources to obtain the resource use condition of the host machine outside the pod container; secondly, acquiring the actual resource allocation condition of the pod, and summing the use condition of the host outside the pod to calculate the actual available condition of the Node; the actual available resource conditions of the CPUs, GPUs and memories of all the Node nodes are calculated through the following formula:
node_cpui=node_cpu_totali-(host_cpu_usedi-pod_cpu_usedi)-pod_cpu_reqi
node_memi=node_mem_totali-(host_mem_usedi-pod_mem_usedi)-pod_mem_reqi
node_gpui=node_gpu_totali-(host_gpu_usedi-pod_gpu_usedi)-pod_gpu_reqi
wherein the node _ cpui、node_memiAnd node _ gpuiRespectively corresponding to the actual available resource information of Node CPU, GPU and memory, the Node _ CPU _ totali、node_mem_totaliAnd node _ gpu _ totaliRespectively corresponding to the total resource configuration information of the Node CPU, GPU and memory, wherein the host _ CPU _ usedi、host_mem_usediAnd host _ gpu _ usediHost machine use information respectively corresponding to Node CPU, GPU and memory, host _ CPU _ usedi、pod_mem_usediAnd host _ gpu _ usediPod use information respectively corresponding to Node CPU, GPU and internal memory, wherein the pod _ CPU _ reqi、pod_mem_reqiAnd pod _ gpu _ reqiRespectively corresponding to the current Node CPU, GPU and the pod resource request distribution information of the memory;
calculating the availability ratios of the CPU, the GPU and the memory of each Node through the following formula:
percent_cpui=node_cpui/node_cpu_totali
percent_memi=node_memi/node_mem_totali
percent_gpui=node_gpui/node_gpu_totali
step 2: comparing the CPU, GPU and memory availability of each Node with a preset threshold, if any Node is lower than the specified threshold, indicating that the Node is overloaded, filtering the Node, and if the number of the filtered nodes is 0, returning to scheduling failure; if the number of the filtered nodes is more than 0, continuing to carry out the step 3;
and step 3: acquiring request information of a real-time task and a timed task pod to CPU, GPU and memory resources through a K8s scheduler, wherein the request information is request _ CPU, request _ GPU, request _ mem and a user ID, searching a table according to the user ID to acquire the residual information of the current user resource, judging whether to support continuous pod creation or not through comparison, if not, returning to the scheduling failure, and if so, continuing the next step;
and 4, step 4: comparing the task resource request information acquired in the step 3 with available resources of the Node nodes, filtering the Node nodes with insufficient CPU, GPU and memory resources, if the number of the filtered nodes is 0, returning to scheduling failure, if the number of the filtered nodes is equal to 1, setting the Node nodes as hosts of the pod to be created, and if the number of the filtered nodes is more than 1, continuing to perform the next step;
and 5: and scoring the filtered Node nodes, and calculating the percentage of the resources left after the resources are distributed by each Node by the request task according to the following formula.
percent_cpui=(node_cpui-request_cpu)/node_cpu_totali
percent_gpui=(node_gpui-request_gpu)/node_gpu_totali
percent_memi=(node_memi-request_mem)/node_mem_totali
And excluding the nodes with the residual resource percentage less than the threshold percentage of the reserved resource after the resources are allocated, and accumulating and sequencing the residual resource percentages of the resources allocated from the CPU, the GPU and the memory of all the nodes.
Carrying out priority sequencing on each Node, determining the number of optimal nodes according to the sequencing, and if the number of the nodes is 1, determining that the Node is the optimal Node and acquiring a label of the Node; if the number of the nodes is more than 1, selecting the optimal Node according to the sorting and obtaining the label of the optimal Node; and finally, the pod is started by the label specified by the yaml file of the machine learning task.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (8)
1. A Kubernetes-based multi-tenant machine learning task resource scheduling method is characterized by comprising the following steps: the method comprises the following steps:
calculating the difference value of the used resources of the Node nodes in the cluster and the used resources of the created containers to obtain the resource information occupied by all processes of the Node operating system;
calling a Kubernetes API to acquire resource information applied by all machine learning models and prediction task containers on Node nodes;
subtracting resource information occupied by all processes of a Node operating system and resource information applied by all machine learning model training and prediction task containers on the Node from the inherent resource capacity of the Node, and calculating real-time available resource information of the Node;
calculating the availability ratios of a Node CPU, a GPU and a memory according to the real-time available resource information of the Node and the inherent resource capacity of the Node;
the Node nodes with the availability ratios of Node CPUs, GPUs and memories not lower than the preset resource threshold percentage allocate computing resources for the machine learning model training and predicting tasks;
the machine learning task scheduling service sends the quantities of CPU, GPU and memory resources applied by machine learning model training and prediction tasks of different users to the system cluster resource management and control service;
the system cluster resource management and control service calculates resource difference values of a multi-tenant resource quota table and a user resource use condition table to obtain user-applicable residual resources, and checks whether the quantity of CPUs (central processing units), GPUs (graphic processing units) and memories applied by the machine learning model training and prediction tasks exceeds the quantity of the user-applicable residual resources or not;
selecting Node nodes which do not exceed the amount of the residual resources which can be applied by the user, calculating the difference value of the real-time available resource information of the Node nodes and the amount of the applied CPU, GPU and memory by the system cluster resource management and control service, and dividing the difference value by the inherent resource capacity of the Node nodes to obtain the percentage of the residual resources of the CPU, GPU and memory after the resources are distributed;
selecting Node nodes with the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed being larger than the preset resource threshold percentage, carrying out score calculation on the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed of each Node, and sequencing according to the score from large to small;
the Node ordered in the sequence of the system cluster resource management and control service is the optimal Node, the Node name of the optimal Node is returned to the machine learning task scheduling service, and persistent storage is carried out in the user resource use condition table;
and dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.
2. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: and each Kubernetes Node in the cluster is provided with a CPU, a GPU and a memory use condition acquisition program.
3. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the inherent resource capacity of the Node nodes in the cluster takes the user ID as a name space in Kubernetes to logically divide and isolate the virtual resource pool.
4. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the multi-tenant resource quota table is as follows:
。
6. the Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the role-based access control by Kubernetes gives access rights to namespaces that are operable by different users.
7. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the Kubernetes cluster includes the following components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, Container runtime.
8. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: CPU, GPU and memory of each Node after distributing resource
The method for calculating the scores by percentage is as follows:
Scorei=request_cpu×percent_cpui+request_gpu×percent_gpui+request_mem×percent_memi
wherein, ScoreiRating _ cpu, rating of i Nodei、percent_gpui、percent_memiAnd respectively allocating the percentages of the residual resources of the CPU, the GPU and the memory for the ith Node after the resources are allocated, wherein the request _ CPU, the request _ GPU and the request _ mem are respectively the CPU, the GPU and the memory quantity applied by the ith Node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111460970.6A CN114356543A (en) | 2021-12-02 | 2021-12-02 | Kubernetes-based multi-tenant machine learning task resource scheduling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111460970.6A CN114356543A (en) | 2021-12-02 | 2021-12-02 | Kubernetes-based multi-tenant machine learning task resource scheduling method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114356543A true CN114356543A (en) | 2022-04-15 |
Family
ID=81096598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111460970.6A Pending CN114356543A (en) | 2021-12-02 | 2021-12-02 | Kubernetes-based multi-tenant machine learning task resource scheduling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114356543A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114661482A (en) * | 2022-05-25 | 2022-06-24 | 成都索贝数码科技股份有限公司 | GPU computing power management method, medium, equipment and system |
CN115098238A (en) * | 2022-07-07 | 2022-09-23 | 北京鼎成智造科技有限公司 | Application program task scheduling method and device |
CN115237608A (en) * | 2022-09-21 | 2022-10-25 | 之江实验室 | Multi-mode scheduling system and method based on unified computing power of multiple clusters |
CN115373764A (en) * | 2022-10-27 | 2022-11-22 | 中诚华隆计算机技术有限公司 | Automatic container loading method and device |
-
2021
- 2021-12-02 CN CN202111460970.6A patent/CN114356543A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114661482A (en) * | 2022-05-25 | 2022-06-24 | 成都索贝数码科技股份有限公司 | GPU computing power management method, medium, equipment and system |
CN115098238A (en) * | 2022-07-07 | 2022-09-23 | 北京鼎成智造科技有限公司 | Application program task scheduling method and device |
CN115098238B (en) * | 2022-07-07 | 2023-05-05 | 北京鼎成智造科技有限公司 | Application program task scheduling method and device |
CN115237608A (en) * | 2022-09-21 | 2022-10-25 | 之江实验室 | Multi-mode scheduling system and method based on unified computing power of multiple clusters |
CN115373764A (en) * | 2022-10-27 | 2022-11-22 | 中诚华隆计算机技术有限公司 | Automatic container loading method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200287961A1 (en) | Balancing resources in distributed computing environments | |
CN114356543A (en) | Kubernetes-based multi-tenant machine learning task resource scheduling method | |
US20070169127A1 (en) | Method, system and computer program product for optimizing allocation of resources on partitions of a data processing system | |
CN105446816B (en) | A kind of energy optimization dispatching method towards heterogeneous platform | |
CN107346264A (en) | A kind of method, apparatus and server apparatus of virtual machine load balance scheduling | |
CN110221920B (en) | Deployment method, device, storage medium and system | |
CN111966500A (en) | Resource scheduling method and device, electronic equipment and storage medium | |
CN106020934A (en) | Optimized deploying method based on virtual cluster online migration | |
CN104679594B (en) | A kind of middleware distributed computing method | |
WO2021180092A1 (en) | Task dispatching method and apparatus | |
CN114968601B (en) | Scheduling method and scheduling system for AI training jobs with resources reserved in proportion | |
CN111459684A (en) | Cloud computing resource fusion scheduling management method, system and medium for multiprocessor architecture | |
Mylavarapu et al. | An optimized capacity planning approach for virtual infrastructure exhibiting stochastic workload | |
CN112559122A (en) | Virtualization instance management and control method and system based on electric power special security and protection equipment | |
CN114968566A (en) | Container scheduling method and device under shared GPU cluster | |
CN108694083B (en) | Data processing method and device for server | |
CN113391914A (en) | Task scheduling method and device | |
CN107203256B (en) | Energy-saving distribution method and device under network function virtualization scene | |
Emara et al. | Genetic-Based Multi-objective Task Scheduling Algorithm in Cloud Computing Environment. | |
CN112416520B (en) | Intelligent resource scheduling method based on vSphere | |
CN113672391A (en) | Parallel computing task scheduling method and system based on Kubernetes | |
CN111966447A (en) | Container placing method based on double-row genetic algorithm | |
CN109298949B (en) | Resource scheduling system of distributed file system | |
CN116302327A (en) | Resource scheduling method and related equipment | |
EP4206915A1 (en) | Container creation method and apparatus, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |