CN114356543A

CN114356543A - Kubernetes-based multi-tenant machine learning task resource scheduling method

Info

Publication number: CN114356543A
Application number: CN202111460970.6A
Authority: CN
Inventors: 杨立波; 王宇冬; 马斌; 李一鹏; 栗维勋; 袁龙; 李�昊; 季学纯; 孙云枫; 李佳阳; 沈嘉灵; 徐丽燕; 胡锐锋; 劳莹莹; 陈子韵
Original assignee: State Grid Corp of China SGCC; State Grid Hebei Electric Power Co Ltd; Nari Technology Co Ltd; State Grid Electric Power Research Institute
Current assignee: State Grid Corp of China SGCC; State Grid Hebei Electric Power Co Ltd; Nari Technology Co Ltd; State Grid Electric Power Research Institute
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-04-15

Abstract

The invention discloses a Kubernetes-based multi-tenant machine learning task resource scheduling method, which is used for carrying out quota management on computing resources available for different users, monitoring the resource state information of nodes in a Kubernetes platform, considering the problem of resource utilization rate of a host where the nodes are located, avoiding the problem of inaccurate scheduling result, carrying out priority sequencing on the nodes according to scheduling task demand information by monitoring real-time scheduling and pre-scheduling request demand information, obtaining a host label of an optimal Node, and carrying out reasonable allocation on resource demands of various machine learning model training and prediction tasks according to the label. The invention effectively prevents and reduces the problem of inclination of node resource use in the Kubernetes platform, realizes multi-node load balance and improves the utilization rate of the node resource.

Description

Kubernetes-based multi-tenant machine learning task resource scheduling method

Technical Field

The invention relates to a Kubernetes-based multi-tenant machine learning task resource scheduling method, and belongs to the technical field of power regulation and control.

Background

At present, the artificial intelligence technology in the field of power grid regulation and control obtains primary results, but the breakthrough problems that the calculation power is dispersed and the application is restricted are met in the aspect of calculation power resource management and control, and the artificial intelligence development and operation environment is deployed in a chimney type by various applications, so that the repeated construction of bottom hardware resources, the calculation power dispersion and the difficult expansion are caused.

The cloud computing platform IaaS layer mainly utilizes a virtualization technology to realize multi-tenant resource isolation and dynamic allocation, but the traditional virtualization technology has high occupancy rate of hardware resources and is not suitable for machine learning model training and high-utilization-rate scenes of computational resources for predicting tasks; and the complexity of links such as application program configuration, operation and management is higher, which is not beneficial to the clustered overall management.

The kubernets have the capability of automatically arranging, deploying and scheduling resources for services, and the like, and are popular with developers, the method carries out custom arrangement and scheduling on resources based on the kubernets, supports the product development work of an artificial intelligent application development and service support platform in a new generation scheduling technology support system, is used for machine learning training and resource scheduling of prediction tasks such as power grid fault identification and analysis, power grid operation prediction and analysis, power grid intelligent scheduling aid decision and the like, and application results verify the technical route and reliability of the method.

Disclosure of Invention

The purpose is as follows: in order to overcome the defects in the prior art, the invention provides a Kubernetes-based multi-tenant machine learning task resource scheduling method, which adopts Kubernetes and a container technology to uniformly regulate and control the CPU, GPU and memory resources of an IaaS layer, constructs the standardized operation environment of an application program for training and predicting a multi-tenant machine learning model, and improves the controllability, the elastic expansion capability and the resource isolation capability of a power grid regulation and control system.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a Kubernetes-based multi-tenant machine learning task resource scheduling method comprises the following steps:

and calculating the difference value of the used resources of the Node nodes in the cluster and the used resources of the created containers to obtain the resource information occupied by all processes of the Node operating system.

And calling a Kubernetes API to acquire resource information applied by all machine learning models and prediction task containers on the Node.

And subtracting the resource information occupied by all processes of the Node operating system and the resource information applied by all machine learning model training and prediction task containers on the Node from the inherent resource capacity of the Node, and calculating the real-time available resource information of the Node.

And calculating the availability ratios of the CPU, the GPU and the memory of the Node according to the real-time available resource information of the Node and the inherent resource capacity of the Node.

The Node nodes with the availability ratios of the Node CPU, the GPU and the memory not lower than the preset resource threshold percentage allocate computing resources for the machine learning model training and predicting tasks.

And the machine learning task scheduling service sends the quantities of CPU, GPU and memory resources applied by the machine learning model training and prediction tasks of different users to the system cluster resource management and control service.

The system cluster resource management and control service obtains the user-applicable residual resources by calculating the resource difference value of the multi-tenant resource quota table and the user resource use condition table, and checks whether the number of CPUs (central processing units), GPUs (graphic processing units) and memories applied by the machine learning model training and prediction tasks exceeds the number of the user-applicable residual resources.

And selecting the Node nodes which do not exceed the amount of the residual resources which can be applied by the user, and calculating the difference value of the real-time available resource information of the Node nodes and the amount of the applied CPU, GPU and memory by the system cluster resource management and control service, and dividing the difference value by the inherent resource capacity of the Node nodes to obtain the percentage of the residual resources of the CPU, GPU and memory after the resources are distributed.

Selecting Node nodes with the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed being larger than the preset threshold percentage of the resources, carrying out score calculation on the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed of each Node, and sequencing according to the score from large to small.

And the Node ordered in the sequence of the system cluster resource management and control service is the optimal Node, the Node name of the optimal Node is returned to the machine learning task scheduling service, and persistent storage is carried out in the user resource use condition table.

And dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.

As a preferred scheme, a CPU, a GPU and a memory use condition acquisition program are deployed on each Kubernetes Node in the cluster.

As a preferred scheme, the inherent resource capacity of the Node nodes in the cluster takes the user ID as a name space in Kubernetes to logically divide and isolate the virtual resource pool.

As a preferred solution, the multi-tenant resource quota table is as follows:

preferably, the user resource usage table is as follows:

preferably, the role-based access control by Kubernetes gives access rights to namespaces operable by different users.

Preferably, the Kubernetes cluster comprises the following components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, Container runtime.

As a preferred scheme, the method for carrying out score calculation on the percentage of the resources left by the CPU, the GPU and the memory after the resources are distributed by each Node is as follows:

Score_i＝request_cpu×percent_cpu_i+request_gpu×percent_gpu_i+request_mem×percent_mem_iwherein, Score_iRating _ cpu, rating of i Node_i、percent_gpu_i、percent_mem_iThe percentages of the residual resources of the CPU, the GPU and the memory after the resources are distributed for the ith Node respectively, and the request _ CPU, the request _ GPU and the request _ mem are respectively the ith Node

The number of CPUs, GPUs and memories of the application of the point.

Has the advantages that: the invention provides a Kubernetes-based multi-tenant machine learning task resource scheduling method, which is used for carrying out quota management on computing resources available for different users, monitoring the resource state information of nodes in a Kubernetes platform, considering the problem of resource utilization rate of a host where the nodes are located, avoiding the problem of inaccurate scheduling result, and simultaneously carrying out priority sequencing on the nodes according to scheduling task demand information by monitoring real-time scheduling and pre-scheduling request demand information to obtain a host label of an optimal Node, carrying out reasonable distribution on resource demands of various machine learning model training and prediction tasks according to the label, effectively preventing and reducing the inclination problem of Node resource use in the Kubernetes platform, realizing multi-Node load balancing and improving the utilization rate of Node resources.

Drawings

FIG. 1 is a schematic diagram of cluster resource multi-tenant management in an example of the invention.

Fig. 2 is a schematic diagram of a kubernets cluster resource management architecture in an embodiment of the present invention.

FIG. 3 is a flow chart of machine learning training and prediction task creation in an embodiment of the invention.

Detailed Description

The present invention will be further described with reference to the following examples.

1) by calculating the used resource of Node (Node _ cpu _ used)_i、node_gpu_used_iAnd node _ mem _ used_i) Use the resource (pod _ cpu _ used) with the created container_i、pod_gpu_used_iAnd pod _ mem _ used_i) And obtaining resource information occupied by all processes of the node operating system.

2) Acquiring resource information (pod _ cpu _ req) applied by all machine learning model training and prediction task containers on the node by calling Kubernets API_i、pod_gpu_req_iAnd pod _ mem _ req_i)。

3) Node inherent resource capacity (Node _ cpu _ total)_i、node_gpu_total_iAnd node _ mem _ total_i) Subtracting the two values to calculate the Node real-time available resource information (Node _ cpu)_i、node_gpu_iAnd node _ mem_i)。

4) Calculating the availability ratios of the CPU, the GPU and the memory of each Node through the following formula:

percent_cpu_i＝node_cpu_i/node_cpu_total_i

percent_gpu_i＝node_gpu_i/node_gpu_total_i

percent_mem_i＝node_mem_i/node_mem_total_i

5) the system cluster resource management and control service allocates computational resources for the training and prediction tasks of the machine learning model no longer through the preset resource threshold percentage, and ensures that overload operation conditions of nodes do not occur.

6) The machine learning task scheduling service sends the quantities (request _ CPU, request _ GPU and request _ mem) of the CPU, the GPU and the memory resources required by the machine learning model training and predicting tasks of different users to the system cluster resource management and control service.

7) The system cluster resource management and control service obtains the user applicable residual resources by calculating the resource difference values of the multi-tenant resource quota table and the user resource use condition table, and verifies whether the number of request _ cpu, request _ gpu and request _ mem of the machine learning model training and prediction task application exceeds the user applicable residual resources.

8) The system cluster resource management and control service enables Node nodes to use resource information (Node _ cpu) in real time_i、node_gpu_iAnd node _ mem_i) And calculating difference values with the applied request _ cpu, request _ gpu and request _ mem, and dividing the difference values by the inherent resource capacity of the Node to obtain the percentage of the resources left after the resources are distributed.

percent_cpu_i＝(node_cpu_i-request_cpu)/node_cpu_total_i

percent_gpu_i＝(node_gpu_i-request_gpu)/node_gpu_total_i

percent_mem_i＝(node_mem_i-request_mem)/node_mem_total_i

And comparing and filtering the nodes with the resource distribution percentage smaller than the preset resource threshold percentage, and then carrying out score calculation on the resource distribution percentage of the remaining nodes and sequencing according to the score.

Score_i＝request_cpu×percent_cpu_i+request_gpu×percent_gpu_i+request_mem×percent_mem_i

9) The system cluster resource management and control service selects the Node with the top grade from the sequencing as the optimal Node, returns the Node name to the machine learning task scheduling service, and performs persistent storage in the user resource use condition table.

10) And dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.

The invention aims to uniformly regulate and control the CPU, GPU, memory and storage resources of an IaaS layer by adopting Kubernets and a container technology, construct an application program standardized operation environment for multi-tenant machine learning model training and prediction, and improve the controllability, elastic expansion capability and resource isolation capability of a power grid system. The technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention:

performing tagging management according to available CPU, GPU, memory and storage resource information of different Node nodes in a cluster, integrating cluster resources into one resource pool through Kubernetes, and logically partitioning and isolating the virtual resource pool by using a user ID as a Namespace (Namespace) in Kubernetes, as shown in fig. 1.

A system administrator distributes required resource information for different users through a cluster multi-tenant resource management interface tool, and the information is stored persistently by adopting a multi-tenant resource quota table, as shown in table 1. The role-based access control (RBAC) of Kubernetes gives access rights to namespaces which can be operated by different users, and prevents resource usage among the users from interfering with each other.

And deploying a CPU, a GPU and a memory use condition acquisition program on each Kubernetes Node in the cluster, and respectively calculating the available resource condition and the available rate condition on all the Node nodes according to the acquired information.

TABLE 1 Multi-tenant resource quota Table

Name of field	Use of
		user_id	User unique ID consistent with authority and single sign-on
cpu_capacity	Total core number of CPU logic
		memory_capacity	Memory assemblyVolume (GB)
gpu_capacity	GPU card number
		storage_capacity	Storage space (GB)

TABLE 2 user resource usage List

Fig. 2 is a schematic diagram of a Kubernetes cluster resource management architecture in this embodiment, where a Kubernetes cluster in this embodiment is composed of 2 Master nodes and 6 Node nodes. The Master node is a main control unit of the cluster and is mainly used for scheduling and managing the cluster to prevent the increase of project requirements and access amount, so that a high-availability mode of double Master nodes is constructed in the embodiment; node nodes are workload nodes and are mainly used for running containers of business applications, the Node nodes comprise two clusters of a CPU and a GPU, the CPU cluster is mainly used for creating a conventional pod task, the GPU cluster is mainly used for creating a pod task related to image operation, and the mode of the double Node clusters enables the applications deployed in the Node nodes to run more reasonably and efficiently.

The kubernets cluster mainly consists of seven main components: the scheduling strategy of the invention mainly plays a role in the Scheduler and calculates the evaluation scores of real-time tasks and timing tasks at nodes. The evaluation score comprises two aspects, on one hand, the actual use condition of the resource of the Node is referred to, and on the other hand, the preference degree of the pod for the requirements of the CPU, the GPU and the memory resource is considered. Finally, the scheduling strategy of the invention comprehensively evaluates each Node according to the real-time task and the timing task, selects the Node with the highest evaluation score as the target scheduling Node, skips the preselection strategy and the optimization strategy of the Scheduler, and can directly establish the pod at the appointed Node by setting the unique label. Fig. 2 is a flowchart illustrating the creation of a pod task request in this embodiment, and the specific manner is as follows:

step 1: acquiring the CPU, GPU and memory use information of a host machine where each Node is located in a Kubernetes platform, the use information and request distribution information of the CPU, GPU and memory of the Node, and respectively calculating the available resource condition and the available rate of each Node according to the acquired information;

firstly, calculating a difference value of the host machine and the pod used resources to obtain the resource use condition of the host machine outside the pod container; secondly, acquiring the actual resource allocation condition of the pod, and summing the use condition of the host outside the pod to calculate the actual available condition of the Node; the actual available resource conditions of the CPUs, GPUs and memories of all the Node nodes are calculated through the following formula:

node_cpu_i＝node_cpu_total_i-(host_cpu_used_i-pod_cpu_used_i)-pod_cpu_req_i

node_mem_i＝node_mem_total_i-(host_mem_used_i-pod_mem_used_i)-pod_mem_req_i

node_gpu_i＝node_gpu_total_i-(host_gpu_used_i-pod_gpu_used_i)-pod_gpu_req_i

wherein the node _ cpu_i、node_mem_iAnd node _ gpu_iRespectively corresponding to the actual available resource information of Node CPU, GPU and memory, the Node _ CPU _ total_i、node_mem_total_iAnd node _ gpu _ total_iRespectively corresponding to the total resource configuration information of the Node CPU, GPU and memory, wherein the host _ CPU _ used_i、host_mem_used_iAnd host _ gpu _ used_iHost machine use information respectively corresponding to Node CPU, GPU and memory, host _ CPU _ used_i、pod_mem_used_iAnd host _ gpu _ used_iPod use information respectively corresponding to Node CPU, GPU and internal memory, wherein the pod _ CPU _ req_i、pod_mem_req_iAnd pod _ gpu _ req_iRespectively corresponding to the current Node CPU, GPU and the pod resource request distribution information of the memory;

calculating the availability ratios of the CPU, the GPU and the memory of each Node through the following formula:

percent_cpu_i＝node_cpu_i/node_cpu_total_i

percent_mem_i＝node_mem_i/node_mem_total_i

percent_gpu_i＝node_gpu_i/node_gpu_total_i

step 2: comparing the CPU, GPU and memory availability of each Node with a preset threshold, if any Node is lower than the specified threshold, indicating that the Node is overloaded, filtering the Node, and if the number of the filtered nodes is 0, returning to scheduling failure; if the number of the filtered nodes is more than 0, continuing to carry out the step 3;

and step 3: acquiring request information of a real-time task and a timed task pod to CPU, GPU and memory resources through a K8s scheduler, wherein the request information is request _ CPU, request _ GPU, request _ mem and a user ID, searching a table according to the user ID to acquire the residual information of the current user resource, judging whether to support continuous pod creation or not through comparison, if not, returning to the scheduling failure, and if so, continuing the next step;

and 4, step 4: comparing the task resource request information acquired in the step 3 with available resources of the Node nodes, filtering the Node nodes with insufficient CPU, GPU and memory resources, if the number of the filtered nodes is 0, returning to scheduling failure, if the number of the filtered nodes is equal to 1, setting the Node nodes as hosts of the pod to be created, and if the number of the filtered nodes is more than 1, continuing to perform the next step;

and 5: and scoring the filtered Node nodes, and calculating the percentage of the resources left after the resources are distributed by each Node by the request task according to the following formula.

percent_cpu_i＝(node_cpu_i-request_cpu)/node_cpu_total_i

percent_gpu_i＝(node_gpu_i-request_gpu)/node_gpu_total_i

percent_mem_i＝(node_mem_i-request_mem)/node_mem_total_i

And excluding the nodes with the residual resource percentage less than the threshold percentage of the reserved resource after the resources are allocated, and accumulating and sequencing the residual resource percentages of the resources allocated from the CPU, the GPU and the memory of all the nodes.

Carrying out priority sequencing on each Node, determining the number of optimal nodes according to the sequencing, and if the number of the nodes is 1, determining that the Node is the optimal Node and acquiring a label of the Node; if the number of the nodes is more than 1, selecting the optimal Node according to the sorting and obtaining the label of the optimal Node; and finally, the pod is started by the label specified by the yaml file of the machine learning task.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A Kubernetes-based multi-tenant machine learning task resource scheduling method is characterized by comprising the following steps: the method comprises the following steps:

calculating the difference value of the used resources of the Node nodes in the cluster and the used resources of the created containers to obtain the resource information occupied by all processes of the Node operating system;

calling a Kubernetes API to acquire resource information applied by all machine learning models and prediction task containers on Node nodes;

subtracting resource information occupied by all processes of a Node operating system and resource information applied by all machine learning model training and prediction task containers on the Node from the inherent resource capacity of the Node, and calculating real-time available resource information of the Node;

calculating the availability ratios of a Node CPU, a GPU and a memory according to the real-time available resource information of the Node and the inherent resource capacity of the Node;

the Node nodes with the availability ratios of Node CPUs, GPUs and memories not lower than the preset resource threshold percentage allocate computing resources for the machine learning model training and predicting tasks;

the machine learning task scheduling service sends the quantities of CPU, GPU and memory resources applied by machine learning model training and prediction tasks of different users to the system cluster resource management and control service;

the system cluster resource management and control service calculates resource difference values of a multi-tenant resource quota table and a user resource use condition table to obtain user-applicable residual resources, and checks whether the quantity of CPUs (central processing units), GPUs (graphic processing units) and memories applied by the machine learning model training and prediction tasks exceeds the quantity of the user-applicable residual resources or not;

selecting Node nodes which do not exceed the amount of the residual resources which can be applied by the user, calculating the difference value of the real-time available resource information of the Node nodes and the amount of the applied CPU, GPU and memory by the system cluster resource management and control service, and dividing the difference value by the inherent resource capacity of the Node nodes to obtain the percentage of the residual resources of the CPU, GPU and memory after the resources are distributed;

selecting Node nodes with the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed being larger than the preset resource threshold percentage, carrying out score calculation on the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed of each Node, and sequencing according to the score from large to small;

the Node ordered in the sequence of the system cluster resource management and control service is the optimal Node, the Node name of the optimal Node is returned to the machine learning task scheduling service, and persistent storage is carried out in the user resource use condition table;

2. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: and each Kubernetes Node in the cluster is provided with a CPU, a GPU and a memory use condition acquisition program.

3. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the inherent resource capacity of the Node nodes in the cluster takes the user ID as a name space in Kubernetes to logically divide and isolate the virtual resource pool.

4. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the multi-tenant resource quota table is as follows:

name of field Use of user_id User unique ID consistent with authority and single sign-on cpu_capacity Total core number of CPU logic memory_capacity Total memory (GB) gpu_capacity GPU card number storage_capacity Storage space (GB)

。

5. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the user resource usage table is as follows:

6. the Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the role-based access control by Kubernetes gives access rights to namespaces that are operable by different users.

7. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the Kubernetes cluster includes the following components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, Container runtime.

8. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: CPU, GPU and memory of each Node after distributing resource

The method for calculating the scores by percentage is as follows:

wherein, Score_iRating _ cpu, rating of i Node_i、percent_gpu_i、percent_mem_iAnd respectively allocating the percentages of the residual resources of the CPU, the GPU and the memory for the ith Node after the resources are allocated, wherein the request _ CPU, the request _ GPU and the request _ mem are respectively the CPU, the GPU and the memory quantity applied by the ith Node.