WO2024032587A1 - Gpu资源使用方法、gpu虚拟化方法以及作业调度装置、集群 - Google Patents

Gpu资源使用方法、gpu虚拟化方法以及作业调度装置、集群 Download PDF

Info

Publication number
WO2024032587A1
WO2024032587A1 PCT/CN2023/111673 CN2023111673W WO2024032587A1 WO 2024032587 A1 WO2024032587 A1 WO 2024032587A1 CN 2023111673 W CN2023111673 W CN 2023111673W WO 2024032587 A1 WO2024032587 A1 WO 2024032587A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
video memory
virtual
container
application
Prior art date
Application number
PCT/CN2023/111673
Other languages
English (en)
French (fr)
Inventor
李孟轩
张冠一
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2024032587A1 publication Critical patent/WO2024032587A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • the present disclosure relates to the field of computers, and in particular to a GPU resource usage method, a GPU virtualization method, a job scheduling device, and a cluster.
  • GPUs Modern graphics processing units (GPUs) began as accelerators for Windows video games but have evolved over the past 20 years into enterprise server processors for high-performance computing and artificial intelligence applications.
  • GPUs lead performance in supercomputing, artificial intelligence training and inference, drug research, financial modeling, and medical imaging. They are also used in more mainstream tasks where CPUs are not fast enough, such as in GPU-driven relational databases. GPUs are better suited than CPUs to handle many of the calculations required for artificial intelligence and machine learning in enterprise data centers and hyperscale networks. The CPU can handle the work, but it takes longer. Because GPUs are designed to solve complex mathematical problems in parallel by breaking them down into separate tasks that they handle simultaneously, they can solve these problems faster.
  • GPU virtualization technology improves GPU utilization by dividing the GPU into multiple smaller-granularity virtual GPUs and using each virtual GPU to run applications or tasks that consume less computing power and video memory.
  • GPU computing power utilization and video memory utilization are not directly proportional.
  • the utilization of computing power is far less than the utilization of video memory. If you want to make full use of computing power, you need to divide the GPU into smaller-granularity virtual GPUs. If you use the traditional GPU virtualization solution, the video memory of each virtual GPU will be very small and cannot properly support a single application or tasks; and if the GPU is divided into larger-granularity virtual GPUs to meet the video memory requirements, the GPU's computing power will be idle.
  • a technical problem to be solved by this disclosure is to provide a solution that can improve the overall utilization of GPU resources while solving the problem of incompatible computing power utilization and video memory utilization of the GPU.
  • a method for using GPU resources including: dividing the GPU into multiple virtual GPUs; for at least one virtual GPU, allocating at least part of the host memory as a video memory swap area to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU; and replace the video memory application request of the application or task for the virtual GPU with a video memory application request based on a unified address space, so that the currently available onboard video memory of the virtual GPU When the video memory is insufficient, at least part of the data in the onboard video memory can be swapped to the video memory swap area based on the unified address space.
  • a GPU virtualization method including: dividing the GPU into multiple virtual GPUs; for at least one virtual GPU, allocating at least part of the host memory as a video memory swap area to the virtual GPU, So that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • a job scheduling device including: a scheduler component configured to schedule container jobs to one or more GPUs and divide the GPU into one or more virtual GPUs, Each virtual GPU corresponds to a container in a container-type job, and each container corresponds to an application or task.
  • the application or task runs in the container.
  • the scheduler component also exchanges at least part of the host memory as video memory. area, allocated to the virtual GPU so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU; the hijacking library is configured to intercept the call request of the application or task to the GPU.
  • the hijacking library The interface used to apply for video memory is also set to the video memory application interface based on the unified address space.
  • a Kubernetes cluster including: multiple GPU nodes, each GPU node including one or more GPUs; and a job scheduling device deployed on at least one GPU node, the job scheduling device
  • the device is the GPU resource management device described in the second aspect of this disclosure.
  • a GPU virtualization device including: a splitting module configured to split the GPU into multiple virtual GPUs; and an allocation module configured to split the GPU into at least one virtual GPU. Part of the memory of the host where the GPU is located is used as a video memory swap area and allocated to the virtual GPU so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • a computing device including: a processor; and a memory on which executable code is stored.
  • the processor is caused to execute the above-mentioned first step. aspect methods.
  • a computer-readable storage medium having executable code stored thereon.
  • the executable code is executed by a processor of a computing device, the processor is caused to execute the above-described first aspect. method.
  • the present disclosure uses the host memory as a video memory swap area and allocates it to the virtual GPU, so that the video memory swap area can act as a virtual video memory to increase the available video memory of the virtual GPU, and solve the problem of dividing the GPU into smaller granularity in order to fully utilize the computing power.
  • the onboard video memory of the virtual GPU is not enough to support a single application or task, which can solve the problem of not balancing the GPU's computing power utilization and video memory utilization.
  • the present disclosure replaces the application or task's video memory application request for the virtual GPU with a video memory application request based on a unified address space, so that the application or task can use the virtual video memory (ie, the video memory swap area) without awareness.
  • Figure 1 shows a schematic diagram of the principle of virtualizing a GPU according to the present disclosure.
  • Figure 2 shows a schematic structural diagram of a job scheduling device according to an embodiment of the present disclosure.
  • Figure 3 shows a schematic diagram of the hijacking library.
  • Figure 4 shows a schematic diagram of the overall processing flow for vGPU tasks.
  • Figure 5 shows a schematic structural diagram of a GPU virtualization device according to an embodiment of the present disclosure.
  • FIG. 6 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure.
  • Figure 1 shows a schematic diagram of the principle of virtualizing a GPU according to the present disclosure.
  • a GPU can be divided into multiple virtual GPUs.
  • GPU can refer to a physical GPU card.
  • the segmentation granularity can be flexibly set as needed.
  • Virtual GPU can also be called vGPU.
  • At least part of the host memory of the host (such as a server) where the GPU is located can be used as a video memory swap area and allocated to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • Onboard video memory refers to the video memory integrated on the physical GPU card, that is, the video memory provided by the physical GPU card itself. Onboard video memory can also be called physical video memory.
  • a virtual GPU allocated with a video memory swap area when the onboard video memory of the virtual GPU is not enough, part of the space in the onboard video memory can be released for use by the current program. Among them, the data in the released space is saved to the video memory swap area. When the data in the released space needs to be used, the data in the video memory swap area can be swapped to the onboard video memory.
  • the total available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • the memory in the video memory swap area can serve as the virtual video memory of the virtual GPU, thereby increasing the available video memory of the virtual GPU.
  • the total available video memory of a virtual GPU is equal to the onboard video memory of the virtual GPU plus the video memory swap area allocated for the virtual GPU.
  • the present disclosure can realize the available video memory exceeding the onboard video memory, so that when the GPU is divided into smaller-granular virtual GPUs for the purpose of fully utilizing the GPU resources, there is no need to worry about the fragmentation.
  • the onboard video memory of the virtual GPU is too small to support a single application or task. Therefore, it can improve the overall utilization of GPU resources while solving the problem of not taking into account the GPU's computing power utilization and video memory utilization, and can help customers be more flexible. Set the partitioning granularity of GPU virtualization to maximize GPU resource utilization.
  • a virtual GPU can be used exclusively by one application or task.
  • Applications or tasks that need to use the video memory swap area can refer to applications or tasks that require the maximum video memory used during operation to exceed the onboard video memory of the virtual GPU.
  • the present disclosure proposes that for applications or tasks that need to use the video memory swap area, the application or task's video memory application request for the virtual GPU can be replaced. Request for video memory application based on unified address space.
  • the unified address space can be regarded as mapping host memory and device memory (ie, the onboard video memory of the GPU) into a unified (virtual) address space.
  • mapping host memory and device memory ie, the onboard video memory of the GPU
  • unified (virtual) address space In the unified address space, memory and video memory are no longer distinguished, thus providing support for free data exchange between memory and video memory.
  • the present disclosure replaces the video memory application request (ie, the default video memory application request) for the virtual GPU through an application or task with a video memory application request based on a unified address space, so that when the currently available onboard video memory of the virtual GPU is insufficient, it can Swapping at least part of the data in the onboard video memory to the video memory swap area based on the unified address space.
  • the video memory application request ie, the default video memory application request
  • This disclosure can use the hijacking library to intercept the call request of the application or task to the virtual GPU, and replace the default interface used by the application or task to apply for video memory with a video memory application interface based on a unified address space, so as to realize the application or task targeting the virtual GPU.
  • the video memory application request is replaced by a video memory application request based on a unified address space.
  • This disclosure can replace the default video memory application interface (cuMemAlloc) with a unified address space-based video memory application interface (cuMemAllocManaged).
  • the system component such as the kernel module part in the Nvidia driver and the HMM component in the Linux kernel
  • the system component can automatically transfer the board Exchange at least part of the data stored in the onboard video memory to the video memory swap area to release at least part of the onboard video memory to meet the video memory usage requirements of the application or task.
  • the present disclosure replaces the ordinary video memory application with a video memory application based on a unified address space by replacing the Cuda call chain, so that while the application or task has the ability to use virtual video memory, the entire process is insensitive to the application or task.
  • the GPU mentioned in this disclosure can be deployed in a K8S cluster.
  • the K8S cluster is a Kubernetes-based GPU cluster and includes multiple GPU nodes.
  • Each GPU node can include one or more GPUs.
  • the GPU node may refer to the GPU server.
  • applications or tasks can be deployed in containers. Containers can provide environmental support for the running of applications or tasks.
  • the virtual GPUs that applications or tasks need to use can be mounted in the container for exclusive use by the application or task. .
  • the present disclosure proposes that the virtualization information can be set by the user based on the actual situation of the cluster and upper-layer applications or tasks.
  • the virtualization information may represent the granularity of partitioning the GPU and/or the size of the video memory swap area allocated for the virtual GPU.
  • the virtualization information may include the maximum number of virtual GPUs and virtual video memory size.
  • the maximum number of virtual GPUs is used to limit the maximum number of virtual GPUs that a physical card can be divided into.
  • the virtual video memory size is used to limit the size of the virtual video memory that needs to be configured for the virtual GPU.
  • the size of virtual video memory can be expressed by the indicator "virtual video memory multiple".
  • Virtual video memory multiple means that the virtual video memory is a multiple of the actual onboard video memory. For example, if the virtual video memory multiple is set to 10, it means that the virtual video memory is 10 times the actual onboard video memory. .
  • the GPU when the GPU is divided into multiple virtual GPUs, the GPU can be divided into multiple virtual GPUs on the condition that the number of divided virtual GPUs is not greater than the maximum number of virtual GPUs; at least part of the host memory is used as video memory.
  • the swap area is allocated to a virtual GPU, the host memory equal to the size of the virtual video memory can be used as the video memory swap area and allocated to the virtual GPU.
  • a virtual GPU can be assigned to an application or task. This solution will limit the access and use of the GPU by the task based on the virtualization parameters set above. Therefore, changes in the running status of an application or task , will not affect applications or tasks on other virtual GPUs.
  • the resource requirement information of the application or task can be obtained.
  • the resource requirement information is used to characterize the GPU resources required by the application or task.
  • the GPU resources required by an application or task may be different in different states during the running process.
  • the GPU resources required by the application may refer to the maximum GPU resources required by the application or task during the entire running process. Based on the resource requirement information, a virtual GPU that can meet the GPU resources required by the application or task can be selected.
  • Resource demand information may include computing power usage ratio and video memory usage size.
  • the computing power usage ratio is used to characterize the proportion of GPU computing power required by an application or task. The unit is %. If the GPU computing power ratio is set to 10, applications or tasks can be scheduled to virtual GPUs with a GPU computing power ratio of more than 10%.
  • the unit of the video memory usage size can be MB. If the video memory usage size is set to 1024, the application or task will occupy 1024MB of GPU video memory.
  • the present disclosure proposes that when creating the container, environment variables corresponding to the virtual GPU need to be set, which may include but are not limited to the container identifier, the identifier of the virtual GPU mounted into the container, and the identifier of the virtual GPU that the container can access.
  • the container identifier can be a Universally Unique Identifier (UUID), which is used by the scheduler plug-in mentioned below to identify the container.
  • UUID Universally Unique Identifier
  • the identification of the virtual GPU can It is the GPU device number that can be accessed within the container and is used for device mapping inside and outside the container.
  • the upper limit of GPU resources that a container can access can refer to the maximum GPU resources that a virtual GPU can provide, which can include the upper limit of video memory and computing power usage.
  • the virtual GPU can be mounted in the container based on environment variables and the container can be started to run applications or tasks in the container.
  • the present disclosure can also monitor the resource usage of the virtual GPU.
  • the resource usage of the virtual GPU can also be output (such as visual display) in real time.
  • This disclosure also proposes a job scheduling device that supports cloud native, which is suitable for deployment on nodes that require GPU virtualization in a K8S cluster, and schedules container jobs in a cloud-native manner.
  • Container jobs refer to pod, deployment, job and other jobs containing vGPU resources submitted to the K8S cluster.
  • Container jobs can also be called vGPU tasks.
  • Container jobs can involve one or more containers.
  • K8S Pod is the smallest unit managed by K8S.
  • a Pod can contain multiple containers.
  • Each container can run an application or task.
  • the GPU resources required to run applications or tasks in the container may refer to vGPU resources, that is, an application or task can exclusively use a vGPU.
  • vGPU resources can be used to characterize the size of vGPU resources required by each application or task, such as the proportion of GPU computing power and the size of used video memory.
  • Figure 2 shows a schematic structural diagram of a job scheduling device according to an embodiment of the present disclosure.
  • the job scheduling device 200 may include a scheduler component 210 and a hijacking library 220 .
  • the job scheduling apparatus 200 may also include one or more of the device component 230, the mounting component 240, the monitoring component 250, and the marking component 260 shown in the dotted box in the figure. Each component can be packaged through charts.
  • the scheduler component 210 may be configured to schedule the container job to one or more GPUs based on the GPU resource requirement information of the container job, and divide the GPU into one or more virtual GPUs, each virtual GPU corresponding to the container.
  • the scheduler component 210 can also allocate at least part of the host memory as a video memory swap area to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • the scheduler component 210 can divide the GPU according to the vGPU resource information in the container job.
  • the vGPU resource information may include the number of vGPUs and the maximum resource information that each vGPU needs to provide, such as the upper limit of computing power and the upper limit of video memory.
  • the scheduler component 210 may also be called a scheduler plug-in.
  • the scheduler component 210 can be a scheduler plug-in that schedules all vGPU tasks and is obtained by improving the Scheduler extender in the K8S cluster.
  • the scheduler plug-in can be recorded as 4PD-vGPU-Scheduler.
  • the scheduler component 210 can hijack and take over the scheduling of all vGPU tasks, coordinate the GPU resources of all clusters, and allocate (i.e., schedule) tasks to certain GPUs on appropriate GPU nodes.
  • the original K8S official scheduler only supports allocating GPUs according to the number.
  • the allocated GPUs are regarded as exclusive resources and cannot be used by other applications or tasks. method to use.
  • the scheduler component 210 of the present disclosure can support tasks to specify the GPU resources that need to be used (such as video memory size and computing power usage ratio). By scheduling applications or tasks to virtual GPUs that meet the needs, the task supports only Use part of the GPU's video memory and computing power, allowing multiple tasks to share the resources of a GPU.
  • the hijacking library 220 is configured to intercept calls made by applications or tasks to the GPU. For applications or tasks that need to use virtual video memory, the hijacking library 220 will also set (such as replace) the interface used to apply for video memory (such as cuMemAlloc) to a unified address space-based video memory application interface (such as cuMemAllocManaged), so that the current interface of the virtual GPU When the available onboard video memory is insufficient, at least part of the data in the onboard video memory can be swapped to the video memory swap area based on a unified address space.
  • video memory such as cuMemAlloc
  • a unified address space-based video memory application interface such as cuMemAllocManaged
  • system components can be used to automatically exchange at least part of the data stored in the onboard video memory to the video memory swap area. Free up at least some of the onboard video memory.
  • the hijacking library 220 may be an improvement of the existing component Hooked Cuda Driver (ie, CUDA hijacking library).
  • the hijacking library itself is an existing technology, but the hijacking library in the existing technology does not support multi-card segmentation and cannot be used directly on K8S.
  • the present disclosure can add the function of communicating with the scheduler component 210 to the hijacking library, so that the K8S cluster can dynamically control each application or task.
  • Cuda8 and subsequent Cuda versions proposed the concept of a unified address space.
  • device memory video memory
  • host memory are no longer distinguished. All data exchange between device memory and host memory is performed by the Linux kernel. It works automatically with the HMM component in the Nvidia kernel module and no longer requires the user to manually control it by calling CuMemcpyDtoH or CuMemcpyHtoD. This control mechanism makes it possible to freely exchange memory and video memory, allowing the memory to be used as a swap area for video memory.
  • This disclosure creatively applies the technology of unified address space and hijacking library to the field of GPU multiplexing, replacing the Cuda call chain with the hijacking library 220 (i.e., Cuda hijacking library), and replacing ordinary video memory applications with "unified address space” Application, so that the task has the ability to use the memory as a video memory swap area without any awareness.
  • the hijacking library 220 i.e., Cuda hijacking library
  • the hijacking library 220 can hijack all upper-layer call requests by hijacking symbol calls, and forward them to the lower-layer real CUDA execution library after processing.
  • Figure 3 shows a schematic diagram of the hijacking library.
  • the hijacking library (Libcudaso) is located between the driver layer (Nvidia GPU Driver) and the Cuda runtime (Cuda Runtime) layer, so it can intercept all requests sent to the driver by the Cuda runtime.
  • the hijacking library can also be configured to check whether the video memory used by the application or task application is greater than the allocated video memory, and if the video memory used by the application or task application is not greater than the allocated video memory, the hijacking library will The video memory application request is sent to the lower-level driver, such as the GPU driver.
  • the hijacking library can perform statistics on video memory and computing power for each container, perform corresponding legality checks (the applied video memory cannot exceed the allocated video memory size), and then pass it to the lower-level driver.
  • the lower-layer driver can control the graphics processing unit (Nvdia GPU) by calling the interface function ioctl.
  • the hijacking library can be an interface (cuda API) added by modifying the guest operating system (Guest OS) using paravirtualization (Para Virtualization).
  • a virtualization layer can be added to Libcuda to build the hijacking library.
  • the application (CUDA Application) in the guest operating system can call the library (CUDA Library) Connect to the Cuda runtime as a static link to support the running of the application.
  • the hijacking library can be mounted to the runtime through dynamic link (Dynamic link) by calling the function dlopen.
  • the device component 230 is configured to mount (i.e., map) the hijacking library 220 into the container and set the preload library in the container so that the hijacking library is forced to be mounted before the process in the container is started.
  • Device component 230 may also be called a device plug-in.
  • the device component 230 may be an improvement on the Device Plugin (ie, device plug-in) in the K8S cluster.
  • the improved device plug-in can be recorded as 4PD-vGPU-Device Plugin.
  • the device component 230 is responsible for mapping the Cuda hijacking library (libvgpu.so) into the container and setting the preload library (such as the preload file /etc/ld.so.preload) in the container.
  • the function of /etc/ld.so.preload is to force libvgpu.so to be mounted before any process is started, ensuring that users cannot bypass vGPU and directly access the GPU. As a result, all calls to Cuda inside the container will be forwarded through the hijacking library.
  • the mounting component 240 may be configured to obtain the identity of the virtual GPU by communicating with the scheduler component 210 .
  • the identifier of the virtual GPU is used to identify the device number of the virtual GPU, and the mounting component 240 can mount the virtual GPU into the container according to the identifier of the virtual GPU. Among them, the mounting component 240 can be mounted into the container together with the corresponding driver library.
  • the mounting component 240 is located at the container layer and can be improved from nvidia-container-runtime.
  • the improved nvidia-container-runtime can be recorded as 4PD-nvidia-container-runime.
  • the mounting component 240 has more communication with the scheduler component 210. Based on the communication, the mounting component 240 can actually mount the vGPU assigned to the container by the scheduler component 210 into the container.
  • the mounting component 240 may also be configured to obtain the GPU resource information of the virtual GPU by communicating with the scheduler component 210, and record the GPU resource information in the environment variable of the container.
  • the GPU resource information may refer to the upper limit of GPU resources that the virtual GPU can provide, such as the upper limit of video memory and the upper limit of computing power.
  • the GPU resource information in the container's environment variables is the GPU resources that the container can access (that is, use).
  • Containers can be created and started based on environment variables. The operation of creating and starting the container can be performed by runc, the container runtime.
  • the monitoring component 250 can be recorded as 4PD-VGPU-monitor and is configured to monitor the resource usage of the vGPU. For example, it can monitor some preset metrics.
  • the monitoring component 250 can also push metrics to the outside to facilitate real-time monitoring and visualization of vGPU resources across the entire cluster.
  • the marking component 260 may be configured to record, for each container in the container-type job, a container identifier that can uniquely identify the container in the environment variable of the container, to facilitate identification by the scheduler component 210 .
  • a Universally Unique Identifier (UUID) can be used as a container identifier. Marking component 260 may refer to MutatingWebhook in K8S.
  • Figure 4 shows a schematic diagram of the overall processing flow for vGPU tasks.
  • steps S410 and S420 can be executed by the marking component; steps S340, S440 and S480 can be executed by the scheduler component; step S450 can be executed by the device component; and step S460 can be executed by the mounting component.
  • step S410 when the task is submitted, the marking component may first check whether there are vGPU resources in the vGPU task. If vGPU resources are detected, the process from step S420 to step S460 can be executed. By virtualizing the GPU, applications or tasks can use virtual video memory, especially virtual video memory without any awareness. If the vGPU resource is not checked, it indicates that the submitted task does not require virtualization of the GPU, so step S470 can be executed to execute the default scheduling process.
  • step S420 a container UUID is added to each discovered vGPU resource to facilitate the mounting component to identify the corresponding container.
  • Each vGPU resource corresponds to a container, and the vGPU resource is the vGPU resource that the container needs to use.
  • step S430 GPU nodes in the cluster are filtered.
  • step S440 score the filtered GPU nodes.
  • the most suitable GPU node (such as the GPU node or GPU nodes with the highest score) can be finally selected to perform the vGPU task. For example, you can first filter the GPU nodes that support virtualization, then score the GPU nodes based on the current remaining GPU computing power on the GPU node, supported segmentation granularity, etc., select the most appropriate (that is, the highest scoring) GPU node, and Split the GPUs in the final selected GPU node to obtain one or more vGPUs that meet the requirements.
  • step S450 add the hijacking library to mount.
  • the CUDA hijacking library libvgpu.so and the preload file /etc/ld.so.preload can be mounted.
  • step S460 environment variables are set.
  • the task can be submitted to the container layer.
  • the nvidia container runtime at the container layer will communicate with the vGPU scheduler and obtain the GPU serial number corresponding to the vGPU, the corresponding usable video memory, and the upper limit of utilization, respectively.
  • the first environment variable is used to control the GPU device number mounted into the container
  • the second and third environment variables are used to control access to the GPU, which are the upper limit of the video memory that the container can access and the utilization rate respectively.
  • nvidia-container-cli can be called to mount the specific GPU device and corresponding driver library, and then runc can be used to start the container.
  • this disclosure discusses GPU virtualization technology based on cloud-native solutions that can realize virtual video memory capabilities, and provides a complete process and solution to solve the above defects from the product and technical levels.
  • the disclosed solution can be implemented based on cloud native K8S technology and can be directly adapted to cloud native scenarios. Therefore, the present disclosure can be implemented as a cloud-native GPU virtualization solution that supports virtual video memory.
  • the cloud-native GPU virtualization solution implemented based on the job scheduling device of the present disclosure mainly includes a component deployment phase, an application or task creation phase, and an application or task running phase.
  • each component in the GPU resource management device can be installed on the nodes in the K8S cluster that require GPU virtualization.
  • the "virtual memory multiplier” and “maximum number of vGPUs” need to be set.
  • Parameters, users can set according to the actual conditions of the cluster and upper-layer applications or tasks. These two parameters are new parameters in this solution, and their functions can be referred to the relevant descriptions above.
  • the application or task will run on the virtual GPU using the set parameters, and the applications on each virtual GPU will not interfere with each other. It does not matter if the user changes the running status of an application or task. Will affect applications or tasks on other virtual GPUs.
  • the expected effect is that the total amount of video memory used by each application or task can exceed the total amount of actual physical GPU memory, and in the inference scenario and the GPU computing power is not fully used, the performance of each application or task is different from that of no use. It is basically the same when GPU is virtualized, and the loss is within 10%. This is due to the strong locality of the inference task, and the exchange of video memory and memory will not be too frequent, which will cause a large performance attenuation.
  • the loss here refers to the increase in request processing time in the inference scenario.
  • This disclosure enables multiple AI applications (such as inference models) to be deployed on one GPU by reusing the GPU.
  • multiple inference models can be deployed on one GPU through multiple model loading methods (such as selecting the multi-model loading method in Nvidia Triton server, torchserve, and tf-serving).
  • Nvidia has launched the Nvidia triton server inference engine for inference scenarios, which can load multiple inference models on one GPU.
  • many AI training frameworks have also launched corresponding inference service engines. They can load multiple models simultaneously in one task and provide inference services for each model.
  • the disadvantage of this multi-model loading method is that this technology is often only applicable to certain models.
  • tf-serving can only load tensorflow models
  • torchserve can only load torch models
  • nvidia which has the widest applicability, can only load tensorflow models.
  • Triton server can only load pytorch script models for pytorch and cannot cover all application scenarios.
  • the disclosed GPU multiplexing scheme can cover all application scenarios.
  • the inference model mentioned in this disclosure may refer to a neural network model.
  • Neural network models can be used to predict image categories, text categories, voice emotions, fraudulent transactions, advertising click-through rates, etc.
  • the neural network model is designed to predict problems related to objects or events in relevant scenes. For example, it can be used to predict image categories, predict text in images, predict text categories, predict voice emotion categories, predict fraudulent transactions, predict advertising click-through rates, predict commodity prices, etc., so that the prediction results can be directly used as a basis for decision-making or further combined with other rules. and become the basis for decision-making.
  • the scenarios in which the neural network model can be used include but are not limited to the following scenarios:
  • Image processing scenarios include: optical character recognition OCR, face recognition, object recognition and picture classification; more specifically, for example, OCR can be applied to bill (such as invoice) recognition, handwriting recognition, etc., and face recognition can be applied to security
  • OCR optical character recognition
  • face recognition object recognition
  • image classification can be applied to e-commerce. The platform's "photo shopping", “find the same style”, etc.
  • Speech recognition scenarios include products that enable human-computer interaction through voice, such as mobile phone voice assistants (such as Siri on Apple phones), smart speakers, etc.;
  • Natural language processing scenarios include: review of texts (such as contracts, legal documents and customer service records, etc.), spam content identification (such as spam text message identification) and text classification (emotion, intention and theme, etc.);
  • Automatic control scenarios include: mine group adjustment operation prediction, wind turbine adjustment operation prediction and air conditioning system adjustment operation prediction; specifically, for the mine group, a group of adjustment operations with a high mining rate can be predicted, and for wind turbine units, a high power generation efficiency can be predicted A set of adjustment operations. For air conditioning systems, a set of adjustment operations can be predicted to meet demand while saving energy consumption;
  • Intelligent question and answer scenarios including: chat robots and intelligent customer service;
  • the field of financial technology includes: marketing (such as coupon usage prediction, advertising click behavior prediction, user portrait mining, etc.) and customer acquisition, anti-fraud, anti-money laundering, underwriting and credit scoring, and commodity price prediction;
  • Medical fields include: disease screening and prevention, personalized health management and auxiliary diagnosis;
  • Municipal fields include: social governance and regulatory law enforcement, resource environment and facility management, industrial development and economic analysis, public services and people's livelihood security, smart cities (allocation and management of various urban resources such as buses, online ride-hailing, shared bicycles, etc.);
  • Recommended business scenarios including: recommendation of news, advertising, music, consulting, video and financial products (such as financial management, insurance, etc.);
  • Search scenarios include: web search, image search, text search, video search, etc.;
  • Abnormal behavior detection scenarios include: detection of abnormal power consumption behavior of State Grid customers, detection of malicious network traffic, detection of abnormal behavior in operation logs, etc.
  • the GPU virtualization method described above in conjunction with Figure 1 of the present disclosure can also be implemented as a GPU virtualization device.
  • Figure 5 shows a schematic structural diagram of a GPU virtualization device according to an embodiment of the present disclosure.
  • the functional units of the GPU virtualization device may be implemented by hardware, software, or a combination of hardware and software that implement the principles of the present disclosure.
  • Those skilled in the art can understand that the functional units described in Figure 5 can be combined or divided into sub-units to implement the above disclosed principles. Therefore, the description herein may support any possible combination, division, or further limitation of the functional units described herein.
  • the GPU virtualization device 500 may include a slicing module 510 and an allocation module 520 .
  • the slicing module 510 is configured to shard the GPU into multiple virtual GPUs.
  • the allocation module 520 is configured to, for at least one virtual GPU, allocate part of the memory of the host where the GPU is located as a video memory swap area to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • the GPU virtualization apparatus 500 may further include a first acquisition module configured to acquire virtualization information, where the virtualization information includes a maximum number of virtual GPUs and a virtual display memory size.
  • the splitting module can split the GPU into multiple virtual GPUs on the condition that the number of virtual GPUs obtained by splitting is not greater than the maximum number of virtual GPUs.
  • the allocation module can allocate host memory equal to the size of the virtual video memory as a video memory swap area to the virtual GPU.
  • the GPU virtualization device 500 may further include a second acquisition module and an allocation module.
  • the second acquisition module is configured to acquire resource requirement information.
  • the resource requirement information is used to characterize the GPU resources required by the application or task.
  • the allocation module is configured to allocate virtual GPUs to applications or tasks based on resource requirement information.
  • the GPU virtualization apparatus 500 may further include a replacement module configured to replace the video memory application request of an application or task for the virtual GPU with a video memory application request based on a unified address space, so that the video memory application request on the currently available onboard of the virtual GPU When the video memory is insufficient, at least part of the data in the onboard video memory can be swapped to the video memory swap area based on a unified address space.
  • the replacement module can use the hijacking library to intercept the call request of the application or task to the virtual GPU, and replace the default interface used by the application or task to apply for video memory with a video memory application interface based on a unified address space.
  • the GPU virtualization device 500 may also include a setting module, a mounting module, and a startup module.
  • the setting module is configured to set the environment variables of the container.
  • the environment variables include the container ID, the ID of the virtual GPU mounted into the container, and the upper limit of GPU resources that the container can access.
  • the mount module is configured to mount the virtual GPU into the container based on environment variables.
  • the startup module is configured to launch a container to run an application or task in the container.
  • the GPU virtualization device 500 may further include a monitoring module configured to monitor resource usage of the virtual GPU.
  • the present disclosure can also be implemented as a Kubernetes cluster, including multiple GPU nodes, each of which includes one or more GPUs; and a job scheduling device deployed on at least one of the GPU nodes.
  • the job scheduling device may be The GPU resource management device described above in conjunction with Figure 2.
  • FIG. 6 shows a schematic structural diagram of a computing device that can be used to implement the above GPU resource usage method or GPU virtualization method according to an embodiment of the present disclosure.
  • computing device 600 includes memory 610 and processor 620 .
  • the processor 620 may be a multi-core processor or may include multiple processors.
  • processor 620 may include a general main processor and one or more special co-processors, such as a graphics processing unit (GPU), a digital signal processor (DSP), and the like.
  • the processor 620 may be implemented using customized circuits, such as Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs).
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • Memory 610 may include various types of storage units, such as system memory, read-only memory (ROM), and persistent storage. Among them, ROM can store static data or instructions required by the processor 620 or other modules of the computer. Persistent storage may be readable and writable storage. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even when the computer is powered off. In some embodiments, permanent The storage device uses a large-capacity storage device (such as a magnetic or optical disk, flash memory) as a permanent storage device. In other embodiments, the permanent storage device may be a removable storage device (eg, floppy disk, optical drive).
  • System memory can be a read-write storage device or a volatile read-write storage device, such as dynamic random access memory.
  • System memory can store some or all of the instructions and data the processor needs to run.
  • memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks and/or optical disks may also be used.
  • memory 610 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (eg, DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray discs, ultra-density discs, flash memory cards (such as SD cards, min SD cards, Micro-SD cards, etc.), magnetic floppy disks, etc.
  • a readable and/or writable removable storage device such as a compact disc (CD), a read-only digital versatile disc (eg, DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray discs, ultra-density discs, flash memory cards (such as SD cards, min SD cards, Micro-SD cards, etc.), magnetic floppy disks, etc.
  • Computer-readable storage media do not contain carrier waves and transient electronic signals that are transmitted wirelessly or wired.
  • the memory 610 stores executable code.
  • the processor 620 can be caused to execute the above-mentioned GPU resource usage method or GPU virtualization method.
  • the method according to the present disclosure can also be implemented as a computer program or computer program product, which computer program or computer program product includes computer program code instructions for executing the above-mentioned steps defined in the above-mentioned method of the present disclosure.
  • the present disclosure may also be implemented as a computer-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having executable code (or computer program, or computer instruction code) stored thereon.
  • executable code or computer program, or computer instruction code
  • the processor is caused to execute each step of the above method according to the present disclosure.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more components for implementing the specified logical function(s).
  • Executable instructions may also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Stored Programmes (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种GPU资源使用方法、虚拟化方法及作业调度装置、集群。将GPU切分成多个虚拟GPU。针对至少一个虚拟GPU,将至少部分主机内存作为显存交换区,分配给虚拟GPU,使虚拟GPU的可用显存大于虚拟GPU的板载显存。将应用或任务针对虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求,以使在虚拟GPU的当前可用板载显存不足的情况下,能够基于统一地址空间将板载显存中的至少部分数据交换到显存交换区。由此,显存交换区可以充当虚拟显存,增大虚拟GPU的可用线程,解决GPU的算力利用率和显存利用率不能兼顾的问题,与此同时通过将应用或任务针对虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求,使应用或任务能够无感知地使用虚拟显存。

Description

GPU资源使用方法、GPU虚拟化方法以及作业调度装置、集群
本公开要求申请日为2022年8月9日、申请号为202210950598.5、发明名称为“GPU资源使用方法、GPU虚拟化方法以及作业调度装置、集群”的中国专利申请的优先权。
技术领域
本公开涉及计算机领域,特别是涉及一种GPU资源使用方法、GPU虚拟化方法以及作业调度装置、集群。
背景技术
现代图形处理单元(Graphics Processing Unit,GPU)最初是作为Windows视频游戏的加速器,但在过去20年中已演变为用于高性能计算和人工智能应用程序的企业服务器处理器。
现在,GPU在超级计算、人工智能训练和推理、药物研究、金融建模和医学成像中处于性能领先地位。在CPU不够快的情况下,它们也被应用于更主流的任务,例如在GPU驱动的关系数据库中。GPU比CPU更适合处理企业数据中心和超大规模网络中人工智能和机器学习所需的许多计算。CPU可以处理工作,但需要更长的时间。由于GPU旨在通过将复杂的数学问题分解为它们同时处理的单独任务来并行解决复杂的数学问题,因此它们可以更快地解决这些问题。
当企业采购了大量GPU服务器后,如何提升GPU的利用率是企业节省采购成本的重要问题。一方面,很多AI的应用或任务并不足以将一张GPU的算力或显存完全占满,但为了不造成应用或任务间的互相干扰需要独占一张GPU,这样GPU上的资源便造成了浪费。另一方面,K8S是越来越被广泛采用的容器编排和集群组织工具,但是K8S对于GPU的管理是任务独占的,也会导致利用率低下。
为了提升GPU的利用率,GPU虚拟化技术被广泛开发和应用。
GPU虚拟化技术通过将GPU切分成多个更小粒度的虚拟GPU,用每个虚拟GPU来运行算力和显存消耗更小的应用或任务,达到提升GPU利用率的目的。
然而,在某些场景下GPU算力的利用率和显存利用率并不成正比。例如在模型推理A/B test、notebook调研等场景下算力的利用率远小于显存的利用率。如果希望将算力更充分地利用起来,就需要将GPU切分成粒度更小的虚拟GPU,如果使用传统GPU虚拟化方案的话,会使得每个虚拟GPU的显存很小,无法正常支撑单个应用或任务;而如果为了满足显存要求将GPU切分成粒度较大的虚拟GPU,则GPU的算力便会有空闲。
因此,需要一种能够在提升GPU资源整体利用率的同时,解决GPU的算力利用率和显存利用率不能兼顾的问题的方案。
发明内容
本公开要解决的一个技术问题是提供一种能够在提升GPU资源整体利用率的同时,解决GPU的算力利用率和显存利用率不能兼顾的问题的方案。
根据本公开的第一个方面,提供了一种GPU资源使用方法,包括:将GPU切分成多个虚拟GPU;针对至少一个虚拟GPU,将至少部分主机内存作为显存交换区,分配给虚拟GPU,以使虚拟GPU的可用显存大于虚拟GPU的板载显存;以及将应用或任务针对所述虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求,以使在虚拟GPU的当前可用板载显存不足的情况下,能够基于统一地址空间将板载显存中的至少部分数据交换到显存交换区。
根据本公开的第二个方面,提供了一种GPU虚拟化方法,包括:将GPU切分成多个虚拟GPU;针对至少一个虚拟GPU,将至少部分主机内存作为显存交换区,分配给虚拟GPU,以使虚拟GPU的可用显存大于虚拟GPU的板载显存。
根据本公开的第三个方面,提供了一种作业调度装置,包括:调度器组件,被配置为将容器类作业调度到一个或多个GPU上,将GPU切分成一个或多个虚拟GPU,每个虚拟GPU对应容器类作业中的一个容器,每个容器对应一个应用或任务,应用或任务运行在容器中,其中,针对至少一个虚拟GPU,调度器组件还将至少部分主机内存作为显存交换区,分配给虚拟GPU,以使虚拟GPU的可用显存大于虚拟GPU的板载显存;劫持库,被配置为截取应用或任务对GPU的调用请求,对于需要使用虚拟显存的应用或任务,劫持库还将申请显存使用的接口设置为基于统一地址空间的显存申请接口。
根据本公开的第三个方面,还提供了一种Kubernetes集群,包括:多个GPU节点,每个GPU节点包括一个或多个GPU;以及部署在至少一个GPU节点上的作业调度装置,作业调度装置为本公开第二个方面述及的GPU资源管理装置。
根据本公开的第四个方面,还提供了一种GPU虚拟化装置,包括:切分模块,被配置为将GPU切分成多个虚拟GPU;分配模块,被配置为针对至少一个虚拟GPU,将GPU所在主机的部分内存作为显存交换区,分配给虚拟GPU,以使得虚拟GPU的可用显存大于虚拟GPU的板载显存。
根据本公开的第五个方面,提供了一种计算设备,包括:处理器;以及存储器,其上存储有可执行代码,当可执行代码被处理器执行时,使处理器执行如上述第一方面的方法。
根据本公开的第六个方面,提供了一种计算机可读存储介质,其上存储有可执行代码,当可执行代码被计算设备的处理器执行时,使处理器执行如上述第一方面的方法。
由此,本公开通过将主机内存作为显存交换区,分配给虚拟GPU,使得显存交换区可以充当虚拟显存增大虚拟GPU的可用显存,解决在出于充分利用算力而将GPU切分成粒度较小的虚拟GPU时,虚拟GPU的板载显存不足以支撑单个应用或任务的问题,进而能够解决GPU的算力利用率和显存利用率不能兼顾的问题。在此基础上,本公开通过将应用或任务针对虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求,使应用或任务能够无感知地使用虚拟显存(即显存交换区)。
附图说明
通过结合附图对本公开示例性实施方式进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显,其中,在本公开示例性实施方式中,相同的参考标号通常代表相同部件。
图1示出了本公开对GPU进行虚拟化的原理示意图。
图2示出了根据本公开一个实施例的作业调度装置的结构示意图。
图3示出了劫持库的原理示意图。
图4示出了针对vGPU任务的整体处理流程示意图。
图5示出了根据本公开一个实施例的GPU虚拟化装置的结构示意图。
图6示出了根据本公开一个实施例的计算设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施方式。虽然附图中显示了本公开的优选实施方式,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
图1示出了本公开对GPU进行虚拟化的原理示意图。
参见图1,可以将GPU切分为多个虚拟GPU。GPU,可以是指一张物理GPU卡。切分粒度可以根据需要灵活设置。虚拟GPU也可以称为vGPU。
针对至少一个虚拟GPU,可以将GPU所在主机(如服务器)的至少部分主机内存作为显存交换区,分配给该虚拟GPU,以使该虚拟GPU的可用显存大于该虚拟GPU的板载显存。板载显存,是指集成在物理GPU卡上的显存,也即物理GPU卡自身提供的显存。板载显存,也可称为物理显存。
对于分配了显存交换区的虚拟GPU而言,当该虚拟GPU的板载显存不够用时,可以将板载显存中的一部分空间释放出来,供当前程序使用。其中,被释放的空间中的数据被保存到显存交换区。当需要使用被释放的空间中的数据时,可以再将显存交换区中的数据交换到板载显存中。
由此,在显存交换区的作用下,虚拟GPU的总的可用显存大于该虚拟GPU的板载显存。也就是说,作为显存交换区的内存,可以充当虚拟GPU的虚拟显存,起到增大虚拟GPU的可用显存的效果。虚拟GPU的总的可用显存等于虚拟GPU的板载显存加上为虚拟GPU分配的显存交换区。
本公开通过使用内存作为显存交换区,可以实现超过板载显存的可用显存,使得在出于充分利用GPU资源的目的而将GPU切分成粒度较小的虚拟GPU时,无需担心因切分得到的虚拟GPU的板载显存较小而不能支撑单个应用或任务,因此能够在提升GPU资源整体利用率的同时,解决GPU的算力利用率和显存利用率不能兼顾的问题,且可以帮助客户更灵活地设置GPU虚拟化的切分粒度,以最大程度的提升GPU的资源利用率。
一个虚拟GPU可以被一个应用或任务独占使用。需要使用显存交换区的应用或任务,可以是指运行过程中需要使用的最大显存超过虚拟GPU的板载显存的应用或任务。
在一个实施例中,为了使上层的应用或任务可以无感知地使用显存交换区,本公开提出,对于需要使用显存交换区的应用或任务,可以将应用或任务针对虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求。
统一地址空间可以视为将主机内存和设备内存(即GPU的板载显存)映射到的一个统一的(虚拟)地址空间中。在统一地址空间中,不再区分内存和显存,从而为内存和显存之间自由进行数据交换提供了支持。
因此,本公开通过应用或任务针对虚拟GPU的显存申请请求(即默认显存申请请求)替换为,基于统一地址空间的显存申请请求,使得在虚拟GPU的当前可用板载显存不足的情况下,能够基于统一地址空间将板载显存中的至少部分数据交换到显存交换区。
本公开可以利用劫持库截取应用或任务对虚拟GPU的调用请求,并将应用或任务申请显存所使用的默认接口替换为基于统一地址空间的显存申请接口,以实现将应用或任务针对虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求。
在Cuda8以及后续的Cuda版本中提出了统一地址空间的概念。在统一地址空间中,不再区分设备内存(显存)和主机内存,所有设备内存和主机内存的数据交换由Linux内核与Nvidia内核模块中的HMM组件自动进行,不再需要用户手动通过调用CuMemcpyDtoH或者CuMemcpyHtoD来控制。这种控制机制使得内存和显存的自由交换成为可能,从而可以使得内存当作显存的交换区来使用。
本公开可以将默认的显存申请接口(cuMemAlloc)替换为基于统一地址空间的显存申请接口(cuMemAllocManaged)。在基于统一地址空间的显存申请接口请求使用的显存超过虚拟GPU的当前可用板载显存的情况下,可以由***组件(如Nvidia驱动中的内核模块部分与Linux内核中的HMM组件)自动将板载显存中存放的至少部分数据交换到显存交换区,以释放至少部分板载显存,满足应用或任务的显存使用需求。
可见,本公开通过替换Cuda调用链,将普通的显存申请替换为基于统一地址空间的显存申请,使得应用或任务在具备虚拟显存使用能力的同时,整个过程对应用或任务无感。
本公开述及的GPU可以部署在K8S集群中,K8S集群是基于Kubernetes的GPU集群,包括多个GPU节点,每个GPU节点可以包括一个或多个GPU。其中,GPU节点可以是指GPU服务器。在K8S集群中,应用或任务可以部署在容器(Container)中,容器能够为应用或任务的运行提供环境支持,应用或任务需要使用的虚拟GPU可以挂载容器中,以被应用或任务独占使用。
为了灵活设置GPU虚拟化的切分粒度,本公开提出,虚拟化信息可以由用户基于集群和上层应用或任务的实际情况设置。虚拟化信息可以表征对GPU进行切分的粒度大小和/或为虚拟GPU分配的显存交换区的大小。
例如,虚拟化信息可以包括最大虚拟GPU数量和虚拟显存大小。最大虚拟GPU数量用于限定一张物理卡最多可以切分成多少个虚拟GPU。虚拟显存大小用于限定需要为虚拟GPU配置的虚拟显存的大小。虚拟显存大小可以用“虚拟显存倍数”这一指标表示,虚拟显存倍数是指虚拟显存是实际板载显存的倍数,例如虚拟显存倍数设置为10,则表示虚拟显存为实际板载显存的10倍。
由此,在将GPU切分成多个虚拟GPU时,可以以切分得到的虚拟GPU的数量不大于最大虚拟GPU数量为条件,将GPU切分成多个虚拟GPU;在将至少部分主机内存作为显存交换区,分配给虚拟GPU时,可以将与虚拟显存大小相等的主机内存作为显存交换区,分配给虚拟GPU。
为了避免任务间的影响,一个虚拟GPU可以分配给一个应用或任务使用,本方案会基于上文设置的虚拟化参数限制任务对于GPU的访问和使用,因此,一个应用或任务的运行状态的变动,不会影响其他虚拟GPU上的应用或任务。在为应用或任务分配虚拟GPU时,可以获取应用或任务的资源需求信息,资源需求信息用于表征应用或任务所需的GPU资源。应用或任务在运行过程中不同状态下所需的GPU资源可能不同,应用所需的GPU资源可以是指应用或任务整个运行过程中所需的最大GPU资源。根据资源需求信息,可以选择能够满足应用或任务所需的GPU资源的虚拟GPU。
资源需求信息可以包括算力使用比例和显存使用大小。算力使用比例用于表征应用或任务需要使用的GPU算力比例,单位为%。如GPU算力比例设置为10,则可以将应用或任务调度到GPU算力比例为10%以上的虚拟GPU上。显存使用大小的单位可以为MB,如显存使用大小设置为1024,则应用或任务会占用1024MB的GPU显存。
为了实现容器对于虚拟GPU透明化的使用,本公开提出,创建容器时需要设置与虚拟GPU对应的环境变量,可以包括但不限于容器标识、挂载进容器的虚拟GPU的标识、容器能够访问的GPU资源上限。容器标识可以是通用唯一识别码(Universally Unique Identifier,UUID),用于下文提到的调度器插件识别容器。虚拟GPU的标识可以 是容器内能访问的GPU设备编号,用于容器内外的设备映射。容器能够访问的GPU资源上限可以是指,虚拟GPU所能提供的最大GPU资源,可以包括显存以及算力使用比例上限。基于环境变量可以将虚拟GPU挂载在容器中,并启动容器以在容器中运行应用或任务。
本公开还可以对虚拟GPU的资源使用情况进行监测。可选地,还可以实时输出(如可视化展示)虚拟GPU的资源使用情况。
至此,结合图1就本公开的GPU虚拟化方法以及GPU资源使用方法涉及的基本流程做了说明。本公开还提出了一种支持云原生的作业调度装置,适于部署在K8S集群中需要使用GPU虚拟化的节点上,以支持云原生的方式对容器类作业进行调度。
容器类作业,是指向K8S集群提交的包含vGPU资源的pod、deployment、job等作业。容器类作业,也可以称为vGPU任务。容器类作业可以涉及一个或多个容器,在K8S中,Pod是K8S管理的最小单位,一个Pod中可以包含多个容器。每个容器中可以运行一个应用或任务。容器中应用或任务的运行所需的GPU资源可以是指vGPU资源,即一个应用或任务可以独占使用一个vGPU。vGPU资源可以用于表征每个应用或任务所需的vGPU的资源大小,如GPU算力比例和使用的显存大小。
图2示出了根据本公开一个实施例的作业调度装置的结构示意图。
如图2所示,作业调度装置200可以包括调度器组件210和劫持库220。在一个实施例中,作业调度装置200还可以包括图中虚线框所示的设备组件230、挂载组件240、监测组件250以及标记组件260中的一个或多个。各组件可以通过chart的方式进行打包。
调度器组件210可以被配置为根据容器类作业对GPU资源的需求信息,将容器类作业调度到一个或多个GPU上,并将GPU切分成一个或多个虚拟GPU,每个虚拟GPU对应容器类作业中的一个容器,每个容器对应一个应用或任务,应用或任务运行在容器中。其中,对于已经投入使用的GPU,调度器组件可以对其剩余算力和显存进行切分,得到虚拟GPU。
针对至少一个虚拟GPU,调度器组件210还可以将至少部分主机内存作为显存交换区,分配给虚拟GPU,以使虚拟GPU的可用显存大于虚拟GPU的板载显存。调度器组件210可以根据容器类作业中的vGPU资源信息,对GPU进行切分。vGPU资源信息可以包括vGPU的个数以及各个vGPU需要提供的最大资源信息,如算力上限和显存上限。
调度器组件210,也可以称为调度器插件。调度器组件210可以是对K8S集群中的Scheduler extender进行改进得到的对所有vGPU任务进行调度的调度器插件,该调度器插件可以记为4PD-vGPU-Scheduler。
调度器组件210可以劫持并接管所有vGPU任务的调度,统筹所有集群的GPU资源并将任务分配(即调度)到合适的GPU节点上的某几个GPU上面去。原版的K8S官方调度器只支持按照个数分配GPU,其所分配到的GPU视为独占资源,其它应用或任务无 法使用。与之相比,本公开的调度器组件210可以支持任务指定需要使用的GPU资源(如显存大小和算力使用比例),通过将应用或任务调度到满足需求的虚拟GPU上,支持任务支持只使用GPU的一部分显存和算力,从而可以让多个任务共享一个GPU的资源。
劫持库220被配置为截取应用或任务对GPU的调用请求。对于需要使用虚拟显存的应用或任务,劫持库220还将申请显存使用的接口(如cuMemAlloc)设置(如替换)为基于统一地址空间的显存申请接口(如cuMemAllocManaged),以使在虚拟GPU的当前可用板载显存不足的情况下,能够基于统一地址空间将所述板载显存中的至少部分数据交换到显存交换区。在基于统一地址空间的显存申请接口申请使用的显存大于当前可用板载显存的情况下,可以利用***组件(如HMM组件)自动将板载显存中存放的至少部分数据交换到显存交换区,以释放至少部分板载显存。
劫持库220可以是对现有组件Hooked Cuda Driver(即CUDA劫持库)进行改进得到的。劫持库本身是现有技术,但是现有技术中的劫持库不支持多卡切分,也无法直接在K8S上使用。本公开可以为劫持库增加与调度器组件210进行通讯的功能,使得K8S集群可以动态地控制各个应用或任务。
如上文所述,Cuda8以及后续的Cuda版本中提出了统一地址空间的概念,在统一地址空间中,不再区分设备内存(显存)和主机内存,所有设备内存和主机内存的数据交换由Linux内核与Nvidia内核模块中的HMM组件自动进行,不再需要用户手动通过调用CuMemcpyDtoH或者CuMemcpyHtoD来控制。这种控制机制使得内存和显存的自由交换成为可能,从而可以使得内存当作显存的交换区来使用。
本公开通过创造性地将统一地址空间、劫持库的技术应用到GPU复用领域,由劫持库220(也即Cuda劫持库)替换Cuda调用链,将普通的显存申请替换成为“统一地址空间”的申请,从而使得任务在完全无感知的情况下具有使用内存当作显存交换区的能力。
劫持库220可以通过劫持符号调用的方式,劫持所有上层的调用请求,经过处理后转发给下层真正的CUDA执行库。图3示出了劫持库的原理示意图。
如图3所示,劫持库(Libcudaso)位于驱动层(Nvidia GPU Driver)与Cuda运行时(Cuda Runtime)层之间,所以可以截取Cuda运行时向驱动所发送的所有请求。在这个基础上,劫持库还可以可以被配置为检查应用或任务申请使用的显存是否大于为其分配的显存,并在应用或任务申请使用的显存不大于为其分配的显存的情况下,将显存申请请求发送给下层驱动,如GPU驱动。例如,劫持库可以针对每个容器进行分别进行显存和算力的统计,进行相应的合法性检查(申请的显存不能超过分配的显存大小)后传递给下层驱动。下层驱动可以通过调用接口函数ioctl控制图形处理单元(Nvdia GPU)。其中,劫持库可以是利用半虚拟化(Para Virtualization)对客户机操作***(Guest OS)进行修改而增加的接口(cuda API),如可以在Libcuda上添加一个虚拟化层以构建劫持库。客户机操作***中的应用(CUDA Application)可以通过调用库(CUDA Library) 以静态连接(Static link)的方式连接到Cuda运行时,以支持应用的运行。可以以动态连接(Dynamic link)的方式通过调用函数dlopen将劫持库挂载到运行时。
设备组件230被配置为将劫持库220挂载(也即映射)进容器,并在容器中设置预加载库,以使得在容器中的进程启动前强制挂载劫持库。
设备组件230,也可以称为设备插件。设备组件230可以是对K8S集群中的Device Plugin(即设备插件)进行改进得到的。改进后的设备器插件可以记为4PD-vGPU-Device Plugin。设备组件230负责将Cuda劫持库(libvgpu.so)映射到容器内部,并在容器中设置预加载库(如预加载文件/etc/ld.so.preload)。/etc/ld.so.preload的功能是让任何进程启动之前强制挂载libvgpu.so,保证用户无法自行绕开vGPU直接访问GPU。由此,容器内部所有对于Cuda的调用都会经过劫持库转发。
挂载组件240可以被配置为通过与调度器组件210通信,获取虚拟GPU的标识。虚拟GPU的标识用于标识虚拟GPU的设备编号,挂载组件240可以根据虚拟GPU的标识将虚拟GPU挂载进容器。其中,挂载组件240可以连同相应驱动库一并挂载进容器。
挂载组件240位于容器层,可以是对nvidia-container-runtime进行改进得到的。改进后的nvidia-container-runtime可以记为4PD-nvidia-container-runime。对比普通的nvidia-container-runtime,挂载组件240多了与调度器组件210的通讯,根据通讯挂载组件240可以将调度器组件210分配给容器的vGPU真正地挂载进入容器中。
挂载组件240还可以被配置为通过与调度器组件210通信,获取虚拟GPU的GPU资源信息,并将GPU资源信息记录在容器的环境变量中。GPU资源信息可以是指虚拟GPU能够提供的GPU资源上限,如显存上限以及算力上限。容器的环境变量中的GPU资源信息,也即容器能够访问(即使用)的GPU资源。根据环境变量可以创建并启动容器。创建并启动容器的操作可以交由容器运行时runc执行。
监测组件250可以记为4PD-VGPU-monitor,被配置为监测vGPU的资源使用情况,如可以监测预先设定的一些衡量指标metrics。监测组件250还可以向外推送metrics,方便整个集群实时监控并可视化vGPU资源。
标记组件260可以被配置为针对容器类作业中的每个容器,将能够唯一标识容器的容器标识记录在容器的环境变量中,方便调度器组件210识别。可以采用通用唯一识别码(Universally Unique Identifier,UUID)作为容器标识。标记组件260可以是指K8S中的MutatingWebhook。
图4示出了针对vGPU任务的整体处理流程示意图。
图4中所示各步骤可以由本公开的作业调度装置中的相应组件执行。具体而言,步骤S410、步骤S420可以由标记组件执行;步骤S340、步骤S440以及步骤S480可以由调度器组件执行;步骤S450可以由设备组件执行;步骤S460可以由挂载组件执行。
参见图4,在步骤S410,当任务提交时,首先可以由标记组件检查vGPU任务中是否存在vGPU资源。若检查到vGPU资源,则可以执行步骤S420至步骤S460的流程, 通过对GPU进行虚拟化,使应用或任务能够使用虚拟显存,特别是能够无感知地使用虚拟显存。若未检查的vGPU资源,则表明提交的任务不需要对GPU进行虚拟化,因此可以执行步骤S470,执行默认的调度流程。
在步骤S420,针对发现每个vGPU资源添加容器UUID,以方便挂载组件识别对应的容器。每个vGPU资源对应一个容器,vGPU资源也即容器需要使用的vGPU资源。
在步骤S430,对集群中的GPU节点进行筛选。
在步骤S440,对筛选得到的GPU节点进行打分。
通过筛选和打分,可以最终选择最合适的GPU节点(如得分最高的一个或多个GPU节点)执行vGPU任务。例如,可以首先筛选支持虚拟化的GPU节点,然后对根据GPU节点上的当前GPU剩余算力、支持的切分粒度等对GPU节点进行打分,选取最合适(即得分最高)的GPU节点,并对最终选取的GPU节点中的GPU进行切分,得到一个或多个满足需求的vGPU。
在步骤S450,添加劫持库挂载。
当任务被提交到对应的GPU节点上后,可以挂载CUDA劫持库libvgpu.so以及预加载文件/etc/ld.so.preload。
在步骤S460,设置环境变量。
在执行完步骤S450之后,可以将任务提交到容器层,在容器层的nvidia container runtime会与vGPU调度器进行通讯,并获取vGPU对应的GPU序号,及对应能使用的显存以及利用率上限,分别将其填入以下3个环境变量:NVIDIA_VISIBLE_DEVICES,CUDA_DEVICE_MEMORY_LIMIT,CUDA_DEVICE_SM_LIMIT。其中,第一个环境变量用于控制挂载进容器的GPU设备编号,第二、第三个环境变量用于控制对于GPU的访问,分别为本容器能访问的显存以及利用率的上限。
最后根据这些环境变量,可以调用nvidia-container-cli进行具体GPU设备和相应驱动库的挂载,并交由runc启动容器。
综上,本公开讨论了能够实现虚拟显存能力的基于云原生方案的GPU虚拟化技术,从产品和技术层面给出了解决上述缺陷的完整流程和解决方案。本公开的方案可以基于云原生K8S技术实现,能够直接适配到云原生场景下。因此,本公开可以实现为一种支持虚拟显存的云原生GPU虚拟化方案。基于本公开的作业调度装置实现的云原生GPU虚拟化方案,主要包括组件部署阶段、应用或任务创建阶段、应用或任务运行阶段。
在组件部署阶段,可以将GPU资源管理装置中的各个组件安装至K8S集群中需要使用GPU虚拟化的节点上,在部署的过程中需要设定“虚拟显存倍数”和“最大vGPU个数”等参数,用户可以根据集群和上层应用或任务的实际情况进行设置。这两个参数是本方案的新增参数,其作用可以参考前文相关描述。
在应用或任务创建阶段,用户需要结合该应用或任务对GPU资源的消耗情况和集群内GPU卡的算力/显存大小来设定“使用的GPU算力比例”和“使用的GPU显存大小”等参 数,灵活设定每个应用或任务所需要的算力和显存大小。这两个参数也是本本公开的新增参数。
在应用户或任务运行这个阶段,应用或任务会使用所设定的参数在虚拟GPU上运行,并且每个虚拟GPU上的应用不会相互干扰,用户变动某个应用或任务的运行状态并不会影响其他虚拟GPU上的应用或任务。
预期达到的效果是每个应用或任务使用的显存大小总和可以超过实际物理GPU的总显存量,且在推理场景下且GPU算力没有打满的情况下每个应用或任务的性能与不使用GPU虚拟化时基本一致,损耗在10%以内。这是由于推理任务的局部性较强,显存和内存的交换不会过于频繁从而引发很大的性能衰减,损耗这里指的是推理场景下请求处理时间的增加部分。
本公开通过复用GPU可以实现在一个GPU上部署多个AI应用(如推理模型)。
在现有技术中可以通过多模型加载方式(如选择Nvidia Triton server,torchserve,tf-serving中的多模型加载方式)实现在一个GPU上部署多个推理模型。例如,Nvidia针对推理场景推出了Nvidia triton server推理引擎,可以在一个GPU上加载多个推理模型。再例如,许多AI训练框架也推出了对应的推理服务引擎,他们可以在一个任务中同时加载多个模型,并针对每个模型提供推理服务。这种多模型加载方式的是缺陷在于,这种技术往往只能适用于特定的几种模型,例如tf-serving只能加载tensorflow模型,torchserve只能加载torch模型,就连适用性最广的nvidia triton server针对pytorch也只能加载pytorch script模型,无法覆盖所有应用场景。本公开的GPU复用方案则可以覆盖所有应用场景。
在现有技术中也可以通过选择其它的GPU虚拟化方案实现在一个GPU上部署多个推理模型。但是现在比较流行的GPU虚拟化方案大多都无法适配到私有化部署的云原生场景,例如nvidiavGPU所提供的虚拟化能力是针对虚拟机场景的,例如,目前一些企业提出的qgpu与cgpu方案都是针对其公有云的场景。上述的方案都很难直接适配到云原生场景下。本公开则可以直接适配到云原生场景中,实现为一种云原生GPU虚拟化方案。
本公开述及的推理模型可以是指神经网络模型。神经网络模型可被用于预测图像类别、文本类别、语音情感、欺诈交易、广告点击率等。所述神经网络模型旨在针对相关场景中的对象或事件有关的问题进行预测。例如,可用于预测图像类别、预测图像中文字、预测文本类别、预测语音情感类别、预测欺诈交易、预测广告点击率、预测商品价格等等,使得预测结果可直接作为决策依据或进一步结合其他规则而成为决策依据。
在一个实施例中,神经网络模型可被用于的场景包括但不限于以下场景:
图像处理场景,包括:光学字符识别OCR、人脸识别、物体识别和图片分类;更具体地举例来说,OCR可应用于票据(如***)识别、手写字识别等,人脸识别可应用安防等领域,物体识别可应用于自动驾驶场景中的交通标志识别,图片分类可应用于电商 平台的“拍照购”、“找同款”等。
语音识别场景,包括可通过语音进行人机交互的产品,如手机的语音助手(如苹果手机的Siri)、智能音箱等;
自然语言处理场景,包括:审查文本(如合同、法律文书和客服记录等)、垃圾内容识别(如垃圾短信识别)和文本分类(情感、意图和主题等);
自动控制场景,包括:矿井组调节操作预测、风力发电机组调节操作预测和空调***调节操作预测;具体的对于矿井组可预测开采率高的一组调节操作,对于风力发电机组可预测发电效率高的一组调节操作,对于空调***,可以预测满足需求的同时节省能耗的一组调节操作;
智能问答场景,包括:聊天机器人和智能客服;
业务决策场景,包括:金融科技领域、医疗领域和市政领域的场景,其中:
金融科技领域包括:营销(如优惠券使用预测、广告点击行为预测、用户画像挖掘等)与获客、反欺诈、反洗钱、承保和信用评分、商品价格预测;
医疗领域包括:疾病筛查和预防、个性化健康管理和辅助诊断;
市政领域包括:社会治理与监管执法、资源环境和设施管理、产业发展和经济分析、公众服务和民生保障、智慧城市(公交、网约车、共享单车等各类城市资源的调配和管理);
推荐业务场景,包括:新闻、广告、音乐、咨询、视频和金融产品(如理财、保险等)的推荐;
搜索场景,包括:网页搜索、图像搜索、文本搜索、视频搜索等;
异常行为检测场景,包括:国家电网客户用电异常行为检测、网络恶意流量检测、操作日志中的异常行为检测等。
本公开上文结合图1所描述的GPU虚拟化方法还可以实现为一种GPU虚拟化装置。
图5示出了根据本公开一个实施例的GPU虚拟化装置的结构示意图。
GPU虚拟化装置的功能单元可以由实现本公开原理的硬件、软件或硬件和软件的结合来实现。本领域技术人员可以理解的是,图5所描述的功能单元可以组合起来或者划分成子单元,从而实现上述公开的原理。因此,本文的描述可以支持对本文描述的功能单元的任何可能的组合、或者划分、或者更进一步的限定。
下面就GPU虚拟化装置可以具有的功能单元以及各功能单元可以执行的操作做简要说明,对于其中涉及的细节部分可以参见上文相关描述,这里不再赘述。
参见图5,GPU虚拟化装置500可以包括切分模块510和分配模块520。
切分模块510被配置为将GPU切分成多个虚拟GPU。
分配模块520被配置为针对至少一个虚拟GPU,将GPU所在主机的部分内存作为显存交换区,分配给虚拟GPU,以使得虚拟GPU的可用显存大于虚拟GPU的板载显存。
GPU虚拟化装置500还可以包括第一获取模块,被配置为获取虚拟化信息,虚拟化信息包括最大虚拟GPU数量和虚拟显存大小。切分模块可以以切分得到的虚拟GPU的数量不大于最大虚拟GPU数量为条件,将GPU切分成多个虚拟GPU。分配模块可以将与虚拟显存大小相等的主机内存作为显存交换区,分配给虚拟GPU。
GPU虚拟化装置500还可以包括第二获取模块和分配模块,第二获取模块被配置为获取资源需求信息,资源需求信息用于表征应用或任务所需的GPU资源。分配模块被配置为根据资源需求信息,为应用或任务分配虚拟GPU。
GPU虚拟化装置500还可以包括替换模块,被配置为将应用或任务针对所述虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求,以使在所述虚拟GPU的当前可用板载显存不足的情况下,能够基于统一地址空间将所述板载显存中的至少部分数据交换到所述显存交换区。替换模块可以利用劫持库截取应用或任务对虚拟GPU的调用请求,并将应用或任务申请显存所使用的默认接口替换为基于统一地址空间的显存申请接口。
应用或任务可以运行在容器中。GPU虚拟化装置500还可以包括设置模块、挂载模块以及启动模块。设置模块被配置为设置容器的环境变量,环境变量包括容器标识、挂载进容器的虚拟GPU的标识、容器能够访问的GPU资源上限。挂载模块被配置为基于环境变量将虚拟GPU挂载进容器中。启动模块被配置为启动容器,以在容器中运行应用或任务。
GPU虚拟化装置500还可以包括监测模块,被配置为对虚拟GPU的资源使用情况进行监测。
本公开还可以实现为一种Kubernetes集群,包括多个GPU节点,每个所述GPU节点包括一个或多个GPU;以及部署在至少一个所述GPU节点上的作业调度装置,作业调度装置可以是上文结合图2所描述的GPU资源管理装置。
图6示出了根据本公开一个实施例的可用于实现上述GPU资源使用方法或GPU虚拟化方法的计算设备的结构示意图。
参见图6,计算设备600包括存储器610和处理器620。
处理器620可以是一个多核的处理器,也可以包含多个处理器。在一些实施例中,处理器620可以包含一个通用的主处理器以及一个或多个特殊的协处理器,例如图形处理器(GPU)、数字信号处理器(DSP)等等。在一些实施例中,处理器620可以使用定制的电路实现,例如特定用途集成电路(ASIC,Application Specific Integrated Circuit)或者现场可编程逻辑门阵列(FPGA,Field Programmable Gate Arrays)。
存储器610可以包括各种类型的存储单元,例如***内存、只读存储器(ROM),和永久存储装置。其中,ROM可以存储处理器620或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性 存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。***内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。***内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储器610可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(DRAM,SRAM,SDRAM,闪存,可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储器610可以包括可读和/或写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM,双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等等)、磁性软盘等等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。
存储器610上存储有可执行代码,当可执行代码被处理器620处理时,可以使处理器620执行上文述及的GPU资源使用方法或GPU虚拟化方法。
上文中已经参考附图详细描述了根据本公开的GPU资源使用方法、GPU虚拟化方法以及作业调度装置、集群。
此外,根据本公开的方法还可以实现为一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括用于执行本公开的上述方法中限定的上述各步骤的计算机程序代码指令。
或者,本公开还可以实施为一种计算机可读存储介质(或计算机可读存储介质、或机器可读存储介质),其上存储有可执行代码(或计算机程序、或计算机指令代码),当所述可执行代码(或计算机程序、或计算机指令代码)被电子设备(或计算设备、服务器等)的处理器执行时,使所述处理器执行根据本公开的上述方法的各个步骤。
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。
附图中的流程图和框图显示了根据本公开的多个实施例的***和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选 择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (19)

  1. 一种GPU资源使用方法,包括:
    将GPU切分成多个虚拟GPU;
    针对至少一个所述虚拟GPU,将至少部分主机内存作为显存交换区,分配给所述虚拟GPU,以使所述虚拟GPU的可用显存大于所述虚拟GPU的板载显存;以及
    将应用或任务针对所述虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求,以使在所述虚拟GPU的当前可用板载显存不足的情况下,能够基于统一地址空间将所述板载显存中的至少部分数据交换到所述显存交换区。
  2. 根据权利要求1所述的方法,将应用或任务对所述虚拟GPU的显存申请请求替换为基于统一地址空间的显存申请请求,包括:
    利用劫持库截取应用或任务对虚拟GPU的调用请求,并将应用或任务申请显存所使用的默认接口替换为基于统一地址空间的显存申请接口。
  3. 根据权利要求1所述的方法,还包括:
    获取虚拟化信息,所述虚拟化信息包括最大虚拟GPU数量和虚拟显存大小,
    其中,将GPU切分成多个虚拟GPU,包括:以切分得到的虚拟GPU的数量不大于所述最大虚拟GPU数量为条件,将GPU切分成多个虚拟GPU,
    将所述GPU所在设备的部分内存作为显存交换区,分配给所述虚拟GPU,包括:将与所述虚拟显存大小相等的主机内存作为显存交换区,分配给所述虚拟GPU。
  4. 根据权利要求1所述的方法,还包括:
    获取资源需求信息,所述资源需求信息用于表征应用或任务所需的GPU资源;
    根据所述资源需求信息,为应用或任务分配虚拟GPU。
  5. 根据权利要求4所述的方法,所述资源需求信息包括使用的GPU算力比例和使用的显存大小。
  6. 根据权利要求1所述的方法,应用或任务运行在容器中,该方法还包括:
    设置容器的环境变量,所述环境变量包括容器标识、挂载进容器的虚拟GPU的标识、所述容器能够访问的GPU资源上限;
    基于所述环境变量将所述虚拟GPU挂载进容器中;以及
    启动所述容器,以在所述容器中运行应用或任务。
  7. 根据权利要求1所述的方法,还包括:
    对所述虚拟GPU的资源使用情况进行监测。
  8. 一种GPU虚拟化方法,包括:
    将GPU切分成多个虚拟GPU;
    针对至少一个所述虚拟GPU,将至少部分主机内存作为显存交换区,分配给所述虚拟GPU,以使所述虚拟GPU的可用显存大于所述虚拟GPU的板载显存。
  9. 一种作业调度装置,包括:
    调度器组件,被配置为将容器类作业调度到一个或多个GPU上,将所述GPU切分成一个或多个虚拟GPU,每个虚拟GPU对应容器类作业中的一个容器,每个容器对应一个应用或任务,所述应用或任务运行在所述容器中,其中,针对至少一个所述虚拟GPU,所述调度器组件还将至少部分主机内存作为显存交换区,分配给所述虚拟GPU,以使所述虚拟GPU的可用显存大于所述虚拟GPU的板载显存;
    劫持库,被配置为截取应用或任务对GPU的调用请求,对于需要使用虚拟显存的应用或任务,所述劫持库还将申请显存所使用的接口设置为基于统一地址空间的显存申请接口。
  10. 根据权利要求9所述的装置,
    所述劫持库还被配置为检查应用或任务申请使用的显存是否大于为其分配的显存,并在应用或任务申请使用的显存不大于为其分配的显存的情况下,将显存申请请求发送给驱动。
  11. 根据权利要求9所述的装置,还包括:
    设备组件,被配置为将所述劫持库挂载进容器,并在所述容器中设置预加载库,以使得在所述容器中的进程启动前强制挂载所述劫持库。
  12. 根据权利要求9所述的装置,还包括:
    挂载组件,被配置为通过与所述调度器组件通信,获取所述虚拟GPU的虚拟GPU标识,并根据所述虚拟GPU标识将所述虚拟GPU挂载进容器。
  13. 根据权利要求12所述的装置,
    所述挂载组件还被配置为通过与所述调度器组件通信,获取所述虚拟GPU的GPU资源信息,并将所述GPU资源信息记录在所述容器的环境变量中。
  14. 根据权利要求9所述的装置,还包括:
    监测组件,被配置为监测并输出所述虚拟GPU的资源使用情况。
  15. 根据权利要求9所述的装置,还包括:
    标记组件,被配置为针对所述容器类作业中的每个容器,将能够唯一标识容器的容器标识记录在所述容器的环境变量中。
  16. 一种Kubernetes集群,包括:
    多个GPU节点,每个所述GPU节点包括一个或多个GPU;以及
    部署在至少一个所述GPU节点上的作业调度装置,所述作业调度装置为权利要求9至15中任何一项所述的GPU资源管理装置。
  17. 一种GPU虚拟化装置,包括:
    切分模块,被配置为将GPU切分成多个虚拟GPU;
    分配模块,被配置为针对至少一个所述虚拟GPU,将所述GPU所在主机的部分内存作为显存交换区,分配给所述虚拟GPU,以使得所述虚拟GPU的可用显存大于所述虚拟GPU的板载显存。
  18. 一种计算设备,包括:
    处理器;以及
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1至8中任何一项所述的方法。
  19. 一种计算机可读存储介质,其上存储有可执行代码,当所述可执行代码被计算设备的处理器执行时,使所述处理器执行如权利要求1至8中任何一项所述的方法。
PCT/CN2023/111673 2022-08-09 2023-08-08 Gpu资源使用方法、gpu虚拟化方法以及作业调度装置、集群 WO2024032587A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210950598.5 2022-08-09
CN202210950598.5A CN117632447A (zh) 2022-08-09 2022-08-09 Gpu资源使用方法、gpu虚拟化方法以及作业调度装置、集群

Publications (1)

Publication Number Publication Date
WO2024032587A1 true WO2024032587A1 (zh) 2024-02-15

Family

ID=89850888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/111673 WO2024032587A1 (zh) 2022-08-09 2023-08-08 Gpu资源使用方法、gpu虚拟化方法以及作业调度装置、集群

Country Status (2)

Country Link
CN (1) CN117632447A (zh)
WO (1) WO2024032587A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133574A (zh) * 2024-05-06 2024-06-04 沐曦集成电路(上海)有限公司 一种sram生成***

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160188251A1 (en) * 2014-07-15 2016-06-30 Nvidia Corporation Techniques for Creating a Notion of Privileged Data Access in a Unified Virtual Memory System
CN111223036A (zh) * 2019-12-29 2020-06-02 广东浪潮大数据研究有限公司 一种gpu虚拟化共享方法、装置及电子设备和存储介质
US20210133123A1 (en) * 2019-11-04 2021-05-06 Nvidia Corporation Techniques for an efficient fabric attached memory
US20210208951A1 (en) * 2020-08-04 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for sharing gpu, electronic device and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160188251A1 (en) * 2014-07-15 2016-06-30 Nvidia Corporation Techniques for Creating a Notion of Privileged Data Access in a Unified Virtual Memory System
US20210133123A1 (en) * 2019-11-04 2021-05-06 Nvidia Corporation Techniques for an efficient fabric attached memory
CN111223036A (zh) * 2019-12-29 2020-06-02 广东浪潮大数据研究有限公司 一种gpu虚拟化共享方法、装置及电子设备和存储介质
US20210208951A1 (en) * 2020-08-04 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for sharing gpu, electronic device and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Taking stock of GPU sharing solutions from the industry", QIONGLING TECHNOLOGY, 31 August 2021 (2021-08-31), XP093138346, Retrieved from the Internet <URL:https://www.qiongling.com/2021/08/31/15370.html> [retrieved on 20240306] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133574A (zh) * 2024-05-06 2024-06-04 沐曦集成电路(上海)有限公司 一种sram生成***

Also Published As

Publication number Publication date
CN117632447A (zh) 2024-03-01

Similar Documents

Publication Publication Date Title
CN104424013B (zh) 在计算环境中部署虚拟机的方法和设备
US20220078036A1 (en) Asset management with respect to a shared pool of configurable computing resources
US8442955B2 (en) Virtual machine image co-migration
US9128765B2 (en) Assigning restored virtual machine based on past application usage of requesting user
US8996452B2 (en) Generating a predictive model from multiple data sources
US9158590B2 (en) Dynamically acquiring computing resources in a networked computing environment
CN103365725B (zh) 在多个云之间动态分配工作负荷部署单元的方法和***
US20140201362A1 (en) Real-time data analysis for resource provisioning among systems in a networked computing environment
US20110137805A1 (en) Inter-cloud resource sharing within a cloud computing environment
US20100115510A1 (en) Virtual graphics device and methods thereof
WO2024032587A1 (zh) Gpu资源使用方法、gpu虚拟化方法以及作业调度装置、集群
CN103034453A (zh) 管理虚拟机实例中预安装应用的持久数据的方法和装置
CN103368767A (zh) 用于具有故障的云中的高效应用管理的方法和***
CN103914511A (zh) 选择用于云存储的图像或者视频文件
CN113434261A (zh) 异构计算设备虚拟化方法及***
CN115169587A (zh) 联邦学习***及实现多方联合处理任务的方法与设备
CN109976907A (zh) 任务分配方法和***、电子设备、计算机可读介质
US20170123968A1 (en) Flash memory management
CN110858326A (zh) 模型训练及获取附加特征数据的方法、装置、设备及介质
US8548881B1 (en) Credit optimization to minimize latency
CN113849503A (zh) 一种开放式大数据处理***、方法及介质
US10007559B1 (en) Virtual tiering
CN112449021B (zh) 一种互联网资源的筛选方法及装置
WO2022078060A1 (en) Tag-driven scheduling of computing resources for function execution
CN115080242A (zh) 一种pci设备资源统一调度的方法、装置及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23851808

Country of ref document: EP

Kind code of ref document: A1