WO2024032587A1 - Procédé d'utilisation de ressources de gpu, procédé de virtualisation de gpu, et appareil de planification de tâche et grappe - Google Patents

Procédé d'utilisation de ressources de gpu, procédé de virtualisation de gpu, et appareil de planification de tâche et grappe Download PDF

Info

Publication number
WO2024032587A1
WO2024032587A1 PCT/CN2023/111673 CN2023111673W WO2024032587A1 WO 2024032587 A1 WO2024032587 A1 WO 2024032587A1 CN 2023111673 W CN2023111673 W CN 2023111673W WO 2024032587 A1 WO2024032587 A1 WO 2024032587A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
video memory
virtual
container
application
Prior art date
Application number
PCT/CN2023/111673
Other languages
English (en)
Chinese (zh)
Inventor
李孟轩
张冠一
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2024032587A1 publication Critical patent/WO2024032587A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • the present disclosure relates to the field of computers, and in particular to a GPU resource usage method, a GPU virtualization method, a job scheduling device, and a cluster.
  • GPUs Modern graphics processing units (GPUs) began as accelerators for Windows video games but have evolved over the past 20 years into enterprise server processors for high-performance computing and artificial intelligence applications.
  • GPUs lead performance in supercomputing, artificial intelligence training and inference, drug research, financial modeling, and medical imaging. They are also used in more mainstream tasks where CPUs are not fast enough, such as in GPU-driven relational databases. GPUs are better suited than CPUs to handle many of the calculations required for artificial intelligence and machine learning in enterprise data centers and hyperscale networks. The CPU can handle the work, but it takes longer. Because GPUs are designed to solve complex mathematical problems in parallel by breaking them down into separate tasks that they handle simultaneously, they can solve these problems faster.
  • GPU virtualization technology improves GPU utilization by dividing the GPU into multiple smaller-granularity virtual GPUs and using each virtual GPU to run applications or tasks that consume less computing power and video memory.
  • GPU computing power utilization and video memory utilization are not directly proportional.
  • the utilization of computing power is far less than the utilization of video memory. If you want to make full use of computing power, you need to divide the GPU into smaller-granularity virtual GPUs. If you use the traditional GPU virtualization solution, the video memory of each virtual GPU will be very small and cannot properly support a single application or tasks; and if the GPU is divided into larger-granularity virtual GPUs to meet the video memory requirements, the GPU's computing power will be idle.
  • a technical problem to be solved by this disclosure is to provide a solution that can improve the overall utilization of GPU resources while solving the problem of incompatible computing power utilization and video memory utilization of the GPU.
  • a method for using GPU resources including: dividing the GPU into multiple virtual GPUs; for at least one virtual GPU, allocating at least part of the host memory as a video memory swap area to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU; and replace the video memory application request of the application or task for the virtual GPU with a video memory application request based on a unified address space, so that the currently available onboard video memory of the virtual GPU When the video memory is insufficient, at least part of the data in the onboard video memory can be swapped to the video memory swap area based on the unified address space.
  • a GPU virtualization method including: dividing the GPU into multiple virtual GPUs; for at least one virtual GPU, allocating at least part of the host memory as a video memory swap area to the virtual GPU, So that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • a job scheduling device including: a scheduler component configured to schedule container jobs to one or more GPUs and divide the GPU into one or more virtual GPUs, Each virtual GPU corresponds to a container in a container-type job, and each container corresponds to an application or task.
  • the application or task runs in the container.
  • the scheduler component also exchanges at least part of the host memory as video memory. area, allocated to the virtual GPU so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU; the hijacking library is configured to intercept the call request of the application or task to the GPU.
  • the hijacking library The interface used to apply for video memory is also set to the video memory application interface based on the unified address space.
  • a Kubernetes cluster including: multiple GPU nodes, each GPU node including one or more GPUs; and a job scheduling device deployed on at least one GPU node, the job scheduling device
  • the device is the GPU resource management device described in the second aspect of this disclosure.
  • a GPU virtualization device including: a splitting module configured to split the GPU into multiple virtual GPUs; and an allocation module configured to split the GPU into at least one virtual GPU. Part of the memory of the host where the GPU is located is used as a video memory swap area and allocated to the virtual GPU so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • a computing device including: a processor; and a memory on which executable code is stored.
  • the processor is caused to execute the above-mentioned first step. aspect methods.
  • a computer-readable storage medium having executable code stored thereon.
  • the executable code is executed by a processor of a computing device, the processor is caused to execute the above-described first aspect. method.
  • the present disclosure uses the host memory as a video memory swap area and allocates it to the virtual GPU, so that the video memory swap area can act as a virtual video memory to increase the available video memory of the virtual GPU, and solve the problem of dividing the GPU into smaller granularity in order to fully utilize the computing power.
  • the onboard video memory of the virtual GPU is not enough to support a single application or task, which can solve the problem of not balancing the GPU's computing power utilization and video memory utilization.
  • the present disclosure replaces the application or task's video memory application request for the virtual GPU with a video memory application request based on a unified address space, so that the application or task can use the virtual video memory (ie, the video memory swap area) without awareness.
  • Figure 1 shows a schematic diagram of the principle of virtualizing a GPU according to the present disclosure.
  • Figure 2 shows a schematic structural diagram of a job scheduling device according to an embodiment of the present disclosure.
  • Figure 3 shows a schematic diagram of the hijacking library.
  • Figure 4 shows a schematic diagram of the overall processing flow for vGPU tasks.
  • Figure 5 shows a schematic structural diagram of a GPU virtualization device according to an embodiment of the present disclosure.
  • FIG. 6 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure.
  • Figure 1 shows a schematic diagram of the principle of virtualizing a GPU according to the present disclosure.
  • a GPU can be divided into multiple virtual GPUs.
  • GPU can refer to a physical GPU card.
  • the segmentation granularity can be flexibly set as needed.
  • Virtual GPU can also be called vGPU.
  • At least part of the host memory of the host (such as a server) where the GPU is located can be used as a video memory swap area and allocated to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • Onboard video memory refers to the video memory integrated on the physical GPU card, that is, the video memory provided by the physical GPU card itself. Onboard video memory can also be called physical video memory.
  • a virtual GPU allocated with a video memory swap area when the onboard video memory of the virtual GPU is not enough, part of the space in the onboard video memory can be released for use by the current program. Among them, the data in the released space is saved to the video memory swap area. When the data in the released space needs to be used, the data in the video memory swap area can be swapped to the onboard video memory.
  • the total available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • the memory in the video memory swap area can serve as the virtual video memory of the virtual GPU, thereby increasing the available video memory of the virtual GPU.
  • the total available video memory of a virtual GPU is equal to the onboard video memory of the virtual GPU plus the video memory swap area allocated for the virtual GPU.
  • the present disclosure can realize the available video memory exceeding the onboard video memory, so that when the GPU is divided into smaller-granular virtual GPUs for the purpose of fully utilizing the GPU resources, there is no need to worry about the fragmentation.
  • the onboard video memory of the virtual GPU is too small to support a single application or task. Therefore, it can improve the overall utilization of GPU resources while solving the problem of not taking into account the GPU's computing power utilization and video memory utilization, and can help customers be more flexible. Set the partitioning granularity of GPU virtualization to maximize GPU resource utilization.
  • a virtual GPU can be used exclusively by one application or task.
  • Applications or tasks that need to use the video memory swap area can refer to applications or tasks that require the maximum video memory used during operation to exceed the onboard video memory of the virtual GPU.
  • the present disclosure proposes that for applications or tasks that need to use the video memory swap area, the application or task's video memory application request for the virtual GPU can be replaced. Request for video memory application based on unified address space.
  • the unified address space can be regarded as mapping host memory and device memory (ie, the onboard video memory of the GPU) into a unified (virtual) address space.
  • mapping host memory and device memory ie, the onboard video memory of the GPU
  • unified (virtual) address space In the unified address space, memory and video memory are no longer distinguished, thus providing support for free data exchange between memory and video memory.
  • the present disclosure replaces the video memory application request (ie, the default video memory application request) for the virtual GPU through an application or task with a video memory application request based on a unified address space, so that when the currently available onboard video memory of the virtual GPU is insufficient, it can Swapping at least part of the data in the onboard video memory to the video memory swap area based on the unified address space.
  • the video memory application request ie, the default video memory application request
  • This disclosure can use the hijacking library to intercept the call request of the application or task to the virtual GPU, and replace the default interface used by the application or task to apply for video memory with a video memory application interface based on a unified address space, so as to realize the application or task targeting the virtual GPU.
  • the video memory application request is replaced by a video memory application request based on a unified address space.
  • This disclosure can replace the default video memory application interface (cuMemAlloc) with a unified address space-based video memory application interface (cuMemAllocManaged).
  • the system component such as the kernel module part in the Nvidia driver and the HMM component in the Linux kernel
  • the system component can automatically transfer the board Exchange at least part of the data stored in the onboard video memory to the video memory swap area to release at least part of the onboard video memory to meet the video memory usage requirements of the application or task.
  • the present disclosure replaces the ordinary video memory application with a video memory application based on a unified address space by replacing the Cuda call chain, so that while the application or task has the ability to use virtual video memory, the entire process is insensitive to the application or task.
  • the GPU mentioned in this disclosure can be deployed in a K8S cluster.
  • the K8S cluster is a Kubernetes-based GPU cluster and includes multiple GPU nodes.
  • Each GPU node can include one or more GPUs.
  • the GPU node may refer to the GPU server.
  • applications or tasks can be deployed in containers. Containers can provide environmental support for the running of applications or tasks.
  • the virtual GPUs that applications or tasks need to use can be mounted in the container for exclusive use by the application or task. .
  • the present disclosure proposes that the virtualization information can be set by the user based on the actual situation of the cluster and upper-layer applications or tasks.
  • the virtualization information may represent the granularity of partitioning the GPU and/or the size of the video memory swap area allocated for the virtual GPU.
  • the virtualization information may include the maximum number of virtual GPUs and virtual video memory size.
  • the maximum number of virtual GPUs is used to limit the maximum number of virtual GPUs that a physical card can be divided into.
  • the virtual video memory size is used to limit the size of the virtual video memory that needs to be configured for the virtual GPU.
  • the size of virtual video memory can be expressed by the indicator "virtual video memory multiple".
  • Virtual video memory multiple means that the virtual video memory is a multiple of the actual onboard video memory. For example, if the virtual video memory multiple is set to 10, it means that the virtual video memory is 10 times the actual onboard video memory. .
  • the GPU when the GPU is divided into multiple virtual GPUs, the GPU can be divided into multiple virtual GPUs on the condition that the number of divided virtual GPUs is not greater than the maximum number of virtual GPUs; at least part of the host memory is used as video memory.
  • the swap area is allocated to a virtual GPU, the host memory equal to the size of the virtual video memory can be used as the video memory swap area and allocated to the virtual GPU.
  • a virtual GPU can be assigned to an application or task. This solution will limit the access and use of the GPU by the task based on the virtualization parameters set above. Therefore, changes in the running status of an application or task , will not affect applications or tasks on other virtual GPUs.
  • the resource requirement information of the application or task can be obtained.
  • the resource requirement information is used to characterize the GPU resources required by the application or task.
  • the GPU resources required by an application or task may be different in different states during the running process.
  • the GPU resources required by the application may refer to the maximum GPU resources required by the application or task during the entire running process. Based on the resource requirement information, a virtual GPU that can meet the GPU resources required by the application or task can be selected.
  • Resource demand information may include computing power usage ratio and video memory usage size.
  • the computing power usage ratio is used to characterize the proportion of GPU computing power required by an application or task. The unit is %. If the GPU computing power ratio is set to 10, applications or tasks can be scheduled to virtual GPUs with a GPU computing power ratio of more than 10%.
  • the unit of the video memory usage size can be MB. If the video memory usage size is set to 1024, the application or task will occupy 1024MB of GPU video memory.
  • the present disclosure proposes that when creating the container, environment variables corresponding to the virtual GPU need to be set, which may include but are not limited to the container identifier, the identifier of the virtual GPU mounted into the container, and the identifier of the virtual GPU that the container can access.
  • the container identifier can be a Universally Unique Identifier (UUID), which is used by the scheduler plug-in mentioned below to identify the container.
  • UUID Universally Unique Identifier
  • the identification of the virtual GPU can It is the GPU device number that can be accessed within the container and is used for device mapping inside and outside the container.
  • the upper limit of GPU resources that a container can access can refer to the maximum GPU resources that a virtual GPU can provide, which can include the upper limit of video memory and computing power usage.
  • the virtual GPU can be mounted in the container based on environment variables and the container can be started to run applications or tasks in the container.
  • the present disclosure can also monitor the resource usage of the virtual GPU.
  • the resource usage of the virtual GPU can also be output (such as visual display) in real time.
  • This disclosure also proposes a job scheduling device that supports cloud native, which is suitable for deployment on nodes that require GPU virtualization in a K8S cluster, and schedules container jobs in a cloud-native manner.
  • Container jobs refer to pod, deployment, job and other jobs containing vGPU resources submitted to the K8S cluster.
  • Container jobs can also be called vGPU tasks.
  • Container jobs can involve one or more containers.
  • K8S Pod is the smallest unit managed by K8S.
  • a Pod can contain multiple containers.
  • Each container can run an application or task.
  • the GPU resources required to run applications or tasks in the container may refer to vGPU resources, that is, an application or task can exclusively use a vGPU.
  • vGPU resources can be used to characterize the size of vGPU resources required by each application or task, such as the proportion of GPU computing power and the size of used video memory.
  • Figure 2 shows a schematic structural diagram of a job scheduling device according to an embodiment of the present disclosure.
  • the job scheduling device 200 may include a scheduler component 210 and a hijacking library 220 .
  • the job scheduling apparatus 200 may also include one or more of the device component 230, the mounting component 240, the monitoring component 250, and the marking component 260 shown in the dotted box in the figure. Each component can be packaged through charts.
  • the scheduler component 210 may be configured to schedule the container job to one or more GPUs based on the GPU resource requirement information of the container job, and divide the GPU into one or more virtual GPUs, each virtual GPU corresponding to the container.
  • the scheduler component 210 can also allocate at least part of the host memory as a video memory swap area to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • the scheduler component 210 can divide the GPU according to the vGPU resource information in the container job.
  • the vGPU resource information may include the number of vGPUs and the maximum resource information that each vGPU needs to provide, such as the upper limit of computing power and the upper limit of video memory.
  • the scheduler component 210 may also be called a scheduler plug-in.
  • the scheduler component 210 can be a scheduler plug-in that schedules all vGPU tasks and is obtained by improving the Scheduler extender in the K8S cluster.
  • the scheduler plug-in can be recorded as 4PD-vGPU-Scheduler.
  • the scheduler component 210 can hijack and take over the scheduling of all vGPU tasks, coordinate the GPU resources of all clusters, and allocate (i.e., schedule) tasks to certain GPUs on appropriate GPU nodes.
  • the original K8S official scheduler only supports allocating GPUs according to the number.
  • the allocated GPUs are regarded as exclusive resources and cannot be used by other applications or tasks. method to use.
  • the scheduler component 210 of the present disclosure can support tasks to specify the GPU resources that need to be used (such as video memory size and computing power usage ratio). By scheduling applications or tasks to virtual GPUs that meet the needs, the task supports only Use part of the GPU's video memory and computing power, allowing multiple tasks to share the resources of a GPU.
  • the hijacking library 220 is configured to intercept calls made by applications or tasks to the GPU. For applications or tasks that need to use virtual video memory, the hijacking library 220 will also set (such as replace) the interface used to apply for video memory (such as cuMemAlloc) to a unified address space-based video memory application interface (such as cuMemAllocManaged), so that the current interface of the virtual GPU When the available onboard video memory is insufficient, at least part of the data in the onboard video memory can be swapped to the video memory swap area based on a unified address space.
  • video memory such as cuMemAlloc
  • a unified address space-based video memory application interface such as cuMemAllocManaged
  • system components can be used to automatically exchange at least part of the data stored in the onboard video memory to the video memory swap area. Free up at least some of the onboard video memory.
  • the hijacking library 220 may be an improvement of the existing component Hooked Cuda Driver (ie, CUDA hijacking library).
  • the hijacking library itself is an existing technology, but the hijacking library in the existing technology does not support multi-card segmentation and cannot be used directly on K8S.
  • the present disclosure can add the function of communicating with the scheduler component 210 to the hijacking library, so that the K8S cluster can dynamically control each application or task.
  • Cuda8 and subsequent Cuda versions proposed the concept of a unified address space.
  • device memory video memory
  • host memory are no longer distinguished. All data exchange between device memory and host memory is performed by the Linux kernel. It works automatically with the HMM component in the Nvidia kernel module and no longer requires the user to manually control it by calling CuMemcpyDtoH or CuMemcpyHtoD. This control mechanism makes it possible to freely exchange memory and video memory, allowing the memory to be used as a swap area for video memory.
  • This disclosure creatively applies the technology of unified address space and hijacking library to the field of GPU multiplexing, replacing the Cuda call chain with the hijacking library 220 (i.e., Cuda hijacking library), and replacing ordinary video memory applications with "unified address space” Application, so that the task has the ability to use the memory as a video memory swap area without any awareness.
  • the hijacking library 220 i.e., Cuda hijacking library
  • the hijacking library 220 can hijack all upper-layer call requests by hijacking symbol calls, and forward them to the lower-layer real CUDA execution library after processing.
  • Figure 3 shows a schematic diagram of the hijacking library.
  • the hijacking library (Libcudaso) is located between the driver layer (Nvidia GPU Driver) and the Cuda runtime (Cuda Runtime) layer, so it can intercept all requests sent to the driver by the Cuda runtime.
  • the hijacking library can also be configured to check whether the video memory used by the application or task application is greater than the allocated video memory, and if the video memory used by the application or task application is not greater than the allocated video memory, the hijacking library will The video memory application request is sent to the lower-level driver, such as the GPU driver.
  • the hijacking library can perform statistics on video memory and computing power for each container, perform corresponding legality checks (the applied video memory cannot exceed the allocated video memory size), and then pass it to the lower-level driver.
  • the lower-layer driver can control the graphics processing unit (Nvdia GPU) by calling the interface function ioctl.
  • the hijacking library can be an interface (cuda API) added by modifying the guest operating system (Guest OS) using paravirtualization (Para Virtualization).
  • a virtualization layer can be added to Libcuda to build the hijacking library.
  • the application (CUDA Application) in the guest operating system can call the library (CUDA Library) Connect to the Cuda runtime as a static link to support the running of the application.
  • the hijacking library can be mounted to the runtime through dynamic link (Dynamic link) by calling the function dlopen.
  • the device component 230 is configured to mount (i.e., map) the hijacking library 220 into the container and set the preload library in the container so that the hijacking library is forced to be mounted before the process in the container is started.
  • Device component 230 may also be called a device plug-in.
  • the device component 230 may be an improvement on the Device Plugin (ie, device plug-in) in the K8S cluster.
  • the improved device plug-in can be recorded as 4PD-vGPU-Device Plugin.
  • the device component 230 is responsible for mapping the Cuda hijacking library (libvgpu.so) into the container and setting the preload library (such as the preload file /etc/ld.so.preload) in the container.
  • the function of /etc/ld.so.preload is to force libvgpu.so to be mounted before any process is started, ensuring that users cannot bypass vGPU and directly access the GPU. As a result, all calls to Cuda inside the container will be forwarded through the hijacking library.
  • the mounting component 240 may be configured to obtain the identity of the virtual GPU by communicating with the scheduler component 210 .
  • the identifier of the virtual GPU is used to identify the device number of the virtual GPU, and the mounting component 240 can mount the virtual GPU into the container according to the identifier of the virtual GPU. Among them, the mounting component 240 can be mounted into the container together with the corresponding driver library.
  • the mounting component 240 is located at the container layer and can be improved from nvidia-container-runtime.
  • the improved nvidia-container-runtime can be recorded as 4PD-nvidia-container-runime.
  • the mounting component 240 has more communication with the scheduler component 210. Based on the communication, the mounting component 240 can actually mount the vGPU assigned to the container by the scheduler component 210 into the container.
  • the mounting component 240 may also be configured to obtain the GPU resource information of the virtual GPU by communicating with the scheduler component 210, and record the GPU resource information in the environment variable of the container.
  • the GPU resource information may refer to the upper limit of GPU resources that the virtual GPU can provide, such as the upper limit of video memory and the upper limit of computing power.
  • the GPU resource information in the container's environment variables is the GPU resources that the container can access (that is, use).
  • Containers can be created and started based on environment variables. The operation of creating and starting the container can be performed by runc, the container runtime.
  • the monitoring component 250 can be recorded as 4PD-VGPU-monitor and is configured to monitor the resource usage of the vGPU. For example, it can monitor some preset metrics.
  • the monitoring component 250 can also push metrics to the outside to facilitate real-time monitoring and visualization of vGPU resources across the entire cluster.
  • the marking component 260 may be configured to record, for each container in the container-type job, a container identifier that can uniquely identify the container in the environment variable of the container, to facilitate identification by the scheduler component 210 .
  • a Universally Unique Identifier (UUID) can be used as a container identifier. Marking component 260 may refer to MutatingWebhook in K8S.
  • Figure 4 shows a schematic diagram of the overall processing flow for vGPU tasks.
  • steps S410 and S420 can be executed by the marking component; steps S340, S440 and S480 can be executed by the scheduler component; step S450 can be executed by the device component; and step S460 can be executed by the mounting component.
  • step S410 when the task is submitted, the marking component may first check whether there are vGPU resources in the vGPU task. If vGPU resources are detected, the process from step S420 to step S460 can be executed. By virtualizing the GPU, applications or tasks can use virtual video memory, especially virtual video memory without any awareness. If the vGPU resource is not checked, it indicates that the submitted task does not require virtualization of the GPU, so step S470 can be executed to execute the default scheduling process.
  • step S420 a container UUID is added to each discovered vGPU resource to facilitate the mounting component to identify the corresponding container.
  • Each vGPU resource corresponds to a container, and the vGPU resource is the vGPU resource that the container needs to use.
  • step S430 GPU nodes in the cluster are filtered.
  • step S440 score the filtered GPU nodes.
  • the most suitable GPU node (such as the GPU node or GPU nodes with the highest score) can be finally selected to perform the vGPU task. For example, you can first filter the GPU nodes that support virtualization, then score the GPU nodes based on the current remaining GPU computing power on the GPU node, supported segmentation granularity, etc., select the most appropriate (that is, the highest scoring) GPU node, and Split the GPUs in the final selected GPU node to obtain one or more vGPUs that meet the requirements.
  • step S450 add the hijacking library to mount.
  • the CUDA hijacking library libvgpu.so and the preload file /etc/ld.so.preload can be mounted.
  • step S460 environment variables are set.
  • the task can be submitted to the container layer.
  • the nvidia container runtime at the container layer will communicate with the vGPU scheduler and obtain the GPU serial number corresponding to the vGPU, the corresponding usable video memory, and the upper limit of utilization, respectively.
  • the first environment variable is used to control the GPU device number mounted into the container
  • the second and third environment variables are used to control access to the GPU, which are the upper limit of the video memory that the container can access and the utilization rate respectively.
  • nvidia-container-cli can be called to mount the specific GPU device and corresponding driver library, and then runc can be used to start the container.
  • this disclosure discusses GPU virtualization technology based on cloud-native solutions that can realize virtual video memory capabilities, and provides a complete process and solution to solve the above defects from the product and technical levels.
  • the disclosed solution can be implemented based on cloud native K8S technology and can be directly adapted to cloud native scenarios. Therefore, the present disclosure can be implemented as a cloud-native GPU virtualization solution that supports virtual video memory.
  • the cloud-native GPU virtualization solution implemented based on the job scheduling device of the present disclosure mainly includes a component deployment phase, an application or task creation phase, and an application or task running phase.
  • each component in the GPU resource management device can be installed on the nodes in the K8S cluster that require GPU virtualization.
  • the "virtual memory multiplier” and “maximum number of vGPUs” need to be set.
  • Parameters, users can set according to the actual conditions of the cluster and upper-layer applications or tasks. These two parameters are new parameters in this solution, and their functions can be referred to the relevant descriptions above.
  • the application or task will run on the virtual GPU using the set parameters, and the applications on each virtual GPU will not interfere with each other. It does not matter if the user changes the running status of an application or task. Will affect applications or tasks on other virtual GPUs.
  • the expected effect is that the total amount of video memory used by each application or task can exceed the total amount of actual physical GPU memory, and in the inference scenario and the GPU computing power is not fully used, the performance of each application or task is different from that of no use. It is basically the same when GPU is virtualized, and the loss is within 10%. This is due to the strong locality of the inference task, and the exchange of video memory and memory will not be too frequent, which will cause a large performance attenuation.
  • the loss here refers to the increase in request processing time in the inference scenario.
  • This disclosure enables multiple AI applications (such as inference models) to be deployed on one GPU by reusing the GPU.
  • multiple inference models can be deployed on one GPU through multiple model loading methods (such as selecting the multi-model loading method in Nvidia Triton server, torchserve, and tf-serving).
  • Nvidia has launched the Nvidia triton server inference engine for inference scenarios, which can load multiple inference models on one GPU.
  • many AI training frameworks have also launched corresponding inference service engines. They can load multiple models simultaneously in one task and provide inference services for each model.
  • the disadvantage of this multi-model loading method is that this technology is often only applicable to certain models.
  • tf-serving can only load tensorflow models
  • torchserve can only load torch models
  • nvidia which has the widest applicability, can only load tensorflow models.
  • Triton server can only load pytorch script models for pytorch and cannot cover all application scenarios.
  • the disclosed GPU multiplexing scheme can cover all application scenarios.
  • the inference model mentioned in this disclosure may refer to a neural network model.
  • Neural network models can be used to predict image categories, text categories, voice emotions, fraudulent transactions, advertising click-through rates, etc.
  • the neural network model is designed to predict problems related to objects or events in relevant scenes. For example, it can be used to predict image categories, predict text in images, predict text categories, predict voice emotion categories, predict fraudulent transactions, predict advertising click-through rates, predict commodity prices, etc., so that the prediction results can be directly used as a basis for decision-making or further combined with other rules. and become the basis for decision-making.
  • the scenarios in which the neural network model can be used include but are not limited to the following scenarios:
  • Image processing scenarios include: optical character recognition OCR, face recognition, object recognition and picture classification; more specifically, for example, OCR can be applied to bill (such as invoice) recognition, handwriting recognition, etc., and face recognition can be applied to security
  • OCR optical character recognition
  • face recognition object recognition
  • image classification can be applied to e-commerce. The platform's "photo shopping", “find the same style”, etc.
  • Speech recognition scenarios include products that enable human-computer interaction through voice, such as mobile phone voice assistants (such as Siri on Apple phones), smart speakers, etc.;
  • Natural language processing scenarios include: review of texts (such as contracts, legal documents and customer service records, etc.), spam content identification (such as spam text message identification) and text classification (emotion, intention and theme, etc.);
  • Automatic control scenarios include: mine group adjustment operation prediction, wind turbine adjustment operation prediction and air conditioning system adjustment operation prediction; specifically, for the mine group, a group of adjustment operations with a high mining rate can be predicted, and for wind turbine units, a high power generation efficiency can be predicted A set of adjustment operations. For air conditioning systems, a set of adjustment operations can be predicted to meet demand while saving energy consumption;
  • Intelligent question and answer scenarios including: chat robots and intelligent customer service;
  • the field of financial technology includes: marketing (such as coupon usage prediction, advertising click behavior prediction, user portrait mining, etc.) and customer acquisition, anti-fraud, anti-money laundering, underwriting and credit scoring, and commodity price prediction;
  • Medical fields include: disease screening and prevention, personalized health management and auxiliary diagnosis;
  • Municipal fields include: social governance and regulatory law enforcement, resource environment and facility management, industrial development and economic analysis, public services and people's livelihood security, smart cities (allocation and management of various urban resources such as buses, online ride-hailing, shared bicycles, etc.);
  • Recommended business scenarios including: recommendation of news, advertising, music, consulting, video and financial products (such as financial management, insurance, etc.);
  • Search scenarios include: web search, image search, text search, video search, etc.;
  • Abnormal behavior detection scenarios include: detection of abnormal power consumption behavior of State Grid customers, detection of malicious network traffic, detection of abnormal behavior in operation logs, etc.
  • the GPU virtualization method described above in conjunction with Figure 1 of the present disclosure can also be implemented as a GPU virtualization device.
  • Figure 5 shows a schematic structural diagram of a GPU virtualization device according to an embodiment of the present disclosure.
  • the functional units of the GPU virtualization device may be implemented by hardware, software, or a combination of hardware and software that implement the principles of the present disclosure.
  • Those skilled in the art can understand that the functional units described in Figure 5 can be combined or divided into sub-units to implement the above disclosed principles. Therefore, the description herein may support any possible combination, division, or further limitation of the functional units described herein.
  • the GPU virtualization device 500 may include a slicing module 510 and an allocation module 520 .
  • the slicing module 510 is configured to shard the GPU into multiple virtual GPUs.
  • the allocation module 520 is configured to, for at least one virtual GPU, allocate part of the memory of the host where the GPU is located as a video memory swap area to the virtual GPU, so that the available video memory of the virtual GPU is greater than the onboard video memory of the virtual GPU.
  • the GPU virtualization apparatus 500 may further include a first acquisition module configured to acquire virtualization information, where the virtualization information includes a maximum number of virtual GPUs and a virtual display memory size.
  • the splitting module can split the GPU into multiple virtual GPUs on the condition that the number of virtual GPUs obtained by splitting is not greater than the maximum number of virtual GPUs.
  • the allocation module can allocate host memory equal to the size of the virtual video memory as a video memory swap area to the virtual GPU.
  • the GPU virtualization device 500 may further include a second acquisition module and an allocation module.
  • the second acquisition module is configured to acquire resource requirement information.
  • the resource requirement information is used to characterize the GPU resources required by the application or task.
  • the allocation module is configured to allocate virtual GPUs to applications or tasks based on resource requirement information.
  • the GPU virtualization apparatus 500 may further include a replacement module configured to replace the video memory application request of an application or task for the virtual GPU with a video memory application request based on a unified address space, so that the video memory application request on the currently available onboard of the virtual GPU When the video memory is insufficient, at least part of the data in the onboard video memory can be swapped to the video memory swap area based on a unified address space.
  • the replacement module can use the hijacking library to intercept the call request of the application or task to the virtual GPU, and replace the default interface used by the application or task to apply for video memory with a video memory application interface based on a unified address space.
  • the GPU virtualization device 500 may also include a setting module, a mounting module, and a startup module.
  • the setting module is configured to set the environment variables of the container.
  • the environment variables include the container ID, the ID of the virtual GPU mounted into the container, and the upper limit of GPU resources that the container can access.
  • the mount module is configured to mount the virtual GPU into the container based on environment variables.
  • the startup module is configured to launch a container to run an application or task in the container.
  • the GPU virtualization device 500 may further include a monitoring module configured to monitor resource usage of the virtual GPU.
  • the present disclosure can also be implemented as a Kubernetes cluster, including multiple GPU nodes, each of which includes one or more GPUs; and a job scheduling device deployed on at least one of the GPU nodes.
  • the job scheduling device may be The GPU resource management device described above in conjunction with Figure 2.
  • FIG. 6 shows a schematic structural diagram of a computing device that can be used to implement the above GPU resource usage method or GPU virtualization method according to an embodiment of the present disclosure.
  • computing device 600 includes memory 610 and processor 620 .
  • the processor 620 may be a multi-core processor or may include multiple processors.
  • processor 620 may include a general main processor and one or more special co-processors, such as a graphics processing unit (GPU), a digital signal processor (DSP), and the like.
  • the processor 620 may be implemented using customized circuits, such as Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs).
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • Memory 610 may include various types of storage units, such as system memory, read-only memory (ROM), and persistent storage. Among them, ROM can store static data or instructions required by the processor 620 or other modules of the computer. Persistent storage may be readable and writable storage. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even when the computer is powered off. In some embodiments, permanent The storage device uses a large-capacity storage device (such as a magnetic or optical disk, flash memory) as a permanent storage device. In other embodiments, the permanent storage device may be a removable storage device (eg, floppy disk, optical drive).
  • System memory can be a read-write storage device or a volatile read-write storage device, such as dynamic random access memory.
  • System memory can store some or all of the instructions and data the processor needs to run.
  • memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks and/or optical disks may also be used.
  • memory 610 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (eg, DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray discs, ultra-density discs, flash memory cards (such as SD cards, min SD cards, Micro-SD cards, etc.), magnetic floppy disks, etc.
  • a readable and/or writable removable storage device such as a compact disc (CD), a read-only digital versatile disc (eg, DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray discs, ultra-density discs, flash memory cards (such as SD cards, min SD cards, Micro-SD cards, etc.), magnetic floppy disks, etc.
  • Computer-readable storage media do not contain carrier waves and transient electronic signals that are transmitted wirelessly or wired.
  • the memory 610 stores executable code.
  • the processor 620 can be caused to execute the above-mentioned GPU resource usage method or GPU virtualization method.
  • the method according to the present disclosure can also be implemented as a computer program or computer program product, which computer program or computer program product includes computer program code instructions for executing the above-mentioned steps defined in the above-mentioned method of the present disclosure.
  • the present disclosure may also be implemented as a computer-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having executable code (or computer program, or computer instruction code) stored thereon.
  • executable code or computer program, or computer instruction code
  • the processor is caused to execute each step of the above method according to the present disclosure.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more components for implementing the specified logical function(s).
  • Executable instructions may also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Stored Programmes (AREA)

Abstract

L'invention concerne un procédé d'utilisation de ressources de GPU, un procédé de virtualisation ainsi qu'un appareil de planification de tâche et une grappe. Le procédé d'utilisation de ressources de GPU consiste à : découper une GPU en une pluralité de GPU virtuelles ; pour au moins une GPU virtuelle, utiliser au moins une partie d'une mémoire hôte en tant que zone d'échange de mémoire vidéo, et attribuer celle-ci à la GPU virtuelle, de sorte que la mémoire vidéo disponible de la GPU virtuelle soit supérieure à la mémoire vidéo embarquée de la GPU virtuelle ; et remplacer une demande d'application de mémoire vidéo d'une application ou d'une tâche pour la GPU virtuelle par une demande d'application de mémoire vidéo sur la base d'un espace d'adresse unifié, de sorte qu'au moins une partie des données dans la mémoire vidéo embarquée puisse être échangée avec la zone d'échange de mémoire vidéo sur la base de l'espace d'adresse unifié lorsque la mémoire vidéo embarquée actuellement disponible de la GPU virtuelle est insuffisante. Au moyen du procédé, une zone d'échange de mémoire vidéo peut servir de mémoire vidéo virtuelle pour augmenter la fil disponible d'une GPU virtuelle, ce qui permet de résoudre le problème lié à l'impossibilité d'équilibrer le taux d'utilisation de puissance informatique et le taux d'utilisation de mémoire vidéo d'une GPU ; en outre, une demande d'application de mémoire vidéo d'une application ou d'une tâche pour une GPU virtuelle est remplacée par une demande d'application de mémoire vidéo sur la base d'un espace d'adresse unifié, de sorte que l'application ou la tâche puisse utiliser une mémoire vidéo virtuelle sans perception.
PCT/CN2023/111673 2022-08-09 2023-08-08 Procédé d'utilisation de ressources de gpu, procédé de virtualisation de gpu, et appareil de planification de tâche et grappe WO2024032587A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210950598.5 2022-08-09
CN202210950598.5A CN117632447A (zh) 2022-08-09 2022-08-09 Gpu资源使用方法、gpu虚拟化方法以及作业调度装置、集群

Publications (1)

Publication Number Publication Date
WO2024032587A1 true WO2024032587A1 (fr) 2024-02-15

Family

ID=89850888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/111673 WO2024032587A1 (fr) 2022-08-09 2023-08-08 Procédé d'utilisation de ressources de gpu, procédé de virtualisation de gpu, et appareil de planification de tâche et grappe

Country Status (2)

Country Link
CN (1) CN117632447A (fr)
WO (1) WO2024032587A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133574A (zh) * 2024-05-06 2024-06-04 沐曦集成电路(上海)有限公司 一种sram生成***

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160188251A1 (en) * 2014-07-15 2016-06-30 Nvidia Corporation Techniques for Creating a Notion of Privileged Data Access in a Unified Virtual Memory System
CN111223036A (zh) * 2019-12-29 2020-06-02 广东浪潮大数据研究有限公司 一种gpu虚拟化共享方法、装置及电子设备和存储介质
US20210133123A1 (en) * 2019-11-04 2021-05-06 Nvidia Corporation Techniques for an efficient fabric attached memory
US20210208951A1 (en) * 2020-08-04 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for sharing gpu, electronic device and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160188251A1 (en) * 2014-07-15 2016-06-30 Nvidia Corporation Techniques for Creating a Notion of Privileged Data Access in a Unified Virtual Memory System
US20210133123A1 (en) * 2019-11-04 2021-05-06 Nvidia Corporation Techniques for an efficient fabric attached memory
CN111223036A (zh) * 2019-12-29 2020-06-02 广东浪潮大数据研究有限公司 一种gpu虚拟化共享方法、装置及电子设备和存储介质
US20210208951A1 (en) * 2020-08-04 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for sharing gpu, electronic device and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Taking stock of GPU sharing solutions from the industry", QIONGLING TECHNOLOGY, 31 August 2021 (2021-08-31), XP093138346, Retrieved from the Internet <URL:https://www.qiongling.com/2021/08/31/15370.html> [retrieved on 20240306] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133574A (zh) * 2024-05-06 2024-06-04 沐曦集成电路(上海)有限公司 一种sram生成***

Also Published As

Publication number Publication date
CN117632447A (zh) 2024-03-01

Similar Documents

Publication Publication Date Title
CN104424013B (zh) 在计算环境中部署虚拟机的方法和设备
US20220078036A1 (en) Asset management with respect to a shared pool of configurable computing resources
US8442955B2 (en) Virtual machine image co-migration
US9128765B2 (en) Assigning restored virtual machine based on past application usage of requesting user
US8996452B2 (en) Generating a predictive model from multiple data sources
US9158590B2 (en) Dynamically acquiring computing resources in a networked computing environment
CN103365725B (zh) 在多个云之间动态分配工作负荷部署单元的方法和***
US20140201362A1 (en) Real-time data analysis for resource provisioning among systems in a networked computing environment
US20110137805A1 (en) Inter-cloud resource sharing within a cloud computing environment
US20100115510A1 (en) Virtual graphics device and methods thereof
WO2024032587A1 (fr) Procédé d&#39;utilisation de ressources de gpu, procédé de virtualisation de gpu, et appareil de planification de tâche et grappe
CN103034453A (zh) 管理虚拟机实例中预安装应用的持久数据的方法和装置
CN103368767A (zh) 用于具有故障的云中的高效应用管理的方法和***
CN103914511A (zh) 选择用于云存储的图像或者视频文件
CN113434261A (zh) 异构计算设备虚拟化方法及***
CN115169587A (zh) 联邦学习***及实现多方联合处理任务的方法与设备
CN109976907A (zh) 任务分配方法和***、电子设备、计算机可读介质
US20170123968A1 (en) Flash memory management
CN110858326A (zh) 模型训练及获取附加特征数据的方法、装置、设备及介质
US8548881B1 (en) Credit optimization to minimize latency
CN113849503A (zh) 一种开放式大数据处理***、方法及介质
US10007559B1 (en) Virtual tiering
CN112449021B (zh) 一种互联网资源的筛选方法及装置
WO2022078060A1 (fr) Planification commandée par étiquette de ressources informatiques pour exécution de fonction
CN115080242A (zh) 一种pci设备资源统一调度的方法、装置及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23851808

Country of ref document: EP

Kind code of ref document: A1