CN114048026A

CN114048026A - GPU resource dynamic allocation method under multitask concurrency condition

Info

Publication number: CN114048026A
Application number: CN202111258248.4A
Authority: CN
Inventors: 肖利民; 常佳辉; 秦广军; 朱乃威; 徐向荣
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-02-15

Abstract

The invention provides a dynamic allocation method of GPU resources under the multitask condition, which aims to solve the problems of idle mass resources, reduced system throughput rate and unreasonable resource allocation caused by adopting a static resource allocation method when NVIDIA GPU multitask is concurrent, and has three obvious characteristics: (1) the system provides a software method, and realizes the configurability of the resource usage amount when the GPU program runs under the condition of not modifying any hardware and driving details; (2) the method has high efficiency, considers the affinity of tasks to different kinds of resources, and executes the tasks with complementary resource requirements concurrently, thereby improving the utilization efficiency of GPU resources and accelerating multitask processing; (3) the method provides a simple program conversion mode, and developers can migrate the native program to the system to run only by adopting fixed operation steps.

Description

GPU resource dynamic allocation method under multitask concurrency condition

The technical field is as follows:

the invention discloses a dynamic allocation method of GPU resources under a multitask condition, relates to challenges faced by high-performance computing, and belongs to the technical field of computers.

Background art:

the GPU is used as equipment which has a large number of computing cores and can provide high-speed parallel computing, and the application of the GPU is already beyond the scope of graphics computing and rendering and is widely applied to large-scale parallel computing such as high-performance computing, mass data processing and artificial intelligence. As an emerging heterogeneous computing platform, many programming frameworks provide support for it, including the CUDA programming model proposed by england, and the OpenCL programming language supported by numerous vendors. Although a programming model for the GPU device has been developed efficiently, the parallelism of the GPU device in a multitasking situation is greatly damaged due to a leftover resource allocation strategy adopted by the GPU driver at runtime. Even if a developer writes a reasonable and high-performance GPU processing program, under the multi-task concurrency, due to the fact that a driver program is not intelligent enough in a scheduling strategy, GPU resources are not sufficiently utilized, and therefore GPU acceleration effects are not obvious or even fail. How to design and realize an intelligent multitask concurrent platform capable of making scheduling decisions aiming at load characteristics is a key point which needs to be solved by using a GPU (graphics processing unit) for high-performance calculation at present.

In order to solve the problem of unreasonable resource allocation of the GPU under the multitasking condition, a number of methods have appeared in recent years:

1) Hyper-Q technology: in the nvidiaFermi architecture, the GPU already supports basic spatial multiplexing, the user can specify which kernels are irrelevant and can run simultaneously, and the GPU will try to allocate the remaining resources to other kernels that can be paralleled during task scheduling. The Kepler framework further improves the support of the GPU for paralleling a plurality of Kernels, and by setting a plurality of Kernel queues without mutual dependency relationship, when the task in one queue does not occupy all GPU resources, the GPU calls the task in the other queue into the GPU. The Kernel queue technology is called Hyper-Q technology, each core on a CPU can submit a GPU task queue, and parallel kernels are selected from a plurality of parallel queues to operate when a GPU runs. Unlike the Fermi architecture, the Kepler architecture truly supports multiple queue concurrent scheduling, and the Fermi architecture merges multiple queues into one execution queue at compile time, providing a pseudo-concurrent effect on the queue.

2) MPS technique: MPS is a GPU multitasking concurrent framework developed by invida corporation. Hyper-Q solves the concurrency problem on GPU hardware, while MPS solves the concurrency problem of multiple processes on software. The MPS maps CUDA contexts of a plurality of processes to the same CUDA context through a C-S framework, but a task scheduling mode is enclosed inside a GPU driver, and a developer is not allowed to modify the contexts according to requirements.

3) GPU-Sim: GPU-Sim is an open source GPU simulator that simulates the operation of a GPU. Researchers can modify the source code of the GPU-Sim to realize improvement of GPU hardware or scheduling strategies. Many researchers have studied a hardware-based GPU scheduling method based on GPU-Sim, such as throughput of SM when trying different resource allocations in the running process, thereby modifying the amount of resources occupied by two kernel; and the shared memory and the L1 cache are used for quickly switching the context of the GPU and the like. The experimental verification of the methods obtains a simulation result through the GPU-Sim, however, the practical GPU does not have corresponding hardware functions, and the resource allocation and scheduling strategies displayed by the experiments cannot be really used in a production environment.

The conventional GPU multitask concurrency support and resource allocation strategy does not solve GPU resource scheduling under the multitask concurrency condition from the aspects of high efficiency and feasibility. MPS and Hyper-Q have serious resource waste, and hardware modification based on GPU-Sim is not supported by actual GPU equipment, so that the problem of GPU resource waste under the condition of multitask concurrency cannot be really solved.

Specifically, the problems of the existing multi-task GPU concurrency are as follows: without an effective GPU controllable GPU resource allocation method, intelligent resource allocation selection cannot be performed according to task load characteristics during operation, so that GPU resources are seriously wasted, and the maximum capability of parallel computing cannot be exerted.

The invention content is as follows:

the invention mainly aims to provide a dynamic allocation method of GPU resources under the multitask condition, which is used for solving the problems of idle mass resources, reduced system throughput rate and unreasonable resource allocation caused by adopting a static resource allocation method when NVIDIA GPU multitask is concurrent, and has three obvious characteristics: (1) the system provides a software method, and realizes the configurability of the resource usage amount when the GPU program runs under the condition of not modifying any hardware and driving details; (2) the method has high efficiency, considers the affinity of tasks to different kinds of resources, and executes the tasks with complementary resource requirements concurrently, thereby improving the utilization efficiency of GPU resources and accelerating multitask processing; (3) the method provides a simple program conversion mode, and developers can migrate the native program to the system to run only by adopting fixed operation steps.

The technical scheme of the invention is as follows:

a dynamic allocation method for GPU resources under the condition of multitask is characterized by that firstly, a control code segment is inserted into a CUDA program by means of a fixed source code modification mode, so that the resource quantity occupied by program operation can be controlled. And then modifying the running mode of the CUDA program into a C/S architecture, wherein the client is a set of self-realized CUDAAPI and replaces the CUDAAPI in the original host program. And the server side adopts a back-end process to receive CUDA calls from different clients and queues the calls. And finally, dynamically regulating and controlling the resource usage amount by the server, traversing the CUDA request queue by the server, executing the calculation task on the GPU, and dynamically reducing the resource amount occupied by the running task by the server and starting the waiting task to realize concurrent processing when the resource requirements of the task at the head of the queue and the running task are complemented.

The method comprises the following steps:

1) and modifying the GPU program, inserting a code segment with control program resource usage, changing the dynamically distributed block in the cuda program into a worker, and circularly pulling the block for execution.

2) Sampling the GPU program, operating the GPU program for a period of time to ensure that GPU resources are completely occupied, and continuing for a period of time to obtain the total instruction number executed by the GPU program in the period of time and the number of all memory instructions.

3) Storing GPU program sampling data by using a Json format text, and occupying registers and sharing the amount of memory;

4) when the program is compiled, the native CUDAAPI is replaced, and the CUDAAPI library realized by the system is linked. The CUDAAPI library realized by the system sends the CUDA request to the server through the socket.

5) The server creates a thread for each active client connection, and the thread stores a CUDA task queue.

6) Before starting kernel, opening up a space on a GPU video memory, and storing the resource allocation configuration during program operation. When the program runs, the control code segment controls the resource amount used by the GPU.

7) When two or more GPU programs need to run, the optimal allocation method is calculated by comparing sample data, configuration data on a GPU video memory is set, and kernel is run.

8) And after the GPU task is completely operated, notifying the server, and selecting to pull a new task from the queue by the server for concurrent execution or expanding the operating GPU program.

Wherein, step 1) includes the following steps:

step (1.1) a developer adds a use statement of a global memory in a GPU code segment, the global memory stores the configuration on a video memory and completes the workload, and the kernel thread in operation can read concurrently;

inserting a control code segment, wherein the control code segment is immediately executed when a program is started, judging the SM number where the control code segment is positioned, and setting the worker number of the control code segment;

copying a source program copy, and replacing a blockIdx variable and a gridDim variable in a source program;

step (1.4) the worker circularly pulls the block to execute until all tasks are executed;

wherein, step 2) includes the following steps:

setting data with proper scale to enable the kernel program to occupy all GPU resources and continuously run for more than 1 ms;

step (2.2) operating the test program by using nvprof, and outputting the total execution amount of the program instructions and the number of the global memory access instructions;

wherein, the step 3) comprises the following steps:

step (3.1) running a test program by using nvvp, and recording the number of registers occupied by one thread of a kernel and the number of shared memories occupied by one block;

step (3.2) using json format text to save program sample data, comprising four fields: sampling the quantity of global memory access instructions; sampling the total instruction number; the number of registers; the number of shared memories;

step (3.3) the server reads the configuration file during initialization;

wherein, the step 4) comprises the following steps:

step (4.1) modifying a cudaapi library path linked during GPU program compiling, and pointing to cudaapi implementation of the system;

replacing the cuda header file in the source program, and modifying the cudaapi header file into the cudaapi header file of the system;

step (4.3) the method that a kernel program is called by a triple sharp bracket in a source program is modified to be called by a cudaLaunch kernel interface;

step (4.4) all cudaapi calls in the source program are transferred to cudaapi realized by the system;

step (4.5) when the cudaapi library realized by the system is initialized, firstly, connecting a cudaapi server, and if the server is not opened, informing a user to open the server and then operate;

step (4.6) after the connection of the cudaapi library is successful, the server side can establish a thread for processing the newly connected api client side;

wherein, the step 5) comprises the following steps:

step (5.1) the server side monitors a socket file and waits for the connection of the client side;

step (5.2) the client establishes connection with the server through socket and transmits the cuda call request by using a specified message format;

step (5.3) after the server receives the client connection request, opening up a new thread to connect with the client;

step (5.4) the server establishes a task queue for each client;

wherein, the step 6) comprises the following steps:

before the server starts the kernel, distributing a global memory on a GPU video memory, and storing SM numbers which can be occupied by the kernel, the maximum worker number which can coexist on each SM, the worker number which exists on each SM at present, the finished block number, the total block number and the GridDim of the kernel;

step (6.2) the server starts a kernel, the block distributed to the SM can be changed into a worker, and the worker circularly pulls the block to execute;

and (6.3) when the worker is initialized, firstly checking sm _ id at the position of the worker and the maximum number of workers which can exist on sm. When the worker exceeds the limit, the worker can directly exit, and when the worker does not exceed the limit, the worker can continuously exist on the equipment;

step (6.4) the worker continuously stored on the equipment comprises a circulation body, and a block is pulled in each circulation;

step (6.5) the worker calculates the blockIdx of the block, transmits the blockIdx to the source program copy explained in step (1.3), and executes the copy program;

after the block execution is finished, checking whether the tasks are completely finished by the worker, if the tasks are finished, quitting the execution, and if not, entering the next round of circulation;

wherein, the step 7) comprises the following steps:

step (7.1) the server side calculates the memory instruction ratio of the kernel in operation, namely the total memory instruction number in the sampling data is divided by the total instruction amount;

step (7.2) calculating the memory instruction ratio of the kernel to be operated;

step (7.3) obtaining all resource allocation combinations of two kernel pairing operations according to the register field and the shared memory field in the sampling data;

if one condition exists in the combination, and the ratio of the system memory instructions is close to 0.05 when the two kernel instructions are operated concurrently, the concurrent operation is carried out according to the configuration;

if the instruction ratios of the system memories of all the combinations are larger than 0.055 or smaller than 0.045, not performing concurrent operation;

wherein, step 8) comprises the following steps:

step (8.1) the server side checks the task queue, if the runnable task exists, the step (7) is carried out, and if the queue is empty, the step (8.2) is carried out;

step (8.2) the server side updates the kernel configuration in operation to enable the kernel configuration to occupy all resources;

step (8.3) reading the running configuration of the kernel, and if the remaining worker meets the resource configuration, keeping the original state and continuing running; if the remaining survival worker is less than the configuration, the step (8.4) is carried out

Step (8.4) starting the new instance of the kernel, sharing configuration with the original instance, and enabling the number of active workers of the kernel on the system to meet the configuration requirement;

the advantages of the invention include:

compared with the prior art, the dynamic allocation method of the GPU resources under the multitask condition has the advantages that:

aiming at the problems of idle large amount of resources, reduced system throughput rate and unreasonable resource allocation caused by adopting a static resource allocation method when NVIDIAGPU multitasks are concurrent, the system provides an efficient GPU multitask operation system, and the system has three obvious characteristics: (1) the system provides a software method, and realizes the configurability of the resource usage amount when the GPU program runs under the condition of not modifying any hardware and driving details. (2) The method has high efficiency, considers the affinity of the tasks to different types of resources, and executes the tasks with complementary resource requirements concurrently, thereby improving the utilization efficiency of GPU resources and accelerating multitask processing. (3) The method provides a simple program conversion mode, and developers can migrate the native program to the system to run only by adopting fixed operation steps.

Description of the drawings:

FIG. 1 is a flowchart illustrating a method for dynamically allocating GPU resources in a multitask concurrency scenario according to the present invention.

FIG. 2 is a diagram of GPU resource configuration data structures.

Figure 3 is a schematic diagram of the worker execution mode.

FIG. 4 is a diagram of a GPU multitask concurrent execution system.

FIG. 5 is a startup kernel flow diagram.

Fig. 6 is a dynamic resource allocation flow diagram.

The specific implementation mode is as follows:

the present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a diagram illustrating a resource allocation data structure during the operation of a GPU program. The configuration consists of a 32-bit, fixed resource restriction segment of 4 bytes and a run state segment of indefinite length. The first 2 bytes of the fixed resource limitation section indicate the maximum worker number which can be operated on each SM, and the 3 rd Byte and the 4 th Byte respectively indicate the usable SM number upper limit and the SM number lower limit; the Grid dimension of the kernel is stored in the first three fields of the running state section, the number of completed blocks in the first field is recorded, the number of currently completed blocks is recorded, and the 2 nd field is the total number of blocks required to be completed by the task; starting from the 3 rd field, the stored content is the number of worker active on SM with the serial number of 0,1,2, … n, the entry number of the field is consistent with the number of SM of GPU hardware, and the number of worker active on each SM of kernel is recorded; when kernel is initialized, the system firstly allocates a GPU memory for the kernel, and stores the runtime configuration shown in FIG. 1, wherein the life cycle of the configuration is the same as the running time of the kernel;

the execution flow of Worker is shown in fig. 2, and becomes Worker after a block is allocated to the SM. Firstly, the Worker acquires the current SMid and the Worker number through the PTX assembly code, and simultaneously adds one to the active Worker number through atomic operation; and reading the runtime configuration in the figure 1 by the worker, comparing the SMid with the upper limit and the lower limit in the configuration, and comparing the worker with the upper limit of the worker in the configuration to judge whether the worker exceeds the configuration limit. If the SM or worker limit is exceeded, the worker is exited, and the number of active workers is reduced by one; if the number exceeds the total block amount, the worker exits, and the number of the active worker is reduced by one; and if the number does not exceed the block total amount, calculating the blockIdx according to the GridDIM in the configuration and calling the source program copy. After the source program copy exits, the worker pulls the next unexecuted block and repeats the process until the number of blocks completed exceeds the total number of blocks.

Fig. 3 shows a structure diagram of a GPU multitask concurrent execution system, where the overall system architecture is divided into a client and a server, the client is composed of cudaapi and GPU programs, and the system supports concurrent execution of multiple GPU programs. The cudaapi layer uses IPC library to communicate with the server. The server side supports the concurrent connection of multiple client sides, wherein the receiver module establishes a thread for each client side connection and is used for processing all cuda requests of one client side. The cuda requests of the client side are put into unified Commandlists, the scheduler module of the server side polls the cuda requests in the Commandlists, and after a decision result is made, the scheduler starts an APIcondactor for executing real cuda tasks. The Device module keeps the computing power and the total amount of resources of the current Device, as well as the running kernel and the amount of resources it occupies. When the Scheduler makes a scheduling decision, the Device module is called to obtain the current latest state of the Device.

Fig. 4 shows a flowchart of starting a kernel, where a scheduler checks whether a current device is idle before starting the kernel, and when the device is completely idle, the scheduler sets the kernel to completely occupy GPU resources. If the device is not idle, the scheduler may check whether the current device is running only one kernel or two kernels, and if the current device is occupied by two kernels, the kernel to be scheduled may block waiting for the task on the GPU to release resources. If the current device is occupied by a kernel, the scheduler calculates the memory instruction ratio of the current kernel to the running kernel, and if the combination is close to 0.05, the runtime configuration is set and stored on the GPU. If no such combination exists, the kernel to be scheduled enters the blocking state. When the scheduler decides that a kernel to be scheduled can run on the GPU, it will first set the run-time configuration and then start the kernel. When the kernel is completed, the occupied GPU resources are released and the scheduler is informed, and the scheduler can select to awaken the originally blocked kernel or pull a new kernel from the waiting queue for scheduling.

Fig. 5 shows a dynamic scheduling process of GPU resources, where a kernel is running on a GPU in an initial state, when a scheduler schedules a new task, a memory instruction ratio of different combinations is calculated first, if there is a combination of memory instructions in a range of 0.045-0.055, concurrent execution is set, and if there is no such combination, the scheduler blocks the newly scheduled task and waits for the existing kernel to release resources. If the scheduler sets concurrent execution, it will first set the run-time configuration of kernel on the GPU, resulting in a reduction in the resources occupied by the running kernel. The scheduler will then start a new task and both tasks will continue to execute concurrently. If the original kernel completes execution first, it releases the resources and notifies the scheduler. If the subsequent kernel is executed first, it will check if there is a blocked task waiting for scheduling, and if there is a blocked task, the kernel will release the resource immediately and notify the scheduler. If there is no blocked task, the scheduler will cause the original kernel to re-occupy all GPU resources. The scheduler firstly reads the runtime configuration of the kernel stored on the GPU, and if the number of the active workers is less than the maximum limit, the scheduler enables the standby kernel of the kernel to improve the number of the active workers on the GPU. If the number of active workers is such that the maximum limit is met, the scheduler does nothing. And after the kernel is finished, releasing the resources, and informing the scheduler, wherein the scheduler enters a new round of task scheduling.

Finally, it should be noted that: the present invention may be used in various other applications, and various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. A method for dynamically distributing GPU resources under multitask condition includes inserting control code segment into CUDA program through a fixed source code modification mode to enable resource amount occupied by program operation to be controllable, modifying operation mode of CUDA program to be C/S structure, enabling client to be a set of self-realized CUDA API to replace CUDA API in original host program, enabling server to receive CUDA calls from different clients by adopting a back-end process and to conduct queuing processing on the calls, enabling the server to conduct dynamic regulation and control on resource usage amount, enabling the server to traverse CUDA request queue and place calculation tasks on GPU for execution, enabling the server to dynamically reduce resource amount occupied by running tasks and start waiting tasks to achieve concurrent processing when task at head of queue and running task resource demand complement each other.

2. The method of claim 1, comprising the steps of:

the method comprises the following steps:

1) modifying a GPU program, inserting a code segment with control program resource usage, changing dynamically allocated blocks in the cuda program into workers, and circularly pulling the blocks to execute;

2) sampling a GPU program, operating the GPU program for a period of time to ensure that GPU resources are completely occupied, and continuing for a period of time to obtain the total number of instructions executed by the GPU program in the period of time and the number of all memory instructions;

4) when the program is compiled, the native CUDAAPI is replaced, and the CUDAAPI library realized by the system is linked. The CUDAAPI library realized by the system sends the CUDA request to the server through the socket;

5) the server establishes a thread for each active client connection, and the thread stores a CUDA task queue;

6) before starting kernel, opening up a space on a GPU video memory, and storing the resource allocation configuration during program operation. When the program runs, the control code segment controls the resource amount used by the GPU;

7) when two or more GPU programs need to run, calculating an optimal distribution method by comparing sample data, setting configuration data on a GPU video memory, and running kernel;

3. The method according to claim 2, wherein the step 1) comprises the steps of:

and (1.4) cyclically pulling the block to execute by the worker until all tasks are executed.

4. The method according to claim 2, wherein the step 2) comprises the steps of:

and (2.2) operating the test program by using nvprof, and outputting the total execution amount of the program instructions and the number of the global memory access instructions.

5. The method according to claim 2, wherein the step 3) comprises the steps of:

and (3.3) reading the configuration file when the server side is initialized.

6. The method according to claim 2, wherein the step 4) comprises the steps of:

and (4.6) after the cudaapi library is successfully connected, the server side creates a thread for processing the newly connected api client side.

7. The method according to claim 2, wherein the step 5) comprises the steps of:

and (5.4) the server establishes a task queue for each client.

8. The method according to claim 2, wherein the step 6) comprises the steps of:

and (6.6) after the block is executed, checking whether the tasks are completely finished by the worker, if the tasks are finished, exiting the execution, and otherwise entering the next round of circulation.

9. The method according to claim 2, wherein the step 7) comprises the steps of:

and (7.5) if the instruction ratios of the system memories of all the combinations are greater than 0.055 or less than 0.045, not performing concurrent operation.

10. The method according to claim 2, wherein the step 8) comprises the steps of:

And (8.4) starting the new instance of the kernel, sharing the configuration with the original instance, and enabling the number of active workers of the kernel on the system to meet the configuration requirement.