CN115114003B

CN115114003B - GPU dynamic multitasking controllable concurrent execution method and system

Info

Publication number: CN115114003B
Application number: CN202210780174.9A
Authority: CN
Inventors: 陈榕; 韩明聪; 陈海波; 臧斌宇
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2024-05-28
Anticipated expiration: 2042-07-04
Also published as: CN115114003A

Abstract

The invention provides a GPU dynamic multitasking controllable concurrent execution method and system, comprising the following steps: step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage; step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently; step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed; step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through proxykernel, jumps to the kernel to be executed and executes the kernel. According to the invention, the proxy kernel dynamically allocates the specified number of the computing units for the kernel to be executed, so that the fine-grained computing unit allocation in the running stage of the GPU program can be realized.

Description

GPU dynamic multitasking controllable concurrent execution method and system

Technical Field

The invention relates to the field of GPU task scheduling, in particular to a method and a system for dynamically and multi-tasking controllable concurrent execution of a GPU.

Background

Compared with a CPU, the GPU has stronger parallel processing capability and is commonly used for tasks such as graphic rendering, high-performance calculation, analog simulation, artificial intelligent model training and reasoning and the like. With the increasing number of current commercial GPU computing units (ComputeUnit), it is difficult for a single computing task to fully utilize all of the computing units in the GPU, allowing multiple tasks to share the GPU simultaneously is the most common practice in order to increase the utilization of the GPU computing units.

The current GPU programming framework (e.g., CUDA, HIP) provides a multitasking concurrent abstraction of GPU Stream, multiple tasks can be executed concurrently at the same time using different GPUStream, and the computing units in the GPU are fully utilized. However, when GPUStream is used to concurrently execute multiple tasks, the user cannot control the resource allocation of the GPU computing unit, so that different tasks compete for the GPU computing unit, while the resource utilization rate and the throughput of the system can be improved, the execution delay of each task can be significantly increased, which seriously affects the real-time performance of the delay sensitive task. Taking a neural network reasoning task in intelligent driving as an example, the obstacle detection task needs to use a GPU to realize low-delay reasoning, and the driver state monitoring task also needs to use the GPU to calculate, but the task has a loose delay requirement, when two tasks are respectively executed by two GPUStream in parallel, the real-time requirement of a strong real-time task (obstacle detection task) cannot be met due to the competition of the two tasks to the GPU calculation unit.

In order to control allocation of concurrent tasks to GPU computing units, the academic world also proposes a kernel fusion (KernelFusion) manner to implement GPU multitasking concurrent computation, and merges source codes of two GPU kernels together, so as to implement that the two GPU kernels share all the GPU computing units, and can control resource allocation of the computing units in the merged codes. However, the kernel fusion method can only select which kernel to merge in the program compiling stage, and cannot be applied to a dynamic scheduling scene where task combination is uncertain or can be determined in the program running stage.

In summary, how to design a method for dynamically controlling concurrent execution of GPU tasks allocated by GPU computing units during a program running stage is a major problem to be solved by researchers in the field.

Patent document CN114048026A (application number: CN 202111258248.4) discloses a dynamic allocation method of GPU resources under the condition of multitasking, so as to solve the problems of idle large amount of resources, reduced system throughput and unreasonable resource allocation caused by adopting a static resource allocation method when NVIDIA GPU multitasking is concurrent. However, the method does not dynamically allocate a designated number of computing units to be executed through proxy kernel, so as to realize the fine-granularity computing unit allocation in the running stage of the GPU program.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a GPU dynamic multi-task controllable concurrent execution method and system.

The GPU dynamic multitasking controllable concurrent execution method provided by the invention comprises the following steps:

Step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;

Step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;

Step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;

Step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.

Preferably, in said step S1:

Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;

the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;

A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.

Preferably, in said step S2:

Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;

The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.

Preferably, in said step S3:

Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;

the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;

The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;

The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.

Preferably, in said step S4:

Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;

the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;

Jump to the function entry address of the kernel to be executed using the JMP instruction.

The GPU dynamic multitasking controllable concurrent execution system provided by the invention comprises:

module M1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;

module M2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;

module M3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;

Module M4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.

Preferably, in said module M1:

Preferably, in said module M2:

Preferably, in said module M3:

Preferably, in said module M4:

Compared with the prior art, the invention has the following beneficial effects:

1. according to the method, the specified number of computing units are dynamically allocated to the kernel to be executed through the proxy kernel, so that the fine-grained computing units allocation in the running stage of the GPU program can be realized;

2. According to the invention, a plurality of proxy kernel are used to meet the number requirements of registers of different kernel to be executed, so that CUOccupancy of the kernel to be executed is maximized, and the performance cost is reduced;

3. According to the invention, the JMP instruction is directly jumped to the kernel to be executed at proxykernel, so that the context storage caused by using the function pointer is avoided, and the performance cost of calling the function through the function pointer is reduced.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of the implementation of the present invention;

FIG. 2 is a schematic diagram of an execution flow of proxykernel.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Example 1:

The invention provides a GPU dynamic multitasking controllable concurrent execution method and a system, which are characterized in that the method generates a plurality of proxy kernel as an entry of kernel to be executed in a program compiling stage; in the program running stage, a user can dynamically select a concurrently executed kernel to be executed, and select a proper proxykernel to be submitted to the GPU according to the number of registers required by the selected kernel to be executed; the user can dynamically control the number of calculation units used by each kernel to be executed through the proxy kernel, and finally jump to the kernel to be executed and execute;

according to the method for dynamically and multi-tasking controllable concurrent execution of the GPU, which is provided by the invention, as shown in fig. 1-2, the method comprises the following steps:

specifically, in the step S1:

specifically, in the step S2:

specifically, in the step S3:

Specifically, in the step S4:

Example 2:

example 2 is a preferable example of example 1 to more specifically explain the present invention.

Those skilled in the art can understand the GPU dynamic multitasking controllable concurrent execution method provided by the present invention as a specific implementation of the GPU dynamic multitasking controllable concurrent execution system, that is, the GPU dynamic multitasking controllable concurrent execution system can be implemented by executing the step flow of the method.

specifically, in the module M1:

specifically, in the module M2:

specifically, in the module M3:

Specifically, in the module M4:

Example 3:

Example 3 is a preferable example of example 1 to more specifically explain the present invention.

The GPU dynamic multi-task controllable concurrent execution method provided by the invention comprises the following steps of:

(1) Generating proxy kernel source code: generating a plurality of source codes of proxy kernel, wherein each proxy kernel has the same source code, but each proxy kernel has different register numbers; proxy kernel is the entry for all kernel to be executed, from which all concurrently executed kernel will jump.

(2) Compiling proxy kernel and kernel to be executed: compiling proxykernel and source codes of kernel to be executed into binary files;

(3) Loading proxy kernel and kernel to be executed: loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;

(4) Selecting a kernel to be executed: selecting a concurrent execution kernel according to the requirement of a user, and selecting a proper proxykernel according to the number of registers required by the selected kernel to be executed;

(5) Start proxy kernel: distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;

(6) Proxy kernel is performed: executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution.

Specifically, in the step (1), the proxy kernel parameter includes the entry address of the kernel function to be executed, the kernel parameter address to be executed, and the number of computing units used by the kernel to be executed.

Specifically, in the step (1), a proxy kernel corresponding to the maximum number of registers should be generated for each CUOccupancy.

Specifically, in the step (4), the proxy kernel should be selected to satisfy the following two points: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.

Specifically, in the step (5), the number of thread blocks of the proxy kernel that is started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel.

Specifically, in the step (5), the number of threads included in each thread block of the enabled proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed.

Specifically, in the step (5), the dynamic shared memory size of proxykernel is the maximum shared memory size used by all the kernel to be executed.

Specifically, in the step (6), proxy kernel needs to set the function parameters, the thread block ID, and the thread ID of the kernel to be executed.

Specifically, in the step (6), the JMP instruction is directly used to jump to the function entry address of the kernel to be executed.

The GPU dynamic multi-task controllable concurrent execution system provided by the invention comprises the following modules:

In the module (1), the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the address of the kernel parameter to be executed and the number of calculation units used by the kernel to be executed.

In the module (1), a proxy kernel corresponding to the maximum number of registers should be generated for each CUOccupancy.

In the module (4), the proxy kernel selected should be the smallest number of registers in the proxy kernel with all registers greater than the number of registers required for the selected kernel to be executed.

In the module (5), the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel.

In the module (5), the number of threads contained in each thread block of the starting proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed.

In the module (5), the dynamic shared memory size of proxykernel is the maximum shared memory size used by all the kernel to be executed.

In the module (6), proxy kernel needs to set function parameters, thread block ID and thread ID of kernel to be executed.

In the module (6), the JMP instruction is directly used to jump to the function entry address of the kernel to be executed.

The present invention will be described in further detail with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of the present invention, comprising the following steps:

Specifically, in the step (4), the register number of proxy kernel may be set by kernel attribute.

Specifically, in the step (4), the user may select the kernel to be executed, which is executed concurrently, according to the application requirement, for example, select the kernel with similar execution time, or be computationally intensive and memory intensive.

Specifically, in the step (5), the user may allocate the number of computing units to each to-be-executed kernel according to the application requirement, for example, allocate enough computing units to the to-be-executed kernel with high priority first, and then allocate the remaining computing units to the to-be-executed kernel with low priority.

Fig. 2 is a schematic diagram of the execution flow of proxykernel in the step (6) of the present invention, which includes the following steps:

(6.1) acquiring the ID of a computing unit where the current thread block is located;

(6.2) acquiring a next kernel to be executed;

(6.3) if the current computing unit ID is smaller than the maximum computing unit ID allocated by kernel acquired in the last step, executing (6.4), otherwise executing (6.2);

(6.4) setting a thread block ID, a thread ID and function parameters for the selected kernel to be executed;

(6.5) jumping to the entry address of the selected kernel to be executed using the JMP instruction;

(6.6) executing the selected but executed kernel;

Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A method for dynamically multitasking and concurrently controllable execution of a GPU, comprising:

step S3: selecting a proxy kernel to submit to the GPU according to the number of registers required by the kernel to be executed;

step S4: the user dynamically controls the number of calculation units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel;

in the step S1:

Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling source codes of a proxy kernel and a kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;

generating a proxy kernel corresponding to the maximum number of registers for each CU;

in the step S4:

Executing proxy kernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;

2. The GPU dynamic multitasking controllable concurrent execution method according to claim 1, characterized in that in said step S2:

Selecting a to-be-executed kernel to be executed concurrently according to the requirement of a user, and selecting a proxy kernel according to the number of registers required by the selected to-be-executed kernel;

The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it is the least number of registers used in all proxy kernel that meet the aforementioned conditions.

3. The GPU dynamic multitasking controllable concurrent execution method according to claim 1, characterized in that in said step S3:

Distributing the number of computing units for each kernel to be executed according to the requirement of a user, setting the starting parameters of the proxy kernel and starting the proxy kernel;

the number of thread blocks of the started proxy kernel is the product of the number of GPU computing units and the CU Occupancy of the current proxy kernel;

the dynamic shared memory size of the enabled proxy kernel is the maximum shared memory size used by all kernel to be executed.

4. A GPU dynamic multitasking controllable concurrent execution system, comprising:

Module M3: selecting a proxy kernel to submit to the GPU according to the number of registers required by the kernel to be executed;

module M4: the user dynamically controls the number of calculation units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel;

In the module M1:

in the module M4:

5. The GPU dynamic multitasking controllable concurrent execution system of claim 4, wherein in said module M2:

6. The GPU dynamic multitasking controllable concurrent execution system of claim 4, wherein in said module M3: