CN115114003B - GPU dynamic multitasking controllable concurrent execution method and system - Google Patents
GPU dynamic multitasking controllable concurrent execution method and system Download PDFInfo
- Publication number
- CN115114003B CN115114003B CN202210780174.9A CN202210780174A CN115114003B CN 115114003 B CN115114003 B CN 115114003B CN 202210780174 A CN202210780174 A CN 202210780174A CN 115114003 B CN115114003 B CN 115114003B
- Authority
- CN
- China
- Prior art keywords
- kernel
- executed
- proxy
- gpu
- registers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000006870 function Effects 0.000 claims description 29
- 238000004364 calculation method Methods 0.000 claims description 12
- 230000009191 jumping Effects 0.000 claims description 10
- 238000013468 resource allocation Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The invention provides a GPU dynamic multitasking controllable concurrent execution method and system, comprising the following steps: step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage; step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently; step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed; step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through proxykernel, jumps to the kernel to be executed and executes the kernel. According to the invention, the proxy kernel dynamically allocates the specified number of the computing units for the kernel to be executed, so that the fine-grained computing unit allocation in the running stage of the GPU program can be realized.
Description
Technical Field
The invention relates to the field of GPU task scheduling, in particular to a method and a system for dynamically and multi-tasking controllable concurrent execution of a GPU.
Background
Compared with a CPU, the GPU has stronger parallel processing capability and is commonly used for tasks such as graphic rendering, high-performance calculation, analog simulation, artificial intelligent model training and reasoning and the like. With the increasing number of current commercial GPU computing units (ComputeUnit), it is difficult for a single computing task to fully utilize all of the computing units in the GPU, allowing multiple tasks to share the GPU simultaneously is the most common practice in order to increase the utilization of the GPU computing units.
The current GPU programming framework (e.g., CUDA, HIP) provides a multitasking concurrent abstraction of GPU Stream, multiple tasks can be executed concurrently at the same time using different GPUStream, and the computing units in the GPU are fully utilized. However, when GPUStream is used to concurrently execute multiple tasks, the user cannot control the resource allocation of the GPU computing unit, so that different tasks compete for the GPU computing unit, while the resource utilization rate and the throughput of the system can be improved, the execution delay of each task can be significantly increased, which seriously affects the real-time performance of the delay sensitive task. Taking a neural network reasoning task in intelligent driving as an example, the obstacle detection task needs to use a GPU to realize low-delay reasoning, and the driver state monitoring task also needs to use the GPU to calculate, but the task has a loose delay requirement, when two tasks are respectively executed by two GPUStream in parallel, the real-time requirement of a strong real-time task (obstacle detection task) cannot be met due to the competition of the two tasks to the GPU calculation unit.
In order to control allocation of concurrent tasks to GPU computing units, the academic world also proposes a kernel fusion (KernelFusion) manner to implement GPU multitasking concurrent computation, and merges source codes of two GPU kernels together, so as to implement that the two GPU kernels share all the GPU computing units, and can control resource allocation of the computing units in the merged codes. However, the kernel fusion method can only select which kernel to merge in the program compiling stage, and cannot be applied to a dynamic scheduling scene where task combination is uncertain or can be determined in the program running stage.
In summary, how to design a method for dynamically controlling concurrent execution of GPU tasks allocated by GPU computing units during a program running stage is a major problem to be solved by researchers in the field.
Patent document CN114048026A (application number: CN 202111258248.4) discloses a dynamic allocation method of GPU resources under the condition of multitasking, so as to solve the problems of idle large amount of resources, reduced system throughput and unreasonable resource allocation caused by adopting a static resource allocation method when NVIDIA GPU multitasking is concurrent. However, the method does not dynamically allocate a designated number of computing units to be executed through proxy kernel, so as to realize the fine-granularity computing unit allocation in the running stage of the GPU program.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a GPU dynamic multi-task controllable concurrent execution method and system.
The GPU dynamic multitasking controllable concurrent execution method provided by the invention comprises the following steps:
Step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
Step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
Step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
Step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Preferably, in said step S1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Preferably, in said step S2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Preferably, in said step S3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Preferably, in said step S4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
The GPU dynamic multitasking controllable concurrent execution system provided by the invention comprises:
module M1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
module M2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
module M3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
Module M4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Preferably, in said module M1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Preferably, in said module M2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Preferably, in said module M3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Preferably, in said module M4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, the specified number of computing units are dynamically allocated to the kernel to be executed through the proxy kernel, so that the fine-grained computing units allocation in the running stage of the GPU program can be realized;
2. According to the invention, a plurality of proxy kernel are used to meet the number requirements of registers of different kernel to be executed, so that CUOccupancy of the kernel to be executed is maximized, and the performance cost is reduced;
3. According to the invention, the JMP instruction is directly jumped to the kernel to be executed at proxykernel, so that the context storage caused by using the function pointer is avoided, and the performance cost of calling the function through the function pointer is reduced.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of the implementation of the present invention;
FIG. 2 is a schematic diagram of an execution flow of proxykernel.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
Example 1:
The invention provides a GPU dynamic multitasking controllable concurrent execution method and a system, which are characterized in that the method generates a plurality of proxy kernel as an entry of kernel to be executed in a program compiling stage; in the program running stage, a user can dynamically select a concurrently executed kernel to be executed, and select a proper proxykernel to be submitted to the GPU according to the number of registers required by the selected kernel to be executed; the user can dynamically control the number of calculation units used by each kernel to be executed through the proxy kernel, and finally jump to the kernel to be executed and execute;
according to the method for dynamically and multi-tasking controllable concurrent execution of the GPU, which is provided by the invention, as shown in fig. 1-2, the method comprises the following steps:
Step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
specifically, in the step S1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
specifically, in the step S2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
specifically, in the step S3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Specifically, in the step S4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
Example 2:
example 2 is a preferable example of example 1 to more specifically explain the present invention.
Those skilled in the art can understand the GPU dynamic multitasking controllable concurrent execution method provided by the present invention as a specific implementation of the GPU dynamic multitasking controllable concurrent execution system, that is, the GPU dynamic multitasking controllable concurrent execution system can be implemented by executing the step flow of the method.
The GPU dynamic multitasking controllable concurrent execution system provided by the invention comprises:
module M1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
specifically, in the module M1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Module M2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
specifically, in the module M2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Module M3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
specifically, in the module M3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Module M4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Specifically, in the module M4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
Example 3:
Example 3 is a preferable example of example 1 to more specifically explain the present invention.
The GPU dynamic multi-task controllable concurrent execution method provided by the invention comprises the following steps of:
(1) Generating proxy kernel source code: generating a plurality of source codes of proxy kernel, wherein each proxy kernel has the same source code, but each proxy kernel has different register numbers; proxy kernel is the entry for all kernel to be executed, from which all concurrently executed kernel will jump.
(2) Compiling proxy kernel and kernel to be executed: compiling proxykernel and source codes of kernel to be executed into binary files;
(3) Loading proxy kernel and kernel to be executed: loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
(4) Selecting a kernel to be executed: selecting a concurrent execution kernel according to the requirement of a user, and selecting a proper proxykernel according to the number of registers required by the selected kernel to be executed;
(5) Start proxy kernel: distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
(6) Proxy kernel is performed: executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution.
Specifically, in the step (1), the proxy kernel parameter includes the entry address of the kernel function to be executed, the kernel parameter address to be executed, and the number of computing units used by the kernel to be executed.
Specifically, in the step (1), a proxy kernel corresponding to the maximum number of registers should be generated for each CUOccupancy.
Specifically, in the step (4), the proxy kernel should be selected to satisfy the following two points: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Specifically, in the step (5), the number of thread blocks of the proxy kernel that is started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel.
Specifically, in the step (5), the number of threads included in each thread block of the enabled proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed.
Specifically, in the step (5), the dynamic shared memory size of proxykernel is the maximum shared memory size used by all the kernel to be executed.
Specifically, in the step (6), proxy kernel needs to set the function parameters, the thread block ID, and the thread ID of the kernel to be executed.
Specifically, in the step (6), the JMP instruction is directly used to jump to the function entry address of the kernel to be executed.
The GPU dynamic multi-task controllable concurrent execution system provided by the invention comprises the following modules:
(1) Generating proxy kernel source code: generating a plurality of source codes of proxy kernel, wherein each proxy kernel has the same source code, but each proxy kernel has different register numbers; proxy kernel is the entry for all kernel to be executed, from which all concurrently executed kernel will jump.
(2) Compiling proxy kernel and kernel to be executed: compiling proxykernel and source codes of kernel to be executed into binary files;
(3) Loading proxy kernel and kernel to be executed: loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
(4) Selecting a kernel to be executed: selecting a concurrent execution kernel according to the requirement of a user, and selecting a proper proxykernel according to the number of registers required by the selected kernel to be executed;
(5) Start proxy kernel: distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
(6) Proxy kernel is performed: executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution.
In the module (1), the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the address of the kernel parameter to be executed and the number of calculation units used by the kernel to be executed.
In the module (1), a proxy kernel corresponding to the maximum number of registers should be generated for each CUOccupancy.
In the module (4), the proxy kernel selected should be the smallest number of registers in the proxy kernel with all registers greater than the number of registers required for the selected kernel to be executed.
In the module (5), the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel.
In the module (5), the number of threads contained in each thread block of the starting proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed.
In the module (5), the dynamic shared memory size of proxykernel is the maximum shared memory size used by all the kernel to be executed.
In the module (6), proxy kernel needs to set function parameters, thread block ID and thread ID of kernel to be executed.
In the module (6), the JMP instruction is directly used to jump to the function entry address of the kernel to be executed.
The present invention will be described in further detail with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of the present invention, comprising the following steps:
(1) Generating proxy kernel source code: generating a plurality of source codes of proxy kernel, wherein each proxy kernel has the same source code, but each proxy kernel has different register numbers; proxy kernel is the entry for all kernel to be executed, from which all concurrently executed kernel will jump.
(2) Compiling proxy kernel and kernel to be executed: compiling proxykernel and source codes of kernel to be executed into binary files;
(3) Loading proxy kernel and kernel to be executed: loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
(4) Selecting a kernel to be executed: selecting a concurrent execution kernel according to the requirement of a user, and selecting a proper proxykernel according to the number of registers required by the selected kernel to be executed;
(5) Start proxy kernel: distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
(6) Proxy kernel is performed: executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution.
Specifically, in the step (4), the register number of proxy kernel may be set by kernel attribute.
Specifically, in the step (4), the user may select the kernel to be executed, which is executed concurrently, according to the application requirement, for example, select the kernel with similar execution time, or be computationally intensive and memory intensive.
Specifically, in the step (5), the user may allocate the number of computing units to each to-be-executed kernel according to the application requirement, for example, allocate enough computing units to the to-be-executed kernel with high priority first, and then allocate the remaining computing units to the to-be-executed kernel with low priority.
Fig. 2 is a schematic diagram of the execution flow of proxykernel in the step (6) of the present invention, which includes the following steps:
(6.1) acquiring the ID of a computing unit where the current thread block is located;
(6.2) acquiring a next kernel to be executed;
(6.3) if the current computing unit ID is smaller than the maximum computing unit ID allocated by kernel acquired in the last step, executing (6.4), otherwise executing (6.2);
(6.4) setting a thread block ID, a thread ID and function parameters for the selected kernel to be executed;
(6.5) jumping to the entry address of the selected kernel to be executed using the JMP instruction;
(6.6) executing the selected but executed kernel;
Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.
The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.
Claims (6)
1. A method for dynamically multitasking and concurrently controllable execution of a GPU, comprising:
Step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
Step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
step S3: selecting a proxy kernel to submit to the GPU according to the number of registers required by the kernel to be executed;
step S4: the user dynamically controls the number of calculation units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel;
in the step S1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling source codes of a proxy kernel and a kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
generating a proxy kernel corresponding to the maximum number of registers for each CU;
in the step S4:
Executing proxy kernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
2. The GPU dynamic multitasking controllable concurrent execution method according to claim 1, characterized in that in said step S2:
Selecting a to-be-executed kernel to be executed concurrently according to the requirement of a user, and selecting a proxy kernel according to the number of registers required by the selected to-be-executed kernel;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it is the least number of registers used in all proxy kernel that meet the aforementioned conditions.
3. The GPU dynamic multitasking controllable concurrent execution method according to claim 1, characterized in that in said step S3:
Distributing the number of computing units for each kernel to be executed according to the requirement of a user, setting the starting parameters of the proxy kernel and starting the proxy kernel;
the number of thread blocks of the started proxy kernel is the product of the number of GPU computing units and the CU Occupancy of the current proxy kernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
the dynamic shared memory size of the enabled proxy kernel is the maximum shared memory size used by all kernel to be executed.
4. A GPU dynamic multitasking controllable concurrent execution system, comprising:
module M1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
module M2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
Module M3: selecting a proxy kernel to submit to the GPU according to the number of registers required by the kernel to be executed;
module M4: the user dynamically controls the number of calculation units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel;
In the module M1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling source codes of a proxy kernel and a kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
generating a proxy kernel corresponding to the maximum number of registers for each CU;
in the module M4:
Executing proxy kernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
5. The GPU dynamic multitasking controllable concurrent execution system of claim 4, wherein in said module M2:
Selecting a to-be-executed kernel to be executed concurrently according to the requirement of a user, and selecting a proxy kernel according to the number of registers required by the selected to-be-executed kernel;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it is the least number of registers used in all proxy kernel that meet the aforementioned conditions.
6. The GPU dynamic multitasking controllable concurrent execution system of claim 4, wherein in said module M3:
Distributing the number of computing units for each kernel to be executed according to the requirement of a user, setting the starting parameters of the proxy kernel and starting the proxy kernel;
the number of thread blocks of the started proxy kernel is the product of the number of GPU computing units and the CU Occupancy of the current proxy kernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
the dynamic shared memory size of the enabled proxy kernel is the maximum shared memory size used by all kernel to be executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210780174.9A CN115114003B (en) | 2022-07-04 | 2022-07-04 | GPU dynamic multitasking controllable concurrent execution method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210780174.9A CN115114003B (en) | 2022-07-04 | 2022-07-04 | GPU dynamic multitasking controllable concurrent execution method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115114003A CN115114003A (en) | 2022-09-27 |
CN115114003B true CN115114003B (en) | 2024-05-28 |
Family
ID=83330376
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210780174.9A Active CN115114003B (en) | 2022-07-04 | 2022-07-04 | GPU dynamic multitasking controllable concurrent execution method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115114003B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963918A (en) * | 2010-10-26 | 2011-02-02 | 上海交通大学 | Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform |
CN106502771A (en) * | 2016-09-09 | 2017-03-15 | 中国农业大学 | Time overhead model building method and system based on kernel functions |
CN110196753A (en) * | 2019-01-21 | 2019-09-03 | 腾讯科技(北京)有限公司 | Graphics processor GPU vitualization method, apparatus and readable medium based on container |
CN111722915A (en) * | 2020-06-22 | 2020-09-29 | 上海商汤智能科技有限公司 | Task processing method, device and system |
CN112417470A (en) * | 2020-11-06 | 2021-02-26 | 上海壁仞智能科技有限公司 | Method and device for realizing GPU data security access, electronic equipment and storage medium |
CN113490924A (en) * | 2019-02-22 | 2021-10-08 | 英特尔公司 | Dynamic switching between EPT and shadow page tables for runtime processor validation |
WO2021211809A1 (en) * | 2020-04-16 | 2021-10-21 | Texas Instruments Incorporated | Scalable hardware thread scheduler |
CN114356547A (en) * | 2021-12-07 | 2022-04-15 | 北京百度网讯科技有限公司 | Low-priority blocking method and device based on processor virtualization environment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170371662A1 (en) * | 2016-06-23 | 2017-12-28 | Intel Corporation | Extension of register files for local processing of data in computing environments |
-
2022
- 2022-07-04 CN CN202210780174.9A patent/CN115114003B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963918A (en) * | 2010-10-26 | 2011-02-02 | 上海交通大学 | Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform |
CN106502771A (en) * | 2016-09-09 | 2017-03-15 | 中国农业大学 | Time overhead model building method and system based on kernel functions |
CN110196753A (en) * | 2019-01-21 | 2019-09-03 | 腾讯科技(北京)有限公司 | Graphics processor GPU vitualization method, apparatus and readable medium based on container |
CN113490924A (en) * | 2019-02-22 | 2021-10-08 | 英特尔公司 | Dynamic switching between EPT and shadow page tables for runtime processor validation |
WO2021211809A1 (en) * | 2020-04-16 | 2021-10-21 | Texas Instruments Incorporated | Scalable hardware thread scheduler |
CN111722915A (en) * | 2020-06-22 | 2020-09-29 | 上海商汤智能科技有限公司 | Task processing method, device and system |
CN112417470A (en) * | 2020-11-06 | 2021-02-26 | 上海壁仞智能科技有限公司 | Method and device for realizing GPU data security access, electronic equipment and storage medium |
CN114356547A (en) * | 2021-12-07 | 2022-04-15 | 北京百度网讯科技有限公司 | Low-priority blocking method and device based on processor virtualization environment |
Non-Patent Citations (3)
Title |
---|
"Efficient Execution of OpenMP on GPUs";Joseph Huber;《2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)》;20220329;第41-52页 * |
"Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences";Mingcong Han;《Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation》;20220713;第539-558页 * |
"一种基于冗余线程的GPU多副本容错技术";贾佳;《计算机研究与发展》;20130715;第50卷(第07期);第1551-1562页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115114003A (en) | 2022-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR100962531B1 (en) | Apparatus for processing multi-threading framework supporting dynamic load-balancing and multi-thread processing method using by it | |
CN1043932C (en) | Multi-tasking low-power controller | |
CN112463709A (en) | Configurable heterogeneous artificial intelligence processor | |
CN112465129A (en) | On-chip heterogeneous artificial intelligence processor | |
US20090160867A1 (en) | Autonomous Context Scheduler For Graphics Processing Units | |
US9152462B2 (en) | Parallel processing device, parallel processing method, optimization device, optimization method and computer program | |
US9582320B2 (en) | Computer systems and methods with resource transfer hint instruction | |
CN105893126A (en) | Task scheduling method and device | |
US20050125793A1 (en) | Operating system kernel-assisted, self-balanced, access-protected library framework in a run-to-completion multi-processor environment | |
CN110308982B (en) | Shared memory multiplexing method and device | |
KR20040069257A (en) | Method of scheduling in a reconfigurable hardware architecture with multiple hardware configurations | |
CN112711478A (en) | Task processing method, device, server and storage medium based on neural network | |
US11947999B2 (en) | Multi-phased and multi-threaded program execution based on SIMD ratio | |
WO2020227582A2 (en) | Method and apparatus for scheduling matrix operations in digital processing systems | |
Li et al. | Efficient algorithms for task mapping on heterogeneous CPU/GPU platforms for fast completion time | |
CN111597044A (en) | Task scheduling method and device, storage medium and electronic equipment | |
CN115114003B (en) | GPU dynamic multitasking controllable concurrent execution method and system | |
US9760969B2 (en) | Graphic processing system and method thereof | |
CN116107753A (en) | Task node distribution method and device, electronic equipment and storage medium | |
CN110969565A (en) | Image processing method and device | |
CN114048026A (en) | GPU resource dynamic allocation method under multitask concurrency condition | |
JP2006099579A (en) | Information processor and information processing method | |
WO2023283767A1 (en) | Task scheduling method and apparatus | |
Chow et al. | Energy efficient task graph execution using compute unit masking in GPUs | |
Gannouni | A Gamma-calculus GPU-based parallel programming framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |