CN115114003B - GPU dynamic multitasking controllable concurrent execution method and system - Google Patents

GPU dynamic multitasking controllable concurrent execution method and system Download PDF

Info

Publication number
CN115114003B
CN115114003B CN202210780174.9A CN202210780174A CN115114003B CN 115114003 B CN115114003 B CN 115114003B CN 202210780174 A CN202210780174 A CN 202210780174A CN 115114003 B CN115114003 B CN 115114003B
Authority
CN
China
Prior art keywords
kernel
executed
proxy
gpu
registers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210780174.9A
Other languages
Chinese (zh)
Other versions
CN115114003A (en
Inventor
陈榕
韩明聪
陈海波
臧斌宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202210780174.9A priority Critical patent/CN115114003B/en
Publication of CN115114003A publication Critical patent/CN115114003A/en
Application granted granted Critical
Publication of CN115114003B publication Critical patent/CN115114003B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention provides a GPU dynamic multitasking controllable concurrent execution method and system, comprising the following steps: step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage; step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently; step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed; step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through proxykernel, jumps to the kernel to be executed and executes the kernel. According to the invention, the proxy kernel dynamically allocates the specified number of the computing units for the kernel to be executed, so that the fine-grained computing unit allocation in the running stage of the GPU program can be realized.

Description

GPU dynamic multitasking controllable concurrent execution method and system
Technical Field
The invention relates to the field of GPU task scheduling, in particular to a method and a system for dynamically and multi-tasking controllable concurrent execution of a GPU.
Background
Compared with a CPU, the GPU has stronger parallel processing capability and is commonly used for tasks such as graphic rendering, high-performance calculation, analog simulation, artificial intelligent model training and reasoning and the like. With the increasing number of current commercial GPU computing units (ComputeUnit), it is difficult for a single computing task to fully utilize all of the computing units in the GPU, allowing multiple tasks to share the GPU simultaneously is the most common practice in order to increase the utilization of the GPU computing units.
The current GPU programming framework (e.g., CUDA, HIP) provides a multitasking concurrent abstraction of GPU Stream, multiple tasks can be executed concurrently at the same time using different GPUStream, and the computing units in the GPU are fully utilized. However, when GPUStream is used to concurrently execute multiple tasks, the user cannot control the resource allocation of the GPU computing unit, so that different tasks compete for the GPU computing unit, while the resource utilization rate and the throughput of the system can be improved, the execution delay of each task can be significantly increased, which seriously affects the real-time performance of the delay sensitive task. Taking a neural network reasoning task in intelligent driving as an example, the obstacle detection task needs to use a GPU to realize low-delay reasoning, and the driver state monitoring task also needs to use the GPU to calculate, but the task has a loose delay requirement, when two tasks are respectively executed by two GPUStream in parallel, the real-time requirement of a strong real-time task (obstacle detection task) cannot be met due to the competition of the two tasks to the GPU calculation unit.
In order to control allocation of concurrent tasks to GPU computing units, the academic world also proposes a kernel fusion (KernelFusion) manner to implement GPU multitasking concurrent computation, and merges source codes of two GPU kernels together, so as to implement that the two GPU kernels share all the GPU computing units, and can control resource allocation of the computing units in the merged codes. However, the kernel fusion method can only select which kernel to merge in the program compiling stage, and cannot be applied to a dynamic scheduling scene where task combination is uncertain or can be determined in the program running stage.
In summary, how to design a method for dynamically controlling concurrent execution of GPU tasks allocated by GPU computing units during a program running stage is a major problem to be solved by researchers in the field.
Patent document CN114048026A (application number: CN 202111258248.4) discloses a dynamic allocation method of GPU resources under the condition of multitasking, so as to solve the problems of idle large amount of resources, reduced system throughput and unreasonable resource allocation caused by adopting a static resource allocation method when NVIDIA GPU multitasking is concurrent. However, the method does not dynamically allocate a designated number of computing units to be executed through proxy kernel, so as to realize the fine-granularity computing unit allocation in the running stage of the GPU program.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a GPU dynamic multi-task controllable concurrent execution method and system.
The GPU dynamic multitasking controllable concurrent execution method provided by the invention comprises the following steps:
Step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
Step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
Step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
Step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Preferably, in said step S1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Preferably, in said step S2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Preferably, in said step S3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Preferably, in said step S4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
The GPU dynamic multitasking controllable concurrent execution system provided by the invention comprises:
module M1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
module M2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
module M3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
Module M4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Preferably, in said module M1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Preferably, in said module M2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Preferably, in said module M3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Preferably, in said module M4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the method, the specified number of computing units are dynamically allocated to the kernel to be executed through the proxy kernel, so that the fine-grained computing units allocation in the running stage of the GPU program can be realized;
2. According to the invention, a plurality of proxy kernel are used to meet the number requirements of registers of different kernel to be executed, so that CUOccupancy of the kernel to be executed is maximized, and the performance cost is reduced;
3. According to the invention, the JMP instruction is directly jumped to the kernel to be executed at proxykernel, so that the context storage caused by using the function pointer is avoided, and the performance cost of calling the function through the function pointer is reduced.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of the implementation of the present invention;
FIG. 2 is a schematic diagram of an execution flow of proxykernel.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
Example 1:
The invention provides a GPU dynamic multitasking controllable concurrent execution method and a system, which are characterized in that the method generates a plurality of proxy kernel as an entry of kernel to be executed in a program compiling stage; in the program running stage, a user can dynamically select a concurrently executed kernel to be executed, and select a proper proxykernel to be submitted to the GPU according to the number of registers required by the selected kernel to be executed; the user can dynamically control the number of calculation units used by each kernel to be executed through the proxy kernel, and finally jump to the kernel to be executed and execute;
according to the method for dynamically and multi-tasking controllable concurrent execution of the GPU, which is provided by the invention, as shown in fig. 1-2, the method comprises the following steps:
Step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
specifically, in the step S1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
specifically, in the step S2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Step S3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
specifically, in the step S3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Step S4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Specifically, in the step S4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
Example 2:
example 2 is a preferable example of example 1 to more specifically explain the present invention.
Those skilled in the art can understand the GPU dynamic multitasking controllable concurrent execution method provided by the present invention as a specific implementation of the GPU dynamic multitasking controllable concurrent execution system, that is, the GPU dynamic multitasking controllable concurrent execution system can be implemented by executing the step flow of the method.
The GPU dynamic multitasking controllable concurrent execution system provided by the invention comprises:
module M1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
specifically, in the module M1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling proxykernel and source codes of kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
A proxy kernel corresponding to the maximum number of registers is generated for each CUOccupancy.
Module M2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
specifically, in the module M2:
Selecting a concurrent execution kernel according to the requirement of a user, and selecting proxykernel according to the number of registers required by the selected kernel to be executed;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Module M3: selecting proxykernel to submit to the GPU according to the number of registers required by the selected kernel to be executed;
specifically, in the module M3:
Distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
The dynamic shared memory size of proxykernel started is the maximum shared memory size used by all kernel to be executed.
Module M4: and the user dynamically controls the number of the computing units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel.
Specifically, in the module M4:
Executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
Example 3:
Example 3 is a preferable example of example 1 to more specifically explain the present invention.
The GPU dynamic multi-task controllable concurrent execution method provided by the invention comprises the following steps of:
(1) Generating proxy kernel source code: generating a plurality of source codes of proxy kernel, wherein each proxy kernel has the same source code, but each proxy kernel has different register numbers; proxy kernel is the entry for all kernel to be executed, from which all concurrently executed kernel will jump.
(2) Compiling proxy kernel and kernel to be executed: compiling proxykernel and source codes of kernel to be executed into binary files;
(3) Loading proxy kernel and kernel to be executed: loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
(4) Selecting a kernel to be executed: selecting a concurrent execution kernel according to the requirement of a user, and selecting a proper proxykernel according to the number of registers required by the selected kernel to be executed;
(5) Start proxy kernel: distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
(6) Proxy kernel is performed: executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution.
Specifically, in the step (1), the proxy kernel parameter includes the entry address of the kernel function to be executed, the kernel parameter address to be executed, and the number of computing units used by the kernel to be executed.
Specifically, in the step (1), a proxy kernel corresponding to the maximum number of registers should be generated for each CUOccupancy.
Specifically, in the step (4), the proxy kernel should be selected to satisfy the following two points: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it should be the least number of registers used in all proxy kernel that meet the aforementioned conditions.
Specifically, in the step (5), the number of thread blocks of the proxy kernel that is started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel.
Specifically, in the step (5), the number of threads included in each thread block of the enabled proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed.
Specifically, in the step (5), the dynamic shared memory size of proxykernel is the maximum shared memory size used by all the kernel to be executed.
Specifically, in the step (6), proxy kernel needs to set the function parameters, the thread block ID, and the thread ID of the kernel to be executed.
Specifically, in the step (6), the JMP instruction is directly used to jump to the function entry address of the kernel to be executed.
The GPU dynamic multi-task controllable concurrent execution system provided by the invention comprises the following modules:
(1) Generating proxy kernel source code: generating a plurality of source codes of proxy kernel, wherein each proxy kernel has the same source code, but each proxy kernel has different register numbers; proxy kernel is the entry for all kernel to be executed, from which all concurrently executed kernel will jump.
(2) Compiling proxy kernel and kernel to be executed: compiling proxykernel and source codes of kernel to be executed into binary files;
(3) Loading proxy kernel and kernel to be executed: loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
(4) Selecting a kernel to be executed: selecting a concurrent execution kernel according to the requirement of a user, and selecting a proper proxykernel according to the number of registers required by the selected kernel to be executed;
(5) Start proxy kernel: distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
(6) Proxy kernel is performed: executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution.
In the module (1), the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the address of the kernel parameter to be executed and the number of calculation units used by the kernel to be executed.
In the module (1), a proxy kernel corresponding to the maximum number of registers should be generated for each CUOccupancy.
In the module (4), the proxy kernel selected should be the smallest number of registers in the proxy kernel with all registers greater than the number of registers required for the selected kernel to be executed.
In the module (5), the number of thread blocks of the proxy kernel started is the product of the number of GPU computing units and CUOccupancy of the current proxykernel.
In the module (5), the number of threads contained in each thread block of the starting proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed.
In the module (5), the dynamic shared memory size of proxykernel is the maximum shared memory size used by all the kernel to be executed.
In the module (6), proxy kernel needs to set function parameters, thread block ID and thread ID of kernel to be executed.
In the module (6), the JMP instruction is directly used to jump to the function entry address of the kernel to be executed.
The present invention will be described in further detail with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of the present invention, comprising the following steps:
(1) Generating proxy kernel source code: generating a plurality of source codes of proxy kernel, wherein each proxy kernel has the same source code, but each proxy kernel has different register numbers; proxy kernel is the entry for all kernel to be executed, from which all concurrently executed kernel will jump.
(2) Compiling proxy kernel and kernel to be executed: compiling proxykernel and source codes of kernel to be executed into binary files;
(3) Loading proxy kernel and kernel to be executed: loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
(4) Selecting a kernel to be executed: selecting a concurrent execution kernel according to the requirement of a user, and selecting a proper proxykernel according to the number of registers required by the selected kernel to be executed;
(5) Start proxy kernel: distributing the number of computing units to each kernel to be executed according to the requirement of a user, setting proxykernel starting parameters and starting proxykernel;
(6) Proxy kernel is performed: executing proxykernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution.
Specifically, in the step (4), the register number of proxy kernel may be set by kernel attribute.
Specifically, in the step (4), the user may select the kernel to be executed, which is executed concurrently, according to the application requirement, for example, select the kernel with similar execution time, or be computationally intensive and memory intensive.
Specifically, in the step (5), the user may allocate the number of computing units to each to-be-executed kernel according to the application requirement, for example, allocate enough computing units to the to-be-executed kernel with high priority first, and then allocate the remaining computing units to the to-be-executed kernel with low priority.
Fig. 2 is a schematic diagram of the execution flow of proxykernel in the step (6) of the present invention, which includes the following steps:
(6.1) acquiring the ID of a computing unit where the current thread block is located;
(6.2) acquiring a next kernel to be executed;
(6.3) if the current computing unit ID is smaller than the maximum computing unit ID allocated by kernel acquired in the last step, executing (6.4), otherwise executing (6.2);
(6.4) setting a thread block ID, a thread ID and function parameters for the selected kernel to be executed;
(6.5) jumping to the entry address of the selected kernel to be executed using the JMP instruction;
(6.6) executing the selected but executed kernel;
Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.
The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims (6)

1. A method for dynamically multitasking and concurrently controllable execution of a GPU, comprising:
Step S1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
Step S2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
step S3: selecting a proxy kernel to submit to the GPU according to the number of registers required by the kernel to be executed;
step S4: the user dynamically controls the number of calculation units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel;
in the step S1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling source codes of a proxy kernel and a kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
generating a proxy kernel corresponding to the maximum number of registers for each CU;
in the step S4:
Executing proxy kernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
2. The GPU dynamic multitasking controllable concurrent execution method according to claim 1, characterized in that in said step S2:
Selecting a to-be-executed kernel to be executed concurrently according to the requirement of a user, and selecting a proxy kernel according to the number of registers required by the selected to-be-executed kernel;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it is the least number of registers used in all proxy kernel that meet the aforementioned conditions.
3. The GPU dynamic multitasking controllable concurrent execution method according to claim 1, characterized in that in said step S3:
Distributing the number of computing units for each kernel to be executed according to the requirement of a user, setting the starting parameters of the proxy kernel and starting the proxy kernel;
the number of thread blocks of the started proxy kernel is the product of the number of GPU computing units and the CU Occupancy of the current proxy kernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
the dynamic shared memory size of the enabled proxy kernel is the maximum shared memory size used by all kernel to be executed.
4. A GPU dynamic multitasking controllable concurrent execution system, comprising:
module M1: generating one or more proxy kernel as an entry of a kernel to be executed in a program compiling stage;
module M2: in the program running stage, a user dynamically selects a kernel to be executed, which is executed concurrently;
Module M3: selecting a proxy kernel to submit to the GPU according to the number of registers required by the kernel to be executed;
module M4: the user dynamically controls the number of calculation units used by each kernel to be executed through the proxy kernel, jumps to the kernel to be executed and executes the kernel;
In the module M1:
Generating source code of one or more proxy kernel, each of which has the same source code, each of which has a different number of registers; proxy kernel is the entry for all kernel to be executed, and all concurrently executed kernel jumps from proxy kernel; compiling source codes of a proxy kernel and a kernel to be executed into binary files; loading the compiled proxy kernel and the binary file where the kernel to be executed is located into a GPU memory;
the parameters of the proxy kernel comprise the entry address of the kernel function to be executed, the kernel parameter address to be executed and the number of calculation units used by the kernel to be executed;
generating a proxy kernel corresponding to the maximum number of registers for each CU;
in the module M4:
Executing proxy kernel in the GPU, selecting a corresponding kernel to be executed according to the ID of the current computing unit, setting parameters and jumping to the kernel to be executed for execution;
the proxy kernel sets the function parameters, the thread block ID and the thread ID of the kernel to be executed;
Jump to the function entry address of the kernel to be executed using the JMP instruction.
5. The GPU dynamic multitasking controllable concurrent execution system of claim 4, wherein in said module M2:
Selecting a to-be-executed kernel to be executed concurrently according to the requirement of a user, and selecting a proxy kernel according to the number of registers required by the selected to-be-executed kernel;
The proxy kernel should be selected to meet the following two criteria: firstly, the number of registers used is larger than the number of registers required by the selected kernel to be executed; second, it is the least number of registers used in all proxy kernel that meet the aforementioned conditions.
6. The GPU dynamic multitasking controllable concurrent execution system of claim 4, wherein in said module M3:
Distributing the number of computing units for each kernel to be executed according to the requirement of a user, setting the starting parameters of the proxy kernel and starting the proxy kernel;
the number of thread blocks of the started proxy kernel is the product of the number of GPU computing units and the CU Occupancy of the current proxy kernel;
The number of threads contained in each thread block of the started proxy kernel is the maximum number of threads of all thread blocks of the kernel to be executed;
the dynamic shared memory size of the enabled proxy kernel is the maximum shared memory size used by all kernel to be executed.
CN202210780174.9A 2022-07-04 2022-07-04 GPU dynamic multitasking controllable concurrent execution method and system Active CN115114003B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210780174.9A CN115114003B (en) 2022-07-04 2022-07-04 GPU dynamic multitasking controllable concurrent execution method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210780174.9A CN115114003B (en) 2022-07-04 2022-07-04 GPU dynamic multitasking controllable concurrent execution method and system

Publications (2)

Publication Number Publication Date
CN115114003A CN115114003A (en) 2022-09-27
CN115114003B true CN115114003B (en) 2024-05-28

Family

ID=83330376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210780174.9A Active CN115114003B (en) 2022-07-04 2022-07-04 GPU dynamic multitasking controllable concurrent execution method and system

Country Status (1)

Country Link
CN (1) CN115114003B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963918A (en) * 2010-10-26 2011-02-02 上海交通大学 Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform
CN106502771A (en) * 2016-09-09 2017-03-15 中国农业大学 Time overhead model building method and system based on kernel functions
CN110196753A (en) * 2019-01-21 2019-09-03 腾讯科技(北京)有限公司 Graphics processor GPU vitualization method, apparatus and readable medium based on container
CN111722915A (en) * 2020-06-22 2020-09-29 上海商汤智能科技有限公司 Task processing method, device and system
CN112417470A (en) * 2020-11-06 2021-02-26 上海壁仞智能科技有限公司 Method and device for realizing GPU data security access, electronic equipment and storage medium
CN113490924A (en) * 2019-02-22 2021-10-08 英特尔公司 Dynamic switching between EPT and shadow page tables for runtime processor validation
WO2021211809A1 (en) * 2020-04-16 2021-10-21 Texas Instruments Incorporated Scalable hardware thread scheduler
CN114356547A (en) * 2021-12-07 2022-04-15 北京百度网讯科技有限公司 Low-priority blocking method and device based on processor virtualization environment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170371662A1 (en) * 2016-06-23 2017-12-28 Intel Corporation Extension of register files for local processing of data in computing environments

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963918A (en) * 2010-10-26 2011-02-02 上海交通大学 Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform
CN106502771A (en) * 2016-09-09 2017-03-15 中国农业大学 Time overhead model building method and system based on kernel functions
CN110196753A (en) * 2019-01-21 2019-09-03 腾讯科技(北京)有限公司 Graphics processor GPU vitualization method, apparatus and readable medium based on container
CN113490924A (en) * 2019-02-22 2021-10-08 英特尔公司 Dynamic switching between EPT and shadow page tables for runtime processor validation
WO2021211809A1 (en) * 2020-04-16 2021-10-21 Texas Instruments Incorporated Scalable hardware thread scheduler
CN111722915A (en) * 2020-06-22 2020-09-29 上海商汤智能科技有限公司 Task processing method, device and system
CN112417470A (en) * 2020-11-06 2021-02-26 上海壁仞智能科技有限公司 Method and device for realizing GPU data security access, electronic equipment and storage medium
CN114356547A (en) * 2021-12-07 2022-04-15 北京百度网讯科技有限公司 Low-priority blocking method and device based on processor virtualization environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Efficient Execution of OpenMP on GPUs";Joseph Huber;《2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)》;20220329;第41-52页 *
"Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences";Mingcong Han;《Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation》;20220713;第539-558页 *
"一种基于冗余线程的GPU多副本容错技术";贾佳;《计算机研究与发展》;20130715;第50卷(第07期);第1551-1562页 *

Also Published As

Publication number Publication date
CN115114003A (en) 2022-09-27

Similar Documents

Publication Publication Date Title
KR100962531B1 (en) Apparatus for processing multi-threading framework supporting dynamic load-balancing and multi-thread processing method using by it
CN1043932C (en) Multi-tasking low-power controller
CN112463709A (en) Configurable heterogeneous artificial intelligence processor
CN112465129A (en) On-chip heterogeneous artificial intelligence processor
US20090160867A1 (en) Autonomous Context Scheduler For Graphics Processing Units
US9152462B2 (en) Parallel processing device, parallel processing method, optimization device, optimization method and computer program
US9582320B2 (en) Computer systems and methods with resource transfer hint instruction
CN105893126A (en) Task scheduling method and device
US20050125793A1 (en) Operating system kernel-assisted, self-balanced, access-protected library framework in a run-to-completion multi-processor environment
CN110308982B (en) Shared memory multiplexing method and device
KR20040069257A (en) Method of scheduling in a reconfigurable hardware architecture with multiple hardware configurations
CN112711478A (en) Task processing method, device, server and storage medium based on neural network
US11947999B2 (en) Multi-phased and multi-threaded program execution based on SIMD ratio
WO2020227582A2 (en) Method and apparatus for scheduling matrix operations in digital processing systems
Li et al. Efficient algorithms for task mapping on heterogeneous CPU/GPU platforms for fast completion time
CN111597044A (en) Task scheduling method and device, storage medium and electronic equipment
CN115114003B (en) GPU dynamic multitasking controllable concurrent execution method and system
US9760969B2 (en) Graphic processing system and method thereof
CN116107753A (en) Task node distribution method and device, electronic equipment and storage medium
CN110969565A (en) Image processing method and device
CN114048026A (en) GPU resource dynamic allocation method under multitask concurrency condition
JP2006099579A (en) Information processor and information processing method
WO2023283767A1 (en) Task scheduling method and apparatus
Chow et al. Energy efficient task graph execution using compute unit masking in GPUs
Gannouni A Gamma-calculus GPU-based parallel programming framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant