CN111708639A - Task scheduling system and method, storage medium and electronic device - Google Patents

Task scheduling system and method, storage medium and electronic device Download PDF

Info

Publication number
CN111708639A
CN111708639A CN202010573441.6A CN202010573441A CN111708639A CN 111708639 A CN111708639 A CN 111708639A CN 202010573441 A CN202010573441 A CN 202010573441A CN 111708639 A CN111708639 A CN 111708639A
Authority
CN
China
Prior art keywords
task
subtask
processed
subtasks
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010573441.6A
Other languages
Chinese (zh)
Inventor
安虹
李名凡
韩文廷
林晗
林增
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202010573441.6A priority Critical patent/CN111708639A/en
Publication of CN111708639A publication Critical patent/CN111708639A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The invention provides a task scheduling system and method, a storage medium and an electronic device, wherein the system is applied to a Graphic Processing Unit (GPU), and the system comprises: a global task scheduling unit and a local task scheduling unit; the global task scheduling unit is used for sending the target subtask to a task buffer area of the stream processor with the least current task amount when detecting that the target subtask exists in the task queue stored in the global memory; the target subtask is a subtask without forward dependence in the task queue; and the local task scheduling unit is used for determining the target subtask sent to the task buffer area as the subtask to be processed, and sequentially scheduling each subtask to be processed to the execution kernel of the stream processor with the least task amount according to the processing priority of each currently remaining subtask to be processed in the task buffer area, so that tasks can be reasonably distributed to the stream processor of the GPU, and the operation performance of the GPU is improved.

Description

Task scheduling system and method, storage medium and electronic device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a task scheduling system and method, a storage medium, and an electronic device.
Background
With the development of science and technology, Graphics Processing Units (GPUs) are also widely used in various fields, such as Graphics Processing, video Processing, machine learning, and data mining, where the operating speed of the GPU greatly affects the work efficiency of people.
In order to improve the operating efficiency of the graphics processor, people usually adopt a way of improving hardware performance indexes, such as performance indexes of capacity, working frequency, bit width and the like; however, with the advent of the post moore's law era, the hardware performance index of the GPU is approaching the bottleneck more and more, and therefore, people begin to improve the data parallel operation capability of the GPU by increasing the number of GPU kernels, that is, each kernel can independently complete a certain operation task, and the operation speed of the GPU can be greatly improved.
In the process of processing a dependent task, a multi-kernel GPU generally allocates a multi-stage task having a task dependency relationship to independent stream processors in the GPU, where the multiple tasks having a dependency relationship have a specific execution order, for example, if a forward dependent task of task a is task B, it is necessary to execute task B first before task a is executed.
However, for each stream processor, the stream processor cannot sense the task processing status of the rest of the stream processors in the GPU, and therefore, the stream processors cannot reasonably arrange the execution order of the tasks to be processed, thereby causing the GPU to have limited operation performance.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a task scheduling system and method, a storage medium and an electronic device, which can reasonably allocate tasks to a stream processor of a GPU and improve the operation performance of the GPU.
The invention also provides a task scheduling device used for ensuring the realization and the application of the method in practice.
A task scheduling system for a graphics processor, GPU, the GPU comprising a global memory and a plurality of stream processors, the system comprising:
a global task scheduling unit and a local task scheduling unit;
the global task scheduling unit is configured to send a target subtask to a task buffer of a stream processor with a smallest task amount among the current plurality of stream processors when it is detected that the target subtask exists in a task queue stored in the global memory;
the task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks; the target subtask is a subtask without forward dependency in the task queue;
the local task scheduling unit is configured to determine the target subtask that has been sent to the task buffer as a to-be-processed subtask, determine whether the number of currently remaining to-be-processed subtasks in the task buffer is multiple, and if yes, sequentially schedule each to-be-processed subtask to an execution kernel of the stream processor with the smallest task amount according to a processing priority of each to-be-processed subtask, so that the execution kernel executes the received to-be-processed subtask.
The task scheduling system optionally further includes: a detection unit;
the detection unit is used for acquiring a task dependency relationship table of each task group of the task queue; determining a forward dependency count for each subtask of each of the task groups based on each of the task dependency tables; and detecting whether a target subtask currently exists in the task queue according to the forward dependency count of each subtask.
The task scheduling system optionally further includes: an update unit;
and the updating unit is used for determining the task group to which the executed subtask to be processed belongs and updating the forward dependent task count of each currently remaining subtask in the task group.
The task scheduling system optionally further includes: a task decomposition unit;
the task decomposition unit is used for determining a task to be decomposed corresponding to a task decomposition instruction and a task granularity specified by the task decomposition instruction when the task decomposition instruction is received, and decomposing the task to be decomposed based on the task granularity to obtain a task group corresponding to the task to be decomposed, wherein the task group comprises sub-tasks; determining the operation type of each subtask, and constructing a dependency relationship table based on the operation type of each subtask, wherein the dependency relationship table records the dependency relationship among the subtasks; and sending the task group corresponding to the task to be decomposed and the dependency relationship table corresponding to the task group to the global memory, so that the global memory stores each subtask of the task group into the task queue.
Optionally, in the task scheduling system, the local task scheduling unit that sequentially schedules each to-be-processed sub-task to the execution core of the stream processor according to the processing priority of each to-be-processed sub-task currently stored in the task buffer includes:
a third determining subunit, configured to determine a task out-degree corresponding to each to-be-processed sub-task currently stored in the task buffer, where the task out-degree is a number of sub-tasks corresponding to a to-be-decomposed task to which the to-be-processed sub-task belongs;
the fourth determining subunit is configured to determine a processing priority of each to-be-processed sub-task based on the task out-degree of each to-be-processed sub-task;
and the scheduling subunit is used for sequentially scheduling each sub-task to be processed to the execution kernel of the stream processor according to the sequence that the processing priority of each sub-task to be processed is from high to low.
A task scheduling method applied to a Graphics Processor (GPU), wherein the GPU comprises a global memory and a plurality of stream processors, and the method comprises the following steps:
when detecting that a target subtask exists in a task queue stored in the global memory, sending the target subtask to a task buffer area of a stream processor with the least task amount in the plurality of current stream processors;
the task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks with dependency relationship; the target subtask is a subtask without forward dependency in the task queue;
determining the target subtasks sent to the task buffer area as subtasks to be processed, and judging whether the number of the current remaining subtasks to be processed in the task buffer area is multiple;
if there are multiple current remaining to-be-processed subtasks in the task cache region, sequentially scheduling each to-be-processed subtask to an execution core of the stream processor with the least task amount according to the processing priority of each to-be-processed subtask, so that the execution core executes the received to-be-processed subtask.
Optionally, in the task scheduling method, the process of detecting the target subtask in the task queue stored in the global memory includes:
acquiring a task dependency relationship table of each task group of the task queue;
determining a forward dependent task count for each subtask of each of the task groups based on each of the task dependency tables;
and detecting whether a target subtask exists in the task queue at present according to the forward dependent task count of each subtask.
Optionally, in the task scheduling method, after the executing core executes the received to-be-processed subtask, the method further includes:
and determining a task group to which the executed to-be-processed subtasks belong, and updating the forward dependent task count of each currently remaining subtask in the task group.
A storage medium, comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium is located to perform a task scheduling method as described above.
An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to perform the task scheduling method as described above by one or more processors.
Compared with the prior art, the invention has the following advantages:
the invention provides a task scheduling system and a method, which are applied to a GPU (graphics processing Unit), wherein the GPU comprises a global memory and a plurality of stream processors, and the system comprises: a global task scheduling unit and a local task scheduling unit; the global task scheduling unit is configured to send a target subtask to a task buffer of a stream processor with a smallest task amount among the current plurality of stream processors when it is detected that the target subtask exists in a task queue stored in the global memory; the task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks; the target subtask is a subtask without forward dependency in the task queue; the local task scheduling unit is configured to determine the target subtask that has been sent to the task buffer as a to-be-processed subtask, determine whether the number of currently remaining to-be-processed subtasks in the task buffer is multiple, and if yes, sequentially schedule each to-be-processed subtask to an execution kernel of the stream processor with the smallest task amount according to a processing priority of each to-be-processed subtask, so that the execution kernel executes the received to-be-processed subtask. By applying the task scheduling system provided by the invention, the subtasks without forward dependency relationship can be sent to each stream processor, so that the task execution process of each stream processor is not influenced by the dependency between the tasks, the processing efficiency of the stream processors can be improved, and the operation performance of the GPU is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a task scheduling system according to the present invention;
FIG. 2 is a schematic diagram of another structure of a task scheduling system according to the present invention;
FIG. 3 is a flowchart of a task scheduling method according to the present invention;
FIG. 4 is a diagram illustrating an exemplary backend architecture provided by the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a task scheduling system, which can be applied to a Graphic Processing Unit (GPU), wherein the GPU comprises a global memory and a plurality of stream processors, and a structural schematic diagram of the task scheduling system is shown in figure 1 and specifically comprises the following steps:
a global task scheduling unit 101 and a local task scheduling unit 102;
the global task scheduling unit 101 is configured to, when it is detected that a target subtask exists in a task queue stored in the global memory, send the target subtask to a task buffer of a stream processor with a smallest task amount among the current plurality of stream processors;
the task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks; the target subtask is a subtask without forward dependency in the task queue;
the local task scheduling unit 102 is configured to determine the target subtask that has been sent to the task buffer as a to-be-processed subtask, determine whether there are multiple to-be-processed subtasks currently remaining in the task buffer, and if yes, sequentially schedule each to-be-processed subtask to an execution kernel of a stream processor to which the task buffer belongs according to a processing priority of each to-be-processed subtask, so that the execution kernel executes the received to-be-processed subtask.
In the task scheduling system provided by the embodiment of the present invention, the stream processor includes an execution core and a shared memory, and a task buffer area is arranged in the shared memory and used for storing the to-be-processed subtasks.
One way to determine the task load of the stream processor may be: the determination is made by the number of pending tasks in the task buffer of the stream processor, i.e. the stream processor with the least amount of tasks may be the stream processor with the least amount of pending tasks.
Another way to determine the amount of tasks for the stream processor may be: the estimated processing time consumption of the task to be processed in the task buffer area is executed by the execution kernel of the stream processor to be determined, that is, the stream processor with the least amount of tasks can be the stream processor with the shortest estimated processing time consumption.
Optionally, if the number of the to-be-processed tasks of the stream processor currently receiving the target subtask is not multiple, the to-be-processed subtask currently obtained by the stream processor may be scheduled to an execution core of the stream processor, so that the execution core executes the to-be-processed subtask.
Specifically, the global task scheduling unit may detect whether a target subtask exists in a task queue stored in the global memory at a preset time interval.
It should be noted that, the storage system of the GPU is designed hierarchically, the global memory may be accessed by all stream processors in the global, and the shared memory located inside each stream processor is exclusive to the stream processing, that is, each stream processor generally cannot access the shared memories of the other stream processors, so that in the case of allocating each task with dependency relationship to each stream processor, each stream processor cannot reasonably arrange the execution order of each task in its shared memory, and thus, during the task execution process of a part of stream processors, it is necessary to wait for the other stream processors to finish executing the forward dependent task of the task, and then the execution performance of the GPU is limited. By applying the task scheduling system provided by the invention, the subtasks without the forward dependency relationship stored in the overall situation can be detected, and the subtasks without the forward dependency relationship are sent to each stream processor, so that the dependency influence between the tasks can not be caused in the process of executing the tasks by each stream processor, the processing efficiency of the stream processors can be improved, and the operation performance of the GPU can be effectively improved.
In the task scheduling system provided in the embodiment of the present invention, based on the above scheme, specifically, the task scheduling system further includes: a detection unit;
the detection unit is used for acquiring a task dependency relationship table of each task group of the task queue; determining a forward dependency count for each subtask of each of the task groups based on each of the task dependency tables; and detecting whether a target subtask currently exists in the task queue according to the forward dependency count of each subtask.
Specifically, each task group has a task dependency relationship table corresponding thereto, and whether a subtask is a subtask without a forward dependency relationship can be determined by the forward task count of each subtask in the task dependency relationship table; that is, a subtask whose forward task count is 0 may be determined as a subtask having no forward dependency.
If the number of the subtasks without the forward dependency relationship is one, determining the subtasks without the forward dependency relationship as target subtasks; if the number of the subtasks without the forward dependency relationship is multiple, the subtask with the highest processing priority in each subtask without the forward dependency may be used as the target subtask, any one subtask without the forward dependency may be selected as the target subtask, and the selection may be performed according to the queuing order of each subtask without the forward dependency in the task queue, that is, the subtask without the forward dependency which is queued first is preferentially used as the target subtask.
If no subtask without forward dependency exists, whether a subtask without forward dependency exists in the task queue can be judged again according to the forward dependency task count of each subtask.
It should be noted that, after the target subtask is detected, it may be determined whether there are any remaining subtasks in the task queue, and if there are any remaining subtasks, it may be determined whether there is a target subtask in the task queue according to the forward dependent task count of each subtask.
In the task scheduling system provided in the embodiment of the present invention, the task dependency relationship table may include dependency information of each subtask of the task group to which the task dependency relationship table belongs, where the dependency information includes: a forward task count, a successor task count, and a successor task list.
For example, task 0, task 1, task 2, task 3, and task 4 exist in the task group;
if the forward task count of the task 0 is 0, the subsequent task count is 4, and the subsequent task list is [ task 1, task 2, task 3, task 4 ];
then, for task 2, the forward task count of task 2 may be 2, the successor task count may be 2, and the successor task list may be [ task 3, task 4 ];
for task 4, the forward count for task 4 may be 4, the successor task count may be 0, and the successor task list is empty.
In the task scheduling system provided in the embodiment of the present invention, based on the above scheme, specifically, the task scheduling system further includes: an update unit;
and the updating unit is used for determining the task group to which the executed subtask to be processed belongs and updating the forward dependent task count of each currently remaining subtask in the task group.
It should be noted that the execution core may decrement the forward dependency count of each subtask currently existing in the task group by one.
The current remaining subtasks in the task group can be determined through the subsequent task list of the executed to-be-processed subtask, and the task identifier of the subsequent task of the executed to-be-processed subtask is recorded in the subsequent task list.
The system provided by the embodiment of the invention can update the forward dependency task counts of each currently remaining subtask in the task group to which the executed to-be-processed subtask belongs, so that the global task scheduling unit can detect a new target subtask according to the forward dependency counts of each currently remaining subtask.
Referring to fig. 2, a diagram of another exemplary structure of the task scheduling system provided by the present invention is shown, where the task scheduling system further includes: a task decomposition unit 103; the task decomposition unit 103 is configured to:
when a task decomposition instruction is received, determining a task to be decomposed corresponding to the task decomposition instruction and task granularity specified by the task decomposition instruction;
decomposing the task to be decomposed based on the task granularity to obtain a task group corresponding to the task to be decomposed, wherein the task group comprises each subtask;
determining the operation type of each subtask, and constructing a dependency relationship table based on the operation type of each subtask, wherein the dependency relationship table records the dependency relationship among the subtasks;
and sending the task group corresponding to the task to be decomposed and the dependency relationship table corresponding to the task group to the global memory, so that the global task scheduling unit 101 stores each subtask of the task group into a task queue of the global memory.
The method comprises the steps that a task to be decomposed is obtained by tasking an application program to be executed through a preset tasking function, wherein the task to be decomposed can be an application program task in a matrix form; for example, the task to be decomposed may be a 4 × 4 matrix, a 4 × 5 matrix, a 5 × 5 matrix, and other matrices of various orders, and the execution of the application task starts with one or more start tasks with a forward task count of 0 and ends with a plurality of end tasks without subsequent tasks.
The structure of the task group obtained by decomposing the decomposition task may include: task index, number of subtasks, number of dependent tasks, and task data.
Specifically, the task granularity of the subtask may be set according to actual requirements, for example, the task granularity of the subtask may be a matrix of 1 × 1; the task to be decomposed can be decomposed according to the task granularity to obtain subtasks with the task granularity.
It should be noted that a preset matrix decomposition algorithm may be called to decompose the to-be-decomposed task according to the task granularity.
Optionally, the matrix decomposition algorithm may be completed through four operations, which mainly include: DGEQT, DLARFB, DTSQT and DSSRFB.
The DGEQT is used for carrying out QR decomposition on the sub-matrix blocks on the diagonal of the matrix to obtain two intermediate sub-matrix results: an upper triangular matrix and a lower triangular matrix; based on the upper triangular matrix, DLARFB carries out updating operation on all the sub-matrix blocks of the column, and based on the lower triangular matrix, DTSQT carries out updating operation on all the sub-matrix frames of the row; the DSSRFB updates all remaining sub-matrix blocks based on the two operated sub-matrix blocks DTSQT and DLARFB.
Specifically, DGEQT: mainly for sub-matrix blocks A on the main diagonal of the matrixkkPerforming QR decomposition, recording transformation matrix V in operationkkWhile simultaneously generating an upper triangular matrix UkkA lower triangular matrix Lkk
Ukk,Lkk←Akk(QR transform process recording matrix Vkk)
DLARFB: u obtained by DGEQT decompositionkkAnd transformation matrix V of the total operation recordskkUpdating the sub-matrix blocks of other rows in the column and updating the sub-matrix blocks of the rows from k to i, wherein the updating method comprises the following steps:
Figure BDA0002550200810000101
where I is an identity matrix, such a matrix plays a particular role in the multiplication of the matrix, like 1 in the multiplication of numbers. In the unit matrix, elements on a diagonal line from the upper left corner to the lower right corner (referred to as a main diagonal line) are all 1, and other elements are all 0.
DTSQT: l decomposed by DGEQTkkAnd transformation matrix V of the total operation recordskkUpdating the sub-matrix blocks of other columns of the row and updating the sub-matrix blocks of k to j columns, wherein the updating method comprises the following steps:
Figure BDA0002550200810000102
DSSRFB: matrix block A updated according to DTSQT and DLARFBikAnd AkjAnd transformation matrix V of the total operation recordskkFor the corresponding matrix block A in the original matrixijUpdating:
Figure BDA0002550200810000103
therefore, the decomposition task is circularly decomposed through the four operations of DGEQT, DLARFB, DTSQT and DSSRFB, and a plurality of subtasks can be obtained.
In the task scheduling system provided in the embodiment of the present invention, based on the above scheme, specifically, according to the processing priority of each to-be-processed sub-task currently stored in the task buffer, the local task scheduling unit 102 that sequentially schedules each to-be-processed sub-task to the execution core of the stream processor may include:
a third determining subunit, configured to determine a task out-degree corresponding to each to-be-processed sub-task currently stored in the task buffer, where the task out-degree is a number of sub-tasks corresponding to a to-be-decomposed task to which the to-be-processed sub-task belongs;
the fourth determining subunit is configured to determine a processing priority of each to-be-processed sub-task based on the task out-degree of each to-be-processed sub-task;
and the scheduling subunit is used for sequentially scheduling each sub-task to be processed to the execution kernel of the stream processor according to the sequence that the task out-degree corresponding to each sub-task to be processed is from large to small.
Specifically, the larger the task out-degree of the sub-task to be processed is, the higher the processing priority of the sub-task is.
For example, if there are multiple pending subtasks, they are: task A, task B, task C and task D; the task out degree of the task A is 4, the task out degree of the task B is 6, the task out degree of the task C is 5, and the task out degree of the task D is 8; the processing priorities of the sub tasks to be processed are from high to low: task D-task B-task C-task A.
It should be noted that, in the system provided by the present invention, according to the processing priority of each to-be-processed sub-task currently stored in the task buffer, the local task scheduling unit 102 that sequentially schedules each to-be-processed sub-task to the execution core of the stream processor may be further configured to:
determining a task decomposition sequence corresponding to each to-be-processed subtask currently stored in the task buffer area, wherein the task decomposition sequence is a sequence of the to-be-processed subtask obtained in a to-be-decomposed task process to which the to-be-processed subtask belongs by applying a task decomposition algorithm;
determining the processing priority of each sub task to be processed based on the task decomposition sequence corresponding to each sub task to be processed;
and scheduling each sub task to be processed to an execution kernel of the stream processor in sequence according to the sequence of the processing priority of each sub task to be processed from high to low.
Specifically, the processing priority of each to-be-processed subtask may be determined according to a sequence from first to last of a task decomposition sequence corresponding to each to-be-processed subtask, that is, if the task decomposition sequence corresponding to the to-be-processed subtask is earlier, the processing priority corresponding to the to-be-processed subtask is higher.
Based on the task scheduling system provided by the embodiment of the invention, the embodiment of the invention also provides a task scheduling method, which corresponds to the task scheduling system provided by the embodiment of the invention; the method is applied to a Graphics Processing Unit (GPU), the GPU comprises a global memory and a plurality of stream processors, and a method flow chart of the method is shown in figure 3, and the method specifically comprises the following steps:
s301: and when detecting that a target subtask exists in the task queue stored in the global memory, sending the target subtask to a task buffer area of a stream processor with the least task amount in the plurality of current stream processors.
The task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks with dependency relationship; the target subtask is a subtask without forward dependency in the task queue.
S302: and determining the target subtask which is sent to the task buffer area as a subtask to be processed.
S303: and judging whether the number of the currently remaining to-be-processed subtasks in the task cache area is multiple, if so, executing S304, and if not, executing S305.
S304: and according to the processing priority of each sub-task to be processed, scheduling each sub-task to be processed to the execution kernel of the stream processor with the least task quantity in sequence, and enabling the execution kernel to execute the received sub-task to be processed.
S305: and scheduling the to-be-processed subtask to an execution core of the stream processor with the least task quantity, so that the execution core executes the received to-be-processed subtask.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the process of detecting the target subtask in the task queue stored in the global memory includes:
acquiring a task dependency relationship table of each task group of the task queue;
determining a forward dependent task count for each subtask of each of the task groups based on each of the task dependency tables;
and detecting whether a target subtask exists in the task queue at present according to the forward dependent task count of each subtask.
In the method provided in the embodiment of the present invention, based on the foregoing implementation process, specifically, after the executing core executes the received to-be-processed sub task, the method further includes:
and determining a task group to which the executed to-be-processed subtasks belong, and updating the forward dependent task count of each currently remaining subtask in the task group.
The task scheduling method disclosed in the above embodiment of the present invention is the same as the principle and the execution process of each unit and module in the task scheduling device disclosed in the above embodiment of the present invention, and reference may be made to the corresponding parts in the task scheduling system provided in the above embodiment of the present invention, which is not described herein again.
In an embodiment provided by the present invention, referring to fig. 4, a diagram of a backend architecture example of a task scheduling method based on a GPU is provided in the embodiment of the present invention;
the GPU comprises a global memory and a plurality of stream processors, each stream processor comprises a shared memory and an execution kernel, and the shared memory comprises a task buffer area; the global memory is used for receiving a dependency graph sent by the dependency analysis front end and a task queue corresponding to the dependency graph, wherein the dependency analysis front end can be a Central Processing Unit (CPU), the CPU can decompose tasks of an application program to obtain data operations of all subtasks of the tasks, the dependency relationship among the subtasks is determined according to the data operations of the subtasks, and the execution time sequence of the subtasks is determined based on the data dependency among different subtasks.
Specifically, the backend architecture includes a global scheduler and a local scheduler for each stream processor.
It should be noted that, in general, the storage system of the GPU is designed hierarchically, the global memory may be accessed by the execution cores of all the stream processors globally, and the shared memory located inside each stream processor is exclusive to the stream processing and only allows its own execution core access.
The global scheduler may search a target sub-task in a task queue of the global memory, where the target sub-task is a sub-task without forward dependency, that is, a sub-task with a forward task count of zero.
After the global scheduler finds the target subtask, the processor with the minimum current task amount in each stream processor is determined, and the target subtask is sent to the task buffer area of the shared memory of the stream processor with the minimum task amount.
The local scheduler can select the subtask with the highest processing priority from the task buffer; and transmitting the subtask with the highest processing priority to the execution kernel which runs persistently, so that the execution kernel executes the task.
The stream processor takes control and checks the dependency of each subtask in the successor task list of the completed subtask, i.e. the stream processor can check the dependency graph to find the successor task of the executed subtask and reduce the forward dependent task count of each successor task, thereby obtaining a new subtask with a forward dependent task count of 0 to wait for the scheduling of the global scheduler.
The embodiment of the invention also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the device where the storage medium is located is controlled to execute the task scheduling method.
An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 5, which specifically includes a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to perform the following operations according to the one or more instructions 502:
when detecting that a target subtask exists in a task queue stored in a global memory, sending the target subtask to a task buffer area of a stream processor with the least task amount in the plurality of current stream processors;
the task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks with dependency relationship; the target subtask is a subtask without forward dependency in the task queue;
determining the target subtasks sent to the task buffer area as subtasks to be processed, and judging whether the number of the current remaining subtasks to be processed in the task buffer area is multiple;
if there are multiple current remaining to-be-processed subtasks in the task cache region, sequentially scheduling each to-be-processed subtask to an execution core of the stream processor with the least task amount according to the processing priority of each to-be-processed subtask, so that the execution core executes the received to-be-processed subtask.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The task scheduling method provided by the present invention is described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the description of the above examples is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A task scheduling system for a graphics processor, GPU, the GPU comprising a global memory and a plurality of stream processors, the system comprising:
a global task scheduling unit and a local task scheduling unit;
the global task scheduling unit is configured to send a target subtask to a task buffer of a stream processor with a smallest task amount among the current plurality of stream processors when it is detected that the target subtask exists in a task queue stored in the global memory;
the task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks; the target subtask is a subtask without forward dependency in the task queue;
the local task scheduling unit is configured to determine the target subtask that has been sent to the task buffer as a to-be-processed subtask, determine whether the number of currently remaining to-be-processed subtasks in the task buffer is multiple, and if yes, sequentially schedule each to-be-processed subtask to an execution kernel of the stream processor with the smallest task amount according to a processing priority of each to-be-processed subtask, so that the execution kernel executes the received to-be-processed subtask.
2. The task scheduling system of claim 1, further comprising: a detection unit;
the detection unit is used for acquiring a task dependency relationship table of each task group of the task queue; determining a forward dependency count for each subtask of each of the task groups based on each of the task dependency tables; and detecting whether a target subtask currently exists in the task queue according to the forward dependency count of each subtask.
3. The task scheduling system of claim 2, further comprising: an update unit;
and the updating unit is used for determining the task group to which the executed subtask to be processed belongs and updating the forward dependent task count of each currently remaining subtask in the task group.
4. The task scheduling system of claim 2, further comprising: a task decomposition unit;
the task decomposition unit is used for determining a task to be decomposed corresponding to a task decomposition instruction and a task granularity specified by the task decomposition instruction when the task decomposition instruction is received, and decomposing the task to be decomposed based on the task granularity to obtain a task group corresponding to the task to be decomposed, wherein the task group comprises sub-tasks; determining the operation type of each subtask, and constructing a dependency relationship table based on the operation type of each subtask, wherein the dependency relationship table records the dependency relationship among the subtasks; and sending the task group corresponding to the task to be decomposed and the dependency relationship table corresponding to the task group to the global memory, so that the global memory stores each subtask of the task group into the task queue.
5. The task scheduling system of claim 4, wherein the local task scheduling unit that sequentially schedules each of the to-be-processed subtasks to the execution core of the stream processor according to the processing priority of each of the to-be-processed subtasks currently stored in the task buffer comprises:
a third determining subunit, configured to determine a task out-degree corresponding to each to-be-processed sub-task currently stored in the task buffer, where the task out-degree is a number of sub-tasks corresponding to a to-be-decomposed task to which the to-be-processed sub-task belongs;
the fourth determining subunit is configured to determine a processing priority of each to-be-processed sub-task based on the task out-degree of each to-be-processed sub-task;
and the scheduling subunit is used for sequentially scheduling each sub-task to be processed to the execution kernel of the stream processor according to the sequence that the processing priority of each sub-task to be processed is from high to low.
6. A task scheduling method applied to a Graphics Processor (GPU), wherein the GPU comprises a global memory and a plurality of stream processors, the method comprising:
when detecting that a target subtask exists in a task queue stored in the global memory, sending the target subtask to a task buffer area of a stream processor with the least task amount in the plurality of current stream processors;
the task queue comprises at least one task group, and the task group comprises one subtask or a plurality of subtasks with dependency relationship; the target subtask is a subtask without forward dependency in the task queue;
determining the target subtasks sent to the task buffer area as subtasks to be processed, and judging whether the number of the current remaining subtasks to be processed in the task buffer area is multiple;
if there are multiple current remaining to-be-processed subtasks in the task cache region, sequentially scheduling each to-be-processed subtask to an execution core of the stream processor with the least task amount according to the processing priority of each to-be-processed subtask, so that the execution core executes the received to-be-processed subtask.
7. The task scheduling method according to claim 6, wherein the process of detecting the target subtask in the task queue stored in the global memory comprises:
acquiring a task dependency relationship table of each task group of the task queue;
determining a forward dependent task count for each subtask of each of the task groups based on each of the task dependency tables;
and detecting whether a target subtask exists in the task queue at present according to the forward dependent task count of each subtask.
8. The task scheduling method according to claim 7, wherein after the causing the execution core to execute the received pending subtasks, further comprising:
and determining a task group to which the executed to-be-processed subtasks belong, and updating the forward dependent task count of each currently remaining subtask in the task group.
9. A storage medium, characterized in that the storage medium comprises stored instructions, wherein when the instructions are executed, a device on which the storage medium is located is controlled to execute the task scheduling method according to any one of claims 6 to 8.
10. An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the method of task scheduling according to any one of claims 6 to 8.
CN202010573441.6A 2020-06-22 2020-06-22 Task scheduling system and method, storage medium and electronic device Pending CN111708639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010573441.6A CN111708639A (en) 2020-06-22 2020-06-22 Task scheduling system and method, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010573441.6A CN111708639A (en) 2020-06-22 2020-06-22 Task scheduling system and method, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN111708639A true CN111708639A (en) 2020-09-25

Family

ID=72542698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010573441.6A Pending CN111708639A (en) 2020-06-22 2020-06-22 Task scheduling system and method, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN111708639A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112184085A (en) * 2020-11-05 2021-01-05 上海亿保健康管理有限公司 Task allocation method, device and equipment
CN113641476A (en) * 2021-08-16 2021-11-12 腾讯科技(深圳)有限公司 Task scheduling method, game engine, equipment and storage medium
CN113934535A (en) * 2021-10-11 2022-01-14 广东科诺勘测工程有限公司 Mass point cloud data processing method, device, server and system
WO2022160626A1 (en) * 2021-01-29 2022-08-04 上海阵量智能科技有限公司 Command processing apparatus and method, electronic device, and computer storage medium
WO2022160628A1 (en) * 2021-01-29 2022-08-04 上海阵量智能科技有限公司 Command processing apparatus and method, electronic device, and computer-readable storage medium
CN116954954A (en) * 2023-09-20 2023-10-27 摩尔线程智能科技(北京)有限责任公司 Method and device for processing multi-task queues, storage medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829764B1 (en) * 1997-06-23 2004-12-07 International Business Machines Corporation System and method for maximizing usage of computer resources in scheduling of application tasks
CN103823706A (en) * 2014-02-12 2014-05-28 浙江大学 RTLinux (real-time Linux) based real-time scheduling method for analog simulation of controlled object model
CN103995743A (en) * 2014-05-21 2014-08-20 中国人民解放军国防科学技术大学 Two-stage mixed task scheduling method based on resource reservation
CN105893158A (en) * 2016-06-08 2016-08-24 北京工业大学 Big data hybrid scheduling model on private cloud condition
CN106598707A (en) * 2015-10-19 2017-04-26 沈阳新松机器人自动化股份有限公司 Task scheduling optimization method
CN110494848A (en) * 2018-03-28 2019-11-22 深圳市大疆创新科技有限公司 Task processing method, equipment and machine readable storage medium
CN110991808A (en) * 2019-11-06 2020-04-10 中国建设银行股份有限公司 Task allocation method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829764B1 (en) * 1997-06-23 2004-12-07 International Business Machines Corporation System and method for maximizing usage of computer resources in scheduling of application tasks
CN103823706A (en) * 2014-02-12 2014-05-28 浙江大学 RTLinux (real-time Linux) based real-time scheduling method for analog simulation of controlled object model
CN103995743A (en) * 2014-05-21 2014-08-20 中国人民解放军国防科学技术大学 Two-stage mixed task scheduling method based on resource reservation
CN106598707A (en) * 2015-10-19 2017-04-26 沈阳新松机器人自动化股份有限公司 Task scheduling optimization method
CN105893158A (en) * 2016-06-08 2016-08-24 北京工业大学 Big data hybrid scheduling model on private cloud condition
CN110494848A (en) * 2018-03-28 2019-11-22 深圳市大疆创新科技有限公司 Task processing method, equipment and machine readable storage medium
CN110991808A (en) * 2019-11-06 2020-04-10 中国建设银行股份有限公司 Task allocation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MINGFAN LI: "Gdarts: A GPU-Based Runtime System for Dataflow Task Programming on Dependency Applications", 《2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM)》 *
邓志龙: "基于hadoop的时隙优化任务调度策略研究", 《西北工业大学学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112184085A (en) * 2020-11-05 2021-01-05 上海亿保健康管理有限公司 Task allocation method, device and equipment
WO2022160626A1 (en) * 2021-01-29 2022-08-04 上海阵量智能科技有限公司 Command processing apparatus and method, electronic device, and computer storage medium
WO2022160628A1 (en) * 2021-01-29 2022-08-04 上海阵量智能科技有限公司 Command processing apparatus and method, electronic device, and computer-readable storage medium
CN113641476A (en) * 2021-08-16 2021-11-12 腾讯科技(深圳)有限公司 Task scheduling method, game engine, equipment and storage medium
CN113641476B (en) * 2021-08-16 2023-07-14 腾讯科技(深圳)有限公司 Task scheduling method, game engine, device and storage medium
CN113934535A (en) * 2021-10-11 2022-01-14 广东科诺勘测工程有限公司 Mass point cloud data processing method, device, server and system
CN116954954A (en) * 2023-09-20 2023-10-27 摩尔线程智能科技(北京)有限责任公司 Method and device for processing multi-task queues, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111708639A (en) Task scheduling system and method, storage medium and electronic device
US11550627B2 (en) Hardware accelerated dynamic work creation on a graphics processing unit
Yang et al. Intermediate data caching optimization for multi-stage and parallel big data frameworks
CN104714785A (en) Task scheduling device, task scheduling method and data parallel processing device
CN109840149B (en) Task scheduling method, device, equipment and storage medium
JP2003044295A (en) Sleep queue management
CN112363821A (en) Computing resource scheduling method and device and computer equipment
CN111831410A (en) Task processing method and device, storage medium and electronic equipment
CN111190712A (en) Task scheduling method, device, equipment and medium
WO2023082575A1 (en) Graph execution pipeline parallelism method and apparatus for neural network model computation
US7243354B1 (en) System and method for efficiently processing information in a multithread environment
CN107977275B (en) Task processing method based on message queue and related equipment
CN113010286A (en) Parallel task scheduling method and device, computer equipment and storage medium
CN115586961A (en) AI platform computing resource task scheduling method, device and medium
CN115509704A (en) Task scheduling method, device, equipment and storage medium
CN111597044A (en) Task scheduling method and device, storage medium and electronic equipment
CN114217930A (en) Accelerator system resource optimization management method based on mixed task scheduling
CN111930485A (en) Job scheduling method based on performance expression
CN114896295B (en) Data desensitization method, desensitization device and desensitization system in big data scene
CN115964164A (en) Computer-implemented method, hardware accelerator, and storage medium
CN116260876A (en) AI application scheduling method and device based on K8s and electronic equipment
CN112130977B (en) Task scheduling method, device, equipment and medium
CN117093335A (en) Task scheduling method and device for distributed storage system
CN112114967B (en) GPU resource reservation method based on service priority
JP2716019B2 (en) Job class determination method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200925