KR102224446B1

KR102224446B1 - Gpgpu thread block scheduling extension method and apparatus

Info

Publication number: KR102224446B1
Application number: KR1020190126843A
Authority: KR
Inventors: 반효경; 조경운
Original assignee: 이화여자대학교 산학협력단
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2021-03-09

Abstract

Disclosed are a GPGPU thread block scheduling extension method and an apparatus thereof, which provide various scheduling policies. The GPGPU thread block scheduling extension method comprises: a step of configuring a macro thread block (MTB) consisting of a plurality of micro thread blocks (mTBs) for each of stream multiprocessors (SMs) making up a GPGPU; and a step of selectively assigning thread blocks (TBs) included in different sub kernels to an MTB configured in one SM among the plurality of SMs based on a scheduling policy implemented in software.

Description

GPGPU 스레드 블록 스케줄링 확장 방법 및 장치{GPGPU THREAD BLOCK SCHEDULING EXTENSION METHOD AND APPARATUS}GPGPU thread block scheduling extension method and apparatus {GPGPU THREAD BLOCK SCHEDULING EXTENSION METHOD AND APPARATUS}

본 발명은 GPGPU 스레드 블록 스케줄링 확장 방법 및 장치에 관한 것으로, 보다 구체적으로는 소프트웨어적으로 구성된 스레드 블록 스케줄링 장치를 이용하여 작업의 특성에 따라 다양한 스케줄링 정책을 제공하는 장치 및 방법에 관한 것이다.The present invention relates to a method and apparatus for extending GPGPU thread block scheduling, and more particularly, to an apparatus and method for providing various scheduling policies according to characteristics of a job using a thread block scheduling apparatus configured in software.

GPU의 컴퓨팅 하드웨어는 수백 혹은 수천 단위의 계산 코어들로 구성된 컴퓨팅 프로세서(주로 Stream Multiprocessor, 이하 SM)들의 집합이다. 이러한 SM의 개수는 GPU 장치의 종류에 따라 수십에서 수백개까지 다양할 수 있다. 보통 1개의 SM은 보통 수십에서 수백개의 동일한 컴퓨팅 명령을 처리 데이터만 달리하여 하드웨어적으로 동시에 수행할 수 있다. The computing hardware of the GPU is a set of computing processors (mainly Stream Multiprocessor, hereinafter SM) composed of hundreds or thousands of computational cores. The number of such SMs may vary from tens to hundreds according to the type of GPU device. Usually, one SM can execute dozens to hundreds of identical computing instructions simultaneously in hardware by only different processing data.

GPU 주로 쓰이는 프로그래밍 방식은 1개 코어 기준의 단일 스레드(single thread) 프로그램을 하드웨어적으로 병렬 수행하는 기법이다. 그런데 GPU를 통해 해결하고자 하는 컴퓨팅 문제들은 수십만에서 수백만 단위의 스레드로 구성될 수 있으므로, 이를 GPU에서 동시 수행하는 것은 하드웨어적으로 불가능하다. The programming method commonly used in the GPU is a technique that executes a single-threaded program based on one core in parallel in hardware. However, the computing problems to be solved through the GPU can be composed of hundreds of thousands to millions of threads, so it is not possible in hardware to perform these concurrently on the GPU.

이러한 문제를 해결하고자 종래 기술은 수십만에서 수백만 단위의 스레드를 해결하고자 하는 문제에 맞게 일정한 크기의 작업 단위로 그룹화하여 SM상에서 수행시키는 방법을 제공한다. 이러한 작업 단위를 스레드 블럭(Thread Block, 이하 TB)이라 하며 TB들을 GPU 장치내의 존재하는 다수의 SM에 할당하는 기능을 수행하는 관리자를 스레드 블록 스케줄러(Thread Block Scheduler, 이하 TBS)라 한다.In order to solve this problem, the prior art provides a method of grouping threads of hundreds of thousands to millions of units into work units of a certain size to meet the problem to be solved and executing them on the SM. This unit of work is referred to as a thread block (TB), and a manager that performs a function of allocating TBs to a plurality of SMs in the GPU device is referred to as a thread block scheduler (TBS).

그러나 GPU에 구현된 H/W TBS는 라운드 로빈(Round Robin, RR) 방식으로 TB를 복수의 SM들에 순차적으로 하나씩 할당하는 고정된 스케줄링 정책만을 수행할 수 있다. 이로 인해 종래의 H/W TBS는 해결하고자 하는 문제의 작업 특성에 따라 다양한 스케줄링 정책을 제공할 수 없어 최적의 성능을 발휘하지 못하는 문제가 있다.However, the H/W TBS implemented in the GPU can perform only a fixed scheduling policy that sequentially allocates TBs to a plurality of SMs in a round robin (RR) method. For this reason, there is a problem that the conventional H/W TBS cannot provide various scheduling policies according to the task characteristics of the problem to be solved, and thus does not exhibit optimal performance.

본 발명은 소프트웨어적으로 구성된 스레드 블록 스케줄링 장치를 이용함으로써 해결하고자 하는 문제의 작업 특성에 따라 다양한 스케줄링 정책을 제공하는 스레드 블록 스케줄링 장치 및 방법을 제공할 수 있다.The present invention can provide a thread block scheduling device and method that provides various scheduling policies according to a task characteristic of a problem to be solved by using a thread block scheduling device configured in software.

또한, 본 발명은 TBS 정책 개발 시 사용 가능한 성능 검증 도구로서의 스레드 블록 스케줄링 장치 및 방법을 제공할 수 있다.In addition, the present invention can provide a thread block scheduling apparatus and method as a performance verification tool that can be used when developing a TBS policy.

본 발명의 일실시예에 따른 GPGPU 스레드 블록 스케줄링 확장 방법은 GPGPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대해 복수의 마이크로 스레드 블록(micro Thread Block, mTB)들로 구성된 매크로 스레드 블록(Macro Thread Block, 이하 MTB)을 설정하는 단계; 소프트웨어적으로 구현된 스케줄링 정책에 기초하여 서로 다른 서브 커널(Sub Kernel)들 내에 포함된 스레드 블록(Thread Block, 이하 TB)들을 상기 복수의 SM들 중 어느 하나의 SM에 설정된 MTB에 선택적으로 할당하는 단계를 포함할 수 있다.The GPGPU thread block scheduling extension method according to an embodiment of the present invention comprises a plurality of micro thread blocks (mTB) for each of a plurality of stream multiprocessors (SM) constituting the GPGPU. Setting a macro thread block (MTB); Selectively allocating thread blocks (TBs) included in different sub kernels to an MTB set in any one of the plurality of SMs based on a scheduling policy implemented in software It may include steps.

상기 설정하는 단계는 동시에 최대 지원 가능한 스레드(Thread) 개수에 기초하여 상기 SM 내의 모든 하드웨어 자원을 커버하도록 MTB의 개수를 결정할 수 있다.In the setting step, the number of MTBs may be determined to cover all hardware resources in the SM based on the maximum number of supported threads at the same time.

상기 할당하는 단계는 상기 스레드 블록들 각각을 상기 복수의 SM들의 MTB에 라운드 로빈(Round-Robin) 방식에 따라 순차적으로 하나씩 할당할 수 있다.In the allocating step, each of the thread blocks may be sequentially allocated one by one to the MTBs of the plurality of SMs according to a round-robin scheme.

상기 할당하는 단계는 상기 스레드 블록들 각각을 상기 복수의 SM들 중 첫번째 SM의 MTB부터 모든 하드웨어 자원을 소진하도록 순차적으로 할당할 수 있다.In the allocating step, each of the thread blocks may be sequentially allocated to exhaust all hardware resources from the MTB of the first SM among the plurality of SMs.

상기 서로 다른 서브 커널들은 소스 수준에서 하나의 통합 커널로 병합되어 상기 MTB에 할당되고, 상기 MTB에 할당된 상기 통합 커널 내의 TB들은 수행해야할 서브 커널 정보를 식별하여 대응하는 서브 커널을 수행할 수 있다.The different sub-kernels are merged into one unified kernel at the source level and allocated to the MTB, and TBs in the unified kernel allocated to the MTB may perform a corresponding sub-kernel by identifying sub-kernel information to be performed. .

상기 MTB에 할당된 상기 통합 커널 내의 TB들은 상기 SM 및 상기 mTB 별로 참조되는 서브 커널 정보를 제공하는 mTB 할당 테이블(mTB Allocation Table, mAT)을 이용하여 대응하는 서브 커널을 수행할 수 있다.TBs in the unified kernel allocated to the MTB may perform a corresponding sub-kernel using an mTB allocation table (mAT) that provides sub-kernel information referenced for each of the SM and the mTB.

상기 mTB 할당 테이블은 상기 mTB에 할당된 TB들의 서브 커널 수행 속도에 기초하여 이팍(Epoch) 별로 상이한 mTB 할당 테이블이 설계될 수 있다.In the mTB allocation table, a different mTB allocation table may be designed for each epoch based on sub-kernel execution speeds of TBs allocated to the mTB.

본 발명의 일실시예에 따른 GPGPU 스레드 블록 스케줄링 확장 장치는 프로세서를 포함하고, 상기 프로세서는 GPGPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대해 복수의 마이크로 스레드 블록(micro Thread Block, mTB)들로 구성된 매크로 스레드 블록(Macro Thread Block, 이하 MTB)을 설정하고, 소프트웨어적으로 구현된 스케줄링 정책에 기초하여 서로 다른 서브 커널(Sub Kernel)들 내에 포함된 스레드 블록(Thread Block, 이하 TB)들을 상기 복수의 SM들 중 어느 하나의 SM에 설정된 MTB에 선택적으로 할당할 수 있다.The GPGPU thread block scheduling extension apparatus according to an embodiment of the present invention includes a processor, and the processor includes a plurality of micro-thread blocks for each of a plurality of stream multiprocessors (SM) constituting the GPGPU. A macro thread block (MTB) consisting of thread blocks (mTBs) is set, and thread blocks included in different sub kernels based on a scheduling policy implemented in software , Hereinafter TB) may be selectively allocated to an MTB set in any one of the plurality of SMs.

상기 프로세서는 동시에 최대 지원 가능한 스레드(Thread) 개수에 기초하여 상기 SM 내의 모든 하드웨어 자원을 커버하도록 MTB의 개수를 결정할 수 있다.The processor may determine the number of MTBs to cover all hardware resources in the SM based on the maximum number of supported threads at the same time.

상기 프로세서는 상기 스레드 블록들 각각을 상기 복수의 SM들의 MTB에 라운드 로빈(Round-Robin) 방식에 따라 순차적을 하나씩 할당할 수 있다.The processor may sequentially allocate each of the thread blocks to the MTBs of the plurality of SMs according to a round-robin scheme.

상기 프로세서는 상기 스레드 블록들 각각을 상기 복수의 SM들 중 첫번째 SM의 MTB부터 모든 하드웨어 자원을 소진하도록 순차적으로 할당할 수 있다.The processor may sequentially allocate each of the thread blocks to exhaust all hardware resources from the MTB of the first SM among the plurality of SMs.

본 발명의 GPGPU 스레드 블록 스케줄링 확장 장치 및 방법은 소프트웨어적으로 구성된 스레드 블록 스케줄링 장치를 이용함으로써 해결하고자 하는 문제의 작업 특성에 따라 다양한 스케줄링 정책을 제공할 수 있다.The GPGPU thread block scheduling extension apparatus and method of the present invention can provide various scheduling policies according to the task characteristics of a problem to be solved by using a thread block scheduling apparatus configured in software.

또한, 본 발명의 GPGPU 스레드 블록 스케줄링 확장 장치 및 방법은 TBS 정책 개발 시 성능 검증 도구로서 사용 가능할 수 있다.In addition, the GPGPU thread block scheduling extension apparatus and method of the present invention may be used as a performance verification tool when developing a TBS policy.

또한, 본 발명의 GPGPU 스레드 블록 스케줄링 확장 장치 및 방법은 GPU 응용 수준에서의 스케줄링 최적화를 통해 성능 개선을 위한 프로그래밍 프레임워크를 제공할 수 있다.In addition, the GPGPU thread block scheduling extension apparatus and method of the present invention can provide a programming framework for improving performance through scheduling optimization at the GPU application level.

도 1은 본 발명의 일실시예에 따른 스레드 블록 스케줄링의 개념도를 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 소프트웨어 정의 TBS의 개념도를 도시한 도면이다.
도 3은 본 발명의 일실시예에 따른 SM의 하드웨어 자원을 분할하여 설정하는 구체적인 예를 도시한 도면이다.
도 4는 본 발명의 일실시예에 따른 서브 커널들이 통합 커널로 병합되는 구성을 도시한 도면이다.
도 5는 본 발명의 일실시예에 따른 SM의 MTB에 할당된 TB 내의 스레드들이 서브 커널을 수행하기 위하여 참조하는 참조 테이블을 도시한 도면이다.
도 6은 본 발명의 일실시예에 따른 이팍(Epoch) 별로 상이한 mTB 할당 테이블의 구성을 도시한 도면이다.
도 7은 본 발명의 일실시예에 따른 S/W TBS가 TB들을 SM에 할당하는 방법을 도시한 도면이다.1 is a diagram illustrating a conceptual diagram of thread block scheduling according to an embodiment of the present invention.
2 is a diagram showing a conceptual diagram of a software defined TBS according to an embodiment of the present invention.
3 is a diagram illustrating a specific example of setting by dividing hardware resources of an SM according to an embodiment of the present invention.
4 is a diagram illustrating a configuration in which sub-kernels are merged into an integrated kernel according to an embodiment of the present invention.
5 is a diagram illustrating a reference table referred to by threads in a TB allocated to an MTB of an SM in order to perform a sub-kernel according to an embodiment of the present invention.
6 is a diagram illustrating a configuration of a different mTB allocation table for each Epoch according to an embodiment of the present invention.
7 is a diagram illustrating a method of allocating TBs to SMs by S/W TBS according to an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 스레드 블록 스케줄링의 개념도를 도시한 도면이다.1 is a diagram illustrating a conceptual diagram of thread block scheduling according to an embodiment of the present invention.

도 1을 참고하면, TBS(100)는 서로 다른 복수의 서브 커널(Sub Kernel)들 내에 포함된 TB들을 수신하여 GPU 내의 하드웨어인 SM에 할당하는 역할을 수행할 수 있다. 여기서, 서브 커널들은 GPU에서 수행되는 코드로서 CPU 개념으로 볼 때 프로그램으로 이해될 수 있다. 서브 커널들은 커널의 종류에 따라 서로 다른 크기의 TB들을 포함할 수 있으며, 각각의 서브 커널 내에 포함된 TB들은 크기가 동일할 수 있다. 즉, 서브 커널 A(Kernel A)와 커널 B(Kernel B)는 서로 다른 크기의 TB들을 포함할 수 있으며, 서브 커널 A 및 서브 커널 B는 동일한 크기의 TB들을 포함할 수 있다.Referring to FIG. 1, the TBS 100 may perform a role of receiving TBs included in a plurality of different sub kernels and allocating them to an SM, which is hardware in a GPU. Here, the sub-kernels are codes executed by the GPU and can be understood as programs in terms of the CPU concept. The sub-kernels may include TBs of different sizes according to the type of kernel, and the TBs included in each sub-kernel may have the same size. That is, sub-kernel A and kernel B may include TBs of different sizes, and sub-kernel A and sub-kernel B may include TBs of the same size.

TB는 1개의 스레드부터 보통 1024개의 스레드로 구성될 수 있는데, SM은 TBS(100)를 통해 할당된 TB 내의 스레드들을 동시에 수행할 수 있다. Warp는 GPU의 하드웨어에서 물리적으로 동시에 수행가능한 스레드 단위를 나타내며, 보통 하나의 Warp는 32개의 스레드로 구성될 수 있다. 일례로, TB가 64개의 스레드로 구성되었다고 가정하면, 해당 TB는 2개의 Warp로 구성될 수 있다. 그리고, 해당 TB가 TBS(100)를 통해 SM에 할당되면, 할당된 SM을 통해 32개의 스레드, 즉, Warp 단위로 동시에 수행될 수 있다. The TB may consist of 1 thread to 1024 threads, and the SM may simultaneously execute threads in the TB allocated through the TBS 100. Warp refers to a unit of threads that can be physically executed simultaneously in the hardware of the GPU, and usually one warp can consist of 32 threads. For example, assuming that the TB is composed of 64 threads, the TB may be composed of two warps. In addition, when the TB is allocated to the SM through the TBS 100, it may be simultaneously performed in 32 threads, that is, in units of Warp, through the allocated SM.

만약, TB가 1개의 스레드로 구성되었다면, 해당 TB는 1개의 Warp로 구성될 수 있는데 1개의 스레드를 가지는 TB는 테스트 이외의 용도로는 사용하지 않으므로 보통 TB의 스레드 개수는 32의 배수로 구성될 수 있다.If the TB is composed of 1 thread, the TB can be composed of 1 warp, but the TB with 1 thread is not used for purposes other than testing, so the number of threads of TB can be usually composed of a multiple of 32. have.

이와 같이 GPU 내의 하드웨어인 SM에서 수행되는 작업(스레드)의 처리율은 TBS(100)가 수행하는 스케줄링 정책에 많은 영향을 받을 수 있다. 만약 TBS(100)가 GPU에 구현된 H/W TBS인 경우, TBS(100)는 서로 다른 복수의 서브 커널들 내에 포함된 TB들을 라운드 로빈(Round-Robin) 방식으로 복수의 SM들에 순차적으로 하나씩 할당할 수 있다. As described above, the throughput of a task (thread) performed by the SM, which is hardware in the GPU, may be greatly influenced by the scheduling policy performed by the TBS 100. If the TBS 100 is a H/W TBS implemented in the GPU, the TBS 100 sequentially processes the TBs included in a plurality of different sub-kernels into a plurality of SMs in a round-robin manner. You can allocate one by one.

그러나 이러한 H/W TBS의 스케줄링 방법은 해결하고자 하는 문제의 작업 특성에 따라 다양한 스케줄링 정책을 제공할 수 없어 GPU의 성능을 최적화하지 못하는 문제가 있다.However, this H/W TBS scheduling method has a problem in that it cannot optimize the performance of the GPU because it cannot provide various scheduling policies according to the task characteristics of the problem to be solved.

이러한 문제를 해결하기 위하여 본 발명은 해결하고자 하는 문제의 작업 특성에 따라 결정된 다양한 스케줄링 정책에 기초하여 서로 다른 서브 커널(Sub Kernel)들 내에 포함된 TB들을 복수의 SM들에 할당할 수 있는 소프트웨어 정의 TBS(software-defined TBS, 이하 sdTBS)를 제공할 수 있다. In order to solve this problem, the present invention is a software definition capable of allocating TBs included in different sub kernels to a plurality of SMs based on various scheduling policies determined according to the task characteristics of the problem to be solved. TBS (software-defined TBS, hereinafter sdTBS) can be provided.

도 2는 본 발명의 일실시예에 따른 소프트웨어 정의 TBS의 개념도를 도시한 도면이다.2 is a diagram showing a conceptual diagram of a software defined TBS according to an embodiment of the present invention.

본 발명의 소프트웨어 정의 TBS, 즉, sdTBS(200)는 위에서 설명한 바와 같이 서로 다른 서브 커널들 내에 포함된 TB들을 다양한 스케줄링 정책에 따라 GPU 내에 하드웨어인 SM에 할당하는 소프트웨어 측면의 방법을 제공할 수 있다. 이를 위해 sdTBS(200)는 소프트웨어적으로 구현된 S/W TBS(210)를 포함할 수 있다. S/W TBS(210)는 사용자에 의해 정의된 서로 다른 서브 커널들 내에 포함된 TB들을 해결하고자 하는 문제의 작업 특성에 대응하는 스케줄링 정책을 이용하여 GPU 내에 포함된 복수의 SM들 중 어느 하나의 SM에 할당함으로써 GPU가 최적의 성능을 발휘하도록 할 수 있다.The software-defined TBS of the present invention, that is, sdTBS 200, as described above, can provide a software-side method of allocating TBs included in different sub-kernels to the SM, which is hardware, in the GPU according to various scheduling policies. . To this end, the sdTBS 200 may include a software-implemented S/W TBS 210. The S/W TBS 210 uses a scheduling policy corresponding to the task characteristic of the problem to be solved for TBs included in different sub-kernels defined by the user. By assigning it to the SM, the GPU can perform optimally.

이때, 기존의 GPU에 제공되는 H/W TBS(220)는 사용자에 의해 정의된 TB들을 수신하면 라운드 로빈 방식으로 복수의 SM들에 하나씩 순차적으로 할당할 수 있다. 따라서, 이러한 H/W TBS(220)의 스케줄링 동작을 무력화하기 위하여 sdTBS(200)는 우선 S/W TBS(210)를 통해 복수의 SM들에 대한 전체 자원을 점유한 뒤 점유한 SM들의 TB에서 내부적으로 필요한 커널을 수행하는 방식을 제공할 수 있다. 이를 위해서 sdTBS(200)는 기존의 서로 다른 커널들을 소스 수준에서 병합하여 단일 커널을 생성해야 하는데 보다 자세한 사항은 이후 도 3을 통해 설명하도록 한다.At this time, the H/W TBS 220 provided to the existing GPU may sequentially allocate one by one to a plurality of SMs in a round robin manner upon receiving TBs defined by a user. Therefore, in order to disable the scheduling operation of the H/W TBS 220, the sdTBS 200 first occupies the entire resource for a plurality of SMs through the S/W TBS 210, and then in the TB of the occupied SMs. It can provide a way to execute the kernel required internally. To this end, the sdTBS 200 must merge different kernels at the source level to create a single kernel, and more details will be described later with reference to FIG. 3.

구체적으로 S/W TBS(210)는 GPU를 구성하는 복수의 SM들 각각에 대해 가상의 매크로 TB(Macro TB, 이하 MTB)을 설정할 수 있다. 이때, S/W TBS(210)는 복수의 SM들 각각에 대해 동시에 최대 지원 가능한 스레드의 개수에 기초하여 해당 복수의 SM들 각각의 모든 하드웨어 자원을 커버하도록 MTB의 개수를 결정할 수 있다. Specifically, the S/W TBS 210 may set a virtual macro TB (Macro TB, hereinafter MTB) for each of a plurality of SMs constituting the GPU. In this case, the S/W TBS 210 may determine the number of MTBs to cover all hardware resources of each of the plurality of SMs based on the number of threads that can be simultaneously supported for each of the plurality of SMs.

일례로, MTB는 보통 동시에 최대 지원 가능한 스레드의 개수가 1024개로 구성될 수 있다. 그리고, 한 개의 SM은 보통 동시에 최대 지원 가능한 스레드의 개수가 2048개이므로, 한 개의 SM 당 2개의 MTB가 설정될 수 있다. For example, the MTB may consist of 1024 threads, which can be supported at the same time. In addition, since the maximum number of threads that can be supported simultaneously for one SM is usually 2048, two MTBs can be set per SM.

즉, S/W TBS(210)는 SM에서 보유한 하드웨어 자원을 모두 소진하도록 적정 개수의 MTB를 설정하여 사용자에 의해 정의된 서로 다른 서브 커널들 내에 포함된 TB들이 해당 SM의 MTB 상에서 수행될 수 있도록 스케줄링 할 수 있다.That is, the S/W TBS 210 sets an appropriate number of MTBs to exhaust all hardware resources held by the SM so that TBs included in different sub-kernels defined by the user can be performed on the MTB of the SM. Can be scheduled.

그리고, S/W TBS(210)는 소프트웨어적으로 구현된 다양한 스케줄링 정책들 중 해결하고자 하는 문제의 작업 특성에 따라 결정된 스케줄링 정책에 따라 서로 다른 서브 커널들 내에 포함된 TB들을 복수의 SM들 중 어느 하나의 SM에 설정된 MTB에 선택적으로 할당할 수 있다. 일례로, SM에 설정된 MTB 당 TB를 1개만 할당하는 경우, GPGPU의 성능이 가장 좋을 수 있다. 이 경우, 작업 특성 상 동일 SM의 MTB에 할당된 TB가 공유 자원(ex. 전역메모리)을 경쟁적으로 사용하면서 오히려 서로에게 나쁜 영향을 미쳐 성능을 떨어뜨릴 수 있으므로 S/W TBS(210)는 하나의 MTB에 하나의 TB 만을 할당하는 것이 좋을 수 있다. 또는 작업 특성 상 SM에 설정된 MTB 당 TB를 최대한 할당하는 경우, GPGUS의 성능이 가장 좋을 수도 있다. In addition, the S/W TBS 210 includes TBs included in different sub-kernels according to the scheduling policy determined according to the task characteristic of the problem to be solved among various scheduling policies implemented in software. It can be selectively allocated to an MTB set in one SM. For example, when only one TB is allocated per MTB set in the SM, the performance of the GPGPU may be the best. In this case, because the TB allocated to the MTB of the same SM can use the shared resources (ex.global memory) competitively and adversely affect each other and degrade the performance, the S/W TBS 210 is one It may be advisable to allocate only one TB to the MTB of. Alternatively, if TB per MTB set in SM is allocated as much as possible due to operation characteristics, the performance of GPGUS may be the best.

이와 같이 S/W TBS(210)는 해결하고자 하는 문제의 작업 특성에 따라 서로 다른 스케줄링 정책을 이용하여 서로 다른 서브 커널들 내에 포함된 TB들을 복수의 SM들 중 어느 하나의 SM에 설정된 MTB에 선택적으로 할당함으로써 GPGPU의 성능을 향상시킬 수 있다. 보다 자세한 S/W TBS(210)의 스케줄링 정책은 추후 도 7을 통해 자세히 설명하도록 한다.In this way, the S/W TBS 210 selectively selects the TBs included in different sub-kernels to the MTB set in any one of the SMs by using different scheduling policies according to the task characteristics of the problem to be solved. By allocating to, the performance of the GPGPU can be improved. A more detailed scheduling policy of the S/W TBS 210 will be described in detail later with reference to FIG. 7.

도 3은 본 발명의 일실시예에 따른 SM의 하드웨어 자원을 분할하여 설정하는 구체적인 예를 도시한 도면이다.3 is a diagram illustrating a specific example of setting by dividing hardware resources of an SM according to an embodiment of the present invention.

위에서 언급한 바와 같이 sdTBS(200)의 S/W TBS(210)는 서로 다른 서브 커널에 포함된 TB들을 SM들에 할당하기 전에 해당 SM들이 보유한 하드웨어 자원을 모두 소진하도록 해당 SM들 각각에 대해 적정 개수의 MTB를 설정할 수 있다. 일례로, 도 3은 하나의 SM 내에 두 개의 MTB가 설정되는 예를 도시하였으나, SM에 설정되는 MTB의 개수는 해당 SM에서 동시에 최대 지원 가능한 스레드의 개수에 기초하여 결정될 수 있다.As mentioned above, the S/W TBS 210 of the sdTBS 200 is appropriate for each of the SMs to exhaust all the hardware resources held by the SMs before allocating the TBs included in the different sub-kernels to the SMs. You can set the number of MTBs. As an example, FIG. 3 illustrates an example in which two MTBs are set in one SM, but the number of MTBs set in the SM may be determined based on the maximum number of threads that can be simultaneously supported in the SM.

이에 더해 S/W TBS(210)는 MTB를 하드웨어적으로 처리 가능한 효율적인 단위로 분할하여 사용할 수 있다. 위에서 언급한 바와 같이 GPU가 물리적으로 동시에 수행가능한 스레드는 보통 32개이고, 32개의 스레드 단위를 Warp로 구성하였다. 따라서, S/W TBS(210)는 MTB를 Warp 단위에 대응하는 32개의 스레드들이 동시에 수행 가능한 유닛 단위인 mTB로 분할하여 설정할 수 있다. In addition, the S/W TBS 210 may divide and use the MTB into efficient units that can be processed by hardware. As mentioned above, the number of threads that the GPU can physically execute at the same time is usually 32, and 32 threads are configured as warps. Accordingly, the S/W TBS 210 may be set by dividing the MTB into mTB, which is a unit unit in which 32 threads corresponding to the Warp unit can execute at the same time.

일례로, 도 3에 표시된 SM은 총 2048개의 스레드를 동시에 수행 가능하므로, S/W TBS(210)는 SM을 1024개의 스레드가 동시에 처리 가능한 2 개의 MTB로 분할하여 설정할 수 있다. 또한, S/W TBS(210)는 물리적으로 동시에 수행 가능한 스레드 단위인 Warp에 대응하여 각각의 MTB를 32 개의 mTB들로 분할하여 설정할 수 있다.As an example, since the SM shown in FIG. 3 can simultaneously execute a total of 2048 threads, the S/W TBS 210 may divide the SM into two MTBs that can be simultaneously processed by 1024 threads. In addition, the S/W TBS 210 may be set by dividing each MTB into 32 mTBs in correspondence to Warp, which is a thread unit that can be physically executed simultaneously.

도 4는 본 발명의 일실시예에 따른 서브 커널들이 통합 커널로 병합되는 구성을 도시한 도면이다.4 is a diagram illustrating a configuration in which sub-kernels are merged into an integrated kernel according to an embodiment of the present invention.

GPU에서 스레드를 수행하기 위한 프로그램인 서브 커널들이 sdTBS(200)에서 수행되기 위해서는 새로운 유형의 통합 커널로 병합되어야 한다. 이때, 통합 커널은 개별 mTB 마다 실제 어떤 서브 커널을 수행하지를 판단하여 해당 서브 커널로 분기할 수 있는 프로그램 구조를 가질 수 있다. 이와 같은 서브 커널 선택 과정은 통합 커널이 조건 분기 형태로 서브 커널 정보(ex, sub kernel id 등)에 따라 서브 커널을 수행할 수 있는 형태이며, sdTBS(200)의 S/W TBS(210)에서 선택한 정책에 따라 동적으로 혹은 정적으로 설정될 수 있다.Sub-kernels, which are programs for executing threads in the GPU, must be merged into a new type of integrated kernel in order to be executed in the sdTBS 200. In this case, the integrated kernel may have a program structure capable of branching to a corresponding sub-kernel by determining which sub-kernel is actually executed for each mTB. This sub-kernel selection process is a form in which the integrated kernel can perform the sub-kernel according to sub-kernel information (ex, sub kernel id, etc.) in the form of a conditional branch, and in the S/W TBS 210 of the sdTBS 200 It can be set dynamically or statically depending on the selected policy.

서브 커널은 GPU에서 수행될 TB들을 1차원에서 최대 3차원까지 입체적으로 정의할 수 있는데, 이와 같은 서브 커널의 차원(Dimension)은 인접한 TB 배치의 특성 정보를 활용하기 위한 것이다. 즉, 동일 서브 커널 내의 TB들은 SM의 자원 사용량이 동일하지만, 서로 다른 커널 간의 TB들은 SM의 자원 사용량이 보통 다른 특징을 가질 수 있으므로, S/W TBS(210)는 이와 같은 인접한 TB 배치의 특성 정보를 이용함으로써 GPU의 성능을 향상시킬 수 있다.The sub-kernel can three-dimensionally define TBs to be executed in the GPU from 1D to 3D, and the dimension of the sub-kernel is to utilize the characteristic information of the arrangement of adjacent TBs. That is, the TBs in the same sub-kernel have the same resource usage of the SM, but the TBs between different kernels may have different characteristics of the SM resource usage, so the S/W TBS 210 has the characteristics of such adjacent TB arrangement. By using the information, you can improve the performance of the GPU.

1 개의 TB가 몇 개의 스레드로 구성될지는 프로그래머가 결정할 수 있는데, S/W TBS(210)는 커널의 차원과 마찬가지로 TB를 1차원에서 최대 3차원까지 입체적으로 정의하여 인접한 스레드 배치의 특성 정보를 활용함으로써 GPU의 성능을 향상시킬 수 있다.The programmer can decide how many threads each TB is composed of, and the S/W TBS 210 uses the characteristic information of adjacent thread arrangement by defining TB in three dimensions from 1D to 3D, similar to the dimension of the kernel. By doing this, you can improve the performance of the GPU.

도 5는 본 발명의 일실시예에 따른 SM의 MTB에 할당된 TB 내의 스레드들이 서브 커널을 수행하기 위하여 참조하는 참조 테이블을 도시한 도면이다.5 is a diagram illustrating a reference table referred to by threads in a TB allocated to an MTB of an SM in order to perform a sub-kernel according to an embodiment of the present invention.

sdTBS(200)의 S/W TBS(210)을 통해 SM의 mTB에 할당된 TB 내의 스레드들은 병렬적으로 수행되기 때문에 각 스레드들이 독립적이면서 효율적으로 자신이 수행해야 할 서브 커널을 식별할 수 있어야 한다. 이를 위해 본 발명의 sdTBS(200)는 TB 내의 스레드들 각각이 수행해야할 서브 커널을 식별하기 위한 참조 테이블을 이용할 수 있다.Since the threads in the TB allocated to the mTB of the SM through the S/W TBS 210 of the sdTBS 200 are executed in parallel, each thread must be able to independently and efficiently identify the sub-kernel to be executed. . To this end, the sdTBS 200 of the present invention may use a reference table for identifying a sub-kernel to be performed by each of the threads in the TB.

보다 구체적으로 sdTBS(200)는 도 5와 같이 SM 및 mTB 별로 참조되는 서브 커널 정보가 포함된 mTB 할당 테이블(mTB Allocation Table, 이하 mAT)을 제공할 수 있다. 다시 말하자면, SM의 mTB에 할당된 TB 내의 스레드들은 mAT를 이용하여 자신이 수행해야 할 서브 커널 정보를 식별한 후 식별된 서브 커널을 수행할 수 있다. More specifically, the sdTBS 200 may provide an mTB allocation table (mTB Allocation Table, hereinafter mAT) including sub-kernel information referenced for each SM and mTB as shown in FIG. 5. In other words, threads in the TB allocated to the mTB of the SM may perform the identified sub-kernel after identifying sub-kernel information to be performed by using mAT.

일례로, 하나의 SM은 보통 32개의 mTB가 설정될 수 있다. 따라서, GPGPU에 10개의 SM이 존재한다면, 총 320개의 mTB가 설정될 수 있다. 이때, mAT는 mTB 별로 어떤 서브 커널을 수행해야 하는지를 식별할 수 있도록 mTB 별로 1byte 크기로 정의된 테이블일 수 있다. 즉, 10개 SM이 존재한다면, mAT는 320byte 크기의 320개의 숫자 값으로 구성될 수 있다.As an example, one SM may usually have 32 mTBs. Therefore, if there are 10 SMs in the GPGPU, a total of 320 mTBs can be set. In this case, mAT may be a table defined with a size of 1 byte for each mTB so as to identify which sub-kernel should be performed for each mTB. That is, if there are 10 SMs, mAT may be composed of 320 numeric values having a size of 320 bytes.

그런데 모든 mTB가 동일한 서브 커널을 수행하지 않기 때문에 mAT 는 여러 세트(set)가 필요할 수 있다. 사용자가 제출한 TB가 제1 서브 커널을 수행하는 10개의 mTB로 실행되고 된다고 할 때, 극단적으로 sdTBS(200)가 제일 처음 위치한 mTB에만 실행하도록 배치를 한다면, 나머지 모든 mTB는 놀고 있고, 처음 mTB만 집중적으로 수행될 수 있다. 이러한 경우, mAT는 10개 세트가 필요하고, 모든 mAT의 320byte중 첫번째 byte만 1이고, 나머지는 모두 0일 수 있다.However, since not all mTBs run the same sub-kernel, mAT may require multiple sets. Assuming that the TB submitted by the user is executed with 10 mTB running the first sub-kernel, if the sdTBS (200) is arranged to run only in the mTB where the first sub-kernel is located, all the remaining mTB are idle, and the first mTB Can only be performed intensively. In this case, 10 sets of mAT are required, and only the first byte of 320 bytes of all mATs is 1, and all others may be 0.

SM의 mTB에 할당된 TB 내의 스레드들 중 첫번째 스레드(리더 스레드)만이 우선적으로 mAT 할당(mAT를 이용하여 서브 커널의 식별하는 것)을 시도하고, 나머지 스레드들은 Warp 수준에서 싱크(sync) 대기할 수 있다. 즉, 리더 스레드가 mAT를 이용하여 자신이 수행해야 서브 커널을 식별하면, 나머지 스레드들은 리더 스레드가 식별한 서브 커널을 통해 수행될 수 있다.Among the threads in the TB allocated to the mTB of SM, only the first thread (leader thread) first attempts mAT allocation (identification of the subkernel using mAT), and the remaining threads wait for sync at the Warp level. I can. That is, if the reader thread identifies a sub-kernel that must be performed by itself using mAT, the remaining threads may be executed through the sub-kernel identified by the leader thread.

그리고, 모든 mTB들의 리더 스레드들 간에는 바쁜 대기(busy wating) 방식으로 락(locking)을 제공할 수 있으며, mAT 할당에 대한 락(lock)을 획득한 리더 스레드는 스케줄링 정책을 호출하여 mAT를 설정한 후 락(lock)을 해제할 수 있다. In addition, locking can be provided between the reader threads of all mTBs in a busy wating method, and the reader thread that acquired the lock for mAT allocation calls the scheduling policy to set the mAT. After the lock can be released.

도 6은 본 발명의 일실시예에 따른 이팍(Epoch) 별로 상이한 mTB 할당 테이블의 구성을 도시한 도면이다.6 is a diagram illustrating a configuration of a different mTB allocation table for each Epoch according to an embodiment of the present invention.

SM의 mTB들 각각은 할당된 TB 내의 스레드들이 처리되는 속도가 상이하므로 단일 mAT를 이용하는 것은 충분하지 않을 수 있다. 따라서, sdTBS(200)는 mTB 들 간 이팍을 달리하여 서로 상이한 mAT를 참조하도록 설계할 수 있다. 이때, 이팍은 0부터 시작하여 계속 증가하는 숫자이며, mTB가 여러 mAT 중 어떤 mAT를 참조하는 지를 나타내는 숫자일 수 있다. 즉, 모든 mTB는 0 이팍에 해당하는 mAT를 참조하여 서브 커널을 식별하여, 해당 서브 커널을 수행한 후, 이팍을 1로 변경할 수 있다. 그 후 mTB는 1에 대응하는 이팍의 mAT를 참조하여 자신에게 맞는 서브 커널을 식별할 수 있다.Each of the mTBs of the SM has different processing speeds of threads in the allocated TB, so it may not be sufficient to use a single mAT. Accordingly, the sdTBS 200 may be designed to refer to different mATs by differently changing epac between mTBs. In this case, EPAC is a number that starts from 0 and continues to increase, and may be a number indicating which mAT of several mATs mTB refers to. That is, all mTBs may identify the sub-kernel by referring to the mAT corresponding to 0 ipak, execute the sub-kernel, and then change the ipak to 1. After that, the mTB can identify a sub-kernel suitable for itself by referring to the mAT of IPAC corresponding to 1.

이와 같은 mAT는 GPU의 하드웨어 제약에 따라 무한정 제공되지 않고, 일정 수준의 mAT가 반복하여 사용될 수 있다. Such mAT is not provided indefinitely according to the hardware constraints of the GPU, and a certain level of mAT may be used repeatedly.

도 7은 본 발명의 일실시예에 따른 S/W TBS가 TB들을 SM에 할당하는 방법을 도시한 도면이다.7 is a diagram illustrating a method of allocating TBs to SMs by S/W TBS according to an embodiment of the present invention.

위에서 언급한 바와 같이 S/W TBS(210)는 소프트웨어적으로 구현된 다양한 스케줄링 정책들 중 해결하고자 하는 문제의 작업 특성에 따라 결정된 스케줄링 정책에 따라 서로 다른 서브 커널들 내에 포함된 TB들을 복수의 SM들 중 어느 하나의 SM에 설정된 MTB에 선택적으로 할당할 수 있다.As mentioned above, the S/W TBS 210 includes a plurality of SMs in the TBs included in different sub-kernels according to the scheduling policy determined according to the task characteristics of the problem to be solved among various scheduling policies implemented in software. It can be selectively allocated to the MTB set in any one of the SMs.

일례로, S/W TBS(210)는 도 7의 (a)와 같이 서로 다른 서브 커널들 내에 포함된 TB들을 복수의 SM들에 라운드 로빈(Round-Robin) 방식에 따라 순차적으로 하나씩 할당할 수 있다. 이와는 달리 S/W TBS(210)는 도 7의 (b)와 같이 복수의 SM들 중 사용량이 가장 적은 SM에 우선적으로 TB들을 할당할 수 있다. 또한, S/W TBS(210)는 도7의 (c)와 같이 복수의 SM들 중 첫번째 SM의 MTB부터 모든 하드웨어 자원을 소진하도록 TB들을 순차적으로 할당할 수 있다. 그리고, S/W TBS(210)는 복수의 SM들 중 사용량이 가장 많은 SM에 우선적으로 TB들을 할당할 수 있다. As an example, the S/W TBS 210 may sequentially allocate TBs included in different sub-kernels to a plurality of SMs one by one according to a round-robin method as shown in FIG. 7(a). have. Unlike this, the S/W TBS 210 may preferentially allocate TBs to the SMs having the least amount of usage among the plurality of SMs as shown in FIG. 7B. In addition, the S/W TBS 210 may sequentially allocate TBs to exhaust all hardware resources from the MTB of the first SM among the plurality of SMs as shown in FIG. 7C. In addition, the S/W TBS 210 may preferentially allocate TBs to the SM with the largest amount of usage among the plurality of SMs.

이와 같이 S/W TBS(210)는 스케줄링하고자 하는 TB들의 작업 특성에 따라 서로 다른 스케줄링 정책을 이용하여 SM의 MTB에 할당함으로써 GPU의 성능을 최적화할 수 있는 방법을 제공할 수 있다.As described above, the S/W TBS 210 may provide a method for optimizing the performance of the GPU by allocating to the MTB of the SM using different scheduling policies according to the task characteristics of the TBs to be scheduled.

한편, 본 발명에 따른 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.Meanwhile, the method according to the present invention is written as a program that can be executed on a computer and can be implemented in various recording media, such as a magnetic storage medium, an optical reading medium, and a digital storage medium.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 즉 정보 캐리어, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체) 또는 전파 신호에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or combinations thereof. Implementations include a data processing device, e.g., a programmable processor, a computer, or a computer program product, i.e. an information carrier, e.g., machine-readable storage, for processing by or controlling the operation of a number of computers. It may be implemented as a computer program tangibly embodied in an apparatus (computer readable medium) or a radio signal. Computer programs such as the above-described computer program(s) may be recorded in any type of programming language, including compiled or interpreted languages, and as a standalone program or in a module, component, subroutine, or computing environment. It can be deployed in any form, including as other units suitable for the use of. A computer program can be deployed to be processed on one computer or multiple computers at one site or to be distributed across multiple sites and interconnected by a communication network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing a computer program include, by way of example, both general purpose and special purpose microprocessors, and any one or more processors of any kind of digital computer. In general, the processor will receive instructions and data from read-only memory or random access memory or both. Elements of the computer may include at least one processor that executes instructions and one or more memory devices that store instructions and data. In general, a computer may include one or more mass storage devices that store data, such as magnetic, magnetic-optical disks, or optical disks, or receive data from or transmit data to them, or both. It may be combined so as to be. Information carriers suitable for embodying computer program instructions and data are, for example, semiconductor memory devices, for example magnetic media such as hard disks, floppy disks and magnetic tapes, Compact Disk Read Only Memory (CD-ROM). ), Optical Media such as DVD (Digital Video Disk), Magnetic-Optical Media such as Floptical Disk, ROM (Read Only Memory), RAM (RAM) , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and the like. The processor and memory may be supplemented by or included in a special purpose logic circuit structure.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체 및 전송매체를 모두 포함할 수 있다.Further, the computer-readable medium may be any available medium that can be accessed by a computer, and may include both a computer storage medium and a transmission medium.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While this specification includes details of a number of specific implementations, these should not be construed as limiting to the scope of any invention or claimable, but rather as a description of features that may be peculiar to a particular embodiment of a particular invention. It must be understood. Certain features described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable sub-combination. Furthermore, although features operate in a particular combination and may be initially described as so claimed, one or more features from a claimed combination may in some cases be excluded from the combination, and the claimed combination may be a sub-combination. Or sub-combination variations.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although operations are depicted in the drawings in a specific order, it should not be understood that such operations must be performed in that particular order or sequential order shown, or that all illustrated operations must be performed in order to obtain a desired result. In certain cases, multitasking and parallel processing can be advantageous. In addition, separation of the various device components in the above-described embodiments should not be understood as requiring such separation in all embodiments, and the program components and devices described are generally integrated together into a single software product or packaged in multiple software products. It should be understood that you can.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.On the other hand, the embodiments of the present invention disclosed in the specification and drawings are merely presented specific examples to aid understanding, and are not intended to limit the scope of the present invention. It is apparent to those of ordinary skill in the art that other modified examples based on the technical idea of the present invention may be implemented in addition to the embodiments disclosed herein.

100 : 스레드 블록 스케줄러(TBS)
200 : 소프트웨어 정의 TBS
210 : S/W TBS
220 : H/W TBS100: thread block scheduler (TBS)
200: software defined TBS
210: S/W TBS
220: H/W TBS

Claims

프로세서가 수행하는 GPGPU 스레드 블록 스케줄링 확장 방법에 있어서,
상기 프로세서가 GPGPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대해 복수의 마이크로 스레드 블록(micro Thread Block, mTB)들로 구성된 매크로 스레드 블록(Macro Thread Block, 이하 MTB)을 설정하는 단계;
상기 프로세서가 소프트웨어적으로 구현된 스케줄링 정책에 기초하여 서로 다른 서브 커널(Sub Kernel)들 내에 포함된 스레드 블록(Thread Block, 이하 TB)들을 상기 복수의 SM들 중 어느 하나의 SM에 설정된 MTB에 선택적으로 할당하는 단계
를 포함하고,
상기 서로 다른 서브 커널들은 소스 수준에서 하나의 통합 커널로 병합되어 상기 MTB에 할당되고,
상기 MTB에 할당된 상기 통합 커널 내의 TB들은 수행해야할 서브 커널 정보를 식별하여 대응하는 서브 커널을 수행하는 GPGPU 스레드 블록 스케줄링 확장 방법.In the GPGPU thread block scheduling extension method performed by a processor,
The processor uses a macro thread block (MTB) consisting of a plurality of micro thread blocks (mTB) for each of a plurality of stream multiprocessors (SM) constituting the GPGPU. Setting up;
The processor selectively selects thread blocks (TBs) included in different sub kernels to the MTB set in any one of the plurality of SMs based on a scheduling policy implemented in software. Steps to allocate
Including,
The different sub-kernels are merged into one unified kernel at the source level and allocated to the MTB,
GPGPU thread block scheduling extension method for performing a corresponding sub-kernel by identifying sub-kernel information to be performed by the TBs in the unified kernel allocated to the MTB.

제1항에 있어서,
상기 설정하는 단계는,
동시에 최대 지원 가능한 스레드(Thread) 개수에 기초하여 상기 SM 내의 모든 하드웨어 자원을 커버하도록 MTB의 개수를 결정하는 GPGPU 스레드 블록 스케줄링 확장 방법.The method of claim 1,
The setting step,
GPGPU thread block scheduling extension method for determining the number of MTBs to cover all hardware resources in the SM based on the maximum number of supported threads at the same time.

제1항에 있어서,
상기 할당하는 단계는,
상기 스레드 블록들 각각을 상기 복수의 SM들의 MTB에 라운드 로빈(Round-Robin) 방식에 따라 순차적으로 하나씩 할당하는 GPGPU 스레드 블록 스케줄링 확장 방법.The method of claim 1,
The allocating step,
GPGPU thread block scheduling extension method for sequentially allocating each of the thread blocks to the MTBs of the plurality of SMs according to a round-robin scheme.

제1항에 있어서,
상기 할당하는 단계는,
상기 스레드 블록들 각각을 상기 복수의 SM들 중 첫번째 SM의 MTB부터 모든 하드웨어 자원을 소진하도록 순차적으로 할당하는 GPGPU 스레드 블록 스케줄링 확장 방법.The method of claim 1,
The allocating step,
GPGPU thread block scheduling extension method for sequentially allocating each of the thread blocks to exhaust all hardware resources from the MTB of the first SM among the plurality of SMs.

삭제delete

제1항에 있어서,
상기 MTB에 할당된 상기 통합 커널 내의 TB들은,
상기 SM 및 상기 mTB 별로 참조되는 서브 커널 정보를 제공하는 mTB 할당 테이블(mTB Allocation Table, mAT)을 이용하여 대응하는 서브 커널을 수행하는 GPGPU 스레드 블록 스케줄링 확장 방법.The method of claim 1,
TBs in the unified kernel allocated to the MTB,
GPGPU thread block scheduling extension method for performing a corresponding sub-kernel using an mTB Allocation Table (mAT) that provides sub-kernel information referenced for each of the SM and the mTB.

제6항에 있어서,
상기 mTB 할당 테이블은,
상기 mTB에 할당된 TB들의 서브 커널 수행 속도에 기초하여 이팍(Epoch) 별로 상이한 mTB 할당 테이블이 설계되는 GPGPU 스레드 블록 스케줄링 확장 방법.The method of claim 6,
The mTB allocation table,
GPGPU thread block scheduling extension method in which a different mTB allocation table is designed for each epoch based on the sub-kernel execution speed of TBs allocated to the mTB.

제1항 내지 제4항 및 제6항 내지 제7항 중 어느 한 항의 방법을 실행하기 위한 프로그램이 기록된 컴퓨터에서 판독 가능한 기록 매체.A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 4 and 6 to 7 is recorded.

GPGPU 스레드 블록 스케줄링 확장 장치에 있어서,
프로세서를 포함하고,
상기 프로세서는,
GPGPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대해 복수의 마이크로 스레드 블록(micro Thread Block, mTB)들로 구성된 매크로 스레드 블록(Macro Thread Block, 이하 MTB)을 설정하고,
소프트웨어적으로 구현된 스케줄링 정책에 기초하여 서로 다른 서브 커널(Sub Kernel)들 내에 포함된 스레드 블록(Thread Block, 이하 TB)들을 상기 복수의 SM들 중 어느 하나의 SM에 설정된 MTB에 선택적으로 할당하고,
서로 다른 서브 커널들은 소스 수준에서 하나의 통합 커널로 병합되어 상기 MTB에 할당되고,
상기 MTB에 할당된 상기 통합 커널 내의 TB들은 수행해야할 서브 커널 정보를 식별하여 대응하는 서브 커널을 수행하는 GPGPU 스레드 블록 스케줄링 확장 장치.In the GPGPU thread block scheduling extension device,
Including a processor,
The processor,
A macro thread block (MTB) consisting of a plurality of micro thread blocks (mTB) is set for each of a plurality of stream multiprocessors (SM) constituting the GPGPU,
Based on a scheduling policy implemented in software, thread blocks (hereinafter, TB) included in different sub kernels are selectively allocated to an MTB set in any one of the plurality of SMs, and ,
Different sub-kernels are merged into one unified kernel at the source level and allocated to the MTB,
GPGPU thread block scheduling extension apparatus for performing a corresponding sub-kernel by identifying sub-kernel information to be performed by the TBs in the unified kernel allocated to the MTB.

제9항에 있어서,
상기 프로세서는,
동시에 최대 지원 가능한 스레드(Thread) 개수에 기초하여 상기 SM 내의 모든 하드웨어 자원을 커버하도록 MTB의 개수를 결정하는 GPGPU 스레드 블록 스케줄링 확장 장치.The method of claim 9,
The processor,
GPGPU thread block scheduling extension apparatus for determining the number of MTBs to cover all hardware resources in the SM based on the maximum number of supported threads at the same time.

제9항에 있어서,
상기 프로세서는,
상기 스레드 블록들 각각을 상기 복수의 SM들의 MTB에 라운드 로빈(Round-Robin) 방식에 따라 순차적을 하나씩 할당하는 GPGPU 스레드 블록 스케줄링 확장 장치.The method of claim 9,
The processor,
GPGPU thread block scheduling extension apparatus for sequentially allocating each of the thread blocks to the MTBs of the plurality of SMs according to a round-robin scheme.

제9항에 있어서,
상기 프로세서는,
상기 스레드 블록들 각각을 상기 복수의 SM들 중 첫번째 SM의 MTB부터 모든 하드웨어 자원을 소진하도록 순차적으로 할당하는 GPGPU 스레드 블록 스케줄링 확장 장치.The method of claim 9,
The processor,
GPGPU thread block scheduling extension apparatus for sequentially allocating each of the thread blocks to exhaust all hardware resources from the MTB of the first SM among the plurality of SMs.

삭제delete

제9항에 있어서,
상기 MTB에 할당된 상기 통합 커널 내의 TB들은,
상기 SM 및 상기 mTB 별로 참조되는 서브 커널 정보를 제공하는 mTB 할당 테이블(mTB Allocation Table, mAT)을 이용하여 대응하는 서브 커널을 수행하는 GPGPU 스레드 블록 스케줄링 확장 장치.The method of claim 9,
TBs in the unified kernel allocated to the MTB,
GPGPU thread block scheduling extension apparatus performing a corresponding sub-kernel using an mTB allocation table (mAT) that provides sub-kernel information referenced for each of the SM and the mTB.

제14항에 있어서,
상기 mTB 할당 테이블은,
상기 mTB에 할당된 TB들의 서브 커널 수행 속도에 기초하여 이팍(Epoch) 별로 상이한 mTB 할당 테이블이 설계되는 GPGPU 스레드 블록 스케줄링 확장 장치.The method of claim 14,
The mTB allocation table,
GPGPU thread block scheduling extension apparatus in which different mTB allocation tables are designed for each epoch based on the sub-kernel execution speed of TBs allocated to the mTB.