KR20200114702A

KR20200114702A - A method and apparatus for long latency hiding based warp scheduling

Info

Publication number: KR20200114702A
Application number: KR1020190036886A
Authority: KR
Inventors: 김광복; 김철홍
Original assignee: 전남대학교산학협력단
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-10-07
Also published as: KR102210765B1

Abstract

The present invention relates to a method and a device for long latency hiding based warp scheduling. According to an embodiment of the present invention, the warp scheduling method can comprise the following steps of: (a) issuing warps for executing memory instructions that generate latency greater than or equal to a first threshold; and (b) issuing warps for executing the memory instructions, and then issuing warps for executing computational instructions.

Description

긴 지연시간 숨김 기반 워프 스케줄링을 위한 방법 및 장치{A method and apparatus for long latency hiding based warp scheduling}A method and apparatus for long latency hiding based warp scheduling}

본 발명은 워프 스케줄링에 관한 것으로, 더욱 상세하게는 긴 지연시간 숨김 기반 워프 스케줄링을 위한 방법 및 장치에 관한 것이다.The present invention relates to warp scheduling, and more particularly, to a method and apparatus for warp scheduling based on concealing a long delay time.

GPU와 같은 처리량 향상을 위해 고안된 프로세서들은 최근 강력한 연산 자원을 이용하여 범용 프로그램을 빠르게 수행할 수 있다. 이에, GPU의 효율성 향상을 위해 다양한 워프 스케줄링 방식이 개발되고 있다.Processors designed to improve throughput, such as GPUs, can quickly execute general-purpose programs using recent powerful computational resources. Accordingly, various warp scheduling methods are being developed to improve the efficiency of the GPU.

다양한 워프 스케줄링 방식 중 하나인 CCWS은 L1 데이터 캐시에서의 워프 내 지역성을 제대로 활용하지 못하는 문제점을 완화하고자 제안된 워프 스케줄링 기법이다. L1 데이터 캐시에서 교체된 태그 정보를 추가된 캐시에 저장하고 워프별로 스레싱(Thrashing) 발생 빈도를 모니터링한다. 만약 한 워프가 지역성올 활용하지 못한다면 해당 워프를 다른 워프에 비해 우선적으로 발행(issue)하여 요청 데이터가 교체되기 전에 재사용할 수 있도록 한다. 따라서 캐시의 지역성을 향상시킬 수 있지만 스레드 수준의 병렬성이 감소될 수 있다. 또한, 실행 시간 동안 워프별로 스레싱 관련 정보를 갱신하고 계산하기 위한 하드웨어 복잡도와 공간이 요구된다.CCWS, one of various warp scheduling schemes, is a proposed warp scheduling scheme to alleviate the problem of not properly utilizing intrawarp locality in the L1 data cache. The tag information replaced from the L1 data cache is stored in the added cache, and the frequency of thrashing occurs for each warp is monitored. If one warp cannot utilize locality, the warp is issued prior to other warps so that the requested data can be reused before being replaced. Therefore, the locality of the cache can be improved, but parallelism at the thread level can be reduced. In addition, hardware complexity and space are required for updating and calculating thrashing-related information for each warp during execution time.

CAWA는 명령어와 스톨 기반의 임계성(Criticality) 예측기를 제안하고, 이에 따라 스레드 블록에서의 각 워프들을 분류한다. 또한, 임계성 기반의 워프 스케줄러는 임계성 워프들을 자원에 할당하도록 하고 캐시 재사용성을 예측한다.CAWA proposes an instruction and stall-based criticality predictor, and classifies each warp in a thread block accordingly. In addition, the criticality-based warp scheduler allocates critical warps to resources and predicts cache reusability.

다만, 종래의 워프 스케줄링 방식의 경우, 병목현상을 일으키거나 지연시간 숨김과 메모리 자원 활용에 미흡하다는 문제점이 있다. However, in the case of the conventional warp scheduling method, there is a problem that it causes a bottleneck or hides the delay time and is insufficient in utilization of memory resources.

[비특허문헌 1] T. G. Rogers., M. O'Connor., and T. M. Aamodt, “Cache-conscious wavefront scheduling,” Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture pp.72-83, 2012. [Non-Patent Document 1] T. G. Rogers., M. O'Connor., and T. M. Aamodt, “Cache-conscious wavefront scheduling,” Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture pp.72-83, 2012.

본 발명은 전술한 문제점을 해결하기 위하여 창출된 것으로, 긴 지연시간 숨김 기반 워프 스케줄링을 위한 방법 및 장치를 제공하는 것을 그 목적으로 한다.The present invention was created to solve the above-described problem, and an object thereof is to provide a method and apparatus for warp scheduling based on hiding a long delay time.

또한, 본 발명은 긴 지연시간을 발생시키는 메모리 명령어(memory instruction)를 수행하는 워프를 우선적으로 발행(issue)하고, 연산 명령어(computation instruction)을 수행하는 워프를 균등하게 발행하는 워프 스케줄링을 위한 방법 및 장치를 제공하는 것을 그 목적으로 한다.In addition, the present invention is a method for warp scheduling that preferentially issues a warp that performs a memory instruction that generates a long delay time, and evenly issues a warp that performs a computation instruction. And to provide an apparatus for that purpose.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the objects mentioned above, and other objects not mentioned will be clearly understood from the following description.

상기한 목적들을 달성하기 위하여, 본 발명의 일 실시예에 따른 워프 스케줄링 방법은 (a) 제1 임계값 이상의 지연시간(latency)을 발생시키는 메모리 명령어(memory instruction)를 실행하는 워프들을 발행(issue)하는 단계; 및 (b) 상기 메모리 명령어를 실행하는 워프들을 발행한 후, 연산 명령어(computation)를 실행하는 워프들을 발행하는 단계;를 포함할 수 있다. In order to achieve the above objects, the warp scheduling method according to an embodiment of the present invention includes: (a) issue warps that execute memory instructions that generate latency equal to or greater than a first threshold. ) To; And (b) issuing warps for executing the memory command, and then issuing warps for executing computational instructions.

실시예에서, 상기 (a) 단계는, 상기 메모리 명령어에 대하여 할당시간이 제2 임계값 이상인 워프들을 발행하는 단계;를 포함할 수 있다.In an embodiment, the step (a) may include issuing warps having an allocation time equal to or greater than a second threshold value for the memory command.

실시예에서, 상기 워프 스케줄링 방법은, 상기 (a) 단계 이후에, 상기 메모리 명령어를 실행하는 워프들을 상기 할당시간을 기반으로 정렬하여 메모리 워프 큐(memory warp queue)에 삽입하는 단계;를 더 포함할 수 있다.In an embodiment, the warp scheduling method further includes, after step (a), sorting warps executing the memory command based on the allocation time and inserting them into a memory warp queue. can do.

실시예에서, 상기 워프 스케줄링 방법은, 상기 삽입하는 단계 이후에, 상기 메모리 워프 큐에 삽입된 워프들의 순서에 따라 상기 메모리 명령어를 실행하는 단계; 를 더 포함할 수 있다.In an embodiment, the warp scheduling method includes, after the inserting step, executing the memory command according to the order of warps inserted into the memory warp queue; It may further include.

실시예에서, 상기 (b) 단계는, 메모리 파이프라인에 스톨(stall)이 발생하는 경우, 상기 연산 명령어를 실행하는 워프들을 발행하는 단계;를 포함할 수 있다.In an embodiment, step (b) may include issuing warps that execute the operation instruction when a stall occurs in the memory pipeline.

실시예에서, 상기 워프 스케줄링 방법은, 상기 (b) 단계 이후에, 워프 ID(identification)에 따라 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입하는 단계;를 더 포함할 수 있다.In an embodiment, the warp scheduling method includes, after step (b), arranging warps that execute the operation instructions according to warp ID (identification) and inserting them into a computation warp queue; further Can include.

실시예에서, 상기 워프 스케줄링 방법은, 상기 삽입하는 단계 이후에, 상기 연산 워프 큐에 삽입된 워프들의 순서에 따라 상기 연산 명령어를 실행하는 단계;를 더 포함할 수 있다.In an embodiment, the warp scheduling method may further include, after the inserting, executing the operation instruction according to the order of warps inserted in the computational warp queue.

실시예에서, 상기 워프 스케줄링 방법은, 상기 (b) 단계 이후에, 활성화 워프(active warp)의 일부를 포함하는 최근 레지스터(recency register)의 최근 비트(recency bit)에 기반하여, 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입할 수 있다.In an embodiment, the warp scheduling method, after the step (b), based on a recent bit of a recent register including a part of an active warp, the operation instruction is You can sort the running warps and insert them into the computation warp queue.

실시예에서, 상기 메모리 명령어를 실행하는 워프들은, 상기 지연시간에 따라 우선순위가 부여되고, 상기 연산 명령어를 실행하는 워프들은, 동일한 우선순위가 부여될 수 있다.In an embodiment, the warps that execute the memory command may be given priority according to the delay time, and the warps that execute the operation command may be given the same priority.

실시예에서, 상기 메모리 명령어와 상기 연산 명령어는, 명령어의 타입에 따라 분리하여 저장될 수 있다.In an embodiment, the memory command and the operation command may be separately stored according to a type of command.

실시예에서, 워프 스케줄링 장치는, 제1 임계값 이상의 지연시간(latency)을 발생시키는 메모리 명령어(memory instruction)를 실행하는 워프들을 발행(issue)하고, 상기 메모리 명령어를 실행하는 워프들을 발행한 후, 연산 명령어(computation)를 실행하는 워프들을 발행하는, 워프 스케줄러;를 포함할 수 있다.In an embodiment, the warp scheduling apparatus issues warps that execute a memory instruction that generates a latency equal to or greater than a first threshold, and issues warps that execute the memory instruction. , A warp scheduler that issues warps that execute computational instructions; may include.

실시예에서, 상기 워프 스케줄러는, 상기 메모리 명령어에 대하여 할당시간이 제2 임계값 이상인 워프들을 발행할 수 있다.In an embodiment, the warp scheduler may issue warps having an allocation time equal to or greater than a second threshold value for the memory command.

실시예에서, 상기 워프 스케줄러는, 상기 메모리 명령어를 실행하는 워프들을 상기 할당시간을 기반으로 정렬하여 메모리 워프 큐(memory warp queue)에 삽입할 수 있다.In an embodiment, the warp scheduler may sort warps executing the memory command based on the allocation time and insert them into a memory warp queue.

실시예에서, 상기 워프 스케줄러는, 상기 메모리 워프 큐에 삽입된 워프들의 순서에 따라 상기 메모리 명령어를 실행할 수 있다. In an embodiment, the warp scheduler may execute the memory command according to the order of warps inserted in the memory warp queue.

실시예에서, 상기 워프 스케줄러는, 메모리 파이프라인에 스톨(stall)이 발생하는 경우, 상기 연산 명령어를 실행하는 워프들을 발행할 수 있다.In an embodiment, when a stall occurs in a memory pipeline, the warp scheduler may issue warps that execute the operation instruction.

실시예에서, 상기 워프 스케줄러는, 워프 ID(identification)에 따라 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입할 수 있다.In an embodiment, the warp scheduler may arrange warps that execute the operation instructions according to warp identification (ID) and insert them into a computation warp queue.

실시예에서, 상기 워프 스케줄러는, 상기 연산 워프 큐에 삽입된 워프들의 순서에 따라 상기 연산 명령어를 실행하는 단계;를 더 포함할 수 있다. In an embodiment, the warp scheduler may further include executing the operation instruction according to the order of warps inserted in the operation warp queue.

실시예에서, 상기 워프 스케줄러는, 활성화 워프(active warp)의 일부를 포함하는 최근 레지스터(recency register)의 최근 비트(recency bit)에 기반하여, 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입할 수 있다.In an embodiment, the warp scheduler sorts warps that execute the operation instruction based on a recent bit of a recency register including a part of an active warp to queue the operation warp. It can be inserted into the (computation warp queue).

실시예에서, 상기 메모리 명령어와 상기 연산 명령어는, 명령어의 타입에 따라 분리하여 저장될 수 있다. In an embodiment, the memory command and the operation command may be separately stored according to a type of command.

상기한 목적들을 달성하기 위한 구체적인 사항들은 첨부된 도면과 함께 상세하게 후술될 실시예들을 참조하면 명확해질 것이다.Specific matters for achieving the above objects will become apparent with reference to embodiments to be described later in detail together with the accompanying drawings.

그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라, 서로 다른 다양한 형태로 구성될 수 있으며, 본 발명의 개시가 완전하도록 하고 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자(이하, "통상의 기술자")에게 발명의 범주를 완전하게 알려주기 위해서 제공되는 것이다.However, the present invention is not limited to the embodiments disclosed below, but may be configured in various different forms, so that the disclosure of the present invention is complete and those of ordinary skill in the technical field to which the present invention pertains ( Hereinafter, it is provided in order to completely inform the scope of the invention to the "common engineer").

본 발명의 일 실시예에 의하면, 긴 지연시간을 발생시키는 메모리 명령어를 수행하는 워프를 우선적으로 발행함으로써 메모리 자원을 최대한 활용하고, 연산 명령어를 수행하는 워프를 균등한 우선순위로 발행함으로써 워프 수준의 병렬성을 높일 수 있다.According to an embodiment of the present invention, a warp that executes a memory instruction that causes a long delay time is issued to maximize the use of memory resources, and a warp that executes an operation instruction is issued with equal priority, You can increase parallelism.

본 발명의 효과들은 상술된 효과들로 제한되지 않으며, 본 발명의 기술적 특징들에 의하여 기대되는 잠정적인 효과들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the above-described effects, and the potential effects expected by the technical features of the present invention will be clearly understood from the following description.

도 1은 본 발명의 일 실시예에 따른 그래픽 처리 장치(Graphics Processing Unit, GPU)의 스트리밍 멀티프로세서(Streaming Multiprocessor, SM)의 기능적 구성을 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 워프 스케줄링의 예를 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 워프 스케줄러의 기능적 구성을 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 워프 스케줄링 장치의 동작 방법을 도시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 L1 데이터 캐시의 미스 비율 그래프를 도시한 도면이다.
도 6은 본 발명의 일 실시예에 따른 L1 데이터 캐시에 대한 예약 실패 그래프를 도시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 스톨 사이클 그래프를 도시한 도면이다.
도 8은 본 발명의 일 실시예에 따른 워프 스케줄링 방식에 따른 성능 그래프를 도시한 도면이다. 1 is a diagram illustrating a functional configuration of a streaming multiprocessor (SM) of a graphics processing unit (GPU) according to an embodiment of the present invention.
2 is a diagram illustrating an example of warp scheduling according to an embodiment of the present invention.
3 is a diagram showing a functional configuration of a warp scheduler according to an embodiment of the present invention.
4 is a diagram illustrating a method of operating a warp scheduling apparatus according to an embodiment of the present invention.
5 is a diagram illustrating a graph of a miss rate of an L1 data cache according to an embodiment of the present invention.
6 is a diagram illustrating a graph of a reservation failure for an L1 data cache according to an embodiment of the present invention.
7 is a diagram showing a stall cycle graph according to an embodiment of the present invention.
8 is a diagram illustrating a performance graph according to a warp scheduling scheme according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고, 여러 가지 실시예들을 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 이를 상세히 설명하고자 한다. In the present invention, various modifications may be made and various embodiments may be provided, and specific embodiments will be illustrated in the drawings and described in detail.

청구범위에 개시된 발명의 다양한 특징들은 도면 및 상세한 설명을 고려하여 더 잘 이해될 수 있을 것이다. 명세서에 개시된 장치, 방법, 제법 및 다양한 실시예들은 예시를 위해서 제공되는 것이다. 개시된 구조 및 기능상의 특징들은 통상의 기술자로 하여금 다양한 실시예들을 구체적으로 실시할 수 있도록 하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다. 개시된 용어 및 문장들은 개시된 발명의 다양한 특징들을 이해하기 쉽게 설명하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다.The various features of the invention disclosed in the claims may be better understood in view of the drawings and detailed description. The apparatus, method, preparation method, and various embodiments disclosed in the specification are provided for illustration purposes. The disclosed structural and functional features are intended to enable a person skilled in the art to specifically implement various embodiments, and are not intended to limit the scope of the invention. The disclosed terms and sentences are intended to describe various features of the disclosed invention in an easy to understand manner, and are not intended to limit the scope of the invention.

본 발명을 설명함에 있어서, 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 상세한 설명을 생략한다.In describing the present invention, if it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the present invention, a detailed description thereof will be omitted.

이하, 본 발명의 일 실시예에 따른 긴 지연시간 숨김 기반 워프 스케줄링을 위한 방법 및 장치를 설명한다.Hereinafter, a method and apparatus for warp scheduling based on hiding a long delay time according to an embodiment of the present invention will be described.

도 1은 본 발명의 일 실시예에 따른 그래픽 처리 장치(Graphics Processing Unit, GPU)의 스트리밍 멀티프로세서(Streaming Multiprocessor, SM)(100)의 기능적 구성을 도시한 도면이다. 일 실시예에서, 그래픽 처리 장치는 다수의 SM(Streaming Multiprocessor)(100)으로 구성될 수 있다. 각 SM(100)은 내부 연결망(Interconnection Network)을 통해 각 별도의 메모리 파티션(Memory Partition)에 연결될 수 있다.1 is a diagram showing a functional configuration of a streaming multiprocessor (SM) 100 of a graphics processing unit (GPU) according to an embodiment of the present invention. In one embodiment, the graphic processing device may be configured with a plurality of SM (Streaming Multiprocessor) 100. Each SM (100) may be connected to each separate memory partition (Memory Partition) through an internal connection network (Interconnection Network).

도 1을 참고하면, SM(100)는 명령어(Instruction) 버퍼(110), L1 명령어 캐시(120), 스코어보드(130), 레지스터 파일(140), 워프 스케줄링(150), 명령어 실행부(160)를 포함할 수 있다. Referring to FIG. 1, the SM 100 includes an instruction buffer 110, an L1 instruction cache 120, a scoreboard 130, a register file 140, a warp scheduling 150, and an instruction execution unit 160. ) Can be included.

명령어 버퍼(110)는 워프(warp)를 위한 PC(program counter)와 빈 슬롯을 선택할 수 있다. 해당 명령어는 L1 명령어 캐시(120)에서 가져온 후 명령어 버퍼(110)의 빈 슬롯에 배치될 수 있다. 일 실시예에서, 해당 명령어는 명령어 버퍼(110)에 배치되기 전에 디코딩 유닛(미도시)을 통해 디코딩될 수 있다. The command buffer 110 may select a program counter (PC) and an empty slot for warp. The instruction may be retrieved from the L1 instruction cache 120 and then placed in an empty slot of the instruction buffer 110. In one embodiment, the instruction may be decoded through a decoding unit (not shown) before being placed in the instruction buffer 110.

명령어는 준비 비트(ready bit)가 스코어보드(130)에 의해 설정될 때까지, 즉, 이 워프로부터의 이전 명령어가 완료될 때까지 명령어 버퍼(110)에서 대기할 수 있다. The instruction may wait in the instruction buffer 110 until the ready bit is set by the scoreboard 130, that is, until the previous instruction from this warp is complete.

워프 스케줄러(150)는 해저드가 발생하지 않도록 본 발명에 따른 워프 스케줄링 기법에 따라 명령어 버퍼(110)를 확인하여 적절한 명령어 실행부(160)에 명령어를 전달할 수 있다. The warp scheduler 150 may check the command buffer 110 according to the warp scheduling technique according to the present invention so that a hazard does not occur and transmit the command to the appropriate command execution unit 160.

워프 스케줄러(150)는 GTO 방식과 LRR 방식을 적용하여 워프 스케줄링을 수행할 수 있다. 일 실시예에서, 워프 스케줄러(150)는 긴 지연시간을 발생시키는 메모리 명령어(memory instruction)를 수행하는 워프를 우선적으로 발행(issue)하여 메모리 자원을 최대한 활용하고, 연산 명령어(computation instruction)을 수행하는 워프는 균등하게 발행하여 워프 수준의 병렬성을 높일 수 있다. The warp scheduler 150 may perform warp scheduling by applying the GTO method and the LRR method. In one embodiment, the warp scheduler 150 preferentially issues a warp that executes a memory instruction that causes a long delay to make the most of memory resources and executes a computation instruction. Warps can be issued evenly to increase the parallelism of the warp level.

즉, 워프 스케줄러(150)는 병목현상을 일으키지 않으면서 지연시간 숨김과 메모리 자원 활용에 유리한 정적 워프 스케줄링을 수행할 수 있다. That is, the warp scheduler 150 may perform static warp scheduling, which is advantageous for hiding a delay time and utilizing memory resources without causing a bottleneck.

여기서, LRR 방식은 워프 ID를 기준으로 워프들이 순차적으로 발행할 수 있도록 하고, 발행 사이클에 발행될 준비가 되지 않은 워프는 무시할 수 있다. LRR 방식은 모든 워프를 발행할 수 있는 후보 워프로 보기 때문에 워프/CTA 관점에서 균등하게 발행되는 경향이 있고, 병렬성을 극대화할 수 있는 방식을 의미할 수 있다. Here, the LRR scheme allows warps to be issued sequentially based on the warp ID, and warps that are not ready to be issued in the issuance cycle can be ignored. Since the LRR method sees all warps as candidate warps that can be issued, it tends to be issued evenly from the perspective of warp/CTA, and can mean a method that can maximize parallelism.

GTO 방식은 스톨이 발생하지 않는다면 동일 워프를 계속 수행할 수 있다. 만약 스톨이 발생한다면 할당된 시간이 오래된 워프를 우선적으로 발행할 수 있다. 따라서, 모든 워프가 전반적인 긴 대기시간을 갖지 않는다면 할당된 시간이 오래된 워프들이 계속적으로 수행될 수 있다. The GTO method can continue to perform the same warp as long as no stall occurs. If a stall occurs, warps with an older allotted time can be issued first. Therefore, if all warps do not have an overall long waiting time, warps with an old allotted time can be continuously performed.

SM(100)은 실행 중인 모든 스레드의 컨텍스트(Context)를 저장하는 워프 풀(Warp Pool)을 포함할 수 있다. 이 경우, 워프 스케줄러(150)는 워프 풀에서 파이프라인 해저드를 발생시키지 않을 워프를 선택하여 매 사이클마다 발행(issue)할 수 있다. The SM 100 may include a warp pool that stores contexts of all running threads. In this case, the warp scheduler 150 may select a warp from the warp pool that does not generate a pipeline hazard and issue it every cycle.

명령어 처리부(160)는 명령어의 타입에 따라 명령어를 실행할 수 있다. 명령어 실행부(160)는 SP(streaming processor), SFU(special function unit) 및 LD/ST(load/store) 유닛을 포함할 수 있다. The command processing unit 160 may execute a command according to the type of command. The command execution unit 160 may include a streaming processor (SP), a special function unit (SFU), and a load/store (LD/ST) unit.

일 실시예에서, GPU 응용 프로그램은 커널이라는 동일한 코드 부분을 수행하는 스레드의 집합으로 이루어질 수 있다. 실행 시간 동안 여러 스레드가 CTA(Cooperative Thread Array) 혹은 스레드 블록(Thread Block)이라고 불리는 단위로 스레드 집합을 형성할 수 있다. 스레드 블록 내의 모든 스레드는 워프(Warp) 단위로 수행되고 관리될 수 있다. In one embodiment, a GPU application program may consist of a set of threads that execute the same code part called a kernel. During execution time, several threads can form a set of threads in a unit called a CTA (Cooperative Thread Array) or a thread block. All threads in the thread block can be executed and managed in units of warp.

멀티 스레딩 기술과 레지스터 파일(140)을 사용하면 스레드/워프가 스톨을 발생시킬 때마다 컨텍스트 스위칭(Context Switching)을 빠르게 수행하여 불필요한 지연시간을 최소화할 수 있다. 따라서, 그래픽 처리 장치는 기존 CMP 설계와 비교하여 평균 파이프라인 사용률과 처리량을 높일 수 있다.If the multi-threading technology and the register file 140 are used, context switching is quickly performed whenever a thread/warp causes a stall, thereby minimizing unnecessary delay time. Accordingly, the graphic processing device can increase the average pipeline utilization rate and throughput compared to the existing CMP design.

도 2는 본 발명의 일 실시예에 따른 워프 스케줄링의 예를 도시한 도면이다.2 is a diagram illustrating an example of warp scheduling according to an embodiment of the present invention.

도 2를 참고하면, 메모리 명령어는 메모리 계층에서 데이터를 불러오거나 쓰기 연산을 수행할 수 있다. 만약 온 칩 메모리에서 데이터 블록을 찾지 못하면 긴 지연시간이 추가적으로 발생할 수 있다. 오프 칩의 L2 캐시에 접근하는 경우, 100 클럭 사이클 이상의 긴 지연시간이 발생할 수 있다. 또한, 오프칩인 DRAM에 대한 접근 레이턴시는 400~600 클럭 사이클에 달할 수 있다. Referring to FIG. 2, a memory command may load data from a memory layer or perform a write operation. If the data block is not found in the on-chip memory, a long delay may additionally occur. When accessing the off-chip L2 cache, a long delay time of 100 clock cycles or more may occur. In addition, the access latency for off-chip DRAM can reach 400 to 600 clock cycles.

따라서, 긴 지연시간 동안 다른 명령어를 수행할 수 있도록 스케줄러에 의한 지연시간 숨김이 적극 활용되어야 한다. 또한, 스레드 수준의 병렬성(Thread Level Parallelism, TLP) 크기는 GPU의 자원을 효과적으로 활용하는 주요인이라고 할 수 있다. 스레드 집합인 워프가 메모리 명령어를 수행할 때 스레드 스케줄러에 의해 다른 워프를 실행할 수 있도록 컨텍스트 스위칭(Context Switching)을 수행한다. 만약 SM에 발행될 준비가 된 워프들이 충분한 경우 다수의 워프가 동시 수행 가능하도록 지원할 수 있는 관리 자원 또한 필요하다.Therefore, hiding the delay time by the scheduler must be actively utilized so that other commands can be executed during a long delay time. Also, the size of thread level parallelism (TLP) can be said to be a major factor in effectively utilizing the resources of the GPU. When a warp, a set of threads, executes a memory instruction, context switching is performed so that another warp can be executed by the thread scheduler. If there are enough warps ready to be issued to SM, management resources that can support multiple warps to be executed simultaneously are also required.

LRR 방식은 모든 워프에 대해 거의 같은 처리율을 보이기 때문에 동시에 같은 자원을 점유할 수 있다. 특히 긴 지연시간을 발생시키는 메모리 명령어가 동시에 발행된다면 캐시 자원에 대한 병목현상이 발생함으로써 성능을 감소시킬 수 있다. 또한, 긴 지연시간 동안 다른 연산 유닛을 사용하는 명령어가 준비되지 않는다면 일부 실행 유닛의 활용률 저하로 이어지게 된다. The LRR method can occupy the same resource at the same time because it shows almost the same throughput for all warps. In particular, if memory instructions that cause a long delay are issued at the same time, a bottleneck for cache resources may occur, thereby reducing performance. In addition, if an instruction using another operation unit is not prepared during a long delay time, the utilization rate of some execution units may decrease.

도 2를 참고하면, 자원을 무한정으로 가정하여 메모리 명령어와 연산 명령어가 워프 스케줄러에 의해 발행될 때 서로 다른 워프 스케줄링 방식에 따른 다른 결과를 확인할 수 있다. Referring to FIG. 2, when a memory instruction and an operation instruction are issued by the warp scheduler by assuming an infinite number of resources, different results according to different warp scheduling methods can be confirmed.

GTO 방식(210)은 스톨이 발생하지 않는다면 동일 워프를 계속 수행할 수 있다. 만약 스톨이 발생한다면 할당된 시간이 오래된 워프를 우선적으로 발행할 수 있다. The GTO method 210 may continue to perform the same warp if no stall occurs. If a stall occurs, warps with an older allotted time can be issued first.

MTO(memory then oldest) 방식(220)은 메모리 명령어를 우선적으로 발행하되 동일한 명령어 타입에 대해서는 할당시간이 오래된 워프를 우선적으로 발행할 수 있다. The memory then oldest (MTO) method 220 may preferentially issue a memory command, but may preferentially issue a warp having an old allocation time for the same command type.

MOTRR(memory oldest then round robin)(230)은 메모리 명령어에 대해서 할당시간 순서를 따르되 그 외 연산 명령어를 수행하는 워프는 LRR 방식을 따르는 방식을 의미할 수 있다. MTO 방식(220)과 MOTRR 방식(230)은 모두 메모리 명령어를 우선적으로 발행함으로써 지연시간 숨김 효과율 통해 전체 실행 시간을 감소시킬 수 있다. The memory oldest then round robin (MOTRR) 230 may refer to a method of following an allocation time order for memory instructions, while a warp performing other operation instructions follows the LRR method. Both the MTO method 220 and the MOTRR method 230 issue memory commands preferentially, thereby reducing the overall execution time through a delay time hiding effect ratio.

각 SM(100)에 포함된 워프 스케줄러(150)를 구현하기 위해서는 각 워프가 실행하는 명령어의 타입에 따라 메모리 명령어와 연산 명령어로 나누어 저장할 수 있다. 하지만, 각 명령어 타입에 해당하는 워프들을 어떤 우선순위로 발행할지 또한 결정해야 할 수 있다. In order to implement the warp scheduler 150 included in each SM 100, the warp may be divided into a memory instruction and an operation instruction according to the type of instruction executed by each warp and stored. However, it may also be necessary to decide at what priority to issue warps corresponding to each instruction type.

본 발명의 다양한 실시예들에 따른 워프 스케줄링은 LRR 방식과 GTO 방식을 적용할 수 있다. LRR 방식과 GTO 방식은 구현 복잡도가 낮고 워크로드의 특성에 따라 각각 우수한 성능을 보인다.Warp scheduling according to various embodiments of the present invention may apply an LRR method and a GTO method. The LRR method and the GTO method have low implementation complexity and each exhibit excellent performance depending on the characteristics of the workload.

본 발명에 따른 워프 스케줄링은 명령어를 수행할 때 각각 다른 처리 시간을 가지는 명령어들을 효과적으로 병렬 실행하도록 스케줄링함으로써 긴 지연시간 숨김 효과를 높일 수 있다. 따라서 본 발명에 따른 워프 스케줄링이 적용된 GPU는 명령어를 처리하는 전체 사이클 수가 감소할 수 있다. Warp scheduling according to the present invention can increase the effect of hiding a long delay time by effectively scheduling instructions having different processing times in parallel when executing instructions. Accordingly, the GPU to which the warp scheduling according to the present invention is applied may reduce the total number of cycles for processing instructions.

본 발명에 따른 워프 스케줄링의 경우, 효과적인 지연 시간 활용 방법을 크게 두 가지로 제안한다. 긴 지연시간을 발생키는 메모리 명령어를 우선적으로 발행하여 메모리 자원을 최대한 활용하는 방법과 연산 명령어를 수행하는 워프에서는 LRR 방식을 적용하여 워프 수준의 병렬성을 높이는 기법이다. In the case of warp scheduling according to the present invention, two methods of effective delay time utilization are proposed. It is a method to maximize the use of memory resources by issuing memory instructions that cause a long delay time first, and to increase the parallelism of the warp level by applying the LRR method to warp that executes operation instructions.

종래의 LRR 방식은 동시에 많은 워프가 동일 자원을 사용함으로써 성능 저하가 발생한다. 그러나, 본 발명에 따른 워프 스케줄링은 메모리 명령어를 최대한 빨리 발행하기 때문에 동시에 메모리 명령어가 수행되는 현상을 완화할 수 있다.In the conventional LRR scheme, performance degradation occurs because many warps simultaneously use the same resource. However, since the warp scheduling according to the present invention issues memory instructions as quickly as possible, it is possible to alleviate a phenomenon in which memory instructions are simultaneously executed.

본 발명에 따른 워프 스케줄링에 의해 워프 풀에 있는 발행 준비 워프(Ready Warp)들은 수행할 명령어의 타입에 따라 크게 두 가지 그룹으로 분류되어 저장된다. 분류된 워프들은 각각 다른 큐에 삽입될 때 서로 다른 우선순위 방식에 따라 다시 정렬된다. According to the warp scheduling according to the present invention, the ready warps in the warp pool are classified into two groups and stored according to the type of command to be executed. Sorted warps are rearranged according to different priority schemes as they are inserted into different queues.

도 2를 참고하면, 메모리 명령어를 우선적으로 발행할 때 전체 실행 사이클이 감소된 것을 확인할 수 있다. MTO 방식(220)과 MOTRR 방식(230)은 모두 GTO 방식에 비해 전체 실행 클럭 사이클이 줄어든다. 메모리 명령어를 우선적으로 발행하고 연산 명령어에 대해서 LRR 방식을 사용하는 경우, MTO 방식(220)과 MOTRR 방식(230)에 비해 지연 시간 숨김이 효과적으로 이루어진 것을 확인할 수 있다.Referring to FIG. 2, it can be seen that the total execution cycle is reduced when the memory instruction is first issued. In both the MTO method 220 and the MOTRR method 230, the total execution clock cycle is reduced compared to the GTO method. When the memory instruction is issued preferentially and the LRR method is used for the operation instruction, it can be confirmed that the delay time is effectively hidden compared to the MTO method 220 and the MOTRR method 230.

도 3은 본 발명의 일 실시예에 따른 워프 스케줄러(150)의 기능적 구성을 도시한 도면이다.3 is a diagram showing a functional configuration of the warp scheduler 150 according to an embodiment of the present invention.

도 3을 참고하면, 워프 스케줄러(150)는 GTO 기반 정렬모듈(310), LRR 기반 정렬모듈(320), 메모리 워프 큐(330) 및 연산 워프 큐(340)를 포함할 수 있다.Referring to FIG. 3, the warp scheduler 150 may include a GTO-based alignment module 310, an LRR-based alignment module 320, a memory warp queue 330, and an arithmetic warp queue 340.

GTO 기반 정렬모듈(310)은 메모리 명령어를 수행하는 워프를 할당 시간을 기반으로 정렬하여 메모리 워프 큐(330)에 삽입할 수 있다. The GTO-based alignment module 310 may align warps that perform memory commands based on an allocation time and insert them into the memory warp queue 330.

LRR 기반 정렬모듈(320)은 연산 명령어를 수행하는 워프를 최근 발행된 워프 ID 값을 기준으로 정렬할 수 있다. 일 실시예에서, LRR 기반 정렬모듈(320)은 연산 명령어를 수행하는 워프들을 최근 레지스터(Recency Register)(350)에 있는 최근 비트를 참조하여 최근 메모리 명령어를 발행한 워프를 대상으로 우선적으로 정렬할 수 있다. LRR 기반 정렬모듈(320)은 정렬된 워프들을 연산 워프 큐(340)에 삽입할 수 있다.The LRR-based alignment module 320 may align warps that perform operation instructions based on a recently issued warp ID value. In one embodiment, the LRR-based alignment module 320 may preferentially align warps that perform operation instructions with respect to warps that have issued the latest memory instructions with reference to the latest bits in the recent register 350. I can. The LRR-based alignment module 320 may insert the aligned warps into the computational warp queue 340.

최근 레지스터(350)는 총 3개의 필드로 구성될 수 있다. 인덱스가 워프 ID(WID)로 설정되며， 최대 48개의 엔트리를 포함할 수 있다. 카운터는 최근 메모리 명령어를 완료한 워프들을 관리하기 위해 워프별 6-비트 카운터로 구성할 수 있다. 카운터 값을 참조하여 최근 발행 완료 워프는 R 비트를 1로 변경할 수 있다. The recent register 350 may consist of a total of three fields. The index is set to a warp ID (WID), and can contain up to 48 entries. The counter can be configured as a 6-bit counter per warp to manage warps that have completed the most recent memory instruction. By referring to the counter value, the recently issued warp can change the R bit to 1.

본 발명에 따른 워프 스케줄러(150)는 각 워프가 수행하는 명령어의 타입에 따라 각각 다른 우선순위 기법을 적용하여 메모리 워프 큐(330)와 연산 워프 큐(340)에 워프들을 정렬하여 저장할 수 있다.The warp scheduler 150 according to the present invention may align and store warps in the memory warp queue 330 and the computational warp queue 340 by applying a different priority technique according to the type of instruction performed by each warp.

본 발명에 따른 워프 스케줄러(150)는 LRR 방식과 GTO 방식의 이점을 모두 활용하여 고정적으로 적용할 수 있다. 특히 GTO 방식이 가지는 성능 측면의 이점은 워프 내 지역성(intra-warp locality)을 적극적으로 활용하고 자원에 대한 경합 완화라고 할 수 있다. The warp scheduler 150 according to the present invention can be fixedly applied by taking advantage of both the LRR method and the GTO method. In particular, the performance advantage of the GTO method can be said to actively utilize intra-warp locality and alleviate contention for resources.

본 발명에 따른 워프 스케줄러(150)는 메모리 명령어에 대해 GTO 방식과 유사하게 SM(100)에 할당된 시간이 오래된 순서로 워프를 정렬할 수 있다. 따라서, 각 워프가 수행할 다음 명령어는 메모리 명령어 타입 여부에 따라 할당 시간이 오래된 워프가 계속적으로 메모리 자원을 이용할 수 있도록 우선적으로 발행될 수 있다.The warp scheduler 150 according to the present invention may sort the warps in the order of the old time allotted to the SM 100 similar to the GTO method for memory commands. Accordingly, the next instruction to be executed by each warp may be preferentially issued so that a warp having an old allocation time can continuously use memory resources according to whether or not a memory instruction type is used.

명령어 버퍼는 워프 당 최대 2개의 명령어를 명령어 캐시로부터 인출하고 해석하여 정보를 저장할 수 있다. 따라서, 현재 명령어 버퍼에 저장된 명령어를 처리해야만 다음 수행할 명령어를 인출 및 해석할 수 있다. The instruction buffer can store information by fetching and interpreting up to two instructions per warp from the instruction cache. Therefore, the instruction to be executed next can be retrieved and interpreted only by processing the instruction currently stored in the instruction buffer.

만약, GTO 방식에 따라 할당 시간이 오래된 워프가 메모리 명령어를 우선적으로 발행하도록 설계하더라도 LRR 방식에 따라 연산 명령어를 발행하는 방식에 영향을 받을 수밖에 없다. 다시 말해, 부분적인 LRR 방식의 영향으로 메모리 명령어를 발행을 할 때, 명령어에 대하여 할당 시간이 오래된 워프들을 그렇지 않은 워프들보다 발행할 기회를 항상 보장하진 못한다. Even if a warp with an old allocation time according to the GTO method is designed to issue memory instructions preferentially, the method of issuing operation instructions according to the LRR method will inevitably be affected. In other words, when issuing a memory instruction due to the influence of the partial LRR method, it is not always possible to guarantee an opportunity to issue warps with an older allocation time for the instruction than warps that do not.

본 발명에 따른 워프 스케줄러(150)는 최근 메모리 명령어를 완료하여 L1 명령어 캐시(120)에 데이터를 적재한 상태로 추정되는 워프들에 대해 발행 후보 워프로 결정할 수 있다. 최근 메모리 명령어를 완료한 워프는 요청된 데이터가 L1 명령어 캐시(120)에 존재할 확률이 매우 높다. 또한， 메모리 접근을 불특정 워프 그룹으로 한정하여 워프 내 지역성을 향상시킬 수 있다.The warp scheduler 150 according to the present invention may determine warps that are estimated to be in a state in which data is loaded in the L1 instruction cache 120 by completing a recent memory instruction as an issue candidate warp. A warp that has recently completed a memory instruction has a very high probability that the requested data exists in the L1 instruction cache 120. In addition, locality within the warp can be improved by limiting memory access to an unspecified warp group.

일 실시예에서, 활성화 워프에 대해 명령어 처리 여부를 실시간으로 측정하여 최근 발행된 워프로부터 순서대로 활성화 워프의 수/2 개만 별도 그룹으로 분류할 수 있다. 일 실시예에서, 상기 분류된 워프 그룹은 ‘최근 발행 완료 워프’ 또는 이와 동등한 기술적 의미를 갖는 용어로 지칭될 수 있다. In one embodiment, it is possible to classify only the number/2 of the activated warps into a separate group in order from recently issued warps by measuring whether commands are processed for the activated warps in real time. In one embodiment, the classified warp group may be referred to as “recently issued warp” or a term having an equivalent technical meaning.

워프 스케줄러(150)가 연산 명령어를 발행할 때 최근 발행 완료 워프를 구분하기 위해, 각 워프별 최근 발행 완료를 나타내는 최근 레지스터(350)가 요구될 수 있다. 이 경우, 예를 들어, SM(100)에서의 최대 할당 가능한 활성화 워프의 수는 48개로 설정될 수 있다. In order to distinguish recently issued warps when the warp scheduler 150 issues an operation instruction, a recent register 350 indicating the most recently issued warp for each warp may be required. In this case, for example, the maximum number of allocatable activation warps in the SM 100 may be set to 48.

각 활성화 워프 슬롯(Warp Slot)에서 Load/Store와 같은 메모리 명령어가 완료될 때, 해당 워프 해당하는 카운터를 0으로 리셋할 수 있다. 또한, 다른 워프의 카운터는 모두 1을 증감시킬 수 있다. 또한, 활성화 워프 수(W_Active)를 실시간으로 카운트하여 W_Active/2개의 최근 워프 그룹이 유지되도록 할 수 있다. When a memory command such as Load/Store is completed in each active warp slot, the counter corresponding to the warp can be reset to zero. Also, all counters of other warps can be increased or decreased by 1. In addition, by counting the number of active warps (W_Active) in real time, W_Active/2 recent warp groups may be maintained.

만약 최근 워프의 수가 W_Active/2개 이상이 된다면 카운터의 값이 가장 큰 워프를 최근 워프 그룹에서 제외시킬 수 있다. 각 워프별 최근 워프 여부는 최근 비트(Recency Bit)를 통해 관리할 수 있다. 따라서, 실제 워프가 할당된 후 활성화된 워프 수를 측정하고, 이러한 정보를 기반으로 최근 완료된 메모리 워프 수를 측정할 수 있다.If the number of recent warps is W_Active/2 or more, the warp with the largest counter value can be excluded from the recent warp group. Whether each warp has a recent warp can be managed through the latest bit (Recency Bit). Accordingly, the number of warps activated after the actual warps are allocated may be measured, and the number of recently completed memory warps may be measured based on this information.

본 발명에 따른 워프 스케줄러(150)는 메모리 명령어를 우선적으로 발행하여 SM(100)에 할당된 시간이 오래된 워프를 우선적으로 발행하고, 더 이상 발행할 수 있는 메모리 명령어가 없거나 메모리 파이프라인에 스톨이 발생해서 해당 사이클에 발행이 불가능하다면 연산 명령어를 수행하는 워프를 발행할 수 있다.The warp scheduler 150 according to the present invention preferentially issues a memory command to issue a warp with an old time allotted to the SM 100, and there are no more memory commands that can be issued or there is a stall in the memory pipeline. If it occurs and cannot be issued in that cycle, a warp that performs an operation instruction can be issued.

도 4는 본 발명의 일 실시예에 따른 워프 스케줄링 장치의 동작 방법을 도시한 도면이다. 일 실시예에서, 워프 스케줄링 장치는 스트리밍 멀티프로세서(100)의 워프 스케줄러(150)를 포함할 수 있다. 4 is a diagram illustrating a method of operating a warp scheduling apparatus according to an embodiment of the present invention. In one embodiment, the warp scheduling device may include the warp scheduler 150 of the streaming multiprocessor 100.

도 4를 참고하면, S401 단계는, 제1 임계값 이상의 지연시간을 발생시키는 메모리 명령어를 실행하는 워프들을 발행하는 단계이다. 일 실시예에서, 메모리 명령어에 대하여 할당시간이 제2 임계값 이상인 워프들을 발행할 수 있다. Referring to FIG. 4, step S401 is a step of issuing warps that execute a memory instruction generating a delay time equal to or greater than a first threshold. In an embodiment, warps having an allocation time equal to or greater than a second threshold value may be issued for a memory instruction.

일 실시예에서, 메모리 명령어를 실행하는 워프들을 상기 할당시간을 기반으로 정렬하여 메모리 워프 큐에 삽입할 수 있다. 이후, 메모리 워프 큐에 삽입된 워프들의 순서에 따라 메모리 명령어를 실행할 수 있다.In an embodiment, warps executing memory commands may be sorted based on the allocation time and inserted into the memory warp queue. Thereafter, the memory command may be executed according to the order of the warps inserted in the memory warp queue.

S403 단계는, 메모리 명령어를 실행하는 워프들을 발행한 후, 연산 명령어를 실행하는 워프들을 발행하는 단계이다. 일 실시예에서, 메모리 파이프라인에 스톨(stall)이 발생하는 경우, 연산 명령어를 실행하는 워프들을 발행할 수 있다. In step S403, after issuing warps for executing memory instructions, warps for executing arithmetic instructions are issued. In one embodiment, when a stall occurs in the memory pipeline, warps that execute an operation instruction may be issued.

일 실시예에서, 워프 ID(identification)에 따라 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입할 수 있다. 이후, 연산 워프 큐에 삽입된 워프들의 순서에 따라 연산 명령어를 실행할 수 있다. In an embodiment, warps that execute computation instructions may be sorted according to warp identification (ID) and inserted into a computation warp queue. Thereafter, the arithmetic command may be executed according to the order of the warps inserted in the arithmetic warp queue.

일 실시예에서, 활성화 워프(active warp)의 일부를 포함하는 최근 레지스터(recency register)의 최근 비트(recency bit)에 기반하여, 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입할 수 있다. 예를 들어, 활성화 워프의 일부는 전체 활성화 워프의 절반(half)을 의미할 수 있다. In one embodiment, a computation warp queue by sorting warps executing computation instructions based on a recent bit of a recency register including a portion of an active warp Can be inserted into For example, a part of the activation warp may mean half of the entire activation warp.

일 실시예에서, 메모리 명령어를 실행하는 워프들은, 상기 지연시간에 따라 우선순위가 부여되고, 연산 명령어를 실행하는 워프들은, 동일한 우선순위가 부여될 수 있다. In one embodiment, warps executing memory instructions may be given priority according to the delay time, and warps executing operation instructions may be given the same priority.

일 실시예에서, 메모리 명령어와 연산 명령어는, 명령어의 타입에 따라 분리하여 저장될 수 있다. In an embodiment, a memory instruction and an operation instruction may be separately stored according to the type of instruction.

도 5는 본 발명의 일 실시예에 따른 L1 데이터 캐시의 미스 비율 그래프를 도시한 도면이다.5 is a diagram illustrating a graph of a miss rate of an L1 data cache according to an embodiment of the present invention.

도 5를 참고하면, GTO 방식은 ATAX, MVT, BICG 벤치마크에서 낮은 미스율을 보이지만 MC 벤치마크의 미스율이 LRR에 비해 2.8배 높음을 확인할 수 있다. 본 발명에 따른 워프 스케줄링 기법은 전체 벤치마크에서 낮은 미스율을 보여 L1 데이터 캐시에서의 지역성을 유지하는 것을 확인할 수 있다. Referring to FIG. 5, it can be seen that the GTO method shows a low error rate in the ATAX, MVT, and BICG benchmarks, but the error rate in the MC benchmark is 2.8 times higher than that of the LRR. It can be confirmed that the warp scheduling scheme according to the present invention maintains locality in the L1 data cache by showing a low miss rate in the entire benchmark.

도 6은 본 발명의 일 실시예에 따른 L1 데이터 캐시에 대한 예약 실패 그래프를 도시한 도면이다.6 is a diagram illustrating a graph of a reservation failure for an L1 data cache according to an embodiment of the present invention.

도 6을 참고하면, L1 데이터 캐시 관련 자원인 MSHR 또는 미스 큐(miss queue)의 점유 가능한 엔트리가 부족하거나 캐시 세트에 대한 접근이 빈번한 경우 LD/ST 유닛(190)에 대한 접근이 불가능하다. Referring to FIG. 6, when an occupiable entry of an MSHR or a miss queue, which is an L1 data cache-related resource, is insufficient or access to a cache set is frequent, access to the LD/ST unit 190 is impossible.

이런 경우 예약 실패(reservation fail) 사이클로 정의하며 그 수치를 LRR, GTO 방식과 본 발명에 따른 워프 스케줄링 기법의 결과를 비교할 수 있다. In this case, it is defined as a reservation fail cycle, and the values can be compared between the LRR and GTO schemes and the results of the warp scheduling scheme according to the present invention.

이 경우, 일부 벤치마크에서는 이러한 스톨 사이클이 증가하지만, 전체 벤치마크에 대한 평균 예약 실패 사이클은 낮은 것을 확인할 수 있다. 따라서, 캐시 효율성과 캐시 자원에 대한 과도한 경합도 측면에서 본 발명에 따른 워프 스케줄링 기법의 우수함을 확인할 수 있다. In this case, it can be seen that these stall cycles increase in some benchmarks, but the average reservation failure cycle for all benchmarks is low. Accordingly, it can be confirmed that the warp scheduling technique according to the present invention is superior in terms of cache efficiency and excessive contention for cache resources.

도 7은 본 발명의 일 실시예에 따른 스톨 사이클 그래프를 도시한 도면이다.7 is a diagram showing a stall cycle graph according to an embodiment of the present invention.

도 7을 참고하면, 발행 단계에서 연산 자원을 사용할 수 없는 상태로 인해 단 하나의 워프도 발행하지 못하는 스톨 사이클을 측정하고 비교할 수 있다. Referring to FIG. 7, it is possible to measure and compare stall cycles in which even one warp cannot be issued due to a state in which computational resources are not available in the issuing stage.

이 경우, 본 발명에 따른 워프 스케줄링 기법이 적용된 전자 장치(예: GPU)의 구조는 LRR 방식을 적용한 구조에 비해 23% 감소한다. 따라서, 본 발명에 따른 워프 스케줄링 기법에 의해 지연 시간을 효과적으로 숨겨 성능을 향상시킨 것을 확인할 수 있다. 캐시 효율성, LD/ST 유닛 스톨 수와 함께 모든 워프가 파이프라인 스톨이 되어 긴 지연시간 동안 워프를 발행하지 못하는 사이클 수가 줄어든 것을 확인할 수 있다.In this case, the structure of the electronic device (eg, GPU) to which the warp scheduling method according to the present invention is applied is reduced by 23% compared to the structure to which the LRR method is applied. Accordingly, it can be confirmed that the performance is improved by effectively hiding the delay time by the warp scheduling technique according to the present invention. In addition to the cache efficiency and the number of LD/ST unit stalls, it can be seen that all warps become pipeline stalls, reducing the number of cycles that cannot issue warps during a long delay.

LRR 방식의 높은 병렬성과 워프들에 대한 수행 진행률을 균등하게 만들 수 있다. 본 발명에 따른 워프 스케줄링 기법은 이러한 LRR 방식의 특성을 반영하면서 메모리 접근에 대하여 일부 워프만 주로 허용하도록 할당 시간을 기준으로 메모리 명령어를 발행할 수 있다. The high parallelism of the LRR method and the execution progress for warps can be made even. The warp scheduling scheme according to the present invention may issue a memory instruction based on an allocation time so as to mainly allow only some warps for memory access while reflecting the characteristics of the LRR scheme.

도 8은 본 발명의 일 실시예에 따른 워프 스케줄링 방식에 따른 성능 그래프를 도시한 도면이다. 8 is a diagram illustrating a performance graph according to a warp scheduling scheme according to an embodiment of the present invention.

도 8을 참고하면, LRR 방식과 GTO 방식을 적용한 구조와 함께 본 발명에 따른 워프 스케줄링을 포함하는 다양한 방식들과의 성능을 비교할 수 있다. 즉, 메모리 명령어를 우선적으로 발행하되 그 다음 할당 시간을 기준으로 오래된 워프를 발행하는 MTO(memory then oldest) 방식, 메모리 명령어 중 할당 시간이 오래된 워프를 발행하되 다른 명령어 타입을 수행하는 워프는 라운드 로인 방식으로 발행 순서를 결정하는 MOTRR(memory oldest then round robin) 방식, MOTRR 방식과 함께 최근 메모리 명령어 수행한 워프 정보를 반영하는 방식인 본 발명에 따른 MOTRR+Recency에 대해서 성능을 비교할 수 있다. Referring to FIG. 8, performance of various schemes including warp scheduling according to the present invention can be compared with a structure to which an LRR scheme and a GTO scheme are applied. That is, MTO (memory then oldest) method in which memory instructions are issued first but then old warps based on the allocation time, warps with old allocation times among memory instructions are issued, but warps that execute other instruction types are round-roin Performance can be compared for MOTRR+Recency according to the present invention, which is a method that reflects warp information performed by a recent memory command along with a memory oldest then round robin (MOTRR) method that determines the issuance order in a method.

성능 비교는 사이클 당 처리 명령어 수인 IPC(instruction per cycle)을 각각 동일한 벤치마크에 대해 측정하고 GTO 방식의 결과에 대해 정규화할 수 있다. MOTRR 방식은 MTO 방식과 다르게 연산 명령어에 대해 공평하게 발행함으로써 더 높은 성능을 보임을 확인할 수 있다. 따라서, 단순히 LRR 방식과 GTO 방식을 별도로 사용하는 것보다 명령어 유형별 다른 방식이 적용된 경우 더 효과적임을 확인할 수 있다.For the performance comparison, the number of instructions per cycle (IPC), which is the number of instructions per cycle, can be measured for the same benchmark and normalized for the results of the GTO method. Unlike the MTO method, the MOTRR method shows higher performance by fairly issuing operation instructions. Therefore, it can be seen that it is more effective when a different method for each type of command is applied than simply using the LRR method and the GTO method separately.

LPS, STC, SQRNG, MC 벤치마크의 경우 GTO 방식과 MTO 방식 성능 결과가 LRR 성능보다 훨씬 낮은 IPC를 보인다. MOTRR은 LRR 방식에 비해 평균 11.5%, GTO 방식에 비해 평균 4.4% 높은 성능을 보인다. In the case of LPS, STC, SQRNG, and MC benchmarks, the GTO method and MTO method performance results show much lower IPC than the LRR performance. MOTRR has an average of 11.5% higher than the LRR method and 4.4% higher than the GTO method.

본 발명의 다양한 실시예에 따른, 최근 메모리 완료 워프 정보를 활용하는 MOTRR+Recency의 IPC는 LRR 방식과 GRO 방식에 대해 각각 12.7%, 5.6%의 평균적인 성능 향상을 보인다. According to various embodiments of the present disclosure, the IPC of MOTRR+Recency using recent memory completion warp information shows an average performance improvement of 12.7% and 5.6% for the LRR method and the GRO method, respectively.

다양한 벤치마크에 대해 LRR 방식과 GTO 방식은 서로 다른 성능 양상을 보인다. 따라서 일부 벤치마크에서는 LRR 방식이 GTO에 비해 높은 성능을 보인다. 본 발명에 따른 MOTRR+Recency 방식은 ATAX, MVT, BICG에서 상당한 성능 향상을 보인다. 특히 MVT 벤치마크의 경우 MOTRR+Recency 방식으로 인한 성능이 MOTRR에 비해 3.1% 향상됨을 확인할 수 있다. For various benchmarks, the LRR and GTO methods show different performance patterns. Therefore, in some benchmarks, the LRR method shows higher performance than GTO. The MOTRR+Recency scheme according to the present invention shows a significant performance improvement in ATAX, MVT, and BICG. In particular, in the case of the MVT benchmark, it can be seen that the performance due to the MOTRR+Recency method is improved by 3.1% compared to the MOTRR.

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로, 통상의 기술자라면 본 발명의 본질적인 특성이 벗어나지 않는 범위에서 다양한 변경 및 수정이 가능할 것이다.The above description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made to those skilled in the art without departing from the essential characteristics of the present invention.

따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라, 설명하기 위한 것이고, 이러한 실시예들에 의하여 본 발명의 범위가 한정되는 것은 아니다.Accordingly, the embodiments disclosed in the present specification are not intended to limit the technical spirit of the present invention, but are intended to describe, and the scope of the present invention is not limited by these embodiments.

본 발명의 보호범위는 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 이해되어야 한다.The scope of protection of the present invention should be interpreted by the claims, and all technical ideas within the scope equivalent thereto should be understood as being included in the scope of the present invention.

100: 스트리밍 멀티프로세서
110: 명령어 버퍼
120: L1 명령어 캐시
130: 스코어보드
140: 레지스터 파일
150: 워프 스케줄링
160: 명령어 실행부
210: GTO 방식
220: MTO 방식
230: MOTRR 방식
310: GTO 기반 정렬모듈
320: LRR 기반 정렬모듈
330: 메모리 워프 큐
340: 연산 워프 큐
350: 최근 레지스터100: streaming multiprocessor
110: command buffer
120: L1 instruction cache
130: Scoreboard
140: register file
150: warp scheduling
160: command execution unit
210: GTO method
220: MTO method
230: MOTRR method
310: GTO-based alignment module
320: LRR-based alignment module
330: memory warp cue
340: computational warp queue
350: last register

Claims

(a) 제1 임계값 이상의 지연시간(latency)을 발생시키는 메모리 명령어(memory instruction)를 실행하는 워프들을 발행(issue)하는 단계; 및
(b) 상기 메모리 명령어를 실행하는 워프들을 발행한 후, 연산 명령어(computation)를 실행하는 워프들을 발행하는 단계;
를 포함하는,
워프 스케줄링 방법.
(a) issuing warps that execute memory instructions that generate a latency greater than or equal to a first threshold; And
(b) issuing warps that execute the memory instruction, and then issue warps that execute computational instructions;
Containing,
Warp scheduling method.

제1항에 있어서,
상기 (a) 단계는,
상기 메모리 명령어에 대하여 할당시간이 제2 임계값 이상인 워프들을 발행하는 단계;
를 포함하는,
워프 스케줄링 방법.
The method of claim 1,
The step (a),
Issuing warps whose allocation time is greater than or equal to a second threshold for the memory command;
Containing,
Warp scheduling method.

제2항에 있어서,
상기 (a) 단계 이후에,
상기 메모리 명령어를 실행하는 워프들을 상기 할당시간을 기반으로 정렬하여 메모리 워프 큐(memory warp queue)에 삽입하는 단계;
를 더 포함하는,
워프 스케줄링 방법.
The method of claim 2,
After step (a),
Aligning warps executing the memory command based on the allocation time and inserting them into a memory warp queue;
Further comprising,
Warp scheduling method.

제3항에 있어서,
상기 삽입하는 단계 이후에,
상기 메모리 워프 큐에 삽입된 워프들의 순서에 따라 상기 메모리 명령어를 실행하는 단계;
를 더 포함하는,
워프 스케줄링 방법.
The method of claim 3,
After the step of inserting,
Executing the memory command according to the order of warps inserted in the memory warp queue;
Further comprising,
Warp scheduling method.

제1항에 있어서,
상기 (b) 단계는,
메모리 파이프라인에 스톨(stall)이 발생하는 경우, 상기 연산 명령어를 실행하는 워프들을 발행하는 단계;
를 포함하는,
워프 스케줄링 방법.
The method of claim 1,
The step (b),
Issuing warps that execute the operation instruction when a stall occurs in the memory pipeline;
Containing,
Warp scheduling method.

제1항에 있어서,
상기 (b) 단계 이후에,
워프 ID(identification)에 따라 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입하는 단계;
를 더 포함하는,
워프 스케줄링 방법.
The method of claim 1,
After step (b),
Arranging warps that execute the operation instructions according to warp identification (ID) and inserting them into a computation warp queue;
Further comprising,
Warp scheduling method.

제6항에 있어서,
상기 삽입하는 단계 이후에,
상기 연산 워프 큐에 삽입된 워프들의 순서에 따라 상기 연산 명령어를 실행하는 단계;
를 더 포함하는,
워프 스케줄링 방법.
The method of claim 6,
After the step of inserting,
Executing the arithmetic command according to the order of warps inserted in the arithmetic warp queue;
Further comprising,
Warp scheduling method.

제1항에 있어서,
상기 (b) 단계 이후에,
활성화 워프(active warp)의 일부를 포함하는 최근 레지스터(recency register)의 최근 비트(recency bit)에 기반하여, 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입하는 단계;
를 더 포함하는,
워프 스케줄링 방법.
The method of claim 1,
After step (b),
Arranging warps that execute the operation instruction and inserting it into a computation warp queue based on a recent bit of a recency register including a part of an active warp. ;
Further comprising,
Warp scheduling method.

제1항에 있어서,
상기 메모리 명령어를 실행하는 워프들은, 상기 지연시간에 따라 우선순위가 부여되고,
상기 연산 명령어를 실행하는 워프들은, 동일한 우선순위가 부여되는,
워프 스케줄링 방법.
The method of claim 1,
Warps executing the memory command are given priority according to the delay time,
Warps that execute the operation instruction are given the same priority,
Warp scheduling method.

제1항에 있어서,
상기 메모리 명령어와 상기 연산 명령어는, 명령어의 타입에 따라 분리하여 저장되는,
워프 스케줄링 방법.
The method of claim 1,
The memory instruction and the operation instruction are stored separately according to the type of instruction,
Warp scheduling method.

제1 임계값 이상의 지연시간(latency)을 발생시키는 메모리 명령어(memory instruction)를 실행하는 워프들을 발행(issue)하고,
상기 메모리 명령어를 실행하는 워프들을 발행한 후, 연산 명령어(computation)를 실행하는 워프들을 발행하는,
워프 스케줄러;
를 포함하는,
워프 스케줄링 장치.
Issue warps that execute a memory instruction that generates a latency equal to or greater than a first threshold,
After issuing warps for executing the memory instruction, issuing warps for executing computational instructions,
Warp scheduler;
Containing,
Warp scheduling device.

제11항에 있어서,
상기 워프 스케줄러는,
상기 메모리 명령어에 대하여 할당시간이 제2 임계값 이상인 워프들을 발행하는,
워프 스케줄링 장치.
The method of claim 11,
The warp scheduler,
Issuing warps having an allocation time equal to or greater than a second threshold for the memory command,
Warp scheduling device.

제11항에 있어서,
상기 워프 스케줄러는,
상기 메모리 명령어를 실행하는 워프들을 상기 할당시간을 기반으로 정렬하여 메모리 워프 큐(memory warp queue)에 삽입하는,
워프 스케줄링 장치.
The method of claim 11,
The warp scheduler,
Arranging warps executing the memory command based on the allocation time and inserting them into a memory warp queue,
Warp scheduling device.

제13항에 있어서,
상기 워프 스케줄러는,
상기 메모리 워프 큐에 삽입된 워프들의 순서에 따라 상기 메모리 명령어를 실행하는,
워프 스케줄링 장치.
The method of claim 13,
The warp scheduler,
Executing the memory command according to the order of warps inserted in the memory warp queue,
Warp scheduling device.

제11항에 있어서,
상기 워프 스케줄러는,
메모리 파이프라인에 스톨(stall)이 발생하는 경우, 상기 연산 명령어를 실행하는 워프들을 발행하는,
워프 스케줄링 장치.
The method of claim 11,
The warp scheduler,
When a stall occurs in the memory pipeline, issuing warps that execute the operation instruction,
Warp scheduling device.

제11항에 있어서,
상기 워프 스케줄러는,
워프 ID(identification)에 따라 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입하는,
워프 스케줄링 장치.
The method of claim 11,
The warp scheduler,
Arranging warps that execute the operation instructions according to warp ID (identification) and inserting them into a computation warp queue,
Warp scheduling device.

제16항에 있어서,
상기 연산 워프 큐에 삽입된 워프들의 순서에 따라 상기 연산 명령어를 실행하는 단계;
를 더 포함하는,
워프 스케줄링 장치.
The method of claim 16,
Executing the arithmetic command according to the order of warps inserted in the arithmetic warp queue;
Further comprising,
Warp scheduling device.

제11항에 있어서,
상기 워프 스케줄러는,
활성화 워프(active warp)의 일부를 포함하는 최근 레지스터(recency register)의 최근 비트(recency bit)에 기반하여, 상기 연산 명령어를 실행하는 워프들을 정렬하여 연산 워프 큐(computation warp queue)에 삽입하는,
워프 스케줄링 장치.
The method of claim 11,
The warp scheduler,
Sorting warps that execute the operation instruction and inserting it into a computation warp queue, based on a recent bit of a recency register including a part of an active warp,
Warp scheduling device.

제11항에 있어서,
상기 메모리 명령어를 실행하는 워프들은, 상기 지연시간에 따라 우선순위가 부여되고,
상기 연산 명령어를 실행하는 워프들은, 동일한 우선순위가 부여되는,
워프 스케줄링 장치.
The method of claim 11,
Warps executing the memory command are given priority according to the delay time,
Warps that execute the operation instruction are given the same priority,
Warp scheduling device.

제11항에 있어서,
상기 메모리 명령어와 상기 연산 명령어는, 명령어의 타입에 따라 분리하여 저장되는,
워프 스케줄링 장치.

The method of claim 11,
The memory instruction and the operation instruction are stored separately according to the type of instruction,
Warp scheduling device.